Lesson 2

Core data structures

Hash maps, on-disk layout, and testable building blocks.

3-4 hours advanced Module 1: Foundation & Data Structures

Learning Objectives

  • Choose the right data structure for your key-value store

  • Implement HashMap with proper hash functions

  • Handle value serialization strategies

  • Build comprehensive testing frameworks

  • Master Go’s testing and benchmarking tools

Lesson 2.1: HashMap Implementation

Choosing the Right Data Structure

Before we implement, let’s understand the options and trade-offs for our core in-memory key-value store.

Option 1: Go’s Native Map (map[string]interface{})

Characteristics:

  • Built-in hash table implementation

  • O(1) average case lookup

  • Not thread-safe (race conditions with concurrent access)

  • Keys are unordered (iteration order randomized)

  • Fast for basic use cases

When to use:

  • Single-threaded applications

  • Prototyping and testing

  • When simplicity matters more than concurrency

Option 2: Custom HashMap with Sharding

Characteristics:

  • Divide keys into N shards (sub-maps)

  • Each shard has its own lock

  • Multiple goroutines can operate on different shards concurrently

  • More complex but much better concurrency

  • Still unordered

Performance improvement:

  • Single lock: throughput = 100K ops/sec (8 cores bottlenecked)

  • 16 shards: throughput = 1.2M ops/sec (8 cores utilized)

When to use:

  • Multi-threaded applications

  • High concurrency, many readers/writers

  • Good balance of simplicity and performance

Option 3: Skip List

Characteristics:

  • Probabilistic data structure maintaining sorted order

  • O(log n) lookup, insert, delete

  • Enables range queries efficiently

  • Can be made lock-free with careful design

  • More memory overhead than hash tables (~33%)

When to use:

  • Need sorted iteration (range queries)

  • Building LSM tree memtable

  • Distributed systems (Bigtable, HBase use this)

Option 4: B-Tree or B+ Tree

Characteristics:

  • Balanced tree structure

  • O(log n) operations

  • High degree (many children per node)

  • Excellent disk I/O patterns

  • More complex implementation

When to use:

  • Disk-based storage (SSTables)

  • Very large datasets

  • Need to minimize disk seeks

Decision: Sharded HashMap for MVP

For Week 2 (MVP): We’ll implement a sharded hash map because:

  • ✓ Good balance of simplicity and performance

  • ✓ Thread-safe for concurrent operations

  • ✓ Sufficient for most use cases

  • ✓ Foundation for learning concurrency patterns

Later (Week 5): We’ll add sorted keys via LSM tree when we add persistence.

Go’s Native Map vs Custom Implementation

Go’s Map Under the Hood

type hmap struct {
    count      int           // number of elements
    flags      uint8         // flags tracking rehashing state
    B          uint8         // log_2 of buckets array size
    noverflow  uint16        // overflow buckets count
    hash0      uint32        // seed hash
    buckets    unsafe.Pointer // bucket array
    oldbuckets unsafe.Pointer // previous bucket array (during resize)
    nevacuate  uintptr       // progress of old bucket evacuation
    extra      *mapextra     // overflow buckets, if any
}

type bmap struct {
    tophash [8]uint8                    // hash values for 8 entries
    keys    [8]keytype                  // keys
    values  [8]valuetype               // values
    pad     uintptr
    overflow *bmap                      // overflow bucket for collision chain
}

Key Points:

  • Bucket size: 8 entries per bucket (cache line optimization)

  • Collision handling: overflow buckets linked as a chain

  • Growth: Doubles bucket count when load factor > 6.5/8 = ~0.8

  • Rehashing: Incremental during operations, not all at once

Why We Can’t Just Use map[string]interface{}

Problem: Not thread-safe

// Problem: Not thread-safe
m := make(map[string]interface{})

go func() {
    m["key1"] = "value1"  // goroutine 1 writes
}()

go func() {
    _ = m["key1"]         // goroutine 2 reads
}()
// Result: RACE CONDITION - panics or corrupts data

Thread Safety Considerations

What Makes Maps Unsafe?

During a resize operation, the map is internally inconsistent. Two concurrent operations can see different states:

Goroutine A: Reading key="foo" during resize
    1. Compute hash(foo) = 0x123abc
    2. Look in old bucket[3]
    3. Find value = "bar"
    4. Return "bar"

Goroutine B: Writing key="baz" = "qux" during resize
    1. Acquire exclusive access (panic or data corruption)
    2. Modify bucket array
    3. Element A sees inconsistent state

Solution: Synchronization

Option 1: Coarse-grained lock (simple, slower)

type Map struct {
    mu    sync.RWMutex
    data  map[string]interface{}
}

func (m *Map) Get(key string) interface{} {
    m.mu.RLock()
    defer m.mu.RUnlock()
    return m.data[key]
}

Option 2: Fine-grained locking (complex, faster)

type ShardedMap struct {
    shards []*mapShard
}

type mapShard struct {
    mu    sync.RWMutex
    data  map[string]interface{}
}

Option 3: sync.Map (Go 1.9+, specific use case)

var m sync.Map

m.Store("key", "value")
val, _ := m.Load("key")

When to use each:

  • Coarse-grained: Simple, fine for modest concurrency

  • Sharded: High concurrency, many goroutines

  • sync.Map: Write-rarely, read-heavy workloads

Lesson 2.2: Basic Operations (Get, Put, Delete)

Architecture of Our Sharded HashMap

┌─────────────────────────────────┐
│   ShardedMap (thread-safe)      │
├─────────────────────────────────┤
│ shards [16]                     │
└──┬──────────┬────────┬─────────┘
   │          │        │
   ▼          ▼        ▼
┌────────┐ ┌────────┐ ┌────────┐
│Shard 0 │ │Shard 1 │...│Shard15│
├────────┤ ├────────┤ ├────────┤
│Lock    │ │Lock    │ │Lock    │
│HashMap │ │HashMap │ │HashMap │
└────────┘ └────────┘ └────────┘

Implementing Get Operation

// Full implementation with documentation
type ShardedMap struct {
    shards []*mapShard
}

type mapShard struct {
    mu   sync.RWMutex
    data map[string]interface{}
}

func NewShardedMap(shardCount int) *ShardedMap {
    shards := make([]*mapShard, shardCount)
    for i := 0; i < shardCount; i++ {
        shards[i] = &mapShard{
            data: make(map[string]interface{}),
        }
    }
    return &ShardedMap{shards: shards}
}

// Get retrieves a value by key
// Returns (value, exists)
// Time complexity: O(1) average case
// Thread-safe: Multiple readers allowed simultaneously
func (m *ShardedMap) Get(key string) (interface{}, bool) {
    // Step 1: Determine which shard owns this key
    shard := m.getShard(key)
    
    // Step 2: Acquire read lock (allows multiple concurrent readers)
    shard.mu.RLock()
    defer shard.mu.RUnlock()
    
    // Step 3: Look up value in shard's map
    val, exists := shard.data[key]
    return val, exists
}

// getShard returns the shard responsible for a given key
// Uses hash function to distribute keys uniformly
func (m *ShardedMap) getShard(key string) *mapShard {
    // Hash the key to get a number
    h := fnv.New64a()
    h.Write([]byte(key))
    
    // Map hash to shard index using modulo
    shardIndex := int(h.Sum64() % uint64(len(m.shards)))
    return m.shards[shardIndex]
}

Implementing Put Operation

// Put stores a value at the given key
// If key exists, overwrites the value
// Time complexity: O(1) average case
// Thread-safe: Exclusive access to shard during write
func (m *ShardedMap) Put(key string, value interface{}) error {
    // Input validation
    if key == "" {
        return errors.New("key cannot be empty")
    }
    if value == nil {
        return errors.New("value cannot be nil")
    }
    
    // Step 1: Get the appropriate shard
    shard := m.getShard(key)
    
    // Step 2: Acquire write lock (exclusive access)
    shard.mu.Lock()
    defer shard.mu.Unlock()
    
    // Step 3: Store in map
    shard.data[key] = value
    
    return nil
}

// Thread safety notes:
// - Lock is held for entire operation
// - Other goroutines cannot access this shard during Put
// - Readers on OTHER shards still proceed normally (parallelism)

Implementing Delete Operation

// Delete removes a key and its value
// Returns true if key existed, false otherwise
// Time complexity: O(1) average case
// Thread-safe: Exclusive access to shard during deletion
func (m *ShardedMap) Delete(key string) bool {
    shard := m.getShard(key)
    
    shard.mu.Lock()
    defer shard.mu.Unlock()
    
    _, existed := shard.data[key]
    delete(shard.data, key)
    return existed
}

Handling Key Collisions

Hash Collision: Two different keys hash to the same value

Example: Hash function: hash(key) = len(key) % 16
  hash("a") = 1
  hash("bcd") = 3
  hash("e") = 1  <- COLLISION with "a"!

How Go’s Map Handles Collisions:

    1. Bucket array has 16 buckets (for example)
    1. Each bucket stores up to 8 entries inline
    1. Hash collision → same bucket
    1. If bucket full, create overflow bucket (linked list)

Our Sharded Map:

  • Primary collision handling: Shard hash (% number of shards)

  • Secondary collision handling: Go’s map internal chaining

Value Serialization Strategies

Our in-memory store stores interface{} which can be any Go type. But what about networking and persistence?

Option 1: JSON Serialization

type JSONCodec struct{}

func (c JSONCodec) Encode(v interface{}) ([]byte, error) {
    return json.Marshal(v)
}

func (c JSONCodec) Decode(data []byte, v interface{}) error {
    return json.Unmarshal(data, v)
}

// Usage:
codec := JSONCodec{}
data, _ := codec.Encode("hello")  // => []byte(`"hello"`)
var s string
codec.Decode(data, &s)  // => s = "hello"

Pros:

  • Human-readable

  • Widely supported

Cons:

  • Verbose

  • Slow

  • Type information lost

Option 2: Protocol Buffers

protobuf
syntax = "proto3";

message Value {
    oneof data {
        string string_value = 1;
        int64 int_value = 2;
        bytes bytes_value = 3;
    }
}

Pros:

  • Compact

  • Fast

  • Schema versioning

Cons:

  • Requires code generation

  • Steeper learning curve

Option 3: Custom Binary Encoding

// Simple binary format for our use case
type BinaryCodec struct{}

// Format: [type_byte][length_varint][data...]
func (c BinaryCodec) Encode(v interface{}) ([]byte, error) {
    buf := bytes.NewBuffer(nil)
    
    switch val := v.(type) {
    case string:
        buf.WriteByte(1)  // type: string
        binary.Write(buf, binary.BigEndian, int32(len(val)))
        buf.WriteString(val)
    case int64:
        buf.WriteByte(2)  // type: int64
        binary.Write(buf, binary.BigEndian, val)
    case []byte:
        buf.WriteByte(3)  // type: bytes
        binary.Write(buf, binary.BigEndian, int32(len(val)))
        buf.Write(val)
    default:
        return nil, errors.New("unsupported type")
    }
    
    return buf.Bytes(), nil
}

Pros:

  • Minimal overhead

  • Fast

  • Control over format

Cons:

  • Manual implementation

  • Limited type support

Decision for our course: Start with storing byte slices ([]byte)

This simplifies serialization and forces explicit encoding decisions.

Lesson 2.3: Testing Strategy

Unit Testing Fundamentals

Test Structure in Go - Test functions must start with “Test”

// Filename: store_test.go (same directory as store.go)
// Test functions must start with "Test"

func TestStoreGet(t *testing.T) {
    // Arrange: Set up test fixtures
    store := NewShardedStore(4)
    store.Put(context.Background(), []byte("key1"), []byte("value1"))
    
    // Act: Execute the code under test
    result, err := store.Get(context.Background(), []byte("key1"))
    
    // Assert: Verify the result
    if err != nil {
        t.Fatalf("unexpected error: %v", err)
    }
    
    if string(result) != "value1" {
        t.Errorf("got %q, want %q", string(result), "value1")
    }
}

Subtests for Organization

func TestStore(t *testing.T) {
    store := NewShardedStore(4)
    
    t.Run("Put and Get", func(t *testing.T) {
        store.Put(context.Background(), []byte("k"), []byte("v"))
        v, _ := store.Get(context.Background(), []byte("k"))
        if string(v) != "v" {
            t.Errorf("got %q, want %q", string(v), "v")
        }
    })
    
    t.Run("Get missing key", func(t *testing.T) {
        _, err := store.Get(context.Background(), []byte("missing"))
        if err != ErrKeyNotFound {
            t.Errorf("got %v, want ErrKeyNotFound", err)
        }
    })
    
    t.Run("Delete", func(t *testing.T) {
        store.Put(context.Background(), []byte("k2"), []byte("v2"))
        store.Delete(context.Background(), []byte("k2"))
        _, err := store.Get(context.Background(), []byte("k2"))
        if err != ErrKeyNotFound {
            t.Errorf("expected key to be deleted")
        }
    })
}

Table-Driven Tests

Table-driven tests reduce duplication and make it easy to add more test cases.

func TestStoreOperations(t *testing.T) {
    tests := []struct {
        name      string
        key       []byte
        value     []byte
        setupFn   func(*ShardedStore)
        wantValue []byte
        wantErr   bool
    }{
        {
            name:      "put and get string",
            key:       []byte("user:1"),
            value:     []byte("alice"),
            wantValue: []byte("alice"),
        },
        {
            name:      "put and get binary",
            key:       []byte("binary:1"),
            value:     []byte{0x00, 0x01, 0x02},
            wantValue: []byte{0x00, 0x01, 0x02},
        },
        {
            name:    "get nonexistent key",
            key:     []byte("missing"),
            wantErr: true,
        },
        {
            name:  "overwrite value",
            key:   []byte("counter"),
            value: []byte("2"),
            setupFn: func(s *ShardedStore) {
                s.Put(context.Background(), []byte("counter"), []byte("1"))
            },
            wantValue: []byte("2"),
        },
    }
    
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            store := NewShardedStore(4)
            if tt.setupFn != nil {
                tt.setupFn(store)
            }
            
            store.Put(context.Background(), tt.key, tt.value)
            got, err := store.Get(context.Background(), tt.key)
            
            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            
            if !tt.wantErr && !bytes.Equal(got, tt.wantValue) {
                t.Errorf("got %v, want %v", got, tt.wantValue)
            }
        })
    }
}

Benchmarking Basics

Benchmarks measure performance characteristics.

func BenchmarkGet(b *testing.B) {
    store := NewShardedStore(16)
    store.Put(context.Background(), []byte("key"), []byte("value"))
    
    b.ResetTimer()  // Don't count setup time
    
    for i := 0; i < b.N; i++ {
        store.Get(context.Background(), []byte("key"))
    }
}

// Run: go test -bench=. -benchmem
// Output: BenchmarkGet-8   5000000   256 ns/op   64 B/alloc
//                                     ^timing     ^memory

Interpreting results:

  • 5000000 iterations

  • 256 ns/op - nanoseconds per operation

  • 64 B/alloc - bytes allocated per operation

Property-Based Testing Introduction

Property-based testing generates random inputs and verifies invariants.

func TestStoreProperties(t *testing.T) {
    // Property: After Put, Get returns same value
    for trial := 0; trial < 100; trial++ {
        store := NewShardedStore(4)
        
        key := []byte(randomString(10))
        value := []byte(randomString(100))
        
        store.Put(context.Background(), key, value)
        got, _ := store.Get(context.Background(), key)
        
        if !bytes.Equal(got, value) {
            t.Errorf("trial %d: put-get invariant failed", trial)
        }
    }
}

func TestConcurrentSafety(t *testing.T) {
    // Property: Concurrent operations don't cause data corruption
    store := NewShardedStore(16)
    done := make(chan bool)
    
    // Writers
    for w := 0; w < 10; w++ {
        go func(writerID int) {
            for i := 0; i < 1000; i++ {
                key := fmt.Sprintf("key:%d:%d", writerID, i)
                store.Put(context.Background(), []byte(key), []byte("value"))
            }
            done <- true
        }(w)
    }
    
    // Readers
    for r := 0; r < 10; r++ {
        go func() {
            for i := 0; i < 1000; i++ {
                for w := 0; w < 10; w++ {
                    key := fmt.Sprintf("key:%d:%d", w, i)
                    store.Get(context.Background(), []byte(key))
                }
            }
            done <- true
        }()
    }
    
    // Wait for all goroutines
    for i := 0; i < 20; i++ {
        <-done
    }
    
    // If we get here without panic or corruption, test passes
    t.Log("concurrent safety property verified")
}

Hands-on Lab: Implement thread-unsafe in-memory store

Build a basic key-value store without concurrency concerns.

Lab Tasks:

  1. Implement Get method with proper validation

  2. Implement Put method with value copying

  3. Implement Delete method

  4. Write comprehensive tests (basic operations, edge cases)

  5. Add benchmarks for Get, Put, Delete operations

  6. Test with binary data and various key types

  7. Achieve >90% code coverage

Summary & Learning Outcomes

What You’ve Learned:

  • Data structure trade-offs and when to use each

  • Go’s native map implementation details

  • Thread safety considerations and solutions

  • Value serialization strategies

  • Comprehensive testing techniques

  • Benchmarking and performance measurement

Key Takeaways

  • Choose the right data structure for your use case

  • Always consider thread safety in concurrent environments

  • Test-driven development leads to better code quality

  • Benchmark early and often to catch performance issues

  • Property-based testing helps find edge cases