Building the Core Data Structure

Learning Objectives

  • • Choose the right data structure for your key-value store
  • • Implement HashMap with proper hash functions
  • • Handle value serialization strategies
  • • Build comprehensive testing frameworks
  • • Master Go's testing and benchmarking tools

Lesson 2.1: HashMap Implementation

Choosing the Right Data Structure

Before we implement, let's understand the options and trade-offs for our core in-memory key-value store.

Option 1: Go's Native Map (map[string]interface{})

Characteristics:
  • • Built-in hash table implementation
  • • O(1) average case lookup
  • • Not thread-safe (race conditions with concurrent access)
  • • Keys are unordered (iteration order randomized)
  • • Fast for basic use cases
When to use:
  • • Single-threaded applications
  • • Prototyping and testing
  • • When simplicity matters more than concurrency

Option 2: Custom HashMap with Sharding

Characteristics:
  • • Divide keys into N shards (sub-maps)
  • • Each shard has its own lock
  • • Multiple goroutines can operate on different shards concurrently
  • • More complex but much better concurrency
  • • Still unordered
Performance improvement:
  • • Single lock: throughput = 100K ops/sec (8 cores bottlenecked)
  • • 16 shards: throughput = 1.2M ops/sec (8 cores utilized)
When to use:
  • • Multi-threaded applications
  • • High concurrency, many readers/writers
  • • Good balance of simplicity and performance

Option 3: Skip List

Characteristics:
  • • Probabilistic data structure maintaining sorted order
  • • O(log n) lookup, insert, delete
  • • Enables range queries efficiently
  • • Can be made lock-free with careful design
  • • More memory overhead than hash tables (~33%)
When to use:
  • • Need sorted iteration (range queries)
  • • Building LSM tree memtable
  • • Distributed systems (Bigtable, HBase use this)

Option 4: B-Tree or B+ Tree

Characteristics:
  • • Balanced tree structure
  • • O(log n) operations
  • • High degree (many children per node)
  • • Excellent disk I/O patterns
  • • More complex implementation
When to use:
  • • Disk-based storage (SSTables)
  • • Very large datasets
  • • Need to minimize disk seeks

Decision: Sharded HashMap for MVP

For Week 2 (MVP): We'll implement a sharded hash map because:

  • ✓ Good balance of simplicity and performance
  • ✓ Thread-safe for concurrent operations
  • ✓ Sufficient for most use cases
  • ✓ Foundation for learning concurrency patterns

Later (Week 5): We'll add sorted keys via LSM tree when we add persistence.

Go's Native Map vs Custom Implementation

Go's Map Under the Hood

type hmap struct {
    count      int           // number of elements
    flags      uint8         // flags tracking rehashing state
    B          uint8         // log_2 of buckets array size
    noverflow  uint16        // overflow buckets count
    hash0      uint32        // seed hash
    buckets    unsafe.Pointer // bucket array
    oldbuckets unsafe.Pointer // previous bucket array (during resize)
    nevacuate  uintptr       // progress of old bucket evacuation
    extra      *mapextra     // overflow buckets, if any
}

type bmap struct {
    tophash [8]uint8                    // hash values for 8 entries
    keys    [8]keytype                  // keys
    values  [8]valuetype               // values
    pad     uintptr
    overflow *bmap                      // overflow bucket for collision chain
}

Key Points:

  • Bucket size: 8 entries per bucket (cache line optimization)
  • Collision handling: overflow buckets linked as a chain
  • Growth: Doubles bucket count when load factor > 6.5/8 = ~0.8
  • Rehashing: Incremental during operations, not all at once

Why We Can't Just Use map[string]interface{}

Problem: Not thread-safe

// Problem: Not thread-safe
m := make(map[string]interface{})

go func() {
    m["key1"] = "value1"  // goroutine 1 writes
}()

go func() {
    _ = m["key1"]         // goroutine 2 reads
}()
// Result: RACE CONDITION - panics or corrupts data

Thread Safety Considerations

What Makes Maps Unsafe?

During a resize operation, the map is internally inconsistent. Two concurrent operations can see different states:

Goroutine A: Reading key="foo" during resize
    1. Compute hash(foo) = 0x123abc
    2. Look in old bucket[3]
    3. Find value = "bar"
    4. Return "bar"

Goroutine B: Writing key="baz" = "qux" during resize
    1. Acquire exclusive access (panic or data corruption)
    2. Modify bucket array
    3. Element A sees inconsistent state

Solution: Synchronization

Option 1: Coarse-grained lock (simple, slower)

type Map struct {
    mu    sync.RWMutex
    data  map[string]interface{}
}

func (m *Map) Get(key string) interface{} {
    m.mu.RLock()
    defer m.mu.RUnlock()
    return m.data[key]
}

Option 2: Fine-grained locking (complex, faster)

type ShardedMap struct {
    shards []*mapShard
}

type mapShard struct {
    mu    sync.RWMutex
    data  map[string]interface{}
}

Option 3: sync.Map (Go 1.9+, specific use case)

var m sync.Map

m.Store("key", "value")
val, _ := m.Load("key")

When to use each:

  • Coarse-grained: Simple, fine for modest concurrency
  • Sharded: High concurrency, many goroutines
  • sync.Map: Write-rarely, read-heavy workloads

Lesson 2.2: Basic Operations (Get, Put, Delete)

Architecture of Our Sharded HashMap

┌─────────────────────────────────┐
│   ShardedMap (thread-safe)      │
├─────────────────────────────────┤
│ shards [16]                     │
└──┬──────────┬────────┬─────────┘
   │          │        │
   ▼          ▼        ▼
┌────────┐ ┌────────┐ ┌────────┐
│Shard 0 │ │Shard 1 │...│Shard15│
├────────┤ ├────────┤ ├────────┤
│Lock    │ │Lock    │ │Lock    │
│HashMap │ │HashMap │ │HashMap │
└────────┘ └────────┘ └────────┘

Implementing Get Operation

// Full implementation with documentation
type ShardedMap struct {
    shards []*mapShard
}

type mapShard struct {
    mu   sync.RWMutex
    data map[string]interface{}
}

func NewShardedMap(shardCount int) *ShardedMap {
    shards := make([]*mapShard, shardCount)
    for i := 0; i < shardCount; i++ {
        shards[i] = &mapShard{
            data: make(map[string]interface{}),
        }
    }
    return &ShardedMap{shards: shards}
}

// Get retrieves a value by key
// Returns (value, exists)
// Time complexity: O(1) average case
// Thread-safe: Multiple readers allowed simultaneously
func (m *ShardedMap) Get(key string) (interface{}, bool) {
    // Step 1: Determine which shard owns this key
    shard := m.getShard(key)
    
    // Step 2: Acquire read lock (allows multiple concurrent readers)
    shard.mu.RLock()
    defer shard.mu.RUnlock()
    
    // Step 3: Look up value in shard's map
    val, exists := shard.data[key]
    return val, exists
}

// getShard returns the shard responsible for a given key
// Uses hash function to distribute keys uniformly
func (m *ShardedMap) getShard(key string) *mapShard {
    // Hash the key to get a number
    h := fnv.New64a()
    h.Write([]byte(key))
    
    // Map hash to shard index using modulo
    shardIndex := int(h.Sum64() % uint64(len(m.shards)))
    return m.shards[shardIndex]
}

Implementing Put Operation

// Put stores a value at the given key
// If key exists, overwrites the value
// Time complexity: O(1) average case
// Thread-safe: Exclusive access to shard during write
func (m *ShardedMap) Put(key string, value interface{}) error {
    // Input validation
    if key == "" {
        return errors.New("key cannot be empty")
    }
    if value == nil {
        return errors.New("value cannot be nil")
    }
    
    // Step 1: Get the appropriate shard
    shard := m.getShard(key)
    
    // Step 2: Acquire write lock (exclusive access)
    shard.mu.Lock()
    defer shard.mu.Unlock()
    
    // Step 3: Store in map
    shard.data[key] = value
    
    return nil
}

// Thread safety notes:
// - Lock is held for entire operation
// - Other goroutines cannot access this shard during Put
// - Readers on OTHER shards still proceed normally (parallelism)

Implementing Delete Operation

// Delete removes a key and its value
// Returns true if key existed, false otherwise
// Time complexity: O(1) average case
// Thread-safe: Exclusive access to shard during deletion
func (m *ShardedMap) Delete(key string) bool {
    shard := m.getShard(key)
    
    shard.mu.Lock()
    defer shard.mu.Unlock()
    
    _, existed := shard.data[key]
    delete(shard.data, key)
    return existed
}

Handling Key Collisions

Hash Collision: Two different keys hash to the same value

Example: Hash function: hash(key) = len(key) % 16
  hash("a") = 1
  hash("bcd") = 3
  hash("e") = 1  <- COLLISION with "a"!
How Go's Map Handles Collisions:
  1. 1. Bucket array has 16 buckets (for example)
  2. 2. Each bucket stores up to 8 entries inline
  3. 3. Hash collision → same bucket
  4. 4. If bucket full, create overflow bucket (linked list)
Our Sharded Map:
  • • Primary collision handling: Shard hash (% number of shards)
  • • Secondary collision handling: Go's map internal chaining

Value Serialization Strategies

Our in-memory store stores interface{} which can be any Go type. But what about networking and persistence?

Option 1: JSON Serialization

type JSONCodec struct{}

func (c JSONCodec) Encode(v interface{}) ([]byte, error) {
    return json.Marshal(v)
}

func (c JSONCodec) Decode(data []byte, v interface{}) error {
    return json.Unmarshal(data, v)
}

// Usage:
codec := JSONCodec{}
data, _ := codec.Encode("hello")  // => []byte(`"hello"`)
var s string
codec.Decode(data, &s)  // => s = "hello"
Pros:
  • • Human-readable
  • • Widely supported
Cons:
  • • Verbose
  • • Slow
  • • Type information lost

Option 2: Protocol Buffers

protobuf
syntax = "proto3";

message Value {
    oneof data {
        string string_value = 1;
        int64 int_value = 2;
        bytes bytes_value = 3;
    }
}
Pros:
  • • Compact
  • • Fast
  • • Schema versioning
Cons:
  • • Requires code generation
  • • Steeper learning curve

Option 3: Custom Binary Encoding

// Simple binary format for our use case
type BinaryCodec struct{}

// Format: [type_byte][length_varint][data...]
func (c BinaryCodec) Encode(v interface{}) ([]byte, error) {
    buf := bytes.NewBuffer(nil)
    
    switch val := v.(type) {
    case string:
        buf.WriteByte(1)  // type: string
        binary.Write(buf, binary.BigEndian, int32(len(val)))
        buf.WriteString(val)
    case int64:
        buf.WriteByte(2)  // type: int64
        binary.Write(buf, binary.BigEndian, val)
    case []byte:
        buf.WriteByte(3)  // type: bytes
        binary.Write(buf, binary.BigEndian, int32(len(val)))
        buf.Write(val)
    default:
        return nil, errors.New("unsupported type")
    }
    
    return buf.Bytes(), nil
}
Pros:
  • • Minimal overhead
  • • Fast
  • • Control over format
Cons:
  • • Manual implementation
  • • Limited type support

Decision for our course: Start with storing byte slices ([]byte)

This simplifies serialization and forces explicit encoding decisions.

Lesson 2.3: Testing Strategy

Unit Testing Fundamentals

Test Structure in Go - Test functions must start with "Test"

// Filename: store_test.go (same directory as store.go)
// Test functions must start with "Test"

func TestStoreGet(t *testing.T) {
    // Arrange: Set up test fixtures
    store := NewShardedStore(4)
    store.Put(context.Background(), []byte("key1"), []byte("value1"))
    
    // Act: Execute the code under test
    result, err := store.Get(context.Background(), []byte("key1"))
    
    // Assert: Verify the result
    if err != nil {
        t.Fatalf("unexpected error: %v", err)
    }
    
    if string(result) != "value1" {
        t.Errorf("got %q, want %q", string(result), "value1")
    }
}

Subtests for Organization

func TestStore(t *testing.T) {
    store := NewShardedStore(4)
    
    t.Run("Put and Get", func(t *testing.T) {
        store.Put(context.Background(), []byte("k"), []byte("v"))
        v, _ := store.Get(context.Background(), []byte("k"))
        if string(v) != "v" {
            t.Errorf("got %q, want %q", string(v), "v")
        }
    })
    
    t.Run("Get missing key", func(t *testing.T) {
        _, err := store.Get(context.Background(), []byte("missing"))
        if err != ErrKeyNotFound {
            t.Errorf("got %v, want ErrKeyNotFound", err)
        }
    })
    
    t.Run("Delete", func(t *testing.T) {
        store.Put(context.Background(), []byte("k2"), []byte("v2"))
        store.Delete(context.Background(), []byte("k2"))
        _, err := store.Get(context.Background(), []byte("k2"))
        if err != ErrKeyNotFound {
            t.Errorf("expected key to be deleted")
        }
    })
}

Table-Driven Tests

Table-driven tests reduce duplication and make it easy to add more test cases.

func TestStoreOperations(t *testing.T) {
    tests := []struct {
        name      string
        key       []byte
        value     []byte
        setupFn   func(*ShardedStore)
        wantValue []byte
        wantErr   bool
    }{
        {
            name:      "put and get string",
            key:       []byte("user:1"),
            value:     []byte("alice"),
            wantValue: []byte("alice"),
        },
        {
            name:      "put and get binary",
            key:       []byte("binary:1"),
            value:     []byte{0x00, 0x01, 0x02},
            wantValue: []byte{0x00, 0x01, 0x02},
        },
        {
            name:    "get nonexistent key",
            key:     []byte("missing"),
            wantErr: true,
        },
        {
            name:  "overwrite value",
            key:   []byte("counter"),
            value: []byte("2"),
            setupFn: func(s *ShardedStore) {
                s.Put(context.Background(), []byte("counter"), []byte("1"))
            },
            wantValue: []byte("2"),
        },
    }
    
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            store := NewShardedStore(4)
            if tt.setupFn != nil {
                tt.setupFn(store)
            }
            
            store.Put(context.Background(), tt.key, tt.value)
            got, err := store.Get(context.Background(), tt.key)
            
            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            
            if !tt.wantErr && !bytes.Equal(got, tt.wantValue) {
                t.Errorf("got %v, want %v", got, tt.wantValue)
            }
        })
    }
}

Benchmarking Basics

Benchmarks measure performance characteristics.

func BenchmarkGet(b *testing.B) {
    store := NewShardedStore(16)
    store.Put(context.Background(), []byte("key"), []byte("value"))
    
    b.ResetTimer()  // Don't count setup time
    
    for i := 0; i < b.N; i++ {
        store.Get(context.Background(), []byte("key"))
    }
}

// Run: go test -bench=. -benchmem
// Output: BenchmarkGet-8   5000000   256 ns/op   64 B/alloc
//                                     ^timing     ^memory

Interpreting results:

  • • 5000000 iterations
  • • 256 ns/op - nanoseconds per operation
  • • 64 B/alloc - bytes allocated per operation

Property-Based Testing Introduction

Property-based testing generates random inputs and verifies invariants.

func TestStoreProperties(t *testing.T) {
    // Property: After Put, Get returns same value
    for trial := 0; trial < 100; trial++ {
        store := NewShardedStore(4)
        
        key := []byte(randomString(10))
        value := []byte(randomString(100))
        
        store.Put(context.Background(), key, value)
        got, _ := store.Get(context.Background(), key)
        
        if !bytes.Equal(got, value) {
            t.Errorf("trial %d: put-get invariant failed", trial)
        }
    }
}

func TestConcurrentSafety(t *testing.T) {
    // Property: Concurrent operations don't cause data corruption
    store := NewShardedStore(16)
    done := make(chan bool)
    
    // Writers
    for w := 0; w < 10; w++ {
        go func(writerID int) {
            for i := 0; i < 1000; i++ {
                key := fmt.Sprintf("key:%d:%d", writerID, i)
                store.Put(context.Background(), []byte(key), []byte("value"))
            }
            done <- true
        }(w)
    }
    
    // Readers
    for r := 0; r < 10; r++ {
        go func() {
            for i := 0; i < 1000; i++ {
                for w := 0; w < 10; w++ {
                    key := fmt.Sprintf("key:%d:%d", w, i)
                    store.Get(context.Background(), []byte(key))
                }
            }
            done <- true
        }()
    }
    
    // Wait for all goroutines
    for i := 0; i < 20; i++ {
        <-done
    }
    
    // If we get here without panic or corruption, test passes
    t.Log("concurrent safety property verified")
}

Hands-on Lab: Implement thread-unsafe in-memory store

Build a basic key-value store without concurrency concerns.

Lab Tasks:

  • 1. Implement Get method with proper validation
  • 2. Implement Put method with value copying
  • 3. Implement Delete method
  • 4. Write comprehensive tests (basic operations, edge cases)
  • 5. Add benchmarks for Get, Put, Delete operations
  • 6. Test with binary data and various key types
  • 7. Achieve >90% code coverage

Summary & Learning Outcomes

What You've Learned:

  • • Data structure trade-offs and when to use each
  • • Go's native map implementation details
  • • Thread safety considerations and solutions
  • • Value serialization strategies
  • • Comprehensive testing techniques
  • • Benchmarking and performance measurement

Key Takeaways

  • • Choose the right data structure for your use case
  • • Always consider thread safety in concurrent environments
  • • Test-driven development leads to better code quality
  • • Benchmark early and often to catch performance issues
  • • Property-based testing helps find edge cases