Learning Objectives

Understand LSM Tree architecture and its benefits
Implement Skip Lists for sorted memtables
Build Sorted String Tables (SSTables) for disk storage
Master dual memtable design for concurrent access
Implement background flushing mechanisms

Lesson 5.1: LSM Tree Architecture

The Problem with Append-Only Logs

Current system has limitations that LSM Trees solve:

Current System Issues:

Data stored in memory (HashMap) - not sorted
All writes logged to disk (WAL) - not indexed
No efficient range queries
Recovery must replay entire log

In-Memory:
  "user:1" → "alice"
  "user:2" → "bob"
  "user:3" → "charlie"
  (randomly ordered)

To find all keys between "user:1" and "user:3":
  └─ Must scan all keys (no index)
  └─ O(n) operation

To persist to disk efficiently:
  └─ WAL is append-only, not indexed
  └─ Recovery must replay entire log

LSM Tree Overview

LSM = Log-Structured Merge Tree The key insight: Optimize for writes, not reads

┌────────────────────────────────────┐
│         In-Memory (Memtable)       │
│    Sorted structure (SkipList)     │
│  Size limit: 2MB                   │
└────────────────────────────────────┘
           │
           │ (When full)
           ▼
┌────────────────────────────────────┐
│    Level 0: SSTables (on disk)     │
│  Multiple small files (0-4)        │
└────────────────────────────────────┘
           │
           │ (Compaction)
           ▼
┌────────────────────────────────────┐
│    Level 1: SSTables (on disk)     │
│  Fewer, larger files (10-20)       │
└────────────────────────────────────┘
           │
           │ (Compaction)
           ▼
┌────────────────────────────────────┐
│    Level 2: SSTables (on disk)     │
│  Even fewer, even larger files     │
└────────────────────────────────────┘

Memtable and Immutable Memtable

Current problem: Single memtable causes contention between reads and writes.

Solution: Dual memtables

Writer goroutines:
  └─ Write to Active memtable

Reader goroutines:
  ├─ Search Active memtable
  └─ Search Immutable memtable

Flusher goroutine (background):
  ├─ When Active full
  ├─ Swap: Active → Immutable
  ├─ Create new Active
  └─ Flush Immutable to disk

Timeline:
t0: Active = MT0 (empty), Immutable = none
t1: Writes to MT0
t2: MT0 full → Immutable = MT0, Active = MT1 (new)
t3: Writes to MT1
t4: Flusher: Flush Immutable to disk
t5: Immutable = MT1, Active = MT2

Benefits:

Writes don’t block on disk I/O
Reads can search both memtables
Flushing happens in background

Sorted String Tables (SSTables)

What is an SSTable? A file on disk containing sorted key-value pairs.

SSTable format:
┌──────────────────────────────┐
│ Header                       │
│ - Version: 1                 │
│ - EntryCount: 1000           │
│ - MinKey: "user:100"         │
│ - MaxKey: "user:999"         │
│ - BloomFilter: (optional)    │
├──────────────────────────────┤
│ Data Blocks                  │
│ Block 0: Entries 0-99        │
│ Block 1: Entries 100-199     │
│ ... (more blocks)            │
│ Block N: Entries 900-999     │
├──────────────────────────────┤
│ Index Block                  │
│ - Block 0 key range          │
│ - Block 1 key range          │
│ - ...                        │
└──────────────────────────────┘

Advantages over WAL:

Sorted keys enable range queries
Index allows binary search
Immutable (no overwrites)
Easy to delete entire file

Multi-level Compaction

Compaction Strategy:

Level 0 (Size: 4MB, Files: 4)
├─ sstable-0.db (1MB)
├─ sstable-1.db (1MB)
├─ sstable-2.db (1MB)
└─ sstable-3.db (1MB)

When adding 5th file → Compact to Level 1:
  └─ Merge all 4 files + new file
  └─ Write to Level 1 (5MB)

Level 1 (Size: 10MB, Files: 2)
├─ sstable-0-1.db (5MB)
└─ sstable-1-1.db (5MB)

When Level 1 exceeds 100MB → Compact to Level 2:
  └─ Merge overlapping files
  └─ Write to Level 2 (100MB+)

Pattern:
├─ Level 0: Small, recent data
├─ Level 1: Larger, older data
├─ Level 2: Large, very old data
└─ Each level ~10x larger than previous

Read and Write Amplification

Write Amplification

How many times data is written:

Without LSM (direct to disk):
  Write amplification: 1x
  (data written once)

With LSM (simplified):
  Level 0: Write once
  Compact to L1: Rewrite once
  Compact to L2: Rewrite again
  Total: ~3-5x

Better with good compaction strategy:
  └─ Leveled compaction: ~10x
  └─ Tiered compaction: ~2-5x

Read Amplification

How many files checked per read:

Without structure (WAL):
  Read amplification: N (scan all)
  
With LSM:
  Read key "user:500":
  1. Check memtable (instant)
  2. Check Level 0 (~4 files, sorted)
  3. Use index/bloom filter
  4. Check Level 1 (1 file via binary search)
  
  Total: ~5-10 seeks (vs scanning N files)

Trade-off:

LSM trades write efficiency (multiple writes) for read efficiency (fewer files to check).

Lesson 5.2: Memtable Implementation

Skip List Data Structure

We need a sorted, concurrent data structure for the memtable.

Option 1: Red-Black Tree

Balanced: O(log n)
Sorted: Yes
Complex rebalancing

Option 2: Skip List

Level 3:           17 ────────────────────┐
                                          │
Level 2:     10 ────── 17 ──────── 26 ────┤
                                    │      │
Level 1:  6 ── 10 ── 17 ── 26 ── 42 ┤     │
          │    │     │     │    │   │     │
Level 0:  6 ── 10 ── 17 ── 26 ── 42 ── 73 ┴─ 99

Probabilistic: O(log n) average
Sorted: Yes
Lock-free variants possible
Simple to implement

Skip List Implementation

package storage

import (
    "math"
    "math/rand"
    "sync"
)

const (
    MaxLevel = 16
    pFactor  = 0.5
)

type SkipNode struct {
    key   []byte
    value []byte
    next  []*SkipNode
}

type SkipList struct {
    head     *SkipNode
    level    int
    mu       sync.RWMutex
    size     int64
}

// NewSkipList creates empty skip list
func NewSkipList() *SkipList {
    head := &SkipNode{
        key:  nil,
        next: make([]*SkipNode, MaxLevel),
    }
    return &SkipList{
        head:  head,
        level: 1,
        size:  0,
    }
}

// randomLevel generates random level for new node
func randomLevel() int {
    level := 1
    for level < MaxLevel && rand.Float64() < pFactor {
        level++
    }
    return level
}

// Put inserts or updates a key
func (sl *SkipList) Put(key, value []byte) {
    sl.mu.Lock()
    defer sl.mu.Unlock()
    
    level := randomLevel()
    update := make([]*SkipNode, MaxLevel)
    current := sl.head
    
    // Find update nodes at each level
    for i := sl.level - 1; i >= 0; i-- {
        for current.next[i] != nil &&
            bytesLess(current.next[i].key, key) {
            current = current.next[i]
        }
        update[i] = current
    }
    
    // Check if key exists
    if current.next[0] != nil && bytesEqual(current.next[0].key, key) {
        current.next[0].value = value
        return
    }
    
    // Create new node
    newNode := &SkipNode{
        key:  key,
        value: value,
        next: make([]*SkipNode, level),
    }
    
    // Insert at each level
    for i := 0; i < level; i++ {
        newNode.next[i] = update[i].next[i]
        update[i].next[i] = newNode
    }
    
    // Update level if needed
    if level > sl.level {
        for i := sl.level; i < level; i++ {
            sl.head.next[i] = newNode
        }
        sl.level = level
    }
    
    sl.size++
}

Flush Triggers and Thresholds

// Memtable with auto-flush
type Memtable struct {
    active      *SkipList
    immutable   *SkipList
    mu          sync.RWMutex
    
    maxSize     int64
    currentSize int64
    
    flushCh     chan *SkipList
}

// Put with auto-flush
func (m *Memtable) Put(key, value []byte) error {
    m.mu.Lock()
    
    // Check if should flush
    newSize := m.currentSize + int64(len(key)+len(value))
    if newSize > m.maxSize {
        // Swap memtables
        m.immutable = m.active
        m.active = NewSkipList()
        m.currentSize = 0
        
        // Signal flush
        select {
        case m.flushCh <- m.immutable:
        default:
            // Flush queue full, apply backpressure
            m.mu.Unlock()
            m.flushCh <- m.immutable
            m.mu.Lock()
        }
    }
    
    m.active.Put(key, value)
    m.currentSize = newSize
    m.mu.Unlock()
    
    return nil
}

// Get checks both memtables
func (m *Memtable) Get(key []byte) ([]byte, bool) {
    m.mu.RLock()
    defer m.mu.RUnlock()
    
    // Check active
    if v, ok := m.active.Get(key); ok {
        return v, ok
    }
    
    // Check immutable
    if v, ok := m.immutable.Get(key); ok {
        return v, ok
    }
    
    return nil, false
}

Background Flushing Goroutines

// StartFlusher starts background flushing
func (db *DB) StartFlusher() {
    go func() {
        for sl := range db.memtable.flushCh {
            if err := db.flushMemtable(sl); err != nil {
                log.Printf("flush error: %v", err)
            }
        }
    }()
}

// flushMemtable converts memtable to SSTable
func (db *DB) flushMemtable(mt *SkipList) error {
    // Create SSTable writer
    sstPath := filepath.Join(db.dataDir, 
        fmt.Sprintf("sstable-%d.sst", time.Now().UnixNano()))
    
    writer, err := NewSSTWriter(sstPath)
    if err != nil {
        return err
    }
    defer writer.Close()
    
    // Iterate and write all entries
    iter := mt.NewIterator()
    for iter.Next() {
        writer.Add(iter.Key(), iter.Value())
    }
    
    return writer.Finish()
}

Hands-on Lab: Build memtable with automatic flushing

Implement a complete memtable system with Skip Lists and background flushing.

Lab Tasks:

Implement Skip List data structure
Build dual memtable system (active + immutable)
Add automatic flush triggers
Implement background flushing goroutine
Create SSTable writer for persistence
Write comprehensive tests
Benchmark concurrent performance

Summary & Learning Outcomes

What You’ve Learned:

LSM Tree architecture and multi-level design
Skip List implementation for sorted memtables
SSTable format and advantages over WAL
Dual memtable design for concurrent access
Background flushing mechanisms
Read/write amplification trade-offs

Key Takeaways

LSM Trees optimize for writes while maintaining read efficiency
Skip Lists provide O(log n) sorted operations with simple implementation
Dual memtables eliminate read/write contention
Background flushing keeps writes non-blocking
SSTables enable efficient range queries and indexing

SSTables & LSM trees

Course Navigation