Lesson 5
SSTables & LSM trees
Memtables, sorted tables, and read paths.
Course Navigation
Back to courseLearning Objectives
-
Understand LSM Tree architecture and its benefits
-
Implement Skip Lists for sorted memtables
-
Build Sorted String Tables (SSTables) for disk storage
-
Master dual memtable design for concurrent access
-
Implement background flushing mechanisms
Lesson 5.1: LSM Tree Architecture
The Problem with Append-Only Logs
Current system has limitations that LSM Trees solve:
Current System Issues:
-
Data stored in memory (HashMap) - not sorted
-
All writes logged to disk (WAL) - not indexed
-
No efficient range queries
-
Recovery must replay entire log
In-Memory:
"user:1" → "alice"
"user:2" → "bob"
"user:3" → "charlie"
(randomly ordered)
To find all keys between "user:1" and "user:3":
└─ Must scan all keys (no index)
└─ O(n) operation
To persist to disk efficiently:
└─ WAL is append-only, not indexed
└─ Recovery must replay entire log
LSM Tree Overview
LSM = Log-Structured Merge Tree The key insight: Optimize for writes, not reads
┌────────────────────────────────────┐
│ In-Memory (Memtable) │
│ Sorted structure (SkipList) │
│ Size limit: 2MB │
└────────────────────────────────────┘
│
│ (When full)
▼
┌────────────────────────────────────┐
│ Level 0: SSTables (on disk) │
│ Multiple small files (0-4) │
└────────────────────────────────────┘
│
│ (Compaction)
▼
┌────────────────────────────────────┐
│ Level 1: SSTables (on disk) │
│ Fewer, larger files (10-20) │
└────────────────────────────────────┘
│
│ (Compaction)
▼
┌────────────────────────────────────┐
│ Level 2: SSTables (on disk) │
│ Even fewer, even larger files │
└────────────────────────────────────┘
Memtable and Immutable Memtable
Current problem: Single memtable causes contention between reads and writes.
Solution: Dual memtables
Writer goroutines:
└─ Write to Active memtable
Reader goroutines:
├─ Search Active memtable
└─ Search Immutable memtable
Flusher goroutine (background):
├─ When Active full
├─ Swap: Active → Immutable
├─ Create new Active
└─ Flush Immutable to disk
Timeline:
t0: Active = MT0 (empty), Immutable = none
t1: Writes to MT0
t2: MT0 full → Immutable = MT0, Active = MT1 (new)
t3: Writes to MT1
t4: Flusher: Flush Immutable to disk
t5: Immutable = MT1, Active = MT2
Benefits:
-
Writes don’t block on disk I/O
-
Reads can search both memtables
-
Flushing happens in background
Sorted String Tables (SSTables)
What is an SSTable? A file on disk containing sorted key-value pairs.
SSTable format:
┌──────────────────────────────┐
│ Header │
│ - Version: 1 │
│ - EntryCount: 1000 │
│ - MinKey: "user:100" │
│ - MaxKey: "user:999" │
│ - BloomFilter: (optional) │
├──────────────────────────────┤
│ Data Blocks │
│ Block 0: Entries 0-99 │
│ Block 1: Entries 100-199 │
│ ... (more blocks) │
│ Block N: Entries 900-999 │
├──────────────────────────────┤
│ Index Block │
│ - Block 0 key range │
│ - Block 1 key range │
│ - ... │
└──────────────────────────────┘
Advantages over WAL:
-
Sorted keys enable range queries
-
Index allows binary search
-
Immutable (no overwrites)
-
Easy to delete entire file
Multi-level Compaction
Compaction Strategy:
Level 0 (Size: 4MB, Files: 4)
├─ sstable-0.db (1MB)
├─ sstable-1.db (1MB)
├─ sstable-2.db (1MB)
└─ sstable-3.db (1MB)
When adding 5th file → Compact to Level 1:
└─ Merge all 4 files + new file
└─ Write to Level 1 (5MB)
Level 1 (Size: 10MB, Files: 2)
├─ sstable-0-1.db (5MB)
└─ sstable-1-1.db (5MB)
When Level 1 exceeds 100MB → Compact to Level 2:
└─ Merge overlapping files
└─ Write to Level 2 (100MB+)
Pattern:
├─ Level 0: Small, recent data
├─ Level 1: Larger, older data
├─ Level 2: Large, very old data
└─ Each level ~10x larger than previous
Read and Write Amplification
Write Amplification
How many times data is written:
Without LSM (direct to disk):
Write amplification: 1x
(data written once)
With LSM (simplified):
Level 0: Write once
Compact to L1: Rewrite once
Compact to L2: Rewrite again
Total: ~3-5x
Better with good compaction strategy:
└─ Leveled compaction: ~10x
└─ Tiered compaction: ~2-5x
Read Amplification
How many files checked per read:
Without structure (WAL):
Read amplification: N (scan all)
With LSM:
Read key "user:500":
1. Check memtable (instant)
2. Check Level 0 (~4 files, sorted)
3. Use index/bloom filter
4. Check Level 1 (1 file via binary search)
Total: ~5-10 seeks (vs scanning N files)
Trade-off:
LSM trades write efficiency (multiple writes) for read efficiency (fewer files to check).
Lesson 5.2: Memtable Implementation
Skip List Data Structure
We need a sorted, concurrent data structure for the memtable.
Option 1: Red-Black Tree
50
/ \
30 70
/ \
20 40
-
Balanced: O(log n)
-
Sorted: Yes
-
Complex rebalancing
Option 2: Skip List
Level 3: 17 ────────────────────┐
│
Level 2: 10 ────── 17 ──────── 26 ────┤
│ │
Level 1: 6 ── 10 ── 17 ── 26 ── 42 ┤ │
│ │ │ │ │ │ │
Level 0: 6 ── 10 ── 17 ── 26 ── 42 ── 73 ┴─ 99
-
Probabilistic: O(log n) average
-
Sorted: Yes
-
Lock-free variants possible
-
Simple to implement
Skip List Implementation
package storage
import (
"math"
"math/rand"
"sync"
)
const (
MaxLevel = 16
pFactor = 0.5
)
type SkipNode struct {
key []byte
value []byte
next []*SkipNode
}
type SkipList struct {
head *SkipNode
level int
mu sync.RWMutex
size int64
}
// NewSkipList creates empty skip list
func NewSkipList() *SkipList {
head := &SkipNode{
key: nil,
next: make([]*SkipNode, MaxLevel),
}
return &SkipList{
head: head,
level: 1,
size: 0,
}
}
// randomLevel generates random level for new node
func randomLevel() int {
level := 1
for level < MaxLevel && rand.Float64() < pFactor {
level++
}
return level
}
// Put inserts or updates a key
func (sl *SkipList) Put(key, value []byte) {
sl.mu.Lock()
defer sl.mu.Unlock()
level := randomLevel()
update := make([]*SkipNode, MaxLevel)
current := sl.head
// Find update nodes at each level
for i := sl.level - 1; i >= 0; i-- {
for current.next[i] != nil &&
bytesLess(current.next[i].key, key) {
current = current.next[i]
}
update[i] = current
}
// Check if key exists
if current.next[0] != nil && bytesEqual(current.next[0].key, key) {
current.next[0].value = value
return
}
// Create new node
newNode := &SkipNode{
key: key,
value: value,
next: make([]*SkipNode, level),
}
// Insert at each level
for i := 0; i < level; i++ {
newNode.next[i] = update[i].next[i]
update[i].next[i] = newNode
}
// Update level if needed
if level > sl.level {
for i := sl.level; i < level; i++ {
sl.head.next[i] = newNode
}
sl.level = level
}
sl.size++
}
Flush Triggers and Thresholds
// Memtable with auto-flush
type Memtable struct {
active *SkipList
immutable *SkipList
mu sync.RWMutex
maxSize int64
currentSize int64
flushCh chan *SkipList
}
// Put with auto-flush
func (m *Memtable) Put(key, value []byte) error {
m.mu.Lock()
// Check if should flush
newSize := m.currentSize + int64(len(key)+len(value))
if newSize > m.maxSize {
// Swap memtables
m.immutable = m.active
m.active = NewSkipList()
m.currentSize = 0
// Signal flush
select {
case m.flushCh <- m.immutable:
default:
// Flush queue full, apply backpressure
m.mu.Unlock()
m.flushCh <- m.immutable
m.mu.Lock()
}
}
m.active.Put(key, value)
m.currentSize = newSize
m.mu.Unlock()
return nil
}
// Get checks both memtables
func (m *Memtable) Get(key []byte) ([]byte, bool) {
m.mu.RLock()
defer m.mu.RUnlock()
// Check active
if v, ok := m.active.Get(key); ok {
return v, ok
}
// Check immutable
if v, ok := m.immutable.Get(key); ok {
return v, ok
}
return nil, false
}
Background Flushing Goroutines
// StartFlusher starts background flushing
func (db *DB) StartFlusher() {
go func() {
for sl := range db.memtable.flushCh {
if err := db.flushMemtable(sl); err != nil {
log.Printf("flush error: %v", err)
}
}
}()
}
// flushMemtable converts memtable to SSTable
func (db *DB) flushMemtable(mt *SkipList) error {
// Create SSTable writer
sstPath := filepath.Join(db.dataDir,
fmt.Sprintf("sstable-%d.sst", time.Now().UnixNano()))
writer, err := NewSSTWriter(sstPath)
if err != nil {
return err
}
defer writer.Close()
// Iterate and write all entries
iter := mt.NewIterator()
for iter.Next() {
writer.Add(iter.Key(), iter.Value())
}
return writer.Finish()
}
Hands-on Lab: Build memtable with automatic flushing
Implement a complete memtable system with Skip Lists and background flushing.
Lab Tasks:
-
Implement Skip List data structure
-
Build dual memtable system (active + immutable)
-
Add automatic flush triggers
-
Implement background flushing goroutine
-
Create SSTable writer for persistence
-
Write comprehensive tests
-
Benchmark concurrent performance
Summary & Learning Outcomes
What You’ve Learned:
-
LSM Tree architecture and multi-level design
-
Skip List implementation for sorted memtables
-
SSTable format and advantages over WAL
-
Dual memtable design for concurrent access
-
Background flushing mechanisms
-
Read/write amplification trade-offs
Key Takeaways
-
LSM Trees optimize for writes while maintaining read efficiency
-
Skip Lists provide O(log n) sorted operations with simple implementation
-
Dual memtables eliminate read/write contention
-
Background flushing keeps writes non-blocking
-
SSTables enable efficient range queries and indexing