Lesson 6
Compaction & optimization
Merging levels, Bloom filters, and tuning.
Course Navigation
Back to courseLearning Objectives
-
Master different compaction strategies (size-tiered, leveled, universal)
-
Implement Bloom filters for 10-100x speedup on negative lookups
-
Build block caching for hot data optimization
-
Learn compression techniques and trade-offs
-
Implement batched writes and resource throttling
Lesson 6.1: Compaction Strategies
Why Compaction Matters
As the database operates, SSTables accumulate at different levels, causing performance degradation.
Without compaction:
Level 0: sstable-0, sstable-1, sstable-2, ... sstable-100 (100 files!)
Level 1: sstable-l1-0, sstable-l1-1, ...
Level 2: ...
Problem:
- Read "user:500": Must check 100+ files in Level 0
- Each file access = disk seek (expensive)
- Read latency degrades over time
- Disk space grows unbounded
With compaction:
Before:
Level 0: 100 small files (1-10MB each)
Level 1: 10 medium files (50MB each)
After compaction:
Level 0: 4 small files (5MB each)
Level 1: 5 medium files (100MB each)
Benefits:
- Read latency: O(100 seeks) → O(5 seeks)
- Disk space: Unused space reclaimed
- Tombstone cleanup: Delete entries removed
Compaction Strategies: Size-Tiered
Size-Tiered Compaction (SimpleDB, Cassandra):
Trigger: When level has too many files (e.g., > 10 files)
Action:
1. Select all files at level N that fit in memory
2. Merge into single file
3. Write to level N+1
4. Delete source files
Timeline:
t0: Level 0: [f0(5MB), f1(5MB), f2(5MB)] → Trigger (3 files)
t1: Merge into temp file (15MB)
t2: Move to Level 1: [f1_0(15MB)] + existing files
t3: Level 0: [] (files deleted)
t4: New writes go to Level 0 again
Advantages:
-
Simple to implement
-
Good for write-heavy workloads
-
Low CPU overhead during compaction
Disadvantages:
-
Write amplification: ~5-10x
-
Uneven file sizes at same level
-
Space overhead during merge
Leveled Compaction
Leveled Compaction (LevelDB, RocksDB):
Design:
- Level 0: Multiple files (overlap allowed)
- Level 1+: Single file per key range (NO overlap)
Trigger: When level size exceeds limit (e.g., 10MB)
Action:
1. Pick file from level N
2. Find overlapping files in level N+1
3. Merge all into level N+1
4. Delete source files
Timeline:
Level 0: [file0(4MB), file1(3MB), file2(2MB)] → Size: 9MB (below 10MB)
[file0(4MB), file1(3MB), file2(2MB), file3(5MB)] → Size: 14MB (TRIGGER)
Select file0 from L0:
- Key range: ["a", "g"]
- Find overlapping in L1: [file_l1_0 ("a"-"m")]
- Merge: file0 + file_l1_0 → file_l1_0'
- Delete: file0, file_l1_0
Result:
Level 0: [file1(3MB), file2(2MB), file3(5MB)]
Level 1: [file_l1_0'(overlapping merged)]
Advantages:
-
Write amplification: ~10x (predictable)
-
Bounded file count per level
-
Space efficient
-
Consistent performance
Disadvantages:
-
Complex implementation
-
Higher CPU during compaction
-
Must maintain key range invariants
Compaction Trade-offs Comparison
Aspect Size-Tiered Leveled Universal
Write Amplification 5-10x ~10x ~5x
Read Amplification Medium Low Low-Medium
Space Overhead High Low Medium
Implementation Simple Complex Medium
Best For Write-heavy Read-heavy Balanced
Lesson 6.2: Optimization Techniques
Bloom Filters for Faster Lookups
Problem: To check if key exists, must read SSTable (expensive disk I/O).
Without Bloom Filter:
Read "user:500":
1. Check memtable: Not found
2. Check SSTable 1: Read file (disk I/O) - Not found
3. Check SSTable 2: Read file (disk I/O) - Not found
4. Check SSTable 3: Read file (disk I/O) - Not found
Result: 3 disk I/Os for missing key!
Solution: Bloom Filter
A probabilistic data structure that answers “is key in this file?” quickly:
-
No false negatives (if says no, definitely not there)
-
Few false positives (might say yes when actually no)
-
Fast negative lookups
-
Small space (1-2 bits per entry)
Bloom Filter:
- Bit array: [0, 0, 1, 0, 1, 1, 0, 1, ...]
- Hash functions: h1, h2, h3
Put "user:1":
Set bits at positions: h1("user:1"), h2("user:1"), h3("user:1")
Check "user:1":
All three positions set? YES, probably in set
Check "user:500":
All three positions set? NO, definitely not in set!
Block Cache for Hot Data
Problem: Frequently accessed data requires disk reads.
// BlockCache caches SSTable blocks
type BlockCache struct {
mu sync.RWMutex
cache map[string]*CacheEntry
maxSize int64
size int64
}
type CacheEntry struct {
block *DataBlock
lastUsed time.Time
}
// Get retrieves from cache
func (bc *BlockCache) Get(key string) (*DataBlock, bool) {
bc.mu.RLock()
defer bc.mu.RUnlock()
entry, ok := bc.cache[key]
if ok {
entry.lastUsed = time.Now()
return entry.block, true
}
return nil, false
}
// Put stores in cache
func (bc *BlockCache) Put(key string, block *DataBlock) {
bc.mu.Lock()
defer bc.mu.Unlock()
// Check if should evict
if bc.size+block.Size() > bc.maxSize {
bc.evictLRU()
}
bc.cache[key] = &CacheEntry{
block: block,
lastUsed: time.Now(),
}
bc.size += block.Size()
}
Compression Strategies
Problem: SSTables consume disk space.
import (
"github.com/golang/snappy"
"compress/gzip"
)
// CompressionCodec compresses/decompresses blocks
type CompressionCodec interface {
Compress(data []byte) ([]byte, error)
Decompress(data []byte) ([]byte, error)
}
// SnappyCodec uses Snappy (fast)
type SnappyCodec struct{}
func (sc SnappyCodec) Compress(data []byte) ([]byte, error) {
return snappy.Encode(nil, data), nil
}
func (sc SnappyCodec) Decompress(data []byte) ([]byte, error) {
return snappy.Decode(nil, data)
}
// GzipCodec uses Gzip (better ratio)
type GzipCodec struct {
level int
}
func (gc GzipCodec) Compress(data []byte) ([]byte, error) {
var buf bytes.Buffer
w, _ := gzip.NewWriterLevel(&buf, gc.level)
w.Write(data)
w.Close()
return buf.Bytes(), nil
}
Compression Trade-offs:
Snappy (Fast):
-
Compression ratio: 0.5-0.7
-
CPU overhead: Low
-
Best for: Recent data, frequent reads
Gzip (Better ratio):
-
Compression ratio: 0.3-0.5
-
CPU overhead: High
-
Best for: Older data, infrequent reads
Batched Writes
Problem: Writing entries one-by-one is slow due to lock contention.
// Batch groups multiple operations
type Batch struct {
entries []*LogEntry
}
// Add adds operation to batch
func (b *Batch) Put(key, value []byte) {
b.entries = append(b.entries, &LogEntry{
Op: OpPut,
Key: key,
Value: value,
})
}
// Commit applies all operations atomically
func (b *Batch) Commit(db *DB) error {
db.mu.Lock()
defer db.mu.Unlock()
// All writes in one lock
for _, entry := range b.entries {
switch entry.Op {
case OpPut:
db.memtable.Put(entry.Key, entry.Value)
case OpDelete:
db.memtable.Delete(entry.Key)
}
}
// Single WAL write
return db.wal.WriteBatch(b.entries)
}
// Usage
batch := &Batch{}
batch.Put([]byte("k1"), []byte("v1"))
batch.Put([]byte("k2"), []byte("v2"))
batch.Put([]byte("k3"), []byte("v3"))
batch.Commit(db) // All or nothing
Hands-on Lab: Implement Leveled Compaction with Bloom filters
Build a complete compaction engine with optimization techniques.
Lab Tasks:
-
Implement LeveledCompactor with size-based triggers
-
Build Bloom filters with configurable false positive rate
-
Implement block cache with LRU eviction
-
Add compression support (Snappy + Gzip)
-
Implement batched writes
-
Write comprehensive tests and benchmarks
-
Measure performance improvements
Summary & Learning Outcomes
What You’ve Learned:
-
Three major compaction strategies and their trade-offs
-
Bloom filters for 10-100x speedup on negative lookups
-
Block caching for hot data optimization
-
Compression techniques and space/CPU trade-offs
-
Batched writes for reduced lock contention
-
Resource throttling and background processing
Key Metrics Achieved:
-
Compaction reduces file count by 90%
-
Bloom filters skip ~99% of negative lookups
-
Block cache achieves 10+ MB/s for hot data
-
Compression ratio: 0.3-0.5 (3-5GB from 10GB)
-
Batched writes reduce lock contention by 10x