Append-Only Log (Write-Ahead Log)

Learning Objectives

• Understand why Write-Ahead Logging is essential for durability
• Master file I/O patterns and binary encoding in Go
• Implement crash recovery mechanisms
• Learn different durability levels and their trade-offs
• Build a complete WAL with checksums and corruption detection

Lesson 4.1: WAL Fundamentals

Why Append-Only Logs Matter

The Write-Ahead Log (WAL) is the foundation of database durability. It ensures that no committed data is lost, even if the system crashes.

The Durability Problem

Without persistence, all data lives in memory:

┌─────────────────────────────────────┐
│     In-Memory Store (RAM)           │
│  ┌──────────────────────────────┐   │
│  │ "user:1" → "Alice"           │   │
│  │ "user:2" → "Bob"             │   │
│  │ "counter" → 42                │   │
│  └──────────────────────────────┘   │
│                                     │
│         PROBLEM:                    │
│  Power failure → All data lost!     │
│  System crash → All data lost!      │
│  OOM killer → All data lost!        │
└─────────────────────────────────────┘

Real-world scenario:

Application running:
1. Receive Put("user:100", "Alice")
2. Store in memory
3. Return success to client

Client: "Data saved!"

Immediately after:
- Power failure
- Server reboots
- Data is gone!

Client: "Where's my data?!"

The Write-Ahead Log Solution

┌──────────────────────────────────────────┐
│      Application                         │
│  Put("user:1", "Alice")                  │
└─────────────────┬──────────────────────┘
                  │
                  ▼
         ┌────────────────┐
         │ Write to WAL   │
         │ (Disk)         │
         └────────────────┘
              ▲    │
              │    │ fsync()
              │    ▼
         ┌────────────────┐
         │  OS Buffer     │
         │  (Page Cache)  │
         └────────────────┘
              │
              │ fsync()
              ▼
         ┌────────────────┐
         │  Disk Storage  │
         │  (Persistent)  │
         └────────────────┘
                  │
                  ▼
         Return to client
         (Data is safe!)
                  │
                  ▼
         Update in-memory store

Key principle: Write to disk BEFORE acknowledging to client.

Timeline:
t0: Client sends Put("key", "value")
t1: Server writes entry to WAL
t2: Server calls fsync() - blocks until on disk
t3: Server updates in-memory store
t4: Server returns success to client
    (If crash between t0-t4, can recover from WAL)

If crash at t3:
  └─ On restart: Replay WAL, recover Put
  └─ Data is safe

If crash after t4:
  └─ On restart: Replay WAL
  └─ Data was already in memory
  └─ Data is safe

Log Structure and Format Design

Basic Log Entry Format

Entry:
┌────────────┬─────────┬────────┬─────────┬──────────┐
│ Operation  │ KeyLen  │ Key    │ValueLen │ Value    │
│ 1 byte     │ 4 bytes │ VarLen │ 4 bytes │ VarLen   │
└────────────┴─────────┴────────┴─────────┴──────────┘

Example:
Operation: 0x01 (Put)
KeyLen: 0x00000004 (4 bytes)
Key: "user"
ValueLen: 0x00000005 (5 bytes)
Value: "alice"

Log File Format (Multiple Entries)

WAL File (append-only):

┌─────────────────────────────────────────┐
│ Header                                  │
│ - Version: 1                            │
│ - CreatedAt: timestamp                  │
│ - Checksum: CRC32                       │
├─────────────────────────────────────────┤
│ Entry 1: Put("user:1", "alice")        │
├─────────────────────────────────────────┤
│ Entry 2: Put("user:2", "bob")          │
├─────────────────────────────────────────┤
│ Entry 3: Delete("user:3")              │
├─────────────────────────────────────────┤
│ Entry 4: Put("counter", "42")          │
├─────────────────────────────────────────┤
│ ... (more entries)                      │
│                                         │
│ File only grows (append-only)           │
│ Never modifies existing entries         │
└─────────────────────────────────────────┘

Checksums for Corruption Detection

// Entry with checksum
type LogEntry struct {
    Op        byte        // 0: Put, 1: Delete
    KeyLen    uint32
    Key       []byte
    ValueLen  uint32
    Value     []byte
    Checksum  uint32      // CRC32 of (Op + KeyLen + Key + ValueLen + Value)
}

// Verify on read
func (e *LogEntry) Verify() error {
    computed := crc32.ChecksumIEEE(e.Serialize())
    if computed != e.Checksum {
        return ErrChecksumMismatch
    }
    return nil
}

Lesson 4.2: WAL Implementation

File I/O in Go

Basic File Operations

import "os"

// Create/open file
f, err := os.OpenFile(
    "wal.log",
    os.O_CREATE|os.O_WRONLY|os.O_APPEND,  // Create, write-only, append
    0644,  // Permissions: rw-r--r--
)
if err != nil {
    return err
}
defer f.Close()

// Write data
data := []byte("hello")
n, err := f.Write(data)
if err != nil {
    return err
}
fmt.Printf("Wrote %d bytes\n", n)

// Force to disk
if err := f.Sync(); err != nil {
    return err
}

// Get file size
info, err := f.Stat()
if err != nil {
    return err
}
fmt.Printf("File size: %d\n", info.Size())

// Read from file
f2, _ := os.Open("wal.log")
buf := make([]byte, 1024)
n, err := f2.Read(buf)

Binary Encoding

Go provides binary encoding in the `encoding/binary` package:

import "encoding/binary"

// Write fixed-size integers
var buf bytes.Buffer
binary.Write(&buf, binary.BigEndian, uint32(42))  // Big-endian encoding
binary.Write(&buf, binary.LittleEndian, uint32(42))  // Little-endian

// Read fixed-size integers
var val uint32
binary.Read(bytes.NewReader(buf.Bytes()), binary.BigEndian, &val)

// Variable-length encoding (varint)
buf2 := bytes.NewBuffer(nil)
binary.PutUvarint([]byte{}, 300)  // Encodes efficiently

Varint efficiency:

• Small values (0-127): 1 byte
• Medium values (128-16383): 2 bytes
• Large values: 3-10 bytes

Example:

• Encode 42: 0x2A (1 byte)
• Encode 300: 0xAC 0x02 (2 bytes)
• Encode 1000: 0xE8 0x07 (2 bytes)

Writing Log Entries

package persistence

import (
    "bytes"
    "encoding/binary"
    "hash/crc32"
    "io"
    "os"
    "sync"
)

// LogEntry represents a single write operation
type LogEntry struct {
    Op       uint8   // 0: Put, 1: Delete
    Key      []byte
    Value    []byte  // Empty for Delete
    Checksum uint32
}

// WriteAheadLog manages persistent writes
type WriteAheadLog struct {
    file       *os.File
    mu         sync.Mutex
    offset     int64       // Current file offset
    unfsynced  int         // Entries since last fsync
    syncEvery  int         // Sync every N writes
    buf        *bytes.Buffer
}

// NewWriteAheadLog creates a new WAL
func NewWriteAheadLog(filepath string, syncEvery int) (*WriteAheadLog, error) {
    f, err := os.OpenFile(
        filepath,
        os.O_CREATE|os.O_WRONLY|os.O_APPEND,
        0644,
    )
    if err != nil {
        return nil, err
    }
    
    // Get current file size
    info, err := f.Stat()
    if err != nil {
        f.Close()
        return nil, err
    }
    
    return &WriteAheadLog{
        file:      f,
        offset:    info.Size(),
        syncEvery: syncEvery,
        buf:       bytes.NewBuffer(make([]byte, 0, 4096)),
    }, nil
}

// Write appends an entry to the WAL
func (w *WriteAheadLog) Write(entry *LogEntry) error {
    w.mu.Lock()
    defer w.mu.Unlock()
    
    // Serialize entry
    data, err := w.serializeEntry(entry)
    if err != nil {
        return err
    }
    
    // Write to file
    if _, err := w.file.Write(data); err != nil {
        return err
    }
    
    w.offset += int64(len(data))
    w.unfsynced++
    
    // Check if should fsync
    if w.unfsynced >= w.syncEvery {
        if err := w.file.Sync(); err != nil {
            return err
        }
        w.unfsynced = 0
    }
    
    return nil
}

Durability Guarantees

fsync() and OS Buffers

The operating system buffers writes in memory (page cache):

Application writes data:
  write() → OS Page Cache → Returns immediately

Data in page cache is NOT durable yet!
  └─ Another power failure → Data lost

fsync() forces to disk:
  fsync() → Blocks until on persistent storage → Returns

Now data is durable!

Performance trade-off:

Without fsync():

• Write latency: 1-10 µs (memory operation)
• Throughput: 100M+ ops/sec
• Durability: NONE (data lost on crash)

With fsync():

• Write latency: 1-10 ms (disk I/O)
• Throughput: 100-1000 ops/sec
• Durability: FULL (data survives crashes)

Durability Levels

Level 0: No Durability (Caching)

func (wal *WAL) Write(entry *LogEntry) error {
    buf := entry.Serialize()
    _, err := wal.file.Write(buf)
    return err
    // No fsync() - data in OS cache only
}

Level 1: Write-Ahead Log with Periodic fsync

func (wal *WAL) Write(entry *LogEntry) error {
    buf := entry.Serialize()
    _, err := wal.file.Write(buf)
    if err != nil {
        return err
    }
    
    wal.unfsynced++
    
    // fsync every N writes
    if wal.unfsynced >= 1000 {
        if err := wal.file.Sync(); err != nil {
            return err
        }
        wal.unfsynced = 0
    }
    
    return nil
}

Level 2: Strict Durability (fsync every write)

func (wal *WAL) Write(entry *LogEntry) error {
    buf := entry.Serialize()
    _, err := wal.file.Write(buf)
    if err != nil {
        return err
    }
    
    // Immediate fsync - guarantees durability
    return wal.file.Sync()
}

Choosing Durability Level

Level	Latency	Throughput	Durability	Use Case
0	<1µs	100M ops/sec	None	Caching only
1	100µs	10K ops/sec	Likely	Web sessions
2	10ms	100 ops/sec	Strict	Financial

Lesson 4.3: Crash Recovery

Reading and Replay

// LogReader reads entries from a WAL file
type LogReader struct {
    file   *os.File
    reader io.Reader
}

// NewLogReader creates a reader for the WAL
func NewLogReader(filepath string) (*LogReader, error) {
    f, err := os.Open(filepath)
    if err != nil {
        return nil, err
    }
    
    return &LogReader{
        file:   f,
        reader: f,
    }, nil
}

// Read reads next entry
func (lr *LogReader) Read() (*LogEntry, error) {
    entry := &LogEntry{}
    
    // Read operation type
    opBuf := make([]byte, 1)
    if _, err := io.ReadFull(lr.reader, opBuf); err != nil {
        if err == io.EOF {
            return nil, io.EOF
        }
        return nil, err
    }
    entry.Op = opBuf[0]
    
    // Read key
    var keyLen uint32
    if err := binary.Read(lr.reader, binary.BigEndian, &keyLen); err != nil {
        return nil, err
    }
    entry.Key = make([]byte, keyLen)
    if _, err := io.ReadFull(lr.reader, entry.Key); err != nil {
        return nil, err
    }
    
    // Read value
    var valLen uint32
    if err := binary.Read(lr.reader, binary.BigEndian, &valLen); err != nil {
        return nil, err
    }
    entry.Value = make([]byte, valLen)
    if _, err := io.ReadFull(lr.reader, entry.Value); err != nil {
        return nil, err
    }
    
    // Read and verify checksum
    var checksum uint32
    if err := binary.Read(lr.reader, binary.BigEndian, &checksum); err != nil {
        return nil, err
    }
    entry.Checksum = checksum
    
    // Verify data integrity
    if err := entry.VerifyChecksum(); err != nil {
        return nil, err
    }
    
    return entry, nil
}

Crash Recovery Implementation

// RecoveryManager replays WAL on startup
type RecoveryManager struct {
    walPath string
    store   Store
}

// NewRecoveryManager creates a recovery manager
func NewRecoveryManager(walPath string, store Store) *RecoveryManager {
    return &RecoveryManager{
        walPath: walPath,
        store:   store,
    }
}

// Recover replays the WAL and recovers to last consistent state
func (rm *RecoveryManager) Recover(ctx context.Context) error {
    // Check if WAL exists
    if _, err := os.Stat(rm.walPath); err != nil {
        if os.IsNotExist(err) {
            return nil  // No WAL, clean start
        }
        return err
    }
    
    reader, err := NewLogReader(rm.walPath)
    if err != nil {
        return err
    }
    defer reader.Close()
    
    var recovered, skipped int
    
    for {
        entry, err := reader.Read()
        if err == io.EOF {
            break
        }
        if err != nil {
            // Corrupted entry, skip
            skipped++
            continue
        }
        
        // Replay operation
        switch entry.Op {
        case 0:  // Put
            if err := rm.store.Put(ctx, entry.Key, entry.Value); err != nil {
                return err
            }
            recovered++
        case 1:  // Delete
            if err := rm.store.Delete(ctx, entry.Key); err != nil {
                return err
            }
            recovered++
        }
    }
    
    log.Printf("Recovered %d entries, skipped %d corrupted entries", recovered, skipped)
    return nil
}

Log Rotation

// LogRotator manages WAL file rotation
type LogRotator struct {
    baseDir        string
    maxFileSize    int64
    currentWAL     *WriteAheadLog
    rotationMu     sync.Mutex
}

// Rotate switches to a new WAL file
func (lr *LogRotator) Rotate() error {
    lr.rotationMu.Lock()
    defer lr.rotationMu.Unlock()
    
    // Close current file
    if lr.currentWAL != nil {
        if err := lr.currentWAL.Sync(); err != nil {
            return err
        }
        if err := lr.currentWAL.Close(); err != nil {
            return err
        }
    }
    
    // Create new WAL with timestamp
    filename := fmt.Sprintf("wal-%d.log", time.Now().UnixNano())
    filepath := filepath.Join(lr.baseDir, filename)
    
    wal, err := NewWriteAheadLog(filepath, 100)
    if err != nil {
        return err
    }
    
    lr.currentWAL = wal
    return nil
}

Hands-on Lab: Implement Write-Ahead Log with crash recovery

Build a persistent WAL with crash recovery.

Lab Tasks:

1. Implement WriteAheadLog with checksums
2. Support configurable fsync frequency
3. Implement LogReader with checksum verification
4. Build RecoveryManager for crash recovery
5. Write comprehensive tests
6. Benchmark different fsync rates
7. Test with corrupted entries

Summary & Learning Outcomes

What You've Learned:

• Why Write-Ahead Logging is essential for durability
• File I/O patterns and binary encoding in Go
• Different durability levels and their trade-offs
• Checksums and corruption detection
• Crash recovery mechanisms
• Log rotation and management

Key Takeaways

• Always write to disk before acknowledging to client
• Choose durability level based on your use case
• Use checksums to detect corruption
• Design for crash recovery from day one
• fsync() is expensive but necessary for durability

← Previous: Concurrency and Thread Safety Next: SSTables and LSM Trees →

Course Navigation

All Lessons