Course Navigation
← Back to Course OverviewAll Lessons
 ✓ 
 Introduction and Database Fundamentals  ✓ 
 Building the Core Data Structure  ✓ 
 Concurrency and Thread Safety  ✓ 
 Append-Only Log (Write-Ahead Log)  ✓ 
 SSTables and LSM Trees  ✓ 
 Compaction and Optimization  ✓ 
 TCP Server and Protocol Design  ✓ 
 Client Library and Advanced Networking  ✓ 
 Transactions and ACID Properties  ✓ 
 Replication and High Availability  11 
 Monitoring, Metrics, and Observability  12 
 Performance Optimization and Tuning  13 
 Configuration and Deployment  14 
 Security and Production Hardening  15 
 Final Project and Beyond Current Lesson
  11 of 15 
  Progress 73% 
 Monitoring, Metrics, and Observability
Learning Objectives
- • Master Prometheus metrics for system monitoring
- • Implement structured logging with zap
- • Use performance profiling to identify bottlenecks
- • Build health checks for orchestration
- • Design comprehensive observability systems
- • Monitor the four golden signals of system health
Lesson 11.1: Prometheus Metrics
What are Metrics?
Metrics are measurements that help you understand system behavior:
Counter
Only increases
requests_total
errors_total
bytes_written_totalGauge
Can go up or down
current_connections
memory_usage_bytes
queue_lengthHistogram
Distribution of values
latency_seconds with buckets
request_size_bytes
response_timeSummary
Percentiles computed on client side
quantile="0.5" value="0.1"
quantile="0.95" value="0.5"
quantile="0.99" value="1.0"Four Golden Signals
Every system needs these 4 metrics for comprehensive monitoring:
1. Latency - How long requests take
   Metric: request_duration_seconds
   Target: p99 < 100ms
2. Traffic - How much work happening
   Metric: requests_per_second
   Target: 10,000+ ops/sec
3. Errors - How many requests fail
   Metric: error_rate_percent
   Target: < 1%
4. Saturation - How utilized is system
   Metric: cpu_percent, memory_percent
   Target: < 70%Implementing Prometheus Metrics
package metrics
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
type Metrics struct {
    // Counters - always increase
    RequestsTotal prometheus.Counter
    ErrorsTotal   prometheus.Counter
    
    // Gauges - can go up/down
    ConnectionCount prometheus.Gauge
    MemoryBytes     prometheus.Gauge
    
    // Histograms - distribution with buckets
    RequestLatency  prometheus.Histogram
}
// NewMetrics creates all metrics
func NewMetrics() *Metrics {
    return &Metrics{
        RequestsTotal: promauto.NewCounter(prometheus.CounterOpts{
            Name: "kvdb_requests_total",
            Help: "Total number of requests",
        }),
        
        ErrorsTotal: promauto.NewCounter(prometheus.CounterOpts{
            Name: "kvdb_errors_total",
            Help: "Total number of errors",
        }),
        
        ConnectionCount: promauto.NewGauge(prometheus.GaugeOpts{
            Name: "kvdb_connections",
            Help: "Current number of connections",
        }),
        
        MemoryBytes: promauto.NewGauge(prometheus.GaugeOpts{
            Name: "kvdb_memory_bytes",
            Help: "Memory usage in bytes",
        }),
        
        RequestLatency: promauto.NewHistogram(prometheus.HistogramOpts{
            Name: "kvdb_request_latency_seconds",
            Help: "Request latency",
            Buckets: []float64{0.001, 0.01, 0.1, 1},
        }),
    }
}
// RecordRequest records a request
func (m *Metrics) RecordRequest(duration float64, hasError bool) {
    m.RequestsTotal.Inc()
    m.RequestLatency.Observe(duration)
    
    if hasError {
        m.ErrorsTotal.Inc()
    }
}
// UpdateConnections updates connection gauge
func (m *Metrics) UpdateConnections(count int64) {
    m.ConnectionCount.Set(float64(count))
}
// UpdateMemory updates memory gauge
func (m *Metrics) UpdateMemory(bytes int64) {
    m.MemoryBytes.Set(float64(bytes))
}Using Metrics in Server
import (
    "net/http"
    "time"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
type Server struct {
    metrics *Metrics
}
// HandleRequest records metrics for each request
func (s *Server) HandleRequest(operation string) error {
    start := time.Now()
    
    // Do work
    err := s.doWork()
    
    // Record metrics
    duration := time.Since(start).Seconds()
    s.metrics.RecordRequest(duration, err != nil)
    
    return err
}
// ExposeMetrics starts Prometheus endpoint
func (s *Server) ExposeMetrics(addr string) {
    http.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(addr, nil)
}Lesson 11.2: Structured Logging
Problem: Unstructured Logs
❌ BAD
2025-01-16 10:15:23 Request from 192.168.1.1 with key user:100 took 50ms
Problems:
- Hard to parse
- Inconsistent format
- Can't filter/search easilySolution: Structured Logging
✅ GOOD
{
  "timestamp": "2025-01-16T10:15:23Z",
  "level": "error",
  "message": "request failed",
  "client_ip": "192.168.1.1",
  "key": "user:100",
  "duration_ms": 50,
  "error": "timeout"
}
Benefits:
- Easy to parse
- Consistent format
- Can filter/search
- Can aggregateUsing zap Logger
import "go.uber.org/zap"
// Create logger
logger, _ := zap.NewProduction()
defer logger.Sync()
// Log with fields
logger.Info("operation completed",
    zap.String("operation", "GET"),
    zap.String("key", "user:100"),
    zap.Duration("latency", 50*time.Millisecond),
    zap.Int("status_code", 200),
)
// Log errors
logger.Error("operation failed",
    zap.String("operation", "PUT"),
    zap.Error(err),
    zap.Duration("latency", 100*time.Millisecond),
)
// Log levels
logger.Debug("debug info")
logger.Info("informational")
logger.Warn("warning")
logger.Error("error")Lesson 11.3: Performance Profiling
CPU Profiling
import (
    "runtime/pprof"
    "os"
)
// Start profiling
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
defer f.Close()
// Run operations
for i := 0; i < 1000000; i++ {
    store.Get([]byte("key"))
}
// Analyze: go tool pprof cpu.profMemory Profiling
import "runtime/pprof"
// Capture heap
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
f.Close()
// Analyze: go tool pprof heap.profHTTP Profiling Endpoint
import _ "net/http/pprof"
// Available endpoints:
// /debug/pprof/profile?seconds=30  - CPU profile
// /debug/pprof/heap                - Memory profile
// /debug/pprof/goroutine           - Goroutine profileLesson 11.4: Health Checks
Liveness Probe (Is server alive?)
func (s *Server) Liveness() bool {
    // Simple check - is process running?
    return s.db != nil
}
// HTTP endpoint
func (s *Server) handleLiveness(w http.ResponseWriter, r *http.Request) {
    if s.Liveness() {
        w.WriteHeader(200)
        w.Write([]byte("alive"))
    } else {
        w.WriteHeader(500)
    }
}
// Usage: GET /healthzReadiness Probe (Can serve requests?)
func (s *Server) Readiness() (bool, string) {
    // Database check
    _, err := s.db.Get(context.Background(), []byte("health"))
    if err != nil {
        return false, "database down"
    }
    
    // Replication check
    lag := s.replication.GetLag()
    if lag > 5*time.Second {
        return false, "replication lag too high"
    }
    
    // Disk check
    if s.getDiskUsagePercent() > 90 {
        return false, "disk full"
    }
    
    return true, "ready"
}
// HTTP endpoint
func (s *Server) handleReadiness(w http.ResponseWriter, r *http.Request) {
    ready, reason := s.Readiness()
    if ready {
        w.WriteHeader(200)
        w.Write([]byte("ready"))
    } else {
        w.WriteHeader(503)
        w.Write([]byte(reason))
    }
}
// Usage: GET /readyzLab 11.1: Monitoring System
Objective
Build a comprehensive monitoring system with metrics, logging, profiling, and health checks.
Requirements
- • Prometheus Metrics: Counters, gauges, histograms for all operations
- • Structured Logging: JSON logs with zap for all events
- • Performance Profiling: CPU and memory profiling endpoints
- • Health Checks: Liveness and readiness probes
- • Monitoring Dashboard: Grafana dashboard with key metrics
- • Alerting: Alert rules for critical metrics
Starter Code
type MonitoringSystem struct {
    metrics *Metrics
    logger  *zap.Logger
}
func NewMonitoringSystem() *MonitoringSystem {
    metrics := NewMetrics()
    logger, _ := zap.NewProduction()
    return &MonitoringSystem{metrics, logger}
}
func (ms *MonitoringSystem) RecordOperation(op string, duration time.Duration, err error) {
    ms.metrics.RecordRequest(duration.Seconds(), err != nil)
    
    if err != nil {
        ms.logger.Error("operation failed",
            zap.String("operation", op),
            zap.Duration("latency", duration),
            zap.Error(err),
        )
    } else {
        ms.logger.Info("operation succeeded",
            zap.String("operation", op),
            zap.Duration("latency", duration),
        )
    }
}
// TODO: Implement health checks
func (ms *MonitoringSystem) Liveness() bool {
    return true
}
func (ms *MonitoringSystem) Readiness() (bool, string) {
    return true, "ready"
}
// TODO: Implement profiling endpoints
func (ms *MonitoringSystem) SetupProfiling() {
    // Add /debug/pprof endpoints
}Test Template
func TestMetrics(t *testing.T) {
    metrics := NewMetrics()
    
    // Record some operations
    metrics.RecordRequest(0.1, false)
    metrics.RecordRequest(0.2, true)
    metrics.UpdateConnections(100)
    
    // Verify metrics were recorded
    // (In real test, you'd check Prometheus registry)
}
func TestHealthChecks(t *testing.T) {
    ms := NewMonitoringSystem()
    
    // Test liveness
    assert.True(t, ms.Liveness())
    
    // Test readiness
    ready, reason := ms.Readiness()
    assert.True(t, ready)
    assert.Equal(t, "ready", reason)
}
func TestLogging(t *testing.T) {
    logger, _ := zap.NewDevelopment()
    
    logger.Info("test message",
        zap.String("key", "value"),
        zap.Int("count", 42),
    )
    
    // In real test, you'd capture log output
}Acceptance Criteria
- ✅ All operations record metrics
- ✅ Structured JSON logs for all events
- ✅ CPU and memory profiling working
- ✅ Liveness probe responds correctly
- ✅ Readiness probe checks all dependencies
- ✅ Grafana dashboard shows key metrics
- ✅ Alert rules trigger on critical conditions
- ✅ > 90% code coverage
- ✅ All tests pass
Summary: Week 11 Complete
By completing Week 11, you've learned and implemented:
1. Prometheus Metrics
- • Counter, Gauge, Histogram, Summary
- • Four golden signals monitoring
- • Custom metrics for database operations
- • Prometheus endpoint exposure
2. Structured Logging
- • JSON structured logs with zap
- • Consistent log format
- • Easy filtering and searching
- • Log levels and context
3. Performance Profiling
- • CPU profiling for hot paths
- • Memory profiling for leaks
- • HTTP profiling endpoints
- • Goroutine analysis
4. Health Checks
- • Liveness probe (is alive?)
- • Readiness probe (can serve?)
- • Dependency health checks
- • Orchestration integration
Key Skills Mastered:
- ✅ Monitor system with Prometheus metrics
- ✅ Debug with structured JSON logs
- ✅ Profile CPU and memory usage
- ✅ Health checks for orchestration
- ✅ Four golden signals monitoring
- ✅ Production-ready observability
Ready for Week 12?
Next week we'll focus on performance optimization, load testing, and capacity planning to maximize system performance.
Continue to Week 12: Performance Optimization →