Learning Objectives

Master Prometheus metrics for system monitoring
Implement structured logging with zap
Use performance profiling to identify bottlenecks
Build health checks for orchestration
Design comprehensive observability systems
Monitor the four golden signals of system health

Lesson 11.1: Prometheus Metrics

What are Metrics?

Metrics are measurements that help you understand system behavior:

Counter

Only increases

requests_total
errors_total
bytes_written_total

Gauge

Can go up or down

current_connections
memory_usage_bytes
queue_length

Histogram

Distribution of values

latency_seconds with buckets
request_size_bytes
response_time

Summary

Percentiles computed on client side

quantile="0.5" value="0.1"
quantile="0.95" value="0.5"
quantile="0.99" value="1.0"

Four Golden Signals

Every system needs these 4 metrics for comprehensive monitoring:

1. Latency - How long requests take
   Metric: request_duration_seconds
   Target: p99 < 100ms

2. Traffic - How much work happening
   Metric: requests_per_second
   Target: 10,000+ ops/sec

3. Errors - How many requests fail
   Metric: error_rate_percent
   Target: < 1%

4. Saturation - How utilized is system
   Metric: cpu_percent, memory_percent
   Target: < 70%

Implementing Prometheus Metrics

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

type Metrics struct {
    // Counters - always increase
    RequestsTotal prometheus.Counter
    ErrorsTotal   prometheus.Counter
    
    // Gauges - can go up/down
    ConnectionCount prometheus.Gauge
    MemoryBytes     prometheus.Gauge
    
    // Histograms - distribution with buckets
    RequestLatency  prometheus.Histogram
}

// NewMetrics creates all metrics
func NewMetrics() *Metrics {
    return &Metrics{
        RequestsTotal: promauto.NewCounter(prometheus.CounterOpts{
            Name: "kvdb_requests_total",
            Help: "Total number of requests",
        }),
        
        ErrorsTotal: promauto.NewCounter(prometheus.CounterOpts{
            Name: "kvdb_errors_total",
            Help: "Total number of errors",
        }),
        
        ConnectionCount: promauto.NewGauge(prometheus.GaugeOpts{
            Name: "kvdb_connections",
            Help: "Current number of connections",
        }),
        
        MemoryBytes: promauto.NewGauge(prometheus.GaugeOpts{
            Name: "kvdb_memory_bytes",
            Help: "Memory usage in bytes",
        }),
        
        RequestLatency: promauto.NewHistogram(prometheus.HistogramOpts{
            Name: "kvdb_request_latency_seconds",
            Help: "Request latency",
            Buckets: []float64{0.001, 0.01, 0.1, 1},
        }),
    }
}

// RecordRequest records a request
func (m *Metrics) RecordRequest(duration float64, hasError bool) {
    m.RequestsTotal.Inc()
    m.RequestLatency.Observe(duration)
    
    if hasError {
        m.ErrorsTotal.Inc()
    }
}

// UpdateConnections updates connection gauge
func (m *Metrics) UpdateConnections(count int64) {
    m.ConnectionCount.Set(float64(count))
}

// UpdateMemory updates memory gauge
func (m *Metrics) UpdateMemory(bytes int64) {
    m.MemoryBytes.Set(float64(bytes))
}

Using Metrics in Server

import (
    "net/http"
    "time"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

type Server struct {
    metrics *Metrics
}

// HandleRequest records metrics for each request
func (s *Server) HandleRequest(operation string) error {
    start := time.Now()
    
    // Do work
    err := s.doWork()
    
    // Record metrics
    duration := time.Since(start).Seconds()
    s.metrics.RecordRequest(duration, err != nil)
    
    return err
}

// ExposeMetrics starts Prometheus endpoint
func (s *Server) ExposeMetrics(addr string) {
    http.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(addr, nil)
}

Lesson 11.2: Structured Logging

Problem: Unstructured Logs

❌ BAD

2025-01-16 10:15:23 Request from 192.168.1.1 with key user:100 took 50ms

Problems:
- Hard to parse
- Inconsistent format
- Can't filter/search easily

Solution: Structured Logging

✅ GOOD

{
  "timestamp": "2025-01-16T10:15:23Z",
  "level": "error",
  "message": "request failed",
  "client_ip": "192.168.1.1",
  "key": "user:100",
  "duration_ms": 50,
  "error": "timeout"
}

Benefits:
- Easy to parse
- Consistent format
- Can filter/search
- Can aggregate

Using zap Logger

import "go.uber.org/zap"

// Create logger
logger, _ := zap.NewProduction()
defer logger.Sync()

// Log with fields
logger.Info("operation completed",
    zap.String("operation", "GET"),
    zap.String("key", "user:100"),
    zap.Duration("latency", 50*time.Millisecond),
    zap.Int("status_code", 200),
)

// Log errors
logger.Error("operation failed",
    zap.String("operation", "PUT"),
    zap.Error(err),
    zap.Duration("latency", 100*time.Millisecond),
)

// Log levels
logger.Debug("debug info")
logger.Info("informational")
logger.Warn("warning")
logger.Error("error")

Lesson 11.3: Performance Profiling

CPU Profiling

import (
    "runtime/pprof"
    "os"
)

// Start profiling
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
defer f.Close()

// Run operations
for i := 0; i < 1000000; i++ {
    store.Get([]byte("key"))
}

// Analyze: go tool pprof cpu.prof

Memory Profiling

import "runtime/pprof"

// Capture heap
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
f.Close()

// Analyze: go tool pprof heap.prof

HTTP Profiling Endpoint

import _ "net/http/pprof"

// Available endpoints:
// /debug/pprof/profile?seconds=30  - CPU profile
// /debug/pprof/heap                - Memory profile
// /debug/pprof/goroutine           - Goroutine profile

Lesson 11.4: Health Checks

Liveness Probe (Is server alive?)

func (s *Server) Liveness() bool {
    // Simple check - is process running?
    return s.db != nil
}

// HTTP endpoint
func (s *Server) handleLiveness(w http.ResponseWriter, r *http.Request) {
    if s.Liveness() {
        w.WriteHeader(200)
        w.Write([]byte("alive"))
    } else {
        w.WriteHeader(500)
    }
}

// Usage: GET /healthz

Readiness Probe (Can serve requests?)

func (s *Server) Readiness() (bool, string) {
    // Database check
    _, err := s.db.Get(context.Background(), []byte("health"))
    if err != nil {
        return false, "database down"
    }
    
    // Replication check
    lag := s.replication.GetLag()
    if lag > 5*time.Second {
        return false, "replication lag too high"
    }
    
    // Disk check
    if s.getDiskUsagePercent() > 90 {
        return false, "disk full"
    }
    
    return true, "ready"
}

// HTTP endpoint
func (s *Server) handleReadiness(w http.ResponseWriter, r *http.Request) {
    ready, reason := s.Readiness()
    if ready {
        w.WriteHeader(200)
        w.Write([]byte("ready"))
    } else {
        w.WriteHeader(503)
        w.Write([]byte(reason))
    }
}

// Usage: GET /readyz

Lab 11.1: Monitoring System

Objective

Build a comprehensive monitoring system with metrics, logging, profiling, and health checks.

Requirements

Prometheus Metrics: Counters, gauges, histograms for all operations
Structured Logging: JSON logs with zap for all events
Performance Profiling: CPU and memory profiling endpoints
Health Checks: Liveness and readiness probes
Monitoring Dashboard: Grafana dashboard with key metrics
Alerting: Alert rules for critical metrics

Starter Code

type MonitoringSystem struct {
    metrics *Metrics
    logger  *zap.Logger
}

func NewMonitoringSystem() *MonitoringSystem {
    metrics := NewMetrics()
    logger, _ := zap.NewProduction()
    return &MonitoringSystem{metrics, logger}
}

func (ms *MonitoringSystem) RecordOperation(op string, duration time.Duration, err error) {
    ms.metrics.RecordRequest(duration.Seconds(), err != nil)
    
    if err != nil {
        ms.logger.Error("operation failed",
            zap.String("operation", op),
            zap.Duration("latency", duration),
            zap.Error(err),
        )
    } else {
        ms.logger.Info("operation succeeded",
            zap.String("operation", op),
            zap.Duration("latency", duration),
        )
    }
}

// TODO: Implement health checks
func (ms *MonitoringSystem) Liveness() bool {
    return true
}

func (ms *MonitoringSystem) Readiness() (bool, string) {
    return true, "ready"
}

// TODO: Implement profiling endpoints
func (ms *MonitoringSystem) SetupProfiling() {
    // Add /debug/pprof endpoints
}

Test Template

func TestMetrics(t *testing.T) {
    metrics := NewMetrics()
    
    // Record some operations
    metrics.RecordRequest(0.1, false)
    metrics.RecordRequest(0.2, true)
    metrics.UpdateConnections(100)
    
    // Verify metrics were recorded
    // (In real test, you'd check Prometheus registry)
}

func TestHealthChecks(t *testing.T) {
    ms := NewMonitoringSystem()
    
    // Test liveness
    assert.True(t, ms.Liveness())
    
    // Test readiness
    ready, reason := ms.Readiness()
    assert.True(t, ready)
    assert.Equal(t, "ready", reason)
}

func TestLogging(t *testing.T) {
    logger, _ := zap.NewDevelopment()
    
    logger.Info("test message",
        zap.String("key", "value"),
        zap.Int("count", 42),
    )
    
    // In real test, you'd capture log output
}

Acceptance Criteria

✅ All operations record metrics
✅ Structured JSON logs for all events
✅ CPU and memory profiling working
✅ Liveness probe responds correctly
✅ Readiness probe checks all dependencies
✅ Grafana dashboard shows key metrics
✅ Alert rules trigger on critical conditions
✅ > 90% code coverage
✅ All tests pass

Summary: Week 11 Complete

By completing Week 11, you’ve learned and implemented:

1. Prometheus Metrics

Counter, Gauge, Histogram, Summary
Four golden signals monitoring
Custom metrics for database operations
Prometheus endpoint exposure

2. Structured Logging

JSON structured logs with zap
Consistent log format
Easy filtering and searching
Log levels and context

3. Performance Profiling

CPU profiling for hot paths
Memory profiling for leaks
HTTP profiling endpoints
Goroutine analysis

4. Health Checks

Liveness probe (is alive?)
Readiness probe (can serve?)
Dependency health checks
Orchestration integration

Key Skills Mastered:

✅ Monitor system with Prometheus metrics
✅ Debug with structured JSON logs
✅ Profile CPU and memory usage
✅ Health checks for orchestration
✅ Four golden signals monitoring
✅ Production-ready observability

Ready for Week 12?

Next week we’ll focus on performance optimization, load testing, and capacity planning to maximize system performance.

Continue to Week 12: Performance Optimization →

Monitoring & observability

Course Navigation