Monitoring, Metrics, and Observability

Learning Objectives

• Master Prometheus metrics for system monitoring
• Implement structured logging with zap
• Use performance profiling to identify bottlenecks
• Build health checks for orchestration
• Design comprehensive observability systems
• Monitor the four golden signals of system health

Lesson 11.1: Prometheus Metrics

What are Metrics?

Metrics are measurements that help you understand system behavior:

Counter

Only increases

requests_total
errors_total
bytes_written_total

Gauge

Can go up or down

current_connections
memory_usage_bytes
queue_length

Histogram

Distribution of values

latency_seconds with buckets
request_size_bytes
response_time

Summary

Percentiles computed on client side

quantile="0.5" value="0.1"
quantile="0.95" value="0.5"
quantile="0.99" value="1.0"

Four Golden Signals

Every system needs these 4 metrics for comprehensive monitoring:

1. Latency - How long requests take
   Metric: request_duration_seconds
   Target: p99 < 100ms

2. Traffic - How much work happening
   Metric: requests_per_second
   Target: 10,000+ ops/sec

3. Errors - How many requests fail
   Metric: error_rate_percent
   Target: < 1%

4. Saturation - How utilized is system
   Metric: cpu_percent, memory_percent
   Target: < 70%

Implementing Prometheus Metrics

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

type Metrics struct {
    // Counters - always increase
    RequestsTotal prometheus.Counter
    ErrorsTotal   prometheus.Counter
    
    // Gauges - can go up/down
    ConnectionCount prometheus.Gauge
    MemoryBytes     prometheus.Gauge
    
    // Histograms - distribution with buckets
    RequestLatency  prometheus.Histogram
}

// NewMetrics creates all metrics
func NewMetrics() *Metrics {
    return &Metrics{
        RequestsTotal: promauto.NewCounter(prometheus.CounterOpts{
            Name: "kvdb_requests_total",
            Help: "Total number of requests",
        }),
        
        ErrorsTotal: promauto.NewCounter(prometheus.CounterOpts{
            Name: "kvdb_errors_total",
            Help: "Total number of errors",
        }),
        
        ConnectionCount: promauto.NewGauge(prometheus.GaugeOpts{
            Name: "kvdb_connections",
            Help: "Current number of connections",
        }),
        
        MemoryBytes: promauto.NewGauge(prometheus.GaugeOpts{
            Name: "kvdb_memory_bytes",
            Help: "Memory usage in bytes",
        }),
        
        RequestLatency: promauto.NewHistogram(prometheus.HistogramOpts{
            Name: "kvdb_request_latency_seconds",
            Help: "Request latency",
            Buckets: []float64{0.001, 0.01, 0.1, 1},
        }),
    }
}

// RecordRequest records a request
func (m *Metrics) RecordRequest(duration float64, hasError bool) {
    m.RequestsTotal.Inc()
    m.RequestLatency.Observe(duration)
    
    if hasError {
        m.ErrorsTotal.Inc()
    }
}

// UpdateConnections updates connection gauge
func (m *Metrics) UpdateConnections(count int64) {
    m.ConnectionCount.Set(float64(count))
}

// UpdateMemory updates memory gauge
func (m *Metrics) UpdateMemory(bytes int64) {
    m.MemoryBytes.Set(float64(bytes))
}

Using Metrics in Server

import (
    "net/http"
    "time"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

type Server struct {
    metrics *Metrics
}

// HandleRequest records metrics for each request
func (s *Server) HandleRequest(operation string) error {
    start := time.Now()
    
    // Do work
    err := s.doWork()
    
    // Record metrics
    duration := time.Since(start).Seconds()
    s.metrics.RecordRequest(duration, err != nil)
    
    return err
}

// ExposeMetrics starts Prometheus endpoint
func (s *Server) ExposeMetrics(addr string) {
    http.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(addr, nil)
}

Lesson 11.2: Structured Logging

Problem: Unstructured Logs

❌ BAD

2025-01-16 10:15:23 Request from 192.168.1.1 with key user:100 took 50ms

Problems:
- Hard to parse
- Inconsistent format
- Can't filter/search easily

Solution: Structured Logging

✅ GOOD

{
  "timestamp": "2025-01-16T10:15:23Z",
  "level": "error",
  "message": "request failed",
  "client_ip": "192.168.1.1",
  "key": "user:100",
  "duration_ms": 50,
  "error": "timeout"
}

Benefits:
- Easy to parse
- Consistent format
- Can filter/search
- Can aggregate

Using zap Logger

import "go.uber.org/zap"

// Create logger
logger, _ := zap.NewProduction()
defer logger.Sync()

// Log with fields
logger.Info("operation completed",
    zap.String("operation", "GET"),
    zap.String("key", "user:100"),
    zap.Duration("latency", 50*time.Millisecond),
    zap.Int("status_code", 200),
)

// Log errors
logger.Error("operation failed",
    zap.String("operation", "PUT"),
    zap.Error(err),
    zap.Duration("latency", 100*time.Millisecond),
)

// Log levels
logger.Debug("debug info")
logger.Info("informational")
logger.Warn("warning")
logger.Error("error")

Lesson 11.3: Performance Profiling

CPU Profiling

import (
    "runtime/pprof"
    "os"
)

// Start profiling
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
defer f.Close()

// Run operations
for i := 0; i < 1000000; i++ {
    store.Get([]byte("key"))
}

// Analyze: go tool pprof cpu.prof

Memory Profiling

import "runtime/pprof"

// Capture heap
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
f.Close()

// Analyze: go tool pprof heap.prof

HTTP Profiling Endpoint

import _ "net/http/pprof"

// Available endpoints:
// /debug/pprof/profile?seconds=30  - CPU profile
// /debug/pprof/heap                - Memory profile
// /debug/pprof/goroutine           - Goroutine profile

Lesson 11.4: Health Checks

Liveness Probe (Is server alive?)

func (s *Server) Liveness() bool {
    // Simple check - is process running?
    return s.db != nil
}

// HTTP endpoint
func (s *Server) handleLiveness(w http.ResponseWriter, r *http.Request) {
    if s.Liveness() {
        w.WriteHeader(200)
        w.Write([]byte("alive"))
    } else {
        w.WriteHeader(500)
    }
}

// Usage: GET /healthz

Readiness Probe (Can serve requests?)

func (s *Server) Readiness() (bool, string) {
    // Database check
    _, err := s.db.Get(context.Background(), []byte("health"))
    if err != nil {
        return false, "database down"
    }
    
    // Replication check
    lag := s.replication.GetLag()
    if lag > 5*time.Second {
        return false, "replication lag too high"
    }
    
    // Disk check
    if s.getDiskUsagePercent() > 90 {
        return false, "disk full"
    }
    
    return true, "ready"
}

// HTTP endpoint
func (s *Server) handleReadiness(w http.ResponseWriter, r *http.Request) {
    ready, reason := s.Readiness()
    if ready {
        w.WriteHeader(200)
        w.Write([]byte("ready"))
    } else {
        w.WriteHeader(503)
        w.Write([]byte(reason))
    }
}

// Usage: GET /readyz

Lab 11.1: Monitoring System

Objective

Build a comprehensive monitoring system with metrics, logging, profiling, and health checks.

Requirements

• Prometheus Metrics: Counters, gauges, histograms for all operations
• Structured Logging: JSON logs with zap for all events
• Performance Profiling: CPU and memory profiling endpoints
• Health Checks: Liveness and readiness probes
• Monitoring Dashboard: Grafana dashboard with key metrics
• Alerting: Alert rules for critical metrics

Starter Code

type MonitoringSystem struct {
    metrics *Metrics
    logger  *zap.Logger
}

func NewMonitoringSystem() *MonitoringSystem {
    metrics := NewMetrics()
    logger, _ := zap.NewProduction()
    return &MonitoringSystem{metrics, logger}
}

func (ms *MonitoringSystem) RecordOperation(op string, duration time.Duration, err error) {
    ms.metrics.RecordRequest(duration.Seconds(), err != nil)
    
    if err != nil {
        ms.logger.Error("operation failed",
            zap.String("operation", op),
            zap.Duration("latency", duration),
            zap.Error(err),
        )
    } else {
        ms.logger.Info("operation succeeded",
            zap.String("operation", op),
            zap.Duration("latency", duration),
        )
    }
}

// TODO: Implement health checks
func (ms *MonitoringSystem) Liveness() bool {
    return true
}

func (ms *MonitoringSystem) Readiness() (bool, string) {
    return true, "ready"
}

// TODO: Implement profiling endpoints
func (ms *MonitoringSystem) SetupProfiling() {
    // Add /debug/pprof endpoints
}

Test Template

func TestMetrics(t *testing.T) {
    metrics := NewMetrics()
    
    // Record some operations
    metrics.RecordRequest(0.1, false)
    metrics.RecordRequest(0.2, true)
    metrics.UpdateConnections(100)
    
    // Verify metrics were recorded
    // (In real test, you'd check Prometheus registry)
}

func TestHealthChecks(t *testing.T) {
    ms := NewMonitoringSystem()
    
    // Test liveness
    assert.True(t, ms.Liveness())
    
    // Test readiness
    ready, reason := ms.Readiness()
    assert.True(t, ready)
    assert.Equal(t, "ready", reason)
}

func TestLogging(t *testing.T) {
    logger, _ := zap.NewDevelopment()
    
    logger.Info("test message",
        zap.String("key", "value"),
        zap.Int("count", 42),
    )
    
    // In real test, you'd capture log output
}

Acceptance Criteria

✅ All operations record metrics
✅ Structured JSON logs for all events
✅ CPU and memory profiling working
✅ Liveness probe responds correctly
✅ Readiness probe checks all dependencies
✅ Grafana dashboard shows key metrics
✅ Alert rules trigger on critical conditions
✅ > 90% code coverage
✅ All tests pass

Summary: Week 11 Complete

By completing Week 11, you've learned and implemented:

1. Prometheus Metrics

• Counter, Gauge, Histogram, Summary
• Four golden signals monitoring
• Custom metrics for database operations
• Prometheus endpoint exposure

2. Structured Logging

• JSON structured logs with zap
• Consistent log format
• Easy filtering and searching
• Log levels and context

3. Performance Profiling

• CPU profiling for hot paths
• Memory profiling for leaks
• HTTP profiling endpoints
• Goroutine analysis

4. Health Checks

• Liveness probe (is alive?)
• Readiness probe (can serve?)
• Dependency health checks
• Orchestration integration

Key Skills Mastered:

✅ Monitor system with Prometheus metrics
✅ Debug with structured JSON logs
✅ Profile CPU and memory usage
✅ Health checks for orchestration
✅ Four golden signals monitoring
✅ Production-ready observability

Ready for Week 12?

Next week we'll focus on performance optimization, load testing, and capacity planning to maximize system performance.

Continue to Week 12: Performance Optimization →

Course Navigation

All Lessons