Lesson 11
Monitoring & observability
Metrics, logs, health checks, and SLO thinking.
Course Navigation
Back to courseLearning Objectives
-
Master Prometheus metrics for system monitoring
-
Implement structured logging with zap
-
Use performance profiling to identify bottlenecks
-
Build health checks for orchestration
-
Design comprehensive observability systems
-
Monitor the four golden signals of system health
Lesson 11.1: Prometheus Metrics
What are Metrics?
Metrics are measurements that help you understand system behavior:
Counter
Only increases
requests_total
errors_total
bytes_written_total
Gauge
Can go up or down
current_connections
memory_usage_bytes
queue_length
Histogram
Distribution of values
latency_seconds with buckets
request_size_bytes
response_time
Summary
Percentiles computed on client side
quantile="0.5" value="0.1"
quantile="0.95" value="0.5"
quantile="0.99" value="1.0"
Four Golden Signals
Every system needs these 4 metrics for comprehensive monitoring:
1. Latency - How long requests take
Metric: request_duration_seconds
Target: p99 < 100ms
2. Traffic - How much work happening
Metric: requests_per_second
Target: 10,000+ ops/sec
3. Errors - How many requests fail
Metric: error_rate_percent
Target: < 1%
4. Saturation - How utilized is system
Metric: cpu_percent, memory_percent
Target: < 70%
Implementing Prometheus Metrics
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
type Metrics struct {
// Counters - always increase
RequestsTotal prometheus.Counter
ErrorsTotal prometheus.Counter
// Gauges - can go up/down
ConnectionCount prometheus.Gauge
MemoryBytes prometheus.Gauge
// Histograms - distribution with buckets
RequestLatency prometheus.Histogram
}
// NewMetrics creates all metrics
func NewMetrics() *Metrics {
return &Metrics{
RequestsTotal: promauto.NewCounter(prometheus.CounterOpts{
Name: "kvdb_requests_total",
Help: "Total number of requests",
}),
ErrorsTotal: promauto.NewCounter(prometheus.CounterOpts{
Name: "kvdb_errors_total",
Help: "Total number of errors",
}),
ConnectionCount: promauto.NewGauge(prometheus.GaugeOpts{
Name: "kvdb_connections",
Help: "Current number of connections",
}),
MemoryBytes: promauto.NewGauge(prometheus.GaugeOpts{
Name: "kvdb_memory_bytes",
Help: "Memory usage in bytes",
}),
RequestLatency: promauto.NewHistogram(prometheus.HistogramOpts{
Name: "kvdb_request_latency_seconds",
Help: "Request latency",
Buckets: []float64{0.001, 0.01, 0.1, 1},
}),
}
}
// RecordRequest records a request
func (m *Metrics) RecordRequest(duration float64, hasError bool) {
m.RequestsTotal.Inc()
m.RequestLatency.Observe(duration)
if hasError {
m.ErrorsTotal.Inc()
}
}
// UpdateConnections updates connection gauge
func (m *Metrics) UpdateConnections(count int64) {
m.ConnectionCount.Set(float64(count))
}
// UpdateMemory updates memory gauge
func (m *Metrics) UpdateMemory(bytes int64) {
m.MemoryBytes.Set(float64(bytes))
}
Using Metrics in Server
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
type Server struct {
metrics *Metrics
}
// HandleRequest records metrics for each request
func (s *Server) HandleRequest(operation string) error {
start := time.Now()
// Do work
err := s.doWork()
// Record metrics
duration := time.Since(start).Seconds()
s.metrics.RecordRequest(duration, err != nil)
return err
}
// ExposeMetrics starts Prometheus endpoint
func (s *Server) ExposeMetrics(addr string) {
http.Handle("/metrics", promhttp.Handler())
go http.ListenAndServe(addr, nil)
}
Lesson 11.2: Structured Logging
Problem: Unstructured Logs
❌ BAD
2025-01-16 10:15:23 Request from 192.168.1.1 with key user:100 took 50ms
Problems:
- Hard to parse
- Inconsistent format
- Can't filter/search easily
Solution: Structured Logging
✅ GOOD
{
"timestamp": "2025-01-16T10:15:23Z",
"level": "error",
"message": "request failed",
"client_ip": "192.168.1.1",
"key": "user:100",
"duration_ms": 50,
"error": "timeout"
}
Benefits:
- Easy to parse
- Consistent format
- Can filter/search
- Can aggregate
Using zap Logger
import "go.uber.org/zap"
// Create logger
logger, _ := zap.NewProduction()
defer logger.Sync()
// Log with fields
logger.Info("operation completed",
zap.String("operation", "GET"),
zap.String("key", "user:100"),
zap.Duration("latency", 50*time.Millisecond),
zap.Int("status_code", 200),
)
// Log errors
logger.Error("operation failed",
zap.String("operation", "PUT"),
zap.Error(err),
zap.Duration("latency", 100*time.Millisecond),
)
// Log levels
logger.Debug("debug info")
logger.Info("informational")
logger.Warn("warning")
logger.Error("error")
Lesson 11.3: Performance Profiling
CPU Profiling
import (
"runtime/pprof"
"os"
)
// Start profiling
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
defer f.Close()
// Run operations
for i := 0; i < 1000000; i++ {
store.Get([]byte("key"))
}
// Analyze: go tool pprof cpu.prof
Memory Profiling
import "runtime/pprof"
// Capture heap
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
f.Close()
// Analyze: go tool pprof heap.prof
HTTP Profiling Endpoint
import _ "net/http/pprof"
// Available endpoints:
// /debug/pprof/profile?seconds=30 - CPU profile
// /debug/pprof/heap - Memory profile
// /debug/pprof/goroutine - Goroutine profile
Lesson 11.4: Health Checks
Liveness Probe (Is server alive?)
func (s *Server) Liveness() bool {
// Simple check - is process running?
return s.db != nil
}
// HTTP endpoint
func (s *Server) handleLiveness(w http.ResponseWriter, r *http.Request) {
if s.Liveness() {
w.WriteHeader(200)
w.Write([]byte("alive"))
} else {
w.WriteHeader(500)
}
}
// Usage: GET /healthz
Readiness Probe (Can serve requests?)
func (s *Server) Readiness() (bool, string) {
// Database check
_, err := s.db.Get(context.Background(), []byte("health"))
if err != nil {
return false, "database down"
}
// Replication check
lag := s.replication.GetLag()
if lag > 5*time.Second {
return false, "replication lag too high"
}
// Disk check
if s.getDiskUsagePercent() > 90 {
return false, "disk full"
}
return true, "ready"
}
// HTTP endpoint
func (s *Server) handleReadiness(w http.ResponseWriter, r *http.Request) {
ready, reason := s.Readiness()
if ready {
w.WriteHeader(200)
w.Write([]byte("ready"))
} else {
w.WriteHeader(503)
w.Write([]byte(reason))
}
}
// Usage: GET /readyz
Lab 11.1: Monitoring System
Objective
Build a comprehensive monitoring system with metrics, logging, profiling, and health checks.
Requirements
-
Prometheus Metrics: Counters, gauges, histograms for all operations
-
Structured Logging: JSON logs with zap for all events
-
Performance Profiling: CPU and memory profiling endpoints
-
Health Checks: Liveness and readiness probes
-
Monitoring Dashboard: Grafana dashboard with key metrics
-
Alerting: Alert rules for critical metrics
Starter Code
type MonitoringSystem struct {
metrics *Metrics
logger *zap.Logger
}
func NewMonitoringSystem() *MonitoringSystem {
metrics := NewMetrics()
logger, _ := zap.NewProduction()
return &MonitoringSystem{metrics, logger}
}
func (ms *MonitoringSystem) RecordOperation(op string, duration time.Duration, err error) {
ms.metrics.RecordRequest(duration.Seconds(), err != nil)
if err != nil {
ms.logger.Error("operation failed",
zap.String("operation", op),
zap.Duration("latency", duration),
zap.Error(err),
)
} else {
ms.logger.Info("operation succeeded",
zap.String("operation", op),
zap.Duration("latency", duration),
)
}
}
// TODO: Implement health checks
func (ms *MonitoringSystem) Liveness() bool {
return true
}
func (ms *MonitoringSystem) Readiness() (bool, string) {
return true, "ready"
}
// TODO: Implement profiling endpoints
func (ms *MonitoringSystem) SetupProfiling() {
// Add /debug/pprof endpoints
}
Test Template
func TestMetrics(t *testing.T) {
metrics := NewMetrics()
// Record some operations
metrics.RecordRequest(0.1, false)
metrics.RecordRequest(0.2, true)
metrics.UpdateConnections(100)
// Verify metrics were recorded
// (In real test, you'd check Prometheus registry)
}
func TestHealthChecks(t *testing.T) {
ms := NewMonitoringSystem()
// Test liveness
assert.True(t, ms.Liveness())
// Test readiness
ready, reason := ms.Readiness()
assert.True(t, ready)
assert.Equal(t, "ready", reason)
}
func TestLogging(t *testing.T) {
logger, _ := zap.NewDevelopment()
logger.Info("test message",
zap.String("key", "value"),
zap.Int("count", 42),
)
// In real test, you'd capture log output
}
Acceptance Criteria
-
✅ All operations record metrics
-
✅ Structured JSON logs for all events
-
✅ CPU and memory profiling working
-
✅ Liveness probe responds correctly
-
✅ Readiness probe checks all dependencies
-
✅ Grafana dashboard shows key metrics
-
✅ Alert rules trigger on critical conditions
-
✅ > 90% code coverage
-
✅ All tests pass
Summary: Week 11 Complete
By completing Week 11, you’ve learned and implemented:
1. Prometheus Metrics
-
Counter, Gauge, Histogram, Summary
-
Four golden signals monitoring
-
Custom metrics for database operations
-
Prometheus endpoint exposure
2. Structured Logging
-
JSON structured logs with zap
-
Consistent log format
-
Easy filtering and searching
-
Log levels and context
3. Performance Profiling
-
CPU profiling for hot paths
-
Memory profiling for leaks
-
HTTP profiling endpoints
-
Goroutine analysis
4. Health Checks
-
Liveness probe (is alive?)
-
Readiness probe (can serve?)
-
Dependency health checks
-
Orchestration integration
Key Skills Mastered:
-
✅ Monitor system with Prometheus metrics
-
✅ Debug with structured JSON logs
-
✅ Profile CPU and memory usage
-
✅ Health checks for orchestration
-
✅ Four golden signals monitoring
-
✅ Production-ready observability
Ready for Week 12?
Next week we’ll focus on performance optimization, load testing, and capacity planning to maximize system performance.
Continue to Week 12: Performance Optimization →