Lag Management and Performance Monitoring

Master Kafka lag management: monitoring strategies, Burrow setup, Prometheus integration, and building production-ready dashboards for consumer performance.

What is Consumer Lag?

Consumer lag = How far behind your consumer is from the latest message in a partition.

Partition 0: [msg-0] [msg-1] [msg-2] [msg-3] [msg-4] [msg-5] [msg-6]
                                    ↑                    ↑
                              consumer offset        latest offset
                                    (3)                  (6)
                                   
                            Lag = 6 - 3 = 3 messages

Why lag matters:

  • High lag = Slow processing = Data staleness
  • Zero lag = Real-time processing = Good
  • Negative lag = Consumer ahead of producer = Impossible (indicates bug)

The 7 Common Lag Patterns

Here are the patterns I see in production and how to fix them:

1. The Steady Climb

Lag: 0 → 100 → 500 → 1000 → 5000 → 10000...
Cause: Consumer can't keep up with producer rate
Fix: Scale consumers or optimize processing

2. The Spiky Pattern

Lag: 0 → 0 → 0 → 5000 → 0 → 0 → 0...
Cause: Bursty traffic or slow processing spikes
Fix: Increase batch size, add buffering

3. The Staircase

Lag: 0 → 1000 → 1000 → 1000 → 2000 → 2000...
Cause: Consumer stuck on specific messages
Fix: Dead letter queue, skip problematic messages

4. The Roller Coaster

Lag: 0 → 5000 → 0 → 8000 → 0 → 12000...
Cause: Rebalancing storms
Fix: Static membership, cooperative rebalancing

5. The Plateau

Lag: 0 → 10000 → 10000 → 10000 → 10000...
Cause: Consumer crashed, not processing
Fix: Health checks, auto-restart

6. The Sawtooth

Lag: 0 → 1000 → 0 → 1000 → 0 → 1000...
Cause: Normal batch processing
Fix: Nothing - this is healthy

7. The Explosion

Lag: 0 → 0 → 0 → 1000000 → 2000000...
Cause: Consumer completely stopped
Fix: Emergency scaling, circuit breakers

Measuring Lag: The Right Way

There are multiple ways to measure lag, each with trade-offs:

1. JMX Metrics (Most Common)

# Per partition
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,topic=*,partition=*
  records-lag-max
  records-lag-avg

# Per consumer group
kafka.consumer:type=consumer-coordinator-metrics,client-id=*
  rebalance-rate-per-hour

2. Kafka Admin API

from kafka.admin import KafkaAdminClient
from kafka.structs import OffsetSpec

admin = KafkaAdminClient(bootstrap_servers='localhost:9092')

# Get latest offsets
latest_offsets = admin.list_consumer_group_offsets('my-group')

# Get consumer offsets  
consumer_offsets = admin.list_consumer_group_offsets('my-group')

# Calculate lag
for partition, offset in consumer_offsets.items():
    lag = latest_offsets[partition] - offset
    print(f"Partition {partition}: {lag} messages behind")

3. Burrow (LinkedIn's Tool)

# Install Burrow
go get github.com/linkedin/Burrow

# Start Burrow
./burrow --config-dir=./config

# Check lag via HTTP API
curl http://localhost:8000/v3/kafka/local/consumer/my-group/lag

Setting Up Prometheus + Grafana

The gold standard for Kafka monitoring:

1. JMX Exporter

# jmx_prometheus_javaagent.jar
java -javaagent:jmx_prometheus_javaagent.jar=8080:config.yaml \
     -jar kafka-server-start.sh config/server.properties

2. Prometheus Config

# prometheus.yml
scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets: ['broker1:8080', 'broker2:8080', 'broker3:8080']
  
  - job_name: 'kafka-consumers'
    static_configs:
      - targets: ['consumer1:8080', 'consumer2:8080']

3. Grafana Dashboard

# Key panels to include:
- Consumer lag by group and partition
- Consumer throughput (messages/sec)
- Broker metrics (CPU, memory, disk)
- Rebalance frequency
- Error rates

Alerting Strategies

Don't just monitor - alert on the right things:

🚨 Critical Alerts

  • Lag > 10,000 messages - Consumer falling behind
  • Zero consumer throughput - Consumer stopped
  • Rebalance frequency > 5/hour - Instability
  • Consumer error rate > 1% - Processing issues

⚠️ Warning Alerts

  • Lag > 1,000 messages - Getting behind
  • Consumer throughput < 50% of normal - Slow processing
  • Memory usage > 80% - Resource pressure

Lag-Based Autoscaling with KEDA

Automatically scale consumers based on lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: kafka-consumer
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: my-group
      topic: my-topic
      lagThreshold: '1000'

End-to-End Latency Measurement

Lag tells you how behind you are, but not how long messages take to process:

# Add timestamps to your messages
message = {
    'data': payload,
    'produced_at': time.time(),
    'message_id': uuid.uuid4()
}

# In consumer
def process_message(message):
    start_time = time.time()
    produced_at = message['produced_at']
    
    # Process message
    result = do_work(message['data'])
    
    # Calculate latencies
    end_to_end_latency = start_time - produced_at
    processing_latency = time.time() - start_time
    
    # Log metrics
    logger.info(f"E2E: {end_to_end_latency:.2f}s, Processing: {processing_latency:.2f}s")

Debugging Slow Consumers

When lag is high, here's your debugging checklist:

1. Check Consumer Metrics

  • Are consumers actually running?
  • Are they polling frequently enough?
  • Are they committing offsets?

2. Check Processing Logic

  • Is processing taking too long?
  • Are there database bottlenecks?
  • Are there external API calls?

3. Check Resource Usage

  • CPU usage high?
  • Memory usage high?
  • Network I/O saturated?

4. Check Partition Distribution

  • Are all partitions being consumed?
  • Is one partition hot?
  • Are consumers evenly distributed?

Production Monitoring Checklist

✅ Must Have

  • Consumer lag monitoring (per partition)
  • Consumer throughput tracking
  • Error rate monitoring
  • Rebalance frequency alerts
  • End-to-end latency tracking

✅ Nice to Have

  • Consumer group health dashboard
  • Partition distribution visualization
  • Historical lag trends
  • Automated scaling based on lag
  • Dead letter queue monitoring

Key Takeaways

  1. Monitor lag patterns, not just numbers - patterns tell the real story
  2. Set up proper alerting - don't just monitor, alert on what matters
  3. Use Prometheus + Grafana - industry standard for a reason
  4. Implement autoscaling - let the system scale itself
  5. Measure end-to-end latency - lag is just one part of the picture

Next Steps

Ready to optimize Kafka storage? Check out our next lesson on Storage, Retention, and Log Management where we'll learn how Kafka stores data and optimize for performance.