What is Consumer Lag?

Consumer lag = How far behind your consumer is from the latest message in a partition.

Partition 0: [msg-0] [msg-1] [msg-2] [msg-3] [msg-4] [msg-5] [msg-6]
                                    ↑                    ↑
                              consumer offset        latest offset
                                    (3)                  (6)
                                   
                            Lag = 6 - 3 = 3 messages

Why lag matters:

High lag = Slow processing = Data staleness
Zero lag = Real-time processing = Good
Negative lag = Consumer ahead of producer = Impossible (indicates bug)

The 7 Common Lag Patterns

Here are the patterns I see in production and how to fix them:

1. The Steady Climb

Lag: 0 → 100 → 500 → 1000 → 5000 → 10000...
Cause: Consumer can't keep up with producer rate
Fix: Scale consumers or optimize processing

2. The Spiky Pattern

Lag: 0 → 0 → 0 → 5000 → 0 → 0 → 0...
Cause: Bursty traffic or slow processing spikes
Fix: Increase batch size, add buffering

3. The Staircase

Lag: 0 → 1000 → 1000 → 1000 → 2000 → 2000...
Cause: Consumer stuck on specific messages
Fix: Dead letter queue, skip problematic messages

4. The Roller Coaster

Lag: 0 → 5000 → 0 → 8000 → 0 → 12000...
Cause: Rebalancing storms
Fix: Static membership, cooperative rebalancing

5. The Plateau

Lag: 0 → 10000 → 10000 → 10000 → 10000...
Cause: Consumer crashed, not processing
Fix: Health checks, auto-restart

6. The Sawtooth

Lag: 0 → 1000 → 0 → 1000 → 0 → 1000...
Cause: Normal batch processing
Fix: Nothing - this is healthy

7. The Explosion

Lag: 0 → 0 → 0 → 1000000 → 2000000...
Cause: Consumer completely stopped
Fix: Emergency scaling, circuit breakers

Measuring Lag: The Right Way

There are multiple ways to measure lag, each with trade-offs:

1. JMX Metrics (Most Common)

# Per partition
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*,topic=*,partition=*
  records-lag-max
  records-lag-avg

# Per consumer group
kafka.consumer:type=consumer-coordinator-metrics,client-id=*
  rebalance-rate-per-hour

2. Kafka Admin API

from kafka.admin import KafkaAdminClient
from kafka.structs import OffsetSpec

admin = KafkaAdminClient(bootstrap_servers='localhost:9092')

# Get latest offsets
latest_offsets = admin.list_consumer_group_offsets('my-group')

# Get consumer offsets  
consumer_offsets = admin.list_consumer_group_offsets('my-group')

# Calculate lag
for partition, offset in consumer_offsets.items():
    lag = latest_offsets[partition] - offset
    print(f"Partition {partition}: {lag} messages behind")

3. Burrow (LinkedIn's Tool)

# Install Burrow
go get github.com/linkedin/Burrow

# Start Burrow
./burrow --config-dir=./config

# Check lag via HTTP API
curl http://localhost:8000/v3/kafka/local/consumer/my-group/lag

Setting Up Prometheus + Grafana

The gold standard for Kafka monitoring:

1. JMX Exporter

# jmx_prometheus_javaagent.jar
java -javaagent:jmx_prometheus_javaagent.jar=8080:config.yaml \
     -jar kafka-server-start.sh config/server.properties

2. Prometheus Config

# prometheus.yml
scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets: ['broker1:8080', 'broker2:8080', 'broker3:8080']
  
  - job_name: 'kafka-consumers'
    static_configs:
      - targets: ['consumer1:8080', 'consumer2:8080']

3. Grafana Dashboard

# Key panels to include:
- Consumer lag by group and partition
- Consumer throughput (messages/sec)
- Broker metrics (CPU, memory, disk)
- Rebalance frequency
- Error rates

Alerting Strategies

Don't just monitor - alert on the right things:

🚨 Critical Alerts

Lag > 10,000 messages - Consumer falling behind
Zero consumer throughput - Consumer stopped
Rebalance frequency > 5/hour - Instability
Consumer error rate > 1% - Processing issues

⚠️ Warning Alerts

Lag > 1,000 messages - Getting behind
Consumer throughput < 50% of normal - Slow processing
Memory usage > 80% - Resource pressure

Lag-Based Autoscaling with KEDA

Automatically scale consumers based on lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
spec:
  scaleTargetRef:
    name: kafka-consumer
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: my-group
      topic: my-topic
      lagThreshold: '1000'

End-to-End Latency Measurement

Lag tells you how behind you are, but not how long messages take to process:

# Add timestamps to your messages
message = {
    'data': payload,
    'produced_at': time.time(),
    'message_id': uuid.uuid4()
}

# In consumer
def process_message(message):
    start_time = time.time()
    produced_at = message['produced_at']
    
    # Process message
    result = do_work(message['data'])
    
    # Calculate latencies
    end_to_end_latency = start_time - produced_at
    processing_latency = time.time() - start_time
    
    # Log metrics
    logger.info(f"E2E: {end_to_end_latency:.2f}s, Processing: {processing_latency:.2f}s")

Debugging Slow Consumers

When lag is high, here's your debugging checklist:

1. Check Consumer Metrics

Are consumers actually running?
Are they polling frequently enough?
Are they committing offsets?

2. Check Processing Logic

Is processing taking too long?
Are there database bottlenecks?
Are there external API calls?

3. Check Resource Usage

CPU usage high?
Memory usage high?
Network I/O saturated?

4. Check Partition Distribution

Are all partitions being consumed?
Is one partition hot?
Are consumers evenly distributed?

Production Monitoring Checklist

✅ Must Have

Consumer lag monitoring (per partition)
Consumer throughput tracking
Error rate monitoring
Rebalance frequency alerts
End-to-end latency tracking

✅ Nice to Have

Consumer group health dashboard
Partition distribution visualization
Historical lag trends
Automated scaling based on lag
Dead letter queue monitoring

Key Takeaways

Monitor lag patterns, not just numbers - patterns tell the real story
Set up proper alerting - don't just monitor, alert on what matters
Use Prometheus + Grafana - industry standard for a reason
Implement autoscaling - let the system scale itself
Measure end-to-end latency - lag is just one part of the picture

Next Steps

Ready to optimize Kafka storage? Check out our next lesson on Storage, Retention, and Log Management where we'll learn how Kafka stores data and optimize for performance.

Lag Management and Performance Monitoring

Course Navigation