Lesson 4
Rebalancing Deep Dive and Optimization
Master Kafka rebalancing: lifecycle, Cooperative-Sticky Assignor, heartbeat management, and performance tuning. Prevent rebalance storms in production.
Why Rebalancing is the #1 Production Problem
Here’s what keeps Kafka operators up at night:
🚨 The Rebalance Storm
-
Consumer 1 crashes
-
Rebalance starts - ALL consumers stop processing
-
Consumer 2 restarts during rebalance
-
Another rebalance starts
-
Consumer 3 times out during processing
-
Another rebalance…
-
Your lag spikes to 1M messages
-
PagerDuty explodes
In production, this can cascade into a 30-minute outage. Let’s fix it.
The Rebalancing Lifecycle
Understanding what happens during a rebalance:
Phase 1: Detection
-
Coordinator detects change (consumer join/leave/timeout)
-
Triggers rebalance protocol
Phase 2: Stop the World
-
ALL consumers stop consuming
-
Current processing continues but no new messages
-
This is the expensive part
Phase 3: Assignment
-
Coordinator calculates new partition assignments
-
Sends assignments to all consumers
Phase 4: Resume
-
Consumers resume from their assigned partitions
-
Processing restarts
Eager vs Cooperative Rebalancing
This is the biggest improvement in Kafka’s history:
Eager Rebalancing (Old Way)
Before: Consumer1=[P0,P1,P2], Consumer2=[P3,P4,P5]
Consumer3 joins
During rebalance: ALL consumers stop
After: Consumer1=[P0,P1], Consumer2=[P2,P3], Consumer3=[P4,P5]
Result: 3 consumers stopped, 3 consumers restarted
Cooperative Rebalancing (New Way)
Before: Consumer1=[P0,P1,P2], Consumer2=[P3,P4,P5]
Consumer3 joins
During rebalance: Only P4,P5 pause briefly
After: Consumer1=[P0,P1], Consumer2=[P2,P3], Consumer3=[P4,P5]
Result: Minimal disruption
Enable Cooperative Rebalancing
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
Available since: Kafka 2.4+
Impact: 90% reduction in rebalance time
Heartbeat and Session Tuning
These parameters control when consumers get kicked out:
session.timeout.ms
Default: 45000ms (45 seconds)
What it means: “If I don’t hear from you in 45s, I’ll kick you out”
Tuning:
-
Lower (30s): Faster failure detection, more sensitive to GC pauses
-
Higher (60s): More tolerant of GC, slower failure detection
heartbeat.interval.ms
Default: 3000ms (3 seconds)
What it means: “I’ll send heartbeat every 3s”
Rule: Must be 1 per minute”
Long rebalance times
“Rebalancing group” AND “duration > 30 seconds”
Consumer timeouts
“Member was removed due to consumer poll timeout”
## Production Configuration
### Recommended Settings
Rebalancing
partition.assignment.strategy = org.apache.kafka.clients.consumer.CooperativeStickyAssignor group.instance.id = ${HOSTNAME}-${RANDOM} # Stable ID
Timeouts
session.timeout.ms = 45000 heartbeat.interval.ms = 3000 max.poll.interval.ms = 300000
Performance
max.poll.records = 500 fetch.min.bytes = 1024 fetch.max.wait.ms = 500
## Key Takeaways
1. **Use cooperative rebalancing** - 90% reduction in downtime
2. **Enable static membership** for containerized deployments
3. **Tune timeouts carefully** - balance between responsiveness and stability
4. **Monitor rebalance frequency** - alert on excessive rebalancing
5. **Implement graceful shutdown** to prevent cascading failures
### Next Steps
Ready to monitor your consumers? Check out our next lesson on **Lag Management and Performance Monitoring** where we'll learn how to track and optimize consumer performance.