Lesson 10
Production Operations and Advanced Patterns
Master Kafka production: Connect, MirrorMaker 2, Kubernetes deployment, advanced patterns (Event Sourcing, CQRS), and troubleshooting.
Course Navigation
Back to courseProduction Readiness Checklist
Before going to production, ensure you’ve covered these essentials:
✅ Infrastructure
-
3+ brokers - Fault tolerance
-
SSD storage - Performance
-
Dedicated resources - No resource contention
-
Network redundancy - Multiple network paths
-
Backup strategy - Data protection
✅ Security
-
SSL/TLS encryption - Data in transit
-
SASL authentication - Identity verification
-
ACL authorization - Access control
-
Quota limits - Resource protection
-
Audit logging - Compliance
✅ Monitoring
-
Consumer lag alerts - Processing health
-
Broker metrics - Resource usage
-
Error rate monitoring - Data quality
-
End-to-end latency - Performance tracking
-
Capacity planning - Growth management
Kafka on Kubernetes with Strimzi
Strimzi is the gold standard for running Kafka on Kubernetes:
1. Install Strimzi Operator
# Install Strimzi operator
kubectl create namespace kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
# Wait for operator to be ready
kubectl wait --for=condition=Ready pod -l name=strimzi-cluster-operator -n kafka
2. Deploy Kafka Cluster
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
namespace: kafka
spec:
kafka:
version: 3.5.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
inter.broker.protocol.version: "3.5"
storage:
type: persistent-claim
size: 100Gi
class: fast-ssd
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
class: fast-ssd
3. Deploy Kafka Connect
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
name: my-connect-cluster
namespace: kafka
spec:
version: 3.5.0
replicas: 2
bootstrapServers: my-cluster-kafka-bootstrap:9092
config:
config.storage.replication.factor: 3
offset.storage.replication.factor: 3
status.storage.replication.factor: 3
group.id: connect-cluster-group
offset.storage.topic: connect-cluster-offsets
config.storage.topic: connect-cluster-configs
status.storage.topic: connect-cluster-status
Advanced Architectural Patterns
These patterns solve common distributed systems problems:
Event Sourcing
Concept: Store all changes as events, not just current state
# Instead of updating user record
user.name = "John Doe"
# Store events
events = [
{"type": "UserCreated", "id": 123, "name": "John"},
{"type": "UserNameChanged", "id": 123, "old_name": "John", "new_name": "John Doe"}
]
Benefits:
- Audit trail - Complete history of changes
- Time travel - Replay events to any point in time
- Debugging - See exactly what happened
- Compliance - Regulatory requirements
CQRS (Command Query Responsibility Segregation)
Concept: Separate read and write models
# Write side (Commands)
POST /users - Create user
PUT /users/123 - Update user
# Read side (Queries)
GET /users/123 - Get user
GET /users/search?name=John - Search users
Benefits:
- Independent scaling - Scale reads and writes separately
- Optimized models - Different schemas for different use cases
- Performance - Read models can be denormalized
Saga Pattern
Concept: Manage distributed transactions with compensating actions
# Order processing saga
1. Reserve inventory → Success
2. Charge credit card → Success
3. Ship order → FAILED
4. Compensate: Refund credit card
5. Compensate: Release inventory
Change Data Capture (CDC)
Concept: Capture database changes and stream them to Kafka
# Database change → Kafka event
INSERT INTO users (id, name) VALUES (123, 'John')
↓
{"op": "c", "before": null, "after": {"id": 123, "name": "John"}}
Kafka Connect and Ecosystem
Kafka Connect integrates Kafka with external systems:
Popular Connectors
Source Connectors (Data In)
-
Debezium - Database CDC
-
JDBC - Any SQL database
-
Elasticsearch - Search indexing
-
MongoDB - Document database
-
Twitter - Social media feeds
Sink Connectors (Data Out)
-
Elasticsearch - Search indexing
-
HDFS - Data lake storage
-
S3 - Cloud storage
-
JDBC - Database writes
-
MongoDB - Document storage
Debezium CDC Example
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: debezium-connector
namespace: kafka
spec:
class: io.debezium.connector.postgresql.PostgresConnector
tasksMax: 1
config:
database.hostname: postgres
database.port: 5432
database.user: postgres
database.password: postgres
database.dbname: mydb
database.server.name: mydb
table.include.list: public.users
plugin.name: pgoutput
slot.name: kafka_slot
publication.name: kafka_publication
Multi-Datacenter Replication
Replicate data across datacenters for disaster recovery:
MirrorMaker 2.0
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
name: my-mirror-maker-2
namespace: kafka
spec:
version: 3.5.0
replicas: 2
connectCluster: "target-cluster"
clusters:
- alias: "source-cluster"
bootstrapServers: source-kafka:9092
- alias: "target-cluster"
bootstrapServers: target-kafka:9092
mirrors:
- sourceCluster: "source-cluster"
targetCluster: "target-cluster"
sourceConnector:
tasksMax: 2
config:
replication.factor: 3
offset-syncs.topic.replication.factor: 3
sync.topic.acls.enabled: "false"
checkpointConnector:
tasksMax: 2
config:
checkpoints.topic.replication.factor: 3
sync.group.offsets.enabled: "true"
refresh.topics.interval.seconds: 60
refresh.groups.interval.seconds: 60
Production Monitoring Stack
Complete monitoring solution for Kafka in production:
Prometheus + Grafana
# Prometheus config
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets: ['kafka-0:9092', 'kafka-1:9092', 'kafka-2:9092']
metrics_path: /metrics
scrape_interval: 5s
- job_name: 'kafka-connect'
static_configs:
- targets: ['connect-0:8083', 'connect-1:8083']
metrics_path: /metrics
scrape_interval: 5s
Key Dashboards
Kafka Cluster Overview
-
Broker health and resource usage
-
Topic partition distribution
-
Consumer group lag
-
Producer/consumer throughput
Consumer Monitoring
-
Lag per consumer group
-
Processing rate trends
-
Error rates and exceptions
-
Rebalancing frequency
Troubleshooting Production Issues
Common production problems and how to fix them:
1. Consumer Lag Spikes
🚨 Symptoms
-
Lag jumps from 0 to 100,000+ messages
-
Consumer throughput drops to zero
-
Processing stops completely
🔍 Debugging Steps
-
Check consumer health - are they running?
-
Look for rebalancing events in logs
-
Check for processing errors or exceptions
-
Verify partition assignment
-
Check resource usage (CPU, memory)
2. Broker Out of Disk Space
🚨 Symptoms
-
Broker stops accepting writes
-
Log files grow rapidly
-
Consumer lag increases
🔍 Debugging Steps
-
Check disk usage:
df -h -
Identify large log files:
du -sh /var/kafka-logs/* -
Check retention settings
-
Clean up old segments
-
Scale up disk or add more brokers
3. Rebalancing Storms
🚨 Symptoms
-
Constant rebalancing events
-
Consumers keep getting kicked out
-
Processing is unstable
🔍 Debugging Steps
-
Check session timeout settings
-
Look for GC pauses in logs
-
Verify network connectivity
-
Check for processing timeouts
-
Enable static membership
Performance Tuning Checklist
Broker Tuning
-
num.network.threads = 3 * number of disks
-
num.io.threads = 8 * number of disks
-
socket.send.buffer.bytes = 102400
-
socket.receive.buffer.bytes = 102400
-
log.flush.interval.messages = 10000
Producer Tuning
-
batch.size = 65536 (64KB)
-
linger.ms = 10
-
compression.type = lz4
-
acks = all
-
retries = 2147483647
Consumer Tuning
-
fetch.min.bytes = 1024
-
fetch.max.wait.ms = 500
-
max.poll.records = 500
-
session.timeout.ms = 45000
-
heartbeat.interval.ms = 3000
Disaster Recovery Planning
Prepare for the worst-case scenarios:
Backup Strategy
-
Topic replication - MirrorMaker 2.0
-
Configuration backup - Git repository
-
Certificate backup - Secure storage
-
Schema backup - Schema Registry export
Recovery Procedures
-
RTO (Recovery Time Objective) - 4 hours
-
RPO (Recovery Point Objective) - 15 minutes
-
Failover testing - Monthly drills
-
Documentation - Step-by-step procedures
Key Takeaways
- Production readiness is a process - not a checklist
- Monitor everything - you can’t fix what you can’t see
- Plan for failure - disaster recovery is essential
- Use proven patterns - Event Sourcing, CQRS, Saga
- Test your recovery procedures - practice makes perfect
🎉 Congratulations!
You’ve completed the Apache Kafka Mastery course! You now have the knowledge and skills to build production-ready Kafka systems that can handle millions of events per second.
Next steps:
-
Build a real project using Kafka
-
Join the Kafka community
-
Contribute to open source projects
-
Share your knowledge with others