Production Operations and Advanced Patterns
Master Kafka production operations: Kubernetes deployment, monitoring, advanced patterns, troubleshooting, and building production-ready Kafka systems.
Course Navigation
← Back to Kafka CourseProduction Readiness Checklist
Before going to production, ensure you've covered these essentials:
✅ Infrastructure
- 3+ brokers - Fault tolerance
- SSD storage - Performance
- Dedicated resources - No resource contention
- Network redundancy - Multiple network paths
- Backup strategy - Data protection
✅ Security
- SSL/TLS encryption - Data in transit
- SASL authentication - Identity verification
- ACL authorization - Access control
- Quota limits - Resource protection
- Audit logging - Compliance
✅ Monitoring
- Consumer lag alerts - Processing health
- Broker metrics - Resource usage
- Error rate monitoring - Data quality
- End-to-end latency - Performance tracking
- Capacity planning - Growth management
Kafka on Kubernetes with Strimzi
Strimzi is the gold standard for running Kafka on Kubernetes:
1. Install Strimzi Operator
# Install Strimzi operator
kubectl create namespace kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
# Wait for operator to be ready
kubectl wait --for=condition=Ready pod -l name=strimzi-cluster-operator -n kafka2. Deploy Kafka Cluster
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  namespace: kafka
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.5"
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      class: fast-ssd3. Deploy Kafka Connect
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: my-connect-cluster
  namespace: kafka
spec:
  version: 3.5.0
  replicas: 2
  bootstrapServers: my-cluster-kafka-bootstrap:9092
  config:
    config.storage.replication.factor: 3
    offset.storage.replication.factor: 3
    status.storage.replication.factor: 3
    group.id: connect-cluster-group
    offset.storage.topic: connect-cluster-offsets
    config.storage.topic: connect-cluster-configs
    status.storage.topic: connect-cluster-statusAdvanced Architectural Patterns
These patterns solve common distributed systems problems:
Event Sourcing
Concept: Store all changes as events, not just current state
# Instead of updating user record
user.name = "John Doe"
# Store events
events = [
  {"type": "UserCreated", "id": 123, "name": "John"},
  {"type": "UserNameChanged", "id": 123, "old_name": "John", "new_name": "John Doe"}
]Benefits:
- Audit trail - Complete history of changes
- Time travel - Replay events to any point in time
- Debugging - See exactly what happened
- Compliance - Regulatory requirements
CQRS (Command Query Responsibility Segregation)
Concept: Separate read and write models
# Write side (Commands)
POST /users - Create user
PUT /users/123 - Update user
# Read side (Queries)  
GET /users/123 - Get user
GET /users/search?name=John - Search usersBenefits:
- Independent scaling - Scale reads and writes separately
- Optimized models - Different schemas for different use cases
- Performance - Read models can be denormalized
Saga Pattern
Concept: Manage distributed transactions with compensating actions
# Order processing saga
1. Reserve inventory → Success
2. Charge credit card → Success  
3. Ship order → FAILED
4. Compensate: Refund credit card
5. Compensate: Release inventoryChange Data Capture (CDC)
Concept: Capture database changes and stream them to Kafka
# Database change → Kafka event
INSERT INTO users (id, name) VALUES (123, 'John')
↓
{"op": "c", "before": null, "after": {"id": 123, "name": "John"}}Kafka Connect and Ecosystem
Kafka Connect integrates Kafka with external systems:
Popular Connectors
Source Connectors (Data In)
- Debezium - Database CDC
- JDBC - Any SQL database
- Elasticsearch - Search indexing
- MongoDB - Document database
- Twitter - Social media feeds
Sink Connectors (Data Out)
- Elasticsearch - Search indexing
- HDFS - Data lake storage
- S3 - Cloud storage
- JDBC - Database writes
- MongoDB - Document storage
Debezium CDC Example
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: debezium-connector
  namespace: kafka
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 1
  config:
    database.hostname: postgres
    database.port: 5432
    database.user: postgres
    database.password: postgres
    database.dbname: mydb
    database.server.name: mydb
    table.include.list: public.users
    plugin.name: pgoutput
    slot.name: kafka_slot
    publication.name: kafka_publicationMulti-Datacenter Replication
Replicate data across datacenters for disaster recovery:
MirrorMaker 2.0
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: my-mirror-maker-2
  namespace: kafka
spec:
  version: 3.5.0
  replicas: 2
  connectCluster: "target-cluster"
  clusters:
  - alias: "source-cluster"
    bootstrapServers: source-kafka:9092
  - alias: "target-cluster"
    bootstrapServers: target-kafka:9092
  mirrors:
  - sourceCluster: "source-cluster"
    targetCluster: "target-cluster"
    sourceConnector:
      tasksMax: 2
      config:
        replication.factor: 3
        offset-syncs.topic.replication.factor: 3
        sync.topic.acls.enabled: "false"
    checkpointConnector:
      tasksMax: 2
      config:
        checkpoints.topic.replication.factor: 3
        sync.group.offsets.enabled: "true"
        refresh.topics.interval.seconds: 60
        refresh.groups.interval.seconds: 60Production Monitoring Stack
Complete monitoring solution for Kafka in production:
Prometheus + Grafana
# Prometheus config
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets: ['kafka-0:9092', 'kafka-1:9092', 'kafka-2:9092']
    metrics_path: /metrics
    scrape_interval: 5s
  - job_name: 'kafka-connect'
    static_configs:
      - targets: ['connect-0:8083', 'connect-1:8083']
    metrics_path: /metrics
    scrape_interval: 5sKey Dashboards
Kafka Cluster Overview
- Broker health and resource usage
- Topic partition distribution
- Consumer group lag
- Producer/consumer throughput
Consumer Monitoring
- Lag per consumer group
- Processing rate trends
- Error rates and exceptions
- Rebalancing frequency
Troubleshooting Production Issues
Common production problems and how to fix them:
1. Consumer Lag Spikes
🚨 Symptoms
- Lag jumps from 0 to 100,000+ messages
- Consumer throughput drops to zero
- Processing stops completely
🔍 Debugging Steps
- Check consumer health - are they running?
- Look for rebalancing events in logs
- Check for processing errors or exceptions
- Verify partition assignment
- Check resource usage (CPU, memory)
2. Broker Out of Disk Space
🚨 Symptoms
- Broker stops accepting writes
- Log files grow rapidly
- Consumer lag increases
🔍 Debugging Steps
- Check disk usage: df -h
- Identify large log files: du -sh /var/kafka-logs/*
- Check retention settings
- Clean up old segments
- Scale up disk or add more brokers
3. Rebalancing Storms
🚨 Symptoms
- Constant rebalancing events
- Consumers keep getting kicked out
- Processing is unstable
🔍 Debugging Steps
- Check session timeout settings
- Look for GC pauses in logs
- Verify network connectivity
- Check for processing timeouts
- Enable static membership
Performance Tuning Checklist
Broker Tuning
- num.network.threads = 3 * number of disks
- num.io.threads = 8 * number of disks
- socket.send.buffer.bytes = 102400
- socket.receive.buffer.bytes = 102400
- log.flush.interval.messages = 10000
Producer Tuning
- batch.size = 65536 (64KB)
- linger.ms = 10
- compression.type = lz4
- acks = all
- retries = 2147483647
Consumer Tuning
- fetch.min.bytes = 1024
- fetch.max.wait.ms = 500
- max.poll.records = 500
- session.timeout.ms = 45000
- heartbeat.interval.ms = 3000
Disaster Recovery Planning
Prepare for the worst-case scenarios:
Backup Strategy
- Topic replication - MirrorMaker 2.0
- Configuration backup - Git repository
- Certificate backup - Secure storage
- Schema backup - Schema Registry export
Recovery Procedures
- RTO (Recovery Time Objective) - 4 hours
- RPO (Recovery Point Objective) - 15 minutes
- Failover testing - Monthly drills
- Documentation - Step-by-step procedures
Key Takeaways
- Production readiness is a process - not a checklist
- Monitor everything - you can't fix what you can't see
- Plan for failure - disaster recovery is essential
- Use proven patterns - Event Sourcing, CQRS, Saga
- Test your recovery procedures - practice makes perfect
🎉 Congratulations!
You've completed the Apache Kafka Mastery course! You now have the knowledge and skills to build production-ready Kafka systems that can handle millions of events per second.
Next steps:
- Build a real project using Kafka
- Join the Kafka community
- Contribute to open source projects
- Share your knowledge with others