Production Operations and Advanced Patterns

Master Kafka production operations: Kubernetes deployment, monitoring, advanced patterns, troubleshooting, and building production-ready Kafka systems.

Production Readiness Checklist

Before going to production, ensure you've covered these essentials:

✅ Infrastructure

  • 3+ brokers - Fault tolerance
  • SSD storage - Performance
  • Dedicated resources - No resource contention
  • Network redundancy - Multiple network paths
  • Backup strategy - Data protection

✅ Security

  • SSL/TLS encryption - Data in transit
  • SASL authentication - Identity verification
  • ACL authorization - Access control
  • Quota limits - Resource protection
  • Audit logging - Compliance

✅ Monitoring

  • Consumer lag alerts - Processing health
  • Broker metrics - Resource usage
  • Error rate monitoring - Data quality
  • End-to-end latency - Performance tracking
  • Capacity planning - Growth management

Kafka on Kubernetes with Strimzi

Strimzi is the gold standard for running Kafka on Kubernetes:

1. Install Strimzi Operator

# Install Strimzi operator
kubectl create namespace kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

# Wait for operator to be ready
kubectl wait --for=condition=Ready pod -l name=strimzi-cluster-operator -n kafka

2. Deploy Kafka Cluster

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  namespace: kafka
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.5"
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      class: fast-ssd

3. Deploy Kafka Connect

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: my-connect-cluster
  namespace: kafka
spec:
  version: 3.5.0
  replicas: 2
  bootstrapServers: my-cluster-kafka-bootstrap:9092
  config:
    config.storage.replication.factor: 3
    offset.storage.replication.factor: 3
    status.storage.replication.factor: 3
    group.id: connect-cluster-group
    offset.storage.topic: connect-cluster-offsets
    config.storage.topic: connect-cluster-configs
    status.storage.topic: connect-cluster-status

Advanced Architectural Patterns

These patterns solve common distributed systems problems:

Event Sourcing

Concept: Store all changes as events, not just current state

# Instead of updating user record
user.name = "John Doe"

# Store events
events = [
  {"type": "UserCreated", "id": 123, "name": "John"},
  {"type": "UserNameChanged", "id": 123, "old_name": "John", "new_name": "John Doe"}
]

Benefits:

  • Audit trail - Complete history of changes
  • Time travel - Replay events to any point in time
  • Debugging - See exactly what happened
  • Compliance - Regulatory requirements

CQRS (Command Query Responsibility Segregation)

Concept: Separate read and write models

# Write side (Commands)
POST /users - Create user
PUT /users/123 - Update user

# Read side (Queries)  
GET /users/123 - Get user
GET /users/search?name=John - Search users

Benefits:

  • Independent scaling - Scale reads and writes separately
  • Optimized models - Different schemas for different use cases
  • Performance - Read models can be denormalized

Saga Pattern

Concept: Manage distributed transactions with compensating actions

# Order processing saga
1. Reserve inventory → Success
2. Charge credit card → Success  
3. Ship order → FAILED
4. Compensate: Refund credit card
5. Compensate: Release inventory

Change Data Capture (CDC)

Concept: Capture database changes and stream them to Kafka

# Database change → Kafka event
INSERT INTO users (id, name) VALUES (123, 'John')
↓
{"op": "c", "before": null, "after": {"id": 123, "name": "John"}}

Kafka Connect and Ecosystem

Kafka Connect integrates Kafka with external systems:

Popular Connectors

Source Connectors (Data In)

  • Debezium - Database CDC
  • JDBC - Any SQL database
  • Elasticsearch - Search indexing
  • MongoDB - Document database
  • Twitter - Social media feeds

Sink Connectors (Data Out)

  • Elasticsearch - Search indexing
  • HDFS - Data lake storage
  • S3 - Cloud storage
  • JDBC - Database writes
  • MongoDB - Document storage

Debezium CDC Example

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: debezium-connector
  namespace: kafka
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 1
  config:
    database.hostname: postgres
    database.port: 5432
    database.user: postgres
    database.password: postgres
    database.dbname: mydb
    database.server.name: mydb
    table.include.list: public.users
    plugin.name: pgoutput
    slot.name: kafka_slot
    publication.name: kafka_publication

Multi-Datacenter Replication

Replicate data across datacenters for disaster recovery:

MirrorMaker 2.0

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: my-mirror-maker-2
  namespace: kafka
spec:
  version: 3.5.0
  replicas: 2
  connectCluster: "target-cluster"
  clusters:
  - alias: "source-cluster"
    bootstrapServers: source-kafka:9092
  - alias: "target-cluster"
    bootstrapServers: target-kafka:9092
  mirrors:
  - sourceCluster: "source-cluster"
    targetCluster: "target-cluster"
    sourceConnector:
      tasksMax: 2
      config:
        replication.factor: 3
        offset-syncs.topic.replication.factor: 3
        sync.topic.acls.enabled: "false"
    checkpointConnector:
      tasksMax: 2
      config:
        checkpoints.topic.replication.factor: 3
        sync.group.offsets.enabled: "true"
        refresh.topics.interval.seconds: 60
        refresh.groups.interval.seconds: 60

Production Monitoring Stack

Complete monitoring solution for Kafka in production:

Prometheus + Grafana

# Prometheus config
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets: ['kafka-0:9092', 'kafka-1:9092', 'kafka-2:9092']
    metrics_path: /metrics
    scrape_interval: 5s

  - job_name: 'kafka-connect'
    static_configs:
      - targets: ['connect-0:8083', 'connect-1:8083']
    metrics_path: /metrics
    scrape_interval: 5s

Key Dashboards

Kafka Cluster Overview

  • Broker health and resource usage
  • Topic partition distribution
  • Consumer group lag
  • Producer/consumer throughput

Consumer Monitoring

  • Lag per consumer group
  • Processing rate trends
  • Error rates and exceptions
  • Rebalancing frequency

Troubleshooting Production Issues

Common production problems and how to fix them:

1. Consumer Lag Spikes

🚨 Symptoms

  • Lag jumps from 0 to 100,000+ messages
  • Consumer throughput drops to zero
  • Processing stops completely

🔍 Debugging Steps

  1. Check consumer health - are they running?
  2. Look for rebalancing events in logs
  3. Check for processing errors or exceptions
  4. Verify partition assignment
  5. Check resource usage (CPU, memory)

2. Broker Out of Disk Space

🚨 Symptoms

  • Broker stops accepting writes
  • Log files grow rapidly
  • Consumer lag increases

🔍 Debugging Steps

  1. Check disk usage: df -h
  2. Identify large log files: du -sh /var/kafka-logs/*
  3. Check retention settings
  4. Clean up old segments
  5. Scale up disk or add more brokers

3. Rebalancing Storms

🚨 Symptoms

  • Constant rebalancing events
  • Consumers keep getting kicked out
  • Processing is unstable

🔍 Debugging Steps

  1. Check session timeout settings
  2. Look for GC pauses in logs
  3. Verify network connectivity
  4. Check for processing timeouts
  5. Enable static membership

Performance Tuning Checklist

Broker Tuning

  • num.network.threads = 3 * number of disks
  • num.io.threads = 8 * number of disks
  • socket.send.buffer.bytes = 102400
  • socket.receive.buffer.bytes = 102400
  • log.flush.interval.messages = 10000

Producer Tuning

  • batch.size = 65536 (64KB)
  • linger.ms = 10
  • compression.type = lz4
  • acks = all
  • retries = 2147483647

Consumer Tuning

  • fetch.min.bytes = 1024
  • fetch.max.wait.ms = 500
  • max.poll.records = 500
  • session.timeout.ms = 45000
  • heartbeat.interval.ms = 3000

Disaster Recovery Planning

Prepare for the worst-case scenarios:

Backup Strategy

  • Topic replication - MirrorMaker 2.0
  • Configuration backup - Git repository
  • Certificate backup - Secure storage
  • Schema backup - Schema Registry export

Recovery Procedures

  • RTO (Recovery Time Objective) - 4 hours
  • RPO (Recovery Point Objective) - 15 minutes
  • Failover testing - Monthly drills
  • Documentation - Step-by-step procedures

Key Takeaways

  1. Production readiness is a process - not a checklist
  2. Monitor everything - you can't fix what you can't see
  3. Plan for failure - disaster recovery is essential
  4. Use proven patterns - Event Sourcing, CQRS, Saga
  5. Test your recovery procedures - practice makes perfect

🎉 Congratulations!

You've completed the Apache Kafka Mastery course! You now have the knowledge and skills to build production-ready Kafka systems that can handle millions of events per second.

Next steps:

  • Build a real project using Kafka
  • Join the Kafka community
  • Contribute to open source projects
  • Share your knowledge with others