Production Readiness Checklist

Before going to production, ensure you've covered these essentials:

✅ Infrastructure

3+ brokers - Fault tolerance
SSD storage - Performance
Dedicated resources - No resource contention
Network redundancy - Multiple network paths
Backup strategy - Data protection

✅ Security

SSL/TLS encryption - Data in transit
SASL authentication - Identity verification
ACL authorization - Access control
Quota limits - Resource protection
Audit logging - Compliance

✅ Monitoring

Consumer lag alerts - Processing health
Broker metrics - Resource usage
Error rate monitoring - Data quality
End-to-end latency - Performance tracking
Capacity planning - Growth management

Kafka on Kubernetes with Strimzi

Strimzi is the gold standard for running Kafka on Kubernetes:

1. Install Strimzi Operator

# Install Strimzi operator
kubectl create namespace kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

# Wait for operator to be ready
kubectl wait --for=condition=Ready pod -l name=strimzi-cluster-operator -n kafka

2. Deploy Kafka Cluster

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  namespace: kafka
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.5"
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      class: fast-ssd

3. Deploy Kafka Connect

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: my-connect-cluster
  namespace: kafka
spec:
  version: 3.5.0
  replicas: 2
  bootstrapServers: my-cluster-kafka-bootstrap:9092
  config:
    config.storage.replication.factor: 3
    offset.storage.replication.factor: 3
    status.storage.replication.factor: 3
    group.id: connect-cluster-group
    offset.storage.topic: connect-cluster-offsets
    config.storage.topic: connect-cluster-configs
    status.storage.topic: connect-cluster-status

Advanced Architectural Patterns

These patterns solve common distributed systems problems:

Event Sourcing

Concept: Store all changes as events, not just current state

# Instead of updating user record
user.name = "John Doe"

# Store events
events = [
  {"type": "UserCreated", "id": 123, "name": "John"},
  {"type": "UserNameChanged", "id": 123, "old_name": "John", "new_name": "John Doe"}
]

Benefits:

Audit trail - Complete history of changes
Time travel - Replay events to any point in time
Debugging - See exactly what happened
Compliance - Regulatory requirements

CQRS (Command Query Responsibility Segregation)

Concept: Separate read and write models

# Write side (Commands)
POST /users - Create user
PUT /users/123 - Update user

# Read side (Queries)  
GET /users/123 - Get user
GET /users/search?name=John - Search users

Benefits:

Independent scaling - Scale reads and writes separately
Optimized models - Different schemas for different use cases
Performance - Read models can be denormalized

Saga Pattern

Concept: Manage distributed transactions with compensating actions

# Order processing saga
1. Reserve inventory → Success
2. Charge credit card → Success  
3. Ship order → FAILED
4. Compensate: Refund credit card
5. Compensate: Release inventory

Change Data Capture (CDC)

Concept: Capture database changes and stream them to Kafka

# Database change → Kafka event
INSERT INTO users (id, name) VALUES (123, 'John')
↓
{"op": "c", "before": null, "after": {"id": 123, "name": "John"}}

Kafka Connect and Ecosystem

Kafka Connect integrates Kafka with external systems:

Popular Connectors

Source Connectors (Data In)

Debezium - Database CDC
JDBC - Any SQL database
Elasticsearch - Search indexing
MongoDB - Document database
Twitter - Social media feeds

Sink Connectors (Data Out)

Elasticsearch - Search indexing
HDFS - Data lake storage
S3 - Cloud storage
JDBC - Database writes
MongoDB - Document storage

Debezium CDC Example

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: debezium-connector
  namespace: kafka
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 1
  config:
    database.hostname: postgres
    database.port: 5432
    database.user: postgres
    database.password: postgres
    database.dbname: mydb
    database.server.name: mydb
    table.include.list: public.users
    plugin.name: pgoutput
    slot.name: kafka_slot
    publication.name: kafka_publication

Multi-Datacenter Replication

Replicate data across datacenters for disaster recovery:

MirrorMaker 2.0

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: my-mirror-maker-2
  namespace: kafka
spec:
  version: 3.5.0
  replicas: 2
  connectCluster: "target-cluster"
  clusters:
  - alias: "source-cluster"
    bootstrapServers: source-kafka:9092
  - alias: "target-cluster"
    bootstrapServers: target-kafka:9092
  mirrors:
  - sourceCluster: "source-cluster"
    targetCluster: "target-cluster"
    sourceConnector:
      tasksMax: 2
      config:
        replication.factor: 3
        offset-syncs.topic.replication.factor: 3
        sync.topic.acls.enabled: "false"
    checkpointConnector:
      tasksMax: 2
      config:
        checkpoints.topic.replication.factor: 3
        sync.group.offsets.enabled: "true"
        refresh.topics.interval.seconds: 60
        refresh.groups.interval.seconds: 60

Production Monitoring Stack

Complete monitoring solution for Kafka in production:

Prometheus + Grafana

# Prometheus config
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets: ['kafka-0:9092', 'kafka-1:9092', 'kafka-2:9092']
    metrics_path: /metrics
    scrape_interval: 5s

  - job_name: 'kafka-connect'
    static_configs:
      - targets: ['connect-0:8083', 'connect-1:8083']
    metrics_path: /metrics
    scrape_interval: 5s

Key Dashboards

Kafka Cluster Overview

Broker health and resource usage
Topic partition distribution
Consumer group lag
Producer/consumer throughput

Consumer Monitoring

Lag per consumer group
Processing rate trends
Error rates and exceptions
Rebalancing frequency

Troubleshooting Production Issues

Common production problems and how to fix them:

1. Consumer Lag Spikes

🚨 Symptoms

Lag jumps from 0 to 100,000+ messages
Consumer throughput drops to zero
Processing stops completely

🔍 Debugging Steps

Check consumer health - are they running?
Look for rebalancing events in logs
Check for processing errors or exceptions
Verify partition assignment
Check resource usage (CPU, memory)

2. Broker Out of Disk Space

🚨 Symptoms

Broker stops accepting writes
Log files grow rapidly
Consumer lag increases

🔍 Debugging Steps

Check disk usage: df -h
Identify large log files: du -sh /var/kafka-logs/*
Check retention settings
Clean up old segments
Scale up disk or add more brokers

3. Rebalancing Storms

🚨 Symptoms

Constant rebalancing events
Consumers keep getting kicked out
Processing is unstable

🔍 Debugging Steps

Check session timeout settings
Look for GC pauses in logs
Verify network connectivity
Check for processing timeouts
Enable static membership

Performance Tuning Checklist

Broker Tuning

num.network.threads = 3 * number of disks
num.io.threads = 8 * number of disks
socket.send.buffer.bytes = 102400
socket.receive.buffer.bytes = 102400
log.flush.interval.messages = 10000

Producer Tuning

batch.size = 65536 (64KB)
linger.ms = 10
compression.type = lz4
acks = all
retries = 2147483647

Consumer Tuning

fetch.min.bytes = 1024
fetch.max.wait.ms = 500
max.poll.records = 500
session.timeout.ms = 45000
heartbeat.interval.ms = 3000

Disaster Recovery Planning

Prepare for the worst-case scenarios:

Backup Strategy

Topic replication - MirrorMaker 2.0
Configuration backup - Git repository
Certificate backup - Secure storage
Schema backup - Schema Registry export

Recovery Procedures

RTO (Recovery Time Objective) - 4 hours
RPO (Recovery Point Objective) - 15 minutes
Failover testing - Monthly drills
Documentation - Step-by-step procedures

Key Takeaways

Production readiness is a process - not a checklist
Monitor everything - you can't fix what you can't see
Plan for failure - disaster recovery is essential
Use proven patterns - Event Sourcing, CQRS, Saga
Test your recovery procedures - practice makes perfect

🎉 Congratulations!

You've completed the Apache Kafka Mastery course! You now have the knowledge and skills to build production-ready Kafka systems that can handle millions of events per second.

Next steps:

Build a real project using Kafka
Join the Kafka community
Contribute to open source projects
Share your knowledge with others

Production Operations and Advanced Patterns

Course Navigation

Production Readiness Checklist

✅ Infrastructure

✅ Security

✅ Monitoring

Kafka on Kubernetes with Strimzi

1. Install Strimzi Operator

2. Deploy Kafka Cluster

3. Deploy Kafka Connect

Advanced Architectural Patterns

Event Sourcing

CQRS (Command Query Responsibility Segregation)

Saga Pattern

Change Data Capture (CDC)

Kafka Connect and Ecosystem

Popular Connectors

Source Connectors (Data In)

Sink Connectors (Data Out)

Debezium CDC Example

Multi-Datacenter Replication

MirrorMaker 2.0

Production Monitoring Stack

Prometheus + Grafana

Key Dashboards

Kafka Cluster Overview

Consumer Monitoring

Troubleshooting Production Issues

1. Consumer Lag Spikes

🚨 Symptoms

🔍 Debugging Steps

2. Broker Out of Disk Space

🚨 Symptoms

🔍 Debugging Steps

3. Rebalancing Storms

🚨 Symptoms

🔍 Debugging Steps

Performance Tuning Checklist

Broker Tuning

Producer Tuning

Consumer Tuning

Disaster Recovery Planning

Backup Strategy

Recovery Procedures

Key Takeaways

🎉 Congratulations!