Distributed Messaging System - Trade-offs and Alternatives

Delivery Semantics Trade-offs

At-Most-Once vs At-Least-Once vs Exactly-Once

At-Most-Once:
✓ Highest throughput (no acks)
✓ Lowest latency (<1ms)
✓ Simplest implementation
✗ Possible message loss
✗ No durability guarantees

Use Cases:
- Metrics and monitoring
- Non-critical logs
- Real-time analytics (approximate)

At-Least-Once:
✓ No message loss
✓ Good throughput
✓ Acceptable latency (5-10ms)
✗ Possible duplicates
✗ Requires idempotent processing

Use Cases:
- Event processing
- Data pipelines
- Most production workloads

Exactly-Once:
✓ No duplicates
✓ No message loss
✓ Strong guarantees
✗ Lower throughput (30% overhead)
✗ Higher latency (20-30ms)
✗ Complex implementation

Use Cases:
- Financial transactions
- Payment processing
- Critical data pipelines

Recommendation: At-least-once for most use cases

Ordering Guarantees

Partition-Level vs Global Ordering

Partition-Level Ordering:
✓ High throughput (parallel processing)
✓ Scalable (add partitions)
✓ Simple implementation
✗ No cross-partition ordering
✗ Requires key-based routing

Example:
Topic: user-events (10 partitions)
user_id=123 → Partition 0 (ordered)
user_id=456 → Partition 5 (ordered)
No ordering between users

Global Ordering (Single Partition):
✓ Strict ordering
✓ Simple to reason about
✗ Limited throughput (single partition bottleneck)
✗ Single point of failure
✗ Cannot scale horizontally

Example:
Topic: transactions (1 partition)
All messages ordered globally
Max throughput: ~100K msg/s

Recommendation: Partition-level ordering with key-based routing

Replication Strategies

Synchronous vs Asynchronous Replication

Synchronous (acks=all):
✓ No data loss
✓ Strong durability
✓ Consistent reads
✗ Higher latency (+3-5ms)
✗ Lower throughput (-30%)
✗ Availability impact (need quorum)

Configuration:
acks=all
min.insync.replicas=2

Asynchronous (acks=1):
✓ Low latency (<2ms)
✓ High throughput
✓ High availability
✗ Possible data loss (leader failure)
✗ Eventual consistency
✗ Replica lag

Configuration:
acks=1
min.insync.replicas=1

Hybrid (acks=1, RF=3):
✓ Balanced latency (2-3ms)
✓ Good throughput
✓ Acceptable durability
✗ Small data loss window
✗ Replica lag possible

Recommendation: Synchronous for critical data, async for logs

Storage Models

Log-Based vs Queue-Based

Log-Based (Kafka-style):
✓ Multiple consumers (replay)
✓ Time-based retention
✓ High throughput
✓ Durable storage
✗ Storage grows with time
✗ No message deletion

Use Cases:
- Event sourcing
- Stream processing
- Data pipelines
- Audit logs

Queue-Based (RabbitMQ-style):
✓ Message deletion after consumption
✓ Lower storage requirements
✓ Priority queues
✓ Message TTL
✗ Single consumer per message
✗ No replay capability
✗ Lower throughput

Use Cases:
- Task queues
- Job processing
- Request-response patterns
- Point-to-point messaging

Recommendation: Log-based for events, queue-based for tasks

Partitioning Strategies

Hash-Based vs Range-Based vs Custom

Hash-Based (Default):
✓ Even distribution
✓ Simple implementation
✓ No hot partitions
✗ No ordering across keys
✗ Cannot query by range

Implementation:
partition = hash(key) % num_partitions

Range-Based:
✓ Range queries possible
✓ Ordered data
✓ Time-series friendly
✗ Hot partitions (recent data)
✗ Uneven distribution
✗ Complex rebalancing

Implementation:
partition = key_range_to_partition(key)

Custom Partitioner:
✓ Business logic based
✓ Flexible distribution
✓ Optimize for use case
✗ Complex implementation
✗ Requires maintenance
✗ May cause hot partitions

Recommendation: Hash-based for most use cases

Consumer Group vs Multiple Topics

Fan-out Patterns

Single Topic + Multiple Consumer Groups:
✓ Single write
✓ Multiple independent consumers
✓ Easy to add consumers
✓ Shared storage
✗ All consumers see all messages
✗ Cannot filter at broker

Example:
Topic: user-events
├── Consumer Group: analytics (all events)
├── Consumer Group: notifications (all events)
└── Consumer Group: audit (all events)

Multiple Topics:
✓ Filtered at producer
✓ Independent retention
✓ Different configurations
✓ Cleaner separation
✗ Multiple writes
✗ More storage
✗ More complexity

Example:
├── Topic: user-login-events
├── Topic: user-purchase-events
└── Topic: user-profile-events

Recommendation: Single topic + multiple groups for fan-out

Compression Trade-offs

Compression Algorithms

None:
✓ Lowest CPU usage
✓ Lowest latency
✗ Highest network usage
✗ Highest storage cost
✗ Lowest throughput

Snappy:
✓ Fast compression (2:1 ratio)
✓ Low CPU overhead
✓ Good balance
✗ Moderate compression

LZ4:
✓ Very fast (2:1 ratio)
✓ Lowest CPU overhead
✓ Best for speed
✗ Moderate compression

Gzip:
✓ Best compression (3:1 ratio)
✓ Lowest storage/network
✗ High CPU usage
✗ Higher latency

Zstd:
✓ Best balance (3:1 ratio)
✓ Fast compression
✓ Configurable levels
✗ Newer (less tested)

Recommendation:
- Logs: Gzip (high compression)
- Events: Snappy (balanced)
- Metrics: LZ4 (speed)

Kafka vs RabbitMQ vs Pulsar

Technology Comparison

Kafka:
✓ Highest throughput (1M+ msg/s)
✓ Durable log storage
✓ Multiple consumers (replay)
✓ Stream processing (Kafka Streams)
✗ Complex operations
✗ No message priority
✗ No request-response

Best For:
- Event streaming
- Log aggregation
- Data pipelines
- Real-time analytics

RabbitMQ:
✓ Flexible routing
✓ Message priority
✓ Request-response patterns
✓ Easy to operate
✗ Lower throughput (100K msg/s)
✗ No replay capability
✗ Message deletion after consumption

Best For:
- Task queues
- Job processing
- Microservices communication
- Traditional messaging

Pulsar:
✓ Multi-tenancy
✓ Geo-replication
✓ Tiered storage
✓ Unified messaging + streaming
✗ Newer (less mature)
✗ More complex
✗ Smaller ecosystem

Best For:
- Multi-tenant platforms
- Global deployments
- Unified architecture
- Cloud-native applications

Recommendation: Kafka for most use cases

Push vs Pull Model

Consumer Patterns

Pull Model (Kafka):
✓ Consumer controls rate
✓ Backpressure handling
✓ Batch processing
✓ Replay capability
✗ Polling overhead
✗ Higher latency (polling interval)

Implementation:
while (true) {
    records = consumer.poll(100ms);
    process(records);
}

Push Model (RabbitMQ):
✓ Lower latency (immediate)
✓ No polling overhead
✓ Simple implementation
✗ No backpressure
✗ Can overwhelm consumer
✗ No replay

Implementation:
consumer.onMessage(message -> {
    process(message);
});

Recommendation: Pull for high throughput, push for low latency

Consistency Models

Strong vs Eventual Consistency

Strong Consistency:
✓ Immediate consistency
✓ Predictable behavior
✓ Easier to reason about
✗ Higher latency
✗ Lower availability (CAP theorem)
✗ Requires coordination

Implementation:
- Synchronous replication
- Quorum writes
- Raft/Paxos consensus

Eventual Consistency:
✓ Low latency
✓ High availability
✓ Partition tolerance
✗ Temporary inconsistency
✗ Conflict resolution needed
✗ Complex application logic

Implementation:
- Asynchronous replication
- Last-write-wins
- Vector clocks

Recommendation: Eventual consistency for messaging

Metadata Storage

ZooKeeper vs Raft vs Etcd

ZooKeeper:
✓ Mature and stable
✓ Large ecosystem
✓ Well-tested
✗ Complex operations
✗ Separate deployment
✗ Java dependency

Raft (KRaft):
✓ Simpler than ZooKeeper
✓ Integrated with Kafka
✓ No external dependency
✓ Faster metadata operations
✗ Newer (less battle-tested)
✗ Migration complexity

Etcd:
✓ Simple API
✓ Strong consistency
✓ Good performance
✗ Not Kafka-native
✗ Additional component
✗ Smaller ecosystem

Recommendation: KRaft for new deployments, ZooKeeper for existing

Cost vs Performance

Infrastructure Trade-offs

High Performance (Expensive):
- NVMe SSDs: $0.20/GB/month
- 100 Gbps network
- 64-core servers
- RF=3, acks=all
- Cost: $50K/month
- Throughput: 10M msg/s
- Latency: <5ms P99

Balanced (Recommended):
- SATA SSDs: $0.10/GB/month
- 10 Gbps network
- 16-core servers
- RF=3, acks=1
- Cost: $20K/month
- Throughput: 3M msg/s
- Latency: <10ms P99

Cost-Optimized (Cheap):
- HDDs: $0.05/GB/month
- 1 Gbps network
- 8-core servers
- RF=2, acks=1
- Cost: $8K/month
- Throughput: 500K msg/s
- Latency: <50ms P99

These trade-offs help make informed decisions based on specific requirements, constraints, and priorities.