Distributed Messaging System - Scale and Constraints

Traffic Scale Analysis

Message Volume

Peak Messages: 10 million messages per second globally
Average Messages: 3 million messages per second
Daily Messages: 250 billion messages per day
Message Size: Average 1KB, max 10MB
Burst Multiplier: 10x during peak events
Retention Period: 7 days default, up to 30 days

Topic and Partition Scale

Total Topics: 100,000 active topics
Partitions per Topic: Average 10, max 1000
Total Partitions: 1 million partitions across cluster
Messages per Partition: 1000 messages/second average
Partition Size: 100GB average, 1TB max
Replication Factor: 3x (3 copies of each partition)

Producer and Consumer Scale

Concurrent Producers: 50,000 active producers
Concurrent Consumers: 200,000 active consumers
Consumer Groups: 10,000 consumer groups
Consumers per Group: Average 20, max 1000
Producer Connections: 100,000 concurrent connections
Consumer Connections: 500,000 concurrent connections

Storage Requirements

Message Storage (Per Broker)

Active Messages: 10M msg/s × 1KB × 7 days = 6TB per broker
Replication: 6TB × 3 = 18TB total per broker
Index Data: 10% of message data = 1.8TB
Metadata: Topics, partitions, offsets = 100GB
Total per Broker: ~20TB
Cluster (100 brokers): 2PB total storage

Offset Storage

Consumer Offsets: 200K consumers × 10 partitions × 100 bytes = 200MB
Offset History: 30 days × 200MB = 6GB
Offset Commits: 1M commits/second × 100 bytes = 100MB/s write rate

Metadata Storage

Topic Metadata: 100K topics × 10KB = 1GB
Partition Metadata: 1M partitions × 5KB = 5GB
Consumer Group Metadata: 10K groups × 50KB = 500MB
Broker Metadata: 100 brokers × 1MB = 100MB
Total Metadata: ~7GB (easily fits in memory)

Network Bandwidth

Inbound Traffic (Producers)

Message Data: 10M msg/s × 1KB = 10GB/s
Protocol Overhead: 20% = 2GB/s
Replication Traffic: 10GB/s × 2 (to replicas) = 20GB/s
Total Inbound: ~32GB/s peak

Outbound Traffic (Consumers)

Message Data: 10M msg/s × 1KB × 20 consumers avg = 200GB/s
Protocol Overhead: 20% = 40GB/s
Total Outbound: ~240GB/s peak

Internal Traffic (Replication)

Leader to Follower: 10GB/s × 2 replicas = 20GB/s
Follower Fetch: 20GB/s
Metadata Sync: 100MB/s
Total Internal: ~40GB/s

Compute Requirements

CPU Resources (Per Broker)

Message Processing: 10K msg/s × 0.1ms = 1 CPU core
Compression/Decompression: 10K msg/s × 0.5ms = 5 CPU cores
Replication: 10K msg/s × 0.2ms = 2 CPU cores
Network I/O: 2 CPU cores
Disk I/O: 2 CPU cores
Total per Broker: ~12 CPU cores
Cluster (100 brokers): 1,200 CPU cores

Memory Resources (Per Broker)

Page Cache: 64GB (for recent messages)
Producer Buffers: 10GB (1000 producers × 10MB)
Consumer Buffers: 20GB (2000 consumers × 10MB)
Metadata Cache: 10GB
JVM Heap: 16GB
Total per Broker: ~120GB RAM
Cluster (100 brokers): 12TB RAM

Disk I/O (Per Broker)

Write IOPS: 10K msg/s × 3 (replication) = 30K IOPS
Read IOPS: 20K msg/s (consumers) = 20K IOPS
Total IOPS: ~50K IOPS per broker
Sequential Write: 10GB/s
Sequential Read: 20GB/s
Storage Type: NVMe SSD required

Latency Constraints

End-to-End Latency Breakdown

Producer → Broker: 2ms
  - Network: 1ms
  - Broker processing: 0.5ms
  - Disk write: 0.5ms

Broker → Replica: 3ms
  - Network: 1ms
  - Replica processing: 1ms
  - Disk write: 1ms

Broker → Consumer: 2ms
  - Network: 1ms
  - Consumer processing: 1ms

Total: 7ms (P50), 20ms (P99)

Latency Requirements by Use Case

Real-time Analytics: <10ms P99
Event Processing: <50ms P99
Batch Processing: <1s P99
Log Aggregation: <5s P99
Data Replication: <100ms P99

Replication and Consistency

Replication Configuration

Replication Factor: 3
- 1 Leader (handles reads/writes)
- 2 Followers (replicate data)

Synchronous Replication:
- Wait for all replicas to acknowledge
- Latency: +3ms
- Durability: 99.999999999%

Asynchronous Replication:
- Don't wait for replicas
- Latency: +0ms
- Durability: 99.99%

Recommended: Synchronous for critical data, async for logs

Consistency Guarantees

Write Consistency: Synchronous replication to quorum (2/3)
Read Consistency: Read from leader only
Offset Consistency: Strongly consistent offset commits
Metadata Consistency: Raft consensus for metadata
Convergence Time: <1 second for replica sync

Failure Scenarios and Tolerances

Broker Failures

Single Broker Failure: <1% traffic impact (automatic failover)
Multiple Broker Failures: Can tolerate (N-1)/2 failures
Leader Failure: <5 seconds to elect new leader
Follower Failure: No impact (leader continues serving)
Network Partition: Quorum-based decisions prevent split-brain

Recovery Time Objectives

Broker Restart: <30 seconds (load from disk)
Leader Election: <5 seconds (Raft consensus)
Partition Rebalance: <30 seconds (consumer group)
Replica Sync: <5 minutes (catch up from leader)
Full Cluster Recovery: <15 minutes (disaster recovery)

Cost Analysis

Infrastructure Costs (Monthly)

Compute: 100 brokers × $500/month = $50,000
Storage: 2PB × $0.10/GB/month = $200,000
Network: 500TB egress × $0.05/GB = $25,000
Load Balancers: 10 × $50/month = $500
Monitoring: $5,000/month
Total Infrastructure: ~$280,000/month

Operational Costs (Monthly)

Engineering: $50,000/month (2 FTEs)
Support: $20,000/month (on-call)
Training: $5,000/month
Total Operational: ~$75,000/month

Cost Per Message

Total Monthly Cost: $355,000
Monthly Messages: 250B × 30 = 7.5 trillion messages
Cost Per Million Messages: $0.047
Cost Per Message: $0.000000047 (~$0.05 per billion)

Scaling Strategies

Horizontal Scaling

Add Brokers: Linear scaling up to 1000 brokers
Add Partitions: Increase parallelism
Add Consumer Groups: Increase consumption rate
Geographic Distribution: Deploy in multiple regions
Auto-Scaling: Scale based on message rate

Vertical Scaling

Larger Brokers: Up to 128 cores, 512GB RAM
Faster Storage: NVMe SSD, RAID 10
Network Upgrades: 100 Gbps interfaces
Memory Optimization: Larger page cache

Performance Optimization

Batching: Batch 10,000 messages per request
Compression: 3:1 compression ratio
Zero-Copy: Sendfile for efficient data transfer
Page Cache: Leverage OS page cache
Sequential I/O: Optimize for sequential disk access

Bottleneck Analysis

Primary Bottlenecks

Disk I/O: 50K IOPS per broker limit
Network Bandwidth: 100 Gbps per broker limit
Replication Lag: Synchronous replication adds 3ms
Consumer Lag: Slow consumers cause backpressure
Partition Count: Too many partitions increase overhead

Mitigation Strategies

NVMe SSDs: 1M IOPS per drive
100 Gbps NICs: 10x bandwidth increase
Async Replication: Reduce latency for non-critical data
Consumer Scaling: Add more consumers to groups
Partition Optimization: Balance partition count vs overhead

This scale analysis provides the foundation for designing a messaging system that can handle massive message volumes while maintaining low latency and high reliability.