Distributed Messaging System - Scale and Constraints
Traffic Scale Analysis
Message Volume
- Peak Messages: 10 million messages per second globally
- Average Messages: 3 million messages per second
- Daily Messages: 250 billion messages per day
- Message Size: Average 1KB, max 10MB
- Burst Multiplier: 10x during peak events
- Retention Period: 7 days default, up to 30 days
Topic and Partition Scale
- Total Topics: 100,000 active topics
- Partitions per Topic: Average 10, max 1000
- Total Partitions: 1 million partitions across cluster
- Messages per Partition: 1000 messages/second average
- Partition Size: 100GB average, 1TB max
- Replication Factor: 3x (3 copies of each partition)
Producer and Consumer Scale
- Concurrent Producers: 50,000 active producers
- Concurrent Consumers: 200,000 active consumers
- Consumer Groups: 10,000 consumer groups
- Consumers per Group: Average 20, max 1000
- Producer Connections: 100,000 concurrent connections
- Consumer Connections: 500,000 concurrent connections
Storage Requirements
Message Storage (Per Broker)
- Active Messages: 10M msg/s × 1KB × 7 days = 6TB per broker
- Replication: 6TB × 3 = 18TB total per broker
- Index Data: 10% of message data = 1.8TB
- Metadata: Topics, partitions, offsets = 100GB
- Total per Broker: ~20TB
- Cluster (100 brokers): 2PB total storage
Offset Storage
- Consumer Offsets: 200K consumers × 10 partitions × 100 bytes = 200MB
- Offset History: 30 days × 200MB = 6GB
- Offset Commits: 1M commits/second × 100 bytes = 100MB/s write rate
Metadata Storage
- Topic Metadata: 100K topics × 10KB = 1GB
- Partition Metadata: 1M partitions × 5KB = 5GB
- Consumer Group Metadata: 10K groups × 50KB = 500MB
- Broker Metadata: 100 brokers × 1MB = 100MB
- Total Metadata: ~7GB (easily fits in memory)
Network Bandwidth
Inbound Traffic (Producers)
- Message Data: 10M msg/s × 1KB = 10GB/s
- Protocol Overhead: 20% = 2GB/s
- Replication Traffic: 10GB/s × 2 (to replicas) = 20GB/s
- Total Inbound: ~32GB/s peak
Outbound Traffic (Consumers)
- Message Data: 10M msg/s × 1KB × 20 consumers avg = 200GB/s
- Protocol Overhead: 20% = 40GB/s
- Total Outbound: ~240GB/s peak
Internal Traffic (Replication)
- Leader to Follower: 10GB/s × 2 replicas = 20GB/s
- Follower Fetch: 20GB/s
- Metadata Sync: 100MB/s
- Total Internal: ~40GB/s
Compute Requirements
CPU Resources (Per Broker)
- Message Processing: 10K msg/s × 0.1ms = 1 CPU core
- Compression/Decompression: 10K msg/s × 0.5ms = 5 CPU cores
- Replication: 10K msg/s × 0.2ms = 2 CPU cores
- Network I/O: 2 CPU cores
- Disk I/O: 2 CPU cores
- Total per Broker: ~12 CPU cores
- Cluster (100 brokers): 1,200 CPU cores
Memory Resources (Per Broker)
- Page Cache: 64GB (for recent messages)
- Producer Buffers: 10GB (1000 producers × 10MB)
- Consumer Buffers: 20GB (2000 consumers × 10MB)
- Metadata Cache: 10GB
- JVM Heap: 16GB
- Total per Broker: ~120GB RAM
- Cluster (100 brokers): 12TB RAM
Disk I/O (Per Broker)
- Write IOPS: 10K msg/s × 3 (replication) = 30K IOPS
- Read IOPS: 20K msg/s (consumers) = 20K IOPS
- Total IOPS: ~50K IOPS per broker
- Sequential Write: 10GB/s
- Sequential Read: 20GB/s
- Storage Type: NVMe SSD required
Latency Constraints
End-to-End Latency Breakdown
Producer → Broker: 2ms
- Network: 1ms
- Broker processing: 0.5ms
- Disk write: 0.5ms
Broker → Replica: 3ms
- Network: 1ms
- Replica processing: 1ms
- Disk write: 1ms
Broker → Consumer: 2ms
- Network: 1ms
- Consumer processing: 1ms
Total: 7ms (P50), 20ms (P99)Latency Requirements by Use Case
- Real-time Analytics: <10ms P99
- Event Processing: <50ms P99
- Batch Processing: <1s P99
- Log Aggregation: <5s P99
- Data Replication: <100ms P99
Replication and Consistency
Replication Configuration
Replication Factor: 3
- 1 Leader (handles reads/writes)
- 2 Followers (replicate data)
Synchronous Replication:
- Wait for all replicas to acknowledge
- Latency: +3ms
- Durability: 99.999999999%
Asynchronous Replication:
- Don't wait for replicas
- Latency: +0ms
- Durability: 99.99%
Recommended: Synchronous for critical data, async for logsConsistency Guarantees
- Write Consistency: Synchronous replication to quorum (2/3)
- Read Consistency: Read from leader only
- Offset Consistency: Strongly consistent offset commits
- Metadata Consistency: Raft consensus for metadata
- Convergence Time: <1 second for replica sync
Failure Scenarios and Tolerances
Broker Failures
- Single Broker Failure: <1% traffic impact (automatic failover)
- Multiple Broker Failures: Can tolerate (N-1)/2 failures
- Leader Failure: <5 seconds to elect new leader
- Follower Failure: No impact (leader continues serving)
- Network Partition: Quorum-based decisions prevent split-brain
Recovery Time Objectives
- Broker Restart: <30 seconds (load from disk)
- Leader Election: <5 seconds (Raft consensus)
- Partition Rebalance: <30 seconds (consumer group)
- Replica Sync: <5 minutes (catch up from leader)
- Full Cluster Recovery: <15 minutes (disaster recovery)
Cost Analysis
Infrastructure Costs (Monthly)
- Compute: 100 brokers × $500/month = $50,000
- Storage: 2PB × $0.10/GB/month = $200,000
- Network: 500TB egress × $0.05/GB = $25,000
- Load Balancers: 10 × $50/month = $500
- Monitoring: $5,000/month
- Total Infrastructure: ~$280,000/month
Operational Costs (Monthly)
- Engineering: $50,000/month (2 FTEs)
- Support: $20,000/month (on-call)
- Training: $5,000/month
- Total Operational: ~$75,000/month
Cost Per Message
- Total Monthly Cost: $355,000
- Monthly Messages: 250B × 30 = 7.5 trillion messages
- Cost Per Million Messages: $0.047
- Cost Per Message: $0.000000047 (~$0.05 per billion)
Scaling Strategies
Horizontal Scaling
- Add Brokers: Linear scaling up to 1000 brokers
- Add Partitions: Increase parallelism
- Add Consumer Groups: Increase consumption rate
- Geographic Distribution: Deploy in multiple regions
- Auto-Scaling: Scale based on message rate
Vertical Scaling
- Larger Brokers: Up to 128 cores, 512GB RAM
- Faster Storage: NVMe SSD, RAID 10
- Network Upgrades: 100 Gbps interfaces
- Memory Optimization: Larger page cache
Performance Optimization
- Batching: Batch 10,000 messages per request
- Compression: 3:1 compression ratio
- Zero-Copy: Sendfile for efficient data transfer
- Page Cache: Leverage OS page cache
- Sequential I/O: Optimize for sequential disk access
Bottleneck Analysis
Primary Bottlenecks
- Disk I/O: 50K IOPS per broker limit
- Network Bandwidth: 100 Gbps per broker limit
- Replication Lag: Synchronous replication adds 3ms
- Consumer Lag: Slow consumers cause backpressure
- Partition Count: Too many partitions increase overhead
Mitigation Strategies
- NVMe SSDs: 1M IOPS per drive
- 100 Gbps NICs: 10x bandwidth increase
- Async Replication: Reduce latency for non-critical data
- Consumer Scaling: Add more consumers to groups
- Partition Optimization: Balance partition count vs overhead
This scale analysis provides the foundation for designing a messaging system that can handle massive message volumes while maintaining low latency and high reliability.