Scale & Constraints

📖 6 min read 📄 Part 2 of 10

Distributed Messaging System - Scale and Constraints

Traffic Scale Analysis

Message Volume

  • Peak Messages: 10 million messages per second globally
  • Average Messages: 3 million messages per second
  • Daily Messages: 250 billion messages per day
  • Message Size: Average 1KB, max 10MB
  • Burst Multiplier: 10x during peak events
  • Retention Period: 7 days default, up to 30 days

Topic and Partition Scale

  • Total Topics: 100,000 active topics
  • Partitions per Topic: Average 10, max 1000
  • Total Partitions: 1 million partitions across cluster
  • Messages per Partition: 1000 messages/second average
  • Partition Size: 100GB average, 1TB max
  • Replication Factor: 3x (3 copies of each partition)

Producer and Consumer Scale

  • Concurrent Producers: 50,000 active producers
  • Concurrent Consumers: 200,000 active consumers
  • Consumer Groups: 10,000 consumer groups
  • Consumers per Group: Average 20, max 1000
  • Producer Connections: 100,000 concurrent connections
  • Consumer Connections: 500,000 concurrent connections

Storage Requirements

Message Storage (Per Broker)

  • Active Messages: 10M msg/s × 1KB × 7 days = 6TB per broker
  • Replication: 6TB × 3 = 18TB total per broker
  • Index Data: 10% of message data = 1.8TB
  • Metadata: Topics, partitions, offsets = 100GB
  • Total per Broker: ~20TB
  • Cluster (100 brokers): 2PB total storage

Offset Storage

  • Consumer Offsets: 200K consumers × 10 partitions × 100 bytes = 200MB
  • Offset History: 30 days × 200MB = 6GB
  • Offset Commits: 1M commits/second × 100 bytes = 100MB/s write rate

Metadata Storage

  • Topic Metadata: 100K topics × 10KB = 1GB
  • Partition Metadata: 1M partitions × 5KB = 5GB
  • Consumer Group Metadata: 10K groups × 50KB = 500MB
  • Broker Metadata: 100 brokers × 1MB = 100MB
  • Total Metadata: ~7GB (easily fits in memory)

Network Bandwidth

Inbound Traffic (Producers)

  • Message Data: 10M msg/s × 1KB = 10GB/s
  • Protocol Overhead: 20% = 2GB/s
  • Replication Traffic: 10GB/s × 2 (to replicas) = 20GB/s
  • Total Inbound: ~32GB/s peak

Outbound Traffic (Consumers)

  • Message Data: 10M msg/s × 1KB × 20 consumers avg = 200GB/s
  • Protocol Overhead: 20% = 40GB/s
  • Total Outbound: ~240GB/s peak

Internal Traffic (Replication)

  • Leader to Follower: 10GB/s × 2 replicas = 20GB/s
  • Follower Fetch: 20GB/s
  • Metadata Sync: 100MB/s
  • Total Internal: ~40GB/s

Compute Requirements

CPU Resources (Per Broker)

  • Message Processing: 10K msg/s × 0.1ms = 1 CPU core
  • Compression/Decompression: 10K msg/s × 0.5ms = 5 CPU cores
  • Replication: 10K msg/s × 0.2ms = 2 CPU cores
  • Network I/O: 2 CPU cores
  • Disk I/O: 2 CPU cores
  • Total per Broker: ~12 CPU cores
  • Cluster (100 brokers): 1,200 CPU cores

Memory Resources (Per Broker)

  • Page Cache: 64GB (for recent messages)
  • Producer Buffers: 10GB (1000 producers × 10MB)
  • Consumer Buffers: 20GB (2000 consumers × 10MB)
  • Metadata Cache: 10GB
  • JVM Heap: 16GB
  • Total per Broker: ~120GB RAM
  • Cluster (100 brokers): 12TB RAM

Disk I/O (Per Broker)

  • Write IOPS: 10K msg/s × 3 (replication) = 30K IOPS
  • Read IOPS: 20K msg/s (consumers) = 20K IOPS
  • Total IOPS: ~50K IOPS per broker
  • Sequential Write: 10GB/s
  • Sequential Read: 20GB/s
  • Storage Type: NVMe SSD required

Latency Constraints

End-to-End Latency Breakdown

Producer → Broker: 2ms
  - Network: 1ms
  - Broker processing: 0.5ms
  - Disk write: 0.5ms

Broker → Replica: 3ms
  - Network: 1ms
  - Replica processing: 1ms
  - Disk write: 1ms

Broker → Consumer: 2ms
  - Network: 1ms
  - Consumer processing: 1ms

Total: 7ms (P50), 20ms (P99)

Latency Requirements by Use Case

  • Real-time Analytics: <10ms P99
  • Event Processing: <50ms P99
  • Batch Processing: <1s P99
  • Log Aggregation: <5s P99
  • Data Replication: <100ms P99

Replication and Consistency

Replication Configuration

Replication Factor: 3
- 1 Leader (handles reads/writes)
- 2 Followers (replicate data)

Synchronous Replication:
- Wait for all replicas to acknowledge
- Latency: +3ms
- Durability: 99.999999999%

Asynchronous Replication:
- Don't wait for replicas
- Latency: +0ms
- Durability: 99.99%

Recommended: Synchronous for critical data, async for logs

Consistency Guarantees

  • Write Consistency: Synchronous replication to quorum (2/3)
  • Read Consistency: Read from leader only
  • Offset Consistency: Strongly consistent offset commits
  • Metadata Consistency: Raft consensus for metadata
  • Convergence Time: <1 second for replica sync

Failure Scenarios and Tolerances

Broker Failures

  • Single Broker Failure: <1% traffic impact (automatic failover)
  • Multiple Broker Failures: Can tolerate (N-1)/2 failures
  • Leader Failure: <5 seconds to elect new leader
  • Follower Failure: No impact (leader continues serving)
  • Network Partition: Quorum-based decisions prevent split-brain

Recovery Time Objectives

  • Broker Restart: <30 seconds (load from disk)
  • Leader Election: <5 seconds (Raft consensus)
  • Partition Rebalance: <30 seconds (consumer group)
  • Replica Sync: <5 minutes (catch up from leader)
  • Full Cluster Recovery: <15 minutes (disaster recovery)

Cost Analysis

Infrastructure Costs (Monthly)

  • Compute: 100 brokers × $500/month = $50,000
  • Storage: 2PB × $0.10/GB/month = $200,000
  • Network: 500TB egress × $0.05/GB = $25,000
  • Load Balancers: 10 × $50/month = $500
  • Monitoring: $5,000/month
  • Total Infrastructure: ~$280,000/month

Operational Costs (Monthly)

  • Engineering: $50,000/month (2 FTEs)
  • Support: $20,000/month (on-call)
  • Training: $5,000/month
  • Total Operational: ~$75,000/month

Cost Per Message

  • Total Monthly Cost: $355,000
  • Monthly Messages: 250B × 30 = 7.5 trillion messages
  • Cost Per Million Messages: $0.047
  • Cost Per Message: $0.000000047 (~$0.05 per billion)

Scaling Strategies

Horizontal Scaling

  • Add Brokers: Linear scaling up to 1000 brokers
  • Add Partitions: Increase parallelism
  • Add Consumer Groups: Increase consumption rate
  • Geographic Distribution: Deploy in multiple regions
  • Auto-Scaling: Scale based on message rate

Vertical Scaling

  • Larger Brokers: Up to 128 cores, 512GB RAM
  • Faster Storage: NVMe SSD, RAID 10
  • Network Upgrades: 100 Gbps interfaces
  • Memory Optimization: Larger page cache

Performance Optimization

  • Batching: Batch 10,000 messages per request
  • Compression: 3:1 compression ratio
  • Zero-Copy: Sendfile for efficient data transfer
  • Page Cache: Leverage OS page cache
  • Sequential I/O: Optimize for sequential disk access

Bottleneck Analysis

Primary Bottlenecks

  1. Disk I/O: 50K IOPS per broker limit
  2. Network Bandwidth: 100 Gbps per broker limit
  3. Replication Lag: Synchronous replication adds 3ms
  4. Consumer Lag: Slow consumers cause backpressure
  5. Partition Count: Too many partitions increase overhead

Mitigation Strategies

  1. NVMe SSDs: 1M IOPS per drive
  2. 100 Gbps NICs: 10x bandwidth increase
  3. Async Replication: Reduce latency for non-critical data
  4. Consumer Scaling: Add more consumers to groups
  5. Partition Optimization: Balance partition count vs overhead

This scale analysis provides the foundation for designing a messaging system that can handle massive message volumes while maintaining low latency and high reliability.