Load Balancer - Scale and Constraints
Traffic Scale Analysis
Request Volume
- Total Requests per Day: 100 billion
- Average Requests per Second: 1.16 million RPS
- Peak Requests per Second: 3.5 million RPS (3x average during traffic spikes)
- Off-Peak Requests per Second: 500,000 RPS
- Request Growth Rate: 25% year-over-year
Connection Metrics
- Concurrent Connections: 10 million active connections
- New Connections per Second: 100,000 (TCP handshakes)
- Average Connection Duration: 30 seconds (HTTP/1.1 keep-alive)
- Long-lived Connections: 500,000 (WebSocket, gRPC streams)
- Connection Reuse Rate: 85% (keep-alive efficiency)
- Peak Concurrent Connections: 25 million (during traffic spikes)
Backend Server Pool
- Total Backend Servers: 10,000 servers across all pools
- Server Pools: 50 distinct service pools
- Average Pool Size: 200 servers per pool
- Largest Pool: 2,000 servers (main API service)
- Smallest Pool: 10 servers (admin service)
Request Size Distribution
Small Requests (API calls, JSON):
Average request size: 2KB
Average response size: 5KB
Percentage of traffic: 70%
Medium Requests (page loads, data fetches):
Average request size: 10KB
Average response size: 50KB
Percentage of traffic: 25%
Large Requests (file uploads, media):
Average request size: 1MB
Average response size: 5MB
Percentage of traffic: 5%Traffic Patterns
- Diurnal Pattern: 3x variation between peak and trough
- Weekly Pattern: 20% higher on weekdays vs weekends
- Seasonal Spikes: 5x during major events (Black Friday, product launches)
- Geographic Distribution:
- North America: 35% of traffic
- Europe: 30% of traffic
- Asia-Pacific: 25% of traffic
- Rest of World: 10% of traffic
Network Bandwidth Calculations
Inbound Bandwidth (Client → Load Balancer)
Small requests (70% of 1.16M RPS):
812,000 RPS × 2KB = 1.6 GB/s
Medium requests (25% of 1.16M RPS):
290,000 RPS × 10KB = 2.9 GB/s
Large requests (5% of 1.16M RPS):
58,000 RPS × 1MB = 58 GB/s
Total Inbound: ~62.5 GB/s average
Peak Inbound: ~187.5 GB/s (3x average)Outbound Bandwidth (Load Balancer → Client)
Small responses (70% of 1.16M RPS):
812,000 RPS × 5KB = 4.06 GB/s
Medium responses (25% of 1.16M RPS):
290,000 RPS × 50KB = 14.5 GB/s
Large responses (5% of 1.16M RPS):
58,000 RPS × 5MB = 290 GB/s
Total Outbound: ~308.5 GB/s average
Peak Outbound: ~925 GB/s (3x average)Inter-LB Communication Bandwidth
Health check traffic:
10,000 backends × 1KB probe × every 5 seconds = 2 MB/s
Configuration sync between LB instances:
100 LB instances × 10KB config × every 10 seconds = 100 KB/s
Metrics aggregation:
100 LB instances × 5KB metrics × every second = 500 KB/s
State synchronization (session tables):
100,000 session updates/sec × 200 bytes = 20 MB/s
Total Inter-LB: ~22.6 MB/sBackend Bandwidth (Load Balancer → Backend Servers)
Same as inbound + protocol overhead:
62.5 GB/s × 1.05 (headers, framing) = 65.6 GB/s
Health check probes:
10,000 servers × 1KB × every 5 seconds = 2 MB/s
Total Backend Bandwidth: ~65.6 GB/sCompute Requirements
CPU Requirements per Load Balancer Instance
Packet Processing:
- TCP/IP stack processing: 1 CPU core per 100K packets/sec
- At 15,000 RPS per LB: ~2 cores for packet processing
SSL/TLS Termination:
- RSA 2048-bit handshake: 1,000 handshakes/sec per core
- ECDHE key exchange: 5,000 handshakes/sec per core
- At 1,000 new TLS connections/sec per LB: ~1 core
HTTP Parsing (L7):
- Header parsing: 50,000 requests/sec per core
- URL routing/matching: 30,000 requests/sec per core
- At 15,000 RPS per LB: ~1 core
Health Checking:
- 100 backends per LB instance: negligible CPU
Connection Management:
- Connection tracking, timeouts: ~0.5 cores
Logging and Metrics:
- Access log generation: ~0.5 cores
- Metrics computation: ~0.5 cores
Total per LB Instance: 8 cores recommended (with headroom)
Production Spec: 16 vCPU (2x headroom for spikes)Memory Requirements per Load Balancer Instance
Connection Tracking Table:
100,000 concurrent connections × 512 bytes per entry = 50 MB
SSL Session Cache:
50,000 sessions × 4KB per session = 200 MB
Routing Table:
10,000 rules × 1KB per rule = 10 MB
Backend Server State:
200 backends × 2KB per backend = 400 KB
Request Buffers:
100,000 connections × 16KB buffer = 1.6 GB
Response Buffers:
100,000 connections × 64KB buffer = 6.4 GB
HTTP Header Cache:
10,000 entries × 2KB = 20 MB
Rate Limiting State:
1,000,000 IP entries × 64 bytes = 64 MB
Kernel Network Buffers:
TCP socket buffers: 2 GB
Application Code and Libraries: 500 MB
Total per LB Instance: ~11 GB
Production Spec: 32 GB RAM (with headroom for spikes)Load Balancer Fleet Sizing
Total cluster capacity needed: 1.16M RPS average, 3.5M RPS peak
Per-LB capacity: 15,000 RPS (comfortable operating point)
LB instances needed for average: 1,160,000 / 15,000 = 78 instances
LB instances needed for peak: 3,500,000 / 15,000 = 234 instances
With N+2 redundancy and 60% target utilization:
Production fleet: 234 / 0.6 = 390 instances (peak capacity)
Minimum fleet: 78 / 0.6 = 130 instances (average load)
Auto-scaling range: 130 - 400 instancesStorage Requirements
Connection State Storage (In-Memory)
Active connection table:
10M connections × 512 bytes = 5 GB (distributed across fleet)
Session persistence table:
2M sticky sessions × 256 bytes = 512 MB (distributed)
Rate limiting counters:
10M unique IPs × 64 bytes = 640 MB (per instance, approximate)Configuration Storage (Persistent)
Backend server configurations:
10,000 servers × 2KB = 20 MB
Routing rules:
50,000 rules × 1KB = 50 MB
SSL certificates:
5,000 certificates × 10KB = 50 MB
ACL rules:
100,000 rules × 256 bytes = 25 MB
Total Configuration: ~145 MB
Stored in: etcd/Consul with replication factor 3 = 435 MBHealth Check Data Storage
Health check results (last 24 hours):
10,000 servers × 1 check/5 sec × 86,400 sec × 100 bytes = 17 GB/day
Health check history (30 days):
17 GB × 30 = 510 GB
Stored in: Time-series database (Prometheus/InfluxDB)
With downsampling after 7 days: ~100 GB effectiveMetrics and Logging Storage
Access logs:
1.16M RPS × 500 bytes per log × 86,400 seconds = 50 TB/day
Metrics (aggregated):
100 LB instances × 200 metrics × 1 data point/sec × 16 bytes = 27 GB/day
Error logs:
0.1% error rate × 1.16M RPS × 1KB = 100 GB/day
Total Logging: ~50 TB/day
With 30-day retention: 1.5 PB
With compression (10:1): 150 TBPerformance Targets
Latency Requirements
- Added Latency (L4 forwarding): <0.5ms p50, <1ms p99
- Added Latency (L7 with SSL): <2ms p50, <5ms p99
- SSL Handshake Time: <10ms p50, <30ms p99
- Health Check Round-trip: <100ms
- Configuration Propagation: <5 seconds across fleet
- Failover Detection: <10 seconds
- Failover Completion: <30 seconds
Throughput Targets
- Cluster Throughput: 3.5M RPS peak capacity
- Per-Instance Throughput: 15,000 RPS at L7, 100,000 RPS at L4
- Per-Instance Bandwidth: 10 Gbps
- New Connections per Instance: 10,000/sec
- SSL Handshakes per Instance: 5,000/sec
Availability Targets
- System Availability: 99.999% (5.26 minutes downtime/year)
- Zero-downtime Deployments: Rolling updates with no dropped connections
- Failover Time: <5 seconds for instance failure
- Data Plane Availability: 99.9999% (31.5 seconds downtime/year)
- Control Plane Availability: 99.99% (52.6 minutes downtime/year)
Reliability Targets
- Packet Loss: <0.001%
- Connection Reset Rate: <0.01%
- Health Check False Positive Rate: <0.1%
- Configuration Error Rate: 0% (validated before deployment)
Cost Estimation
Compute Costs (Monthly)
Load Balancer Instances (average fleet of 200):
200 instances × 16 vCPU, 32GB RAM
On-demand: 200 × $400/month = $80,000/month
Reserved (1-year): 200 × $250/month = $50,000/month
Auto-scaling buffer (peak adds 200 more):
200 instances × $400/month × 20% utilization = $16,000/month
Total Compute: $66,000/month (with reserved pricing)Network Costs (Monthly)
Outbound bandwidth (client-facing):
308.5 GB/s average × 2,592,000 seconds/month = 800 PB/month
At $0.02/GB (bulk pricing): $16M/month
With CDN offload (90%): $1.6M/month
Inter-AZ traffic:
22.6 MB/s × 2,592,000 seconds = 58.6 TB/month
At $0.01/GB: $586/month
Total Network: ~$1.6M/month (with CDN offload)Storage Costs (Monthly)
Configuration storage (etcd cluster):
3 nodes × $200/month = $600/month
Metrics storage (Prometheus/InfluxDB):
100 GB × $0.10/GB = $10/month
Log storage (S3 + Elasticsearch):
150 TB compressed × $0.023/GB = $3,450/month
Elasticsearch (hot): 5 TB × $0.10/GB = $500/month
Total Storage: ~$4,560/monthSSL Certificate Costs
Wildcard certificates: 10 × $500/year = $5,000/year = $417/month
DV certificates (Let's Encrypt): Free
EV certificates: 5 × $1,000/year = $5,000/year = $417/month
Total Certificates: ~$834/monthTotal Cost Summary
Compute: $66,000/month
Network: $1,600,000/month
Storage: $4,560/month
Certificates: $834/month
Monitoring/Tools: $5,000/month
Operations (team):$100,000/month
Total: ~$1,776,394/month
Cost per request: $1,776,394 / (100B requests) = $0.000018 per request
Cost per GB transferred: $1,776,394 / (800 PB) = $0.0022 per GBGrowth Projections
Year 1 → Year 3 Projections
Year 1 Year 2 Year 3
Requests/day: 100B 125B 156B
Peak RPS: 3.5M 4.4M 5.5M
Connections: 10M 12.5M 15.6M
Backend servers: 10,000 12,500 15,600
LB instances: 200 250 312
Monthly cost: $1.8M $2.2M $2.8MCapacity Planning Triggers
- Scale-up trigger: Average CPU > 60% for 5 minutes
- Scale-up trigger: Connection count > 80% of capacity
- Scale-up trigger: Request queue depth > 1000
- Scale-down trigger: Average CPU < 30% for 15 minutes
- Capacity review: Quarterly planning with 6-month lead time
- Hardware refresh: Every 3 years for physical LB appliances
Technology Evolution Considerations
- HTTP/3 (QUIC): Reduces connection overhead, changes connection tracking model
- eBPF/XDP: Enables kernel-bypass packet processing, 10x throughput improvement
- ARM processors: Better performance-per-watt for packet processing
- SmartNICs: Offload SSL and packet processing to network cards
- 5G growth: More mobile connections, smaller request sizes, higher connection churn
Scaling Bottlenecks
Connection Table Limits
- Problem: 10M connections × 512 bytes = 5 GB distributed state
- Bottleneck: Single-instance memory limits connection count
- Solution: Stateless design with external session store, or connection-level sharding
SSL Handshake CPU
- Problem: RSA-2048 limits to 1,000 handshakes/sec/core
- Bottleneck: New connection storms during failover
- Solution: ECDHE (5x faster), session resumption, 0-RTT with TLS 1.3
Bandwidth Saturation
- Problem: Large responses (media) saturate NIC bandwidth
- Bottleneck: 10 Gbps NIC limits per-instance throughput
- Solution: 25/100 Gbps NICs, DSR for large responses, CDN offload
Health Check Thundering Herd
- Problem: All LB instances checking all backends simultaneously
- Bottleneck: Backend servers overwhelmed by health checks
- Solution: Staggered checks, shared health state, hierarchical health checking
This comprehensive scale analysis provides the foundation for designing a load balancer system that handles massive traffic while maintaining sub-millisecond latency overhead and five-nines availability.