Scale & Constraints

📖 9 min read 📄 Part 2 of 10

Load Balancer - Scale and Constraints

Traffic Scale Analysis

Request Volume

  • Total Requests per Day: 100 billion
  • Average Requests per Second: 1.16 million RPS
  • Peak Requests per Second: 3.5 million RPS (3x average during traffic spikes)
  • Off-Peak Requests per Second: 500,000 RPS
  • Request Growth Rate: 25% year-over-year

Connection Metrics

  • Concurrent Connections: 10 million active connections
  • New Connections per Second: 100,000 (TCP handshakes)
  • Average Connection Duration: 30 seconds (HTTP/1.1 keep-alive)
  • Long-lived Connections: 500,000 (WebSocket, gRPC streams)
  • Connection Reuse Rate: 85% (keep-alive efficiency)
  • Peak Concurrent Connections: 25 million (during traffic spikes)

Backend Server Pool

  • Total Backend Servers: 10,000 servers across all pools
  • Server Pools: 50 distinct service pools
  • Average Pool Size: 200 servers per pool
  • Largest Pool: 2,000 servers (main API service)
  • Smallest Pool: 10 servers (admin service)

Request Size Distribution

Small Requests (API calls, JSON):
  Average request size: 2KB
  Average response size: 5KB
  Percentage of traffic: 70%

Medium Requests (page loads, data fetches):
  Average request size: 10KB
  Average response size: 50KB
  Percentage of traffic: 25%

Large Requests (file uploads, media):
  Average request size: 1MB
  Average response size: 5MB
  Percentage of traffic: 5%

Traffic Patterns

  • Diurnal Pattern: 3x variation between peak and trough
  • Weekly Pattern: 20% higher on weekdays vs weekends
  • Seasonal Spikes: 5x during major events (Black Friday, product launches)
  • Geographic Distribution:
    • North America: 35% of traffic
    • Europe: 30% of traffic
    • Asia-Pacific: 25% of traffic
    • Rest of World: 10% of traffic

Network Bandwidth Calculations

Inbound Bandwidth (Client → Load Balancer)

Small requests (70% of 1.16M RPS):
  812,000 RPS × 2KB = 1.6 GB/s

Medium requests (25% of 1.16M RPS):
  290,000 RPS × 10KB = 2.9 GB/s

Large requests (5% of 1.16M RPS):
  58,000 RPS × 1MB = 58 GB/s

Total Inbound: ~62.5 GB/s average
Peak Inbound: ~187.5 GB/s (3x average)

Outbound Bandwidth (Load Balancer → Client)

Small responses (70% of 1.16M RPS):
  812,000 RPS × 5KB = 4.06 GB/s

Medium responses (25% of 1.16M RPS):
  290,000 RPS × 50KB = 14.5 GB/s

Large responses (5% of 1.16M RPS):
  58,000 RPS × 5MB = 290 GB/s

Total Outbound: ~308.5 GB/s average
Peak Outbound: ~925 GB/s (3x average)

Inter-LB Communication Bandwidth

Health check traffic:
  10,000 backends × 1KB probe × every 5 seconds = 2 MB/s

Configuration sync between LB instances:
  100 LB instances × 10KB config × every 10 seconds = 100 KB/s

Metrics aggregation:
  100 LB instances × 5KB metrics × every second = 500 KB/s

State synchronization (session tables):
  100,000 session updates/sec × 200 bytes = 20 MB/s

Total Inter-LB: ~22.6 MB/s

Backend Bandwidth (Load Balancer → Backend Servers)

Same as inbound + protocol overhead:
  62.5 GB/s × 1.05 (headers, framing) = 65.6 GB/s

Health check probes:
  10,000 servers × 1KB × every 5 seconds = 2 MB/s

Total Backend Bandwidth: ~65.6 GB/s

Compute Requirements

CPU Requirements per Load Balancer Instance

Packet Processing:
  - TCP/IP stack processing: 1 CPU core per 100K packets/sec
  - At 15,000 RPS per LB: ~2 cores for packet processing

SSL/TLS Termination:
  - RSA 2048-bit handshake: 1,000 handshakes/sec per core
  - ECDHE key exchange: 5,000 handshakes/sec per core
  - At 1,000 new TLS connections/sec per LB: ~1 core

HTTP Parsing (L7):
  - Header parsing: 50,000 requests/sec per core
  - URL routing/matching: 30,000 requests/sec per core
  - At 15,000 RPS per LB: ~1 core

Health Checking:
  - 100 backends per LB instance: negligible CPU

Connection Management:
  - Connection tracking, timeouts: ~0.5 cores

Logging and Metrics:
  - Access log generation: ~0.5 cores
  - Metrics computation: ~0.5 cores

Total per LB Instance: 8 cores recommended (with headroom)
Production Spec: 16 vCPU (2x headroom for spikes)

Memory Requirements per Load Balancer Instance

Connection Tracking Table:
  100,000 concurrent connections × 512 bytes per entry = 50 MB

SSL Session Cache:
  50,000 sessions × 4KB per session = 200 MB

Routing Table:
  10,000 rules × 1KB per rule = 10 MB

Backend Server State:
  200 backends × 2KB per backend = 400 KB

Request Buffers:
  100,000 connections × 16KB buffer = 1.6 GB

Response Buffers:
  100,000 connections × 64KB buffer = 6.4 GB

HTTP Header Cache:
  10,000 entries × 2KB = 20 MB

Rate Limiting State:
  1,000,000 IP entries × 64 bytes = 64 MB

Kernel Network Buffers:
  TCP socket buffers: 2 GB

Application Code and Libraries: 500 MB

Total per LB Instance: ~11 GB
Production Spec: 32 GB RAM (with headroom for spikes)

Load Balancer Fleet Sizing

Total cluster capacity needed: 1.16M RPS average, 3.5M RPS peak
Per-LB capacity: 15,000 RPS (comfortable operating point)
LB instances needed for average: 1,160,000 / 15,000 = 78 instances
LB instances needed for peak: 3,500,000 / 15,000 = 234 instances

With N+2 redundancy and 60% target utilization:
  Production fleet: 234 / 0.6 = 390 instances (peak capacity)
  Minimum fleet: 78 / 0.6 = 130 instances (average load)

Auto-scaling range: 130 - 400 instances

Storage Requirements

Connection State Storage (In-Memory)

Active connection table:
  10M connections × 512 bytes = 5 GB (distributed across fleet)

Session persistence table:
  2M sticky sessions × 256 bytes = 512 MB (distributed)

Rate limiting counters:
  10M unique IPs × 64 bytes = 640 MB (per instance, approximate)

Configuration Storage (Persistent)

Backend server configurations:
  10,000 servers × 2KB = 20 MB

Routing rules:
  50,000 rules × 1KB = 50 MB

SSL certificates:
  5,000 certificates × 10KB = 50 MB

ACL rules:
  100,000 rules × 256 bytes = 25 MB

Total Configuration: ~145 MB
Stored in: etcd/Consul with replication factor 3 = 435 MB

Health Check Data Storage

Health check results (last 24 hours):
  10,000 servers × 1 check/5 sec × 86,400 sec × 100 bytes = 17 GB/day

Health check history (30 days):
  17 GB × 30 = 510 GB

Stored in: Time-series database (Prometheus/InfluxDB)
With downsampling after 7 days: ~100 GB effective

Metrics and Logging Storage

Access logs:
  1.16M RPS × 500 bytes per log × 86,400 seconds = 50 TB/day

Metrics (aggregated):
  100 LB instances × 200 metrics × 1 data point/sec × 16 bytes = 27 GB/day

Error logs:
  0.1% error rate × 1.16M RPS × 1KB = 100 GB/day

Total Logging: ~50 TB/day
With 30-day retention: 1.5 PB
With compression (10:1): 150 TB

Performance Targets

Latency Requirements

  • Added Latency (L4 forwarding): <0.5ms p50, <1ms p99
  • Added Latency (L7 with SSL): <2ms p50, <5ms p99
  • SSL Handshake Time: <10ms p50, <30ms p99
  • Health Check Round-trip: <100ms
  • Configuration Propagation: <5 seconds across fleet
  • Failover Detection: <10 seconds
  • Failover Completion: <30 seconds

Throughput Targets

  • Cluster Throughput: 3.5M RPS peak capacity
  • Per-Instance Throughput: 15,000 RPS at L7, 100,000 RPS at L4
  • Per-Instance Bandwidth: 10 Gbps
  • New Connections per Instance: 10,000/sec
  • SSL Handshakes per Instance: 5,000/sec

Availability Targets

  • System Availability: 99.999% (5.26 minutes downtime/year)
  • Zero-downtime Deployments: Rolling updates with no dropped connections
  • Failover Time: <5 seconds for instance failure
  • Data Plane Availability: 99.9999% (31.5 seconds downtime/year)
  • Control Plane Availability: 99.99% (52.6 minutes downtime/year)

Reliability Targets

  • Packet Loss: <0.001%
  • Connection Reset Rate: <0.01%
  • Health Check False Positive Rate: <0.1%
  • Configuration Error Rate: 0% (validated before deployment)

Cost Estimation

Compute Costs (Monthly)

Load Balancer Instances (average fleet of 200):
  200 instances × 16 vCPU, 32GB RAM
  On-demand: 200 × $400/month = $80,000/month
  Reserved (1-year): 200 × $250/month = $50,000/month

Auto-scaling buffer (peak adds 200 more):
  200 instances × $400/month × 20% utilization = $16,000/month

Total Compute: $66,000/month (with reserved pricing)

Network Costs (Monthly)

Outbound bandwidth (client-facing):
  308.5 GB/s average × 2,592,000 seconds/month = 800 PB/month
  At $0.02/GB (bulk pricing): $16M/month
  With CDN offload (90%): $1.6M/month

Inter-AZ traffic:
  22.6 MB/s × 2,592,000 seconds = 58.6 TB/month
  At $0.01/GB: $586/month

Total Network: ~$1.6M/month (with CDN offload)

Storage Costs (Monthly)

Configuration storage (etcd cluster):
  3 nodes × $200/month = $600/month

Metrics storage (Prometheus/InfluxDB):
  100 GB × $0.10/GB = $10/month

Log storage (S3 + Elasticsearch):
  150 TB compressed × $0.023/GB = $3,450/month
  Elasticsearch (hot): 5 TB × $0.10/GB = $500/month

Total Storage: ~$4,560/month

SSL Certificate Costs

Wildcard certificates: 10 × $500/year = $5,000/year = $417/month
DV certificates (Let's Encrypt): Free
EV certificates: 5 × $1,000/year = $5,000/year = $417/month

Total Certificates: ~$834/month

Total Cost Summary

Compute:           $66,000/month
Network:        $1,600,000/month
Storage:            $4,560/month
Certificates:         $834/month
Monitoring/Tools:   $5,000/month
Operations (team):$100,000/month

Total:          ~$1,776,394/month
Cost per request: $1,776,394 / (100B requests) = $0.000018 per request
Cost per GB transferred: $1,776,394 / (800 PB) = $0.0022 per GB

Growth Projections

Year 1 → Year 3 Projections

                    Year 1          Year 2          Year 3
Requests/day:       100B            125B            156B
Peak RPS:           3.5M            4.4M            5.5M
Connections:        10M             12.5M           15.6M
Backend servers:    10,000          12,500          15,600
LB instances:       200             250             312
Monthly cost:       $1.8M           $2.2M           $2.8M

Capacity Planning Triggers

  • Scale-up trigger: Average CPU > 60% for 5 minutes
  • Scale-up trigger: Connection count > 80% of capacity
  • Scale-up trigger: Request queue depth > 1000
  • Scale-down trigger: Average CPU < 30% for 15 minutes
  • Capacity review: Quarterly planning with 6-month lead time
  • Hardware refresh: Every 3 years for physical LB appliances

Technology Evolution Considerations

  • HTTP/3 (QUIC): Reduces connection overhead, changes connection tracking model
  • eBPF/XDP: Enables kernel-bypass packet processing, 10x throughput improvement
  • ARM processors: Better performance-per-watt for packet processing
  • SmartNICs: Offload SSL and packet processing to network cards
  • 5G growth: More mobile connections, smaller request sizes, higher connection churn

Scaling Bottlenecks

Connection Table Limits

  • Problem: 10M connections × 512 bytes = 5 GB distributed state
  • Bottleneck: Single-instance memory limits connection count
  • Solution: Stateless design with external session store, or connection-level sharding

SSL Handshake CPU

  • Problem: RSA-2048 limits to 1,000 handshakes/sec/core
  • Bottleneck: New connection storms during failover
  • Solution: ECDHE (5x faster), session resumption, 0-RTT with TLS 1.3

Bandwidth Saturation

  • Problem: Large responses (media) saturate NIC bandwidth
  • Bottleneck: 10 Gbps NIC limits per-instance throughput
  • Solution: 25/100 Gbps NICs, DSR for large responses, CDN offload

Health Check Thundering Herd

  • Problem: All LB instances checking all backends simultaneously
  • Bottleneck: Backend servers overwhelmed by health checks
  • Solution: Staggered checks, shared health state, hierarchical health checking

This comprehensive scale analysis provides the foundation for designing a load balancer system that handles massive traffic while maintaining sub-millisecond latency overhead and five-nines availability.