Load Balancer - Scaling Considerations
Horizontal Scaling Strategies
DNS-Based Load Distribution
Architecture:
Client -> DNS Resolver -> [LB IP 1, LB IP 2, LB IP 3, ...] -> LB Instance -> Backend
How it works:
1. DNS returns multiple A records for the LB domain
2. Client (or resolver) picks one IP (round-robin or random)
3. Each IP maps to a different LB instance or cluster
Configuration example (Route 53):
lb.example.com A 10.0.1.100 (weight: 50, health-check: enabled)
lb.example.com A 10.0.2.100 (weight: 50, health-check: enabled)
lb.example.com A 10.0.3.100 (weight: 50, health-check: enabled)
Advantages:
- Simple to implement
- No special network infrastructure
- Works across regions and providers
Limitations:
- DNS TTL causes stale routing (clients cache old IPs)
- Uneven distribution (resolvers cache for many clients)
- Slow failover (TTL-dependent, typically 30-300 seconds)
- No connection-aware balancing
Mitigation:
- Low TTL (30 seconds) for faster failover
- Health-checked DNS (Route 53, Cloudflare) removes unhealthy IPs
- Over-provision each LB to handle neighbor's traffic during failoverBGP Anycast
Architecture:
Client -> Internet -> BGP Routing -> Nearest LB (same IP advertised from multiple locations)
How it works:
1. Same IP prefix announced from multiple Points of Presence (PoPs)
2. BGP routing directs traffic to nearest PoP (by AS path length)
3. Each PoP has independent LB instances
Implementation:
- Announce 203.0.113.0/24 from US-East, US-West, EU-West, AP-Southeast
- Each location runs independent LB cluster
- BGP communities control traffic engineering
Advantages:
- Automatic geographic routing (lowest latency)
- Built-in DDoS absorption (attack distributed across PoPs)
- Instant failover (BGP withdrawal, ~30 seconds convergence)
- No DNS TTL issues
Limitations:
- Requires own AS number and IP space
- BGP convergence can cause connection resets
- Limited control over exact traffic distribution
- Complex operational requirements
Production usage:
- Cloudflare, Google, AWS CloudFront all use Anycast
- Typical convergence: 30-90 seconds for BGP updates
- Mitigation for connection resets: TCP Anycast with flow-aware routingECMP (Equal-Cost Multi-Path)
Architecture:
Client -> Router -> ECMP Hash -> [LB Instance 1, LB Instance 2, ...]
How it works:
1. Router has multiple equal-cost routes to LB instances
2. Per-flow hash (5-tuple) determines which LB receives each connection
3. All packets for same flow go to same LB instance
Hash function:
hash = crc32(src_ip, dst_ip, src_port, dst_port, protocol) % num_paths
Advantages:
- Wire-speed distribution (hardware routing)
- Per-flow consistency (no connection splitting)
- Linear scaling (add more LB instances = more paths)
- Sub-millisecond failover with BFD (Bidirectional Forwarding Detection)
Limitations:
- Adding/removing LB instances causes flow redistribution
- Hash imbalance possible with few large flows
- Requires L3 network infrastructure control
Resilient hashing (solving redistribution):
- Consistent hashing at router level
- Only affected flows (1/N) redistributed on member change
- Supported by modern routers (Juniper, Arista, Cisco)
Typical deployment:
- Top-of-rack switch with 16-64 ECMP paths
- Each path = one LB instance
- BFD health checking with 50ms detectionLayer 4 DSR (Direct Server Return) Scaling
Architecture:
Client -> LB (inbound only) -> Backend
Backend -> Client (direct, bypasses LB)
How it works:
1. LB receives packet, selects backend
2. LB rewrites destination MAC (L2 DSR) or encapsulates (L3 DSR/IP-in-IP)
3. Backend processes request, responds directly to client
4. Response traffic never touches LB
Bandwidth multiplication:
- LB only handles inbound traffic (typically 10-20% of total)
- 10 Gbps LB NIC effectively handles 50-100 Gbps of total traffic
- Critical for media/streaming workloads with large responses
Implementation options:
- L2 DSR: Backend and LB on same L2 segment, MAC rewrite
- L3 DSR (IP-in-IP): Backend can be anywhere, tunnel encapsulation
- L3 DSR (GRE): Similar to IP-in-IP with GRE encapsulation
Limitations:
- No response modification (can't add headers, compress, etc.)
- No SSL termination at LB (backend must handle TLS)
- Health checking more complex (can't observe responses)
- Backend must accept traffic for LB's VIPConnection Draining During Scale Events
Graceful Backend Removal
Drain process:
1. Mark backend as DRAINING (stop sending new connections)
2. Existing connections continue until completion or timeout
3. Monitor active connection count
4. After all connections close (or timeout), remove backend
Timeline:
t=0: Admin initiates drain
t=0: New connections stop routing to this backend
t=0-5m: Existing connections complete naturally
t=5m: Force-close remaining connections (configurable timeout)
t=5m: Backend fully removed from pool
Connection handling during drain:
- HTTP/1.1: Send "Connection: close" header on next response
- HTTP/2: Send GOAWAY frame
- WebSocket: Send close frame with 1001 (Going Away)
- TCP (L4): Wait for natural close or timeoutLB Instance Removal (Self-Drain)
When scaling down LB instances:
1. Remove LB from ECMP/DNS (stop new connections arriving)
2. Existing connections continue processing
3. For each active connection:
a. If HTTP keep-alive: close after current request completes
b. If long-lived (WebSocket): migrate or gracefully close
4. Wait for drain timeout (default: 300 seconds)
5. Force-terminate remaining connections
6. Shutdown LB instance
Connection migration (advanced):
- Transfer connection state to peer LB instance
- Client sees brief pause but no connection reset
- Requires shared state or state transfer protocol
- Used by: AWS NLB, Google MaglevZero-Downtime Deployment of LB Software
Rolling update strategy:
1. Deploy new version to 1 instance (canary)
2. Monitor error rates for 5 minutes
3. If healthy, proceed with rolling update (10% at a time)
4. Each instance: drain connections -> update -> rejoin cluster
5. Total deployment time: ~30 minutes for 200 instances
Blue-green deployment:
1. Spin up parallel fleet with new version
2. Shift traffic gradually (10% -> 50% -> 100%)
3. Monitor for issues at each stage
4. Rollback: shift traffic back to old fleet
5. Decommission old fleet after validation periodHigh Availability Configurations
Active-Active (Recommended)
Architecture:
All LB instances actively serving traffic simultaneously
Configuration:
- 200 LB instances, all receiving traffic via ECMP
- No primary/secondary distinction
- Each instance independently makes routing decisions
- Shared configuration via etcd/Consul
- Independent health checking (or shared health service)
Failure handling:
- Instance failure: ECMP/BFD detects in <1 second
- Traffic redistributed to remaining instances
- No state loss (stateless design or replicated state)
- Capacity planning: N+2 redundancy minimum
Advantages:
- Maximum resource utilization (all instances active)
- Linear scaling
- No failover delay
- No split-brain riskActive-Standby (Hot Standby)
Architecture:
Primary LB handles all traffic; standby monitors and takes over on failure
Configuration:
- Primary: Holds VIP, processes all traffic
- Standby: Monitors primary via heartbeat, ready to take over
- VRRP (Virtual Router Redundancy Protocol) for VIP failover
Failover process:
1. Standby detects primary failure (missed heartbeats)
2. Standby sends gratuitous ARP for VIP
3. Traffic redirected to standby within 1-3 seconds
4. Existing connections may reset (unless state is synchronized)
State synchronization options:
- Connection table replication (real-time mirroring)
- Session state in external store (Redis)
- Stateless design (no sync needed, connections reset on failover)
Limitations:
- 50% resource waste (standby idle during normal operation)
- Failover causes brief disruption
- State sync adds complexity and latency
- Single point of scaling (limited to one instance capacity)
Use case: Small deployments, hardware LB appliances (F5, Citrix)Multi-Region Active-Active
Architecture:
Independent LB clusters in each region, GeoDNS routes to nearest
US-East Cluster: 80 LB instances (35% of global traffic)
EU-West Cluster: 60 LB instances (30% of global traffic)
AP-Southeast Cluster: 50 LB instances (25% of global traffic)
US-West Cluster: 20 LB instances (10% of global traffic)
Cross-region failover:
1. Region health check fails (all instances in region down)
2. GeoDNS removes region from rotation (30-60 second TTL)
3. Traffic shifts to next-nearest region
4. Receiving region auto-scales to handle additional load
Capacity planning for failover:
- Each region sized to handle 150% of normal load
- Allows absorbing one region's traffic during failure
- Auto-scaling can add 50% more capacity within 5 minutesGeographic Load Balancing
GeoDNS Routing
Implementation:
- DNS resolver's source IP mapped to geographic location
- Response contains IP of nearest LB cluster
- Fallback to global anycast if geo-lookup fails
Example (Cloudflare/Route 53):
Client in New York -> resolves to US-East LB cluster
Client in London -> resolves to EU-West LB cluster
Client in Tokyo -> resolves to AP-Northeast LB cluster
Accuracy considerations:
- GeoIP databases ~95% accurate at country level
- ~80% accurate at city level
- VPN/proxy users may be misrouted
- Enterprise DNS resolvers may be in different location than users
- EDNS Client Subnet (ECS) improves accuracy for large resolversLatency-Based Routing
Implementation:
- Measure actual latency from client regions to each LB cluster
- Route to lowest-latency cluster (not necessarily geographically nearest)
- Continuous measurement updates routing decisions
Measurement methods:
- Active probing from LB clusters to known resolver IPs
- Passive measurement from real client connections (TCP handshake RTT)
- Third-party latency data (RIPE Atlas, Cloudflare Radar)
Example routing decision:
Client in São Paulo:
- US-East: 120ms RTT
- EU-West: 180ms RTT
- US-West: 150ms RTT
-> Route to US-East (lowest latency, not geographically nearest)Failover Routing
Priority-based failover:
Primary: US-East cluster (handles 100% when healthy)
Secondary: US-West cluster (takes over if primary fails)
Tertiary: EU-West cluster (last resort)
Health check configuration:
- Check interval: 10 seconds
- Failure threshold: 3 consecutive failures
- Recovery threshold: 2 consecutive successes
- Failover time: 30-60 seconds (health check + DNS TTL)
Weighted failover (gradual):
- Healthy: 100% to primary
- Degraded: 70% primary, 30% secondary
- Failed: 0% primary, 100% secondaryAuto-Scaling Triggers and Capacity Planning
Scaling Metrics and Thresholds
Scale-up triggers (any one triggers scaling):
- CPU utilization > 60% for 3 minutes
- Active connections > 80% of capacity for 2 minutes
- Request queue depth > 1000 for 1 minute
- Request latency p99 > 10ms for 5 minutes
- Bandwidth utilization > 70% for 3 minutes
- SSL handshake queue > 500 for 1 minute
Scale-down triggers (all must be true):
- CPU utilization < 30% for 15 minutes
- Active connections < 40% of capacity for 15 minutes
- No scale-up event in last 30 minutes
- Current time not in predicted peak window
Scaling parameters:
- Scale-up increment: 20% of current fleet (minimum 5 instances)
- Scale-down increment: 10% of current fleet (minimum 2 instances)
- Cooldown after scale-up: 5 minutes
- Cooldown after scale-down: 15 minutes
- Minimum fleet size: 130 instances
- Maximum fleet size: 400 instancesPredictive Scaling
Approach: ML-based traffic prediction + pre-scaling
Training data:
- Historical traffic patterns (last 90 days)
- Day-of-week patterns
- Time-of-day patterns
- Known events calendar (product launches, sales, etc.)
Pre-scaling actions:
- Scale up 30 minutes before predicted peak
- Scale up 2 hours before known events
- Maintain 40% headroom during predicted peaks
- Weekend/holiday scaling profiles
Example schedule:
Weekday:
06:00 UTC: Scale to 150 instances (morning ramp)
12:00 UTC: Scale to 250 instances (peak hours)
22:00 UTC: Scale to 180 instances (evening)
02:00 UTC: Scale to 130 instances (overnight minimum)Capacity Planning Process
Quarterly review:
1. Analyze growth trends (requests, connections, bandwidth)
2. Project 6-month forward capacity needs
3. Identify bottlenecks (CPU, memory, network, SSL)
4. Plan infrastructure procurement/reservation
5. Update auto-scaling limits
Key metrics for planning:
- Traffic growth rate: 25% YoY
- Peak-to-average ratio: 3x
- Event spike multiplier: 5x
- Required headroom: 40% above peak
Capacity formula:
Required instances = (peak_rps × event_multiplier) / (per_instance_capacity × target_utilization)
= (3.5M × 1.5) / (15,000 × 0.6)
= 583 instances (maximum event capacity)Performance Optimization
Kernel Bypass (DPDK)
What: Data Plane Development Kit - bypasses kernel network stack
Why: Kernel processing adds 5-10μs per packet; DPDK reduces to <1μs
Architecture:
Normal: NIC -> Kernel -> Socket -> Application
DPDK: NIC -> DPDK PMD (Poll Mode Driver) -> Application (direct memory access)
Performance gains:
- Packet processing: 10-100x improvement
- Latency: <1μs per packet (vs 5-10μs with kernel)
- Throughput: 10M+ packets/sec per core (vs 1M with kernel)
- Zero-copy: Packets processed in-place in NIC ring buffer
Implementation considerations:
- Dedicates CPU cores to packet processing (busy-polling)
- Requires hugepages for memory (2MB or 1GB pages)
- Bypasses kernel firewall (iptables) - must implement in userspace
- No standard socket API - custom networking code required
- Used by: VPP (fd.io), Seastar, F5 BIG-IP
Trade-offs:
+ Massive performance improvement for packet-heavy workloads
- Higher CPU utilization (dedicated cores always busy)
- More complex development and debugging
- Loses kernel network features (must reimplement)XDP/eBPF (eXpress Data Path)
What: Programmable packet processing at the kernel's lowest level
Why: Near-DPDK performance while keeping kernel integration
Architecture:
NIC -> XDP Hook (eBPF program) -> [DROP | PASS | REDIRECT | TX]
Performance:
- Processing at NIC driver level (before sk_buff allocation)
- 10M+ packets/sec per core
- <5μs latency addition
- Can offload to NIC hardware (XDP offload)
Use cases for LB:
- DDoS mitigation (drop bad packets at line rate)
- L4 load balancing (redirect packets to backends)
- Connection tracking (lightweight state in eBPF maps)
- Rate limiting (token bucket in eBPF)
Example XDP load balancer (simplified):
1. Parse packet headers (IP + TCP/UDP)
2. Lookup connection in eBPF hash map
3. If existing: rewrite destination, XDP_TX
4. If new: select backend (hash), create entry, XDP_TX
5. If invalid: XDP_DROP
Production usage:
- Facebook Katran (L4 LB, handles all Facebook traffic)
- Cloudflare (DDoS mitigation)
- Cilium (Kubernetes networking and LB)
Trade-offs:
+ Kernel integration maintained (monitoring, debugging)
+ No dedicated CPU cores (event-driven)
+ Can coexist with normal kernel networking
- Limited program complexity (eBPF verifier constraints)
- L7 processing still requires userspace
- Newer technology, less tooling maturityConnection Multiplexing
Problem: 10,000 backend servers × 100 connections each = 1M backend connections
Solution: Multiplex many client connections over fewer backend connections
HTTP/1.1 Connection Pooling:
- Maintain pool of keep-alive connections to each backend
- Reuse connections across different clients
- Pool size: 100-500 connections per backend
- Reduces TCP handshake overhead by 90%+
HTTP/2 Multiplexing:
- Single TCP connection carries multiple concurrent streams
- 1 connection to backend handles 100+ concurrent requests
- Reduces backend connection count by 100x
- Eliminates head-of-line blocking at HTTP level
gRPC Multiplexing:
- Built on HTTP/2, same multiplexing benefits
- Persistent connections with stream-level load balancing
- Connection count: 1-4 per backend (sufficient for most loads)
Backend connection pool sizing:
Per-backend pool: min=10, max=500, idle_timeout=60s
Total backend connections: 10,000 backends × 100 avg = 1M connections
With HTTP/2: 10,000 backends × 4 connections = 40,000 connections (25x reduction)Keep-Alive Optimization
Client-side keep-alive:
- Default timeout: 60 seconds
- Max requests per connection: 1000
- Reduces new connection rate by 85%
- Saves SSL handshake cost on subsequent requests
Backend keep-alive:
- Persistent connections to backends
- Timeout: 300 seconds (longer than client-side)
- Pre-warmed connections (maintain minimum pool)
- Health check reuses keep-alive connections
TCP optimization:
- TCP Fast Open (TFO): Save 1 RTT on reconnection
- TCP keepalive probes: Detect dead connections (interval=30s, count=3)
- SO_REUSEPORT: Multiple LB threads share same port
- TCP_NODELAY: Disable Nagle's algorithm for low latency
TLS optimization:
- Session resumption (TLS session tickets): Skip full handshake
- 0-RTT (TLS 1.3 early data): Send data with first flight
- OCSP stapling: Avoid client-side OCSP lookup
- Certificate compression: Reduce handshake sizeMemory and CPU Optimization
Memory optimization:
- Hugepages (2MB/1GB): Reduce TLB misses for large hash tables
- NUMA-aware allocation: Keep data on same NUMA node as processing CPU
- Object pooling: Pre-allocate connection entries, avoid malloc/free
- Zero-copy sendfile: Avoid copying data between kernel and userspace
CPU optimization:
- CPU pinning: Dedicate cores to packet processing
- RSS (Receive Side Scaling): Distribute NIC interrupts across cores
- RFS (Receive Flow Steering): Route packets to CPU processing the flow
- Busy polling: Avoid interrupt overhead for high-throughput
- SIMD instructions: Batch packet header parsing
Cache optimization:
- Cache-line aligned data structures (64 bytes)
- Hot/cold data separation (frequently accessed fields together)
- Prefetching: Prefetch next connection entry during current processing
- Avoid false sharing: Pad per-CPU counters to cache line boundariesMonitoring and Observability at Scale
Key Metrics for Scaling Decisions
Infrastructure metrics:
- CPU utilization per instance (target: <60%)
- Memory utilization per instance (target: <80%)
- NIC bandwidth utilization (target: <70%)
- Connection table utilization (target: <80%)
- SSL session cache hit rate (target: >90%)
Traffic metrics:
- Requests per second (total and per-pool)
- New connections per second
- Active connections
- Bandwidth (inbound/outbound)
- Request queue depth
Quality metrics:
- Latency percentiles (p50, p90, p99, p999)
- Error rates (4xx, 5xx, connection errors)
- Health check pass rate
- Retry rate
- Circuit breaker trip rate
Alerting thresholds:
- P1 (page): Error rate > 1%, latency p99 > 100ms, availability < 99.9%
- P2 (ticket): CPU > 70%, connections > 85% capacity, cert expiring < 7 days
- P3 (review): Growth rate exceeding projections, cost anomaliesThis scaling guide ensures the load balancer can handle exponential traffic growth while maintaining sub-millisecond latency overhead, five-nines availability, and cost-effective resource utilization.