Load Balancer - Scaling Considerations

Horizontal Scaling Strategies

DNS-Based Load Distribution

Architecture:
  Client -> DNS Resolver -> [LB IP 1, LB IP 2, LB IP 3, ...] -> LB Instance -> Backend

How it works:
  1. DNS returns multiple A records for the LB domain
  2. Client (or resolver) picks one IP (round-robin or random)
  3. Each IP maps to a different LB instance or cluster

Configuration example (Route 53):
  lb.example.com  A  10.0.1.100  (weight: 50, health-check: enabled)
  lb.example.com  A  10.0.2.100  (weight: 50, health-check: enabled)
  lb.example.com  A  10.0.3.100  (weight: 50, health-check: enabled)

Advantages:
  - Simple to implement
  - No special network infrastructure
  - Works across regions and providers

Limitations:
  - DNS TTL causes stale routing (clients cache old IPs)
  - Uneven distribution (resolvers cache for many clients)
  - Slow failover (TTL-dependent, typically 30-300 seconds)
  - No connection-aware balancing

Mitigation:
  - Low TTL (30 seconds) for faster failover
  - Health-checked DNS (Route 53, Cloudflare) removes unhealthy IPs
  - Over-provision each LB to handle neighbor's traffic during failover

BGP Anycast

Architecture:
  Client -> Internet -> BGP Routing -> Nearest LB (same IP advertised from multiple locations)

How it works:
  1. Same IP prefix announced from multiple Points of Presence (PoPs)
  2. BGP routing directs traffic to nearest PoP (by AS path length)
  3. Each PoP has independent LB instances

Implementation:
  - Announce 203.0.113.0/24 from US-East, US-West, EU-West, AP-Southeast
  - Each location runs independent LB cluster
  - BGP communities control traffic engineering

Advantages:
  - Automatic geographic routing (lowest latency)
  - Built-in DDoS absorption (attack distributed across PoPs)
  - Instant failover (BGP withdrawal, ~30 seconds convergence)
  - No DNS TTL issues

Limitations:
  - Requires own AS number and IP space
  - BGP convergence can cause connection resets
  - Limited control over exact traffic distribution
  - Complex operational requirements

Production usage:
  - Cloudflare, Google, AWS CloudFront all use Anycast
  - Typical convergence: 30-90 seconds for BGP updates
  - Mitigation for connection resets: TCP Anycast with flow-aware routing

ECMP (Equal-Cost Multi-Path)

Architecture:
  Client -> Router -> ECMP Hash -> [LB Instance 1, LB Instance 2, ...]

How it works:
  1. Router has multiple equal-cost routes to LB instances
  2. Per-flow hash (5-tuple) determines which LB receives each connection
  3. All packets for same flow go to same LB instance

Hash function:
  hash = crc32(src_ip, dst_ip, src_port, dst_port, protocol) % num_paths

Advantages:
  - Wire-speed distribution (hardware routing)
  - Per-flow consistency (no connection splitting)
  - Linear scaling (add more LB instances = more paths)
  - Sub-millisecond failover with BFD (Bidirectional Forwarding Detection)

Limitations:
  - Adding/removing LB instances causes flow redistribution
  - Hash imbalance possible with few large flows
  - Requires L3 network infrastructure control

Resilient hashing (solving redistribution):
  - Consistent hashing at router level
  - Only affected flows (1/N) redistributed on member change
  - Supported by modern routers (Juniper, Arista, Cisco)

Typical deployment:
  - Top-of-rack switch with 16-64 ECMP paths
  - Each path = one LB instance
  - BFD health checking with 50ms detection

Layer 4 DSR (Direct Server Return) Scaling

Architecture:
  Client -> LB (inbound only) -> Backend
  Backend -> Client (direct, bypasses LB)

How it works:
  1. LB receives packet, selects backend
  2. LB rewrites destination MAC (L2 DSR) or encapsulates (L3 DSR/IP-in-IP)
  3. Backend processes request, responds directly to client
  4. Response traffic never touches LB

Bandwidth multiplication:
  - LB only handles inbound traffic (typically 10-20% of total)
  - 10 Gbps LB NIC effectively handles 50-100 Gbps of total traffic
  - Critical for media/streaming workloads with large responses

Implementation options:
  - L2 DSR: Backend and LB on same L2 segment, MAC rewrite
  - L3 DSR (IP-in-IP): Backend can be anywhere, tunnel encapsulation
  - L3 DSR (GRE): Similar to IP-in-IP with GRE encapsulation

Limitations:
  - No response modification (can't add headers, compress, etc.)
  - No SSL termination at LB (backend must handle TLS)
  - Health checking more complex (can't observe responses)
  - Backend must accept traffic for LB's VIP

Connection Draining During Scale Events

Graceful Backend Removal

Drain process:
  1. Mark backend as DRAINING (stop sending new connections)
  2. Existing connections continue until completion or timeout
  3. Monitor active connection count
  4. After all connections close (or timeout), remove backend

Timeline:
  t=0:    Admin initiates drain
  t=0:    New connections stop routing to this backend
  t=0-5m: Existing connections complete naturally
  t=5m:   Force-close remaining connections (configurable timeout)
  t=5m:   Backend fully removed from pool

Connection handling during drain:
  - HTTP/1.1: Send "Connection: close" header on next response
  - HTTP/2: Send GOAWAY frame
  - WebSocket: Send close frame with 1001 (Going Away)
  - TCP (L4): Wait for natural close or timeout

LB Instance Removal (Self-Drain)

When scaling down LB instances:

1. Remove LB from ECMP/DNS (stop new connections arriving)
2. Existing connections continue processing
3. For each active connection:
   a. If HTTP keep-alive: close after current request completes
   b. If long-lived (WebSocket): migrate or gracefully close
4. Wait for drain timeout (default: 300 seconds)
5. Force-terminate remaining connections
6. Shutdown LB instance

Connection migration (advanced):
  - Transfer connection state to peer LB instance
  - Client sees brief pause but no connection reset
  - Requires shared state or state transfer protocol
  - Used by: AWS NLB, Google Maglev

Zero-Downtime Deployment of LB Software

Rolling update strategy:
  1. Deploy new version to 1 instance (canary)
  2. Monitor error rates for 5 minutes
  3. If healthy, proceed with rolling update (10% at a time)
  4. Each instance: drain connections -> update -> rejoin cluster
  5. Total deployment time: ~30 minutes for 200 instances

Blue-green deployment:
  1. Spin up parallel fleet with new version
  2. Shift traffic gradually (10% -> 50% -> 100%)
  3. Monitor for issues at each stage
  4. Rollback: shift traffic back to old fleet
  5. Decommission old fleet after validation period

High Availability Configurations

Active-Active (Recommended)

Architecture:
  All LB instances actively serving traffic simultaneously

Configuration:
  - 200 LB instances, all receiving traffic via ECMP
  - No primary/secondary distinction
  - Each instance independently makes routing decisions
  - Shared configuration via etcd/Consul
  - Independent health checking (or shared health service)

Failure handling:
  - Instance failure: ECMP/BFD detects in <1 second
  - Traffic redistributed to remaining instances
  - No state loss (stateless design or replicated state)
  - Capacity planning: N+2 redundancy minimum

Advantages:
  - Maximum resource utilization (all instances active)
  - Linear scaling
  - No failover delay
  - No split-brain risk

Active-Standby (Hot Standby)

Architecture:
  Primary LB handles all traffic; standby monitors and takes over on failure

Configuration:
  - Primary: Holds VIP, processes all traffic
  - Standby: Monitors primary via heartbeat, ready to take over
  - VRRP (Virtual Router Redundancy Protocol) for VIP failover

Failover process:
  1. Standby detects primary failure (missed heartbeats)
  2. Standby sends gratuitous ARP for VIP
  3. Traffic redirected to standby within 1-3 seconds
  4. Existing connections may reset (unless state is synchronized)

State synchronization options:
  - Connection table replication (real-time mirroring)
  - Session state in external store (Redis)
  - Stateless design (no sync needed, connections reset on failover)

Limitations:
  - 50% resource waste (standby idle during normal operation)
  - Failover causes brief disruption
  - State sync adds complexity and latency
  - Single point of scaling (limited to one instance capacity)

Use case: Small deployments, hardware LB appliances (F5, Citrix)

Multi-Region Active-Active

Architecture:
  Independent LB clusters in each region, GeoDNS routes to nearest

  US-East Cluster: 80 LB instances (35% of global traffic)
  EU-West Cluster: 60 LB instances (30% of global traffic)
  AP-Southeast Cluster: 50 LB instances (25% of global traffic)
  US-West Cluster: 20 LB instances (10% of global traffic)

Cross-region failover:
  1. Region health check fails (all instances in region down)
  2. GeoDNS removes region from rotation (30-60 second TTL)
  3. Traffic shifts to next-nearest region
  4. Receiving region auto-scales to handle additional load

Capacity planning for failover:
  - Each region sized to handle 150% of normal load
  - Allows absorbing one region's traffic during failure
  - Auto-scaling can add 50% more capacity within 5 minutes

Geographic Load Balancing

GeoDNS Routing

Implementation:
  - DNS resolver's source IP mapped to geographic location
  - Response contains IP of nearest LB cluster
  - Fallback to global anycast if geo-lookup fails

Example (Cloudflare/Route 53):
  Client in New York -> resolves to US-East LB cluster
  Client in London -> resolves to EU-West LB cluster
  Client in Tokyo -> resolves to AP-Northeast LB cluster

Accuracy considerations:
  - GeoIP databases ~95% accurate at country level
  - ~80% accurate at city level
  - VPN/proxy users may be misrouted
  - Enterprise DNS resolvers may be in different location than users
  - EDNS Client Subnet (ECS) improves accuracy for large resolvers

Latency-Based Routing

Implementation:
  - Measure actual latency from client regions to each LB cluster
  - Route to lowest-latency cluster (not necessarily geographically nearest)
  - Continuous measurement updates routing decisions

Measurement methods:
  - Active probing from LB clusters to known resolver IPs
  - Passive measurement from real client connections (TCP handshake RTT)
  - Third-party latency data (RIPE Atlas, Cloudflare Radar)

Example routing decision:
  Client in São Paulo:
    - US-East: 120ms RTT
    - EU-West: 180ms RTT
    - US-West: 150ms RTT
    -> Route to US-East (lowest latency, not geographically nearest)

Failover Routing

Priority-based failover:
  Primary: US-East cluster (handles 100% when healthy)
  Secondary: US-West cluster (takes over if primary fails)
  Tertiary: EU-West cluster (last resort)

Health check configuration:
  - Check interval: 10 seconds
  - Failure threshold: 3 consecutive failures
  - Recovery threshold: 2 consecutive successes
  - Failover time: 30-60 seconds (health check + DNS TTL)

Weighted failover (gradual):
  - Healthy: 100% to primary
  - Degraded: 70% primary, 30% secondary
  - Failed: 0% primary, 100% secondary

Auto-Scaling Triggers and Capacity Planning

Scaling Metrics and Thresholds

Scale-up triggers (any one triggers scaling):
  - CPU utilization > 60% for 3 minutes
  - Active connections > 80% of capacity for 2 minutes
  - Request queue depth > 1000 for 1 minute
  - Request latency p99 > 10ms for 5 minutes
  - Bandwidth utilization > 70% for 3 minutes
  - SSL handshake queue > 500 for 1 minute

Scale-down triggers (all must be true):
  - CPU utilization < 30% for 15 minutes
  - Active connections < 40% of capacity for 15 minutes
  - No scale-up event in last 30 minutes
  - Current time not in predicted peak window

Scaling parameters:
  - Scale-up increment: 20% of current fleet (minimum 5 instances)
  - Scale-down increment: 10% of current fleet (minimum 2 instances)
  - Cooldown after scale-up: 5 minutes
  - Cooldown after scale-down: 15 minutes
  - Minimum fleet size: 130 instances
  - Maximum fleet size: 400 instances

Predictive Scaling

Approach: ML-based traffic prediction + pre-scaling

Training data:
  - Historical traffic patterns (last 90 days)
  - Day-of-week patterns
  - Time-of-day patterns
  - Known events calendar (product launches, sales, etc.)

Pre-scaling actions:
  - Scale up 30 minutes before predicted peak
  - Scale up 2 hours before known events
  - Maintain 40% headroom during predicted peaks
  - Weekend/holiday scaling profiles

Example schedule:
  Weekday:
    06:00 UTC: Scale to 150 instances (morning ramp)
    12:00 UTC: Scale to 250 instances (peak hours)
    22:00 UTC: Scale to 180 instances (evening)
    02:00 UTC: Scale to 130 instances (overnight minimum)

Capacity Planning Process

Quarterly review:
  1. Analyze growth trends (requests, connections, bandwidth)
  2. Project 6-month forward capacity needs
  3. Identify bottlenecks (CPU, memory, network, SSL)
  4. Plan infrastructure procurement/reservation
  5. Update auto-scaling limits

Key metrics for planning:
  - Traffic growth rate: 25% YoY
  - Peak-to-average ratio: 3x
  - Event spike multiplier: 5x
  - Required headroom: 40% above peak

Capacity formula:
  Required instances = (peak_rps × event_multiplier) / (per_instance_capacity × target_utilization)
  = (3.5M × 1.5) / (15,000 × 0.6)
  = 583 instances (maximum event capacity)

Performance Optimization

Kernel Bypass (DPDK)

What: Data Plane Development Kit - bypasses kernel network stack
Why: Kernel processing adds 5-10μs per packet; DPDK reduces to <1μs

Architecture:
  Normal: NIC -> Kernel -> Socket -> Application
  DPDK:   NIC -> DPDK PMD (Poll Mode Driver) -> Application (direct memory access)

Performance gains:
  - Packet processing: 10-100x improvement
  - Latency: <1μs per packet (vs 5-10μs with kernel)
  - Throughput: 10M+ packets/sec per core (vs 1M with kernel)
  - Zero-copy: Packets processed in-place in NIC ring buffer

Implementation considerations:
  - Dedicates CPU cores to packet processing (busy-polling)
  - Requires hugepages for memory (2MB or 1GB pages)
  - Bypasses kernel firewall (iptables) - must implement in userspace
  - No standard socket API - custom networking code required
  - Used by: VPP (fd.io), Seastar, F5 BIG-IP

Trade-offs:
  + Massive performance improvement for packet-heavy workloads
  - Higher CPU utilization (dedicated cores always busy)
  - More complex development and debugging
  - Loses kernel network features (must reimplement)

XDP/eBPF (eXpress Data Path)

What: Programmable packet processing at the kernel's lowest level
Why: Near-DPDK performance while keeping kernel integration

Architecture:
  NIC -> XDP Hook (eBPF program) -> [DROP | PASS | REDIRECT | TX]

Performance:
  - Processing at NIC driver level (before sk_buff allocation)
  - 10M+ packets/sec per core
  - <5μs latency addition
  - Can offload to NIC hardware (XDP offload)

Use cases for LB:
  - DDoS mitigation (drop bad packets at line rate)
  - L4 load balancing (redirect packets to backends)
  - Connection tracking (lightweight state in eBPF maps)
  - Rate limiting (token bucket in eBPF)

Example XDP load balancer (simplified):
  1. Parse packet headers (IP + TCP/UDP)
  2. Lookup connection in eBPF hash map
  3. If existing: rewrite destination, XDP_TX
  4. If new: select backend (hash), create entry, XDP_TX
  5. If invalid: XDP_DROP

Production usage:
  - Facebook Katran (L4 LB, handles all Facebook traffic)
  - Cloudflare (DDoS mitigation)
  - Cilium (Kubernetes networking and LB)

Trade-offs:
  + Kernel integration maintained (monitoring, debugging)
  + No dedicated CPU cores (event-driven)
  + Can coexist with normal kernel networking
  - Limited program complexity (eBPF verifier constraints)
  - L7 processing still requires userspace
  - Newer technology, less tooling maturity

Connection Multiplexing

Problem: 10,000 backend servers × 100 connections each = 1M backend connections
Solution: Multiplex many client connections over fewer backend connections

HTTP/1.1 Connection Pooling:
  - Maintain pool of keep-alive connections to each backend
  - Reuse connections across different clients
  - Pool size: 100-500 connections per backend
  - Reduces TCP handshake overhead by 90%+

HTTP/2 Multiplexing:
  - Single TCP connection carries multiple concurrent streams
  - 1 connection to backend handles 100+ concurrent requests
  - Reduces backend connection count by 100x
  - Eliminates head-of-line blocking at HTTP level

gRPC Multiplexing:
  - Built on HTTP/2, same multiplexing benefits
  - Persistent connections with stream-level load balancing
  - Connection count: 1-4 per backend (sufficient for most loads)

Backend connection pool sizing:
  Per-backend pool: min=10, max=500, idle_timeout=60s
  Total backend connections: 10,000 backends × 100 avg = 1M connections
  With HTTP/2: 10,000 backends × 4 connections = 40,000 connections (25x reduction)

Keep-Alive Optimization

Client-side keep-alive:
  - Default timeout: 60 seconds
  - Max requests per connection: 1000
  - Reduces new connection rate by 85%
  - Saves SSL handshake cost on subsequent requests

Backend keep-alive:
  - Persistent connections to backends
  - Timeout: 300 seconds (longer than client-side)
  - Pre-warmed connections (maintain minimum pool)
  - Health check reuses keep-alive connections

TCP optimization:
  - TCP Fast Open (TFO): Save 1 RTT on reconnection
  - TCP keepalive probes: Detect dead connections (interval=30s, count=3)
  - SO_REUSEPORT: Multiple LB threads share same port
  - TCP_NODELAY: Disable Nagle's algorithm for low latency

TLS optimization:
  - Session resumption (TLS session tickets): Skip full handshake
  - 0-RTT (TLS 1.3 early data): Send data with first flight
  - OCSP stapling: Avoid client-side OCSP lookup
  - Certificate compression: Reduce handshake size

Memory and CPU Optimization

Memory optimization:
  - Hugepages (2MB/1GB): Reduce TLB misses for large hash tables
  - NUMA-aware allocation: Keep data on same NUMA node as processing CPU
  - Object pooling: Pre-allocate connection entries, avoid malloc/free
  - Zero-copy sendfile: Avoid copying data between kernel and userspace

CPU optimization:
  - CPU pinning: Dedicate cores to packet processing
  - RSS (Receive Side Scaling): Distribute NIC interrupts across cores
  - RFS (Receive Flow Steering): Route packets to CPU processing the flow
  - Busy polling: Avoid interrupt overhead for high-throughput
  - SIMD instructions: Batch packet header parsing

Cache optimization:
  - Cache-line aligned data structures (64 bytes)
  - Hot/cold data separation (frequently accessed fields together)
  - Prefetching: Prefetch next connection entry during current processing
  - Avoid false sharing: Pad per-CPU counters to cache line boundaries

Monitoring and Observability at Scale

Key Metrics for Scaling Decisions

Infrastructure metrics:
  - CPU utilization per instance (target: <60%)
  - Memory utilization per instance (target: <80%)
  - NIC bandwidth utilization (target: <70%)
  - Connection table utilization (target: <80%)
  - SSL session cache hit rate (target: >90%)

Traffic metrics:
  - Requests per second (total and per-pool)
  - New connections per second
  - Active connections
  - Bandwidth (inbound/outbound)
  - Request queue depth

Quality metrics:
  - Latency percentiles (p50, p90, p99, p999)
  - Error rates (4xx, 5xx, connection errors)
  - Health check pass rate
  - Retry rate
  - Circuit breaker trip rate

Alerting thresholds:
  - P1 (page): Error rate > 1%, latency p99 > 100ms, availability < 99.9%
  - P2 (ticket): CPU > 70%, connections > 85% capacity, cert expiring < 7 days
  - P3 (review): Growth rate exceeding projections, cost anomalies

This scaling guide ensures the load balancer can handle exponential traffic growth while maintaining sub-millisecond latency overhead, five-nines availability, and cost-effective resource utilization.