Load Balancer - Database Design
Overview
A load balancer's "database" is fundamentally different from traditional application databases. Most state is held in-memory for microsecond-level access, with persistent storage used only for configuration, health history, and metrics. The design prioritizes read speed and lock-free concurrent access over durability.
Connection Tracking Tables
Active Connection Table (In-Memory Hash Map)
ConnectionEntry {
connection_id: uint64 // Hash of 5-tuple (src_ip, src_port, dst_ip, dst_port, protocol)
source_ip: uint32 // Client IP (IPv4) or uint128 (IPv6)
source_port: uint16 // Client ephemeral port
destination_ip: uint32 // Selected backend IP
destination_port: uint16 // Backend service port
protocol: uint8 // TCP=6, UDP=17
state: enum { // Connection lifecycle state
SYN_RECEIVED,
ESTABLISHED,
FIN_WAIT,
CLOSE_WAIT,
TIME_WAIT,
CLOSED
}
created_at: uint64 // Unix timestamp (nanoseconds)
last_activity: uint64 // Last packet timestamp
idle_timeout: uint32 // Seconds before timeout
bytes_in: uint64 // Bytes from client
bytes_out: uint64 // Bytes from backend
packets_in: uint64 // Packets from client
packets_out: uint64 // Packets from backend
backend_server_id: uint32 // Reference to backend server
pool_id: uint16 // Service pool identifier
ssl_session_id: bytes[32] // TLS session for resumption
http_request_count: uint32 // Requests on this connection (keep-alive)
flags: uint16 // Bitfield: is_websocket, is_http2, is_ssl, etc.
}
// Size: ~180 bytes per entry
// Total for 10M connections: ~1.8 GBConnection Lookup Data Structures
// Primary lookup: Hash table indexed by 5-tuple hash
// O(1) average lookup time, critical for packet forwarding path
struct connection_hash_table {
uint32_t bucket_count; // 2^24 = 16M buckets
uint32_t entry_count; // Current active connections
connection_entry* buckets[]; // Array of bucket pointers
rwlock_t bucket_locks[]; // Per-bucket read-write locks
};
// Secondary index: Backend server -> connections (for drain operations)
// Used when removing a backend to find all affected connections
struct backend_connection_index {
uint32_t server_id;
linked_list connection_ids; // All connections to this backend
uint32_t count;
};
// Timeout wheel: Hierarchical timing wheel for connection expiry
// Efficiently expires idle connections without scanning all entries
struct timing_wheel {
uint32_t tick_ms; // 100ms per tick
uint32_t wheel_size; // 4096 slots (covers ~7 minutes)
linked_list slots[4096]; // Connections expiring at each tick
uint64_t current_tick;
struct timing_wheel* overflow; // For longer timeouts
};Connection Table Partitioning Strategy
Partitioning approach: Lock-free concurrent hash map with sharding
Shard count: 256 shards (power of 2 for fast modulo)
Shard assignment: connection_id % 256
Per-shard lock: Read-write spinlock (readers don't block each other)
Memory layout (cache-line aligned):
Each shard: 64-byte aligned structure
Each entry: 192 bytes (padded to cache line boundary)
Avoids false sharing between CPU cores
Eviction policy:
- TIME_WAIT entries: 120 seconds then evict
- Idle connections: configurable timeout (default 300s)
- Emergency eviction: LRU when table > 90% capacityServer Pool Configuration Schema
Backend Server Registry (Persistent - etcd/Consul)
{
"server_id": "backend-us-east-1a-api-0042",
"pool_id": "pool-api-production",
"hostname": "api-0042.us-east-1a.internal",
"ip_address": "10.0.42.100",
"port": 8080,
"weight": 100,
"max_connections": 10000,
"max_requests_per_second": 5000,
"status": "HEALTHY",
"admin_status": "ENABLED",
"metadata": {
"availability_zone": "us-east-1a",
"instance_type": "c5.2xlarge",
"version": "v2.3.1",
"canary": false,
"warm_up_time_seconds": 30
},
"tls_config": {
"enabled": true,
"verify_certificate": true,
"ca_certificate": "/certs/internal-ca.pem",
"sni_hostname": "api.internal.example.com"
},
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-20T14:22:00Z",
"last_health_check": "2024-01-20T14:22:05Z"
}Server Pool Configuration
{
"pool_id": "pool-api-production",
"name": "API Production Pool",
"algorithm": "LEAST_CONNECTIONS",
"algorithm_config": {
"weighted": true,
"slow_start_duration_seconds": 30,
"power_of_two_choices": true
},
"health_check": {
"protocol": "HTTP",
"path": "/health",
"port": 8080,
"interval_seconds": 5,
"timeout_seconds": 2,
"unhealthy_threshold": 3,
"healthy_threshold": 2,
"expected_status_codes": [200],
"expected_body_regex": "\"status\":\"ok\"",
"tls_enabled": true
},
"session_persistence": {
"enabled": true,
"type": "COOKIE",
"cookie_name": "SERVERID",
"ttl_seconds": 3600
},
"connection_limits": {
"max_connections_per_backend": 10000,
"max_pending_requests": 1000,
"connection_timeout_ms": 5000,
"idle_timeout_seconds": 300,
"max_request_time_seconds": 60
},
"circuit_breaker": {
"enabled": true,
"error_threshold_percent": 50,
"evaluation_window_seconds": 30,
"min_request_count": 100,
"recovery_timeout_seconds": 60
},
"retry_policy": {
"max_retries": 2,
"retry_on": ["connection_error", "5xx", "reset"],
"retry_budget_percent": 20
},
"servers": ["backend-us-east-1a-api-0042", "backend-us-east-1a-api-0043"],
"created_at": "2024-01-10T08:00:00Z",
"updated_at": "2024-01-20T12:00:00Z"
}Health Check Results Storage
Health Check State (In-Memory)
HealthCheckState {
server_id: string
current_status: enum { HEALTHY, UNHEALTHY, DEGRADED, UNKNOWN }
consecutive_successes: uint32
consecutive_failures: uint32
last_check_time: timestamp
last_success_time: timestamp
last_failure_time: timestamp
last_response_time_ms: float32
avg_response_time_ms: float32 // Exponential moving average
last_status_code: uint16
last_error: string // "connection_refused", "timeout", etc.
flap_count: uint32 // Status changes in last hour
in_slow_start: bool // Recently became healthy
}Health Check History (Time-Series Database)
-- InfluxDB/TimescaleDB schema for health check history
CREATE TABLE health_check_results (
time TIMESTAMPTZ NOT NULL,
server_id TEXT NOT NULL,
pool_id TEXT NOT NULL,
status TEXT NOT NULL, -- 'success', 'failure', 'timeout'
response_time_ms DOUBLE PRECISION,
status_code INTEGER,
error_message TEXT,
check_type TEXT -- 'http', 'tcp', 'grpc'
);
-- Hypertable for automatic time-based partitioning
SELECT create_hypertable('health_check_results', 'time');
-- Retention policy: raw data for 7 days, downsampled for 90 days
CREATE MATERIALIZED VIEW health_check_hourly
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS bucket,
server_id,
pool_id,
COUNT(*) AS total_checks,
COUNT(*) FILTER (WHERE status = 'success') AS successful_checks,
AVG(response_time_ms) AS avg_response_time,
MAX(response_time_ms) AS max_response_time,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) AS p99_response_time
FROM health_check_results
GROUP BY bucket, server_id, pool_id;Load Metrics and Statistics Tables
Real-Time Metrics (In-Memory Ring Buffers)
// Per-backend server metrics (updated on every request)
struct backend_metrics {
atomic_uint64_t active_connections;
atomic_uint64_t total_requests;
atomic_uint64_t total_bytes_in;
atomic_uint64_t total_bytes_out;
atomic_uint64_t error_count_4xx;
atomic_uint64_t error_count_5xx;
atomic_uint64_t connection_errors;
atomic_uint64_t timeout_errors;
// Ring buffer for response time tracking (last 1000 requests)
struct ring_buffer {
uint32_t values[1024]; // Response times in microseconds
uint32_t head;
uint32_t count;
} response_times;
// Per-second counters (circular buffer, last 60 seconds)
struct per_second_counter {
uint64_t requests[60];
uint64_t errors[60];
uint64_t bytes[60];
uint32_t current_slot;
} rate_counters;
};
// Global LB instance metrics
struct lb_instance_metrics {
atomic_uint64_t total_connections;
atomic_uint64_t active_connections;
atomic_uint64_t requests_per_second;
atomic_uint64_t bytes_per_second_in;
atomic_uint64_t bytes_per_second_out;
atomic_uint64_t ssl_handshakes_per_second;
atomic_uint64_t health_checks_per_second;
atomic_uint64_t dropped_connections;
atomic_uint64_t rejected_connections; // Rate limited
double cpu_utilization;
double memory_utilization;
uint64_t uptime_seconds;
};Aggregated Metrics (Persistent - Prometheus/InfluxDB)
# Prometheus metrics exposed by each LB instance
# Connection metrics
lb_connections_active{pool="api", backend="server-042"} 1523
lb_connections_total{pool="api", backend="server-042"} 5847291
lb_connections_failed{pool="api", backend="server-042", reason="timeout"} 42
# Request metrics
lb_requests_total{pool="api", method="GET", status="200"} 1000000
lb_requests_total{pool="api", method="POST", status="500"} 150
lb_request_duration_seconds_bucket{pool="api", le="0.001"} 500000
lb_request_duration_seconds_bucket{pool="api", le="0.01"} 900000
lb_request_duration_seconds_bucket{pool="api", le="0.1"} 990000
lb_request_duration_seconds_bucket{pool="api", le="1.0"} 999000
# Bandwidth metrics
lb_bytes_received_total{pool="api"} 1099511627776
lb_bytes_sent_total{pool="api"} 5497558138880
# Health check metrics
lb_health_check_status{pool="api", backend="server-042"} 1
lb_health_check_duration_seconds{pool="api", backend="server-042"} 0.005
# Rate limiting metrics
lb_rate_limited_requests_total{pool="api", rule="per_ip"} 5000
lb_rate_limited_requests_total{pool="api", rule="global"} 200Session Persistence / Sticky Session Storage
Cookie-Based Session Table
SessionEntry {
session_id: string[64] // Cookie value or hash
client_identifier: string // IP, cookie, or header value
backend_server_id: uint32 // Pinned backend server
pool_id: uint16 // Pool this session belongs to
created_at: uint64 // Creation timestamp
last_accessed: uint64 // Last request timestamp
expires_at: uint64 // Absolute expiry
request_count: uint32 // Requests in this session
flags: uint8 // is_secure, is_http_only
}
// Size: ~160 bytes per entry
// For 2M active sessions: ~320 MBSession Storage Options
Option 1: Local In-Memory (Default)
- Fastest access: O(1) hash lookup
- Problem: Session lost on LB instance failure
- Mitigation: Client retry goes to same backend via consistent hashing
Option 2: Distributed Cache (Redis Cluster)
- Access time: 0.5-1ms network round-trip
- Benefit: Survives LB instance failure
- Schema:
Key: "session:{pool_id}:{session_id}"
Value: "{backend_server_id}:{expires_at}:{metadata}"
TTL: session timeout (e.g., 3600 seconds)
Option 3: Consistent Hashing (Stateless)
- No storage needed
- Client IP/header hashed to backend
- Problem: Backend removal causes session redistribution
- Mitigation: Bounded load consistent hashing
Option 4: Backend-Generated Cookie
- LB inserts cookie with encoded backend ID
- No lookup needed on subsequent requests
- Cookie format: "SERVERID=s042.pool-api|{hmac_signature}"
- Verification: HMAC check (no storage, CPU only)Session Persistence Hash Ring
// Consistent hash ring for stateless session affinity
struct hash_ring {
uint32_t ring_size; // 2^32 virtual ring
uint32_t vnodes_per_server; // 150 virtual nodes per backend
struct vnode {
uint32_t hash_point; // Position on ring
uint32_t server_id; // Backend server
uint32_t weight; // Server weight
} *vnodes; // Sorted array of virtual nodes
uint32_t vnode_count; // Total virtual nodes
};
// Lookup: O(log n) binary search on sorted vnode array
// Memory: 150 vnodes x 10,000 servers x 12 bytes = 18 MBConfiguration Management
Routing Rules Schema
{
"rule_id": "route-api-v2-canary",
"priority": 100,
"match": {
"hosts": ["api.example.com", "*.api.example.com"],
"paths": ["/v2/*"],
"methods": ["GET", "POST"],
"headers": {
"X-Canary": ["true"],
"X-Api-Version": ["2.*"]
},
"query_params": {
"beta": ["1"]
},
"source_ips": ["10.0.0.0/8"]
},
"action": {
"type": "ROUTE",
"pool_id": "pool-api-v2-canary",
"weight": 10
},
"transforms": {
"request_headers": {
"add": {"X-Forwarded-Proto": "https", "X-Request-ID": "${uuid}"},
"remove": ["X-Internal-Debug"]
},
"response_headers": {
"add": {"Strict-Transport-Security": "max-age=31536000"},
"remove": ["Server", "X-Powered-By"]
},
"url_rewrite": {
"prefix_replace": {"/v2/": "/api/"}
}
},
"rate_limit": {
"requests_per_second": 1000,
"burst": 2000,
"key": "source_ip"
},
"enabled": true,
"created_at": "2024-01-15T10:00:00Z",
"version": 42
}ACL (Access Control List) Schema
{
"acl_id": "acl-block-bad-actors",
"priority": 1,
"action": "DENY",
"match": {
"source_ips": [
"192.168.1.0/24",
"10.99.0.0/16"
],
"source_countries": ["XX"],
"user_agents": ["BadBot/*", "ScrapingTool/*"],
"request_rate_exceeds": 100
},
"response": {
"status_code": 403,
"body": "{\"error\": \"Access denied\"}",
"headers": {"Content-Type": "application/json"}
},
"logging": {
"enabled": true,
"sample_rate": 1.0
},
"expires_at": "2024-06-01T00:00:00Z",
"created_at": "2024-01-20T08:00:00Z"
}SSL Certificate Storage
{
"cert_id": "cert-api-example-com",
"domains": ["api.example.com", "*.api.example.com"],
"certificate_pem": "-----BEGIN CERTIFICATE-----\n...",
"private_key_encrypted": "vault:secret/ssl/api-example-com#key",
"chain_pem": "-----BEGIN CERTIFICATE-----\n...",
"type": "WILDCARD",
"issuer": "Let's Encrypt",
"valid_from": "2024-01-01T00:00:00Z",
"valid_until": "2024-04-01T00:00:00Z",
"auto_renew": true,
"renewal_days_before_expiry": 30,
"ocsp_stapling": true,
"ct_logs": true,
"fingerprint_sha256": "AB:CD:EF:...",
"key_type": "ECDSA_P256",
"created_at": "2024-01-01T00:00:00Z",
"last_renewed": "2024-01-01T00:00:00Z"
}In-Memory Data Structures
Connection Lookup (Critical Path)
Data Structure: Open-addressing hash table with linear probing
- Why: Cache-friendly, no pointer chasing, predictable memory access
- Load factor: 0.7 (resize at 70% capacity)
- Hash function: xxHash64 on 5-tuple
- Collision resolution: Linear probing with Robin Hood hashing
- Lookup time: O(1) average, ~2 cache misses typical
Alternative for high-concurrency: Lock-free concurrent hash map
- Approach: Per-bucket CAS (Compare-And-Swap) operations
- Read path: Completely lock-free
- Write path: CAS retry loop (typically succeeds first attempt)Backend Selection (Per-Request)
Data Structure: Weighted selection array + atomic counters
Round Robin:
- Circular array of server IDs (weighted: server appears N times)
- Atomic counter for current position
- O(1) selection
Least Connections:
- Min-heap ordered by connection count
- O(log n) selection, O(log n) update
- Alternative: Approximate with Power of Two Choices (O(1))
Consistent Hashing:
- Sorted array of virtual nodes (binary search)
- O(log n) selection
- Jump consistent hash variant: O(1) with limitations
Power of Two Choices:
- Random selection of 2 backends, pick least loaded
- O(1) selection
- Near-optimal load distribution with minimal overheadRate Limiting (Per-Request Check)
Data Structure: Token bucket with sliding window
Per-IP rate limiter:
- Hash map: IP -> token_bucket
- Token bucket: {tokens: float, last_refill: timestamp, rate: float}
- Memory: 10M IPs x 24 bytes = 240 MB
- Cleanup: LRU eviction for inactive IPs
Global rate limiter:
- Sliding window counter (fixed window + proportional previous window)
- Single atomic counter per window
- Memory: negligible
Distributed rate limiting:
- Local approximate counting + periodic sync
- Redis-backed for exact global limits
- Trade-off: accuracy vs latencyIP Lookup Structures (ACL Matching)
Data Structure: Radix tree (Patricia trie) for CIDR matching
// Supports O(W) lookup where W = address width (32 for IPv4, 128 for IPv6)
// Memory efficient for sparse IP ranges
// Supports longest-prefix matching for CIDR blocks
For small ACLs (<1000 rules): Sorted array with binary search
For large ACLs (>10000 rules): Radix tree with bitmap compression
For country-based blocking: Pre-computed GeoIP database (MaxMind)
Memory usage:
100,000 CIDR rules x ~64 bytes per node = 6.4 MB
GeoIP database: ~50 MBData Consistency and Synchronization
Configuration Propagation
Mechanism: Watch-based propagation via etcd/Consul
1. Admin updates configuration via API
2. Configuration written to etcd with new version number
3. All LB instances watch etcd key prefix
4. On change notification, LB instances pull new config
5. Config validated locally before applying
6. Atomic swap of configuration pointer (no lock needed for readers)
7. Old config retained for rollback (last 5 versions)
Propagation time: <5 seconds (typical <1 second)
Consistency model: Eventual consistency (brief window of mixed configs)Health State Sharing
Approach: Gossip protocol for health state dissemination
1. Each LB instance performs health checks on assigned backends
2. Health state changes propagated via gossip (SWIM protocol)
3. Convergence time: O(log n) rounds, ~2 seconds for 100 instances
4. Consistency: Eventual (brief window where instances disagree)
5. Conflict resolution: Most recent timestamp wins
Alternative: Centralized health checker
- Dedicated health check service
- Publishes results to shared store (Redis/etcd)
- LB instances subscribe to health updates
- Pro: Reduces health check load on backends
- Con: Single point of failure (mitigated with HA)Data Retention and Cleanup
Retention Policies
Connection state: Evicted on connection close + TIME_WAIT (120s)
Session persistence: TTL-based (configurable, default 1 hour)
Rate limiting counters: LRU eviction, max 10M entries
Health check raw data: 7 days
Health check aggregated: 90 days
Access logs: 30 days (hot), 1 year (cold/S3)
Metrics: 15 seconds resolution for 24 hours
1 minute resolution for 7 days
5 minute resolution for 30 days
1 hour resolution for 1 year
Configuration history: Last 100 versions (indefinite)
SSL certificates: Until expiry + 30 daysThis database design prioritizes in-memory performance for the data plane (packet forwarding path) while using persistent storage for the control plane (configuration, metrics, and audit). The separation ensures that configuration changes and monitoring never impact forwarding performance.