Database Design

📖 11 min read 📄 Part 4 of 10

Load Balancer - Database Design

Overview

A load balancer's "database" is fundamentally different from traditional application databases. Most state is held in-memory for microsecond-level access, with persistent storage used only for configuration, health history, and metrics. The design prioritizes read speed and lock-free concurrent access over durability.

Connection Tracking Tables

Active Connection Table (In-Memory Hash Map)

ConnectionEntry {
  connection_id:       uint64          // Hash of 5-tuple (src_ip, src_port, dst_ip, dst_port, protocol)
  source_ip:           uint32          // Client IP (IPv4) or uint128 (IPv6)
  source_port:         uint16          // Client ephemeral port
  destination_ip:      uint32          // Selected backend IP
  destination_port:    uint16          // Backend service port
  protocol:            uint8           // TCP=6, UDP=17
  state:               enum {          // Connection lifecycle state
                         SYN_RECEIVED,
                         ESTABLISHED,
                         FIN_WAIT,
                         CLOSE_WAIT,
                         TIME_WAIT,
                         CLOSED
                       }
  created_at:          uint64          // Unix timestamp (nanoseconds)
  last_activity:       uint64          // Last packet timestamp
  idle_timeout:        uint32          // Seconds before timeout
  bytes_in:            uint64          // Bytes from client
  bytes_out:           uint64          // Bytes from backend
  packets_in:          uint64          // Packets from client
  packets_out:         uint64          // Packets from backend
  backend_server_id:   uint32          // Reference to backend server
  pool_id:             uint16          // Service pool identifier
  ssl_session_id:      bytes[32]       // TLS session for resumption
  http_request_count:  uint32          // Requests on this connection (keep-alive)
  flags:               uint16          // Bitfield: is_websocket, is_http2, is_ssl, etc.
}
// Size: ~180 bytes per entry
// Total for 10M connections: ~1.8 GB

Connection Lookup Data Structures

// Primary lookup: Hash table indexed by 5-tuple hash
// O(1) average lookup time, critical for packet forwarding path

struct connection_hash_table {
    uint32_t          bucket_count;     // 2^24 = 16M buckets
    uint32_t          entry_count;      // Current active connections
    connection_entry* buckets[];        // Array of bucket pointers
    rwlock_t          bucket_locks[];   // Per-bucket read-write locks
};

// Secondary index: Backend server -> connections (for drain operations)
// Used when removing a backend to find all affected connections

struct backend_connection_index {
    uint32_t          server_id;
    linked_list       connection_ids;   // All connections to this backend
    uint32_t          count;
};

// Timeout wheel: Hierarchical timing wheel for connection expiry
// Efficiently expires idle connections without scanning all entries

struct timing_wheel {
    uint32_t          tick_ms;          // 100ms per tick
    uint32_t          wheel_size;       // 4096 slots (covers ~7 minutes)
    linked_list       slots[4096];      // Connections expiring at each tick
    uint64_t          current_tick;
    struct timing_wheel* overflow;      // For longer timeouts
};

Connection Table Partitioning Strategy

Partitioning approach: Lock-free concurrent hash map with sharding

Shard count: 256 shards (power of 2 for fast modulo)
Shard assignment: connection_id % 256
Per-shard lock: Read-write spinlock (readers don't block each other)

Memory layout (cache-line aligned):
  Each shard: 64-byte aligned structure
  Each entry: 192 bytes (padded to cache line boundary)
  Avoids false sharing between CPU cores

Eviction policy:
  - TIME_WAIT entries: 120 seconds then evict
  - Idle connections: configurable timeout (default 300s)
  - Emergency eviction: LRU when table > 90% capacity

Server Pool Configuration Schema

Backend Server Registry (Persistent - etcd/Consul)

{
  "server_id": "backend-us-east-1a-api-0042",
  "pool_id": "pool-api-production",
  "hostname": "api-0042.us-east-1a.internal",
  "ip_address": "10.0.42.100",
  "port": 8080,
  "weight": 100,
  "max_connections": 10000,
  "max_requests_per_second": 5000,
  "status": "HEALTHY",
  "admin_status": "ENABLED",
  "metadata": {
    "availability_zone": "us-east-1a",
    "instance_type": "c5.2xlarge",
    "version": "v2.3.1",
    "canary": false,
    "warm_up_time_seconds": 30
  },
  "tls_config": {
    "enabled": true,
    "verify_certificate": true,
    "ca_certificate": "/certs/internal-ca.pem",
    "sni_hostname": "api.internal.example.com"
  },
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-20T14:22:00Z",
  "last_health_check": "2024-01-20T14:22:05Z"
}

Server Pool Configuration

{
  "pool_id": "pool-api-production",
  "name": "API Production Pool",
  "algorithm": "LEAST_CONNECTIONS",
  "algorithm_config": {
    "weighted": true,
    "slow_start_duration_seconds": 30,
    "power_of_two_choices": true
  },
  "health_check": {
    "protocol": "HTTP",
    "path": "/health",
    "port": 8080,
    "interval_seconds": 5,
    "timeout_seconds": 2,
    "unhealthy_threshold": 3,
    "healthy_threshold": 2,
    "expected_status_codes": [200],
    "expected_body_regex": "\"status\":\"ok\"",
    "tls_enabled": true
  },
  "session_persistence": {
    "enabled": true,
    "type": "COOKIE",
    "cookie_name": "SERVERID",
    "ttl_seconds": 3600
  },
  "connection_limits": {
    "max_connections_per_backend": 10000,
    "max_pending_requests": 1000,
    "connection_timeout_ms": 5000,
    "idle_timeout_seconds": 300,
    "max_request_time_seconds": 60
  },
  "circuit_breaker": {
    "enabled": true,
    "error_threshold_percent": 50,
    "evaluation_window_seconds": 30,
    "min_request_count": 100,
    "recovery_timeout_seconds": 60
  },
  "retry_policy": {
    "max_retries": 2,
    "retry_on": ["connection_error", "5xx", "reset"],
    "retry_budget_percent": 20
  },
  "servers": ["backend-us-east-1a-api-0042", "backend-us-east-1a-api-0043"],
  "created_at": "2024-01-10T08:00:00Z",
  "updated_at": "2024-01-20T12:00:00Z"
}

Health Check Results Storage

Health Check State (In-Memory)

HealthCheckState {
  server_id:              string
  current_status:         enum { HEALTHY, UNHEALTHY, DEGRADED, UNKNOWN }
  consecutive_successes:  uint32
  consecutive_failures:   uint32
  last_check_time:        timestamp
  last_success_time:      timestamp
  last_failure_time:      timestamp
  last_response_time_ms:  float32
  avg_response_time_ms:   float32    // Exponential moving average
  last_status_code:       uint16
  last_error:             string     // "connection_refused", "timeout", etc.
  flap_count:             uint32     // Status changes in last hour
  in_slow_start:          bool       // Recently became healthy
}

Health Check History (Time-Series Database)

-- InfluxDB/TimescaleDB schema for health check history
CREATE TABLE health_check_results (
    time              TIMESTAMPTZ NOT NULL,
    server_id         TEXT NOT NULL,
    pool_id           TEXT NOT NULL,
    status            TEXT NOT NULL,        -- 'success', 'failure', 'timeout'
    response_time_ms  DOUBLE PRECISION,
    status_code       INTEGER,
    error_message     TEXT,
    check_type        TEXT                  -- 'http', 'tcp', 'grpc'
);

-- Hypertable for automatic time-based partitioning
SELECT create_hypertable('health_check_results', 'time');

-- Retention policy: raw data for 7 days, downsampled for 90 days
CREATE MATERIALIZED VIEW health_check_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS bucket,
    server_id,
    pool_id,
    COUNT(*) AS total_checks,
    COUNT(*) FILTER (WHERE status = 'success') AS successful_checks,
    AVG(response_time_ms) AS avg_response_time,
    MAX(response_time_ms) AS max_response_time,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) AS p99_response_time
FROM health_check_results
GROUP BY bucket, server_id, pool_id;

Load Metrics and Statistics Tables

Real-Time Metrics (In-Memory Ring Buffers)

// Per-backend server metrics (updated on every request)
struct backend_metrics {
    atomic_uint64_t  active_connections;
    atomic_uint64_t  total_requests;
    atomic_uint64_t  total_bytes_in;
    atomic_uint64_t  total_bytes_out;
    atomic_uint64_t  error_count_4xx;
    atomic_uint64_t  error_count_5xx;
    atomic_uint64_t  connection_errors;
    atomic_uint64_t  timeout_errors;

    // Ring buffer for response time tracking (last 1000 requests)
    struct ring_buffer {
        uint32_t  values[1024];    // Response times in microseconds
        uint32_t  head;
        uint32_t  count;
    } response_times;

    // Per-second counters (circular buffer, last 60 seconds)
    struct per_second_counter {
        uint64_t  requests[60];
        uint64_t  errors[60];
        uint64_t  bytes[60];
        uint32_t  current_slot;
    } rate_counters;
};

// Global LB instance metrics
struct lb_instance_metrics {
    atomic_uint64_t  total_connections;
    atomic_uint64_t  active_connections;
    atomic_uint64_t  requests_per_second;
    atomic_uint64_t  bytes_per_second_in;
    atomic_uint64_t  bytes_per_second_out;
    atomic_uint64_t  ssl_handshakes_per_second;
    atomic_uint64_t  health_checks_per_second;
    atomic_uint64_t  dropped_connections;
    atomic_uint64_t  rejected_connections;    // Rate limited
    double           cpu_utilization;
    double           memory_utilization;
    uint64_t         uptime_seconds;
};

Aggregated Metrics (Persistent - Prometheus/InfluxDB)

# Prometheus metrics exposed by each LB instance

# Connection metrics
lb_connections_active{pool="api", backend="server-042"} 1523
lb_connections_total{pool="api", backend="server-042"} 5847291
lb_connections_failed{pool="api", backend="server-042", reason="timeout"} 42

# Request metrics
lb_requests_total{pool="api", method="GET", status="200"} 1000000
lb_requests_total{pool="api", method="POST", status="500"} 150
lb_request_duration_seconds_bucket{pool="api", le="0.001"} 500000
lb_request_duration_seconds_bucket{pool="api", le="0.01"} 900000
lb_request_duration_seconds_bucket{pool="api", le="0.1"} 990000
lb_request_duration_seconds_bucket{pool="api", le="1.0"} 999000

# Bandwidth metrics
lb_bytes_received_total{pool="api"} 1099511627776
lb_bytes_sent_total{pool="api"} 5497558138880

# Health check metrics
lb_health_check_status{pool="api", backend="server-042"} 1
lb_health_check_duration_seconds{pool="api", backend="server-042"} 0.005

# Rate limiting metrics
lb_rate_limited_requests_total{pool="api", rule="per_ip"} 5000
lb_rate_limited_requests_total{pool="api", rule="global"} 200

Session Persistence / Sticky Session Storage

Cookie-Based Session Table

SessionEntry {
  session_id:        string[64]      // Cookie value or hash
  client_identifier: string          // IP, cookie, or header value
  backend_server_id: uint32          // Pinned backend server
  pool_id:           uint16          // Pool this session belongs to
  created_at:        uint64          // Creation timestamp
  last_accessed:     uint64          // Last request timestamp
  expires_at:        uint64          // Absolute expiry
  request_count:     uint32          // Requests in this session
  flags:             uint8           // is_secure, is_http_only
}
// Size: ~160 bytes per entry
// For 2M active sessions: ~320 MB

Session Storage Options

Option 1: Local In-Memory (Default)
  - Fastest access: O(1) hash lookup
  - Problem: Session lost on LB instance failure
  - Mitigation: Client retry goes to same backend via consistent hashing

Option 2: Distributed Cache (Redis Cluster)
  - Access time: 0.5-1ms network round-trip
  - Benefit: Survives LB instance failure
  - Schema:
    Key: "session:{pool_id}:{session_id}"
    Value: "{backend_server_id}:{expires_at}:{metadata}"
    TTL: session timeout (e.g., 3600 seconds)

Option 3: Consistent Hashing (Stateless)
  - No storage needed
  - Client IP/header hashed to backend
  - Problem: Backend removal causes session redistribution
  - Mitigation: Bounded load consistent hashing

Option 4: Backend-Generated Cookie
  - LB inserts cookie with encoded backend ID
  - No lookup needed on subsequent requests
  - Cookie format: "SERVERID=s042.pool-api|{hmac_signature}"
  - Verification: HMAC check (no storage, CPU only)

Session Persistence Hash Ring

// Consistent hash ring for stateless session affinity
struct hash_ring {
    uint32_t          ring_size;        // 2^32 virtual ring
    uint32_t          vnodes_per_server; // 150 virtual nodes per backend
    struct vnode {
        uint32_t      hash_point;       // Position on ring
        uint32_t      server_id;        // Backend server
        uint32_t      weight;           // Server weight
    } *vnodes;                          // Sorted array of virtual nodes
    uint32_t          vnode_count;      // Total virtual nodes
};

// Lookup: O(log n) binary search on sorted vnode array
// Memory: 150 vnodes x 10,000 servers x 12 bytes = 18 MB

Configuration Management

Routing Rules Schema

{
  "rule_id": "route-api-v2-canary",
  "priority": 100,
  "match": {
    "hosts": ["api.example.com", "*.api.example.com"],
    "paths": ["/v2/*"],
    "methods": ["GET", "POST"],
    "headers": {
      "X-Canary": ["true"],
      "X-Api-Version": ["2.*"]
    },
    "query_params": {
      "beta": ["1"]
    },
    "source_ips": ["10.0.0.0/8"]
  },
  "action": {
    "type": "ROUTE",
    "pool_id": "pool-api-v2-canary",
    "weight": 10
  },
  "transforms": {
    "request_headers": {
      "add": {"X-Forwarded-Proto": "https", "X-Request-ID": "${uuid}"},
      "remove": ["X-Internal-Debug"]
    },
    "response_headers": {
      "add": {"Strict-Transport-Security": "max-age=31536000"},
      "remove": ["Server", "X-Powered-By"]
    },
    "url_rewrite": {
      "prefix_replace": {"/v2/": "/api/"}
    }
  },
  "rate_limit": {
    "requests_per_second": 1000,
    "burst": 2000,
    "key": "source_ip"
  },
  "enabled": true,
  "created_at": "2024-01-15T10:00:00Z",
  "version": 42
}

ACL (Access Control List) Schema

{
  "acl_id": "acl-block-bad-actors",
  "priority": 1,
  "action": "DENY",
  "match": {
    "source_ips": [
      "192.168.1.0/24",
      "10.99.0.0/16"
    ],
    "source_countries": ["XX"],
    "user_agents": ["BadBot/*", "ScrapingTool/*"],
    "request_rate_exceeds": 100
  },
  "response": {
    "status_code": 403,
    "body": "{\"error\": \"Access denied\"}",
    "headers": {"Content-Type": "application/json"}
  },
  "logging": {
    "enabled": true,
    "sample_rate": 1.0
  },
  "expires_at": "2024-06-01T00:00:00Z",
  "created_at": "2024-01-20T08:00:00Z"
}

SSL Certificate Storage

{
  "cert_id": "cert-api-example-com",
  "domains": ["api.example.com", "*.api.example.com"],
  "certificate_pem": "-----BEGIN CERTIFICATE-----\n...",
  "private_key_encrypted": "vault:secret/ssl/api-example-com#key",
  "chain_pem": "-----BEGIN CERTIFICATE-----\n...",
  "type": "WILDCARD",
  "issuer": "Let's Encrypt",
  "valid_from": "2024-01-01T00:00:00Z",
  "valid_until": "2024-04-01T00:00:00Z",
  "auto_renew": true,
  "renewal_days_before_expiry": 30,
  "ocsp_stapling": true,
  "ct_logs": true,
  "fingerprint_sha256": "AB:CD:EF:...",
  "key_type": "ECDSA_P256",
  "created_at": "2024-01-01T00:00:00Z",
  "last_renewed": "2024-01-01T00:00:00Z"
}

In-Memory Data Structures

Connection Lookup (Critical Path)

Data Structure: Open-addressing hash table with linear probing
  - Why: Cache-friendly, no pointer chasing, predictable memory access
  - Load factor: 0.7 (resize at 70% capacity)
  - Hash function: xxHash64 on 5-tuple
  - Collision resolution: Linear probing with Robin Hood hashing
  - Lookup time: O(1) average, ~2 cache misses typical

Alternative for high-concurrency: Lock-free concurrent hash map
  - Approach: Per-bucket CAS (Compare-And-Swap) operations
  - Read path: Completely lock-free
  - Write path: CAS retry loop (typically succeeds first attempt)

Backend Selection (Per-Request)

Data Structure: Weighted selection array + atomic counters

Round Robin:
  - Circular array of server IDs (weighted: server appears N times)
  - Atomic counter for current position
  - O(1) selection

Least Connections:
  - Min-heap ordered by connection count
  - O(log n) selection, O(log n) update
  - Alternative: Approximate with Power of Two Choices (O(1))

Consistent Hashing:
  - Sorted array of virtual nodes (binary search)
  - O(log n) selection
  - Jump consistent hash variant: O(1) with limitations

Power of Two Choices:
  - Random selection of 2 backends, pick least loaded
  - O(1) selection
  - Near-optimal load distribution with minimal overhead

Rate Limiting (Per-Request Check)

Data Structure: Token bucket with sliding window

Per-IP rate limiter:
  - Hash map: IP -> token_bucket
  - Token bucket: {tokens: float, last_refill: timestamp, rate: float}
  - Memory: 10M IPs x 24 bytes = 240 MB
  - Cleanup: LRU eviction for inactive IPs

Global rate limiter:
  - Sliding window counter (fixed window + proportional previous window)
  - Single atomic counter per window
  - Memory: negligible

Distributed rate limiting:
  - Local approximate counting + periodic sync
  - Redis-backed for exact global limits
  - Trade-off: accuracy vs latency

IP Lookup Structures (ACL Matching)

Data Structure: Radix tree (Patricia trie) for CIDR matching

// Supports O(W) lookup where W = address width (32 for IPv4, 128 for IPv6)
// Memory efficient for sparse IP ranges
// Supports longest-prefix matching for CIDR blocks

For small ACLs (<1000 rules): Sorted array with binary search
For large ACLs (>10000 rules): Radix tree with bitmap compression
For country-based blocking: Pre-computed GeoIP database (MaxMind)

Memory usage:
  100,000 CIDR rules x ~64 bytes per node = 6.4 MB
  GeoIP database: ~50 MB

Data Consistency and Synchronization

Configuration Propagation

Mechanism: Watch-based propagation via etcd/Consul

1. Admin updates configuration via API
2. Configuration written to etcd with new version number
3. All LB instances watch etcd key prefix
4. On change notification, LB instances pull new config
5. Config validated locally before applying
6. Atomic swap of configuration pointer (no lock needed for readers)
7. Old config retained for rollback (last 5 versions)

Propagation time: <5 seconds (typical <1 second)
Consistency model: Eventual consistency (brief window of mixed configs)

Health State Sharing

Approach: Gossip protocol for health state dissemination

1. Each LB instance performs health checks on assigned backends
2. Health state changes propagated via gossip (SWIM protocol)
3. Convergence time: O(log n) rounds, ~2 seconds for 100 instances
4. Consistency: Eventual (brief window where instances disagree)
5. Conflict resolution: Most recent timestamp wins

Alternative: Centralized health checker
  - Dedicated health check service
  - Publishes results to shared store (Redis/etcd)
  - LB instances subscribe to health updates
  - Pro: Reduces health check load on backends
  - Con: Single point of failure (mitigated with HA)

Data Retention and Cleanup

Retention Policies

Connection state:        Evicted on connection close + TIME_WAIT (120s)
Session persistence:     TTL-based (configurable, default 1 hour)
Rate limiting counters:  LRU eviction, max 10M entries
Health check raw data:   7 days
Health check aggregated: 90 days
Access logs:             30 days (hot), 1 year (cold/S3)
Metrics:                 15 seconds resolution for 24 hours
                         1 minute resolution for 7 days
                         5 minute resolution for 30 days
                         1 hour resolution for 1 year
Configuration history:   Last 100 versions (indefinite)
SSL certificates:        Until expiry + 30 days

This database design prioritizes in-memory performance for the data plane (packet forwarding path) while using persistent storage for the control plane (configuration, metrics, and audit). The separation ensures that configuration changes and monitoring never impact forwarding performance.