Database Design

📖 9 min read 📄 Part 4 of 10

CDN Network - Database Design

Content Metadata Schema

Cache Entry (Per-Object Metadata)

-- Stored in-memory and on local SSD at each edge server
-- Optimized for fast lookup by URL hash

CacheEntry {
    url_hash:          BYTES(32)       -- SHA-256 of normalized URL (primary key)
    original_url:      STRING(4096)    -- Full URL including query params
    content_hash:      BYTES(32)       -- SHA-256 of response body (for dedup)
    content_type:      STRING(128)     -- MIME type (e.g., "image/webp")
    content_encoding:  STRING(32)      -- Compression (gzip, br, identity)
    content_length:    INT64           -- Size in bytes

    -- Cache control
    ttl_seconds:       INT32           -- Time-to-live from origin
    created_at:        TIMESTAMP       -- When cached at this edge
    expires_at:        TIMESTAMP       -- Absolute expiration time
    last_modified:     TIMESTAMP       -- Origin Last-Modified header
    etag:              STRING(256)     -- Origin ETag for conditional requests

    -- Vary handling
    vary_headers:      STRING(512)     -- Vary header value
    vary_key:          BYTES(32)       -- Hash of varied header values

    -- Storage location
    disk_path:         STRING(512)     -- Path on local SSD/HDD
    disk_offset:       INT64           -- Offset within storage file
    memory_cached:     BOOL            -- Currently in RAM cache

    -- Access patterns
    hit_count:         INT64           -- Total hits since cached
    last_accessed:     TIMESTAMP       -- For LRU eviction
    access_frequency:  FLOAT32         -- Exponentially weighted moving avg

    -- Response headers to preserve
    response_headers:  MAP<STRING, STRING>
    status_code:       INT16           -- Original response status

    -- Metadata
    origin_id:         STRING(64)      -- Which origin served this
    distribution_id:   STRING(64)      -- Customer distribution identifier
    is_negative_cache: BOOL            -- Cached 404/error response
}

Variant Storage (for Vary header support)

CacheVariant {
    url_hash:       BYTES(32)          -- Same URL
    variant_key:    BYTES(32)          -- Hash of vary header values
    cache_entry_id: BYTES(32)          -- Points to specific CacheEntry
    vary_values:    MAP<STRING, STRING> -- Actual header values
}

Cache Index Data Structures

Primary Index: Hash Table (In-Memory)

Architecture: Open-addressing hash table with linear probing
- Key: SHA-256(normalized_url + variant_key) = 32 bytes
- Value: Pointer to metadata + disk location = 24 bytes
- Load factor: 0.7 (resize at 70% capacity)
- Total entries per server: 500M - 2B objects
- Memory usage: ~30-100 GB per server

Lookup complexity: O(1) average, O(n) worst case
Insert complexity: O(1) amortized

LRU Eviction Structure

Architecture: Segmented LRU (SLRU) with two segments

Probationary Segment (20% of cache):
- New objects enter here
- Evicted first when space needed
- Promoted to protected on second access

Protected Segment (80% of cache):
- Frequently accessed objects
- Evicted only when probationary is empty
- Demoted to probationary tail on eviction

Implementation: Doubly-linked list with hash map for O(1) ops
- Move to head on access: O(1)
- Evict from tail: O(1)
- Lookup by key: O(1) via hash map

Bloom Filters (Negative Lookup Optimization)

Purpose: Quickly determine if an object is NOT in cache
- Avoids expensive disk lookups for uncached objects
- False positive rate: 1% (tunable)
- Size: ~10 bits per object = 1.2 GB for 1 billion objects
- Hash functions: 7 (optimal for 1% FPR)

Usage flow:
1. Request arrives - check bloom filter
2. If "no" - guaranteed miss - fetch from origin
3. If "maybe" - check hash table - may still be miss

Consistent Hash Ring (Intra-PoP Distribution)

Purpose: Distribute objects across servers within a PoP
- Virtual nodes: 150-200 per physical server
- Hash function: xxHash64 for speed
- Rebalancing on server add/remove: only 1/N objects move
- Weighted nodes: more capacity = more virtual nodes

Ring structure:
- Sorted array of (hash_value, server_id) pairs
- Binary search for lookup: O(log N)
- N = num_virtual_nodes (typically 10,000-50,000 per PoP)

Origin Configuration

Origin Server Registry

CREATE TABLE origins (
    origin_id           UUID PRIMARY KEY,
    distribution_id     UUID NOT NULL,
    hostname            VARCHAR(253) NOT NULL,
    ip_addresses        INET[],
    protocol            ENUM('HTTP', 'HTTPS', 'HTTP2'),
    port                INT DEFAULT 443,

    -- Connection settings
    connect_timeout_ms  INT DEFAULT 5000,
    read_timeout_ms     INT DEFAULT 30000,
    keepalive_timeout   INT DEFAULT 60000,
    max_connections     INT DEFAULT 1024,

    -- TLS to origin
    origin_ssl_protocols    VARCHAR(64) DEFAULT 'TLSv1.2,TLSv1.3',
    origin_ssl_verify       BOOL DEFAULT TRUE,
    custom_ca_cert          TEXT,
    client_cert_id          UUID,

    -- Custom headers
    custom_origin_headers   JSONB,
    host_header_override    VARCHAR(253),

    -- Failover
    failover_origin_id      UUID REFERENCES origins(origin_id),
    failover_criteria       JSONB,

    -- Health
    health_status           ENUM('HEALTHY', 'DEGRADED', 'UNHEALTHY'),
    last_health_check       TIMESTAMP,
    consecutive_failures    INT DEFAULT 0,

    created_at              TIMESTAMP DEFAULT NOW(),
    updated_at              TIMESTAMP DEFAULT NOW()
);

CREATE TABLE origin_health_checks (
    check_id            UUID PRIMARY KEY,
    origin_id           UUID REFERENCES origins(origin_id),
    check_type          ENUM('HTTP', 'HTTPS', 'TCP'),
    path                VARCHAR(2048) DEFAULT '/health',
    interval_seconds    INT DEFAULT 30,
    timeout_ms          INT DEFAULT 5000,
    healthy_threshold   INT DEFAULT 3,
    unhealthy_threshold INT DEFAULT 2,
    expected_status     INT[] DEFAULT '{200}',
    expected_body       VARCHAR(1024),
    enabled             BOOL DEFAULT TRUE
);

Origin Group (Failover Configuration)

CREATE TABLE origin_groups (
    group_id            UUID PRIMARY KEY,
    distribution_id     UUID NOT NULL,
    name                VARCHAR(128),
    primary_origin_id   UUID REFERENCES origins(origin_id),
    failover_origins    UUID[],
    failover_codes      INT[] DEFAULT '{500,502,503,504}',
    failover_timeout_ms INT DEFAULT 10000
);

SSL Certificate Storage

CREATE TABLE ssl_certificates (
    cert_id             UUID PRIMARY KEY,
    distribution_id     UUID NOT NULL,
    domain_names        VARCHAR(253)[],

    -- Certificate data (encrypted at rest)
    certificate_pem     TEXT NOT NULL,
    private_key_pem     TEXT NOT NULL,
    ca_bundle_pem       TEXT,

    -- Metadata
    issuer              VARCHAR(256),
    serial_number       VARCHAR(128),
    fingerprint_sha256  CHAR(64),
    key_algorithm       ENUM('RSA-2048','RSA-4096','ECDSA-P256','ECDSA-P384'),

    -- Validity
    not_before          TIMESTAMP NOT NULL,
    not_after           TIMESTAMP NOT NULL,
    auto_renew          BOOL DEFAULT TRUE,
    renewal_days_before INT DEFAULT 30,

    -- Deployment status
    status              ENUM('PENDING_VALIDATION','ISSUED','DEPLOYED','EXPIRED','REVOKED'),
    deployed_edges      INT DEFAULT 0,
    last_deployed_at    TIMESTAMP,

    -- OCSP
    ocsp_response       BYTEA,
    ocsp_expires_at     TIMESTAMP,

    created_at          TIMESTAMP DEFAULT NOW(),
    updated_at          TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_cert_domains ON ssl_certificates USING GIN (domain_names);
CREATE INDEX idx_cert_expiry ON ssl_certificates (not_after) WHERE status = 'DEPLOYED';

Access Logs Schema

Real-Time Log Entry

-- Volume: 10-50 million log entries/second globally

AccessLog {
    timestamp:          TIMESTAMP(microseconds)
    edge_id:            STRING(32)
    server_id:          STRING(32)

    -- Client info
    client_ip:          BYTES(16)
    client_port:        UINT16
    client_country:     CHAR(2)
    client_asn:         UINT32
    user_agent:         STRING(512)

    -- Request
    method:             ENUM(GET, HEAD, POST, PUT, DELETE, OPTIONS)
    host:               STRING(253)
    path:               STRING(4096)
    query_string:       STRING(4096)
    protocol:           ENUM(HTTP/1.1, HTTP/2, HTTP/3)
    request_bytes:      INT64

    -- Response
    status_code:        UINT16
    response_bytes:     INT64
    content_type:       STRING(128)

    -- Cache
    cache_status:       ENUM(HIT, MISS, EXPIRED, STALE, BYPASS, ERROR)
    cache_key:          STRING(4096)
    time_to_first_byte: FLOAT32

    -- Timing
    total_time_ms:      FLOAT32
    origin_time_ms:     FLOAT32
    tls_time_ms:        FLOAT32

    -- Connection
    connection_reused:  BOOL
    tls_version:        STRING(8)
    tls_cipher:         STRING(64)

    -- Distribution
    distribution_id:    STRING(64)
    origin_id:          STRING(64)

    -- Security
    waf_action:         ENUM(ALLOW, BLOCK, COUNT, CHALLENGE)
    waf_rule_id:        STRING(64)
    bot_score:          FLOAT32
}

Log Storage Strategy

Hot tier (0-24 hours):   Kafka/Kinesis for real-time analytics
Warm tier (1-30 days):   ClickHouse/Druid for ad-hoc queries
Cold tier (30-365 days): S3/GCS in Parquet for compliance
Archive (1+ years):      Glacier/Archive tier for legal hold

Compression: 10-20x with columnar + zstd
Partitioning: By distribution_id, date, edge_region

Analytics Aggregation Tables

Per-Minute Aggregation

CREATE TABLE analytics_1min (
    distribution_id     UUID,
    edge_region         VARCHAR(32),
    timestamp_minute    TIMESTAMP,

    -- Request metrics
    total_requests      BIGINT,
    cache_hits          BIGINT,
    cache_misses        BIGINT,
    status_2xx          BIGINT,
    status_3xx          BIGINT,
    status_4xx          BIGINT,
    status_5xx          BIGINT,

    -- Bandwidth
    bytes_sent          BIGINT,
    bytes_received      BIGINT,

    -- Latency percentiles
    latency_p50_ms      FLOAT,
    latency_p95_ms      FLOAT,
    latency_p99_ms      FLOAT,
    latency_avg_ms      FLOAT,

    -- Error tracking
    origin_errors       BIGINT,
    tls_errors          BIGINT,
    timeout_errors      BIGINT,

    PRIMARY KEY (distribution_id, edge_region, timestamp_minute)
);

-- Roll up to hourly, daily, monthly
CREATE TABLE analytics_1hour  (LIKE analytics_1min);
CREATE TABLE analytics_1day   (LIKE analytics_1min);
CREATE TABLE analytics_1month (LIKE analytics_1min);

Top-N Tables

CREATE TABLE top_urls_hourly (
    distribution_id     UUID,
    hour                TIMESTAMP,
    url_path            VARCHAR(4096),
    request_count       BIGINT,
    bytes_transferred   BIGINT,
    cache_hit_ratio     FLOAT,
    avg_latency_ms      FLOAT,
    PRIMARY KEY (distribution_id, hour, url_path)
);

CREATE TABLE top_referrers_hourly (
    distribution_id     UUID,
    hour                TIMESTAMP,
    referrer            VARCHAR(4096),
    request_count       BIGINT,
    unique_visitors     BIGINT
);

Configuration Management

Distribution Configuration

CREATE TABLE distributions (
    distribution_id     UUID PRIMARY KEY,
    account_id          UUID NOT NULL,
    domain_names        VARCHAR(253)[],
    cdn_domain          VARCHAR(253) UNIQUE,

    -- Default behavior
    default_origin_id   UUID REFERENCES origins(origin_id),
    default_ttl         INT DEFAULT 86400,

    -- Features
    http2_enabled       BOOL DEFAULT TRUE,
    http3_enabled       BOOL DEFAULT TRUE,
    ipv6_enabled        BOOL DEFAULT TRUE,
    compression_enabled BOOL DEFAULT TRUE,

    -- Access
    price_class         ENUM('ALL', 'NA_EU', 'NA_EU_ASIA'),
    geo_restrictions    JSONB,

    -- Status
    status              ENUM('DEPLOYING','ACTIVE','DISABLED','DELETING'),
    last_deployed_at    TIMESTAMP,
    config_version      INT DEFAULT 1,

    created_at          TIMESTAMP DEFAULT NOW(),
    updated_at          TIMESTAMP DEFAULT NOW()
);

Cache Behavior Rules

CREATE TABLE cache_behaviors (
    behavior_id         UUID PRIMARY KEY,
    distribution_id     UUID REFERENCES distributions(distribution_id),
    priority            INT NOT NULL,
    path_pattern        VARCHAR(2048) NOT NULL,

    -- Origin
    origin_id           UUID REFERENCES origins(origin_id),
    origin_group_id     UUID REFERENCES origin_groups(group_id),

    -- Cache settings
    min_ttl             INT DEFAULT 0,
    max_ttl             INT DEFAULT 31536000,
    default_ttl         INT DEFAULT 86400,

    -- Query string handling
    query_string_behavior ENUM('NONE','WHITELIST','ALL','ALL_EXCEPT'),
    query_string_keys   VARCHAR(128)[],

    -- Headers
    forward_headers     ENUM('NONE', 'WHITELIST', 'ALL'),
    whitelisted_headers VARCHAR(128)[],

    -- Cookies
    forward_cookies     ENUM('NONE', 'WHITELIST', 'ALL'),
    whitelisted_cookies VARCHAR(128)[],

    -- Protocol
    viewer_protocol     ENUM('HTTP_HTTPS','REDIRECT_HTTPS','HTTPS_ONLY'),
    allowed_methods     VARCHAR(16)[] DEFAULT '{GET,HEAD}',
    compress            BOOL DEFAULT TRUE,

    -- Edge functions
    edge_function_associations JSONB
);

Purge Queue

CREATE TABLE purge_requests (
    purge_id            UUID PRIMARY KEY,
    distribution_id     UUID NOT NULL,
    requested_at        TIMESTAMP DEFAULT NOW(),
    requested_by        UUID,

    -- What to purge
    purge_type          ENUM('URL', 'PREFIX', 'WILDCARD', 'ALL'),
    paths               VARCHAR(4096)[],

    -- Status tracking
    status              ENUM('PENDING','IN_PROGRESS','COMPLETED','FAILED'),
    edges_total         INT,
    edges_completed     INT DEFAULT 0,
    completed_at        TIMESTAMP,
    error_message       TEXT,

    caller_quota_used   INT
);

CREATE INDEX idx_purge_status
    ON purge_requests (distribution_id, status, requested_at DESC);

DNS Zone Records and Routing Policies

DNS Record Storage

CREATE TABLE dns_zones (
    zone_id             UUID PRIMARY KEY,
    domain_name         VARCHAR(253) NOT NULL,
    distribution_id     UUID REFERENCES distributions(distribution_id),
    primary_ns          VARCHAR(253),
    admin_email         VARCHAR(253),
    serial_number       BIGINT,
    refresh_seconds     INT DEFAULT 3600,
    retry_seconds       INT DEFAULT 900,
    expire_seconds      INT DEFAULT 604800,
    min_ttl             INT DEFAULT 60
);

CREATE TABLE dns_records (
    record_id           UUID PRIMARY KEY,
    zone_id             UUID REFERENCES dns_zones(zone_id),
    name                VARCHAR(253) NOT NULL,
    type                ENUM('A','AAAA','CNAME','NS','TXT','MX','SRV'),
    ttl                 INT DEFAULT 60,

    -- Record value
    value               VARCHAR(4096),
    priority            INT,

    -- Routing policy
    routing_policy      ENUM('SIMPLE','WEIGHTED','LATENCY','GEOLOCATION','FAILOVER','MULTIVALUE'),
    weight              INT,
    region              VARCHAR(32),
    geo_location        VARCHAR(64),
    health_check_id     UUID,
    failover_type       ENUM('PRIMARY', 'SECONDARY'),
    set_identifier      VARCHAR(128),

    -- Anycast
    anycast_group_id    UUID,

    enabled             BOOL DEFAULT TRUE,
    created_at          TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_dns_lookup ON dns_records (zone_id, name, type)
    WHERE enabled = TRUE;

Routing Policy Engine

GeoDNS Resolution Flow:
1. Client DNS query arrives at authoritative nameserver
2. Extract client IP (or EDNS Client Subnet if available)
3. Map IP to geographic location (MaxMind GeoIP database)
4. Query routing policies for matching geo rules
5. Apply health check status (exclude unhealthy endpoints)
6. Return closest healthy edge PoP IP(s)
7. TTL: 30-60 seconds (balance freshness vs DNS load)

Anycast Resolution Flow:
1. All edge PoPs announce same IP prefix via BGP
2. Internet routing naturally directs to nearest PoP
3. No DNS-level geographic routing needed
4. Failover: withdraw BGP announcement from unhealthy PoP

Data Store Technology Choices

Data Type Technology Rationale
Cache index Custom in-memory hash table Ultra-low latency; no serialization
Cache content Local SSD/HDD custom filesystem Direct I/O, no FS overhead
Configuration etcd / Consul Distributed consensus, watches
Certificates HashiCorp Vault / AWS ACM Encrypted, access-controlled
Access logs Kafka then ClickHouse then S3 Throughput, queries, cheap archive
Analytics ClickHouse / Apache Druid Real-time OLAP, pre-aggregation
DNS records Custom in-memory store Sub-millisecond lookups required
Purge queue Redis Streams / Kafka Ordered, distributed delivery
Customer config PostgreSQL + Redis cache ACID for writes, fast reads