Database Design

📖 11 min read 📄 Part 4 of 10

Resource Allocation Service - Database Design

Overview

The resource allocation service requires a multi-tier storage architecture that balances strong consistency for critical allocation decisions with high throughput for status updates and historical analytics. This design draws from patterns used in Kubernetes (etcd for state), Borg (Paxos-based cell state), and YARN (ZooKeeper + LevelDB).


Storage Architecture Overview

Storage Tiers:

Hot Path (In-Memory) | Warm Path (Consistent)    | Cold Path (Analytical)
---------------------+---------------------------+------------------------
Priority Queues      | etcd / CockroachDB        | PostgreSQL / ClickHouse
Resource Bitmaps     | Redis Cluster             | Object Storage (S3)
Scoring Cache        |                           | Data Lake (Parquet)
Node State Cache     |                           |
---------------------+---------------------------+------------------------
Latency: <1ms        | Latency: 1-10ms           | Latency: 100ms-seconds
Consistency: N/A     | Consistency: Strong       | Consistency: Eventual
Durability: None     | Durability: Yes           | Durability: Yes

Resource Inventory Schema

Node Registry (etcd + PostgreSQL)

-- Primary node inventory (PostgreSQL for queryability, synced from etcd)
CREATE TABLE nodes (
    node_id             UUID PRIMARY KEY,
    hostname            VARCHAR(255) NOT NULL UNIQUE,
    cluster_id          VARCHAR(100) NOT NULL,
    zone                VARCHAR(50) NOT NULL,
    region              VARCHAR(50) NOT NULL,
    rack_id             VARCHAR(100),

    -- Capacity (total physical resources)
    cpu_total_millicores    BIGINT NOT NULL,
    memory_total_bytes      BIGINT NOT NULL,
    gpu_total               INTEGER DEFAULT 0,
    gpu_type                VARCHAR(50),
    fpga_total              INTEGER DEFAULT 0,
    ephemeral_storage_bytes BIGINT NOT NULL,
    max_pods                INTEGER DEFAULT 110,

    -- Allocatable (capacity minus system reserved)
    cpu_allocatable_millicores  BIGINT NOT NULL,
    memory_allocatable_bytes    BIGINT NOT NULL,
    gpu_allocatable             INTEGER DEFAULT 0,

    -- Current availability (updated frequently)
    cpu_available_millicores    BIGINT NOT NULL,
    memory_available_bytes      BIGINT NOT NULL,
    gpu_available               INTEGER DEFAULT 0,
    pods_running                INTEGER DEFAULT 0,

    -- Node metadata
    labels              JSONB DEFAULT '{}',
    taints              JSONB DEFAULT '[]',
    annotations         JSONB DEFAULT '{}',
    conditions          JSONB DEFAULT '[]',

    -- Lifecycle
    status              VARCHAR(20) DEFAULT 'Ready',
    registered_at       TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    last_heartbeat_at   TIMESTAMP WITH TIME ZONE,
    unschedulable       BOOLEAN DEFAULT FALSE,

    -- Versioning for optimistic concurrency
    resource_version    BIGINT NOT NULL DEFAULT 1,
    updated_at          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_nodes_status ON nodes(status) WHERE status = 'Ready';
CREATE INDEX idx_nodes_zone ON nodes(zone, status);
CREATE INDEX idx_nodes_gpu ON nodes(gpu_type, gpu_available) WHERE gpu_total > 0;
CREATE INDEX idx_nodes_labels ON nodes USING GIN(labels);
CREATE INDEX idx_nodes_available_cpu ON nodes(cpu_available_millicores) WHERE status = 'Ready';
CREATE INDEX idx_nodes_cluster ON nodes(cluster_id, status);

-- Extended resources (custom/plugin resources)
CREATE TABLE node_extended_resources (
    node_id         UUID REFERENCES nodes(node_id) ON DELETE CASCADE,
    resource_name   VARCHAR(255) NOT NULL,
    capacity        BIGINT NOT NULL,
    allocatable     BIGINT NOT NULL,
    available       BIGINT NOT NULL,
    PRIMARY KEY (node_id, resource_name)
);

etcd Key Structure for Real-Time State

# Node state (authoritative, real-time)
/registry/nodes/{node_id}/spec       -> NodeSpec (capacity, labels, taints)
/registry/nodes/{node_id}/status     -> NodeStatus (conditions, available resources)
/registry/nodes/{node_id}/lease      -> Lease (heartbeat TTL)

# Example etcd value for node status:
{
  "nodeId": "node-abc123",
  "allocatable": {
    "cpu": "63500m",
    "memory": "253Gi",
    "nvidia.com/gpu": "8",
    "pods": "110"
  },
  "available": {
    "cpu": "24000m",
    "memory": "128Gi",
    "nvidia.com/gpu": "3",
    "pods": "67"
  },
  "conditions": [
    {"type": "Ready", "status": "True", "lastHeartbeat": "2024-01-15T10:30:00Z"},
    {"type": "MemoryPressure", "status": "False"},
    {"type": "DiskPressure", "status": "False"}
  ],
  "resourceVersion": 48291
}

Allocation Records

Active Allocations Table

CREATE TABLE allocations (
    allocation_id       UUID PRIMARY KEY,
    tenant_id           VARCHAR(100) NOT NULL,
    namespace           VARCHAR(255) NOT NULL,
    job_name            VARCHAR(255),

    -- Resource specification
    resource_type       VARCHAR(50) NOT NULL,
    cpu_request_millicores  BIGINT,
    cpu_limit_millicores    BIGINT,
    memory_request_bytes    BIGINT,
    memory_limit_bytes      BIGINT,
    gpu_request             INTEGER DEFAULT 0,
    storage_request_bytes   BIGINT DEFAULT 0,
    extended_resources      JSONB DEFAULT '{}',

    -- Scheduling metadata
    priority            INTEGER NOT NULL DEFAULT 0,
    priority_class      VARCHAR(50),
    preemption_policy   VARCHAR(20) DEFAULT 'PreemptLowerPriority',
    qos_class           VARCHAR(20),

    -- Placement
    node_id             UUID REFERENCES nodes(node_id),
    zone                VARCHAR(50),

    -- Constraints (stored for rescheduling)
    node_selector       JSONB DEFAULT '{}',
    affinity_rules      JSONB DEFAULT '{}',
    tolerations         JSONB DEFAULT '[]',
    topology_constraints JSONB DEFAULT '[]',

    -- Lifecycle
    state               VARCHAR(20) NOT NULL DEFAULT 'Pending',
    created_at          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    scheduled_at        TIMESTAMP WITH TIME ZONE,
    started_at          TIMESTAMP WITH TIME ZONE,
    completed_at        TIMESTAMP WITH TIME ZONE,

    -- TTL and expiration
    ttl_seconds         INTEGER,
    expires_at          TIMESTAMP WITH TIME ZONE,

    -- Versioning
    resource_version    BIGINT NOT NULL DEFAULT 1,
    updated_at          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

-- Performance-critical indexes
CREATE INDEX idx_alloc_tenant ON allocations(tenant_id, state);
CREATE INDEX idx_alloc_node ON allocations(node_id, state) WHERE state = 'Running';
CREATE INDEX idx_alloc_pending ON allocations(priority DESC, created_at ASC)
    WHERE state = 'Pending';
CREATE INDEX idx_alloc_state ON allocations(state, updated_at);
CREATE INDEX idx_alloc_expiry ON allocations(expires_at)
    WHERE expires_at IS NOT NULL AND state = 'Running';
CREATE INDEX idx_alloc_namespace ON allocations(tenant_id, namespace, state);

Quota Management Tables

-- Hierarchical quota definition
CREATE TABLE resource_quotas (
    quota_id            UUID PRIMARY KEY,
    tenant_id           VARCHAR(100) NOT NULL,
    namespace           VARCHAR(255),
    team_id             VARCHAR(100),

    resource_type       VARCHAR(100) NOT NULL,

    -- Limits
    hard_limit          BIGINT NOT NULL,
    soft_limit          BIGINT,
    burst_limit         BIGINT,
    burst_duration_sec  INTEGER DEFAULT 300,

    -- Current usage (denormalized for fast checks)
    current_usage       BIGINT NOT NULL DEFAULT 0,
    peak_usage          BIGINT NOT NULL DEFAULT 0,
    peak_usage_at       TIMESTAMP WITH TIME ZONE,

    -- Reserved (committed but not yet running)
    reserved            BIGINT NOT NULL DEFAULT 0,

    -- Scheduling
    priority_weight     FLOAT DEFAULT 1.0,
    preemptible         BOOLEAN DEFAULT FALSE,

    -- Metadata
    created_at          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    updated_at          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    created_by          VARCHAR(255),

    UNIQUE(tenant_id, namespace, resource_type)
);

CREATE INDEX idx_quota_tenant ON resource_quotas(tenant_id);
CREATE INDEX idx_quota_namespace ON resource_quotas(tenant_id, namespace);
CREATE INDEX idx_quota_usage ON resource_quotas(resource_type, current_usage);

-- Quota usage history (for trending and capacity planning)
CREATE TABLE quota_usage_history (
    id                  BIGSERIAL PRIMARY KEY,
    tenant_id           VARCHAR(100) NOT NULL,
    namespace           VARCHAR(255),
    resource_type       VARCHAR(100) NOT NULL,
    usage_value         BIGINT NOT NULL,
    hard_limit          BIGINT NOT NULL,
    utilization_pct     FLOAT NOT NULL,
    recorded_at         TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_quota_history_time ON quota_usage_history(tenant_id, recorded_at DESC);

Scheduling Queue Schema

CREATE TABLE scheduling_queue (
    job_id              UUID PRIMARY KEY,
    allocation_id       UUID REFERENCES allocations(allocation_id),
    tenant_id           VARCHAR(100) NOT NULL,
    namespace           VARCHAR(255) NOT NULL,

    -- Priority and ordering
    priority            INTEGER NOT NULL,
    priority_class      VARCHAR(50),
    queue_position      BIGSERIAL,

    -- Resource requirements (denormalized for fast filtering)
    cpu_request         BIGINT NOT NULL,
    memory_request      BIGINT NOT NULL,
    gpu_request         INTEGER DEFAULT 0,

    -- Constraints summary
    requires_gpu        BOOLEAN DEFAULT FALSE,
    requires_zone       VARCHAR(50),
    node_selector_hash  VARCHAR(64),

    -- Scheduling state
    queue_state         VARCHAR(20) NOT NULL DEFAULT 'Active',
    attempts            INTEGER DEFAULT 0,
    last_attempt_at     TIMESTAMP WITH TIME ZONE,
    next_retry_at       TIMESTAMP WITH TIME ZONE,
    backoff_seconds     INTEGER DEFAULT 1,
    unschedulable_reason VARCHAR(500),

    -- Gang scheduling
    gang_id             UUID,
    gang_min_members    INTEGER,

    -- Timestamps
    enqueued_at         TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    nominated_node_id   UUID,

    -- Full constraint specification
    affinity_rules      JSONB DEFAULT '{}',
    tolerations         JSONB DEFAULT '[]',
    topology_constraints JSONB DEFAULT '[]'
);

CREATE INDEX idx_queue_priority ON scheduling_queue(priority DESC, queue_position ASC)
    WHERE queue_state = 'Active';
CREATE INDEX idx_queue_gang ON scheduling_queue(gang_id) WHERE gang_id IS NOT NULL;
CREATE INDEX idx_queue_retry ON scheduling_queue(next_retry_at) WHERE queue_state = 'Backoff';
CREATE INDEX idx_queue_tenant ON scheduling_queue(tenant_id, queue_state);

Historical Usage Tables

-- Hourly aggregated usage (for capacity planning and chargeback)
CREATE TABLE usage_hourly (
    id                  BIGSERIAL,
    tenant_id           VARCHAR(100) NOT NULL,
    namespace           VARCHAR(255) NOT NULL,
    resource_type       VARCHAR(100) NOT NULL,
    hour_start          TIMESTAMP WITH TIME ZONE NOT NULL,

    -- Usage metrics
    avg_usage           BIGINT NOT NULL,
    max_usage           BIGINT NOT NULL,
    min_usage           BIGINT NOT NULL,
    p95_usage           BIGINT,
    p99_usage           BIGINT,

    -- Allocation metrics
    total_requested     BIGINT NOT NULL,
    total_allocated     BIGINT NOT NULL,
    total_limit         BIGINT NOT NULL,

    -- Efficiency metrics
    utilization_pct     FLOAT,
    waste_pct           FLOAT,

    -- Cost
    cost_units          DECIMAL(12,4),

    PRIMARY KEY (id, hour_start)
) PARTITION BY RANGE (hour_start);

-- Create monthly partitions
CREATE TABLE usage_hourly_2024_01 PARTITION OF usage_hourly
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE INDEX idx_usage_tenant_time ON usage_hourly(tenant_id, hour_start DESC);
CREATE INDEX idx_usage_resource ON usage_hourly(resource_type, hour_start DESC);

-- Daily summary materialized view
CREATE MATERIALIZED VIEW usage_daily_summary AS
SELECT
    tenant_id,
    namespace,
    resource_type,
    DATE_TRUNC('day', hour_start) AS day,
    AVG(avg_usage) AS daily_avg_usage,
    MAX(max_usage) AS daily_peak_usage,
    SUM(cost_units) AS daily_cost,
    AVG(utilization_pct) AS avg_utilization
FROM usage_hourly
GROUP BY tenant_id, namespace, resource_type, DATE_TRUNC('day', hour_start);

Preemption Records

CREATE TABLE preemption_events (
    preemption_id       UUID PRIMARY KEY,

    -- Preempting job (higher priority)
    preempting_job_id   UUID NOT NULL,
    preempting_tenant   VARCHAR(100) NOT NULL,
    preempting_priority INTEGER NOT NULL,

    -- Preempted job (lower priority, being evicted)
    preempted_job_id    UUID NOT NULL,
    preempted_tenant    VARCHAR(100) NOT NULL,
    preempted_priority  INTEGER NOT NULL,
    preempted_node_id   UUID NOT NULL,

    -- Resources freed
    cpu_freed_millicores    BIGINT,
    memory_freed_bytes      BIGINT,
    gpu_freed               INTEGER DEFAULT 0,

    -- Context
    reason              VARCHAR(500) NOT NULL,
    preemption_policy   VARCHAR(50),

    -- Timing
    initiated_at        TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    grace_period_sec    INTEGER DEFAULT 30,
    evicted_at          TIMESTAMP WITH TIME ZONE,
    rescheduled_at      TIMESTAMP WITH TIME ZONE,

    -- Outcome
    outcome             VARCHAR(20),

    -- Audit
    initiated_by        VARCHAR(255)
);

CREATE INDEX idx_preemption_time ON preemption_events(initiated_at DESC);
CREATE INDEX idx_preemption_victim ON preemption_events(preempted_tenant, initiated_at DESC);
CREATE INDEX idx_preemption_node ON preemption_events(preempted_node_id, initiated_at DESC);

Node Health and Heartbeat Tracking

-- Node heartbeat tracking (high-write, time-series)
CREATE TABLE node_heartbeats (
    node_id             UUID NOT NULL,
    heartbeat_time      TIMESTAMP WITH TIME ZONE NOT NULL,

    -- Resource snapshot at heartbeat time
    cpu_usage_millicores    BIGINT,
    memory_usage_bytes      BIGINT,
    gpu_utilization_pct     FLOAT[],
    disk_usage_bytes        BIGINT,
    network_rx_bytes        BIGINT,
    network_tx_bytes        BIGINT,

    -- Pod status
    pods_running            INTEGER,
    pods_pending            INTEGER,
    containers_running      INTEGER,

    -- Health indicators
    load_average_1m         FLOAT,
    load_average_5m         FLOAT,
    oom_kills_total         INTEGER,

    PRIMARY KEY (node_id, heartbeat_time)
) PARTITION BY RANGE (heartbeat_time);

-- Node condition changes (event-driven)
CREATE TABLE node_condition_changes (
    id                  BIGSERIAL PRIMARY KEY,
    node_id             UUID NOT NULL,
    condition_type      VARCHAR(50) NOT NULL,
    old_status          VARCHAR(10),
    new_status          VARCHAR(10) NOT NULL,
    reason              VARCHAR(255),
    message             TEXT,
    changed_at          TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_condition_node ON node_condition_changes(node_id, changed_at DESC);
CREATE INDEX idx_condition_type ON node_condition_changes(condition_type, new_status, changed_at DESC);

Redis Structure for Real-Time Heartbeats

# Last heartbeat timestamp (for lease checking)
HSET node:heartbeat:{node_id} last_seen {unix_timestamp}
HSET node:heartbeat:{node_id} cpu_available 24000
HSET node:heartbeat:{node_id} memory_available 137438953472
HSET node:heartbeat:{node_id} gpu_available 3
HSET node:heartbeat:{node_id} pods_running 43

# Sorted set for detecting stale nodes
ZADD node:heartbeat:timestamps {unix_timestamp} {node_id}

# To find nodes that missed heartbeats (> 40 seconds old):
ZRANGEBYSCORE node:heartbeat:timestamps -inf {now - 40}

In-Memory Data Structures

Priority Queue (Scheduling Queue)

// Multi-level priority queue for pending allocations
type SchedulingQueue struct {
    // Active queue: ready to be scheduled
    activeQueue    *PriorityQueue  // Sorted by (priority DESC, timestamp ASC)

    // Backoff queue: failed scheduling, waiting for retry
    backoffQueue   *PriorityQueue  // Sorted by next_retry_at

    // Unschedulable queue: no feasible nodes currently
    unschedulable  *PriorityQueue  // Re-evaluated on cluster state changes

    // Nomination map: jobs waiting for preemption to complete
    nominations    map[string]*NominatedJob

    mu             sync.RWMutex
}

type QueueItem struct {
    JobID           string
    Priority        int32
    Timestamp       time.Time
    ResourceRequest ResourceList
    Constraints     SchedulingConstraints
    Attempts        int
    BackoffDuration time.Duration
}

Resource Bitmap (Node Resource Tracking)

// Bitmap for tracking discrete resource allocation (GPUs, NUMA nodes)
type ResourceBitmap struct {
    bits     []uint64
    total    int
    free     int
    mu       sync.RWMutex
}

// Example: 8 GPUs on a node
// bits = [0b11110011] -> GPUs 2,3 are free; 0,1,4,5,6,7 allocated

// NUMA-aware allocation
type NUMATopology struct {
    nodes       []NUMANode
    distances   [][]int        // NUMA distance matrix
}

type NUMANode struct {
    id          int
    cpuBitmap   ResourceBitmap
    memoryMB    int64
    gpuBitmap   ResourceBitmap
}

Interval Tree (Time-Based Reservations)

// Interval tree for managing time-based resource reservations
type ReservationTree struct {
    root    *IntervalNode
    mu      sync.RWMutex
}

type Reservation struct {
    ID          string
    TenantID    string
    StartTime   time.Time
    EndTime     time.Time
    Resources   ResourceList
    NodeID      string
}

// Operations:
// - Insert reservation: O(log n)
// - Find overlapping reservations: O(log n + k) where k = overlaps
// - Find available capacity at time T: O(log n)
// - Remove expired reservations: O(log n)

Node Scoring Cache

// Cache pre-computed node scores to avoid recalculation
type NodeScoreCache struct {
    scores      map[string]*CachedScore
    generation  int64
    ttl         time.Duration
    mu          sync.RWMutex
}

type CachedScore struct {
    NodeID          string
    BalancedScore   float64
    SpreadScore     float64
    AffinityScore   float64
    TaintScore      float64
    ComputedAt      time.Time
    Generation      int64
}

Data Consistency Model

Data Consistency Rationale
Resource claims (binding) Strong (linearizable) Prevent double-allocation
Quota updates Strong (serializable) Prevent over-quota
Node capacity Strong (per-node) Accurate scheduling decisions
Scheduling queue Strong (within partition) Priority ordering
Heartbeats Eventual (bounded staleness) High write volume
Usage metrics Eventual Analytics, not critical path
Audit logs Eventual (ordered per entity) Compliance, not real-time

Optimistic Concurrency Control

-- All critical updates use resource_version for OCC
UPDATE allocations
SET state = 'Running',
    node_id = $1,
    resource_version = resource_version + 1,
    updated_at = NOW()
WHERE allocation_id = $2
  AND resource_version = $3
  AND state = 'Binding';

-- If 0 rows affected: conflict detected, retry with fresh state

Data Retention and Archival

Retention Policy:
| Data Type           | Hot Storage  | Archive             |
|---------------------|--------------|---------------------|
| Active allocations  | Indefinite   | N/A                 |
| Completed allocs    | 7 days       | 2 years (S3)        |
| Heartbeats          | 1 hour       | 30 days (sampled)   |
| Usage hourly        | 90 days      | 3 years (Parquet)   |
| Preemption events   | 30 days      | 1 year              |
| Audit logs          | 90 days      | 7 years (compliance)|
| Quota history       | 1 year       | 3 years             |

This multi-tier database design ensures the scheduler can make fast, consistent allocation decisions while maintaining the historical data needed for capacity planning, chargeback, and compliance.