Resource Allocation Service - Database Design
Overview
The resource allocation service requires a multi-tier storage architecture that balances strong consistency for critical allocation decisions with high throughput for status updates and historical analytics. This design draws from patterns used in Kubernetes (etcd for state), Borg (Paxos-based cell state), and YARN (ZooKeeper + LevelDB).
Storage Architecture Overview
Storage Tiers:
Hot Path (In-Memory) | Warm Path (Consistent) | Cold Path (Analytical)
---------------------+---------------------------+------------------------
Priority Queues | etcd / CockroachDB | PostgreSQL / ClickHouse
Resource Bitmaps | Redis Cluster | Object Storage (S3)
Scoring Cache | | Data Lake (Parquet)
Node State Cache | |
---------------------+---------------------------+------------------------
Latency: <1ms | Latency: 1-10ms | Latency: 100ms-seconds
Consistency: N/A | Consistency: Strong | Consistency: Eventual
Durability: None | Durability: Yes | Durability: YesResource Inventory Schema
Node Registry (etcd + PostgreSQL)
-- Primary node inventory (PostgreSQL for queryability, synced from etcd)
CREATE TABLE nodes (
node_id UUID PRIMARY KEY,
hostname VARCHAR(255) NOT NULL UNIQUE,
cluster_id VARCHAR(100) NOT NULL,
zone VARCHAR(50) NOT NULL,
region VARCHAR(50) NOT NULL,
rack_id VARCHAR(100),
-- Capacity (total physical resources)
cpu_total_millicores BIGINT NOT NULL,
memory_total_bytes BIGINT NOT NULL,
gpu_total INTEGER DEFAULT 0,
gpu_type VARCHAR(50),
fpga_total INTEGER DEFAULT 0,
ephemeral_storage_bytes BIGINT NOT NULL,
max_pods INTEGER DEFAULT 110,
-- Allocatable (capacity minus system reserved)
cpu_allocatable_millicores BIGINT NOT NULL,
memory_allocatable_bytes BIGINT NOT NULL,
gpu_allocatable INTEGER DEFAULT 0,
-- Current availability (updated frequently)
cpu_available_millicores BIGINT NOT NULL,
memory_available_bytes BIGINT NOT NULL,
gpu_available INTEGER DEFAULT 0,
pods_running INTEGER DEFAULT 0,
-- Node metadata
labels JSONB DEFAULT '{}',
taints JSONB DEFAULT '[]',
annotations JSONB DEFAULT '{}',
conditions JSONB DEFAULT '[]',
-- Lifecycle
status VARCHAR(20) DEFAULT 'Ready',
registered_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
last_heartbeat_at TIMESTAMP WITH TIME ZONE,
unschedulable BOOLEAN DEFAULT FALSE,
-- Versioning for optimistic concurrency
resource_version BIGINT NOT NULL DEFAULT 1,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_nodes_status ON nodes(status) WHERE status = 'Ready';
CREATE INDEX idx_nodes_zone ON nodes(zone, status);
CREATE INDEX idx_nodes_gpu ON nodes(gpu_type, gpu_available) WHERE gpu_total > 0;
CREATE INDEX idx_nodes_labels ON nodes USING GIN(labels);
CREATE INDEX idx_nodes_available_cpu ON nodes(cpu_available_millicores) WHERE status = 'Ready';
CREATE INDEX idx_nodes_cluster ON nodes(cluster_id, status);
-- Extended resources (custom/plugin resources)
CREATE TABLE node_extended_resources (
node_id UUID REFERENCES nodes(node_id) ON DELETE CASCADE,
resource_name VARCHAR(255) NOT NULL,
capacity BIGINT NOT NULL,
allocatable BIGINT NOT NULL,
available BIGINT NOT NULL,
PRIMARY KEY (node_id, resource_name)
);etcd Key Structure for Real-Time State
# Node state (authoritative, real-time)
/registry/nodes/{node_id}/spec -> NodeSpec (capacity, labels, taints)
/registry/nodes/{node_id}/status -> NodeStatus (conditions, available resources)
/registry/nodes/{node_id}/lease -> Lease (heartbeat TTL)
# Example etcd value for node status:
{
"nodeId": "node-abc123",
"allocatable": {
"cpu": "63500m",
"memory": "253Gi",
"nvidia.com/gpu": "8",
"pods": "110"
},
"available": {
"cpu": "24000m",
"memory": "128Gi",
"nvidia.com/gpu": "3",
"pods": "67"
},
"conditions": [
{"type": "Ready", "status": "True", "lastHeartbeat": "2024-01-15T10:30:00Z"},
{"type": "MemoryPressure", "status": "False"},
{"type": "DiskPressure", "status": "False"}
],
"resourceVersion": 48291
}Allocation Records
Active Allocations Table
CREATE TABLE allocations (
allocation_id UUID PRIMARY KEY,
tenant_id VARCHAR(100) NOT NULL,
namespace VARCHAR(255) NOT NULL,
job_name VARCHAR(255),
-- Resource specification
resource_type VARCHAR(50) NOT NULL,
cpu_request_millicores BIGINT,
cpu_limit_millicores BIGINT,
memory_request_bytes BIGINT,
memory_limit_bytes BIGINT,
gpu_request INTEGER DEFAULT 0,
storage_request_bytes BIGINT DEFAULT 0,
extended_resources JSONB DEFAULT '{}',
-- Scheduling metadata
priority INTEGER NOT NULL DEFAULT 0,
priority_class VARCHAR(50),
preemption_policy VARCHAR(20) DEFAULT 'PreemptLowerPriority',
qos_class VARCHAR(20),
-- Placement
node_id UUID REFERENCES nodes(node_id),
zone VARCHAR(50),
-- Constraints (stored for rescheduling)
node_selector JSONB DEFAULT '{}',
affinity_rules JSONB DEFAULT '{}',
tolerations JSONB DEFAULT '[]',
topology_constraints JSONB DEFAULT '[]',
-- Lifecycle
state VARCHAR(20) NOT NULL DEFAULT 'Pending',
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
scheduled_at TIMESTAMP WITH TIME ZONE,
started_at TIMESTAMP WITH TIME ZONE,
completed_at TIMESTAMP WITH TIME ZONE,
-- TTL and expiration
ttl_seconds INTEGER,
expires_at TIMESTAMP WITH TIME ZONE,
-- Versioning
resource_version BIGINT NOT NULL DEFAULT 1,
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
-- Performance-critical indexes
CREATE INDEX idx_alloc_tenant ON allocations(tenant_id, state);
CREATE INDEX idx_alloc_node ON allocations(node_id, state) WHERE state = 'Running';
CREATE INDEX idx_alloc_pending ON allocations(priority DESC, created_at ASC)
WHERE state = 'Pending';
CREATE INDEX idx_alloc_state ON allocations(state, updated_at);
CREATE INDEX idx_alloc_expiry ON allocations(expires_at)
WHERE expires_at IS NOT NULL AND state = 'Running';
CREATE INDEX idx_alloc_namespace ON allocations(tenant_id, namespace, state);Quota Management Tables
-- Hierarchical quota definition
CREATE TABLE resource_quotas (
quota_id UUID PRIMARY KEY,
tenant_id VARCHAR(100) NOT NULL,
namespace VARCHAR(255),
team_id VARCHAR(100),
resource_type VARCHAR(100) NOT NULL,
-- Limits
hard_limit BIGINT NOT NULL,
soft_limit BIGINT,
burst_limit BIGINT,
burst_duration_sec INTEGER DEFAULT 300,
-- Current usage (denormalized for fast checks)
current_usage BIGINT NOT NULL DEFAULT 0,
peak_usage BIGINT NOT NULL DEFAULT 0,
peak_usage_at TIMESTAMP WITH TIME ZONE,
-- Reserved (committed but not yet running)
reserved BIGINT NOT NULL DEFAULT 0,
-- Scheduling
priority_weight FLOAT DEFAULT 1.0,
preemptible BOOLEAN DEFAULT FALSE,
-- Metadata
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
created_by VARCHAR(255),
UNIQUE(tenant_id, namespace, resource_type)
);
CREATE INDEX idx_quota_tenant ON resource_quotas(tenant_id);
CREATE INDEX idx_quota_namespace ON resource_quotas(tenant_id, namespace);
CREATE INDEX idx_quota_usage ON resource_quotas(resource_type, current_usage);
-- Quota usage history (for trending and capacity planning)
CREATE TABLE quota_usage_history (
id BIGSERIAL PRIMARY KEY,
tenant_id VARCHAR(100) NOT NULL,
namespace VARCHAR(255),
resource_type VARCHAR(100) NOT NULL,
usage_value BIGINT NOT NULL,
hard_limit BIGINT NOT NULL,
utilization_pct FLOAT NOT NULL,
recorded_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_quota_history_time ON quota_usage_history(tenant_id, recorded_at DESC);Scheduling Queue Schema
CREATE TABLE scheduling_queue (
job_id UUID PRIMARY KEY,
allocation_id UUID REFERENCES allocations(allocation_id),
tenant_id VARCHAR(100) NOT NULL,
namespace VARCHAR(255) NOT NULL,
-- Priority and ordering
priority INTEGER NOT NULL,
priority_class VARCHAR(50),
queue_position BIGSERIAL,
-- Resource requirements (denormalized for fast filtering)
cpu_request BIGINT NOT NULL,
memory_request BIGINT NOT NULL,
gpu_request INTEGER DEFAULT 0,
-- Constraints summary
requires_gpu BOOLEAN DEFAULT FALSE,
requires_zone VARCHAR(50),
node_selector_hash VARCHAR(64),
-- Scheduling state
queue_state VARCHAR(20) NOT NULL DEFAULT 'Active',
attempts INTEGER DEFAULT 0,
last_attempt_at TIMESTAMP WITH TIME ZONE,
next_retry_at TIMESTAMP WITH TIME ZONE,
backoff_seconds INTEGER DEFAULT 1,
unschedulable_reason VARCHAR(500),
-- Gang scheduling
gang_id UUID,
gang_min_members INTEGER,
-- Timestamps
enqueued_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
nominated_node_id UUID,
-- Full constraint specification
affinity_rules JSONB DEFAULT '{}',
tolerations JSONB DEFAULT '[]',
topology_constraints JSONB DEFAULT '[]'
);
CREATE INDEX idx_queue_priority ON scheduling_queue(priority DESC, queue_position ASC)
WHERE queue_state = 'Active';
CREATE INDEX idx_queue_gang ON scheduling_queue(gang_id) WHERE gang_id IS NOT NULL;
CREATE INDEX idx_queue_retry ON scheduling_queue(next_retry_at) WHERE queue_state = 'Backoff';
CREATE INDEX idx_queue_tenant ON scheduling_queue(tenant_id, queue_state);Historical Usage Tables
-- Hourly aggregated usage (for capacity planning and chargeback)
CREATE TABLE usage_hourly (
id BIGSERIAL,
tenant_id VARCHAR(100) NOT NULL,
namespace VARCHAR(255) NOT NULL,
resource_type VARCHAR(100) NOT NULL,
hour_start TIMESTAMP WITH TIME ZONE NOT NULL,
-- Usage metrics
avg_usage BIGINT NOT NULL,
max_usage BIGINT NOT NULL,
min_usage BIGINT NOT NULL,
p95_usage BIGINT,
p99_usage BIGINT,
-- Allocation metrics
total_requested BIGINT NOT NULL,
total_allocated BIGINT NOT NULL,
total_limit BIGINT NOT NULL,
-- Efficiency metrics
utilization_pct FLOAT,
waste_pct FLOAT,
-- Cost
cost_units DECIMAL(12,4),
PRIMARY KEY (id, hour_start)
) PARTITION BY RANGE (hour_start);
-- Create monthly partitions
CREATE TABLE usage_hourly_2024_01 PARTITION OF usage_hourly
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE INDEX idx_usage_tenant_time ON usage_hourly(tenant_id, hour_start DESC);
CREATE INDEX idx_usage_resource ON usage_hourly(resource_type, hour_start DESC);
-- Daily summary materialized view
CREATE MATERIALIZED VIEW usage_daily_summary AS
SELECT
tenant_id,
namespace,
resource_type,
DATE_TRUNC('day', hour_start) AS day,
AVG(avg_usage) AS daily_avg_usage,
MAX(max_usage) AS daily_peak_usage,
SUM(cost_units) AS daily_cost,
AVG(utilization_pct) AS avg_utilization
FROM usage_hourly
GROUP BY tenant_id, namespace, resource_type, DATE_TRUNC('day', hour_start);Preemption Records
CREATE TABLE preemption_events (
preemption_id UUID PRIMARY KEY,
-- Preempting job (higher priority)
preempting_job_id UUID NOT NULL,
preempting_tenant VARCHAR(100) NOT NULL,
preempting_priority INTEGER NOT NULL,
-- Preempted job (lower priority, being evicted)
preempted_job_id UUID NOT NULL,
preempted_tenant VARCHAR(100) NOT NULL,
preempted_priority INTEGER NOT NULL,
preempted_node_id UUID NOT NULL,
-- Resources freed
cpu_freed_millicores BIGINT,
memory_freed_bytes BIGINT,
gpu_freed INTEGER DEFAULT 0,
-- Context
reason VARCHAR(500) NOT NULL,
preemption_policy VARCHAR(50),
-- Timing
initiated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
grace_period_sec INTEGER DEFAULT 30,
evicted_at TIMESTAMP WITH TIME ZONE,
rescheduled_at TIMESTAMP WITH TIME ZONE,
-- Outcome
outcome VARCHAR(20),
-- Audit
initiated_by VARCHAR(255)
);
CREATE INDEX idx_preemption_time ON preemption_events(initiated_at DESC);
CREATE INDEX idx_preemption_victim ON preemption_events(preempted_tenant, initiated_at DESC);
CREATE INDEX idx_preemption_node ON preemption_events(preempted_node_id, initiated_at DESC);Node Health and Heartbeat Tracking
-- Node heartbeat tracking (high-write, time-series)
CREATE TABLE node_heartbeats (
node_id UUID NOT NULL,
heartbeat_time TIMESTAMP WITH TIME ZONE NOT NULL,
-- Resource snapshot at heartbeat time
cpu_usage_millicores BIGINT,
memory_usage_bytes BIGINT,
gpu_utilization_pct FLOAT[],
disk_usage_bytes BIGINT,
network_rx_bytes BIGINT,
network_tx_bytes BIGINT,
-- Pod status
pods_running INTEGER,
pods_pending INTEGER,
containers_running INTEGER,
-- Health indicators
load_average_1m FLOAT,
load_average_5m FLOAT,
oom_kills_total INTEGER,
PRIMARY KEY (node_id, heartbeat_time)
) PARTITION BY RANGE (heartbeat_time);
-- Node condition changes (event-driven)
CREATE TABLE node_condition_changes (
id BIGSERIAL PRIMARY KEY,
node_id UUID NOT NULL,
condition_type VARCHAR(50) NOT NULL,
old_status VARCHAR(10),
new_status VARCHAR(10) NOT NULL,
reason VARCHAR(255),
message TEXT,
changed_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_condition_node ON node_condition_changes(node_id, changed_at DESC);
CREATE INDEX idx_condition_type ON node_condition_changes(condition_type, new_status, changed_at DESC);Redis Structure for Real-Time Heartbeats
# Last heartbeat timestamp (for lease checking)
HSET node:heartbeat:{node_id} last_seen {unix_timestamp}
HSET node:heartbeat:{node_id} cpu_available 24000
HSET node:heartbeat:{node_id} memory_available 137438953472
HSET node:heartbeat:{node_id} gpu_available 3
HSET node:heartbeat:{node_id} pods_running 43
# Sorted set for detecting stale nodes
ZADD node:heartbeat:timestamps {unix_timestamp} {node_id}
# To find nodes that missed heartbeats (> 40 seconds old):
ZRANGEBYSCORE node:heartbeat:timestamps -inf {now - 40}In-Memory Data Structures
Priority Queue (Scheduling Queue)
// Multi-level priority queue for pending allocations
type SchedulingQueue struct {
// Active queue: ready to be scheduled
activeQueue *PriorityQueue // Sorted by (priority DESC, timestamp ASC)
// Backoff queue: failed scheduling, waiting for retry
backoffQueue *PriorityQueue // Sorted by next_retry_at
// Unschedulable queue: no feasible nodes currently
unschedulable *PriorityQueue // Re-evaluated on cluster state changes
// Nomination map: jobs waiting for preemption to complete
nominations map[string]*NominatedJob
mu sync.RWMutex
}
type QueueItem struct {
JobID string
Priority int32
Timestamp time.Time
ResourceRequest ResourceList
Constraints SchedulingConstraints
Attempts int
BackoffDuration time.Duration
}Resource Bitmap (Node Resource Tracking)
// Bitmap for tracking discrete resource allocation (GPUs, NUMA nodes)
type ResourceBitmap struct {
bits []uint64
total int
free int
mu sync.RWMutex
}
// Example: 8 GPUs on a node
// bits = [0b11110011] -> GPUs 2,3 are free; 0,1,4,5,6,7 allocated
// NUMA-aware allocation
type NUMATopology struct {
nodes []NUMANode
distances [][]int // NUMA distance matrix
}
type NUMANode struct {
id int
cpuBitmap ResourceBitmap
memoryMB int64
gpuBitmap ResourceBitmap
}Interval Tree (Time-Based Reservations)
// Interval tree for managing time-based resource reservations
type ReservationTree struct {
root *IntervalNode
mu sync.RWMutex
}
type Reservation struct {
ID string
TenantID string
StartTime time.Time
EndTime time.Time
Resources ResourceList
NodeID string
}
// Operations:
// - Insert reservation: O(log n)
// - Find overlapping reservations: O(log n + k) where k = overlaps
// - Find available capacity at time T: O(log n)
// - Remove expired reservations: O(log n)Node Scoring Cache
// Cache pre-computed node scores to avoid recalculation
type NodeScoreCache struct {
scores map[string]*CachedScore
generation int64
ttl time.Duration
mu sync.RWMutex
}
type CachedScore struct {
NodeID string
BalancedScore float64
SpreadScore float64
AffinityScore float64
TaintScore float64
ComputedAt time.Time
Generation int64
}Data Consistency Model
| Data | Consistency | Rationale |
|---|---|---|
| Resource claims (binding) | Strong (linearizable) | Prevent double-allocation |
| Quota updates | Strong (serializable) | Prevent over-quota |
| Node capacity | Strong (per-node) | Accurate scheduling decisions |
| Scheduling queue | Strong (within partition) | Priority ordering |
| Heartbeats | Eventual (bounded staleness) | High write volume |
| Usage metrics | Eventual | Analytics, not critical path |
| Audit logs | Eventual (ordered per entity) | Compliance, not real-time |
Optimistic Concurrency Control
-- All critical updates use resource_version for OCC
UPDATE allocations
SET state = 'Running',
node_id = $1,
resource_version = resource_version + 1,
updated_at = NOW()
WHERE allocation_id = $2
AND resource_version = $3
AND state = 'Binding';
-- If 0 rows affected: conflict detected, retry with fresh stateData Retention and Archival
Retention Policy:
| Data Type | Hot Storage | Archive |
|---------------------|--------------|---------------------|
| Active allocations | Indefinite | N/A |
| Completed allocs | 7 days | 2 years (S3) |
| Heartbeats | 1 hour | 30 days (sampled) |
| Usage hourly | 90 days | 3 years (Parquet) |
| Preemption events | 30 days | 1 year |
| Audit logs | 90 days | 7 years (compliance)|
| Quota history | 1 year | 3 years |This multi-tier database design ensures the scheduler can make fast, consistent allocation decisions while maintaining the historical data needed for capacity planning, chargeback, and compliance.