Resource Allocation Service - Scale and Constraints

Overview

A production-grade resource allocation service must handle the scale of modern cloud infrastructure. This document defines the quantitative constraints that drive architectural decisions, referencing real-world systems like Google Borg (managing millions of tasks across hundreds of thousands of machines), Kubernetes (supporting 5,000+ node clusters), and YARN (managing petabytes of resources across thousands of nodes).

Cluster Scale

Node Inventory

Metric	Small Cluster	Medium Cluster	Large Cluster (Target)
Total Nodes	500	5,000	50,000
CPU Cores (total)	32,000	320,000	3,200,000
Memory (total)	128 TB	1.28 PB	12.8 PB
GPU Units	200	2,000	20,000
FPGA Units	50	500	5,000
NVMe Storage	500 TB	5 PB	50 PB
Network Bandwidth	25 Gbps/node	25-100 Gbps/node	100 Gbps/node

Node Configurations (Heterogeneous Fleet)

Standard Compute Node:
- 64 CPU cores (Intel Xeon / AMD EPYC)
- 256 GB RAM
- 2 TB NVMe SSD
- 25 Gbps network

GPU-Accelerated Node:
- 48 CPU cores
- 512 GB RAM
- 8x NVIDIA A100 (80GB each)
- 4 TB NVMe SSD
- 100 Gbps network (RDMA-capable)

High-Memory Node:
- 128 CPU cores
- 2 TB RAM
- 8 TB NVMe SSD
- 25 Gbps network

Storage-Optimized Node:
- 32 CPU cores
- 128 GB RAM
- 24x 16TB HDD + 2x 3.2TB NVMe
- 25 Gbps network

Reference: Real-World Scale

Google Borg: Manages hundreds of thousands of machines, billions of task-hours per month
Kubernetes: Officially supports up to 5,000 nodes per cluster, 150,000 pods
YARN at Meta: 300,000+ nodes, managing exabytes of storage
Azure: Millions of VMs across global regions

Request Volume

Allocation Request Throughput

Operation	Average Rate	Peak Rate	Burst (10s window)
Pod/Task Scheduling	500/sec	5,000/sec	10,000/sec
Resource Allocation Requests	1,000/sec	10,000/sec	25,000/sec
Resource Release	800/sec	8,000/sec	20,000/sec
Quota Check	5,000/sec	50,000/sec	100,000/sec
Status Queries	10,000/sec	100,000/sec	200,000/sec
Node Heartbeats	50,000/sec (1/sec/node)	50,000/sec	50,000/sec
Scheduling Decisions	500/sec	5,000/sec	10,000/sec

Request Patterns

Daily Pattern:
- Business hours (9am-6pm): 3x average load
- Batch job submission (2am-4am): 5x burst for ETL/ML training
- Deployment windows: 10x burst during rolling updates
- Auto-scaling events: Correlated bursts across tenants

Weekly Pattern:
- Monday morning: 2x average (cold start after weekend)
- Friday evening: 0.5x average (workload wind-down)
- End of month: 1.5x (batch reporting jobs)

Resource Types and Quantities

Compute Resources

Resource Type	Unit	Granularity	Overcommit Ratio
CPU	millicores	1m (0.001 core)	2:1 - 10:1
Memory	bytes	1 MiB	1.25:1 - 2:1
GPU	devices	1 GPU or fractional (MIG)	1:1 (no overcommit)
FPGA	devices	1 FPGA	1:1
Ephemeral Storage	bytes	1 GiB	1.5:1
Persistent Storage	bytes	1 GiB	1:1
Network Bandwidth	bits/sec	1 Mbps	3:1 - 5:1
Hugepages (2Mi)	pages	1 page	1:1
Hugepages (1Gi)	pages	1 page	1:1

Extended Resources

Custom/Extended Resources:
- software-licenses/matlab: 500 total seats
- hardware/infiniband: 200 ports
- hardware/rdma: 200 interfaces
- accelerators/tpu-v4: 1,000 chips
- networking/elastic-ip: 10,000 addresses
- custom/encryption-slots: 5,000 HSM slots

Resource Dimensions per Allocation

A single allocation request may specify:

Requests (guaranteed minimum): CPU 100m, Memory 256Mi
Limits (maximum allowed): CPU 2000m, Memory 4Gi
Node Affinity: zone=us-east-1a, gpu-type=a100
Anti-Affinity: spread across failure domains
Tolerations: Accept tainted nodes (e.g., GPU-only nodes)
Topology Constraints: Max skew across zones

Tenant Scale

Multi-Tenancy Dimensions

Metric	Value
Total Tenants (Organizations)	1,000
Namespaces per Tenant	50-200
Total Namespaces	100,000
Resource Quotas (active)	500,000
Service Accounts	1,000,000
Active Jobs/Pods per Tenant	500-50,000
Total Active Allocations	5,000,000

Quota Hierarchy

Organization (Tenant)
├── Team A (50% of org quota)
│   ├── Namespace: production (70% of team quota)
│   ├── Namespace: staging (20% of team quota)
│   └── Namespace: development (10% of team quota)
├── Team B (30% of org quota)
│   ├── Namespace: ml-training (80% of team quota)
│   └── Namespace: ml-inference (20% of team quota)
└── Team C (20% of org quota)
    └── Namespace: batch-processing (100% of team quota)

Per-Namespace Quota Example:
- CPU: 1000 cores (request), 2000 cores (limit)
- Memory: 4 TB (request), 8 TB (limit)
- GPU: 100 devices
- Pods: 10,000
- Services: 500
- PersistentVolumeClaims: 1,000

Scheduling Latency Targets

Latency SLOs

Operation	P50	P95	P99	Max Acceptable
Pod Scheduling (simple)	5ms	20ms	50ms	100ms
Pod Scheduling (complex constraints)	20ms	100ms	500ms	2s
Batch Job Admission	50ms	200ms	1s	5s
Gang Scheduling (multi-pod)	100ms	500ms	2s	10s
Preemption Decision	10ms	50ms	200ms	500ms
Quota Check	1ms	5ms	10ms	50ms
Node Scoring (per node)	0.01ms	0.05ms	0.1ms	1ms
Resource Reservation	2ms	10ms	50ms	100ms

Scheduling Throughput Targets

Kubernetes Reference (5,000 node cluster):
- Scheduling throughput: 100+ pods/sec sustained
- Startup latency (schedule to running): < 5 seconds
- API server latency: < 1 second for 99th percentile

Our Target (50,000 node cluster):
- Scheduling throughput: 500+ allocations/sec sustained
- Burst scheduling: 5,000 allocations/sec for 30 seconds
- End-to-end allocation latency: < 2 seconds (P99)
- Scheduling queue drain time: < 60 seconds after burst

Latency Budget Breakdown

Total Scheduling Latency Budget: 100ms (P95 for simple pods)

1. API Admission + Validation:     10ms
2. Queue Wait Time:                 5ms (P50), 50ms (P95)
3. Feasibility Filtering:          15ms (filter 50K nodes → ~500 candidates)
4. Scoring/Ranking:                25ms (score 500 candidates)
5. Optimistic Binding:             10ms
6. Persistence (etcd write):       15ms
7. Node Notification:              20ms
                                   ─────
Total:                             100ms

Storage Requirements

State Categories

State Type	Storage System	Size	Access Pattern
Resource Inventory (current)	etcd / Redis	500 MB	Read-heavy, strong consistency
Active Allocations	etcd / Redis	2 GB	Read-write, strong consistency
Scheduling Queue	In-memory + WAL	200 MB	Write-heavy, ordered
Quota State	etcd / PostgreSQL	100 MB	Read-heavy, strong consistency
Node Health/Heartbeats	In-memory + Redis	1 GB	Write-heavy, time-series
Allocation History	PostgreSQL / ClickHouse	5 TB/year	Append-only, analytical
Audit Logs	Kafka → S3	10 TB/year	Append-only, compliance
Metrics/Telemetry	Prometheus / InfluxDB	2 TB/year	Time-series, aggregated

etcd Constraints (Critical Path)

etcd Limitations (Kubernetes reference):
- Max database size: 8 GB (recommended)
- Max request size: 1.5 MB
- Max key-value size: 1.5 MB
- Recommended QPS: < 10,000 read/sec, < 1,000 write/sec
- Raft consensus latency: 1-10ms (same datacenter)

Implications:
- Cannot store all allocation history in etcd
- Must shard or tier state across storage systems
- Node heartbeats should NOT go through etcd
- Use watch mechanism for change propagation

Storage Growth Projections

Year 1:
- Active state: 5 GB
- Historical data: 5 TB
- Audit logs: 10 TB
- Total: ~15 TB

Year 3 (10x growth):
- Active state: 50 GB
- Historical data: 150 TB
- Audit logs: 300 TB
- Total: ~450 TB (with compression: ~100 TB)

Network Requirements

Control Plane Traffic

Traffic Type	Rate	Payload Size	Protocol
Node Heartbeats	50,000/sec	1-5 KB	gRPC
Resource Status Updates	10,000/sec	2-10 KB	gRPC streaming
API Requests (external)	10,000/sec	1-50 KB	HTTPS/REST
Scheduler ↔ etcd	5,000/sec	0.5-5 KB	gRPC
Watch Notifications	100,000/sec	0.5-2 KB	gRPC streaming
Inter-scheduler Sync	1,000/sec	10-100 KB	gRPC

Bandwidth Calculations

Control Plane Bandwidth:
- Heartbeats: 50,000/sec × 2KB = 100 MB/sec = 800 Mbps
- Status updates: 10,000/sec × 5KB = 50 MB/sec = 400 Mbps
- API traffic: 10,000/sec × 10KB = 100 MB/sec = 800 Mbps
- Watch streams: 100,000/sec × 1KB = 100 MB/sec = 800 Mbps
- Total control plane: ~3 Gbps sustained

Network Requirements:
- Scheduler nodes: 10 Gbps NIC minimum
- etcd nodes: 10 Gbps NIC, < 1ms RTT between peers
- API gateway: 25 Gbps aggregate
- Cross-AZ traffic: Minimize (cost + latency)

Failure Detection Timing

Heartbeat Configuration:
- Heartbeat interval: 10 seconds
- Heartbeat timeout: 40 seconds (4 missed heartbeats)
- Node lease duration: 40 seconds
- Pod eviction timeout: 5 minutes (after node marked NotReady)

Implications:
- 50,000 nodes × 1 heartbeat/10s = 5,000 heartbeats/sec
- Must process within 1 second to avoid false positives
- Network partitions detected within 40 seconds
- Full rescheduling triggered within 5 minutes

Cost Modeling

Resource Utilization Targets

Metric	Target	Acceptable Range	Alert Threshold
CPU Utilization (cluster avg)	65%	50-80%	< 40% or > 85%
Memory Utilization	75%	60-85%	< 50% or > 90%
GPU Utilization	80%	70-95%	< 60% or > 95%
Storage Utilization	70%	50-80%	> 85%
Network Utilization	40%	20-60%	> 70%

Overcommit Ratios and Risk

Overcommit Strategy:
┌─────────────────┬───────────┬──────────────┬─────────────────┐
│ Resource        │ Ratio     │ Risk Level   │ Mitigation      │
├─────────────────┼───────────┼──────────────┼─────────────────┤
│ CPU             │ 4:1       │ Medium       │ Throttling       │
│ Memory          │ 1.5:1     │ High         │ OOM killer       │
│ GPU             │ 1:1       │ N/A          │ No overcommit    │
│ Ephemeral Disk  │ 1.5:1     │ Medium       │ Eviction         │
│ Network         │ 5:1       │ Low          │ Traffic shaping  │
└─────────────────┴───────────┴──────────────┴─────────────────┘

Cost Impact:
- Without overcommit: $10M/month infrastructure
- With CPU 4:1 overcommit: $4M/month (60% savings)
- Risk: 2-5% of workloads experience throttling during peaks

Cost Per Resource Unit

Hourly Cost Model (approximate):
- 1 CPU core: $0.03/hour
- 1 GB RAM: $0.004/hour
- 1 GPU (A100): $3.00/hour
- 1 TB SSD: $0.10/hour
- 1 TB HDD: $0.02/hour
- 1 Gbps network: $0.01/hour

Monthly Cluster Cost (50,000 nodes):
- Compute: $70M (hardware amortization + power + cooling)
- Network: $5M
- Storage: $15M
- Operations: $10M
- Total: ~$100M/month

Efficiency Target:
- Improve utilization from 40% → 65% = $37.5M/month savings
- Justifies significant investment in scheduler sophistication

Chargeback Model

Tenant Billing Dimensions:
1. Reserved capacity (guaranteed, pay regardless of use)
2. On-demand usage (actual consumption, higher rate)
3. Burst/spot usage (preemptible, lowest rate)
4. Priority premium (higher priority = higher cost multiplier)

Priority Cost Multipliers:
- Critical (P0): 3.0x base rate
- High (P1): 2.0x base rate
- Normal (P2): 1.0x base rate
- Low (P3): 0.5x base rate
- Best-effort (P4): 0.2x base rate

Constraint Summary for Architecture Decisions

Constraint	Impact on Design
50K nodes	Partitioned scheduling, hierarchical state
500 alloc/sec sustained	Optimistic concurrency, batch processing
100ms P95 scheduling	In-memory state, pre-computed scores
5M active allocations	Sharded state store, efficient indexing
50K heartbeats/sec	Dedicated heartbeat path, not through API server
8GB etcd limit	Tiered storage, external state for history
1000 tenants	Hierarchical quotas, fair-share scheduling
4:1 CPU overcommit	Monitoring, throttling, priority-based eviction
$100M/month cost	Every 1% utilization improvement = $1M savings

These constraints directly inform the choice of scheduling algorithm, state management approach, and system architecture described in subsequent sections.