Resource Allocation Service - Scale and Constraints
Overview
A production-grade resource allocation service must handle the scale of modern cloud infrastructure. This document defines the quantitative constraints that drive architectural decisions, referencing real-world systems like Google Borg (managing millions of tasks across hundreds of thousands of machines), Kubernetes (supporting 5,000+ node clusters), and YARN (managing petabytes of resources across thousands of nodes).
Cluster Scale
Node Inventory
Metric
Small Cluster
Medium Cluster
Large Cluster (Target)
Total Nodes
500
5,000
50,000
CPU Cores (total)
32,000
320,000
3,200,000
Memory (total)
128 TB
1.28 PB
12.8 PB
GPU Units
200
2,000
20,000
FPGA Units
50
500
5,000
NVMe Storage
500 TB
5 PB
50 PB
Network Bandwidth
25 Gbps/node
25-100 Gbps/node
100 Gbps/node
Node Configurations (Heterogeneous Fleet)
Copy Standard Compute Node :
- 64 CPU cores (Intel Xeon / AMD EPYC)
- 256 GB RAM
- 2 TB NVMe SSD
- 25 Gbps network
GPU-Accelerated Node :
- 48 CPU cores
- 512 GB RAM
- 8x NVIDIA A100 (80GB each)
- 4 TB NVMe SSD
- 100 Gbps network (RDMA-capable)
High-Memory Node :
- 128 CPU cores
- 2 TB RAM
- 8 TB NVMe SSD
- 25 Gbps network
Storage-Optimized Node :
- 32 CPU cores
- 128 GB RAM
- 24x 16TB HDD + 2x 3.2TB NVMe
- 25 Gbps network Reference: Real-World Scale
Google Borg : Manages hundreds of thousands of machines, billions of task-hours per month
Kubernetes : Officially supports up to 5,000 nodes per cluster, 150,000 pods
YARN at Meta : 300,000+ nodes, managing exabytes of storage
Azure : Millions of VMs across global regions
Request Volume
Allocation Request Throughput
Operation
Average Rate
Peak Rate
Burst (10s window)
Pod/Task Scheduling
500/sec
5,000/sec
10,000/sec
Resource Allocation Requests
1,000/sec
10,000/sec
25,000/sec
Resource Release
800/sec
8,000/sec
20,000/sec
Quota Check
5,000/sec
50,000/sec
100,000/sec
Status Queries
10,000/sec
100,000/sec
200,000/sec
Node Heartbeats
50,000/sec (1/sec/node)
50,000/sec
50,000/sec
Scheduling Decisions
500/sec
5,000/sec
10,000/sec
Request Patterns
Copy Daily Pattern :
- Business hours (9am-6pm): 3x average load
- Batch job submission (2am-4am): 5x burst for ETL/ML training
- Deployment windows: 10x burst during rolling updates
- Auto-scaling events: Correlated bursts across tenants
Weekly Pattern :
- Monday morning: 2x average (cold start after weekend)
- Friday evening: 0.5x average (workload wind-down)
- End of month: 1.5x (batch reporting jobs)
Resource Types and Quantities
Compute Resources
Resource Type
Unit
Granularity
Overcommit Ratio
CPU
millicores
1m (0.001 core)
2:1 - 10:1
Memory
bytes
1 MiB
1.25:1 - 2:1
GPU
devices
1 GPU or fractional (MIG)
1:1 (no overcommit)
FPGA
devices
1 FPGA
1:1
Ephemeral Storage
bytes
1 GiB
1.5:1
Persistent Storage
bytes
1 GiB
1:1
Network Bandwidth
bits/sec
1 Mbps
3:1 - 5:1
Hugepages (2Mi)
pages
1 page
1:1
Hugepages (1Gi)
pages
1 page
1:1
Extended Resources
Copy Custom/Extended Resources:
- software-licenses/ matlab: 500 total seats
- hardware/ infiniband: 200 ports
- hardware/ rdma: 200 interfaces
- accelerators/ tpu-v4: 1 ,000 chips
- networking/ elastic-ip: 10 ,000 addresses
- custom/ encryption-slots: 5 ,000 HSM slotsResource Dimensions per Allocation
A single allocation request may specify:
Requests (guaranteed minimum): CPU 100m, Memory 256Mi
Limits (maximum allowed): CPU 2000m, Memory 4Gi
Node Affinity : zone=us-east-1a, gpu-type=a100
Anti-Affinity : spread across failure domains
Tolerations : Accept tainted nodes (e.g., GPU-only nodes)
Topology Constraints : Max skew across zones
Tenant Scale
Multi-Tenancy Dimensions
Metric
Value
Total Tenants (Organizations)
1,000
Namespaces per Tenant
50-200
Total Namespaces
100,000
Resource Quotas (active)
500,000
Service Accounts
1,000,000
Active Jobs/Pods per Tenant
500-50,000
Total Active Allocations
5,000,000
Quota Hierarchy
Copy Organization (Tenant)
├── Team A (50% of org quota)
│ ├── Namespace: production (70% of team quota)
│ ├── Namespace: staging (20% of team quota)
│ └── Namespace: development (10% of team quota)
├── Team B (30% of org quota)
│ ├── Namespace: ml-training (80% of team quota)
│ └── Namespace: ml-inference (20% of team quota)
└── Team C (20% of org quota)
└── Namespace: batch-processing (100% of team quota)
Per-Namespace Quota Example:
- CPU: 1000 cores (request), 2000 cores (limit)
- Memory: 4 TB (request), 8 TB (limit)
- GPU: 100 devices
- Pods: 10 ,000
- Services: 500
- PersistentVolumeClaims: 1 ,000
Scheduling Latency Targets
Latency SLOs
Operation
P50
P95
P99
Max Acceptable
Pod Scheduling (simple)
5ms
20ms
50ms
100ms
Pod Scheduling (complex constraints)
20ms
100ms
500ms
2s
Batch Job Admission
50ms
200ms
1s
5s
Gang Scheduling (multi-pod)
100ms
500ms
2s
10s
Preemption Decision
10ms
50ms
200ms
500ms
Quota Check
1ms
5ms
10ms
50ms
Node Scoring (per node)
0.01ms
0.05ms
0.1ms
1ms
Resource Reservation
2ms
10ms
50ms
100ms
Scheduling Throughput Targets
Copy Kubernetes Reference (5 ,000 node cluster):
- Scheduling throughput: 100 + pods/sec sustained
- Startup latency (schedule to running): < 5 seconds
- API server latency: < 1 second for 99 th percentile
Our Target (50 ,000 node cluster):
- Scheduling throughput: 500 + allocations/sec sustained
- Burst scheduling: 5 ,000 allocations/sec for 30 seconds
- End-to -end allocation latency : < 2 seconds (P99 )
- Scheduling queue drain time : < 60 seconds after burstLatency Budget Breakdown
Copy Total Scheduling Latency Budget : 100ms (P95 for simple pods)
1. API Admission + Validation : 10ms
2. Queue Wait Time : 5ms (P50), 50ms (P95)
3. Feasibility Filtering : 15ms (filter 50K nodes → ~500 candidates)
4. Scoring/Ranking : 25ms (score 500 candidates)
5. Optimistic Binding : 10ms
6. Persistence (etcd write) : 15ms
7. Node Notification : 20ms
─────
Total : 100ms
Storage Requirements
State Categories
State Type
Storage System
Size
Access Pattern
Resource Inventory (current)
etcd / Redis
500 MB
Read-heavy, strong consistency
Active Allocations
etcd / Redis
2 GB
Read-write, strong consistency
Scheduling Queue
In-memory + WAL
200 MB
Write-heavy, ordered
Quota State
etcd / PostgreSQL
100 MB
Read-heavy, strong consistency
Node Health/Heartbeats
In-memory + Redis
1 GB
Write-heavy, time-series
Allocation History
PostgreSQL / ClickHouse
5 TB/year
Append-only, analytical
Audit Logs
Kafka → S3
10 TB/year
Append-only, compliance
Metrics/Telemetry
Prometheus / InfluxDB
2 TB/year
Time-series, aggregated
etcd Constraints (Critical Path)
Copy etcd Limitations (Kubernetes reference) :
- Max database size: 8 GB (recommended)
- Max request size: 1.5 MB
- Max key-value size: 1.5 MB
- Recommended QPS: < 10,000 read/sec, < 1,000 write/sec
- Raft consensus latency: 1-10ms (same datacenter)
Implications :
- Cannot store all allocation history in etcd
- Must shard or tier state across storage systems
- Node heartbeats should NOT go through etcd
- Use watch mechanism for change propagation Storage Growth Projections
Copy Year 1:
- Active state: 5 GB
- Historical data: 5 TB
- Audit logs: 10 TB
- Total: ~15 TB
Year 3 (10x growth):
- Active state: 50 GB
- Historical data: 150 TB
- Audit logs: 300 TB
- Total: ~450 TB (with compression: ~100 TB)
Network Requirements
Control Plane Traffic
Traffic Type
Rate
Payload Size
Protocol
Node Heartbeats
50,000/sec
1-5 KB
gRPC
Resource Status Updates
10,000/sec
2-10 KB
gRPC streaming
API Requests (external)
10,000/sec
1-50 KB
HTTPS/REST
Scheduler ↔ etcd
5,000/sec
0.5-5 KB
gRPC
Watch Notifications
100,000/sec
0.5-2 KB
gRPC streaming
Inter-scheduler Sync
1,000/sec
10-100 KB
gRPC
Bandwidth Calculations
Copy Control Plane Bandwidth:
- Heartbeats: 50 ,000 /sec × 2 KB = 100 MB/sec = 800 Mbps
- Status updates: 10 ,000 /sec × 5 KB = 50 MB/sec = 400 Mbps
- API traffic: 10 ,000 /sec × 10 KB = 100 MB/sec = 800 Mbps
- Watch streams: 100 ,000 /sec × 1 KB = 100 MB/sec = 800 Mbps
- Total control plane: ~3 Gbps sustained
Network Requirements:
- Scheduler nodes: 10 Gbps NIC minimum
- etcd nodes: 10 Gbps NIC, < 1 ms RTT between peers
- API gateway: 25 Gbps aggregate
- Cross-AZ traffic: Minimize (cost + latency)Failure Detection Timing
Copy Heartbeat Configuration:
- Heartbeat interval: 10 seconds
- Heartbeat timeout: 40 seconds (4 missed heartbeats)
- Node lease duration: 40 seconds
- Pod eviction timeout: 5 minutes (after node marked NotReady)
Implications:
- 50 ,000 nodes × 1 heartbeat/10 s = 5 ,000 heartbeats/sec
- Must process within 1 second to avoid false positives
- Network partitions detected within 40 seconds
- Full rescheduling triggered within 5 minutes
Cost Modeling
Resource Utilization Targets
Metric
Target
Acceptable Range
Alert Threshold
CPU Utilization (cluster avg)
65%
50-80%
< 40% or > 85%
Memory Utilization
75%
60-85%
< 50% or > 90%
GPU Utilization
80%
70-95%
< 60% or > 95%
Storage Utilization
70%
50-80%
> 85%
Network Utilization
40%
20-60%
> 70%
Overcommit Ratios and Risk
Copy Overcommit Strategy:
┌─────────────────┬───────────┬──────────────┬─────────────────┐
│ Resource │ Ratio │ Risk Level │ Mitigation │
├─────────────────┼───────────┼──────────────┼─────────────────┤
│ CPU │ 4 :1 │ Medium │ Throttling │
│ Memory │ 1.5 :1 │ High │ OOM killer │
│ GPU │ 1 :1 │ N/A │ No overcommit │
│ Ephemeral Disk │ 1.5 :1 │ Medium │ Eviction │
│ Network │ 5 :1 │ Low │ Traffic shaping │
└─────────────────┴───────────┴──────────────┴─────────────────┘
Cost Impact:
- Without overcommit: $10 M/month infrastructure
- With CPU 4 :1 overcommit: $4 M/month (60 % savings)
- Risk: 2 - 5 % of workloads experience throttling during peaksCost Per Resource Unit
Copy Hourly Cost Model (approximate):
- 1 CPU core: $0 .03/hour
- 1 GB RAM: $0 .004/hour
- 1 GPU (A100): $3 .00/hour
- 1 TB SSD: $0 .10/hour
- 1 TB HDD: $0 .02/hour
- 1 Gbps network: $0 .01/hour
Monthly Cluster Cost (50,000 nodes):
- Compute: $70M (hardware amortization + power + cooling)
- Network: $5M
- Storage: $15M
- Operations: $10M
- Total: ~$100M /month
Efficiency Target:
- Improve utilization from 40% → 65% = $37 .5M/month savings
- Justifies significant investment in scheduler sophisticationChargeback Model
Copy Tenant Billing Dimensions:
1 . Reserved capacity (guaranteed, pay regardless of use)
2 . On-demand usage (actual consumption, higher rate )
3 . Burst/spot usage (preemptible, lowest rate )
4 . Priority premium (higher priority = higher cost multiplier)
Priority Cost Multipliers:
- Critical (P0 ): 3.0 x base rate
- High (P1 ): 2.0 x base rate
- Normal (P2 ): 1.0 x base rate
- Low (P3 ): 0.5 x base rate
- Best-effort (P4 ): 0.2 x base rate
Constraint Summary for Architecture Decisions
Constraint
Impact on Design
50K nodes
Partitioned scheduling, hierarchical state
500 alloc/sec sustained
Optimistic concurrency, batch processing
100ms P95 scheduling
In-memory state, pre-computed scores
5M active allocations
Sharded state store, efficient indexing
50K heartbeats/sec
Dedicated heartbeat path, not through API server
8GB etcd limit
Tiered storage, external state for history
1000 tenants
Hierarchical quotas, fair-share scheduling
4:1 CPU overcommit
Monitoring, throttling, priority-based eviction
$100M/month cost
Every 1% utilization improvement = $1M savings
These constraints directly inform the choice of scheduling algorithm, state management approach, and system architecture described in subsequent sections.