Scale & Constraints

📖 11 min read 📄 Part 2 of 10

Resource Allocation Service - Scale and Constraints

Overview

A production-grade resource allocation service must handle the scale of modern cloud infrastructure. This document defines the quantitative constraints that drive architectural decisions, referencing real-world systems like Google Borg (managing millions of tasks across hundreds of thousands of machines), Kubernetes (supporting 5,000+ node clusters), and YARN (managing petabytes of resources across thousands of nodes).


Cluster Scale

Node Inventory

Metric Small Cluster Medium Cluster Large Cluster (Target)
Total Nodes 500 5,000 50,000
CPU Cores (total) 32,000 320,000 3,200,000
Memory (total) 128 TB 1.28 PB 12.8 PB
GPU Units 200 2,000 20,000
FPGA Units 50 500 5,000
NVMe Storage 500 TB 5 PB 50 PB
Network Bandwidth 25 Gbps/node 25-100 Gbps/node 100 Gbps/node

Node Configurations (Heterogeneous Fleet)

Standard Compute Node:
- 64 CPU cores (Intel Xeon / AMD EPYC)
- 256 GB RAM
- 2 TB NVMe SSD
- 25 Gbps network

GPU-Accelerated Node:
- 48 CPU cores
- 512 GB RAM
- 8x NVIDIA A100 (80GB each)
- 4 TB NVMe SSD
- 100 Gbps network (RDMA-capable)

High-Memory Node:
- 128 CPU cores
- 2 TB RAM
- 8 TB NVMe SSD
- 25 Gbps network

Storage-Optimized Node:
- 32 CPU cores
- 128 GB RAM
- 24x 16TB HDD + 2x 3.2TB NVMe
- 25 Gbps network

Reference: Real-World Scale

  • Google Borg: Manages hundreds of thousands of machines, billions of task-hours per month
  • Kubernetes: Officially supports up to 5,000 nodes per cluster, 150,000 pods
  • YARN at Meta: 300,000+ nodes, managing exabytes of storage
  • Azure: Millions of VMs across global regions

Request Volume

Allocation Request Throughput

Operation Average Rate Peak Rate Burst (10s window)
Pod/Task Scheduling 500/sec 5,000/sec 10,000/sec
Resource Allocation Requests 1,000/sec 10,000/sec 25,000/sec
Resource Release 800/sec 8,000/sec 20,000/sec
Quota Check 5,000/sec 50,000/sec 100,000/sec
Status Queries 10,000/sec 100,000/sec 200,000/sec
Node Heartbeats 50,000/sec (1/sec/node) 50,000/sec 50,000/sec
Scheduling Decisions 500/sec 5,000/sec 10,000/sec

Request Patterns

Daily Pattern:
- Business hours (9am-6pm): 3x average load
- Batch job submission (2am-4am): 5x burst for ETL/ML training
- Deployment windows: 10x burst during rolling updates
- Auto-scaling events: Correlated bursts across tenants

Weekly Pattern:
- Monday morning: 2x average (cold start after weekend)
- Friday evening: 0.5x average (workload wind-down)
- End of month: 1.5x (batch reporting jobs)

Resource Types and Quantities

Compute Resources

Resource Type Unit Granularity Overcommit Ratio
CPU millicores 1m (0.001 core) 2:1 - 10:1
Memory bytes 1 MiB 1.25:1 - 2:1
GPU devices 1 GPU or fractional (MIG) 1:1 (no overcommit)
FPGA devices 1 FPGA 1:1
Ephemeral Storage bytes 1 GiB 1.5:1
Persistent Storage bytes 1 GiB 1:1
Network Bandwidth bits/sec 1 Mbps 3:1 - 5:1
Hugepages (2Mi) pages 1 page 1:1
Hugepages (1Gi) pages 1 page 1:1

Extended Resources

Custom/Extended Resources:
- software-licenses/matlab: 500 total seats
- hardware/infiniband: 200 ports
- hardware/rdma: 200 interfaces
- accelerators/tpu-v4: 1,000 chips
- networking/elastic-ip: 10,000 addresses
- custom/encryption-slots: 5,000 HSM slots

Resource Dimensions per Allocation

A single allocation request may specify:

  • Requests (guaranteed minimum): CPU 100m, Memory 256Mi
  • Limits (maximum allowed): CPU 2000m, Memory 4Gi
  • Node Affinity: zone=us-east-1a, gpu-type=a100
  • Anti-Affinity: spread across failure domains
  • Tolerations: Accept tainted nodes (e.g., GPU-only nodes)
  • Topology Constraints: Max skew across zones

Tenant Scale

Multi-Tenancy Dimensions

Metric Value
Total Tenants (Organizations) 1,000
Namespaces per Tenant 50-200
Total Namespaces 100,000
Resource Quotas (active) 500,000
Service Accounts 1,000,000
Active Jobs/Pods per Tenant 500-50,000
Total Active Allocations 5,000,000

Quota Hierarchy

Organization (Tenant)
├── Team A (50% of org quota)
   ├── Namespace: production (70% of team quota)
   ├── Namespace: staging (20% of team quota)
   └── Namespace: development (10% of team quota)
├── Team B (30% of org quota)
   ├── Namespace: ml-training (80% of team quota)
   └── Namespace: ml-inference (20% of team quota)
└── Team C (20% of org quota)
    └── Namespace: batch-processing (100% of team quota)

Per-Namespace Quota Example:
- CPU: 1000 cores (request), 2000 cores (limit)
- Memory: 4 TB (request), 8 TB (limit)
- GPU: 100 devices
- Pods: 10,000
- Services: 500
- PersistentVolumeClaims: 1,000

Scheduling Latency Targets

Latency SLOs

Operation P50 P95 P99 Max Acceptable
Pod Scheduling (simple) 5ms 20ms 50ms 100ms
Pod Scheduling (complex constraints) 20ms 100ms 500ms 2s
Batch Job Admission 50ms 200ms 1s 5s
Gang Scheduling (multi-pod) 100ms 500ms 2s 10s
Preemption Decision 10ms 50ms 200ms 500ms
Quota Check 1ms 5ms 10ms 50ms
Node Scoring (per node) 0.01ms 0.05ms 0.1ms 1ms
Resource Reservation 2ms 10ms 50ms 100ms

Scheduling Throughput Targets

Kubernetes Reference (5,000 node cluster):
- Scheduling throughput: 100+ pods/sec sustained
- Startup latency (schedule to running): < 5 seconds
- API server latency: < 1 second for 99th percentile

Our Target (50,000 node cluster):
- Scheduling throughput: 500+ allocations/sec sustained
- Burst scheduling: 5,000 allocations/sec for 30 seconds
- End-to-end allocation latency: < 2 seconds (P99)
- Scheduling queue drain time: < 60 seconds after burst

Latency Budget Breakdown

Total Scheduling Latency Budget: 100ms (P95 for simple pods)

1. API Admission + Validation:     10ms
2. Queue Wait Time:                 5ms (P50), 50ms (P95)
3. Feasibility Filtering:          15ms (filter 50K nodes → ~500 candidates)
4. Scoring/Ranking:                25ms (score 500 candidates)
5. Optimistic Binding:             10ms
6. Persistence (etcd write):       15ms
7. Node Notification:              20ms
                                   ─────
Total:                             100ms

Storage Requirements

State Categories

State Type Storage System Size Access Pattern
Resource Inventory (current) etcd / Redis 500 MB Read-heavy, strong consistency
Active Allocations etcd / Redis 2 GB Read-write, strong consistency
Scheduling Queue In-memory + WAL 200 MB Write-heavy, ordered
Quota State etcd / PostgreSQL 100 MB Read-heavy, strong consistency
Node Health/Heartbeats In-memory + Redis 1 GB Write-heavy, time-series
Allocation History PostgreSQL / ClickHouse 5 TB/year Append-only, analytical
Audit Logs Kafka → S3 10 TB/year Append-only, compliance
Metrics/Telemetry Prometheus / InfluxDB 2 TB/year Time-series, aggregated

etcd Constraints (Critical Path)

etcd Limitations (Kubernetes reference):
- Max database size: 8 GB (recommended)
- Max request size: 1.5 MB
- Max key-value size: 1.5 MB
- Recommended QPS: < 10,000 read/sec, < 1,000 write/sec
- Raft consensus latency: 1-10ms (same datacenter)

Implications:
- Cannot store all allocation history in etcd
- Must shard or tier state across storage systems
- Node heartbeats should NOT go through etcd
- Use watch mechanism for change propagation

Storage Growth Projections

Year 1:
- Active state: 5 GB
- Historical data: 5 TB
- Audit logs: 10 TB
- Total: ~15 TB

Year 3 (10x growth):
- Active state: 50 GB
- Historical data: 150 TB
- Audit logs: 300 TB
- Total: ~450 TB (with compression: ~100 TB)

Network Requirements

Control Plane Traffic

Traffic Type Rate Payload Size Protocol
Node Heartbeats 50,000/sec 1-5 KB gRPC
Resource Status Updates 10,000/sec 2-10 KB gRPC streaming
API Requests (external) 10,000/sec 1-50 KB HTTPS/REST
Scheduler ↔ etcd 5,000/sec 0.5-5 KB gRPC
Watch Notifications 100,000/sec 0.5-2 KB gRPC streaming
Inter-scheduler Sync 1,000/sec 10-100 KB gRPC

Bandwidth Calculations

Control Plane Bandwidth:
- Heartbeats: 50,000/sec × 2KB = 100 MB/sec = 800 Mbps
- Status updates: 10,000/sec × 5KB = 50 MB/sec = 400 Mbps
- API traffic: 10,000/sec × 10KB = 100 MB/sec = 800 Mbps
- Watch streams: 100,000/sec × 1KB = 100 MB/sec = 800 Mbps
- Total control plane: ~3 Gbps sustained

Network Requirements:
- Scheduler nodes: 10 Gbps NIC minimum
- etcd nodes: 10 Gbps NIC, < 1ms RTT between peers
- API gateway: 25 Gbps aggregate
- Cross-AZ traffic: Minimize (cost + latency)

Failure Detection Timing

Heartbeat Configuration:
- Heartbeat interval: 10 seconds
- Heartbeat timeout: 40 seconds (4 missed heartbeats)
- Node lease duration: 40 seconds
- Pod eviction timeout: 5 minutes (after node marked NotReady)

Implications:
- 50,000 nodes × 1 heartbeat/10s = 5,000 heartbeats/sec
- Must process within 1 second to avoid false positives
- Network partitions detected within 40 seconds
- Full rescheduling triggered within 5 minutes

Cost Modeling

Resource Utilization Targets

Metric Target Acceptable Range Alert Threshold
CPU Utilization (cluster avg) 65% 50-80% < 40% or > 85%
Memory Utilization 75% 60-85% < 50% or > 90%
GPU Utilization 80% 70-95% < 60% or > 95%
Storage Utilization 70% 50-80% > 85%
Network Utilization 40% 20-60% > 70%

Overcommit Ratios and Risk

Overcommit Strategy:
┌─────────────────┬───────────┬──────────────┬─────────────────┐
│ Resource        │ Ratio     │ Risk Level   │ Mitigation      │
├─────────────────┼───────────┼──────────────┼─────────────────┤
│ CPU             │ 4:1       │ Medium       │ Throttling       │
│ Memory          │ 1.5:1     │ High         │ OOM killer       │
│ GPU             │ 1:1       │ N/A          │ No overcommit    │
│ Ephemeral Disk  │ 1.5:1     │ Medium       │ Eviction         │
│ Network         │ 5:1       │ Low          │ Traffic shaping  │
└─────────────────┴───────────┴──────────────┴─────────────────┘

Cost Impact:
- Without overcommit: $10M/month infrastructure
- With CPU 4:1 overcommit: $4M/month (60% savings)
- Risk: 2-5% of workloads experience throttling during peaks

Cost Per Resource Unit

Hourly Cost Model (approximate):
- 1 CPU core: $0.03/hour
- 1 GB RAM: $0.004/hour
- 1 GPU (A100): $3.00/hour
- 1 TB SSD: $0.10/hour
- 1 TB HDD: $0.02/hour
- 1 Gbps network: $0.01/hour

Monthly Cluster Cost (50,000 nodes):
- Compute: $70M (hardware amortization + power + cooling)
- Network: $5M
- Storage: $15M
- Operations: $10M
- Total: ~$100M/month

Efficiency Target:
- Improve utilization from 40% → 65% = $37.5M/month savings
- Justifies significant investment in scheduler sophistication

Chargeback Model

Tenant Billing Dimensions:
1. Reserved capacity (guaranteed, pay regardless of use)
2. On-demand usage (actual consumption, higher rate)
3. Burst/spot usage (preemptible, lowest rate)
4. Priority premium (higher priority = higher cost multiplier)

Priority Cost Multipliers:
- Critical (P0): 3.0x base rate
- High (P1): 2.0x base rate
- Normal (P2): 1.0x base rate
- Low (P3): 0.5x base rate
- Best-effort (P4): 0.2x base rate

Constraint Summary for Architecture Decisions

Constraint Impact on Design
50K nodes Partitioned scheduling, hierarchical state
500 alloc/sec sustained Optimistic concurrency, batch processing
100ms P95 scheduling In-memory state, pre-computed scores
5M active allocations Sharded state store, efficient indexing
50K heartbeats/sec Dedicated heartbeat path, not through API server
8GB etcd limit Tiered storage, external state for history
1000 tenants Hierarchical quotas, fair-share scheduling
4:1 CPU overcommit Monitoring, throttling, priority-based eviction
$100M/month cost Every 1% utilization improvement = $1M savings

These constraints directly inform the choice of scheduling algorithm, state management approach, and system architecture described in subsequent sections.