Resource Allocation Service - Tradeoffs and Alternatives
Overview
Resource allocation system design involves fundamental tradeoffs between scheduling quality, latency, fairness, and system complexity. This section analyzes the key architectural decisions, comparing approaches used by production systems: Google Borg (centralized, optimistic), Omega (shared-state, parallel), Sparrow (distributed, sampling-based), Kubernetes (centralized, plugin-based), Mesos (two-level, offer-based), and YARN (centralized with application-level scheduling).
Centralized vs Distributed Scheduling
Centralized Scheduler (Borg / Kubernetes Model)
Architecture:
┌──────────────────────────┐
│ Single Scheduler │
│ (or active-standby) │
│ │
│ - Full cluster view │
│ - Optimal placement │
│ - Single point of truth │
└────────────┬─────────────┘
│
┌──────────┼──────────┐
│ │ │
Node 1 Node 2 Node NAdvantages:
- Global optimization: sees all resources, makes globally optimal decisions
- Simpler consistency: single writer for resource state
- Easier policy enforcement: quotas, priorities, fairness in one place
- Proven at scale: Borg manages Google's entire fleet centrally
Disadvantages:
- Throughput bottleneck: single scheduler limits scheduling rate
- Head-of-line blocking: complex scheduling decisions delay simple ones
- Single point of failure: requires HA (active-standby adds failover latency)
- Scaling ceiling: ~5,000 nodes per scheduler instance (Kubernetes)
When to choose: Most production deployments. Simpler to reason about, debug, and operate. Sufficient for clusters up to 10K nodes with optimization.
Shared-State Parallel Scheduling (Omega Model)
Architecture:
┌─────────────────────────────────────────┐
│ Shared Cluster State │
│ (optimistic concurrency control) │
└───┬──────────┬──────────┬──────────┬───┘
│ │ │ │
Scheduler Scheduler Scheduler Scheduler
(Service) (Batch) (ML) (System)Advantages:
- Parallel scheduling: multiple schedulers operate simultaneously
- No head-of-line blocking: each scheduler handles its own workload type
- Higher throughput: linear scaling with number of schedulers
- Specialization: each scheduler optimized for its workload
Disadvantages:
- Conflict resolution: schedulers may claim same resources (5-30% conflict rate)
- Wasted work: conflicting decisions must be retried
- Complexity: distributed state management, conflict detection
- Fairness harder: no single arbiter for cross-scheduler fairness
When to choose: Very large clusters (50K+ nodes) with diverse workload types where a single scheduler is a bottleneck.
Distributed Scheduling (Sparrow Model)
Architecture:
Job arrives → Pick d random nodes → Probe for availability → Schedule on first available
No central state. Each scheduling decision is independent.
Uses "power of two choices" or "batch sampling" for quality.Advantages:
- Extremely low latency: O(1) scheduling decisions
- No single point of failure: fully decentralized
- Infinite horizontal scaling: add more scheduler instances
- Simple implementation: no distributed consensus needed
Disadvantages:
- Suboptimal placement: no global view, can't do bin packing
- No constraint support: affinity, anti-affinity, topology impossible
- No preemption: can't reason about priorities globally
- Poor for heterogeneous clusters: random probing misses specialized nodes
When to choose: Homogeneous clusters with simple, short-lived tasks (e.g., serverless function invocations, sub-second tasks). Not suitable for complex scheduling requirements.
Decision Matrix
| Factor | Centralized | Shared-State | Distributed |
|---|---|---|---|
| Scheduling quality | Optimal | Near-optimal | Acceptable |
| Latency (P50) | 5-20ms | 5-20ms | <1ms |
| Throughput | 100-500/sec | 500-5000/sec | 10000+/sec |
| Constraint support | Full | Full | None/Limited |
| Complexity | Low | High | Low |
| Failure handling | Failover delay | Graceful | No impact |
| Cluster size | <10K nodes | 10K-100K nodes | Any |
Optimistic vs Pessimistic Concurrency for Resource Claims
Optimistic Concurrency (Omega / etcd model)
Flow:
1. Read current state (resource_version = 42)
2. Make scheduling decision locally
3. Write new state with expected version (CAS: version 42 → 43)
4. If conflict: re-read, re-decide, retry
Characteristics:
- No locks held during decision-making
- High throughput under low contention
- Degrades under high contention (many retries)
- Used by: Kubernetes (etcd), Omega, CockroachDBBest for: Read-heavy workloads, low-to-medium contention, when scheduling decisions take time (complex constraints).
Pessimistic Concurrency (Lock-based)
Flow:
1. Acquire lock on target node's resources
2. Make scheduling decision
3. Update state
4. Release lock
Characteristics:
- Guaranteed no conflicts
- Lower throughput (lock contention)
- Risk of deadlocks with multiple resources
- Simpler retry logic (no conflicts to handle)
- Used by: Traditional databases, some YARN implementationsBest for: High-contention scenarios, simple scheduling decisions, when correctness is more important than throughput.
Hybrid Approach (Recommended)
Strategy:
- Use optimistic concurrency for the common path (90% of allocations)
- Fall back to pessimistic locking for:
- Gang scheduling (must lock multiple nodes atomically)
- Preemption (must ensure eviction before new placement)
- Quota updates (must be strictly serialized)
Implementation:
- etcd with resource versions for node bindings (optimistic)
- Distributed locks (Redis/etcd) for gang scheduling (pessimistic)
- Serialized queue for quota counter updates (pessimistic)Fair Share vs Priority-Based vs Deadline-Based Scheduling
Fair Share Scheduling (DRF - Dominant Resource Fairness)
Principle: Each tenant gets resources proportional to their fair share weight.
Example (2 tenants, equal weight):
Cluster: 100 CPU, 100 GPU
Tenant A (CPU-heavy): dominant resource = CPU
Tenant B (GPU-heavy): dominant resource = GPU
DRF allocation:
- Tenant A: 67 CPU, 33 GPU (dominant share: 67%)
- Tenant B: 33 CPU, 67 GPU (dominant share: 67%)
- Both tenants have equal dominant resource share
Properties:
+ Sharing incentive: no tenant is worse off than equal partition
+ Strategy-proof: tenants can't game the system by lying about needs
+ Envy-free: no tenant prefers another's allocation
+ Pareto efficient: can't improve one without hurting another
Limitations:
- Ignores urgency: a critical production job gets same share as dev experiment
- Slow convergence: takes time to reach fair state after changes
- Doesn't handle deadlines or SLAsPriority-Based Scheduling
Principle: Higher priority jobs always scheduled before lower priority.
Priority Levels (Kubernetes model):
2000000000: system-node-critical (kubelet, kube-proxy)
1000000000: system-cluster-critical (DNS, scheduler)
1000: production-critical
500: production-normal
100: batch-high
0: batch-normal (default)
-100: best-effort
-1000: preemptible
Properties:
+ Clear SLA enforcement: critical jobs always get resources
+ Simple to understand and implement
+ Fast scheduling: just pick highest priority pending job
Limitations:
- Starvation: low-priority jobs may never run
- Priority inflation: everyone wants "critical" priority
- No fairness guarantee between same-priority tenants
- Requires governance: who assigns priorities?Deadline-Based Scheduling (EDF - Earliest Deadline First)
Principle: Schedule jobs closest to their deadline first.
Example:
Job A: needs 4 GPU, deadline in 2 hours
Job B: needs 2 GPU, deadline in 30 minutes
Job C: needs 8 GPU, deadline in 6 hours
Scheduling order: B (30min) → A (2h) → C (6h)
Properties:
+ Optimal for meeting deadlines (provably optimal for single resource)
+ Natural for batch/ETL workloads with SLAs
+ Self-adjusting: urgency increases as deadline approaches
Limitations:
- Requires accurate runtime estimates (hard to predict)
- Cascading failures: one missed deadline can cause chain reaction
- Not suitable for long-running services (no deadline)
- Complex for multi-resource scheduling (NP-hard)Hybrid Approach (Production Recommendation)
Recommended: Priority bands with fair-share within each band
Band 1 (System Critical): Strict priority, always scheduled first
Band 2 (Production): Fair-share among production tenants
Band 3 (Batch): Fair-share with deadline awareness
Band 4 (Best-Effort): Backfill only, fully preemptible
Within each band:
- Fair-share ensures no tenant is starved
- Priority breaks ties within fair-share
- Deadline awareness for batch jobs with SLAs
- Aging: jobs waiting too long get priority boostPreemptive vs Non-Preemptive Allocation
Preemptive (Borg / Kubernetes)
Mechanism:
1. High-priority job arrives, no capacity available
2. Scheduler identifies lower-priority victims on target nodes
3. Victims receive graceful termination signal (SIGTERM)
4. After grace period (30s default), forced kill (SIGKILL)
5. Resources freed, high-priority job scheduled
Preemption Policies:
- PreemptLowerPriority: Can preempt any lower-priority pod
- PreemptNever: Never preempt (wait in queue instead)
- PreemptSamePriority: Can preempt same or lower priority (rare)
Victim Selection Algorithm:
1. Find nodes where preemption would make job schedulable
2. For each candidate node, find minimum set of victims
3. Prefer victims with: lowest priority, most recent start, no PDB
4. Select node that minimizes total preemption impactAdvantages:
- Critical jobs always get resources quickly
- Better resource utilization (no idle reserved capacity)
- Enables overcommitment with safety valve
Disadvantages:
- Wasted work: preempted jobs lose progress (unless checkpointed)
- Cascading preemptions: preempted job may preempt another
- Complexity: PodDisruptionBudgets, grace periods, rescheduling
- User experience: unpredictable for low-priority workloads
Non-Preemptive (Queue-based)
Mechanism:
1. High-priority job arrives, no capacity available
2. Job waits in priority queue
3. When resources freed (job completes/releases), highest priority dequeued
4. No running jobs are ever interrupted
Properties:
- Predictable: once running, job won't be interrupted
- Simpler: no eviction logic, no grace periods
- Wasteful: resources may be held by low-priority jobs while high-priority waitsWhen to choose non-preemptive:
- Workloads that can't be checkpointed (lose all progress on preemption)
- Environments where predictability > efficiency
- Small clusters where preemption cascades are dangerous
Recommendation: Use preemptive scheduling with safeguards:
- PodDisruptionBudgets limit simultaneous evictions
- Grace periods allow checkpointing
- Preemption cooldown prevents thrashing
- Only preempt across priority bands (not within same priority)
Static vs Dynamic Resource Partitioning
Static Partitioning
Model: Fixed resource pools per tenant/workload type
Cluster: 1000 nodes
├── Production pool: 400 nodes (dedicated, never shared)
├── Batch pool: 300 nodes (dedicated to batch)
├── ML pool: 200 nodes (GPU nodes, dedicated to ML)
└── Shared pool: 100 nodes (overflow for any tenant)
Properties:
+ Predictable performance (no noisy neighbors)
+ Simple capacity planning
+ Strong isolation guarantees
- Poor utilization (pools often underutilized)
- Inflexible (can't respond to demand changes)
- Wasteful (production pool idle at night, batch pool idle during day)Dynamic Partitioning
Model: Resources allocated on-demand with quotas as guardrails
Cluster: 1000 nodes (shared pool)
Quotas:
├── Production: guaranteed 400, burst to 600
├── Batch: guaranteed 200, burst to 500
├── ML: guaranteed 150, burst to 300
└── Dev: guaranteed 50, burst to 200
At any time: sum(guaranteed) <= total capacity
Burst capacity: shared, first-come-first-served, preemptible
Properties:
+ Higher utilization (30-50% improvement over static)
+ Responsive to demand changes
+ Burst capacity for unexpected needs
- Noisy neighbor risk (burst traffic affects others)
- More complex scheduling logic
- Harder to predict performanceRecommendation: Dynamic partitioning with guaranteed minimums. Each tenant gets a guaranteed base allocation (never preempted) plus access to burst capacity (preemptible). This provides the isolation benefits of static partitioning with the efficiency of dynamic sharing.
Reservation-Based vs Best-Effort Allocation
Reservation-Based
Model: Resources reserved in advance, guaranteed availability
Use Cases:
- Scheduled batch jobs (reserve GPU for nightly training)
- Maintenance windows (reserve capacity for migration)
- SLA-bound workloads (guarantee resources for production)
Implementation:
- Advance reservation API: "Reserve 8 GPU from 2am-6am on Jan 20"
- Scheduler maintains reservation calendar (interval tree)
- Reserved resources excluded from general pool during window
- Unused reservations released after grace period (15 min)
Trade-offs:
+ Guaranteed availability at scheduled time
+ Predictable for capacity planning
+ Enables offline optimization (schedule reservations optimally)
- Reduces flexibility (reserved resources can't be used by others)
- Waste if reservation not fully utilized
- Complex: cancellation, modification, conflict resolutionBest-Effort Allocation
Model: Request resources when needed, get them if available
Use Cases:
- Interactive workloads (unpredictable demand)
- Development/testing (flexible timing)
- Burst traffic handling
Implementation:
- Submit request, scheduler finds resources immediately
- If unavailable: queue, wait, or fail
- No advance planning needed
Trade-offs:
+ Maximum flexibility and utilization
+ Simple API and mental model
+ Resources available to anyone when not in use
- No availability guarantee
- May wait in queue during peak times
- Hard to plan capacity for time-sensitive workloadsRecommendation: Support both models. Use reservations for predictable, time-sensitive workloads (nightly ML training, scheduled maintenance). Use best-effort for everything else. Allow reserved but unused capacity to be used by best-effort workloads (preemptible when reservation window starts).
Monolithic Scheduler vs Two-Level Scheduler
Monolithic Scheduler (Borg / Kubernetes)
Single scheduler makes ALL placement decisions:
- Knows about all resources
- Knows about all pending jobs
- Applies all policies (priority, fairness, affinity, etc.)
- Single codebase to maintain
Scaling approach: Make the single scheduler faster
- Parallel filtering
- Score caching
- Batch binding
- Equivalence classes (group similar pods)Two-Level Scheduler (Mesos)
Level 1 (Resource Manager): Offers resources to frameworks
Level 2 (Framework Schedulers): Accept offers and place tasks
Example:
Mesos Master: "Here's an offer: Node-5 has 16 CPU, 64GB"
Spark Framework: "I'll take 8 CPU, 32GB for my executor"
Mesos Master: "Offer accepted. Remaining: 8 CPU, 32GB"
Marathon Framework: "I'll take 4 CPU, 16GB for my service"
Problems with Offers:
1. Offer starvation: Framework holds offer without using it
Solution: Offer timeout (5 seconds to accept/reject)
2. Resource hoarding: Framework accepts more than needed
Solution: Revocable offers, usage monitoring
3. Information hiding: Framework doesn't see full cluster state
Solution: Provide cluster summary with offers
4. Offer fragmentation: Small offers that no framework wants
Solution: Offer aggregation, minimum offer sizeComparison
| Aspect | Monolithic | Two-Level |
|---|---|---|
| Scheduling quality | Optimal (global view) | Suboptimal (partial view) |
| Extensibility | Plugin-based | Framework-based |
| Isolation | Shared scheduler | Separate schedulers |
| Complexity | Single complex system | Multiple simpler systems |
| Latency | Depends on queue depth | Offer round-trip adds latency |
| Real-world | Kubernetes, Borg | Mesos, early YARN |
Industry trend: Monolithic with plugins (Kubernetes Scheduling Framework) has won. Two-level scheduling added complexity without proportional benefits for most use cases.
Resource Quotas vs Limits vs Requests
Requests (Guaranteed Minimum)
Definition: Minimum resources guaranteed to the workload
Scheduler behavior: Only places pod on node with enough unrequested capacity
Runtime behavior: Resources always available (reserved on node)
Example: requests.cpu = 2000m
- Scheduler ensures node has 2000m unrequested CPU
- Pod guaranteed 2000m CPU even under contention
- Used for scheduling decisions and QoS classificationLimits (Maximum Allowed)
Definition: Maximum resources the workload can consume
Scheduler behavior: NOT used for scheduling (only requests matter)
Runtime behavior: Enforced by kernel (cgroups)
Example: limits.cpu = 4000m
- Pod can burst up to 4000m when node has spare capacity
- Throttled (not killed) when exceeding CPU limit
- OOM-killed when exceeding memory limit
- Enables overcommitment (sum of limits > node capacity)Quotas (Tenant-Level Caps)
Definition: Maximum total resources a tenant/namespace can request
Admission behavior: Rejects new allocations that would exceed quota
Runtime behavior: No enforcement (quotas are admission-time only)
Example: quota.cpu = 1000 cores for namespace "ml-team"
- ml-team can have at most 1000 cores of CPU requests
- Individual pods still have their own requests/limits
- Prevents any single tenant from consuming entire clusterInteraction Model
Hierarchy:
Cluster Capacity (physical limit)
└── Node Allocatable (capacity - system reserved)
└── Namespace Quota (admin-set cap per tenant)
└── Pod Limit (max per workload)
└── Pod Request (guaranteed per workload)
Example:
Cluster: 10,000 CPU cores
Node: 64 cores allocatable
Namespace quota: 500 cores
Pod limit: 16 cores
Pod request: 4 cores
This namespace can run: 500/4 = 125 pods (by request)
Each pod can burst to: 16 cores (if node has capacity)
Total burst potential: 125 * 16 = 2000 cores (but quota only allows 500 requested)Design Decision: How Strict Should Quotas Be?
Strict Quotas (Hard Enforcement):
- Reject any request that would exceed quota
- Predictable, prevents resource exhaustion
- Can block legitimate burst needs
- Used for: GPU, licenses, expensive resources
Soft Quotas (Warning + Burst):
- Allow exceeding quota temporarily
- Alert administrators when exceeded
- Burst allocations are preemptible
- Used for: CPU, memory in non-production
Hierarchical Quotas (Nested):
- Organization → Team → Namespace
- Unused quota at one level can be borrowed by siblings
- Prevents waste while maintaining fairness
- Complex to implement and reason aboutSummary: Recommended Architecture
For a production resource allocation service targeting 50K nodes:
- Scheduling model: Centralized with partitioning (zone-based)
- Concurrency: Optimistic (with pessimistic fallback for gang scheduling)
- Fairness: Priority bands with fair-share within bands
- Preemption: Yes, across priority bands with PDB protection
- Partitioning: Dynamic with guaranteed minimums
- Reservations: Supported for predictable workloads
- Scheduler architecture: Monolithic with plugin framework
- Resource model: Requests + Limits + Quotas (Kubernetes model)
This combination provides the best balance of scheduling quality, operational simplicity, and scalability for most production environments.