Resource Allocation Service - Tradeoffs and Alternatives

Overview

Resource allocation system design involves fundamental tradeoffs between scheduling quality, latency, fairness, and system complexity. This section analyzes the key architectural decisions, comparing approaches used by production systems: Google Borg (centralized, optimistic), Omega (shared-state, parallel), Sparrow (distributed, sampling-based), Kubernetes (centralized, plugin-based), Mesos (two-level, offer-based), and YARN (centralized with application-level scheduling).

Centralized vs Distributed Scheduling

Centralized Scheduler (Borg / Kubernetes Model)

Architecture:
  ┌──────────────────────────┐
  │   Single Scheduler       │
  │   (or active-standby)    │
  │                          │
  │  - Full cluster view     │
  │  - Optimal placement     │
  │  - Single point of truth │
  └────────────┬─────────────┘
               │
    ┌──────────┼──────────┐
    │          │          │
  Node 1    Node 2    Node N

Advantages:

Global optimization: sees all resources, makes globally optimal decisions
Simpler consistency: single writer for resource state
Easier policy enforcement: quotas, priorities, fairness in one place
Proven at scale: Borg manages Google's entire fleet centrally

Disadvantages:

Throughput bottleneck: single scheduler limits scheduling rate
Head-of-line blocking: complex scheduling decisions delay simple ones
Single point of failure: requires HA (active-standby adds failover latency)
Scaling ceiling: ~5,000 nodes per scheduler instance (Kubernetes)

When to choose: Most production deployments. Simpler to reason about, debug, and operate. Sufficient for clusters up to 10K nodes with optimization.

Shared-State Parallel Scheduling (Omega Model)

Architecture:
  ┌─────────────────────────────────────────┐
  │         Shared Cluster State            │
  │    (optimistic concurrency control)     │
  └───┬──────────┬──────────┬──────────┬───┘
      │          │          │          │
  Scheduler  Scheduler  Scheduler  Scheduler
  (Service)  (Batch)    (ML)       (System)

Advantages:

Parallel scheduling: multiple schedulers operate simultaneously
No head-of-line blocking: each scheduler handles its own workload type
Higher throughput: linear scaling with number of schedulers
Specialization: each scheduler optimized for its workload

Disadvantages:

Conflict resolution: schedulers may claim same resources (5-30% conflict rate)
Wasted work: conflicting decisions must be retried
Complexity: distributed state management, conflict detection
Fairness harder: no single arbiter for cross-scheduler fairness

When to choose: Very large clusters (50K+ nodes) with diverse workload types where a single scheduler is a bottleneck.

Distributed Scheduling (Sparrow Model)

Architecture:
  Job arrives → Pick d random nodes → Probe for availability → Schedule on first available

  No central state. Each scheduling decision is independent.
  Uses "power of two choices" or "batch sampling" for quality.

Advantages:

Extremely low latency: O(1) scheduling decisions
No single point of failure: fully decentralized
Infinite horizontal scaling: add more scheduler instances
Simple implementation: no distributed consensus needed

Disadvantages:

Suboptimal placement: no global view, can't do bin packing
No constraint support: affinity, anti-affinity, topology impossible
No preemption: can't reason about priorities globally
Poor for heterogeneous clusters: random probing misses specialized nodes

When to choose: Homogeneous clusters with simple, short-lived tasks (e.g., serverless function invocations, sub-second tasks). Not suitable for complex scheduling requirements.

Decision Matrix

Factor	Centralized	Shared-State	Distributed
Scheduling quality	Optimal	Near-optimal	Acceptable
Latency (P50)	5-20ms	5-20ms	<1ms
Throughput	100-500/sec	500-5000/sec	10000+/sec
Constraint support	Full	Full	None/Limited
Complexity	Low	High	Low
Failure handling	Failover delay	Graceful	No impact
Cluster size	<10K nodes	10K-100K nodes	Any

Optimistic vs Pessimistic Concurrency for Resource Claims

Optimistic Concurrency (Omega / etcd model)

Flow:
1. Read current state (resource_version = 42)
2. Make scheduling decision locally
3. Write new state with expected version (CAS: version 42 → 43)
4. If conflict: re-read, re-decide, retry

Characteristics:
- No locks held during decision-making
- High throughput under low contention
- Degrades under high contention (many retries)
- Used by: Kubernetes (etcd), Omega, CockroachDB

Best for: Read-heavy workloads, low-to-medium contention, when scheduling decisions take time (complex constraints).

Pessimistic Concurrency (Lock-based)

Flow:
1. Acquire lock on target node's resources
2. Make scheduling decision
3. Update state
4. Release lock

Characteristics:
- Guaranteed no conflicts
- Lower throughput (lock contention)
- Risk of deadlocks with multiple resources
- Simpler retry logic (no conflicts to handle)
- Used by: Traditional databases, some YARN implementations

Best for: High-contention scenarios, simple scheduling decisions, when correctness is more important than throughput.

Hybrid Approach (Recommended)

Strategy:
- Use optimistic concurrency for the common path (90% of allocations)
- Fall back to pessimistic locking for:
  - Gang scheduling (must lock multiple nodes atomically)
  - Preemption (must ensure eviction before new placement)
  - Quota updates (must be strictly serialized)

Implementation:
- etcd with resource versions for node bindings (optimistic)
- Distributed locks (Redis/etcd) for gang scheduling (pessimistic)
- Serialized queue for quota counter updates (pessimistic)

Fair Share vs Priority-Based vs Deadline-Based Scheduling

Fair Share Scheduling (DRF - Dominant Resource Fairness)

Principle: Each tenant gets resources proportional to their fair share weight.

Example (2 tenants, equal weight):
  Cluster: 100 CPU, 100 GPU
  Tenant A (CPU-heavy): dominant resource = CPU
  Tenant B (GPU-heavy): dominant resource = GPU
  
  DRF allocation:
  - Tenant A: 67 CPU, 33 GPU (dominant share: 67%)
  - Tenant B: 33 CPU, 67 GPU (dominant share: 67%)
  - Both tenants have equal dominant resource share

Properties:
+ Sharing incentive: no tenant is worse off than equal partition
+ Strategy-proof: tenants can't game the system by lying about needs
+ Envy-free: no tenant prefers another's allocation
+ Pareto efficient: can't improve one without hurting another

Limitations:
- Ignores urgency: a critical production job gets same share as dev experiment
- Slow convergence: takes time to reach fair state after changes
- Doesn't handle deadlines or SLAs

Priority-Based Scheduling

Principle: Higher priority jobs always scheduled before lower priority.

Priority Levels (Kubernetes model):
  2000000000: system-node-critical (kubelet, kube-proxy)
  1000000000: system-cluster-critical (DNS, scheduler)
       1000: production-critical
        500: production-normal
        100: batch-high
          0: batch-normal (default)
       -100: best-effort
      -1000: preemptible

Properties:
+ Clear SLA enforcement: critical jobs always get resources
+ Simple to understand and implement
+ Fast scheduling: just pick highest priority pending job

Limitations:
- Starvation: low-priority jobs may never run
- Priority inflation: everyone wants "critical" priority
- No fairness guarantee between same-priority tenants
- Requires governance: who assigns priorities?

Deadline-Based Scheduling (EDF - Earliest Deadline First)

Principle: Schedule jobs closest to their deadline first.

Example:
  Job A: needs 4 GPU, deadline in 2 hours
  Job B: needs 2 GPU, deadline in 30 minutes
  Job C: needs 8 GPU, deadline in 6 hours
  
  Scheduling order: B (30min) → A (2h) → C (6h)

Properties:
+ Optimal for meeting deadlines (provably optimal for single resource)
+ Natural for batch/ETL workloads with SLAs
+ Self-adjusting: urgency increases as deadline approaches

Limitations:
- Requires accurate runtime estimates (hard to predict)
- Cascading failures: one missed deadline can cause chain reaction
- Not suitable for long-running services (no deadline)
- Complex for multi-resource scheduling (NP-hard)

Hybrid Approach (Production Recommendation)

Recommended: Priority bands with fair-share within each band

Band 1 (System Critical): Strict priority, always scheduled first
Band 2 (Production):      Fair-share among production tenants
Band 3 (Batch):           Fair-share with deadline awareness
Band 4 (Best-Effort):     Backfill only, fully preemptible

Within each band:
- Fair-share ensures no tenant is starved
- Priority breaks ties within fair-share
- Deadline awareness for batch jobs with SLAs
- Aging: jobs waiting too long get priority boost

Preemptive vs Non-Preemptive Allocation

Preemptive (Borg / Kubernetes)

Mechanism:
1. High-priority job arrives, no capacity available
2. Scheduler identifies lower-priority victims on target nodes
3. Victims receive graceful termination signal (SIGTERM)
4. After grace period (30s default), forced kill (SIGKILL)
5. Resources freed, high-priority job scheduled

Preemption Policies:
- PreemptLowerPriority: Can preempt any lower-priority pod
- PreemptNever: Never preempt (wait in queue instead)
- PreemptSamePriority: Can preempt same or lower priority (rare)

Victim Selection Algorithm:
1. Find nodes where preemption would make job schedulable
2. For each candidate node, find minimum set of victims
3. Prefer victims with: lowest priority, most recent start, no PDB
4. Select node that minimizes total preemption impact

Advantages:

Critical jobs always get resources quickly
Better resource utilization (no idle reserved capacity)
Enables overcommitment with safety valve

Disadvantages:

Wasted work: preempted jobs lose progress (unless checkpointed)
Cascading preemptions: preempted job may preempt another
Complexity: PodDisruptionBudgets, grace periods, rescheduling
User experience: unpredictable for low-priority workloads

Non-Preemptive (Queue-based)

Mechanism:
1. High-priority job arrives, no capacity available
2. Job waits in priority queue
3. When resources freed (job completes/releases), highest priority dequeued
4. No running jobs are ever interrupted

Properties:
- Predictable: once running, job won't be interrupted
- Simpler: no eviction logic, no grace periods
- Wasteful: resources may be held by low-priority jobs while high-priority waits

When to choose non-preemptive:

Workloads that can't be checkpointed (lose all progress on preemption)
Environments where predictability > efficiency
Small clusters where preemption cascades are dangerous

Recommendation: Use preemptive scheduling with safeguards:

PodDisruptionBudgets limit simultaneous evictions
Grace periods allow checkpointing
Preemption cooldown prevents thrashing
Only preempt across priority bands (not within same priority)

Static vs Dynamic Resource Partitioning

Static Partitioning

Model: Fixed resource pools per tenant/workload type

  Cluster: 1000 nodes
  ├── Production pool: 400 nodes (dedicated, never shared)
  ├── Batch pool: 300 nodes (dedicated to batch)
  ├── ML pool: 200 nodes (GPU nodes, dedicated to ML)
  └── Shared pool: 100 nodes (overflow for any tenant)

Properties:
+ Predictable performance (no noisy neighbors)
+ Simple capacity planning
+ Strong isolation guarantees
- Poor utilization (pools often underutilized)
- Inflexible (can't respond to demand changes)
- Wasteful (production pool idle at night, batch pool idle during day)

Dynamic Partitioning

Model: Resources allocated on-demand with quotas as guardrails

  Cluster: 1000 nodes (shared pool)
  Quotas:
  ├── Production: guaranteed 400, burst to 600
  ├── Batch: guaranteed 200, burst to 500
  ├── ML: guaranteed 150, burst to 300
  └── Dev: guaranteed 50, burst to 200
  
  At any time: sum(guaranteed) <= total capacity
  Burst capacity: shared, first-come-first-served, preemptible

Properties:
+ Higher utilization (30-50% improvement over static)
+ Responsive to demand changes
+ Burst capacity for unexpected needs
- Noisy neighbor risk (burst traffic affects others)
- More complex scheduling logic
- Harder to predict performance

Recommendation: Dynamic partitioning with guaranteed minimums. Each tenant gets a guaranteed base allocation (never preempted) plus access to burst capacity (preemptible). This provides the isolation benefits of static partitioning with the efficiency of dynamic sharing.

Reservation-Based vs Best-Effort Allocation

Reservation-Based

Model: Resources reserved in advance, guaranteed availability

Use Cases:
- Scheduled batch jobs (reserve GPU for nightly training)
- Maintenance windows (reserve capacity for migration)
- SLA-bound workloads (guarantee resources for production)

Implementation:
- Advance reservation API: "Reserve 8 GPU from 2am-6am on Jan 20"
- Scheduler maintains reservation calendar (interval tree)
- Reserved resources excluded from general pool during window
- Unused reservations released after grace period (15 min)

Trade-offs:
+ Guaranteed availability at scheduled time
+ Predictable for capacity planning
+ Enables offline optimization (schedule reservations optimally)
- Reduces flexibility (reserved resources can't be used by others)
- Waste if reservation not fully utilized
- Complex: cancellation, modification, conflict resolution

Best-Effort Allocation

Model: Request resources when needed, get them if available

Use Cases:
- Interactive workloads (unpredictable demand)
- Development/testing (flexible timing)
- Burst traffic handling

Implementation:
- Submit request, scheduler finds resources immediately
- If unavailable: queue, wait, or fail
- No advance planning needed

Trade-offs:
+ Maximum flexibility and utilization
+ Simple API and mental model
+ Resources available to anyone when not in use
- No availability guarantee
- May wait in queue during peak times
- Hard to plan capacity for time-sensitive workloads

Recommendation: Support both models. Use reservations for predictable, time-sensitive workloads (nightly ML training, scheduled maintenance). Use best-effort for everything else. Allow reserved but unused capacity to be used by best-effort workloads (preemptible when reservation window starts).

Monolithic Scheduler vs Two-Level Scheduler

Monolithic Scheduler (Borg / Kubernetes)

Single scheduler makes ALL placement decisions:
- Knows about all resources
- Knows about all pending jobs
- Applies all policies (priority, fairness, affinity, etc.)
- Single codebase to maintain

Scaling approach: Make the single scheduler faster
- Parallel filtering
- Score caching
- Batch binding
- Equivalence classes (group similar pods)

Two-Level Scheduler (Mesos)

Level 1 (Resource Manager): Offers resources to frameworks
Level 2 (Framework Schedulers): Accept offers and place tasks

Example:
  Mesos Master: "Here's an offer: Node-5 has 16 CPU, 64GB"
  Spark Framework: "I'll take 8 CPU, 32GB for my executor"
  Mesos Master: "Offer accepted. Remaining: 8 CPU, 32GB"
  Marathon Framework: "I'll take 4 CPU, 16GB for my service"

Problems with Offers:
1. Offer starvation: Framework holds offer without using it
   Solution: Offer timeout (5 seconds to accept/reject)
   
2. Resource hoarding: Framework accepts more than needed
   Solution: Revocable offers, usage monitoring
   
3. Information hiding: Framework doesn't see full cluster state
   Solution: Provide cluster summary with offers
   
4. Offer fragmentation: Small offers that no framework wants
   Solution: Offer aggregation, minimum offer size

Comparison

Aspect	Monolithic	Two-Level
Scheduling quality	Optimal (global view)	Suboptimal (partial view)
Extensibility	Plugin-based	Framework-based
Isolation	Shared scheduler	Separate schedulers
Complexity	Single complex system	Multiple simpler systems
Latency	Depends on queue depth	Offer round-trip adds latency
Real-world	Kubernetes, Borg	Mesos, early YARN

Industry trend: Monolithic with plugins (Kubernetes Scheduling Framework) has won. Two-level scheduling added complexity without proportional benefits for most use cases.

Resource Quotas vs Limits vs Requests

Requests (Guaranteed Minimum)

Definition: Minimum resources guaranteed to the workload
Scheduler behavior: Only places pod on node with enough unrequested capacity
Runtime behavior: Resources always available (reserved on node)

Example: requests.cpu = 2000m
- Scheduler ensures node has 2000m unrequested CPU
- Pod guaranteed 2000m CPU even under contention
- Used for scheduling decisions and QoS classification

Limits (Maximum Allowed)

Definition: Maximum resources the workload can consume
Scheduler behavior: NOT used for scheduling (only requests matter)
Runtime behavior: Enforced by kernel (cgroups)

Example: limits.cpu = 4000m
- Pod can burst up to 4000m when node has spare capacity
- Throttled (not killed) when exceeding CPU limit
- OOM-killed when exceeding memory limit
- Enables overcommitment (sum of limits > node capacity)

Quotas (Tenant-Level Caps)

Definition: Maximum total resources a tenant/namespace can request
Admission behavior: Rejects new allocations that would exceed quota
Runtime behavior: No enforcement (quotas are admission-time only)

Example: quota.cpu = 1000 cores for namespace "ml-team"
- ml-team can have at most 1000 cores of CPU requests
- Individual pods still have their own requests/limits
- Prevents any single tenant from consuming entire cluster

Interaction Model

Hierarchy:
  Cluster Capacity (physical limit)
    └── Node Allocatable (capacity - system reserved)
        └── Namespace Quota (admin-set cap per tenant)
            └── Pod Limit (max per workload)
                └── Pod Request (guaranteed per workload)

Example:
  Cluster: 10,000 CPU cores
  Node: 64 cores allocatable
  Namespace quota: 500 cores
  Pod limit: 16 cores
  Pod request: 4 cores
  
  This namespace can run: 500/4 = 125 pods (by request)
  Each pod can burst to: 16 cores (if node has capacity)
  Total burst potential: 125 * 16 = 2000 cores (but quota only allows 500 requested)

Design Decision: How Strict Should Quotas Be?

Strict Quotas (Hard Enforcement):
- Reject any request that would exceed quota
- Predictable, prevents resource exhaustion
- Can block legitimate burst needs
- Used for: GPU, licenses, expensive resources

Soft Quotas (Warning + Burst):
- Allow exceeding quota temporarily
- Alert administrators when exceeded
- Burst allocations are preemptible
- Used for: CPU, memory in non-production

Hierarchical Quotas (Nested):
- Organization → Team → Namespace
- Unused quota at one level can be borrowed by siblings
- Prevents waste while maintaining fairness
- Complex to implement and reason about

Summary: Recommended Architecture

For a production resource allocation service targeting 50K nodes:

Scheduling model: Centralized with partitioning (zone-based)
Concurrency: Optimistic (with pessimistic fallback for gang scheduling)
Fairness: Priority bands with fair-share within bands
Preemption: Yes, across priority bands with PDB protection
Partitioning: Dynamic with guaranteed minimums
Reservations: Supported for predictable workloads
Scheduler architecture: Monolithic with plugin framework
Resource model: Requests + Limits + Quotas (Kubernetes model)

This combination provides the best balance of scheduling quality, operational simplicity, and scalability for most production environments.