Resource Allocation Service - Variations and Follow-ups

Overview

Resource allocation is a broad domain with many specialized variations. This section covers common extensions asked in system design interviews, each representing a distinct set of challenges beyond the core scheduling problem. We also include detailed answers to frequently asked follow-up questions.

Variation 1: GPU/TPU Scheduling for ML Workloads

Unique Challenges

GPU scheduling differs fundamentally from CPU scheduling:

Key Differences from CPU Scheduling:
1. Discrete allocation: GPUs are whole devices (or MIG partitions), not divisible like CPU millicores
2. Topology matters: GPU-to-GPU communication (NVLink) is 10x faster than PCIe
3. Memory is fixed: GPU memory (80GB A100) cannot be overcommitted
4. Cost: 1 GPU-hour costs 100x a CPU-hour
5. Long-running: Training jobs run for hours/days (not seconds)
6. Gang scheduling: Distributed training needs all GPUs simultaneously

NVIDIA MIG (Multi-Instance GPU) Scheduling

A100 80GB can be partitioned into MIG instances:
- 7x 1g.10gb (7 instances, 10GB each)
- 3x 2g.20gb + 1x 1g.10gb
- 2x 3g.40gb
- 1x 7g.80gb (full GPU)

Scheduling Implications:
- Must track MIG profiles per GPU, not just "GPU count"
- Reconfiguring MIG requires GPU reset (disruptive)
- Different workloads need different profiles
- Scheduler must match workload to available MIG profile

Schema Extension:
{
  "node_id": "gpu-node-01",
  "gpus": [
    {
      "gpu_id": 0,
      "model": "A100-80GB",
      "mig_mode": true,
      "instances": [
        {"profile": "3g.40gb", "allocated_to": "job-123"},
        {"profile": "3g.40gb", "allocated_to": null},
        {"profile": "1g.10gb", "allocated_to": "job-456"}
      ]
    }
  ]
}

Topology-Aware GPU Scheduling

NVIDIA DGX A100 Topology:
  GPU 0 ←NVLink→ GPU 1 ←NVLink→ GPU 2 ←NVLink→ GPU 3
    ↕ NVLink        ↕ NVLink        ↕ NVLink        ↕ NVLink
  GPU 4 ←NVLink→ GPU 5 ←NVLink→ GPU 6 ←NVLink→ GPU 7

Scheduling Rule: For multi-GPU jobs, prefer GPUs connected via NVLink
- 2-GPU job: Allocate GPU 0+1 (NVLink connected) NOT GPU 0+7 (PCIe only)
- 4-GPU job: Allocate GPU 0+1+4+5 (NVSwitch connected)
- 8-GPU job: Full node allocation

Performance Impact:
- NVLink: 600 GB/s bidirectional
- PCIe Gen4: 32 GB/s
- Wrong topology: 2-5x slower for communication-heavy training

Variation 2: Spot/Preemptible Instance Management

Design for Spot Capacity

Spot Instance Characteristics:
- 60-90% cheaper than on-demand
- Can be reclaimed with 2-minute warning (AWS) or 30-second (GCP)
- Availability varies by instance type, zone, and time
- Best for: fault-tolerant, checkpointable workloads

Architecture:
┌─────────────────────────────────────────────┐
│           Spot Instance Manager              │
├─────────────┬──────────────┬────────────────┤
│ Price       │ Availability │ Interruption   │
│ Tracker     │ Predictor    │ Handler        │
└─────────────┴──────────────┴────────────────┘

Price Tracker:
- Monitor spot prices across instance types and zones
- Maintain price history for prediction
- Alert when prices approach on-demand threshold

Availability Predictor:
- ML model trained on historical interruption patterns
- Predict: "p4d.24xlarge in us-east-1a has 85% chance of lasting 4+ hours"
- Input features: time of day, day of week, recent interruption rate

Interruption Handler:
1. Receive interruption notice (2 min warning)
2. Trigger workload checkpoint (save training state)
3. Attempt migration to another spot instance
4. Fall back to on-demand if no spot available
5. Resume from checkpoint on new instance

Spot-Aware Scheduling Policy

Job Classification:
- Spot-eligible: Fault-tolerant, checkpointable (batch, training)
- Spot-ineligible: Stateful, latency-sensitive (databases, APIs)

Scheduling Strategy:
1. Place spot-eligible jobs on spot instances first
2. Maintain minimum on-demand capacity for spot-ineligible
3. Diversify across instance types (reduce correlated interruptions)
4. Keep 10-20% on-demand buffer for spot fallback

Cost Optimization:
- Spot fleet: Mix of instance types to reduce interruption risk
- Bid strategy: Bid at on-demand price (pay market price, maximize availability)
- Zone diversification: Spread across 3+ AZs
- Instance diversification: Use 5+ instance types per workload

Variation 3: Multi-Cloud Resource Allocation

Cross-Cloud Scheduling

Architecture:
┌─────────────────────────────────────────────────────┐
│              Multi-Cloud Resource Manager             │
├──────────────┬──────────────┬────────────────────────┤
│ AWS Provider │ GCP Provider │ Azure Provider         │
│ - EC2/EKS   │ - GCE/GKE   │ - VMs/AKS             │
│ - Spot/RI   │ - Preemptible│ - Spot/Reserved        │
└──────────────┴──────────────┴────────────────────────┘

Abstraction Layer:
- Normalize resource types across clouds
- Unified API regardless of underlying provider
- Handle provider-specific quirks (naming, limits, APIs)

Scheduling Decisions:
1. Cost: Which cloud is cheapest for this workload right now?
2. Latency: Which cloud is closest to the data?
3. Compliance: Which clouds meet data residency requirements?
4. Availability: Which cloud has capacity for this instance type?
5. Lock-in: Avoid concentrating too much on one provider

Challenges:
- Network latency between clouds: 20-100ms (vs <1ms within cloud)
- Data transfer costs: $0.01-0.09/GB between clouds
- Different APIs, instance types, pricing models
- Credential management across providers
- Inconsistent SLAs and support models

Variation 4: Serverless Resource Management

Cold Start Optimization

Serverless Scheduling Challenges:
- Cold start: 100ms-10s to provision new execution environment
- Scale to zero: No resources consumed when idle
- Burst: 0 to 10,000 concurrent executions in seconds
- Short-lived: Most invocations complete in <1 second

Cold Start Mitigation Strategies:

1. Warm Pool (Pre-provisioned):
   - Maintain pool of pre-initialized containers
   - Size based on predicted demand (ML model)
   - Trade-off: cost of idle containers vs cold start latency
   - Target: 95% of invocations served from warm pool

2. Snapshot/Restore (Firecracker):
   - Snapshot initialized VM state to disk
   - Restore from snapshot in 5-50ms (vs 500ms+ full boot)
   - Used by AWS Lambda (Firecracker microVMs)
   - Trade-off: snapshot storage cost vs boot time

3. Predictive Scaling:
   - Analyze invocation patterns (time of day, day of week)
   - Pre-warm containers before predicted traffic spike
   - Example: Pre-warm 100 containers at 8:55am for 9am traffic

4. Tiered Warm Pools:
   - Hot: Running, ready to serve (0ms overhead)
   - Warm: Paused, resume in 5-10ms
   - Cold: Snapshot on disk, restore in 50-100ms
   - Frozen: No resources, full cold start 500ms+

Resource Allocation for Serverless:
- CPU: Time-sliced (128ms-900s per invocation)
- Memory: Fixed per function (128MB-10GB)
- Concurrency: Per-function limit (1000 default)
- Burst: Account-level burst limit (3000 concurrent)

Variation 5: Network Bandwidth Allocation

Traffic Shaping and QoS

Network Resource Model:
- Bandwidth: Guaranteed minimum + burst maximum
- Latency: Priority queuing for latency-sensitive traffic
- Connections: Rate limiting on new connections/sec

Allocation Tiers:
┌─────────────────────────────────────────────┐
│ Priority 1: Control Plane (scheduler, etcd) │  Guaranteed 1 Gbps
├─────────────────────────────────────────────┤
│ Priority 2: Production Services (APIs)      │  Guaranteed 5 Gbps
├─────────────────────────────────────────────┤
│ Priority 3: Batch Data Transfer             │  Guaranteed 2 Gbps
├─────────────────────────────────────────────┤
│ Priority 4: Best-Effort (backups, logs)     │  No guarantee
└─────────────────────────────────────────────┘
Total node bandwidth: 25 Gbps
Guaranteed sum: 8 Gbps (32% reserved)
Remaining 17 Gbps: Shared proportionally by weight

Implementation:
- Linux TC (Traffic Control) with HTB (Hierarchical Token Bucket)
- Kubernetes: Bandwidth plugin with annotations
- Cloud: VPC flow logs + security group rate limiting
- SDN: OpenFlow rules for per-tenant bandwidth enforcement

Variation 6: Storage Tiering and Data Placement

Tiered Storage Allocation

Storage Tiers:
┌──────────────────────────────────────────────────────┐
│ Tier 0: NVMe SSD (local)  │ 1M IOPS, 0.1ms latency │
│ Tier 1: Network SSD (EBS) │ 64K IOPS, 1ms latency  │
│ Tier 2: HDD (bulk)        │ 500 IOPS, 10ms latency │
│ Tier 3: Object Store (S3) │ 5K req/s, 50ms latency │
└──────────────────────────────────────────────────────┘

Data Placement Scheduler:
- Hot data (accessed in last hour): Tier 0/1
- Warm data (accessed in last week): Tier 1/2
- Cold data (accessed in last month): Tier 2/3
- Archive (rarely accessed): Tier 3 with lifecycle policy

Scheduling Considerations:
- Data locality: Schedule compute near the data (avoid network transfer)
- Replication: Ensure data replicas are in different failure domains
- Migration: Background process moves data between tiers
- Capacity: Each tier has limited capacity, must balance across tiers

Co-location Scheduling:
1. Job needs dataset X (100GB, stored on nodes 5, 12, 23)
2. Scheduler prefers nodes 5, 12, 23 (data locality score boost)
3. If those nodes are full: schedule nearby (same rack) and stream data
4. Last resort: schedule anywhere and copy data (expensive)

Variation 7: License Management

Floating License Allocation

License Types:
1. Node-locked: Tied to specific machine (no scheduling needed)
2. Floating: Pool of N licenses, any N users can use simultaneously
3. Named-user: Tied to specific user (but any machine)
4. Concurrent: Max N simultaneous sessions

Floating License Scheduler:
┌─────────────────────────────────────────┐
│         License Manager Service          │
├─────────────────────────────────────────┤
│ License Pool:                            │
│   MATLAB: 50 total, 42 checked out      │
│   Ansys:  20 total, 18 checked out      │
│   Cadence: 10 total, 10 checked out     │ ← FULL
└─────────────────────────────────────────┘

Checkout Flow:
1. Job requires license: "needs 1x MATLAB license"
2. Scheduler checks: license available? (42 < 50: yes)
3. Scheduler atomically: allocate compute + checkout license
4. Job runs with license held
5. Job completes: release compute + checkin license

Challenges:
- License server is SPOF (need HA license server)
- Stale checkouts: job crashes without releasing license
  Solution: Heartbeat-based lease (auto-release after 5 min no heartbeat)
- Priority: Critical simulation needs license, but all checked out
  Solution: Preempt lowest-priority license holder
- Cost: Licenses are expensive ($10K-100K/year each)
  Solution: Usage tracking, right-sizing license pool

Integration with Resource Scheduler:
- Treat licenses as extended resources: "licenses/matlab: 1"
- Scheduler filters nodes where license server is reachable
- Quota system includes license counts per tenant
- Chargeback based on license-hours consumed

Variation 8: Real-Time vs Batch Scheduling

Real-Time Scheduling Requirements

Real-Time Workloads:
- Latency SLA: <10ms response time
- Jitter: <1ms variation
- Availability: 99.99% (52 min downtime/year)
- Examples: Trading systems, autonomous vehicles, industrial control

Real-Time Scheduling Constraints:
- CPU isolation: Dedicated cores (no sharing, no context switching)
- Memory: Pinned (no swapping, no NUMA migration)
- Interrupts: IRQ affinity (route interrupts away from RT cores)
- Kernel: PREEMPT_RT patch or real-time kernel
- Network: DPDK/RDMA (bypass kernel network stack)

Resource Model for RT:
{
  "requests": {
    "cpu": "4000m",
    "memory": "16Gi",
    "rt-cpu": "4",           // 4 dedicated cores (not shared)
    "hugepages-1Gi": "8",   // 8 x 1GB hugepages (pinned memory)
    "rdma/hca": "1"          // 1 RDMA network interface
  },
  "scheduling": {
    "cpuPolicy": "static",   // Exclusive CPU cores
    "memoryPolicy": "static", // NUMA-pinned memory
    "topologyPolicy": "single-numa-node" // All resources from same NUMA
  }
}

Batch Scheduling Optimizations

Batch Workload Characteristics:
- Throughput-oriented (not latency-sensitive)
- Can tolerate queuing (minutes to hours acceptable)
- Often large (hundreds of tasks per job)
- Checkpointable (can resume after interruption)

Batch-Specific Optimizations:
1. Backfill scheduling: Fill gaps with small jobs while waiting for large job
2. Job packing: Group similar jobs for better bin packing
3. Deadline-aware: Schedule based on job deadline, not arrival time
4. Preemptible: Batch jobs yield to interactive workloads
5. Spot-friendly: Use cheap preemptible instances

Backfill Algorithm:
  Queue: [Large-Job (needs 100 GPU, ETA: 2 hours), Small-Job (needs 2 GPU)]
  
  Without backfill: Small-Job waits 2 hours behind Large-Job
  With backfill: Small-Job runs immediately on available 2 GPU
  Constraint: Small-Job must complete before Large-Job's resources are ready
  
  This improves utilization by 15-30% for batch-heavy clusters

Common Interview Follow-Up Questions

Q1: "How do you prevent resource starvation?"

Detailed Answer:

Starvation occurs when a job waits indefinitely because higher-priority jobs
always consume available resources.

Prevention Strategies:

1. Priority Aging:
   - Increase effective priority by 1 every 60 seconds of wait time
   - After 1 hour: +60 priority boost
   - Eventually, any job becomes high enough priority to schedule
   - Cap: Aged priority cannot exceed system-critical level

2. Guaranteed Minimum Quota:
   - Each tenant guaranteed minimum resources regardless of priority
   - Example: "ml-team always gets at least 10 GPU"
   - Guaranteed resources cannot be preempted
   - Prevents complete starvation of any tenant

3. Fair-Share Within Priority Bands:
   - Within same priority level, use DRF (Dominant Resource Fairness)
   - No single tenant can monopolize a priority band
   - Ensures proportional sharing among equals

4. Starvation Detection + Alert:
   - Monitor: jobs waiting > 10 minutes at each priority level
   - Alert: if any job waits > 30 minutes
   - Auto-action: boost priority or trigger preemption review
   - Dashboard: queue depth and wait time per tenant

5. Maximum Resource Holding Time:
   - Low-priority jobs have TTL (e.g., 24 hours max)
   - After TTL: job terminated, resources freed
   - Prevents indefinite resource holding by low-priority work

Q2: "How do you handle resource fragmentation?"

Detailed Answer:

Fragmentation: Resources exist but are scattered across nodes in unusable chunks.

Detection:
- Metric: "Allocatable but unschedulable resources"
- Example: 100 GPU free across cluster, but no single node has 8 free
- Fragmentation ratio = unschedulable_free / total_free

Mitigation Strategies:

1. Proactive Bin Packing:
   - Score function penalizes creating small fragments
   - Prefer nodes that result in large contiguous free blocks
   - Example: Place 2-GPU job on node with 3 free (leaves 1 contiguous)
     rather than node with 8 free (wastes large block)

2. Descheduler (Periodic Defragmentation):
   - Run every 10 minutes
   - Identify "badly placed" pods (creating fragmentation)
   - Evict and reschedule to consolidate free space
   - Respect PodDisruptionBudgets (max 1 eviction per cycle)
   - Priority: Only move low-priority, short-running pods

3. Compaction (Live Migration):
   - For VMs: Live migrate to consolidate
   - For containers: Checkpoint + restore on new node
   - Expensive but effective for persistent fragmentation
   - Schedule during low-traffic windows

4. Reserved Blocks:
   - Reserve contiguous blocks for large jobs
   - Gradually migrate small jobs off reserved blocks
   - Ensures large gang-scheduled jobs can always be placed

Q3: "How do you handle a scheduler failure?"

Detailed Answer:

Failure Modes and Recovery:

1. Scheduler Process Crash:
   - Detection: Health check fails (2 missed heartbeats = 20 seconds)
   - Recovery: Standby scheduler promoted (leader election via etcd)
   - Impact: 20-30 second scheduling pause
   - Running jobs: Unaffected (already placed on nodes)
   - Pending jobs: Delayed until new leader elected

2. Scheduler State Corruption:
   - Detection: Consistency check fails (state vs etcd mismatch)
   - Recovery: Rebuild in-memory state from etcd (source of truth)
   - Time: 30-60 seconds for 50K node cluster
   - Impact: Scheduling pause during rebuild

3. etcd Cluster Failure:
   - Detection: etcd health check fails
   - Impact: No new scheduling decisions (can't persist bindings)
   - Running jobs: Continue running (nodes have local state)
   - Recovery: etcd cluster recovery from backup
   - Mitigation: etcd runs on dedicated, redundant hardware

4. Split-Brain (Network Partition):
   - Scenario: Scheduler can reach some nodes but not etcd
   - Behavior: Scheduler stops making decisions (can't persist)
   - Fencing: Old leader must be fenced before new leader acts
   - Implementation: Lease-based leadership (expires automatically)

High Availability Design:
- 3 scheduler replicas (1 active, 2 standby)
- Leader election via etcd lease (15-second TTL)
- Failover time: < 30 seconds
- State reconstruction: < 60 seconds
- Zero impact on running workloads during failover

Q4: "How do you handle a sudden spike in allocation requests?"

Detailed Answer:

Spike Scenarios:
- Auto-scaler triggers: 1000 new pods in 10 seconds
- Deployment rollout: 500 pods replacing 500 pods
- Disaster recovery: Entire zone fails, 10,000 pods need rescheduling

Handling Strategies:

1. Request Queuing with Backpressure:
   - Queue depth limit: 10,000 pending requests
   - Beyond limit: Return 429 (Too Many Requests) with Retry-After
   - Priority queue: Critical requests processed first during spike

2. Batch Scheduling:
   - Group similar requests (same resource profile)
   - Schedule batch in single pass (amortize filtering cost)
   - 10x throughput improvement for homogeneous batches

3. Admission Control:
   - Rate limit per tenant: 100 requests/sec
   - Global rate limit: 5000 requests/sec
   - Burst allowance: 10x rate for 30 seconds
   - Graceful degradation: Simplify scoring during overload

4. Precomputed Scheduling:
   - For known patterns (deployments), pre-compute placement
   - Cache node scores (valid for 5 seconds)
   - Skip scoring for pods identical to recently scheduled pods
   - "Equivalence classes": Group identical pods, schedule once

5. Cluster Autoscaler Integration:
   - If spike exceeds cluster capacity: trigger node provisioning
   - Optimistic scheduling: Assume new nodes will arrive
   - Backfill: Schedule what fits now, queue rest for new nodes

Q5: "How would you design chargeback/showback for resource usage?"

Detailed Answer:

Chargeback Model:

1. Metering (What to measure):
   - CPU-seconds consumed (actual usage, not just requested)
   - Memory-byte-seconds (peak memory × duration)
   - GPU-seconds (allocated GPU × duration)
   - Storage-byte-seconds (provisioned storage × duration)
   - Network-bytes (ingress + egress)
   - License-seconds (license checkout duration)

2. Rating (How to price):
   - Base rate per resource-unit-hour
   - Priority multiplier (critical = 3x, normal = 1x, preemptible = 0.2x)
   - Time-of-day multiplier (peak hours = 1.5x, off-peak = 0.8x)
   - Commitment discount (reserved capacity = 0.6x)
   - Spot discount (preemptible = 0.3x)

3. Billing Granularity:
   - Minimum billing increment: 1 second (like AWS Lambda)
   - Aggregation: Hourly for reporting, monthly for billing
   - Rounding: Round up to nearest second

4. Waste Detection:
   - Requested but unused: (requested - actual_usage) / requested
   - Alert if waste > 50% for > 1 hour
   - Recommend right-sizing based on actual usage patterns
   - Show "potential savings" in dashboard

5. Implementation:
   - Metrics pipeline: cAdvisor → Prometheus → Usage Aggregator
   - Storage: ClickHouse for high-cardinality usage data
   - Reporting: Daily/weekly/monthly cost reports per tenant
   - API: GET /api/v1/billing/tenants/{id}/usage?period=monthly

Additional Variations for Deep Dives

Variation 9: Heterogeneous Hardware Scheduling

Mix of CPU architectures (x86, ARM, RISC-V)
Specialized accelerators (TPU, Inferentia, Trainium)
Different memory technologies (DDR5, HBM, CXL)
Scheduling must match workload to optimal hardware

Variation 10: Energy-Aware Scheduling

Schedule workloads when renewable energy is available
Consolidate during low-demand to power down nodes
Carbon-aware scheduling (prefer low-carbon regions)
Power capping: Limit total cluster power consumption

Variation 11: Confidential Computing Scheduling

TEE (Trusted Execution Environment) allocation
SGX enclaves: Limited EPC memory per node
SEV-SNP VMs: Attestation before scheduling
Scheduling must verify hardware security capabilities

These variations demonstrate the breadth of resource allocation challenges and provide rich material for interview discussions at any depth.