Resource Allocation Service - Variations and Follow-ups
Overview
Resource allocation is a broad domain with many specialized variations. This section covers common extensions asked in system design interviews, each representing a distinct set of challenges beyond the core scheduling problem. We also include detailed answers to frequently asked follow-up questions.
Variation 1: GPU/TPU Scheduling for ML Workloads
Unique Challenges
GPU scheduling differs fundamentally from CPU scheduling:
Key Differences from CPU Scheduling:
1. Discrete allocation: GPUs are whole devices (or MIG partitions), not divisible like CPU millicores
2. Topology matters: GPU-to-GPU communication (NVLink) is 10x faster than PCIe
3. Memory is fixed: GPU memory (80GB A100) cannot be overcommitted
4. Cost: 1 GPU-hour costs 100x a CPU-hour
5. Long-running: Training jobs run for hours/days (not seconds)
6. Gang scheduling: Distributed training needs all GPUs simultaneouslyNVIDIA MIG (Multi-Instance GPU) Scheduling
A100 80GB can be partitioned into MIG instances:
- 7x 1g.10gb (7 instances, 10GB each)
- 3x 2g.20gb + 1x 1g.10gb
- 2x 3g.40gb
- 1x 7g.80gb (full GPU)
Scheduling Implications:
- Must track MIG profiles per GPU, not just "GPU count"
- Reconfiguring MIG requires GPU reset (disruptive)
- Different workloads need different profiles
- Scheduler must match workload to available MIG profile
Schema Extension:
{
"node_id": "gpu-node-01",
"gpus": [
{
"gpu_id": 0,
"model": "A100-80GB",
"mig_mode": true,
"instances": [
{"profile": "3g.40gb", "allocated_to": "job-123"},
{"profile": "3g.40gb", "allocated_to": null},
{"profile": "1g.10gb", "allocated_to": "job-456"}
]
}
]
}Topology-Aware GPU Scheduling
NVIDIA DGX A100 Topology:
GPU 0 ←NVLink→ GPU 1 ←NVLink→ GPU 2 ←NVLink→ GPU 3
↕ NVLink ↕ NVLink ↕ NVLink ↕ NVLink
GPU 4 ←NVLink→ GPU 5 ←NVLink→ GPU 6 ←NVLink→ GPU 7
Scheduling Rule: For multi-GPU jobs, prefer GPUs connected via NVLink
- 2-GPU job: Allocate GPU 0+1 (NVLink connected) NOT GPU 0+7 (PCIe only)
- 4-GPU job: Allocate GPU 0+1+4+5 (NVSwitch connected)
- 8-GPU job: Full node allocation
Performance Impact:
- NVLink: 600 GB/s bidirectional
- PCIe Gen4: 32 GB/s
- Wrong topology: 2-5x slower for communication-heavy trainingVariation 2: Spot/Preemptible Instance Management
Design for Spot Capacity
Spot Instance Characteristics:
- 60-90% cheaper than on-demand
- Can be reclaimed with 2-minute warning (AWS) or 30-second (GCP)
- Availability varies by instance type, zone, and time
- Best for: fault-tolerant, checkpointable workloads
Architecture:
┌─────────────────────────────────────────────┐
│ Spot Instance Manager │
├─────────────┬──────────────┬────────────────┤
│ Price │ Availability │ Interruption │
│ Tracker │ Predictor │ Handler │
└─────────────┴──────────────┴────────────────┘
Price Tracker:
- Monitor spot prices across instance types and zones
- Maintain price history for prediction
- Alert when prices approach on-demand threshold
Availability Predictor:
- ML model trained on historical interruption patterns
- Predict: "p4d.24xlarge in us-east-1a has 85% chance of lasting 4+ hours"
- Input features: time of day, day of week, recent interruption rate
Interruption Handler:
1. Receive interruption notice (2 min warning)
2. Trigger workload checkpoint (save training state)
3. Attempt migration to another spot instance
4. Fall back to on-demand if no spot available
5. Resume from checkpoint on new instanceSpot-Aware Scheduling Policy
Job Classification:
- Spot-eligible: Fault-tolerant, checkpointable (batch, training)
- Spot-ineligible: Stateful, latency-sensitive (databases, APIs)
Scheduling Strategy:
1. Place spot-eligible jobs on spot instances first
2. Maintain minimum on-demand capacity for spot-ineligible
3. Diversify across instance types (reduce correlated interruptions)
4. Keep 10-20% on-demand buffer for spot fallback
Cost Optimization:
- Spot fleet: Mix of instance types to reduce interruption risk
- Bid strategy: Bid at on-demand price (pay market price, maximize availability)
- Zone diversification: Spread across 3+ AZs
- Instance diversification: Use 5+ instance types per workloadVariation 3: Multi-Cloud Resource Allocation
Cross-Cloud Scheduling
Architecture:
┌─────────────────────────────────────────────────────┐
│ Multi-Cloud Resource Manager │
├──────────────┬──────────────┬────────────────────────┤
│ AWS Provider │ GCP Provider │ Azure Provider │
│ - EC2/EKS │ - GCE/GKE │ - VMs/AKS │
│ - Spot/RI │ - Preemptible│ - Spot/Reserved │
└──────────────┴──────────────┴────────────────────────┘
Abstraction Layer:
- Normalize resource types across clouds
- Unified API regardless of underlying provider
- Handle provider-specific quirks (naming, limits, APIs)
Scheduling Decisions:
1. Cost: Which cloud is cheapest for this workload right now?
2. Latency: Which cloud is closest to the data?
3. Compliance: Which clouds meet data residency requirements?
4. Availability: Which cloud has capacity for this instance type?
5. Lock-in: Avoid concentrating too much on one provider
Challenges:
- Network latency between clouds: 20-100ms (vs <1ms within cloud)
- Data transfer costs: $0.01-0.09/GB between clouds
- Different APIs, instance types, pricing models
- Credential management across providers
- Inconsistent SLAs and support modelsVariation 4: Serverless Resource Management
Cold Start Optimization
Serverless Scheduling Challenges:
- Cold start: 100ms-10s to provision new execution environment
- Scale to zero: No resources consumed when idle
- Burst: 0 to 10,000 concurrent executions in seconds
- Short-lived: Most invocations complete in <1 second
Cold Start Mitigation Strategies:
1. Warm Pool (Pre-provisioned):
- Maintain pool of pre-initialized containers
- Size based on predicted demand (ML model)
- Trade-off: cost of idle containers vs cold start latency
- Target: 95% of invocations served from warm pool
2. Snapshot/Restore (Firecracker):
- Snapshot initialized VM state to disk
- Restore from snapshot in 5-50ms (vs 500ms+ full boot)
- Used by AWS Lambda (Firecracker microVMs)
- Trade-off: snapshot storage cost vs boot time
3. Predictive Scaling:
- Analyze invocation patterns (time of day, day of week)
- Pre-warm containers before predicted traffic spike
- Example: Pre-warm 100 containers at 8:55am for 9am traffic
4. Tiered Warm Pools:
- Hot: Running, ready to serve (0ms overhead)
- Warm: Paused, resume in 5-10ms
- Cold: Snapshot on disk, restore in 50-100ms
- Frozen: No resources, full cold start 500ms+
Resource Allocation for Serverless:
- CPU: Time-sliced (128ms-900s per invocation)
- Memory: Fixed per function (128MB-10GB)
- Concurrency: Per-function limit (1000 default)
- Burst: Account-level burst limit (3000 concurrent)Variation 5: Network Bandwidth Allocation
Traffic Shaping and QoS
Network Resource Model:
- Bandwidth: Guaranteed minimum + burst maximum
- Latency: Priority queuing for latency-sensitive traffic
- Connections: Rate limiting on new connections/sec
Allocation Tiers:
┌─────────────────────────────────────────────┐
│ Priority 1: Control Plane (scheduler, etcd) │ Guaranteed 1 Gbps
├─────────────────────────────────────────────┤
│ Priority 2: Production Services (APIs) │ Guaranteed 5 Gbps
├─────────────────────────────────────────────┤
│ Priority 3: Batch Data Transfer │ Guaranteed 2 Gbps
├─────────────────────────────────────────────┤
│ Priority 4: Best-Effort (backups, logs) │ No guarantee
└─────────────────────────────────────────────┘
Total node bandwidth: 25 Gbps
Guaranteed sum: 8 Gbps (32% reserved)
Remaining 17 Gbps: Shared proportionally by weight
Implementation:
- Linux TC (Traffic Control) with HTB (Hierarchical Token Bucket)
- Kubernetes: Bandwidth plugin with annotations
- Cloud: VPC flow logs + security group rate limiting
- SDN: OpenFlow rules for per-tenant bandwidth enforcementVariation 6: Storage Tiering and Data Placement
Tiered Storage Allocation
Storage Tiers:
┌──────────────────────────────────────────────────────┐
│ Tier 0: NVMe SSD (local) │ 1M IOPS, 0.1ms latency │
│ Tier 1: Network SSD (EBS) │ 64K IOPS, 1ms latency │
│ Tier 2: HDD (bulk) │ 500 IOPS, 10ms latency │
│ Tier 3: Object Store (S3) │ 5K req/s, 50ms latency │
└──────────────────────────────────────────────────────┘
Data Placement Scheduler:
- Hot data (accessed in last hour): Tier 0/1
- Warm data (accessed in last week): Tier 1/2
- Cold data (accessed in last month): Tier 2/3
- Archive (rarely accessed): Tier 3 with lifecycle policy
Scheduling Considerations:
- Data locality: Schedule compute near the data (avoid network transfer)
- Replication: Ensure data replicas are in different failure domains
- Migration: Background process moves data between tiers
- Capacity: Each tier has limited capacity, must balance across tiers
Co-location Scheduling:
1. Job needs dataset X (100GB, stored on nodes 5, 12, 23)
2. Scheduler prefers nodes 5, 12, 23 (data locality score boost)
3. If those nodes are full: schedule nearby (same rack) and stream data
4. Last resort: schedule anywhere and copy data (expensive)Variation 7: License Management
Floating License Allocation
License Types:
1. Node-locked: Tied to specific machine (no scheduling needed)
2. Floating: Pool of N licenses, any N users can use simultaneously
3. Named-user: Tied to specific user (but any machine)
4. Concurrent: Max N simultaneous sessions
Floating License Scheduler:
┌─────────────────────────────────────────┐
│ License Manager Service │
├─────────────────────────────────────────┤
│ License Pool: │
│ MATLAB: 50 total, 42 checked out │
│ Ansys: 20 total, 18 checked out │
│ Cadence: 10 total, 10 checked out │ ← FULL
└─────────────────────────────────────────┘
Checkout Flow:
1. Job requires license: "needs 1x MATLAB license"
2. Scheduler checks: license available? (42 < 50: yes)
3. Scheduler atomically: allocate compute + checkout license
4. Job runs with license held
5. Job completes: release compute + checkin license
Challenges:
- License server is SPOF (need HA license server)
- Stale checkouts: job crashes without releasing license
Solution: Heartbeat-based lease (auto-release after 5 min no heartbeat)
- Priority: Critical simulation needs license, but all checked out
Solution: Preempt lowest-priority license holder
- Cost: Licenses are expensive ($10K-100K/year each)
Solution: Usage tracking, right-sizing license pool
Integration with Resource Scheduler:
- Treat licenses as extended resources: "licenses/matlab: 1"
- Scheduler filters nodes where license server is reachable
- Quota system includes license counts per tenant
- Chargeback based on license-hours consumedVariation 8: Real-Time vs Batch Scheduling
Real-Time Scheduling Requirements
Real-Time Workloads:
- Latency SLA: <10ms response time
- Jitter: <1ms variation
- Availability: 99.99% (52 min downtime/year)
- Examples: Trading systems, autonomous vehicles, industrial control
Real-Time Scheduling Constraints:
- CPU isolation: Dedicated cores (no sharing, no context switching)
- Memory: Pinned (no swapping, no NUMA migration)
- Interrupts: IRQ affinity (route interrupts away from RT cores)
- Kernel: PREEMPT_RT patch or real-time kernel
- Network: DPDK/RDMA (bypass kernel network stack)
Resource Model for RT:
{
"requests": {
"cpu": "4000m",
"memory": "16Gi",
"rt-cpu": "4", // 4 dedicated cores (not shared)
"hugepages-1Gi": "8", // 8 x 1GB hugepages (pinned memory)
"rdma/hca": "1" // 1 RDMA network interface
},
"scheduling": {
"cpuPolicy": "static", // Exclusive CPU cores
"memoryPolicy": "static", // NUMA-pinned memory
"topologyPolicy": "single-numa-node" // All resources from same NUMA
}
}Batch Scheduling Optimizations
Batch Workload Characteristics:
- Throughput-oriented (not latency-sensitive)
- Can tolerate queuing (minutes to hours acceptable)
- Often large (hundreds of tasks per job)
- Checkpointable (can resume after interruption)
Batch-Specific Optimizations:
1. Backfill scheduling: Fill gaps with small jobs while waiting for large job
2. Job packing: Group similar jobs for better bin packing
3. Deadline-aware: Schedule based on job deadline, not arrival time
4. Preemptible: Batch jobs yield to interactive workloads
5. Spot-friendly: Use cheap preemptible instances
Backfill Algorithm:
Queue: [Large-Job (needs 100 GPU, ETA: 2 hours), Small-Job (needs 2 GPU)]
Without backfill: Small-Job waits 2 hours behind Large-Job
With backfill: Small-Job runs immediately on available 2 GPU
Constraint: Small-Job must complete before Large-Job's resources are ready
This improves utilization by 15-30% for batch-heavy clustersCommon Interview Follow-Up Questions
Q1: "How do you prevent resource starvation?"
Detailed Answer:
Starvation occurs when a job waits indefinitely because higher-priority jobs
always consume available resources.
Prevention Strategies:
1. Priority Aging:
- Increase effective priority by 1 every 60 seconds of wait time
- After 1 hour: +60 priority boost
- Eventually, any job becomes high enough priority to schedule
- Cap: Aged priority cannot exceed system-critical level
2. Guaranteed Minimum Quota:
- Each tenant guaranteed minimum resources regardless of priority
- Example: "ml-team always gets at least 10 GPU"
- Guaranteed resources cannot be preempted
- Prevents complete starvation of any tenant
3. Fair-Share Within Priority Bands:
- Within same priority level, use DRF (Dominant Resource Fairness)
- No single tenant can monopolize a priority band
- Ensures proportional sharing among equals
4. Starvation Detection + Alert:
- Monitor: jobs waiting > 10 minutes at each priority level
- Alert: if any job waits > 30 minutes
- Auto-action: boost priority or trigger preemption review
- Dashboard: queue depth and wait time per tenant
5. Maximum Resource Holding Time:
- Low-priority jobs have TTL (e.g., 24 hours max)
- After TTL: job terminated, resources freed
- Prevents indefinite resource holding by low-priority workQ2: "How do you handle resource fragmentation?"
Detailed Answer:
Fragmentation: Resources exist but are scattered across nodes in unusable chunks.
Detection:
- Metric: "Allocatable but unschedulable resources"
- Example: 100 GPU free across cluster, but no single node has 8 free
- Fragmentation ratio = unschedulable_free / total_free
Mitigation Strategies:
1. Proactive Bin Packing:
- Score function penalizes creating small fragments
- Prefer nodes that result in large contiguous free blocks
- Example: Place 2-GPU job on node with 3 free (leaves 1 contiguous)
rather than node with 8 free (wastes large block)
2. Descheduler (Periodic Defragmentation):
- Run every 10 minutes
- Identify "badly placed" pods (creating fragmentation)
- Evict and reschedule to consolidate free space
- Respect PodDisruptionBudgets (max 1 eviction per cycle)
- Priority: Only move low-priority, short-running pods
3. Compaction (Live Migration):
- For VMs: Live migrate to consolidate
- For containers: Checkpoint + restore on new node
- Expensive but effective for persistent fragmentation
- Schedule during low-traffic windows
4. Reserved Blocks:
- Reserve contiguous blocks for large jobs
- Gradually migrate small jobs off reserved blocks
- Ensures large gang-scheduled jobs can always be placedQ3: "How do you handle a scheduler failure?"
Detailed Answer:
Failure Modes and Recovery:
1. Scheduler Process Crash:
- Detection: Health check fails (2 missed heartbeats = 20 seconds)
- Recovery: Standby scheduler promoted (leader election via etcd)
- Impact: 20-30 second scheduling pause
- Running jobs: Unaffected (already placed on nodes)
- Pending jobs: Delayed until new leader elected
2. Scheduler State Corruption:
- Detection: Consistency check fails (state vs etcd mismatch)
- Recovery: Rebuild in-memory state from etcd (source of truth)
- Time: 30-60 seconds for 50K node cluster
- Impact: Scheduling pause during rebuild
3. etcd Cluster Failure:
- Detection: etcd health check fails
- Impact: No new scheduling decisions (can't persist bindings)
- Running jobs: Continue running (nodes have local state)
- Recovery: etcd cluster recovery from backup
- Mitigation: etcd runs on dedicated, redundant hardware
4. Split-Brain (Network Partition):
- Scenario: Scheduler can reach some nodes but not etcd
- Behavior: Scheduler stops making decisions (can't persist)
- Fencing: Old leader must be fenced before new leader acts
- Implementation: Lease-based leadership (expires automatically)
High Availability Design:
- 3 scheduler replicas (1 active, 2 standby)
- Leader election via etcd lease (15-second TTL)
- Failover time: < 30 seconds
- State reconstruction: < 60 seconds
- Zero impact on running workloads during failoverQ4: "How do you handle a sudden spike in allocation requests?"
Detailed Answer:
Spike Scenarios:
- Auto-scaler triggers: 1000 new pods in 10 seconds
- Deployment rollout: 500 pods replacing 500 pods
- Disaster recovery: Entire zone fails, 10,000 pods need rescheduling
Handling Strategies:
1. Request Queuing with Backpressure:
- Queue depth limit: 10,000 pending requests
- Beyond limit: Return 429 (Too Many Requests) with Retry-After
- Priority queue: Critical requests processed first during spike
2. Batch Scheduling:
- Group similar requests (same resource profile)
- Schedule batch in single pass (amortize filtering cost)
- 10x throughput improvement for homogeneous batches
3. Admission Control:
- Rate limit per tenant: 100 requests/sec
- Global rate limit: 5000 requests/sec
- Burst allowance: 10x rate for 30 seconds
- Graceful degradation: Simplify scoring during overload
4. Precomputed Scheduling:
- For known patterns (deployments), pre-compute placement
- Cache node scores (valid for 5 seconds)
- Skip scoring for pods identical to recently scheduled pods
- "Equivalence classes": Group identical pods, schedule once
5. Cluster Autoscaler Integration:
- If spike exceeds cluster capacity: trigger node provisioning
- Optimistic scheduling: Assume new nodes will arrive
- Backfill: Schedule what fits now, queue rest for new nodesQ5: "How would you design chargeback/showback for resource usage?"
Detailed Answer:
Chargeback Model:
1. Metering (What to measure):
- CPU-seconds consumed (actual usage, not just requested)
- Memory-byte-seconds (peak memory × duration)
- GPU-seconds (allocated GPU × duration)
- Storage-byte-seconds (provisioned storage × duration)
- Network-bytes (ingress + egress)
- License-seconds (license checkout duration)
2. Rating (How to price):
- Base rate per resource-unit-hour
- Priority multiplier (critical = 3x, normal = 1x, preemptible = 0.2x)
- Time-of-day multiplier (peak hours = 1.5x, off-peak = 0.8x)
- Commitment discount (reserved capacity = 0.6x)
- Spot discount (preemptible = 0.3x)
3. Billing Granularity:
- Minimum billing increment: 1 second (like AWS Lambda)
- Aggregation: Hourly for reporting, monthly for billing
- Rounding: Round up to nearest second
4. Waste Detection:
- Requested but unused: (requested - actual_usage) / requested
- Alert if waste > 50% for > 1 hour
- Recommend right-sizing based on actual usage patterns
- Show "potential savings" in dashboard
5. Implementation:
- Metrics pipeline: cAdvisor → Prometheus → Usage Aggregator
- Storage: ClickHouse for high-cardinality usage data
- Reporting: Daily/weekly/monthly cost reports per tenant
- API: GET /api/v1/billing/tenants/{id}/usage?period=monthlyAdditional Variations for Deep Dives
Variation 9: Heterogeneous Hardware Scheduling
- Mix of CPU architectures (x86, ARM, RISC-V)
- Specialized accelerators (TPU, Inferentia, Trainium)
- Different memory technologies (DDR5, HBM, CXL)
- Scheduling must match workload to optimal hardware
Variation 10: Energy-Aware Scheduling
- Schedule workloads when renewable energy is available
- Consolidate during low-demand to power down nodes
- Carbon-aware scheduling (prefer low-carbon regions)
- Power capping: Limit total cluster power consumption
Variation 11: Confidential Computing Scheduling
- TEE (Trusted Execution Environment) allocation
- SGX enclaves: Limited EPC memory per node
- SEV-SNP VMs: Attestation before scheduling
- Scheduling must verify hardware security capabilities
These variations demonstrate the breadth of resource allocation challenges and provide rich material for interview discussions at any depth.