Resource Allocation Service - Security and Privacy
Overview
A resource allocation service is a critical infrastructure component that controls access to compute, storage, and network resources. Security failures can lead to unauthorized resource consumption (financial impact), data breaches (through co-located workloads), denial of service (resource exhaustion), and compliance violations. This section covers security architecture for multi-tenant resource managers, drawing from practices in Kubernetes RBAC, AWS IAM, and Google Borg's security model.
Multi-Tenant Isolation
Resource Isolation Layers
Isolation Stack (Defense in Depth):
Layer 5: Quota Enforcement ← Prevent resource monopolization
Layer 4: Network Isolation ← Prevent cross-tenant communication
Layer 3: Runtime Isolation ← Prevent workload interference
Layer 2: Kernel Isolation ← Prevent privilege escalation
Layer 1: Hardware Isolation ← Prevent side-channel attacks
Implementation per Layer:
Layer 5 - Quota Enforcement:
- Hard quotas per tenant (cannot exceed)
- Rate limiting on API requests
- Admission webhooks reject over-quota requests
- Real-time quota accounting (not eventual)
Layer 4 - Network Isolation:
- Network policies (default deny between namespaces)
- Separate VLANs/VPCs per tenant (for strict isolation)
- Encrypted inter-pod communication (mTLS via service mesh)
- DNS isolation (tenants can't resolve other tenants' services)
Layer 3 - Runtime Isolation:
- Separate cgroups per tenant (CPU, memory, I/O limits)
- Seccomp profiles (restrict system calls)
- AppArmor/SELinux (mandatory access control)
- Read-only root filesystem
Layer 2 - Kernel Isolation:
- Namespaces (PID, network, mount, user, IPC, UTS)
- User namespaces (container root != host root)
- Capability dropping (no CAP_SYS_ADMIN)
- Seccomp-BPF (syscall filtering)
Layer 1 - Hardware Isolation:
- Dedicated nodes for sensitive tenants (no co-location)
- Hardware-backed TEE (Intel SGX, AMD SEV)
- IOMMU for device isolation
- Separate NUMA domains per tenantNoisy Neighbor Prevention
Problem: One tenant's workload degrades another's performance.
CPU Noisy Neighbor:
- Cause: Tenant A uses all CPU, starving Tenant B
- Prevention: CFS bandwidth throttling (cpu.cfs_quota_us)
- Detection: Monitor CPU throttling metrics per tenant
- Mitigation: Guaranteed CPU via requests (Kubernetes QoS: Guaranteed)
Memory Noisy Neighbor:
- Cause: Tenant A causes memory pressure, triggering OOM for Tenant B
- Prevention: Memory limits enforced by cgroups (memory.max)
- Detection: Monitor memory pressure indicators (PSI)
- Mitigation: OOM score adjustment (protect high-priority tenants)
I/O Noisy Neighbor:
- Cause: Tenant A saturates disk I/O, slowing Tenant B
- Prevention: blkio cgroup (I/O weight and bandwidth limits)
- Detection: Monitor I/O latency percentiles per tenant
- Mitigation: I/O scheduling priority (ionice)
Network Noisy Neighbor:
- Cause: Tenant A floods network, causing packet drops for Tenant B
- Prevention: Traffic shaping (tc/HTB per tenant)
- Detection: Monitor network queue drops per tenant
- Mitigation: Per-tenant bandwidth guarantees
Cache Noisy Neighbor (LLC - Last Level Cache):
- Cause: Tenant A evicts Tenant B's data from shared CPU cache
- Prevention: Intel CAT (Cache Allocation Technology)
- Detection: Monitor LLC miss rate per tenant
- Mitigation: Dedicated cache partitions for sensitive workloadsTenant Isolation Levels
Level 1 - Soft Isolation (Default):
- Shared nodes, namespace separation
- Cgroup limits, network policies
- Suitable for: Internal teams, trusted tenants
- Cost: 1x (maximum resource sharing)
Level 2 - Medium Isolation:
- Dedicated node pools per tenant
- No workload co-location
- Shared control plane
- Suitable for: Different business units, compliance boundaries
- Cost: 1.3-1.5x (reduced bin packing efficiency)
Level 3 - Hard Isolation:
- Dedicated clusters per tenant
- Separate control planes
- Separate network infrastructure
- Suitable for: External customers, regulated industries
- Cost: 2-3x (dedicated infrastructure overhead)
Level 4 - Hardware Isolation:
- Dedicated physical hardware
- Air-gapped networks
- HSM-backed encryption
- Suitable for: Government, classified workloads
- Cost: 5-10x (no sharing whatsoever)RBAC for Resource Management
Role-Based Access Control Model
RBAC Hierarchy:
Platform Admin (God mode - very restricted)
├── Can: Create/delete clusters, set global policies
├── Can: Override any quota, preempt any workload
└── Cannot: Access tenant data/workloads directly
Tenant Admin
├── Can: Manage quotas within tenant allocation
├── Can: Create/delete namespaces for their tenant
├── Can: Set priority classes within their tenant
└── Cannot: Exceed tenant-level quota, affect other tenants
Namespace Admin
├── Can: Deploy workloads in their namespace
├── Can: Set resource requests/limits within namespace quota
├── Can: View namespace usage metrics
└── Cannot: Modify namespace quota, access other namespaces
Developer
├── Can: Deploy workloads with pre-approved resource profiles
├── Can: View own workload status and logs
├── Can: Scale within pre-set limits
└── Cannot: Set priority above normal, request GPU without approval
Service Account (Automated)
├── Can: Perform specific operations (deploy, scale)
├── Can: Access specific resources (own namespace only)
├── Scoped: Time-limited tokens, specific API groups
└── Cannot: Escalate privileges, access secrets outside scopeRBAC Policy Examples
# Kubernetes-style RBAC policies
# Tenant Admin: Full control within their tenant's namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tenant-admin
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["*"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["*"]
- apiGroups: ["scheduling.k8s.io"]
resources: ["priorityclasses"]
verbs: ["get", "list"] # Can view but not create priority classes
- apiGroups: [""]
resources: ["resourcequotas"]
verbs: ["get", "list"] # Can view but not modify quotas
# Developer: Limited deployment capabilities
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: developer
namespace: ml-team-dev
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "create", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "delete"] # Can delete own pods
- apiGroups: [""]
resources: ["pods/log", "pods/exec"]
verbs: ["get", "create"] # Can view logs and exec into pods
# Resource limits enforced by LimitRange (not RBAC)Admission Control (Policy Enforcement)
Admission Pipeline:
Request → Authentication → Authorization (RBAC) → Admission Controllers → Persist
Key Admission Controllers:
1. ResourceQuota Admission:
- Reject if allocation would exceed namespace quota
- Atomic: check + reserve in single transaction
- Prevents race conditions on quota
2. LimitRange Admission:
- Enforce min/max resource requests per pod
- Inject defaults if not specified
- Prevent: pods requesting 1000 CPU cores (typo protection)
3. PodSecurity Admission:
- Enforce security standards (restricted, baseline, privileged)
- Reject pods running as root (unless explicitly allowed)
- Reject privileged containers in non-system namespaces
4. Custom Webhook (Organization Policies):
- Require cost-center label on all allocations
- Enforce naming conventions
- Require approval for GPU requests > 8
- Block scheduling to production nodes from dev namespaces
Example Custom Policy:
{
"rule": "gpu-approval-required",
"condition": "spec.resources.requests['nvidia.com/gpu'] > 8",
"action": "deny",
"message": "GPU requests > 8 require manager approval. File request at go/gpu-request",
"exceptions": ["namespace:ml-platform-prod", "priorityClass:system-critical"]
}Quota Enforcement and Abuse Prevention
Quota Enforcement Mechanisms
Enforcement Points:
1. Admission Time (Preventive):
- Check quota BEFORE accepting allocation request
- Atomic: read quota + reserve in single transaction
- Reject with clear error: "Quota exceeded: GPU limit 100, used 98, requested 4"
2. Runtime (Detective):
- Monitor actual usage vs allocated
- Detect: workloads exceeding limits (memory, CPU burst)
- Action: Throttle (CPU) or OOM-kill (memory) or evict (disk)
3. Periodic Reconciliation (Corrective):
- Every 5 minutes: reconcile quota counters with actual state
- Fix drift: crashed pods that didn't release quota
- Alert: if reconciliation finds > 5% discrepancyAbuse Prevention
Abuse Scenarios and Mitigations:
1. Quota Gaming (Request Inflation):
- Attack: Request maximum resources but use minimum (waste)
- Detection: utilization < 10% for > 1 hour
- Mitigation: VPA recommendations, usage-based billing, auto-downsize
2. Priority Escalation:
- Attack: Mark all jobs as "critical" to get faster scheduling
- Prevention: Priority class assignment requires admin approval
- Detection: Alert if tenant's avg priority > threshold
- Mitigation: Priority classes are cluster-scoped (not tenant-controlled)
3. Resource Squatting:
- Attack: Allocate resources and hold them idle (blocking others)
- Detection: Allocated but unused for > TTL
- Mitigation: Idle resource reclamation (evict after 30 min idle)
- Exception: Reserved capacity (explicitly paid for)
4. Burst Abuse:
- Attack: Continuously burst above quota (never return to base)
- Detection: Burst duration > allowed window
- Mitigation: Hard burst timeout, preempt burst allocations
- Penalty: Reduce burst allowance for repeat offenders
5. Sybil Attack (Multiple Identities):
- Attack: Create multiple tenants to multiply quota
- Prevention: Tenant creation requires admin approval
- Detection: Correlate tenants by billing account, IP, behavior
- Mitigation: Organization-level aggregate quotas
6. API Abuse (DoS on Scheduler):
- Attack: Flood scheduler with allocation requests
- Prevention: Per-tenant rate limiting (100 req/sec)
- Detection: Request rate > 10x normal for tenant
- Mitigation: Progressive rate limiting, temporary banAudit Logging for Compliance
Audit Log Schema
{
"auditId": "audit-7f3a2b1c-4d5e-6f7a-8b9c-0d1e2f3a4b5c",
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "RequestResponse",
"actor": {
"userId": "user-abc123",
"username": "alice@example.com",
"groups": ["ml-team", "gpu-users"],
"serviceAccount": null,
"sourceIP": "10.0.1.42",
"userAgent": "kubectl/v1.28.0"
},
"action": {
"verb": "create",
"resource": "allocations",
"namespace": "ml-team",
"name": "training-job-42",
"apiGroup": "scheduler.example.com",
"apiVersion": "v1"
},
"request": {
"resources": {"cpu": "8000m", "memory": "32Gi", "nvidia.com/gpu": "4"},
"priority": 100,
"priorityClass": "high-priority"
},
"response": {
"code": 201,
"status": "Success",
"allocationId": "alloc-def456",
"nodeAssigned": "gpu-worker-042"
},
"context": {
"quotaBefore": {"gpu_used": 68, "gpu_limit": 100},
"quotaAfter": {"gpu_used": 72, "gpu_limit": 100},
"schedulingLatencyMs": 15,
"preemptionsTriggered": 0
},
"metadata": {
"cluster": "prod-us-east-1",
"region": "us-east-1",
"environment": "production"
}
}Audit Levels
Level 1 - Metadata Only:
- Who, what, when (no request/response bodies)
- Low storage cost, always enabled
- Use for: Read operations, status checks
Level 2 - Request:
- Metadata + request body
- Use for: Write operations (create, update, delete)
- Captures: What was requested
Level 3 - RequestResponse:
- Metadata + request + response bodies
- Highest storage cost
- Use for: Security-sensitive operations (quota changes, preemptions)
- Captures: Full context for forensic analysis
Audit Policy Example:
rules:
- level: RequestResponse
resources: ["resourcequotas", "priorityclasses", "preemptions"]
verbs: ["create", "update", "delete"]
- level: Request
resources: ["allocations", "jobs"]
verbs: ["create", "delete"]
- level: Metadata
resources: ["allocations", "nodes"]
verbs: ["get", "list", "watch"]
- level: None
resources: ["healthz", "readyz"] # Skip health checksCompliance Requirements
SOC 2 Type II:
- All resource allocation decisions logged
- Retention: 1 year minimum
- Access to audit logs restricted (separate from operational access)
- Regular review of access patterns (quarterly)
GDPR (if tenant data involved):
- Data residency: Resources allocated in compliant regions only
- Right to erasure: Can delete tenant's allocation history
- Data minimization: Don't log unnecessary PII in audit records
- Breach notification: Detect unauthorized access within 72 hours
HIPAA (healthcare workloads):
- PHI workloads on dedicated, encrypted nodes
- Access logging for all PHI-adjacent resources
- BAA (Business Associate Agreement) with cloud provider
- Encryption at rest and in transit for all PHI data
FedRAMP (government workloads):
- Dedicated infrastructure (no multi-tenancy)
- FIPS 140-2 validated encryption
- Continuous monitoring and reporting
- Personnel security clearances for operatorsSecure Communication
Control Plane Security
Communication Channels:
1. Client → API Server:
- Protocol: HTTPS (TLS 1.3)
- Authentication: Bearer tokens (JWT) or client certificates
- Authorization: RBAC check on every request
- Rate limiting: Per-client, per-tenant
2. API Server → etcd:
- Protocol: gRPC with mTLS
- Authentication: Client certificates (auto-rotated)
- Encryption: TLS 1.3 (data in transit)
- etcd encryption: AES-256-GCM (data at rest)
- Access: Only API server can reach etcd (network policy)
3. Scheduler → Nodes (Agent):
- Protocol: gRPC with mTLS
- Authentication: Node bootstrap tokens → client certificates
- Certificate rotation: Every 24 hours (automatic)
- Node identity: Verified via TPM attestation (optional)
4. Node → API Server (Heartbeat):
- Protocol: gRPC with mTLS
- Authentication: Node client certificate
- Frequency: Every 10 seconds
- Payload: Signed with node's private key
Certificate Management:
- CA: Internal PKI (not public CA)
- Rotation: Certificates valid for 1 year, rotated at 80% lifetime
- Revocation: CRL or OCSP for compromised nodes
- Bootstrap: One-time token for initial certificate requestSecrets Management for Scheduler
Secrets the Scheduler Handles:
- etcd encryption keys (for encrypting resource state)
- TLS certificates (for API server, node communication)
- Cloud provider credentials (for autoscaler)
- Webhook tokens (for admission controllers)
Storage:
- Never in environment variables or config files
- Use: HashiCorp Vault, AWS Secrets Manager, or Kubernetes Secrets (encrypted)
- Rotation: Automatic, zero-downtime rotation
- Access: Least privilege (scheduler only gets what it needs)
Encryption at Rest:
- etcd data encrypted with AES-256-GCM
- Encryption key stored in KMS (AWS KMS, GCP Cloud KMS)
- Key rotation: Every 90 days
- Envelope encryption: Data key encrypted by master keyResource Exhaustion Attacks and Prevention
Attack Vectors
1. Fork Bomb (Process Exhaustion):
- Attack: Workload spawns unlimited processes
- Prevention: pids.max cgroup limit (e.g., 4096 processes)
- Detection: Process count spike alert
2. Memory Bomb (OOM):
- Attack: Workload allocates memory until node OOMs
- Prevention: memory.max cgroup limit (hard cap)
- Detection: Memory usage approaching limit
- Response: OOM killer targets the offending container (not others)
3. Disk Bomb (Storage Exhaustion):
- Attack: Write unlimited data to ephemeral storage
- Prevention: Ephemeral storage limits (eviction on exceed)
- Detection: Disk usage monitoring per pod
- Response: Pod eviction when exceeding limit
4. Network Flood (Bandwidth Exhaustion):
- Attack: Generate massive network traffic
- Prevention: Per-pod bandwidth limits (tc/HTB)
- Detection: Network traffic anomaly detection
- Response: Traffic shaping, pod eviction for repeat offenders
5. API Flood (Control Plane DoS):
- Attack: Flood API server with requests
- Prevention: Per-client rate limiting, priority queuing
- Detection: Request rate anomaly
- Response: Progressive throttling, temporary IP ban
6. Crypto Mining (Resource Theft):
- Attack: Deploy crypto miners consuming CPU/GPU
- Detection: High CPU/GPU usage with no legitimate output
- Prevention: Image scanning, runtime behavior analysis
- Response: Terminate workload, alert security team
7. Scheduler Manipulation:
- Attack: Submit thousands of unschedulable pods to overload scheduler
- Prevention: Admission quota on pending pods per namespace
- Detection: Queue depth spike without corresponding resource requests
- Response: Reject new submissions, alert tenant adminDefense-in-Depth Configuration
# Pod Security Standards (Restricted)
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: MustRunAsNonRoot
runAsGroup:
rule: MustRunAs
ranges: [{min: 1000, max: 65534}]
fsGroup:
rule: MustRunAs
ranges: [{min: 1000, max: 65534}]
volumes: ['configMap', 'emptyDir', 'projected', 'secret', 'downwardAPI', 'persistentVolumeClaim']
allowedCapabilities: []
requiredDropCapabilities: ['ALL']
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefaultData Residency Requirements
Region-Aware Scheduling
Data Residency Constraints:
Requirement: "EU customer data must be processed in EU regions only"
Implementation:
1. Label nodes with region/jurisdiction:
labels:
topology.kubernetes.io/region: eu-west-1
compliance/data-residency: eu
compliance/jurisdiction: gdpr
2. Tenant configuration specifies residency requirements:
{
"tenant_id": "eu-customer-corp",
"data_residency": "eu",
"allowed_regions": ["eu-west-1", "eu-central-1", "eu-north-1"],
"prohibited_regions": ["us-*", "cn-*"]
}
3. Scheduler enforces at admission time:
- Check: Does requested node match tenant's residency requirements?
- Reject: If no compliant nodes available, fail with clear error
- Never: Schedule EU-resident workload on US node (even temporarily)
4. Audit: Log all placement decisions with residency context
- Prove: Workload X was always in EU region
- Alert: If any scheduling decision violates residency
Cross-Border Data Transfer:
- Scheduler prevents cross-region scheduling for restricted tenants
- Data replication respects residency (no replicas outside allowed regions)
- Backup storage must also comply (S3 bucket in same region)
- Disaster recovery: Only to pre-approved regionsEncryption of Resource Metadata
Data Classification
Classification of Scheduler Data:
Highly Sensitive:
- Tenant credentials and tokens
- Encryption keys
- Node bootstrap secrets
- Audit logs containing PII
→ Encryption: AES-256-GCM, stored in KMS-backed secret store
Sensitive:
- Allocation details (what resources each tenant uses)
- Quota configurations (reveals business capacity)
- Scheduling constraints (reveals architecture)
- Cost/billing data
→ Encryption: At rest (disk encryption), in transit (TLS)
Internal:
- Node capacity and health metrics
- Scheduling queue state
- Performance metrics
→ Encryption: In transit (TLS), at rest (volume encryption)
Public:
- API documentation
- Cluster-level aggregate metrics (anonymized)
- Open-source scheduler configuration
→ Encryption: In transit only (TLS)Encryption Implementation
At Rest:
- etcd: --encryption-provider-config with AES-CBC or AES-GCM
- PostgreSQL: Transparent Data Encryption (TDE) or volume encryption
- Redis: Encrypted volumes (EBS encryption) + AUTH password
- Backups: Encrypted with separate key (stored in different KMS)
- Logs: Encrypted S3 bucket with SSE-KMS
In Transit:
- All internal communication: mTLS (mutual TLS 1.3)
- External API: TLS 1.3 with strong cipher suites
- etcd peer communication: TLS with client cert verification
- Node heartbeats: gRPC with TLS + token authentication
Key Management:
- Master keys: AWS KMS / GCP Cloud KMS / HashiCorp Vault
- Data encryption keys: Generated per-resource, encrypted by master key
- Rotation: Master key every 365 days, data keys every 90 days
- Access: Scheduler service account only (least privilege IAM policy)
Sensitive Field Encryption (Application-Level):
- Tenant-specific secrets in allocation specs: Encrypted before storage
- Environment variables in job specs: Encrypted at rest, decrypted at runtime
- Labels containing PII: Hashed or encrypted
- Audit log PII fields: Encrypted with tenant-specific key (for GDPR deletion)Security Monitoring and Incident Response
Security Metrics and Alerts
Real-Time Monitoring:
1. Authentication Failures:
- Threshold: > 10 failures/minute from same source
- Action: Temporary IP block, alert security team
2. Authorization Denials:
- Threshold: > 50 denials/minute for same user
- Action: Alert, investigate potential privilege escalation attempt
3. Quota Violations:
- Threshold: Any attempt to exceed hard quota
- Action: Log, alert tenant admin
4. Unusual Scheduling Patterns:
- Threshold: Tenant scheduling 10x normal rate
- Action: Alert, investigate potential abuse
5. Preemption Anomalies:
- Threshold: > 100 preemptions/hour for same tenant
- Action: Alert, investigate potential priority manipulation
6. Node Compromise Indicators:
- Threshold: Node reporting impossible resource values
- Action: Cordon node, alert security, investigate
Incident Response Playbook:
1. Detect: Automated alerting on security metrics
2. Contain: Isolate affected tenant/node (cordon, revoke tokens)
3. Investigate: Audit log analysis, timeline reconstruction
4. Remediate: Patch vulnerability, rotate credentials
5. Recover: Restore service, verify integrity
6. Report: Post-incident report, update policiesZero-Trust Architecture
Principles Applied to Resource Allocation:
1. Never Trust, Always Verify:
- Every API call authenticated (no anonymous access)
- Every scheduling decision authorized (RBAC check)
- Node identity verified on every heartbeat (certificate validation)
2. Least Privilege:
- Scheduler: Only access to scheduling-related etcd keys
- Nodes: Only report own status, receive own directives
- Tenants: Only see own resources and metrics
- Operators: Read-only by default, write requires approval
3. Assume Breach:
- Encrypt all data (even internal communication)
- Segment network (scheduler can't reach tenant workloads)
- Rotate credentials frequently (24-hour certificate lifetime)
- Monitor for lateral movement (unexpected access patterns)
4. Explicit Verification:
- Multi-factor for admin operations (quota changes, preemptions)
- Approval workflow for sensitive changes (priority escalation)
- Break-glass procedure for emergency access (logged, time-limited)This security architecture ensures the resource allocation service protects tenant isolation, prevents abuse, maintains compliance, and provides the audit trail necessary for regulated environments.