Security & Privacy

📖 16 min read 📄 Part 9 of 10

Resource Allocation Service - Security and Privacy

Overview

A resource allocation service is a critical infrastructure component that controls access to compute, storage, and network resources. Security failures can lead to unauthorized resource consumption (financial impact), data breaches (through co-located workloads), denial of service (resource exhaustion), and compliance violations. This section covers security architecture for multi-tenant resource managers, drawing from practices in Kubernetes RBAC, AWS IAM, and Google Borg's security model.


Multi-Tenant Isolation

Resource Isolation Layers

Isolation Stack (Defense in Depth):

Layer 5: Quota Enforcement     ← Prevent resource monopolization
Layer 4: Network Isolation     ← Prevent cross-tenant communication
Layer 3: Runtime Isolation     ← Prevent workload interference
Layer 2: Kernel Isolation      ← Prevent privilege escalation
Layer 1: Hardware Isolation    ← Prevent side-channel attacks

Implementation per Layer:

Layer 5 - Quota Enforcement:
- Hard quotas per tenant (cannot exceed)
- Rate limiting on API requests
- Admission webhooks reject over-quota requests
- Real-time quota accounting (not eventual)

Layer 4 - Network Isolation:
- Network policies (default deny between namespaces)
- Separate VLANs/VPCs per tenant (for strict isolation)
- Encrypted inter-pod communication (mTLS via service mesh)
- DNS isolation (tenants can't resolve other tenants' services)

Layer 3 - Runtime Isolation:
- Separate cgroups per tenant (CPU, memory, I/O limits)
- Seccomp profiles (restrict system calls)
- AppArmor/SELinux (mandatory access control)
- Read-only root filesystem

Layer 2 - Kernel Isolation:
- Namespaces (PID, network, mount, user, IPC, UTS)
- User namespaces (container root != host root)
- Capability dropping (no CAP_SYS_ADMIN)
- Seccomp-BPF (syscall filtering)

Layer 1 - Hardware Isolation:
- Dedicated nodes for sensitive tenants (no co-location)
- Hardware-backed TEE (Intel SGX, AMD SEV)
- IOMMU for device isolation
- Separate NUMA domains per tenant

Noisy Neighbor Prevention

Problem: One tenant's workload degrades another's performance.

CPU Noisy Neighbor:
- Cause: Tenant A uses all CPU, starving Tenant B
- Prevention: CFS bandwidth throttling (cpu.cfs_quota_us)
- Detection: Monitor CPU throttling metrics per tenant
- Mitigation: Guaranteed CPU via requests (Kubernetes QoS: Guaranteed)

Memory Noisy Neighbor:
- Cause: Tenant A causes memory pressure, triggering OOM for Tenant B
- Prevention: Memory limits enforced by cgroups (memory.max)
- Detection: Monitor memory pressure indicators (PSI)
- Mitigation: OOM score adjustment (protect high-priority tenants)

I/O Noisy Neighbor:
- Cause: Tenant A saturates disk I/O, slowing Tenant B
- Prevention: blkio cgroup (I/O weight and bandwidth limits)
- Detection: Monitor I/O latency percentiles per tenant
- Mitigation: I/O scheduling priority (ionice)

Network Noisy Neighbor:
- Cause: Tenant A floods network, causing packet drops for Tenant B
- Prevention: Traffic shaping (tc/HTB per tenant)
- Detection: Monitor network queue drops per tenant
- Mitigation: Per-tenant bandwidth guarantees

Cache Noisy Neighbor (LLC - Last Level Cache):
- Cause: Tenant A evicts Tenant B's data from shared CPU cache
- Prevention: Intel CAT (Cache Allocation Technology)
- Detection: Monitor LLC miss rate per tenant
- Mitigation: Dedicated cache partitions for sensitive workloads

Tenant Isolation Levels

Level 1 - Soft Isolation (Default):
- Shared nodes, namespace separation
- Cgroup limits, network policies
- Suitable for: Internal teams, trusted tenants
- Cost: 1x (maximum resource sharing)

Level 2 - Medium Isolation:
- Dedicated node pools per tenant
- No workload co-location
- Shared control plane
- Suitable for: Different business units, compliance boundaries
- Cost: 1.3-1.5x (reduced bin packing efficiency)

Level 3 - Hard Isolation:
- Dedicated clusters per tenant
- Separate control planes
- Separate network infrastructure
- Suitable for: External customers, regulated industries
- Cost: 2-3x (dedicated infrastructure overhead)

Level 4 - Hardware Isolation:
- Dedicated physical hardware
- Air-gapped networks
- HSM-backed encryption
- Suitable for: Government, classified workloads
- Cost: 5-10x (no sharing whatsoever)

RBAC for Resource Management

Role-Based Access Control Model

RBAC Hierarchy:

Platform Admin (God mode - very restricted)
├── Can: Create/delete clusters, set global policies
├── Can: Override any quota, preempt any workload
└── Cannot: Access tenant data/workloads directly

Tenant Admin
├── Can: Manage quotas within tenant allocation
├── Can: Create/delete namespaces for their tenant
├── Can: Set priority classes within their tenant
└── Cannot: Exceed tenant-level quota, affect other tenants

Namespace Admin
├── Can: Deploy workloads in their namespace
├── Can: Set resource requests/limits within namespace quota
├── Can: View namespace usage metrics
└── Cannot: Modify namespace quota, access other namespaces

Developer
├── Can: Deploy workloads with pre-approved resource profiles
├── Can: View own workload status and logs
├── Can: Scale within pre-set limits
└── Cannot: Set priority above normal, request GPU without approval

Service Account (Automated)
├── Can: Perform specific operations (deploy, scale)
├── Can: Access specific resources (own namespace only)
├── Scoped: Time-limited tokens, specific API groups
└── Cannot: Escalate privileges, access secrets outside scope

RBAC Policy Examples

# Kubernetes-style RBAC policies

# Tenant Admin: Full control within their tenant's namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: tenant-admin
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["*"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets"]
  verbs: ["*"]
- apiGroups: ["scheduling.k8s.io"]
  resources: ["priorityclasses"]
  verbs: ["get", "list"]  # Can view but not create priority classes
- apiGroups: [""]
  resources: ["resourcequotas"]
  verbs: ["get", "list"]  # Can view but not modify quotas

# Developer: Limited deployment capabilities
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
  namespace: ml-team-dev
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "create", "update"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "delete"]  # Can delete own pods
- apiGroups: [""]
  resources: ["pods/log", "pods/exec"]
  verbs: ["get", "create"]  # Can view logs and exec into pods
# Resource limits enforced by LimitRange (not RBAC)

Admission Control (Policy Enforcement)

Admission Pipeline:
Request → Authentication → Authorization (RBAC) → Admission Controllers → Persist

Key Admission Controllers:

1. ResourceQuota Admission:
   - Reject if allocation would exceed namespace quota
   - Atomic: check + reserve in single transaction
   - Prevents race conditions on quota

2. LimitRange Admission:
   - Enforce min/max resource requests per pod
   - Inject defaults if not specified
   - Prevent: pods requesting 1000 CPU cores (typo protection)

3. PodSecurity Admission:
   - Enforce security standards (restricted, baseline, privileged)
   - Reject pods running as root (unless explicitly allowed)
   - Reject privileged containers in non-system namespaces

4. Custom Webhook (Organization Policies):
   - Require cost-center label on all allocations
   - Enforce naming conventions
   - Require approval for GPU requests > 8
   - Block scheduling to production nodes from dev namespaces

Example Custom Policy:
{
  "rule": "gpu-approval-required",
  "condition": "spec.resources.requests['nvidia.com/gpu'] > 8",
  "action": "deny",
  "message": "GPU requests > 8 require manager approval. File request at go/gpu-request",
  "exceptions": ["namespace:ml-platform-prod", "priorityClass:system-critical"]
}

Quota Enforcement and Abuse Prevention

Quota Enforcement Mechanisms

Enforcement Points:

1. Admission Time (Preventive):
   - Check quota BEFORE accepting allocation request
   - Atomic: read quota + reserve in single transaction
   - Reject with clear error: "Quota exceeded: GPU limit 100, used 98, requested 4"

2. Runtime (Detective):
   - Monitor actual usage vs allocated
   - Detect: workloads exceeding limits (memory, CPU burst)
   - Action: Throttle (CPU) or OOM-kill (memory) or evict (disk)

3. Periodic Reconciliation (Corrective):
   - Every 5 minutes: reconcile quota counters with actual state
   - Fix drift: crashed pods that didn't release quota
   - Alert: if reconciliation finds > 5% discrepancy

Abuse Prevention

Abuse Scenarios and Mitigations:

1. Quota Gaming (Request Inflation):
   - Attack: Request maximum resources but use minimum (waste)
   - Detection: utilization < 10% for > 1 hour
   - Mitigation: VPA recommendations, usage-based billing, auto-downsize

2. Priority Escalation:
   - Attack: Mark all jobs as "critical" to get faster scheduling
   - Prevention: Priority class assignment requires admin approval
   - Detection: Alert if tenant's avg priority > threshold
   - Mitigation: Priority classes are cluster-scoped (not tenant-controlled)

3. Resource Squatting:
   - Attack: Allocate resources and hold them idle (blocking others)
   - Detection: Allocated but unused for > TTL
   - Mitigation: Idle resource reclamation (evict after 30 min idle)
   - Exception: Reserved capacity (explicitly paid for)

4. Burst Abuse:
   - Attack: Continuously burst above quota (never return to base)
   - Detection: Burst duration > allowed window
   - Mitigation: Hard burst timeout, preempt burst allocations
   - Penalty: Reduce burst allowance for repeat offenders

5. Sybil Attack (Multiple Identities):
   - Attack: Create multiple tenants to multiply quota
   - Prevention: Tenant creation requires admin approval
   - Detection: Correlate tenants by billing account, IP, behavior
   - Mitigation: Organization-level aggregate quotas

6. API Abuse (DoS on Scheduler):
   - Attack: Flood scheduler with allocation requests
   - Prevention: Per-tenant rate limiting (100 req/sec)
   - Detection: Request rate > 10x normal for tenant
   - Mitigation: Progressive rate limiting, temporary ban

Audit Logging for Compliance

Audit Log Schema

{
  "auditId": "audit-7f3a2b1c-4d5e-6f7a-8b9c-0d1e2f3a4b5c",
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "RequestResponse",
  
  "actor": {
    "userId": "user-abc123",
    "username": "alice@example.com",
    "groups": ["ml-team", "gpu-users"],
    "serviceAccount": null,
    "sourceIP": "10.0.1.42",
    "userAgent": "kubectl/v1.28.0"
  },
  
  "action": {
    "verb": "create",
    "resource": "allocations",
    "namespace": "ml-team",
    "name": "training-job-42",
    "apiGroup": "scheduler.example.com",
    "apiVersion": "v1"
  },
  
  "request": {
    "resources": {"cpu": "8000m", "memory": "32Gi", "nvidia.com/gpu": "4"},
    "priority": 100,
    "priorityClass": "high-priority"
  },
  
  "response": {
    "code": 201,
    "status": "Success",
    "allocationId": "alloc-def456",
    "nodeAssigned": "gpu-worker-042"
  },
  
  "context": {
    "quotaBefore": {"gpu_used": 68, "gpu_limit": 100},
    "quotaAfter": {"gpu_used": 72, "gpu_limit": 100},
    "schedulingLatencyMs": 15,
    "preemptionsTriggered": 0
  },
  
  "metadata": {
    "cluster": "prod-us-east-1",
    "region": "us-east-1",
    "environment": "production"
  }
}

Audit Levels

Level 1 - Metadata Only:
- Who, what, when (no request/response bodies)
- Low storage cost, always enabled
- Use for: Read operations, status checks

Level 2 - Request:
- Metadata + request body
- Use for: Write operations (create, update, delete)
- Captures: What was requested

Level 3 - RequestResponse:
- Metadata + request + response bodies
- Highest storage cost
- Use for: Security-sensitive operations (quota changes, preemptions)
- Captures: Full context for forensic analysis

Audit Policy Example:
rules:
- level: RequestResponse
  resources: ["resourcequotas", "priorityclasses", "preemptions"]
  verbs: ["create", "update", "delete"]
  
- level: Request
  resources: ["allocations", "jobs"]
  verbs: ["create", "delete"]
  
- level: Metadata
  resources: ["allocations", "nodes"]
  verbs: ["get", "list", "watch"]
  
- level: None
  resources: ["healthz", "readyz"]  # Skip health checks

Compliance Requirements

SOC 2 Type II:
- All resource allocation decisions logged
- Retention: 1 year minimum
- Access to audit logs restricted (separate from operational access)
- Regular review of access patterns (quarterly)

GDPR (if tenant data involved):
- Data residency: Resources allocated in compliant regions only
- Right to erasure: Can delete tenant's allocation history
- Data minimization: Don't log unnecessary PII in audit records
- Breach notification: Detect unauthorized access within 72 hours

HIPAA (healthcare workloads):
- PHI workloads on dedicated, encrypted nodes
- Access logging for all PHI-adjacent resources
- BAA (Business Associate Agreement) with cloud provider
- Encryption at rest and in transit for all PHI data

FedRAMP (government workloads):
- Dedicated infrastructure (no multi-tenancy)
- FIPS 140-2 validated encryption
- Continuous monitoring and reporting
- Personnel security clearances for operators

Secure Communication

Control Plane Security

Communication Channels:

1. Client → API Server:
   - Protocol: HTTPS (TLS 1.3)
   - Authentication: Bearer tokens (JWT) or client certificates
   - Authorization: RBAC check on every request
   - Rate limiting: Per-client, per-tenant

2. API Server → etcd:
   - Protocol: gRPC with mTLS
   - Authentication: Client certificates (auto-rotated)
   - Encryption: TLS 1.3 (data in transit)
   - etcd encryption: AES-256-GCM (data at rest)
   - Access: Only API server can reach etcd (network policy)

3. Scheduler → Nodes (Agent):
   - Protocol: gRPC with mTLS
   - Authentication: Node bootstrap tokens → client certificates
   - Certificate rotation: Every 24 hours (automatic)
   - Node identity: Verified via TPM attestation (optional)

4. Node → API Server (Heartbeat):
   - Protocol: gRPC with mTLS
   - Authentication: Node client certificate
   - Frequency: Every 10 seconds
   - Payload: Signed with node's private key

Certificate Management:
- CA: Internal PKI (not public CA)
- Rotation: Certificates valid for 1 year, rotated at 80% lifetime
- Revocation: CRL or OCSP for compromised nodes
- Bootstrap: One-time token for initial certificate request

Secrets Management for Scheduler

Secrets the Scheduler Handles:
- etcd encryption keys (for encrypting resource state)
- TLS certificates (for API server, node communication)
- Cloud provider credentials (for autoscaler)
- Webhook tokens (for admission controllers)

Storage:
- Never in environment variables or config files
- Use: HashiCorp Vault, AWS Secrets Manager, or Kubernetes Secrets (encrypted)
- Rotation: Automatic, zero-downtime rotation
- Access: Least privilege (scheduler only gets what it needs)

Encryption at Rest:
- etcd data encrypted with AES-256-GCM
- Encryption key stored in KMS (AWS KMS, GCP Cloud KMS)
- Key rotation: Every 90 days
- Envelope encryption: Data key encrypted by master key

Resource Exhaustion Attacks and Prevention

Attack Vectors

1. Fork Bomb (Process Exhaustion):
   - Attack: Workload spawns unlimited processes
   - Prevention: pids.max cgroup limit (e.g., 4096 processes)
   - Detection: Process count spike alert

2. Memory Bomb (OOM):
   - Attack: Workload allocates memory until node OOMs
   - Prevention: memory.max cgroup limit (hard cap)
   - Detection: Memory usage approaching limit
   - Response: OOM killer targets the offending container (not others)

3. Disk Bomb (Storage Exhaustion):
   - Attack: Write unlimited data to ephemeral storage
   - Prevention: Ephemeral storage limits (eviction on exceed)
   - Detection: Disk usage monitoring per pod
   - Response: Pod eviction when exceeding limit

4. Network Flood (Bandwidth Exhaustion):
   - Attack: Generate massive network traffic
   - Prevention: Per-pod bandwidth limits (tc/HTB)
   - Detection: Network traffic anomaly detection
   - Response: Traffic shaping, pod eviction for repeat offenders

5. API Flood (Control Plane DoS):
   - Attack: Flood API server with requests
   - Prevention: Per-client rate limiting, priority queuing
   - Detection: Request rate anomaly
   - Response: Progressive throttling, temporary IP ban

6. Crypto Mining (Resource Theft):
   - Attack: Deploy crypto miners consuming CPU/GPU
   - Detection: High CPU/GPU usage with no legitimate output
   - Prevention: Image scanning, runtime behavior analysis
   - Response: Terminate workload, alert security team

7. Scheduler Manipulation:
   - Attack: Submit thousands of unschedulable pods to overload scheduler
   - Prevention: Admission quota on pending pods per namespace
   - Detection: Queue depth spike without corresponding resource requests
   - Response: Reject new submissions, alert tenant admin

Defense-in-Depth Configuration

# Pod Security Standards (Restricted)
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: MustRunAsNonRoot
  runAsGroup:
    rule: MustRunAs
    ranges: [{min: 1000, max: 65534}]
  fsGroup:
    rule: MustRunAs
    ranges: [{min: 1000, max: 65534}]
  volumes: ['configMap', 'emptyDir', 'projected', 'secret', 'downwardAPI', 'persistentVolumeClaim']
  allowedCapabilities: []
  requiredDropCapabilities: ['ALL']
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  seccompProfile:
    type: RuntimeDefault

Data Residency Requirements

Region-Aware Scheduling

Data Residency Constraints:

Requirement: "EU customer data must be processed in EU regions only"

Implementation:
1. Label nodes with region/jurisdiction:
   labels:
     topology.kubernetes.io/region: eu-west-1
     compliance/data-residency: eu
     compliance/jurisdiction: gdpr

2. Tenant configuration specifies residency requirements:
   {
     "tenant_id": "eu-customer-corp",
     "data_residency": "eu",
     "allowed_regions": ["eu-west-1", "eu-central-1", "eu-north-1"],
     "prohibited_regions": ["us-*", "cn-*"]
   }

3. Scheduler enforces at admission time:
   - Check: Does requested node match tenant's residency requirements?
   - Reject: If no compliant nodes available, fail with clear error
   - Never: Schedule EU-resident workload on US node (even temporarily)

4. Audit: Log all placement decisions with residency context
   - Prove: Workload X was always in EU region
   - Alert: If any scheduling decision violates residency

Cross-Border Data Transfer:
- Scheduler prevents cross-region scheduling for restricted tenants
- Data replication respects residency (no replicas outside allowed regions)
- Backup storage must also comply (S3 bucket in same region)
- Disaster recovery: Only to pre-approved regions

Encryption of Resource Metadata

Data Classification

Classification of Scheduler Data:

Highly Sensitive:
- Tenant credentials and tokens
- Encryption keys
- Node bootstrap secrets
- Audit logs containing PII
→ Encryption: AES-256-GCM, stored in KMS-backed secret store

Sensitive:
- Allocation details (what resources each tenant uses)
- Quota configurations (reveals business capacity)
- Scheduling constraints (reveals architecture)
- Cost/billing data
→ Encryption: At rest (disk encryption), in transit (TLS)

Internal:
- Node capacity and health metrics
- Scheduling queue state
- Performance metrics
→ Encryption: In transit (TLS), at rest (volume encryption)

Public:
- API documentation
- Cluster-level aggregate metrics (anonymized)
- Open-source scheduler configuration
→ Encryption: In transit only (TLS)

Encryption Implementation

At Rest:
- etcd: --encryption-provider-config with AES-CBC or AES-GCM
- PostgreSQL: Transparent Data Encryption (TDE) or volume encryption
- Redis: Encrypted volumes (EBS encryption) + AUTH password
- Backups: Encrypted with separate key (stored in different KMS)
- Logs: Encrypted S3 bucket with SSE-KMS

In Transit:
- All internal communication: mTLS (mutual TLS 1.3)
- External API: TLS 1.3 with strong cipher suites
- etcd peer communication: TLS with client cert verification
- Node heartbeats: gRPC with TLS + token authentication

Key Management:
- Master keys: AWS KMS / GCP Cloud KMS / HashiCorp Vault
- Data encryption keys: Generated per-resource, encrypted by master key
- Rotation: Master key every 365 days, data keys every 90 days
- Access: Scheduler service account only (least privilege IAM policy)

Sensitive Field Encryption (Application-Level):
- Tenant-specific secrets in allocation specs: Encrypted before storage
- Environment variables in job specs: Encrypted at rest, decrypted at runtime
- Labels containing PII: Hashed or encrypted
- Audit log PII fields: Encrypted with tenant-specific key (for GDPR deletion)

Security Monitoring and Incident Response

Security Metrics and Alerts

Real-Time Monitoring:

1. Authentication Failures:
   - Threshold: > 10 failures/minute from same source
   - Action: Temporary IP block, alert security team
   
2. Authorization Denials:
   - Threshold: > 50 denials/minute for same user
   - Action: Alert, investigate potential privilege escalation attempt

3. Quota Violations:
   - Threshold: Any attempt to exceed hard quota
   - Action: Log, alert tenant admin

4. Unusual Scheduling Patterns:
   - Threshold: Tenant scheduling 10x normal rate
   - Action: Alert, investigate potential abuse

5. Preemption Anomalies:
   - Threshold: > 100 preemptions/hour for same tenant
   - Action: Alert, investigate potential priority manipulation

6. Node Compromise Indicators:
   - Threshold: Node reporting impossible resource values
   - Action: Cordon node, alert security, investigate

Incident Response Playbook:
1. Detect: Automated alerting on security metrics
2. Contain: Isolate affected tenant/node (cordon, revoke tokens)
3. Investigate: Audit log analysis, timeline reconstruction
4. Remediate: Patch vulnerability, rotate credentials
5. Recover: Restore service, verify integrity
6. Report: Post-incident report, update policies

Zero-Trust Architecture

Principles Applied to Resource Allocation:

1. Never Trust, Always Verify:
   - Every API call authenticated (no anonymous access)
   - Every scheduling decision authorized (RBAC check)
   - Node identity verified on every heartbeat (certificate validation)

2. Least Privilege:
   - Scheduler: Only access to scheduling-related etcd keys
   - Nodes: Only report own status, receive own directives
   - Tenants: Only see own resources and metrics
   - Operators: Read-only by default, write requires approval

3. Assume Breach:
   - Encrypt all data (even internal communication)
   - Segment network (scheduler can't reach tenant workloads)
   - Rotate credentials frequently (24-hour certificate lifetime)
   - Monitor for lateral movement (unexpected access patterns)

4. Explicit Verification:
   - Multi-factor for admin operations (quota changes, preemptions)
   - Approval workflow for sensitive changes (priority escalation)
   - Break-glass procedure for emergency access (logged, time-limited)

This security architecture ensures the resource allocation service protects tenant isolation, prevents abuse, maintains compliance, and provides the audit trail necessary for regulated environments.