Variations & Follow-ups

📖 9 min read 📄 Part 8 of 10

Container Orchestration System - Variations and Follow-up Questions

Common Variations

1. Serverless on Kubernetes (Knative)

Problem: Auto-scale to zero and scale from zero

Implementation:

Components:
- Knative Serving: Serverless containers
- Knative Eventing: Event-driven architecture
- Auto-scaler: Scale based on requests

Features:
- Scale to zero (no pods when idle)
- Scale from zero (<1 second cold start)
- Request-based autoscaling
- Traffic splitting (canary, blue-green)

Challenges:
- Cold start latency
- Connection draining
- State management
- Cost vs performance

2. StatefulSets (Stateful Applications)

Problem: Run databases and stateful apps in Kubernetes

Key Differences from Deployments:

Stable Identity:
- Predictable pod names (app-0, app-1, app-2)
- Stable network identity
- Stable storage

Ordered Operations:
- Sequential pod creation (0 → 1 → 2)
- Sequential pod deletion (2 → 1 → 0)
- Rolling updates in order

Persistent Storage:
- PVC per pod
- Survives pod rescheduling
- Automatic volume attachment

Use Cases:
- Databases (MySQL, PostgreSQL, MongoDB)
- Message queues (Kafka, RabbitMQ)
- Distributed systems (Elasticsearch, Cassandra)

3. DaemonSets (One Pod per Node)

Problem: Run system daemons on every node

Use Cases:

Monitoring:
- Node exporter (Prometheus)
- Log collector (Fluentd)
- Metrics agent

Networking:
- CNI plugin
- Kube-proxy
- Network policy agent

Storage:
- CSI driver
- Volume manager

Security:
- Security agent
- Vulnerability scanner

Scheduling:

Behavior:
- Automatically scheduled on new nodes
- Tolerates all taints (runs on all nodes)
- Not counted in resource limits
- Survives node drain

Implementation:
- Controller watches for new nodes
- Creates pod on each node
- Deletes pod when node removed

4. Jobs and CronJobs (Batch Processing)

Problem: Run batch workloads and scheduled tasks

Jobs:

Features:
- Run to completion
- Parallel execution
- Retry on failure
- Cleanup after completion

Use Cases:
- Data processing
- Batch analytics
- Database migrations
- Report generation

Configuration:
completions: 10  # Run 10 times
parallelism: 3   # 3 at a time
backoffLimit: 3  # Max 3 retries

CronJobs:

Features:
- Scheduled execution
- Cron syntax
- Concurrency policy
- History limits

Use Cases:
- Backups
- Cleanup tasks
- Periodic reports
- Health checks

Configuration:
schedule: "0 2 * * *"  # Daily at 2 AM
concurrencyPolicy: Forbid  # Don't overlap
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1

5. Multi-Cluster Management

Problem: Manage 100+ clusters from single control plane

Implementation:

Cluster API:
- Declarative cluster management
- Provision clusters via YAML
- Lifecycle management
- Upgrade automation

Cluster Federation:
- Federated resources
- Cross-cluster scheduling
- Global load balancing
- Disaster recovery

Fleet Management:
- GitOps for configuration
- Policy enforcement
- Compliance checking
- Cost optimization

Challenges:
- Cross-cluster networking
- Consistent configuration
- Unified monitoring
- Access control

Follow-up Questions

Technical Deep Dives

Q1: How does the scheduler make decisions?

Scheduling Algorithm:

1. Filtering Phase:
   - Check node resources (CPU, memory)
   - Check node selector labels
   - Check taints and tolerations
   - Check affinity/anti-affinity rules
   - Result: Feasible nodes

2. Scoring Phase:
   For each feasible node, calculate score (0-100):
   
   Resource Balance (30%):
   score = 100 × (1 - abs(cpu_usage - memory_usage))
   
   Image Locality (20%):
   score = 100 if image cached, 0 otherwise
   
   Pod Affinity (25%):
   score = 100 × (matching_pods / total_pods)
   
   Node Affinity (25%):
   score = 100 if preferred, 50 if allowed, 0 otherwise
   
   Total Score = Σ(weight × score)

3. Selection:
   - Pick highest-scoring node
   - Bind pod to node
   - Update etcd

Complexity: O(n log n) where n = 5,000 nodes
Time: <1 second per pod

Q2: How do you handle node failures?

Detection and Recovery:

1. Node Failure Detection:
   - Kubelet stops sending heartbeats
   - Node controller marks NotReady after 40s
   - Pods marked as Unknown after 5 minutes

2. Pod Eviction:
   - Controller evicts pods from failed node
   - Pods marked for deletion
   - New pods scheduled on healthy nodes

3. Rescheduling:
   - Scheduler places pods on new nodes
   - Kubelet starts containers
   - Services updated with new endpoints

4. Storage Handling:
   - Detach volumes from failed node
   - Attach to new node
   - Mount in new pod

Timeline:
- Detection: 40 seconds
- Eviction: 5 minutes
- Rescheduling: 1 minute
- Total: ~6 minutes

Optimization:
- Reduce detection time (faster heartbeats)
- Parallel rescheduling
- Pre-pull images
- Fast volume attachment

Q3: How do you implement rolling updates without downtime?

Rolling Update Strategy:

Configuration:
maxSurge: 1        # Max 1 extra pod during update
maxUnavailable: 0  # No downtime

Process:
1. Current: 3 pods (v1)
2. Create 1 new pod (v2) → 4 pods total
3. Wait for new pod ready
4. Delete 1 old pod (v1) → 3 pods total
5. Repeat until all pods updated

Readiness Probes:
- Check if pod ready to serve traffic
- Don't route traffic until ready
- Prevent serving errors

Rollback:
- If new pods fail readiness
- Automatically rollback
- Keep old ReplicaSet
- Instant rollback possible

Timeline:
- 3 pods, 30-second startup
- Total: 90 seconds for complete rollout
- Zero downtime throughout

Q4: How do you handle secrets securely?

Secret Management:

1. Storage:
   - Encrypted at rest in etcd
   - Encryption key in KMS
   - Key rotation every 90 days

2. Distribution:
   - Mounted as volume or env var
   - Only to pods that need them
   - Encrypted in transit (TLS)

3. Access Control:
   - RBAC for secret access
   - Namespace isolation
   - Service account permissions

4. External Secrets:
   - Integration with Vault, AWS Secrets Manager
   - Sync to Kubernetes secrets
   - Automatic rotation

5. Best Practices:
   - Don't log secrets
   - Don't commit to git
   - Use external secret managers
   - Audit secret access

Implementation:
apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: cGFzc3dvcmQ=

Pod Usage:
env:
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      name: db-secret
      key: password

Q5: How do you implement service mesh?

Service Mesh (Istio):

Architecture:
- Sidecar proxy (Envoy) per pod
- Control plane (Istiod)
- Telemetry collection

Features:
1. Traffic Management:
   - Intelligent routing
   - Load balancing
   - Retries and timeouts
   - Circuit breakers

2. Security:
   - mTLS between services
   - Authorization policies
   - Certificate management

3. Observability:
   - Distributed tracing
   - Metrics collection
   - Access logs

Implementation:
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        user:
          exact: "tester"
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 90
    - destination:
        host: reviews
        subset: v2
      weight: 10

Overhead:
- CPU: 0.5 cores per pod
- Memory: 50 MB per pod
- Latency: 1-2ms additional

Benefits: Worth overhead for advanced features

Scalability Questions

Q6: How would you scale to 10,000 nodes?

Current: 5,000 nodes
Target: 10,000 nodes (2x)

Challenges:
1. etcd Limits:
   - Max 8 GB database
   - Max 1,000 writes/second
   - Max 10,000 watch streams

2. API Server:
   - 10K req/s → 20K req/s
   - More connections
   - Higher memory

3. Scheduler:
   - 10K node evaluations per pod
   - Slower scheduling
   - More CPU needed

Solutions:
1. etcd:
   - Aggressive compaction
   - Pagination for lists
   - Separate etcd for events
   - Client-side caching

2. API Server:
   - Scale to 20 instances
   - Larger cache
   - Request prioritization

3. Scheduler:
   - Pre-filter nodes
   - Parallel scoring
   - Cache node info
   - Batch scheduling

4. Networking:
   - eBPF for performance
   - Optimize network policies
   - Better CNI plugin

Cost: $70M/year (2x current)

Q7: How do you handle cluster upgrades?

Upgrade Strategy:

1. Control Plane Upgrade:
   - Upgrade API servers (rolling)
   - Upgrade controller manager
   - Upgrade scheduler
   - Upgrade etcd (careful!)
   - Time: 30 minutes

2. Node Upgrade:
   - Cordon node (mark unschedulable)
   - Drain node (evict pods)
   - Upgrade kubelet and runtime
   - Uncordon node
   - Repeat for all nodes
   - Time: 2-4 hours for 5,000 nodes

3. Validation:
   - Test in staging first
   - Canary upgrade (10% nodes)
   - Monitor metrics
   - Rollback if issues

4. Automation:
   - Cluster API for automation
   - GitOps for configuration
   - Automated testing
   - Gradual rollout

Best Practices:
- Upgrade one minor version at a time
- Backup etcd before upgrade
- Test in non-production first
- Have rollback plan
- Schedule during low traffic

Business Logic Questions

Q8: How do you implement multi-tenancy?

Multi-Tenancy Strategies:

1. Namespace Isolation:
   - One namespace per tenant
   - Resource quotas
   - Network policies
   - RBAC

2. Resource Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi"
    persistentvolumeclaims: "50"
    services.loadbalancers: "5"

3. Network Policies:
   - Isolate tenant traffic
   - Allow only necessary communication
   - Deny by default

4. RBAC:
   - Tenant-specific service accounts
   - Namespace-scoped permissions
   - No cluster-wide access

5. Monitoring:
   - Per-tenant metrics
   - Cost allocation
   - Usage tracking

Challenges:
- Noisy neighbor (resource contention)
- Security isolation
- Fair scheduling
- Cost attribution

Q9: How do you implement disaster recovery?

DR Strategy:

1. Backup:
   - etcd snapshots (hourly)
   - Application data backups
   - Configuration backups
   - Store in multiple regions

2. Multi-Region:
   - Primary cluster: Active
   - Secondary cluster: Standby
   - Async replication
   - DNS failover

3. Recovery Process:
   - Restore etcd from backup
   - Start control plane
   - Nodes reconnect
   - Pods rescheduled
   - Services restored

4. Testing:
   - Monthly DR drills
   - Automated testing
   - Measure RTO/RPO
   - Update procedures

Targets:
- RTO: 5 minutes
- RPO: 1 minute
- Availability: 99.99%

Q10: How do you implement cost optimization?

Cost Optimization Strategies:

1. Right-Sizing:
   - VPA recommendations
   - Analyze actual usage
   - Adjust requests/limits
   - 30% cost savings

2. Bin Packing:
   - Pack pods efficiently
   - Minimize wasted resources
   - 70%+ utilization
   - 20% cost savings

3. Spot Instances:
   - Use for fault-tolerant workloads
   - 70% cheaper than on-demand
   - Handle interruptions
   - 40% cost savings

4. Cluster Autoscaling:
   - Scale down during low traffic
   - Remove idle nodes
   - Right-size node types
   - 25% cost savings

5. Resource Quotas:
   - Prevent over-provisioning
   - Enforce limits per team
   - Chargeback model
   - Accountability

Total Savings: 60% vs unoptimized

These variations and follow-up questions demonstrate the breadth of capabilities and considerations required when building a production-grade container orchestration platform.