Container Orchestration System - Variations and Follow-up Questions
Common Variations
1. Serverless on Kubernetes (Knative)
Problem: Auto-scale to zero and scale from zero
Implementation:
Components:
- Knative Serving: Serverless containers
- Knative Eventing: Event-driven architecture
- Auto-scaler: Scale based on requests
Features:
- Scale to zero (no pods when idle)
- Scale from zero (<1 second cold start)
- Request-based autoscaling
- Traffic splitting (canary, blue-green)
Challenges:
- Cold start latency
- Connection draining
- State management
- Cost vs performance2. StatefulSets (Stateful Applications)
Problem: Run databases and stateful apps in Kubernetes
Key Differences from Deployments:
Stable Identity:
- Predictable pod names (app-0, app-1, app-2)
- Stable network identity
- Stable storage
Ordered Operations:
- Sequential pod creation (0 → 1 → 2)
- Sequential pod deletion (2 → 1 → 0)
- Rolling updates in order
Persistent Storage:
- PVC per pod
- Survives pod rescheduling
- Automatic volume attachment
Use Cases:
- Databases (MySQL, PostgreSQL, MongoDB)
- Message queues (Kafka, RabbitMQ)
- Distributed systems (Elasticsearch, Cassandra)3. DaemonSets (One Pod per Node)
Problem: Run system daemons on every node
Use Cases:
Monitoring:
- Node exporter (Prometheus)
- Log collector (Fluentd)
- Metrics agent
Networking:
- CNI plugin
- Kube-proxy
- Network policy agent
Storage:
- CSI driver
- Volume manager
Security:
- Security agent
- Vulnerability scannerScheduling:
Behavior:
- Automatically scheduled on new nodes
- Tolerates all taints (runs on all nodes)
- Not counted in resource limits
- Survives node drain
Implementation:
- Controller watches for new nodes
- Creates pod on each node
- Deletes pod when node removed4. Jobs and CronJobs (Batch Processing)
Problem: Run batch workloads and scheduled tasks
Jobs:
Features:
- Run to completion
- Parallel execution
- Retry on failure
- Cleanup after completion
Use Cases:
- Data processing
- Batch analytics
- Database migrations
- Report generation
Configuration:
completions: 10 # Run 10 times
parallelism: 3 # 3 at a time
backoffLimit: 3 # Max 3 retriesCronJobs:
Features:
- Scheduled execution
- Cron syntax
- Concurrency policy
- History limits
Use Cases:
- Backups
- Cleanup tasks
- Periodic reports
- Health checks
Configuration:
schedule: "0 2 * * *" # Daily at 2 AM
concurrencyPolicy: Forbid # Don't overlap
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 15. Multi-Cluster Management
Problem: Manage 100+ clusters from single control plane
Implementation:
Cluster API:
- Declarative cluster management
- Provision clusters via YAML
- Lifecycle management
- Upgrade automation
Cluster Federation:
- Federated resources
- Cross-cluster scheduling
- Global load balancing
- Disaster recovery
Fleet Management:
- GitOps for configuration
- Policy enforcement
- Compliance checking
- Cost optimization
Challenges:
- Cross-cluster networking
- Consistent configuration
- Unified monitoring
- Access controlFollow-up Questions
Technical Deep Dives
Q1: How does the scheduler make decisions?
Scheduling Algorithm:
1. Filtering Phase:
- Check node resources (CPU, memory)
- Check node selector labels
- Check taints and tolerations
- Check affinity/anti-affinity rules
- Result: Feasible nodes
2. Scoring Phase:
For each feasible node, calculate score (0-100):
Resource Balance (30%):
score = 100 × (1 - abs(cpu_usage - memory_usage))
Image Locality (20%):
score = 100 if image cached, 0 otherwise
Pod Affinity (25%):
score = 100 × (matching_pods / total_pods)
Node Affinity (25%):
score = 100 if preferred, 50 if allowed, 0 otherwise
Total Score = Σ(weight × score)
3. Selection:
- Pick highest-scoring node
- Bind pod to node
- Update etcd
Complexity: O(n log n) where n = 5,000 nodes
Time: <1 second per podQ2: How do you handle node failures?
Detection and Recovery:
1. Node Failure Detection:
- Kubelet stops sending heartbeats
- Node controller marks NotReady after 40s
- Pods marked as Unknown after 5 minutes
2. Pod Eviction:
- Controller evicts pods from failed node
- Pods marked for deletion
- New pods scheduled on healthy nodes
3. Rescheduling:
- Scheduler places pods on new nodes
- Kubelet starts containers
- Services updated with new endpoints
4. Storage Handling:
- Detach volumes from failed node
- Attach to new node
- Mount in new pod
Timeline:
- Detection: 40 seconds
- Eviction: 5 minutes
- Rescheduling: 1 minute
- Total: ~6 minutes
Optimization:
- Reduce detection time (faster heartbeats)
- Parallel rescheduling
- Pre-pull images
- Fast volume attachmentQ3: How do you implement rolling updates without downtime?
Rolling Update Strategy:
Configuration:
maxSurge: 1 # Max 1 extra pod during update
maxUnavailable: 0 # No downtime
Process:
1. Current: 3 pods (v1)
2. Create 1 new pod (v2) → 4 pods total
3. Wait for new pod ready
4. Delete 1 old pod (v1) → 3 pods total
5. Repeat until all pods updated
Readiness Probes:
- Check if pod ready to serve traffic
- Don't route traffic until ready
- Prevent serving errors
Rollback:
- If new pods fail readiness
- Automatically rollback
- Keep old ReplicaSet
- Instant rollback possible
Timeline:
- 3 pods, 30-second startup
- Total: 90 seconds for complete rollout
- Zero downtime throughoutQ4: How do you handle secrets securely?
Secret Management:
1. Storage:
- Encrypted at rest in etcd
- Encryption key in KMS
- Key rotation every 90 days
2. Distribution:
- Mounted as volume or env var
- Only to pods that need them
- Encrypted in transit (TLS)
3. Access Control:
- RBAC for secret access
- Namespace isolation
- Service account permissions
4. External Secrets:
- Integration with Vault, AWS Secrets Manager
- Sync to Kubernetes secrets
- Automatic rotation
5. Best Practices:
- Don't log secrets
- Don't commit to git
- Use external secret managers
- Audit secret access
Implementation:
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: cGFzc3dvcmQ=
Pod Usage:
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: passwordQ5: How do you implement service mesh?
Service Mesh (Istio):
Architecture:
- Sidecar proxy (Envoy) per pod
- Control plane (Istiod)
- Telemetry collection
Features:
1. Traffic Management:
- Intelligent routing
- Load balancing
- Retries and timeouts
- Circuit breakers
2. Security:
- mTLS between services
- Authorization policies
- Certificate management
3. Observability:
- Distributed tracing
- Metrics collection
- Access logs
Implementation:
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
user:
exact: "tester"
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
Overhead:
- CPU: 0.5 cores per pod
- Memory: 50 MB per pod
- Latency: 1-2ms additional
Benefits: Worth overhead for advanced featuresScalability Questions
Q6: How would you scale to 10,000 nodes?
Current: 5,000 nodes
Target: 10,000 nodes (2x)
Challenges:
1. etcd Limits:
- Max 8 GB database
- Max 1,000 writes/second
- Max 10,000 watch streams
2. API Server:
- 10K req/s → 20K req/s
- More connections
- Higher memory
3. Scheduler:
- 10K node evaluations per pod
- Slower scheduling
- More CPU needed
Solutions:
1. etcd:
- Aggressive compaction
- Pagination for lists
- Separate etcd for events
- Client-side caching
2. API Server:
- Scale to 20 instances
- Larger cache
- Request prioritization
3. Scheduler:
- Pre-filter nodes
- Parallel scoring
- Cache node info
- Batch scheduling
4. Networking:
- eBPF for performance
- Optimize network policies
- Better CNI plugin
Cost: $70M/year (2x current)Q7: How do you handle cluster upgrades?
Upgrade Strategy:
1. Control Plane Upgrade:
- Upgrade API servers (rolling)
- Upgrade controller manager
- Upgrade scheduler
- Upgrade etcd (careful!)
- Time: 30 minutes
2. Node Upgrade:
- Cordon node (mark unschedulable)
- Drain node (evict pods)
- Upgrade kubelet and runtime
- Uncordon node
- Repeat for all nodes
- Time: 2-4 hours for 5,000 nodes
3. Validation:
- Test in staging first
- Canary upgrade (10% nodes)
- Monitor metrics
- Rollback if issues
4. Automation:
- Cluster API for automation
- GitOps for configuration
- Automated testing
- Gradual rollout
Best Practices:
- Upgrade one minor version at a time
- Backup etcd before upgrade
- Test in non-production first
- Have rollback plan
- Schedule during low trafficBusiness Logic Questions
Q8: How do you implement multi-tenancy?
Multi-Tenancy Strategies:
1. Namespace Isolation:
- One namespace per tenant
- Resource quotas
- Network policies
- RBAC
2. Resource Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
spec:
hard:
requests.cpu: "100"
requests.memory: "200Gi"
persistentvolumeclaims: "50"
services.loadbalancers: "5"
3. Network Policies:
- Isolate tenant traffic
- Allow only necessary communication
- Deny by default
4. RBAC:
- Tenant-specific service accounts
- Namespace-scoped permissions
- No cluster-wide access
5. Monitoring:
- Per-tenant metrics
- Cost allocation
- Usage tracking
Challenges:
- Noisy neighbor (resource contention)
- Security isolation
- Fair scheduling
- Cost attributionQ9: How do you implement disaster recovery?
DR Strategy:
1. Backup:
- etcd snapshots (hourly)
- Application data backups
- Configuration backups
- Store in multiple regions
2. Multi-Region:
- Primary cluster: Active
- Secondary cluster: Standby
- Async replication
- DNS failover
3. Recovery Process:
- Restore etcd from backup
- Start control plane
- Nodes reconnect
- Pods rescheduled
- Services restored
4. Testing:
- Monthly DR drills
- Automated testing
- Measure RTO/RPO
- Update procedures
Targets:
- RTO: 5 minutes
- RPO: 1 minute
- Availability: 99.99%Q10: How do you implement cost optimization?
Cost Optimization Strategies:
1. Right-Sizing:
- VPA recommendations
- Analyze actual usage
- Adjust requests/limits
- 30% cost savings
2. Bin Packing:
- Pack pods efficiently
- Minimize wasted resources
- 70%+ utilization
- 20% cost savings
3. Spot Instances:
- Use for fault-tolerant workloads
- 70% cheaper than on-demand
- Handle interruptions
- 40% cost savings
4. Cluster Autoscaling:
- Scale down during low traffic
- Remove idle nodes
- Right-size node types
- 25% cost savings
5. Resource Quotas:
- Prevent over-provisioning
- Enforce limits per team
- Chargeback model
- Accountability
Total Savings: 60% vs unoptimizedThese variations and follow-up questions demonstrate the breadth of capabilities and considerations required when building a production-grade container orchestration platform.