Container Orchestration System - Database Design
etcd Schema Design
Key Structure
Kubernetes uses etcd as its primary datastore with hierarchical key structure:
/registry/{resource_type}/{namespace}/{name}
Examples:
/registry/pods/default/nginx-pod
/registry/services/kube-system/kube-dns
/registry/deployments/production/web-app
/registry/configmaps/default/app-configResource Storage
Pods
Key: /registry/pods/{namespace}/{pod_name}
Value: {
"metadata": {
"name": "nginx-pod",
"namespace": "default",
"uid": "uuid",
"resourceVersion": "12345",
"creationTimestamp": "2026-01-08T10:00:00Z",
"labels": {"app": "nginx", "tier": "frontend"}
},
"spec": {
"containers": [{
"name": "nginx",
"image": "nginx:1.21",
"ports": [{"containerPort": 80}],
"resources": {
"requests": {"cpu": "100m", "memory": "128Mi"},
"limits": {"cpu": "200m", "memory": "256Mi"}
}
}],
"nodeName": "node-1"
},
"status": {
"phase": "Running",
"podIP": "10.0.1.5",
"startTime": "2026-01-08T10:00:05Z",
"containerStatuses": [{
"name": "nginx",
"ready": true,
"restartCount": 0,
"state": {"running": {"startedAt": "2026-01-08T10:00:05Z"}}
}]
}
}Services
Key: /registry/services/{namespace}/{service_name}
Value: {
"metadata": {
"name": "nginx-service",
"namespace": "default"
},
"spec": {
"type": "ClusterIP",
"clusterIP": "10.0.0.1",
"ports": [{
"port": 80,
"targetPort": 8080,
"protocol": "TCP"
}],
"selector": {"app": "nginx"}
},
"status": {
"loadBalancer": {}
}
}Nodes
Key: /registry/nodes/{node_name}
Value: {
"metadata": {
"name": "node-1",
"labels": {
"kubernetes.io/hostname": "node-1",
"node.kubernetes.io/instance-type": "m5.2xlarge",
"topology.kubernetes.io/zone": "us-west-2a"
}
},
"spec": {
"podCIDR": "10.0.1.0/24",
"providerID": "aws:///us-west-2a/i-1234567890"
},
"status": {
"capacity": {
"cpu": "8",
"memory": "32Gi",
"pods": "110"
},
"allocatable": {
"cpu": "7800m",
"memory": "30Gi",
"pods": "110"
},
"conditions": [{
"type": "Ready",
"status": "True",
"lastHeartbeatTime": "2026-01-08T10:00:00Z"
}]
}
}etcd Optimization
Key Design Patterns
1. Pagination:
- List operations with limit
- Continue token for next page
- Avoid loading all objects
2. Watch Mechanism:
- Efficient change notification
- Bookmark for resume
- Resource version tracking
3. Compaction:
- Remove old revisions
- Keep last 5 minutes
- Reduce database size
4. Defragmentation:
- Reclaim space
- Run during low traffic
- Improve performanceetcd Performance Tuning
Configuration:
- Snapshot count: 10,000
- Heartbeat interval: 100ms
- Election timeout: 1000ms
- Max request bytes: 1.5 MB
- Max concurrent streams: 128
Hardware:
- SSD storage (required)
- 8 cores CPU
- 16 GB RAM
- Dedicated disk for WAL
- Low-latency network
Monitoring:
- Write latency (target: <10ms)
- Commit latency (target: <50ms)
- Database size (limit: 8 GB)
- Compaction durationCaching Layer
API Server Cache
In-Memory Cache:
- Cache frequently accessed objects
- Reduce etcd load
- Serve reads from cache
- Invalidate on writes
Cache Strategy:
- Cache pods, services, nodes
- TTL: 30 seconds
- Size: 10 GB per API server
- Hit rate: 80%
Benefits:
- Reduced etcd load (80% reduction)
- Faster API responses (<10ms)
- Better scalabilityInformer Cache (Client-Side)
Informer Pattern:
- List all objects initially
- Watch for changes
- Maintain local cache
- Serve from cache
Implementation:
informer = NewInformer(
listFunc,
watchFunc,
resyncPeriod: 30 seconds
)
Benefits:
- No repeated API calls
- Real-time updates
- Reduced API server load
- Efficient client-side cachingAuxiliary Storage
Container Registry
Registry: Docker Hub, ECR, GCR, Harbor
Storage:
- Image layers: 5 TB
- Manifests: 100 GB
- Metadata: 10 GB
Caching:
- Node-local image cache
- Pull-through cache
- Reduce registry load
Performance:
- Image pull: 10 images/second per node
- Bandwidth: 1 Gbps per node
- Parallel pulls: 5 concurrentPersistent Volumes
Volume Types:
- Local: Node-local SSD
- NFS: Network file system
- Cloud: EBS, GCE PD
- Distributed: Ceph, GlusterFS
Metadata (etcd):
Key: /registry/persistentvolumes/{pv_name}
Value: {
"spec": {
"capacity": {"storage": "100Gi"},
"accessModes": ["ReadWriteOnce"],
"storageClassName": "fast-ssd",
"hostPath": {"path": "/mnt/data"}
},
"status": {
"phase": "Bound"
}
}
Volume Binding:
- Static: Pre-created PVs
- Dynamic: Provision on demand
- Binding: Match PVC to PVBackup and Recovery
etcd Backup
Backup Strategy:
- Snapshot every hour
- Retention: 24 hours
- Cross-region replication
- Encrypted backups
Backup Process:
etcdctl snapshot save backup.db
etcdctl snapshot status backup.db
Restore Process:
etcdctl snapshot restore backup.db \
--data-dir=/var/lib/etcd-restore
Recovery Time: <5 minutesDisaster Recovery
Scenario: Complete cluster failure
Recovery Steps:
1. Restore etcd from backup
2. Start control plane components
3. Nodes reconnect automatically
4. Kubelet reconciles pod state
5. Controllers recreate missing resources
Recovery Time: <10 minutes
Data Loss: <1 minute (last backup)Data Consistency
Consistency Models
Strong Consistency (etcd):
- All writes go through Raft
- Linearizable reads
- Guaranteed ordering
Eventual Consistency (Status):
- Pod status updates
- Node status updates
- Endpoint updates
- 5-10 second lag acceptable
Optimistic Concurrency:
- Resource version for updates
- Conflict detection
- Retry on conflictConflict Resolution
Update Conflict:
1. Client reads object (version: 100)
2. Client modifies object
3. Client updates with version 100
4. If current version != 100: Conflict
5. Client retries with new version
Implementation:
PUT /api/v1/namespaces/default/pods/nginx
{
"metadata": {
"resourceVersion": "100"
},
...
}
Response: 409 Conflict (if version mismatch)Data Retention
Retention Policies
Active Resources:
- Pods, Services, Deployments: Until deleted
- Nodes: Until deregistered
Events:
- Retention: 1 hour
- Separate etcd instance
- Prevent main etcd bloat
Audit Logs:
- Retention: 1 year
- External storage (S3)
- Compliance requirement
Metrics:
- Recent: 15 days (Prometheus)
- Historical: 1 year (long-term storage)This database design ensures Kubernetes can manage massive clusters efficiently while maintaining consistency, performance, and reliability through etcd's distributed consensus and intelligent caching strategies.