Key-Value Store - Scale and Constraints
Traffic Estimation
Request Volume
- Total Requests: 100 billion operations per day
- Read Operations: 70 billion reads per day (70% of traffic)
- Write Operations: 30 billion writes per day (30% of traffic)
- Peak Traffic: 3x average during peak hours
- Operations per Second: ~1.16M ops/sec average, 3.5M ops/sec peak
- Batch Operations: 20% of requests are batch operations
User and Client Metrics
- Active Clients: 100,000 application servers
- Concurrent Connections: 500,000 active connections
- Requests per Client: 1,000 ops/sec per client average
- Connection Lifetime: 1 hour average connection duration
- Client Distribution: 60% same region, 30% cross-region, 10% global
- Client Types: 40% web servers, 40% microservices, 20% batch jobs
Storage Capacity Planning
Data Volume Estimation
- Total Keys: 10 billion unique keys
- Average Key Size: 50 bytes (UUID or composite key)
- Average Value Size: 1KB (mix of small and large values)
- Total Raw Data: 10TB (10 billion × 1KB)
- Replication Factor: 3x for high availability
- Total Storage with Replication: 30TB
- Metadata Overhead: 10% (3TB for indexes, versions, tombstones)
- Total Storage Required: 33TB
Storage Growth
- Daily New Keys: 100 million new keys per day
- Daily Data Growth: 100GB raw data per day
- Monthly Growth: 3TB per month
- Annual Growth: 36TB per year
- Storage Retention: 2 years for active data
- Archive Storage: 5 years for compliance
- Projected 3-Year Storage: 150TB
Value Size Distribution
- Tiny (<100 bytes): 30% of keys (user sessions, flags)
- Small (100B-1KB): 40% of keys (user profiles, configs)
- Medium (1KB-10KB): 20% of keys (cached API responses)
- Large (10KB-100KB): 8% of keys (documents, images metadata)
- Very Large (100KB-1MB): 2% of keys (serialized objects)
- Maximum Value Size: 1MB hard limit
Performance Requirements
Latency Targets
- P50 Read Latency: <500μs (in-memory cache hit)
- P95 Read Latency: <1ms (local SSD read)
- P99 Read Latency: <2ms (includes network overhead)
- P99.9 Read Latency: <10ms (cross-AZ read)
- P50 Write Latency: <2ms (synchronous replication)
- P95 Write Latency: <5ms (quorum write)
- P99 Write Latency: <10ms (includes fsync)
- P99.9 Write Latency: <50ms (slow disk or network)
Throughput Requirements
- Per-Node Read Throughput: 100,000 reads/sec
- Per-Node Write Throughput: 50,000 writes/sec
- Cluster Read Throughput: 10M+ reads/sec
- Cluster Write Throughput: 5M+ writes/sec
- Batch Operation Throughput: 1M+ ops/sec
- Scan Throughput: 100K keys/sec per node
Concurrent Operations
- Concurrent Reads: 100,000 simultaneous read operations
- Concurrent Writes: 50,000 simultaneous write operations
- Concurrent Connections: 10,000 per node, 500K cluster-wide
- Concurrent Transactions: 10,000 active transactions
- Concurrent Scans: 1,000 active scan operations
- Connection Pool Size: 100 connections per client
Network Bandwidth
Inbound Traffic
- Read Requests: 70B reads/day × 100 bytes = 7TB/day
- Write Requests: 30B writes/day × 1KB = 30TB/day
- Total Inbound: 37TB/day = 428MB/sec average
- Peak Inbound: 1.3GB/sec during peak hours
- Per-Node Inbound: 4.3MB/sec average per 100 nodes
Outbound Traffic
- Read Responses: 70B reads/day × 1KB = 70TB/day
- Write Acknowledgments: 30B writes/day × 100 bytes = 3TB/day
- Total Outbound: 73TB/day = 845MB/sec average
- Peak Outbound: 2.5GB/sec during peak hours
- Per-Node Outbound: 8.5MB/sec average per 100 nodes
Replication Traffic
- Intra-Region Replication: 30TB/day × 2 replicas = 60TB/day
- Cross-Region Replication: 30TB/day × 1 replica = 30TB/day
- Total Replication: 90TB/day = 1GB/sec average
- Peak Replication: 3GB/sec during peak hours
- Replication Lag Target: <100ms intra-region, <1s cross-region
Memory Requirements
Per-Node Memory Allocation
- Data Cache: 64GB for hot data (80% of RAM)
- Index Cache: 8GB for key indexes and bloom filters
- Connection Buffers: 4GB for client connections (10K × 400KB)
- Write Buffer: 4GB for write-ahead log and memtable
- Operating System: 4GB for OS and system processes
- Total per Node: 84GB RAM (using 96GB machines)
Cluster Memory
- Number of Nodes: 100 nodes for 100M ops/sec
- Total Cluster Memory: 8.4TB RAM
- Cache Hit Rate: 95% for read operations
- Memory Efficiency: 90% effective utilization
- Cache Eviction: LRU policy with TTL awareness
Memory Distribution
- Hot Data (1 hour): 40GB per node (frequently accessed)
- Warm Data (24 hours): 20GB per node (occasionally accessed)
- Metadata: 4GB per node (indexes, bloom filters, stats)
- Reserved: 16GB per node (headroom for spikes)
Compute Resources
CPU Requirements
- Per-Node CPU: 32 cores (2 × 16-core processors)
- CPU Utilization: 60% average, 80% peak
- Read Operations: 0.1ms CPU time per read
- Write Operations: 0.5ms CPU time per write
- Compaction: 10% CPU reserved for background tasks
- Replication: 5% CPU for replication processing
Cluster Compute
- Total Nodes: 100 nodes for target throughput
- Total CPU Cores: 3,200 cores
- Compute Capacity: 1.16M ops/sec requires ~60% CPU
- Headroom: 40% capacity for growth and spikes
- Auto-Scaling: Add nodes when CPU >70% for 5 minutes
Disk I/O Requirements
Disk Specifications
- Disk Type: NVMe SSD for low latency
- Disk Capacity: 1TB per node (10TB usable per node with compression)
- Read IOPS: 500K IOPS per disk
- Write IOPS: 100K IOPS per disk
- Sequential Read: 3GB/sec per disk
- Sequential Write: 2GB/sec per disk
I/O Patterns
- Read I/O: 70% cache hits, 30% disk reads = 30K IOPS per node
- Write I/O: Sequential writes to WAL + compaction = 20K IOPS per node
- Compaction I/O: Background compaction = 10K IOPS per node
- Total I/O: 60K IOPS per node (well within SSD limits)
- I/O Amplification: 3x for LSM-tree compaction
Write-Ahead Log (WAL)
- WAL Size: 10GB per node (1 hour of writes)
- WAL Rotation: Every 1GB or 10 minutes
- WAL Sync: fsync every 100ms or 1000 writes
- WAL Replay: <1 minute for node recovery
- WAL Replication: Async replication to replicas
Replication and Consistency
Replication Configuration
- Replication Factor: 3 (1 primary + 2 replicas)
- Replication Strategy: Quorum-based (R + W > N)
- Consistency Level: Configurable per operation
- Strong: R=2, W=2 (majority quorum)
- Eventual: R=1, W=1 (fastest)
- Read-Your-Writes: W=2, R=1 with session affinity
- Replication Lag: <100ms for 99% of writes
- Cross-Region Lag: <1 second for async replication
Data Distribution
- Partitioning Strategy: Consistent hashing with virtual nodes
- Virtual Nodes: 256 vnodes per physical node
- Partition Count: 25,600 partitions (100 nodes × 256 vnodes)
- Partition Size: 1.3GB per partition (33TB / 25,600)
- Rebalancing: Automatic when nodes added/removed
- Hot Partition Handling: Split hot partitions dynamically
Availability and Fault Tolerance
Failure Scenarios
- Node Failure Rate: 1% of nodes fail per month
- Expected Failures: 1 node failure per month in 100-node cluster
- Failure Detection: <5 seconds using heartbeats
- Failover Time: <30 seconds to promote replica
- Recovery Time: <5 minutes to restore replication factor
- Data Loss: Zero data loss with quorum writes
Disaster Recovery
- Backup Frequency: Incremental every hour, full daily
- Backup Retention: 30 days for point-in-time recovery
- Backup Size: 33TB full backup, 1.4TB incremental per day
- Restore Time: <4 hours for full cluster restore
- Cross-Region Replication: Async replication to 2 regions
- RPO: <5 minutes (time between backups)
- RTO: <1 hour (time to failover to backup region)
Cost Estimation
Infrastructure Costs (Monthly)
- Compute: 100 nodes × $500/month = $50,000
- Storage: 33TB × $0.10/GB/month = $3,300
- Network: 100TB/month × $0.05/GB = $5,000
- Backup Storage: 100TB × $0.02/GB/month = $2,000
- Total Infrastructure: $60,300/month
Operational Costs (Monthly)
- Monitoring: $2,000/month
- Support: $5,000/month
- Personnel: 3 engineers × $15,000/month = $45,000
- Total Operational: $52,000/month
Total Cost of Ownership
- Monthly Total: $112,300
- Annual Total: $1,347,600
- Cost per Operation: $0.00000135 per operation
- Cost per GB Stored: $3.40 per GB per month
This scale analysis provides the foundation for capacity planning, infrastructure provisioning, and cost optimization for a production-grade distributed key-value store.