Scale & Constraints

📖 7 min read 📄 Part 2 of 10

Key-Value Store - Scale and Constraints

Traffic Estimation

Request Volume

Total Requests: 100 billion operations per day
Read Operations: 70 billion reads per day (70% of traffic)
Write Operations: 30 billion writes per day (30% of traffic)
Peak Traffic: 3x average during peak hours
Operations per Second: ~1.16M ops/sec average, 3.5M ops/sec peak
Batch Operations: 20% of requests are batch operations

User and Client Metrics

Active Clients: 100,000 application servers
Concurrent Connections: 500,000 active connections
Requests per Client: 1,000 ops/sec per client average
Connection Lifetime: 1 hour average connection duration
Client Distribution: 60% same region, 30% cross-region, 10% global
Client Types: 40% web servers, 40% microservices, 20% batch jobs

Storage Capacity Planning

Data Volume Estimation

Total Keys: 10 billion unique keys
Average Key Size: 50 bytes (UUID or composite key)
Average Value Size: 1KB (mix of small and large values)
Total Raw Data: 10TB (10 billion × 1KB)
Replication Factor: 3x for high availability
Total Storage with Replication: 30TB
Metadata Overhead: 10% (3TB for indexes, versions, tombstones)
Total Storage Required: 33TB

Storage Growth

Daily New Keys: 100 million new keys per day
Daily Data Growth: 100GB raw data per day
Monthly Growth: 3TB per month
Annual Growth: 36TB per year
Storage Retention: 2 years for active data
Archive Storage: 5 years for compliance
Projected 3-Year Storage: 150TB

Value Size Distribution

Tiny (<100 bytes): 30% of keys (user sessions, flags)
Small (100B-1KB): 40% of keys (user profiles, configs)
Medium (1KB-10KB): 20% of keys (cached API responses)
Large (10KB-100KB): 8% of keys (documents, images metadata)
Very Large (100KB-1MB): 2% of keys (serialized objects)
Maximum Value Size: 1MB hard limit

Performance Requirements

Latency Targets

P50 Read Latency: <500μs (in-memory cache hit)
P95 Read Latency: <1ms (local SSD read)
P99 Read Latency: <2ms (includes network overhead)
P99.9 Read Latency: <10ms (cross-AZ read)
P50 Write Latency: <2ms (synchronous replication)
P95 Write Latency: <5ms (quorum write)
P99 Write Latency: <10ms (includes fsync)
P99.9 Write Latency: <50ms (slow disk or network)

Throughput Requirements

Per-Node Read Throughput: 100,000 reads/sec
Per-Node Write Throughput: 50,000 writes/sec
Cluster Read Throughput: 10M+ reads/sec
Cluster Write Throughput: 5M+ writes/sec
Batch Operation Throughput: 1M+ ops/sec
Scan Throughput: 100K keys/sec per node

Concurrent Operations

Concurrent Reads: 100,000 simultaneous read operations
Concurrent Writes: 50,000 simultaneous write operations
Concurrent Connections: 10,000 per node, 500K cluster-wide
Concurrent Transactions: 10,000 active transactions
Concurrent Scans: 1,000 active scan operations
Connection Pool Size: 100 connections per client

Network Bandwidth

Inbound Traffic

Read Requests: 70B reads/day × 100 bytes = 7TB/day
Write Requests: 30B writes/day × 1KB = 30TB/day
Total Inbound: 37TB/day = 428MB/sec average
Peak Inbound: 1.3GB/sec during peak hours
Per-Node Inbound: 4.3MB/sec average per 100 nodes

Outbound Traffic

Read Responses: 70B reads/day × 1KB = 70TB/day
Write Acknowledgments: 30B writes/day × 100 bytes = 3TB/day
Total Outbound: 73TB/day = 845MB/sec average
Peak Outbound: 2.5GB/sec during peak hours
Per-Node Outbound: 8.5MB/sec average per 100 nodes

Replication Traffic

Intra-Region Replication: 30TB/day × 2 replicas = 60TB/day
Cross-Region Replication: 30TB/day × 1 replica = 30TB/day
Total Replication: 90TB/day = 1GB/sec average
Peak Replication: 3GB/sec during peak hours
Replication Lag Target: <100ms intra-region, <1s cross-region

Memory Requirements

Per-Node Memory Allocation

Data Cache: 64GB for hot data (80% of RAM)
Index Cache: 8GB for key indexes and bloom filters
Connection Buffers: 4GB for client connections (10K × 400KB)
Write Buffer: 4GB for write-ahead log and memtable
Operating System: 4GB for OS and system processes
Total per Node: 84GB RAM (using 96GB machines)

Cluster Memory

Number of Nodes: 100 nodes for 100M ops/sec
Total Cluster Memory: 8.4TB RAM
Cache Hit Rate: 95% for read operations
Memory Efficiency: 90% effective utilization
Cache Eviction: LRU policy with TTL awareness

Memory Distribution

Hot Data (1 hour): 40GB per node (frequently accessed)
Warm Data (24 hours): 20GB per node (occasionally accessed)
Metadata: 4GB per node (indexes, bloom filters, stats)
Reserved: 16GB per node (headroom for spikes)

Compute Resources

CPU Requirements

Per-Node CPU: 32 cores (2 × 16-core processors)
CPU Utilization: 60% average, 80% peak
Read Operations: 0.1ms CPU time per read
Write Operations: 0.5ms CPU time per write
Compaction: 10% CPU reserved for background tasks
Replication: 5% CPU for replication processing

Cluster Compute

Total Nodes: 100 nodes for target throughput
Total CPU Cores: 3,200 cores
Compute Capacity: 1.16M ops/sec requires ~60% CPU
Headroom: 40% capacity for growth and spikes
Auto-Scaling: Add nodes when CPU >70% for 5 minutes

Disk I/O Requirements

Disk Specifications

Disk Type: NVMe SSD for low latency
Disk Capacity: 1TB per node (10TB usable per node with compression)
Read IOPS: 500K IOPS per disk
Write IOPS: 100K IOPS per disk
Sequential Read: 3GB/sec per disk
Sequential Write: 2GB/sec per disk

I/O Patterns

Read I/O: 70% cache hits, 30% disk reads = 30K IOPS per node
Write I/O: Sequential writes to WAL + compaction = 20K IOPS per node
Compaction I/O: Background compaction = 10K IOPS per node
Total I/O: 60K IOPS per node (well within SSD limits)
I/O Amplification: 3x for LSM-tree compaction

Write-Ahead Log (WAL)

WAL Size: 10GB per node (1 hour of writes)
WAL Rotation: Every 1GB or 10 minutes
WAL Sync: fsync every 100ms or 1000 writes
WAL Replay: <1 minute for node recovery
WAL Replication: Async replication to replicas

Replication and Consistency

Replication Configuration

Replication Factor: 3 (1 primary + 2 replicas)
Replication Strategy: Quorum-based (R + W > N)
Consistency Level: Configurable per operation
- Strong: R=2, W=2 (majority quorum)
- Eventual: R=1, W=1 (fastest)
- Read-Your-Writes: W=2, R=1 with session affinity
Replication Lag: <100ms for 99% of writes
Cross-Region Lag: <1 second for async replication

Data Distribution

Partitioning Strategy: Consistent hashing with virtual nodes
Virtual Nodes: 256 vnodes per physical node
Partition Count: 25,600 partitions (100 nodes × 256 vnodes)
Partition Size: 1.3GB per partition (33TB / 25,600)
Rebalancing: Automatic when nodes added/removed
Hot Partition Handling: Split hot partitions dynamically

Availability and Fault Tolerance

Failure Scenarios

Node Failure Rate: 1% of nodes fail per month
Expected Failures: 1 node failure per month in 100-node cluster
Failure Detection: <5 seconds using heartbeats
Failover Time: <30 seconds to promote replica
Recovery Time: <5 minutes to restore replication factor
Data Loss: Zero data loss with quorum writes

Disaster Recovery

Backup Frequency: Incremental every hour, full daily
Backup Retention: 30 days for point-in-time recovery
Backup Size: 33TB full backup, 1.4TB incremental per day
Restore Time: <4 hours for full cluster restore
Cross-Region Replication: Async replication to 2 regions
RPO: <5 minutes (time between backups)
RTO: <1 hour (time to failover to backup region)

Cost Estimation

Infrastructure Costs (Monthly)

Compute: 100 nodes × $500/month = $50,000
Storage: 33TB × $0.10/GB/month = $3,300
Network: 100TB/month × $0.05/GB = $5,000
Backup Storage: 100TB × $0.02/GB/month = $2,000
Total Infrastructure: $60,300/month

Operational Costs (Monthly)

Monitoring: $2,000/month
Support: $5,000/month
Personnel: 3 engineers × $15,000/month = $45,000
Total Operational: $52,000/month

Total Cost of Ownership

Monthly Total: $112,300
Annual Total: $1,347,600
Cost per Operation: $0.00000135 per operation
Cost per GB Stored: $3.40 per GB per month

This scale analysis provides the foundation for capacity planning, infrastructure provisioning, and cost optimization for a production-grade distributed key-value store.