Scale & Constraints

📖 7 min read 📄 Part 2 of 10

Key-Value Store - Scale and Constraints

Traffic Estimation

Request Volume

  • Total Requests: 100 billion operations per day
  • Read Operations: 70 billion reads per day (70% of traffic)
  • Write Operations: 30 billion writes per day (30% of traffic)
  • Peak Traffic: 3x average during peak hours
  • Operations per Second: ~1.16M ops/sec average, 3.5M ops/sec peak
  • Batch Operations: 20% of requests are batch operations

User and Client Metrics

  • Active Clients: 100,000 application servers
  • Concurrent Connections: 500,000 active connections
  • Requests per Client: 1,000 ops/sec per client average
  • Connection Lifetime: 1 hour average connection duration
  • Client Distribution: 60% same region, 30% cross-region, 10% global
  • Client Types: 40% web servers, 40% microservices, 20% batch jobs

Storage Capacity Planning

Data Volume Estimation

  • Total Keys: 10 billion unique keys
  • Average Key Size: 50 bytes (UUID or composite key)
  • Average Value Size: 1KB (mix of small and large values)
  • Total Raw Data: 10TB (10 billion × 1KB)
  • Replication Factor: 3x for high availability
  • Total Storage with Replication: 30TB
  • Metadata Overhead: 10% (3TB for indexes, versions, tombstones)
  • Total Storage Required: 33TB

Storage Growth

  • Daily New Keys: 100 million new keys per day
  • Daily Data Growth: 100GB raw data per day
  • Monthly Growth: 3TB per month
  • Annual Growth: 36TB per year
  • Storage Retention: 2 years for active data
  • Archive Storage: 5 years for compliance
  • Projected 3-Year Storage: 150TB

Value Size Distribution

  • Tiny (<100 bytes): 30% of keys (user sessions, flags)
  • Small (100B-1KB): 40% of keys (user profiles, configs)
  • Medium (1KB-10KB): 20% of keys (cached API responses)
  • Large (10KB-100KB): 8% of keys (documents, images metadata)
  • Very Large (100KB-1MB): 2% of keys (serialized objects)
  • Maximum Value Size: 1MB hard limit

Performance Requirements

Latency Targets

  • P50 Read Latency: <500μs (in-memory cache hit)
  • P95 Read Latency: <1ms (local SSD read)
  • P99 Read Latency: <2ms (includes network overhead)
  • P99.9 Read Latency: <10ms (cross-AZ read)
  • P50 Write Latency: <2ms (synchronous replication)
  • P95 Write Latency: <5ms (quorum write)
  • P99 Write Latency: <10ms (includes fsync)
  • P99.9 Write Latency: <50ms (slow disk or network)

Throughput Requirements

  • Per-Node Read Throughput: 100,000 reads/sec
  • Per-Node Write Throughput: 50,000 writes/sec
  • Cluster Read Throughput: 10M+ reads/sec
  • Cluster Write Throughput: 5M+ writes/sec
  • Batch Operation Throughput: 1M+ ops/sec
  • Scan Throughput: 100K keys/sec per node

Concurrent Operations

  • Concurrent Reads: 100,000 simultaneous read operations
  • Concurrent Writes: 50,000 simultaneous write operations
  • Concurrent Connections: 10,000 per node, 500K cluster-wide
  • Concurrent Transactions: 10,000 active transactions
  • Concurrent Scans: 1,000 active scan operations
  • Connection Pool Size: 100 connections per client

Network Bandwidth

Inbound Traffic

  • Read Requests: 70B reads/day × 100 bytes = 7TB/day
  • Write Requests: 30B writes/day × 1KB = 30TB/day
  • Total Inbound: 37TB/day = 428MB/sec average
  • Peak Inbound: 1.3GB/sec during peak hours
  • Per-Node Inbound: 4.3MB/sec average per 100 nodes

Outbound Traffic

  • Read Responses: 70B reads/day × 1KB = 70TB/day
  • Write Acknowledgments: 30B writes/day × 100 bytes = 3TB/day
  • Total Outbound: 73TB/day = 845MB/sec average
  • Peak Outbound: 2.5GB/sec during peak hours
  • Per-Node Outbound: 8.5MB/sec average per 100 nodes

Replication Traffic

  • Intra-Region Replication: 30TB/day × 2 replicas = 60TB/day
  • Cross-Region Replication: 30TB/day × 1 replica = 30TB/day
  • Total Replication: 90TB/day = 1GB/sec average
  • Peak Replication: 3GB/sec during peak hours
  • Replication Lag Target: <100ms intra-region, <1s cross-region

Memory Requirements

Per-Node Memory Allocation

  • Data Cache: 64GB for hot data (80% of RAM)
  • Index Cache: 8GB for key indexes and bloom filters
  • Connection Buffers: 4GB for client connections (10K × 400KB)
  • Write Buffer: 4GB for write-ahead log and memtable
  • Operating System: 4GB for OS and system processes
  • Total per Node: 84GB RAM (using 96GB machines)

Cluster Memory

  • Number of Nodes: 100 nodes for 100M ops/sec
  • Total Cluster Memory: 8.4TB RAM
  • Cache Hit Rate: 95% for read operations
  • Memory Efficiency: 90% effective utilization
  • Cache Eviction: LRU policy with TTL awareness

Memory Distribution

  • Hot Data (1 hour): 40GB per node (frequently accessed)
  • Warm Data (24 hours): 20GB per node (occasionally accessed)
  • Metadata: 4GB per node (indexes, bloom filters, stats)
  • Reserved: 16GB per node (headroom for spikes)

Compute Resources

CPU Requirements

  • Per-Node CPU: 32 cores (2 × 16-core processors)
  • CPU Utilization: 60% average, 80% peak
  • Read Operations: 0.1ms CPU time per read
  • Write Operations: 0.5ms CPU time per write
  • Compaction: 10% CPU reserved for background tasks
  • Replication: 5% CPU for replication processing

Cluster Compute

  • Total Nodes: 100 nodes for target throughput
  • Total CPU Cores: 3,200 cores
  • Compute Capacity: 1.16M ops/sec requires ~60% CPU
  • Headroom: 40% capacity for growth and spikes
  • Auto-Scaling: Add nodes when CPU >70% for 5 minutes

Disk I/O Requirements

Disk Specifications

  • Disk Type: NVMe SSD for low latency
  • Disk Capacity: 1TB per node (10TB usable per node with compression)
  • Read IOPS: 500K IOPS per disk
  • Write IOPS: 100K IOPS per disk
  • Sequential Read: 3GB/sec per disk
  • Sequential Write: 2GB/sec per disk

I/O Patterns

  • Read I/O: 70% cache hits, 30% disk reads = 30K IOPS per node
  • Write I/O: Sequential writes to WAL + compaction = 20K IOPS per node
  • Compaction I/O: Background compaction = 10K IOPS per node
  • Total I/O: 60K IOPS per node (well within SSD limits)
  • I/O Amplification: 3x for LSM-tree compaction

Write-Ahead Log (WAL)

  • WAL Size: 10GB per node (1 hour of writes)
  • WAL Rotation: Every 1GB or 10 minutes
  • WAL Sync: fsync every 100ms or 1000 writes
  • WAL Replay: <1 minute for node recovery
  • WAL Replication: Async replication to replicas

Replication and Consistency

Replication Configuration

  • Replication Factor: 3 (1 primary + 2 replicas)
  • Replication Strategy: Quorum-based (R + W > N)
  • Consistency Level: Configurable per operation
    • Strong: R=2, W=2 (majority quorum)
    • Eventual: R=1, W=1 (fastest)
    • Read-Your-Writes: W=2, R=1 with session affinity
  • Replication Lag: <100ms for 99% of writes
  • Cross-Region Lag: <1 second for async replication

Data Distribution

  • Partitioning Strategy: Consistent hashing with virtual nodes
  • Virtual Nodes: 256 vnodes per physical node
  • Partition Count: 25,600 partitions (100 nodes × 256 vnodes)
  • Partition Size: 1.3GB per partition (33TB / 25,600)
  • Rebalancing: Automatic when nodes added/removed
  • Hot Partition Handling: Split hot partitions dynamically

Availability and Fault Tolerance

Failure Scenarios

  • Node Failure Rate: 1% of nodes fail per month
  • Expected Failures: 1 node failure per month in 100-node cluster
  • Failure Detection: <5 seconds using heartbeats
  • Failover Time: <30 seconds to promote replica
  • Recovery Time: <5 minutes to restore replication factor
  • Data Loss: Zero data loss with quorum writes

Disaster Recovery

  • Backup Frequency: Incremental every hour, full daily
  • Backup Retention: 30 days for point-in-time recovery
  • Backup Size: 33TB full backup, 1.4TB incremental per day
  • Restore Time: <4 hours for full cluster restore
  • Cross-Region Replication: Async replication to 2 regions
  • RPO: <5 minutes (time between backups)
  • RTO: <1 hour (time to failover to backup region)

Cost Estimation

Infrastructure Costs (Monthly)

  • Compute: 100 nodes × $500/month = $50,000
  • Storage: 33TB × $0.10/GB/month = $3,300
  • Network: 100TB/month × $0.05/GB = $5,000
  • Backup Storage: 100TB × $0.02/GB/month = $2,000
  • Total Infrastructure: $60,300/month

Operational Costs (Monthly)

  • Monitoring: $2,000/month
  • Support: $5,000/month
  • Personnel: 3 engineers × $15,000/month = $45,000
  • Total Operational: $52,000/month

Total Cost of Ownership

  • Monthly Total: $112,300
  • Annual Total: $1,347,600
  • Cost per Operation: $0.00000135 per operation
  • Cost per GB Stored: $3.40 per GB per month

This scale analysis provides the foundation for capacity planning, infrastructure provisioning, and cost optimization for a production-grade distributed key-value store.