Scaling Considerations

📖 9 min read 📄 Part 6 of 10

Key-Value Store - Scaling Considerations

Horizontal Scaling Strategy

Adding Nodes to Cluster

Initial Cluster (3 nodes):
Node A: tokens [0, 3, 6, 9, ...]     → 33% data
Node B: tokens [1, 4, 7, 10, ...]    → 33% data
Node C: tokens [2, 5, 8, 11, ...]    → 33% data

After Adding Node D:
Node A: tokens [0, 4, 8, 12, ...]    → 25% data
Node B: tokens [1, 5, 9, 13, ...]    → 25% data
Node C: tokens [2, 6, 10, 14, ...]   → 25% data
Node D: tokens [3, 7, 11, 15, ...]   → 25% data (streamed from A, B, C)

Node Addition Process

  1. Join Cluster: New node contacts seed nodes
  2. Token Assignment: Coordinator assigns virtual nodes
  3. Data Streaming: Stream data for assigned tokens from existing nodes
  4. Parallel Streaming: Stream from multiple nodes simultaneously
  5. Incremental Handoff: Gradually take over traffic
  6. Completion: Mark node as NORMAL, update routing tables

Scaling Metrics

  • Time to Add Node: 30-60 minutes for 1TB data transfer
  • Network Impact: 100MB/s streaming bandwidth per node
  • Performance Impact: <5% latency increase during streaming
  • Availability: No downtime, cluster remains available
  • Rebalancing: Automatic rebalancing of data distribution

Read Scaling

Read Path Optimization

  • Cache Layers: Multi-level caching (L1: in-process, L2: distributed)
  • Read Replicas: Route reads to nearest replica
  • Bloom Filters: Reduce unnecessary disk reads by 90%+
  • Compression: Reduce I/O with Snappy/LZ4 compression
  • Batch Reads: Combine multiple reads into single request
  • Connection Pooling: Reuse connections to reduce overhead

Read Scaling Strategies

Read Scaling Approaches:

1. Replica Scaling:
   - Add more replicas (N=3 → N=5)
   - Distribute read load across replicas
   - Trade-off: More storage, better read throughput

2. Caching:
   - In-memory cache for hot keys
   - 95% cache hit rate → 20x read throughput
   - Trade-off: Memory cost, cache invalidation complexity

3. Read-Only Replicas:
   - Dedicated replicas for read-only workloads
   - No write overhead, optimized for reads
   - Trade-off: Eventual consistency, replication lag

4. Geographic Distribution:
   - Replicas in multiple regions
   - Route reads to nearest replica
   - Trade-off: Cross-region replication cost

Read Bottlenecks and Solutions

Bottleneck Impact Solution
Disk I/O High latency Add SSDs, increase cache
Network High latency Add replicas, use compression
CPU High CPU usage Optimize serialization, add nodes
Hot Keys Uneven load Replicate hot keys, use cache
Large Values Slow reads Compress values, use chunking

Write Scaling

Write Path Optimization

  • Batch Writes: Group multiple writes into single batch
  • Async Replication: Replicate asynchronously to non-quorum nodes
  • Write-Ahead Log: Sequential writes for durability
  • MemTable Buffering: Buffer writes in memory before flushing
  • Compaction Throttling: Limit compaction impact on writes
  • Parallel Writes: Write to multiple partitions concurrently

Write Scaling Strategies

Write Scaling Approaches:

1. Partition Scaling:
   - Increase partition count
   - Distribute writes across more partitions
   - Trade-off: More metadata overhead

2. Quorum Tuning:
   - Lower write quorum (W=1 instead of W=2)
   - Faster writes, lower consistency
   - Trade-off: Consistency vs performance

3. Async Replication:
   - Replicate to N-1 nodes asynchronously
   - Acknowledge after W nodes
   - Trade-off: Replication lag, potential data loss

4. Write Batching:
   - Batch multiple writes into single request
   - Amortize network and serialization overhead
   - Trade-off: Increased latency for individual writes

Write Bottlenecks and Solutions

Bottleneck Impact Solution
Disk I/O Write latency Use NVMe SSDs, increase MemTable size
Replication Write latency Async replication, lower quorum
Compaction Write stalls Throttle compaction, add nodes
Network Replication lag Increase bandwidth, compress data
Contention Lock contention Partition data, use lock-free structures

Storage Scaling

Storage Capacity Planning

Storage Growth Projection:

Year 1:
- Initial: 10TB
- Growth: 100GB/day × 365 = 36.5TB
- Total: 46.5TB

Year 2:
- Initial: 46.5TB
- Growth: 150GB/day × 365 = 54.75TB
- Total: 101.25TB

Year 3:
- Initial: 101.25TB
- Growth: 200GB/day × 365 = 73TB
- Total: 174.25TB

With 3x replication: 174.25TB × 3 = 522.75TB
With 10% overhead: 522.75TB × 1.1 = 575TB

Storage Tiering

Hot Tier (NVMe SSD):
- Data: Last 30 days
- Size: 10TB
- Cost: $1,000/TB = $10,000
- Latency: <1ms

Warm Tier (SATA SSD):
- Data: 30-365 days
- Size: 50TB
- Cost: $200/TB = $10,000
- Latency: <5ms

Cold Tier (Object Storage):
- Data: >365 days
- Size: 500TB
- Cost: $20/TB = $10,000
- Latency: <100ms

Total Cost: $30,000 (vs $575,000 for all NVMe)

Compaction Scaling

  • Parallel Compaction: Run multiple compactions concurrently
  • Incremental Compaction: Compact small portions at a time
  • Throttled Compaction: Limit I/O impact during peak hours
  • Scheduled Compaction: Run major compactions during off-peak
  • Selective Compaction: Prioritize hot partitions

Network Scaling

Network Bandwidth Requirements

Network Traffic Breakdown:

Client Traffic:
- Inbound: 428MB/s (read requests + write data)
- Outbound: 845MB/s (read responses + write acks)
- Total: 1.27GB/s

Replication Traffic:
- Intra-Region: 694MB/s (2x replication)
- Cross-Region: 347MB/s (1x replication)
- Total: 1.04GB/s

Total Bandwidth: 2.31GB/s = 18.5Gbps
Per-Node (100 nodes): 185Mbps

Recommendation: 10Gbps NICs per node

Network Optimization

  • Compression: Reduce bandwidth by 3-5x with Snappy/LZ4
  • Batching: Combine multiple operations to reduce overhead
  • Connection Pooling: Reuse connections to reduce handshake overhead
  • Multiplexing: Send multiple requests over single connection
  • Regional Routing: Route traffic to nearest datacenter
  • CDN Integration: Cache static data at edge locations

Hot Key Handling

Hot Key Detection

# Pseudo-code for hot key detection
class HotKeyDetector:
    def __init__(self):
        self.access_counts = {}  # key → count
        self.window_size = 60    # 60 seconds
        self.threshold = 10000   # 10K requests/minute
    
    def record_access(self, key):
        self.access_counts[key] = self.access_counts.get(key, 0) + 1
        
        if self.access_counts[key] > self.threshold:
            self.mark_as_hot(key)
    
    def mark_as_hot(self, key):
        # Replicate to more nodes
        # Add to cache
        # Alert monitoring system

Hot Key Mitigation Strategies

  1. Replication: Replicate hot keys to more nodes (N=3 → N=10)
  2. Caching: Cache hot keys in memory on all nodes
  3. Sharding: Split hot key into multiple sub-keys
  4. Rate Limiting: Limit requests per client for hot keys
  5. Load Balancing: Distribute hot key reads across replicas

Hot Partition Handling

Hot Partition Split:

Before:
Partition 1: [key1, key2, ..., hot_key, ..., keyN]100K requests/sec

After:
Partition 1a: [key1, key2, ..., keyM]40K requests/sec
              
Partition 1b: [hot_key]  (replicated to 10 nodes)50K requests/sec
              
Partition 1c: [keyM+1, ..., keyN]10K requests/sec

Geographic Distribution

Multi-Region Architecture

Region: US-East (Primary)
- 50 nodes
- Handles: 60% of traffic
- Latency: <2ms local

Region: US-West (Secondary)
- 30 nodes
- Handles: 25% of traffic
- Latency: <5ms local, <50ms cross-region

Region: EU-West (Secondary)
- 20 nodes
- Handles: 15% of traffic
- Latency: <5ms local, <100ms cross-region

Replication:
- Async replication between regions
- Lag: <1 second
- Conflict resolution: Last-write-wins with vector clocks

Cross-Region Replication

  • Async Replication: Replicate writes asynchronously to other regions
  • Conflict Resolution: Use vector clocks or timestamps
  • Selective Replication: Replicate only critical data globally
  • Regional Failover: Automatic failover to nearest region
  • Consistency: Eventual consistency across regions

Auto-Scaling

Auto-Scaling Triggers

Scale-Up Triggers:
  - CPU usage > 70% for 5 minutes
  - Memory usage > 80% for 5 minutes
  - Disk usage > 85%
  - Request latency P99 > 20ms for 10 minutes
  - Queue depth > 10,000 for 5 minutes

Scale-Down Triggers:
  - CPU usage < 30% for 30 minutes
  - Memory usage < 50% for 30 minutes
  - Request latency P99 < 5ms for 30 minutes
  - Queue depth < 1,000 for 30 minutes

Constraints:
  - Min nodes: 3 (for replication)
  - Max nodes: 1000
  - Scale-up: Add 10% of current nodes
  - Scale-down: Remove 5% of current nodes
  - Cooldown: 15 minutes between scaling actions

Auto-Scaling Process

  1. Monitor Metrics: Collect CPU, memory, latency, throughput
  2. Evaluate Triggers: Check if scaling triggers are met
  3. Calculate Capacity: Determine number of nodes to add/remove
  4. Provision Nodes: Launch new instances or terminate existing
  5. Join Cluster: New nodes join cluster and receive data
  6. Rebalance: Redistribute data across new topology
  7. Verify: Confirm cluster health and performance

Performance Optimization

Latency Optimization

  • Reduce Network Hops: Co-locate clients and servers
  • Optimize Serialization: Use Protocol Buffers or MessagePack
  • Connection Pooling: Reuse connections to avoid handshake
  • Batch Operations: Combine multiple operations
  • Async I/O: Use non-blocking I/O for concurrency
  • Zero-Copy: Avoid unnecessary data copying

Throughput Optimization

  • Parallel Processing: Process multiple requests concurrently
  • Batch Processing: Process multiple operations together
  • Pipeline Requests: Send multiple requests without waiting
  • Compression: Reduce data transfer size
  • Efficient Serialization: Use binary protocols
  • Connection Multiplexing: Multiple requests per connection

Resource Optimization

  • Memory Management: Efficient memory allocation and pooling
  • CPU Optimization: Optimize hot code paths
  • Disk I/O: Sequential writes, efficient compaction
  • Network: Compression, batching, connection pooling
  • Garbage Collection: Tune GC for low latency

Monitoring and Alerting

Key Metrics to Monitor

Performance Metrics:
- Latency: P50, P95, P99, P99.9 for reads and writes
- Throughput: Operations per second
- Error Rate: Failed operations per second
- Cache Hit Rate: Percentage of reads from cache

Resource Metrics:
- CPU Usage: Per node and cluster average
- Memory Usage: Per node and cluster average
- Disk Usage: Per node and cluster total
- Network Bandwidth: Inbound and outbound

Cluster Metrics:
- Node Count: Total, healthy, unhealthy
- Replication Lag: Time delay for replication
- Compaction Queue: Pending compaction tasks
- Repair Progress: Data repair status

Business Metrics:
- Active Keys: Total number of keys
- Data Size: Total data size across cluster
- Request Rate: Requests per second
- Cost: Infrastructure cost per operation

Alerting Thresholds

Critical Alerts:
  - Node down for > 5 minutes
  - Cluster availability < 99%
  - P99 latency > 100ms for 10 minutes
  - Error rate > 1% for 5 minutes
  - Disk usage > 90%

Warning Alerts:
  - CPU usage > 80% for 10 minutes
  - Memory usage > 85% for 10 minutes
  - Replication lag > 5 seconds
  - Cache hit rate < 90%
  - Compaction queue > 100 tasks

This comprehensive scaling guide provides strategies and best practices for scaling a distributed key-value store from initial deployment to handling billions of operations per day.