Scaling Considerations

📖 7 min read 📄 Part 6 of 10

API Rate Limiter - Scaling Considerations

Horizontal Scaling Strategies

Stateless Rate Limiter Servers

Benefits:
- Easy to add/remove servers
- No session affinity required
- Simple load balancing
- Fast deployment and rollback

Implementation:
- All state in Redis/database
- Servers are interchangeable
- Auto-scaling based on CPU/memory
- Rolling deployments without downtime

Scaling Pattern:
Normal Load: 50 servers
Peak Load: 150 servers (3x)
Auto-scale triggers:
- CPU > 70% for 5 minutes → scale up
- CPU < 30% for 15 minutes → scale down

Redis Cluster Scaling

Single Redis Instance Limits:
- 100K operations/second
- 256 GB memory
- Single point of failure

Redis Cluster Solution:
- 20 master nodes
- 20 replica nodes (1 per master)
- 2M operations/second total
- 5 TB total memory
- Automatic failover

Sharding Strategy:
- Hash slot based (16384 slots)
- Consistent hashing
- Automatic rebalancing
- Minimal data movement

Scaling Process:
1. Add new master+replica pair
2. Rebalance hash slots
3. Migrate data gradually
4. Update client routing

Database Sharding

Vertical Sharding (by table):
- Rules table: Shard 0
- User quotas: Shards 1-10
- Audit logs: Shards 11-20
- Metrics: Separate time-series DB

Horizontal Sharding (by user_id):
Shard 0: user_id % 10 = 0
Shard 1: user_id % 10 = 1
...
Shard 9: user_id % 10 = 9

Benefits:
- Distribute load evenly
- Parallel query execution
- Independent scaling
- Fault isolation

Performance Optimization

Multi-Level Caching

L1 Cache (In-Memory):
- Location: Application server
- Size: 100 MB per server
- TTL: 60 seconds
- Hit Rate: 80%
- Latency: <1ms

L2 Cache (Redis):
- Location: Redis cluster
- Size: 130 GB per region
- TTL: 300 seconds
- Hit Rate: 95%
- Latency: <2ms

L3 Cache (Database Query Cache):
- Location: PostgreSQL
- Size: 50 GB
- TTL: 600 seconds
- Hit Rate: 70%
- Latency: <10ms

Cache Hierarchy:
Request → L1 → L2 → L3 → Database
         ↓     ↓     ↓
      80%   15%   4%   1%

Batch Processing

Counter Updates:
- Buffer updates locally (100ms)
- Batch write to Redis
- Reduce network calls by 10x
- Trade-off: Slight accuracy loss

Configuration Changes:
- Batch propagate every 5 seconds
- Reduce config sync overhead
- Eventual consistency acceptable

Metrics Collection:
- Aggregate locally (1 minute)
- Batch write to time-series DB
- Reduce write load by 60x

Connection Pooling

Redis Connection Pool:
- Min connections: 10 per server
- Max connections: 100 per server
- Idle timeout: 300 seconds
- Connection reuse: 1000 requests

Database Connection Pool:
- Min connections: 5 per server
- Max connections: 50 per server
- Connection lifetime: 3600 seconds
- Statement caching enabled

Benefits:
- Reduce connection overhead
- Better resource utilization
- Faster request processing
- Lower latency variance

Geographic Distribution

Multi-Region Deployment

Regions:
- US-East (Primary)
- US-West (Secondary)
- EU-West (Primary)
- Asia-Pacific (Primary)
- South America (Secondary)

Traffic Routing:
- GeoDNS for region selection
- <50ms latency within region
- <200ms cross-region
- Automatic failover

Data Replication:
- Configuration: Synchronous (strong consistency)
- Counters: Asynchronous (eventual consistency)
- Metrics: Asynchronous (best effort)

Cross-Region Synchronization

Configuration Sync:
- Method: Active-active replication
- Latency: <1 second
- Consistency: Strong (Raft consensus)
- Conflict Resolution: Last-write-wins with version vectors

Counter Sync:
- Method: Periodic aggregation
- Frequency: Every 10 seconds
- Consistency: Eventual
- Accuracy: 99% within 1 second

Implementation:
1. Local enforcement (fast)
2. Background sync to global store
3. Periodic reconciliation
4. Conflict detection and resolution

Load Balancing Strategies

Request Distribution

Layer 4 Load Balancing (TCP):
- Algorithm: Least connections
- Health checks: TCP port 6379 (Redis)
- Failover: <1 second
- Session persistence: Source IP hash

Layer 7 Load Balancing (HTTP):
- Algorithm: Round robin with weights
- Health checks: HTTP /health endpoint
- Sticky sessions: Not required (stateless)
- SSL termination: At load balancer

Geographic Load Balancing:
- DNS-based routing
- Latency-based selection
- Automatic failover to nearest region
- Traffic splitting for A/B testing

Auto-Scaling Configuration

# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rate-limiter-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rate-limiter
  minReplicas: 50
  maxReplicas: 200
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "20000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Bottleneck Identification and Mitigation

Redis Bottlenecks

Problem: Redis CPU saturation at 100K ops/sec

Solutions:
1. Redis Cluster (20 nodes)
   - Throughput: 2M ops/sec
   - Cost: 20x infrastructure
   
2. Read Replicas (5 replicas per master)
   - Read throughput: 500K ops/sec
   - Write throughput: 100K ops/sec
   
3. Client-side caching
   - Reduce Redis load by 80%
   - Trade-off: Stale data up to 60s
   
4. Pipelining
   - Batch 100 commands
   - Reduce network RTT by 100x

Network Bottlenecks

Problem: 10 Gbps network saturation

Solutions:
1. Upgrade to 100 Gbps interfaces
   - Cost: $5000/month per server
   - Benefit: 10x capacity
   
2. Compression
   - Reduce payload size by 70%
   - CPU overhead: +5%
   
3. Protocol optimization
   - Binary protocol vs JSON
   - Reduce overhead by 50%
   
4. Regional caching
   - Serve 95% locally
   - Cross-region: 5%

Database Write Bottlenecks

Problem: 10K writes/sec limit per shard

Solutions:
1. Batch writes
   - Aggregate 100 writes
   - Reduce load by 100x
   
2. Async writes
   - Write to queue first
   - Process in background
   
3. More shards
   - 10 shards → 100K writes/sec
   - Linear scaling
   
4. Time-series database
   - Optimized for metrics
   - 1M writes/sec per node

Capacity Planning

Growth Projections

Current (Year 1):
- 10M requests/second
- 100M users
- 500 servers
- $142K/month

Year 2 (2x growth):
- 20M requests/second
- 200M users
- 1000 servers
- $284K/month

Year 3 (3x growth):
- 30M requests/second
- 300M users
- 1500 servers
- $426K/month

Scaling Strategy:
- Horizontal scaling (linear cost)
- Optimize algorithms (reduce cost/request)
- Regional expansion (reduce latency)

Resource Allocation

Per 1M requests/second:
- Servers: 50 (20K req/s each)
- Redis: 2 clusters (10 nodes each)
- Database: 5 shards
- Load Balancers: 5
- Cost: $14,200/month

Breakdown:
- Compute: 60% ($8,520)
- Storage: 15% ($2,130)
- Network: 20% ($2,840)
- Other: 5% ($710)

Monitoring and Alerting for Scale

Key Metrics to Monitor

Performance Metrics:
- Requests per second (by region, endpoint)
- Latency (P50, P95, P99)
- Error rate (4xx, 5xx)
- Cache hit rate

Resource Metrics:
- CPU utilization (per server)
- Memory usage (per server)
- Network bandwidth (in/out)
- Disk I/O (IOPS, throughput)

Business Metrics:
- Throttle rate (by user, tier)
- Quota utilization
- API key usage
- Revenue impact

Alerts:
- P99 latency > 10ms (warning)
- Error rate > 1% (critical)
- CPU > 80% (warning)
- Redis memory > 90% (critical)
- Throttle rate > 10% (warning)

Scaling Triggers

Scale Up Triggers:
- CPU > 70% for 5 minutes
- Memory > 80% for 5 minutes
- Request queue > 1000
- P99 latency > 20ms
- Redis ops > 80K/sec

Scale Down Triggers:
- CPU < 30% for 15 minutes
- Memory < 50% for 15 minutes
- Request queue < 100
- P99 latency < 5ms

Cooldown Periods:
- Scale up: 60 seconds
- Scale down: 300 seconds

Cost Optimization

Right-Sizing Resources

Before Optimization:
- 500 servers × $200/month = $100K
- Over-provisioned by 40%

After Optimization:
- 300 servers × $200/month = $60K
- Savings: $40K/month (40%)

Strategies:
- Use spot instances (70% savings)
- Reserved instances (40% savings)
- Auto-scaling (30% savings)
- Resource right-sizing (20% savings)

Caching ROI

Without Caching:
- Database queries: 10M/sec
- Database cost: $50K/month

With Redis Caching (95% hit rate):
- Database queries: 500K/sec
- Redis cost: $5K/month
- Database cost: $5K/month
- Total: $10K/month
- Savings: $40K/month (80%)

This comprehensive scaling strategy ensures the rate limiting system can handle massive growth while maintaining performance and controlling costs.