Scaling Considerations

📖 10 min read 📄 Part 6 of 10

Uber Backend - Scaling Considerations

Horizontal Scaling Strategies

1. Stateless Application Servers

Challenge: Scale API servers to handle 50K requests/second

Solution:

  • Deploy 5,000+ application server instances across regions
  • Use container orchestration (Kubernetes) for auto-scaling
  • Load balance with consistent hashing for cache affinity
  • Scale based on CPU (70% threshold) and request queue depth

Implementation:

Auto-scaling Rules:
- Scale up: CPU > 70% for 2 minutes OR queue depth > 100
- Scale down: CPU < 30% for 10 minutes AND queue depth < 10
- Min instances: 1000 per region
- Max instances: 10,000 per region
- Scale increment: 10% of current capacity

2. Database Sharding and Partitioning

Challenge: Handle 100K database writes/second

Solution:

  • Shard by city_id for geographic locality
  • 500 database shards globally
  • Each shard: 1 primary + 2 read replicas
  • Use connection pooling (PgBouncer) to manage connections

Sharding Strategy:

City-based Sharding:
- Top 50 cities: Dedicated shards (1 city per shard)
- Medium cities: 5-10 cities per shard
- Small cities: 50-100 cities per shard

Benefits:
- 95% of queries stay within single shard
- Easy to scale hot cities independently
- Geographic data locality

Challenges:
- Cross-city trips require distributed transactions
- Uneven load distribution
- Shard rebalancing complexity

3. Caching at Multiple Layers

Challenge: Reduce database load and improve response times

Solution:

L1 Cache (Application Memory):
- Driver locations: 5-second TTL
- Surge pricing: 60-second TTL
- Size: 2GB per application instance

L2 Cache (Redis Cluster):
- User profiles: 1-hour TTL
- Trip history: 5-minute TTL
- Payment methods: 10-minute TTL
- Size: 10TB across 200 Redis nodes

L3 Cache (CDN):
- Static assets: 24-hour TTL
- Map tiles: 7-day TTL
- Profile images: 1-day TTL
- 200+ edge locations globally

Geospatial Scaling

1. Driver Location Indexing

Challenge: Query 750K location updates/second and find nearby drivers

Solution:

  • Use Redis GEORADIUS for real-time queries
  • S2 Geometry library for spatial indexing
  • Partition by city/region for horizontal scaling
  • Update indices asynchronously via Kafka

Architecture:

GPS Update Flow:
Driver App → Load Balancer → Location API → Kafka Topic
                                                  ↓
                                          Stream Processor
                                                  ↓
                                    ┌─────────────┴─────────────┐
                                    ↓                           ↓
                            Redis Geospatial              Cassandra
                            (Real-time index)          (Historical data)

Query Performance:
- GEORADIUS query: <10ms for 50 drivers within 5km
- Index update: <5ms per location
- Throughput: 1M updates/second per Redis cluster

2. Geofencing and Zone Management

Challenge: Detect pickup/dropoff arrivals and manage surge zones

Solution:

  • Use H3 hexagonal grid system for zone management
  • Level 9 hexagons (~0.1 km² per cell)
  • Pre-compute zone boundaries and cache
  • Stream processing for real-time geofence detection

Implementation:

H3 Geospatial Index:
- City divided into ~10,000 hexagonal cells
- Each cell tracks: active requests, available drivers
- Update frequency: Every 1-2 minutes
- Storage: 10KB per cell × 10K cells = 100MB per city

Geofence Detection:
1. Driver sends location update
2. Convert lat/lng to H3 cell ID
3. Check if cell matches pickup/dropoff zone
4. Trigger arrival event if match
5. Processing time: <50ms per update

Real-time Matching Optimization

1. Matching Service Scaling

Challenge: Match 15K ride requests/second with optimal drivers

Solution:

  • Parallel matching across multiple matching service instances
  • Pre-filter driver pool using geospatial queries
  • Async matching with result caching
  • Circuit breaker for graceful degradation

Matching Pipeline:

Request → Load Balancer → Matching Service Instance
                                    ↓
                          1. Geospatial Query (Redis)
                             - Find drivers within 5km
                             - Filter by vehicle type
                             - Result: 50 candidate drivers
                                    ↓
                          2. Scoring & Ranking
                             - Calculate score for each driver
                             - Rank by composite score
                             - Select top 3 drivers
                                    ↓
                          3. Send Ride Offers
                             - Push notification to drivers
                             - Wait for acceptance (15s timeout)
                             - First to accept wins
                                    ↓
                          4. Fallback & Retry
                             - If no acceptance, expand radius
                             - Retry with new driver pool
                             - Max 3 retry attempts

Performance:
- Matching latency: <5 seconds (95th percentile)
- Success rate: 95% on first attempt
- Throughput: 15K matches/second peak

2. Driver Pool Management

Challenge: Maintain real-time view of available drivers

Solution:

  • Separate service for driver availability tracking
  • Heartbeat mechanism with 30-second intervals
  • Automatic status updates based on trip state
  • Distributed cache for driver status

Driver Status Management:

Status Transitions:
OFFLINE → AVAILABLE → MATCHED → ON_TRIP → AVAILABLE

Heartbeat System:
- Driver app sends heartbeat every 30 seconds
- Server marks driver offline after 60 seconds of no heartbeat
- Exponential backoff for reconnection attempts

Availability Index:
- Redis sorted set: ZADD available:drivers {timestamp} {driver_id}
- Remove stale entries: ZREMRANGEBYSCORE available:drivers 0 {60_seconds_ago}
- Query available drivers: ZRANGEBYSCORE available:drivers {now-60s} {now}

Payment Processing Scaling

1. Async Payment Processing

Challenge: Process 1,800 payment transactions/second

Solution:

  • Decouple payment processing from trip completion
  • Use message queue for async processing
  • Retry logic with exponential backoff
  • Idempotency keys to prevent duplicate charges

Payment Flow:

Trip CompletedKafka EventPayment Service1. Calculate Fare
                              2. Fraud Check
                              3. Authorize Payment
                              4. Capture FundsSuccessUpdate Trip
                              FailureRetry QueueMax 3 retries with backoff
                              Final failureManual review

2. Payment Gateway Redundancy

Challenge: Ensure payment availability despite gateway failures

Solution:

  • Multiple payment gateway integrations (Stripe, Braintree, Adyen)
  • Automatic failover on gateway errors
  • Route based on payment method and region
  • Circuit breaker pattern for failing gateways

Gateway Selection:

Primary Gateway: Stripe (70% of traffic)
Secondary Gateway: Braintree (20% of traffic)
Tertiary Gateway: Adyen (10% of traffic)

Failover Logic:
1. Try primary gateway
2. If error rate > 5%, circuit breaker opens
3. Route to secondary gateway
4. Monitor and close circuit after 5 minutes
5. Gradual traffic shift back to primary

Data Pipeline Scaling

1. Real-time Analytics Pipeline

Challenge: Process 10 billion events per day for analytics

Solution:

  • Kafka for event streaming (10M messages/second)
  • Apache Flink for stream processing
  • ClickHouse for real-time analytics queries
  • S3 for data lake storage

Pipeline Architecture:

Event Sources → Kafka Topics → Flink Jobs → ClickHouse
                                    ↓
                              S3 Data Lake
                                    ↓
                              Spark Batch Jobs
                                    ↓
                              Data Warehouse

Event Types:
- trip_events: 50M events/day
- location_updates: 6.5B events/day
- payment_events: 50M events/day
- user_actions: 500M events/day
- Total: 7B+ events/day

2. ML Model Serving

Challenge: Serve ML predictions at scale (surge pricing, fraud detection, ETA)

Solution:

  • Model serving infrastructure (TensorFlow Serving, SageMaker)
  • Feature store for real-time feature computation
  • Model versioning and A/B testing
  • Caching for frequently requested predictions

ML Serving Architecture:

Request → Feature Store → Model Server → Response
              ↓
        Feature Cache
        (Redis)

Models:
- Surge Pricing: 100K predictions/second
- Fraud Detection: 2K predictions/second
- ETA Calculation: 50K predictions/second
- Demand Forecasting: Batch processing every 5 minutes

Performance:
- Prediction latency: <50ms (p95)
- Feature computation: <20ms (p95)
- Cache hit rate: 80%

Network and CDN Optimization

1. Global Load Balancing

Challenge: Route users to nearest data center with lowest latency

Solution:

  • GeoDNS for geographic routing
  • Anycast IP addresses for automatic routing
  • Health checks and automatic failover
  • Traffic splitting for gradual rollouts

Routing Strategy:

User RequestGeoDNSNearest RegionRegional Load BalancerAvailability Zone LBApplication Servers

Latency Targets:
- Same region: <50ms
- Cross-region (same continent): <150ms
- Cross-continent: <300ms

2. CDN for Static Assets

Challenge: Serve map tiles and static assets globally

Solution:

  • CloudFront/Cloudflare CDN with 200+ PoPs
  • Aggressive caching for static content
  • Image optimization and compression
  • HTTP/2 and HTTP/3 for multiplexing

CDN Configuration:

Cache Rules:
- Map tiles: 7-day TTL, 95% hit rate
- Profile images: 1-day TTL, 85% hit rate
- App assets: 24-hour TTL, 90% hit rate
- API responses: No caching

Performance:
- CDN bandwidth: 5 GB/s peak
- Origin bandwidth: 250 MB/s (5% of total)
- Cost savings: 90% reduction in origin traffic

Database Optimization

1. Read Replica Scaling

Challenge: Handle 500K database reads/second

Solution:

  • 3x read replicas per primary database
  • Read/write splitting at application layer
  • Replica lag monitoring (<1 second acceptable)
  • Automatic failover on replica failure

Read Scaling Strategy:

Write Operations → Primary Database
Read Operations → Read Replicas (Round-robin)

Replica Configuration:
- Async replication with <1s lag
- Dedicated replicas for analytics queries
- Geographic replicas for cross-region reads
- Connection pooling: 1000 connections per replica

Load Distribution:
- Primary: 20% (writes + critical reads)
- Replica 1: 30% (general reads)
- Replica 2: 30% (general reads)
- Replica 3: 20% (analytics reads)

2. Query Optimization

Challenge: Maintain sub-100ms query response times

Solution:

  • Comprehensive indexing strategy
  • Query plan analysis and optimization
  • Materialized views for complex aggregations
  • Partitioning for large tables

Optimization Techniques:

Indexing:
- B-tree indices for equality/range queries
- GiST indices for geospatial queries
- Covering indices to avoid table lookups
- Partial indices for filtered queries

Partitioning:
- trips table: Partition by month (time-based)
- transactions table: Partition by month
- location_history: Partition by week
- Benefits: Faster queries, easier archival

Materialized Views:
- driver_earnings_daily: Refresh every hour
- city_metrics_hourly: Refresh every 15 minutes
- surge_zone_stats: Refresh every 2 minutes

Monitoring and Observability

1. Metrics Collection

Challenge: Monitor health of 2,000+ microservices

Solution:

  • Prometheus for metrics collection
  • Grafana for visualization
  • Custom dashboards per service
  • Automated alerting on SLA violations

Key Metrics:

Application Metrics:
- Request rate, latency, error rate (RED metrics)
- Database query performance
- Cache hit rates
- Queue depths and processing lag

Business Metrics:
- Trips per second
- Matching success rate
- Payment success rate
- Average wait time
- Driver utilization

Infrastructure Metrics:
- CPU, memory, disk usage
- Network throughput
- Database connections
- Cache memory usage

2. Distributed Tracing

Challenge: Debug issues across microservices

Solution:

  • Jaeger for distributed tracing
  • Trace sampling (1% of requests)
  • Correlation IDs across all services
  • Performance bottleneck identification

Tracing Implementation:

Request Flow Tracing:
Rider App → API Gateway → Matching Service → Location Service
                                                    ↓
                                            Payment Service
                                                    ↓
                                            Notification Service

Trace Data:
- Service name and operation
- Start time and duration
- Parent/child span relationships
- Tags: user_id, trip_id, city_id
- Logs: Important events and errors

Performance Analysis:
- Identify slow services
- Detect cascading failures
- Optimize critical paths
- Capacity planning

Disaster Recovery and Failover

1. Multi-Region Failover

Challenge: Maintain service during regional outages

Solution:

  • Active-active deployment in multiple regions
  • Cross-region data replication
  • Automatic traffic shifting on failures
  • Regular disaster recovery drills

Failover Strategy:

Normal Operation:
- US-West: 40% traffic
- US-East: 30% traffic
- Europe: 20% traffic
- Asia-Pacific: 10% traffic

Region Failure:
1. Health checks detect failure
2. GeoDNS removes failed region from rotation
3. Traffic redistributed to healthy regions
4. Capacity auto-scales to handle increased load
5. Failover time: <2 minutes

Data Consistency:
- Critical data: Sync replication (payments)
- User data: Async replication (5-minute lag)
- Analytics data: Async replication (1-hour lag)

2. Graceful Degradation

Challenge: Maintain core functionality during partial outages

Solution:

  • Feature flags for non-critical features
  • Circuit breakers for failing dependencies
  • Fallback mechanisms for external services
  • Priority queuing for critical operations

Degradation Levels:

Level 1 (Normal): All features available
Level 2 (Minor): Disable non-critical features
  - Ride scheduling
  - Promotions
  - Advanced analytics
Level 3 (Major): Core features only
  - Ride matching
  - Basic tracking
  - Payment processing
Level 4 (Critical): Emergency mode
  - Manual dispatch
  - Cash payments only
  - SMS-based communication

This comprehensive scaling strategy enables Uber to handle millions of concurrent rides globally while maintaining sub-second response times and high reliability through horizontal scaling, geographic distribution, and intelligent caching.