Uber Backend - Scaling Considerations
Horizontal Scaling Strategies
1. Stateless Application Servers
Challenge: Scale API servers to handle 50K requests/second
Solution:
- Deploy 5,000+ application server instances across regions
- Use container orchestration (Kubernetes) for auto-scaling
- Load balance with consistent hashing for cache affinity
- Scale based on CPU (70% threshold) and request queue depth
Implementation:
Auto-scaling Rules:
- Scale up: CPU > 70% for 2 minutes OR queue depth > 100
- Scale down: CPU < 30% for 10 minutes AND queue depth < 10
- Min instances: 1000 per region
- Max instances: 10,000 per region
- Scale increment: 10% of current capacity2. Database Sharding and Partitioning
Challenge: Handle 100K database writes/second
Solution:
- Shard by city_id for geographic locality
- 500 database shards globally
- Each shard: 1 primary + 2 read replicas
- Use connection pooling (PgBouncer) to manage connections
Sharding Strategy:
City-based Sharding:
- Top 50 cities: Dedicated shards (1 city per shard)
- Medium cities: 5-10 cities per shard
- Small cities: 50-100 cities per shard
Benefits:
- 95% of queries stay within single shard
- Easy to scale hot cities independently
- Geographic data locality
Challenges:
- Cross-city trips require distributed transactions
- Uneven load distribution
- Shard rebalancing complexity3. Caching at Multiple Layers
Challenge: Reduce database load and improve response times
Solution:
L1 Cache (Application Memory):
- Driver locations: 5-second TTL
- Surge pricing: 60-second TTL
- Size: 2GB per application instance
L2 Cache (Redis Cluster):
- User profiles: 1-hour TTL
- Trip history: 5-minute TTL
- Payment methods: 10-minute TTL
- Size: 10TB across 200 Redis nodes
L3 Cache (CDN):
- Static assets: 24-hour TTL
- Map tiles: 7-day TTL
- Profile images: 1-day TTL
- 200+ edge locations globallyGeospatial Scaling
1. Driver Location Indexing
Challenge: Query 750K location updates/second and find nearby drivers
Solution:
- Use Redis GEORADIUS for real-time queries
- S2 Geometry library for spatial indexing
- Partition by city/region for horizontal scaling
- Update indices asynchronously via Kafka
Architecture:
GPS Update Flow:
Driver App → Load Balancer → Location API → Kafka Topic
↓
Stream Processor
↓
┌─────────────┴─────────────┐
↓ ↓
Redis Geospatial Cassandra
(Real-time index) (Historical data)
Query Performance:
- GEORADIUS query: <10ms for 50 drivers within 5km
- Index update: <5ms per location
- Throughput: 1M updates/second per Redis cluster2. Geofencing and Zone Management
Challenge: Detect pickup/dropoff arrivals and manage surge zones
Solution:
- Use H3 hexagonal grid system for zone management
- Level 9 hexagons (~0.1 km² per cell)
- Pre-compute zone boundaries and cache
- Stream processing for real-time geofence detection
Implementation:
H3 Geospatial Index:
- City divided into ~10,000 hexagonal cells
- Each cell tracks: active requests, available drivers
- Update frequency: Every 1-2 minutes
- Storage: 10KB per cell × 10K cells = 100MB per city
Geofence Detection:
1. Driver sends location update
2. Convert lat/lng to H3 cell ID
3. Check if cell matches pickup/dropoff zone
4. Trigger arrival event if match
5. Processing time: <50ms per updateReal-time Matching Optimization
1. Matching Service Scaling
Challenge: Match 15K ride requests/second with optimal drivers
Solution:
- Parallel matching across multiple matching service instances
- Pre-filter driver pool using geospatial queries
- Async matching with result caching
- Circuit breaker for graceful degradation
Matching Pipeline:
Request → Load Balancer → Matching Service Instance
↓
1. Geospatial Query (Redis)
- Find drivers within 5km
- Filter by vehicle type
- Result: 50 candidate drivers
↓
2. Scoring & Ranking
- Calculate score for each driver
- Rank by composite score
- Select top 3 drivers
↓
3. Send Ride Offers
- Push notification to drivers
- Wait for acceptance (15s timeout)
- First to accept wins
↓
4. Fallback & Retry
- If no acceptance, expand radius
- Retry with new driver pool
- Max 3 retry attempts
Performance:
- Matching latency: <5 seconds (95th percentile)
- Success rate: 95% on first attempt
- Throughput: 15K matches/second peak2. Driver Pool Management
Challenge: Maintain real-time view of available drivers
Solution:
- Separate service for driver availability tracking
- Heartbeat mechanism with 30-second intervals
- Automatic status updates based on trip state
- Distributed cache for driver status
Driver Status Management:
Status Transitions:
OFFLINE → AVAILABLE → MATCHED → ON_TRIP → AVAILABLE
Heartbeat System:
- Driver app sends heartbeat every 30 seconds
- Server marks driver offline after 60 seconds of no heartbeat
- Exponential backoff for reconnection attempts
Availability Index:
- Redis sorted set: ZADD available:drivers {timestamp} {driver_id}
- Remove stale entries: ZREMRANGEBYSCORE available:drivers 0 {60_seconds_ago}
- Query available drivers: ZRANGEBYSCORE available:drivers {now-60s} {now}Payment Processing Scaling
1. Async Payment Processing
Challenge: Process 1,800 payment transactions/second
Solution:
- Decouple payment processing from trip completion
- Use message queue for async processing
- Retry logic with exponential backoff
- Idempotency keys to prevent duplicate charges
Payment Flow:
Trip Completed → Kafka Event → Payment Service
↓
1. Calculate Fare
2. Fraud Check
3. Authorize Payment
4. Capture Funds
↓
Success → Update Trip
Failure → Retry Queue
↓
Max 3 retries with backoff
Final failure → Manual review2. Payment Gateway Redundancy
Challenge: Ensure payment availability despite gateway failures
Solution:
- Multiple payment gateway integrations (Stripe, Braintree, Adyen)
- Automatic failover on gateway errors
- Route based on payment method and region
- Circuit breaker pattern for failing gateways
Gateway Selection:
Primary Gateway: Stripe (70% of traffic)
Secondary Gateway: Braintree (20% of traffic)
Tertiary Gateway: Adyen (10% of traffic)
Failover Logic:
1. Try primary gateway
2. If error rate > 5%, circuit breaker opens
3. Route to secondary gateway
4. Monitor and close circuit after 5 minutes
5. Gradual traffic shift back to primaryData Pipeline Scaling
1. Real-time Analytics Pipeline
Challenge: Process 10 billion events per day for analytics
Solution:
- Kafka for event streaming (10M messages/second)
- Apache Flink for stream processing
- ClickHouse for real-time analytics queries
- S3 for data lake storage
Pipeline Architecture:
Event Sources → Kafka Topics → Flink Jobs → ClickHouse
↓
S3 Data Lake
↓
Spark Batch Jobs
↓
Data Warehouse
Event Types:
- trip_events: 50M events/day
- location_updates: 6.5B events/day
- payment_events: 50M events/day
- user_actions: 500M events/day
- Total: 7B+ events/day2. ML Model Serving
Challenge: Serve ML predictions at scale (surge pricing, fraud detection, ETA)
Solution:
- Model serving infrastructure (TensorFlow Serving, SageMaker)
- Feature store for real-time feature computation
- Model versioning and A/B testing
- Caching for frequently requested predictions
ML Serving Architecture:
Request → Feature Store → Model Server → Response
↓
Feature Cache
(Redis)
Models:
- Surge Pricing: 100K predictions/second
- Fraud Detection: 2K predictions/second
- ETA Calculation: 50K predictions/second
- Demand Forecasting: Batch processing every 5 minutes
Performance:
- Prediction latency: <50ms (p95)
- Feature computation: <20ms (p95)
- Cache hit rate: 80%Network and CDN Optimization
1. Global Load Balancing
Challenge: Route users to nearest data center with lowest latency
Solution:
- GeoDNS for geographic routing
- Anycast IP addresses for automatic routing
- Health checks and automatic failover
- Traffic splitting for gradual rollouts
Routing Strategy:
User Request → GeoDNS → Nearest Region
↓
Regional Load Balancer
↓
Availability Zone LB
↓
Application Servers
Latency Targets:
- Same region: <50ms
- Cross-region (same continent): <150ms
- Cross-continent: <300ms2. CDN for Static Assets
Challenge: Serve map tiles and static assets globally
Solution:
- CloudFront/Cloudflare CDN with 200+ PoPs
- Aggressive caching for static content
- Image optimization and compression
- HTTP/2 and HTTP/3 for multiplexing
CDN Configuration:
Cache Rules:
- Map tiles: 7-day TTL, 95% hit rate
- Profile images: 1-day TTL, 85% hit rate
- App assets: 24-hour TTL, 90% hit rate
- API responses: No caching
Performance:
- CDN bandwidth: 5 GB/s peak
- Origin bandwidth: 250 MB/s (5% of total)
- Cost savings: 90% reduction in origin trafficDatabase Optimization
1. Read Replica Scaling
Challenge: Handle 500K database reads/second
Solution:
- 3x read replicas per primary database
- Read/write splitting at application layer
- Replica lag monitoring (<1 second acceptable)
- Automatic failover on replica failure
Read Scaling Strategy:
Write Operations → Primary Database
Read Operations → Read Replicas (Round-robin)
Replica Configuration:
- Async replication with <1s lag
- Dedicated replicas for analytics queries
- Geographic replicas for cross-region reads
- Connection pooling: 1000 connections per replica
Load Distribution:
- Primary: 20% (writes + critical reads)
- Replica 1: 30% (general reads)
- Replica 2: 30% (general reads)
- Replica 3: 20% (analytics reads)2. Query Optimization
Challenge: Maintain sub-100ms query response times
Solution:
- Comprehensive indexing strategy
- Query plan analysis and optimization
- Materialized views for complex aggregations
- Partitioning for large tables
Optimization Techniques:
Indexing:
- B-tree indices for equality/range queries
- GiST indices for geospatial queries
- Covering indices to avoid table lookups
- Partial indices for filtered queries
Partitioning:
- trips table: Partition by month (time-based)
- transactions table: Partition by month
- location_history: Partition by week
- Benefits: Faster queries, easier archival
Materialized Views:
- driver_earnings_daily: Refresh every hour
- city_metrics_hourly: Refresh every 15 minutes
- surge_zone_stats: Refresh every 2 minutesMonitoring and Observability
1. Metrics Collection
Challenge: Monitor health of 2,000+ microservices
Solution:
- Prometheus for metrics collection
- Grafana for visualization
- Custom dashboards per service
- Automated alerting on SLA violations
Key Metrics:
Application Metrics:
- Request rate, latency, error rate (RED metrics)
- Database query performance
- Cache hit rates
- Queue depths and processing lag
Business Metrics:
- Trips per second
- Matching success rate
- Payment success rate
- Average wait time
- Driver utilization
Infrastructure Metrics:
- CPU, memory, disk usage
- Network throughput
- Database connections
- Cache memory usage2. Distributed Tracing
Challenge: Debug issues across microservices
Solution:
- Jaeger for distributed tracing
- Trace sampling (1% of requests)
- Correlation IDs across all services
- Performance bottleneck identification
Tracing Implementation:
Request Flow Tracing:
Rider App → API Gateway → Matching Service → Location Service
↓
Payment Service
↓
Notification Service
Trace Data:
- Service name and operation
- Start time and duration
- Parent/child span relationships
- Tags: user_id, trip_id, city_id
- Logs: Important events and errors
Performance Analysis:
- Identify slow services
- Detect cascading failures
- Optimize critical paths
- Capacity planningDisaster Recovery and Failover
1. Multi-Region Failover
Challenge: Maintain service during regional outages
Solution:
- Active-active deployment in multiple regions
- Cross-region data replication
- Automatic traffic shifting on failures
- Regular disaster recovery drills
Failover Strategy:
Normal Operation:
- US-West: 40% traffic
- US-East: 30% traffic
- Europe: 20% traffic
- Asia-Pacific: 10% traffic
Region Failure:
1. Health checks detect failure
2. GeoDNS removes failed region from rotation
3. Traffic redistributed to healthy regions
4. Capacity auto-scales to handle increased load
5. Failover time: <2 minutes
Data Consistency:
- Critical data: Sync replication (payments)
- User data: Async replication (5-minute lag)
- Analytics data: Async replication (1-hour lag)2. Graceful Degradation
Challenge: Maintain core functionality during partial outages
Solution:
- Feature flags for non-critical features
- Circuit breakers for failing dependencies
- Fallback mechanisms for external services
- Priority queuing for critical operations
Degradation Levels:
Level 1 (Normal): All features available
Level 2 (Minor): Disable non-critical features
- Ride scheduling
- Promotions
- Advanced analytics
Level 3 (Major): Core features only
- Ride matching
- Basic tracking
- Payment processing
Level 4 (Critical): Emergency mode
- Manual dispatch
- Cash payments only
- SMS-based communicationThis comprehensive scaling strategy enables Uber to handle millions of concurrent rides globally while maintaining sub-second response times and high reliability through horizontal scaling, geographic distribution, and intelligent caching.