Uber Backend - Tradeoffs and Alternatives
Architecture Tradeoffs
1. Microservices vs Monolith
Chosen: Microservices Architecture
Advantages:
- Independent scaling of services (matching, payment, location)
- Technology diversity (use best tool for each service)
- Team autonomy and parallel development
- Fault isolation (one service failure doesn't crash entire system)
- Easier to understand and maintain individual services
Disadvantages:
- Increased operational complexity (2,000+ services)
- Network latency between services
- Distributed transaction challenges
- More complex debugging and monitoring
- Higher infrastructure costs
Alternative: Modular Monolith
- Single deployable unit with clear module boundaries
- Simpler deployment and debugging
- Lower operational overhead
- Easier to start, harder to scale
- Why not chosen: Cannot scale components independently, single point of failure
Hybrid Approach:
Core Services (Microservices):
- Matching Service (needs independent scaling)
- Location Service (high throughput requirements)
- Payment Service (PCI compliance isolation)
Supporting Services (Modular Monolith):
- User management
- Notifications
- Analytics
- Admin tools
Benefits: Balance between scalability and operational simplicity2. Synchronous vs Asynchronous Communication
Chosen: Hybrid Approach
Synchronous (REST/gRPC):
- Use for: User-facing APIs, critical path operations
- Advantages: Immediate response, simpler error handling
- Disadvantages: Tight coupling, cascading failures
- Examples: Ride request, driver matching, trip status
Asynchronous (Kafka/Message Queues):
- Use for: Background processing, analytics, notifications
- Advantages: Decoupling, better fault tolerance, load leveling
- Disadvantages: Eventual consistency, complex debugging
- Examples: Payment processing, analytics events, email notifications
Decision Matrix:
Synchronous:
✓ Ride matching (need immediate response)
✓ Payment authorization (need confirmation)
✓ Trip status updates (real-time requirement)
Asynchronous:
✓ Payment capture (can be delayed)
✓ Analytics processing (eventual consistency OK)
✓ Email receipts (not time-critical)
✓ Driver payouts (batch processing)3. Strong vs Eventual Consistency
Chosen: Per-Use-Case Consistency Model
Strong Consistency:
Use Cases:
- Payment transactions (ACID compliance)
- Trip status changes (prevent double-booking)
- Driver availability (prevent double-matching)
Implementation:
- Synchronous database writes
- Distributed transactions (2PC when necessary)
- Pessimistic locking for critical sections
Tradeoff: Lower availability, higher latencyEventual Consistency:
Use Cases:
- Driver locations (5-second staleness acceptable)
- Surge pricing (1-minute lag acceptable)
- Analytics data (1-hour lag acceptable)
- User profiles (cache with TTL)
Implementation:
- Async replication
- Event sourcing
- CQRS pattern
- Cache-aside pattern
Tradeoff: Higher availability, potential stale readsCAP Theorem Application:
Partition Tolerance (P): Always required in distributed system
Consistency (C) vs Availability (A):
- Payment Service: Choose C (consistency over availability)
- Location Service: Choose A (availability over consistency)
- Matching Service: Balance (eventual consistency with compensation)Database Tradeoffs
1. SQL vs NoSQL
Chosen: Polyglot Persistence
PostgreSQL (SQL):
Use Cases:
- Trips, users, payments (transactional data)
- Complex queries with joins
- ACID compliance requirements
Advantages:
- Strong consistency
- Rich query capabilities
- Mature ecosystem
- ACID transactions
Disadvantages:
- Vertical scaling limits
- Complex sharding
- Schema migrationsCassandra (NoSQL):
Use Cases:
- GPS location history (time-series data)
- High write throughput
- Multi-datacenter replication
Advantages:
- Linear scalability
- High write throughput
- Multi-datacenter support
- No single point of failure
Disadvantages:
- Eventual consistency
- Limited query flexibility
- No joins or transactionsRedis (In-Memory):
Use Cases:
- Driver locations (geospatial queries)
- Session management
- Real-time caching
Advantages:
- Sub-millisecond latency
- Geospatial support
- Pub/sub capabilities
Disadvantages:
- Memory constraints
- Data persistence challenges
- Limited query capabilitiesAlternative: Single Database
- Use PostgreSQL for everything
- Simpler architecture, easier to manage
- Why not chosen: Cannot handle write throughput, expensive to scale
2. Database Sharding Strategies
Chosen: Geographic (City-based) Sharding
Advantages:
- Data locality (most queries within same city)
- Reduced cross-shard queries (95% stay local)
- Easy to scale hot cities independently
- Natural business boundary
Disadvantages:
- Uneven load distribution (NYC vs small cities)
- Cross-city trips require distributed transactions
- Shard rebalancing complexity
Alternative 1: Hash-based Sharding (User ID)
Advantages:
- Even distribution across shards
- Predictable shard assignment
- Simple implementation
Disadvantages:
- No data locality
- More cross-shard queries
- Difficult to query by city
Why not chosen: Most queries are city-specific, hash sharding loses localityAlternative 2: Range-based Sharding (Time)
Advantages:
- Easy to archive old data
- Simple to add new shards
- Good for time-series queries
Disadvantages:
- Hot shard problem (all writes to latest shard)
- Uneven load distribution
- Complex cross-time-range queries
Why not chosen: Creates hot spots, doesn't match query patternsHybrid Approach:
Primary Sharding: City-based (for operational data)
Secondary Sharding: Time-based (for historical data)
Implementation:
- Active trips: City-based shards
- Historical trips (>30 days): Time-based shards
- Benefits: Operational efficiency + easy archivalReal-time Communication Tradeoffs
1. WebSocket vs Server-Sent Events vs Long Polling
Chosen: WebSocket for Real-time, REST for Fallback
WebSocket:
Advantages:
- Full-duplex communication
- Low latency (<100ms)
- Efficient for frequent updates
- Native mobile support
Disadvantages:
- Connection management complexity
- Load balancer challenges (sticky sessions)
- Firewall/proxy issues
- Higher server resource usage
Use Cases:
- Driver location updates
- Trip status changes
- Real-time notificationsServer-Sent Events (SSE):
Advantages:
- Simpler than WebSocket
- Automatic reconnection
- HTTP-based (easier through proxies)
Disadvantages:
- One-way communication only
- Limited browser support
- Connection limits per domain
Why not chosen: Need bidirectional communication for driver updatesLong Polling:
Advantages:
- Works everywhere (HTTP-based)
- Simple implementation
- No special infrastructure
Disadvantages:
- Higher latency (1-2 seconds)
- More server resources
- Inefficient for frequent updates
Use Case: Fallback when WebSocket unavailable2. Push vs Pull for Location Updates
Chosen: Push-based with Batching
Push (Driver sends updates):
Advantages:
- Real-time updates (3-5 second intervals)
- Server has latest data
- Better for matching algorithm
Disadvantages:
- High server load (750K updates/second)
- Battery drain on driver devices
- Network bandwidth usage
Implementation:
- Batch updates (send 5 locations at once)
- Adaptive frequency (faster when on trip)
- Compression to reduce bandwidthPull (Server requests updates):
Advantages:
- Server controls update frequency
- Can reduce load during high traffic
- Simpler client implementation
Disadvantages:
- Higher latency
- More network requests
- Polling overhead
Why not chosen: Latency too high for real-time matchingMatching Algorithm Tradeoffs
1. Centralized vs Distributed Matching
Chosen: Distributed Matching with Regional Coordination
Centralized Matching:
Advantages:
- Global optimization
- Simpler algorithm
- Consistent matching logic
Disadvantages:
- Single point of failure
- Scalability bottleneck
- High latency for distant regions
Why not chosen: Cannot scale to global operationsDistributed Matching:
Advantages:
- Regional scalability
- Lower latency
- Fault isolation
Disadvantages:
- Suboptimal global matching
- Cross-region coordination complexity
- Potential duplicate matches
Implementation:
- Match within city/region first
- Expand to neighboring regions if no match
- Distributed lock to prevent double-matching2. Greedy vs Optimal Matching
Chosen: Greedy with Constraints
Greedy Matching (First Available):
Algorithm:
1. Find nearest available driver
2. Send ride offer
3. First to accept wins
Advantages:
- Fast matching (<5 seconds)
- Simple implementation
- Low computational cost
Disadvantages:
- Suboptimal for driver earnings
- May not minimize rider wait time
- Doesn't consider future demandOptimal Matching (Global Optimization):
Algorithm:
1. Consider all pending requests and available drivers
2. Solve assignment problem (Hungarian algorithm)
3. Optimize for total wait time or earnings
Advantages:
- Optimal solution
- Better driver utilization
- Minimizes total wait time
Disadvantages:
- High computational cost (O(n³))
- Requires batch processing
- Delayed matching (30-60 seconds)
Why not chosen: Latency too high for real-time matchingHybrid Approach:
Greedy with Scoring:
1. Find nearby drivers (geospatial query)
2. Score each driver:
- Distance to pickup (40%)
- Driver rating (20%)
- Acceptance rate (15%)
- Earnings balance (15%)
- Time since last trip (10%)
3. Select top 3 drivers
4. Send offers simultaneously
5. First to accept wins
Benefits: Fast matching with better optimization than pure greedyPayment Processing Tradeoffs
1. Sync vs Async Payment Processing
Chosen: Async with Immediate Authorization
Synchronous Payment:
Flow:
Trip Complete → Calculate Fare → Process Payment → Update Trip
Advantages:
- Immediate confirmation
- Simpler error handling
- Strong consistency
Disadvantages:
- Blocks trip completion on payment
- Higher latency
- Payment gateway failures block trips
Why not chosen: Payment failures shouldn't prevent trip completionAsynchronous Payment:
Flow:
Trip Complete → Update Trip → Queue Payment → Process Async
Advantages:
- Faster trip completion
- Better fault tolerance
- Load leveling
Disadvantages:
- Eventual consistency
- Complex error handling
- Retry logic required
Implementation:
1. Authorize payment at trip start (hold funds)
2. Complete trip immediately
3. Capture payment asynchronously
4. Retry on failure with exponential backoff2. Single vs Multiple Payment Gateways
Chosen: Multiple Gateways with Failover
Single Gateway:
Advantages:
- Simpler integration
- Lower maintenance
- Consistent behavior
Disadvantages:
- Single point of failure
- Vendor lock-in
- No negotiating leverage
Why not chosen: Too risky for critical payment infrastructureMultiple Gateways:
Implementation:
- Primary: Stripe (70% traffic)
- Secondary: Braintree (20% traffic)
- Tertiary: Adyen (10% traffic)
Advantages:
- High availability
- Vendor negotiation leverage
- Geographic optimization
- Risk mitigation
Disadvantages:
- Complex integration
- Higher maintenance
- Reconciliation complexity
Failover Logic:
1. Try primary gateway
2. If error rate > 5%, open circuit breaker
3. Route to secondary gateway
4. Monitor and close circuit after 5 minutesSurge Pricing Tradeoffs
1. Real-time vs Predictive Pricing
Chosen: Hybrid (Real-time with ML Prediction)
Real-time Pricing:
Algorithm:
- Calculate supply/demand ratio every 1-2 minutes
- Apply surge multiplier based on ratio
- Update immediately
Advantages:
- Reflects current conditions
- Simple to understand
- Responsive to changes
Disadvantages:
- Can be volatile
- Reactive (not proactive)
- May cause price shocksPredictive Pricing:
Algorithm:
- ML model predicts demand 15-30 minutes ahead
- Pre-emptively adjust pricing
- Smooth price transitions
Advantages:
- Proactive supply management
- Smoother price changes
- Better driver positioning
Disadvantages:
- Prediction errors
- Complex to explain
- Requires significant ML infrastructure
Implementation:
- Use ML to predict demand
- Apply real-time adjustments
- Smooth transitions to avoid shocks
- Cap maximum surge multiplier2. Zone-based vs Individual Pricing
Chosen: Zone-based (H3 Hexagons)
Zone-based Pricing:
Implementation:
- Divide city into hexagonal zones (H3)
- Calculate surge per zone
- All trips in zone get same multiplier
Advantages:
- Scalable (10K zones vs 10M trips)
- Predictable for users
- Easier to cache and distribute
Disadvantages:
- Less precise
- Zone boundary issues
- May not reflect micro-patternsIndividual Pricing:
Implementation:
- Calculate surge for each trip request
- Consider exact pickup/dropoff locations
- Personalized pricing
Advantages:
- Most accurate pricing
- Can optimize per trip
- Better revenue optimization
Disadvantages:
- Computationally expensive
- Difficult to explain
- Potential fairness issues
- Regulatory concerns
Why not chosen: Scalability and transparency concernsMonitoring and Observability Tradeoffs
1. Sampling vs Full Tracing
Chosen: Adaptive Sampling
Full Tracing:
Advantages:
- Complete visibility
- No missed issues
- Detailed debugging
Disadvantages:
- High storage cost (10TB/day)
- Performance overhead
- Analysis complexity
Why not chosen: Cost and performance impact too highAdaptive Sampling:
Implementation:
- Sample 1% of successful requests
- Sample 100% of errors
- Sample 10% of slow requests (>1s)
- Sample 100% of critical paths (payments)
Advantages:
- Reduced cost (100GB/day vs 10TB/day)
- Lower performance impact
- Focus on important traces
Disadvantages:
- May miss rare issues
- Statistical analysis required
- Sampling biasThis comprehensive analysis of tradeoffs demonstrates the complex decision-making required to build a global-scale ride-sharing platform, balancing performance, cost, complexity, and user experience.