Trade-offs & Alternatives

📖 10 min read 📄 Part 7 of 10

Uber Backend - Tradeoffs and Alternatives

Architecture Tradeoffs

1. Microservices vs Monolith

Chosen: Microservices Architecture

Advantages:

  • Independent scaling of services (matching, payment, location)
  • Technology diversity (use best tool for each service)
  • Team autonomy and parallel development
  • Fault isolation (one service failure doesn't crash entire system)
  • Easier to understand and maintain individual services

Disadvantages:

  • Increased operational complexity (2,000+ services)
  • Network latency between services
  • Distributed transaction challenges
  • More complex debugging and monitoring
  • Higher infrastructure costs

Alternative: Modular Monolith

  • Single deployable unit with clear module boundaries
  • Simpler deployment and debugging
  • Lower operational overhead
  • Easier to start, harder to scale
  • Why not chosen: Cannot scale components independently, single point of failure

Hybrid Approach:

Core Services (Microservices):
- Matching Service (needs independent scaling)
- Location Service (high throughput requirements)
- Payment Service (PCI compliance isolation)

Supporting Services (Modular Monolith):
- User management
- Notifications
- Analytics
- Admin tools

Benefits: Balance between scalability and operational simplicity

2. Synchronous vs Asynchronous Communication

Chosen: Hybrid Approach

Synchronous (REST/gRPC):

  • Use for: User-facing APIs, critical path operations
  • Advantages: Immediate response, simpler error handling
  • Disadvantages: Tight coupling, cascading failures
  • Examples: Ride request, driver matching, trip status

Asynchronous (Kafka/Message Queues):

  • Use for: Background processing, analytics, notifications
  • Advantages: Decoupling, better fault tolerance, load leveling
  • Disadvantages: Eventual consistency, complex debugging
  • Examples: Payment processing, analytics events, email notifications

Decision Matrix:

Synchronous:
✓ Ride matching (need immediate response)
✓ Payment authorization (need confirmation)
✓ Trip status updates (real-time requirement)

Asynchronous:
✓ Payment capture (can be delayed)
✓ Analytics processing (eventual consistency OK)
✓ Email receipts (not time-critical)
✓ Driver payouts (batch processing)

3. Strong vs Eventual Consistency

Chosen: Per-Use-Case Consistency Model

Strong Consistency:

Use Cases:
- Payment transactions (ACID compliance)
- Trip status changes (prevent double-booking)
- Driver availability (prevent double-matching)

Implementation:
- Synchronous database writes
- Distributed transactions (2PC when necessary)
- Pessimistic locking for critical sections

Tradeoff: Lower availability, higher latency

Eventual Consistency:

Use Cases:
- Driver locations (5-second staleness acceptable)
- Surge pricing (1-minute lag acceptable)
- Analytics data (1-hour lag acceptable)
- User profiles (cache with TTL)

Implementation:
- Async replication
- Event sourcing
- CQRS pattern
- Cache-aside pattern

Tradeoff: Higher availability, potential stale reads

CAP Theorem Application:

Partition Tolerance (P): Always required in distributed system

Consistency (C) vs Availability (A):
- Payment Service: Choose C (consistency over availability)
- Location Service: Choose A (availability over consistency)
- Matching Service: Balance (eventual consistency with compensation)

Database Tradeoffs

1. SQL vs NoSQL

Chosen: Polyglot Persistence

PostgreSQL (SQL):

Use Cases:
- Trips, users, payments (transactional data)
- Complex queries with joins
- ACID compliance requirements

Advantages:
- Strong consistency
- Rich query capabilities
- Mature ecosystem
- ACID transactions

Disadvantages:
- Vertical scaling limits
- Complex sharding
- Schema migrations

Cassandra (NoSQL):

Use Cases:
- GPS location history (time-series data)
- High write throughput
- Multi-datacenter replication

Advantages:
- Linear scalability
- High write throughput
- Multi-datacenter support
- No single point of failure

Disadvantages:
- Eventual consistency
- Limited query flexibility
- No joins or transactions

Redis (In-Memory):

Use Cases:
- Driver locations (geospatial queries)
- Session management
- Real-time caching

Advantages:
- Sub-millisecond latency
- Geospatial support
- Pub/sub capabilities

Disadvantages:
- Memory constraints
- Data persistence challenges
- Limited query capabilities

Alternative: Single Database

  • Use PostgreSQL for everything
  • Simpler architecture, easier to manage
  • Why not chosen: Cannot handle write throughput, expensive to scale

2. Database Sharding Strategies

Chosen: Geographic (City-based) Sharding

Advantages:

  • Data locality (most queries within same city)
  • Reduced cross-shard queries (95% stay local)
  • Easy to scale hot cities independently
  • Natural business boundary

Disadvantages:

  • Uneven load distribution (NYC vs small cities)
  • Cross-city trips require distributed transactions
  • Shard rebalancing complexity

Alternative 1: Hash-based Sharding (User ID)

Advantages:
- Even distribution across shards
- Predictable shard assignment
- Simple implementation

Disadvantages:
- No data locality
- More cross-shard queries
- Difficult to query by city

Why not chosen: Most queries are city-specific, hash sharding loses locality

Alternative 2: Range-based Sharding (Time)

Advantages:
- Easy to archive old data
- Simple to add new shards
- Good for time-series queries

Disadvantages:
- Hot shard problem (all writes to latest shard)
- Uneven load distribution
- Complex cross-time-range queries

Why not chosen: Creates hot spots, doesn't match query patterns

Hybrid Approach:

Primary Sharding: City-based (for operational data)
Secondary Sharding: Time-based (for historical data)

Implementation:
- Active trips: City-based shards
- Historical trips (>30 days): Time-based shards
- Benefits: Operational efficiency + easy archival

Real-time Communication Tradeoffs

1. WebSocket vs Server-Sent Events vs Long Polling

Chosen: WebSocket for Real-time, REST for Fallback

WebSocket:

Advantages:
- Full-duplex communication
- Low latency (<100ms)
- Efficient for frequent updates
- Native mobile support

Disadvantages:
- Connection management complexity
- Load balancer challenges (sticky sessions)
- Firewall/proxy issues
- Higher server resource usage

Use Cases:
- Driver location updates
- Trip status changes
- Real-time notifications

Server-Sent Events (SSE):

Advantages:
- Simpler than WebSocket
- Automatic reconnection
- HTTP-based (easier through proxies)

Disadvantages:
- One-way communication only
- Limited browser support
- Connection limits per domain

Why not chosen: Need bidirectional communication for driver updates

Long Polling:

Advantages:
- Works everywhere (HTTP-based)
- Simple implementation
- No special infrastructure

Disadvantages:
- Higher latency (1-2 seconds)
- More server resources
- Inefficient for frequent updates

Use Case: Fallback when WebSocket unavailable

2. Push vs Pull for Location Updates

Chosen: Push-based with Batching

Push (Driver sends updates):

Advantages:
- Real-time updates (3-5 second intervals)
- Server has latest data
- Better for matching algorithm

Disadvantages:
- High server load (750K updates/second)
- Battery drain on driver devices
- Network bandwidth usage

Implementation:
- Batch updates (send 5 locations at once)
- Adaptive frequency (faster when on trip)
- Compression to reduce bandwidth

Pull (Server requests updates):

Advantages:
- Server controls update frequency
- Can reduce load during high traffic
- Simpler client implementation

Disadvantages:
- Higher latency
- More network requests
- Polling overhead

Why not chosen: Latency too high for real-time matching

Matching Algorithm Tradeoffs

1. Centralized vs Distributed Matching

Chosen: Distributed Matching with Regional Coordination

Centralized Matching:

Advantages:
- Global optimization
- Simpler algorithm
- Consistent matching logic

Disadvantages:
- Single point of failure
- Scalability bottleneck
- High latency for distant regions

Why not chosen: Cannot scale to global operations

Distributed Matching:

Advantages:
- Regional scalability
- Lower latency
- Fault isolation

Disadvantages:
- Suboptimal global matching
- Cross-region coordination complexity
- Potential duplicate matches

Implementation:
- Match within city/region first
- Expand to neighboring regions if no match
- Distributed lock to prevent double-matching

2. Greedy vs Optimal Matching

Chosen: Greedy with Constraints

Greedy Matching (First Available):

Algorithm:
1. Find nearest available driver
2. Send ride offer
3. First to accept wins

Advantages:
- Fast matching (<5 seconds)
- Simple implementation
- Low computational cost

Disadvantages:
- Suboptimal for driver earnings
- May not minimize rider wait time
- Doesn't consider future demand

Optimal Matching (Global Optimization):

Algorithm:
1. Consider all pending requests and available drivers
2. Solve assignment problem (Hungarian algorithm)
3. Optimize for total wait time or earnings

Advantages:
- Optimal solution
- Better driver utilization
- Minimizes total wait time

Disadvantages:
- High computational cost (O(n³))
- Requires batch processing
- Delayed matching (30-60 seconds)

Why not chosen: Latency too high for real-time matching

Hybrid Approach:

Greedy with Scoring:
1. Find nearby drivers (geospatial query)
2. Score each driver:
   - Distance to pickup (40%)
   - Driver rating (20%)
   - Acceptance rate (15%)
   - Earnings balance (15%)
   - Time since last trip (10%)
3. Select top 3 drivers
4. Send offers simultaneously
5. First to accept wins

Benefits: Fast matching with better optimization than pure greedy

Payment Processing Tradeoffs

1. Sync vs Async Payment Processing

Chosen: Async with Immediate Authorization

Synchronous Payment:

Flow:
Trip Complete → Calculate Fare → Process Payment → Update Trip

Advantages:
- Immediate confirmation
- Simpler error handling
- Strong consistency

Disadvantages:
- Blocks trip completion on payment
- Higher latency
- Payment gateway failures block trips

Why not chosen: Payment failures shouldn't prevent trip completion

Asynchronous Payment:

Flow:
Trip Complete → Update Trip → Queue Payment → Process Async

Advantages:
- Faster trip completion
- Better fault tolerance
- Load leveling

Disadvantages:
- Eventual consistency
- Complex error handling
- Retry logic required

Implementation:
1. Authorize payment at trip start (hold funds)
2. Complete trip immediately
3. Capture payment asynchronously
4. Retry on failure with exponential backoff

2. Single vs Multiple Payment Gateways

Chosen: Multiple Gateways with Failover

Single Gateway:

Advantages:
- Simpler integration
- Lower maintenance
- Consistent behavior

Disadvantages:
- Single point of failure
- Vendor lock-in
- No negotiating leverage

Why not chosen: Too risky for critical payment infrastructure

Multiple Gateways:

Implementation:
- Primary: Stripe (70% traffic)
- Secondary: Braintree (20% traffic)
- Tertiary: Adyen (10% traffic)

Advantages:
- High availability
- Vendor negotiation leverage
- Geographic optimization
- Risk mitigation

Disadvantages:
- Complex integration
- Higher maintenance
- Reconciliation complexity

Failover Logic:
1. Try primary gateway
2. If error rate > 5%, open circuit breaker
3. Route to secondary gateway
4. Monitor and close circuit after 5 minutes

Surge Pricing Tradeoffs

1. Real-time vs Predictive Pricing

Chosen: Hybrid (Real-time with ML Prediction)

Real-time Pricing:

Algorithm:
- Calculate supply/demand ratio every 1-2 minutes
- Apply surge multiplier based on ratio
- Update immediately

Advantages:
- Reflects current conditions
- Simple to understand
- Responsive to changes

Disadvantages:
- Can be volatile
- Reactive (not proactive)
- May cause price shocks

Predictive Pricing:

Algorithm:
- ML model predicts demand 15-30 minutes ahead
- Pre-emptively adjust pricing
- Smooth price transitions

Advantages:
- Proactive supply management
- Smoother price changes
- Better driver positioning

Disadvantages:
- Prediction errors
- Complex to explain
- Requires significant ML infrastructure

Implementation:
- Use ML to predict demand
- Apply real-time adjustments
- Smooth transitions to avoid shocks
- Cap maximum surge multiplier

2. Zone-based vs Individual Pricing

Chosen: Zone-based (H3 Hexagons)

Zone-based Pricing:

Implementation:
- Divide city into hexagonal zones (H3)
- Calculate surge per zone
- All trips in zone get same multiplier

Advantages:
- Scalable (10K zones vs 10M trips)
- Predictable for users
- Easier to cache and distribute

Disadvantages:
- Less precise
- Zone boundary issues
- May not reflect micro-patterns

Individual Pricing:

Implementation:
- Calculate surge for each trip request
- Consider exact pickup/dropoff locations
- Personalized pricing

Advantages:
- Most accurate pricing
- Can optimize per trip
- Better revenue optimization

Disadvantages:
- Computationally expensive
- Difficult to explain
- Potential fairness issues
- Regulatory concerns

Why not chosen: Scalability and transparency concerns

Monitoring and Observability Tradeoffs

1. Sampling vs Full Tracing

Chosen: Adaptive Sampling

Full Tracing:

Advantages:
- Complete visibility
- No missed issues
- Detailed debugging

Disadvantages:
- High storage cost (10TB/day)
- Performance overhead
- Analysis complexity

Why not chosen: Cost and performance impact too high

Adaptive Sampling:

Implementation:
- Sample 1% of successful requests
- Sample 100% of errors
- Sample 10% of slow requests (>1s)
- Sample 100% of critical paths (payments)

Advantages:
- Reduced cost (100GB/day vs 10TB/day)
- Lower performance impact
- Focus on important traces

Disadvantages:
- May miss rare issues
- Statistical analysis required
- Sampling bias

This comprehensive analysis of tradeoffs demonstrates the complex decision-making required to build a global-scale ride-sharing platform, balancing performance, cost, complexity, and user experience.