Scaling Considerations

📖 7 min read 📄 Part 6 of 10

Ticketmaster - Scaling Considerations

Pre-Sale Preparation

Infrastructure Pre-Scaling

T-24 hours: Capacity Planning
- Analyze historical data for similar events
- Estimate peak concurrent users
- Calculate required infrastructure

T-12 hours: Pre-Scaling
- Scale application servers: 500 → 10,000 instances
- Scale database read replicas: 100 → 500 replicas
- Scale Redis cache: 100 → 1,000 nodes
- Pre-warm CDN cache with event assets

T-1 hour: Final Checks
- Load test queue system
- Verify database connections
- Test payment gateway capacity
- Enable enhanced monitoring

CDN Pre-Warming

Assets to Cache:
- Event page HTML
- Seat map images
- Venue images
- JavaScript/CSS bundles
- API responses (event details)

Process:
1. Generate cache keys for all assets
2. Push to CDN edge locations (200+ PoPs)
3. Verify cache hit rates (target: 95%)
4. Monitor cache performance

Queue System Scaling

Distributed Queue Architecture

Queue Sharding:
- Shard by event_id
- 1,000 Redis nodes
- 10,000 users per node
- 10M total capacity

Queue Processing:
- Process 10K users/second
- Batch dequeue (100 users at a time)
- Parallel processing across nodes
- Auto-scaling based on queue length

Queue Performance Optimization

Optimization Techniques:
1. Batch Operations:
   - Dequeue 100 users at once
   - Update positions in batch
   - Reduce Redis round trips

2. Connection Pooling:
   - Reuse Redis connections
   - 1000 connections per app server
   - Connection multiplexing

3. Async Processing:
   - Non-blocking queue operations
   - Event-driven architecture
   - WebSocket for real-time updates

4. Caching:
   - Cache queue statistics
   - Cache wait time estimates
   - 30-second TTL

Database Scaling

Read Scaling

Read Replica Strategy:
- 1 primary + 10 read replicas per shard
- Route reads to replicas (90% of traffic)
- Route writes to primary (10% of traffic)
- Async replication (<1 second lag)

Query Optimization:
- Index all foreign keys
- Composite indexes for common queries
- Covering indexes to avoid table lookups
- Query result caching (30-second TTL)

Write Scaling

Write Optimization:
1. Batch Writes:
   - Group multiple seat updates
   - Single transaction for batch
   - Reduce transaction overhead

2. Async Writes:
   - Analytics data (async)
   - Audit logs (async)
   - Email notifications (async)

3. Write Sharding:
   - Shard by event_id
   - Isolate hot events
   - Prevent cross-shard contention

4. Connection Pooling:
   - PgBouncer for connection management
   - 10,000 connections per shard
   - Connection reuse

Optimistic Locking at Scale

Conflict Resolution:
1. User attempts to reserve seat
2. Read seat with version number
3. Update with version check
4. If conflict (version mismatch):
   - Retry with exponential backoff
   - Max 3 retries
   - If all fail, suggest alternative seats

Performance:
- 50K concurrent reservation attempts
- 95% success rate on first try
- 4% success on retry
- 1% failure (seat taken)

Optimization:
- Pre-filter available seats
- Reduce contention with smart seat selection
- Suggest seats with lower contention

Inventory Management Scaling

Seat Availability Caching

Cache Strategy:
L1 (Application Memory):
- Hot events: Full seat map
- TTL: 10 seconds
- Size: 100 MB per instance

L2 (Redis):
- All events: Availability summary
- TTL: 30 seconds
- Size: 10 GB total

L3 (Database):
- Source of truth
- Real-time updates
- Optimistic locking

Cache Invalidation:
- On seat purchase: Invalidate section
- On reservation: Update availability count
- Broadcast to all app servers

Reservation Expiry at Scale

Challenge: Release 100K expired reservations/minute

Solution:
1. Partition by expiry time (1-minute buckets)
2. Process each bucket in parallel
3. Batch release (1000 seats at a time)
4. Update availability cache
5. Notify users in queue

Implementation:
- Background job every 30 seconds
- 100 worker threads
- Process 1000 seats per thread
- 100K seats released per run

Payment Processing Scaling

Async Payment Architecture

Payment Flow:
1. User completes checkout
2. Reserve seats (synchronous)
3. Queue payment for processing (async)
4. Return order confirmation immediately
5. Process payment in background
6. Update order status
7. Send confirmation email

Benefits:
- Faster checkout (no payment wait)
- Better fault tolerance
- Load leveling
- Retry capability

Payment Gateway Scaling

Multiple Gateways:
- Primary: Stripe (70% traffic)
- Secondary: Braintree (20% traffic)
- Tertiary: Adyen (10% traffic)

Failover Strategy:
1. Try primary gateway
2. If error rate > 5%, circuit breaker opens
3. Route to secondary gateway
4. Monitor and close circuit after 5 minutes

Capacity:
- 10K transactions/second per gateway
- 30K total capacity
- Auto-scaling based on queue depth

Bot Detection Scaling

Multi-Layer Bot Defense

Layer 1 - CDN/WAF (Cloudflare):
- Block known bot IPs
- Rate limit by IP (100 req/min)
- Challenge suspicious traffic
- DDoS protection

Layer 2 - Queue System:
- Device fingerprinting
- One session per user
- Behavioral analysis
- CAPTCHA challenges

Layer 3 - Application:
- ML-based fraud detection
- Purchase velocity checks
- Account reputation scoring
- Pattern analysis

Layer 4 - Post-Purchase:
- Transaction analysis
- Resale monitoring
- Account linking detection

CAPTCHA at Scale

Challenge: Serve 10M CAPTCHAs during major sale

Solution:
- reCAPTCHA v3 (invisible, score-based)
- Challenge only high-risk users (20%)
- 2M CAPTCHA challenges
- 99.9% success rate

Optimization:
- Cache CAPTCHA tokens (5-minute TTL)
- Reuse tokens for multiple requests
- Async verification
- Fallback to v2 if v3 fails

Monitoring and Auto-Scaling

Key Metrics

System Metrics:
- CPU utilization (target: 70%)
- Memory usage (target: 80%)
- Network throughput
- Disk I/O

Application Metrics:
- Request rate (req/s)
- Response time (p50, p95, p99)
- Error rate (%)
- Queue length

Business Metrics:
- Tickets sold/minute
- Checkout success rate
- Bot detection rate
- Revenue/minute

Auto-Scaling Rules

Scale Up Triggers:
- CPU > 70% for 2 minutes
- Queue length > 100K
- Response time p95 > 2 seconds
- Error rate > 1%

Scale Down Triggers:
- CPU < 30% for 10 minutes
- Queue length < 10K
- Response time p95 < 500ms
- Error rate < 0.1%

Scaling Limits:
- Min instances: 500
- Max instances: 50,000
- Scale increment: 10% of current
- Cooldown: 5 minutes

Cost Optimization

Reserved vs On-Demand

Baseline Capacity (Reserved):
- 500 application servers
- 100 database servers
- 100 cache nodes
- Cost: $500K/month
- Savings: 60% vs on-demand

Burst Capacity (On-Demand):
- 49,500 application servers
- 400 database servers
- 900 cache nodes
- Duration: 2-3 hours per major sale
- Cost: $400K per sale
- 50 major sales/year: $20M/year

Total Annual: $26M

Spot Instances

Use Cases:
- Queue processing workers
- Analytics processing
- Background jobs
- Non-critical services

Savings:
- 70% cheaper than on-demand
- $6M annual savings
- Acceptable interruption rate (5%)

Geographic Distribution

Multi-Region Architecture

Regions:
- US-East (Primary): 40% traffic
- US-West: 30% traffic
- Europe: 20% traffic
- Asia-Pacific: 10% traffic

Data Replication:
- Event data: Replicated to all regions
- User data: Replicated to all regions
- Orders: Replicated to backup region
- Inventory: Single region (strong consistency)

Latency:
- Same region: <50ms
- Cross-region: <150ms
- Global: <300ms

Disaster Recovery

Failover Strategy

Scenario: Primary region failure

Immediate Actions:
1. Detect failure (health checks)
2. Route traffic to backup region
3. Promote read replicas to primary
4. Scale backup region capacity
5. Notify operations team

Recovery Time:
- Detection: 1 minute
- Failover: 2 minutes
- Total RTO: 3 minutes
- RPO: 1 minute (replication lag)

Testing:
- Monthly failover drills
- Chaos engineering
- Load testing backup region

This comprehensive scaling strategy enables Ticketmaster to handle 100x traffic spikes during major sales while maintaining performance, fairness, and cost efficiency.