Ticketmaster - Scaling Considerations
Pre-Sale Preparation
Infrastructure Pre-Scaling
T-24 hours: Capacity Planning
- Analyze historical data for similar events
- Estimate peak concurrent users
- Calculate required infrastructure
T-12 hours: Pre-Scaling
- Scale application servers: 500 → 10,000 instances
- Scale database read replicas: 100 → 500 replicas
- Scale Redis cache: 100 → 1,000 nodes
- Pre-warm CDN cache with event assets
T-1 hour: Final Checks
- Load test queue system
- Verify database connections
- Test payment gateway capacity
- Enable enhanced monitoringCDN Pre-Warming
Assets to Cache:
- Event page HTML
- Seat map images
- Venue images
- JavaScript/CSS bundles
- API responses (event details)
Process:
1. Generate cache keys for all assets
2. Push to CDN edge locations (200+ PoPs)
3. Verify cache hit rates (target: 95%)
4. Monitor cache performanceQueue System Scaling
Distributed Queue Architecture
Queue Sharding:
- Shard by event_id
- 1,000 Redis nodes
- 10,000 users per node
- 10M total capacity
Queue Processing:
- Process 10K users/second
- Batch dequeue (100 users at a time)
- Parallel processing across nodes
- Auto-scaling based on queue lengthQueue Performance Optimization
Optimization Techniques:
1. Batch Operations:
- Dequeue 100 users at once
- Update positions in batch
- Reduce Redis round trips
2. Connection Pooling:
- Reuse Redis connections
- 1000 connections per app server
- Connection multiplexing
3. Async Processing:
- Non-blocking queue operations
- Event-driven architecture
- WebSocket for real-time updates
4. Caching:
- Cache queue statistics
- Cache wait time estimates
- 30-second TTLDatabase Scaling
Read Scaling
Read Replica Strategy:
- 1 primary + 10 read replicas per shard
- Route reads to replicas (90% of traffic)
- Route writes to primary (10% of traffic)
- Async replication (<1 second lag)
Query Optimization:
- Index all foreign keys
- Composite indexes for common queries
- Covering indexes to avoid table lookups
- Query result caching (30-second TTL)Write Scaling
Write Optimization:
1. Batch Writes:
- Group multiple seat updates
- Single transaction for batch
- Reduce transaction overhead
2. Async Writes:
- Analytics data (async)
- Audit logs (async)
- Email notifications (async)
3. Write Sharding:
- Shard by event_id
- Isolate hot events
- Prevent cross-shard contention
4. Connection Pooling:
- PgBouncer for connection management
- 10,000 connections per shard
- Connection reuseOptimistic Locking at Scale
Conflict Resolution:
1. User attempts to reserve seat
2. Read seat with version number
3. Update with version check
4. If conflict (version mismatch):
- Retry with exponential backoff
- Max 3 retries
- If all fail, suggest alternative seats
Performance:
- 50K concurrent reservation attempts
- 95% success rate on first try
- 4% success on retry
- 1% failure (seat taken)
Optimization:
- Pre-filter available seats
- Reduce contention with smart seat selection
- Suggest seats with lower contentionInventory Management Scaling
Seat Availability Caching
Cache Strategy:
L1 (Application Memory):
- Hot events: Full seat map
- TTL: 10 seconds
- Size: 100 MB per instance
L2 (Redis):
- All events: Availability summary
- TTL: 30 seconds
- Size: 10 GB total
L3 (Database):
- Source of truth
- Real-time updates
- Optimistic locking
Cache Invalidation:
- On seat purchase: Invalidate section
- On reservation: Update availability count
- Broadcast to all app serversReservation Expiry at Scale
Challenge: Release 100K expired reservations/minute
Solution:
1. Partition by expiry time (1-minute buckets)
2. Process each bucket in parallel
3. Batch release (1000 seats at a time)
4. Update availability cache
5. Notify users in queue
Implementation:
- Background job every 30 seconds
- 100 worker threads
- Process 1000 seats per thread
- 100K seats released per runPayment Processing Scaling
Async Payment Architecture
Payment Flow:
1. User completes checkout
2. Reserve seats (synchronous)
3. Queue payment for processing (async)
4. Return order confirmation immediately
5. Process payment in background
6. Update order status
7. Send confirmation email
Benefits:
- Faster checkout (no payment wait)
- Better fault tolerance
- Load leveling
- Retry capabilityPayment Gateway Scaling
Multiple Gateways:
- Primary: Stripe (70% traffic)
- Secondary: Braintree (20% traffic)
- Tertiary: Adyen (10% traffic)
Failover Strategy:
1. Try primary gateway
2. If error rate > 5%, circuit breaker opens
3. Route to secondary gateway
4. Monitor and close circuit after 5 minutes
Capacity:
- 10K transactions/second per gateway
- 30K total capacity
- Auto-scaling based on queue depthBot Detection Scaling
Multi-Layer Bot Defense
Layer 1 - CDN/WAF (Cloudflare):
- Block known bot IPs
- Rate limit by IP (100 req/min)
- Challenge suspicious traffic
- DDoS protection
Layer 2 - Queue System:
- Device fingerprinting
- One session per user
- Behavioral analysis
- CAPTCHA challenges
Layer 3 - Application:
- ML-based fraud detection
- Purchase velocity checks
- Account reputation scoring
- Pattern analysis
Layer 4 - Post-Purchase:
- Transaction analysis
- Resale monitoring
- Account linking detectionCAPTCHA at Scale
Challenge: Serve 10M CAPTCHAs during major sale
Solution:
- reCAPTCHA v3 (invisible, score-based)
- Challenge only high-risk users (20%)
- 2M CAPTCHA challenges
- 99.9% success rate
Optimization:
- Cache CAPTCHA tokens (5-minute TTL)
- Reuse tokens for multiple requests
- Async verification
- Fallback to v2 if v3 failsMonitoring and Auto-Scaling
Key Metrics
System Metrics:
- CPU utilization (target: 70%)
- Memory usage (target: 80%)
- Network throughput
- Disk I/O
Application Metrics:
- Request rate (req/s)
- Response time (p50, p95, p99)
- Error rate (%)
- Queue length
Business Metrics:
- Tickets sold/minute
- Checkout success rate
- Bot detection rate
- Revenue/minuteAuto-Scaling Rules
Scale Up Triggers:
- CPU > 70% for 2 minutes
- Queue length > 100K
- Response time p95 > 2 seconds
- Error rate > 1%
Scale Down Triggers:
- CPU < 30% for 10 minutes
- Queue length < 10K
- Response time p95 < 500ms
- Error rate < 0.1%
Scaling Limits:
- Min instances: 500
- Max instances: 50,000
- Scale increment: 10% of current
- Cooldown: 5 minutesCost Optimization
Reserved vs On-Demand
Baseline Capacity (Reserved):
- 500 application servers
- 100 database servers
- 100 cache nodes
- Cost: $500K/month
- Savings: 60% vs on-demand
Burst Capacity (On-Demand):
- 49,500 application servers
- 400 database servers
- 900 cache nodes
- Duration: 2-3 hours per major sale
- Cost: $400K per sale
- 50 major sales/year: $20M/year
Total Annual: $26MSpot Instances
Use Cases:
- Queue processing workers
- Analytics processing
- Background jobs
- Non-critical services
Savings:
- 70% cheaper than on-demand
- $6M annual savings
- Acceptable interruption rate (5%)Geographic Distribution
Multi-Region Architecture
Regions:
- US-East (Primary): 40% traffic
- US-West: 30% traffic
- Europe: 20% traffic
- Asia-Pacific: 10% traffic
Data Replication:
- Event data: Replicated to all regions
- User data: Replicated to all regions
- Orders: Replicated to backup region
- Inventory: Single region (strong consistency)
Latency:
- Same region: <50ms
- Cross-region: <150ms
- Global: <300msDisaster Recovery
Failover Strategy
Scenario: Primary region failure
Immediate Actions:
1. Detect failure (health checks)
2. Route traffic to backup region
3. Promote read replicas to primary
4. Scale backup region capacity
5. Notify operations team
Recovery Time:
- Detection: 1 minute
- Failover: 2 minutes
- Total RTO: 3 minutes
- RPO: 1 minute (replication lag)
Testing:
- Monthly failover drills
- Chaos engineering
- Load testing backup regionThis comprehensive scaling strategy enables Ticketmaster to handle 100x traffic spikes during major sales while maintaining performance, fairness, and cost efficiency.