Scaling Considerations

📖 7 min read 📄 Part 6 of 10

Google Search - Scaling Considerations

Index Scaling

Horizontal Index Sharding

Challenge: 100 PB index across 10,000 shards

Strategy:
- Hash-based sharding by term
- Each shard: 10 GB compressed
- 3x replication for reliability
- Independent shard updates

Shard Distribution:
shard_id = hash(term) % 10000

Benefits:
- Parallel query processing
- Independent scaling
- Fault isolation
- Even load distribution

Index Compression

Techniques:
1. Delta Encoding:
   - Store differences between doc_ids
   - [12345, 12347, 12350] → [12345, +2, +3]
   - 50% size reduction

2. Variable-Length Encoding:
   - Small numbers use fewer bytes
   - Varint encoding
   - 30% additional reduction

3. Dictionary Compression:
   - Common terms in dictionary
   - Reference by ID
   - 20% additional reduction

Total Compression: 10x (100 PB → 10 PB)

Tiered Index Storage

Hot Tier (SSD):
- Recent pages (<30 days)
- Popular pages (high traffic)
- 20% of index, 80% of queries
- <10ms access time

Warm Tier (HDD):
- Medium-age pages (30-365 days)
- Medium traffic
- 50% of index, 15% of queries
- <50ms access time

Cold Tier (Archive):
- Old pages (>1 year)
- Low traffic
- 30% of index, 5% of queries
- <500ms access time

Benefits:
- Cost optimization (10x cheaper)
- Performance for common queries
- Complete coverage

Query Processing Scaling

Parallel Query Execution

Query Fan-out:
1. Query arrives at frontend
2. Fan out to all 10,000 index shards
3. Each shard searches independently
4. Aggregate results from all shards
5. Rank and return top results

Timeout Strategy:
- Per-shard timeout: 200ms
- If shard times out: Skip and continue
- Graceful degradation: Return partial results
- 99.9% shard availability = 99.9% query success

Optimization:
- Early termination (top-k)
- Skip low-quality shards
- Cache popular queries
- Pre-compute common results

Query Result Caching

Multi-Level Cache:

L1 (Edge CDN):
- Popular queries (top 1%)
- TTL: 5 minutes
- Hit rate: 30%
- Latency: <10ms

L2 (Regional):
- Common queries (top 10%)
- TTL: 1 hour
- Hit rate: 20%
- Latency: <50ms

L3 (Data Center):
- All queries
- TTL: 24 hours
- Hit rate: 10%
- Latency: <100ms

Total Cache Hit Rate: 60%
Cache Miss: Full index search (200-500ms)

Crawler Scaling

Distributed Crawling

Architecture:
- 10,000 crawler instances
- Each handles specific domains
- Coordinated via central scheduler
- 40K pages/second total

Politeness:
- 1-10 second delay per domain
- Respect robots.txt
- Adaptive rate limiting
- Avoid overloading sites

Prioritization:
priority = (pagerank × 0.4) + 
           (freshness_need × 0.3) + 
           (link_count × 0.2) + 
           (user_interest × 0.1)

High Priority:
- News sites (every 5 minutes)
- Popular sites (hourly)
- Changed pages (detected via checksums)

Low Priority:
- Rarely updated sites (weekly)
- Low-quality sites (monthly)
- Deep pages (quarterly)

Incremental Crawling

Strategy:
1. Detect page changes:
   - HTTP Last-Modified header
   - ETag comparison
   - Content checksum

2. Crawl only changed pages:
   - 10% of pages change daily
   - Crawl 10B pages instead of 100B
   - 10x efficiency improvement

3. Update index incrementally:
   - No full rebuild
   - Update changed documents
   - Merge with main index

Benefits:
- Reduced bandwidth (90% savings)
- Faster index updates
- Lower infrastructure cost

Ranking Scaling

Pre-computation

Offline Computation:
- PageRank: Computed monthly
- Document quality scores: Weekly
- Link analysis: Weekly
- Topic classification: At index time

Online Computation:
- Query-document relevance: Real-time
- Personalization: Real-time
- Freshness boost: Real-time
- ML re-ranking: Real-time (<50ms)

Benefits:
- Faster query processing
- Complex algorithms offline
- Simple scoring online

ML Model Optimization

Model Compression:
- Quantization: 32-bit → 8-bit (4x smaller)
- Pruning: Remove 90% of weights
- Distillation: Large model → small model
- Result: 100x faster inference

Serving:
- Model sharding across servers
- Batch inference (100 queries at once)
- GPU acceleration
- <50ms inference time

A/B Testing:
- 1% traffic to new model
- Monitor quality metrics
- Gradual rollout if successful
- Rollback if quality degrades

Geographic Distribution

Multi-Region Architecture

Regions:
- North America: 8 data centers
- Europe: 6 data centers
- Asia-Pacific: 4 data centers
- Other: 2 data centers

Data Replication:
- Index: Replicated to all regions
- User data: Replicated globally (Spanner)
- Query logs: Regional with async replication

Query Routing:
- GeoDNS routes to nearest region
- Latency-based routing
- Failover to backup region

Benefits:
- Low latency globally (<100ms)
- High availability (99.99%)
- Disaster recovery

Edge Computing

Edge Locations: 200+ PoPs

Cached at Edge:
- Autocomplete suggestions
- Popular query results
- Static assets (JS, CSS, images)
- Knowledge graph cards

Benefits:
- <50ms latency for cached queries
- Reduced origin load (60% offload)
- Better user experience

Auto-Scaling

Dynamic Scaling

Metrics:
- Queries per second
- CPU utilization
- Query latency (p95, p99)
- Cache hit rate

Scale-Up Triggers:
- QPS > 80% capacity
- CPU > 70% for 5 minutes
- Latency p95 > 500ms
- Cache hit rate < 50%

Scale-Down Triggers:
- QPS < 40% capacity
- CPU < 30% for 30 minutes
- Latency p95 < 200ms

Scaling:
- Add/remove 10% capacity at a time
- 5-minute cooldown between changes
- Pre-scale for known traffic patterns

Traffic Patterns

Daily Pattern:
- 6 AM - 9 AM: 150% of average (morning)
- 12 PM - 2 PM: 120% of average (lunch)
- 8 PM - 11 PM: 180% of average (evening peak)
- 2 AM - 5 AM: 50% of average (night)

Weekly Pattern:
- Monday-Friday: 120% of average
- Saturday-Sunday: 80% of average

Seasonal:
- Holiday season: 200% of average
- Back to school: 150% of average
- Summer: 90% of average

Pre-Scaling:
- Scale up 1 hour before peak
- Scale down 2 hours after peak
- Maintain minimum capacity

Cost Optimization

Resource Efficiency

Compute:
- Spot instances for batch jobs (70% savings)
- Reserved instances for baseline (60% savings)
- Auto-scaling for peak traffic

Storage:
- Tiered storage (hot/warm/cold)
- Compression (10x reduction)
- Deduplication (20% savings)

Network:
- CDN for static content (80% offload)
- Compression (gzip, brotli)
- Regional caching

Total Savings: 50% vs on-demand pricing

Query Optimization

Techniques:
1. Early Termination:
   - Stop after finding top-k results
   - Don't process all shards
   - 50% CPU savings

2. Approximate Algorithms:
   - Approximate top-k
   - Sampling for large result sets
   - 30% CPU savings

3. Caching:
   - 60% cache hit rate
   - Avoid index search
   - 60% cost savings

4. Query Rewriting:
   - Simplify complex queries
   - Remove redundant terms
   - 20% CPU savings

Total: 80% cost reduction vs naive approach

Monitoring and Alerting

Key Metrics

Availability:
- Uptime: 99.99% target
- Error rate: <0.01%
- Timeout rate: <0.1%

Performance:
- Query latency: p50 <200ms, p95 <500ms, p99 <1s
- Index freshness: 80% <24 hours
- Crawl rate: 40K pages/second

Quality:
- Click-through rate: 60%+
- Zero-result rate: <5%
- Spam in results: <0.1%

Alerts:
- Critical: Uptime <99.9%, latency p95 >1s
- Warning: Uptime <99.95%, latency p95 >500ms
- Info: Unusual traffic patterns

This comprehensive scaling strategy enables Google Search to handle billions of queries daily while maintaining sub-second latency and high availability through distributed systems, intelligent caching, and continuous optimization.