Google Search - Scaling Considerations
Index Scaling
Horizontal Index Sharding
Challenge: 100 PB index across 10,000 shards
Strategy:
- Hash-based sharding by term
- Each shard: 10 GB compressed
- 3x replication for reliability
- Independent shard updates
Shard Distribution:
shard_id = hash(term) % 10000
Benefits:
- Parallel query processing
- Independent scaling
- Fault isolation
- Even load distributionIndex Compression
Techniques:
1. Delta Encoding:
- Store differences between doc_ids
- [12345, 12347, 12350] → [12345, +2, +3]
- 50% size reduction
2. Variable-Length Encoding:
- Small numbers use fewer bytes
- Varint encoding
- 30% additional reduction
3. Dictionary Compression:
- Common terms in dictionary
- Reference by ID
- 20% additional reduction
Total Compression: 10x (100 PB → 10 PB)Tiered Index Storage
Hot Tier (SSD):
- Recent pages (<30 days)
- Popular pages (high traffic)
- 20% of index, 80% of queries
- <10ms access time
Warm Tier (HDD):
- Medium-age pages (30-365 days)
- Medium traffic
- 50% of index, 15% of queries
- <50ms access time
Cold Tier (Archive):
- Old pages (>1 year)
- Low traffic
- 30% of index, 5% of queries
- <500ms access time
Benefits:
- Cost optimization (10x cheaper)
- Performance for common queries
- Complete coverageQuery Processing Scaling
Parallel Query Execution
Query Fan-out:
1. Query arrives at frontend
2. Fan out to all 10,000 index shards
3. Each shard searches independently
4. Aggregate results from all shards
5. Rank and return top results
Timeout Strategy:
- Per-shard timeout: 200ms
- If shard times out: Skip and continue
- Graceful degradation: Return partial results
- 99.9% shard availability = 99.9% query success
Optimization:
- Early termination (top-k)
- Skip low-quality shards
- Cache popular queries
- Pre-compute common resultsQuery Result Caching
Multi-Level Cache:
L1 (Edge CDN):
- Popular queries (top 1%)
- TTL: 5 minutes
- Hit rate: 30%
- Latency: <10ms
L2 (Regional):
- Common queries (top 10%)
- TTL: 1 hour
- Hit rate: 20%
- Latency: <50ms
L3 (Data Center):
- All queries
- TTL: 24 hours
- Hit rate: 10%
- Latency: <100ms
Total Cache Hit Rate: 60%
Cache Miss: Full index search (200-500ms)Crawler Scaling
Distributed Crawling
Architecture:
- 10,000 crawler instances
- Each handles specific domains
- Coordinated via central scheduler
- 40K pages/second total
Politeness:
- 1-10 second delay per domain
- Respect robots.txt
- Adaptive rate limiting
- Avoid overloading sites
Prioritization:
priority = (pagerank × 0.4) +
(freshness_need × 0.3) +
(link_count × 0.2) +
(user_interest × 0.1)
High Priority:
- News sites (every 5 minutes)
- Popular sites (hourly)
- Changed pages (detected via checksums)
Low Priority:
- Rarely updated sites (weekly)
- Low-quality sites (monthly)
- Deep pages (quarterly)Incremental Crawling
Strategy:
1. Detect page changes:
- HTTP Last-Modified header
- ETag comparison
- Content checksum
2. Crawl only changed pages:
- 10% of pages change daily
- Crawl 10B pages instead of 100B
- 10x efficiency improvement
3. Update index incrementally:
- No full rebuild
- Update changed documents
- Merge with main index
Benefits:
- Reduced bandwidth (90% savings)
- Faster index updates
- Lower infrastructure costRanking Scaling
Pre-computation
Offline Computation:
- PageRank: Computed monthly
- Document quality scores: Weekly
- Link analysis: Weekly
- Topic classification: At index time
Online Computation:
- Query-document relevance: Real-time
- Personalization: Real-time
- Freshness boost: Real-time
- ML re-ranking: Real-time (<50ms)
Benefits:
- Faster query processing
- Complex algorithms offline
- Simple scoring onlineML Model Optimization
Model Compression:
- Quantization: 32-bit → 8-bit (4x smaller)
- Pruning: Remove 90% of weights
- Distillation: Large model → small model
- Result: 100x faster inference
Serving:
- Model sharding across servers
- Batch inference (100 queries at once)
- GPU acceleration
- <50ms inference time
A/B Testing:
- 1% traffic to new model
- Monitor quality metrics
- Gradual rollout if successful
- Rollback if quality degradesGeographic Distribution
Multi-Region Architecture
Regions:
- North America: 8 data centers
- Europe: 6 data centers
- Asia-Pacific: 4 data centers
- Other: 2 data centers
Data Replication:
- Index: Replicated to all regions
- User data: Replicated globally (Spanner)
- Query logs: Regional with async replication
Query Routing:
- GeoDNS routes to nearest region
- Latency-based routing
- Failover to backup region
Benefits:
- Low latency globally (<100ms)
- High availability (99.99%)
- Disaster recoveryEdge Computing
Edge Locations: 200+ PoPs
Cached at Edge:
- Autocomplete suggestions
- Popular query results
- Static assets (JS, CSS, images)
- Knowledge graph cards
Benefits:
- <50ms latency for cached queries
- Reduced origin load (60% offload)
- Better user experienceAuto-Scaling
Dynamic Scaling
Metrics:
- Queries per second
- CPU utilization
- Query latency (p95, p99)
- Cache hit rate
Scale-Up Triggers:
- QPS > 80% capacity
- CPU > 70% for 5 minutes
- Latency p95 > 500ms
- Cache hit rate < 50%
Scale-Down Triggers:
- QPS < 40% capacity
- CPU < 30% for 30 minutes
- Latency p95 < 200ms
Scaling:
- Add/remove 10% capacity at a time
- 5-minute cooldown between changes
- Pre-scale for known traffic patternsTraffic Patterns
Daily Pattern:
- 6 AM - 9 AM: 150% of average (morning)
- 12 PM - 2 PM: 120% of average (lunch)
- 8 PM - 11 PM: 180% of average (evening peak)
- 2 AM - 5 AM: 50% of average (night)
Weekly Pattern:
- Monday-Friday: 120% of average
- Saturday-Sunday: 80% of average
Seasonal:
- Holiday season: 200% of average
- Back to school: 150% of average
- Summer: 90% of average
Pre-Scaling:
- Scale up 1 hour before peak
- Scale down 2 hours after peak
- Maintain minimum capacityCost Optimization
Resource Efficiency
Compute:
- Spot instances for batch jobs (70% savings)
- Reserved instances for baseline (60% savings)
- Auto-scaling for peak traffic
Storage:
- Tiered storage (hot/warm/cold)
- Compression (10x reduction)
- Deduplication (20% savings)
Network:
- CDN for static content (80% offload)
- Compression (gzip, brotli)
- Regional caching
Total Savings: 50% vs on-demand pricingQuery Optimization
Techniques:
1. Early Termination:
- Stop after finding top-k results
- Don't process all shards
- 50% CPU savings
2. Approximate Algorithms:
- Approximate top-k
- Sampling for large result sets
- 30% CPU savings
3. Caching:
- 60% cache hit rate
- Avoid index search
- 60% cost savings
4. Query Rewriting:
- Simplify complex queries
- Remove redundant terms
- 20% CPU savings
Total: 80% cost reduction vs naive approachMonitoring and Alerting
Key Metrics
Availability:
- Uptime: 99.99% target
- Error rate: <0.01%
- Timeout rate: <0.1%
Performance:
- Query latency: p50 <200ms, p95 <500ms, p99 <1s
- Index freshness: 80% <24 hours
- Crawl rate: 40K pages/second
Quality:
- Click-through rate: 60%+
- Zero-result rate: <5%
- Spam in results: <0.1%
Alerts:
- Critical: Uptime <99.9%, latency p95 >1s
- Warning: Uptime <99.95%, latency p95 >500ms
- Info: Unusual traffic patternsThis comprehensive scaling strategy enables Google Search to handle billions of queries daily while maintaining sub-second latency and high availability through distributed systems, intelligent caching, and continuous optimization.