Google Search - Interview Tips
Interview Approach
Time Management (45-60 minutes)
Phase 1: Requirements (5-10 min)
- Clarify scope (web search, not images/videos initially)
- Understand scale (100B pages, 8B queries/day)
- Identify key features (crawl, index, search, rank)
Phase 2: High-Level Design (10-15 min)
- Draw architecture: Crawler → Parser → Indexer → Query Processor
- Explain data flow
- Discuss major components
Phase 3: Deep Dive (15-20 min)
- Focus on 2-3 components:
* Inverted index structure
* Ranking algorithm
* Distributed query processing
Phase 4: Tradeoffs (5-10 min)
- Full vs incremental crawl
- Simple vs ML ranking
- Exact vs approximate results
Phase 5: Wrap-up (5 min)
- Monitoring and quality
- Security and privacy
- Future improvementsKey Topics to Emphasize
1. Inverted Index
What to Say:
✓ "I'll use an inverted index: Term → [Document IDs]"
✓ "Each posting list contains doc_id, term frequency, positions"
✓ "Shard by term hash across 10,000 servers"
✓ "Compress with delta encoding for 10x reduction"
✗ Avoid: "Store all documents in one database" (doesn't scale)
Sample Answer:
"For the index, I'll build an inverted index:
Structure:
- Key: Term (e.g., 'machine')
- Value: Posting list of documents containing the term
- Each entry: {doc_id, term_frequency, positions, field_flags}
Example:
'machine' → [
{doc: 123, tf: 5, pos: [10,45,67], field: TITLE|BODY},
{doc: 456, tf: 2, pos: [5,234], field: BODY},
...
]
Sharding:
- Hash term to shard: shard_id = hash('machine') % 10000
- 10,000 shards, 10 GB each
- Query all shards in parallel
Compression:
- Delta encoding for doc_ids
- Variable-length encoding
- 10x compression (100 PB → 10 PB)
This enables fast term lookup (O(1)) and parallel query processing."2. Crawling Strategy
What to Say:
✓ "Prioritized breadth-first crawl with PageRank"
✓ "Respect robots.txt and politeness delays"
✓ "Incremental crawling for efficiency"
✗ Avoid: "Crawl everything every day" (too expensive)
Sample Answer:
"For crawling, I'll use a prioritized approach:
URL Frontier:
- Priority queue of URLs to crawl
- Priority = (PageRank × 0.4) + (freshness × 0.3) +
(link_count × 0.2) + (user_interest × 0.1)
Crawl Strategy:
1. Start with seed URLs (popular sites)
2. Extract links from fetched pages
3. Add to frontier with priority
4. Crawl high-priority URLs first
5. Breadth-first within priority level
Politeness:
- 1-10 second delay per domain
- Respect robots.txt
- Adaptive rate limiting
- 10,000 distributed crawlers
Incremental:
- Detect changes with checksums
- Only crawl changed pages (10% daily)
- 90% bandwidth savings
This balances coverage, freshness, and cost."3. Ranking Algorithm
What to Say:
✓ "Multi-stage ranking: BM25 → ML re-ranking → personalization"
✓ "Consider relevance, authority (PageRank), freshness, engagement"
✓ "ML model with 200+ features trained on click data"
✗ Avoid: "Just sort by keyword match" (poor quality)
Sample Answer:
"For ranking, I'll use a multi-stage pipeline:
Stage 1 - Initial Scoring (BM25):
- Retrieve 10,000 matching documents
- Score by term frequency and IDF
- Fast (<50ms)
Stage 2 - ML Re-Ranking:
- Features: 200+ signals
* Content: TF-IDF, term proximity, field importance
* Authority: PageRank, domain authority
* Engagement: CTR, dwell time, bounce rate
* Freshness: Publish date, crawl recency
- Model: LambdaMART (gradient boosting)
- Re-rank top 100 documents
- <50ms inference
Stage 3 - Personalization:
- Location (city-level)
- Language preference
- Device type
- Top 20 documents
Stage 4 - Diversity:
- Ensure variety of sources
- Limit 2 results per domain
- Final top 10
Total: <200ms end-to-end"4. Distributed Query Processing
What to Say:
✓ "Fan out query to all 10,000 index shards in parallel"
✓ "Each shard returns top-k results"
✓ "Aggregate and merge results"
✓ "Early termination for performance"
✗ Avoid: "Query shards sequentially" (too slow)
Sample Answer:
"For query processing, I'll use parallel execution:
Query Flow:
1. Parse query at frontend
2. Fan out to all 10,000 index shards
3. Each shard:
- Fetch posting lists for query terms
- Intersect lists (AND) or union (OR)
- Score top 1000 documents
- Return to aggregator
4. Aggregator:
- Merge results from all shards
- Sort by score
- Return top 100 to ranking service
Timeout Handling:
- Per-shard timeout: 200ms
- If shard times out: Skip and continue
- Graceful degradation
- 99.9% shard availability = good results
Optimization:
- Early termination (stop after top-k)
- Skip low-quality shards
- Cache popular queries (30% hit rate)
This achieves <500ms query latency at scale."Common Pitfalls to Avoid
1. Ignoring Scale
❌ Bad: "Store all pages in MySQL"
✓ Good: "Distributed storage across 10,000 shards"
❌ Bad: "Crawl all pages daily"
✓ Good: "Incremental crawling, prioritize important pages"
Key Point: 100 billion pages requires distributed systems2. Oversimplifying Ranking
❌ Bad: "Rank by keyword count"
✓ Good: "Multi-factor ranking with ML"
❌ Bad: "Just use TF-IDF"
✓ Good: "BM25 + PageRank + ML + personalization"
Key Point: Ranking quality is competitive advantage3. Neglecting Freshness
❌ Bad: "Crawl everything monthly"
✓ Good: "Adaptive crawl frequency, real-time for news"
❌ Bad: "Rebuild index weekly"
✓ Good: "Incremental updates, visible in hours"
Key Point: Users expect fresh results4. Poor Query Performance
❌ Bad: "Query shards sequentially"
✓ Good: "Parallel query with 200ms timeout"
❌ Bad: "No caching"
✓ Good: "Multi-level cache, 60% hit rate"
Key Point: Sub-second latency is criticalStrong Talking Points
Demonstrate Trade-off Thinking
"For crawling, I considered three approaches:
1. Full Crawl Daily:
Pros: Complete coverage, simple
Cons: Expensive (100B pages), slow
2. Incremental Only:
Pros: Fast, efficient
Cons: May miss new pages
3. Hybrid (My Choice):
Pros: Complete coverage + fast updates
Cons: More complex
Implementation:
- Full crawl: Monthly
- Incremental: Daily (10% of pages)
- Real-time: News (5-minute cycle)
I chose hybrid because it balances coverage, freshness, and cost."Show Scalability Awareness
"At Google's scale (100B pages, 8B queries/day):
1. Index: 100 PB compressed, 10,000 shards
2. Crawlers: 10,000 instances, 40K pages/second
3. Query Servers: 100,000 instances, 92K queries/second
4. Cost: $1B/year infrastructure
Every design decision must consider:
- Horizontal scalability
- Fault tolerance
- Cost efficiency
- Performance at scale"Mention Real-World Considerations
"Beyond technical design:
1. Quality: Human raters, A/B testing, metrics
2. Spam: Detection and penalties
3. Privacy: GDPR compliance, user controls
4. Business: Ad integration, revenue
5. Competition: Differentiation, innovation
These aren't just technical problems, they're business problems."Follow-up Question Strategies
When Asked "How do you prevent spam?"
Answer:
"Spam detection uses multiple signals:
1. Content Analysis:
- Keyword stuffing detection
- Hidden text detection
- Thin content detection
2. Link Analysis:
- Unnatural link patterns
- Link farms
- Paid links
3. Behavioral Signals:
- High bounce rate
- Low dwell time
- User reports
4. ML Model:
- 100+ spam signals
- Gradient boosting
- Human-labeled training data
Actions:
- Low spam score: Index normally
- Medium: Rank lower
- High: Don't index
- Extreme: Ban domain
This blocks >99% of spam while minimizing false positives."When Asked "How do you handle typos?"
Answer:
"Multi-layer spell correction:
1. Dictionary Check:
- Compare against known words
- Edit distance (Levenshtein)
- Suggest corrections
2. Context-Based:
- N-gram language model
- Consider surrounding words
- Domain-specific terms
3. Query Logs:
- Learn from user corrections
- Popular misspellings
- Trending terms
4. Implementation:
if (no_results || low_quality):
suggest = spell_check(query)
show 'Did you mean: {suggest}?'
if (obvious_typo):
auto_correct(query)
show 'Showing results for: {corrected}'
This handles 95%+ of typos automatically."Red Flags to Avoid
Don't Say:
❌ "Just use Elasticsearch"
✓ "Build custom inverted index optimized for scale"
❌ "Rank by keyword density"
✓ "Multi-factor ranking with ML"
❌ "Store everything in one database"
✓ "Distributed storage with sharding"
❌ "Crawl everything every day"
✓ "Incremental crawling with prioritization"
❌ "No need for caching"
✓ "Multi-level caching for performance"Closing Strong
Summarize Your Design
"To summarize my Google Search design:
1. Crawling: Prioritized breadth-first, 40K pages/second
2. Indexing: Inverted index, 10,000 shards, 100 PB compressed
3. Query: Parallel processing, <500ms latency
4. Ranking: Multi-stage (BM25 + ML + personalization)
5. Scale: 100B pages indexed, 8B queries/day
Key strengths:
- Handles massive scale (100B pages)
- Fast queries (<500ms)
- High-quality results (ML ranking)
- Fresh content (incremental updates)
Areas for improvement:
- Better spam detection
- More personalization
- Improved voice search
- Knowledge graph expansion
I'm happy to dive deeper into any component."This interview guide provides the structure and talking points needed to excel in a Google Search system design interview, demonstrating understanding of massive scale, distributed systems, and search quality.