Google Search - Interview Tips

Interview Approach

Time Management (45-60 minutes)

Phase 1: Requirements (5-10 min)
- Clarify scope (web search, not images/videos initially)
- Understand scale (100B pages, 8B queries/day)
- Identify key features (crawl, index, search, rank)

Phase 2: High-Level Design (10-15 min)
- Draw architecture: Crawler → Parser → Indexer → Query Processor
- Explain data flow
- Discuss major components

Phase 3: Deep Dive (15-20 min)
- Focus on 2-3 components:
  * Inverted index structure
  * Ranking algorithm
  * Distributed query processing

Phase 4: Tradeoffs (5-10 min)
- Full vs incremental crawl
- Simple vs ML ranking
- Exact vs approximate results

Phase 5: Wrap-up (5 min)
- Monitoring and quality
- Security and privacy
- Future improvements

Key Topics to Emphasize

1. Inverted Index

What to Say:

✓ "I'll use an inverted index: Term → [Document IDs]"
✓ "Each posting list contains doc_id, term frequency, positions"
✓ "Shard by term hash across 10,000 servers"
✓ "Compress with delta encoding for 10x reduction"
✗ Avoid: "Store all documents in one database" (doesn't scale)

Sample Answer:
"For the index, I'll build an inverted index:

Structure:
- Key: Term (e.g., 'machine')
- Value: Posting list of documents containing the term
- Each entry: {doc_id, term_frequency, positions, field_flags}

Example:
'machine' → [
  {doc: 123, tf: 5, pos: [10,45,67], field: TITLE|BODY},
  {doc: 456, tf: 2, pos: [5,234], field: BODY},
  ...
]

Sharding:
- Hash term to shard: shard_id = hash('machine') % 10000
- 10,000 shards, 10 GB each
- Query all shards in parallel

Compression:
- Delta encoding for doc_ids
- Variable-length encoding
- 10x compression (100 PB → 10 PB)

This enables fast term lookup (O(1)) and parallel query processing."

2. Crawling Strategy

What to Say:

✓ "Prioritized breadth-first crawl with PageRank"
✓ "Respect robots.txt and politeness delays"
✓ "Incremental crawling for efficiency"
✗ Avoid: "Crawl everything every day" (too expensive)

Sample Answer:
"For crawling, I'll use a prioritized approach:

URL Frontier:
- Priority queue of URLs to crawl
- Priority = (PageRank × 0.4) + (freshness × 0.3) + 
            (link_count × 0.2) + (user_interest × 0.1)

Crawl Strategy:
1. Start with seed URLs (popular sites)
2. Extract links from fetched pages
3. Add to frontier with priority
4. Crawl high-priority URLs first
5. Breadth-first within priority level

Politeness:
- 1-10 second delay per domain
- Respect robots.txt
- Adaptive rate limiting
- 10,000 distributed crawlers

Incremental:
- Detect changes with checksums
- Only crawl changed pages (10% daily)
- 90% bandwidth savings

This balances coverage, freshness, and cost."

3. Ranking Algorithm

What to Say:

✓ "Multi-stage ranking: BM25 → ML re-ranking → personalization"
✓ "Consider relevance, authority (PageRank), freshness, engagement"
✓ "ML model with 200+ features trained on click data"
✗ Avoid: "Just sort by keyword match" (poor quality)

Sample Answer:
"For ranking, I'll use a multi-stage pipeline:

Stage 1 - Initial Scoring (BM25):
- Retrieve 10,000 matching documents
- Score by term frequency and IDF
- Fast (<50ms)

Stage 2 - ML Re-Ranking:
- Features: 200+ signals
  * Content: TF-IDF, term proximity, field importance
  * Authority: PageRank, domain authority
  * Engagement: CTR, dwell time, bounce rate
  * Freshness: Publish date, crawl recency
- Model: LambdaMART (gradient boosting)
- Re-rank top 100 documents
- <50ms inference

Stage 3 - Personalization:
- Location (city-level)
- Language preference
- Device type
- Top 20 documents

Stage 4 - Diversity:
- Ensure variety of sources
- Limit 2 results per domain
- Final top 10

Total: <200ms end-to-end"

4. Distributed Query Processing

What to Say:

✓ "Fan out query to all 10,000 index shards in parallel"
✓ "Each shard returns top-k results"
✓ "Aggregate and merge results"
✓ "Early termination for performance"
✗ Avoid: "Query shards sequentially" (too slow)

Sample Answer:
"For query processing, I'll use parallel execution:

Query Flow:
1. Parse query at frontend
2. Fan out to all 10,000 index shards
3. Each shard:
   - Fetch posting lists for query terms
   - Intersect lists (AND) or union (OR)
   - Score top 1000 documents
   - Return to aggregator
4. Aggregator:
   - Merge results from all shards
   - Sort by score
   - Return top 100 to ranking service

Timeout Handling:
- Per-shard timeout: 200ms
- If shard times out: Skip and continue
- Graceful degradation
- 99.9% shard availability = good results

Optimization:
- Early termination (stop after top-k)
- Skip low-quality shards
- Cache popular queries (30% hit rate)

This achieves <500ms query latency at scale."

Common Pitfalls to Avoid

1. Ignoring Scale

❌ Bad: "Store all pages in MySQL"
✓ Good: "Distributed storage across 10,000 shards"

❌ Bad: "Crawl all pages daily"
✓ Good: "Incremental crawling, prioritize important pages"

Key Point: 100 billion pages requires distributed systems

2. Oversimplifying Ranking

❌ Bad: "Rank by keyword count"
✓ Good: "Multi-factor ranking with ML"

❌ Bad: "Just use TF-IDF"
✓ Good: "BM25 + PageRank + ML + personalization"

Key Point: Ranking quality is competitive advantage

3. Neglecting Freshness

❌ Bad: "Crawl everything monthly"
✓ Good: "Adaptive crawl frequency, real-time for news"

❌ Bad: "Rebuild index weekly"
✓ Good: "Incremental updates, visible in hours"

Key Point: Users expect fresh results

4. Poor Query Performance

❌ Bad: "Query shards sequentially"
✓ Good: "Parallel query with 200ms timeout"

❌ Bad: "No caching"
✓ Good: "Multi-level cache, 60% hit rate"

Key Point: Sub-second latency is critical

Strong Talking Points

Demonstrate Trade-off Thinking

"For crawling, I considered three approaches:

1. Full Crawl Daily:
   Pros: Complete coverage, simple
   Cons: Expensive (100B pages), slow
   
2. Incremental Only:
   Pros: Fast, efficient
   Cons: May miss new pages
   
3. Hybrid (My Choice):
   Pros: Complete coverage + fast updates
   Cons: More complex
   
   Implementation:
   - Full crawl: Monthly
   - Incremental: Daily (10% of pages)
   - Real-time: News (5-minute cycle)
   
I chose hybrid because it balances coverage, freshness, and cost."

Show Scalability Awareness

"At Google's scale (100B pages, 8B queries/day):

1. Index: 100 PB compressed, 10,000 shards
2. Crawlers: 10,000 instances, 40K pages/second
3. Query Servers: 100,000 instances, 92K queries/second
4. Cost: $1B/year infrastructure

Every design decision must consider:
- Horizontal scalability
- Fault tolerance
- Cost efficiency
- Performance at scale"

Mention Real-World Considerations

"Beyond technical design:

1. Quality: Human raters, A/B testing, metrics
2. Spam: Detection and penalties
3. Privacy: GDPR compliance, user controls
4. Business: Ad integration, revenue
5. Competition: Differentiation, innovation

These aren't just technical problems, they're business problems."

Follow-up Question Strategies

When Asked "How do you prevent spam?"

Answer:
"Spam detection uses multiple signals:

1. Content Analysis:
   - Keyword stuffing detection
   - Hidden text detection
   - Thin content detection

2. Link Analysis:
   - Unnatural link patterns
   - Link farms
   - Paid links

3. Behavioral Signals:
   - High bounce rate
   - Low dwell time
   - User reports

4. ML Model:
   - 100+ spam signals
   - Gradient boosting
   - Human-labeled training data

Actions:
- Low spam score: Index normally
- Medium: Rank lower
- High: Don't index
- Extreme: Ban domain

This blocks >99% of spam while minimizing false positives."

When Asked "How do you handle typos?"

Answer:
"Multi-layer spell correction:

1. Dictionary Check:
   - Compare against known words
   - Edit distance (Levenshtein)
   - Suggest corrections

2. Context-Based:
   - N-gram language model
   - Consider surrounding words
   - Domain-specific terms

3. Query Logs:
   - Learn from user corrections
   - Popular misspellings
   - Trending terms

4. Implementation:
   if (no_results || low_quality):
       suggest = spell_check(query)
       show 'Did you mean: {suggest}?'
   
   if (obvious_typo):
       auto_correct(query)
       show 'Showing results for: {corrected}'

This handles 95%+ of typos automatically."

Red Flags to Avoid

Don't Say:

❌ "Just use Elasticsearch"
✓ "Build custom inverted index optimized for scale"

❌ "Rank by keyword density"
✓ "Multi-factor ranking with ML"

❌ "Store everything in one database"
✓ "Distributed storage with sharding"

❌ "Crawl everything every day"
✓ "Incremental crawling with prioritization"

❌ "No need for caching"
✓ "Multi-level caching for performance"

Closing Strong

Summarize Your Design

"To summarize my Google Search design:

1. Crawling: Prioritized breadth-first, 40K pages/second
2. Indexing: Inverted index, 10,000 shards, 100 PB compressed
3. Query: Parallel processing, <500ms latency
4. Ranking: Multi-stage (BM25 + ML + personalization)
5. Scale: 100B pages indexed, 8B queries/day

Key strengths:
- Handles massive scale (100B pages)
- Fast queries (<500ms)
- High-quality results (ML ranking)
- Fresh content (incremental updates)

Areas for improvement:
- Better spam detection
- More personalization
- Improved voice search
- Knowledge graph expansion

I'm happy to dive deeper into any component."

This interview guide provides the structure and talking points needed to excel in a Google Search system design interview, demonstrating understanding of massive scale, distributed systems, and search quality.