Interview Tips for Google Docs System Design

Overview

Designing a collaborative document editing system like Google Docs is one of the most challenging system design problems. This guide provides strategies for approaching the interview and common pitfalls to avoid.

Interview Structure (45-60 minutes)

Phase 1: Requirements Clarification (5-10 minutes)

Goal: Understand the scope and constraints

Key Questions to Ask:

Scale: How many daily active users? Concurrent editors per document?
Features: Real-time collaboration? Offline support? Rich text formatting?
Latency: What's acceptable latency for seeing others' edits? (100ms? 500ms?)
Consistency: Strong consistency or eventual consistency?
Conflict Resolution: How should we handle concurrent edits?
Platform: Web only? Mobile apps? Desktop apps?

Red Flags to Avoid:

❌ Jumping into design without clarifying requirements
❌ Assuming features without asking (e.g., "I'll assume we need comments")
❌ Not discussing scale constraints

Phase 2: High-Level Design (15-20 minutes)

Goal: Present the overall architecture

What to Cover:

Client-Server Architecture: WebSocket for real-time, REST for CRUD
Core Components: OT Server, Document Service, Storage Layer
Data Flow: User edit → OT transformation → Broadcast → Database
Key Technologies: WebSocket, Spanner/PostgreSQL, Redis, CDN

How to Present:

Start with a simple diagram:

┌─────────┐
│ Client  │ ←→ WebSocket ←→ ┌──────────────┐
│ Browser │                  │  OT Server   │
└─────────┘                  └──────────────┘
                                     ↓
                             ┌──────────────┐
                             │   Database   │
                             │   (Spanner)  │
                             └──────────────┘

Then explain the flow:
1. User types "hello" at position 10
2. Client sends operation to OT server via WebSocket
3. OT server transforms operation against concurrent operations
4. OT server broadcasts to all connected clients
5. OT server persists to database

Red Flags to Avoid:

❌ Diving into implementation details too early
❌ Not explaining the data flow
❌ Ignoring real-time requirements

Phase 3: Deep Dive (20-25 minutes)

Goal: Demonstrate technical depth

Topics Interviewers Often Probe:

1. Operational Transform (Most Important)

What to Explain:

OT is an algorithm for conflict-free collaborative editing
Transforms concurrent operations to maintain consistency
Example: Two users insert at same position

Example to Use:

Initial document: "Hello"
User A: Insert "X" at position 5 → "HelloX"
User B: Insert "Y" at position 5 → "HelloY"

Without OT: Conflict! Which one wins?

With OT:
1. User A's operation arrives first: "HelloX"
2. User B's operation is transformed:
   - Original: Insert "Y" at position 5
   - Transformed: Insert "Y" at position 6 (after "X")
3. Final result: "HelloXY" (consistent for both users)

Key Points:

O(n²) complexity for n concurrent operations
Requires central server for coordination
Alternative: CRDT (mention but don't deep dive unless asked)

2. WebSocket Scaling

What to Explain:

Each server handles 50K connections
Use sticky sessions (route user to same server)
Load balancer with consistent hashing
Horizontal scaling by adding more servers

Capacity Calculation:

5M concurrent users
÷ 50K connections per server
= 100 servers needed
× 3 regions
= 300 servers globally

3. Database Design

What to Explain:

Spanner for strong consistency (document state)
Bigtable for revision history (append-only)
Redis for OT state and presence (in-memory)

Schema to Present:

-- Documents table
CREATE TABLE documents (
  id UUID PRIMARY KEY,
  title VARCHAR(255),
  content TEXT,
  version INT,
  owner_id UUID,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

-- Operations table (for OT)
CREATE TABLE operations (
  id UUID PRIMARY KEY,
  document_id UUID,
  user_id UUID,
  type VARCHAR(20),  -- insert, delete, format
  position INT,
  content TEXT,
  version INT,
  timestamp TIMESTAMP
);

4. Conflict Resolution

What to Explain:

OT handles conflicts automatically
Transform operations to preserve intent
Last-writer-wins for metadata (title, permissions)

Common Follow-up: "What if the OT server crashes?" Answer: Operations are persisted before acknowledgment. On restart, replay from last checkpoint.

Phase 4: Scaling and Optimization (10-15 minutes)

Goal: Show you can handle massive scale

Topics to Cover:

1. Hot Document Problem

Problem: Document with 1000+ concurrent editors Solution:

Hierarchical OT (multiple OT servers with coordinator)
Operation batching (send 10 ops together instead of individually)
Read-only mode for viewers (don't send operations)

2. Global Distribution

Problem: Users in different continents Solution:

Regional OT servers (US, EU, Asia)
Local operations have low latency (10-20ms)
Cross-region operations have higher latency (100-200ms)
Acceptable tradeoff for global collaboration

3. Caching Strategy

Layers:

L1: Client-side (IndexedDB) - 60% hit rate
L2: Redis - 35% hit rate
L3: CDN (for exports) - 90% hit rate
Overall: 95% cache hit rate

4. Cost Optimization

Mention:

Reserved instances for compute (40% savings)
Compression for storage and network (70% reduction)
Tiered storage for old revisions (cold storage)

Common Interview Questions and Answers

Q1: "How does Operational Transform work?"

Answer Structure:

Define the problem (concurrent edits)
Explain transformation with example
Mention complexity (O(n²))
Discuss alternative (CRDT)

Example Answer: "OT is an algorithm that transforms concurrent operations to maintain consistency. For example, if two users insert text at the same position, OT transforms the second operation to account for the first. The complexity is O(n²) for n concurrent operations, which is why we limit concurrent editors to 100 per document. An alternative is CRDT, which has O(n) complexity but larger operation size."

Q2: "How do you handle offline editing?"

Answer Structure:

Local storage (IndexedDB)
Queue operations while offline
Sync when online
Conflict resolution using OT

Example Answer: "We store the document in IndexedDB on the client. When offline, operations are queued locally. When the user comes back online, we fetch server operations since last sync, transform our local operations against them using OT, and send the transformed operations to the server. This ensures consistency even after offline editing."

Q3: "How do you scale WebSocket connections?"

Answer Structure:

Connection limits per server (50K)
Sticky sessions with load balancer
Horizontal scaling
Connection pooling

Example Answer: "Each server can handle about 50K WebSocket connections. We use a load balancer with consistent hashing to route users to the same server (sticky sessions). To scale, we add more servers horizontally. For 5M concurrent users, we need about 100 servers per region, or 300 globally across 3 regions."

Q4: "What database would you use and why?"

Answer Structure:

Requirements (strong consistency, global distribution)
Choice (Spanner)
Alternatives (PostgreSQL, MongoDB)
Tradeoffs

Example Answer: "I'd use Google Spanner for document state because it provides strong consistency with global distribution. This ensures all users see the same document state. For revision history, I'd use Bigtable for its append-only efficiency. For OT state and presence, I'd use Redis for low latency. Alternatives like PostgreSQL work for single-region deployments, but don't scale globally."

Q5: "How do you handle a document with 1000 concurrent editors?"

Answer Structure:

Acknowledge the challenge (O(n²) complexity)
Hierarchical OT solution
Operation batching
Read-only mode for excess users

Example Answer: "With 1000 editors, OT complexity becomes a bottleneck (1M transformations). I'd implement hierarchical OT with multiple OT servers coordinated by a master. Each server handles 100 editors. I'd also batch operations (send 10 at once) to reduce network overhead. For viewers beyond 500, I'd offer read-only mode to reduce load."

Q6: "How do you ensure data consistency across regions?"

Answer Structure:

Spanner's global consistency
Regional OT servers
Synchronous replication for writes
Acceptable latency tradeoff

Example Answer: "Spanner provides strong consistency across regions using TrueTime API. For real-time collaboration, I use regional OT servers that handle local operations with low latency (10-20ms). Cross-region operations take longer (100-200ms) due to synchronous replication, but this is acceptable for global collaboration. Users see local changes immediately and remote changes with slight delay."

Q7: "How would you implement version history?"

Answer Structure:

Store all operations (append-only)
Snapshots every N operations
Restore by replaying operations
Compression and tiering

Example Answer: "I'd store all operations in an append-only log in Bigtable. Every 100 operations, I'd create a snapshot of the full document. To restore version N, I find the nearest snapshot and replay operations from snapshot to N. Old revisions are compressed and moved to cold storage after 90 days. This balances storage cost with restore performance."

Q8: "What happens if the OT server crashes?"

Answer Structure:

Operations persisted before acknowledgment
Replay from last checkpoint
Redis persistence for in-memory state
Failover to replica

Example Answer: "All operations are persisted to Spanner before acknowledgment, so no data loss occurs. The OT server maintains in-memory state in Redis with persistence enabled. On crash, a replica takes over and replays operations from the last checkpoint. Users may see a brief disconnection (1-2 seconds) but no data loss."

Key Concepts to Master

1. Operational Transform

What: Algorithm for conflict-free collaborative editing
Why: Maintains consistency with concurrent edits
How: Transform operations against each other
Complexity: O(n²) for n concurrent operations
Alternative: CRDT (O(n) but larger operations)

2. WebSocket vs HTTP

WebSocket: Full-duplex, low latency, persistent connection
HTTP: Request-response, higher latency, stateless
When to use WebSocket: Real-time collaboration
When to use HTTP: CRUD operations, file uploads

3. Strong vs Eventual Consistency

Strong: All users see same state (Spanner)
Eventual: Temporary inconsistencies (Cassandra)
Tradeoff: Consistency vs latency vs availability
Choice: Strong for document state, eventual for presence

4. Caching Strategy

Multi-level: Client, Redis, CDN
Invalidation: Write-through, TTL-based
Hit rate: Target 95%+ to reduce database load

5. Scaling Patterns

Horizontal: Add more servers
Vertical: Bigger servers (limited)
Sharding: Partition by document ID
Replication: Multiple copies for reads

Common Mistakes to Avoid

1. Not Discussing OT

❌ "We'll just use last-writer-wins for conflicts" ✅ "We'll use Operational Transform to handle concurrent edits"

Why: OT is the core of collaborative editing. Not mentioning it shows lack of depth.

2. Ignoring Real-Time Requirements

❌ "Users can refresh to see updates" ✅ "We'll use WebSocket for real-time synchronization"

Why: Real-time is a fundamental requirement for Google Docs.

3. Oversimplifying Conflict Resolution

❌ "We'll lock the document when someone is editing" ✅ "We'll use OT to transform concurrent operations"

Why: Locking defeats the purpose of collaborative editing.

4. Not Considering Scale

❌ "We'll use a single server for everything" ✅ "We'll need 300 WebSocket servers for 5M concurrent users"

Why: Scale is a key constraint that affects architecture decisions.

5. Forgetting About Offline Mode

❌ "Users must be online to edit" ✅ "We'll support offline editing with sync on reconnection"

Why: Offline support is expected in modern applications.

6. Ignoring Security

❌ Not mentioning encryption or access control ✅ "We'll use TLS for transit, AES-256 for rest, and RBAC for access control"

Why: Security is critical for document editing systems.

How to Stand Out

1. Mention Specific Technologies

Instead of: "We'll use a database" Say: "We'll use Spanner for strong consistency and global distribution"

2. Provide Numbers

Instead of: "We'll need multiple servers" Say: "We'll need 100 servers per region for 5M concurrent users at 50K connections per server"

3. Discuss Tradeoffs

Instead of: "We'll use OT" Say: "We'll use OT over CRDT because it provides better user experience despite O(n²) complexity"

4. Consider Edge Cases

What if 1000 users edit simultaneously?
What if the OT server crashes?
What if a user edits offline for a week?
What if someone pastes 1MB of text?

5. Think About Operations

How do we monitor OT server health?
How do we debug conflicts?
How do we handle schema migrations?
How do we test OT correctness?

Time Management Tips

Allocate Time Wisely

Requirements: 10% (5 minutes)
High-level design: 30% (15 minutes)
Deep dive: 40% (20 minutes)
Scaling: 20% (10 minutes)

If Running Out of Time

Prioritize:

OT explanation (most important)
WebSocket architecture
Database design
Scaling strategy

Skip if needed:

Monitoring and alerting
Deployment strategy
Cost optimization
Advanced features (comments, suggestions)

If Ahead of Schedule

Discuss:

Rich text formatting
Comments and suggestions
Version history
Offline mode
Security and compliance

Final Checklist

Before ending the interview, ensure you've covered:

Operational Transform explanation with example
WebSocket architecture for real-time sync
Database choice with justification (Spanner)
Conflict resolution strategy
Scaling approach (horizontal, regional)
Caching strategy (multi-level)
Offline support (IndexedDB, sync)
Security basics (encryption, access control)
At least one tradeoff discussion
Capacity calculations with numbers

Summary

Key Takeaways:

Master OT: It's the heart of collaborative editing
Think Real-Time: WebSocket, low latency, immediate updates
Scale Matters: Calculate capacity, discuss bottlenecks
Tradeoffs: OT vs CRDT, strong vs eventual consistency
Be Specific: Name technologies, provide numbers
Consider Edge Cases: Hot documents, crashes, offline editing
Security: Don't forget encryption and access control

Success Formula:

Clear communication > Perfect solution
Tradeoff discussion > Single approach
Specific examples > Generic statements
Capacity calculations > Hand-waving
Asking questions > Making assumptions

Good luck with your interview! Remember, the interviewer wants to see your thought process, not just the final design.