Tradeoffs and Alternatives for Google Docs

Overview

Designing a collaborative document editing system involves fundamental tradeoffs between consistency, performance, complexity, and user experience. This document explores key decisions and their alternatives.

1. Operational Transform vs CRDT

Operational Transform (OT)

Chosen Approach

Pros:

Deterministic conflict resolution
Smaller operation size (10-100 bytes)
Well-understood for text editing
Preserves user intent accurately
Efficient for sequential operations

Cons:

O(n²) complexity for concurrent operations
Requires central server for coordination
Complex implementation (transformation functions)
Difficult to reason about correctness
Server becomes bottleneck at scale

Example:

// OT operation
{
  type: 'insert',
  position: 10,
  text: 'hello',
  userId: 'user123',
  version: 42
}

// Transform against concurrent operation
function transform(op1, op2) {
  if (op1.position < op2.position) {
    return op1; // No change needed
  } else {
    return {
      ...op1,
      position: op1.position + op2.text.length
    };
  }
}

Conflict-Free Replicated Data Types (CRDT)

Alternative Approach

Pros:

No central coordination needed
O(n) complexity for merging
Eventual consistency guaranteed
Better for peer-to-peer systems
Simpler to scale horizontally

Cons:

Larger operation size (100-1000 bytes)
More complex data structures
Harder to preserve user intent
Memory overhead for tombstones
Interleaving issues with concurrent edits

Example:

// CRDT operation (YJS-style)
{
  id: { client: 'user123', clock: 42 },
  left: { client: 'user456', clock: 38 },
  right: { client: 'user789', clock: 40 },
  content: 'hello',
  deleted: false
}

// Merge is automatic - no transformation needed
function merge(doc1, doc2) {
  return doc1.union(doc2); // Commutative and associative
}

Decision: OT for Google Docs

Rationale:

Text editing benefits from deterministic ordering
Central server architecture already in place
Smaller operation size reduces bandwidth
Better user experience (predictable behavior)
Worth the complexity for quality

When to use CRDT:

Peer-to-peer collaboration (no server)
Offline-first applications
Eventually consistent systems
Simpler implementation requirements

2. Strong Consistency vs Eventual Consistency

Strong Consistency (Spanner)

Chosen for Document State

Pros:

Linearizable reads and writes
No conflicting versions
Simpler application logic
Guaranteed correctness
ACID transactions

Cons:

Higher latency (50-100ms cross-region)
Limited by speed of light
More expensive infrastructure
Lower availability during partitions
Reduced write throughput

Use Cases:

Document content and structure
User permissions and access control
Critical metadata (owner, created date)

Eventual Consistency (Cassandra/DynamoDB)

Alternative for Non-Critical Data

Pros:

Lower latency (5-10ms)
Higher availability (99.99%)
Better write throughput
Cheaper infrastructure
Partition tolerance

Cons:

Temporary inconsistencies
Conflict resolution needed
More complex application logic
Potential data loss scenarios
Harder to reason about

Use Cases:

User presence (online/offline status)
Cursor positions
View counts and analytics
Comment notifications
Activity logs

Hybrid Approach

Strong Consistency (Spanner):
├── Document content
├── Document structure
├── User permissions
└── Revision history

Eventual Consistency (Redis/Cassandra):
├── User presence
├── Cursor positions
├── Typing indicators
└── View analytics

3. WebSocket vs Server-Sent Events vs HTTP/2

WebSocket

Chosen Approach

Pros:

Full-duplex communication
Low latency (5-10ms)
Efficient for bidirectional data
Single persistent connection
Native browser support

Cons:

Connection management complexity
Load balancing challenges (sticky sessions)
Firewall/proxy issues
Higher server resource usage
Difficult to scale horizontally

Capacity: 50,000 connections per server

Server-Sent Events (SSE)

Alternative for One-Way Updates

Pros:

Simpler than WebSocket
Automatic reconnection
HTTP-based (better firewall support)
Built-in event IDs
Lower server resource usage

Cons:

One-way only (server to client)
Requires separate HTTP for client updates
Limited browser support (no IE)
Connection limits (6 per domain)
No binary data support

Capacity: 100,000 connections per server

HTTP/2 with Long Polling

Alternative for Compatibility

Pros:

Universal browser support
No special infrastructure needed
Works through all proxies
Multiplexing support
Simpler load balancing

Cons:

Higher latency (100-500ms)
More server requests
Inefficient for real-time updates
Higher bandwidth usage
Battery drain on mobile

Capacity: 500,000 requests per server

Decision Matrix

Feature              WebSocket    SSE      HTTP/2
─────────────────────────────────────────────────
Latency              5-10ms       10-20ms  100-500ms
Bidirectional        Yes          No       Yes
Browser Support      95%          85%      99%
Firewall Friendly    No           Yes      Yes
Scalability          Medium       High     High
Complexity           High         Low      Medium
Real-time Quality    Excellent    Good     Poor

Choice: WebSocket for real-time editing
        SSE for notifications
        HTTP/2 for API calls

4. Centralized vs Distributed OT

Centralized OT Server

Chosen Approach

Pros:

Simpler transformation logic
Guaranteed operation ordering
Easier to implement correctly
Single source of truth
Better debugging

Cons:

Single point of failure
Latency for remote users
Scalability bottleneck
Regional affinity issues
Higher operational complexity

Architecture:

All Editors → Central OT Server → Database

Distributed OT with Regional Servers

Hybrid Approach

Pros:

Lower latency for regional users
Better fault tolerance
Horizontal scalability
Regional data residency
Load distribution

Cons:

Complex coordination protocol
Potential for conflicts
Harder to maintain consistency
More infrastructure
Increased operational cost

Architecture:

US Editors → US OT Server ──┐
EU Editors → EU OT Server ──┼→ Global Coordinator → Database
Asia Editors → Asia OT Server ─┘

Decision: Regional OT with Global Coordination

Rationale:

Balance between latency and consistency
Regional servers for local operations
Global coordinator for cross-region conflicts
Acceptable 100-200ms for remote updates

5. Synchronous vs Asynchronous Replication

Synchronous Replication

Chosen for Critical Data

Pros:

No data loss on failure
Immediate consistency
Simpler recovery
Guaranteed durability
ACID compliance

Cons:

Higher write latency (50-100ms)
Reduced availability
Network dependency
Lower throughput
More expensive

Use Cases:

Document content writes
Permission changes
User authentication

Asynchronous Replication

Alternative for Non-Critical Data

Pros:

Lower write latency (5-10ms)
Higher availability
Better throughput
Network independence
Cheaper infrastructure

Cons:

Potential data loss (seconds)
Temporary inconsistency
Complex conflict resolution
Harder to debug
Replication lag issues

Use Cases:

Analytics events
Activity logs
Presence updates
View counts

Replication Strategy

Critical Path (Synchronous):
User Edit → OT Server → Spanner (3 replicas, sync) → Acknowledge

Non-Critical Path (Asynchronous):
User View → API Server → Kafka → Cassandra (async) → Return

6. Optimistic vs Pessimistic Locking

Optimistic Locking

Chosen Approach

Pros:

Better user experience (no blocking)
Higher concurrency
Lower latency
Simpler implementation
Scales better

Cons:

Conflicts require resolution
Potential for lost updates
Retry logic needed
More complex error handling
User confusion on conflicts

Implementation:

// Optimistic locking with version
async function saveDocument(docId, content, version) {
  const result = await db.update({
    id: docId,
    content: content,
    version: version + 1,
    where: { version: version } // Only update if version matches
  });
  
  if (result.rowsAffected === 0) {
    throw new ConflictError('Document was modified by another user');
  }
}

Pessimistic Locking

Alternative for Critical Sections

Pros:

No conflicts (guaranteed)
Simpler conflict resolution
Predictable behavior
Data integrity guaranteed
Easier to reason about

Cons:

Poor user experience (blocking)
Lower concurrency
Higher latency
Deadlock potential
Doesn't scale well

Implementation:

// Pessimistic locking with mutex
async function saveDocument(docId, content) {
  const lock = await acquireLock(docId, timeout = 5000);
  try {
    await db.update({ id: docId, content: content });
  } finally {
    await releaseLock(lock);
  }
}

Decision: Optimistic for Collaborative Editing

Rationale:

OT handles conflicts automatically
Users expect real-time updates
Blocking would break collaboration
Conflicts are rare with OT

7. Monolithic vs Microservices

Microservices Architecture

Chosen Approach

Services:

Document Service (CRUD operations)
OT Service (operation transformation)
Collaboration Service (WebSocket, presence)
Storage Service (file uploads)
Export Service (PDF, DOCX generation)
Search Service (full-text search)
Auth Service (authentication)

Pros:

Independent scaling
Technology flexibility
Team autonomy
Fault isolation
Easier deployment

Cons:

Distributed system complexity
Network latency between services
Data consistency challenges
More operational overhead
Debugging difficulty

Monolithic Architecture

Alternative for Simplicity

Pros:

Simpler deployment
Lower latency (in-process calls)
Easier debugging
Stronger consistency
Lower operational cost

Cons:

Scaling challenges
Technology lock-in
Team coordination needed
Deployment risk (all-or-nothing)
Harder to maintain at scale

Decision: Microservices for Scale

Rationale:

Different services have different scaling needs
OT service needs CPU, Storage needs I/O
Independent deployment reduces risk
Team autonomy improves velocity

8. SQL vs NoSQL for Document Storage

Spanner (NewSQL)

Chosen Approach

Pros:

Strong consistency + horizontal scaling
SQL interface (familiar)
ACID transactions
Global distribution
Automatic sharding

Cons:

Expensive ($300/TB/month)
Higher latency (50-100ms)
Vendor lock-in (Google Cloud)
Complex operational model
Overkill for simple use cases

PostgreSQL (SQL)

Alternative for Simplicity

Pros:

Mature and stable
Rich feature set
Strong consistency
Lower cost
Better tooling

Cons:

Vertical scaling limits
Manual sharding needed
No global distribution
Single region only
Complex replication setup

MongoDB (NoSQL)

Alternative for Flexibility

Pros:

Flexible schema
Horizontal scaling
Lower latency
Simpler operations
Lower cost

Cons:

Eventual consistency
No ACID transactions (older versions)
Complex query patterns
Data duplication
Consistency challenges

Decision Matrix

Requirement          Spanner  PostgreSQL  MongoDB
──────────────────────────────────────────────────
Strong Consistency   ✓        ✓           ✗
Global Distribution  ✓        ✗           ✓
Horizontal Scaling   ✓        ✗           ✓
ACID Transactions    ✓        ✓           ✓ (4.0+)
Cost Efficiency      ✗        ✓           ✓
Operational Simple   ✗        ✓           ✓

Choice: Spanner for global consistency
        PostgreSQL for single-region deployments
        MongoDB for flexible schema needs

Summary of Key Decisions

Decision	Chosen	Alternative	Rationale
Conflict Resolution	OT	CRDT	Better UX, deterministic
Consistency Model	Strong	Eventual	Data correctness critical
Real-time Protocol	WebSocket	SSE/HTTP/2	Low latency, bidirectional
OT Architecture	Regional	Centralized	Balance latency/consistency
Replication	Sync	Async	No data loss acceptable
Locking Strategy	Optimistic	Pessimistic	Better concurrency
Architecture	Microservices	Monolithic	Independent scaling
Database	Spanner	PostgreSQL	Global distribution

Each decision involves tradeoffs between performance, consistency, complexity, and cost. The choices reflect Google Docs' requirements for real-time collaboration, global scale, and strong consistency.