Trade-offs & Alternatives

📖 9 min read 📄 Part 7 of 10

Tradeoffs and Alternatives for Google Docs

Overview

Designing a collaborative document editing system involves fundamental tradeoffs between consistency, performance, complexity, and user experience. This document explores key decisions and their alternatives.

1. Operational Transform vs CRDT

Operational Transform (OT)

Chosen Approach

Pros:

  • Deterministic conflict resolution
  • Smaller operation size (10-100 bytes)
  • Well-understood for text editing
  • Preserves user intent accurately
  • Efficient for sequential operations

Cons:

  • O(n²) complexity for concurrent operations
  • Requires central server for coordination
  • Complex implementation (transformation functions)
  • Difficult to reason about correctness
  • Server becomes bottleneck at scale

Example:

// OT operation
{
  type: 'insert',
  position: 10,
  text: 'hello',
  userId: 'user123',
  version: 42
}

// Transform against concurrent operation
function transform(op1, op2) {
  if (op1.position < op2.position) {
    return op1; // No change needed
  } else {
    return {
      ...op1,
      position: op1.position + op2.text.length
    };
  }
}

Conflict-Free Replicated Data Types (CRDT)

Alternative Approach

Pros:

  • No central coordination needed
  • O(n) complexity for merging
  • Eventual consistency guaranteed
  • Better for peer-to-peer systems
  • Simpler to scale horizontally

Cons:

  • Larger operation size (100-1000 bytes)
  • More complex data structures
  • Harder to preserve user intent
  • Memory overhead for tombstones
  • Interleaving issues with concurrent edits

Example:

// CRDT operation (YJS-style)
{
  id: { client: 'user123', clock: 42 },
  left: { client: 'user456', clock: 38 },
  right: { client: 'user789', clock: 40 },
  content: 'hello',
  deleted: false
}

// Merge is automatic - no transformation needed
function merge(doc1, doc2) {
  return doc1.union(doc2); // Commutative and associative
}

Decision: OT for Google Docs

Rationale:

  • Text editing benefits from deterministic ordering
  • Central server architecture already in place
  • Smaller operation size reduces bandwidth
  • Better user experience (predictable behavior)
  • Worth the complexity for quality

When to use CRDT:

  • Peer-to-peer collaboration (no server)
  • Offline-first applications
  • Eventually consistent systems
  • Simpler implementation requirements

2. Strong Consistency vs Eventual Consistency

Strong Consistency (Spanner)

Chosen for Document State

Pros:

  • Linearizable reads and writes
  • No conflicting versions
  • Simpler application logic
  • Guaranteed correctness
  • ACID transactions

Cons:

  • Higher latency (50-100ms cross-region)
  • Limited by speed of light
  • More expensive infrastructure
  • Lower availability during partitions
  • Reduced write throughput

Use Cases:

  • Document content and structure
  • User permissions and access control
  • Critical metadata (owner, created date)

Eventual Consistency (Cassandra/DynamoDB)

Alternative for Non-Critical Data

Pros:

  • Lower latency (5-10ms)
  • Higher availability (99.99%)
  • Better write throughput
  • Cheaper infrastructure
  • Partition tolerance

Cons:

  • Temporary inconsistencies
  • Conflict resolution needed
  • More complex application logic
  • Potential data loss scenarios
  • Harder to reason about

Use Cases:

  • User presence (online/offline status)
  • Cursor positions
  • View counts and analytics
  • Comment notifications
  • Activity logs

Hybrid Approach

Strong Consistency (Spanner):
├── Document content
├── Document structure
├── User permissions
└── Revision history

Eventual Consistency (Redis/Cassandra):
├── User presence
├── Cursor positions
├── Typing indicators
└── View analytics

3. WebSocket vs Server-Sent Events vs HTTP/2

WebSocket

Chosen Approach

Pros:

  • Full-duplex communication
  • Low latency (5-10ms)
  • Efficient for bidirectional data
  • Single persistent connection
  • Native browser support

Cons:

  • Connection management complexity
  • Load balancing challenges (sticky sessions)
  • Firewall/proxy issues
  • Higher server resource usage
  • Difficult to scale horizontally

Capacity: 50,000 connections per server

Server-Sent Events (SSE)

Alternative for One-Way Updates

Pros:

  • Simpler than WebSocket
  • Automatic reconnection
  • HTTP-based (better firewall support)
  • Built-in event IDs
  • Lower server resource usage

Cons:

  • One-way only (server to client)
  • Requires separate HTTP for client updates
  • Limited browser support (no IE)
  • Connection limits (6 per domain)
  • No binary data support

Capacity: 100,000 connections per server

HTTP/2 with Long Polling

Alternative for Compatibility

Pros:

  • Universal browser support
  • No special infrastructure needed
  • Works through all proxies
  • Multiplexing support
  • Simpler load balancing

Cons:

  • Higher latency (100-500ms)
  • More server requests
  • Inefficient for real-time updates
  • Higher bandwidth usage
  • Battery drain on mobile

Capacity: 500,000 requests per server

Decision Matrix

Feature              WebSocket    SSE      HTTP/2
─────────────────────────────────────────────────
Latency              5-10ms       10-20ms  100-500ms
Bidirectional        Yes          No       Yes
Browser Support      95%          85%      99%
Firewall Friendly    No           Yes      Yes
Scalability          Medium       High     High
Complexity           High         Low      Medium
Real-time Quality    Excellent    Good     Poor

Choice: WebSocket for real-time editing
        SSE for notifications
        HTTP/2 for API calls

4. Centralized vs Distributed OT

Centralized OT Server

Chosen Approach

Pros:

  • Simpler transformation logic
  • Guaranteed operation ordering
  • Easier to implement correctly
  • Single source of truth
  • Better debugging

Cons:

  • Single point of failure
  • Latency for remote users
  • Scalability bottleneck
  • Regional affinity issues
  • Higher operational complexity

Architecture:

All Editors → Central OT ServerDatabase

Distributed OT with Regional Servers

Hybrid Approach

Pros:

  • Lower latency for regional users
  • Better fault tolerance
  • Horizontal scalability
  • Regional data residency
  • Load distribution

Cons:

  • Complex coordination protocol
  • Potential for conflicts
  • Harder to maintain consistency
  • More infrastructure
  • Increased operational cost

Architecture:

US Editors → US OT Server ──┐
EU Editors → EU OT Server ──┼→ Global Coordinator → Database
Asia Editors → Asia OT Server ─┘

Decision: Regional OT with Global Coordination

Rationale:

  • Balance between latency and consistency
  • Regional servers for local operations
  • Global coordinator for cross-region conflicts
  • Acceptable 100-200ms for remote updates

5. Synchronous vs Asynchronous Replication

Synchronous Replication

Chosen for Critical Data

Pros:

  • No data loss on failure
  • Immediate consistency
  • Simpler recovery
  • Guaranteed durability
  • ACID compliance

Cons:

  • Higher write latency (50-100ms)
  • Reduced availability
  • Network dependency
  • Lower throughput
  • More expensive

Use Cases:

  • Document content writes
  • Permission changes
  • User authentication

Asynchronous Replication

Alternative for Non-Critical Data

Pros:

  • Lower write latency (5-10ms)
  • Higher availability
  • Better throughput
  • Network independence
  • Cheaper infrastructure

Cons:

  • Potential data loss (seconds)
  • Temporary inconsistency
  • Complex conflict resolution
  • Harder to debug
  • Replication lag issues

Use Cases:

  • Analytics events
  • Activity logs
  • Presence updates
  • View counts

Replication Strategy

Critical Path (Synchronous):
User Edit → OT Server → Spanner (3 replicas, sync) → Acknowledge

Non-Critical Path (Asynchronous):
User View → API Server → Kafka → Cassandra (async) → Return

6. Optimistic vs Pessimistic Locking

Optimistic Locking

Chosen Approach

Pros:

  • Better user experience (no blocking)
  • Higher concurrency
  • Lower latency
  • Simpler implementation
  • Scales better

Cons:

  • Conflicts require resolution
  • Potential for lost updates
  • Retry logic needed
  • More complex error handling
  • User confusion on conflicts

Implementation:

// Optimistic locking with version
async function saveDocument(docId, content, version) {
  const result = await db.update({
    id: docId,
    content: content,
    version: version + 1,
    where: { version: version } // Only update if version matches
  });
  
  if (result.rowsAffected === 0) {
    throw new ConflictError('Document was modified by another user');
  }
}

Pessimistic Locking

Alternative for Critical Sections

Pros:

  • No conflicts (guaranteed)
  • Simpler conflict resolution
  • Predictable behavior
  • Data integrity guaranteed
  • Easier to reason about

Cons:

  • Poor user experience (blocking)
  • Lower concurrency
  • Higher latency
  • Deadlock potential
  • Doesn't scale well

Implementation:

// Pessimistic locking with mutex
async function saveDocument(docId, content) {
  const lock = await acquireLock(docId, timeout = 5000);
  try {
    await db.update({ id: docId, content: content });
  } finally {
    await releaseLock(lock);
  }
}

Decision: Optimistic for Collaborative Editing

Rationale:

  • OT handles conflicts automatically
  • Users expect real-time updates
  • Blocking would break collaboration
  • Conflicts are rare with OT

7. Monolithic vs Microservices

Microservices Architecture

Chosen Approach

Services:

  • Document Service (CRUD operations)
  • OT Service (operation transformation)
  • Collaboration Service (WebSocket, presence)
  • Storage Service (file uploads)
  • Export Service (PDF, DOCX generation)
  • Search Service (full-text search)
  • Auth Service (authentication)

Pros:

  • Independent scaling
  • Technology flexibility
  • Team autonomy
  • Fault isolation
  • Easier deployment

Cons:

  • Distributed system complexity
  • Network latency between services
  • Data consistency challenges
  • More operational overhead
  • Debugging difficulty

Monolithic Architecture

Alternative for Simplicity

Pros:

  • Simpler deployment
  • Lower latency (in-process calls)
  • Easier debugging
  • Stronger consistency
  • Lower operational cost

Cons:

  • Scaling challenges
  • Technology lock-in
  • Team coordination needed
  • Deployment risk (all-or-nothing)
  • Harder to maintain at scale

Decision: Microservices for Scale

Rationale:

  • Different services have different scaling needs
  • OT service needs CPU, Storage needs I/O
  • Independent deployment reduces risk
  • Team autonomy improves velocity

8. SQL vs NoSQL for Document Storage

Spanner (NewSQL)

Chosen Approach

Pros:

  • Strong consistency + horizontal scaling
  • SQL interface (familiar)
  • ACID transactions
  • Global distribution
  • Automatic sharding

Cons:

  • Expensive ($300/TB/month)
  • Higher latency (50-100ms)
  • Vendor lock-in (Google Cloud)
  • Complex operational model
  • Overkill for simple use cases

PostgreSQL (SQL)

Alternative for Simplicity

Pros:

  • Mature and stable
  • Rich feature set
  • Strong consistency
  • Lower cost
  • Better tooling

Cons:

  • Vertical scaling limits
  • Manual sharding needed
  • No global distribution
  • Single region only
  • Complex replication setup

MongoDB (NoSQL)

Alternative for Flexibility

Pros:

  • Flexible schema
  • Horizontal scaling
  • Lower latency
  • Simpler operations
  • Lower cost

Cons:

  • Eventual consistency
  • No ACID transactions (older versions)
  • Complex query patterns
  • Data duplication
  • Consistency challenges

Decision Matrix

Requirement          Spanner  PostgreSQL  MongoDB
──────────────────────────────────────────────────
Strong Consistency   ✓        ✓           ✗
Global Distribution  ✓        ✗           ✓
Horizontal Scaling   ✓        ✗           ✓
ACID Transactions    ✓        ✓           ✓ (4.0+)
Cost Efficiency      ✗        ✓           ✓
Operational Simple   ✗        ✓           ✓

Choice: Spanner for global consistency
        PostgreSQL for single-region deployments
        MongoDB for flexible schema needs

Summary of Key Decisions

Decision Chosen Alternative Rationale
Conflict Resolution OT CRDT Better UX, deterministic
Consistency Model Strong Eventual Data correctness critical
Real-time Protocol WebSocket SSE/HTTP/2 Low latency, bidirectional
OT Architecture Regional Centralized Balance latency/consistency
Replication Sync Async No data loss acceptable
Locking Strategy Optimistic Pessimistic Better concurrency
Architecture Microservices Monolithic Independent scaling
Database Spanner PostgreSQL Global distribution

Each decision involves tradeoffs between performance, consistency, complexity, and cost. The choices reflect Google Docs' requirements for real-time collaboration, global scale, and strong consistency.