Tradeoffs and Alternatives for Google Docs
Overview
Designing a collaborative document editing system involves fundamental tradeoffs between consistency, performance, complexity, and user experience. This document explores key decisions and their alternatives.
1. Operational Transform vs CRDT
Operational Transform (OT)
Chosen Approach
Pros:
- Deterministic conflict resolution
- Smaller operation size (10-100 bytes)
- Well-understood for text editing
- Preserves user intent accurately
- Efficient for sequential operations
Cons:
- O(n²) complexity for concurrent operations
- Requires central server for coordination
- Complex implementation (transformation functions)
- Difficult to reason about correctness
- Server becomes bottleneck at scale
Example:
// OT operation
{
type: 'insert',
position: 10,
text: 'hello',
userId: 'user123',
version: 42
}
// Transform against concurrent operation
function transform(op1, op2) {
if (op1.position < op2.position) {
return op1; // No change needed
} else {
return {
...op1,
position: op1.position + op2.text.length
};
}
}Conflict-Free Replicated Data Types (CRDT)
Alternative Approach
Pros:
- No central coordination needed
- O(n) complexity for merging
- Eventual consistency guaranteed
- Better for peer-to-peer systems
- Simpler to scale horizontally
Cons:
- Larger operation size (100-1000 bytes)
- More complex data structures
- Harder to preserve user intent
- Memory overhead for tombstones
- Interleaving issues with concurrent edits
Example:
// CRDT operation (YJS-style)
{
id: { client: 'user123', clock: 42 },
left: { client: 'user456', clock: 38 },
right: { client: 'user789', clock: 40 },
content: 'hello',
deleted: false
}
// Merge is automatic - no transformation needed
function merge(doc1, doc2) {
return doc1.union(doc2); // Commutative and associative
}Decision: OT for Google Docs
Rationale:
- Text editing benefits from deterministic ordering
- Central server architecture already in place
- Smaller operation size reduces bandwidth
- Better user experience (predictable behavior)
- Worth the complexity for quality
When to use CRDT:
- Peer-to-peer collaboration (no server)
- Offline-first applications
- Eventually consistent systems
- Simpler implementation requirements
2. Strong Consistency vs Eventual Consistency
Strong Consistency (Spanner)
Chosen for Document State
Pros:
- Linearizable reads and writes
- No conflicting versions
- Simpler application logic
- Guaranteed correctness
- ACID transactions
Cons:
- Higher latency (50-100ms cross-region)
- Limited by speed of light
- More expensive infrastructure
- Lower availability during partitions
- Reduced write throughput
Use Cases:
- Document content and structure
- User permissions and access control
- Critical metadata (owner, created date)
Eventual Consistency (Cassandra/DynamoDB)
Alternative for Non-Critical Data
Pros:
- Lower latency (5-10ms)
- Higher availability (99.99%)
- Better write throughput
- Cheaper infrastructure
- Partition tolerance
Cons:
- Temporary inconsistencies
- Conflict resolution needed
- More complex application logic
- Potential data loss scenarios
- Harder to reason about
Use Cases:
- User presence (online/offline status)
- Cursor positions
- View counts and analytics
- Comment notifications
- Activity logs
Hybrid Approach
Strong Consistency (Spanner):
├── Document content
├── Document structure
├── User permissions
└── Revision history
Eventual Consistency (Redis/Cassandra):
├── User presence
├── Cursor positions
├── Typing indicators
└── View analytics3. WebSocket vs Server-Sent Events vs HTTP/2
WebSocket
Chosen Approach
Pros:
- Full-duplex communication
- Low latency (5-10ms)
- Efficient for bidirectional data
- Single persistent connection
- Native browser support
Cons:
- Connection management complexity
- Load balancing challenges (sticky sessions)
- Firewall/proxy issues
- Higher server resource usage
- Difficult to scale horizontally
Capacity: 50,000 connections per server
Server-Sent Events (SSE)
Alternative for One-Way Updates
Pros:
- Simpler than WebSocket
- Automatic reconnection
- HTTP-based (better firewall support)
- Built-in event IDs
- Lower server resource usage
Cons:
- One-way only (server to client)
- Requires separate HTTP for client updates
- Limited browser support (no IE)
- Connection limits (6 per domain)
- No binary data support
Capacity: 100,000 connections per server
HTTP/2 with Long Polling
Alternative for Compatibility
Pros:
- Universal browser support
- No special infrastructure needed
- Works through all proxies
- Multiplexing support
- Simpler load balancing
Cons:
- Higher latency (100-500ms)
- More server requests
- Inefficient for real-time updates
- Higher bandwidth usage
- Battery drain on mobile
Capacity: 500,000 requests per server
Decision Matrix
Feature WebSocket SSE HTTP/2
─────────────────────────────────────────────────
Latency 5-10ms 10-20ms 100-500ms
Bidirectional Yes No Yes
Browser Support 95% 85% 99%
Firewall Friendly No Yes Yes
Scalability Medium High High
Complexity High Low Medium
Real-time Quality Excellent Good Poor
Choice: WebSocket for real-time editing
SSE for notifications
HTTP/2 for API calls4. Centralized vs Distributed OT
Centralized OT Server
Chosen Approach
Pros:
- Simpler transformation logic
- Guaranteed operation ordering
- Easier to implement correctly
- Single source of truth
- Better debugging
Cons:
- Single point of failure
- Latency for remote users
- Scalability bottleneck
- Regional affinity issues
- Higher operational complexity
Architecture:
All Editors → Central OT Server → DatabaseDistributed OT with Regional Servers
Hybrid Approach
Pros:
- Lower latency for regional users
- Better fault tolerance
- Horizontal scalability
- Regional data residency
- Load distribution
Cons:
- Complex coordination protocol
- Potential for conflicts
- Harder to maintain consistency
- More infrastructure
- Increased operational cost
Architecture:
US Editors → US OT Server ──┐
EU Editors → EU OT Server ──┼→ Global Coordinator → Database
Asia Editors → Asia OT Server ─┘Decision: Regional OT with Global Coordination
Rationale:
- Balance between latency and consistency
- Regional servers for local operations
- Global coordinator for cross-region conflicts
- Acceptable 100-200ms for remote updates
5. Synchronous vs Asynchronous Replication
Synchronous Replication
Chosen for Critical Data
Pros:
- No data loss on failure
- Immediate consistency
- Simpler recovery
- Guaranteed durability
- ACID compliance
Cons:
- Higher write latency (50-100ms)
- Reduced availability
- Network dependency
- Lower throughput
- More expensive
Use Cases:
- Document content writes
- Permission changes
- User authentication
Asynchronous Replication
Alternative for Non-Critical Data
Pros:
- Lower write latency (5-10ms)
- Higher availability
- Better throughput
- Network independence
- Cheaper infrastructure
Cons:
- Potential data loss (seconds)
- Temporary inconsistency
- Complex conflict resolution
- Harder to debug
- Replication lag issues
Use Cases:
- Analytics events
- Activity logs
- Presence updates
- View counts
Replication Strategy
Critical Path (Synchronous):
User Edit → OT Server → Spanner (3 replicas, sync) → Acknowledge
Non-Critical Path (Asynchronous):
User View → API Server → Kafka → Cassandra (async) → Return6. Optimistic vs Pessimistic Locking
Optimistic Locking
Chosen Approach
Pros:
- Better user experience (no blocking)
- Higher concurrency
- Lower latency
- Simpler implementation
- Scales better
Cons:
- Conflicts require resolution
- Potential for lost updates
- Retry logic needed
- More complex error handling
- User confusion on conflicts
Implementation:
// Optimistic locking with version
async function saveDocument(docId, content, version) {
const result = await db.update({
id: docId,
content: content,
version: version + 1,
where: { version: version } // Only update if version matches
});
if (result.rowsAffected === 0) {
throw new ConflictError('Document was modified by another user');
}
}Pessimistic Locking
Alternative for Critical Sections
Pros:
- No conflicts (guaranteed)
- Simpler conflict resolution
- Predictable behavior
- Data integrity guaranteed
- Easier to reason about
Cons:
- Poor user experience (blocking)
- Lower concurrency
- Higher latency
- Deadlock potential
- Doesn't scale well
Implementation:
// Pessimistic locking with mutex
async function saveDocument(docId, content) {
const lock = await acquireLock(docId, timeout = 5000);
try {
await db.update({ id: docId, content: content });
} finally {
await releaseLock(lock);
}
}Decision: Optimistic for Collaborative Editing
Rationale:
- OT handles conflicts automatically
- Users expect real-time updates
- Blocking would break collaboration
- Conflicts are rare with OT
7. Monolithic vs Microservices
Microservices Architecture
Chosen Approach
Services:
- Document Service (CRUD operations)
- OT Service (operation transformation)
- Collaboration Service (WebSocket, presence)
- Storage Service (file uploads)
- Export Service (PDF, DOCX generation)
- Search Service (full-text search)
- Auth Service (authentication)
Pros:
- Independent scaling
- Technology flexibility
- Team autonomy
- Fault isolation
- Easier deployment
Cons:
- Distributed system complexity
- Network latency between services
- Data consistency challenges
- More operational overhead
- Debugging difficulty
Monolithic Architecture
Alternative for Simplicity
Pros:
- Simpler deployment
- Lower latency (in-process calls)
- Easier debugging
- Stronger consistency
- Lower operational cost
Cons:
- Scaling challenges
- Technology lock-in
- Team coordination needed
- Deployment risk (all-or-nothing)
- Harder to maintain at scale
Decision: Microservices for Scale
Rationale:
- Different services have different scaling needs
- OT service needs CPU, Storage needs I/O
- Independent deployment reduces risk
- Team autonomy improves velocity
8. SQL vs NoSQL for Document Storage
Spanner (NewSQL)
Chosen Approach
Pros:
- Strong consistency + horizontal scaling
- SQL interface (familiar)
- ACID transactions
- Global distribution
- Automatic sharding
Cons:
- Expensive ($300/TB/month)
- Higher latency (50-100ms)
- Vendor lock-in (Google Cloud)
- Complex operational model
- Overkill for simple use cases
PostgreSQL (SQL)
Alternative for Simplicity
Pros:
- Mature and stable
- Rich feature set
- Strong consistency
- Lower cost
- Better tooling
Cons:
- Vertical scaling limits
- Manual sharding needed
- No global distribution
- Single region only
- Complex replication setup
MongoDB (NoSQL)
Alternative for Flexibility
Pros:
- Flexible schema
- Horizontal scaling
- Lower latency
- Simpler operations
- Lower cost
Cons:
- Eventual consistency
- No ACID transactions (older versions)
- Complex query patterns
- Data duplication
- Consistency challenges
Decision Matrix
Requirement Spanner PostgreSQL MongoDB
──────────────────────────────────────────────────
Strong Consistency ✓ ✓ ✗
Global Distribution ✓ ✗ ✓
Horizontal Scaling ✓ ✗ ✓
ACID Transactions ✓ ✓ ✓ (4.0+)
Cost Efficiency ✗ ✓ ✓
Operational Simple ✗ ✓ ✓
Choice: Spanner for global consistency
PostgreSQL for single-region deployments
MongoDB for flexible schema needsSummary of Key Decisions
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Conflict Resolution | OT | CRDT | Better UX, deterministic |
| Consistency Model | Strong | Eventual | Data correctness critical |
| Real-time Protocol | WebSocket | SSE/HTTP/2 | Low latency, bidirectional |
| OT Architecture | Regional | Centralized | Balance latency/consistency |
| Replication | Sync | Async | No data loss acceptable |
| Locking Strategy | Optimistic | Pessimistic | Better concurrency |
| Architecture | Microservices | Monolithic | Independent scaling |
| Database | Spanner | PostgreSQL | Global distribution |
Each decision involves tradeoffs between performance, consistency, complexity, and cost. The choices reflect Google Docs' requirements for real-time collaboration, global scale, and strong consistency.