Scaling Considerations

📖 9 min read 📄 Part 6 of 10

Scaling Considerations for Google Docs

Overview

Scaling a collaborative document editing system to support 100M+ daily active users with real-time synchronization requires careful consideration of operational transform servers, document distribution, and global infrastructure.

1. Operational Transform Server Scaling

Challenge: OT Computation Complexity

  • Problem: O(n²) complexity for n concurrent operations
  • Impact: 100 concurrent editors = 10,000 transform operations
  • Bottleneck: Single OT server becomes CPU-bound at 50-100 concurrent editors

Solution: Hierarchical OT Architecture

┌─────────────────────────────────────────────────────────┐
│                    Global OT Coordinator                 │
│              (Handles cross-region conflicts)            │
└─────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Regional OT  │    │ Regional OT  │    │ Regional OT  │
│   Server     │    │   Server     │    │   Server     │
│  (US-East)   │    │  (EU-West)   │    │  (AP-South)  │
└──────────────┘    └──────────────┘    └──────────────┘
        │                   │                   │
    ┌───┴───┐           ┌───┴───┐           ┌───┴───┐
    ▼       ▼           ▼       ▼           ▼       ▼
  OT-1    OT-2        OT-3    OT-4        OT-5    OT-6
  (50     (50         (50     (50         (50     (50
  users)  users)      users)  users)      users)  users)

Implementation Strategy:

  • Document Affinity: Pin document to specific OT server
  • Load-Based Routing: Route new editors to least-loaded server
  • Hot Document Detection: Identify documents with >50 concurrent editors
  • Dynamic Splitting: Split hot documents across multiple OT servers with coordination layer

OT Server Capacity Planning

Single OT Server Capacity:
- 50 concurrent editors (optimal)
- 100 concurrent editors (maximum)
- 500 operations/second processing rate
- 16 CPU cores, 64GB RAM

Cluster Sizing:
- 100M DAU × 5% concurrent = 5M concurrent users
- Average 3 users per document = 1.67M active documents
- 1% hot documents (>10 editors) = 16,700 hot documents
- Hot documents need 16,700 × 2 = 33,400 OT servers
- Regular documents need 1.65M / 50 = 33,000 OT servers
- Total: ~66,000 OT servers globally

2. Document Sharding and Distribution

Sharding Strategy: Document ID-Based

-- Shard key: document_id
-- Sharding function: HASH(document_id) % num_shards

Shard Distribution:
- 1 billion documents
- 10,000 shards
- 100,000 documents per shard
- Shard size: ~50GB (500KB avg document)

Hot Shard Problem

Detection:

  • Monitor QPS per shard (threshold: 10,000 QPS)
  • Track concurrent editors per shard
  • Identify celebrity documents (viral docs with 1000+ editors)

Mitigation:

┌─────────────────────────────────────────────────────────┐
│                    Hot Document Handling                 │
└─────────────────────────────────────────────────────────┘

1. Read Replicas:
   Primary Shard → Read Replica 1 (read-only)
                → Read Replica 2 (read-only)
                → Read Replica 3 (read-only)

2. Caching Layer:
   - Cache document content in Redis (TTL: 5 minutes)
   - Cache recent operations in memory
   - Serve 90% of reads from cache

3. Rate Limiting:
   - Limit new editors to 1000 per document
   - Queue additional editors with "view-only" mode
   - Upgrade to edit mode when slots available

3. Global Distribution Architecture

Multi-Region Deployment

┌──────────────────────────────────────────────────────────┐
│                    Global Load Balancer                   │
│              (GeoDNS + Anycast Routing)                   │
└──────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   US Region  │    │   EU Region  │    │  Asia Region │
│              │    │              │    │              │
│ • 5 zones    │    │ • 4 zones    │    │ • 6 zones    │
│ • 40% users  │    │ • 30% users  │    │ • 30% users  │
└──────────────┘    └──────────────┘    └──────────────┘

Cross-Region Synchronization

Challenge: Speed of light latency

  • US-East to EU-West: ~80ms RTT
  • US-East to Asia-Pacific: ~180ms RTT
  • EU-West to Asia-Pacific: ~200ms RTT

Strategy: Regional OT with Global Coordination

User in US edits document:
1. Operation sent to US Regional OT Server (5ms)
2. Operation applied locally and acknowledged (10ms)
3. Operation replicated to Spanner (cross-region) (50-100ms)
4. Other regions pull updates asynchronously (100-200ms)

Result:
- Local users see changes in 15ms
- Remote users see changes in 100-200ms
- Acceptable for collaborative editing

4. WebSocket Connection Scaling

Connection Distribution

Target: 5M concurrent WebSocket connections

Per Server Capacity:
- 50,000 connections per server (C5.4xlarge)
- 8 vCPUs, 32GB RAM
- Network bandwidth: 5 Gbps

Required Servers:
- 5M / 50K = 100 WebSocket servers per region
- 3 regions × 100 = 300 servers globally
- Add 50% buffer = 450 servers total

Connection Management Strategy

// Sticky session routing
const getWebSocketServer = (userId) => {
  const hash = murmurhash(userId);
  const serverIndex = hash % numServers;
  return webSocketServers[serverIndex];
};

// Connection pooling
class WebSocketPool {
  constructor(maxConnections = 50000) {
    this.connections = new Map();
    this.maxConnections = maxConnections;
  }
  
  addConnection(userId, ws) {
    if (this.connections.size >= this.maxConnections) {
      throw new Error('Connection limit reached');
    }
    this.connections.set(userId, ws);
  }
  
  removeConnection(userId) {
    this.connections.delete(userId);
  }
  
  broadcast(documentId, operation) {
    // Send to all users editing this document
    const editors = this.getDocumentEditors(documentId);
    editors.forEach(userId => {
      const ws = this.connections.get(userId);
      if (ws && ws.readyState === WebSocket.OPEN) {
        ws.send(JSON.stringify(operation));
      }
    });
  }
}

5. Database Scaling

Spanner Scaling for Document State

Configuration:
- 3 regions (US, EU, Asia)
- 5 nodes per region
- 15 nodes total
- 10TB storage per node
- 150TB total capacity

Capacity:
- 1 billion documents × 100KB avg = 100TB
- Fits within 150TB capacity
- Headroom for growth: 50TB

Performance:
- 10,000 QPS per node
- 15 nodes × 10K = 150,000 QPS total
- Actual load: 50,000 QPS (33% utilization)

Bigtable Scaling for Revision History

Configuration:
- 3 clusters (US, EU, Asia)
- 30 nodes per cluster
- 90 nodes total
- 8TB SSD per node
- 720TB total capacity

Capacity:
- 1 billion documents × 50 revisions avg = 50B revisions
- 50B × 10KB avg = 500TB
- Fits within 720TB capacity

Performance:
- 10,000 QPS per node (reads)
- 90 nodes × 10K = 900,000 QPS total
- Actual load: 100,000 QPS (11% utilization)

Redis Scaling for OT State

Configuration:
- Redis Cluster with 100 shards
- 3 replicas per shard
- 300 Redis instances total
- 64GB RAM per instance
- 19.2TB total memory

Capacity:
- 1.67M active documents × 10KB state = 16.7GB
- Fits easily in 19.2TB capacity
- 99.9% memory available for caching

Performance:
- 100,000 ops/sec per instance
- 300 instances × 100K = 30M ops/sec
- Actual load: 5M ops/sec (17% utilization)

6. Caching Strategy

Multi-Level Cache Hierarchy

L1 Cache (Client-Side):
- IndexedDB for offline document storage
- 100MB per user
- Cache hit rate: 60%

L2 Cache (Redis):
- Document content (TTL: 5 minutes)
- Recent operations (TTL: 1 minute)
- User presence (TTL: 30 seconds)
- Cache hit rate: 35%

L3 Cache (CDN):
- Static assets (images, fonts)
- Document exports (PDF, DOCX)
- Cache hit rate: 90%

Overall Cache Hit Rate: 95%
Database Load Reduction: 20x

Cache Invalidation Strategy

// Write-through cache for document updates
async function updateDocument(docId, operation) {
  // 1. Apply operation to database
  await spanner.applyOperation(docId, operation);
  
  // 2. Invalidate cache
  await redis.del(`doc:${docId}`);
  
  // 3. Broadcast to active editors
  await broadcastOperation(docId, operation);
  
  // 4. Update cache with new version (optional)
  const newDoc = await spanner.getDocument(docId);
  await redis.setex(`doc:${docId}`, 300, JSON.stringify(newDoc));
}

7. Autoscaling Configuration

OT Server Autoscaling

autoscaling:
  metric: cpu_utilization
  target: 70%
  min_instances: 1000
  max_instances: 100000
  scale_up:
    threshold: 80%
    cooldown: 60s
    increment: 10%
  scale_down:
    threshold: 50%
    cooldown: 300s
    decrement: 5%

WebSocket Server Autoscaling

autoscaling:
  metric: connection_count
  target: 40000  # 80% of 50K capacity
  min_instances: 100
  max_instances: 1000
  scale_up:
    threshold: 45000
    cooldown: 30s
    increment: 20%
  scale_down:
    threshold: 30000
    cooldown: 600s
    decrement: 10%

8. Performance Optimization

Operation Batching

// Batch multiple operations to reduce network overhead
class OperationBatcher {
  constructor(flushInterval = 50) {
    this.operations = [];
    this.flushInterval = flushInterval;
    this.timer = null;
  }
  
  addOperation(op) {
    this.operations.push(op);
    
    if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), this.flushInterval);
    }
    
    // Flush immediately if batch is large
    if (this.operations.length >= 10) {
      this.flush();
    }
  }
  
  flush() {
    if (this.operations.length === 0) return;
    
    const batch = this.operations;
    this.operations = [];
    clearTimeout(this.timer);
    this.timer = null;
    
    // Send batch to server
    this.sendBatch(batch);
  }
}

Compression

// Compress operations before transmission
const compressOperation = (operation) => {
  const json = JSON.stringify(operation);
  const compressed = pako.deflate(json);
  return compressed;
};

// Typical compression ratios:
// - Text operations: 5:1
// - JSON metadata: 3:1
// - Network bandwidth savings: 70%

9. Monitoring and Alerting

Key Metrics

Performance Metrics:
- Operation latency (p50, p95, p99)
- WebSocket connection count
- OT server CPU utilization
- Database query latency
- Cache hit rate

Business Metrics:
- Active documents
- Concurrent editors per document
- Operations per second
- Document creation rate
- User engagement time

Alerts:
- OT server CPU > 80% (scale up)
- WebSocket connections > 45K per server (scale up)
- Operation latency p99 > 500ms (investigate)
- Cache hit rate < 90% (tune cache)
- Database errors > 0.1% (critical)

10. Cost Optimization

Infrastructure Costs (Monthly)

Compute:
- OT servers: 66,000 × $150 = $9.9M
- WebSocket servers: 450 × $150 = $67.5K
- API servers: 1,000 × $150 = $150K
Total Compute: $10.1M/month

Storage:
- Spanner: 150TB × $300/TB = $45K
- Bigtable: 720TB × $170/TB = $122K
- Redis: 19.2TB × $500/TB = $9.6M
Total Storage: $9.8M/month

Network:
- Egress: 10PB × $50/TB = $500K
- Cross-region: 5PB × $20/TB = $100K
Total Network: $600K/month

Grand Total: $20.5M/month
Cost per DAU: $0.205

Optimization Strategies

  1. Reserved Instances: 40% savings on compute
  2. Spot Instances: Use for non-critical OT servers
  3. Compression: Reduce storage and network costs by 70%
  4. Tiered Storage: Move old revisions to cold storage
  5. Regional Optimization: Deploy more capacity in high-usage regions

Summary

Scaling Google Docs to 100M+ DAU requires:

  • 66,000 OT servers for real-time collaboration
  • Hierarchical OT architecture to handle hot documents
  • Multi-region deployment with regional OT coordination
  • 450 WebSocket servers for 5M concurrent connections
  • Spanner + Bigtable for globally distributed storage
  • Multi-level caching achieving 95% cache hit rate
  • Autoscaling based on CPU and connection metrics
  • $20.5M/month infrastructure cost ($0.205 per DAU)

The key to scaling is distributing OT computation, using regional coordination for global documents, and aggressive caching to reduce database load.