Distributed Tracing System - Scale and Constraints

Overview

A distributed tracing system at scale must handle the telemetry output of thousands of microservices, each generating spans for every request they process. The scale challenges are unique: write-heavy workloads with append-only semantics, time-bounded queries, and the need to reconstruct full request paths from individually reported spans. Systems like Jaeger at Uber, Zipkin at Twitter, and Datadog APM handle millions of spans per second in production.

Span Volume Estimation

Source Traffic Analysis

Infrastructure Assumptions:
- Microservices: 2,000 services in production
- Average requests per service: 500 req/sec
- Total system requests: 1,000,000 req/sec (with fan-out)
- Average spans per request: 8-12 (depth of call chain)
- Peak multiplier: 3x during traffic spikes

Span Generation Rate:
- Steady state: 1,000,000 req/sec × 10 spans/req = 10M spans/sec generated
- Peak: 30M spans/sec generated
- Daily total: 10M × 86,400 = 864 billion spans/day generated

Span Size Breakdown

Minimal Span (structured binary, e.g., Thrift/Protobuf):
- trace_id: 16 bytes (128-bit)
- span_id: 8 bytes (64-bit)
- parent_span_id: 8 bytes
- operation_name: 50 bytes average
- service_name: 30 bytes average
- start_time: 8 bytes (epoch microseconds)
- duration: 8 bytes (microseconds)
- status/flags: 4 bytes
- Subtotal: ~132 bytes

Tags (key-value pairs, average 5 per span):
- http.method: 10 bytes
- http.status_code: 8 bytes
- http.url: 100 bytes
- peer.service: 30 bytes
- component: 20 bytes
- Subtotal: ~170 bytes

Logs/Events (average 1-2 per span):
- timestamp: 8 bytes
- event message: 100 bytes
- Subtotal: ~120 bytes

Process/Resource Info:
- hostname: 30 bytes
- ip: 15 bytes
- version: 20 bytes
- Subtotal: ~65 bytes

Total per span (binary): ~500 bytes
Total per span (JSON wire format): ~1.2 KB
Total per span (with request/response bodies): ~5 KB (when enabled)

Trace Volume and Structure

Trace Characteristics

Trace Composition:
- Average spans per trace: 10
- Median spans per trace: 6
- P99 spans per trace: 50 (complex workflows)
- P99.9 spans per trace: 200+ (batch operations)
- Maximum trace depth: 15-20 levels
- Maximum trace width (fan-out): 100+ parallel spans

Trace Duration Distribution:
- P50: 50ms
- P90: 200ms
- P95: 500ms
- P99: 2,000ms
- P99.9: 10,000ms (long-running operations)

Trace Volume:
- Total traces/sec: 1,000,000 (one trace per user request)
- After head-based sampling (1%): 10,000 traces/sec stored
- After tail-based sampling (error + slow): 15,000 traces/sec stored

Trace Assembly Complexity

Span Arrival Patterns:
- Spans from a single trace arrive over 0-30 second window
- Out-of-order arrival is common (network delays)
- Late spans (>60 seconds) occur for async operations
- Orphan spans (missing parent): ~0.1% of all spans

Assembly Requirements:
- Buffer window: 60 seconds for trace completion
- Memory per in-flight trace: ~5 KB (10 spans × 500 bytes)
- Concurrent in-flight traces: 60 sec × 10,000 traces/sec = 600,000
- Memory for assembly buffer: 600,000 × 5 KB = 3 GB

Storage Requirements

Raw Span Storage

With 1% Head-Based Sampling:
- Stored spans/sec: 100,000
- Span size (compressed): 200 bytes (2.5:1 compression ratio)
- Write throughput: 100,000 × 200 bytes = 20 MB/sec
- Daily storage: 20 MB/sec × 86,400 = 1.7 TB/day
- Monthly storage: 1.7 TB × 30 = 51 TB/month

With Tail-Based Sampling (keeps errors + slow + 0.5% random):
- Stored spans/sec: 150,000
- Daily storage: 2.6 TB/day
- Monthly storage: 78 TB/month

Replication (3x for durability):
- Effective monthly storage: 51-78 TB × 3 = 153-234 TB/month

Index Storage

Primary Indexes:
- Service + time index: ~10% of raw data = 5-8 TB/month
- Operation + time index: ~8% of raw data = 4-6 TB/month
- Duration index: ~5% of raw data = 2.5-4 TB/month
- Tag indexes (selective): ~15% of raw data = 8-12 TB/month
- Total index overhead: ~38% of raw data

Secondary Indexes:
- Full-text search (Elasticsearch): 2x raw data size
- Service dependency graph: ~100 MB (aggregated)
- Pre-computed metrics: ~500 GB/month

Retention Policies

Tiered Retention Strategy:
┌─────────────────────────────────────────────────────────┐
│ Tier      │ Duration │ Data Type        │ Storage       │
├─────────────────────────────────────────────────────────┤
│ Hot       │ 2 days   │ Full spans       │ SSD/NVMe     │
│ Warm      │ 7 days   │ Full spans       │ SSD          │
│ Standard  │ 30 days  │ Full spans       │ HDD          │
│ Cold      │ 90 days  │ Sampled spans    │ Object store │
│ Archive   │ 1 year   │ Aggregated only  │ S3/GCS       │
└─────────────────────────────────────────────────────────┘

Total Active Storage (30-day full retention):
- Raw spans: 51 TB
- Indexes: 20 TB
- Replicated: 213 TB total cluster storage

Query Load and Latency Targets

Query Patterns

Query Type Distribution:
- Trace by ID lookup: 60% of queries (direct link from logs/alerts)
- Search by service + time: 20% of queries
- Search by operation + duration: 10% of queries
- Tag-based search: 5% of queries
- Dependency graph queries: 3% of queries
- Analytics/aggregation: 2% of queries

Query Volume:
- Total queries/sec: 500 (from developers, dashboards, alerts)
- Peak queries/sec: 2,000 (during incidents)
- Concurrent users: 200 (normal), 1,000 (incident)

Latency Targets

Query Latency SLOs:
┌──────────────────────────────────────────────────────┐
│ Query Type              │ P50    │ P95    │ P99      │
├──────────────────────────────────────────────────────┤
│ Trace by ID            │ 50ms   │ 200ms  │ 500ms   │
│ Service search (1hr)   │ 200ms  │ 1s     │ 3s      │
│ Tag search (1hr)       │ 500ms  │ 2s     │ 5s      │
│ Duration filter        │ 200ms  │ 1s     │ 3s      │
│ Dependency graph       │ 100ms  │ 500ms  │ 1s      │
│ Analytics aggregation  │ 1s     │ 5s     │ 10s     │
└──────────────────────────────────────────────────────┘

Trace Assembly Latency:
- Small trace (5 spans): 10ms
- Medium trace (20 spans): 50ms
- Large trace (100+ spans): 200ms
- Very large trace (1000+ spans): 1-2s

Sampling Rates and Storage Impact

Sampling Strategy Comparison

Strategy 1: Fixed 1% Head-Based Sampling
- Spans stored: 100K/sec
- Storage: 1.7 TB/day
- Cost: $$$
- Coverage: Random, may miss errors
- Latency overhead: Minimal (decision at trace start)

Strategy 2: Adaptive Sampling
- High-traffic services (>10K req/sec): 0.1% sampling
- Medium-traffic services (1K-10K req/sec): 1% sampling
- Low-traffic services (<1K req/sec): 10% sampling
- Error traces: 100% sampling
- Slow traces (>P99): 100% sampling
- Effective storage: 2.5 TB/day
- Coverage: Better error/latency visibility

Strategy 3: Tail-Based Sampling (1% random + all errors + all slow)
- Random sample: 100K spans/sec
- Error traces: ~5K spans/sec (0.5% error rate × full trace)
- Slow traces: ~10K spans/sec (P99 traces)
- Total: ~150K spans/sec
- Storage: 2.6 TB/day
- Coverage: Best for debugging, higher cost

Strategy 4: Dynamic Rate Limiting
- Budget: 200K spans/sec maximum
- Distribute budget across services proportionally
- Priority: errors > slow > random
- Storage: Fixed at 3.5 TB/day
- Coverage: Predictable cost, good coverage

Storage Savings Analysis

Full Fidelity (no sampling):
- 10M spans/sec × 500 bytes = 5 GB/sec = 432 TB/day
- Completely impractical for storage and cost

1% Sampling:
- 100K spans/sec = 1.7 TB/day (99.6% reduction)

0.1% Sampling:
- 10K spans/sec = 170 GB/day (99.96% reduction)

Compression Impact (LZ4/Snappy):
- Typical ratio: 3:1 to 5:1 for span data
- 1.7 TB/day → 400-570 GB/day after compression

Network Bandwidth Requirements

Agent to Collector

Per-Host Agent Traffic:
- Spans generated per host: 500-5,000 spans/sec
- Span size (protobuf): 500 bytes
- Raw bandwidth: 250 KB/sec - 2.5 MB/sec per host
- With batching (100 spans/batch): 5-50 batches/sec
- Batch size: ~50 KB
- Actual bandwidth (with headers): 300 KB/sec - 3 MB/sec per host

Fleet-Wide Agent Traffic:
- Hosts: 10,000
- Total agent → collector bandwidth: 3-30 GB/sec
- After sampling at agent (1%): 30-300 MB/sec
- Network overhead (TCP, TLS): +15%
- Effective bandwidth needed: 35-345 MB/sec

Collector to Storage

Collector Output:
- After processing/enrichment: spans grow ~20% (added metadata)
- Collector → Kafka: 120K spans/sec × 600 bytes = 72 MB/sec
- Kafka → Storage: 72 MB/sec (same throughput)
- With replication (3x Kafka): 216 MB/sec internal Kafka traffic

Collector → Elasticsearch (for indexing):
- Index documents: 120K docs/sec × 800 bytes = 96 MB/sec
- Bulk API batching reduces overhead

Cross-Region Traffic (if centralized):
- Region → Central: 20-50 MB/sec per region
- 5 regions: 100-250 MB/sec cross-region
- Cost: $0.02/GB = $170-430K/month for cross-region transfer

Compute Requirements

Collection Tier

Collector Instances:
- Throughput per collector: 50K spans/sec (CPU-bound on validation)
- CPU per collector: 4 cores (validation, enrichment, batching)
- Memory per collector: 8 GB (buffering, in-flight batches)
- Collectors needed: 100K spans/sec ÷ 50K = 2 collectors minimum
- With headroom (3x): 6 collectors
- Instance type: c5.2xlarge or equivalent

Collector Processing Pipeline:
- Deserialize: 0.5μs per span
- Validate: 1μs per span
- Enrich (add process info): 0.5μs per span
- Sample decision: 0.2μs per span
- Serialize to batch: 0.3μs per span
- Total: ~3μs per span per core
- Throughput: ~330K spans/sec per core

Indexing Tier

Elasticsearch Cluster (for search):
- Index rate: 100K documents/sec
- Nodes needed: 10 data nodes (10K docs/sec each)
- CPU per node: 16 cores
- Memory per node: 64 GB (JVM heap: 31 GB)
- Storage per node: 4 TB NVMe SSD
- Instance type: r5.4xlarge or i3.4xlarge

ClickHouse Cluster (for analytics):
- Insert rate: 100K rows/sec
- Nodes needed: 3 nodes
- CPU per node: 32 cores
- Memory per node: 128 GB
- Storage per node: 10 TB NVMe

Query Serving Tier

Query Nodes:
- Concurrent queries: 500
- Query complexity: varies (simple lookup to complex aggregation)
- Nodes needed: 5 query nodes
- CPU per node: 8 cores
- Memory per node: 32 GB (caching hot traces)
- Cache hit rate target: 60% for trace-by-ID queries

API Gateway:
- Requests/sec: 2,000 peak
- Instances: 3 (with load balancer)
- CPU per instance: 2 cores
- Memory per instance: 4 GB

Cost Estimation at Scale

Monthly Cost Breakdown

Compute Costs:
┌─────────────────────────────────────────────────────────────┐
│ Component          │ Instances │ Type        │ Monthly Cost  │
├─────────────────────────────────────────────────────────────┤
│ Collectors         │ 6         │ c5.2xlarge  │ $1,500        │
│ Kafka Brokers      │ 6         │ r5.2xlarge  │ $3,600        │
│ ES Data Nodes      │ 10        │ r5.4xlarge  │ $12,000       │
│ ES Master Nodes    │ 3         │ c5.xlarge   │ $450          │
│ Cassandra Nodes    │ 9         │ i3.2xlarge  │ $9,000        │
│ Query Servers      │ 5         │ c5.2xlarge  │ $1,250        │
│ API Gateway        │ 3         │ c5.xlarge   │ $450          │
│ ClickHouse         │ 3         │ r5.8xlarge  │ $5,400        │
├─────────────────────────────────────────────────────────────┤
│ Compute Subtotal   │           │             │ $33,650       │
└─────────────────────────────────────────────────────────────┘

Storage Costs:
┌─────────────────────────────────────────────────────────────┐
│ Storage Type       │ Volume    │ Unit Cost   │ Monthly Cost  │
├─────────────────────────────────────────────────────────────┤
│ SSD (hot/warm)     │ 50 TB     │ $0.10/GB    │ $5,000        │
│ HDD (standard)     │ 100 TB    │ $0.04/GB    │ $4,000        │
│ S3 (cold/archive)  │ 200 TB    │ $0.023/GB   │ $4,600        │
│ Kafka storage      │ 5 TB      │ $0.10/GB    │ $500          │
├─────────────────────────────────────────────────────────────┤
│ Storage Subtotal   │           │             │ $14,100       │
└─────────────────────────────────────────────────────────────┘

Network Costs:
┌─────────────────────────────────────────────────────────────┐
│ Traffic Type       │ Volume    │ Unit Cost   │ Monthly Cost  │
├─────────────────────────────────────────────────────────────┤
│ Cross-AZ          │ 50 TB     │ $0.01/GB    │ $500          │
│ Cross-Region      │ 10 TB     │ $0.02/GB    │ $200          │
│ Internet egress   │ 1 TB      │ $0.09/GB    │ $90           │
├─────────────────────────────────────────────────────────────┤
│ Network Subtotal   │           │             │ $790          │
└─────────────────────────────────────────────────────────────┘

Total Monthly Cost: ~$48,540
Annual Cost: ~$582,500

Cost per million spans stored: ~$0.33
Cost per trace stored: ~$0.000003

Cost Optimization Levers

1. Aggressive Sampling (0.1% vs 1%):
   - Reduces storage 10x: saves ~$12,000/month
   - Reduces compute 5x: saves ~$15,000/month
   - Trade-off: less debugging visibility

2. Compression (LZ4 on storage):
   - Reduces storage 3-5x: saves ~$8,000/month
   - Minimal CPU overhead

3. Tiered Storage (move to S3 after 7 days):
   - Reduces SSD usage 75%: saves ~$3,750/month
   - Slightly slower queries for old data

4. Reserved Instances (1-year commitment):
   - 30-40% discount on compute: saves ~$10,000/month

5. Spot Instances for batch processing:
   - 60-70% discount for analytics workloads
   - Saves ~$3,000/month on ClickHouse batch jobs

Key Constraints Summary

Hard Constraints:
- Span ingestion must not add >1ms latency to application requests
- Agent CPU overhead must be <2% of host CPU
- Agent memory overhead must be <100 MB per host
- Trace retrieval by ID must complete in <500ms (P99)
- System must handle 3x traffic spikes without data loss
- No single point of failure in collection pipeline

Soft Constraints:
- Search queries should complete in <3s (P95)
- Sampling decisions should be consistent per trace
- Cross-service context propagation overhead <100μs
- Dashboard refresh latency <5s
- Alert evaluation latency <30s from span arrival

This scale analysis provides the foundation for capacity planning and architecture decisions in a production distributed tracing system comparable to Uber's Jaeger deployment or Datadog's APM infrastructure.