Distributed Tracing System - Scale and Constraints
Overview
A distributed tracing system at scale must handle the telemetry output of thousands of microservices, each generating spans for every request they process. The scale challenges are unique: write-heavy workloads with append-only semantics, time-bounded queries, and the need to reconstruct full request paths from individually reported spans. Systems like Jaeger at Uber, Zipkin at Twitter, and Datadog APM handle millions of spans per second in production.
Span Volume Estimation
Source Traffic Analysis
Infrastructure Assumptions:
- Microservices: 2,000 services in production
- Average requests per service: 500 req/sec
- Total system requests: 1,000,000 req/sec (with fan-out)
- Average spans per request: 8-12 (depth of call chain)
- Peak multiplier: 3x during traffic spikes
Span Generation Rate:
- Steady state: 1,000,000 req/sec × 10 spans/req = 10M spans/sec generated
- Peak: 30M spans/sec generated
- Daily total: 10M × 86,400 = 864 billion spans/day generatedSpan Size Breakdown
Minimal Span (structured binary, e.g., Thrift/Protobuf):
- trace_id: 16 bytes (128-bit)
- span_id: 8 bytes (64-bit)
- parent_span_id: 8 bytes
- operation_name: 50 bytes average
- service_name: 30 bytes average
- start_time: 8 bytes (epoch microseconds)
- duration: 8 bytes (microseconds)
- status/flags: 4 bytes
- Subtotal: ~132 bytes
Tags (key-value pairs, average 5 per span):
- http.method: 10 bytes
- http.status_code: 8 bytes
- http.url: 100 bytes
- peer.service: 30 bytes
- component: 20 bytes
- Subtotal: ~170 bytes
Logs/Events (average 1-2 per span):
- timestamp: 8 bytes
- event message: 100 bytes
- Subtotal: ~120 bytes
Process/Resource Info:
- hostname: 30 bytes
- ip: 15 bytes
- version: 20 bytes
- Subtotal: ~65 bytes
Total per span (binary): ~500 bytes
Total per span (JSON wire format): ~1.2 KB
Total per span (with request/response bodies): ~5 KB (when enabled)Trace Volume and Structure
Trace Characteristics
Trace Composition:
- Average spans per trace: 10
- Median spans per trace: 6
- P99 spans per trace: 50 (complex workflows)
- P99.9 spans per trace: 200+ (batch operations)
- Maximum trace depth: 15-20 levels
- Maximum trace width (fan-out): 100+ parallel spans
Trace Duration Distribution:
- P50: 50ms
- P90: 200ms
- P95: 500ms
- P99: 2,000ms
- P99.9: 10,000ms (long-running operations)
Trace Volume:
- Total traces/sec: 1,000,000 (one trace per user request)
- After head-based sampling (1%): 10,000 traces/sec stored
- After tail-based sampling (error + slow): 15,000 traces/sec storedTrace Assembly Complexity
Span Arrival Patterns:
- Spans from a single trace arrive over 0-30 second window
- Out-of-order arrival is common (network delays)
- Late spans (>60 seconds) occur for async operations
- Orphan spans (missing parent): ~0.1% of all spans
Assembly Requirements:
- Buffer window: 60 seconds for trace completion
- Memory per in-flight trace: ~5 KB (10 spans × 500 bytes)
- Concurrent in-flight traces: 60 sec × 10,000 traces/sec = 600,000
- Memory for assembly buffer: 600,000 × 5 KB = 3 GBStorage Requirements
Raw Span Storage
With 1% Head-Based Sampling:
- Stored spans/sec: 100,000
- Span size (compressed): 200 bytes (2.5:1 compression ratio)
- Write throughput: 100,000 × 200 bytes = 20 MB/sec
- Daily storage: 20 MB/sec × 86,400 = 1.7 TB/day
- Monthly storage: 1.7 TB × 30 = 51 TB/month
With Tail-Based Sampling (keeps errors + slow + 0.5% random):
- Stored spans/sec: 150,000
- Daily storage: 2.6 TB/day
- Monthly storage: 78 TB/month
Replication (3x for durability):
- Effective monthly storage: 51-78 TB × 3 = 153-234 TB/monthIndex Storage
Primary Indexes:
- Service + time index: ~10% of raw data = 5-8 TB/month
- Operation + time index: ~8% of raw data = 4-6 TB/month
- Duration index: ~5% of raw data = 2.5-4 TB/month
- Tag indexes (selective): ~15% of raw data = 8-12 TB/month
- Total index overhead: ~38% of raw data
Secondary Indexes:
- Full-text search (Elasticsearch): 2x raw data size
- Service dependency graph: ~100 MB (aggregated)
- Pre-computed metrics: ~500 GB/monthRetention Policies
Tiered Retention Strategy:
┌─────────────────────────────────────────────────────────┐
│ Tier │ Duration │ Data Type │ Storage │
├─────────────────────────────────────────────────────────┤
│ Hot │ 2 days │ Full spans │ SSD/NVMe │
│ Warm │ 7 days │ Full spans │ SSD │
│ Standard │ 30 days │ Full spans │ HDD │
│ Cold │ 90 days │ Sampled spans │ Object store │
│ Archive │ 1 year │ Aggregated only │ S3/GCS │
└─────────────────────────────────────────────────────────┘
Total Active Storage (30-day full retention):
- Raw spans: 51 TB
- Indexes: 20 TB
- Replicated: 213 TB total cluster storageQuery Load and Latency Targets
Query Patterns
Query Type Distribution:
- Trace by ID lookup: 60% of queries (direct link from logs/alerts)
- Search by service + time: 20% of queries
- Search by operation + duration: 10% of queries
- Tag-based search: 5% of queries
- Dependency graph queries: 3% of queries
- Analytics/aggregation: 2% of queries
Query Volume:
- Total queries/sec: 500 (from developers, dashboards, alerts)
- Peak queries/sec: 2,000 (during incidents)
- Concurrent users: 200 (normal), 1,000 (incident)Latency Targets
Query Latency SLOs:
┌──────────────────────────────────────────────────────┐
│ Query Type │ P50 │ P95 │ P99 │
├──────────────────────────────────────────────────────┤
│ Trace by ID │ 50ms │ 200ms │ 500ms │
│ Service search (1hr) │ 200ms │ 1s │ 3s │
│ Tag search (1hr) │ 500ms │ 2s │ 5s │
│ Duration filter │ 200ms │ 1s │ 3s │
│ Dependency graph │ 100ms │ 500ms │ 1s │
│ Analytics aggregation │ 1s │ 5s │ 10s │
└──────────────────────────────────────────────────────┘
Trace Assembly Latency:
- Small trace (5 spans): 10ms
- Medium trace (20 spans): 50ms
- Large trace (100+ spans): 200ms
- Very large trace (1000+ spans): 1-2sSampling Rates and Storage Impact
Sampling Strategy Comparison
Strategy 1: Fixed 1% Head-Based Sampling
- Spans stored: 100K/sec
- Storage: 1.7 TB/day
- Cost: $$$
- Coverage: Random, may miss errors
- Latency overhead: Minimal (decision at trace start)
Strategy 2: Adaptive Sampling
- High-traffic services (>10K req/sec): 0.1% sampling
- Medium-traffic services (1K-10K req/sec): 1% sampling
- Low-traffic services (<1K req/sec): 10% sampling
- Error traces: 100% sampling
- Slow traces (>P99): 100% sampling
- Effective storage: 2.5 TB/day
- Coverage: Better error/latency visibility
Strategy 3: Tail-Based Sampling (1% random + all errors + all slow)
- Random sample: 100K spans/sec
- Error traces: ~5K spans/sec (0.5% error rate × full trace)
- Slow traces: ~10K spans/sec (P99 traces)
- Total: ~150K spans/sec
- Storage: 2.6 TB/day
- Coverage: Best for debugging, higher cost
Strategy 4: Dynamic Rate Limiting
- Budget: 200K spans/sec maximum
- Distribute budget across services proportionally
- Priority: errors > slow > random
- Storage: Fixed at 3.5 TB/day
- Coverage: Predictable cost, good coverageStorage Savings Analysis
Full Fidelity (no sampling):
- 10M spans/sec × 500 bytes = 5 GB/sec = 432 TB/day
- Completely impractical for storage and cost
1% Sampling:
- 100K spans/sec = 1.7 TB/day (99.6% reduction)
0.1% Sampling:
- 10K spans/sec = 170 GB/day (99.96% reduction)
Compression Impact (LZ4/Snappy):
- Typical ratio: 3:1 to 5:1 for span data
- 1.7 TB/day → 400-570 GB/day after compressionNetwork Bandwidth Requirements
Agent to Collector
Per-Host Agent Traffic:
- Spans generated per host: 500-5,000 spans/sec
- Span size (protobuf): 500 bytes
- Raw bandwidth: 250 KB/sec - 2.5 MB/sec per host
- With batching (100 spans/batch): 5-50 batches/sec
- Batch size: ~50 KB
- Actual bandwidth (with headers): 300 KB/sec - 3 MB/sec per host
Fleet-Wide Agent Traffic:
- Hosts: 10,000
- Total agent → collector bandwidth: 3-30 GB/sec
- After sampling at agent (1%): 30-300 MB/sec
- Network overhead (TCP, TLS): +15%
- Effective bandwidth needed: 35-345 MB/secCollector to Storage
Collector Output:
- After processing/enrichment: spans grow ~20% (added metadata)
- Collector → Kafka: 120K spans/sec × 600 bytes = 72 MB/sec
- Kafka → Storage: 72 MB/sec (same throughput)
- With replication (3x Kafka): 216 MB/sec internal Kafka traffic
Collector → Elasticsearch (for indexing):
- Index documents: 120K docs/sec × 800 bytes = 96 MB/sec
- Bulk API batching reduces overhead
Cross-Region Traffic (if centralized):
- Region → Central: 20-50 MB/sec per region
- 5 regions: 100-250 MB/sec cross-region
- Cost: $0.02/GB = $170-430K/month for cross-region transferCompute Requirements
Collection Tier
Collector Instances:
- Throughput per collector: 50K spans/sec (CPU-bound on validation)
- CPU per collector: 4 cores (validation, enrichment, batching)
- Memory per collector: 8 GB (buffering, in-flight batches)
- Collectors needed: 100K spans/sec ÷ 50K = 2 collectors minimum
- With headroom (3x): 6 collectors
- Instance type: c5.2xlarge or equivalent
Collector Processing Pipeline:
- Deserialize: 0.5μs per span
- Validate: 1μs per span
- Enrich (add process info): 0.5μs per span
- Sample decision: 0.2μs per span
- Serialize to batch: 0.3μs per span
- Total: ~3μs per span per core
- Throughput: ~330K spans/sec per coreIndexing Tier
Elasticsearch Cluster (for search):
- Index rate: 100K documents/sec
- Nodes needed: 10 data nodes (10K docs/sec each)
- CPU per node: 16 cores
- Memory per node: 64 GB (JVM heap: 31 GB)
- Storage per node: 4 TB NVMe SSD
- Instance type: r5.4xlarge or i3.4xlarge
ClickHouse Cluster (for analytics):
- Insert rate: 100K rows/sec
- Nodes needed: 3 nodes
- CPU per node: 32 cores
- Memory per node: 128 GB
- Storage per node: 10 TB NVMeQuery Serving Tier
Query Nodes:
- Concurrent queries: 500
- Query complexity: varies (simple lookup to complex aggregation)
- Nodes needed: 5 query nodes
- CPU per node: 8 cores
- Memory per node: 32 GB (caching hot traces)
- Cache hit rate target: 60% for trace-by-ID queries
API Gateway:
- Requests/sec: 2,000 peak
- Instances: 3 (with load balancer)
- CPU per instance: 2 cores
- Memory per instance: 4 GBCost Estimation at Scale
Monthly Cost Breakdown
Compute Costs:
┌─────────────────────────────────────────────────────────────┐
│ Component │ Instances │ Type │ Monthly Cost │
├─────────────────────────────────────────────────────────────┤
│ Collectors │ 6 │ c5.2xlarge │ $1,500 │
│ Kafka Brokers │ 6 │ r5.2xlarge │ $3,600 │
│ ES Data Nodes │ 10 │ r5.4xlarge │ $12,000 │
│ ES Master Nodes │ 3 │ c5.xlarge │ $450 │
│ Cassandra Nodes │ 9 │ i3.2xlarge │ $9,000 │
│ Query Servers │ 5 │ c5.2xlarge │ $1,250 │
│ API Gateway │ 3 │ c5.xlarge │ $450 │
│ ClickHouse │ 3 │ r5.8xlarge │ $5,400 │
├─────────────────────────────────────────────────────────────┤
│ Compute Subtotal │ │ │ $33,650 │
└─────────────────────────────────────────────────────────────┘
Storage Costs:
┌─────────────────────────────────────────────────────────────┐
│ Storage Type │ Volume │ Unit Cost │ Monthly Cost │
├─────────────────────────────────────────────────────────────┤
│ SSD (hot/warm) │ 50 TB │ $0.10/GB │ $5,000 │
│ HDD (standard) │ 100 TB │ $0.04/GB │ $4,000 │
│ S3 (cold/archive) │ 200 TB │ $0.023/GB │ $4,600 │
│ Kafka storage │ 5 TB │ $0.10/GB │ $500 │
├─────────────────────────────────────────────────────────────┤
│ Storage Subtotal │ │ │ $14,100 │
└─────────────────────────────────────────────────────────────┘
Network Costs:
┌─────────────────────────────────────────────────────────────┐
│ Traffic Type │ Volume │ Unit Cost │ Monthly Cost │
├─────────────────────────────────────────────────────────────┤
│ Cross-AZ │ 50 TB │ $0.01/GB │ $500 │
│ Cross-Region │ 10 TB │ $0.02/GB │ $200 │
│ Internet egress │ 1 TB │ $0.09/GB │ $90 │
├─────────────────────────────────────────────────────────────┤
│ Network Subtotal │ │ │ $790 │
└─────────────────────────────────────────────────────────────┘
Total Monthly Cost: ~$48,540
Annual Cost: ~$582,500
Cost per million spans stored: ~$0.33
Cost per trace stored: ~$0.000003Cost Optimization Levers
1. Aggressive Sampling (0.1% vs 1%):
- Reduces storage 10x: saves ~$12,000/month
- Reduces compute 5x: saves ~$15,000/month
- Trade-off: less debugging visibility
2. Compression (LZ4 on storage):
- Reduces storage 3-5x: saves ~$8,000/month
- Minimal CPU overhead
3. Tiered Storage (move to S3 after 7 days):
- Reduces SSD usage 75%: saves ~$3,750/month
- Slightly slower queries for old data
4. Reserved Instances (1-year commitment):
- 30-40% discount on compute: saves ~$10,000/month
5. Spot Instances for batch processing:
- 60-70% discount for analytics workloads
- Saves ~$3,000/month on ClickHouse batch jobsKey Constraints Summary
Hard Constraints:
- Span ingestion must not add >1ms latency to application requests
- Agent CPU overhead must be <2% of host CPU
- Agent memory overhead must be <100 MB per host
- Trace retrieval by ID must complete in <500ms (P99)
- System must handle 3x traffic spikes without data loss
- No single point of failure in collection pipeline
Soft Constraints:
- Search queries should complete in <3s (P95)
- Sampling decisions should be consistent per trace
- Cross-service context propagation overhead <100μs
- Dashboard refresh latency <5s
- Alert evaluation latency <30s from span arrivalThis scale analysis provides the foundation for capacity planning and architecture decisions in a production distributed tracing system comparable to Uber's Jaeger deployment or Datadog's APM infrastructure.