Distributed Tracing System - Tradeoffs and Alternatives

Overview

Designing a distributed tracing system involves numerous architectural decisions where each choice optimizes for different qualities. Understanding these tradeoffs deeply is critical for system design interviews, as interviewers expect candidates to articulate why they chose one approach over another and under what conditions the alternative would be preferable. This section covers the major decision points with production context from systems like Jaeger, Zipkin, AWS X-Ray, and Datadog APM.

Head-Based vs Tail-Based Sampling

Head-Based Sampling

How It Works:
- Sampling decision made at trace creation (first span)
- Decision encoded in trace context (W3C traceparent sampled flag)
- All downstream services respect the decision
- Deterministic: hash(trace_id) < threshold → sample

Example (1% sampling):
  trace_id = "5b8aa5a2d2c872e8321cf37308d69df2"
  hash(trace_id) % 10000 = 42  → 42 < 100 (1%) → SAMPLE
  
  trace_id = "7c9bb6b3e3d983f9432dg48419e70eg3"
  hash(trace_id) % 10000 = 8734 → 8734 >= 100 → DROP

Advantages:
+ Minimal overhead (single comparison per trace)
+ No buffering required
+ Predictable storage costs (rate × span_size × time)
+ Simple implementation (10 lines of code)
+ Works with any collection architecture
+ Consistent: all services agree on sampling decision

Disadvantages:
- Blind to trace outcome (errors, latency)
- At 1% sampling, 99% of error traces are lost
- Cannot adapt to traffic patterns
- Important rare events may never be captured
- Fixed cost regardless of trace value

Best For:
- High-traffic systems where storage cost dominates
- Systems with uniform traffic patterns
- When simplicity and low overhead are priorities
- Initial deployment before investing in tail-based infrastructure

Tail-Based Sampling

How It Works:
- ALL spans are collected initially (100% collection)
- Spans buffered at collector until trace is "complete"
- Sampling decision made after seeing full trace
- Keep/drop based on trace characteristics

Decision Policies (evaluated in priority order):
1. has_error = true → ALWAYS KEEP
2. duration > dynamic_p99_threshold → ALWAYS KEEP
3. matches_alert_rule = true → ALWAYS KEEP
4. random(trace_id) < 0.005 → KEEP (0.5% baseline)
5. DEFAULT → DROP

Advantages:
+ Captures 100% of error traces
+ Captures all latency outliers
+ Adaptive to actual trace characteristics
+ Better signal-to-noise ratio in stored data
+ Can implement complex business rules

Disadvantages:
- Higher memory usage (buffer all in-flight traces)
- 30-60 second delay before storage (waiting for trace completion)
- Requires trace-ID affinity routing (all spans → same collector)
- Complex trace completion detection (when is a trace "done"?)
- Higher network bandwidth (collect everything, discard most)
- Agent overhead: must report ALL spans to collector

Resource Overhead:
- Memory: 6-16 GB per collector (buffering 600K concurrent traces)
- Network: 10-100x more agent→collector traffic vs head-based
- CPU: Trace assembly and policy evaluation per trace
- Latency: 60s additional delay to storage

Best For:
- Systems where error visibility is critical (payment, auth)
- When debugging rare issues is a priority
- Organizations with budget for infrastructure overhead
- Mature tracing deployments needing better signal quality

Hybrid Approach (Production Recommendation)

Combine both strategies:
1. Head-based: 0.5% random baseline (ensures coverage)
2. Tail-based: 100% of errors and slow traces (ensures quality)
3. Rate limiting: Cap total output at budget (ensures cost control)

Implementation:
- Agent applies head-based sampling (reduces network traffic)
- Collector applies tail-based on the sampled subset
- Additionally, agent ALWAYS forwards error spans regardless of head-based decision
- Collector merges: head-sampled traces + error traces + slow traces

This is what Datadog and Honeycomb effectively do in production.

Push vs Pull Collection Model

Push Model (Chosen by Most Systems)

Architecture:
  Application → Agent → Collector
  (spans pushed from source to destination)

How It Works:
- SDK/agent batches spans and sends to collector
- Collector provides gRPC/HTTP endpoint
- Agent retries on failure with exponential backoff

Advantages:
+ Lower latency (spans arrive as soon as batch is ready)
+ Simpler agent (just serialize and send)
+ Works across network boundaries (agent initiates connection)
+ Natural backpressure (429/503 responses)
+ Scales with application (more apps = more push traffic)

Disadvantages:
- Collector must handle variable load (traffic spikes)
- Agent must handle collector unavailability (buffering)
- Discovery: agent needs to know collector address
- DDoS risk: compromised agents could flood collector

Used By: Jaeger, Zipkin, Datadog, OpenTelemetry (default)

Pull Model

Architecture:
  Application → Local Buffer ← Collector (pulls)
  (collector scrapes spans from agents)

How It Works:
- Agent exposes endpoint with buffered spans
- Collector periodically scrapes agents (like Prometheus)
- Agent clears buffer after successful scrape

Advantages:
+ Collector controls ingestion rate (natural rate limiting)
+ Easier to implement backpressure
+ Collector knows all sources (service discovery)
+ Simpler failure handling (just retry scrape)

Disadvantages:
- Higher latency (scrape interval adds delay)
- Agent must buffer more data (between scrapes)
- Doesn't work well across network boundaries (firewalls)
- Scrape overhead scales with number of agents (O(n) connections)
- Not suitable for high-cardinality span data (too much to buffer)

Used By: Prometheus (for metrics, not traces)
Not commonly used for tracing due to volume and latency requirements.

Verdict: Push model wins for tracing due to volume and latency needs.
Pull model works for metrics (lower volume, periodic aggregation).

Storage Engine Comparison

Cassandra

Architecture: Wide-column store, masterless, eventual consistency

Strengths for Tracing:
+ Excellent write throughput (100K+ writes/sec per node)
+ Linear horizontal scaling (add nodes = more capacity)
+ Tunable consistency (LOCAL_ONE for indexes, LOCAL_QUORUM for spans)
+ Built-in TTL (automatic data expiration)
+ Multi-datacenter replication (built-in)
+ Proven at scale: Uber's Jaeger uses Cassandra

Weaknesses for Tracing:
- Poor for complex queries (no JOINs, limited filtering)
- No full-text search capability
- Read amplification for secondary indexes
- Operational complexity (compaction tuning, repair)
- High write amplification for denormalized indexes (5-8x)

Best For: Primary span storage, trace-by-ID retrieval
Not For: Tag search, analytics, aggregations

Capacity: 15K writes/sec per node, 5K reads/sec per node
Cost: Moderate (requires skilled operators)

Elasticsearch

Architecture: Distributed search engine, inverted index, near-real-time

Strengths for Tracing:
+ Excellent full-text and structured search
+ Flexible schema (dynamic mapping for tags)
+ Rich query DSL (nested, bool, range, aggregations)
+ Near-real-time indexing (5-second refresh)
+ Built-in aggregation framework
+ Kibana integration for visualization

Weaknesses for Tracing:
- Lower write throughput than Cassandra (~10K docs/sec per node)
- High memory requirements (JVM heap for indexing)
- Index management complexity (ILM, rollover, shrink)
- Expensive at scale (RAM-hungry nodes)
- GC pauses can cause latency spikes
- Not ideal for high-cardinality fields

Best For: Span search by tags, full-text log search
Not For: Primary storage at very high write volumes

Capacity: 10K docs/sec per node, complex queries in <1s
Cost: High (memory-intensive nodes, operational overhead)

ClickHouse

Architecture: Column-oriented OLAP database, MPP query engine

Strengths for Tracing:
+ Exceptional compression (10:1 for trace data)
+ Blazing fast aggregations (columnar scan)
+ Excellent for time-series analytics
+ Materialized views for pre-aggregation
+ Low storage cost per byte
+ SQL interface (familiar, powerful)
+ LowCardinality type perfect for service_name, operation_name

Weaknesses for Tracing:
- Not designed for point lookups (trace-by-ID)
- Eventual consistency on inserts (async replication)
- Limited full-text search capability
- Fewer operational tools than Cassandra/ES
- Merge operations can spike CPU
- Less mature ecosystem for tracing specifically

Best For: Analytics, dashboards, percentile computation, metrics from traces
Not For: Real-time trace-by-ID lookup, flexible tag search

Capacity: 100K+ inserts/sec per node, sub-second aggregations over billions of rows
Cost: Low (excellent compression, fewer nodes needed)

Grafana Tempo (Object Storage)

Architecture: Object storage (S3/GCS) + lightweight index

Strengths:
+ Extremely low cost (S3 pricing: $0.023/GB)
+ Infinite scalability (object storage)
+ No operational overhead for storage layer
+ Simple architecture (stateless ingesters + object store)
+ Good for trace-by-ID lookups

Weaknesses:
- No search by tags (requires external index or TraceQL)
- Higher query latency (object storage read latency)
- Requires separate system for search (Tempo + Grafana Mimir)
- Newer, less battle-tested at extreme scale

Best For: Cost-sensitive deployments, trace-by-ID workflows
Not For: Complex search queries without additional indexing

Used By: Grafana Cloud, organizations prioritizing cost over search flexibility

Comparison Matrix

| Criteria              | Cassandra | Elasticsearch | ClickHouse | Tempo/S3  |
|-----------------------|-----------|---------------|------------|-----------|
| Write throughput      | ★★★★★    | ★★★          | ★★★★★     | ★★★★★    |
| Trace-by-ID lookup    | ★★★★★    | ★★★          | ★★★       | ★★★★     |
| Tag search            | ★★        | ★★★★★        | ★★★       | ★         |
| Analytics/aggregation | ★         | ★★★          | ★★★★★     | ★         |
| Compression           | ★★★      | ★★           | ★★★★★     | ★★★★     |
| Operational simplicity| ★★        | ★★           | ★★★       | ★★★★★    |
| Cost at scale         | ★★★      | ★★           | ★★★★      | ★★★★★    |
| Ecosystem/tooling     | ★★★★     | ★★★★★        | ★★★       | ★★★      |

OpenTelemetry vs Proprietary Instrumentation

OpenTelemetry (Recommended)

What It Is: Vendor-neutral observability framework (CNCF project)
Components: SDK, API, Collector, Protocol (OTLP)

Advantages:
+ Vendor-neutral (switch backends without re-instrumenting)
+ Industry standard (adopted by all major APM vendors)
+ Rich auto-instrumentation (HTTP, gRPC, DB, messaging)
+ Single SDK for traces, metrics, and logs
+ Active community (Google, Microsoft, AWS, Datadog contribute)
+ Semantic conventions standardize tag names

Disadvantages:
- Larger SDK footprint than minimal proprietary agents
- Rapid evolution (breaking changes between versions)
- Auto-instrumentation may capture too much (noise)
- Configuration complexity (many options)
- Performance overhead slightly higher than hand-tuned proprietary

When to Choose:
- Multi-vendor strategy (avoid lock-in)
- Greenfield projects
- Organizations with diverse tech stacks
- When future flexibility matters more than current optimization

Proprietary Instrumentation (Datadog, New Relic, Dynatrace)

Advantages:
+ Tighter integration with vendor's analysis features
+ Often lower overhead (optimized for specific backend)
+ Better auto-instrumentation for specific frameworks
+ Single vendor support (one throat to choke)
+ Advanced features (AI-powered analysis, code-level profiling)

Disadvantages:
- Vendor lock-in (expensive to switch)
- Limited to vendor's supported languages/frameworks
- Cost scales with data volume (vendor pricing)
- Less community innovation

When to Choose:
- Small team needing managed solution
- Already committed to a specific APM vendor
- When vendor-specific features justify lock-in
- When operational simplicity trumps flexibility

Sidecar Proxy Tracing vs Library Instrumentation

Sidecar/Service Mesh Tracing (Envoy, Istio, Linkerd)

How It Works:
- Sidecar proxy intercepts all network traffic
- Automatically generates spans for every request
- No application code changes required
- Propagates trace context via headers

Advantages:
+ Zero application code changes
+ Language-agnostic (works with any runtime)
+ Consistent instrumentation across all services
+ Captures network-level details (retries, timeouts, TLS)
+ Managed by platform team (not application developers)

Disadvantages:
- Only captures network boundaries (no in-process visibility)
- Cannot instrument database calls, cache operations, business logic
- Adds network hop latency (~1-2ms per request)
- Resource overhead (sidecar memory: 50-100 MB per pod)
- Complex debugging (proxy issues vs application issues)
- Limited tag/attribute customization

Best For:
- Initial tracing rollout (quick wins, no code changes)
- Polyglot environments with many languages
- Network-level visibility (service mesh observability)
- Organizations where developers won't instrument code

Library Instrumentation (OpenTelemetry SDK)

How It Works:
- SDK embedded in application process
- Auto-instrumentation hooks into frameworks
- Manual instrumentation for business logic
- Direct span creation and context propagation

Advantages:
+ Full visibility (network + in-process + business logic)
+ Rich attributes (business context, user info)
+ Lower latency (no network hop for span creation)
+ Fine-grained control (custom spans, events, links)
+ Database query tracing, cache operations, queue interactions

Disadvantages:
- Requires code changes (even if minimal with auto-instrumentation)
- Language-specific (need SDK per language)
- Developer adoption required (training, code reviews)
- Version management across services
- Potential for instrumentation bugs affecting application

Best For:
- Deep debugging capability
- Business-context-rich traces
- Performance-sensitive applications
- Mature engineering organizations

Recommendation: Both (Layered Approach)

Layer 1: Service mesh (baseline network visibility, zero effort)
Layer 2: Auto-instrumentation (framework-level, minimal effort)
Layer 3: Manual instrumentation (business logic, targeted effort)

This gives:
- Immediate value from mesh (all services traced day 1)
- Deeper value from SDK (critical paths instrumented)
- Business context where it matters most (key workflows)

Centralized vs Distributed Trace Assembly

Centralized Assembly

How It Works:
- All spans sent to central location
- Assembly service reconstructs traces from spans
- Single point for trace-by-ID queries

Advantages:
+ Simple query model (one place to look)
+ Consistent trace view (no partial traces)
+ Easier to implement trace completion detection
+ Single storage system to manage

Disadvantages:
- Single point of failure
- Cross-region bandwidth costs
- Higher write latency for remote regions
- Scaling bottleneck at assembly layer
- Data sovereignty concerns (GDPR)

Used By: Most single-region deployments, Jaeger (default)

Distributed Assembly (Federated)

How It Works:
- Each region stores its own spans
- Query layer fans out to all regions
- Trace assembled from distributed fragments

Advantages:
+ Data locality (low write latency)
+ No cross-region bandwidth for writes
+ GDPR compliance (data stays in region)
+ No single point of failure
+ Each region scales independently

Disadvantages:
- Complex query routing (which regions have this trace?)
- Partial trace visibility if region is down
- Higher query latency for cross-region traces
- Consistency challenges (clock skew across regions)
- More complex operational model

Used By: Large multi-region deployments, AWS X-Ray (regional)

Real-Time Analysis vs Batch Processing

Real-Time (Stream Processing)

How It Works:
- Analyze spans as they arrive (sub-second)
- Stream processors (Flink, Kafka Streams) compute metrics
- Alerts fire immediately on anomalies

Use Cases:
- Latency spike detection (alert within 30 seconds)
- Error rate monitoring (real-time dashboards)
- Service dependency graph (live updates)
- Adaptive sampling rate adjustment

Trade-offs:
+ Immediate visibility into issues
+ Faster incident detection and response
- Higher compute cost (always running)
- Approximate results (streaming aggregations)
- Complex exactly-once semantics

Batch Processing

How It Works:
- Accumulate spans, process periodically (hourly/daily)
- Batch jobs compute exact aggregations
- Results available after processing window

Use Cases:
- Exact percentile computation (full dataset)
- Trend analysis (week-over-week comparison)
- Capacity planning (historical patterns)
- Cost allocation (per-service resource usage)

Trade-offs:
+ Exact results (process complete dataset)
+ Lower compute cost (run periodically)
+ Simpler implementation (batch SQL queries)
- Delayed visibility (hours behind real-time)
- Not suitable for alerting
- Large batch jobs can be expensive

Recommendation: Lambda Architecture

Real-time layer: Stream processing for alerts and dashboards (approximate)
Batch layer: Periodic jobs for exact analytics and reporting
Serving layer: Merge real-time and batch results for queries

This gives sub-second alerting AND exact historical analytics.

Full Fidelity vs Sampled Tracing

Full Fidelity (100% of traces):
- Storage: 432 TB/day at 10M spans/sec
- Cost: ~$500K/month in storage alone
- Benefit: Never miss any trace, perfect debugging
- Used by: Honeycomb (for small-medium scale)
- Practical limit: ~100K spans/sec before cost is prohibitive

Sampled Tracing (1% of traces):
- Storage: 4.3 TB/day
- Cost: ~$5K/month in storage
- Benefit: 100x cost reduction
- Trade-off: May miss important traces (mitigated by tail-based sampling)
- Used by: Most production systems at scale

Decision Framework:
- < 10K spans/sec: Full fidelity is affordable (~$500/month)
- 10K-100K spans/sec: Consider full fidelity if budget allows
- 100K-1M spans/sec: Sampling required, use tail-based
- > 1M spans/sec: Aggressive sampling + pre-aggregated metrics

The trend is toward "sample everything, store selectively" with
tail-based sampling providing the best of both worlds.

This tradeoffs analysis provides the depth needed to justify architectural decisions in a system design interview, demonstrating understanding of real-world constraints and production experience.