Distributed Tracing System - Database Design
Overview
The database design for a distributed tracing system must optimize for two fundamentally different access patterns: high-throughput append-only writes (span ingestion) and complex read queries (trace retrieval, search, analytics). This dual nature drives a polyglot persistence approach where different storage engines serve different query patterns. Production systems like Jaeger use Cassandra + Elasticsearch, while newer systems like Tempo use object storage with a search index.
Span Schema
Core Span Data Model
// OpenTelemetry Span Protocol Buffer Definition
message Span {
bytes trace_id = 1; // 16 bytes, globally unique
bytes span_id = 2; // 8 bytes, unique within a trace
bytes parent_span_id = 3; // 8 bytes, empty for root span
string name = 4; // operation name
SpanKind kind = 5; // CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL
fixed64 start_time_unix_nano = 6;
fixed64 end_time_unix_nano = 7;
repeated KeyValue attributes = 8;
repeated Event events = 10;
repeated Link links = 12;
Status status = 14;
Resource resource = 15;
}
message Event {
fixed64 time_unix_nano = 1;
string name = 2;
repeated KeyValue attributes = 3;
}
message Link {
bytes trace_id = 1;
bytes span_id = 2;
repeated KeyValue attributes = 3;
}
message Status {
StatusCode code = 1; // UNSET, OK, ERROR
string message = 2;
}Semantic Conventions (Standard Tags)
HTTP Spans:
- http.method: GET, POST, PUT, DELETE
- http.url: full URL
- http.status_code: 200, 404, 500
- http.request_content_length: bytes
Database Spans:
- db.system: mysql, postgresql, redis
- db.statement: "SELECT * FROM users WHERE id = ?"
- db.operation: SELECT, INSERT, UPDATE
RPC Spans:
- rpc.system: grpc, thrift
- rpc.service: service name
- rpc.method: method name
Messaging Spans:
- messaging.system: kafka, rabbitmq, sqs
- messaging.destination: topic/queue name
- messaging.operation: send, receive, processTrace Assembly Data Model
Trace as a Directed Acyclic Graph (DAG)
A trace is a collection of spans sharing the same trace_id.
Spans form a DAG via parent_span_id references.
Root span has no parent (entry point of the request).
Example Trace DAG:
[API Gateway: GET /order/123]
(root span)
/ \
[Auth: validate] [Order Service: getOrder]
/ | \
[DB: SELECT] [Cache: GET] [Payment: getStatus]
|
[DB: SELECT payment]
Assembly Algorithm:
1. Collect all spans with matching trace_id
2. Build parent-to-children adjacency list
3. Identify root(s): spans with no parent_span_id
4. Compute critical path (longest sequential chain)
5. Calculate trace duration: max(end_time) - min(start_time)Trace Metadata (Computed at Assembly)
{
"trace_id": "abc123def456...",
"root_service": "api-gateway",
"root_operation": "GET /order/123",
"start_time": "2024-01-15T10:30:00.000Z",
"duration_ms": 245,
"span_count": 7,
"service_count": 4,
"services": ["api-gateway", "auth-service", "order-service", "payment-service"],
"has_error": false,
"critical_path_ms": 220,
"depth": 4,
"width": 3
}Primary Storage: Cassandra
Schema Design
CREATE KEYSPACE tracing WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east-1': 3,
'us-west-2': 3
};
-- Main spans table: partitioned by trace_id
CREATE TABLE tracing.spans_by_trace (
trace_id blob,
span_id blob,
parent_span_id blob,
operation_name text,
service_name text,
span_kind tinyint,
start_time timestamp,
duration_us bigint,
status_code tinyint,
status_message text,
tags map<text, text>,
events list<frozen<span_event>>,
links list<frozen<span_link>>,
process_tags map<text, text>,
PRIMARY KEY (trace_id, span_id)
) WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': '1',
'compaction_window_unit': 'HOURS'
}
AND default_time_to_live = 604800
AND gc_grace_seconds = 86400;
CREATE TYPE tracing.span_event (
timestamp timestamp,
name text,
attributes map<text, text>
);
CREATE TYPE tracing.span_link (
trace_id blob,
span_id blob,
attributes map<text, text>
);
-- Service index: find traces by service and time range
CREATE TABLE tracing.traces_by_service (
service_name text,
date_bucket text,
start_time timestamp,
trace_id blob,
operation_name text,
duration_us bigint,
has_error boolean,
span_count int,
PRIMARY KEY ((service_name, date_bucket), start_time, trace_id)
) WITH CLUSTERING ORDER BY (start_time DESC, trace_id ASC)
AND default_time_to_live = 604800;
-- Duration index: find slow traces
CREATE TABLE tracing.traces_by_duration (
service_name text,
date_bucket text,
duration_bucket text,
duration_us bigint,
trace_id blob,
start_time timestamp,
operation_name text,
PRIMARY KEY ((service_name, date_bucket, duration_bucket), duration_us, trace_id)
) WITH CLUSTERING ORDER BY (duration_us DESC, trace_id ASC)
AND default_time_to_live = 604800;
-- Tag index: find traces by specific tag values
CREATE TABLE tracing.traces_by_tag (
service_name text,
tag_key text,
tag_value text,
date_bucket text,
start_time timestamp,
trace_id blob,
duration_us bigint,
PRIMARY KEY ((service_name, tag_key, tag_value, date_bucket), start_time, trace_id)
) WITH CLUSTERING ORDER BY (start_time DESC, trace_id ASC)
AND default_time_to_live = 604800;Write Path Optimization
Batch Write Strategy:
1. Group spans by trace_id (same partition)
2. Use unlogged batches for spans within same trace
3. Write index entries asynchronously
4. Use LOCAL_QUORUM for spans, LOCAL_ONE for indexes
Write Amplification:
- 1 span write produces: 1 spans_by_trace + 1 traces_by_service +
1 traces_by_duration + N traces_by_tag (per indexed tag)
- Typical amplification: 5-8x per span
- Mitigated by: selective tag indexing, async index writesSearch Storage: Elasticsearch
Index Template
{
"index_patterns": ["jaeger-span-*"],
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"index.refresh_interval": "5s",
"index.translog.durability": "async",
"index.codec": "best_compression",
"index.mapping.total_fields.limit": 2000
},
"mappings": {
"properties": {
"traceID": { "type": "keyword" },
"spanID": { "type": "keyword" },
"parentSpanID": { "type": "keyword" },
"operationName": { "type": "keyword" },
"serviceName": { "type": "keyword" },
"startTime": { "type": "date", "format": "epoch_micros" },
"startTimeMillis": { "type": "date" },
"duration": { "type": "long" },
"tags": {
"type": "nested",
"properties": {
"key": { "type": "keyword" },
"value": { "type": "keyword" }
}
},
"tag": {
"type": "object",
"properties": {
"http_method": { "type": "keyword" },
"http_status_code": { "type": "keyword" },
"http_url": { "type": "keyword" },
"error": { "type": "boolean" }
}
},
"process": {
"properties": {
"serviceName": { "type": "keyword" },
"tags": {
"type": "nested",
"properties": {
"key": { "type": "keyword" },
"value": { "type": "keyword" }
}
}
}
}
}
}
}Index Lifecycle Management
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": { "max_size": "50gb", "max_age": "1d" },
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "2d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"allocate": { "require": { "data": "warm" } }
}
},
"cold": {
"min_age": "7d",
"actions": {
"freeze": {},
"allocate": { "require": { "data": "cold" } }
}
},
"delete": {
"min_age": "30d",
"actions": { "delete": {} }
}
}
}
}Analytics Storage: ClickHouse
Schema for Aggregated Analytics
CREATE TABLE spans (
trace_id FixedString(16),
span_id FixedString(8),
parent_span_id FixedString(8),
service_name LowCardinality(String),
operation_name LowCardinality(String),
span_kind Enum8('INTERNAL'=0,'SERVER'=1,'CLIENT'=2,'PRODUCER'=3,'CONSUMER'=4),
start_time DateTime64(6),
duration_us UInt64,
status_code Enum8('UNSET'=0, 'OK'=1, 'ERROR'=2),
http_method LowCardinality(Nullable(String)),
http_status_code Nullable(UInt16),
tags Map(String, String),
INDEX idx_service service_name TYPE set(100) GRANULARITY 4,
INDEX idx_duration duration_us TYPE minmax GRANULARITY 4
) ENGINE = MergeTree()
PARTITION BY toDate(start_time)
ORDER BY (service_name, operation_name, start_time, trace_id)
TTL start_time + INTERVAL 30 DAY
SETTINGS index_granularity = 8192;
-- Pre-aggregated service metrics
CREATE MATERIALIZED VIEW service_metrics_mv
ENGINE = SummingMergeTree()
PARTITION BY toDate(time_bucket)
ORDER BY (service_name, operation_name, time_bucket)
AS SELECT
service_name,
operation_name,
toStartOfMinute(start_time) AS time_bucket,
count() AS request_count,
countIf(status_code = 'ERROR') AS error_count,
sum(duration_us) AS total_duration_us,
min(duration_us) AS min_duration_us,
max(duration_us) AS max_duration_us
FROM spans
WHERE span_kind = 'SERVER'
GROUP BY service_name, operation_name, time_bucket;
-- Service dependency graph
CREATE MATERIALIZED VIEW service_dependencies_mv
ENGINE = SummingMergeTree()
PARTITION BY toDate(time_bucket)
ORDER BY (caller_service, callee_service, time_bucket)
AS SELECT
s1.service_name AS caller_service,
s2.service_name AS callee_service,
toStartOfHour(s1.start_time) AS time_bucket,
count() AS call_count,
countIf(s2.status_code = 'ERROR') AS error_count,
avg(s2.duration_us) AS avg_duration_us
FROM spans s1
INNER JOIN spans s2 ON s1.trace_id = s2.trace_id
AND s1.span_id = s2.parent_span_id
WHERE s1.service_name != s2.service_name
GROUP BY caller_service, callee_service, time_bucket;Time-Series Partitioning Strategy
Partition Design
Time-Based Partitioning:
| Storage Engine | Partition Key | Partition Size |
|----------------|---------------------|-----------------|
| Cassandra | (service, date) | 1-10 GB |
| Elasticsearch | Daily index rollover| ~50 GB/index |
| ClickHouse | toDate(start_time) | 50-100 GB |
Avoiding Hot Partitions in Cassandra:
- Problem: Popular services create hot partitions
- Solution: Composite partition key with time bucketing
-- Distribute across time buckets
PRIMARY KEY ((service_name, date_bucket), start_time)
-- Add hash bucket for very hot services
PRIMARY KEY ((service_name, date_bucket, hash_bucket), start_time)
-- hash_bucket = trace_id % 16Tag Indexing for Flexible Queries
Tag Index Strategy
Challenge: Tags are arbitrary key-value pairs with high cardinality
- Total unique tag keys: ~500
- High-cardinality: user_id, request_id (millions of values)
- Low-cardinality: http.method, status_code (tens of values)
Indexing Approach:
1. Whitelist indexed tags (configured per service):
- http.method, http.status_code, error, component
- Custom business tags: customer_tier, feature_flag
2. Skip high-cardinality tags in indexes:
- user_id, session_id → only in raw span storage
- Queryable via trace_id lookup, not via search
3. Bloom filters for existence queries:
- "Does any span have tag X=Y?"
- False positive rate: 1%
- Memory: ~10 bits per entryCardinality Management
-- Track tag cardinality for index decisions
SELECT
tag_key,
uniqExact(tag_value) as cardinality,
count() as usage_count
FROM spans
ARRAY JOIN mapKeys(tags) AS tag_key, mapValues(tags) AS tag_value
WHERE start_time > now() - INTERVAL 1 DAY
GROUP BY tag_key
ORDER BY cardinality DESC;
-- Results guide indexing:
-- request_id | 50,000,000 → DO NOT INDEX
-- http.url | 5,000 → INDEX (moderate)
-- http.method | 7 → INDEX (low cardinality)
-- http.status_code | 25 → INDEX (low cardinality)Service Dependency Graph Storage
-- Cassandra: Current service graph
CREATE TABLE tracing.service_graph (
caller_service text,
callee_service text,
protocol text,
last_seen timestamp,
call_rate_per_sec float,
error_rate float,
avg_duration_ms float,
PRIMARY KEY (caller_service, callee_service, protocol)
);
-- Operations per service
CREATE TABLE tracing.service_operations (
service_name text,
operation_name text,
span_kind text,
last_seen timestamp,
PRIMARY KEY (service_name, operation_name)
);Retention and TTL Policies
Multi-Tier Retention
| Data Type | Hot | Warm | Cold | Archive |
|------------------------|--------|--------|---------|---------|
| Raw spans | 2 days | 7 days | 30 days | - |
| Span indexes | 2 days | 7 days | 14 days | - |
| Service metrics (1min) | 7 days | 30 days| - | - |
| Service metrics (1hr) | 30 days| 90 days| 1 year | - |
| Dependency graph | 30 days| 90 days| 1 year | - |
| Error traces | 7 days | 30 days| 90 days | 1 year |
Selective Retention (Keep Important Traces Longer):
1. Error traces (status_code = ERROR): 90 days
2. Slow traces (duration > P99): 30 days
3. Traces matching alert rules: 90 days
4. Manually bookmarked traces: 1 year
5. Traces linked to incidents: 1 yearStorage Engine Comparison
| Criteria | Cassandra | Elasticsearch | ClickHouse |
|------------------|---------------|---------------|-----------------|
| Write throughput | Excellent | Good | Excellent |
| Read by ID | Excellent | Good | Good |
| Full-text search | Poor | Excellent | Moderate |
| Aggregations | Poor | Moderate | Excellent |
| Compression | Good (3:1) | Moderate (2:1)| Excellent (10:1)|
| Operational cost | Moderate | High | Low |
| Best for | Span storage | Search/filter | Analytics |
Recommended Architecture:
- Primary storage: Cassandra (all spans, trace-by-ID retrieval)
- Search index: Elasticsearch (tag search, filtering)
- Analytics: ClickHouse (aggregations, percentiles, dashboards)
- Long-term: S3/GCS (archived traces, cost-effective retention)This database design supports the full spectrum of distributed tracing workloads from high-throughput ingestion to complex analytical queries, following patterns proven at scale by Jaeger, Zipkin, and commercial APM systems.