Distributed Tracing System - Security and Privacy
Overview
Distributed tracing systems capture detailed information about every request flowing through an organization's infrastructure. This creates a unique security and privacy challenge: the system designed to improve observability can inadvertently become a repository of sensitive data, a target for attackers seeking to understand system architecture, or a compliance liability under regulations like GDPR, CCPA, and HIPAA. Production tracing systems must implement defense-in-depth across data collection, storage, access, and retention.
PII in Trace Data
Common Sources of Sensitive Data in Spans
High Risk - Frequently Captured Accidentally:
1. HTTP URLs with query parameters:
/api/users?email=john@example.com&ssn=123-45-6789
2. Request/Response bodies (when body capture is enabled):
{"credit_card": "4111-1111-1111-1111", "cvv": "123"}
3. Database queries with literal values:
"SELECT * FROM users WHERE email = 'john@example.com'"
4. Error messages containing user data:
"Failed to process payment for user john.doe@company.com"
5. HTTP headers:
Authorization: Bearer eyJhbGciOiJIUzI1NiIs...
Cookie: session_id=abc123; user_email=john@example.com
6. gRPC request metadata:
user-id: 12345
x-forwarded-for: 192.168.1.100
Medium Risk - Context-Dependent:
7. User IDs (may be considered PII in some jurisdictions)
8. IP addresses (PII under GDPR)
9. Device identifiers
10. Geographic coordinates
11. Custom business tags (order amounts, account numbers)
Low Risk - Generally Safe:
- Service names, operation names
- HTTP methods, status codes
- Duration, timestamp
- Span/trace IDs
- Generic error codesData Scrubbing Pipeline
Scrubbing Architecture:
Agent → Collector → [Scrubbing Pipeline] → Storage
Pipeline Stages:
1. Allowlist Filtering (most restrictive, safest):
- Only permit explicitly approved tag keys
- Default: DENY all unknown tags
- Approved: http.method, http.status_code, db.system, error
2. Blocklist Filtering (catch known sensitive patterns):
- Block: authorization, cookie, x-api-key, password
- Block: credit_card, ssn, social_security
- Block: any header matching /^x-.*-secret/
3. Regex-Based Redaction:
patterns:
- name: credit_card
regex: '\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
replacement: '[REDACTED_CC]'
- name: email
regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
replacement: '[REDACTED_EMAIL]'
- name: ssn
regex: '\b\d{3}-\d{2}-\d{4}\b'
replacement: '[REDACTED_SSN]'
- name: jwt_token
regex: 'eyJ[A-Za-z0-9_-]*\.eyJ[A-Za-z0-9_-]*\.[A-Za-z0-9_-]*'
replacement: '[REDACTED_JWT]'
- name: ipv4
regex: '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
replacement: '[REDACTED_IP]'
4. URL Sanitization:
- Parse URL, remove query parameters by default
- Allowlist safe query params: page, limit, sort, filter
- Path parameter normalization: /users/12345 → /users/{id}
5. Database Query Normalization:
- Replace literal values with placeholders
- "WHERE email = 'john@example.com'" → "WHERE email = ?"
- "INSERT INTO orders VALUES (123, 'item')" → "INSERT INTO orders VALUES (?, ?)"
Implementation (OpenTelemetry Collector Processor):
processors:
attributes:
actions:
- key: http.url
action: hash # Hash instead of storing raw URL
- key: db.statement
action: extract
pattern: '^(?P<operation>\w+)\s' # Keep only operation type
- key: http.request.header.authorization
action: delete # Remove entirely
- key: user.email
action: deleteHashing vs Deletion vs Tokenization
Strategy Comparison:
| Approach | Reversible | Queryable | Use Case |
|-------------|------------|-----------|------------------------------|
| Deletion | No | No | Truly sensitive (passwords) |
| Hashing | No | Yes* | Correlation without exposure |
| Tokenization| Yes | Yes | Compliance with data access |
| Masking | No | Partial | Partial visibility needed |
*Queryable if you hash the search term the same way
Example - User ID Handling:
- Raw: user_id = "john.doe@company.com"
- Hashed: user_id = SHA256("john.doe@company.com") = "a1b2c3..."
→ Can still correlate all traces for same user
→ Cannot reverse to get email
→ Can search by hashing the email you're looking for
- Tokenized: user_id = token_lookup("john.doe@company.com") = "USR_7x9k2m"
→ Separate token service maps token ↔ real value
→ Token service has strict access controls
→ Traces store only token, not real valueAccess Control
Role-Based Access Control (RBAC)
Role Hierarchy:
┌─────────────────────────────────────────────────────────────┐
│ Role │ Permissions │
├─────────────────────────────────────────────────────────────┤
│ Platform Admin │ All traces, all services, config mgmt │
│ Service Owner │ Own service traces + direct dependencies │
│ Developer │ Own team's service traces │
│ On-Call Engineer │ All traces (time-limited during incident)│
│ Auditor │ Read-only, all traces, audit logs │
│ External Partner │ Specific service, limited time range │
└─────────────────────────────────────────────────────────────┘
Permission Model:
{
"role": "service_owner",
"principal": "team-payments",
"permissions": {
"traces": {
"read": {
"services": ["payment-service", "payment-gateway", "fraud-detection"],
"include_dependencies": true,
"max_depth": 2,
"time_range": "30d"
},
"search": {
"services": ["payment-service"],
"allowed_tag_filters": ["http.status_code", "error", "customer.tier"],
"blocked_tag_filters": ["user.email", "card.number"]
}
},
"admin": {
"sampling_config": ["payment-service"],
"retention_config": false
}
}
}Service-Level Trace Isolation
Problem: Developer on Team A should not see internal traces of Team B's service.
A trace spanning both teams should show Team A's spans in detail,
but Team B's spans only as opaque boxes (service name + duration).
Implementation:
1. Trace Filtering at Query Time:
def get_trace(trace_id, requesting_user):
spans = storage.get_spans(trace_id)
user_services = get_allowed_services(requesting_user)
filtered_spans = []
for span in spans:
if span.service_name in user_services:
filtered_spans.append(span) # Full detail
else:
filtered_spans.append(redact_span(span)) # Opaque
return assemble_trace(filtered_spans)
def redact_span(span):
return Span(
span_id=span.span_id,
parent_span_id=span.parent_span_id,
service_name=span.service_name, # Keep for topology
operation_name="[REDACTED]",
duration=span.duration, # Keep for timing
tags={}, # Remove all tags
logs=[], # Remove all logs
status=span.status # Keep error indicator
)
2. Attribute-Level Redaction:
- Some tags visible to all (http.status_code, error)
- Some tags visible only to service owner (db.statement, request.body)
- Configured per-tag, per-role
3. Time-Limited Escalation:
- During incidents: grant temporary access to all traces
- Requires approval from service owner or on-call
- Auto-expires after incident resolution (max 24 hours)
- All escalated access is audit-loggedData Retention and GDPR Compliance
Right to Erasure (Article 17)
Challenge: User requests deletion of all their data.
Traces may contain user identifiers scattered across thousands of spans.
Approaches:
1. Pseudonymization at Ingestion (Preferred):
- Replace user identifiers with pseudonyms at collection time
- Mapping table: real_user_id → pseudonym (encrypted, separate storage)
- Deletion: delete mapping entry → traces become anonymous
- Traces remain intact for debugging (no user association)
2. Trace Deletion by User ID:
- Requires index: user_id → [trace_ids]
- Delete all traces containing user's data
- Expensive: may delete traces shared with other users
- Risk: incomplete deletion if user_id appears in unexpected fields
3. Crypto-Shredding:
- Encrypt user-specific data with per-user key
- Store encrypted values in span tags
- Deletion: destroy the user's encryption key
- Data becomes unreadable without key
- Traces remain for structural analysis
Implementation of Crypto-Shredding:
# At ingestion
user_key = key_store.get_key(user_id)
encrypted_email = encrypt(user_key, user.email)
span.set_attribute("user.email.encrypted", encrypted_email)
span.set_attribute("user.key_id", user_key.id)
# At query time (authorized user)
user_key = key_store.get_key(key_id)
email = decrypt(user_key, encrypted_email)
# For deletion (GDPR request)
key_store.delete_key(user_id)
# All encrypted attributes become permanently unreadable
Recommended: Pseudonymization + short retention (7-30 days)
This minimizes GDPR exposure while maintaining debugging utility.Data Minimization (Article 5)
Principle: Collect only what is necessary for the stated purpose.
Application to Tracing:
- Purpose: Debug production issues, monitor performance
- Necessary: service names, operation names, durations, error codes
- NOT necessary: full request bodies, user emails, IP addresses
Implementation:
1. Default deny: Only collect explicitly approved attributes
2. Purpose limitation: Document why each collected field is needed
3. Storage limitation: Minimum retention that serves debugging needs
4. Regular review: Quarterly audit of collected attributes
Data Protection Impact Assessment (DPIA):
- Required when tracing processes personal data at scale
- Document: what data, why collected, how protected, retention period
- Review: annually or when collection scope changesRetention Policy Enforcement
Automated Retention:
retention_policies:
- name: "standard_traces"
condition: "default"
duration: "7d"
action: "delete"
- name: "error_traces"
condition: "has_error = true"
duration: "30d"
action: "delete"
- name: "gdpr_user_traces"
condition: "contains_pii = true"
duration: "72h" # Minimize PII exposure window
action: "delete"
- name: "compliance_audit"
condition: "service IN ('auth', 'payment')"
duration: "1y"
action: "archive_anonymized" # Keep structure, remove PII
Deletion Verification:
- After TTL expiration, verify data is actually deleted
- Check all storage tiers (hot, warm, cold, backups)
- Verify deletion from indexes (Elasticsearch, Cassandra)
- Audit log: "Deleted 1.2M spans older than 7 days at 2024-01-15T03:00:00Z"Encryption
In-Transit Encryption
Agent → Collector:
- Protocol: gRPC with TLS 1.3 (mutual TLS preferred)
- Certificate: Auto-rotated via cert-manager or SPIFFE/SPIRE
- Cipher suites: TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256
- Minimum: TLS 1.2 (disable TLS 1.0, 1.1)
mTLS Configuration:
# Collector server config
tls:
cert_file: /certs/collector.crt
key_file: /certs/collector.key
client_ca_file: /certs/ca.crt # Verify agent certificates
client_auth: RequireAndVerifyClientCert
# Agent client config
tls:
cert_file: /certs/agent.crt
key_file: /certs/agent.key
ca_file: /certs/ca.crt # Verify collector certificate
Collector → Kafka:
- SASL_SSL authentication
- TLS for data in transit
- ACLs for topic-level access control
Collector → Storage:
- Cassandra: Client-to-node encryption (TLS)
- Elasticsearch: HTTPS with authentication
- ClickHouse: TLS + password authentication
- S3: HTTPS (always encrypted in transit)
Internal Network:
- Service mesh (Istio/Linkerd): automatic mTLS between all pods
- Eliminates need for application-level TLS configuration
- Certificate rotation handled by mesh control planeAt-Rest Encryption
Storage Layer Encryption:
- Cassandra: Transparent Data Encryption (TDE) with AES-256
- Elasticsearch: Encrypted-at-rest via OS-level (dm-crypt/LUKS)
- ClickHouse: Encrypted disks (AES-256-CTR)
- S3: SSE-S3 or SSE-KMS (AWS managed keys)
- Kafka: Encrypted log segments (AES-256)
Key Management:
- Use AWS KMS / GCP Cloud KMS / HashiCorp Vault
- Separate keys per data classification level
- Key rotation: every 90 days (automatic)
- Key access audit: who accessed which key, when
Encryption Hierarchy:
Master Key (KMS, HSM-backed)
└── Data Encryption Key (DEK) per storage volume
└── Encrypted span data
Performance Impact:
- AES-NI hardware acceleration: <5% CPU overhead
- Negligible latency impact for reads/writes
- Key caching eliminates KMS round-trips for hot keysMulti-Tenant Trace Isolation
Tenant Isolation Architecture
Multi-Tenancy Models:
1. Shared Infrastructure, Logical Isolation:
- All tenants share same Kafka, Cassandra, ES clusters
- Isolation via tenant_id field in every span
- Query layer enforces tenant filtering
- Risk: Noisy neighbor, data leakage bugs
2. Shared Infrastructure, Physical Isolation:
- Separate Kafka topics per tenant
- Separate Cassandra keyspaces per tenant
- Separate ES indices per tenant
- Better isolation, higher operational cost
3. Dedicated Infrastructure (Enterprise):
- Separate clusters per tenant
- Complete isolation
- Highest cost, strongest guarantees
Implementation (Logical Isolation):
# Every span tagged with tenant at ingestion
span.resource.attributes["tenant.id"] = extract_tenant(api_key)
# Storage: tenant_id in partition key
PRIMARY KEY ((tenant_id, trace_id), span_id)
# Query enforcement (middleware)
def query_traces(request):
tenant_id = authenticate(request).tenant_id
# ALWAYS inject tenant filter - cannot be bypassed
query = request.query.with_filter(tenant_id=tenant_id)
return storage.execute(query)
# Prevent cross-tenant data access
def get_trace(trace_id, tenant_id):
trace = storage.get(trace_id)
if trace.tenant_id != tenant_id:
raise PermissionDenied("Trace belongs to different tenant")
return trace
Resource Quotas per Tenant:
quotas:
tenant_a:
max_spans_per_sec: 100000
max_storage_gb: 500
max_retention_days: 30
max_queries_per_min: 1000
tenant_b:
max_spans_per_sec: 50000
max_storage_gb: 200
max_retention_days: 14
max_queries_per_min: 500Data Leakage Prevention
Defense in Depth:
1. Ingestion: Validate tenant_id matches API key at collector
2. Storage: Tenant_id in partition key (physical separation)
3. Query: Mandatory tenant filter injected by middleware (not client)
4. API: Response validation - verify all returned spans match tenant
5. Audit: Log all cross-tenant access attempts (should be zero)
Testing:
- Chaos testing: Attempt cross-tenant queries (must fail)
- Penetration testing: Try to bypass tenant filters
- Data audit: Periodic scan for spans without tenant_id
- Canary tenants: Synthetic tenants with known data for verificationAudit Logging for Trace Access
What to Audit
Audit Events:
1. Trace Access:
- Who accessed which trace_id
- When (timestamp)
- From where (IP, user agent)
- What was returned (span count, services visible)
2. Search Queries:
- Who searched for what
- Query parameters (service, tags, time range)
- Result count
- Whether results contained sensitive services
3. Configuration Changes:
- Sampling rate changes (who, when, old value, new value)
- Retention policy changes
- Access control changes
- Scrubbing rule changes
4. Administrative Actions:
- Trace deletion (manual)
- User permission grants/revokes
- API key creation/rotation
- Export/download of trace data
Audit Log Schema:
{
"timestamp": "2024-01-15T10:30:00Z",
"event_type": "trace.access",
"principal": {
"user_id": "eng-jane-doe",
"team": "payments",
"role": "developer",
"ip_address": "10.0.1.42"
},
"action": {
"type": "get_trace",
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"services_accessed": ["payment-service", "order-service"],
"span_count_returned": 12
},
"context": {
"reason": "incident_investigation",
"incident_id": "INC-2024-0142",
"escalation_approved_by": "oncall-lead"
}
}Audit Log Storage and Retention
Requirements:
- Immutable (append-only, no modification)
- Tamper-evident (cryptographic chaining or write-once storage)
- Long retention (1-7 years for compliance)
- Searchable (for investigations)
- Separate from trace data (different access controls)
Storage Options:
- AWS CloudTrail (managed, immutable, S3-backed)
- Immutable S3 bucket (Object Lock, WORM compliance)
- Dedicated Elasticsearch index (separate cluster from traces)
- Blockchain-anchored (hash chain for tamper evidence)
Access to Audit Logs:
- Only security team and compliance officers
- Separate authentication from trace system
- Alert on: bulk access patterns, unusual hours, sensitive service accessSampling Bias and Security Implications
Security Risks of Sampling
Problem: Sampling can create blind spots that attackers exploit.
Attack Scenarios:
1. Sampling Evasion:
- Attacker sends many requests to dilute sampling
- Malicious request has 1% chance of being traced
- At 1% sampling: attacker needs ~100 requests for one to be traced
- Solution: Always trace requests matching security rules
2. Trace ID Manipulation:
- Attacker crafts trace_id to influence sampling decision
- If sampling = hash(trace_id) < threshold, attacker picks trace_id
that hashes above threshold → never sampled
- Solution: Server-side sampling decision, ignore client trace flags
for untrusted sources
3. Trace Bombing (DoS):
- Attacker sends requests with unique trace_ids at high rate
- Each trace_id creates new trace in tail-based sampling buffer
- Exhausts collector memory
- Solution: Rate limiting per source, max traces per client
4. Information Leakage via Trace Context:
- Trace context headers reveal internal architecture
- traceparent header shows span_id → reveals call depth
- tracestate may contain internal routing info
- Solution: Strip tracestate at trust boundaries,
generate new trace_id at external-facing gateways
Mitigations:
security_sampling_rules:
- condition: "source = external AND path matches /admin/*"
action: always_sample # Never miss admin access
- condition: "auth.failed = true"
action: always_sample # Always trace auth failures
- condition: "rate_limit_triggered = true"
action: always_sample # Trace rate-limited requests
- condition: "source = untrusted"
action: ignore_client_sampling_decisionSecure Context Propagation Across Trust Boundaries
Trust Boundary Definition
Trust Boundaries in Tracing:
1. External → Internal (internet → API gateway)
2. Internal → External (service → third-party API)
3. Team → Team (different security domains within org)
4. Region → Region (different compliance jurisdictions)
5. Cloud → On-Premise (different network security models)
Context Propagation Rules by Boundary:
External → Internal:
- DO: Accept W3C traceparent (trace_id + parent_id + flags)
- DO NOT: Accept tracestate from untrusted sources
- DO NOT: Trust sampling decision from external (re-evaluate)
- DO: Generate new internal span_id
- DO: Rate limit trace creation from external sources
- DO: Validate trace_id format (reject malformed)
Internal → External:
- DO: Propagate traceparent (enables end-to-end tracing with partners)
- DO NOT: Include internal tracestate
- DO NOT: Include baggage items with internal data
- DO: Log outgoing trace_id for correlation
- CONSIDER: Generate new trace_id at boundary (break trace linkage)
Implementation at API Gateway:
def handle_external_request(request):
# Extract external context (if present)
external_context = extract_w3c_context(request.headers)
if external_context and is_trusted_source(request):
# Trusted partner: continue their trace
internal_span = tracer.start_span(
"gateway.receive",
context=external_context,
kind=SpanKind.SERVER
)
else:
# Untrusted: start new trace, link to external
internal_span = tracer.start_span(
"gateway.receive",
links=[Link(external_context)] if external_context else [],
kind=SpanKind.SERVER
)
# Apply security sampling rules
if should_force_sample(request):
internal_span.set_sampled(True)
# Strip sensitive headers before propagating internally
sanitized_headers = strip_sensitive_headers(request.headers)
return forward_to_backend(request, internal_span, sanitized_headers)Baggage Security
W3C Baggage: Key-value pairs propagated across all services in a trace.
Risk: Baggage is visible to ALL downstream services.
Security Concerns:
1. Data exposure: Baggage "user_id=12345" visible to every service
2. Size explosion: Malicious client adds large baggage items
3. Injection: Baggage values used in queries without sanitization
Policies:
baggage_policy:
max_items: 10
max_key_length: 64
max_value_length: 256
max_total_size: 4096 # bytes
allowed_keys:
- "request.priority"
- "feature.flags"
- "tenant.id"
blocked_keys:
- "user.*" # No user data in baggage
- "auth.*" # No auth data in baggage
- "internal.*" # No internal routing in baggage
sanitization:
- strip_at_external_boundary: true
- validate_values: true # No special characters
- log_violations: trueCompliance Framework Summary
| Regulation | Key Requirements for Tracing |
|-----------|--------------------------------------------------|
| GDPR | Data minimization, right to erasure, DPO, |
| | consent, cross-border transfer restrictions |
| CCPA | Right to know, right to delete, opt-out of sale |
| HIPAA | PHI protection, access controls, audit trails, |
| | encryption, BAA with vendors |
| SOC 2 | Access controls, monitoring, incident response, |
| | change management, encryption |
| PCI DSS | No storage of full card numbers, encryption, |
| | access logging, network segmentation |
| SOX | Audit trails, access controls, data integrity |
Implementation Checklist:
□ PII scrubbing pipeline configured and tested
□ Encryption in transit (TLS 1.2+) for all connections
□ Encryption at rest for all storage tiers
□ RBAC with least-privilege access
□ Audit logging for all trace access
□ Retention policies aligned with compliance requirements
□ Data deletion capability (GDPR right to erasure)
□ Cross-border transfer controls (EU data stays in EU)
□ Regular security assessments and penetration testing
□ Incident response plan for trace data breaches
□ Data Processing Agreement (DPA) with trace storage vendors
□ Privacy Impact Assessment (PIA) documentedThis security and privacy guide ensures a distributed tracing system meets enterprise compliance requirements while maintaining its core debugging utility, following practices established by production systems at organizations subject to strict regulatory oversight.