Distributed Tracing System - Security and Privacy

Overview

Distributed tracing systems capture detailed information about every request flowing through an organization's infrastructure. This creates a unique security and privacy challenge: the system designed to improve observability can inadvertently become a repository of sensitive data, a target for attackers seeking to understand system architecture, or a compliance liability under regulations like GDPR, CCPA, and HIPAA. Production tracing systems must implement defense-in-depth across data collection, storage, access, and retention.

PII in Trace Data

Common Sources of Sensitive Data in Spans

High Risk - Frequently Captured Accidentally:
1. HTTP URLs with query parameters:
   /api/users?email=john@example.com&ssn=123-45-6789
   
2. Request/Response bodies (when body capture is enabled):
   {"credit_card": "4111-1111-1111-1111", "cvv": "123"}
   
3. Database queries with literal values:
   "SELECT * FROM users WHERE email = 'john@example.com'"
   
4. Error messages containing user data:
   "Failed to process payment for user john.doe@company.com"
   
5. HTTP headers:
   Authorization: Bearer eyJhbGciOiJIUzI1NiIs...
   Cookie: session_id=abc123; user_email=john@example.com
   
6. gRPC request metadata:
   user-id: 12345
   x-forwarded-for: 192.168.1.100

Medium Risk - Context-Dependent:
7. User IDs (may be considered PII in some jurisdictions)
8. IP addresses (PII under GDPR)
9. Device identifiers
10. Geographic coordinates
11. Custom business tags (order amounts, account numbers)

Low Risk - Generally Safe:
- Service names, operation names
- HTTP methods, status codes
- Duration, timestamp
- Span/trace IDs
- Generic error codes

Data Scrubbing Pipeline

Scrubbing Architecture:
  Agent → Collector → [Scrubbing Pipeline] → Storage

Pipeline Stages:
1. Allowlist Filtering (most restrictive, safest):
   - Only permit explicitly approved tag keys
   - Default: DENY all unknown tags
   - Approved: http.method, http.status_code, db.system, error

2. Blocklist Filtering (catch known sensitive patterns):
   - Block: authorization, cookie, x-api-key, password
   - Block: credit_card, ssn, social_security
   - Block: any header matching /^x-.*-secret/

3. Regex-Based Redaction:
   patterns:
     - name: credit_card
       regex: '\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
       replacement: '[REDACTED_CC]'
     - name: email
       regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
       replacement: '[REDACTED_EMAIL]'
     - name: ssn
       regex: '\b\d{3}-\d{2}-\d{4}\b'
       replacement: '[REDACTED_SSN]'
     - name: jwt_token
       regex: 'eyJ[A-Za-z0-9_-]*\.eyJ[A-Za-z0-9_-]*\.[A-Za-z0-9_-]*'
       replacement: '[REDACTED_JWT]'
     - name: ipv4
       regex: '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
       replacement: '[REDACTED_IP]'

4. URL Sanitization:
   - Parse URL, remove query parameters by default
   - Allowlist safe query params: page, limit, sort, filter
   - Path parameter normalization: /users/12345 → /users/{id}

5. Database Query Normalization:
   - Replace literal values with placeholders
   - "WHERE email = 'john@example.com'" → "WHERE email = ?"
   - "INSERT INTO orders VALUES (123, 'item')" → "INSERT INTO orders VALUES (?, ?)"

Implementation (OpenTelemetry Collector Processor):
  processors:
    attributes:
      actions:
        - key: http.url
          action: hash        # Hash instead of storing raw URL
        - key: db.statement
          action: extract
          pattern: '^(?P<operation>\w+)\s'  # Keep only operation type
        - key: http.request.header.authorization
          action: delete      # Remove entirely
        - key: user.email
          action: delete

Hashing vs Deletion vs Tokenization

Strategy Comparison:
| Approach     | Reversible | Queryable | Use Case                    |
|-------------|------------|-----------|------------------------------|
| Deletion    | No         | No        | Truly sensitive (passwords)  |
| Hashing     | No         | Yes*      | Correlation without exposure |
| Tokenization| Yes        | Yes       | Compliance with data access  |
| Masking     | No         | Partial   | Partial visibility needed    |

*Queryable if you hash the search term the same way

Example - User ID Handling:
- Raw: user_id = "john.doe@company.com"
- Hashed: user_id = SHA256("john.doe@company.com") = "a1b2c3..."
  → Can still correlate all traces for same user
  → Cannot reverse to get email
  → Can search by hashing the email you're looking for

- Tokenized: user_id = token_lookup("john.doe@company.com") = "USR_7x9k2m"
  → Separate token service maps token ↔ real value
  → Token service has strict access controls
  → Traces store only token, not real value

Access Control

Role-Based Access Control (RBAC)

Role Hierarchy:
┌─────────────────────────────────────────────────────────────┐
│ Role              │ Permissions                              │
├─────────────────────────────────────────────────────────────┤
│ Platform Admin    │ All traces, all services, config mgmt    │
│ Service Owner     │ Own service traces + direct dependencies │
│ Developer         │ Own team's service traces                │
│ On-Call Engineer  │ All traces (time-limited during incident)│
│ Auditor           │ Read-only, all traces, audit logs        │
│ External Partner  │ Specific service, limited time range     │
└─────────────────────────────────────────────────────────────┘

Permission Model:
{
  "role": "service_owner",
  "principal": "team-payments",
  "permissions": {
    "traces": {
      "read": {
        "services": ["payment-service", "payment-gateway", "fraud-detection"],
        "include_dependencies": true,
        "max_depth": 2,
        "time_range": "30d"
      },
      "search": {
        "services": ["payment-service"],
        "allowed_tag_filters": ["http.status_code", "error", "customer.tier"],
        "blocked_tag_filters": ["user.email", "card.number"]
      }
    },
    "admin": {
      "sampling_config": ["payment-service"],
      "retention_config": false
    }
  }
}

Service-Level Trace Isolation

Problem: Developer on Team A should not see internal traces of Team B's service.
A trace spanning both teams should show Team A's spans in detail,
but Team B's spans only as opaque boxes (service name + duration).

Implementation:

1. Trace Filtering at Query Time:
   def get_trace(trace_id, requesting_user):
       spans = storage.get_spans(trace_id)
       user_services = get_allowed_services(requesting_user)
       
       filtered_spans = []
       for span in spans:
           if span.service_name in user_services:
               filtered_spans.append(span)  # Full detail
           else:
               filtered_spans.append(redact_span(span))  # Opaque
       
       return assemble_trace(filtered_spans)
   
   def redact_span(span):
       return Span(
           span_id=span.span_id,
           parent_span_id=span.parent_span_id,
           service_name=span.service_name,  # Keep for topology
           operation_name="[REDACTED]",
           duration=span.duration,           # Keep for timing
           tags={},                          # Remove all tags
           logs=[],                          # Remove all logs
           status=span.status               # Keep error indicator
       )

2. Attribute-Level Redaction:
   - Some tags visible to all (http.status_code, error)
   - Some tags visible only to service owner (db.statement, request.body)
   - Configured per-tag, per-role

3. Time-Limited Escalation:
   - During incidents: grant temporary access to all traces
   - Requires approval from service owner or on-call
   - Auto-expires after incident resolution (max 24 hours)
   - All escalated access is audit-logged

Data Retention and GDPR Compliance

Right to Erasure (Article 17)

Challenge: User requests deletion of all their data.
Traces may contain user identifiers scattered across thousands of spans.

Approaches:

1. Pseudonymization at Ingestion (Preferred):
   - Replace user identifiers with pseudonyms at collection time
   - Mapping table: real_user_id → pseudonym (encrypted, separate storage)
   - Deletion: delete mapping entry → traces become anonymous
   - Traces remain intact for debugging (no user association)

2. Trace Deletion by User ID:
   - Requires index: user_id → [trace_ids]
   - Delete all traces containing user's data
   - Expensive: may delete traces shared with other users
   - Risk: incomplete deletion if user_id appears in unexpected fields

3. Crypto-Shredding:
   - Encrypt user-specific data with per-user key
   - Store encrypted values in span tags
   - Deletion: destroy the user's encryption key
   - Data becomes unreadable without key
   - Traces remain for structural analysis

Implementation of Crypto-Shredding:
  # At ingestion
  user_key = key_store.get_key(user_id)
  encrypted_email = encrypt(user_key, user.email)
  span.set_attribute("user.email.encrypted", encrypted_email)
  span.set_attribute("user.key_id", user_key.id)
  
  # At query time (authorized user)
  user_key = key_store.get_key(key_id)
  email = decrypt(user_key, encrypted_email)
  
  # For deletion (GDPR request)
  key_store.delete_key(user_id)
  # All encrypted attributes become permanently unreadable

Recommended: Pseudonymization + short retention (7-30 days)
This minimizes GDPR exposure while maintaining debugging utility.

Data Minimization (Article 5)

Principle: Collect only what is necessary for the stated purpose.

Application to Tracing:
- Purpose: Debug production issues, monitor performance
- Necessary: service names, operation names, durations, error codes
- NOT necessary: full request bodies, user emails, IP addresses

Implementation:
1. Default deny: Only collect explicitly approved attributes
2. Purpose limitation: Document why each collected field is needed
3. Storage limitation: Minimum retention that serves debugging needs
4. Regular review: Quarterly audit of collected attributes

Data Protection Impact Assessment (DPIA):
- Required when tracing processes personal data at scale
- Document: what data, why collected, how protected, retention period
- Review: annually or when collection scope changes

Retention Policy Enforcement

Automated Retention:
  retention_policies:
    - name: "standard_traces"
      condition: "default"
      duration: "7d"
      action: "delete"
    
    - name: "error_traces"
      condition: "has_error = true"
      duration: "30d"
      action: "delete"
    
    - name: "gdpr_user_traces"
      condition: "contains_pii = true"
      duration: "72h"          # Minimize PII exposure window
      action: "delete"
    
    - name: "compliance_audit"
      condition: "service IN ('auth', 'payment')"
      duration: "1y"
      action: "archive_anonymized"  # Keep structure, remove PII

Deletion Verification:
- After TTL expiration, verify data is actually deleted
- Check all storage tiers (hot, warm, cold, backups)
- Verify deletion from indexes (Elasticsearch, Cassandra)
- Audit log: "Deleted 1.2M spans older than 7 days at 2024-01-15T03:00:00Z"

Encryption

In-Transit Encryption

Agent → Collector:
- Protocol: gRPC with TLS 1.3 (mutual TLS preferred)
- Certificate: Auto-rotated via cert-manager or SPIFFE/SPIRE
- Cipher suites: TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256
- Minimum: TLS 1.2 (disable TLS 1.0, 1.1)

mTLS Configuration:
  # Collector server config
  tls:
    cert_file: /certs/collector.crt
    key_file: /certs/collector.key
    client_ca_file: /certs/ca.crt    # Verify agent certificates
    client_auth: RequireAndVerifyClientCert
  
  # Agent client config
  tls:
    cert_file: /certs/agent.crt
    key_file: /certs/agent.key
    ca_file: /certs/ca.crt           # Verify collector certificate

Collector → Kafka:
- SASL_SSL authentication
- TLS for data in transit
- ACLs for topic-level access control

Collector → Storage:
- Cassandra: Client-to-node encryption (TLS)
- Elasticsearch: HTTPS with authentication
- ClickHouse: TLS + password authentication
- S3: HTTPS (always encrypted in transit)

Internal Network:
- Service mesh (Istio/Linkerd): automatic mTLS between all pods
- Eliminates need for application-level TLS configuration
- Certificate rotation handled by mesh control plane

At-Rest Encryption

Storage Layer Encryption:
- Cassandra: Transparent Data Encryption (TDE) with AES-256
- Elasticsearch: Encrypted-at-rest via OS-level (dm-crypt/LUKS)
- ClickHouse: Encrypted disks (AES-256-CTR)
- S3: SSE-S3 or SSE-KMS (AWS managed keys)
- Kafka: Encrypted log segments (AES-256)

Key Management:
- Use AWS KMS / GCP Cloud KMS / HashiCorp Vault
- Separate keys per data classification level
- Key rotation: every 90 days (automatic)
- Key access audit: who accessed which key, when

Encryption Hierarchy:
  Master Key (KMS, HSM-backed)
    └── Data Encryption Key (DEK) per storage volume
         └── Encrypted span data

Performance Impact:
- AES-NI hardware acceleration: <5% CPU overhead
- Negligible latency impact for reads/writes
- Key caching eliminates KMS round-trips for hot keys

Multi-Tenant Trace Isolation

Tenant Isolation Architecture

Multi-Tenancy Models:

1. Shared Infrastructure, Logical Isolation:
   - All tenants share same Kafka, Cassandra, ES clusters
   - Isolation via tenant_id field in every span
   - Query layer enforces tenant filtering
   - Risk: Noisy neighbor, data leakage bugs

2. Shared Infrastructure, Physical Isolation:
   - Separate Kafka topics per tenant
   - Separate Cassandra keyspaces per tenant
   - Separate ES indices per tenant
   - Better isolation, higher operational cost

3. Dedicated Infrastructure (Enterprise):
   - Separate clusters per tenant
   - Complete isolation
   - Highest cost, strongest guarantees

Implementation (Logical Isolation):
  # Every span tagged with tenant at ingestion
  span.resource.attributes["tenant.id"] = extract_tenant(api_key)
  
  # Storage: tenant_id in partition key
  PRIMARY KEY ((tenant_id, trace_id), span_id)
  
  # Query enforcement (middleware)
  def query_traces(request):
      tenant_id = authenticate(request).tenant_id
      # ALWAYS inject tenant filter - cannot be bypassed
      query = request.query.with_filter(tenant_id=tenant_id)
      return storage.execute(query)
  
  # Prevent cross-tenant data access
  def get_trace(trace_id, tenant_id):
      trace = storage.get(trace_id)
      if trace.tenant_id != tenant_id:
          raise PermissionDenied("Trace belongs to different tenant")
      return trace

Resource Quotas per Tenant:
  quotas:
    tenant_a:
      max_spans_per_sec: 100000
      max_storage_gb: 500
      max_retention_days: 30
      max_queries_per_min: 1000
    tenant_b:
      max_spans_per_sec: 50000
      max_storage_gb: 200
      max_retention_days: 14
      max_queries_per_min: 500

Data Leakage Prevention

Defense in Depth:
1. Ingestion: Validate tenant_id matches API key at collector
2. Storage: Tenant_id in partition key (physical separation)
3. Query: Mandatory tenant filter injected by middleware (not client)
4. API: Response validation - verify all returned spans match tenant
5. Audit: Log all cross-tenant access attempts (should be zero)

Testing:
- Chaos testing: Attempt cross-tenant queries (must fail)
- Penetration testing: Try to bypass tenant filters
- Data audit: Periodic scan for spans without tenant_id
- Canary tenants: Synthetic tenants with known data for verification

Audit Logging for Trace Access

What to Audit

Audit Events:
1. Trace Access:
   - Who accessed which trace_id
   - When (timestamp)
   - From where (IP, user agent)
   - What was returned (span count, services visible)

2. Search Queries:
   - Who searched for what
   - Query parameters (service, tags, time range)
   - Result count
   - Whether results contained sensitive services

3. Configuration Changes:
   - Sampling rate changes (who, when, old value, new value)
   - Retention policy changes
   - Access control changes
   - Scrubbing rule changes

4. Administrative Actions:
   - Trace deletion (manual)
   - User permission grants/revokes
   - API key creation/rotation
   - Export/download of trace data

Audit Log Schema:
{
  "timestamp": "2024-01-15T10:30:00Z",
  "event_type": "trace.access",
  "principal": {
    "user_id": "eng-jane-doe",
    "team": "payments",
    "role": "developer",
    "ip_address": "10.0.1.42"
  },
  "action": {
    "type": "get_trace",
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "services_accessed": ["payment-service", "order-service"],
    "span_count_returned": 12
  },
  "context": {
    "reason": "incident_investigation",
    "incident_id": "INC-2024-0142",
    "escalation_approved_by": "oncall-lead"
  }
}

Audit Log Storage and Retention

Requirements:
- Immutable (append-only, no modification)
- Tamper-evident (cryptographic chaining or write-once storage)
- Long retention (1-7 years for compliance)
- Searchable (for investigations)
- Separate from trace data (different access controls)

Storage Options:
- AWS CloudTrail (managed, immutable, S3-backed)
- Immutable S3 bucket (Object Lock, WORM compliance)
- Dedicated Elasticsearch index (separate cluster from traces)
- Blockchain-anchored (hash chain for tamper evidence)

Access to Audit Logs:
- Only security team and compliance officers
- Separate authentication from trace system
- Alert on: bulk access patterns, unusual hours, sensitive service access

Sampling Bias and Security Implications

Security Risks of Sampling

Problem: Sampling can create blind spots that attackers exploit.

Attack Scenarios:

1. Sampling Evasion:
   - Attacker sends many requests to dilute sampling
   - Malicious request has 1% chance of being traced
   - At 1% sampling: attacker needs ~100 requests for one to be traced
   - Solution: Always trace requests matching security rules

2. Trace ID Manipulation:
   - Attacker crafts trace_id to influence sampling decision
   - If sampling = hash(trace_id) < threshold, attacker picks trace_id
     that hashes above threshold → never sampled
   - Solution: Server-side sampling decision, ignore client trace flags
     for untrusted sources

3. Trace Bombing (DoS):
   - Attacker sends requests with unique trace_ids at high rate
   - Each trace_id creates new trace in tail-based sampling buffer
   - Exhausts collector memory
   - Solution: Rate limiting per source, max traces per client

4. Information Leakage via Trace Context:
   - Trace context headers reveal internal architecture
   - traceparent header shows span_id → reveals call depth
   - tracestate may contain internal routing info
   - Solution: Strip tracestate at trust boundaries, 
     generate new trace_id at external-facing gateways

Mitigations:
  security_sampling_rules:
    - condition: "source = external AND path matches /admin/*"
      action: always_sample    # Never miss admin access
    - condition: "auth.failed = true"
      action: always_sample    # Always trace auth failures
    - condition: "rate_limit_triggered = true"
      action: always_sample    # Trace rate-limited requests
    - condition: "source = untrusted"
      action: ignore_client_sampling_decision

Secure Context Propagation Across Trust Boundaries

Trust Boundary Definition

Trust Boundaries in Tracing:
1. External → Internal (internet → API gateway)
2. Internal → External (service → third-party API)
3. Team → Team (different security domains within org)
4. Region → Region (different compliance jurisdictions)
5. Cloud → On-Premise (different network security models)

Context Propagation Rules by Boundary:

External → Internal:
  - DO: Accept W3C traceparent (trace_id + parent_id + flags)
  - DO NOT: Accept tracestate from untrusted sources
  - DO NOT: Trust sampling decision from external (re-evaluate)
  - DO: Generate new internal span_id
  - DO: Rate limit trace creation from external sources
  - DO: Validate trace_id format (reject malformed)

Internal → External:
  - DO: Propagate traceparent (enables end-to-end tracing with partners)
  - DO NOT: Include internal tracestate
  - DO NOT: Include baggage items with internal data
  - DO: Log outgoing trace_id for correlation
  - CONSIDER: Generate new trace_id at boundary (break trace linkage)

Implementation at API Gateway:
  def handle_external_request(request):
      # Extract external context (if present)
      external_context = extract_w3c_context(request.headers)
      
      if external_context and is_trusted_source(request):
          # Trusted partner: continue their trace
          internal_span = tracer.start_span(
              "gateway.receive",
              context=external_context,
              kind=SpanKind.SERVER
          )
      else:
          # Untrusted: start new trace, link to external
          internal_span = tracer.start_span(
              "gateway.receive",
              links=[Link(external_context)] if external_context else [],
              kind=SpanKind.SERVER
          )
      
      # Apply security sampling rules
      if should_force_sample(request):
          internal_span.set_sampled(True)
      
      # Strip sensitive headers before propagating internally
      sanitized_headers = strip_sensitive_headers(request.headers)
      
      return forward_to_backend(request, internal_span, sanitized_headers)

Baggage Security

W3C Baggage: Key-value pairs propagated across all services in a trace.
Risk: Baggage is visible to ALL downstream services.

Security Concerns:
1. Data exposure: Baggage "user_id=12345" visible to every service
2. Size explosion: Malicious client adds large baggage items
3. Injection: Baggage values used in queries without sanitization

Policies:
  baggage_policy:
    max_items: 10
    max_key_length: 64
    max_value_length: 256
    max_total_size: 4096  # bytes
    
    allowed_keys:
      - "request.priority"
      - "feature.flags"
      - "tenant.id"
    
    blocked_keys:
      - "user.*"          # No user data in baggage
      - "auth.*"          # No auth data in baggage
      - "internal.*"      # No internal routing in baggage
    
    sanitization:
      - strip_at_external_boundary: true
      - validate_values: true  # No special characters
      - log_violations: true

Compliance Framework Summary

| Regulation | Key Requirements for Tracing                    |
|-----------|--------------------------------------------------|
| GDPR      | Data minimization, right to erasure, DPO,        |
|           | consent, cross-border transfer restrictions       |
| CCPA      | Right to know, right to delete, opt-out of sale  |
| HIPAA     | PHI protection, access controls, audit trails,   |
|           | encryption, BAA with vendors                      |
| SOC 2     | Access controls, monitoring, incident response,  |
|           | change management, encryption                     |
| PCI DSS   | No storage of full card numbers, encryption,     |
|           | access logging, network segmentation              |
| SOX       | Audit trails, access controls, data integrity    |

Implementation Checklist:
□ PII scrubbing pipeline configured and tested
□ Encryption in transit (TLS 1.2+) for all connections
□ Encryption at rest for all storage tiers
□ RBAC with least-privilege access
□ Audit logging for all trace access
□ Retention policies aligned with compliance requirements
□ Data deletion capability (GDPR right to erasure)
□ Cross-border transfer controls (EU data stays in EU)
□ Regular security assessments and penetration testing
□ Incident response plan for trace data breaches
□ Data Processing Agreement (DPA) with trace storage vendors
□ Privacy Impact Assessment (PIA) documented

This security and privacy guide ensures a distributed tracing system meets enterprise compliance requirements while maintaining its core debugging utility, following practices established by production systems at organizations subject to strict regulatory oversight.