Variations & Follow-ups

📖 2 min read 📄 Part 8 of 10

Metrics Monitoring System - Variations and Follow-ups

Common Variations

1. Distributed Tracing Integration

  • Correlate metrics with traces
  • Trace ID in metric labels
  • Span metrics generation
  • Service dependency mapping

2. Log Aggregation Integration

  • Metrics from logs
  • Log-based alerts
  • Unified observability platform
  • Context switching between logs and metrics

3. Application Performance Monitoring (APM)

  • Code-level metrics
  • Transaction tracing
  • Error tracking
  • Performance profiling

4. Infrastructure Monitoring

  • Host metrics (CPU, memory, disk)
  • Network metrics
  • Container metrics
  • Cloud provider metrics

5. Business Metrics

  • Revenue tracking
  • User engagement
  • Conversion rates
  • Custom business KPIs

Follow-up Questions

Q: How do you handle high-cardinality metrics? A: Limit labels, use metric relabeling, monitor cardinality, drop problematic metrics, use recording rules for aggregations.

Q: How do you prevent alert fatigue? A: Alert grouping, intelligent routing, rate limiting, severity levels, runbook links, auto-resolution.

Q: How do you handle metric gaps? A: Staleness markers, interpolation for visualization, gap detection alerts, scrape failure tracking.

Q: How do you scale to 1M monitored services? A: Hierarchical federation, regional clusters, metric relaying, sampling, dedicated clusters for high-volume services.

Q: How do you ensure alert delivery? A: Multiple notification channels, retry logic, delivery confirmation, escalation policies, dead letter queue.

Q: How do you handle clock skew? A: Server-side timestamps, NTP synchronization, timestamp validation, out-of-order handling.

Q: How do you optimize storage costs? A: Aggressive compression, downsampling, retention policies, tiered storage, drop unused metrics.

Q: How do you implement multi-tenancy? A: Tenant isolation, separate namespaces, quota enforcement, access control, cost allocation.

Edge Cases

Metric Storms

  • Sudden spike in metric volume
  • Cardinality explosion
  • Scrape timeouts
  • Handling: Rate limiting, backpressure, sampling, circuit breakers

Network Partitions

  • Split brain scenarios
  • Inconsistent data
  • Alert duplication
  • Handling: Quorum-based decisions, conflict resolution, deduplication

Service Restarts

  • Counter resets
  • Missing data
  • State loss
  • Handling: Reset detection, rate calculations, state persistence

These variations and follow-ups demonstrate comprehensive understanding of monitoring systems at scale.