Metrics Monitoring System - Variations and Follow-ups
Common Variations
1. Distributed Tracing Integration
- Correlate metrics with traces
- Trace ID in metric labels
- Span metrics generation
- Service dependency mapping
2. Log Aggregation Integration
- Metrics from logs
- Log-based alerts
- Unified observability platform
- Context switching between logs and metrics
3. Application Performance Monitoring (APM)
- Code-level metrics
- Transaction tracing
- Error tracking
- Performance profiling
4. Infrastructure Monitoring
- Host metrics (CPU, memory, disk)
- Network metrics
- Container metrics
- Cloud provider metrics
5. Business Metrics
- Revenue tracking
- User engagement
- Conversion rates
- Custom business KPIs
Follow-up Questions
Q: How do you handle high-cardinality metrics? A: Limit labels, use metric relabeling, monitor cardinality, drop problematic metrics, use recording rules for aggregations.
Q: How do you prevent alert fatigue? A: Alert grouping, intelligent routing, rate limiting, severity levels, runbook links, auto-resolution.
Q: How do you handle metric gaps? A: Staleness markers, interpolation for visualization, gap detection alerts, scrape failure tracking.
Q: How do you scale to 1M monitored services? A: Hierarchical federation, regional clusters, metric relaying, sampling, dedicated clusters for high-volume services.
Q: How do you ensure alert delivery? A: Multiple notification channels, retry logic, delivery confirmation, escalation policies, dead letter queue.
Q: How do you handle clock skew? A: Server-side timestamps, NTP synchronization, timestamp validation, out-of-order handling.
Q: How do you optimize storage costs? A: Aggressive compression, downsampling, retention policies, tiered storage, drop unused metrics.
Q: How do you implement multi-tenancy? A: Tenant isolation, separate namespaces, quota enforcement, access control, cost allocation.
Edge Cases
Metric Storms
- Sudden spike in metric volume
- Cardinality explosion
- Scrape timeouts
- Handling: Rate limiting, backpressure, sampling, circuit breakers
Network Partitions
- Split brain scenarios
- Inconsistent data
- Alert duplication
- Handling: Quorum-based decisions, conflict resolution, deduplication
Service Restarts
- Counter resets
- Missing data
- State loss
- Handling: Reset detection, rate calculations, state persistence
These variations and follow-ups demonstrate comprehensive understanding of monitoring systems at scale.