Metrics Monitoring System - Scale and Constraints
Traffic Scale Analysis
Metric Ingestion Volume
Monitored Targets: 100,000 services
Metrics per Target: 100 metrics average
Scrape Interval: 15 seconds
Samples per Second: 100K × 100 / 15 = 666,666 samples/sec
Peak Ingestion: 2 million samples/sec (3x normal)
Daily Samples: 57.6 billion samples
Monthly Samples: 1.7 trillion samplesData Size Calculations
Sample Size: 16 bytes (8B timestamp + 8B value)
Metadata per Series: 200 bytes (labels, tags)
Daily Raw Data: 57.6B × 16B = 922 GB/day
With Metadata: 922 GB + overhead = 1 TB/day
Compressed (10:1): 100 GB/day
Monthly Storage: 3 TB compressed
Annual Storage: 36 TB compressedQuery Load
Dashboard Users: 10,000 concurrent
Queries per User: 10 queries per dashboard
Refresh Rate: 30 seconds
Query Rate: 10K × 10 / 30 = 3,333 queries/sec
API Queries: 1,000 queries/sec
Total Query Load: 4,333 queries/sec
Peak Query Load: 15,000 queries/secStorage Requirements
Time-series Database Storage
Active Series: 100 million unique time series
Sample Rate: 15 seconds
Retention: 15 days high-res, 1 year downsampled
High-Resolution (15 days):
- Samples: 100M × (15 days × 86400s / 15s) = 8.64 trillion samples
- Storage: 8.64T × 16B = 138 TB raw
- Compressed: 13.8 TB
Downsampled (1 year):
- 1-minute resolution: 100M × (365 × 1440) = 52.6 trillion samples
- Storage: 52.6T × 16B = 841 TB raw
- Compressed: 84 TB
Total Storage: 13.8 TB + 84 TB = 97.8 TB
With Replication (3x): 293 TBMetadata Storage
Time Series: 100 million series
Metadata per Series: 200 bytes
Total Metadata: 100M × 200B = 20 GB
Index Size: 50 GB
Total: 70 GB (negligible compared to time-series data)Cache Storage
Query Result Cache: 1 TB
Recent Data Cache: 500 GB
Metadata Cache: 50 GB
Total Cache: 1.55 TB in memoryCompute Requirements
Ingestion Nodes
Ingestion Rate: 2M samples/sec peak
Processing per Sample: 0.1ms
CPU Required: 2M × 0.1ms = 200 CPU cores
Nodes (32 cores each): 7 nodes
With Headroom (2x): 14 nodes
Memory per Node: 64 GB
Total Memory: 896 GBQuery Nodes
Query Rate: 15K queries/sec peak
Processing per Query: 50ms average
CPU Required: 15K × 50ms = 750 CPU cores
Nodes (48 cores each): 16 nodes
With Headroom (2x): 32 nodes
Memory per Node: 128 GB (for caching)
Total Memory: 4 TBAlert Evaluation Nodes
Alert Rules: 100,000 rules
Evaluation Interval: 30 seconds
Evaluations per Second: 3,333 evaluations/sec
Processing per Evaluation: 10ms
CPU Required: 33 CPU cores
Nodes (32 cores each): 2 nodes
With Headroom (2x): 4 nodesStorage Nodes
Total Storage: 293 TB (with replication)
Storage per Node: 10 TB
Nodes Required: 30 nodes
CPU per Node: 16 cores
Memory per Node: 128 GBNetwork Bandwidth
Inbound Traffic
Metric Ingestion: 2M samples/sec × 200 bytes = 400 MB/sec = 3.2 Gbps
API Requests: 1K requests/sec × 1 KB = 1 MB/sec = 8 Mbps
Total Inbound: 3.2 Gbps peakOutbound Traffic
Query Responses: 15K queries/sec × 100 KB = 1.5 GB/sec = 12 Gbps
Dashboard Updates: 10K dashboards × 50 KB / 30s = 16.7 MB/sec = 133 Mbps
Alert Notifications: 100 alerts/sec × 5 KB = 500 KB/sec = 4 Mbps
Total Outbound: 12.1 Gbps peakInternal Traffic
Replication: 400 MB/sec × 3 = 1.2 GB/sec = 9.6 Gbps
Query Distribution: 5 Gbps
Compaction: 2 Gbps
Total Internal: 16.6 GbpsDatabase Scaling
Time-series Database (Prometheus/VictoriaMetrics)
Cluster Size: 30 nodes
Sharding: By metric hash
Replication Factor: 3
Write Throughput: 2M samples/sec
Query Throughput: 15K queries/sec
Storage per Node: 10 TB
Memory per Node: 128 GBMetadata Store (PostgreSQL)
Cluster Size: 5 nodes (primary + 4 replicas)
Data Size: 70 GB
Connections: 10,000 concurrent
Query Latency: <10ms p95
Replication: Streaming replicationCache Layer (Redis)
Cluster Size: 20 nodes
Memory per Node: 128 GB
Total Cache: 2.56 TB
Hit Rate Target: 80%
Eviction Policy: LRU
Persistence: RDB snapshotsMessage Queue Scaling
Metric Ingestion Queue (Kafka)
Cluster Size: 10 brokers
Partitions: 1,000 partitions
Replication Factor: 3
Throughput: 2M messages/sec
Retention: 24 hours
Storage: 10 TBAlert Notification Queue
Queue: RabbitMQ / SQS
Throughput: 1,000 messages/sec
Retention: 7 days
Dead Letter Queue: YesGeographic Distribution
Data Center Regions
Primary Regions: 3 (US-East, US-West, EU-West)
Secondary Regions: 2 (Asia-Pacific, South America)
Edge Locations: 50+ (for metric collection)
Cross-region Latency: <100ms
Replication: Async between regionsRegional Distribution
Per Region:
- Ingestion Nodes: 5 nodes
- Query Nodes: 10 nodes
- Storage Nodes: 10 nodes
- Cache Nodes: 7 nodes
- Total: 32 nodes per regionCost Analysis
Infrastructure Costs (Monthly)
Compute (Ingestion): 14 × $200 = $2,800
Compute (Query): 32 × $300 = $9,600
Compute (Storage): 30 × $400 = $12,000
Compute (Alert): 4 × $200 = $800
Storage (SSD): 300 TB × $0.10/GB = $30,000
Network: $5,000
Cache (Redis): 20 × $500 = $10,000
Message Queue: $2,000
Total Monthly: $72,200
Cost per Metric: $0.0007 per metric per month
Cost per Service: $0.72 per service per monthOperational Costs (Monthly)
Engineering: $200,000
DevOps/SRE: $100,000
Support: $50,000
Total Monthly: $350,000Performance Benchmarks
Ingestion Performance
Write Latency: p50: 5ms, p95: 20ms, p99: 50ms
Write Throughput: 2M samples/sec sustained
Batch Size: 1,000 samples per batch
Compression Ratio: 10:1Query Performance
Simple Query (1 metric, 1 hour): p50: 50ms, p95: 200ms, p99: 500ms
Complex Query (10 metrics, 24 hours): p50: 500ms, p95: 2s, p99: 5s
Aggregation Query: p50: 200ms, p95: 1s, p99: 3s
Dashboard Load: p50: 500ms, p95: 2s, p99: 5sAlert Evaluation
Rule Evaluation: p50: 100ms, p95: 500ms, p99: 1s
Alert Firing Latency: p50: 30s, p95: 60s, p99: 90s
Notification Delivery: p50: 1s, p95: 5s, p99: 10sScaling Bottlenecks
Write Path Bottlenecks
- Network I/O: Ingestion node network saturation
- Disk I/O: Storage node write throughput
- Memory: In-memory buffer overflow
- CPU: Compression and encoding overhead
Read Path Bottlenecks
- Query Complexity: Complex aggregations across many series
- Time Range: Queries spanning long time periods
- Cardinality: High-cardinality label queries
- Cache Misses: Cold data access
Storage Bottlenecks
- Compaction: Background compaction impacting queries
- High Cardinality: Memory usage for index
- Retention: Cleanup of old data
- Replication: Cross-region replication lag
Capacity Planning
Growth Projections (Annual)
Monitored Services: +50% YoY
Metrics per Service: +30% YoY
Query Volume: +60% YoY
Storage: +80% YoY (including historical data)
Alert Rules: +40% YoYScaling Triggers
CPU Utilization > 70%: Add compute nodes
Memory Utilization > 80%: Add memory or nodes
Disk Usage > 75%: Add storage capacity
Query Latency > SLA: Add query nodes
Ingestion Lag > 10s: Add ingestion nodes
Cache Hit Rate < 75%: Increase cache sizeCapacity Reserves
Compute: 40% reserve capacity
Storage: 50% reserve capacity
Network: 60% reserve capacity
Cache: 30% reserve capacityThis scale analysis provides the foundation for architecting a monitoring system that can handle massive metric volumes while maintaining performance.