Metrics Monitoring System - Scale and Constraints

Traffic Scale Analysis

Metric Ingestion Volume

Monitored Targets: 100,000 services
Metrics per Target: 100 metrics average
Scrape Interval: 15 seconds
Samples per Second: 100K × 100 / 15 = 666,666 samples/sec
Peak Ingestion: 2 million samples/sec (3x normal)
Daily Samples: 57.6 billion samples
Monthly Samples: 1.7 trillion samples

Data Size Calculations

Sample Size: 16 bytes (8B timestamp + 8B value)
Metadata per Series: 200 bytes (labels, tags)
Daily Raw Data: 57.6B × 16B = 922 GB/day
With Metadata: 922 GB + overhead = 1 TB/day
Compressed (10:1): 100 GB/day
Monthly Storage: 3 TB compressed
Annual Storage: 36 TB compressed

Query Load

Dashboard Users: 10,000 concurrent
Queries per User: 10 queries per dashboard
Refresh Rate: 30 seconds
Query Rate: 10K × 10 / 30 = 3,333 queries/sec
API Queries: 1,000 queries/sec
Total Query Load: 4,333 queries/sec
Peak Query Load: 15,000 queries/sec

Storage Requirements

Time-series Database Storage

Active Series: 100 million unique time series
Sample Rate: 15 seconds
Retention: 15 days high-res, 1 year downsampled

High-Resolution (15 days):
- Samples: 100M × (15 days × 86400s / 15s) = 8.64 trillion samples
- Storage: 8.64T × 16B = 138 TB raw
- Compressed: 13.8 TB

Downsampled (1 year):
- 1-minute resolution: 100M × (365 × 1440) = 52.6 trillion samples
- Storage: 52.6T × 16B = 841 TB raw
- Compressed: 84 TB

Total Storage: 13.8 TB + 84 TB = 97.8 TB
With Replication (3x): 293 TB

Metadata Storage

Time Series: 100 million series
Metadata per Series: 200 bytes
Total Metadata: 100M × 200B = 20 GB
Index Size: 50 GB
Total: 70 GB (negligible compared to time-series data)

Cache Storage

Query Result Cache: 1 TB
Recent Data Cache: 500 GB
Metadata Cache: 50 GB
Total Cache: 1.55 TB in memory

Compute Requirements

Ingestion Nodes

Ingestion Rate: 2M samples/sec peak
Processing per Sample: 0.1ms
CPU Required: 2M × 0.1ms = 200 CPU cores
Nodes (32 cores each): 7 nodes
With Headroom (2x): 14 nodes
Memory per Node: 64 GB
Total Memory: 896 GB

Query Nodes

Query Rate: 15K queries/sec peak
Processing per Query: 50ms average
CPU Required: 15K × 50ms = 750 CPU cores
Nodes (48 cores each): 16 nodes
With Headroom (2x): 32 nodes
Memory per Node: 128 GB (for caching)
Total Memory: 4 TB

Alert Evaluation Nodes

Alert Rules: 100,000 rules
Evaluation Interval: 30 seconds
Evaluations per Second: 3,333 evaluations/sec
Processing per Evaluation: 10ms
CPU Required: 33 CPU cores
Nodes (32 cores each): 2 nodes
With Headroom (2x): 4 nodes

Storage Nodes

Total Storage: 293 TB (with replication)
Storage per Node: 10 TB
Nodes Required: 30 nodes
CPU per Node: 16 cores
Memory per Node: 128 GB

Network Bandwidth

Inbound Traffic

Metric Ingestion: 2M samples/sec × 200 bytes = 400 MB/sec = 3.2 Gbps
API Requests: 1K requests/sec × 1 KB = 1 MB/sec = 8 Mbps
Total Inbound: 3.2 Gbps peak

Outbound Traffic

Query Responses: 15K queries/sec × 100 KB = 1.5 GB/sec = 12 Gbps
Dashboard Updates: 10K dashboards × 50 KB / 30s = 16.7 MB/sec = 133 Mbps
Alert Notifications: 100 alerts/sec × 5 KB = 500 KB/sec = 4 Mbps
Total Outbound: 12.1 Gbps peak

Internal Traffic

Replication: 400 MB/sec × 3 = 1.2 GB/sec = 9.6 Gbps
Query Distribution: 5 Gbps
Compaction: 2 Gbps
Total Internal: 16.6 Gbps

Database Scaling

Time-series Database (Prometheus/VictoriaMetrics)

Cluster Size: 30 nodes
Sharding: By metric hash
Replication Factor: 3
Write Throughput: 2M samples/sec
Query Throughput: 15K queries/sec
Storage per Node: 10 TB
Memory per Node: 128 GB

Metadata Store (PostgreSQL)

Cluster Size: 5 nodes (primary + 4 replicas)
Data Size: 70 GB
Connections: 10,000 concurrent
Query Latency: <10ms p95
Replication: Streaming replication

Cache Layer (Redis)

Cluster Size: 20 nodes
Memory per Node: 128 GB
Total Cache: 2.56 TB
Hit Rate Target: 80%
Eviction Policy: LRU
Persistence: RDB snapshots

Message Queue Scaling

Metric Ingestion Queue (Kafka)

Cluster Size: 10 brokers
Partitions: 1,000 partitions
Replication Factor: 3
Throughput: 2M messages/sec
Retention: 24 hours
Storage: 10 TB

Alert Notification Queue

Queue: RabbitMQ / SQS
Throughput: 1,000 messages/sec
Retention: 7 days
Dead Letter Queue: Yes

Geographic Distribution

Data Center Regions

Primary Regions: 3 (US-East, US-West, EU-West)
Secondary Regions: 2 (Asia-Pacific, South America)
Edge Locations: 50+ (for metric collection)
Cross-region Latency: <100ms
Replication: Async between regions

Regional Distribution

Per Region:
- Ingestion Nodes: 5 nodes
- Query Nodes: 10 nodes
- Storage Nodes: 10 nodes
- Cache Nodes: 7 nodes
- Total: 32 nodes per region

Cost Analysis

Infrastructure Costs (Monthly)

Compute (Ingestion): 14 × $200 = $2,800
Compute (Query): 32 × $300 = $9,600
Compute (Storage): 30 × $400 = $12,000
Compute (Alert): 4 × $200 = $800
Storage (SSD): 300 TB × $0.10/GB = $30,000
Network: $5,000
Cache (Redis): 20 × $500 = $10,000
Message Queue: $2,000
Total Monthly: $72,200
Cost per Metric: $0.0007 per metric per month
Cost per Service: $0.72 per service per month

Operational Costs (Monthly)

Engineering: $200,000
DevOps/SRE: $100,000
Support: $50,000
Total Monthly: $350,000

Performance Benchmarks

Ingestion Performance

Write Latency: p50: 5ms, p95: 20ms, p99: 50ms
Write Throughput: 2M samples/sec sustained
Batch Size: 1,000 samples per batch
Compression Ratio: 10:1

Query Performance

Simple Query (1 metric, 1 hour): p50: 50ms, p95: 200ms, p99: 500ms
Complex Query (10 metrics, 24 hours): p50: 500ms, p95: 2s, p99: 5s
Aggregation Query: p50: 200ms, p95: 1s, p99: 3s
Dashboard Load: p50: 500ms, p95: 2s, p99: 5s

Alert Evaluation

Rule Evaluation: p50: 100ms, p95: 500ms, p99: 1s
Alert Firing Latency: p50: 30s, p95: 60s, p99: 90s
Notification Delivery: p50: 1s, p95: 5s, p99: 10s

Scaling Bottlenecks

Write Path Bottlenecks

Network I/O: Ingestion node network saturation
Disk I/O: Storage node write throughput
Memory: In-memory buffer overflow
CPU: Compression and encoding overhead

Read Path Bottlenecks

Query Complexity: Complex aggregations across many series
Time Range: Queries spanning long time periods
Cardinality: High-cardinality label queries
Cache Misses: Cold data access

Storage Bottlenecks

Compaction: Background compaction impacting queries
High Cardinality: Memory usage for index
Retention: Cleanup of old data
Replication: Cross-region replication lag

Capacity Planning

Growth Projections (Annual)

Monitored Services: +50% YoY
Metrics per Service: +30% YoY
Query Volume: +60% YoY
Storage: +80% YoY (including historical data)
Alert Rules: +40% YoY

Scaling Triggers

CPU Utilization > 70%: Add compute nodes
Memory Utilization > 80%: Add memory or nodes
Disk Usage > 75%: Add storage capacity
Query Latency > SLA: Add query nodes
Ingestion Lag > 10s: Add ingestion nodes
Cache Hit Rate < 75%: Increase cache size

Capacity Reserves

Compute: 40% reserve capacity
Storage: 50% reserve capacity
Network: 60% reserve capacity
Cache: 30% reserve capacity

This scale analysis provides the foundation for architecting a monitoring system that can handle massive metric volumes while maintaining performance.