Architecture

📖 3 min read 📄 Part 3 of 10

Metrics Monitoring System - Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Monitored Services                        │
├─────────────────────────────────────────────────────────────┤
│  ApplicationsDatabasesServersContainers      │
│  (Metrics(Metrics(Node(cAdvisor)     │
│   Endpoint)Exporter)Exporter)│                  │
└────────┬────────────┬─────────────┬────────────┬────────────┘
         │            │             │            │
         │ PullPullPullPush
         │            │             │            │
    ┌────▼────────────▼─────────────▼────────────▼────┐
    │           Service Discovery                      │
    │  (Consul, K8s API, DNS, Static Config)          │
    └────────────────────┬────────────────────────────┘
                         │
         ┌───────────────┴───────────────┐
         │                               │
    ┌────▼────┐                    ┌─────▼─────┐
    │ Scraper │                    │   Push    │
    │  Pool   │                    │  Gateway  │
    └────┬────┘                    └─────┬─────┘
         │                               │
         └───────────────┬───────────────┘
                         │
                    ┌────▼────┐
                    │  Kafka(Metric Stream)
                    └────┬────┘
                         │
         ┌───────────────┼───────────────┐
         │               │               │
    ┌────▼────┐    ┌─────▼─────┐   ┌───▼────┐
    │  Write  │    │   Alert   │   │  Rule  │
    │  Path   │    │ Evaluator │   │ Manager│
    └────┬────┘    └─────┬─────┘   └────────┘
         │               │
         │               ▼
         │         ┌──────────┐
         │         │  Alert   │
         │         │ Manager  │
         │         └─────┬────┘
         │               │
         │               ▼
         │         ┌──────────┐
         │         │Notification│
         │         │  Service │
         │         └──────────┘
         │
    ┌────▼────────────────────────────┐
    │    Time-Series Database         │
    │  (Prometheus / VictoriaMetrics) │
    └────┬────────────────────────────┘
         │
    ┌────▼────────────────────────────┐
    │         Query Layer              │
    ├──────────────────────────────────┤
    │  Query EngineCacheAPI  │
    └────┬─────────────────────────────┘
         │
    ┌────▼────────────────────────────┐
    │      Application Layer           │
    ├──────────────────────────────────┤
    │  DashboardAPIGrafana  │
    └──────────────────────────────────┘

Component Details

1. Service Discovery

  • Automatically discovers monitoring targets
  • Supports Kubernetes, Consul, DNS, static configs
  • Updates target list dynamically
  • Health checking of targets

2. Metric Collection

  • Scraper Pool: Pull metrics from HTTP endpoints
  • Push Gateway: Accept pushed metrics
  • Batching: Collect metrics in batches
  • Compression: Compress metric data

3. Write Path

  • Kafka for buffering and reliability
  • Write to time-series database
  • Replication across nodes
  • Compression and encoding

4. Time-Series Database

  • Stores metric data efficiently
  • Columnar storage format
  • Automatic downsampling
  • Retention policies

5. Alert System

  • Evaluates alert rules periodically
  • Manages alert state (pending, firing, resolved)
  • Routes alerts to notification channels
  • Deduplication and grouping

6. Query Layer

  • PromQL-compatible query language
  • Query optimization and caching
  • Aggregation and downsampling
  • Federation support

7. Visualization

  • Dashboard builder
  • Real-time metric graphs
  • Alert visualization
  • Custom panels

Data Flow

Metric Ingestion Flow

1. Service Discovery finds targets
2. Scraper pulls metrics every 15s
3. Metrics sent to Kafka
4. Write path consumes from Kafka
5. Data written to time-series DB
6. Indexes updated

Query Flow

1. User requests dashboard
2. Query service checks cache
3. If miss, query time-series DB
4. Apply aggregations
5. Cache result
6. Return to user

Alert Flow

1. Alert evaluator runs rules every 30s
2. Query time-series DB for metrics
3. Evaluate conditions
4. Update alert state
5. If firing, send to notification service
6. Deliver via configured channels

This architecture provides a scalable, reliable foundation for comprehensive monitoring at massive scale.