Scale & Constraints

📖 7 min read 📄 Part 2 of 10

Web Analytics Tool - Scale and Constraints

Traffic Scale Analysis

Global Event Volume

Daily Events: 100 billion events
Peak Events/Second: 5 million events/sec
Average Events/Second: 1.16 million events/sec
Events per Website (avg): 10,000 events/day
Events per User Session: 15-20 events

Data Ingestion Rate

Raw Data Ingestion: 200 TB/day
Compressed Storage: 50 TB/day (4:1 compression)
Monthly Storage Growth: 1.5 PB/month
Annual Storage Growth: 18 PB/year
Total Historical Data: 100+ PB

User Scale

Tracked Websites: 10 million active sites
Unique Visitors Tracked: 5 billion monthly
Active Sessions: 50 million concurrent
Dashboard Users: 1 million concurrent
Report Queries: 100,000 queries/second
API Requests: 100,000 requests/second

Storage Requirements

Event Data Storage

Event Record Size: 2 KB average
Daily Raw Events: 100B × 2KB = 200 TB
With Metadata: 200 TB × 1.2 = 240 TB
Compressed (4:1): 60 TB/day
Replicated (3x): 180 TB/day

Time-based Storage Tiers

Hot Storage (7 days):
- Raw events: 420 TB
- Aggregated data: 50 TB
- Total: 470 TB in SSD/NVMe

Warm Storage (8-90 days):
- Raw events: 5 PB
- Aggregated data: 500 TB
- Total: 5.5 PB in SSD

Cold Storage (90+ days):
- Raw events: 50 PB
- Aggregated data: 5 PB
- Total: 55 PB in HDD/Object Storage

Aggregated Data Storage

Minute-level aggregates: 100 GB/day
Hour-level aggregates: 10 GB/day
Day-level aggregates: 1 GB/day
Month-level aggregates: 100 MB/day
Total aggregated storage: 10 TB (historical)

Compute Requirements

Real-time Processing

Stream Processing Nodes: 1,000 nodes
CPU per Node: 32 cores
Memory per Node: 128 GB
Total CPU Cores: 32,000 cores
Total Memory: 128 TB
Processing Capacity: 5M events/sec

Batch Processing

Batch Processing Cluster: 5,000 nodes
CPU per Node: 64 cores
Memory per Node: 256 GB
Total CPU Cores: 320,000 cores
Total Memory: 1.25 PB
Daily Batch Jobs: 10,000 jobs

Query Processing

Query Nodes: 2,000 nodes
CPU per Node: 48 cores
Memory per Node: 384 GB
SSD Cache per Node: 2 TB
Total Query Capacity: 100,000 queries/sec
Cache Hit Rate: 80%

Network Bandwidth

Inbound Traffic

Event Ingestion: 2 Gbps average, 10 Gbps peak
API Requests: 500 Mbps average, 2 Gbps peak
Data Imports: 100 Mbps average, 1 Gbps peak
Total Inbound: 2.6 Gbps average, 13 Gbps peak

Outbound Traffic

Dashboard Queries: 5 Gbps average, 20 Gbps peak
API Responses: 2 Gbps average, 8 Gbps peak
Data Exports: 1 Gbps average, 5 Gbps peak
Report Emails: 100 Mbps average, 500 Mbps peak
Total Outbound: 8.1 Gbps average, 33.5 Gbps peak

Internal Traffic

Replication Traffic: 10 Gbps
Inter-service Communication: 20 Gbps
Data Pipeline: 50 Gbps
Cache Synchronization: 5 Gbps
Total Internal: 85 Gbps

Database Scaling

Event Store (Time-series Database)

Database: ClickHouse / TimescaleDB
Cluster Size: 500 nodes
Sharding Strategy: By website_id + timestamp
Replication Factor: 3
Partition Size: 1 day per partition
Retention: 14 months detailed, 5 years aggregated
Query Performance: <100ms for simple, <3s for complex

Metadata Store (Relational)

Database: PostgreSQL with Citus
Cluster Size: 50 nodes
Data Size: 5 TB
Tables: websites, users, properties, dimensions
Replication: Primary-replica with 3 replicas
Query Performance: <50ms for reads, <200ms for writes

Session Store (Key-Value)

Database: Redis Cluster
Cluster Size: 100 nodes
Memory per Node: 256 GB
Total Memory: 25 TB
Active Sessions: 50 million
TTL: 30 minutes
Throughput: 10M ops/sec

Aggregation Store (OLAP)

Database: Apache Druid / Apache Pinot
Cluster Size: 200 nodes
Data Size: 50 TB
Segments: 100,000 segments
Query Latency: <1 second
Ingestion Rate: 1M events/sec

Caching Strategy

CDN Caching

CDN Providers: CloudFront, Fastly
Edge Locations: 200+ locations
Cached Assets: JavaScript SDK, static reports
Cache Hit Rate: 95%
Bandwidth Savings: 90% reduction

Application Cache

Cache: Redis / Memcached
Cluster Size: 200 nodes
Memory per Node: 128 GB
Total Cache: 25 TB
Cache Hit Rate: 80%
TTL: 5 minutes to 1 hour

Query Result Cache

Cache: Redis with LRU eviction
Cache Size: 10 TB
Cached Queries: 100 million queries
Hit Rate: 70%
TTL: 15 minutes for real-time, 1 hour for historical
Eviction: LRU with size-based limits

Message Queue Scaling

Event Ingestion Queue

Queue: Apache Kafka
Cluster Size: 100 brokers
Partitions: 10,000 partitions
Replication Factor: 3
Retention: 7 days
Throughput: 5M messages/sec
Storage: 500 TB

Processing Queues

Internal Queues: RabbitMQ / SQS
Queue Count: 50 queues
Message Rate: 1M messages/sec
Message Size: 5 KB average
Retention: 14 days
Dead Letter Queue: Yes

Geographic Distribution

Data Center Regions

Primary Regions: 5 (US-East, US-West, EU-West, Asia-Pacific, South America)
Secondary Regions: 10 (regional presence)
Edge Locations: 200+ (CDN and tracking endpoints)
Cross-region Latency: <100ms between primary regions
Replication: Async replication between regions

Regional Traffic Distribution

North America: 40% (40B events/day)
Europe: 30% (30B events/day)
Asia-Pacific: 20% (20B events/day)
South America: 5% (5B events/day)
Other: 5% (5B events/day)

Regional Infrastructure

Per Region:
- Ingestion Nodes: 200 nodes
- Processing Nodes: 1,000 nodes
- Storage: 20 PB
- Query Nodes: 400 nodes
- Cache Nodes: 40 nodes

Cost Analysis

Infrastructure Costs (Monthly)

Compute (EC2/VMs): $2,000,000
Storage (S3/Object): $500,000
Database (RDS/Managed): $800,000
Network (Data Transfer): $300,000
CDN: $100,000
Monitoring/Logging: $50,000
Total Monthly: $3,750,000
Cost per Event: $0.00000375
Cost per Website: $0.375/month

Operational Costs (Monthly)

Engineering Team: $500,000
DevOps/SRE: $200,000
Support Team: $150,000
Data Center: $100,000
Licenses: $50,000
Total Monthly: $1,000,000

Performance Benchmarks

Latency Requirements

Event Ingestion: p50: 20ms, p95: 50ms, p99: 100ms
Real-time Processing: p50: 1s, p95: 3s, p99: 5s
Simple Query: p50: 100ms, p95: 500ms, p99: 1s
Complex Query: p50: 1s, p95: 5s, p99: 10s
Dashboard Load: p50: 500ms, p95: 2s, p99: 3s
API Response: p50: 100ms, p95: 500ms, p99: 1s

Throughput Benchmarks

Event Ingestion: 5M events/sec sustained
Query Processing: 100K queries/sec
API Requests: 100K requests/sec
Data Export: 10K exports/hour
Report Generation: 1M reports/hour
Batch Processing: 100B events/hour

Resource Utilization Targets

CPU Utilization: 60-70% average
Memory Utilization: 70-80% average
Disk I/O: <80% capacity
Network: <70% capacity
Cache Hit Rate: >80%
Query Success Rate: >99.9%

Scaling Bottlenecks

Ingestion Bottlenecks

  • Network I/O: 10 Gbps per ingestion node limit
  • Kafka Throughput: 1M messages/sec per broker
  • Serialization: CPU-bound JSON parsing
  • Validation: CPU-intensive data validation

Processing Bottlenecks

  • Sessionization: Memory-intensive session state management
  • Aggregation: CPU-intensive metric calculations
  • Enrichment: I/O-bound geo-location and device lookups
  • Deduplication: Memory-intensive bloom filters

Query Bottlenecks

  • High Cardinality: Queries on dimensions with millions of values
  • Time Range: Queries spanning multiple partitions
  • Joins: Cross-partition joins are expensive
  • Aggregations: Group-by on high-cardinality dimensions

Storage Bottlenecks

  • Write Amplification: Multiple indexes and replicas
  • Compaction: Background compaction impacts query performance
  • Hot Partitions: Popular websites cause uneven load
  • Cold Data Access: Slow queries on archived data

Capacity Planning

Growth Projections (Annual)

Event Volume: +50% YoY
Storage: +60% YoY (including historical data)
Compute: +40% YoY
Network: +45% YoY
Websites: +30% YoY
Users: +35% YoY

Scaling Triggers

CPU Utilization > 75%: Add compute nodes
Memory Utilization > 85%: Add memory or nodes
Disk Usage > 80%: Add storage capacity
Query Latency > SLA: Add query nodes
Event Lag > 10 seconds: Add processing nodes
Cache Hit Rate < 75%: Increase cache size

Capacity Reserves

Compute: 30% reserve capacity
Storage: 40% reserve capacity
Network: 50% reserve capacity
Cache: 25% reserve capacity
Database: 35% reserve capacity

This scale analysis provides the foundation for architecting a web analytics platform that can handle massive data volumes while maintaining performance and reliability.