Top-K Analysis System - Problem Statement
Overview
Design a system that identifies the most popular items (top-K) in real-time from a high-volume stream of requests. Examples include trending topics on Twitter, most viewed videos on YouTube, top search queries on Google, or most visited URLs. The system must handle billions of events while maintaining accuracy and low latency.
Functional Requirements
Core Features
- Real-time Top-K: Identify top K items from streaming data
- Multiple Time Windows: Last minute, hour, day, week
- Multiple Dimensions: Top URLs, users, products, searches
- Approximate Counting: Use probabilistic data structures
- Historical Top-K: Query historical top-K lists
- Trending Detection: Identify rapidly rising items
Use Cases
- Trending Topics: Twitter trending hashtags
- Popular Videos: YouTube most viewed
- Top Searches: Google trending searches
- Hot Products: E-commerce best sellers
- Popular URLs: Most visited pages
- Top Users: Most active users
Query Types
- Global Top-K: Top K items across all data
- Filtered Top-K: Top K within category/region
- Time-windowed: Top K in last hour/day
- Comparative: Compare top-K across time periods
- Personalized: Top K for user segment
Non-Functional Requirements
Performance
- Ingestion Rate: 1 million events/second
- Query Latency: <100ms for top-K query
- Update Latency: <1 second for real-time top-K
- Memory Efficiency: O(K) space complexity
- Accuracy: 95%+ accuracy for top-K
Scalability
- Daily Events: 100 billion events/day
- Unique Items: 1 billion unique items
- Concurrent Queries: 10,000 queries/second
- Time Windows: Multiple windows simultaneously
- Dimensions: 10+ different dimensions
Reliability
- Uptime: 99.9% availability
- Data Durability: 99.99% durability
- Graceful Degradation: Serve stale data if needed
- Fault Tolerance: Handle node failures
Scale Constraints
Event Volume
Events/Second: 1 million
Daily Events: 86.4 billion
Event Size: 100 bytes
Daily Data: 8.64 TB raw
Compressed: 1.7 TB (5:1)Top-K Tracking
K Value: 100 (top 100)
Time Windows: 5 windows (1min, 5min, 1hour, 1day, 1week)
Dimensions: 10 dimensions
Total Top-K Lists: 50 lists
Memory per List: 10 KB
Total Memory: 500 KB (negligible)Success Metrics
Accuracy
- Top-K accuracy: >95%
- False positive rate: <5%
- Ranking accuracy: >90%
Performance
- Query latency: p95 <100ms
- Update latency: p95 <1s
- Ingestion latency: p95 <50ms
- System uptime: >99.9%
This problem statement establishes the foundation for a real-time top-K analysis system.