Problem Statement

📖 3 min read 📄 Part 1 of 10

Top-K Analysis System - Problem Statement

Overview

Design a system that identifies the most popular items (top-K) in real-time from a high-volume stream of requests. Examples include trending topics on Twitter, most viewed videos on YouTube, top search queries on Google, or most visited URLs. The system must handle billions of events while maintaining accuracy and low latency.

Functional Requirements

Core Features

  • Real-time Top-K: Identify top K items from streaming data
  • Multiple Time Windows: Last minute, hour, day, week
  • Multiple Dimensions: Top URLs, users, products, searches
  • Approximate Counting: Use probabilistic data structures
  • Historical Top-K: Query historical top-K lists
  • Trending Detection: Identify rapidly rising items

Use Cases

  • Trending Topics: Twitter trending hashtags
  • Popular Videos: YouTube most viewed
  • Top Searches: Google trending searches
  • Hot Products: E-commerce best sellers
  • Popular URLs: Most visited pages
  • Top Users: Most active users

Query Types

  • Global Top-K: Top K items across all data
  • Filtered Top-K: Top K within category/region
  • Time-windowed: Top K in last hour/day
  • Comparative: Compare top-K across time periods
  • Personalized: Top K for user segment

Non-Functional Requirements

Performance

  • Ingestion Rate: 1 million events/second
  • Query Latency: <100ms for top-K query
  • Update Latency: <1 second for real-time top-K
  • Memory Efficiency: O(K) space complexity
  • Accuracy: 95%+ accuracy for top-K

Scalability

  • Daily Events: 100 billion events/day
  • Unique Items: 1 billion unique items
  • Concurrent Queries: 10,000 queries/second
  • Time Windows: Multiple windows simultaneously
  • Dimensions: 10+ different dimensions

Reliability

  • Uptime: 99.9% availability
  • Data Durability: 99.99% durability
  • Graceful Degradation: Serve stale data if needed
  • Fault Tolerance: Handle node failures

Scale Constraints

Event Volume

Events/Second: 1 million
Daily Events: 86.4 billion
Event Size: 100 bytes
Daily Data: 8.64 TB raw
Compressed: 1.7 TB (5:1)

Top-K Tracking

K Value: 100 (top 100)
Time Windows: 5 windows (1min, 5min, 1hour, 1day, 1week)
Dimensions: 10 dimensions
Total Top-K Lists: 50 lists
Memory per List: 10 KB
Total Memory: 500 KB (negligible)

Success Metrics

Accuracy

  • Top-K accuracy: >95%
  • False positive rate: <5%
  • Ranking accuracy: >90%

Performance

  • Query latency: p95 <100ms
  • Update latency: p95 <1s
  • Ingestion latency: p95 <50ms
  • System uptime: >99.9%

This problem statement establishes the foundation for a real-time top-K analysis system.