Data Structures for Scale: Complete Guide for System Design
Overview
Standard data structures (arrays, hash maps, trees) don't scale to billions of elements. This section covers probabilistic and specialized data structures that trade perfect accuracy for massive space/time savings β the secret weapons of large-scale systems.
1. Bloom Filters
What Problem Do They Solve?
"Is this element in the set?" β answered in O(1) time and constant memory, with a small false positive rate but ZERO false negatives.
How They Work
Bit array of m bits, k hash functions
Insert "hello":
h1("hello") = 3 β set bit 3
h2("hello") = 7 β set bit 7
h3("hello") = 11 β set bit 11
Bit array: [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]
β β β
Query "hello": bits 3,7,11 all set β "PROBABLY YES"
Query "world": h1=2, h2=7, h3=14 β bit 2 is 0 β "DEFINITELY NO"
Query "test": h1=3, h2=5, h3=11 β bit 5 is 0 β "DEFINITELY NO"False Positive Rate
Formula: FPR β (1 - e^(-kn/m))^k
Where:
m = number of bits
n = number of elements inserted
k = number of hash functions
Optimal k = (m/n) Γ ln(2) β 0.693 Γ (m/n)Sizing Guide
| Elements (n) | FPR Target | Bits (m) | Hash Functions (k) | Memory |
|---|---|---|---|---|
| 1 million | 1% | 9.6M bits | 7 | 1.2 MB |
| 1 million | 0.1% | 14.4M bits | 10 | 1.8 MB |
| 100 million | 1% | 960M bits | 7 | 120 MB |
| 1 billion | 1% | 9.6B bits | 7 | 1.2 GB |
Rule of thumb: ~10 bits per element for 1% FPR.
Use Cases in System Design
| System | Use Case | Benefit |
|---|---|---|
| Cassandra/HBase | Check if SSTable contains key | Avoid disk reads |
| Chrome | Malicious URL detection | Fast local check before network |
| Medium | Recommend articles not yet seen | Avoid showing duplicates |
| Akamai CDN | "Is this URL cached?" | Reduce origin requests |
| Bitcoin | SPV wallet transaction verification | Lightweight verification |
Variants
- Counting Bloom Filter: Replace bits with counters β supports deletion
- Cuckoo Filter: Better space efficiency, supports deletion, faster lookups
- Scalable Bloom Filter: Grows dynamically by adding filter layers
2. Count-Min Sketch
What Problem Does It Solve?
"How many times has this element appeared?" β frequency estimation in constant memory with bounded error.
How It Works
d hash functions, each mapping to a row of w counters
w counters
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ
h1: β0β0β3β0β0β2β0β0β β row 1
βββΌββΌββΌββΌββΌββΌββΌββ€
h2: β0β1β0β0β3β0β0β2β β row 2
βββΌββΌββΌββΌββΌββΌββΌββ€
h3: β0β0β0β2β0β0β3β0β β row 3
βββΌββΌββΌββΌββΌββΌββΌββ€
h4: β1β0β0β0β0β3β0β0β β row 4
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ
Insert("apple"):
Increment counter at h1("apple"), h2("apple"), h3("apple"), h4("apple")
Query("apple"):
Return MIN(counter[h1("apple")], counter[h2("apple")], ...)
MIN gives least over-counted estimateError Bounds
With width w and depth d:
Error β€ (e/w) Γ N with probability β₯ 1 - (1/e)^d
Where N = total count of all elements
Typical: w=2048, d=5 β error β€ 0.1% of total count, 99.3% confidence
Memory: w Γ d Γ counter_size = 2048 Γ 5 Γ 4 bytes = 40KBUse Cases
| Application | What's Counted | Why CMS |
|---|---|---|
| Network monitoring | Packet flows per IP | Millions of IPs, limited memory |
| Ad click tracking | Clicks per ad | Real-time, approximate is fine |
| Heavy hitters detection | Top-K frequent items | Find popular items in stream |
| Rate limiting | Requests per user | Distributed, memory-efficient |
| NLP | Word frequency | Corpus too large for exact counts |
3. HyperLogLog
What Problem Does It Solve?
"How many unique elements are in this set?" β cardinality estimation using only ~12KB of memory regardless of set size.
Intuition
Observation: If you hash elements uniformly, the maximum number of
leading zeros in any hash tells you about the cardinality.
If max leading zeros = k, estimated cardinality β 2^k
Example:
hash("alice") = 001010... β 2 leading zeros
hash("bob") = 000001... β 5 leading zeros β max
hash("carol") = 010100... β 1 leading zero
Max leading zeros = 5 β estimate β 2^5 = 32 unique elementsHow HyperLogLog Improves This
1. Split hash into two parts:
- First p bits β bucket index (2^p buckets, typically p=14 β 16384 buckets)
- Remaining bits β count leading zeros
2. Each bucket stores max leading zeros seen
3. Final estimate = harmonic mean across all buckets Γ correction factor
Buckets (p=14, 16384 buckets, 6 bits each):
ββββββ¬βββββ¬βββββ¬βββββ¬ββββββ¬βββββ
β B0 β B1 β B2 β B3 β ... β Bn β
β 5 β 3 β 7 β 2 β β 4 β
ββββββ΄βββββ΄βββββ΄βββββ΄ββββββ΄βββββ
Memory: 16384 Γ 6 bits = 12KB (counts up to billions!)Accuracy
Standard error: 1.04 / βm where m = number of buckets
With 16384 buckets (12KB): error β 0.81%
With 65536 buckets (48KB): error β 0.41%
For counting 1 billion unique items:
Exact: would need ~1GB+ (storing all items)
HyperLogLog: 12KB with <1% errorUse Cases
| System | What's Counted | Memory Saved |
|---|---|---|
| Redis (PFCOUNT) | Unique visitors per page | GB β 12KB |
| Google BigQuery | COUNT(DISTINCT) approximation | Massive savings |
| Presto/Trino | Approximate distinct counts | Faster queries |
| Network monitoring | Unique source IPs | Per-interface tracking |
| Advertising | Unique ad impressions | Per-campaign counting |
Redis HyperLogLog Commands
PFADD visitors:page1 "user123" "user456" "user789"
PFADD visitors:page1 "user123" // duplicate, no effect
PFCOUNT visitors:page1 // Returns ~3
// Merge multiple HLLs (union of sets)
PFMERGE visitors:total visitors:page1 visitors:page2
PFCOUNT visitors:total // Unique visitors across all pages4. Consistent Hashing
The Problem
Simple hash: server = hash(key) % N
If N changes (server added/removed):
Almost ALL keys remap to different servers
β Massive cache invalidation
β "Cache stampede" on database
Example: 4 servers β 5 servers
hash(key) % 4 = 2 β hash(key) % 5 = 3 (different server!)
~80% of keys move when adding 1 server to 4How Consistent Hashing Works
Hash ring (0 to 2^32 - 1):
0
β
Node C βββ€
β
2^32 ββββββββββββββββββββ 2^8
β β
β βββ Node A
β β
2^24 ββββββββββββββββββββ 2^16
β
βββ Node B
β
Key placement: hash(key) β walk clockwise β first node encountered
Adding Node D between A and B:
Only keys between B and D move to D
All other keys stay put!
Keys moved: ~1/N of total (instead of ~(N-1)/N with modulo)Virtual Nodes
Problem: With few physical nodes, distribution is uneven.
Without virtual nodes (3 servers):
Node A: 45% of keys β unbalanced!
Node B: 20% of keys
Node C: 35% of keys
With virtual nodes (100-200 per physical node):
Node A: A_0, A_1, A_2, ... A_149 (150 points on ring)
Node B: B_0, B_1, B_2, ... B_149
Node C: C_0, C_1, C_2, ... C_149
Result: ~33% each (statistically balanced)Benefits of virtual nodes:
- Even distribution regardless of physical node count
- Heterogeneous nodes: powerful server gets more virtual nodes
- Smoother rebalancing when nodes join/leave
Rebalancing
Node D joins (between positions 100-200 on ring):
Before: Keys 100-200 β Node A
After: Keys 100-200 β Node D (only these keys move)
Steps:
1. Node D announces its position(s) on ring
2. Node A transfers keys in range [100, 200] to Node D
3. Ring metadata updated in all nodes
4. Future requests for those keys go to Node DReal-World Usage
| System | Implementation |
|---|---|
| Amazon DynamoDB | Consistent hashing with virtual nodes |
| Apache Cassandra | Token ring with vnodes |
| Memcached | Client-side consistent hashing (ketama) |
| Nginx | Upstream consistent hashing |
| Discord | Guild (server) distribution |
5. Skip Lists
What Problem Do They Solve?
Sorted data structure with O(log n) search, insert, delete β like a balanced BST but simpler to implement and naturally concurrent.
How They Work
Level 3: HEAD ββββββββββββββββββββββββββββββββββββ 50 ββββββββ NIL
Level 2: HEAD ββββββββββ 20 ββββββββββββββββββββββ 50 ββββ 70 β NIL
Level 1: HEAD ββββ 10 ββ 20 ββββ 30 ββββββββ 50 ββ 60 ββ 70 β NIL
Level 0: HEAD β 5 β 10 β 20 β 25 β 30 β 40 β 50 β 60 β 70 β NIL
Search for 40:
1. Start at HEAD, Level 3: 50 > 40, go down
2. Level 2: 20 < 40, move right. 50 > 40, go down
3. Level 1: 30 < 40, move right. 50 > 40, go down
4. Level 0: 40 = 40, FOUND!Probabilistic Balancing
When inserting a new element:
Level 0: always (100%)
Level 1: with probability p (typically p=0.5 or p=0.25)
Level 2: with probability pΒ²
Level k: with probability p^k
Expected height: O(log n)
Expected search time: O(log n)
No rebalancing needed (unlike AVL/Red-Black trees)Why Skip Lists Over Balanced Trees?
| Aspect | Skip List | Red-Black Tree |
|---|---|---|
| Implementation | Simpler | Complex rotations |
| Concurrency | Lock-free possible | Rotations need locks |
| Range queries | Natural (follow level 0) | In-order traversal |
| Memory | More pointers | Fixed 2 children |
| Cache performance | Worse (pointer chasing) | Better (locality) |
Use in Redis Sorted Sets
ZADD leaderboard 100 "alice"
ZADD leaderboard 200 "bob"
ZADD leaderboard 150 "carol"
ZRANGEBYSCORE leaderboard 100 200 // Range query: O(log n + k)
ZRANK leaderboard "bob" // Rank query: O(log n)Redis uses skip lists because:
- O(log n) insert/delete/search
- O(log n + k) range queries (k = result size)
- Simpler to implement than balanced trees
- Easy to modify for concurrent access
6. Merkle Trees
What Problem Do They Solve?
Efficiently verify data integrity and detect differences between large datasets using O(log n) comparisons.
How They Work
Root Hash
H(H12 + H34)
/ \
H12 H34
H(H1+H2) H(H3+H4)
/ \ / \
H1 H2 H3 H4
H(D1) H(D2) H(D3) H(D4)
β β β β
D1 D2 D3 D4
(data) (data) (data) (data)
If D3 changes:
H3 changes β H34 changes β Root changes
To find what changed: compare roots β compare H12 vs H34 β
H34 differs β compare H3 vs H4 β H3 differs β D3 changed
Only O(log n) comparisons needed!Anti-Entropy in Distributed Databases
Node A Node B
ββββββββββββ ββββββββββββ
βRoot: abc1 β βRoot: abc1 β β Same! No sync needed
ββββββββββββ ββββββββββββ
Node A Node B
ββββββββββββ ββββββββββββ
βRoot: abc1 β βRoot: def2 β β Different! Sync needed
ββββββββββββ ββββββββββββ
β β
βββ Compare subtrees ββββββββββββ
Only transfer differing branches
(not entire dataset)Use Cases
| System | Purpose |
|---|---|
| Cassandra | Anti-entropy repair between replicas |
| DynamoDB | Detect inconsistencies between replicas |
| Git | Content-addressable storage, diff detection |
| Bitcoin/Ethereum | Transaction verification (SPV) |
| IPFS | Content integrity verification |
| ZFS | Data integrity checksums |
| Certificate Transparency | Append-only log verification |
Properties
- Tamper detection: Any change propagates to root
- Efficient diff: O(log n) to find differences
- Efficient proof: Prove element exists with O(log n) hashes
- Append-friendly: Adding leaves only updates one path
7. Trie / Radix Tree
Trie (Prefix Tree)
Insert: "cat", "car", "card", "care", "dog", "do"
(root)
/ \
c d
| |
a o
/ \ | \
t r g (end:"do")
|\
d e
(end:"card") (end:"care")
Search "car": traverse cβaβr β found (is end node)
Search "cap": traverse cβaβp β 'p' not found β NOT IN TRIE
Prefix "ca": traverse cβa β return all descendants: cat, car, card, careRadix Tree (Compressed Trie)
Compress single-child chains:
Trie: c β a β t
c β a β r β d
c β a β r β e
Radix Tree: "ca" β "t"
β "r" β "d"
β "e"
Saves memory by merging single-child pathsUse Cases
| Application | Why Trie/Radix |
|---|---|
| Autocomplete | Prefix search in O(prefix length) |
| IP routing (longest prefix match) | Radix tree for CIDR blocks |
| Spell checker | Find words within edit distance |
| DNS resolution | Domain name lookup |
| File systems (ext4) | Directory entry lookup |
| HTTP routers | URL path matching |
IP Routing Example
Routing table as radix tree:
192.168.0.0/16 β Gateway A
192.168.1.0/24 β Gateway B (more specific)
10.0.0.0/8 β Gateway C
Lookup 192.168.1.5:
Traverse: 192 β 168 β 1 β 5
Longest match: 192.168.1.0/24 β Gateway BPerformance
| Operation | Trie | Radix Tree | Hash Map |
|---|---|---|---|
| Exact lookup | O(k) | O(k) | O(1) avg |
| Prefix search | O(k + results) | O(k + results) | O(n) scan |
| Ordered iteration | O(n) | O(n) | O(n log n) sort |
| Memory | High (per-char nodes) | Lower (compressed) | Lower |
Where k = key length, n = total elements
Comparison: When to Use What
| Data Structure | Question It Answers | Space | Error |
|---|---|---|---|
| Bloom Filter | "Is X in the set?" | O(n) bits | False positives only |
| Count-Min Sketch | "How often does X appear?" | O(1) | Over-counts |
| HyperLogLog | "How many unique items?" | O(1) ~12KB | Β±1% |
| Consistent Hashing | "Which server handles X?" | O(nΓvnodes) | None |
| Skip List | "Sorted operations on X?" | O(n) | None |
| Merkle Tree | "Are datasets A and B the same?" | O(n) | None |
| Trie | "What starts with prefix X?" | O(nΓk) | None |
Interview Cheat Sheet
| When interviewer asks... | Data structure to mention |
|---|---|
| "Check if URL is malicious" | Bloom filter (fast negative check) |
| "Find top-K trending topics" | Count-Min Sketch + min-heap |
| "Count unique visitors" | HyperLogLog |
| "Distribute data across servers" | Consistent hashing with vnodes |
| "Implement a leaderboard" | Skip list (Redis sorted set) |
| "Detect data corruption" | Merkle tree |
| "Autocomplete suggestions" | Trie / Radix tree |
| "Rate limiting per user" | Count-Min Sketch or sliding window |
| "Deduplicate events" | Bloom filter (check before processing) |
| "Sync replicas efficiently" | Merkle tree (anti-entropy) |