πŸ—οΈ Data Structures

Data Structures for Scale

πŸ“– 13 min read 🧠 Complete Guide

Data Structures for Scale: Complete Guide for System Design

Overview

Standard data structures (arrays, hash maps, trees) don't scale to billions of elements. This section covers probabilistic and specialized data structures that trade perfect accuracy for massive space/time savings β€” the secret weapons of large-scale systems.


1. Bloom Filters

What Problem Do They Solve?

"Is this element in the set?" β€” answered in O(1) time and constant memory, with a small false positive rate but ZERO false negatives.

How They Work

Bit array of m bits, k hash functions

Insert "hello":
  h1("hello") = 3   β†’ set bit 3
  h2("hello") = 7   β†’ set bit 7
  h3("hello") = 11  β†’ set bit 11

Bit array: [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]
                  ↑           ↑           ↑

Query "hello": bits 3,7,11 all set β†’ "PROBABLY YES"
Query "world": h1=2, h2=7, h3=14 β†’ bit 2 is 0 β†’ "DEFINITELY NO"
Query "test":  h1=3, h2=5, h3=11 β†’ bit 5 is 0 β†’ "DEFINITELY NO"

False Positive Rate

Formula: FPR β‰ˆ (1 - e^(-kn/m))^k

Where:
  m = number of bits
  n = number of elements inserted
  k = number of hash functions

Optimal k = (m/n) Γ— ln(2) β‰ˆ 0.693 Γ— (m/n)

Sizing Guide

Elements (n) FPR Target Bits (m) Hash Functions (k) Memory
1 million 1% 9.6M bits 7 1.2 MB
1 million 0.1% 14.4M bits 10 1.8 MB
100 million 1% 960M bits 7 120 MB
1 billion 1% 9.6B bits 7 1.2 GB

Rule of thumb: ~10 bits per element for 1% FPR.

Use Cases in System Design

System Use Case Benefit
Cassandra/HBase Check if SSTable contains key Avoid disk reads
Chrome Malicious URL detection Fast local check before network
Medium Recommend articles not yet seen Avoid showing duplicates
Akamai CDN "Is this URL cached?" Reduce origin requests
Bitcoin SPV wallet transaction verification Lightweight verification

Variants

  • Counting Bloom Filter: Replace bits with counters β†’ supports deletion
  • Cuckoo Filter: Better space efficiency, supports deletion, faster lookups
  • Scalable Bloom Filter: Grows dynamically by adding filter layers

2. Count-Min Sketch

What Problem Does It Solve?

"How many times has this element appeared?" β€” frequency estimation in constant memory with bounded error.

How It Works

d hash functions, each mapping to a row of w counters

         w counters
    β”Œβ”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”
h1: β”‚0β”‚0β”‚3β”‚0β”‚0β”‚2β”‚0β”‚0β”‚  ← row 1
    β”œβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”€
h2: β”‚0β”‚1β”‚0β”‚0β”‚3β”‚0β”‚0β”‚2β”‚  ← row 2
    β”œβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”€
h3: β”‚0β”‚0β”‚0β”‚2β”‚0β”‚0β”‚3β”‚0β”‚  ← row 3
    β”œβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”Όβ”€β”€
h4: β”‚1β”‚0β”‚0β”‚0β”‚0β”‚3β”‚0β”‚0β”‚  ← row 4
    β””β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”΄β”€β”˜

Insert("apple"):
  Increment counter at h1("apple"), h2("apple"), h3("apple"), h4("apple")

Query("apple"):
  Return MIN(counter[h1("apple")], counter[h2("apple")], ...)
  
  MIN gives least over-counted estimate

Error Bounds

With width w and depth d:
  Error ≀ (e/w) Γ— N  with probability β‰₯ 1 - (1/e)^d
  
  Where N = total count of all elements

Typical: w=2048, d=5 β†’ error ≀ 0.1% of total count, 99.3% confidence
Memory: w Γ— d Γ— counter_size = 2048 Γ— 5 Γ— 4 bytes = 40KB

Use Cases

Application What's Counted Why CMS
Network monitoring Packet flows per IP Millions of IPs, limited memory
Ad click tracking Clicks per ad Real-time, approximate is fine
Heavy hitters detection Top-K frequent items Find popular items in stream
Rate limiting Requests per user Distributed, memory-efficient
NLP Word frequency Corpus too large for exact counts

3. HyperLogLog

What Problem Does It Solve?

"How many unique elements are in this set?" β€” cardinality estimation using only ~12KB of memory regardless of set size.

Intuition

Observation: If you hash elements uniformly, the maximum number of 
leading zeros in any hash tells you about the cardinality.

If max leading zeros = k, estimated cardinality β‰ˆ 2^k

Example:
  hash("alice") = 001010...  β†’ 2 leading zeros
  hash("bob")   = 000001...  β†’ 5 leading zeros  ← max
  hash("carol") = 010100...  β†’ 1 leading zero

  Max leading zeros = 5 β†’ estimate β‰ˆ 2^5 = 32 unique elements

How HyperLogLog Improves This

1. Split hash into two parts:
   - First p bits β†’ bucket index (2^p buckets, typically p=14 β†’ 16384 buckets)
   - Remaining bits β†’ count leading zeros

2. Each bucket stores max leading zeros seen

3. Final estimate = harmonic mean across all buckets Γ— correction factor

Buckets (p=14, 16384 buckets, 6 bits each):
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ B0 β”‚ B1 β”‚ B2 β”‚ B3 β”‚ ... β”‚ Bn β”‚
β”‚ 5  β”‚ 3  β”‚ 7  β”‚ 2  β”‚     β”‚ 4  β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

Memory: 16384 Γ— 6 bits = 12KB (counts up to billions!)

Accuracy

Standard error: 1.04 / √m  where m = number of buckets

With 16384 buckets (12KB): error β‰ˆ 0.81%
With 65536 buckets (48KB): error β‰ˆ 0.41%

For counting 1 billion unique items:
  Exact: would need ~1GB+ (storing all items)
  HyperLogLog: 12KB with <1% error

Use Cases

System What's Counted Memory Saved
Redis (PFCOUNT) Unique visitors per page GB β†’ 12KB
Google BigQuery COUNT(DISTINCT) approximation Massive savings
Presto/Trino Approximate distinct counts Faster queries
Network monitoring Unique source IPs Per-interface tracking
Advertising Unique ad impressions Per-campaign counting

Redis HyperLogLog Commands

PFADD visitors:page1 "user123" "user456" "user789"
PFADD visitors:page1 "user123"  // duplicate, no effect

PFCOUNT visitors:page1  // Returns ~3

// Merge multiple HLLs (union of sets)
PFMERGE visitors:total visitors:page1 visitors:page2
PFCOUNT visitors:total  // Unique visitors across all pages

4. Consistent Hashing

The Problem

Simple hash: server = hash(key) % N

If N changes (server added/removed):
  Almost ALL keys remap to different servers
  β†’ Massive cache invalidation
  β†’ "Cache stampede" on database

Example: 4 servers β†’ 5 servers
  hash(key) % 4 = 2  β†’  hash(key) % 5 = 3  (different server!)
  ~80% of keys move when adding 1 server to 4

How Consistent Hashing Works

Hash ring (0 to 2^32 - 1):

                    0
                    β”‚
           Node C ───
                    β”‚
    2^32 ──────────────────── 2^8
         β”‚                    β”‚
         β”‚                    │── Node A
         β”‚                    β”‚
    2^24 ──────────────────── 2^16
                    β”‚
                    │── Node B
                    β”‚

Key placement: hash(key) β†’ walk clockwise β†’ first node encountered

Adding Node D between A and B:
  Only keys between B and D move to D
  All other keys stay put!
  
  Keys moved: ~1/N of total (instead of ~(N-1)/N with modulo)

Virtual Nodes

Problem: With few physical nodes, distribution is uneven.

Without virtual nodes (3 servers):
  Node A: 45% of keys  ← unbalanced!
  Node B: 20% of keys
  Node C: 35% of keys

With virtual nodes (100-200 per physical node):
  Node A: A_0, A_1, A_2, ... A_149  (150 points on ring)
  Node B: B_0, B_1, B_2, ... B_149
  Node C: C_0, C_1, C_2, ... C_149

  Result: ~33% each (statistically balanced)

Benefits of virtual nodes:

  • Even distribution regardless of physical node count
  • Heterogeneous nodes: powerful server gets more virtual nodes
  • Smoother rebalancing when nodes join/leave

Rebalancing

Node D joins (between positions 100-200 on ring):

Before: Keys 100-200 β†’ Node A
After:  Keys 100-200 β†’ Node D (only these keys move)

Steps:
1. Node D announces its position(s) on ring
2. Node A transfers keys in range [100, 200] to Node D
3. Ring metadata updated in all nodes
4. Future requests for those keys go to Node D

Real-World Usage

System Implementation
Amazon DynamoDB Consistent hashing with virtual nodes
Apache Cassandra Token ring with vnodes
Memcached Client-side consistent hashing (ketama)
Nginx Upstream consistent hashing
Discord Guild (server) distribution

5. Skip Lists

What Problem Do They Solve?

Sorted data structure with O(log n) search, insert, delete β€” like a balanced BST but simpler to implement and naturally concurrent.

How They Work

Level 3: HEAD ──────────────────────────────────── 50 ──────── NIL
Level 2: HEAD ────────── 20 ────────────────────── 50 ──── 70 ─ NIL
Level 1: HEAD ──── 10 ── 20 ──── 30 ──────── 50 ── 60 ── 70 ─ NIL
Level 0: HEAD ─ 5 ─ 10 ─ 20 ─ 25 ─ 30 ─ 40 ─ 50 ─ 60 ─ 70 ─ NIL

Search for 40:
1. Start at HEAD, Level 3: 50 > 40, go down
2. Level 2: 20 < 40, move right. 50 > 40, go down
3. Level 1: 30 < 40, move right. 50 > 40, go down
4. Level 0: 40 = 40, FOUND!

Probabilistic Balancing

When inserting a new element:
  Level 0: always (100%)
  Level 1: with probability p (typically p=0.5 or p=0.25)
  Level 2: with probability pΒ²
  Level k: with probability p^k

Expected height: O(log n)
Expected search time: O(log n)
No rebalancing needed (unlike AVL/Red-Black trees)

Why Skip Lists Over Balanced Trees?

Aspect Skip List Red-Black Tree
Implementation Simpler Complex rotations
Concurrency Lock-free possible Rotations need locks
Range queries Natural (follow level 0) In-order traversal
Memory More pointers Fixed 2 children
Cache performance Worse (pointer chasing) Better (locality)

Use in Redis Sorted Sets

ZADD leaderboard 100 "alice"
ZADD leaderboard 200 "bob"
ZADD leaderboard 150 "carol"

ZRANGEBYSCORE leaderboard 100 200  // Range query: O(log n + k)
ZRANK leaderboard "bob"            // Rank query: O(log n)

Redis uses skip lists because:

  • O(log n) insert/delete/search
  • O(log n + k) range queries (k = result size)
  • Simpler to implement than balanced trees
  • Easy to modify for concurrent access

6. Merkle Trees

What Problem Do They Solve?

Efficiently verify data integrity and detect differences between large datasets using O(log n) comparisons.

How They Work

                    Root Hash
                   H(H12 + H34)
                  /              \
            H12                    H34
         H(H1+H2)              H(H3+H4)
          /     \               /     \
        H1       H2          H3       H4
      H(D1)    H(D2)       H(D3)    H(D4)
        β”‚        β”‚            β”‚        β”‚
       D1       D2           D3       D4
     (data)   (data)       (data)   (data)

If D3 changes:
  H3 changes β†’ H34 changes β†’ Root changes
  
To find what changed: compare roots β†’ compare H12 vs H34 β†’ 
  H34 differs β†’ compare H3 vs H4 β†’ H3 differs β†’ D3 changed

Only O(log n) comparisons needed!

Anti-Entropy in Distributed Databases

Node A                          Node B
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Root: abc1 β”‚                   β”‚Root: abc1 β”‚ ← Same! No sync needed
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Node A                          Node B  
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Root: abc1 β”‚                   β”‚Root: def2 β”‚ ← Different! Sync needed
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                               β”‚
     └── Compare subtrees β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         Only transfer differing branches
         (not entire dataset)

Use Cases

System Purpose
Cassandra Anti-entropy repair between replicas
DynamoDB Detect inconsistencies between replicas
Git Content-addressable storage, diff detection
Bitcoin/Ethereum Transaction verification (SPV)
IPFS Content integrity verification
ZFS Data integrity checksums
Certificate Transparency Append-only log verification

Properties

  • Tamper detection: Any change propagates to root
  • Efficient diff: O(log n) to find differences
  • Efficient proof: Prove element exists with O(log n) hashes
  • Append-friendly: Adding leaves only updates one path

7. Trie / Radix Tree

Trie (Prefix Tree)

Insert: "cat", "car", "card", "care", "dog", "do"

         (root)
        /      \
       c        d
       |        |
       a        o
      / \       |  \
     t   r      g   (end:"do")
         |\ 
         d  e
    (end:"card") (end:"care")

Search "car": traverse c→a→r → found (is end node)
Search "cap": traverse c→a→p → 'p' not found → NOT IN TRIE
Prefix "ca": traverse c→a → return all descendants: cat, car, card, care

Radix Tree (Compressed Trie)

Compress single-child chains:

Trie:          c β†’ a β†’ t
               c β†’ a β†’ r β†’ d
               c β†’ a β†’ r β†’ e

Radix Tree:    "ca" β†’ "t"
                    β†’ "r" β†’ "d"
                         β†’ "e"

Saves memory by merging single-child paths

Use Cases

Application Why Trie/Radix
Autocomplete Prefix search in O(prefix length)
IP routing (longest prefix match) Radix tree for CIDR blocks
Spell checker Find words within edit distance
DNS resolution Domain name lookup
File systems (ext4) Directory entry lookup
HTTP routers URL path matching

IP Routing Example

Routing table as radix tree:

192.168.0.0/16  β†’ Gateway A
192.168.1.0/24  β†’ Gateway B  (more specific)
10.0.0.0/8      β†’ Gateway C

Lookup 192.168.1.5:
  Traverse: 192 β†’ 168 β†’ 1 β†’ 5
  Longest match: 192.168.1.0/24 β†’ Gateway B

Performance

Operation Trie Radix Tree Hash Map
Exact lookup O(k) O(k) O(1) avg
Prefix search O(k + results) O(k + results) O(n) scan
Ordered iteration O(n) O(n) O(n log n) sort
Memory High (per-char nodes) Lower (compressed) Lower

Where k = key length, n = total elements


Comparison: When to Use What

Data Structure Question It Answers Space Error
Bloom Filter "Is X in the set?" O(n) bits False positives only
Count-Min Sketch "How often does X appear?" O(1) Over-counts
HyperLogLog "How many unique items?" O(1) ~12KB Β±1%
Consistent Hashing "Which server handles X?" O(nΓ—vnodes) None
Skip List "Sorted operations on X?" O(n) None
Merkle Tree "Are datasets A and B the same?" O(n) None
Trie "What starts with prefix X?" O(nΓ—k) None

Interview Cheat Sheet

When interviewer asks... Data structure to mention
"Check if URL is malicious" Bloom filter (fast negative check)
"Find top-K trending topics" Count-Min Sketch + min-heap
"Count unique visitors" HyperLogLog
"Distribute data across servers" Consistent hashing with vnodes
"Implement a leaderboard" Skip list (Redis sorted set)
"Detect data corruption" Merkle tree
"Autocomplete suggestions" Trie / Radix tree
"Rate limiting per user" Count-Min Sketch or sliding window
"Deduplicate events" Bloom filter (check before processing)
"Sync replicas efficiently" Merkle tree (anti-entropy)