Data Structures for Scale: Complete Guide for System Design

Overview

Standard data structures (arrays, hash maps, trees) don't scale to billions of elements. This section covers probabilistic and specialized data structures that trade perfect accuracy for massive space/time savings — the secret weapons of large-scale systems.

1. Bloom Filters

What Problem Do They Solve?

"Is this element in the set?" — answered in O(1) time and constant memory, with a small false positive rate but ZERO false negatives.

How They Work

Bit array of m bits, k hash functions

Insert "hello":
  h1("hello") = 3   → set bit 3
  h2("hello") = 7   → set bit 7
  h3("hello") = 11  → set bit 11

Bit array: [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]
                  ↑           ↑           ↑

Query "hello": bits 3,7,11 all set → "PROBABLY YES"
Query "world": h1=2, h2=7, h3=14 → bit 2 is 0 → "DEFINITELY NO"
Query "test":  h1=3, h2=5, h3=11 → bit 5 is 0 → "DEFINITELY NO"

False Positive Rate

Formula: FPR ≈ (1 - e^(-kn/m))^k

Where:
  m = number of bits
  n = number of elements inserted
  k = number of hash functions

Optimal k = (m/n) × ln(2) ≈ 0.693 × (m/n)

Sizing Guide

Elements (n)	FPR Target	Bits (m)	Hash Functions (k)	Memory
1 million	1%	9.6M bits	7	1.2 MB
1 million	0.1%	14.4M bits	10	1.8 MB
100 million	1%	960M bits	7	120 MB
1 billion	1%	9.6B bits	7	1.2 GB

Rule of thumb: ~10 bits per element for 1% FPR.

Use Cases in System Design

System	Use Case	Benefit
Cassandra/HBase	Check if SSTable contains key	Avoid disk reads
Chrome	Malicious URL detection	Fast local check before network
Medium	Recommend articles not yet seen	Avoid showing duplicates
Akamai CDN	"Is this URL cached?"	Reduce origin requests
Bitcoin	SPV wallet transaction verification	Lightweight verification

Variants

Counting Bloom Filter: Replace bits with counters → supports deletion
Cuckoo Filter: Better space efficiency, supports deletion, faster lookups
Scalable Bloom Filter: Grows dynamically by adding filter layers

2. Count-Min Sketch

What Problem Does It Solve?

"How many times has this element appeared?" — frequency estimation in constant memory with bounded error.

How It Works

d hash functions, each mapping to a row of w counters

         w counters
    ┌─┬─┬─┬─┬─┬─┬─┬─┐
h1: │0│0│3│0│0│2│0│0│  ← row 1
    ├─┼─┼─┼─┼─┼─┼─┼─┤
h2: │0│1│0│0│3│0│0│2│  ← row 2
    ├─┼─┼─┼─┼─┼─┼─┼─┤
h3: │0│0│0│2│0│0│3│0│  ← row 3
    ├─┼─┼─┼─┼─┼─┼─┼─┤
h4: │1│0│0│0│0│3│0│0│  ← row 4
    └─┴─┴─┴─┴─┴─┴─┴─┘

Insert("apple"):
  Increment counter at h1("apple"), h2("apple"), h3("apple"), h4("apple")

Query("apple"):
  Return MIN(counter[h1("apple")], counter[h2("apple")], ...)
  
  MIN gives least over-counted estimate

Error Bounds

With width w and depth d:
  Error ≤ (e/w) × N  with probability ≥ 1 - (1/e)^d
  
  Where N = total count of all elements

Typical: w=2048, d=5 → error ≤ 0.1% of total count, 99.3% confidence
Memory: w × d × counter_size = 2048 × 5 × 4 bytes = 40KB

Use Cases

Application	What's Counted	Why CMS
Network monitoring	Packet flows per IP	Millions of IPs, limited memory
Ad click tracking	Clicks per ad	Real-time, approximate is fine
Heavy hitters detection	Top-K frequent items	Find popular items in stream
Rate limiting	Requests per user	Distributed, memory-efficient
NLP	Word frequency	Corpus too large for exact counts

3. HyperLogLog

What Problem Does It Solve?

"How many unique elements are in this set?" — cardinality estimation using only ~12KB of memory regardless of set size.

Intuition

Observation: If you hash elements uniformly, the maximum number of 
leading zeros in any hash tells you about the cardinality.

If max leading zeros = k, estimated cardinality ≈ 2^k

Example:
  hash("alice") = 001010...  → 2 leading zeros
  hash("bob")   = 000001...  → 5 leading zeros  ← max
  hash("carol") = 010100...  → 1 leading zero

  Max leading zeros = 5 → estimate ≈ 2^5 = 32 unique elements

How HyperLogLog Improves This

1. Split hash into two parts:
   - First p bits → bucket index (2^p buckets, typically p=14 → 16384 buckets)
   - Remaining bits → count leading zeros

2. Each bucket stores max leading zeros seen

3. Final estimate = harmonic mean across all buckets × correction factor

Buckets (p=14, 16384 buckets, 6 bits each):
┌────┬────┬────┬────┬─────┬────┐
│ B0 │ B1 │ B2 │ B3 │ ... │ Bn │
│ 5  │ 3  │ 7  │ 2  │     │ 4  │
└────┴────┴────┴────┴─────┴────┘

Memory: 16384 × 6 bits = 12KB (counts up to billions!)

Accuracy

Standard error: 1.04 / √m  where m = number of buckets

With 16384 buckets (12KB): error ≈ 0.81%
With 65536 buckets (48KB): error ≈ 0.41%

For counting 1 billion unique items:
  Exact: would need ~1GB+ (storing all items)
  HyperLogLog: 12KB with <1% error

Use Cases

System	What's Counted	Memory Saved
Redis (PFCOUNT)	Unique visitors per page	GB → 12KB
Google BigQuery	COUNT(DISTINCT) approximation	Massive savings
Presto/Trino	Approximate distinct counts	Faster queries
Network monitoring	Unique source IPs	Per-interface tracking
Advertising	Unique ad impressions	Per-campaign counting

Redis HyperLogLog Commands

PFADD visitors:page1 "user123" "user456" "user789"
PFADD visitors:page1 "user123"  // duplicate, no effect

PFCOUNT visitors:page1  // Returns ~3

// Merge multiple HLLs (union of sets)
PFMERGE visitors:total visitors:page1 visitors:page2
PFCOUNT visitors:total  // Unique visitors across all pages

4. Consistent Hashing

The Problem

Simple hash: server = hash(key) % N

If N changes (server added/removed):
  Almost ALL keys remap to different servers
  → Massive cache invalidation
  → "Cache stampede" on database

Example: 4 servers → 5 servers
  hash(key) % 4 = 2  →  hash(key) % 5 = 3  (different server!)
  ~80% of keys move when adding 1 server to 4

How Consistent Hashing Works

Hash ring (0 to 2^32 - 1):

                    0
                    │
           Node C ──┤
                    │
    2^32 ──────────────────── 2^8
         │                    │
         │                    │── Node A
         │                    │
    2^24 ──────────────────── 2^16
                    │
                    │── Node B
                    │

Key placement: hash(key) → walk clockwise → first node encountered

Adding Node D between A and B:
  Only keys between B and D move to D
  All other keys stay put!
  
  Keys moved: ~1/N of total (instead of ~(N-1)/N with modulo)

Virtual Nodes

Problem: With few physical nodes, distribution is uneven.

Without virtual nodes (3 servers):
  Node A: 45% of keys  ← unbalanced!
  Node B: 20% of keys
  Node C: 35% of keys

With virtual nodes (100-200 per physical node):
  Node A: A_0, A_1, A_2, ... A_149  (150 points on ring)
  Node B: B_0, B_1, B_2, ... B_149
  Node C: C_0, C_1, C_2, ... C_149

  Result: ~33% each (statistically balanced)

Benefits of virtual nodes:

Even distribution regardless of physical node count
Heterogeneous nodes: powerful server gets more virtual nodes
Smoother rebalancing when nodes join/leave

Rebalancing

Node D joins (between positions 100-200 on ring):

Before: Keys 100-200 → Node A
After:  Keys 100-200 → Node D (only these keys move)

Steps:
1. Node D announces its position(s) on ring
2. Node A transfers keys in range [100, 200] to Node D
3. Ring metadata updated in all nodes
4. Future requests for those keys go to Node D

Real-World Usage

System	Implementation
Amazon DynamoDB	Consistent hashing with virtual nodes
Apache Cassandra	Token ring with vnodes
Memcached	Client-side consistent hashing (ketama)
Nginx	Upstream consistent hashing
Discord	Guild (server) distribution

5. Skip Lists

What Problem Do They Solve?

Sorted data structure with O(log n) search, insert, delete — like a balanced BST but simpler to implement and naturally concurrent.

How They Work

Level 3: HEAD ──────────────────────────────────── 50 ──────── NIL
Level 2: HEAD ────────── 20 ────────────────────── 50 ──── 70 ─ NIL
Level 1: HEAD ──── 10 ── 20 ──── 30 ──────── 50 ── 60 ── 70 ─ NIL
Level 0: HEAD ─ 5 ─ 10 ─ 20 ─ 25 ─ 30 ─ 40 ─ 50 ─ 60 ─ 70 ─ NIL

Search for 40:
1. Start at HEAD, Level 3: 50 > 40, go down
2. Level 2: 20 < 40, move right. 50 > 40, go down
3. Level 1: 30 < 40, move right. 50 > 40, go down
4. Level 0: 40 = 40, FOUND!

Probabilistic Balancing

When inserting a new element:
  Level 0: always (100%)
  Level 1: with probability p (typically p=0.5 or p=0.25)
  Level 2: with probability p²
  Level k: with probability p^k

Expected height: O(log n)
Expected search time: O(log n)
No rebalancing needed (unlike AVL/Red-Black trees)

Why Skip Lists Over Balanced Trees?

Aspect	Skip List	Red-Black Tree
Implementation	Simpler	Complex rotations
Concurrency	Lock-free possible	Rotations need locks
Range queries	Natural (follow level 0)	In-order traversal
Memory	More pointers	Fixed 2 children
Cache performance	Worse (pointer chasing)	Better (locality)

Use in Redis Sorted Sets

ZADD leaderboard 100 "alice"
ZADD leaderboard 200 "bob"
ZADD leaderboard 150 "carol"

ZRANGEBYSCORE leaderboard 100 200  // Range query: O(log n + k)
ZRANK leaderboard "bob"            // Rank query: O(log n)

Redis uses skip lists because:

O(log n) insert/delete/search
O(log n + k) range queries (k = result size)
Simpler to implement than balanced trees
Easy to modify for concurrent access

6. Merkle Trees

What Problem Do They Solve?

Efficiently verify data integrity and detect differences between large datasets using O(log n) comparisons.

How They Work

                    Root Hash
                   H(H12 + H34)
                  /              \
            H12                    H34
         H(H1+H2)              H(H3+H4)
          /     \               /     \
        H1       H2          H3       H4
      H(D1)    H(D2)       H(D3)    H(D4)
        │        │            │        │
       D1       D2           D3       D4
     (data)   (data)       (data)   (data)

If D3 changes:
  H3 changes → H34 changes → Root changes
  
To find what changed: compare roots → compare H12 vs H34 → 
  H34 differs → compare H3 vs H4 → H3 differs → D3 changed

Only O(log n) comparisons needed!

Anti-Entropy in Distributed Databases

Node A                          Node B
┌──────────┐                   ┌──────────┐
│Root: abc1 │                   │Root: abc1 │ ← Same! No sync needed
└──────────┘                   └──────────┘

Node A                          Node B  
┌──────────┐                   ┌──────────┐
│Root: abc1 │                   │Root: def2 │ ← Different! Sync needed
└──────────┘                   └──────────┘
     │                               │
     └── Compare subtrees ───────────┘
         Only transfer differing branches
         (not entire dataset)

Use Cases

System	Purpose
Cassandra	Anti-entropy repair between replicas
DynamoDB	Detect inconsistencies between replicas
Git	Content-addressable storage, diff detection
Bitcoin/Ethereum	Transaction verification (SPV)
IPFS	Content integrity verification
ZFS	Data integrity checksums
Certificate Transparency	Append-only log verification

Properties

Tamper detection: Any change propagates to root
Efficient diff: O(log n) to find differences
Efficient proof: Prove element exists with O(log n) hashes
Append-friendly: Adding leaves only updates one path

7. Trie / Radix Tree

Trie (Prefix Tree)

Insert: "cat", "car", "card", "care", "dog", "do"

         (root)
        /      \
       c        d
       |        |
       a        o
      / \       |  \
     t   r      g   (end:"do")
         |\ 
         d  e
    (end:"card") (end:"care")

Search "car": traverse c→a→r → found (is end node)
Search "cap": traverse c→a→p → 'p' not found → NOT IN TRIE
Prefix "ca": traverse c→a → return all descendants: cat, car, card, care

Radix Tree (Compressed Trie)

Compress single-child chains:

Trie:          c → a → t
               c → a → r → d
               c → a → r → e

Radix Tree:    "ca" → "t"
                    → "r" → "d"
                         → "e"

Saves memory by merging single-child paths

Use Cases

Application	Why Trie/Radix
Autocomplete	Prefix search in O(prefix length)
IP routing (longest prefix match)	Radix tree for CIDR blocks
Spell checker	Find words within edit distance
DNS resolution	Domain name lookup
File systems (ext4)	Directory entry lookup
HTTP routers	URL path matching

IP Routing Example

Routing table as radix tree:

192.168.0.0/16  → Gateway A
192.168.1.0/24  → Gateway B  (more specific)
10.0.0.0/8      → Gateway C

Lookup 192.168.1.5:
  Traverse: 192 → 168 → 1 → 5
  Longest match: 192.168.1.0/24 → Gateway B

Performance

Operation	Trie	Radix Tree	Hash Map
Exact lookup	O(k)	O(k)	O(1) avg
Prefix search	O(k + results)	O(k + results)	O(n) scan
Ordered iteration	O(n)	O(n)	O(n log n) sort
Memory	High (per-char nodes)	Lower (compressed)	Lower

Where k = key length, n = total elements

Comparison: When to Use What

Data Structure	Question It Answers	Space	Error
Bloom Filter	"Is X in the set?"	O(n) bits	False positives only
Count-Min Sketch	"How often does X appear?"	O(1)	Over-counts
HyperLogLog	"How many unique items?"	O(1) ~12KB	±1%
Consistent Hashing	"Which server handles X?"	O(n×vnodes)	None
Skip List	"Sorted operations on X?"	O(n)	None
Merkle Tree	"Are datasets A and B the same?"	O(n)	None
Trie	"What starts with prefix X?"	O(n×k)	None

Interview Cheat Sheet

When interviewer asks...	Data structure to mention
"Check if URL is malicious"	Bloom filter (fast negative check)
"Find top-K trending topics"	Count-Min Sketch + min-heap
"Count unique visitors"	HyperLogLog
"Distribute data across servers"	Consistent hashing with vnodes
"Implement a leaderboard"	Skip list (Redis sorted set)
"Detect data corruption"	Merkle tree
"Autocomplete suggestions"	Trie / Radix tree
"Rate limiting per user"	Count-Min Sketch or sliding window
"Deduplicate events"	Bloom filter (check before processing)
"Sync replicas efficiently"	Merkle tree (anti-entropy)