Caching Strategies: Complete Guide for System Design

Overview

Caching is the single most impactful technique for scaling read-heavy systems. Understanding cache internals, eviction policies, and consistency patterns is essential for every system design interview.

The fundamental tradeoff: Speed vs. freshness. Cached data is fast but potentially stale.

1. Cache Eviction Policies

LRU (Least Recently Used)

How it works: Evict the item that hasn't been accessed for the longest time.

Implementation:

Data Structure: HashMap + Doubly Linked List

HashMap: key → node pointer (O(1) lookup)
Linked List: ordered by access time (most recent at head)

Operations:
- GET: Move accessed node to head, return value — O(1)
- PUT: Add to head, if full evict tail — O(1)

┌─────────────────────────────────────────┐
│ Head                              Tail   │
│ [Most Recent] ↔ [...] ↔ [...] ↔ [LRU]  │
│                                   ↑      │
│                              Evict this   │
└─────────────────────────────────────────┘

Pros: Simple, works well for temporal locality Cons: Susceptible to scan pollution (one full scan evicts all hot items)

Real-world use: Memcached, Redis (approximated with sampling)

LFU (Least Frequently Used)

How it works: Evict the item with the lowest access count.

Implementation:

Data Structure: HashMap + Frequency Buckets (min-heap or linked lists)

freq=1: [item_D, item_E]  ← evict from lowest frequency
freq=2: [item_B]
freq=5: [item_A, item_C]

On access: increment frequency, move to next bucket
On eviction: remove from lowest frequency bucket (LRU within bucket)

Pros: Keeps genuinely popular items cached Cons:

New items start at freq=1 (cold start problem)
Historical frequency may not reflect current popularity
Items that were popular long ago stay cached

Real-world use: Redis 4.0+ (LFU mode)

ARC (Adaptive Replacement Cache)

How it works: Dynamically balances between recency (LRU) and frequency (LFU).

┌─────────────────────────────────────────────────────┐
│                                                       │
│  Ghost B1          T1 (Recency)                      │
│  [evicted LRU]  ← [recent items] ──────┐            │
│                                          │            │
│                    Cache (size c)    ◄───┤ Adaptive   │
│                                          │ boundary   │
│  Ghost B2          T2 (Frequency)        │            │
│  [evicted LFU]  ← [frequent items] ─────┘            │
│                                                       │
└─────────────────────────────────────────────────────┘

Adaptation:
- Hit in B1 ghost → increase T1 size (recency matters more)
- Hit in B2 ghost → increase T2 size (frequency matters more)

Pros: Self-tuning, handles varying workloads Cons: More complex, higher memory overhead (ghost lists)

Real-world use: ZFS, IBM DB2

W-TinyLFU (Window Tiny Least Frequently Used)

How it works: Used by Caffeine (Java's best cache library). Combines:

Admission window (1% of cache, LRU) — gives new items a chance
Main cache (99%, segmented LRU) — holds proven items
TinyLFU frequency sketch — decides admission to main cache

┌──────────────────────────────────────────────────────┐
│                                                        │
│  New Item ──► Window (1%) ──► TinyLFU Filter ──► Main │
│              (LRU)            (frequency check)  (SLRU)│
│                                    │                   │
│                                    ▼                   │
│                              Reject if new item's      │
│                              frequency < victim's      │
│                              frequency                 │
└──────────────────────────────────────────────────────┘

TinyLFU uses Count-Min Sketch:

4-bit counters (max frequency 15)
Periodic halving (aging) to adapt to changing patterns
~8 bytes per entry overhead

Pros: Near-optimal hit rates, low overhead, handles all workload types Cons: Complex implementation

Real-world use: Caffeine (Java), Ristretto (Go)

Comparison

Policy	Hit Rate	Memory Overhead	Scan Resistant	Adaptive
LRU	Good	Low	No	No
LFU	Good (stable workloads)	Medium	Yes	No
ARC	Very Good	High (2x)	Yes	Yes
W-TinyLFU	Best	Low	Yes	Yes

2. Cache Invalidation Strategies

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

TTL-Based (Time-To-Live)

SET user:123 "{name: 'Alice'}" EX 300  // Expires in 5 minutes

Timeline:
t=0    SET (fresh)
t=150  GET → hit (still fresh)
t=300  GET → miss (expired, refetch from DB)
t=301  SET (refreshed from DB)

Pros: Simple, bounded staleness, self-healing Cons: Stale during TTL window, thundering herd on expiry

TTL selection guidelines:

Data Type	TTL	Rationale
User session	30 min	Security requirement
Product price	60s	Business-critical accuracy
User profile	5-15 min	Moderate change frequency
Static config	1-24 hours	Rarely changes
DNS record	300s	Balance freshness/load

Event-Based Invalidation

┌─────────┐    write     ┌─────────┐   invalidate   ┌───────┐
│  Client  │────────────►│   DB    │───────────────►│ Cache  │
└─────────┘              └─────────┘                └───────┘
                              │
                              │ publish event
                              ▼
                    ┌──────────────────┐
                    │  Message Queue    │
                    │  (Kafka/Redis)    │
                    └────────┬─────────┘
                             │ consume
                             ▼
                    ┌──────────────────┐
                    │ Cache Invalidator │──► DELETE cache key
                    └──────────────────┘

Pros: Near-real-time freshness, precise invalidation Cons: Complex infrastructure, eventual consistency window

Write-Through

Write Request
     │
     ▼
┌─────────┐  1. Write  ┌───────┐  2. Write  ┌─────┐
│  Client  │──────────►│ Cache  │───────────►│  DB  │
└─────────┘            └───────┘             └─────┘
                       (synchronous)

Behavior: Every write updates both cache and DB synchronously.

Pros: Cache always consistent with DB, simple reads Cons: Write latency increased (two writes), cache may hold rarely-read data

Write-Behind (Write-Back)

Write Request
     │
     ▼
┌─────────┐  1. Write  ┌───────┐  2. Async batch  ┌─────┐
│  Client  │──────────►│ Cache  │─ ─ ─ ─ ─ ─ ─ ─►│  DB  │
└─────────┘            └───────┘   (delayed)       └─────┘
                       Returns immediately

Behavior: Write to cache immediately, asynchronously persist to DB.

Pros: Very fast writes, batching reduces DB load Cons: Data loss risk if cache crashes before persist, complexity

Write-Around

Write Request
     │
     ├──────────────────────────────────────►┌─────┐
     │                                        │  DB  │
     │         (cache not updated)            └─────┘
     │
     │  Subsequent read:
     │  Cache miss → read from DB → populate cache

Behavior: Writes go directly to DB, bypassing cache.

Pros: Cache not polluted with write-once data Cons: First read after write is always a cache miss

3. Cache Stampede / Thundering Herd Prevention

The Problem

Popular cache key expires at t=100

t=100: 1000 concurrent requests all get cache miss
       → 1000 identical DB queries simultaneously
       → DB overwhelmed, cascading failure

Timeline:
t=99:  Cache HIT (all fast)
t=100: Cache EXPIRED
t=100: Request 1 → DB query (slow)
t=100: Request 2 → DB query (slow)  ← stampede!
t=100: Request 3 → DB query (slow)
...
t=100: Request 1000 → DB query (slow)
t=101: All 1000 responses come back, all write to cache

Solution 1: Locking (Mutex)

GET cache_key
IF miss:
    IF acquire_lock(cache_key, timeout=5s):
        value = query_database()
        SET cache_key value EX 300
        release_lock(cache_key)
        return value
    ELSE:
        // Another thread is rebuilding
        sleep(50ms)
        retry GET cache_key  // Should be populated now

Pros: Only one DB query, simple Cons: Other requests wait (increased latency), lock management

Solution 2: Probabilistic Early Expiration

def get_with_early_expiry(key):
    value, ttl, fetch_time = cache.get_with_metadata(key)
    
    # Probabilistically refresh before actual expiry
    # As TTL approaches 0, probability of refresh increases
    remaining_ttl = ttl - (now() - fetch_time)
    
    if remaining_ttl < 0 or random() < exp(-remaining_ttl / beta):
        # Refresh in background
        async_refresh(key)
    
    return value

Pros: No coordination needed, gradual refresh Cons: Occasional redundant refreshes, tuning required

Solution 3: Background Refresh

┌─────────────────────────────────────────────┐
│ Cache Entry:                                  │
│   value: "data"                              │
│   hard_ttl: 300s (actual expiry)             │
│   soft_ttl: 240s (trigger refresh)           │
│                                               │
│ At soft_ttl: trigger background refresh       │
│ Serve stale data until refresh completes      │
└─────────────────────────────────────────────┘

Pros: Zero latency impact, always serves data Cons: Briefly stale during refresh, background job infrastructure

Solution 4: Request Coalescing (Singleflight)

// Go's singleflight pattern
group := singleflight.Group{}

value, err, shared := group.Do(cacheKey, func() (interface{}, error) {
    // Only ONE goroutine executes this
    return fetchFromDatabase(key)
})
// All concurrent callers get the same result

Pros: Exactly one DB call per key, no locks Cons: All waiters blocked until first completes

4. Multi-Level Caching Architecture

┌─────────────────────────────────────────────────────────────┐
│                                                               │
│  L1: Application Cache (in-process)                          │
│  ┌─────────────────────────────────────┐                    │
│  │ Caffeine/Guava (JVM heap)           │  Latency: ~100ns   │
│  │ Size: 100MB-1GB per instance        │  Hit rate: 60-80%  │
│  └─────────────────────────────────────┘                    │
│           │ miss                                             │
│           ▼                                                  │
│  L2: Distributed Cache                                       │
│  ┌─────────────────────────────────────┐                    │
│  │ Redis Cluster / Memcached           │  Latency: ~1ms     │
│  │ Size: 10GB-1TB shared               │  Hit rate: 90-95%  │
│  └─────────────────────────────────────┘                    │
│           │ miss                                             │
│           ▼                                                  │
│  L3: CDN Edge Cache                                          │
│  ┌─────────────────────────────────────┐                    │
│  │ CloudFront / Fastly / Akamai        │  Latency: ~10-50ms │
│  │ Size: Distributed globally          │  (from edge)       │
│  └─────────────────────────────────────┘                    │
│           │ miss                                             │
│           ▼                                                  │
│  Origin: Database / API                                      │
│  ┌─────────────────────────────────────┐                    │
│  │ PostgreSQL / DynamoDB               │  Latency: 5-50ms   │
│  └─────────────────────────────────────┘                    │
│                                                               │
└─────────────────────────────────────────────────────────────┘

L1: Application-Level Cache

Characteristics:

In-process memory (no network hop)
Per-instance (not shared between servers)
Fastest possible access (~100ns)
Limited by JVM/process heap size
Lost on restart/deploy

Best for: Hot configuration, user sessions, computed results

Invalidation challenge: Each instance has its own copy. Options:

Short TTL (accept brief staleness)
Pub/sub notification (Redis pub/sub, Kafka)
Versioned keys (increment version on change)

L2: Distributed Cache (Redis/Memcached)

Redis vs Memcached:

Feature	Redis	Memcached
Data structures	Rich (lists, sets, sorted sets, hashes)	Key-value only
Persistence	RDB + AOF	None
Replication	Built-in (master-replica)	None
Clustering	Redis Cluster (auto-sharding)	Client-side sharding
Memory efficiency	Less (overhead per key)	More (slab allocator)
Max value size	512MB	1MB
Threads	Single-threaded (6.0+ I/O threads)	Multi-threaded
Use case	Feature-rich caching, pub/sub	Simple high-throughput caching

Redis Cluster topology:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Master 1     │  │ Master 2     │  │ Master 3     │
│ Slots 0-5460 │  │ Slots 5461-  │  │ Slots 10923- │
│              │  │ 10922        │  │ 16383        │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                 │                 │
┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐
│ Replica 1    │  │ Replica 2    │  │ Replica 3    │
└─────────────┘  └─────────────┘  └─────────────┘

L3: CDN Cache

Best for: Static assets, API responses that are cacheable, media files

Cache-Control headers:

Cache-Control: public, max-age=3600, s-maxage=86400
                │              │              │
                │              │              └─ CDN caches for 24h
                │              └─ Browser caches for 1h
                └─ CDN can cache (not private)

Cache-Control: private, no-store
                │         │
                │         └─ Don't cache at all
                └─ Only browser can cache (not CDN)

5. Cache Consistency in Distributed Systems

The Consistency Problem

Server A writes to DB, invalidates local cache
Server B still has stale data in its local cache
User hits Server B → gets stale data

Timeline:
t=0: User updates profile (Server A)
t=0: Server A invalidates its L1 cache
t=0: Server A writes to DB
t=1: User reads profile (routed to Server B)
t=1: Server B returns stale L1 cache → INCONSISTENT

Solutions

1. Cache-Aside with Short TTL:

Read:  Check cache → miss → read DB → populate cache (TTL=60s)
Write: Update DB → delete cache key

Consistency window: up to TTL seconds of staleness

2. Pub/Sub Invalidation:

Write to DB → Publish invalidation event → All instances delete key

┌──────────┐     ┌─────────┐     ┌──────────┐
│ Server A  │────►│  Redis   │────►│ Server B  │
│ (writer)  │     │  Pub/Sub │     │ (reader)  │
└──────────┘     └─────────┘     └──────────┘
                       │
                       ▼
                 ┌──────────┐
                 │ Server C  │
                 └──────────┘

3. Versioned Cache Keys:

Key format: user:{id}:v{version}

Write: increment version in DB
Read:  get current version → construct key → cache lookup

Old versions naturally expire via TTL
No explicit invalidation needed

4. Lease-based Invalidation (Facebook's Memcache):

On cache miss: server gets a "lease" (token)
On cache fill: must present valid lease
If key was invalidated between miss and fill: lease is invalid → don't cache stale data

6. When NOT to Cache

❌ Don't cache when:

Scenario	Why
Write-heavy data	Cache constantly invalidated, overhead > benefit
Large objects accessed once	Wastes cache space, evicts hot items
Highly personalized data	Low reuse across users, huge cache size
Data requiring strong consistency	Staleness is unacceptable (financial)
Cheap-to-compute data	Caching overhead exceeds computation cost
Unpredictable access patterns	No temporal/spatial locality

✅ Cache when:

Read-to-write ratio > 10:1
Data is expensive to compute or fetch
Data is accessed by many users (shared)
Slight staleness is acceptable
Access patterns show temporal locality

Cache Sizing

Required cache size = working_set_size × (1 + overhead_factor)

Working set = items accessed within one TTL period
Overhead = ~20-30% for hash table, metadata, fragmentation

Example:
- 10M users, 80/20 rule → 2M active users
- Profile size: 2KB
- Working set: 2M × 2KB = 4GB
- With overhead: 4GB × 1.3 = ~5.2GB Redis

7. Common Caching Patterns in System Design

Read-Through Cache

Application ──► Cache ──► (on miss) ──► Database
                  │
                  └── Cache handles miss logic internally

Refresh-Ahead

Cache proactively refreshes entries BEFORE they expire
Based on access patterns and predicted need

Good for: items with predictable access (e.g., homepage data)

Cache Warming

On deploy/restart:
1. Query DB for top-N hot keys
2. Pre-populate cache
3. Start serving traffic

Prevents cold-start stampede after deployment

Interview Cheat Sheet

When interviewer asks...	Key points to mention
"How to speed up reads?"	Multi-level cache, cache-aside pattern, TTL strategy
"Cache invalidation?"	Event-based for freshness, TTL as safety net
"What if cache goes down?"	Circuit breaker, fallback to DB, cache warming on recovery
"Cache consistency?"	Accept eventual consistency, pub/sub invalidation, versioned keys
"Thundering herd?"	Locking, singleflight, probabilistic early refresh
"What eviction policy?"	W-TinyLFU for best hit rate, LRU for simplicity
"Redis vs Memcached?"	Redis for features/persistence, Memcached for pure speed/simplicity
"How much cache do we need?"	Working set × overhead, 80/20 rule, measure hit rate