Key-Value Store - System Architecture

High-Level Architecture Overview

System Architecture Principles

Distributed Hash Table (DHT): Consistent hashing for data distribution
Masterless Architecture: No single point of failure, peer-to-peer design
Tunable Consistency: Configurable consistency levels per operation
Horizontal Scalability: Linear scaling by adding nodes
Fault Tolerance: Automatic replication and failover
Eventually Consistent: AP system with optional strong consistency

Core Architecture Components

Key-Value Store Architecture — Client → Router → Hash Ring → Storage Nodes → Replication

Data Partitioning and Distribution

Consistent Hashing Architecture

Consistent Hashing — Hash ring with 4 nodes showing vnode ranges and key distribution

Virtual Nodes (VNodes)

VNodes per Physical Node: 256 virtual nodes
Purpose: Even data distribution and flexible rebalancing
Token Assignment: Random or evenly spaced tokens on ring
Rebalancing: Move vnodes instead of entire node's data
Hot Spot Mitigation: Split hot vnodes to multiple physical nodes
Failure Handling: Distribute failed node's vnodes across cluster

Partition Strategy

Partition Key: Hash of primary key determines partition
Hash Function: MD5 or MurmurHash3 for uniform distribution
Partition Count: 256 vnodes × 100 nodes = 25,600 partitions
Partition Size: Target 1-2GB per partition
Replication: Each partition replicated to N nodes (N=3)
Preference List: Ordered list of N nodes responsible for key

Storage Engine Architecture

LSM-Tree Storage Engine

LSM-Tree Storage Engine — Write path (left) flushes through levels; Read path (right) checks each layer

Write-Ahead Log (WAL)

Purpose: Durability and crash recovery
Format: Append-only sequential writes
Sync Policy: fsync every 100ms or 1000 writes
Rotation: New WAL file every 1GB or 10 minutes
Replay: Reconstruct MemTable on node restart
Cleanup: Delete WAL after MemTable flushed to SSTable

MemTable (In-Memory Buffer)

Data Structure: Skip list or red-black tree for sorted keys
Size: 64MB default, configurable per table
Write Buffer: Multiple MemTables for concurrent writes
Flush Trigger: Size threshold or time-based
Immutable MemTable: Read-only during flush to SSTable
Concurrent Access: Lock-free reads, synchronized writes

SSTable (Sorted String Table)

Format: Immutable sorted key-value files on disk
Structure: Data blocks + index blocks + bloom filter + metadata
Compression: Snappy or LZ4 for data blocks
Block Size: 4KB-64KB for efficient I/O
Index: Sparse index for fast key lookup
Bloom Filter: Probabilistic filter to avoid unnecessary reads

Compaction Strategy

Leveled Compaction

Leveled Compaction — SSTables merge from overlapping L0 into non-overlapping sorted levels

Compaction Process

Trigger: Level size exceeds threshold (L0: 4 files, L1: 10MB, L2: 100MB)
Selection: Pick overlapping SSTables from adjacent levels
Merge: Merge-sort keys, apply tombstones, remove duplicates
Output: Write new SSTables to next level
Cleanup: Delete old SSTables after compaction
Throttling: Limit compaction I/O to avoid impacting reads/writes

Compaction Strategies

Size-Tiered: Merge SSTables of similar size (write-heavy workloads)
Leveled: Merge into non-overlapping levels (read-heavy workloads)
Time-Window: Compact by time windows (time-series data)
Hybrid: Combine strategies for different data patterns

Replication and Consistency

Quorum-Based Replication

Quorum Replication — Write needs W=2 acks; Read returns latest from R=2 responses (W+R > N ensures consistency)

Consistency Levels

ONE: Wait for 1 replica (fastest, least consistent)
QUORUM: Wait for majority (N/2 + 1) replicas (balanced)
ALL: Wait for all N replicas (slowest, most consistent)
LOCAL_QUORUM: Quorum within local datacenter
EACH_QUORUM: Quorum in each datacenter
ANY: Write to any node including hints (highest availability)

Conflict Resolution

Last-Write-Wins (LWW): Use timestamp to resolve conflicts
Vector Clocks: Track causality for concurrent writes
Application-Defined: Custom merge functions for complex types
Read Repair: Fix inconsistencies during read operations
Anti-Entropy: Background process to sync replicas

Hinted Handoff

Purpose: Handle temporary node failures
Mechanism: Store writes for unavailable nodes
Hint Storage: Separate storage for hints per target node
Hint Replay: Deliver hints when target node recovers
Hint Expiration: Delete hints after 3 hours
Hint Throttling: Limit hint replay rate to avoid overwhelming recovered node

Membership and Failure Detection

Gossip Protocol

Gossip Protocol — Nodes exchange heartbeats and cluster state periodically for failure detection

Failure Detection

Heartbeat Interval: 1 second between gossip messages
Failure Threshold: Mark node as suspected after 10 seconds
Failure Confirmation: Mark node as down after 30 seconds
Phi Accrual Detector: Adaptive failure detection based on history
False Positive Rate: <1% false positives
Recovery Detection: Automatic detection when node comes back

Cluster Membership

Seed Nodes: Bootstrap nodes for new nodes to join
Join Process: New node contacts seed, receives cluster state
Token Assignment: Coordinator assigns vnodes to new node
Data Streaming: Transfer data for assigned vnodes
Decommission: Gracefully remove node and redistribute data
Replace: Replace failed node with new node at same tokens

Caching Architecture

Multi-Level Cache

Multi-Level Cache — L1 in-process (fastest) → L2 distributed Redis → L3 persistent SSTables (slowest)

Cache Strategies

Read-Through: Load from storage on cache miss
Write-Through: Write to cache and storage synchronously
Write-Behind: Write to cache, async write to storage
Cache-Aside: Application manages cache explicitly
TTL-Based: Automatic expiration for time-sensitive data
Invalidation: Explicit cache invalidation on updates

Query Processing

Read Path Optimization

Check MemTable: Search in-memory buffer first
Check Immutable MemTable: Search flushing buffer
Bloom Filter Check: Skip SSTables that don't contain key
Index Lookup: Use SSTable index to find data block
Data Block Read: Read and decompress data block
Merge Results: Combine results from multiple SSTables
Return Latest: Return value with highest timestamp

Write Path Optimization

Write to WAL: Append to write-ahead log for durability
Write to MemTable: Insert into in-memory sorted structure
Replicate: Send write to N-1 replica nodes
Wait for Quorum: Wait for W replicas to acknowledge
Return Success: Acknowledge write to client
Background Flush: Async flush MemTable to SSTable

Batch Operations

Batch Writes: Group multiple writes into single transaction
Batch Reads: Fetch multiple keys in single request
Pipelining: Send multiple requests without waiting for responses
Parallel Execution: Execute independent operations concurrently
Atomic Batches: All-or-nothing semantics for batch writes

Monitoring and Observability

Key Metrics

Latency: P50, P95, P99, P99.9 for reads and writes
Throughput: Operations per second (reads, writes, deletes)
Error Rate: Failed operations per second
Cache Hit Rate: Percentage of reads served from cache
Compaction: Compaction queue size and throughput
Replication Lag: Time delay for replica synchronization

Health Checks

Node Health: CPU, memory, disk, network utilization
Cluster Health: Number of nodes up/down/suspected
Data Health: Replication factor, consistency level compliance
Performance Health: Latency and throughput SLA compliance
Capacity Health: Disk space, memory usage, connection count

This comprehensive architecture provides the foundation for building a highly scalable, available, and performant distributed key-value store.