Key-Value Store - Database Design

Data Model

Core Data Structure

Key-Value Pair:
{
  key: string,              // Primary identifier (max 1KB)
  value: bytes,             // Arbitrary data (max 1MB)
  timestamp: int64,         // Microsecond precision
  ttl: int32,              // Time-to-live in seconds (optional)
  version: int64,          // Version number for optimistic locking
  metadata: {
    created_at: timestamp,
    updated_at: timestamp,
    size: int32,
    checksum: string
  }
}

Key Design Patterns

Simple Keys: user:12345, session:abc-def-ghi
Hierarchical Keys: app:prod:cache:user:12345
Composite Keys: tenant_id:user_id:resource_type:resource_id
Time-Based Keys: metrics:2024-01-08:server1:cpu
Hash-Based Keys: shard_0:user:12345 (for manual sharding)

Value Types and Encoding

String: UTF-8 encoded text
Binary: Raw bytes (images, serialized objects)
JSON: Structured data with schema flexibility
Protocol Buffers: Efficient binary serialization
MessagePack: Compact binary JSON-like format
Compressed: Snappy/LZ4 compression for large values

Storage Schema

SSTable File Format

SSTable File Format — Immutable sorted file with header, data blocks, sparse index, bloom filter, metadata, and footer

Write-Ahead Log (WAL) Format

Write-Ahead Log (WAL) File Format — Append-only log with header, sequential records, and per-record CRC32 checksums for crash recovery

Metadata Management

Node Metadata

-- Stored in distributed configuration service (ZooKeeper/etcd)

Node:
  node_id: UUID
  hostname: string
  ip_address: string
  port: int
  datacenter: string
  rack: string
  status: JOINING|NORMAL|LEAVING|DOWN
  tokens: [int64]  -- Virtual node tokens
  load: float      -- Data size in GB
  cpu_usage: float
  memory_usage: float
  last_heartbeat: timestamp

Partition Metadata

Partition:
  partition_id: int64
  token_range: (start_token, end_token)
  replicas: [node_id]  -- Ordered preference list
  primary_node: node_id
  size_bytes: int64
  key_count: int64
  last_compaction: timestamp
  status: HEALTHY|DEGRADED|UNAVAILABLE

SSTable Metadata

SSTable:
  sstable_id: UUID
  node_id: UUID
  level: int  -- LSM tree level (0-6)
  file_path: string
  size_bytes: int64
  key_count: int64
  min_key: string
  max_key: string
  min_timestamp: int64
  max_timestamp: int64
  bloom_filter_size: int64
  compression_ratio: float
  created_at: timestamp
  last_accessed: timestamp

Indexing Strategy

Primary Index (Key-Based)

Structure: Hash index for MemTable, sorted index for SSTables
Lookup: O(1) for MemTable, O(log N) for SSTables
Storage: In-memory for MemTable, on-disk for SSTables
Updates: Append-only, no in-place updates
Compaction: Merge and rebuild indexes during compaction

Bloom Filters

Purpose: Avoid unnecessary SSTable reads
Size: 10 bits per key (1% false positive rate)
Hash Functions: 3-5 independent hash functions
Memory: 1.25 bytes per key (for 1% FPR)
Effectiveness: Reduce disk reads by 90%+

Secondary Indexes (Optional)

Secondary Index Structure:
  index_name: string
  indexed_field: string
  index_type: HASH|BTREE|BITMAP
  
Index Entry:
  indexed_value: any
  primary_keys: [string]  -- List of keys with this value
  
Example:
  Index: user_email
  Entry: "john@example.com" → ["user:123", "session:abc"]

Data Partitioning

Partition Key Selection

Hash Partitioning: MD5(key) % partition_count
Range Partitioning: Lexicographic key ranges
Composite Partitioning: Hash(tenant_id) + Range(timestamp)
Custom Partitioning: Application-defined partition function

Partition Distribution

Consistent Hash Ring:
  Total Tokens: 2^64
  Nodes: 100
  VNodes per Node: 256
  Total VNodes: 25,600
  
Token Assignment:
  Node 1: [0, 256, 512, ..., 25344]
  Node 2: [1, 257, 513, ..., 25345]
  ...
  Node 100: [99, 355, 611, ..., 25443]
  
Key Placement:
  hash("user:123") = 42 → Token 42 → Node 1
  Replicas: Node 1 (primary), Node 2, Node 3

Replication Schema

Replication Metadata

Replication_Config:
  keyspace: string
  replication_factor: int  -- N (typically 3)
  replication_strategy: SIMPLE|NETWORK_TOPOLOGY
  datacenters: {
    dc1: replication_factor: 3
    dc2: replication_factor: 2
  }
  
Replica_State:
  key: string
  replicas: [
    {
      node_id: UUID,
      version: int64,
      timestamp: int64,
      status: SYNCED|LAGGING|DIVERGED
    }
  ]

Version Vectors

Version Vector (for conflict resolution):
  key: "user:123"
  versions: {
    node_1: 5,  -- Node 1 has seen 5 updates
    node_2: 4,  -- Node 2 has seen 4 updates
    node_3: 5   -- Node 3 has seen 5 updates
  }
  
Conflict Detection:
  V1 = {node_1: 5, node_2: 4, node_3: 5}
  V2 = {node_1: 4, node_2: 5, node_3: 5}
  → Concurrent writes detected (neither dominates)
  → Application must resolve conflict

Data Lifecycle Management

TTL (Time-To-Live)

TTL Implementation:
  - Store expiration timestamp with each key
  - Lazy deletion: Check TTL on read
  - Active deletion: Background process scans and deletes
  - Tombstone: Mark as deleted, remove during compaction
  
TTL Storage:
  key: "session:abc"
  value: {...}
  ttl: 3600  -- seconds
  expires_at: current_time + 3600

Compaction and Garbage Collection

Tombstone Removal: Delete tombstones older than gc_grace_seconds
Version Pruning: Keep only latest version of each key
Expired Key Removal: Delete keys past TTL
Orphan Cleanup: Remove data for deleted partitions
Space Reclamation: Reclaim disk space from deleted data

Data Archival

Archival Strategy:
  - Hot Data: <30 days, on NVMe SSD
  - Warm Data: 30-365 days, on SATA SSD
  - Cold Data: >365 days, on object storage (S3)
  
Archival Process:
  1. Identify old SSTables (last_accessed > 30 days)
  2. Upload to object storage
  3. Create stub with pointer to archived location
  4. Delete local SSTable
  5. On read: Fetch from object storage if needed

Consistency and Conflict Resolution

Conflict Resolution Strategies

Last-Write-Wins (LWW):
  - Use timestamp to determine winner
  - Simple but may lose concurrent updates
  
  Example:
    Write 1: key="user:123", value="Alice", ts=100
    Write 2: key="user:123", value="Bob", ts=105
    Result: value="Bob" (higher timestamp wins)

Vector Clocks:
  - Track causality for each replica
  - Detect concurrent writes
  - Preserve all conflicting versions
  
  Example:
    Node A: [A:1] → "Alice"
    Node B: [B:1] → "Bob"
    Conflict: Both versions preserved
    Application resolves: merge or choose one

Read Repair

Read Repair Process:
  1. Client reads with R=2 from 3 replicas
  2. Coordinator receives:
     - Node A: value="Alice", version=5
     - Node B: value="Bob", version=6
  3. Coordinator detects inconsistency
  4. Return latest value ("Bob") to client
  5. Background: Update Node A to version=6

Backup and Recovery

Backup Schema

Backup Metadata:
  backup_id: UUID
  backup_type: FULL|INCREMENTAL
  started_at: timestamp
  completed_at: timestamp
  size_bytes: int64
  node_count: int
  sstable_count: int
  status: IN_PROGRESS|COMPLETED|FAILED
  
Backup Contents:
  - SSTables: All SSTable files
  - Metadata: Node and partition metadata
  - WAL: Recent write-ahead logs
  - Configuration: Cluster configuration
  - Schemas: Keyspace and table schemas

Recovery Process

Restore Metadata: Recreate cluster topology
Restore SSTables: Copy SSTable files to nodes
Rebuild Indexes: Recreate bloom filters and indexes
Replay WAL: Apply recent writes from WAL
Verify Consistency: Run repair to ensure consistency
Resume Operations: Bring cluster online

This database design provides a solid foundation for implementing a production-grade distributed key-value store with strong durability, consistency, and performance characteristics.