Database Internals: Complete Guide for System Design

Overview

Every system stores data. Understanding how databases work internally lets you choose the right database, design efficient schemas, and explain performance characteristics in interviews.

1. B-Tree vs LSM-Tree

These are the two dominant storage engine architectures. Your choice fundamentally affects read/write performance.

B-Tree (Used by: PostgreSQL, MySQL/InnoDB, Oracle)

How it works:

Self-balancing tree structure
Each node is a fixed-size page (typically 4-16KB)
Internal nodes contain keys and pointers to child pages
Leaf nodes contain keys and values (or pointers to values)
All leaves at same depth (balanced)

                    [30 | 60]                    ← Root (internal)
                   /    |    \
          [10|20]    [40|50]    [70|80|90]      ← Internal nodes
         /  |  \    /  |  \    /  |  |  \
        L1  L2  L3 L4  L5  L6 L7 L8 L9 L10    ← Leaf nodes (data)

Read path: O(log n) — traverse from root to leaf Write path: O(log n) — find leaf, update in place, split if full

Write amplification:

Updating a single byte requires rewriting entire page (4-16KB)
Page splits cascade upward
Write amplification factor: ~10-30x

Advantages:

Consistent read performance
Good for read-heavy workloads
Efficient range queries (leaves are linked)
Mature, well-understood

LSM-Tree (Used by: RocksDB, Cassandra, LevelDB, HBase)

How it works:

Writes go to in-memory buffer (memtable) — very fast
When memtable is full, flush to disk as sorted SSTable (Sorted String Table)
Background compaction merges SSTables to reduce read amplification

Write Path:
                                    
  Write ──► MemTable (sorted, in-memory)
                │
                │ (flush when full, ~64MB)
                ▼
  Level 0:  [SST] [SST] [SST]     ← Unsorted between files
                │
                │ (compaction)
                ▼
  Level 1:  [SST] [SST] [SST] [SST]  ← Sorted, non-overlapping
                │
                │ (compaction)
                ▼
  Level 2:  [SST] [SST] ... [SST]    ← 10x larger than L1

Read path:

Check memtable
Check each level's bloom filter
Binary search within SSTable
Worst case: check all levels

Write amplification:

Each compaction rewrites data
Leveled compaction: ~10-30x write amplification
But sequential writes (much faster than random B-tree writes)

Advantages:

Much higher write throughput (sequential I/O)
Better space efficiency (no fragmentation)
Compaction can be tuned for workload

Comparison

Aspect	B-Tree	LSM-Tree
Read latency	Predictable, O(log n)	Variable (check multiple levels)
Write throughput	Lower (random I/O)	Higher (sequential I/O)
Space amplification	Higher (page fragmentation)	Lower (compacted)
Write amplification	10-30x (page rewrites)	10-30x (compaction)
Read amplification	1x (single path)	1-N levels checked
Range queries	Excellent (linked leaves)	Good (sorted SSTables)
Use case	Read-heavy, OLTP	Write-heavy, time-series

Interview tip: "For our write-heavy logging system, an LSM-tree engine like RocksDB gives us 5-10x better write throughput. For our user profile service with mostly reads, PostgreSQL's B-tree is more predictable."

2. Write-Ahead Logging (WAL)

The Problem

If a database crashes mid-write, data could be corrupted. We need atomicity and durability.

How WAL Works

Write Request
     │
     ▼
┌─────────────┐
│ Write to WAL │ ← Sequential append (fast)
│ (on disk)    │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ Update       │ ← In-memory (fast)
│ Buffer Pool  │
└──────┬──────┘
       │
       ▼ (periodically)
┌─────────────┐
│ Checkpoint   │ ← Flush dirty pages to data files
│ to Disk      │
└─────────────┘

Key principle: Write the intent (WAL) before modifying data. If crash occurs:

WAL intact → replay to recover
WAL incomplete → discard partial transaction

WAL Record Structure

┌──────────────────────────────────────────┐
│ LSN | TxID | Type | PageID | Before | After │
├──────────────────────────────────────────┤
│ 001 | T1   | UPD  | P5     | old    | new   │
│ 002 | T1   | UPD  | P8     | old    | new   │
│ 003 | T1   | CMT  | -      | -      | -     │  ← Commit marker
│ 004 | T2   | UPD  | P3     | old    | new   │
│ 005 | T2   | ABT  | -      | -      | -     │  ← Abort marker
└──────────────────────────────────────────┘

Crash Recovery (ARIES Algorithm)

Analysis: Scan WAL to find active transactions at crash time
Redo: Replay all committed transactions (bring data up to date)
Undo: Roll back uncommitted transactions (using before-images)

Performance Implications

WAL writes are sequential → fast on HDD and SSD
fsync after each commit ensures durability (but adds latency)
Group commit: batch multiple transactions' fsync calls
Typical: fsync every 10ms, batching 100s of commits

3. MVCC (Multi-Version Concurrency Control)

The Problem

How do readers and writers coexist without blocking each other?

How MVCC Works

Instead of locking rows, keep multiple versions:

Transaction Timeline:
                                    
T1 starts (snapshot at time=100)
                                    
Row "user_1" versions:
┌─────────────────────────────────────────────┐
│ Version │ Created │ Expired │ Value          │
├─────────────────────────────────────────────┤
│ v1      │ 50      │ 150     │ name="Alice"   │
│ v2      │ 150     │ ∞       │ name="Alicia"  │
└─────────────────────────────────────────────┘

T1 (snapshot=100) sees v1 ("Alice")  ← consistent snapshot
T3 (snapshot=200) sees v2 ("Alicia") ← sees committed update

PostgreSQL's MVCC Implementation

Each row has xmin (creating transaction) and xmax (deleting transaction)
UPDATE = INSERT new version + mark old version expired
Visibility check: is xmin committed and xmax not yet committed from my snapshot?
Dead tuples accumulate → VACUUM process cleans them up

MVCC Benefits

Readers never block writers
Writers never block readers
Consistent snapshots without locks
Enables time-travel queries (as of timestamp X)

MVCC Costs

Storage overhead (multiple versions)
Garbage collection needed (vacuum/compaction)
Long-running transactions hold old versions alive
Write conflicts still need detection (first-writer-wins or abort)

4. Transaction Isolation Levels

The Anomalies

Anomaly	Description	Example
Dirty Read	Read uncommitted data	See a transfer before it commits
Non-Repeatable Read	Same query, different results	Balance changes between two reads
Phantom Read	New rows appear in range query	New order appears in date range
Write Skew	Constraint violated across rows	Two doctors both go off-call

Isolation Levels

Weakest ──────────────────────────────────────────── Strongest
                                                      
Read         Read          Repeatable    Snapshot     Serializable
Uncommitted  Committed     Read          Isolation    
                                                      
Allows:      Allows:       Allows:       Allows:      Prevents:
- Dirty      - Non-repeat  - Phantoms    - Write      ALL
- Non-repeat - Phantoms    - Write skew    skew       anomalies
- Phantoms   - Write skew                             
- Write skew

Practical Examples

Read Committed (PostgreSQL default):

-- Transaction 1                    -- Transaction 2
BEGIN;                              BEGIN;
SELECT balance FROM accounts        
WHERE id = 1;  -- Returns 1000     
                                    UPDATE accounts SET balance = 500
                                    WHERE id = 1;
                                    COMMIT;
SELECT balance FROM accounts        
WHERE id = 1;  -- Returns 500! (non-repeatable read)
COMMIT;

Serializable (strongest):

-- Transaction 1                    -- Transaction 2
BEGIN ISOLATION LEVEL SERIALIZABLE; BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM doctors        
WHERE on_call = true;  -- Returns 2
                                    SELECT count(*) FROM doctors
                                    WHERE on_call = true;  -- Returns 2
UPDATE doctors SET on_call = false  
WHERE id = 1;                       UPDATE doctors SET on_call = false
                                    WHERE id = 2;
COMMIT;  -- Succeeds                COMMIT;  -- ABORTED! (serialization failure)

Choosing Isolation Level

Use Case	Level	Why
Analytics/reporting	Read Committed	Stale data acceptable
E-commerce catalog	Read Committed	Performance over consistency
Financial transfers	Serializable	Must prevent all anomalies
Inventory management	Repeatable Read	Prevent double-selling
Booking systems	Serializable	Prevent overbooking

5. Indexing Strategies

B-Tree Index (Default)

CREATE INDEX idx_users_email ON users(email);

Balanced tree structure
O(log n) lookups
Supports: equality, range, prefix, ORDER BY
Best for: high-cardinality columns, range queries

Hash Index

CREATE INDEX idx_users_id ON users USING hash(id);

O(1) lookups for equality
Does NOT support range queries
Best for: exact-match lookups only (e.g., primary key)

Composite Index

CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

Leftmost prefix rule: Can use (user_id) or (user_id, created_at), NOT (created_at) alone
Order matters: put equality conditions first, range conditions last
Covering index: includes all columns needed → index-only scan

Bitmap Index

Bit array per distinct value
Excellent for low-cardinality columns (status, gender, boolean)
Efficient for AND/OR combinations
Used in data warehouses (Oracle, PostgreSQL partial)

GIN (Generalized Inverted Index)

CREATE INDEX idx_posts_tags ON posts USING gin(tags);
CREATE INDEX idx_docs_search ON documents USING gin(to_tsvector('english', content));

Maps values to sets of rows containing them
Best for: full-text search, array containment, JSONB queries
Slower to update than B-tree (must update multiple entries)

GiST (Generalized Search Tree)

CREATE INDEX idx_locations_geo ON locations USING gist(coordinates);

Supports geometric/spatial queries
Range types, nearest-neighbor searches
Best for: PostGIS spatial data, range overlaps

Inverted Index (Elasticsearch, Lucene)

Document 1: "the quick brown fox"
Document 2: "the lazy brown dog"

Inverted Index:
┌─────────┬──────────────┐
│ Term    │ Documents     │
├─────────┼──────────────┤
│ brown   │ [1, 2]       │
│ dog     │ [2]          │
│ fox     │ [1]          │
│ lazy    │ [2]          │
│ quick   │ [1]          │
│ the     │ [1, 2]       │
└─────────┴──────────────┘

Index Selection Guide

Query Pattern	Best Index Type
WHERE id = X	B-tree or Hash
WHERE date BETWEEN A AND B	B-tree
WHERE tags @> '{python}'	GIN
WHERE ST_DWithin(point, ...)	GiST
Full-text search	GIN (tsvector) or Inverted
WHERE status IN ('active', 'pending')	Bitmap or partial B-tree

6. Connection Pooling and Query Optimization

Connection Pooling

The problem: Each database connection costs ~5-10MB RAM and takes 50-100ms to establish (TCP + TLS + auth).

Without pooling:
App Server ──[new conn]──► DB  (per request, expensive)
App Server ──[new conn]──► DB
App Server ──[new conn]──► DB

With pooling (PgBouncer, HikariCP):
                    ┌─── Conn 1 ───┐
App Server ──► Pool ├─── Conn 2 ───┤──► DB
                    └─── Conn 3 ───┘

Pool sizing formula (PostgreSQL):

connections = (core_count * 2) + effective_spindle_count

Typical: 20-50 connections per database instance
More connections ≠ more throughput (context switching overhead)

Pooling modes:

Mode	Behavior	Use Case
Session	Conn held for entire session	Prepared statements
Transaction	Conn returned after each tx	Most web apps
Statement	Conn returned after each query	Simple queries only

Query Optimization

EXPLAIN ANALYZE output reading:

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123 AND status = 'active';

-- Good: Index Scan
Index Scan using idx_orders_user_status on orders
  Index Cond: (user_id = 123 AND status = 'active')
  Rows: 15, Time: 0.5ms

-- Bad: Sequential Scan
Seq Scan on orders
  Filter: (user_id = 123 AND status = 'active')
  Rows Removed by Filter: 999985
  Rows: 15, Time: 850ms

Common optimization patterns:

Add missing indexes for WHERE/JOIN/ORDER BY columns
Use covering indexes to avoid table lookups
**Avoid SELECT *** — fetch only needed columns
Pagination: Use keyset pagination over OFFSET
Denormalize for read-heavy queries (materialized views)
Partition large tables by date/region

7. SQL vs NoSQL Decision Framework

When to Choose SQL (PostgreSQL, MySQL)

✅ Choose SQL when:

Complex queries with JOINs across multiple entities
ACID transactions are critical (financial, inventory)
Data has clear relationships and structure
Need strong consistency guarantees
Schema is relatively stable
Reporting and analytics on structured data

When to Choose NoSQL

Document Store (MongoDB, DynamoDB):

Flexible schema, evolving data models
Self-contained documents (embed related data)
High write throughput with eventual consistency
Example: product catalogs, user profiles, CMS

Wide-Column (Cassandra, HBase):

Massive write throughput (time-series, IoT)
Known query patterns (design table per query)
Multi-datacenter replication
Example: event logging, metrics, messaging

Key-Value (Redis, DynamoDB):

Simple lookups by key
Caching, session storage
Extremely low latency requirements
Example: shopping carts, feature flags

Graph (Neo4j, Neptune):

Highly connected data with complex traversals
Social networks, recommendation engines
Fraud detection (pattern matching)
Example: friend-of-friend queries, knowledge graphs

Decision Matrix

                    Need ACID?
                   /          \
                 Yes           No
                  │             │
          Complex JOINs?    Write-heavy?
         /        \         /        \
       Yes        No      Yes        No
        │          │        │          │
   PostgreSQL  PostgreSQL  Cassandra  Query pattern?
                (simpler)  DynamoDB   /      \
                                   Simple   Complex
                                     │        │
                                   Redis    MongoDB
                                   DynamoDB

Hybrid Approaches (Common in Practice)

Primary Store	Secondary Store	Purpose
PostgreSQL	Redis	Cache hot data
PostgreSQL	Elasticsearch	Full-text search
DynamoDB	S3	Archive old data
MongoDB	Redis	Session cache
PostgreSQL	Kafka → Cassandra	Event sourcing

Interview tip: "I'd use PostgreSQL as our primary store for user data and orders because we need ACID transactions. For the activity feed, we'd use Cassandra because it's write-heavy with simple access patterns. Redis sits in front for caching frequently accessed profiles."

Interview Cheat Sheet

When interviewer asks...	Key points to mention
"How would you store this data?"	Access patterns → choose engine → justify
"How to handle concurrent writes?"	MVCC, isolation levels, optimistic locking
"Database is slow, what do you check?"	EXPLAIN, missing indexes, connection pool, N+1
"How to ensure data isn't lost?"	WAL, replication, backups, fsync
"SQL or NoSQL?"	Access patterns, consistency needs, scale requirements
"How does indexing work?"	B-tree structure, composite index rules, covering indexes