Database Internals: Complete Guide for System Design
Overview
Every system stores data. Understanding how databases work internally lets you choose the right database, design efficient schemas, and explain performance characteristics in interviews.
1. B-Tree vs LSM-Tree
These are the two dominant storage engine architectures. Your choice fundamentally affects read/write performance.
B-Tree (Used by: PostgreSQL, MySQL/InnoDB, Oracle)
How it works:
- Self-balancing tree structure
- Each node is a fixed-size page (typically 4-16KB)
- Internal nodes contain keys and pointers to child pages
- Leaf nodes contain keys and values (or pointers to values)
- All leaves at same depth (balanced)
[30 | 60] ← Root (internal)
/ | \
[10|20] [40|50] [70|80|90] ← Internal nodes
/ | \ / | \ / | | \
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 ← Leaf nodes (data)Read path: O(log n) — traverse from root to leaf Write path: O(log n) — find leaf, update in place, split if full
Write amplification:
- Updating a single byte requires rewriting entire page (4-16KB)
- Page splits cascade upward
- Write amplification factor: ~10-30x
Advantages:
- Consistent read performance
- Good for read-heavy workloads
- Efficient range queries (leaves are linked)
- Mature, well-understood
LSM-Tree (Used by: RocksDB, Cassandra, LevelDB, HBase)
How it works:
- Writes go to in-memory buffer (memtable) — very fast
- When memtable is full, flush to disk as sorted SSTable (Sorted String Table)
- Background compaction merges SSTables to reduce read amplification
Write Path:
Write ──► MemTable (sorted, in-memory)
│
│ (flush when full, ~64MB)
▼
Level 0: [SST] [SST] [SST] ← Unsorted between files
│
│ (compaction)
▼
Level 1: [SST] [SST] [SST] [SST] ← Sorted, non-overlapping
│
│ (compaction)
▼
Level 2: [SST] [SST] ... [SST] ← 10x larger than L1Read path:
- Check memtable
- Check each level's bloom filter
- Binary search within SSTable
- Worst case: check all levels
Write amplification:
- Each compaction rewrites data
- Leveled compaction: ~10-30x write amplification
- But sequential writes (much faster than random B-tree writes)
Advantages:
- Much higher write throughput (sequential I/O)
- Better space efficiency (no fragmentation)
- Compaction can be tuned for workload
Comparison
| Aspect | B-Tree | LSM-Tree |
|---|---|---|
| Read latency | Predictable, O(log n) | Variable (check multiple levels) |
| Write throughput | Lower (random I/O) | Higher (sequential I/O) |
| Space amplification | Higher (page fragmentation) | Lower (compacted) |
| Write amplification | 10-30x (page rewrites) | 10-30x (compaction) |
| Read amplification | 1x (single path) | 1-N levels checked |
| Range queries | Excellent (linked leaves) | Good (sorted SSTables) |
| Use case | Read-heavy, OLTP | Write-heavy, time-series |
Interview tip: "For our write-heavy logging system, an LSM-tree engine like RocksDB gives us 5-10x better write throughput. For our user profile service with mostly reads, PostgreSQL's B-tree is more predictable."
2. Write-Ahead Logging (WAL)
The Problem
If a database crashes mid-write, data could be corrupted. We need atomicity and durability.
How WAL Works
Write Request
│
▼
┌─────────────┐
│ Write to WAL │ ← Sequential append (fast)
│ (on disk) │
└──────┬──────┘
│
▼
┌─────────────┐
│ Update │ ← In-memory (fast)
│ Buffer Pool │
└──────┬──────┘
│
▼ (periodically)
┌─────────────┐
│ Checkpoint │ ← Flush dirty pages to data files
│ to Disk │
└─────────────┘Key principle: Write the intent (WAL) before modifying data. If crash occurs:
- WAL intact → replay to recover
- WAL incomplete → discard partial transaction
WAL Record Structure
┌──────────────────────────────────────────┐
│ LSN | TxID | Type | PageID | Before | After │
├──────────────────────────────────────────┤
│ 001 | T1 | UPD | P5 | old | new │
│ 002 | T1 | UPD | P8 | old | new │
│ 003 | T1 | CMT | - | - | - │ ← Commit marker
│ 004 | T2 | UPD | P3 | old | new │
│ 005 | T2 | ABT | - | - | - │ ← Abort marker
└──────────────────────────────────────────┘Crash Recovery (ARIES Algorithm)
- Analysis: Scan WAL to find active transactions at crash time
- Redo: Replay all committed transactions (bring data up to date)
- Undo: Roll back uncommitted transactions (using before-images)
Performance Implications
- WAL writes are sequential → fast on HDD and SSD
fsyncafter each commit ensures durability (but adds latency)- Group commit: batch multiple transactions' fsync calls
- Typical: fsync every 10ms, batching 100s of commits
3. MVCC (Multi-Version Concurrency Control)
The Problem
How do readers and writers coexist without blocking each other?
How MVCC Works
Instead of locking rows, keep multiple versions:
Transaction Timeline:
T1 starts (snapshot at time=100)
Row "user_1" versions:
┌─────────────────────────────────────────────┐
│ Version │ Created │ Expired │ Value │
├─────────────────────────────────────────────┤
│ v1 │ 50 │ 150 │ name="Alice" │
│ v2 │ 150 │ ∞ │ name="Alicia" │
└─────────────────────────────────────────────┘
T1 (snapshot=100) sees v1 ("Alice") ← consistent snapshot
T3 (snapshot=200) sees v2 ("Alicia") ← sees committed updatePostgreSQL's MVCC Implementation
- Each row has
xmin(creating transaction) andxmax(deleting transaction) - UPDATE = INSERT new version + mark old version expired
- Visibility check: is
xmincommitted andxmaxnot yet committed from my snapshot? - Dead tuples accumulate → VACUUM process cleans them up
MVCC Benefits
- Readers never block writers
- Writers never block readers
- Consistent snapshots without locks
- Enables time-travel queries (as of timestamp X)
MVCC Costs
- Storage overhead (multiple versions)
- Garbage collection needed (vacuum/compaction)
- Long-running transactions hold old versions alive
- Write conflicts still need detection (first-writer-wins or abort)
4. Transaction Isolation Levels
The Anomalies
| Anomaly | Description | Example |
|---|---|---|
| Dirty Read | Read uncommitted data | See a transfer before it commits |
| Non-Repeatable Read | Same query, different results | Balance changes between two reads |
| Phantom Read | New rows appear in range query | New order appears in date range |
| Write Skew | Constraint violated across rows | Two doctors both go off-call |
Isolation Levels
Weakest ──────────────────────────────────────────── Strongest
Read Read Repeatable Snapshot Serializable
Uncommitted Committed Read Isolation
Allows: Allows: Allows: Allows: Prevents:
- Dirty - Non-repeat - Phantoms - Write ALL
- Non-repeat - Phantoms - Write skew skew anomalies
- Phantoms - Write skew
- Write skew Practical Examples
Read Committed (PostgreSQL default):
-- Transaction 1 -- Transaction 2
BEGIN; BEGIN;
SELECT balance FROM accounts
WHERE id = 1; -- Returns 1000
UPDATE accounts SET balance = 500
WHERE id = 1;
COMMIT;
SELECT balance FROM accounts
WHERE id = 1; -- Returns 500! (non-repeatable read)
COMMIT;Serializable (strongest):
-- Transaction 1 -- Transaction 2
BEGIN ISOLATION LEVEL SERIALIZABLE; BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM doctors
WHERE on_call = true; -- Returns 2
SELECT count(*) FROM doctors
WHERE on_call = true; -- Returns 2
UPDATE doctors SET on_call = false
WHERE id = 1; UPDATE doctors SET on_call = false
WHERE id = 2;
COMMIT; -- Succeeds COMMIT; -- ABORTED! (serialization failure)Choosing Isolation Level
| Use Case | Level | Why |
|---|---|---|
| Analytics/reporting | Read Committed | Stale data acceptable |
| E-commerce catalog | Read Committed | Performance over consistency |
| Financial transfers | Serializable | Must prevent all anomalies |
| Inventory management | Repeatable Read | Prevent double-selling |
| Booking systems | Serializable | Prevent overbooking |
5. Indexing Strategies
B-Tree Index (Default)
CREATE INDEX idx_users_email ON users(email);- Balanced tree structure
- O(log n) lookups
- Supports: equality, range, prefix, ORDER BY
- Best for: high-cardinality columns, range queries
Hash Index
CREATE INDEX idx_users_id ON users USING hash(id);- O(1) lookups for equality
- Does NOT support range queries
- Best for: exact-match lookups only (e.g., primary key)
Composite Index
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);- Leftmost prefix rule: Can use (user_id) or (user_id, created_at), NOT (created_at) alone
- Order matters: put equality conditions first, range conditions last
- Covering index: includes all columns needed → index-only scan
Bitmap Index
- Bit array per distinct value
- Excellent for low-cardinality columns (status, gender, boolean)
- Efficient for AND/OR combinations
- Used in data warehouses (Oracle, PostgreSQL partial)
GIN (Generalized Inverted Index)
CREATE INDEX idx_posts_tags ON posts USING gin(tags);
CREATE INDEX idx_docs_search ON documents USING gin(to_tsvector('english', content));- Maps values to sets of rows containing them
- Best for: full-text search, array containment, JSONB queries
- Slower to update than B-tree (must update multiple entries)
GiST (Generalized Search Tree)
CREATE INDEX idx_locations_geo ON locations USING gist(coordinates);- Supports geometric/spatial queries
- Range types, nearest-neighbor searches
- Best for: PostGIS spatial data, range overlaps
Inverted Index (Elasticsearch, Lucene)
Document 1: "the quick brown fox"
Document 2: "the lazy brown dog"
Inverted Index:
┌─────────┬──────────────┐
│ Term │ Documents │
├─────────┼──────────────┤
│ brown │ [1, 2] │
│ dog │ [2] │
│ fox │ [1] │
│ lazy │ [2] │
│ quick │ [1] │
│ the │ [1, 2] │
└─────────┴──────────────┘Index Selection Guide
| Query Pattern | Best Index Type |
|---|---|
| WHERE id = X | B-tree or Hash |
| WHERE date BETWEEN A AND B | B-tree |
| WHERE tags @> '{python}' | GIN |
| WHERE ST_DWithin(point, ...) | GiST |
| Full-text search | GIN (tsvector) or Inverted |
| WHERE status IN ('active', 'pending') | Bitmap or partial B-tree |
6. Connection Pooling and Query Optimization
Connection Pooling
The problem: Each database connection costs ~5-10MB RAM and takes 50-100ms to establish (TCP + TLS + auth).
Without pooling:
App Server ──[new conn]──► DB (per request, expensive)
App Server ──[new conn]──► DB
App Server ──[new conn]──► DB
With pooling (PgBouncer, HikariCP):
┌─── Conn 1 ───┐
App Server ──► Pool ├─── Conn 2 ───┤──► DB
└─── Conn 3 ───┘Pool sizing formula (PostgreSQL):
connections = (core_count * 2) + effective_spindle_count- Typical: 20-50 connections per database instance
- More connections ≠ more throughput (context switching overhead)
Pooling modes:
| Mode | Behavior | Use Case |
|---|---|---|
| Session | Conn held for entire session | Prepared statements |
| Transaction | Conn returned after each tx | Most web apps |
| Statement | Conn returned after each query | Simple queries only |
Query Optimization
EXPLAIN ANALYZE output reading:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123 AND status = 'active';
-- Good: Index Scan
Index Scan using idx_orders_user_status on orders
Index Cond: (user_id = 123 AND status = 'active')
Rows: 15, Time: 0.5ms
-- Bad: Sequential Scan
Seq Scan on orders
Filter: (user_id = 123 AND status = 'active')
Rows Removed by Filter: 999985
Rows: 15, Time: 850msCommon optimization patterns:
- Add missing indexes for WHERE/JOIN/ORDER BY columns
- Use covering indexes to avoid table lookups
- **Avoid SELECT *** — fetch only needed columns
- Pagination: Use keyset pagination over OFFSET
- Denormalize for read-heavy queries (materialized views)
- Partition large tables by date/region
7. SQL vs NoSQL Decision Framework
When to Choose SQL (PostgreSQL, MySQL)
✅ Choose SQL when:
- Complex queries with JOINs across multiple entities
- ACID transactions are critical (financial, inventory)
- Data has clear relationships and structure
- Need strong consistency guarantees
- Schema is relatively stable
- Reporting and analytics on structured data
When to Choose NoSQL
Document Store (MongoDB, DynamoDB):
- Flexible schema, evolving data models
- Self-contained documents (embed related data)
- High write throughput with eventual consistency
- Example: product catalogs, user profiles, CMS
Wide-Column (Cassandra, HBase):
- Massive write throughput (time-series, IoT)
- Known query patterns (design table per query)
- Multi-datacenter replication
- Example: event logging, metrics, messaging
Key-Value (Redis, DynamoDB):
- Simple lookups by key
- Caching, session storage
- Extremely low latency requirements
- Example: shopping carts, feature flags
Graph (Neo4j, Neptune):
- Highly connected data with complex traversals
- Social networks, recommendation engines
- Fraud detection (pattern matching)
- Example: friend-of-friend queries, knowledge graphs
Decision Matrix
Need ACID?
/ \
Yes No
│ │
Complex JOINs? Write-heavy?
/ \ / \
Yes No Yes No
│ │ │ │
PostgreSQL PostgreSQL Cassandra Query pattern?
(simpler) DynamoDB / \
Simple Complex
│ │
Redis MongoDB
DynamoDB Hybrid Approaches (Common in Practice)
| Primary Store | Secondary Store | Purpose |
|---|---|---|
| PostgreSQL | Redis | Cache hot data |
| PostgreSQL | Elasticsearch | Full-text search |
| DynamoDB | S3 | Archive old data |
| MongoDB | Redis | Session cache |
| PostgreSQL | Kafka → Cassandra | Event sourcing |
Interview tip: "I'd use PostgreSQL as our primary store for user data and orders because we need ACID transactions. For the activity feed, we'd use Cassandra because it's write-heavy with simple access patterns. Redis sits in front for caching frequently accessed profiles."
Interview Cheat Sheet
| When interviewer asks... | Key points to mention |
|---|---|
| "How would you store this data?" | Access patterns → choose engine → justify |
| "How to handle concurrent writes?" | MVCC, isolation levels, optimistic locking |
| "Database is slow, what do you check?" | EXPLAIN, missing indexes, connection pool, N+1 |
| "How to ensure data isn't lost?" | WAL, replication, backups, fsync |
| "SQL or NoSQL?" | Access patterns, consistency needs, scale requirements |
| "How does indexing work?" | B-tree structure, composite index rules, covering indexes |