🗄️ Databases

Database Internals

📖 12 min read 🧠 Complete Guide

Database Internals: Complete Guide for System Design

Overview

Every system stores data. Understanding how databases work internally lets you choose the right database, design efficient schemas, and explain performance characteristics in interviews.


1. B-Tree vs LSM-Tree

These are the two dominant storage engine architectures. Your choice fundamentally affects read/write performance.

B-Tree (Used by: PostgreSQL, MySQL/InnoDB, Oracle)

How it works:

  • Self-balancing tree structure
  • Each node is a fixed-size page (typically 4-16KB)
  • Internal nodes contain keys and pointers to child pages
  • Leaf nodes contain keys and values (or pointers to values)
  • All leaves at same depth (balanced)
                    [30 | 60]                    ← Root (internal)
                   /    |    \
          [10|20]    [40|50]    [70|80|90]      ← Internal nodes
         /  |  \    /  |  \    /  |  |  \
        L1  L2  L3 L4  L5  L6 L7 L8 L9 L10    ← Leaf nodes (data)

Read path: O(log n) — traverse from root to leaf Write path: O(log n) — find leaf, update in place, split if full

Write amplification:

  • Updating a single byte requires rewriting entire page (4-16KB)
  • Page splits cascade upward
  • Write amplification factor: ~10-30x

Advantages:

  • Consistent read performance
  • Good for read-heavy workloads
  • Efficient range queries (leaves are linked)
  • Mature, well-understood

LSM-Tree (Used by: RocksDB, Cassandra, LevelDB, HBase)

How it works:

  1. Writes go to in-memory buffer (memtable) — very fast
  2. When memtable is full, flush to disk as sorted SSTable (Sorted String Table)
  3. Background compaction merges SSTables to reduce read amplification
Write Path:
                                    
  Write ──► MemTable (sorted, in-memory)
                │
                │ (flush when full, ~64MB)
                ▼
  Level 0:  [SST] [SST] [SST]Unsorted between files
                │
                │ (compaction)
                ▼
  Level 1:  [SST] [SST] [SST] [SST]Sorted, non-overlapping
                │
                │ (compaction)
                ▼
  Level 2:  [SST] [SST] ... [SST]10x larger than L1

Read path:

  1. Check memtable
  2. Check each level's bloom filter
  3. Binary search within SSTable
  4. Worst case: check all levels

Write amplification:

  • Each compaction rewrites data
  • Leveled compaction: ~10-30x write amplification
  • But sequential writes (much faster than random B-tree writes)

Advantages:

  • Much higher write throughput (sequential I/O)
  • Better space efficiency (no fragmentation)
  • Compaction can be tuned for workload

Comparison

Aspect B-Tree LSM-Tree
Read latency Predictable, O(log n) Variable (check multiple levels)
Write throughput Lower (random I/O) Higher (sequential I/O)
Space amplification Higher (page fragmentation) Lower (compacted)
Write amplification 10-30x (page rewrites) 10-30x (compaction)
Read amplification 1x (single path) 1-N levels checked
Range queries Excellent (linked leaves) Good (sorted SSTables)
Use case Read-heavy, OLTP Write-heavy, time-series

Interview tip: "For our write-heavy logging system, an LSM-tree engine like RocksDB gives us 5-10x better write throughput. For our user profile service with mostly reads, PostgreSQL's B-tree is more predictable."


2. Write-Ahead Logging (WAL)

The Problem

If a database crashes mid-write, data could be corrupted. We need atomicity and durability.

How WAL Works

Write Request
     │
     ▼
┌─────────────┐
│ Write to WAL │ ← Sequential append (fast)
│ (on disk)    │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ Update       │ ← In-memory (fast)
│ Buffer Pool  │
└──────┬──────┘
       │
       ▼ (periodically)
┌─────────────┐
│ Checkpoint   │ ← Flush dirty pages to data files
│ to Disk      │
└─────────────┘

Key principle: Write the intent (WAL) before modifying data. If crash occurs:

  • WAL intact → replay to recover
  • WAL incomplete → discard partial transaction

WAL Record Structure

┌──────────────────────────────────────────┐
│ LSN | TxID | Type | PageID | Before | After │
├──────────────────────────────────────────┤
│ 001 | T1   | UPD  | P5     | old    | new   │
│ 002 | T1   | UPD  | P8     | old    | new   │
│ 003 | T1   | CMT  | -      | -      | -     │  ← Commit marker
│ 004 | T2   | UPD  | P3     | old    | new   │
│ 005 | T2   | ABT  | -      | -      | -     │  ← Abort marker
└──────────────────────────────────────────┘

Crash Recovery (ARIES Algorithm)

  1. Analysis: Scan WAL to find active transactions at crash time
  2. Redo: Replay all committed transactions (bring data up to date)
  3. Undo: Roll back uncommitted transactions (using before-images)

Performance Implications

  • WAL writes are sequential → fast on HDD and SSD
  • fsync after each commit ensures durability (but adds latency)
  • Group commit: batch multiple transactions' fsync calls
  • Typical: fsync every 10ms, batching 100s of commits

3. MVCC (Multi-Version Concurrency Control)

The Problem

How do readers and writers coexist without blocking each other?

How MVCC Works

Instead of locking rows, keep multiple versions:

Transaction Timeline:
                                    
T1 starts (snapshot at time=100)
                                    
Row "user_1" versions:
┌─────────────────────────────────────────────┐
│ Version │ Created │ Expired │ Value          │
├─────────────────────────────────────────────┤
│ v1      │ 50      │ 150     │ name="Alice"   │
│ v2      │ 150     │ ∞       │ name="Alicia"  │
└─────────────────────────────────────────────┘

T1 (snapshot=100) sees v1 ("Alice")  ← consistent snapshot
T3 (snapshot=200) sees v2 ("Alicia") ← sees committed update

PostgreSQL's MVCC Implementation

  • Each row has xmin (creating transaction) and xmax (deleting transaction)
  • UPDATE = INSERT new version + mark old version expired
  • Visibility check: is xmin committed and xmax not yet committed from my snapshot?
  • Dead tuples accumulate → VACUUM process cleans them up

MVCC Benefits

  • Readers never block writers
  • Writers never block readers
  • Consistent snapshots without locks
  • Enables time-travel queries (as of timestamp X)

MVCC Costs

  • Storage overhead (multiple versions)
  • Garbage collection needed (vacuum/compaction)
  • Long-running transactions hold old versions alive
  • Write conflicts still need detection (first-writer-wins or abort)

4. Transaction Isolation Levels

The Anomalies

Anomaly Description Example
Dirty Read Read uncommitted data See a transfer before it commits
Non-Repeatable Read Same query, different results Balance changes between two reads
Phantom Read New rows appear in range query New order appears in date range
Write Skew Constraint violated across rows Two doctors both go off-call

Isolation Levels

Weakest ──────────────────────────────────────────── Strongest
                                                      
Read         Read          Repeatable    Snapshot     Serializable
Uncommitted  Committed     Read          Isolation    
                                                      
Allows:      Allows:       Allows:       Allows:      Prevents:
- Dirty      - Non-repeat  - Phantoms    - Write      ALL
- Non-repeat - Phantoms    - Write skew    skew       anomalies
- Phantoms   - Write skew                             
- Write skew                                          

Practical Examples

Read Committed (PostgreSQL default):

-- Transaction 1                    -- Transaction 2
BEGIN;                              BEGIN;
SELECT balance FROM accounts        
WHERE id = 1;  -- Returns 1000     
                                    UPDATE accounts SET balance = 500
                                    WHERE id = 1;
                                    COMMIT;
SELECT balance FROM accounts        
WHERE id = 1;  -- Returns 500! (non-repeatable read)
COMMIT;

Serializable (strongest):

-- Transaction 1                    -- Transaction 2
BEGIN ISOLATION LEVEL SERIALIZABLE; BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM doctors        
WHERE on_call = true;  -- Returns 2
                                    SELECT count(*) FROM doctors
                                    WHERE on_call = true;  -- Returns 2
UPDATE doctors SET on_call = false  
WHERE id = 1;                       UPDATE doctors SET on_call = false
                                    WHERE id = 2;
COMMIT;  -- Succeeds                COMMIT;  -- ABORTED! (serialization failure)

Choosing Isolation Level

Use Case Level Why
Analytics/reporting Read Committed Stale data acceptable
E-commerce catalog Read Committed Performance over consistency
Financial transfers Serializable Must prevent all anomalies
Inventory management Repeatable Read Prevent double-selling
Booking systems Serializable Prevent overbooking

5. Indexing Strategies

B-Tree Index (Default)

CREATE INDEX idx_users_email ON users(email);
  • Balanced tree structure
  • O(log n) lookups
  • Supports: equality, range, prefix, ORDER BY
  • Best for: high-cardinality columns, range queries

Hash Index

CREATE INDEX idx_users_id ON users USING hash(id);
  • O(1) lookups for equality
  • Does NOT support range queries
  • Best for: exact-match lookups only (e.g., primary key)

Composite Index

CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);
  • Leftmost prefix rule: Can use (user_id) or (user_id, created_at), NOT (created_at) alone
  • Order matters: put equality conditions first, range conditions last
  • Covering index: includes all columns needed → index-only scan

Bitmap Index

  • Bit array per distinct value
  • Excellent for low-cardinality columns (status, gender, boolean)
  • Efficient for AND/OR combinations
  • Used in data warehouses (Oracle, PostgreSQL partial)

GIN (Generalized Inverted Index)

CREATE INDEX idx_posts_tags ON posts USING gin(tags);
CREATE INDEX idx_docs_search ON documents USING gin(to_tsvector('english', content));
  • Maps values to sets of rows containing them
  • Best for: full-text search, array containment, JSONB queries
  • Slower to update than B-tree (must update multiple entries)

GiST (Generalized Search Tree)

CREATE INDEX idx_locations_geo ON locations USING gist(coordinates);
  • Supports geometric/spatial queries
  • Range types, nearest-neighbor searches
  • Best for: PostGIS spatial data, range overlaps

Inverted Index (Elasticsearch, Lucene)

Document 1: "the quick brown fox"
Document 2: "the lazy brown dog"

Inverted Index:
┌─────────┬──────────────┐
│ TermDocuments     │
├─────────┼──────────────┤
│ brown   │ [1, 2]       │
│ dog     │ [2]          │
│ fox     │ [1]          │
│ lazy    │ [2]          │
│ quick   │ [1]          │
│ the     │ [1, 2]       │
└─────────┴──────────────┘

Index Selection Guide

Query Pattern Best Index Type
WHERE id = X B-tree or Hash
WHERE date BETWEEN A AND B B-tree
WHERE tags @> '{python}' GIN
WHERE ST_DWithin(point, ...) GiST
Full-text search GIN (tsvector) or Inverted
WHERE status IN ('active', 'pending') Bitmap or partial B-tree

6. Connection Pooling and Query Optimization

Connection Pooling

The problem: Each database connection costs ~5-10MB RAM and takes 50-100ms to establish (TCP + TLS + auth).

Without pooling:
App Server ──[new conn]──► DB  (per request, expensive)
App Server ──[new conn]──► DB
App Server ──[new conn]──► DB

With pooling (PgBouncer, HikariCP):
                    ┌─── Conn 1 ───┐
App Server ──► Pool ├─── Conn 2 ───┤──► DB
                    └─── Conn 3 ───┘

Pool sizing formula (PostgreSQL):

connections = (core_count * 2) + effective_spindle_count
  • Typical: 20-50 connections per database instance
  • More connections ≠ more throughput (context switching overhead)

Pooling modes:

Mode Behavior Use Case
Session Conn held for entire session Prepared statements
Transaction Conn returned after each tx Most web apps
Statement Conn returned after each query Simple queries only

Query Optimization

EXPLAIN ANALYZE output reading:

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123 AND status = 'active';

-- Good: Index Scan
Index Scan using idx_orders_user_status on orders
  Index Cond: (user_id = 123 AND status = 'active')
  Rows: 15, Time: 0.5ms

-- Bad: Sequential Scan
Seq Scan on orders
  Filter: (user_id = 123 AND status = 'active')
  Rows Removed by Filter: 999985
  Rows: 15, Time: 850ms

Common optimization patterns:

  1. Add missing indexes for WHERE/JOIN/ORDER BY columns
  2. Use covering indexes to avoid table lookups
  3. **Avoid SELECT *** — fetch only needed columns
  4. Pagination: Use keyset pagination over OFFSET
  5. Denormalize for read-heavy queries (materialized views)
  6. Partition large tables by date/region

7. SQL vs NoSQL Decision Framework

When to Choose SQL (PostgreSQL, MySQL)

Choose SQL when:

  • Complex queries with JOINs across multiple entities
  • ACID transactions are critical (financial, inventory)
  • Data has clear relationships and structure
  • Need strong consistency guarantees
  • Schema is relatively stable
  • Reporting and analytics on structured data

When to Choose NoSQL

Document Store (MongoDB, DynamoDB):

  • Flexible schema, evolving data models
  • Self-contained documents (embed related data)
  • High write throughput with eventual consistency
  • Example: product catalogs, user profiles, CMS

Wide-Column (Cassandra, HBase):

  • Massive write throughput (time-series, IoT)
  • Known query patterns (design table per query)
  • Multi-datacenter replication
  • Example: event logging, metrics, messaging

Key-Value (Redis, DynamoDB):

  • Simple lookups by key
  • Caching, session storage
  • Extremely low latency requirements
  • Example: shopping carts, feature flags

Graph (Neo4j, Neptune):

  • Highly connected data with complex traversals
  • Social networks, recommendation engines
  • Fraud detection (pattern matching)
  • Example: friend-of-friend queries, knowledge graphs

Decision Matrix

                    Need ACID?
                   /          \
                 Yes           No
                  │             │
          Complex JOINs?    Write-heavy?
         /        \         /        \
       Yes        No      Yes        No
        │          │        │          │
   PostgreSQL  PostgreSQL  Cassandra  Query pattern?
                (simpler)  DynamoDB   /      \
                                   Simple   Complex
                                     │        │
                                   Redis    MongoDB
                                   DynamoDB  

Hybrid Approaches (Common in Practice)

Primary Store Secondary Store Purpose
PostgreSQL Redis Cache hot data
PostgreSQL Elasticsearch Full-text search
DynamoDB S3 Archive old data
MongoDB Redis Session cache
PostgreSQL Kafka → Cassandra Event sourcing

Interview tip: "I'd use PostgreSQL as our primary store for user data and orders because we need ACID transactions. For the activity feed, we'd use Cassandra because it's write-heavy with simple access patterns. Redis sits in front for caching frequently accessed profiles."


Interview Cheat Sheet

When interviewer asks... Key points to mention
"How would you store this data?" Access patterns → choose engine → justify
"How to handle concurrent writes?" MVCC, isolation levels, optimistic locking
"Database is slow, what do you check?" EXPLAIN, missing indexes, connection pool, N+1
"How to ensure data isn't lost?" WAL, replication, backups, fsync
"SQL or NoSQL?" Access patterns, consistency needs, scale requirements
"How does indexing work?" B-tree structure, composite index rules, covering indexes