Distributed Systems Theory: Complete Guide for System Design
Overview
Distributed systems theory provides the foundational constraints and impossibility results that govern every large-scale system. Understanding these concepts lets you reason about what's achievable and make informed tradeoffs.
1. CAP Theorem
Statement
In a distributed system experiencing a network partition, you must choose between:
- Consistency (C): Every read receives the most recent write or an error
- Availability (A): Every request receives a non-error response (may be stale)
- Partition Tolerance (P): System continues operating despite network partitions
Key insight: Partitions WILL happen in any distributed system. So the real choice is between C and A during a partition.
Normal operation: You can have C + A + P (no partition occurring)
During partition:
βββββββββββββββ β³ βββββββββββββββ
β Node A β (partition) β Node B β
β data: v2 β β³ β data: v1 β
βββββββββββββββ βββββββββββββββ
CP choice: Node B refuses reads (unavailable) until partition heals
AP choice: Node B serves stale v1 (inconsistent) but stays availableReal-World Examples
| System | Choice | Behavior During Partition |
|---|---|---|
| ZooKeeper | CP | Minority partition becomes unavailable |
| etcd | CP | Raft leader in majority partition serves; minority rejects |
| Cassandra | AP | All nodes serve (potentially stale), reconcile later |
| DynamoDB | AP (tunable) | Serves with eventual consistency by default |
| MongoDB | CP | Primary in majority partition; secondaries reject writes |
| PostgreSQL (single) | CA | No partition tolerance (single node) |
| CockroachDB | CP | Ranges without majority become unavailable |
Common Misconceptions
- "You pick 2 of 3" β Wrong. P is not optional. You always need P. The choice is CP or AP.
- "CAP applies all the time" β Wrong. Only during partitions. Normally you get all three.
- "A system is entirely CP or AP" β Wrong. Different operations can make different choices.
- "Consistency means ACID" β Wrong. CAP consistency = linearizability (single-copy semantics).
2. PACELC Theorem
Extending CAP
CAP only describes behavior during partitions. PACELC adds: what tradeoff do you make when there's NO partition?
IF Partition:
Choose Availability or Consistency (same as CAP)
ELSE (normal operation):
Choose Latency or Consistency
Format: PA/EL, PC/EC, PA/EC, PC/ELSystem Classification
| System | During Partition | Normal Operation | Classification |
|---|---|---|---|
| DynamoDB | Availability | Latency | PA/EL |
| Cassandra | Availability | Latency | PA/EL |
| MongoDB | Consistency | Consistency | PC/EC |
| CockroachDB | Consistency | Consistency | PC/EC |
| PNUTS (Yahoo) | Availability | Consistency | PA/EC |
| Cosmos DB | Tunable | Tunable | Tunable |
Interview insight: "Even when there's no partition, we still face a latency-consistency tradeoff. Synchronous replication gives consistency but adds latency. Async replication is fast but risks stale reads."
3. Consistency Models
Spectrum (Strongest β Weakest)
Linearizability (strongest)
β
βΌ
Sequential Consistency
β
βΌ
Causal Consistency
β
βΌ
Eventual Consistency (weakest)Linearizability
Definition: Operations appear to execute atomically at some point between invocation and response. Equivalent to a single-copy system.
Timeline (real time β):
Client A: |--write(x=1)--|
Client B: |--read(x)--| must return 1
Client C: |--read(x)--| must return 1
Once ANY client sees x=1, ALL subsequent reads must see x=1Provides: Real-time ordering. If op A completes before op B starts, A is ordered before B. Cost: Requires coordination (consensus), high latency Used by: ZooKeeper, etcd, Spanner (TrueTime)
Sequential Consistency
Definition: All operations appear in SOME total order consistent with each process's program order. But this order doesn't need to match real-time.
Client A: write(x=1), write(x=2)
Client B: read(x)=2, read(x)=1 β INVALID (violates A's program order)
Client B: read(x)=1, read(x)=2 β VALID
Client A: write(x=1)
Client B: write(x=2)
Client C: read(x)=1, read(x)=2 β VALID
Client D: read(x)=2, read(x)=1 β INVALID (must agree with C's order)Difference from linearizability: No real-time constraint. Operations can appear reordered as long as per-process order is preserved.
Causal Consistency
Definition: Operations that are causally related are seen in the same order by all processes. Concurrent (unrelated) operations may be seen in different orders.
Causal relationship: A "happened before" B if:
1. A and B are in the same process, A before B
2. A is a send, B is the corresponding receive
3. Transitivity: AβB and BβC implies AβC
Example:
Client A: write(x=1)
Client B: read(x)=1, write(y=2) β y=2 causally depends on x=1
Client C: MUST see x=1 before y=2 (causal order)
MAY see other unrelated writes in any orderUsed by: MongoDB (causal consistency sessions), COPS
Eventual Consistency
Definition: If no new updates are made, all replicas will eventually converge to the same value. No ordering guarantees during convergence.
t=0: Client writes x=1 to Node A
t=1: Node A has x=1, Node B has x=0 (stale)
t=2: Node A has x=1, Node B has x=0 (still propagating)
t=5: Node A has x=1, Node B has x=1 (converged)
During t=1 to t=4: reads from Node B return stale dataConflict resolution strategies:
- Last-Writer-Wins (LWW): Timestamp-based, simple but loses data
- Vector clocks: Detect conflicts, application resolves
- CRDTs: Mathematically guaranteed convergence without coordination
Used by: DynamoDB, Cassandra, DNS, S3
4. Consensus Algorithms
Why Consensus?
Distributed systems need agreement on:
- Who is the leader?
- What is the committed log?
- What is the current configuration?
Raft (Understandable Consensus)
Three roles: Leader, Follower, Candidate
Leader Election
Normal operation:
ββββββββββ heartbeat ββββββββββββ heartbeat ββββββββββββ
β Leader ββββββββββββββΊβ Follower βββββββββββββββ Follower β
β (term 1)β β β β β
ββββββββββ ββββββββββββ ββββββββββββ
Election timeout (no heartbeat received):
ββββββββββββ RequestVote ββββββββββββ
β Candidate βββββββββββββββΊβ Follower β
β (term 2) ββββββββββββββββ (grants) β
ββββββββββββ VoteGranted ββββββββββββ
β
β Majority votes received
βΌ
ββββββββββ
β Leader β (term 2, begins serving)
β β
ββββββββββElection rules:
- Follower times out β becomes Candidate, increments term
- Candidate votes for itself, requests votes from all others
- Each node votes for at most one candidate per term
- Candidate with majority becomes Leader
- If split vote β timeout β new election with higher term
Log Replication
Client request: "SET x=5"
Leader Followers
β β
β 1. Append to local log β
β β
βββ AppendEntries ββββββββΊβ 2. Send to followers
β β
ββββ Success ββββββββββββββ 3. Followers append
β β
β 4. Entry committed β (majority acknowledged)
β (apply to state) β
β β
βββ AppendEntries ββββββββΊβ 5. Notify followers to commit
β (commitIndex updated) βCommit rule: Entry is committed when replicated to majority (βn/2β + 1 nodes).
Safety guarantee: If a log entry is committed, it will be present in the logs of all future leaders.
Raft in Practice
| Aspect | Typical Configuration |
|---|---|
| Cluster size | 3, 5, or 7 nodes (odd for majority) |
| Election timeout | 150-300ms (randomized) |
| Heartbeat interval | 50-100ms |
| Tolerates failures | (n-1)/2 nodes can fail |
| 3-node cluster | Tolerates 1 failure |
| 5-node cluster | Tolerates 2 failures |
Paxos (Intuition)
Three roles: Proposer, Acceptor, Learner
Two phases:
- Prepare: Proposer asks acceptors "Will you accept my proposal n?"
- Accept: If majority promises, proposer sends actual value
Phase 1 (Prepare):
Proposer ββ[Prepare(n)]βββΊ Acceptors
Acceptors ββ[Promise(n)]βββΊ Proposer (if n > any previous)
Phase 2 (Accept):
Proposer ββ[Accept(n, value)]βββΊ Acceptors
Acceptors ββ[Accepted(n, value)]βββΊ LearnersWhy Raft over Paxos?
- Raft: single leader, simpler to understand and implement
- Paxos: more general, but notoriously difficult to implement correctly
- Most production systems use Raft (etcd, CockroachDB, TiKV)
5. Clock Synchronization
The Problem
In distributed systems, there's no global clock. Each node has its own clock that drifts.
Lamport Clocks (Logical Clocks)
Rules:
- Before each event, increment local counter
- When sending message, attach current counter
- When receiving message, set counter = max(local, received) + 1
Node A: 1 2 3 6 7
β β β β² β
β β ββββββΊβββββ β
β β msg(3) β
Node B: 1 2 3 4 5 6
β β²
ββββββΊββββββββββ
msg(2)
Node C: 1 2 3 4 5 6Limitation: If L(a) < L(b), we CANNOT conclude a happened before b. Only the converse: if aβb then L(a) < L(b).
Vector Clocks
Each node maintains a vector of counters (one per node):
Node A: [A:1, B:0, C:0]
Node A sends to B: message carries [A:1, B:0, C:0]
Node B receives: B = max([A:0, B:1, C:0], [A:1, B:0, C:0]) + increment B
B = [A:1, B:2, C:0]Comparison rules:
- V1 < V2 if all components of V1 β€ V2 and at least one is strictly less
- V1 β₯ V2 (concurrent) if neither V1 < V2 nor V2 < V1
[2,1,0] < [2,2,0] β causally ordered
[2,1,0] β₯ [1,2,0] β concurrent (conflict!)Used by: DynamoDB (simplified), Riak (for conflict detection)
Hybrid Logical Clocks (HLC)
Combines physical time with logical ordering:
HLC = (physical_time, logical_counter)
- Uses physical time when possible (for human-meaningful timestamps)
- Falls back to logical counter when physical clocks are close
- Bounded drift from physical timeUsed by: CockroachDB, YugabyteDB
Google's TrueTime
TrueTime API:
tt.now() β returns [earliest, latest] interval
Guarantee: actual time is within the interval
Uncertainty: typically 1-7ms (GPS + atomic clocks)
Spanner's rule: wait out uncertainty before committing
β Ensures linearizability with ~7ms commit latency6. Failure Models
Hierarchy (Weakest β Strongest Assumption)
Byzantine (arbitrary behavior)
β
βΌ
Omission (may fail to send/receive)
β
βΌ
Crash-Recovery (may crash and restart)
β
βΌ
Crash-Fail (crashes permanently)Crash-Fail
- Node stops and never recovers
- Simplest model to reason about
- Consensus needs: n β₯ 2f + 1 nodes to tolerate f failures
- Example: Raft with 5 nodes tolerates 2 permanent failures
Crash-Recovery
- Node may crash and restart with durable state
- Must persist state before acknowledging
- WAL ensures recovery is possible
- Example: Database with WAL can recover after crash
Omission Failures
- Node fails to send or receive messages
- Network partitions are a form of omission failure
- Harder to detect than crashes (is it slow or dead?)
Byzantine Failures
- Node may behave arbitrarily (including maliciously)
- May send conflicting messages to different nodes
- Consensus needs: n β₯ 3f + 1 nodes to tolerate f Byzantine failures
- Example: Blockchain consensus (untrusted participants)
Interview context: Most system design interviews assume crash-fail or crash-recovery. Byzantine tolerance is relevant for blockchain/multi-party systems.
7. FLP Impossibility Result
Statement
In an asynchronous distributed system with even ONE possible crash failure, there is NO deterministic algorithm that guarantees consensus.
Intuition
The problem: In an asynchronous system, you cannot distinguish between:
1. A crashed node
2. A very slow node
If you wait forever β no liveness (system hangs)
If you decide without the slow node β it might have voted differently
No timeout is "correct" because the system is asynchronous
(no bound on message delivery time)Practical Implications
FLP doesn't mean consensus is impossible in practice. It means:
- You cannot have ALL of: safety + liveness + fault tolerance in async systems
- Practical systems sacrifice one partially:
- Raft/Paxos: May temporarily lose liveness (during leader election)
- Randomized algorithms: Probabilistic termination (always terminates with high probability)
- Partial synchrony: Assume system is eventually synchronous (timeouts work eventually)
8. Split-Brain Problem
What Is Split-Brain?
Normal:
βββββββββββββββββββββββββββββββββββββββ
β [Leader] ββ [Follower] ββ [Follower] β
βββββββββββββββββββββββββββββββββββββββ
Network partition:
ββββββββββββββββ β³ ββββββββββββββββ
β [Leader A] β β [Leader B] β β TWO leaders!
β [Follower] β β [Follower] β
ββββββββββββββββ ββββββββββββββββ
Both sides accept writes β data divergence β SPLIT BRAINSolutions
1. Majority Quorum
5-node cluster, partition into [3] and [2]:
[Node1, Node2, Node3] | [Node4, Node5]
Majority (3/5) | Minority (2/5)
Keeps leader | Steps down
Accepts writes | Rejects writesRule: Only the partition with majority (>n/2) can elect a leader and accept writes.
2. Fencing Tokens
Epoch/Term numbers prevent stale leaders:
Leader A (term=1) gets partitioned
Leader B (term=2) elected by majority
Leader A recovers, tries to write with term=1
Storage layer rejects: term=1 < current_term=2
ββββββββββ write(term=1) βββββββββββ
βLeader A βββββββββββββββββΊβ Storage ββββΊ REJECTED (stale term)
β(stale) β β (term=2) β
ββββββββββ βββββββββββ3. STONITH (Shoot The Other Node In The Head)
When split-brain detected:
1. Surviving partition sends "kill" signal to other partition
2. Uses out-of-band mechanism (IPMI, power management)
3. Guarantees the other side is truly dead before proceeding
Used in: Pacemaker/Corosync clusters, some database HA setups4. Witness/Tiebreaker Node
ββββββββββ ββββββββββ
β Node A β β³ β Node B β
β (data) β β (data) β
ββββββ¬ββββ ββββββ¬ββββ
β β
βββββββββ¬ββββββββββββ
β
ββββββββ΄βββββββ
β Witness β β Lightweight node, no data
β (tiebreaker)β Votes in elections only
βββββββββββββββSplit-Brain in Practice
| System | Prevention Mechanism |
|---|---|
| ZooKeeper | Majority quorum (2f+1 nodes) |
| etcd/Raft | Term numbers + majority |
| Redis Sentinel | Quorum of sentinels must agree |
| PostgreSQL Patroni | DCS (etcd/ZK) for leader lock |
| Kafka | ISR (In-Sync Replicas) + controller |
| Elasticsearch | Minimum master nodes setting |
Quick Reference: Theorem Application
Designing a system? Ask:
1. What consistency do we need?
- Financial: Linearizability (CP)
- Social feed: Eventual consistency (AP)
- Shopping cart: Causal consistency
2. What's our partition strategy?
- CP: Reject requests from minority partition
- AP: Serve all requests, reconcile later
3. How do we detect/handle failures?
- Heartbeats + timeouts for crash detection
- Quorum writes for durability
- Fencing tokens for stale leader prevention
4. How do we order events?
- Single leader: Leader assigns order
- Multi-leader: Vector clocks for conflict detection
- Global order needed: Consensus (Raft)Interview Cheat Sheet
| When interviewer asks... | Key points to mention |
|---|---|
| "How to handle network partitions?" | CAP tradeoff, quorum, fencing tokens |
| "How to ensure consistency?" | Consensus (Raft), linearizability, cost in latency |
| "What if two nodes disagree?" | Conflict resolution: LWW, vector clocks, CRDTs |
| "How to detect failures?" | Heartbeats, phi accrual detector, timeout tuning |
| "How to order events?" | Lamport clocks for partial order, vector clocks for causality |
| "What consistency model?" | Match to requirements: strong for money, eventual for feeds |
| "How does leader election work?" | Raft: term numbers, majority vote, election timeout |
| "What about split-brain?" | Majority quorum, fencing tokens, STONITH |