πŸ”— Distributed

Distributed Systems Theory

πŸ“– 13 min read 🧠 Complete Guide

Distributed Systems Theory: Complete Guide for System Design

Overview

Distributed systems theory provides the foundational constraints and impossibility results that govern every large-scale system. Understanding these concepts lets you reason about what's achievable and make informed tradeoffs.


1. CAP Theorem

Statement

In a distributed system experiencing a network partition, you must choose between:

  • Consistency (C): Every read receives the most recent write or an error
  • Availability (A): Every request receives a non-error response (may be stale)
  • Partition Tolerance (P): System continues operating despite network partitions

Key insight: Partitions WILL happen in any distributed system. So the real choice is between C and A during a partition.

Normal operation: You can have C + A + P (no partition occurring)

During partition:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β•³         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Node A      β”‚    (partition)    β”‚  Node B      β”‚
β”‚  data: v2    β”‚         β•³         β”‚  data: v1    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

CP choice: Node B refuses reads (unavailable) until partition heals
AP choice: Node B serves stale v1 (inconsistent) but stays available

Real-World Examples

System Choice Behavior During Partition
ZooKeeper CP Minority partition becomes unavailable
etcd CP Raft leader in majority partition serves; minority rejects
Cassandra AP All nodes serve (potentially stale), reconcile later
DynamoDB AP (tunable) Serves with eventual consistency by default
MongoDB CP Primary in majority partition; secondaries reject writes
PostgreSQL (single) CA No partition tolerance (single node)
CockroachDB CP Ranges without majority become unavailable

Common Misconceptions

  1. "You pick 2 of 3" β€” Wrong. P is not optional. You always need P. The choice is CP or AP.
  2. "CAP applies all the time" β€” Wrong. Only during partitions. Normally you get all three.
  3. "A system is entirely CP or AP" β€” Wrong. Different operations can make different choices.
  4. "Consistency means ACID" β€” Wrong. CAP consistency = linearizability (single-copy semantics).

2. PACELC Theorem

Extending CAP

CAP only describes behavior during partitions. PACELC adds: what tradeoff do you make when there's NO partition?

IF Partition:
    Choose Availability or Consistency (same as CAP)
ELSE (normal operation):
    Choose Latency or Consistency
    
Format: PA/EL, PC/EC, PA/EC, PC/EL

System Classification

System During Partition Normal Operation Classification
DynamoDB Availability Latency PA/EL
Cassandra Availability Latency PA/EL
MongoDB Consistency Consistency PC/EC
CockroachDB Consistency Consistency PC/EC
PNUTS (Yahoo) Availability Consistency PA/EC
Cosmos DB Tunable Tunable Tunable

Interview insight: "Even when there's no partition, we still face a latency-consistency tradeoff. Synchronous replication gives consistency but adds latency. Async replication is fast but risks stale reads."


3. Consistency Models

Spectrum (Strongest β†’ Weakest)

Linearizability (strongest)
     β”‚
     β–Ό
Sequential Consistency
     β”‚
     β–Ό
Causal Consistency
     β”‚
     β–Ό
Eventual Consistency (weakest)

Linearizability

Definition: Operations appear to execute atomically at some point between invocation and response. Equivalent to a single-copy system.

Timeline (real time β†’):

Client A: |--write(x=1)--|
Client B:          |--read(x)--| must return 1
Client C:                   |--read(x)--| must return 1

Once ANY client sees x=1, ALL subsequent reads must see x=1

Provides: Real-time ordering. If op A completes before op B starts, A is ordered before B. Cost: Requires coordination (consensus), high latency Used by: ZooKeeper, etcd, Spanner (TrueTime)

Sequential Consistency

Definition: All operations appear in SOME total order consistent with each process's program order. But this order doesn't need to match real-time.

Client A: write(x=1), write(x=2)
Client B: read(x)=2, read(x)=1  ← INVALID (violates A's program order)
Client B: read(x)=1, read(x)=2  ← VALID

Client A: write(x=1)
Client B: write(x=2)
Client C: read(x)=1, read(x)=2  ← VALID
Client D: read(x)=2, read(x)=1  ← INVALID (must agree with C's order)

Difference from linearizability: No real-time constraint. Operations can appear reordered as long as per-process order is preserved.

Causal Consistency

Definition: Operations that are causally related are seen in the same order by all processes. Concurrent (unrelated) operations may be seen in different orders.

Causal relationship: A "happened before" B if:
1. A and B are in the same process, A before B
2. A is a send, B is the corresponding receive
3. Transitivity: A→B and B→C implies A→C

Example:
Client A: write(x=1)
Client B: read(x)=1, write(y=2)  ← y=2 causally depends on x=1
Client C: MUST see x=1 before y=2 (causal order)
          MAY see other unrelated writes in any order

Used by: MongoDB (causal consistency sessions), COPS

Eventual Consistency

Definition: If no new updates are made, all replicas will eventually converge to the same value. No ordering guarantees during convergence.

t=0: Client writes x=1 to Node A
t=1: Node A has x=1, Node B has x=0 (stale)
t=2: Node A has x=1, Node B has x=0 (still propagating)
t=5: Node A has x=1, Node B has x=1 (converged)

During t=1 to t=4: reads from Node B return stale data

Conflict resolution strategies:

  • Last-Writer-Wins (LWW): Timestamp-based, simple but loses data
  • Vector clocks: Detect conflicts, application resolves
  • CRDTs: Mathematically guaranteed convergence without coordination

Used by: DynamoDB, Cassandra, DNS, S3


4. Consensus Algorithms

Why Consensus?

Distributed systems need agreement on:

  • Who is the leader?
  • What is the committed log?
  • What is the current configuration?

Raft (Understandable Consensus)

Three roles: Leader, Follower, Candidate

Leader Election

Normal operation:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  heartbeat  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  heartbeat  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Leader  │────────────►│ Follower  │◄────────────│ Follower  β”‚
β”‚ (term 1)β”‚            β”‚           β”‚             β”‚           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Election timeout (no heartbeat received):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  RequestVote  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Candidate │─────────────►│ Follower  β”‚
β”‚ (term 2)  │◄─────────────│ (grants)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   VoteGranted β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β”‚ Majority votes received
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Leader  β”‚  (term 2, begins serving)
β”‚         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Election rules:

  1. Follower times out β†’ becomes Candidate, increments term
  2. Candidate votes for itself, requests votes from all others
  3. Each node votes for at most one candidate per term
  4. Candidate with majority becomes Leader
  5. If split vote β†’ timeout β†’ new election with higher term

Log Replication

Client request: "SET x=5"

Leader                    Followers
  β”‚                         β”‚
  β”‚ 1. Append to local log  β”‚
  β”‚                         β”‚
  │── AppendEntries ───────►│  2. Send to followers
  β”‚                         β”‚
  │◄── Success ─────────────│  3. Followers append
  β”‚                         β”‚
  β”‚ 4. Entry committed       β”‚  (majority acknowledged)
  β”‚    (apply to state)      β”‚
  β”‚                         β”‚
  │── AppendEntries ───────►│  5. Notify followers to commit
  β”‚   (commitIndex updated)  β”‚

Commit rule: Entry is committed when replicated to majority (⌊n/2βŒ‹ + 1 nodes).

Safety guarantee: If a log entry is committed, it will be present in the logs of all future leaders.

Raft in Practice

Aspect Typical Configuration
Cluster size 3, 5, or 7 nodes (odd for majority)
Election timeout 150-300ms (randomized)
Heartbeat interval 50-100ms
Tolerates failures (n-1)/2 nodes can fail
3-node cluster Tolerates 1 failure
5-node cluster Tolerates 2 failures

Paxos (Intuition)

Three roles: Proposer, Acceptor, Learner

Two phases:

  1. Prepare: Proposer asks acceptors "Will you accept my proposal n?"
  2. Accept: If majority promises, proposer sends actual value
Phase 1 (Prepare):
Proposer ──[Prepare(n)]──► Acceptors
Acceptors ──[Promise(n)]──► Proposer  (if n > any previous)

Phase 2 (Accept):
Proposer ──[Accept(n, value)]──► Acceptors
Acceptors ──[Accepted(n, value)]──► Learners

Why Raft over Paxos?

  • Raft: single leader, simpler to understand and implement
  • Paxos: more general, but notoriously difficult to implement correctly
  • Most production systems use Raft (etcd, CockroachDB, TiKV)

5. Clock Synchronization

The Problem

In distributed systems, there's no global clock. Each node has its own clock that drifts.

Lamport Clocks (Logical Clocks)

Rules:

  1. Before each event, increment local counter
  2. When sending message, attach current counter
  3. When receiving message, set counter = max(local, received) + 1
Node A:  1    2    3         6    7
         β”‚    β”‚    β”‚         β–²    β”‚
         β”‚    β”‚    β””β”€β”€β”€β”€β–Ίβ”€β”€β”€β”€β”˜    β”‚
         β”‚    β”‚         msg(3)    β”‚
Node B:  1    2    3    4    5    6
              β”‚              β–²
              β””β”€β”€β”€β”€β–Ίβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   msg(2)
Node C:  1    2    3    4    5    6

Limitation: If L(a) < L(b), we CANNOT conclude a happened before b. Only the converse: if a→b then L(a) < L(b).

Vector Clocks

Each node maintains a vector of counters (one per node):

Node A: [A:1, B:0, C:0]
Node A sends to B: message carries [A:1, B:0, C:0]
Node B receives: B = max([A:0, B:1, C:0], [A:1, B:0, C:0]) + increment B
                 B = [A:1, B:2, C:0]

Comparison rules:

  • V1 < V2 if all components of V1 ≀ V2 and at least one is strictly less
  • V1 βˆ₯ V2 (concurrent) if neither V1 < V2 nor V2 < V1
[2,1,0] < [2,2,0]  β†’ causally ordered
[2,1,0] βˆ₯ [1,2,0]  β†’ concurrent (conflict!)

Used by: DynamoDB (simplified), Riak (for conflict detection)

Hybrid Logical Clocks (HLC)

Combines physical time with logical ordering:

HLC = (physical_time, logical_counter)

- Uses physical time when possible (for human-meaningful timestamps)
- Falls back to logical counter when physical clocks are close
- Bounded drift from physical time

Used by: CockroachDB, YugabyteDB

Google's TrueTime

TrueTime API:
  tt.now() β†’ returns [earliest, latest] interval
  
  Guarantee: actual time is within the interval
  Uncertainty: typically 1-7ms (GPS + atomic clocks)
  
  Spanner's rule: wait out uncertainty before committing
  β†’ Ensures linearizability with ~7ms commit latency

6. Failure Models

Hierarchy (Weakest β†’ Strongest Assumption)

Byzantine (arbitrary behavior)
     β”‚
     β–Ό
Omission (may fail to send/receive)
     β”‚
     β–Ό
Crash-Recovery (may crash and restart)
     β”‚
     β–Ό
Crash-Fail (crashes permanently)

Crash-Fail

  • Node stops and never recovers
  • Simplest model to reason about
  • Consensus needs: n β‰₯ 2f + 1 nodes to tolerate f failures
  • Example: Raft with 5 nodes tolerates 2 permanent failures

Crash-Recovery

  • Node may crash and restart with durable state
  • Must persist state before acknowledging
  • WAL ensures recovery is possible
  • Example: Database with WAL can recover after crash

Omission Failures

  • Node fails to send or receive messages
  • Network partitions are a form of omission failure
  • Harder to detect than crashes (is it slow or dead?)

Byzantine Failures

  • Node may behave arbitrarily (including maliciously)
  • May send conflicting messages to different nodes
  • Consensus needs: n β‰₯ 3f + 1 nodes to tolerate f Byzantine failures
  • Example: Blockchain consensus (untrusted participants)

Interview context: Most system design interviews assume crash-fail or crash-recovery. Byzantine tolerance is relevant for blockchain/multi-party systems.


7. FLP Impossibility Result

Statement

In an asynchronous distributed system with even ONE possible crash failure, there is NO deterministic algorithm that guarantees consensus.

Intuition

The problem: In an asynchronous system, you cannot distinguish between:
1. A crashed node
2. A very slow node

If you wait forever β†’ no liveness (system hangs)
If you decide without the slow node β†’ it might have voted differently

No timeout is "correct" because the system is asynchronous
(no bound on message delivery time)

Practical Implications

FLP doesn't mean consensus is impossible in practice. It means:

  • You cannot have ALL of: safety + liveness + fault tolerance in async systems
  • Practical systems sacrifice one partially:
    • Raft/Paxos: May temporarily lose liveness (during leader election)
    • Randomized algorithms: Probabilistic termination (always terminates with high probability)
    • Partial synchrony: Assume system is eventually synchronous (timeouts work eventually)

8. Split-Brain Problem

What Is Split-Brain?

Normal:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  [Leader] ←→ [Follower] ←→ [Follower] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Network partition:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β•³    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  [Leader A]   β”‚         β”‚  [Leader B]   β”‚  ← TWO leaders!
β”‚  [Follower]   β”‚         β”‚  [Follower]   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Both sides accept writes β†’ data divergence β†’ SPLIT BRAIN

Solutions

1. Majority Quorum

5-node cluster, partition into [3] and [2]:

[Node1, Node2, Node3]  |  [Node4, Node5]
     Majority (3/5)    |    Minority (2/5)
     Keeps leader      |    Steps down
     Accepts writes    |    Rejects writes

Rule: Only the partition with majority (>n/2) can elect a leader and accept writes.

2. Fencing Tokens

Epoch/Term numbers prevent stale leaders:

Leader A (term=1) gets partitioned
Leader B (term=2) elected by majority

Leader A recovers, tries to write with term=1
Storage layer rejects: term=1 < current_term=2

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  write(term=1)  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Leader A │───────────────►│ Storage  │──► REJECTED (stale term)
β”‚(stale)  β”‚                β”‚ (term=2) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. STONITH (Shoot The Other Node In The Head)

When split-brain detected:
1. Surviving partition sends "kill" signal to other partition
2. Uses out-of-band mechanism (IPMI, power management)
3. Guarantees the other side is truly dead before proceeding

Used in: Pacemaker/Corosync clusters, some database HA setups

4. Witness/Tiebreaker Node

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Node A  β”‚    β•³    β”‚ Node B  β”‚
β”‚ (data)  β”‚         β”‚ (data)  β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜
     β”‚                   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
      β”‚   Witness    β”‚  ← Lightweight node, no data
      β”‚  (tiebreaker)β”‚     Votes in elections only
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Split-Brain in Practice

System Prevention Mechanism
ZooKeeper Majority quorum (2f+1 nodes)
etcd/Raft Term numbers + majority
Redis Sentinel Quorum of sentinels must agree
PostgreSQL Patroni DCS (etcd/ZK) for leader lock
Kafka ISR (In-Sync Replicas) + controller
Elasticsearch Minimum master nodes setting

Quick Reference: Theorem Application

Designing a system? Ask:

1. What consistency do we need?
   - Financial: Linearizability (CP)
   - Social feed: Eventual consistency (AP)
   - Shopping cart: Causal consistency

2. What's our partition strategy?
   - CP: Reject requests from minority partition
   - AP: Serve all requests, reconcile later

3. How do we detect/handle failures?
   - Heartbeats + timeouts for crash detection
   - Quorum writes for durability
   - Fencing tokens for stale leader prevention

4. How do we order events?
   - Single leader: Leader assigns order
   - Multi-leader: Vector clocks for conflict detection
   - Global order needed: Consensus (Raft)

Interview Cheat Sheet

When interviewer asks... Key points to mention
"How to handle network partitions?" CAP tradeoff, quorum, fencing tokens
"How to ensure consistency?" Consensus (Raft), linearizability, cost in latency
"What if two nodes disagree?" Conflict resolution: LWW, vector clocks, CRDTs
"How to detect failures?" Heartbeats, phi accrual detector, timeout tuning
"How to order events?" Lamport clocks for partial order, vector clocks for causality
"What consistency model?" Match to requirements: strong for money, eventual for feeds
"How does leader election work?" Raft: term numbers, majority vote, election timeout
"What about split-brain?" Majority quorum, fencing tokens, STONITH