Back-of-Envelope Estimation: Complete Guide for System Design

Overview

Every system design interview expects you to ground your design in real numbers. Back-of-envelope estimation demonstrates that you can reason quantitatively about scale, identify bottlenecks before they happen, and make informed capacity decisions.

The goal isn't precision — it's demonstrating structured thinking and arriving at the right order of magnitude.

1. Powers of 2 Reference Table

Memorize these. They come up constantly.

Power	Exact Value	Approximate	Common Name
2^10	1,024	~1 Thousand	1 KB
2^20	1,048,576	~1 Million	1 MB
2^30	1,073,741,824	~1 Billion	1 GB
2^40	1,099,511,627,776	~1 Trillion	1 TB
2^50	~1.13 × 10^15	~1 Quadrillion	1 PB

Quick Conversions

1 KB = 1,000 bytes (for estimation, use 10^3)
1 MB = 1,000 KB = 10^6 bytes
1 GB = 1,000 MB = 10^9 bytes
1 TB = 1,000 GB = 10^12 bytes
1 PB = 1,000 TB = 10^15 bytes

Time:
1 day = 86,400 seconds ≈ 10^5 seconds (use 100K)
1 month ≈ 2.5 × 10^6 seconds (use 2.5M)
1 year ≈ 3 × 10^7 seconds (use 30M)

Useful Multipliers

Seconds in a day:    86,400 ≈ 100,000 (10^5)
Seconds in a month:  2,592,000 ≈ 2.5 × 10^6
Seconds in a year:   31,536,000 ≈ 3 × 10^7

For QPS calculations, use:
  Daily requests / 100,000 = average QPS
  Peak QPS ≈ 2-5× average QPS

2. Latency Numbers Every Programmer Should Know

The Complete Table (2024 Numbers)

Operation	Latency	Notes
L1 cache reference	1 ns
Branch mispredict	3 ns
L2 cache reference	4 ns
Mutex lock/unlock	17 ns
L3 cache reference	12 ns
Main memory reference	100 ns
Compress 1KB (Snappy)	3,000 ns (3 μs)
Read 1MB sequentially (memory)	3,000 ns (3 μs)
SSD random read	16,000 ns (16 μs)
Read 1MB sequentially (SSD)	49,000 ns (49 μs)
Round trip same datacenter	500,000 ns (500 μs)	0.5 ms
Read 1MB sequentially (HDD)	825,000 ns (825 μs)
Disk seek (HDD)	2,000,000 ns (2 ms)
Redis GET (same DC)	500,000-1,000,000 ns	0.5-1 ms
Database query (indexed)	1-5 ms
Send packet CA→Netherlands→CA	150,000,000 ns	150 ms
TLS handshake	2-10 ms	Depends on version
TCP handshake (same region)	0.5-1 ms
TCP handshake (cross-region)	50-150 ms

Visual Scale

1 ns    ████ L1 cache
4 ns    ████████ L2 cache
100 ns  ████████████████████████████████████████ RAM

                    ↓ 1000x gap ↓

16 μs   ████ SSD random read
49 μs   ████████████ SSD sequential 1MB
500 μs  ████████████████████████████████████████████████████ Network (same DC)

                    ↓ 1000x gap ↓

2 ms    ████ HDD seek
5 ms    ████████████ DB query
150 ms  ████████████████████████████████████████████████████ Cross-continent

Key Takeaways for Design

Memory is 100x faster than SSD → Cache aggressively
SSD is 100x faster than network → Minimize network hops
Same-DC network is 300x faster than cross-continent → Co-locate services
Sequential reads are 50x faster than random → Design for sequential access
Compression is cheap → Compress before sending over network

3. How to Estimate QPS from DAU

Framework

Given: DAU (Daily Active Users)

Step 1: Estimate actions per user per day
Step 2: Calculate daily requests
Step 3: Convert to QPS (÷ 86,400 ≈ ÷ 100,000)
Step 4: Estimate peak (2-5× average)

Formula:
  Average QPS = (DAU × actions_per_user) / 86,400
  Peak QPS = Average QPS × peak_factor (typically 2-5×)

Common Ratios

Platform Type	Actions/User/Day	Read:Write Ratio
Social media (read-heavy)	20-50 reads, 1-2 writes	100:1
Messaging	20-40 messages sent	1:1 (send/receive)
E-commerce	5-10 page views, 0.1 purchases	100:1
Search engine	5-10 searches	Read-only
Video streaming	2-5 videos watched	1000:1 (views:uploads)
File storage	3-5 reads, 1-2 writes	5:1

Worked Example: Twitter-like Service

Given: 300M DAU

Tweets:
  - Average user posts 0.5 tweets/day
  - Write QPS = 300M × 0.5 / 100K = 1,500 QPS
  - Peak write QPS = 1,500 × 3 = 4,500 QPS

Timeline reads:
  - Average user reads timeline 10 times/day
  - Read QPS = 300M × 10 / 100K = 30,000 QPS
  - Peak read QPS = 30,000 × 3 = 90,000 QPS

Fanout:
  - Average 200 followers per user
  - Fanout writes = 1,500 × 200 = 300,000 writes/sec to timelines

4. Storage Estimation Framework

Formula

Total Storage = per_item_size × items_per_user × total_users × retention_period

Growth rate = new_items_per_day × per_item_size

Common Data Sizes

Data Type	Typical Size
User ID (UUID)	16 bytes
Timestamp	8 bytes
Short text (tweet)	280 bytes (UTF-8)
Email/message	5-50 KB
User profile (JSON)	1-5 KB
Thumbnail image	10-50 KB
Standard photo	200 KB - 2 MB
High-res photo	2-10 MB
1 min video (720p)	10-20 MB
1 min video (1080p)	30-60 MB
1 min video (4K)	100-300 MB
1 hour video (1080p)	2-4 GB
Database row (typical)	200 bytes - 2 KB
Log entry	200-500 bytes

Worked Example: Instagram-like Service

Given: 500M DAU, 2B total users

Photos:
  - 10% of DAU upload daily = 50M photos/day
  - Average photo size: 2MB (original) + 200KB (thumbnails) ≈ 2.2MB
  - Daily storage: 50M × 2.2MB = 110TB/day
  - Annual storage: 110TB × 365 = ~40PB/year

Metadata:
  - Per photo: user_id(8B) + timestamp(8B) + location(16B) + 
    caption(500B) + tags(200B) ≈ 1KB
  - Daily metadata: 50M × 1KB = 50GB/day
  - Annual metadata: 50GB × 365 = ~18TB/year

Total after 5 years:
  - Photos: ~200PB
  - Metadata: ~90TB

5. Bandwidth Estimation

Formula

Bandwidth = QPS × average_response_size

Ingress (incoming) = write_QPS × request_payload_size
Egress (outgoing) = read_QPS × response_payload_size

Worked Example: Video Streaming Service

Given: 100M DAU, average 1 hour of video/day

Streaming bandwidth:
  - Concurrent viewers (assume 10% at peak): 10M
  - Bitrate: 5 Mbps (1080p adaptive)
  - Peak egress: 10M × 5 Mbps = 50 Tbps (50 Petabits/sec)
  
  This is why CDNs are essential!
  With CDN (90% cache hit): Origin serves 5 Tbps

Upload bandwidth:
  - 500K videos uploaded/day
  - Average video: 500MB
  - Upload bandwidth: 500K × 500MB / 86,400s = ~3 GB/s = 24 Gbps

Bandwidth Cost Reference

AWS Data Transfer (2024 approximate):
  - Inbound: Free
  - Outbound (first 10TB): $0.09/GB
  - Outbound (next 40TB): $0.085/GB
  - Outbound (100TB+): $0.07/GB
  - CloudFront: $0.02-0.085/GB (cheaper than direct)
  - Same-region transfer: $0.01/GB
  - Cross-region: $0.02/GB

Example: Serving 1PB/month outbound
  = 1,000,000 GB × $0.07 = $70,000/month
  With CloudFront: ~$50,000/month

6. Cost Estimation

Compute Costs (2024 Approximate)

Resource	Cost	Notes
AWS EC2 m5.xlarge (4 vCPU, 16GB)	$0.192/hr ≈ $140/month	General purpose
AWS EC2 c5.2xlarge (8 vCPU, 16GB)	$0.34/hr ≈ $245/month	Compute optimized
AWS EC2 r5.2xlarge (8 vCPU, 64GB)	$0.504/hr ≈ $363/month	Memory optimized
AWS Lambda	$0.20 per 1M requests + duration	Serverless
Kubernetes pod (1 vCPU, 2GB)	~$50-70/month	Managed K8s

Storage Costs

Storage Type	Cost/GB/Month	Use Case
S3 Standard	$0.023	Frequently accessed
S3 Infrequent Access	$0.0125	Monthly access
S3 Glacier	$0.004	Archival
EBS gp3 (SSD)	$0.08	Database volumes
EBS io2 (high IOPS)	$0.125 + IOPS cost	High-performance DB
RDS PostgreSQL (db.r5.xlarge)	~$500/month	Managed database
ElastiCache Redis (r5.large)	~$200/month	In-memory cache
DynamoDB (on-demand)	$1.25/M writes, $0.25/M reads	NoSQL

Quick Cost Estimation Template

Monthly cost estimate for a service with:
  - 10M DAU, 1000 QPS average, 3000 QPS peak

Compute (handle 3000 QPS):
  - Each server handles ~500 QPS
  - Need: 6 servers + 2 redundancy = 8 servers
  - 8 × m5.xlarge = 8 × $140 = $1,120/month

Database:
  - Primary + replica: 2 × db.r5.xlarge = $1,000/month
  - Storage (500GB): 500 × $0.08 = $40/month

Cache:
  - Redis cluster (3 nodes): 3 × $200 = $600/month

Storage (S3):
  - 10TB media: 10,000 × $0.023 = $230/month

CDN:
  - 50TB egress: ~$4,000/month

Load Balancer:
  - ALB: ~$50/month

Total: ~$7,040/month ≈ $85K/year

7. Common Estimation Examples

Example 1: Twitter/X

Scale:
  - 400M DAU
  - 500M tweets/day
  - Average tweet: 280 chars + metadata ≈ 500 bytes
  - Average user follows 200 accounts
  - Timeline: 200 tweets shown

QPS:
  - Tweet writes: 500M / 100K = 5,000 QPS
  - Timeline reads: 400M × 10 reads/day / 100K = 40,000 QPS
  - Peak: 5× = 200,000 read QPS

Storage (tweets only):
  - Daily: 500M × 500B = 250GB/day
  - Annual: 250GB × 365 = ~90TB/year
  - 5 years: ~450TB

Fanout:
  - 5,000 tweets/sec × 200 avg followers = 1M timeline writes/sec
  - Celebrity tweet (50M followers): single tweet → 50M writes

Example 2: YouTube

Scale:
  - 2B monthly users, 500M DAU
  - 500 hours of video uploaded per minute
  - Average video: 5 minutes, 50MB (compressed)

Upload:
  - 500 hours/min = 30,000 hours/day
  - 30,000 × 60 min × 50MB/5min = 18PB/day raw
  - With transcoding (5 resolutions): 18PB × 5 = 90PB/day

Storage:
  - Daily new content: ~90PB (all resolutions)
  - Annual: ~33EB (exabytes)

Streaming:
  - 500M DAU × 40 min average watch time
  - Concurrent viewers (peak): ~50M
  - Bandwidth: 50M × 5Mbps = 250 Tbps peak
  - CDN handles 95%: Origin = 12.5 Tbps

QPS:
  - Video views: 500M × 8 videos/day / 100K = 40,000 QPS
  - Search: 500M × 3 searches/day / 100K = 15,000 QPS

Example 3: WhatsApp

Scale:
  - 2B users, 500M DAU
  - 100B messages/day

QPS:
  - Messages: 100B / 100K = 1,000,000 QPS (1M QPS!)
  - Peak: 3M QPS

Storage:
  - Average message: 100 bytes (text)
  - Daily text: 100B × 100B = 10TB/day
  - Media messages (20% of total): 20B × 200KB avg = 4PB/day
  - 30-day retention: 120PB media

Connection management:
  - 500M concurrent connections (WebSocket/MQTT)
  - Each connection: ~10KB memory
  - Total memory for connections: 500M × 10KB = 5TB RAM
  - At 64GB per server: ~80,000 servers just for connections
  - Actual: use efficient protocols, ~2M connections per server
  - Need: ~250 connection servers

8. Rules of Thumb

The 80/20 Rule (Pareto Principle)

- 20% of users generate 80% of traffic
- 20% of data is accessed 80% of the time (cache this!)
- 20% of features handle 80% of use cases

Application:
  - Cache size: 20% of total data covers 80% of reads
  - Hot partition: 20% of keys get 80% of requests

Read/Write Ratios

System Type	Read:Write	Implication
Social media	100:1 to 1000:1	Optimize for reads, cache heavily
Messaging	1:1	Balance read/write paths
Logging/Analytics	1:100	Optimize for writes (LSM-tree)
E-commerce catalog	100:1	Cache product pages
Financial trading	1:1	Low latency both directions

Server Capacity Rules of Thumb

Single server can handle:
  - Web server (Nginx): 10,000-100,000 concurrent connections
  - Application server: 500-2,000 QPS (depends on complexity)
  - Database (PostgreSQL): 5,000-20,000 QPS (simple queries)
  - Redis: 100,000-200,000 QPS
  - Kafka broker: 100,000-200,000 messages/sec

Memory:
  - Modern server: 64-512GB RAM
  - Redis: 25GB usable per instance (leave room for overhead)
  - JVM application: 4-32GB heap typical

Disk:
  - SSD IOPS: 10,000-100,000 (depends on drive)
  - SSD throughput: 500MB/s - 3GB/s
  - HDD IOPS: 100-200
  - HDD throughput: 100-200MB/s

Network Rules of Thumb

- 1 Gbps link: ~125 MB/s throughput
- 10 Gbps link: ~1.25 GB/s throughput
- Typical server NIC: 10-25 Gbps
- Cross-AZ latency: 1-2ms
- Cross-region latency: 50-200ms
- CDN edge to user: 5-30ms

Estimation Process (Interview Template)

Step 1: Clarify scale
  "How many users? DAU? Geographic distribution?"

Step 2: Estimate traffic
  - Actions per user per day
  - Calculate QPS (average and peak)
  - Identify read vs write ratio

Step 3: Estimate storage
  - Per-item size × items/day × retention
  - Separate hot (SSD/cache) from cold (HDD/S3)

Step 4: Estimate bandwidth
  - QPS × payload size
  - Identify if CDN is needed

Step 5: Estimate compute
  - QPS / per-server-capacity = number of servers
  - Add redundancy (N+2 or 3× for HA)

Step 6: Identify bottlenecks
  - Which resource hits limits first?
  - Where do we need to scale horizontally?

Interview Cheat Sheet

When interviewer asks...	Framework to use
"How many servers do we need?"	QPS / per-server-capacity + redundancy
"How much storage?"	per_item × items/day × retention
"Can a single database handle this?"	Compare QPS to DB limits (5-20K QPS)
"Do we need a cache?"	If read QPS > DB capacity, yes
"Do we need a CDN?"	If serving media to global users, yes
"What's the cost?"	Compute + storage + bandwidth + managed services
"How to handle peak traffic?"	Auto-scaling, over-provision 3×, queue overflow

Common Mistakes to Avoid

Forgetting peak vs average — Design for peak, not average
Ignoring metadata — Indexes, replicas, and overhead add 2-3× raw data size
Forgetting redundancy — Always multiply by replication factor (typically 3×)
Precise numbers — Round aggressively. 86,400 ≈ 100,000. Don't waste time on arithmetic.
Not stating assumptions — Always say "assuming X" so interviewer can correct you