Design Instagram - System Architecture

High-Level Architecture Overview

Architecture Principles

Microservices Architecture: 150+ independent services
Event-Driven Design: Asynchronous processing via message queues
Media-First Design: Optimized for image and video delivery
Multi-Region Deployment: Active-active across 15+ regions globally
CDN-Heavy: 90%+ media served from CDN edge locations
Read-Optimized: Aggressive caching for read-heavy workload (500:1 ratio)

System Architecture Diagram

Instagram High-Level Architecture — Upload & Feed Paths

Core Services Architecture

1. Upload Service

Responsibilities:

Handle photo and video uploads
Generate pre-signed S3 URLs for direct upload
Validate file types and sizes
Trigger media processing pipeline
Store metadata in database

Upload Flow:

1. Client requests upload URL
2. Upload Service generates pre-signed S3 URL
3. Client uploads directly to S3
4. S3 triggers Lambda/SNS notification
5. Media Processing Service processes file
6. Metadata stored in database
7. CDN cache warmed
8. User notified of completion

Technology Stack:

Language: Go (high performance, concurrent)
Storage: S3 for media files
Database: PostgreSQL for metadata
Queue: SQS for processing jobs
Cache: Redis for upload status

2. Media Processing Service

Responsibilities:

Image compression and resizing
Video transcoding to multiple formats
Generate thumbnails
Apply filters and effects
Extract metadata (EXIF, dimensions)
Content moderation (NSFW detection)

Processing Pipeline:

Media Processing Pipeline

Image Processing:

Original: Store as-is (2MB average)
Feed: 1080×1080, 200KB (JPEG, quality 85)
Thumbnail: 150×150, 20KB
Profile: 320×320, 50KB
Format: JPEG for photos, MP4 for videos

Video Processing:

Original: Store as-is (20MB average)
1080p: H.264, 5MB
720p: H.264, 2MB
480p: H.264, 1MB
Thumbnail: First frame, 100KB
Format: MP4 with H.264 codec

3. Feed Service

Responsibilities:

Generate personalized home feed
Rank posts using ML algorithms
Handle pagination and infinite scroll
Merge different content types (posts, stories, ads)
Cache feed results

Feed Generation Strategy:

Hybrid Approach:
1. Regular Users (<10K followers):
   - Fan-out on write (push model)
   - Pre-compute feed when post is created
   - Store in Cassandra partitioned by user_id

2. Celebrity Users (>1M followers):
   - Fan-out on read (pull model)
   - Fetch posts on-demand
   - Merge with pre-computed feed
   - Cache aggressively

3. Feed Ranking:
   - Fetch candidate posts (1000+)
   - Score using ML model
   - Rank by score
   - Apply business rules (diversity, freshness)
   - Return top 20 posts

Ranking Algorithm:

Score = f(
  recency (time since post),
  engagement (likes, comments, shares),
  user affinity (interaction history with author),
  content type (photo, video, carousel),
  media quality (resolution, clarity),
  hashtag relevance,
  location relevance
)

ML Model:
- Input: User features + Post features
- Output: Engagement probability (0-1)
- Training: Offline on historical data
- Serving: Real-time feature computation

4. Story Service

Responsibilities:

Handle story uploads and viewing
Manage 24-hour expiration
Track story views and analytics
Handle story replies
Manage story highlights

Story Architecture:

Story Upload:
1. Upload to S3 with 24h lifecycle policy
2. Store metadata in Redis (TTL: 24h)
3. Publish event to Kafka
4. Fan-out to followers (push notification)

Story Viewing:
1. Fetch active stories from Redis
2. If miss, query database
3. Check if already viewed
4. Increment view count
5. Mark as viewed for user
6. Return story with analytics

Story Expiration:
1. S3 lifecycle policy deletes after 24h
2. Redis TTL expires keys
3. Background job cleans database
4. CDN cache invalidation

5. Social Graph Service

Responsibilities:

Manage follow/unfollow relationships
Get followers and following lists
Suggest users to follow
Handle private account approvals
Manage blocked and muted users

Graph Storage:

Following Relationship (Cassandra):
CREATE TABLE following (
  user_id BIGINT,
  following_id BIGINT,
  followed_at TIMESTAMP,
  PRIMARY KEY (user_id, following_id)
);

Followers Relationship (Cassandra):
CREATE TABLE followers (
  user_id BIGINT,
  follower_id BIGINT,
  followed_at TIMESTAMP,
  PRIMARY KEY (user_id, follower_id)
);

Cache (Redis):
Key: followers:{user_id}
Value: Set of follower_ids
TTL: 1 hour

Key: following:{user_id}
Value: Set of following_ids
TTL: 1 hour

6. Engagement Service

Responsibilities:

Handle likes, comments, shares
Update engagement counts
Manage comment threads
Handle mentions and tags
Send engagement notifications

Engagement Flow:

Like Post:
1. Client sends like request
2. Engagement Service validates
3. Write to database (async)
4. Update like count (eventual consistency)
5. Publish event to Kafka
6. Notification Service sends notification
7. Update cache
8. Return success to client

Comment on Post:
1. Client sends comment
2. Validate content (spam, profanity)
3. Store comment in database
4. Update comment count
5. Publish event to Kafka
6. Notification Service notifies post author
7. Return comment to client

7. Search and Discovery Service

Responsibilities:

Index posts, users, hashtags, locations
Full-text search
Autocomplete suggestions
Trending hashtags
Explore page recommendations

Search Architecture:

Indexing Pipeline:
Post Created → Kafka → Indexing Worker → Elasticsearch

Search Query:
Client → Search Service → Elasticsearch → Results

Index Structure:
- Posts Index: 60TB (recent 6 months)
- Users Index: 20TB
- Hashtags Index: 10TB
- Locations Index: 10TB

Search Features:
- Full-text search
- Fuzzy matching
- Autocomplete (prefix search)
- Filters (date, location, user)
- Ranking by relevance and engagement

8. Notification Service

Responsibilities:

Send push notifications
In-app notifications
Email notifications
Notification preferences
Notification batching

Notification Types:

Like on post
Comment on post
New follower
Mention in post/comment
Story reply
Live video from followed user

Notification Pipeline:

Event → Kafka → Notification Worker → Filter by Preferences →
Push Service (APNs/FCM) → Device

Batching:
- Group similar notifications (10 likes → "X and 9 others liked")
- Delay non-urgent notifications (5 minutes)
- Send digest for high-volume users

9. Recommendation Service

Responsibilities:

Suggest users to follow
Recommend posts for Explore page
Suggest hashtags
Recommend locations
Personalized content discovery

Recommendation Algorithm:

Collaborative Filtering:
- Find similar users based on follow graph
- Recommend users followed by similar users

Content-Based Filtering:
- Analyze user's liked posts
- Recommend similar content

Hybrid Approach:
- Combine collaborative and content-based
- Use ML model to weight signals
- Apply business rules (diversity, freshness)

Features:
- User engagement history
- Follow graph
- Content preferences (hashtags, locations)
- Demographic information
- Time of day, day of week

Data Flow Diagrams

Photo Upload Flow

1. User selects photo in app
2. App requests upload URL from Upload Service
3. Upload Service generates pre-signed S3 URL
4. App uploads photo directly to S3
5. S3 triggers SNS notification
6. Media Processing Service picks up job
7. Process: Resize, compress, generate thumbnails
8. Store processed images in S3
9. Update metadata in database
10. Warm CDN cache
11. Publish event to Kafka
12. Fan-out Service distributes to followers
13. Notification Service sends notifications
14. Return success to user

Feed Fetch Flow

1. User opens app and requests feed
2. API Gateway authenticates request
3. Feed Service checks cache (Redis)
4. If cache hit: Return cached feed
5. If cache miss:
   a. Get list of users they follow
   b. Fetch recent posts from Timeline DB
   c. Merge celebrity posts (pull model)
   d. Fetch post metadata and media URLs
   e. Rank posts using ML model
   f. Apply business rules
   g. Cache result (TTL: 5 minutes)
6. Return feed to client
7. Client fetches media from CDN

Technology Stack

Programming Languages

Backend Services: Go (high performance), Python (ML)
Media Processing: C++ (image/video processing)
ML Models: Python (TensorFlow, PyTorch)
Scripts: Python (automation, data analysis)

Databases

Post Metadata: Cassandra (time-series, high write)
User Data: PostgreSQL (relational, ACID)
Social Graph: Redis + Cassandra
Feed Storage: Cassandra (timeline data)
Analytics: ClickHouse (OLAP)

Caching

Application Cache: Redis Cluster (50TB)
CDN: CloudFront / Cloudflare (90% hit rate)
Browser Cache: HTTP caching headers

Message Queue

Event Streaming: Apache Kafka (high throughput)
Task Queue: Amazon SQS (job processing)
Pub/Sub: Redis Pub/Sub (real-time updates)

Storage

Object Storage: Amazon S3 (media files)
Block Storage: EBS (database volumes)
Archival: S3 Glacier (old content)

Search

Full-text Search: Elasticsearch (distributed)
Autocomplete: Redis (sorted sets)

ML Infrastructure

Training: TensorFlow on GPU clusters
Serving: TensorFlow Serving
Feature Store: Feast
Experiment Tracking: MLflow

Monitoring

Metrics: Prometheus + Grafana
Logging: ELK Stack
Tracing: Jaeger
Alerting: PagerDuty

Infrastructure

Container Orchestration: Kubernetes
Service Mesh: Istio
Load Balancing: NGINX, HAProxy
DNS: Route 53 (GeoDNS)

This architecture provides a scalable, reliable, and performant foundation for building an Instagram-like platform capable of handling billions of photos and videos with real-time delivery and high availability.