Design Instagram - System Architecture
High-Level Architecture Overview
Architecture Principles
- Microservices Architecture: 150+ independent services
- Event-Driven Design: Asynchronous processing via message queues
- Media-First Design: Optimized for image and video delivery
- Multi-Region Deployment: Active-active across 15+ regions globally
- CDN-Heavy: 90%+ media served from CDN edge locations
- Read-Optimized: Aggressive caching for read-heavy workload (500:1 ratio)
System Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ iOS │ │ Android │ │ Web │ │
│ │ App │ │ App │ │ Browser │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼──────────────────────────────┘
│ │ │
└─────────────┴─────────────┘
│
┌─────────────┴─────────────┐
│ Global Load Balancer │
│ (GeoDNS + Anycast) │
└─────────────┬─────────────┘
│
┌─────────────┴─────────────┐
│ API Gateway │
│ (Auth, Rate Limit, SSL) │
└─────────────┬─────────────┘
│
┌─────────────┴──────────────────────────────┐
│ │
┌───────┴────────┐ ┌──────────────┴──────────┐
│ Upload Path │ │ Feed Path │
│ (Media Post) │ │ (Content Fetch) │
└───────┬────────┘ └──────────────┬──────────┘
│ │
┌───────┴────────┐ ┌──────────────┴──────────┐
│ Upload Service│ │ Feed Service │
└───────┬────────┘ └──────────────┬──────────┘
│ │
┌───────┴────────┐ ┌──────────────┴──────────┐
│ Media │ │ Feed Cache │
│ Processing │ │ (Redis) │
└───────┬────────┘ └──────────────┬──────────┘
│ │
┌───────┴────────┐ ┌──────────────┴──────────┐
│ S3 Storage │ │ Feed DB │
│ + CDN │ │ (Cassandra) │
└────────────────┘ └─────────────────────────┘Core Services Architecture
1. Upload Service
Responsibilities:
- Handle photo and video uploads
- Generate pre-signed S3 URLs for direct upload
- Validate file types and sizes
- Trigger media processing pipeline
- Store metadata in database
Upload Flow:
1. Client requests upload URL
2. Upload Service generates pre-signed S3 URL
3. Client uploads directly to S3
4. S3 triggers Lambda/SNS notification
5. Media Processing Service processes file
6. Metadata stored in database
7. CDN cache warmed
8. User notified of completionTechnology Stack:
- Language: Go (high performance, concurrent)
- Storage: S3 for media files
- Database: PostgreSQL for metadata
- Queue: SQS for processing jobs
- Cache: Redis for upload status
2. Media Processing Service
Responsibilities:
- Image compression and resizing
- Video transcoding to multiple formats
- Generate thumbnails
- Apply filters and effects
- Extract metadata (EXIF, dimensions)
- Content moderation (NSFW detection)
Processing Pipeline:
┌─────────────────┐
│ S3 Upload │
└────────┬────────┘
│
┌────────┴────────┐
│ SQS Queue │
└────────┬────────┘
│
┌────────┴────────────────────────┐
│ Processing Workers (GPU) │
│ - Image: Resize, Compress │
│ - Video: Transcode, Thumbnail │
│ - ML: NSFW, Face Detection │
└────────┬────────────────────────┘
│
┌────────┴────────┐
│ Store Processed│
│ Media in S3 │
└────────┬────────┘
│
┌────────┴────────┐
│ Update Metadata│
│ in Database │
└────────┬────────┘
│
┌────────┴────────┐
│ Warm CDN Cache │
└─────────────────┘Image Processing:
- Original: Store as-is (2MB average)
- Feed: 1080×1080, 200KB (JPEG, quality 85)
- Thumbnail: 150×150, 20KB
- Profile: 320×320, 50KB
- Format: JPEG for photos, MP4 for videos
Video Processing:
- Original: Store as-is (20MB average)
- 1080p: H.264, 5MB
- 720p: H.264, 2MB
- 480p: H.264, 1MB
- Thumbnail: First frame, 100KB
- Format: MP4 with H.264 codec
3. Feed Service
Responsibilities:
- Generate personalized home feed
- Rank posts using ML algorithms
- Handle pagination and infinite scroll
- Merge different content types (posts, stories, ads)
- Cache feed results
Feed Generation Strategy:
Hybrid Approach:
1. Regular Users (<10K followers):
- Fan-out on write (push model)
- Pre-compute feed when post is created
- Store in Cassandra partitioned by user_id
2. Celebrity Users (>1M followers):
- Fan-out on read (pull model)
- Fetch posts on-demand
- Merge with pre-computed feed
- Cache aggressively
3. Feed Ranking:
- Fetch candidate posts (1000+)
- Score using ML model
- Rank by score
- Apply business rules (diversity, freshness)
- Return top 20 postsRanking Algorithm:
Score = f(
recency (time since post),
engagement (likes, comments, shares),
user affinity (interaction history with author),
content type (photo, video, carousel),
media quality (resolution, clarity),
hashtag relevance,
location relevance
)
ML Model:
- Input: User features + Post features
- Output: Engagement probability (0-1)
- Training: Offline on historical data
- Serving: Real-time feature computation4. Story Service
Responsibilities:
- Handle story uploads and viewing
- Manage 24-hour expiration
- Track story views and analytics
- Handle story replies
- Manage story highlights
Story Architecture:
Story Upload:
1. Upload to S3 with 24h lifecycle policy
2. Store metadata in Redis (TTL: 24h)
3. Publish event to Kafka
4. Fan-out to followers (push notification)
Story Viewing:
1. Fetch active stories from Redis
2. If miss, query database
3. Check if already viewed
4. Increment view count
5. Mark as viewed for user
6. Return story with analytics
Story Expiration:
1. S3 lifecycle policy deletes after 24h
2. Redis TTL expires keys
3. Background job cleans database
4. CDN cache invalidation5. Social Graph Service
Responsibilities:
- Manage follow/unfollow relationships
- Get followers and following lists
- Suggest users to follow
- Handle private account approvals
- Manage blocked and muted users
Graph Storage:
Following Relationship (Cassandra):
CREATE TABLE following (
user_id BIGINT,
following_id BIGINT,
followed_at TIMESTAMP,
PRIMARY KEY (user_id, following_id)
);
Followers Relationship (Cassandra):
CREATE TABLE followers (
user_id BIGINT,
follower_id BIGINT,
followed_at TIMESTAMP,
PRIMARY KEY (user_id, follower_id)
);
Cache (Redis):
Key: followers:{user_id}
Value: Set of follower_ids
TTL: 1 hour
Key: following:{user_id}
Value: Set of following_ids
TTL: 1 hour6. Engagement Service
Responsibilities:
- Handle likes, comments, shares
- Update engagement counts
- Manage comment threads
- Handle mentions and tags
- Send engagement notifications
Engagement Flow:
Like Post:
1. Client sends like request
2. Engagement Service validates
3. Write to database (async)
4. Update like count (eventual consistency)
5. Publish event to Kafka
6. Notification Service sends notification
7. Update cache
8. Return success to client
Comment on Post:
1. Client sends comment
2. Validate content (spam, profanity)
3. Store comment in database
4. Update comment count
5. Publish event to Kafka
6. Notification Service notifies post author
7. Return comment to client7. Search and Discovery Service
Responsibilities:
- Index posts, users, hashtags, locations
- Full-text search
- Autocomplete suggestions
- Trending hashtags
- Explore page recommendations
Search Architecture:
Indexing Pipeline:
Post Created → Kafka → Indexing Worker → Elasticsearch
Search Query:
Client → Search Service → Elasticsearch → Results
Index Structure:
- Posts Index: 60TB (recent 6 months)
- Users Index: 20TB
- Hashtags Index: 10TB
- Locations Index: 10TB
Search Features:
- Full-text search
- Fuzzy matching
- Autocomplete (prefix search)
- Filters (date, location, user)
- Ranking by relevance and engagement8. Notification Service
Responsibilities:
- Send push notifications
- In-app notifications
- Email notifications
- Notification preferences
- Notification batching
Notification Types:
- Like on post
- Comment on post
- New follower
- Mention in post/comment
- Story reply
- Live video from followed user
Notification Pipeline:
Event → Kafka → Notification Worker → Filter by Preferences →
Push Service (APNs/FCM) → Device
Batching:
- Group similar notifications (10 likes → "X and 9 others liked")
- Delay non-urgent notifications (5 minutes)
- Send digest for high-volume users9. Recommendation Service
Responsibilities:
- Suggest users to follow
- Recommend posts for Explore page
- Suggest hashtags
- Recommend locations
- Personalized content discovery
Recommendation Algorithm:
Collaborative Filtering:
- Find similar users based on follow graph
- Recommend users followed by similar users
Content-Based Filtering:
- Analyze user's liked posts
- Recommend similar content
Hybrid Approach:
- Combine collaborative and content-based
- Use ML model to weight signals
- Apply business rules (diversity, freshness)
Features:
- User engagement history
- Follow graph
- Content preferences (hashtags, locations)
- Demographic information
- Time of day, day of weekData Flow Diagrams
Photo Upload Flow
1. User selects photo in app
2. App requests upload URL from Upload Service
3. Upload Service generates pre-signed S3 URL
4. App uploads photo directly to S3
5. S3 triggers SNS notification
6. Media Processing Service picks up job
7. Process: Resize, compress, generate thumbnails
8. Store processed images in S3
9. Update metadata in database
10. Warm CDN cache
11. Publish event to Kafka
12. Fan-out Service distributes to followers
13. Notification Service sends notifications
14. Return success to userFeed Fetch Flow
1. User opens app and requests feed
2. API Gateway authenticates request
3. Feed Service checks cache (Redis)
4. If cache hit: Return cached feed
5. If cache miss:
a. Get list of users they follow
b. Fetch recent posts from Timeline DB
c. Merge celebrity posts (pull model)
d. Fetch post metadata and media URLs
e. Rank posts using ML model
f. Apply business rules
g. Cache result (TTL: 5 minutes)
6. Return feed to client
7. Client fetches media from CDNTechnology Stack
Programming Languages
- Backend Services: Go (high performance), Python (ML)
- Media Processing: C++ (image/video processing)
- ML Models: Python (TensorFlow, PyTorch)
- Scripts: Python (automation, data analysis)
Databases
- Post Metadata: Cassandra (time-series, high write)
- User Data: PostgreSQL (relational, ACID)
- Social Graph: Redis + Cassandra
- Feed Storage: Cassandra (timeline data)
- Analytics: ClickHouse (OLAP)
Caching
- Application Cache: Redis Cluster (50TB)
- CDN: CloudFront / Cloudflare (90% hit rate)
- Browser Cache: HTTP caching headers
Message Queue
- Event Streaming: Apache Kafka (high throughput)
- Task Queue: Amazon SQS (job processing)
- Pub/Sub: Redis Pub/Sub (real-time updates)
Storage
- Object Storage: Amazon S3 (media files)
- Block Storage: EBS (database volumes)
- Archival: S3 Glacier (old content)
Search
- Full-text Search: Elasticsearch (distributed)
- Autocomplete: Redis (sorted sets)
ML Infrastructure
- Training: TensorFlow on GPU clusters
- Serving: TensorFlow Serving
- Feature Store: Feast
- Experiment Tracking: MLflow
Monitoring
- Metrics: Prometheus + Grafana
- Logging: ELK Stack
- Tracing: Jaeger
- Alerting: PagerDuty
Infrastructure
- Container Orchestration: Kubernetes
- Service Mesh: Istio
- Load Balancing: NGINX, HAProxy
- DNS: Route 53 (GeoDNS)
This architecture provides a scalable, reliable, and performant foundation for building an Instagram-like platform capable of handling billions of photos and videos with real-time delivery and high availability.