Design Instagram - Interview Tips
Interview Approach Strategy
Step 1: Requirements Clarification (5-10 minutes)
Essential Questions to Ask
Functional Requirements:
- "What are the core features? Just photo sharing, or also stories, DMs, shopping?"
- "Do we need to support videos?"
- "Should we implement filters and effects?"
- "Do we need real-time features like live streaming?"
Scale Requirements:
- "How many users? 500M, 1B, 2B?"
- "How many photos/videos uploaded per day?"
- "What's the read-to-write ratio?"
- "What's acceptable latency for feed loads and uploads?"
Non-Functional Requirements:
- "What's the availability requirement? 99.9%, 99.95%?"
- "Do we need to support multiple regions?"
- "Are there compliance requirements (GDPR, CCPA)?"
- "What's the budget constraint?"Sample Clarification Dialog
Interviewer: "Design Instagram"
You: "Great! Let me clarify the requirements:
Core Features:
- Users can upload photos and videos
- Users can follow other users
- Users see a feed of posts from people they follow
- Users can like, comment, and share posts
- Should I include stories (24-hour content)?
Scale:
- Are we targeting Instagram's scale (2B users, 500M DAU)?
- Should I assume 100M photos and 50M videos per day?
- Is the read-to-write ratio around 500:1?
Performance:
- Is <1 second feed load acceptable?
- Should photo upload be <5 seconds?
- Do we need real-time updates or is eventual consistency okay?
Interviewer: "Yes, focus on core photo/video sharing and feed.
Instagram scale is good. Real-time is nice-to-have."
You: "Perfect! Let me start with back-of-envelope calculations..."Step 2: Back-of-the-Envelope Calculations (5 minutes)
Calculate Scale
Users:
- Total users: 2B
- DAU: 500M (25%)
- Average followers: 150
Content:
- Photos per day: 100M
- Videos per day: 50M
- Stories per day: 500M
- Photos per second: 100M / 86400 = ~1,157 TPS
- Peak TPS: 3x = 3,500 TPS
Feed Requests:
- Each user checks feed 10 times/day
- Total requests: 500M × 10 = 5B requests/day
- Requests per second: 5B / 86400 = ~58K RPS
- Peak RPS: 3x = 174K RPS
Storage:
- Photo size: 2MB (original), 200KB (compressed)
- Video size: 20MB (original), 5MB (compressed)
- Daily storage: 100M × 2MB + 50M × 20MB = 1.2PB/day
- With compression: 100M × 200KB + 50M × 5MB = 270TB/day
Bandwidth:
- Ingress: 1.2PB/day = 14GB/s
- Egress: 5B requests × 20 posts × 200KB = 20PB/day = 231GB/s
- With CDN (90% hit rate): Origin serves 23GB/sPro Tip: Write these calculations on the whiteboard. Shows you understand scale.
Step 3: High-Level Design (10-15 minutes)
Start Simple
Step 1: Basic Architecture
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Clients │────▶│ API │────▶│Database │
└─────────┘ │ Gateway │ └─────────┘
└─────────┘
Step 2: Add Core Services
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐
│ Clients │────▶│ API │────▶│ Upload │────▶│ S3 │
└─────────┘ │ Gateway │ │ Service │ └─────────┘
└─────────┘ └──────────┘
│
▼
┌──────────┐
│ Feed │
│ Service │
└──────────┘
Step 3: Add Processing and CDN
[Full architecture with media processing, CDN, caching, etc.]Pro Tip: Draw incrementally. Don't overwhelm with complexity upfront.
Key Components to Mention
1. API Gateway: Authentication, rate limiting, routing
2. Upload Service: Handle photo/video uploads
3. Media Processing Service: Resize, compress, transcode
4. Feed Service: Generate personalized feeds
5. Story Service: Handle 24-hour ephemeral content
6. User Service: User profiles and authentication
7. Social Graph Service: Follow relationships
8. Search Service: Full-text search
9. Notification Service: Push notifications
10. S3 + CDN: Media storage and delivery
11. Cache Layer: Redis for performance
12. Message Queue: Kafka for async processingStep 4: Deep Dive (15-20 minutes)
Critical Design Decisions
Decision 1: Media Storage Strategy
Interviewer: "How do you store and serve billions of photos?"
Strong Answer:
"I'd use a multi-tier storage and delivery strategy:
1. Storage (S3):
- Store original and compressed versions
- Original: 2MB (for editing, downloads)
- Compressed: 200KB (for feed display)
- Thumbnail: 20KB (for grid view)
- Lifecycle policies for archival
2. CDN (CloudFront):
- 90% cache hit rate
- 200+ edge locations globally
- Automatic image optimization
- Reduce origin load to 10%
3. Optimization:
- Compress images (JPEG quality 85, WebP)
- Multiple sizes for different contexts
- Lazy loading for images
- Progressive JPEG for faster rendering
This provides fast delivery (<500ms) while minimizing storage costs."Decision 2: Feed Generation Strategy
Interviewer: "How do you generate personalized feeds?"
Strong Answer:
"I'd use a hybrid fan-out approach:
1. Regular Users (<10K followers):
- Fan-out on write (push model)
- Pre-compute feeds when post is created
- Store in Cassandra partitioned by user_id
- Fast reads (<100ms)
2. Celebrity Users (>1M followers):
- Fan-out on read (pull model)
- Fetch posts on-demand
- Merge with pre-computed feed
- Cache aggressively (TTL: 5 minutes)
3. Feed Ranking:
- Fetch candidate posts (1000+)
- Score using ML model (engagement prediction)
- Rank by score
- Apply business rules (diversity, freshness)
- Return top 20 posts
This balances write and read performance across different user types."Decision 3: Handling Celebrity Posts
Interviewer: "What if a celebrity with 600M followers posts a photo?"
Strong Answer:
"This is the 'hot user problem'. Solutions:
1. Skip fan-out for celebrity users:
- Don't write to 600M feeds
- Fetch celebrity posts on-demand
- Merge with pre-computed feed
2. Separate infrastructure:
- Dedicated cache for celebrity posts
- Higher cache TTL (1 hour vs 5 minutes)
- CDN caching for popular content
3. Rate limiting:
- Limit fan-out to 20K writes/second
- Process in batches of 1000
- Use priority queue (active users first)
4. Async processing:
- Fan-out via Kafka
- Non-blocking for user
- Eventual consistency acceptable
This prevents celebrity posts from overwhelming the system."Common Pitfalls to Avoid
Pitfall 1: Ignoring Media Processing
❌ Wrong: "Store photos directly in database"
✅ Right: "Upload to S3, process asynchronously, store multiple sizes"
Always consider media processing pipeline.Pitfall 2: Not Considering Storage Costs
❌ Wrong: "Store all photos at original resolution"
✅ Right: "Compress to 200KB for feed, store original for editing"
Storage costs are significant at scale (2PB+ for 5 years).Pitfall 3: Forgetting CDN
❌ Wrong: "Serve all media from origin servers"
✅ Right: "Use CDN with 90% hit rate to reduce origin load"
CDN is critical for media-heavy applications.Pitfall 4: Not Handling Celebrity Users
❌ Wrong: "Fan-out to all followers for every post"
✅ Right: "Use hybrid approach based on follower count"
Celebrity users require special handling.Pitfall 5: Ignoring Stories
❌ Wrong: Only discussing permanent posts
✅ Right: "Stories require 24-hour expiration, different storage strategy"
Stories are a major feature of Instagram.Impressive Points to Mention
Technical Depth
1. Perceptual Hashing:
"For deduplication, I'd use perceptual hashing (pHash):
- Hash images based on visual content
- Detect duplicate uploads
- Store once, reference multiple times
- 20% storage savings"
2. Adaptive Bitrate Streaming:
"For videos, I'd use adaptive bitrate streaming:
- Transcode to multiple quality levels (1080p, 720p, 480p)
- HLS/DASH protocol
- Client selects quality based on bandwidth
- Better user experience"
3. Image Optimization:
"For image delivery, I'd use:
- WebP format for modern browsers (30% smaller)
- Progressive JPEG for faster rendering
- Responsive images (srcset)
- Lazy loading for images below fold"
4. Content ID Matching:
"For copyright protection, I'd use:
- Hash-based matching for images
- Audio fingerprinting for videos
- Visual similarity detection using ML
- Automated DMCA takedown process"System Design Patterns
1. Circuit Breaker:
"Prevent cascading failures with circuit breakers:
- Open circuit after N failures
- Fail fast instead of waiting
- Periodic health checks to close circuit"
2. Bulkhead Pattern:
"Isolate resources to prevent total failure:
- Separate thread pools per service
- Limit connections per dependency
- One service failure doesn't affect others"
3. Cache-Aside Pattern:
"For feed caching:
- Check cache first
- If miss, query database
- Update cache with result
- Set appropriate TTL (5 minutes)"Real-World Considerations
1. Cost Optimization:
"To reduce costs:
- Aggressive image compression (2MB → 200KB)
- CDN caching (90% hit rate)
- Tiered storage (hot/warm/cold)
- Deduplication (20% savings)
- Target: <$0.52 per DAU per month"
2. Monitoring:
"Key metrics to monitor:
- Upload success rate (>99.9%)
- Feed load time (p95 <1s)
- Image load time (p95 <500ms)
- CDN hit rate (>90%)
- Error rate (<0.1%)"
3. Disaster Recovery:
"Multi-region active-active:
- Async replication between regions
- RPO: 1 hour
- RTO: 4 hours
- Automatic failover with health checks"Time Management
45-Minute Interview Breakdown
0-5 min: Requirements clarification
5-10 min: Back-of-envelope calculations
10-25 min: High-level design and core components
25-40 min: Deep dive into 2-3 components
40-45 min: Wrap-up, edge cases, questions
Adjust based on interviewer's focus.What to Prioritize
Must Cover:
- High-level architecture
- Media storage and delivery (S3 + CDN)
- Feed generation strategy
- Database design
- Scaling approach
Nice to Have:
- Stories feature
- Security and privacy
- Monitoring and alerting
- Cost optimization
Skip if Time Limited:
- Detailed API design
- Specific code implementations
- Advanced ML algorithmsSample Interview Questions and Answers
Q: "How would you implement the photo upload flow?"
Strong Answer:
"I'd use an asynchronous upload flow:
1. Client requests upload URL:
- POST /api/v1/posts/upload
- Server generates pre-signed S3 URL
- Returns URL to client
2. Client uploads directly to S3:
- Direct upload to S3 (no server bottleneck)
- S3 triggers SNS notification
- Fast response to user (<1 second)
3. Async processing:
- SQS queue picks up notification
- Media processing workers:
a. Resize to multiple sizes
b. Compress (2MB → 200KB)
c. Generate thumbnail
d. Extract metadata
- Store processed images in S3
- Update metadata in database
4. Completion:
- Warm CDN cache
- Publish event to Kafka
- Fan-out to followers
- Notify user of completion
This provides fast upload response while handling processing asynchronously."Q: "How do you handle stories that expire after 24 hours?"
Strong Answer:
"I'd use multiple mechanisms for story expiration:
1. S3 Lifecycle Policy:
- Set 24-hour TTL on story objects
- Automatic deletion by S3
- No manual cleanup needed
2. Redis TTL:
- Store story metadata in Redis
- Set 24-hour TTL
- Automatic expiration
3. Database Cleanup:
- Mark stories as expired
- Background job runs hourly
- Delete expired stories
- Update user stats
4. CDN Cache Invalidation:
- Invalidate CDN cache after 24 hours
- Prevent serving expired stories
- Use cache headers (max-age: 86400)
This ensures stories disappear after 24 hours with eventual consistency."Final Tips
Do's ✅
- Ask clarifying questions upfront
- Start with simple design, add complexity gradually
- Explain reasoning for each decision
- Discuss tradeoffs explicitly
- Consider media storage and delivery
- Mention CDN and caching strategies
- Think about celebrity users
- Be prepared to dive deep into any component
Don'ts ❌
- Don't jump straight to implementation
- Don't ignore storage costs
- Don't forget about CDN
- Don't overlook media processing
- Don't ignore celebrity user problem
- Don't forget about stories
- Don't assume unlimited resources
- Don't forget to manage time
If You Get Stuck
1. Ask for hints: "Could you give me a hint about X?"
2. Think out loud: "I'm considering two approaches..."
3. Discuss tradeoffs: "Approach A has X benefit but Y drawback..."
4. Relate to experience: "In my previous project, we used..."
5. Be honest: "I'm not familiar with X, but I would approach it by..."Remember: The goal is to demonstrate your system design thinking process, not to create a perfect solution. Show how you approach complex problems, consider tradeoffs, and make informed decisions.