AI/ML System Design: Complete Guide for 2025-2026 Interviews
Overview
AI/ML system design is now a critical interview topic. With the explosion of LLMs, RAG systems, and AI-powered features, companies expect candidates to understand how to build, serve, and scale ML systems in production.
This section covers: LLM serving, RAG pipelines, vector databases, embeddings, feature stores, model serving patterns, recommendation systems, and cost optimization.
1. LLM Serving Architecture
The Challenge
LLMs are massive (7B-405B+ parameters) and computationally expensive. Serving them requires specialized infrastructure.
Model Sharding Strategies
Single GPU Memory: 80GB (A100) or192GB (H200)
Model Size: LLaMA 70B = ~140GB in FP16
Strategy 1: Tensor Parallelism (within a node)
βββββββββββββββββββββββββββββββββββββββββββ
β Single Layer split across GPUs β
β β
β GPU 0: [W_0:n/2] GPU 1: [W_n/2:n] β
β β AllReduce β β
β Each GPU computes partial result β
β Combine via AllReduce β
βββββββββββββββββββββββββββββββββββββββββββ
Best for: Latency-sensitive (reduces per-layer time)
Strategy 2: Pipeline Parallelism (across nodes)
ββββββββββββ ββββββββββββ ββββββββββββ
β GPU 0 ββββΊβ GPU 1 ββββΊβ GPU 2 β
β Layers 0-9β βLayers10-19β βLayers20-29β
ββββββββββββ ββββββββββββ ββββββββββββ
Best for: Throughput (pipeline multiple requests)
Strategy 3: Data Parallelism (replicas)
ββββββββββββ ββββββββββββ ββββββββββββ
β Replica0 β β Replica1 β β Replica2 β
β Full Modelβ β Full Modelβ β Full Modelβ
ββββββββββββ ββββββββββββ ββββββββββββ
β β β
βββββ Load Balancer βββββββββββ
Best for: Scaling throughput with smaller models
KV Cache
Problem: Autoregressive generation recomputes attention for all previous tokens
Without KV Cache:
Token 1: compute attention over [1]
Token 2: compute attention over [1, 2]
Token 3: compute attention over [1, 2, 3] β redundant!
...
Token N: compute attention over [1, 2, ..., N] β O(NΒ²) total
With KV Cache:
Token 1: compute K1, V1, store in cache
Token 2: compute K2, V2, attend to cached [K1,V1] + [K2,V2]
Token 3: compute K3, V3, attend to cached [K1,V1,K2,V2] + [K3,V3]
...
Token N: only compute KN, VN, attend to all cached β O(N) per token
KV Cache size per request:
= 2 Γ num_layers Γ hidden_dim Γ seq_length Γ bytes_per_param
GPT-4 class (128 layers, 12288 hidden, 8K context, FP16):
= 2 Γ 128 Γ 12288 Γ 8192 Γ 2bytes β 52GB per request!
This is why context length is expensive.
Batching Strategies
Static Batching:
Wait for N requests, process together
Problem: all requests wait for longest sequence
Continuous Batching (vLLM, TGI):
βββββββββββββββββββββββββββββββββββββββ
β Iteration 1: [Req1_tok5, Req2_tok3, Req3_tok1] β
β Iteration 2: [Req1_tok6, Req2_tok4, Req3_tok2] β
β Iteration 3: [Req1_DONE, Req2_tok5, Req3_tok3] β
β Iteration 4: [Req4_tok1, Req2_tok6, Req3_tok4] β β Req4 joins!
βββββββββββββββββββββββββββββββββββββββ
Requests join/leave batch dynamically
No waiting for slowest request
2-5Γ throughput improvement over static batching
LLMs have knowledge cutoffs and hallucinate. RAG grounds responses in actual documents.
Without RAG:User:"What's our refund policy?"LLM:"I think your refund policy is..." (hallucination)
With RAG:User:"What's our refund policy?"System: [retrieves actual policy document]
LLM:"According to your policy document, refunds are..." (grounded)
1. Hybrid Search (Dense + Sparse):
Score = Ξ± Γ vector_similarity + (1-Ξ±) Γ BM25_score
Combines semantic understanding with keyword matching
2. Re-ranking:
Query β Retrieve top-100 β Re-rank with cross-encoder β Return top-10
Cross-encoder is more accurate but slower (can't pre-compute)
3. Query Expansion:
Original: "How to fix memory leaks?"
Expanded: "How to fix memory leaks? memory management garbage collection heap"
4. Hypothetical Document Embedding (HyDE):
Generate hypothetical answer β embed that β search for similar real docs
5. Multi-query RAG:
Break complex question into sub-questions
Retrieve for each sub-question
Synthesize final answer from all retrieved context
RAG Evaluation Metrics
Metric
What It Measures
Target
Retrieval Recall@K
% of relevant docs in top-K
>80%
Retrieval Precision@K
% of top-K that are relevant
>60%
Faithfulness
Does answer match retrieved context?
>90%
Answer Relevancy
Does answer address the question?
>85%
Context Relevancy
Is retrieved context relevant?
>70%
3. Vector Databases
Why Vector Databases?
Traditional databases search by exact match or range. Vector databases search by similarity in high-dimensional space.
Traditional DB: SELECT * FROM docs WHEREtitle = 'refund policy'
Vector DB: Find documents semantically similar to"how do I get my money back"
Comparison of Vector Databases
Database
Type
Strengths
Weaknesses
Best For
Pinecone
Managed SaaS
Zero-ops, fast, scalable
Cost, vendor lock-in
Startups, quick deployment
Weaviate
Open-source
Hybrid search, modules
Memory-heavy
Multi-modal search
Milvus
Open-source
High performance, GPU
Complex operations
Large-scale production
pgvector
PostgreSQL ext
Familiar, ACID, joins
Limited scale
Small-medium datasets
Qdrant
Open-source
Rust performance, filtering
Newer ecosystem
Filtered similarity search
Chroma
Open-source
Simple API, embedded
Not for production scale
Prototyping, small apps
When to Use Which
< 1M vectors + existing PostgreSQL β pgvector
< 10M vectors + need managed service β Pinecone
> 10M vectors + need control β Milvus or Qdrant
Need hybrid search (vector + keyword) β WeaviatePrototyping / localdev β Chroma
Scaling Vector Search
10M vectors Γ 1536 dimensions Γ 4 bytes = ~60GB raw data
With index overhead: ~100-150GB
Options:
1. Vertical scaling: Single node with enough RAM (up to ~100M vectors)
2. Sharding: Distribute vectors across nodes by partition key
3. Tiered storage: Hot vectors in memory, cold on SSD
4. Embedding Generation and Indexing
Embedding Models
Model
Dimensions
Context
Speed
Quality
OpenAI text-embedding-3-small
1536
8191 tokens
Fast
Good
OpenAI text-embedding-3-large
3072
8191 tokens
Medium
Best
Cohere embed-v3
1024
512 tokens
Fast
Very Good
BGE-large-en
1024
512 tokens
Self-hosted
Very Good
E5-mistral-7b
4096
32K tokens
Slow
Excellent
Vector Index Types
HNSW (Hierarchical Navigable Small World)
Multi-layer graph structure:
Layer 2: [A] βββββββββββββββββββ [D] (few nodes, long edges)
Layer 1: [A] ββββ [B] ββββ [C] ββ [D] (more nodes)
Layer 0: [A][B][C][D][E][F][G][H] (all nodes, short edges)
Search: Start at top layer, greedily descend
- Top layers: fast coarse navigation
- Bottom layers: precise local search
Performance:
- Build time: O(N Γ log N)
- Search time: O(log N)
- Memory: O(N Γ M) where M = max connections per node
- Recall@10: 95-99% typical
Best for: Low-latency search, moderate dataset sizes (fits in RAM)
IVF (Inverted File Index)
1.Cluster vectors into K centroids (K-means)
2. At query time, find nearest nprobe centroids
3.Searchonly vectors in those clusters
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
βCluster0 β βCluster1 β βCluster2 β βCluster3 β
β β’β’β’β’β’ β β β’β’β’β’ β β β’β’β’β’β’β’β’ β β β’β’ β
β β’β’ β β β’β’β’ β β β’β’ β β β’β’β’β’β’ β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
Query: Find nearest centroid(s) β searchonly those clusters
nprobe=1: Fast but may miss results
nprobe=10: Slower but higher recall
Best for: Large datasets, tunable speed/accuracy tradeoff
PQ (Product Quantization)
Compress vectors by splitting into subvectors and quantizing each:
Original: [0.1, 0.3, 0.5, 0.7, 0.2, 0.4, 0.6, 0.8] (32 bytes in FP32)
Split: [0.1, 0.3] [0.5, 0.7] [0.2, 0.4] [0.6, 0.8] (4 subvectors)
Quantize: [code_23] [code_45] [code_12] [code_67] (4 bytes!)
Compression: 8Γ reduction (32 bytes β 4 bytes)
Trade-off: ~5-10% recall loss for 8Γ memory savings
Best for: Memory-constrained environments, billion-scale datasets
Index Selection Guide
Dataset Size
Memory Budget
Latency Requirement
Recommended
< 100K
Any
< 10ms
Flat (brute force)
100K - 10M
Fits in RAM
< 10ms
HNSW
1M - 100M
Limited
< 50ms
IVF + PQ
100M - 1B
Very limited
< 100ms
IVF + PQ + disk
> 1B
Distributed
< 200ms
Sharded HNSW or IVF
5. Feature Stores
What Is a Feature Store?
A centralized system for managing, storing, and serving ML features consistently across training and inference.
Item features β Item embedding β Similarity
User profile = weighted average of liked item embeddings
Recommendation = items most similartouser profile
Advantages: No cold-startfor items (features available immediately)
Disadvantages: Limited discovery (recommends similarto past)
Teacher (large model) β Student (small model)
GPT-4 (expensive, accurate) generates training data
Fine-tune GPT-3.5 or Llama-7B on that data
Deploy smaller model for 90% of requests
Cost reduction: 10-50Γ for similar quality on specific tasks
3. Quantization
FP32 β FP16 β INT8 β INT4
Model: LLaMA 70B
FP32: 280GB (impossible on single node)
FP16: 140GB (2Γ A100 80GB)
INT8: 70GB (1Γ A100 80GB)
INT4: 35GB (1Γ A100 40GB or consumer GPU)
Quality loss:
FP16: ~0% loss
INT8 (GPTQ/AWQ): 1-3% loss
INT4: 3-8% loss (acceptable for many tasks)
4. Routing and Cascading
ββββββββββββββββ
β User Request β
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ Simple query? βββββββββββββββ
β Router βββββββββββββββββββββββΊβ Small Model β (cheap)
β (classifier) β β (7B/13B) β
β β Complex query? βββββββββββββββ
β βββββββββββββββββββββββΊβββββββββββββββ
β β β Large Model β (expensive)
ββββββββββββββββ β (70B/GPT-4) β
βββββββββββββββ
Result: 70% of queries handled by cheap model
Average cost reduction: 60-80%
5. Semantic Caching
Cache LLM responses for semantically similar queries:Query:"What is the capital of France?"
β Generate response, cache with embedding
Query:"What's France's capital city?"
β Embedding similarity >0.95 β return cached response
Hit rate:10-30% for customer support, FAQ-style queries
Savings: Proportional to hit rate