πŸ€– AI/ML

AI/ML System Design

πŸ“– 17 min read 🧠 Complete Guide

AI/ML System Design: Complete Guide for 2025-2026 Interviews

Overview

AI/ML system design is now a critical interview topic. With the explosion of LLMs, RAG systems, and AI-powered features, companies expect candidates to understand how to build, serve, and scale ML systems in production.

This section covers: LLM serving, RAG pipelines, vector databases, embeddings, feature stores, model serving patterns, recommendation systems, and cost optimization.


1. LLM Serving Architecture

The Challenge

LLMs are massive (7B-405B+ parameters) and computationally expensive. Serving them requires specialized infrastructure.

Model Sharding Strategies

Single GPU Memory: 80GB (A100) or 192GB (H200)
Model Size: LLaMA 70B = ~140GB in FP16

Strategy 1: Tensor Parallelism (within a node)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Single Layer split across GPUs           β”‚
β”‚                                          β”‚
β”‚ GPU 0: [W_0:n/2]  GPU 1: [W_n/2:n]     β”‚
β”‚         ↕ AllReduce ↕                    β”‚
β”‚ Each GPU computes partial result         β”‚
β”‚ Combine via AllReduce                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Best for: Latency-sensitive (reduces per-layer time)

Strategy 2: Pipeline Parallelism (across nodes)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GPU 0     │──►│ GPU 1     │──►│ GPU 2     β”‚
β”‚ Layers 0-9β”‚   β”‚Layers10-19β”‚   β”‚Layers20-29β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Best for: Throughput (pipeline multiple requests)

Strategy 3: Data Parallelism (replicas)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Replica 0 β”‚  β”‚ Replica 1 β”‚  β”‚ Replica 2 β”‚
β”‚ Full Modelβ”‚  β”‚ Full Modelβ”‚  β”‚ Full Modelβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↑              ↑              ↑
     └──── Load Balancer β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Best for: Scaling throughput with smaller models

KV Cache

Problem: Autoregressive generation recomputes attention for all previous tokens

Without KV Cache:
  Token 1: compute attention over [1]
  Token 2: compute attention over [1, 2]
  Token 3: compute attention over [1, 2, 3]  ← redundant!
  ...
  Token N: compute attention over [1, 2, ..., N]  ← O(NΒ²) total

With KV Cache:
  Token 1: compute K1, V1, store in cache
  Token 2: compute K2, V2, attend to cached [K1,V1] + [K2,V2]
  Token 3: compute K3, V3, attend to cached [K1,V1,K2,V2] + [K3,V3]
  ...
  Token N: only compute KN, VN, attend to all cached  ← O(N) per token

KV Cache size per request:
  = 2 Γ— num_layers Γ— hidden_dim Γ— seq_length Γ— bytes_per_param
  
  GPT-4 class (128 layers, 12288 hidden, 8K context, FP16):
  = 2 Γ— 128 Γ— 12288 Γ— 8192 Γ— 2 bytes β‰ˆ 52GB per request!
  
  This is why context length is expensive.

Batching Strategies

Static Batching:
  Wait for N requests, process together
  Problem: all requests wait for longest sequence

Continuous Batching (vLLM, TGI):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Iteration 1: [Req1_tok5, Req2_tok3, Req3_tok1]  β”‚
  β”‚ Iteration 2: [Req1_tok6, Req2_tok4, Req3_tok2]  β”‚
  β”‚ Iteration 3: [Req1_DONE, Req2_tok5, Req3_tok3]  β”‚
  β”‚ Iteration 4: [Req4_tok1, Req2_tok6, Req3_tok4]  β”‚ ← Req4 joins!
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  
  Requests join/leave batch dynamically
  No waiting for slowest request
  2-5Γ— throughput improvement over static batching

Serving Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   API Gateway                     β”‚
β”‚            (rate limiting, auth)                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Request Router                       β”‚
β”‚    (model selection, prompt routing)             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           Inference Engine                        β”‚
β”‚    (vLLM / TensorRT-LLM / TGI)                  β”‚
β”‚    β€’ Continuous batching                         β”‚
β”‚    β€’ PagedAttention (KV cache management)        β”‚
β”‚    β€’ Speculative decoding                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              GPU Cluster                          β”‚
β”‚    (A100/H100/H200, NVLink, InfiniBand)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. RAG (Retrieval-Augmented Generation) Pipeline

Why RAG?

LLMs have knowledge cutoffs and hallucinate. RAG grounds responses in actual documents.

Without RAG:
  User: "What's our refund policy?"
  LLM: "I think your refund policy is..." (hallucination)

With RAG:
  User: "What's our refund policy?"
  System: [retrieves actual policy document]
  LLM: "According to your policy document, refunds are..." (grounded)

Complete RAG Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     INGESTION PIPELINE                        β”‚
β”‚                                                               β”‚
β”‚  Documents ──► Chunking ──► Embedding ──► Vector DB          β”‚
β”‚  (PDF,HTML,    (512-1024    (OpenAI,      (Pinecone,         β”‚
β”‚   Markdown)    tokens)      Cohere)       Weaviate)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     QUERY PIPELINE                            β”‚
β”‚                                                               β”‚
β”‚  User Query                                                   β”‚
β”‚      β”‚                                                        β”‚
β”‚      β–Ό                                                        β”‚
β”‚  Query Embedding ──► Vector Search ──► Top-K Chunks          β”‚
β”‚      β”‚                                      β”‚                 β”‚
β”‚      β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚      β–Ό              β–Ό                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚ Prompt: "Given this context:      β”‚                        β”‚
β”‚  β”‚ {retrieved_chunks}                β”‚                        β”‚
β”‚  β”‚ Answer this question:             β”‚                        β”‚
β”‚  β”‚ {user_query}"                     β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚      β”‚                                                        β”‚
β”‚      β–Ό                                                        β”‚
β”‚  LLM Response (grounded in retrieved context)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Chunking Strategies

Strategy Chunk Size Overlap Best For
Fixed-size 512 tokens 50-100 tokens General purpose
Sentence-based Variable 1-2 sentences Conversational content
Paragraph-based Variable None Well-structured docs
Semantic Variable None Mixed content types
Recursive (LangChain) 512-1024 100-200 Code + text

Advanced RAG Techniques

1. Hybrid Search (Dense + Sparse):
   Score = Ξ± Γ— vector_similarity + (1-Ξ±) Γ— BM25_score
   Combines semantic understanding with keyword matching

2. Re-ranking:
   Query β†’ Retrieve top-100 β†’ Re-rank with cross-encoder β†’ Return top-10
   Cross-encoder is more accurate but slower (can't pre-compute)

3. Query Expansion:
   Original: "How to fix memory leaks?"
   Expanded: "How to fix memory leaks? memory management garbage collection heap"
   
4. Hypothetical Document Embedding (HyDE):
   Generate hypothetical answer β†’ embed that β†’ search for similar real docs

5. Multi-query RAG:
   Break complex question into sub-questions
   Retrieve for each sub-question
   Synthesize final answer from all retrieved context

RAG Evaluation Metrics

Metric What It Measures Target
Retrieval Recall@K % of relevant docs in top-K >80%
Retrieval Precision@K % of top-K that are relevant >60%
Faithfulness Does answer match retrieved context? >90%
Answer Relevancy Does answer address the question? >85%
Context Relevancy Is retrieved context relevant? >70%

3. Vector Databases

Why Vector Databases?

Traditional databases search by exact match or range. Vector databases search by similarity in high-dimensional space.

Traditional DB: SELECT * FROM docs WHERE title = 'refund policy'
Vector DB: Find documents semantically similar to "how do I get my money back"

Comparison of Vector Databases

Database Type Strengths Weaknesses Best For
Pinecone Managed SaaS Zero-ops, fast, scalable Cost, vendor lock-in Startups, quick deployment
Weaviate Open-source Hybrid search, modules Memory-heavy Multi-modal search
Milvus Open-source High performance, GPU Complex operations Large-scale production
pgvector PostgreSQL ext Familiar, ACID, joins Limited scale Small-medium datasets
Qdrant Open-source Rust performance, filtering Newer ecosystem Filtered similarity search
Chroma Open-source Simple API, embedded Not for production scale Prototyping, small apps

When to Use Which

< 1M vectors + existing PostgreSQL β†’ pgvector
< 10M vectors + need managed service β†’ Pinecone
> 10M vectors + need control β†’ Milvus or Qdrant
Need hybrid search (vector + keyword) β†’ Weaviate
Prototyping / local dev β†’ Chroma

Scaling Vector Search

10M vectors Γ— 1536 dimensions Γ— 4 bytes = ~60GB raw data
With index overhead: ~100-150GB

Options:
1. Vertical scaling: Single node with enough RAM (up to ~100M vectors)
2. Sharding: Distribute vectors across nodes by partition key
3. Tiered storage: Hot vectors in memory, cold on SSD

4. Embedding Generation and Indexing

Embedding Models

Model Dimensions Context Speed Quality
OpenAI text-embedding-3-small 1536 8191 tokens Fast Good
OpenAI text-embedding-3-large 3072 8191 tokens Medium Best
Cohere embed-v3 1024 512 tokens Fast Very Good
BGE-large-en 1024 512 tokens Self-hosted Very Good
E5-mistral-7b 4096 32K tokens Slow Excellent

Vector Index Types

HNSW (Hierarchical Navigable Small World)

Multi-layer graph structure:

Layer 2: [A] ─────────────────── [D]        (few nodes, long edges)
Layer 1: [A] ──── [B] ──── [C] ── [D]      (more nodes)
Layer 0: [A] [B] [C] [D] [E] [F] [G] [H]  (all nodes, short edges)

Search: Start at top layer, greedily descend
  - Top layers: fast coarse navigation
  - Bottom layers: precise local search

Performance:
  - Build time: O(N Γ— log N)
  - Search time: O(log N)
  - Memory: O(N Γ— M) where M = max connections per node
  - Recall@10: 95-99% typical

Best for: Low-latency search, moderate dataset sizes (fits in RAM)

IVF (Inverted File Index)

1. Cluster vectors into K centroids (K-means)
2. At query time, find nearest nprobe centroids
3. Search only vectors in those clusters

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Cluster 0 β”‚  β”‚Cluster 1 β”‚  β”‚Cluster 2 β”‚  β”‚Cluster 3 β”‚
β”‚ β€’β€’β€’β€’β€’    β”‚  β”‚ β€’β€’β€’β€’     β”‚  β”‚ β€’β€’β€’β€’β€’β€’β€’  β”‚  β”‚ β€’β€’       β”‚
β”‚ β€’β€’       β”‚  β”‚ β€’β€’β€’      β”‚  β”‚ β€’β€’       β”‚  β”‚ β€’β€’β€’β€’β€’    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query: Find nearest centroid(s) β†’ search only those clusters
  nprobe=1: Fast but may miss results
  nprobe=10: Slower but higher recall

Best for: Large datasets, tunable speed/accuracy tradeoff

PQ (Product Quantization)

Compress vectors by splitting into subvectors and quantizing each:

Original: [0.1, 0.3, 0.5, 0.7, 0.2, 0.4, 0.6, 0.8]  (32 bytes in FP32)
Split:    [0.1, 0.3] [0.5, 0.7] [0.2, 0.4] [0.6, 0.8]  (4 subvectors)
Quantize: [code_23]  [code_45]  [code_12]  [code_67]     (4 bytes!)

Compression: 8Γ— reduction (32 bytes β†’ 4 bytes)
Trade-off: ~5-10% recall loss for 8Γ— memory savings

Best for: Memory-constrained environments, billion-scale datasets

Index Selection Guide

Dataset Size Memory Budget Latency Requirement Recommended
< 100K Any < 10ms Flat (brute force)
100K - 10M Fits in RAM < 10ms HNSW
1M - 100M Limited < 50ms IVF + PQ
100M - 1B Very limited < 100ms IVF + PQ + disk
> 1B Distributed < 200ms Sharded HNSW or IVF

5. Feature Stores

What Is a Feature Store?

A centralized system for managing, storing, and serving ML features consistently across training and inference.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FEATURE STORE                          β”‚
β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚  Offline Store   β”‚        β”‚  Online Store    β”‚         β”‚
β”‚  β”‚  (S3/BigQuery)   β”‚        β”‚  (Redis/DynamoDB) β”‚        β”‚
β”‚  β”‚                   β”‚        β”‚                   β”‚        β”‚
β”‚  β”‚  Historical data  β”‚        β”‚  Latest values    β”‚        β”‚
β”‚  β”‚  Training jobs    β”‚        β”‚  Low-latency      β”‚        β”‚
β”‚  β”‚  Batch features   β”‚        β”‚  Real-time serving β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚           β”‚                             β”‚                  β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                      β”‚                                     β”‚
β”‚              Feature Registry                              β”‚
β”‚           (schema, lineage, docs)                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²                              β–²
         β”‚                              β”‚
    Training Pipeline              Inference Service
    (batch, historical)            (real-time, latest)

Online vs Offline Features

Aspect Online Store Offline Store
Latency < 10ms Minutes to hours
Freshness Real-time to minutes Hours to days
Storage Redis, DynamoDB S3, BigQuery, Hive
Use case Model inference Model training
Data volume Latest values only Full history
Cost Higher (fast storage) Lower (bulk storage)

Feature Freshness Requirements

Real-time features (< 1 second):
  - Current cart contents
  - Live session behavior
  - Real-time fraud signals

Near-real-time (1 second - 1 hour):
  - User's recent clicks
  - Trending topics
  - Current inventory levels

Batch features (hours - days):
  - User lifetime value
  - Historical purchase patterns
  - Aggregated statistics

Popular Feature Stores

Tool Type Best For
Feast Open-source Kubernetes-native, flexible
Tecton Managed Enterprise, real-time features
Databricks Feature Store Managed Databricks ecosystem
Amazon SageMaker FS Managed AWS-native ML pipelines
Hopsworks Open-source Full ML platform

6. ML Model Serving Patterns

Deployment Strategies

1. A/B Testing:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     90%     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Traffic     │────────────►│  Model v1    β”‚
β”‚   Router      β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚               β”‚     10%     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               │────────────►│  Model v2    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Measure: conversion rate, engagement, revenue

2. Canary Deployment:
Phase 1: 1% traffic β†’ new model (monitor for errors)
Phase 2: 10% traffic β†’ new model (monitor metrics)
Phase 3: 50% traffic β†’ new model (compare to baseline)
Phase 4: 100% traffic β†’ new model (full rollout)

3. Shadow Mode:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  100%  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” ──► Response to user
β”‚   Traffic     │───────►│  Model v1    β”‚
β”‚               β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚               β”‚  100%  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” ──► Log only (not served)
β”‚               │───────►│  Model v2    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Compare predictions offline, no user impact

Model Serving Infrastructure

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Model Serving Platform               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Model A    β”‚  β”‚ Model B    β”‚  β”‚ Model C    β”‚   β”‚
β”‚  β”‚ (v1.2)     β”‚  β”‚ (v3.0)     β”‚  β”‚ (v1.0)     β”‚   β”‚
β”‚  β”‚ GPU: 1Γ—A10 β”‚  β”‚ GPU: 4Γ—A100β”‚  β”‚ CPU only   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                   β”‚
β”‚  Features:                                        β”‚
β”‚  β€’ Auto-scaling (based on QPS/latency)           β”‚
β”‚  β€’ Model versioning and rollback                 β”‚
β”‚  β€’ A/B testing and canary                        β”‚
β”‚  β€’ Monitoring (latency, accuracy, drift)         β”‚
β”‚  β€’ Batching for throughput                       β”‚
β”‚                                                   β”‚
β”‚  Tools: TorchServe, Triton, Seldon, KServe      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Monitoring ML Models in Production

Metric What to Monitor Alert Threshold
Latency (p50, p99) Inference time p99 > 200ms
Throughput Requests/second < expected baseline
Error rate Failed predictions > 0.1%
Data drift Input distribution shift KL divergence > threshold
Prediction drift Output distribution shift Significant change
Feature freshness Staleness of features > SLA

7. Recommendation System Architecture

Approaches

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              RECOMMENDATION PIPELINE                          β”‚
β”‚                                                               β”‚
β”‚  Stage 1: Candidate Generation (1M β†’ 1000 items)            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚ β€’ Collaborative Filtering (user-item matrix)     β”‚        β”‚
β”‚  β”‚ β€’ Content-Based (item similarity)                β”‚        β”‚
β”‚  β”‚ β€’ Knowledge Graph (entity relationships)         β”‚        β”‚
β”‚  β”‚ β€’ Popular/Trending (fallback)                    β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚           β”‚                                                   β”‚
β”‚           β–Ό                                                   β”‚
β”‚  Stage 2: Scoring/Ranking (1000 β†’ 100 items)                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚ β€’ Deep learning model (features + embeddings)    β”‚        β”‚
β”‚  β”‚ β€’ Multi-objective: relevance, diversity, recency β”‚        β”‚
β”‚  β”‚ β€’ Business rules (boost promoted content)        β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚           β”‚                                                   β”‚
β”‚           β–Ό                                                   β”‚
β”‚  Stage 3: Re-ranking/Filtering (100 β†’ 20 items)             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚ β€’ Diversity injection (MMR)                      β”‚        β”‚
β”‚  β”‚ β€’ Freshness boost                               β”‚        β”‚
β”‚  β”‚ β€’ Content policy filtering                      β”‚        β”‚
β”‚  β”‚ β€’ Already-seen deduplication                    β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚           β”‚                                                   β”‚
β”‚           β–Ό                                                   β”‚
β”‚  Final: Serve top-20 recommendations                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Collaborative Filtering

User-Item Matrix:
         Item1  Item2  Item3  Item4  Item5
User A:  [5     3      ?      1      ?   ]
User B:  [4     ?      4      1      ?   ]
User C:  [?     1      ?      5      4   ]
User D:  [1     ?      5      4      ?   ]

Matrix Factorization (ALS/SVD):
  User matrix (NΓ—K) Γ— Item matrix (KΓ—M) β‰ˆ Rating matrix (NΓ—M)
  K = latent factors (typically 50-200)
  
  Predict User A's rating for Item3:
  = dot_product(user_A_embedding, item_3_embedding)

Content-Based Filtering

Item features β†’ Item embedding β†’ Similarity

User profile = weighted average of liked item embeddings
Recommendation = items most similar to user profile

Advantages: No cold-start for items (features available immediately)
Disadvantages: Limited discovery (recommends similar to past)

Hybrid Approach (Production Standard)

Final_Score = Ξ± Γ— collaborative_score 
            + Ξ² Γ— content_score 
            + Ξ³ Γ— popularity_score
            + Ξ΄ Γ— recency_score
            + business_rules_boost

Weights learned via online A/B testing

Cold Start Solutions

Problem Solution
New user (no history) Popular items, demographic-based, onboarding quiz
New item (no interactions) Content-based features, explore/exploit
New system (no data) Editorial curation, import external signals

8. Cost Optimization for LLM Inference

The Cost Problem

GPT-4 API pricing (2024):
  Input: $30 per 1M tokens
  Output: $60 per 1M tokens

At scale (1M requests/day, 1000 tokens avg):
  Daily cost: 1M Γ— 1000 tokens Γ— ($30+$60)/1M = $90,000/day
  Monthly: $2.7M

Self-hosted (8Γ—H100 for 70B model):
  Hardware: ~$250K (or $30K/month cloud rental)
  Can serve: ~100 requests/second
  Cost per request: ~$0.003 (vs $0.09 for GPT-4)
  Break-even: ~10K requests/day

Optimization Strategies

1. Prompt Caching

Many requests share common system prompts or context:

Without caching:
  Request 1: [System prompt (2000 tokens)] + [User query (100 tokens)]
  Request 2: [System prompt (2000 tokens)] + [User query (150 tokens)]
  Total: 4250 tokens processed

With prefix caching (vLLM, Anthropic):
  Request 1: [System prompt (2000 tokens)] + [User query (100 tokens)]
  Request 2: [CACHED prefix] + [User query (150 tokens)]
  Total: 2250 tokens processed (47% savings)

2. Model Distillation

Teacher (large model) β†’ Student (small model)

GPT-4 (expensive, accurate) generates training data
Fine-tune GPT-3.5 or Llama-7B on that data
Deploy smaller model for 90% of requests

Cost reduction: 10-50Γ— for similar quality on specific tasks

3. Quantization

FP32 β†’ FP16 β†’ INT8 β†’ INT4

Model: LLaMA 70B
  FP32: 280GB (impossible on single node)
  FP16: 140GB (2Γ— A100 80GB)
  INT8:  70GB (1Γ— A100 80GB)
  INT4:  35GB (1Γ— A100 40GB or consumer GPU)

Quality loss:
  FP16: ~0% loss
  INT8 (GPTQ/AWQ): 1-3% loss
  INT4: 3-8% loss (acceptable for many tasks)

4. Routing and Cascading

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Request  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     Simple query?     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Router      │─────────────────────►│ Small Model  β”‚ (cheap)
β”‚  (classifier) β”‚                       β”‚ (7B/13B)     β”‚
β”‚               β”‚     Complex query?    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚               β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               β”‚                       β”‚ Large Model  β”‚ (expensive)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚ (70B/GPT-4)  β”‚
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Result: 70% of queries handled by cheap model
Average cost reduction: 60-80%

5. Semantic Caching

Cache LLM responses for semantically similar queries:

Query: "What is the capital of France?"
β†’ Generate response, cache with embedding

Query: "What's France's capital city?"
β†’ Embedding similarity > 0.95 β†’ return cached response

Hit rate: 10-30% for customer support, FAQ-style queries
Savings: Proportional to hit rate

Cost Comparison Table

Optimization Effort Cost Reduction Quality Impact
Prompt caching Low 20-50% None
Semantic caching Medium 10-30% Minimal
Model routing Medium 50-70% 1-5% on routed queries
Quantization (INT8) Low 50% compute 1-3%
Distillation High 80-95% 5-15% (task-specific)
Batching optimization Low 30-50% throughput None

System Design Interview: AI/ML Questions

Common Questions and Approaches

Question Key Components
"Design a chatbot" RAG pipeline, LLM serving, conversation memory, guardrails
"Design a recommendation system" Candidate gen, ranking, feature store, A/B testing
"Design a search engine with AI" Embedding search, hybrid retrieval, re-ranking
"Design an AI code assistant" RAG over codebase, streaming, context window management
"Design a content moderation system" Classification model, human-in-loop, confidence thresholds
"Design a fraud detection system" Feature store, real-time scoring, model monitoring

Interview Cheat Sheet

When interviewer asks... Key points to mention
"How to serve an LLM?" Model sharding, KV cache, continuous batching, vLLM
"How to reduce hallucination?" RAG pipeline, grounding, citation, confidence scores
"Vector DB choice?" pgvector for small, Pinecone for managed, Milvus for scale
"How to handle LLM costs?" Routing, caching, distillation, quantization
"Cold start for recommendations?" Content-based features, popular items, explore/exploit
"How to evaluate RAG quality?" Retrieval recall, faithfulness, answer relevancy
"Feature freshness?" Online store (Redis) for real-time, offline for batch
"How to deploy new model safely?" Shadow mode β†’ canary β†’ A/B test β†’ full rollout