AI/ML System Design: Complete Guide for 2025-2026 Interviews

Overview

AI/ML system design is now a critical interview topic. With the explosion of LLMs, RAG systems, and AI-powered features, companies expect candidates to understand how to build, serve, and scale ML systems in production.

This section covers: LLM serving, RAG pipelines, vector databases, embeddings, feature stores, model serving patterns, recommendation systems, and cost optimization.

1. LLM Serving Architecture

The Challenge

LLMs are massive (7B-405B+ parameters) and computationally expensive. Serving them requires specialized infrastructure.

Model Sharding Strategies

Single GPU Memory: 80GB (A100) or 192GB (H200)
Model Size: LLaMA 70B = ~140GB in FP16

Strategy 1: Tensor Parallelism (within a node)
┌─────────────────────────────────────────┐
│ Single Layer split across GPUs           │
│                                          │
│ GPU 0: [W_0:n/2]  GPU 1: [W_n/2:n]     │
│         ↕ AllReduce ↕                    │
│ Each GPU computes partial result         │
│ Combine via AllReduce                    │
└─────────────────────────────────────────┘
Best for: Latency-sensitive (reduces per-layer time)

Strategy 2: Pipeline Parallelism (across nodes)
┌──────────┐   ┌──────────┐   ┌──────────┐
│ GPU 0     │──►│ GPU 1     │──►│ GPU 2     │
│ Layers 0-9│   │Layers10-19│   │Layers20-29│
└──────────┘   └──────────┘   └──────────┘
Best for: Throughput (pipeline multiple requests)

Strategy 3: Data Parallelism (replicas)
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Replica 0 │  │ Replica 1 │  │ Replica 2 │
│ Full Model│  │ Full Model│  │ Full Model│
└──────────┘  └──────────┘  └──────────┘
     ↑              ↑              ↑
     └──── Load Balancer ──────────┘
Best for: Scaling throughput with smaller models

KV Cache

Problem: Autoregressive generation recomputes attention for all previous tokens

Without KV Cache:
  Token 1: compute attention over [1]
  Token 2: compute attention over [1, 2]
  Token 3: compute attention over [1, 2, 3]  ← redundant!
  ...
  Token N: compute attention over [1, 2, ..., N]  ← O(N²) total

With KV Cache:
  Token 1: compute K1, V1, store in cache
  Token 2: compute K2, V2, attend to cached [K1,V1] + [K2,V2]
  Token 3: compute K3, V3, attend to cached [K1,V1,K2,V2] + [K3,V3]
  ...
  Token N: only compute KN, VN, attend to all cached  ← O(N) per token

KV Cache size per request:
  = 2 × num_layers × hidden_dim × seq_length × bytes_per_param
  
  GPT-4 class (128 layers, 12288 hidden, 8K context, FP16):
  = 2 × 128 × 12288 × 8192 × 2 bytes ≈ 52GB per request!
  
  This is why context length is expensive.

Batching Strategies

Static Batching:
  Wait for N requests, process together
  Problem: all requests wait for longest sequence

Continuous Batching (vLLM, TGI):
  ┌─────────────────────────────────────┐
  │ Iteration 1: [Req1_tok5, Req2_tok3, Req3_tok1]  │
  │ Iteration 2: [Req1_tok6, Req2_tok4, Req3_tok2]  │
  │ Iteration 3: [Req1_DONE, Req2_tok5, Req3_tok3]  │
  │ Iteration 4: [Req4_tok1, Req2_tok6, Req3_tok4]  │ ← Req4 joins!
  └─────────────────────────────────────┘
  
  Requests join/leave batch dynamically
  No waiting for slowest request
  2-5× throughput improvement over static batching

Serving Stack

┌─────────────────────────────────────────────────┐
│                   API Gateway                     │
│            (rate limiting, auth)                  │
├─────────────────────────────────────────────────┤
│              Request Router                       │
│    (model selection, prompt routing)             │
├─────────────────────────────────────────────────┤
│           Inference Engine                        │
│    (vLLM / TensorRT-LLM / TGI)                  │
│    • Continuous batching                         │
│    • PagedAttention (KV cache management)        │
│    • Speculative decoding                        │
├─────────────────────────────────────────────────┤
│              GPU Cluster                          │
│    (A100/H100/H200, NVLink, InfiniBand)         │
└─────────────────────────────────────────────────┘

2. RAG (Retrieval-Augmented Generation) Pipeline

Why RAG?

LLMs have knowledge cutoffs and hallucinate. RAG grounds responses in actual documents.

Without RAG:
  User: "What's our refund policy?"
  LLM: "I think your refund policy is..." (hallucination)

With RAG:
  User: "What's our refund policy?"
  System: [retrieves actual policy document]
  LLM: "According to your policy document, refunds are..." (grounded)

Complete RAG Architecture

┌─────────────────────────────────────────────────────────────┐
│                     INGESTION PIPELINE                        │
│                                                               │
│  Documents ──► Chunking ──► Embedding ──► Vector DB          │
│  (PDF,HTML,    (512-1024    (OpenAI,      (Pinecone,         │
│   Markdown)    tokens)      Cohere)       Weaviate)          │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                     QUERY PIPELINE                            │
│                                                               │
│  User Query                                                   │
│      │                                                        │
│      ▼                                                        │
│  Query Embedding ──► Vector Search ──► Top-K Chunks          │
│      │                                      │                 │
│      │              ┌───────────────────────┘                 │
│      ▼              ▼                                         │
│  ┌──────────────────────────────────┐                        │
│  │ Prompt: "Given this context:      │                        │
│  │ {retrieved_chunks}                │                        │
│  │ Answer this question:             │                        │
│  │ {user_query}"                     │                        │
│  └──────────────────────────────────┘                        │
│      │                                                        │
│      ▼                                                        │
│  LLM Response (grounded in retrieved context)                │
└─────────────────────────────────────────────────────────────┘

Chunking Strategies

Strategy	Chunk Size	Overlap	Best For
Fixed-size	512 tokens	50-100 tokens	General purpose
Sentence-based	Variable	1-2 sentences	Conversational content
Paragraph-based	Variable	None	Well-structured docs
Semantic	Variable	None	Mixed content types
Recursive (LangChain)	512-1024	100-200	Code + text

Advanced RAG Techniques

1. Hybrid Search (Dense + Sparse):
   Score = α × vector_similarity + (1-α) × BM25_score
   Combines semantic understanding with keyword matching

2. Re-ranking:
   Query → Retrieve top-100 → Re-rank with cross-encoder → Return top-10
   Cross-encoder is more accurate but slower (can't pre-compute)

3. Query Expansion:
   Original: "How to fix memory leaks?"
   Expanded: "How to fix memory leaks? memory management garbage collection heap"
   
4. Hypothetical Document Embedding (HyDE):
   Generate hypothetical answer → embed that → search for similar real docs

5. Multi-query RAG:
   Break complex question into sub-questions
   Retrieve for each sub-question
   Synthesize final answer from all retrieved context

RAG Evaluation Metrics

Metric	What It Measures	Target
Retrieval Recall@K	% of relevant docs in top-K	>80%
Retrieval Precision@K	% of top-K that are relevant	>60%
Faithfulness	Does answer match retrieved context?	>90%
Answer Relevancy	Does answer address the question?	>85%
Context Relevancy	Is retrieved context relevant?	>70%

3. Vector Databases

Why Vector Databases?

Traditional databases search by exact match or range. Vector databases search by similarity in high-dimensional space.

Traditional DB: SELECT * FROM docs WHERE title = 'refund policy'
Vector DB: Find documents semantically similar to "how do I get my money back"

Comparison of Vector Databases

Database	Type	Strengths	Weaknesses	Best For
Pinecone	Managed SaaS	Zero-ops, fast, scalable	Cost, vendor lock-in	Startups, quick deployment
Weaviate	Open-source	Hybrid search, modules	Memory-heavy	Multi-modal search
Milvus	Open-source	High performance, GPU	Complex operations	Large-scale production
pgvector	PostgreSQL ext	Familiar, ACID, joins	Limited scale	Small-medium datasets
Qdrant	Open-source	Rust performance, filtering	Newer ecosystem	Filtered similarity search
Chroma	Open-source	Simple API, embedded	Not for production scale	Prototyping, small apps

When to Use Which

< 1M vectors + existing PostgreSQL → pgvector
< 10M vectors + need managed service → Pinecone
> 10M vectors + need control → Milvus or Qdrant
Need hybrid search (vector + keyword) → Weaviate
Prototyping / local dev → Chroma

Scaling Vector Search

10M vectors × 1536 dimensions × 4 bytes = ~60GB raw data
With index overhead: ~100-150GB

Options:
1. Vertical scaling: Single node with enough RAM (up to ~100M vectors)
2. Sharding: Distribute vectors across nodes by partition key
3. Tiered storage: Hot vectors in memory, cold on SSD

4. Embedding Generation and Indexing

Embedding Models

Model	Dimensions	Context	Speed	Quality
OpenAI text-embedding-3-small	1536	8191 tokens	Fast	Good
OpenAI text-embedding-3-large	3072	8191 tokens	Medium	Best
Cohere embed-v3	1024	512 tokens	Fast	Very Good
BGE-large-en	1024	512 tokens	Self-hosted	Very Good
E5-mistral-7b	4096	32K tokens	Slow	Excellent

Vector Index Types

HNSW (Hierarchical Navigable Small World)

Multi-layer graph structure:

Layer 2: [A] ─────────────────── [D]        (few nodes, long edges)
Layer 1: [A] ──── [B] ──── [C] ── [D]      (more nodes)
Layer 0: [A] [B] [C] [D] [E] [F] [G] [H]  (all nodes, short edges)

Search: Start at top layer, greedily descend
  - Top layers: fast coarse navigation
  - Bottom layers: precise local search

Performance:
  - Build time: O(N × log N)
  - Search time: O(log N)
  - Memory: O(N × M) where M = max connections per node
  - Recall@10: 95-99% typical

Best for: Low-latency search, moderate dataset sizes (fits in RAM)

IVF (Inverted File Index)

1. Cluster vectors into K centroids (K-means)
2. At query time, find nearest nprobe centroids
3. Search only vectors in those clusters

┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
│Cluster 0 │  │Cluster 1 │  │Cluster 2 │  │Cluster 3 │
│ •••••    │  │ ••••     │  │ •••••••  │  │ ••       │
│ ••       │  │ •••      │  │ ••       │  │ •••••    │
└─────────┘  └─────────┘  └─────────┘  └─────────┘

Query: Find nearest centroid(s) → search only those clusters
  nprobe=1: Fast but may miss results
  nprobe=10: Slower but higher recall

Best for: Large datasets, tunable speed/accuracy tradeoff

PQ (Product Quantization)

Compress vectors by splitting into subvectors and quantizing each:

Original: [0.1, 0.3, 0.5, 0.7, 0.2, 0.4, 0.6, 0.8]  (32 bytes in FP32)
Split:    [0.1, 0.3] [0.5, 0.7] [0.2, 0.4] [0.6, 0.8]  (4 subvectors)
Quantize: [code_23]  [code_45]  [code_12]  [code_67]     (4 bytes!)

Compression: 8× reduction (32 bytes → 4 bytes)
Trade-off: ~5-10% recall loss for 8× memory savings

Best for: Memory-constrained environments, billion-scale datasets

Index Selection Guide

Dataset Size	Memory Budget	Latency Requirement	Recommended
< 100K	Any	< 10ms	Flat (brute force)
100K - 10M	Fits in RAM	< 10ms	HNSW
1M - 100M	Limited	< 50ms	IVF + PQ
100M - 1B	Very limited	< 100ms	IVF + PQ + disk
> 1B	Distributed	< 200ms	Sharded HNSW or IVF

5. Feature Stores

What Is a Feature Store?

A centralized system for managing, storing, and serving ML features consistently across training and inference.

┌─────────────────────────────────────────────────────────┐
│                    FEATURE STORE                          │
│                                                           │
│  ┌─────────────────┐        ┌─────────────────┐         │
│  │  Offline Store   │        │  Online Store    │         │
│  │  (S3/BigQuery)   │        │  (Redis/DynamoDB) │        │
│  │                   │        │                   │        │
│  │  Historical data  │        │  Latest values    │        │
│  │  Training jobs    │        │  Low-latency      │        │
│  │  Batch features   │        │  Real-time serving │       │
│  └────────┬──────────┘        └────────┬──────────┘       │
│           │                             │                  │
│           └──────────┬──────────────────┘                  │
│                      │                                     │
│              Feature Registry                              │
│           (schema, lineage, docs)                          │
└─────────────────────────────────────────────────────────┘
         ▲                              ▲
         │                              │
    Training Pipeline              Inference Service
    (batch, historical)            (real-time, latest)

Online vs Offline Features

Aspect	Online Store	Offline Store
Latency	< 10ms	Minutes to hours
Freshness	Real-time to minutes	Hours to days
Storage	Redis, DynamoDB	S3, BigQuery, Hive
Use case	Model inference	Model training
Data volume	Latest values only	Full history
Cost	Higher (fast storage)	Lower (bulk storage)

Feature Freshness Requirements

Real-time features (< 1 second):
  - Current cart contents
  - Live session behavior
  - Real-time fraud signals

Near-real-time (1 second - 1 hour):
  - User's recent clicks
  - Trending topics
  - Current inventory levels

Batch features (hours - days):
  - User lifetime value
  - Historical purchase patterns
  - Aggregated statistics

Popular Feature Stores

Tool	Type	Best For
Feast	Open-source	Kubernetes-native, flexible
Tecton	Managed	Enterprise, real-time features
Databricks Feature Store	Managed	Databricks ecosystem
Amazon SageMaker FS	Managed	AWS-native ML pipelines
Hopsworks	Open-source	Full ML platform

6. ML Model Serving Patterns

Deployment Strategies

1. A/B Testing:
┌──────────────┐     90%     ┌─────────────┐
│   Traffic     │────────────►│  Model v1    │
│   Router      │             └─────────────┘
│               │     10%     ┌─────────────┐
│               │────────────►│  Model v2    │
└──────────────┘             └─────────────┘
Measure: conversion rate, engagement, revenue

2. Canary Deployment:
Phase 1: 1% traffic → new model (monitor for errors)
Phase 2: 10% traffic → new model (monitor metrics)
Phase 3: 50% traffic → new model (compare to baseline)
Phase 4: 100% traffic → new model (full rollout)

3. Shadow Mode:
┌──────────────┐  100%  ┌─────────────┐ ──► Response to user
│   Traffic     │───────►│  Model v1    │
│               │        └─────────────┘
│               │  100%  ┌─────────────┐ ──► Log only (not served)
│               │───────►│  Model v2    │
└──────────────┘        └─────────────┘
Compare predictions offline, no user impact

Model Serving Infrastructure

┌─────────────────────────────────────────────────┐
│              Model Serving Platform               │
├─────────────────────────────────────────────────┤
│                                                   │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐   │
│  │ Model A    │  │ Model B    │  │ Model C    │   │
│  │ (v1.2)     │  │ (v3.0)     │  │ (v1.0)     │   │
│  │ GPU: 1×A10 │  │ GPU: 4×A100│  │ CPU only   │   │
│  └───────────┘  └───────────┘  └───────────┘   │
│                                                   │
│  Features:                                        │
│  • Auto-scaling (based on QPS/latency)           │
│  • Model versioning and rollback                 │
│  • A/B testing and canary                        │
│  • Monitoring (latency, accuracy, drift)         │
│  • Batching for throughput                       │
│                                                   │
│  Tools: TorchServe, Triton, Seldon, KServe      │
└─────────────────────────────────────────────────┘

Monitoring ML Models in Production

Metric	What to Monitor	Alert Threshold
Latency (p50, p99)	Inference time	p99 > 200ms
Throughput	Requests/second	< expected baseline
Error rate	Failed predictions	> 0.1%
Data drift	Input distribution shift	KL divergence > threshold
Prediction drift	Output distribution shift	Significant change
Feature freshness	Staleness of features	> SLA

7. Recommendation System Architecture

Approaches

┌─────────────────────────────────────────────────────────────┐
│              RECOMMENDATION PIPELINE                          │
│                                                               │
│  Stage 1: Candidate Generation (1M → 1000 items)            │
│  ┌─────────────────────────────────────────────────┐        │
│  │ • Collaborative Filtering (user-item matrix)     │        │
│  │ • Content-Based (item similarity)                │        │
│  │ • Knowledge Graph (entity relationships)         │        │
│  │ • Popular/Trending (fallback)                    │        │
│  └─────────────────────────────────────────────────┘        │
│           │                                                   │
│           ▼                                                   │
│  Stage 2: Scoring/Ranking (1000 → 100 items)                │
│  ┌─────────────────────────────────────────────────┐        │
│  │ • Deep learning model (features + embeddings)    │        │
│  │ • Multi-objective: relevance, diversity, recency │        │
│  │ • Business rules (boost promoted content)        │        │
│  └─────────────────────────────────────────────────┘        │
│           │                                                   │
│           ▼                                                   │
│  Stage 3: Re-ranking/Filtering (100 → 20 items)             │
│  ┌─────────────────────────────────────────────────┐        │
│  │ • Diversity injection (MMR)                      │        │
│  │ • Freshness boost                               │        │
│  │ • Content policy filtering                      │        │
│  │ • Already-seen deduplication                    │        │
│  └─────────────────────────────────────────────────┘        │
│           │                                                   │
│           ▼                                                   │
│  Final: Serve top-20 recommendations                         │
└─────────────────────────────────────────────────────────────┘

Collaborative Filtering

User-Item Matrix:
         Item1  Item2  Item3  Item4  Item5
User A:  [5     3      ?      1      ?   ]
User B:  [4     ?      4      1      ?   ]
User C:  [?     1      ?      5      4   ]
User D:  [1     ?      5      4      ?   ]

Matrix Factorization (ALS/SVD):
  User matrix (N×K) × Item matrix (K×M) ≈ Rating matrix (N×M)
  K = latent factors (typically 50-200)
  
  Predict User A's rating for Item3:
  = dot_product(user_A_embedding, item_3_embedding)

Content-Based Filtering

Item features → Item embedding → Similarity

User profile = weighted average of liked item embeddings
Recommendation = items most similar to user profile

Advantages: No cold-start for items (features available immediately)
Disadvantages: Limited discovery (recommends similar to past)

Hybrid Approach (Production Standard)

Final_Score = α × collaborative_score 
            + β × content_score 
            + γ × popularity_score
            + δ × recency_score
            + business_rules_boost

Weights learned via online A/B testing

Cold Start Solutions

Problem	Solution
New user (no history)	Popular items, demographic-based, onboarding quiz
New item (no interactions)	Content-based features, explore/exploit
New system (no data)	Editorial curation, import external signals

8. Cost Optimization for LLM Inference

The Cost Problem

GPT-4 API pricing (2024):
  Input: $30 per 1M tokens
  Output: $60 per 1M tokens

At scale (1M requests/day, 1000 tokens avg):
  Daily cost: 1M × 1000 tokens × ($30+$60)/1M = $90,000/day
  Monthly: $2.7M

Self-hosted (8×H100 for 70B model):
  Hardware: ~$250K (or $30K/month cloud rental)
  Can serve: ~100 requests/second
  Cost per request: ~$0.003 (vs $0.09 for GPT-4)
  Break-even: ~10K requests/day

Optimization Strategies

1. Prompt Caching

Many requests share common system prompts or context:

Without caching:
  Request 1: [System prompt (2000 tokens)] + [User query (100 tokens)]
  Request 2: [System prompt (2000 tokens)] + [User query (150 tokens)]
  Total: 4250 tokens processed

With prefix caching (vLLM, Anthropic):
  Request 1: [System prompt (2000 tokens)] + [User query (100 tokens)]
  Request 2: [CACHED prefix] + [User query (150 tokens)]
  Total: 2250 tokens processed (47% savings)

2. Model Distillation

Teacher (large model) → Student (small model)

GPT-4 (expensive, accurate) generates training data
Fine-tune GPT-3.5 or Llama-7B on that data
Deploy smaller model for 90% of requests

Cost reduction: 10-50× for similar quality on specific tasks

3. Quantization

FP32 → FP16 → INT8 → INT4

Model: LLaMA 70B
  FP32: 280GB (impossible on single node)
  FP16: 140GB (2× A100 80GB)
  INT8:  70GB (1× A100 80GB)
  INT4:  35GB (1× A100 40GB or consumer GPU)

Quality loss:
  FP16: ~0% loss
  INT8 (GPTQ/AWQ): 1-3% loss
  INT4: 3-8% loss (acceptable for many tasks)

4. Routing and Cascading

┌──────────────┐
│ User Request  │
└──────┬───────┘
       │
       ▼
┌──────────────┐     Simple query?     ┌─────────────┐
│   Router      │─────────────────────►│ Small Model  │ (cheap)
│  (classifier) │                       │ (7B/13B)     │
│               │     Complex query?    └─────────────┘
│               │─────────────────────►┌─────────────┐
│               │                       │ Large Model  │ (expensive)
└──────────────┘                       │ (70B/GPT-4)  │
                                       └─────────────┘

Result: 70% of queries handled by cheap model
Average cost reduction: 60-80%

5. Semantic Caching

Cache LLM responses for semantically similar queries:

Query: "What is the capital of France?"
→ Generate response, cache with embedding

Query: "What's France's capital city?"
→ Embedding similarity > 0.95 → return cached response

Hit rate: 10-30% for customer support, FAQ-style queries
Savings: Proportional to hit rate

Cost Comparison Table

Optimization	Effort	Cost Reduction	Quality Impact
Prompt caching	Low	20-50%	None
Semantic caching	Medium	10-30%	Minimal
Model routing	Medium	50-70%	1-5% on routed queries
Quantization (INT8)	Low	50% compute	1-3%
Distillation	High	80-95%	5-15% (task-specific)
Batching optimization	Low	30-50% throughput	None

System Design Interview: AI/ML Questions

Common Questions and Approaches

Question	Key Components
"Design a chatbot"	RAG pipeline, LLM serving, conversation memory, guardrails
"Design a recommendation system"	Candidate gen, ranking, feature store, A/B testing
"Design a search engine with AI"	Embedding search, hybrid retrieval, re-ranking
"Design an AI code assistant"	RAG over codebase, streaming, context window management
"Design a content moderation system"	Classification model, human-in-loop, confidence thresholds
"Design a fraud detection system"	Feature store, real-time scoring, model monitoring

Interview Cheat Sheet

When interviewer asks...	Key points to mention
"How to serve an LLM?"	Model sharding, KV cache, continuous batching, vLLM
"How to reduce hallucination?"	RAG pipeline, grounding, citation, confidence scores
"Vector DB choice?"	pgvector for small, Pinecone for managed, Milvus for scale
"How to handle LLM costs?"	Routing, caching, distillation, quantization
"Cold start for recommendations?"	Content-based features, popular items, explore/exploit
"How to evaluate RAG quality?"	Retrieval recall, faithfulness, answer relevancy
"Feature freshness?"	Online store (Redis) for real-time, offline for batch
"How to deploy new model safely?"	Shadow mode → canary → A/B test → full rollout