Trade-offs & Alternatives

📖 17 min read 📄 Part 7 of 10

CDN Network - Trade-offs and Alternatives

Push vs Pull CDN Models

Pull CDN (Origin-Pull)

How it works:
1. Client requests content from CDN edge
2. Edge checks local cache
3. On cache miss: edge fetches from origin server
4. Edge caches response and serves to client
5. Subsequent requests served from cache until TTL expires

Advantages:
+ Origin is always the source of truth
+ No manual content management required
+ Automatic cache population based on demand
+ Storage efficient: only caches what users actually request
+ Simple operational model for content publishers
+ Works well with dynamic content (short TTLs)

Disadvantages:
- First request for any object is slow (origin round-trip)
- Cache miss storms on TTL expiry for popular content
- Origin must handle burst of cache-fill requests
- Cold cache after PoP restart or new PoP deployment
- Unpredictable origin load patterns

Best for:
- Websites with large content catalogs (long tail)
- Content that changes frequently
- Organizations without dedicated CDN operations teams
- Use cases where freshness is more important than latency

Real-world examples: CloudFront, Cloudflare, Fastly (default mode)

Push CDN (Origin-Push)

How it works:
1. Content publisher uploads content directly to CDN storage
2. CDN distributes content to edge locations proactively
3. Client requests are always served from cache
4. Publisher responsible for updating/invalidating content

Advantages:
+ Every request is a cache hit (no cold start)
+ Predictable, consistent latency for all users
+ Origin servers not needed for serving (CDN is the origin)
+ Better for large files (pre-positioned, no timeout risk)
+ Controlled bandwidth usage (upload during off-peak)

Disadvantages:
- Publisher must manage content lifecycle explicitly
- Storage costs for pre-positioning across all PoPs
- Stale content risk if invalidation fails
- Complex workflow for content updates
- Doesn't scale well for millions of unique URLs
- Wasted storage for content nobody requests

Best for:
- Video on demand (pre-position popular titles)
- Software distribution (OS updates, game patches)
- Static sites with known, finite content sets
- Live event preparation (pre-warm before broadcast)

Real-world examples: Netflix Open Connect, Akamai NetStorage

Hybrid Approach (Production Reality)

Most production CDNs use a hybrid model:
- Pull for general web content (images, CSS, JS, API responses)
- Push for known high-demand content (video catalog, software updates)
- Pre-warm (push) for anticipated traffic (product launches, events)
- Pull with origin shield to reduce origin load

Decision matrix:
| Factor              | Use Pull        | Use Push          |
|---------------------|-----------------|-------------------|
| Content volume      | >1M unique URLs | <100K unique URLs |
| Update frequency    | Frequent        | Infrequent        |
| Access pattern      | Long tail       | Head-heavy        |
| File size           | <10 MB          | >100 MB           |
| Latency tolerance   | First-hit delay OK | Zero delay required |
| Operational model   | Hands-off       | Active management |

Anycast vs GeoDNS Routing

Anycast Routing

How it works:
- Same IP address announced via BGP from every PoP
- Internet routing (BGP) naturally directs packets to nearest PoP
- No DNS-level routing decisions needed
- Single IP serves global traffic

Technical implementation:
- Each PoP announces same /24 or /48 prefix
- BGP path selection chooses shortest AS path
- Failover: withdraw BGP announcement from unhealthy PoP
- Traffic shifts to next-nearest PoP automatically

Advantages:
+ Single IP address simplifies DNS configuration
+ Automatic failover via BGP (no DNS TTL delay)
+ Inherent DDoS resilience (attack distributed across PoPs)
+ No DNS resolution latency (IP is cached/hardcoded)
+ Works for UDP protocols (DNS, QUIC) where TCP state isn't an issue
+ Failover time: 30-90 seconds (BGP convergence)

Disadvantages:
- Limited control over routing decisions
- BGP routing may not choose lowest-latency path
- TCP connection issues during BGP route changes (RST)
- Cannot route based on content type or customer
- Debugging routing issues is complex
- Some ISPs have suboptimal BGP policies
- Cannot easily do weighted traffic splitting

Mitigation for TCP issues:
- Use QUIC/HTTP3 (connection migration handles route changes)
- Implement connection draining before BGP withdrawal
- Use ECMP (Equal-Cost Multi-Path) for gradual shifts

GeoDNS (Geographic DNS Routing)

How it works:
- DNS resolver determines client location (via IP geolocation or EDNS Client Subnet)
- Returns IP address of nearest/best PoP for that location
- Different clients get different IP addresses
- Each PoP has unique IP addresses

Technical implementation:
- Authoritative DNS servers with geolocation database
- EDNS Client Subnet (ECS) for resolver-level accuracy
- Health checks determine which PoPs are available
- Weighted responses for load balancing within a region

Advantages:
+ Fine-grained control over traffic routing
+ Can route based on customer, content type, or load
+ Weighted routing for gradual traffic shifts
+ Easy A/B testing and canary deployments
+ Can implement latency-based routing (not just geo)
+ Supports complex failover policies
+ Per-customer routing decisions possible

Disadvantages:
- DNS TTL creates failover delay (30-300 seconds)
- DNS caching at resolvers may serve stale records
- EDNS Client Subnet not universally supported
- More complex DNS infrastructure required
- Multiple IP addresses to manage and monitor
- DNS resolution adds latency to first request
- Resolver location != user location (VPNs, public DNS)

Accuracy challenges:
- Google Public DNS (8.8.8.8): user may be far from resolver
- Without ECS: route based on resolver location, not user
- VPN users: appear in VPN exit location, not actual location
- Mobile users: IP geolocation less accurate

Comparison Matrix

Factor Anycast GeoDNS
Failover speed 30-90s (BGP) 30-300s (DNS TTL)
Routing accuracy Network-level Application-level
DDoS resilience Excellent (distributed) Good (can be targeted)
Operational complexity High (BGP expertise) Medium (DNS management)
Granularity of control Low High
Protocol support All (L3 routing) DNS-dependent protocols
TCP connection stability Risk during failover Stable (same IP)
Cost Higher (BGP transit) Lower (DNS only)

Production recommendation:

Use both together (industry standard):
- Anycast for DNS resolution itself (fast, resilient DNS)
- GeoDNS logic within anycast DNS to return best edge IP
- Anycast for HTTP/3 QUIC traffic (connection migration)
- GeoDNS for HTTP/1.1 and HTTP/2 (TCP stability)

Cache-Everything vs Selective Caching

Cache-Everything Approach

Philosophy: Cache all responses by default, opt-out for specific paths

Implementation:
- Default TTL for all responses (e.g., 1 hour)
- Override with Cache-Control headers from origin
- Cache even POST responses if safe (idempotent APIs)
- Cache error responses (negative caching) with short TTL

Advantages:
+ Maximum origin offload
+ Simple mental model ("everything is cached")
+ Catches cacheable content that wasn't explicitly marked
+ Better performance for forgotten/misconfigured resources
+ Reduces origin infrastructure costs significantly

Disadvantages:
- Risk of caching personalized/sensitive content
- Stale data for dynamic content without proper headers
- Cache poisoning risk if cache key isn't comprehensive
- Debugging issues: "why am I seeing old content?"
- Storage waste for truly uncacheable content
- Privacy concerns (cached authenticated responses)

Safeguards needed:
- Never cache responses with Set-Cookie headers
- Never cache responses to requests with Authorization header (unless explicit)
- Strip/ignore query params that don't affect response
- Implement cache tags for targeted invalidation
- Monitor for accidental PII caching

Selective Caching Approach

Philosophy: Only cache content explicitly marked as cacheable

Implementation:
- Require explicit Cache-Control: public, max-age=N
- Pass through anything without cache headers
- Whitelist specific path patterns for caching
- Default behavior: proxy without caching

Advantages:
+ No risk of caching sensitive/personalized content
+ Origin has full control over what's cached
+ Simpler debugging (cache behavior is explicit)
+ No stale data surprises
+ Better for compliance-sensitive applications

Disadvantages:
- Lower cache hit ratio (many cacheable things not marked)
- Higher origin load (more pass-through requests)
- Requires origin developers to set proper headers
- Missed optimization opportunities
- More expensive to operate (more origin capacity needed)

When to use:
- Financial/healthcare applications (compliance requirements)
- Highly personalized content (e-commerce with user context)
- APIs with authentication (risk of response leakage)
- Early-stage products (before caching strategy is mature)

Recommended Hybrid

Tier 1 (Always cache): Static assets with versioned URLs
  /static/*, /assets/*, *.js?v=*, *.css?v=*
  TTL: 1 year (immutable)

Tier 2 (Cache with revalidation): Semi-static content
  /images/*, /fonts/*, /media/*
  TTL: 1 day, stale-while-revalidate: 1 hour

Tier 3 (Short cache): Dynamic but shared content
  /api/public/*, /feed/*, /catalog/*
  TTL: 10-60 seconds

Tier 4 (No cache): Personalized/sensitive content
  /api/user/*, /account/*, /checkout/*
  Cache-Control: private, no-store

TTL-Based vs Event-Based Invalidation

TTL-Based Invalidation

How it works:
- Each cached object has an expiration time
- After TTL expires, next request triggers revalidation
- Origin confirms freshness (304) or sends new content (200)
- No active invalidation needed

Advantages:
+ Simple to implement and understand
+ No invalidation infrastructure needed
+ Self-healing: stale content eventually refreshes
+ Predictable origin load (revalidation spread over time)
+ Works without any coordination between systems

Disadvantages:
- Content can be stale for up to TTL duration
- Short TTLs increase origin load
- Long TTLs risk serving outdated content
- No way to force immediate update
- TTL is a guess (how long will content be valid?)
- Thundering herd on popular content TTL expiry

Optimization: stale-while-revalidate
- Serve stale content immediately
- Revalidate in background asynchronously
- User gets fast response, content refreshes behind the scenes
- Eliminates latency penalty of revalidation

Event-Based Invalidation (Purge)

How it works:
- Content cached with long TTL (hours/days/forever)
- When content changes, publish invalidation event
- CDN purges affected objects from all edges
- Next request fetches fresh content from origin

Advantages:
+ Content always fresh (purge on change)
+ Can use very long TTLs (better hit ratio)
+ Immediate consistency when needed
+ Precise control over what's invalidated
+ Lower origin load (fewer revalidation requests)

Disadvantages:
- Requires invalidation infrastructure (pub/sub, queues)
- Purge propagation takes time (1-30 seconds globally)
- Brief inconsistency window during propagation
- Complexity: must track what to purge when data changes
- Purge storms can overwhelm the system
- Risk of over-purging (invalidating too much)
- Cost: purge APIs often have rate limits and charges

Implementation patterns:
- Webhook on CMS publish → trigger CDN purge API
- Database change data capture → purge affected URLs
- Cache tags: tag objects with logical groups, purge by tag
- Surrogate keys: purge all objects with a given key

Comparison

Factor TTL-Based Event-Based
Freshness guarantee Eventual (within TTL) Near-immediate
Implementation complexity Low High
Origin load Higher (revalidation) Lower (long TTLs)
Consistency Weak Strong (after propagation)
Operational overhead None Purge infrastructure
Best for Slowly changing content Frequently updated content
Failure mode Stale content Missing purge = stale

Production recommendation:

Use both together:
- TTL as safety net (content eventually refreshes even if purge fails)
- Event-based purge for immediate freshness on critical updates
- stale-while-revalidate for non-critical content
- Cache tags for efficient bulk invalidation

Example: E-commerce product page
- TTL: 1 hour (safety net)
- Purge on: price change, stock change, description update
- stale-while-revalidate: 5 minutes (non-critical updates)
- Result: Usually fresh within seconds, guaranteed fresh within 1 hour

Single-Tier vs Multi-Tier Caching Hierarchy

Single-Tier (Edge Only)

Architecture: Client → Edge PoP → Origin

Advantages:
+ Simplest architecture
+ Lowest latency for cache hits (one hop)
+ Fewer points of failure
+ Easier to debug and monitor
+ Lower infrastructure cost

Disadvantages:
- Each PoP independently fetches from origin
- Origin receives N * miss_rate requests (N = number of PoPs)
- Cold PoPs have poor hit ratios
- Popular content fetched redundantly by every PoP
- Origin must handle high request volume

When appropriate:
- Small number of PoPs (<20)
- Origin can handle the load
- Content is highly cacheable (>95% hit ratio)
- Low content diversity (small catalog)

Multi-Tier (Edge + Shield + Origin)

Architecture: Client → Edge PoP → Regional Shield → Origin

Two-tier variant:
  Client → Edge → Shield → Origin
  - Shield consolidates misses from 20-50 edge PoPs
  - Origin sees 1/20th to 1/50th the miss traffic

Three-tier variant:
  Client → Edge → Regional Mid-Tier → Global Shield → Origin
  - Edge: hot content, small cache
  - Regional mid-tier: warm content, medium cache
  - Global shield: cold content, large cache
  - Origin: only truly uncached content

Advantages:
+ Dramatically reduces origin load (60-90% reduction)
+ Better hit ratios at shield (aggregated demand)
+ Origin can be smaller/cheaper
+ Handles flash crowds better (shield absorbs)
+ Enables request coalescing at shield layer

Disadvantages:
- Additional latency on cache miss (extra hop)
- More complex architecture to operate
- Shield becomes a potential bottleneck/SPOF
- Higher infrastructure cost (shield servers)
- Debugging cache behavior across tiers is complex
- Stale content can persist longer (cached at multiple tiers)

When appropriate:
- Large number of PoPs (>50)
- Origin is expensive or capacity-limited
- Content catalog is large (long tail)
- Flash crowd protection is important
- Origin is geographically distant from most users

Proprietary CDN vs Multi-CDN vs Build-Your-Own

Single Proprietary CDN (CloudFront, Akamai, Cloudflare)

Advantages:
+ Turnkey solution, fast time to market
+ Global infrastructure already deployed
+ DDoS protection included
+ Managed SSL certificates
+ Edge compute capabilities
+ 24/7 NOC and support
+ Continuous platform improvements

Disadvantages:
- Vendor lock-in (proprietary APIs, edge functions)
- Limited customization
- Cost at scale ($0.02-0.15/GB vs $0.005 internal cost)
- Single point of failure (provider outage)
- Limited visibility into infrastructure
- Feature roadmap controlled by vendor

Cost at scale:
- 1 PB/month: ~$20,000-50,000/month
- 10 PB/month: ~$100,000-300,000/month
- 100 PB/month: ~$500,000-2,000,000/month

Multi-CDN Strategy

Advantages:
+ No single provider SPOF
+ Best performance per region (use best CDN per geo)
+ Cost optimization (competitive bidding)
+ Leverage each CDN's strengths
+ Negotiating leverage with providers

Disadvantages:
- Operational complexity (multiple dashboards, APIs)
- Cache fragmentation (content split across CDNs)
- Inconsistent feature sets across providers
- Complex purge coordination
- Higher total cost than single CDN (less volume discount)
- Need traffic management layer (DNS or client-side)

Implementation cost:
- Traffic management platform: $50K-200K/year
- Engineering overhead: 1-2 FTEs dedicated
- Monitoring across CDNs: additional tooling costs

Build-Your-Own CDN

Advantages:
+ Full control over every aspect
+ Lowest cost at massive scale (>100 PB/month)
+ Custom optimizations for specific use case
+ No vendor dependencies
+ Competitive advantage (unique capabilities)

Disadvantages:
- Enormous upfront investment ($50M-500M)
- 2-5 year build timeline to reach parity
- Requires specialized networking/systems talent (50-200 engineers)
- Ongoing operational burden (hardware, peering, NOC)
- Must build DDoS protection, WAF, etc.
- Regulatory compliance in each country

Break-even analysis:
- Below 50 PB/month: use managed CDN
- 50-500 PB/month: multi-CDN or hybrid (own + managed)
- Above 500 PB/month: build your own (Netflix, Google, Facebook)

Companies that built their own:
- Netflix (Open Connect): 100+ Tbps, ISP-embedded
- Google (GFE/Cloud CDN): integrated with search/YouTube
- Facebook (Edge PoPs): social content delivery
- Apple: iCloud and media delivery

Edge Compute vs Origin Compute

Edge Compute

Execute logic at CDN edge, close to users

Use cases:
- A/B testing (route users to variants without origin)
- Authentication/authorization (validate tokens at edge)
- URL rewriting and redirects
- Header manipulation (add security headers, CORS)
- Image/video optimization (resize, format conversion)
- Geolocation-based content (language, pricing)
- Bot detection and blocking
- Request/response transformation

Constraints:
- Limited CPU time (5-50ms typical)
- Limited memory (128 MB typical)
- No persistent storage (stateless)
- Limited network access (restricted subrequests)
- Cold start considerations
- Debugging is harder (distributed execution)

Best for:
- Latency-sensitive logic
- Simple transformations
- Decisions that don't need backend data
- High-volume, low-complexity operations

Origin Compute

Execute logic at centralized origin servers

Use cases:
- Complex business logic
- Database queries and transactions
- Machine learning inference (large models)
- Long-running computations
- Stateful operations
- Third-party API integrations

Advantages over edge:
+ Unlimited compute resources
+ Access to databases and state
+ Full programming language support
+ Easier debugging and monitoring
+ Simpler deployment model
+ No cold start concerns (always running)

Best for:
- Complex application logic
- Data-intensive operations
- Operations requiring consistency
- Long-running processes
- Operations needing large memory/CPU

Decision Framework

Factor Edge Origin
Latency requirement <50ms <500ms acceptable
Computation complexity Simple Complex
State needed None/minimal Database access
Request volume Very high Moderate
Personalization Light (geo, device) Heavy (user history)
Data dependencies None Multiple services

HTTP/2 Push vs Preload Hints

HTTP/2 Server Push (Deprecated)

How it worked:
- Server proactively sends resources before client requests them
- Pushed alongside the HTML response
- Client receives CSS/JS without additional round trips

Why it failed:
- Pushed resources often already in browser cache (wasted bandwidth)
- No way to know client's cache state before pushing
- Complex implementation for marginal benefit
- Removed from Chrome (2022), other browsers following
- CDN implementation was inconsistent

Performance impact:
- Best case: saved 1 RTT for critical resources
- Worst case: wasted bandwidth pushing cached resources
- Average: negligible improvement in real-world measurements

Preload Hints (103 Early Hints)

How it works:
- Server sends 103 Early Hints response before final response
- Contains Link: <resource>; rel=preload headers
- Browser begins fetching hinted resources immediately
- Final response (200) arrives with full content

Advantages over Server Push:
+ Browser checks cache before fetching (no waste)
+ Works with CDN caching (hints can be cached too)
+ Simpler implementation
+ Compatible with all HTTP versions
+ CDN can send hints while waiting for origin response

CDN implementation:
1. Client requests HTML page
2. CDN sends 103 with preload hints (from cache or config)
3. CDN fetches HTML from origin (or cache)
4. Client already loading CSS/JS while waiting for HTML
5. CDN sends 200 with HTML content

Performance benefit:
- Saves origin processing time (hints sent immediately)
- 100-500ms improvement for pages with slow origins
- No wasted bandwidth (browser respects cache)

Configuration example:
  Link: </styles/main.css>; rel=preload; as=style
  Link: </scripts/app.js>; rel=preload; as=script
  Link: </fonts/inter.woff2>; rel=preload; as=font; crossorigin

Recommendation

- Do NOT use HTTP/2 Server Push (deprecated, removed from browsers)
- DO use 103 Early Hints for critical resources
- DO use <link rel="preload"> in HTML for important resources
- DO use CDN-level early hints configuration
- Consider: preconnect hints for third-party origins

Summary: Decision Quick Reference

Decision Choose A When... Choose B When...
Push vs Pull Known content, large files, events Dynamic content, large catalogs
Anycast vs GeoDNS DDoS concern, UDP/QUIC, simplicity Fine control, TCP stability, A/B testing
Cache-all vs Selective Performance priority, static content Security priority, personalized content
TTL vs Event purge Simple ops, slowly changing content Freshness critical, CMS integration
Single vs Multi-tier Few PoPs, simple architecture Many PoPs, origin protection needed
Managed vs Build <50 PB/month, fast time to market >500 PB/month, unique requirements
Edge vs Origin compute Low latency, simple logic, high volume Complex logic, state needed, moderate volume