Scale & Constraints

📖 6 min read 📄 Part 2 of 10

Webhook Notification Service - Scale and Constraints

Traffic Scale Analysis

Webhook Volume

  • Peak Webhooks: 100,000 webhooks per second
  • Average Webhooks: 30,000 webhooks per second
  • Daily Webhooks: 2.5 billion webhooks per day
  • Event Volume: 100 million events per day
  • Webhook per Event: Average 25 webhooks per event (fan-out)
  • Retry Rate: 10% of webhooks require retry

Webhook Distribution

  • Registered Webhooks: 1 million active webhook endpoints
  • Active Subscriptions: 10 million event subscriptions
  • Events per Webhook: Average 10 events subscribed
  • High-Volume Webhooks: 10K webhooks (>1000 deliveries/day)
  • Medium-Volume Webhooks: 100K webhooks (100-1000 deliveries/day)
  • Low-Volume Webhooks: 890K webhooks (<100 deliveries/day)

Storage Requirements

Webhook Configuration Storage

Per Webhook:
- Webhook ID: 16 bytes (UUID)
- URL: 200 bytes average
- Secret: 64 bytes (HMAC key)
- Configuration: 500 bytes (retry, timeout, filters)
- Metadata: 200 bytes (created_at, updated_at, status)
- Total per webhook: ~1 KB

Total Webhooks: 1M × 1 KB = 1 GB
With indexes (2x): 2 GB
With replication (3x): 6 GB

Event Queue Storage

Per Event:
- Event ID: 16 bytes
- Event type: 50 bytes
- Payload: 5 KB average
- Metadata: 200 bytes
- Total per event: ~5.3 KB

Queue Depth: 1M events × 5.3 KB = 5.3 GB
With replication (3x): 16 GB

Delivery History Storage

Per Delivery:
- Delivery ID: 16 bytes
- Webhook ID: 16 bytes
- Event ID: 16 bytes
- Status: 20 bytes
- Response: 1 KB average
- Timestamps: 24 bytes
- Total per delivery: ~1.1 KB

Daily Deliveries: 2.5B × 1.1 KB = 2.75 TB/day
Monthly: 82.5 TB
With compression (10:1): 8.25 TB/month
Retention (90 days): 24.75 TB

Network Bandwidth

Outbound Traffic (Webhook Deliveries)

Webhook Requests:
- 100K webhooks/s × 5 KB = 500 MB/s
- HTTP overhead (20%): 100 MB/s
- Total outbound: ~600 MB/s peak

Retry Traffic:
- 10K retries/s × 5 KB = 50 MB/s
- Total with retries: ~650 MB/s peak

Inbound Traffic (Responses)

Webhook Responses:
- 100K responses/s × 1 KB = 100 MB/s
- HTTP overhead (20%): 20 MB/s
- Total inbound: ~120 MB/s peak

Internal Traffic (Queue)

Event Publishing:
- 100K events/s × 5.3 KB = 530 MB/s

Queue Replication:
- 530 MB/s × 2 (replicas) = 1.06 GB/s

Total Internal: ~1.6 GB/s

Compute Requirements

Webhook Dispatcher

Per Dispatcher:
- Event processing: 2 CPU cores
- Webhook matching: 2 CPU cores
- Queue management: 1 CPU core
- Total: 5 CPU cores

Cluster: 20 dispatchers × 5 cores = 100 CPU cores

Delivery Workers

Per Worker:
- HTTP client: 2 CPU cores
- Retry logic: 1 CPU core
- Logging: 0.5 CPU cores
- Total: 3.5 CPU cores

Cluster: 1000 workers × 3.5 cores = 3,500 CPU cores

Total Compute

Dispatchers: 100 cores
Workers: 3,500 cores
Coordinators: 12 cores
Total: 3,612 CPU cores

Memory Requirements

Dispatcher Service

Per Dispatcher:
- Event queue: 5 GB
- Webhook cache: 2 GB
- Connection pool: 1 GB
- Application: 2 GB
- Total: ~10 GB

Cluster: 20 dispatchers × 10 GB = 200 GB

Worker Nodes

Per Worker:
- HTTP connections: 1 GB
- Retry queue: 500 MB
- Response buffers: 500 MB
- Application: 1 GB
- Total: ~3 GB

Cluster: 1000 workers × 3 GB = 3 TB

Database and Cache

PostgreSQL:
- Webhook config: 6 GB
- Delivery history: 25 TB
- Total: ~25 TB

Redis:
- Event queue: 16 GB
- Webhook cache: 10 GB
- Retry queue: 5 GB
- Total: ~31 GB

Latency Constraints

End-to-End Latency Breakdown

Event Generated: 0ms
Event Published to Queue: 10ms
Dispatcher Picks Up: 50ms (poll interval)
Webhook Matching: 5ms
Worker Picks Up: 50ms (poll interval)
HTTP Request: 200ms (network + endpoint processing)
Response Processing: 10ms
Status Update: 20ms
Total: 345ms (P50), 1000ms (P99)

Retry Latency

Attempt 1: Immediate
Attempt 2: +1 second
Attempt 3: +2 seconds
Attempt 4: +4 seconds
Attempt 5: +8 seconds
Total: 15 seconds for all retries

Failure Scenarios

Endpoint Failures

Timeout (30s):
- Impact: Slow delivery, resource usage
- Recovery: Retry with backoff
- Rate: 5% of deliveries

5xx Errors:
- Impact: Temporary failure
- Recovery: Retry immediately
- Rate: 2% of deliveries

4xx Errors:
- Impact: Permanent failure
- Recovery: Don't retry (except 429)
- Rate: 1% of deliveries

System Failures

Worker Failure:
- Impact: In-flight webhooks lost
- Recovery: Requeue from database
- Time: <30 seconds

Queue Failure:
- Impact: No new deliveries
- Recovery: Failover to replica
- Time: <10 seconds

Database Failure:
- Impact: Cannot track status
- Recovery: Replica promotion
- Time: <30 seconds

Cost Analysis

Infrastructure Costs (Monthly)

Dispatchers:
- 20 instances × $200 = $4,000

Workers:
- 1000 instances × $100 = $100,000

Database:
- 25 TB storage × $0.10/GB = $2,500
- Compute: $5,000

Queue (Kafka):
- 10 brokers × $500 = $5,000

Load Balancers:
- 5 × $50 = $250

Total Infrastructure: ~$117,000/month

Operational Costs (Monthly)

Engineering: $20,000
Monitoring: $2,000
Support: $5,000
Total Operational: $27,000

Total Monthly Cost: $144,000

Cost Per Webhook

Monthly Deliveries: 2.5B × 30 = 75 billion
Cost per webhook: $144,000 ÷ 75B = $0.00000192
Cost per 1000 webhooks: $0.00192
Cost per million webhooks: $1.92

Scaling Strategies

Horizontal Scaling

Add Workers:
- Linear scaling up to 10,000 workers
- Auto-scale based on queue depth
- Handle traffic spikes

Add Dispatchers:
- Scale up to 100 dispatchers
- Partition events by type
- Parallel processing

Add Queue Brokers:
- Scale Kafka cluster
- Add partitions
- Increase throughput

Bottleneck Analysis

Primary Bottlenecks

1. HTTP Request Latency:
   - Slow endpoints (>5s response)
   - Mitigation: Timeout, async processing

2. Retry Storm:
   - Many webhooks retrying simultaneously
   - Mitigation: Jitter, rate limiting

3. Queue Throughput:
   - 100K events/s limit
   - Mitigation: More partitions, brokers

4. Database Writes:
   - 10K writes/s per shard
   - Mitigation: Batch writes, sharding

5. Network Bandwidth:
   - 10 Gbps per server
   - Mitigation: Compression, batching

This scale analysis provides the foundation for designing a webhook service that handles massive delivery volumes while maintaining reliability and low latency.