Webhook Notification Service - Scale and Constraints
Traffic Scale Analysis
Webhook Volume
- Peak Webhooks: 100,000 webhooks per second
- Average Webhooks: 30,000 webhooks per second
- Daily Webhooks: 2.5 billion webhooks per day
- Event Volume: 100 million events per day
- Webhook per Event: Average 25 webhooks per event (fan-out)
- Retry Rate: 10% of webhooks require retry
Webhook Distribution
- Registered Webhooks: 1 million active webhook endpoints
- Active Subscriptions: 10 million event subscriptions
- Events per Webhook: Average 10 events subscribed
- High-Volume Webhooks: 10K webhooks (>1000 deliveries/day)
- Medium-Volume Webhooks: 100K webhooks (100-1000 deliveries/day)
- Low-Volume Webhooks: 890K webhooks (<100 deliveries/day)
Storage Requirements
Webhook Configuration Storage
Per Webhook:
- Webhook ID: 16 bytes (UUID)
- URL: 200 bytes average
- Secret: 64 bytes (HMAC key)
- Configuration: 500 bytes (retry, timeout, filters)
- Metadata: 200 bytes (created_at, updated_at, status)
- Total per webhook: ~1 KB
Total Webhooks: 1M × 1 KB = 1 GB
With indexes (2x): 2 GB
With replication (3x): 6 GBEvent Queue Storage
Per Event:
- Event ID: 16 bytes
- Event type: 50 bytes
- Payload: 5 KB average
- Metadata: 200 bytes
- Total per event: ~5.3 KB
Queue Depth: 1M events × 5.3 KB = 5.3 GB
With replication (3x): 16 GBDelivery History Storage
Per Delivery:
- Delivery ID: 16 bytes
- Webhook ID: 16 bytes
- Event ID: 16 bytes
- Status: 20 bytes
- Response: 1 KB average
- Timestamps: 24 bytes
- Total per delivery: ~1.1 KB
Daily Deliveries: 2.5B × 1.1 KB = 2.75 TB/day
Monthly: 82.5 TB
With compression (10:1): 8.25 TB/month
Retention (90 days): 24.75 TBNetwork Bandwidth
Outbound Traffic (Webhook Deliveries)
Webhook Requests:
- 100K webhooks/s × 5 KB = 500 MB/s
- HTTP overhead (20%): 100 MB/s
- Total outbound: ~600 MB/s peak
Retry Traffic:
- 10K retries/s × 5 KB = 50 MB/s
- Total with retries: ~650 MB/s peakInbound Traffic (Responses)
Webhook Responses:
- 100K responses/s × 1 KB = 100 MB/s
- HTTP overhead (20%): 20 MB/s
- Total inbound: ~120 MB/s peakInternal Traffic (Queue)
Event Publishing:
- 100K events/s × 5.3 KB = 530 MB/s
Queue Replication:
- 530 MB/s × 2 (replicas) = 1.06 GB/s
Total Internal: ~1.6 GB/sCompute Requirements
Webhook Dispatcher
Per Dispatcher:
- Event processing: 2 CPU cores
- Webhook matching: 2 CPU cores
- Queue management: 1 CPU core
- Total: 5 CPU cores
Cluster: 20 dispatchers × 5 cores = 100 CPU coresDelivery Workers
Per Worker:
- HTTP client: 2 CPU cores
- Retry logic: 1 CPU core
- Logging: 0.5 CPU cores
- Total: 3.5 CPU cores
Cluster: 1000 workers × 3.5 cores = 3,500 CPU coresTotal Compute
Dispatchers: 100 cores
Workers: 3,500 cores
Coordinators: 12 cores
Total: 3,612 CPU coresMemory Requirements
Dispatcher Service
Per Dispatcher:
- Event queue: 5 GB
- Webhook cache: 2 GB
- Connection pool: 1 GB
- Application: 2 GB
- Total: ~10 GB
Cluster: 20 dispatchers × 10 GB = 200 GBWorker Nodes
Per Worker:
- HTTP connections: 1 GB
- Retry queue: 500 MB
- Response buffers: 500 MB
- Application: 1 GB
- Total: ~3 GB
Cluster: 1000 workers × 3 GB = 3 TBDatabase and Cache
PostgreSQL:
- Webhook config: 6 GB
- Delivery history: 25 TB
- Total: ~25 TB
Redis:
- Event queue: 16 GB
- Webhook cache: 10 GB
- Retry queue: 5 GB
- Total: ~31 GBLatency Constraints
End-to-End Latency Breakdown
Event Generated: 0ms
Event Published to Queue: 10ms
Dispatcher Picks Up: 50ms (poll interval)
Webhook Matching: 5ms
Worker Picks Up: 50ms (poll interval)
HTTP Request: 200ms (network + endpoint processing)
Response Processing: 10ms
Status Update: 20ms
Total: 345ms (P50), 1000ms (P99)Retry Latency
Attempt 1: Immediate
Attempt 2: +1 second
Attempt 3: +2 seconds
Attempt 4: +4 seconds
Attempt 5: +8 seconds
Total: 15 seconds for all retriesFailure Scenarios
Endpoint Failures
Timeout (30s):
- Impact: Slow delivery, resource usage
- Recovery: Retry with backoff
- Rate: 5% of deliveries
5xx Errors:
- Impact: Temporary failure
- Recovery: Retry immediately
- Rate: 2% of deliveries
4xx Errors:
- Impact: Permanent failure
- Recovery: Don't retry (except 429)
- Rate: 1% of deliveriesSystem Failures
Worker Failure:
- Impact: In-flight webhooks lost
- Recovery: Requeue from database
- Time: <30 seconds
Queue Failure:
- Impact: No new deliveries
- Recovery: Failover to replica
- Time: <10 seconds
Database Failure:
- Impact: Cannot track status
- Recovery: Replica promotion
- Time: <30 secondsCost Analysis
Infrastructure Costs (Monthly)
Dispatchers:
- 20 instances × $200 = $4,000
Workers:
- 1000 instances × $100 = $100,000
Database:
- 25 TB storage × $0.10/GB = $2,500
- Compute: $5,000
Queue (Kafka):
- 10 brokers × $500 = $5,000
Load Balancers:
- 5 × $50 = $250
Total Infrastructure: ~$117,000/monthOperational Costs (Monthly)
Engineering: $20,000
Monitoring: $2,000
Support: $5,000
Total Operational: $27,000
Total Monthly Cost: $144,000Cost Per Webhook
Monthly Deliveries: 2.5B × 30 = 75 billion
Cost per webhook: $144,000 ÷ 75B = $0.00000192
Cost per 1000 webhooks: $0.00192
Cost per million webhooks: $1.92Scaling Strategies
Horizontal Scaling
Add Workers:
- Linear scaling up to 10,000 workers
- Auto-scale based on queue depth
- Handle traffic spikes
Add Dispatchers:
- Scale up to 100 dispatchers
- Partition events by type
- Parallel processing
Add Queue Brokers:
- Scale Kafka cluster
- Add partitions
- Increase throughputBottleneck Analysis
Primary Bottlenecks
1. HTTP Request Latency:
- Slow endpoints (>5s response)
- Mitigation: Timeout, async processing
2. Retry Storm:
- Many webhooks retrying simultaneously
- Mitigation: Jitter, rate limiting
3. Queue Throughput:
- 100K events/s limit
- Mitigation: More partitions, brokers
4. Database Writes:
- 10K writes/s per shard
- Mitigation: Batch writes, sharding
5. Network Bandwidth:
- 10 Gbps per server
- Mitigation: Compression, batchingThis scale analysis provides the foundation for designing a webhook service that handles massive delivery volumes while maintaining reliability and low latency.