Webhook Notification Service - System Architecture
High-Level Architecture
┌─────────────────────────────────────────────────────────┐
│ Application Services │
│ (Generate events: payments, orders, users) │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Event Publisher │
│ (Publish events to message queue) │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Event Queue (Kafka) │
│ (Durable event storage and distribution) │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Webhook Dispatcher Service │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Dispatch 1│ │Dispatch 2│ │Dispatch 3│ ... │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Delivery Queue (Redis) │
│ (Pending webhook deliveries) │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Delivery Workers │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ ... │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼────────────┼────────────┼────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ External Webhook Endpoints │
│ (Customer servers receiving webhooks) │
└─────────────────────────────────────────────────────────┘Event Flow
Event to Webhook Flow
1. Event Generated
├── Event type: payment.success
├── Event data: {amount: 100, user_id: 123}
└── Event ID: uuid
2. Event Published
├── Publish to Kafka topic
├── Partition by event type
└── Replicate for durability
3. Dispatcher Consumes Event
├── Read from Kafka
├── Find matching webhooks
├── Fan-out to all subscribers
└── Queue for delivery
4. Worker Delivers Webhook
├── Pick from delivery queue
├── Build HTTP request
├── Add signature
├── Send to endpoint
└── Handle response
5. Record Result
├── Update delivery status
├── Schedule retry if failed
├── Update metrics
└── Trigger alerts if neededWebhook Dispatcher Architecture
Dispatcher Components
┌─────────────────────────────────────────────────────────┐
│ Webhook Dispatcher │
├─────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Event │ │ Webhook │ │ Fan-out │ │
│ │ Consumer │ │ Matcher │ │ Engine │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Event │ │ Subscription │ │ Delivery │ │
│ │ Validator │ │ Cache │ │ Queue │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘Webhook Matching
Event: payment.success
Matching Process:
1. Query subscriptions for event type
SELECT webhook_id FROM subscriptions
WHERE event_type = 'payment.success'
AND enabled = TRUE
2. Apply filters (if configured)
- Amount > $100
- User tier = 'premium'
- Region = 'US'
3. Fan-out to matched webhooks
- Create delivery record for each
- Add to delivery queue
- Track fan-out count
Optimization:
- Cache subscriptions in Redis
- Index by event type
- Batch database queries
- Parallel fan-outDelivery Worker Architecture
Worker Components
┌─────────────────────────────────────────────────────────┐
│ Delivery Worker │
├─────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Queue │ │ HTTP Client │ │ Retry │ │
│ │ Consumer │ │ │ │ Manager │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Signature │ │ Circuit │ │ Result │ │
│ │ Generator │ │ Breaker │ │ Recorder │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘HTTP Request Building
Webhook Request:
POST https://customer.example.com/webhooks
Content-Type: application/json
X-Webhook-ID: webhook_abc123
X-Event-ID: event_xyz789
X-Event-Type: payment.success
X-Signature: sha256=abc123def456...
X-Delivery-Attempt: 1
User-Agent: WebhookService/1.0
{
"event_id": "event_xyz789",
"event_type": "payment.success",
"timestamp": "2024-01-08T10:00:00Z",
"data": {
"payment_id": "pay_123",
"amount": 100.00,
"currency": "USD",
"status": "success"
}
}
Signature Calculation:
signature = HMAC-SHA256(secret, payload)
X-Signature: sha256={signature}Response Handling
Success (2xx):
- Record successful delivery
- Update metrics
- Complete delivery
Retry (5xx, timeout, network error):
- Calculate retry delay
- Schedule retry
- Increment attempt count
- Update status
Permanent Failure (4xx except 429):
- Record failure
- Move to dead letter queue
- Alert webhook owner
- Don't retry
Rate Limited (429):
- Respect Retry-After header
- Schedule retry accordingly
- Implement backoffRetry Architecture
Retry Queue Management
Priority Queues by Attempt:
Attempt 1 (Immediate):
- Priority: Highest
- Delay: 0 seconds
- Queue: retry_queue_0
Attempt 2:
- Priority: High
- Delay: 1 second
- Queue: retry_queue_1
Attempt 3:
- Priority: Medium
- Delay: 2 seconds
- Queue: retry_queue_2
Attempt 4:
- Priority: Low
- Delay: 4 seconds
- Queue: retry_queue_4
Attempt 5:
- Priority: Lowest
- Delay: 8 seconds
- Queue: retry_queue_8
Dead Letter Queue:
- After max retries
- Manual intervention
- Replay capabilityExponential Backoff with Jitter
Retry Delay Calculation:
base_delay = 1000ms
max_delay = 60000ms
attempt = retry_count
delay = min(base_delay * 2^attempt, max_delay)
jitter = random(0, delay * 0.1)
final_delay = delay + jitter
Example:
Attempt 1: 1000ms + jitter(0-100ms) = 1000-1100ms
Attempt 2: 2000ms + jitter(0-200ms) = 2000-2200ms
Attempt 3: 4000ms + jitter(0-400ms) = 4000-4400ms
Attempt 4: 8000ms + jitter(0-800ms) = 8000-8800ms
Attempt 5: 16000ms + jitter(0-1600ms) = 16000-17600ms
Jitter Benefits:
- Prevents thundering herd
- Distributes retry load
- Reduces endpoint pressureCircuit Breaker Pattern
Circuit Breaker States
Closed (Normal):
- All requests allowed
- Monitor failure rate
- Transition to Open if failures > 50%
Open (Failing):
- All requests blocked
- Wait for timeout (60 seconds)
- Transition to Half-Open
Half-Open (Testing):
- Allow limited requests (10%)
- Test endpoint recovery
- Transition to Closed if success
- Transition to Open if failure
Implementation:
class CircuitBreaker {
State state = CLOSED;
int failureCount = 0;
int successCount = 0;
long openedAt = 0;
boolean allowRequest(String webhookId) {
if (state == CLOSED) {
return true;
} else if (state == OPEN) {
if (now() - openedAt > 60000) {
state = HALF_OPEN;
return true;
}
return false;
} else { // HALF_OPEN
return random() < 0.1; // 10% of requests
}
}
void recordSuccess() {
if (state == HALF_OPEN) {
successCount++;
if (successCount >= 10) {
state = CLOSED;
failureCount = 0;
}
}
}
void recordFailure() {
failureCount++;
if (state == CLOSED && failureCount >= 50) {
state = OPEN;
openedAt = now();
} else if (state == HALF_OPEN) {
state = OPEN;
openedAt = now();
}
}
}Monitoring and Observability
Key Metrics
Delivery Metrics:
- Webhooks delivered/second
- Delivery success rate
- Delivery latency (P50, P95, P99)
- Retry rate
- Dead letter queue size
Endpoint Metrics:
- Response time per endpoint
- Success rate per endpoint
- Error rate per endpoint
- Circuit breaker state
System Metrics:
- Queue depth
- Worker utilization
- Dispatcher lag
- Database performanceDistributed Tracing
Trace webhook delivery:
Span 1: Event Published
- Duration: 10ms
- Tags: event_id, event_type
Span 2: Webhook Matched
- Duration: 5ms
- Tags: webhook_count
Span 3: Delivery Queued
- Duration: 2ms
- Tags: queue_name
Span 4: HTTP Request
- Duration: 200ms
- Tags: endpoint_url, status_code
Span 5: Result Recorded
- Duration: 20ms
- Tags: delivery_status
Total: 237ms
Identify bottlenecks and optimizeThis architecture provides a robust, scalable foundation for reliable webhook delivery at massive scale.