webhook

Introduction

A webhook is a user-defined HTTP callback. When an event occurs in System A, it sends an HTTP POST request to a pre-configured URL in System B — delivering the event data instantly, without System B ever asking for it. This "don't call us, we'll call you" model is the backbone of real-time integrations across the modern web.

Polling vs. Webhooks — a fundamental trade-off:

Approach	How It Works	Latency	Efficiency	Complexity
Polling	Client repeatedly asks "anything new?" at fixed intervals	Interval-bound (seconds to minutes)	Wasteful — 95%+ of requests return empty	Simple client, simple server
Long Polling	Client holds an open request; server responds when event occurs	Near real-time	Better than polling, but ties up connections	Moderate (connection management)
Webhooks	Server pushes event to client's URL when it happens	Real-time (milliseconds)	Optimal — zero wasted requests	Client must expose an endpoint

Webhooks are used everywhere:

Stripe sends payment_intent.succeeded when a charge completes
GitHub sends push events when code is committed to a repository
Shopify sends orders/create when a customer places an order
Twilio sends delivery receipts when an SMS is delivered

In each case, the provider (Stripe, GitHub) makes an HTTP POST to your server with event data. Your server processes the event and returns 200 OK. If your server fails to respond, the provider retries — typically with exponential backoff.

The engineering challenge is deceptively simple on the surface but hides real complexity:

Reliability — What happens when your server is down? Events must not be lost.
Idempotency — Retries mean the same event may arrive multiple times. Processing it twice would be catastrophic (e.g., charging a customer twice).
Security — Anyone can send a POST request to your endpoint. How do you verify it came from the real provider and wasn't forged by an attacker?
Ordering — Events may arrive out of order. An invoice.paid event might arrive before invoice.created.
Throughput — During flash sales or viral events, webhook volume can spike 5-10× normal. Your handler must absorb the burst without dropping events.

LLD Connection: This problem connects to the Message Queue Low-Level Design, where you implement the producer-consumer pattern that decouples event ingestion from processing.

Comparison of polling versus webhook architecture showing push model efficiency

Functional Requirements

We extract the core operations from the problem statement:

"receive" event notifications → ACCEPT incoming HTTP requests
"execute" corresponding operations → PROCESS event payload
"persist" original data and results → STORE for auditing/debugging
"ensure" events are processed even when components fail → GUARANTEE at-least-once delivery

FR1 — Accept Event Notifications. The service exposes a webhook endpoint that receives HTTP POST requests from external providers (e.g., Stripe, GitHub, Shopify). Each request contains an event payload describing what happened. The service validates the request authenticity, acknowledges receipt immediately with 200 OK, and enqueues the event for asynchronous processing.

FR2 — Process Events Reliably. Each accepted event is processed exactly according to its type — updating records, triggering workflows, or notifying downstream systems. The original event data and processing results are persisted for tracking, auditing, and debugging. If any component fails mid-processing, the event is retried automatically until it succeeds or is moved to a dead letter queue for manual investigation.

Out of Scope

Webhook registration/subscription management — How providers register callback URLs (handled by provider's API)
Outbound webhook delivery — Sending webhooks to external consumers (inverse problem)
Business logic implementation — What happens after events are processed (domain-specific)
Authentication/authorization — User identity management

Scale Requirements

Metric	Value
Event volume	1,000,000 events per day
Average event size	~5 KB
Peak traffic multiplier	5× normal (during flash sales, releases)
Latency target	< 200 ms end-to-end (event arrival → processing complete)
Data retention	30 days for all events
Delivery guarantee	At-least-once processing

Non-Functional Requirements

Requirement	Target	Rationale
High Availability	99.9% uptime	Missing events from Stripe = missed payments = revenue loss
Low Latency	< 200 ms end-to-end	Events must be processed before external provider times out (~5-30 sec)
At-Least-Once Processing	Zero event loss after acceptance	If we return `200 OK`, we committed to processing the event
Idempotency	Duplicate events produce same result	Network retries from providers will send duplicates; processing twice = data corruption
Security	Verify event authenticity	Open endpoint is an attack surface; must validate HMAC signatures
Durability	30-day event retention	Audit trail for debugging, compliance, and dispute resolution

The critical insight: returning 200 OK is a contract. When we return 200, we tell the external provider "we received your event and will process it." If we lose the event after acknowledging it, the provider won't retry — and the event is gone forever. This is why at-least-once processing after acknowledgment is the most important non-functional requirement.

Resource Estimation

Traffic Estimation

Metric	Normal	Peak (5×)
Events per day	1,000,000	5,000,000
Events per second (avg)	~12/sec	~58/sec
Events per second (peak burst)	~58/sec	~290/sec

With 1M events/day: 1,000,000 / 86,400 ≈ 11.6 events/sec average.
Peak hours concentrate ~40% of daily traffic in 8 hours: 400,000 / 28,800 ≈ 14/sec normal peak.
Flash sale bursts (5× multiplier): 14 × 5 ≈ 70/sec sustained, with micro-bursts up to 290/sec.

Storage Estimation

Data	Calculation	Result
Daily event storage	1M events × 5 KB	5 GB/day
30-day retention	5 GB × 30	150 GB
Processing results	~1 KB per event × 1M × 30	~30 GB
Total storage (with overhead)	~180 GB × 1.3	~235 GB

Infrastructure Estimation

Component	Requirement
Request handlers	3+ instances behind load balancer (each handles ~100 req/sec)
Message queue	Managed service (SQS/Kafka) with replication
Queue consumers	3-5 instances (each processes ~20 events/sec with DB writes)
Database	PostgreSQL with read replica, ~235 GB disk

The system is I/O-bound, not CPU-bound — most time is spent waiting on network (receiving HTTP) and disk (writing to database). This means we can handle high throughput with relatively few compute instances.

Data Model

The data model captures two key entities: the raw event received from the provider and the processing result produced by our system.

CREATE TABLE webhook_events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id        VARCHAR(255) UNIQUE NOT NULL,     -- Provider's unique event identifier (idempotency key)
    event_type      VARCHAR(100) NOT NULL,             -- e.g., 'payment_intent.succeeded', 'order.created'
    provider        VARCHAR(50) NOT NULL,              -- e.g., 'stripe', 'github', 'shopify'
    payload         JSONB NOT NULL,                    -- Raw event payload from provider
    headers         JSONB,                             -- Original HTTP headers (for debugging)
    signature       VARCHAR(512),                      -- HMAC signature from provider
    received_at     TIMESTAMP DEFAULT NOW(),           -- When we received the event
    status          VARCHAR(20) DEFAULT 'pending',     -- pending, processing, completed, failed, dead_letter
    retry_count     INTEGER DEFAULT 0,
    last_error      TEXT,                              -- Last processing error message
    processed_at    TIMESTAMP,                         -- When processing completed
    expires_at      TIMESTAMP DEFAULT NOW() + INTERVAL '30 days'  -- TTL for retention policy
);

CREATE INDEX idx_events_event_id ON webhook_events(event_id);        -- Fast idempotency lookup
CREATE INDEX idx_events_status ON webhook_events(status) WHERE status != 'completed';  -- Partial index for pending events
CREATE INDEX idx_events_provider_type ON webhook_events(provider, event_type);          -- Filter by provider/type
CREATE INDEX idx_events_expires ON webhook_events(expires_at);       -- TTL cleanup job

CREATE TABLE processing_results (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id        VARCHAR(255) REFERENCES webhook_events(event_id),
    action_taken    VARCHAR(200) NOT NULL,              -- e.g., 'updated_payment_status', 'created_order'
    result_data     JSONB,                              -- Processing output details
    duration_ms     INTEGER,                            -- Processing time in milliseconds
    created_at      TIMESTAMP DEFAULT NOW()
);

Key design decisions:

event_id as idempotency key: The provider's unique event identifier (e.g., Stripe's evt_1MqLSbK...) is stored with a UNIQUE constraint. Before processing, we check if this event_id already exists — if so, we skip processing and return success.
JSONB for payload: Events from different providers have different schemas. JSONB stores the raw payload without requiring a fixed schema, while still allowing efficient queries with GIN indexes.
Partial index on status: The WHERE status != 'completed' clause means the index only covers pending/failed events — a small fraction of total rows. Queries for "events that need processing" are fast without indexing all 30M completed events.
TTL with expires_at: A background job runs daily to delete events where expires_at < NOW(), enforcing the 30-day retention policy.

API Endpoints

The webhook service exposes a single inbound endpoint. External providers send events to this URL.

Receive Webhook Event

POST /webhook

Request Headers:

Content-Type: application/json
X-Webhook-Signature: sha256=5d7f3c8a1b...    -- HMAC signature for verification
X-Webhook-Timestamp: 1710691200               -- Unix timestamp to prevent replay attacks
X-Event-Id: evt_1MqLSbKJFk9d2k               -- Provider's unique event identifier

Request Body:

{
  "event_id": "evt_1MqLSbKJFk9d2k",
  "event_type": "payment_intent.succeeded",
  "created_at": "2026-03-17T10:00:00Z",
  "data": {
    "object": {
      "id": "pi_3MqLSbKJFk9d2k",
      "amount": 5000,
      "currency": "usd",
      "status": "succeeded",
      "customer": "cus_NffrFeUfNV2Hib"
    }
  }
}

Response (Success — 200 OK):

{
  "status": "accepted",
  "event_id": "evt_1MqLSbKJFk9d2k"
}

Response (Duplicate — 200 OK):

{
  "status": "already_processed",
  "event_id": "evt_1MqLSbKJFk9d2k"
}

Response (Invalid Signature — 401 Unauthorized):

{
  "error": "invalid_signature",
  "message": "HMAC signature verification failed"
}

Status Code	Meaning
`200 OK`	Event accepted and queued for processing
`401 Unauthorized`	HMAC signature verification failed
`429 Too Many Requests`	Rate limit exceeded
`500 Internal Server Error`	Server failure — provider should retry

Why always return 200 for duplicates? If the provider sent the same event twice (due to their retry logic) and we already processed it, returning 200 tells them "we've got it, stop retrying." Returning an error would cause them to keep retrying indefinitely.

Internal Endpoints (for monitoring/debugging)

GET /events/:event_id/status

Returns processing status, retry count, and result data for a specific event. Used by internal dashboards for debugging — not exposed to external providers.

High Level Design

We build the architecture incrementally, starting from the simplest possible design and evolving it as we discover problems that need solving.

1. Basic Design — Direct Processing

Starting point: The most straightforward approach is a single service that receives the HTTP request, processes the event, writes to the database, and returns 200 OK.

The request handler does everything: validate the request, execute business logic (e.g., update payment status), write to the database, and respond.

Basic webhook design: External Service sends POST to Request Handler which directly writes to Database

The critical flaw: The request handler has too many responsibilities — HTTP handling, business logic, database writes. If it crashes after processing but before the database write succeeds, the event is lost. Worse, if the handler is slow (heavy business logic), it blocks the HTTP response. External providers typically have aggressive timeouts (5-30 seconds). If we don't respond fast enough, the provider considers the delivery failed and retries — even though we might have already processed the event.

We need to separate concerns: accept the event quickly, then process it asynchronously.

2. Message Queue for Reliability and Decoupling

The solution is a classic producer-consumer pattern with a message queue:

Request Handler (producer): Receives HTTP, validates, enqueues event, returns 200 OK
Message Queue (buffer): Durably stores events until consumed
Queue Consumer (consumer): Pulls events, processes them, writes results to database

This separation gives us three critical properties:

Property	How the Queue Provides It
Fast acknowledgment	Handler returns `200 OK` immediately after enqueue (~10 ms), not after processing (~200 ms)
Failure recovery	If a consumer crashes, the message stays in the queue and another consumer picks it up
Load buffering	During a 5× traffic spike, the queue absorbs the burst; consumers drain at their own pace

Webhook architecture with message queue decoupling request handlers from consumers

Request Flow — Step by Step

Sequence diagram showing the complete webhook processing flow from event arrival to completion

Step 1 — Event arrives. Stripe sends POST /webhook with event payload and HMAC signature in headers.

Step 2 — Signature verification. The handler computes HMAC-SHA256 over the request body using the shared secret key. If the computed hash doesn't match the signature in the header, return 401 Unauthorized.

Step 3 — Idempotency check. Query the database: does an event with this event_id already exist? If yes, return 200 OK with status: "already_processed". No further action needed.

Step 4 — Persist event record. Insert a new row in webhook_events with status: 'pending'. This creates the audit trail immediately.

Step 5 — Enqueue for processing. Push the event to the message queue. The queue durably stores the message.

Step 6 — Acknowledge to provider. Return 200 OK. Total time: ~10-50 ms. The provider knows we received the event.

Step 7 — Consumer processes event. A queue consumer pulls the message, executes the business logic (e.g., update payment status, trigger email), and writes the result to the database.

Step 8 — Acknowledge to queue. Only after the database write succeeds does the consumer ACK the message. If the consumer crashes before ACK, the message becomes visible again after a visibility timeout, and another consumer retries it.

This is the key reliability guarantee: The message stays in the queue until we prove (via database write + ACK) that processing succeeded. No event loss is possible after step 6.

3. Handling Failures at Every Layer

Each component in the pipeline can fail. The architecture must handle every failure mode without losing events.

Request Handler Failures

Before enqueue: If the handler crashes before enqueuing and returning 200 OK, the external provider never receives acknowledgment. The provider retries the delivery (typically 3-5 times with exponential backoff). Since we never saved the event, the retry is a fresh delivery — no data loss.

After enqueue, before response: The event is safely in the queue, but the provider didn't receive 200 OK. The provider retries and sends the event again. Our idempotency check catches the duplicate — the event_id already exists in the database — and we return 200 OK without reprocessing.

Message Queue Failures

Message queue failure protection with durability and replication

Durable queues persist messages to disk, surviving process crashes. Multi-node replication (Kafka's replication.factor=3, or SQS's built-in multi-AZ) ensures that even if an entire server dies, messages are preserved on other nodes.

Queue Consumer Failures

Queue consumer failure recovery with multiple instances and message redelivery

The message queue uses a visibility timeout mechanism. When a consumer pulls a message, the message becomes invisible to other consumers for a configured period (e.g., 30 seconds). If the consumer successfully processes the event and ACKs within this window, the message is permanently deleted. If the consumer crashes, the visibility timeout expires and the message reappears — allowing another consumer to pick it up.

This is why consumer-side idempotency is critical. The same event may be delivered to multiple consumers (if the first consumer was slow or crashed). Each consumer must check the event_id before executing business logic to avoid duplicate processing.

Database Failures

Database failures are handled with standard resilience patterns:

Write retries with exponential backoff — If the first write fails, retry after 100 ms, then 200 ms, 400 ms, etc. Most transient failures (connection timeout, deadlock) resolve within a few retries.
Database replication with automatic failover — A standby replica promotes to primary if the primary fails. The application reconnects to the new primary within seconds.
Consumer waits for DB recovery — If the database is down for an extended period, the consumer stops ACKing messages. Messages accumulate in the queue (which has much higher capacity than the DB). When the database recovers, consumers drain the backlog.

4. Complete Architecture

Complete webhook processing architecture with all components and failure handling

Component Ownership & Scaling

Component	Responsibility	Scaling Strategy	Failure Mode
Load Balancer	Route requests, health checks	Managed (ALB/NLB)	Multi-AZ automatic
Request Handlers	HMAC verification, idempotency check, enqueue	Horizontal (add instances)	LB routes away from dead instances
Message Queue	Durable event buffering	Managed (SQS/Kafka)	Multi-AZ replication
Queue Consumers	Event processing, DB writes	Horizontal (add consumers)	Visibility timeout + redelivery
Database	Event & result storage	Vertical + read replicas	Automatic failover to standby
Dead Letter Queue	Capture poison messages	Same as main queue	Alert + manual investigation

Dead Letter Queue (DLQ)

After a configurable number of retry attempts (e.g., 5), a message moves to the dead letter queue. This prevents a single malformed event from blocking the entire pipeline. Common DLQ scenarios:

Malformed payload — Provider sent invalid JSON that can't be parsed
Missing handler — Event type has no registered processor
Persistent downstream failure — External API that the processor calls is permanently down
Bug in consumer code — Logic error that crashes on specific event patterns

A monitoring alert fires when the DLQ receives messages. Engineers investigate, fix the root cause, and replay the events from the DLQ back into the main queue.

Deep Dive Questions

How do we secure the webhook endpoint against forged requests?

Our webhook endpoint is a publicly accessible URL. Anyone who discovers it can send fake events — spoofing payment confirmations, fabricating order updates, or flooding us with garbage data. We need multiple layers of defense.

Layer 1: HMAC Signature Verification

The webhook provider (e.g., Stripe) and our service share a secret key (configured during webhook registration). When the provider sends an event:

The provider computes HMAC-SHA256(secret_key, request_body) and includes the hash in the X-Webhook-Signature header
Our handler computes the same HMAC using the shared secret and the received body
If the hashes match, the request is authentic — only someone with the secret key could produce that signature

import hmac
import hashlib

WEBHOOK_SECRET = os.environ["WEBHOOK_SECRET"]  # Shared secret, stored securely

def verify_signature(request_body: bytes, signature_header: str, timestamp: str) -> bool:
    """Verify the HMAC signature on an incoming webhook request."""
    # Construct the signed payload: timestamp + "." + body
    # Including timestamp prevents replay attacks
    signed_payload = f"{timestamp}.".encode() + request_body

    # Compute expected signature
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        signed_payload,
        hashlib.sha256,
    ).hexdigest()

    # Constant-time comparison prevents timing attacks
    return hmac.compare_digest(f"sha256={expected}", signature_header)

Why hmac.compare_digest instead of ==? Regular string comparison short-circuits on the first differing character. An attacker could measure response times to deduce the expected signature one character at a time (timing attack). compare_digest takes constant time regardless of where strings differ.

Why include the timestamp? Without it, an attacker who intercepts a valid request could replay it later. By requiring the timestamp to be within a window (e.g., ±5 minutes), we reject stale requests.

HMAC signature verification flow showing provider signing and service verification

Layer 2: IP Allowlisting

Configure the load balancer or firewall to accept webhook requests only from known provider IP ranges. Stripe publishes their webhook IP addresses; so does GitHub and Shopify.

Provider	IP Range Documentation
Stripe	Published in Stripe docs, updated periodically
GitHub	Available via `GET https://api.github.com/meta`
Shopify	Published in Shopify docs

Limitation: Provider IPs can change. IP allowlisting is a defense-in-depth measure — not a primary authentication mechanism. Always use HMAC verification as the primary check.

Layer 3: Rate Limiting

Set rate limits per IP or per API key to prevent denial-of-service attacks:

Normal provider traffic: ~12 events/sec average, ~70/sec peak → set limit at 200/sec per provider IP
Abuse traffic: If any source exceeds 200/sec, return 429 Too Many Requests

Rate limiting protects against both malicious flooding and buggy providers that accidentally send duplicate events in loops.

Defense Summary

Layer	Protects Against	Implementation
HMAC Signatures	Forged/spoofed requests	Compute + constant-time compare in handler
Timestamp validation	Replay attacks (old captured requests)	Reject if timestamp > 5 min old
IP Allowlisting	Requests from unauthorized sources	Load balancer/firewall rules
Rate Limiting	DoS attacks, buggy providers	Token bucket per source IP

How do we handle duplicate webhook deliveries?

Duplicate events are inevitable, not exceptional. They occur from:

Provider retries — Provider didn't receive 200 OK (network issue) and resends
Consumer retries — Consumer crashed mid-processing; message redelivered after visibility timeout
Intentional replay — Provider resends events after an outage recovery

Without idempotency, processing the same payment_intent.succeeded event twice charges the customer twice. Processing the same order.created event twice creates two orders.

Idempotency Key Strategy

Every webhook event has a unique identifier assigned by the provider (e.g., Stripe's evt_1MqLSbKJFk9d2k). We use this as an idempotency key:

async def handle_webhook(request: Request) -> Response:
    body = await request.body()
    event_data = json.loads(body)
    event_id = event_data["event_id"]

    # Step 1: Check if we've seen this event before
    existing = await db.fetch_one(
        "SELECT id, status FROM webhook_events WHERE event_id = $1", event_id
    )

    if existing:
        # Already seen — skip processing, return success so provider stops retrying
        return Response({"status": "already_processed", "event_id": event_id}, status=200)

    # Step 2: Insert with UNIQUE constraint as safety net
    try:
        await db.execute(
            """INSERT INTO webhook_events (event_id, event_type, provider, payload, status)
               VALUES ($1, $2, $3, $4, 'pending')""",
            event_id, event_data["event_type"], event_data.get("provider", "unknown"),
            json.dumps(event_data),
        )
    except UniqueViolationError:
        # Race condition: another handler inserted between our SELECT and INSERT
        return Response({"status": "already_processed", "event_id": event_id}, status=200)

    # Step 3: Enqueue for async processing
    await queue.enqueue({"event_id": event_id, "payload": event_data})
    return Response({"status": "accepted", "event_id": event_id}, status=200)

Why check-then-insert instead of just INSERT with ON CONFLICT? The SELECT first is cheaper than catching exceptions on every request. 99% of events are new — the SELECT returns "not found" and we proceed. The UNIQUE constraint is a safety net for the rare race condition where two handlers receive the same event simultaneously.

Consumer-Side Idempotency

The handler-side check prevents duplicate enqueueing. But messages can still be delivered twice to consumers (visibility timeout expiry, queue retry). Consumers must also be idempotent:

async def process_event(message: dict):
    event_id = message["event_id"]

    # Atomic status update: only succeeds if status is still 'pending'
    rows_updated = await db.execute(
        """UPDATE webhook_events SET status = 'processing'
           WHERE event_id = $1 AND status = 'pending'""",
        event_id,
    )

    if rows_updated == 0:
        # Already processing or completed by another consumer
        return

    try:
        result = await execute_business_logic(message["payload"])
        await db.execute(
            """UPDATE webhook_events SET status = 'completed', processed_at = NOW()
               WHERE event_id = $1""",
            event_id,
        )
        await db.execute(
            """INSERT INTO processing_results (event_id, action_taken, result_data, duration_ms)
               VALUES ($1, $2, $3, $4)""",
            event_id, result.action, json.dumps(result.data), result.duration_ms,
        )
    except Exception as e:
        await db.execute(
            """UPDATE webhook_events SET status = 'failed', retry_count = retry_count + 1,
               last_error = $2 WHERE event_id = $1""",
            event_id, str(e),
        )
        raise  # Re-raise so queue doesn't ACK — message will be redelivered

The key line is WHERE event_id = $1 AND status = 'pending'. This is an atomic compare-and-swap. If two consumers try to process the same event simultaneously, only one succeeds in changing the status from 'pending' to 'processing'. The other gets rows_updated = 0 and exits immediately.

Queue-Level Deduplication

Some message queues offer built-in deduplication:

Queue	Deduplication Feature
AWS SQS FIFO	Content-based deduplication within 5-minute window
Apache Kafka	`enable.idempotence=true` on producer prevents duplicate publishes
RabbitMQ	No built-in; implement at application level

Queue deduplication is a defense-in-depth addition — it reduces duplicates but doesn't eliminate them (e.g., messages delivered across the deduplication window). Application-level idempotency is still required.

How do we handle events that arrive out of order?

Webhook providers send events independently. Network latency, retry timing, and provider-side batching can cause events to arrive in a different order than they occurred. For example:

Stripe sends invoice.created at 10:00:00
Stripe sends invoice.paid at 10:00:05
Due to a network retry, invoice.paid arrives at our service at 10:00:06
invoice.created arrives at 10:00:08

If our processor blindly processes events in arrival order, it would try to mark an invoice as "paid" before it exists in our database.

Strategy 1: Fetch Latest State from Source of Truth

Instead of relying on event data to update local state, fetch the current state from the provider's API when processing each event:

async def process_invoice_event(event: dict):
    invoice_id = event["data"]["object"]["id"]
    event_type = event["event_type"]

    # Don't trust the event payload for state — fetch latest from Stripe
    latest_invoice = await stripe_client.get_invoice(invoice_id)

    # Upsert with the latest data regardless of event order
    await db.execute(
        """INSERT INTO invoices (id, status, amount, updated_at)
           VALUES ($1, $2, $3, $4)
           ON CONFLICT (id) DO UPDATE SET
               status = EXCLUDED.status,
               amount = EXCLUDED.amount,
               updated_at = EXCLUDED.updated_at
           WHERE invoices.updated_at < EXCLUDED.updated_at""",
        latest_invoice.id,
        latest_invoice.status,
        latest_invoice.amount,
        latest_invoice.updated_at,
    )

The WHERE invoices.updated_at < EXCLUDED.updated_at clause ensures we never overwrite newer data with older data. If invoice.paid (newer timestamp) was processed first, and invoice.created (older timestamp) arrives later, the UPDATE silently does nothing because the existing updated_at is already newer.

Strategy 2: Timestamp-Based Conflict Resolution

When you can't call the provider's API (rate limits, latency concerns), use the event's timestamp to determine ordering:

async def process_event_with_timestamp(event: dict):
    event_id = event["event_id"]
    entity_id = event["data"]["object"]["id"]
    event_timestamp = parse_datetime(event["created_at"])

    # Check if we already have a more recent event for this entity
    latest = await db.fetch_one(
        """SELECT event_timestamp FROM entity_state
           WHERE entity_id = $1 ORDER BY event_timestamp DESC LIMIT 1""",
        entity_id,
    )

    if latest and latest.event_timestamp >= event_timestamp:
        # This event is older than what we already processed — skip it
        await db.execute(
            """UPDATE webhook_events SET status = 'skipped_stale'
               WHERE event_id = $1""",
            event_id,
        )
        return

    # Process the event — it's the newest we've seen for this entity
    await apply_event_to_state(entity_id, event)

Key Takeaways

Never assume event order. Design processing logic that produces correct results regardless of arrival sequence.
Use the provider's API as the source of truth. Event payloads are notifications, not authoritative state updates.
Timestamp-based conflict resolution works when provider API calls are impractical. The WHERE updated_at < new_timestamp pattern prevents stale overwrites.
Log skipped events. When an out-of-order event is skipped, mark it as skipped_stale in the database for debugging — don't silently drop it.

How do we design a robust retry strategy with exponential backoff?

When event processing fails, we need to retry — but naively retrying immediately can overwhelm a struggling dependency. If the database is temporarily overloaded and 1,000 events fail simultaneously, immediately retrying all 1,000 creates a thundering herd that makes the situation worse.

Exponential Backoff with Jitter

The standard approach: increase the delay between retries exponentially, and add random jitter to prevent synchronized retries.

delay = min(base_delay × 2^attempt + random_jitter, max_delay)

import random

MAX_RETRIES = 5
BASE_DELAY_SEC = 1.0
MAX_DELAY_SEC = 60.0

def calculate_retry_delay(attempt: int) -> float:
    """Calculate delay with exponential backoff + full jitter."""
    # Exponential: 1s, 2s, 4s, 8s, 16s
    exponential = BASE_DELAY_SEC * (2 ** attempt)

    # Cap at max delay
    capped = min(exponential, MAX_DELAY_SEC)

    # Full jitter: random value between 0 and capped delay
    # This spreads retries uniformly, preventing thundering herd
    return random.uniform(0, capped)

# Example retry schedule:
# Attempt 0: 0 - 1 sec    (immediate to 1s)
# Attempt 1: 0 - 2 sec
# Attempt 2: 0 - 4 sec
# Attempt 3: 0 - 8 sec
# Attempt 4: 0 - 16 sec
# After 5 failures → Dead Letter Queue

Why Full Jitter Over Equal Jitter?

Strategy	Formula	Problem
No jitter	`base × 2^attempt`	All failed events retry at exactly the same time → thundering herd
Equal jitter	`base × 2^attempt / 2 + random(0, base × 2^attempt / 2)`	Better, but retries still clustered around midpoint
Full jitter	`random(0, base × 2^attempt)`	Retries spread uniformly across the entire window → optimal load distribution

AWS's analysis shows full jitter provides the best overall throughput when many clients retry against a shared resource.

Dead Letter Queue (DLQ) Policy

After MAX_RETRIES failed attempts, the event moves to the dead letter queue. The consumer must not keep retrying — the event is likely a poison message (malformed data, unhandled event type, persistent downstream failure). Retrying forever would waste resources and potentially block the queue.

Retry Attempt	Delay (approx.)	Cumulative Wait
1	~0.5 sec	~0.5 sec
2	~1 sec	~1.5 sec
3	~2 sec	~3.5 sec
4	~4 sec	~7.5 sec
5	~8 sec	~15.5 sec
→ DLQ	Event moves to dead letter queue	Alert triggered

Total time before DLQ: ~15-30 seconds. Fast enough to catch transient failures (network blip, DB failover) but not so aggressive that it overwhelms recovering systems.

Implementation with SQS Visibility Timeout

AWS SQS doesn't support per-message retry delays natively. The workaround is to use the visibility timeout as a retry mechanism:

Consumer pulls message, attempts processing, fails
Consumer calls ChangeMessageVisibility with the calculated backoff delay
Message becomes invisible for that duration, then reappears for the next attempt

Alternatively, use SQS's built-in redrive policy:

{
  "maxReceiveCount": 5,
  "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:webhook-dlq"
}

After 5 receives without successful deletion, SQS automatically moves the message to the DLQ.

What observability do we need for a webhook processing pipeline?

A webhook pipeline processes events from external systems — systems we don't control. When something goes wrong, we need to identify whether the issue is on our side (handler bug, DB outage) or the provider's side (malformed payload, changed API). Observability is critical because webhook failures are often silent — no user is clicking a button and seeing an error page.

Key Metrics to Track

Metric	What It Measures	Alert Threshold
Ingestion rate	Events received per second	< 50% of expected baseline for > 5 min (provider may be down)
Queue depth	Messages waiting in queue	> 10,000 (consumers can't keep up)
Processing latency (P50, P95, P99)	Time from enqueue to DB write	P99 > 500 ms
Error rate	Failed processing attempts / total	> 5% over 5-minute window
DLQ ingestion rate	Events moving to dead letter queue	> 0 (any DLQ event warrants investigation)
Retry rate	Messages redelivered / total consumed	> 10% (indicates systematic failures)
Signature rejection rate	HMAC verification failures / total	> 1% (possible secret rotation issue or attack)
Consumer lag	Difference between newest message and consumer's current position	> 60 seconds

Structured Logging for Every Event

Each event should produce a log trail that allows full reconstruction of its lifecycle:

{
  "timestamp": "2026-03-17T10:00:00.123Z",
  "level": "INFO",
  "event_id": "evt_1MqLSbKJFk9d2k",
  "event_type": "payment_intent.succeeded",
  "provider": "stripe",
  "stage": "processing_complete",
  "duration_ms": 45,
  "action_taken": "updated_payment_status",
  "retry_count": 0,
  "trace_id": "abc123-def456"
}

Dashboard Layout

A webhook monitoring dashboard should answer three questions at a glance:

Is the system healthy? — Ingestion rate, processing latency, error rate. Green/yellow/red indicators.
Is there a backlog? — Queue depth, consumer lag. If the queue is growing, consumers need scaling.
Are there poison messages? — DLQ count, top error categories. Which event types are failing and why?

Alerting Strategy

Severity	Condition	Response
P1 (page)	DLQ receiving events	Investigate immediately — events failing permanently
P1 (page)	Queue depth > 50K and growing	Consumers offline or DB down — immediate scaling/investigation
P2 (ticket)	Error rate > 5% for > 10 min	Systematic failure — check downstream dependencies
P3 (log)	Signature rejection spike	Possible secret rotation or attack — check provider status
P3 (log)	Ingestion rate drops > 50%	Provider may be experiencing an outage

Staff-Level Discussion Topics

The following topics contain open-ended architectural questions designed for staff+ conversations where you demonstrate systems thinking, trade-off analysis, and cross-cutting architectural decisions.

Achieving Exactly-Once Processing Semantics

Context: Your webhook pipeline guarantees at-least-once processing. But "at-least-once" means some events may be processed twice. For payment events, duplicate processing means double-charging customers. Product demands "exactly-once" guarantees.

Discussion Points:

Why is true exactly-once delivery impossible in distributed systems? How does the Two Generals' Problem apply here?
How do you achieve exactly-once semantics (not delivery) using idempotency? What's the difference?
What are the trade-offs between database-level idempotency (UNIQUE constraints) vs application-level idempotency (idempotency key cache)?
How do you handle the case where the event was successfully processed but the status update to "completed" failed? The next retry will re-process it.
Can you use database transactions spanning the event processing and status update to achieve atomicity? What are the limitations?
How would you implement an idempotency key TTL that balances memory usage against deduplication window?

Multi-Provider Webhook Architecture

Context: Your platform integrates with 15 different webhook providers (Stripe, GitHub, Shopify, Twilio, SendGrid, etc.). Each provider has a different payload format, signature scheme, retry policy, and event taxonomy. The codebase is becoming unmaintainable with provider-specific if/else chains everywhere.

Discussion Points:

How do you design a provider-agnostic webhook processing framework? What abstractions make sense?
How do you handle different authentication schemes? (HMAC-SHA256 for Stripe, HMAC-SHA1 for GitHub, basic auth for others)
How do you normalize different event schemas into a common internal format?
How do you handle provider-specific quirks? (Different retry intervals, different header names for signatures, different timestamp formats)
What testing strategy ensures that changes for one provider don't break another?
How do you handle provider API version changes that alter webhook payload formats?

Scaling Webhook Processing to 100× Current Volume

Context: Your platform grows from 1M events/day to 100M events/day. The current architecture (single queue, PostgreSQL for all events) is hitting limits. Database write throughput is maxed out, query performance degrades with 3 billion rows, and the single queue becomes a bottleneck.

Discussion Points:

How do you partition the message queue? By provider? By event type? By tenant? What are the trade-offs?
When does PostgreSQL stop being appropriate for event storage? What alternatives exist? (TimescaleDB, Cassandra, DynamoDB, S3 + Athena)
How do you implement a tiered storage strategy? (Hot: recent 24h in fast DB, Warm: 7 days in standard DB, Cold: 30 days in object storage)
How do you handle the thundering herd problem when a provider sends 10M events in 1 minute after an outage?
What queue partitioning strategy ensures fair processing across providers while preventing one noisy provider from starving others?
How do you monitor and autoscale consumers based on queue depth, processing latency, and error rates?

Disaster Recovery and Data Consistency

Context: Your primary database fails during a flash sale. The message queue has 50,000 unprocessed events. Your disaster recovery plan needs to handle this scenario without losing events or creating duplicates.

Discussion Points:

What happens to in-flight messages when the database is unavailable? How do consumers behave?
How do you design the system so that queue messages survive a complete database rebuild?
What's the recovery procedure after a database failover? How do you verify no events were lost?
How do you reconcile state between the queue (events in flight), the database (events partially processed), and the provider (events already acknowledged)?
Should you implement a "replay" capability to re-process events from a specific time window? How?
How do you test disaster recovery procedures without impacting production?

Level Expectations

Dimension	Mid-Level (L4)	Senior (L5)	Staff (L6)
Requirements & Estimation	List basic features (accept events, persist); identify availability as NFR	Quantify traffic (events/sec), storage (GB), compute; define at-least-once guarantee	SLA definition; cost analysis; multi-provider normalization strategy
Architecture	Basic handler → database; mention a queue	Queue-based async pipeline; dead letter queue; separate handler and consumer roles	Partitioned queues; tiered storage; multi-region replication; graceful degradation
Security	Mention HMAC verification	Implement HMAC with timestamp + constant-time comparison; IP allowlisting; rate limiting	Threat modeling; secret rotation strategy; zero-trust between services
Reliability	"Use a queue for reliability"	Idempotency at handler and consumer level; exponential backoff with jitter; visibility timeout mechanics	Exactly-once semantics discussion; reconciliation procedures; DR planning
Observability	Basic logging	Structured logging per event lifecycle; key metrics (queue depth, error rate, latency percentiles)	Full alerting strategy; SLO-based monitoring; cross-provider correlation dashboards

Summary

Architecture evolution from direct handler to queue-based resilient pipeline

Key Design Decisions

Message Queue for Decoupling. The handler's only job is to validate, persist the raw event, and enqueue. Business logic runs asynchronously in consumers. This separation gives us fast acknowledgment (~50 ms), failure isolation, and independent scaling of ingestion vs processing.

At-Least-Once with Idempotency. The queue guarantees at-least-once delivery; application-level idempotency (provider's event_id as unique key + atomic status updates) ensures duplicate processing is harmless. True exactly-once delivery is impossible in distributed systems — idempotency is the practical solution.

HMAC Signature Verification. Every request is authenticated using the shared secret before any processing. Constant-time comparison prevents timing attacks. Timestamp validation prevents replay attacks. IP allowlisting and rate limiting provide defense in depth.

Exponential Backoff with Full Jitter. Failed events retry with increasing delays (1s → 2s → 4s → 8s → 16s) plus random jitter to prevent thundering herd. After 5 failures, events move to the dead letter queue for human investigation.

Fetch Latest State for Ordering. Out-of-order events are handled by fetching current state from the provider's API rather than trusting event payload. Timestamp-based conflict resolution (WHERE updated_at < new_timestamp) prevents stale overwrites when API calls are impractical.

Architecture Principles Applied

Principle	Application
Separation of concerns	Handler does HTTP + enqueue; Consumer does business logic + persistence
Fail-safe defaults	Return `200 OK` only after successful enqueue; ACK message only after DB write
Defense in depth	HMAC + IP allowlist + rate limiting for security; handler + consumer idempotency for deduplication
Async over sync	Event processing decoupled from HTTP response; provider gets fast `200 OK` regardless of processing time
Design for failure	Every component failure mode has a recovery path; no single failure loses an event

Common Pitfalls

Pitfall	Why It Fails	Better Approach
Synchronous processing in handler	Slow processing → timeout → provider retries → duplicate events	Enqueue immediately, process async
No idempotency check	Provider retry delivers duplicate → event processed twice	Use `event_id` as idempotency key with UNIQUE constraint
Immediate retry on failure	1,000 events fail → 1,000 immediate retries → overwhelm DB	Exponential backoff with full jitter
`==` for signature comparison	Timing attack reveals expected signature character by character	`hmac.compare_digest` for constant-time comparison
Trust event payload for state	Out-of-order events corrupt local state	Fetch latest from provider API or use timestamp-based resolution
No dead letter queue	Poison message blocks queue forever	Move to DLQ after N failures; alert for investigation
Return `200` before enqueue	Handler crashes after `200` but before enqueue → event lost forever	Return `200` only after successful enqueue + DB insert