Why do payment gateways send duplicate webhooks?

Payment gateways use at-least-once delivery with retry policies (Stripe retries up to 25 times over 72 hours) to ensure events are not lost due to network failures or transient consumer errors. Any non-2xx response or connection timeout schedules a retry, which can create duplicates if the original request actually succeeded.

What is the safest idempotency key for webhook deduplication?

Prefer the gateway-provided event_id (e.g. Stripe's evt_... field) as the primary key — it is guaranteed unique per provider. Use SHA-256(signature + payload_hash) as a fallback when the header is absent or malformed.

Should I return 200 or 202 from a webhook endpoint?

Return 202 Accepted immediately after atomically claiming the idempotency key, then process asynchronously. This decouples the HTTP acknowledgment from ledger writes, preventing gateway-side retry storms caused by reverse-proxy idle timeouts cutting the connection before your worker finishes.

Handling Duplicate Webhook Deliveries in Payment Gateways

Part of: Webhook Delivery Guarantees

Payment ecosystems rely on asynchronous, event-driven delivery where network volatility and gateway-side retry policies routinely produce duplicate webhook payloads. The at-least-once delivery model that underpins every major payment provider — Stripe, Adyen, PayPal, Braintree — means duplicates are a design property, not an anomaly. Your consumer must turn that into effectively-once ledger behaviour.

This runbook walks through every implementation step: picking the right idempotency key, claiming it atomically, decoupling acknowledgment from processing, and hardening the full flow with a payment-event state machine. Prerequisites: familiarity with idempotency key generation strategies and a working understanding of atomic operations and transaction scoping.

Why Duplicates Are Inevitable

Exactly-once delivery requires distributed consensus between sender and receiver — consensus that introduces latency incompatible with financial SLAs. Instead, gateways guarantee delivery at the cost of potential repetition:

Stripe retries up to 25 times over a 72-hour window using exponential backoff.
Adyen and PayPal retry on HTTP 5xx responses and connection resets, with retry windows of 24–48 hours.
Reverse proxy idle timeouts (typically 30–60 s) sever connections before a slow worker returns 200 OK, causing the provider to schedule a retry even though the original request completed.
Queue redelivery: in RabbitMQ, SQS, or Kafka, a worker crash before acknowledgment returns the message to the queue regardless of whether the ledger write succeeded.

The diagram below shows the three primary duplication paths and where consumer-side idempotency intercepts each one.

Step-by-Step Implementation

Step 1 — Choose and Extract an Idempotency Key

Pick the narrowest, most stable identifier the gateway provides. In order of preference:

Gateway event ID — stripe-signature header contains evt_...; Adyen sends X-Adyen-Hmac-Signature alongside an eventId body field. This is a globally unique, provider-assigned identifier. Use it directly.
Composite SHA-256 fallback — when the header is absent or the provider does not supply a stable event ID, compute SHA-256(hmac_signature || payload_bytes). This gives 256 bits of entropy, enough to be collision-safe across billions of events.
Anti-pattern — never use X-Request-ID alone; it is generated per HTTP connection and changes on every retry, breaking deduplication entirely.

Key derivation for the composite case:

import hashlib, hmac, os

def derive_idempotency_key(signature_header: str, raw_body: bytes) -> str:
    """
    Derives a stable 64-hex-char deduplication key from the gateway
    HMAC signature and the raw request body bytes.
    """
    digest = hashlib.sha256(
        signature_header.encode() + raw_body
    ).hexdigest()
    return digest  # 64 hex chars = 256-bit collision space

Step 2 — Claim the Key Atomically

Redis SET NX is the fastest path; PostgreSQL unique constraints give ACID durability. Choose based on whether false negatives (cache eviction expiring a key before the retry window closes) are acceptable for your SLA.

Redis (Node.js / Express + BullMQ):

app.post('/webhooks/payment', async (req, res) => {
  const eventId = req.headers['stripe-event-id'] || deriveKey(req);
  const key = `idemp:webhook:${eventId}`;

  // SET NX + 7-day TTL in one round-trip — no TOCTOU race
  const claimed = await redis.set(key, '1', 'EX', 604800, 'NX');
  if (!claimed) {
    // Duplicate: gateway already received 200 for this event
    return res.status(200).json({ status: 'duplicate' });
  }

  await paymentQueue.add('process', req.body, { jobId: eventId });
  res.status(202).send();
});

PostgreSQL (Python / FastAPI):

@app.post("/webhooks/payment")
async def handle_webhook(request: Request):
    raw_body = await request.body()
    event_id = request.headers.get("stripe-event-id") or derive_key(
        request.headers.get("stripe-signature", ""), raw_body
    )

    async with db.transaction():
        # INSERT ... ON CONFLICT is a single atomic statement
        result = await db.execute(
            """
            INSERT INTO webhook_idempotency (event_id, received_at)
            VALUES ($1, NOW())
            ON CONFLICT (event_id) DO NOTHING
            RETURNING event_id
            """,
            event_id,
        )
        if not result:
            return {"status": "duplicate"}

        await payment_queue.enqueue(raw_body.decode())
        return Response(status_code=202)

The PostgreSQL table requires a UNIQUE INDEX ON webhook_idempotency(event_id) and a received_at column with an index to support TTL-based pruning (see Step 5).

Step 3 — Decouple Acknowledgment from Processing

Return 202 Accepted immediately after claiming the key. Processing happens asynchronously in a worker. This prevents reverse proxy idle timeout (30–60 s) from closing the connection before a slow ledger write completes, which would cause the gateway to schedule a retry even though the event was already claimed.

Go (net/http + Redis Lua atomic SET):

var dedupeScript = redis.NewScript(`
  if redis.call("SET", KEYS[1], "1", "NX", "EX", ARGV[1]) then
    return 1
  else
    return 0
  end
`)

func WebhookHandler(w http.ResponseWriter, r *http.Request) {
    eventID := r.Header.Get("X-Gateway-Event-ID")
    if eventID == "" {
        eventID = deriveKey(r)
    }

    claimed, err := dedupeScript.Run(
        r.Context(), redisClient,
        []string{fmt.Sprintf("idemp:webhook:%s", eventID)},
        604800, // 7-day TTL in seconds
    ).Int()
    if err != nil || claimed == 0 {
        w.WriteHeader(http.StatusOK) // idempotent response to gateway
        return
    }

    // Fire-and-forget: enqueue for async processing
    if err := enqueuePayment(r.Context(), eventID, r.Body); err != nil {
        // Rollback the Redis key so the next retry can reclaim it
        redisClient.Del(r.Context(), fmt.Sprintf("idemp:webhook:%s", eventID))
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusAccepted)
}

Step 4 — Guard with a Payment-Event State Machine

The idempotency key prevents reprocessing at the HTTP boundary; a state machine prevents it at the ledger boundary. Map every payment event type to an explicit lifecycle:

received → validated → processing → settled
                                  ↘ failed

Before writing to the ledger, query the current state. If the event is already settled, return immediately without re-applying the credit. If the state is processing, the worker was interrupted — re-execute the idempotent transaction.

-- Postgres: claim a processing slot only when state allows it
UPDATE payment_events
SET    state = 'processing', updated_at = NOW()
WHERE  event_id = $1
  AND  state IN ('received', 'validated')
RETURNING event_id, state;
-- Zero rows returned → already settled or in-flight; skip ledger write

This is especially important for partial-capture and refund webhooks, where out-of-order delivery (a payment.refunded arriving before payment.captured) can corrupt balance calculations. Reject out-of-order transitions explicitly rather than silently ignoring them.

Step 5 — Manage TTLs to Bound Storage Growth

Set the idempotency key TTL to at least the maximum gateway retry window plus 20%:

Stripe: 72-hour window → set TTL to 90 hours (324 000 s)
Adyen: 48-hour window → set TTL to 60 hours (216 000 s)
PayPal: 24-hour window → set TTL to 30 hours (108 000 s)

For Redis, the EX argument on SET NX handles this automatically. For PostgreSQL, run a nightly pruning job:

DELETE FROM webhook_idempotency
WHERE received_at < NOW() - INTERVAL '90 hours';

Add a BRIN index on received_at rather than a B-tree index; BRIN has near-zero write overhead and is ideal for append-only time-ordered tables.

Verification and Testing

Simulate a duplicate delivery locally:

# Send the same payload twice with identical event_id
EVENT_ID="evt_test_$(date +%s)"

curl -X POST http://localhost:3000/webhooks/payment \
  -H "Content-Type: application/json" \
  -H "Stripe-Event-Id: ${EVENT_ID}" \
  -d '{"type":"payment_intent.succeeded","amount":5000}'

# Second call must return 200 {"status":"duplicate"} — not 202
curl -X POST http://localhost:3000/webhooks/payment \
  -H "Content-Type: application/json" \
  -H "Stripe-Event-Id: ${EVENT_ID}" \
  -d '{"type":"payment_intent.succeeded","amount":5000}'

Inspect Redis state:

redis-cli GET "idemp:webhook:${EVENT_ID}"
# Expected: "1"
redis-cli TTL "idemp:webhook:${EVENT_ID}"
# Expected: a positive integer near 324000

Inspect PostgreSQL dedup rows:

SELECT event_id, received_at
FROM   webhook_idempotency
WHERE  event_id = 'evt_test_...'
ORDER  BY received_at DESC;
-- Must show exactly one row regardless of how many POST requests arrived

Verify ledger write count:

SELECT COUNT(*) FROM payment_ledger WHERE event_id = 'evt_test_...';
-- Must be 1, not 2

Failure Scenarios and Debugging

Failure Scenario	Remediation Steps	Observability Hooks
Worker crashes after Redis claim but before ledger write	The Redis key blocks retries until TTL expires. Set TTL no shorter than the maximum retry window. If processing must restart, add a `DELETE` of the Redis key in the crash-recovery path, then re-enqueue. Alternatively, use transactional outbox to write the ledger row and the idempotency record in the same database transaction.	Span attribute `webhook.recovery=true`; alert on `webhook_processing_abandoned_total > 0` for more than 5 m
Redis eviction expires key before retry window closes	Switch to a PostgreSQL-backed idempotency table (eviction-proof) for payment events, or set `maxmemory-policy noeviction` on the Redis instance used for webhook dedup. Do not share this Redis with a cache that uses LRU eviction.	`idempotency_cache_miss_on_retry_total` counter; log field `idempotency.false_negative=true`
Concurrent workers race to claim the same event	Ensure the Redis `SET NX` or PostgreSQL `INSERT ... ON CONFLICT` is the first operation in the request handler — before any database read or business logic. Never split the check and the write into two separate commands. Use `pg_try_advisory_lock` as an additional guard before the state machine update if needed.	`webhook_race_detected_total` counter; trace span `idempotency.race=true`
Out-of-order event delivery corrupts ledger state	Reject state transitions that are not valid from the current state (e.g. `payment.refunded` arriving before `payment.captured`). Dead-letter invalid transitions; replay after the prerequisite event is processed. Implement exponential backoff with jitter on the requeue.	Log field `payment.state_transition_rejected=true`; `webhook_invalid_transition_total` counter per event type
Proxy timeout causes duplicate before worker returns 202	Return `202 Accepted` within 100 ms of receiving the request. Raise the reverse proxy’s idle timeout to be at least 2× your p99 handler latency. Instrument time-to-first-byte at the proxy layer and alert when it exceeds 5 s.	Span attribute `http.response_latency_ms`; alert `webhook_handler_p99_latency_ms > 5000`

SRE Observability Checklist

Emit all of the following for every webhook consumer handling payment events:

webhook_received_total — counter, labels: provider, event_type. Baseline for all rate calculations.
webhook_duplicate_total — counter, labels: provider, event_type, key_source (event_id vs composite). Alert when rate(webhook_duplicate_total[5m]) / rate(webhook_received_total[5m]) > 0.05.
idempotency_cache_hit_ratio — gauge: rate(hits[5m]) / (rate(hits[5m]) + rate(misses[5m])). Values below 0.95 during a retry storm indicate key eviction.
webhook_processing_latency_ms (p50/p95/p99) — histogram. Alert p99 > 30000 (30 s), which approaches Stripe’s retry trigger threshold.
ledger_reconciliation_mismatch_total — counter. Any value above 0 must page immediately (P1). Cross-reference against the provider’s /events API to identify gaps.
OpenTelemetry span attributes on every request: webhook.event_id, webhook.duplicate (bool), idempotency.cache_state (HIT/MISS), payment.state_transition (e.g. validated→processing), webhook.provider.

Structured log fields per event:

{
  "timestamp": "2026-06-23T09:14:33.201Z",
  "service": "payment-webhook-consumer",
  "event_id": "evt_1Nq...",
  "provider": "stripe",
  "idempotency_key_source": "event_id",
  "idempotency_cache_state": "MISS",
  "payment_state_transition": "received→processing",
  "processing_latency_ms": 18,
  "duplicate": false
}

Webhook Delivery Guarantees — the parent page explaining at-least-once semantics, retry matrices, and the delivery contract every payment provider offers.
Using Redis SET NX for Distributed Request Deduplication — deep dive on the atomic SET NX pattern, Lua scripting, and cluster-mode considerations used in Step 2.
Wrapping Database Transactions for Safe Retries — covers the transactional outbox pattern that eliminates the crash-between-claim-and-write failure mode described in the failure scenarios table.
Implementing Exponential Backoff Without Overlapping Retries — jitter strategies for re-queuing out-of-order events without amplifying retry storms.