What is idempotency in distributed systems?

Idempotency is the guarantee that invoking an operation multiple times with identical inputs produces the same system state and observable outputs as invoking it once. In distributed APIs, it converts at-least-once delivery into application-level exactly-once processing.

How long should an idempotency key TTL be?

TTL should match your business retry window. Payment processors typically use 48–72 hours. High-frequency systems with short retry horizons (e.g., 15-minute windows) can use shorter TTLs, but cache eviction before retry expiry causes duplicate processing.

Idempotency Fundamentals & API Guarantees

Q: Which HTTP methods are idempotent by default?

GET, HEAD, OPTIONS, and TRACE are safe and idempotent per RFC 9110. PUT and DELETE are idempotent but not safe. POST and PATCH are neither safe nor inherently idempotent, so they require explicit idempotency-key mechanics.

In distributed backend architectures, network unreliability is a baseline operating condition. When clients retry failed requests, gateways time out, or message brokers redeliver payloads, systems must guarantee that repeated invocations do not corrupt state or trigger duplicate side effects. Idempotency is the architectural contract that transforms at-least-once delivery into application-level exactly-once semantics. This page covers the engineering reality of idempotency, end-to-end request flow, failure boundary mapping, deduplication mechanics, trade-off matrices, and production-ready implementation strategies.

Engineering Contract

In pure mathematics, idempotency is defined as f(f(x)) = f(x). Applied to distributed systems, this translates to a concrete guarantee: invoking an operation multiple times with identical inputs yields the same system state and identical observable outputs as invoking it once.

Mathematical purity rarely survives network partitions, clock drift, or partial transaction commits. In practice, idempotency is a contractual guarantee enforced by application code, not a framework default. It requires explicit state tracking, deterministic execution paths, and careful isolation of side effects — external webhooks, ledger postings, inventory decrements, email dispatches. The precise failure mode idempotency prevents is the duplicate side effect: a payment charged twice, a counter incremented on every retry, or an account created multiple times from a single user action.

Understanding HTTP method semantics and safety establishes the baseline: RFC 9110 categorises methods by safety (read-only, no state mutation) and idempotency (repeated calls yield identical state). GET, HEAD, OPTIONS, and TRACE are safe and idempotent by protocol design. PUT and DELETE are idempotent but not safe. POST and PATCH are neither — so they demand explicit idempotency-key mechanics at every state-mutating boundary.

Conceptual Architecture: End-to-End Request Flow

The idempotency lifecycle has five discrete stages. A failure at any stage produces a specific class of duplicate or inconsistency — mapping these precisely is what makes deduplication reliable.

Five-stage idempotency pipeline. A cache HIT short-circuits to stages 1–3 only; a concurrent retry on a PENDING key blocks until COMPLETED, then returns the cached result.

Key State Machine

Every idempotency key exists in one of three states:

(none) — key has never been seen; proceed to execute.
PENDING — a concurrent request is mid-execution; block or return 409 Conflict.
COMPLETED — execution finished; return the cached response without re-executing.

An ERROR sub-state (key reserved but execution failed) requires explicit handling: either release the key to allow a clean retry, or persist the error response so retries receive a consistent failure rather than re-executing partial work.

Failure Boundary Map

Idempotency guarantees must be enforced at precise architectural boundaries. A request traverses multiple failure domains, each introducing its own partial-commit scenario.

Layer	Failure Mode	Idempotency Concern
Load balancer	TCP connection reset after forwarding	Client retries reach a different backend instance; key store must be shared across instances
API gateway	Request buffered, response dropped on timeout	Service executes successfully; client never receives `200`; next retry must hit deduplication cache, not re-execute
Service mesh	Circuit breaker trips mid-flight; mTLS handshake failure	Partial routing: some replicas processed the request, some did not; idempotency key must be reserved before fan-out
Application DB	Transaction commits but broker publish fails	Business state mutated; event not emitted; transactional outbox pattern resolves this gap
Message broker	At-least-once redelivery; duplicate event on consumer restart	Consumers must check deduplication store before processing; see webhook delivery guarantees
Idempotency store	Cache eviction before TTL; Redis failover	Key lost mid-TTL window; next retry executes as first-time request — must alert on `idempotency_store_miss_rate` spike

The critical insight is that gateway-masked success is the most dangerous failure mode. The backend completes the operation, the response is lost in transit, and without a deduplication cache the next client retry triggers a second execution. Atomic key reservation before execution closes this window.

Implementation Patterns Overview

Each implementation domain has a dedicated deep-dive. The table below maps the domain to the variant it covers and its primary trade-off.

Domain	Variants Covered	Primary Trade-off	Page
Key generation & entropy	UUIDv4, UUIDv7, HMAC-deterministic	Randomness vs. time-ordering vs. payload coupling	Idempotency Key Generation Strategies
HTTP semantics & safety	`GET`/`PUT`/`DELETE`/`POST`/`PATCH` per RFC 9110	Protocol guarantees vs. application-layer enforcement surface	HTTP Method Semantics & Safety
Retry mechanics	Exponential backoff, jitter, circuit breakers	Retry aggression vs. infrastructure load	Retry Logic & Backoff Fundamentals
Async delivery	Webhooks, HMAC signing, replay windows	Delivery latency vs. duplication surface	Webhook Delivery Guarantees
Storage backends	Redis SET NX, PostgreSQL UPSERT, DynamoDB conditional writes	Latency vs. durability vs. cost	Redis Cache-Based Deduplication · Database Unique Constraints & Upserts
Key TTL & eviction	Redis TTL, Postgres partial index, DynamoDB TTL attribute	Memory pressure vs. deduplication window	Idempotency Key Storage & TTL Management
Atomic transactions	Outbox pattern, 2PC avoidance, optimistic locking	Throughput vs. consistency boundary	Transaction Scoping & Atomic Operations
Distributed locks	Redlock, ZooKeeper, single-node Redis	Availability vs. safety under partition	Distributed Lock Acquisition Patterns

Trade-off Matrix

Choosing a backend for idempotency state requires balancing four operational dimensions: write latency, read latency, durability, and consistency under failure.

Idempotency store trade-offs. Redis dominates on latency; PostgreSQL on audit durability; DynamoDB on multi-region availability. Most payment systems pair Redis for hot-path deduplication with PostgreSQL for durable audit records.

Hybrid Strategy

Production payment systems typically run Redis as the hot-path deduplication layer (microsecond reads, TTL-managed eviction) with a PostgreSQL write-through for durable audit. The Redis key holds the request outcome for the retry window; the Postgres row lives indefinitely for dispute resolution. Redis cache-based deduplication covers the SET key value NX EX 172800 atomic reservation pattern in detail.

Request Deduplication Mechanics

Idempotency Key Lifecycle in Detail

When a request arrives at the API layer, the deduplication pipeline executes this sequence:

Extract Idempotency-Key header; reject with 400 Bad Request if absent or malformed.
Attempt atomic reservation: SET key PENDING NX EX 172800 (Redis) or INSERT ... ON CONFLICT DO NOTHING (Postgres).
If reservation fails (key exists):
- State COMPLETED → return cached status, headers, and body verbatim.
- State PENDING → return 409 Conflict with Retry-After: 1 or block with a short poll.
Execute business logic within a database transaction.
On success: atomically update key to COMPLETED and cache the full HTTP response.
On failure: transition key to ERROR; decide whether to release for retry or cache the error response.

This sequence ensures that concurrent retries for the same key never execute business logic twice. The NX flag (or ON CONFLICT DO NOTHING) is the critical gate — without it, a race condition between two simultaneous retries allows both to proceed past the cache lookup.

Key Generation & Collision Avoidance

Key integrity relies on sufficient entropy. UUIDv4 provides 122 bits of randomness — collision probability is negligible at any realistic scale. UUIDv7 offers time-ordered monotonicity, which improves B-tree index locality and cache eviction predictability at the cost of reduced randomness (48-bit random portion). For replay protection, deterministic hashing of SHA-256(method + path + canonical_body) can supplement client keys, but it couples deduplication to payload structure — any field reorder produces a new key. For an in-depth comparison of these generation strategies, see idempotency key generation strategies.

Retries, Timeouts & Network Instability

The Retry Storm Problem

Naive client retries without backoff amplify duplicate processing, exhaust connection pools, and saturate downstream databases. When a transient failure hits, hundreds of clients may retry simultaneously, creating a thundering herd. Idempotency keys neutralise the business impact — duplicate executions are blocked — but they do not prevent infrastructure degradation. Connection pools still exhaust; thread pools still saturate.

Exponential backoff with jitter is the algorithmic counterpart to server-side deduplication. Backoff spaces retries logarithmically; jitter randomises the delay offset to prevent synchronised waves. Clients must also honour Retry-After response headers and implement circuit breakers to halt retries when backend health signals sustained degradation — a circuit breaker open for 30 seconds is far cheaper than 10,000 concurrent retries hammering a degraded database.

Timeout Alignment

Retry TTLs and server-side idempotency TTLs must be aligned. If a client stops retrying after 24 hours but the idempotency key expires at 12 hours, the final retry executes as a first-time request. The rule: idempotency_key_ttl ≥ client_max_retry_duration + clock_skew_buffer. For most payment flows this means 48 hours minimum.

Asynchronous Workflows & Event-Driven Deduplication

Webhook & Callback Idempotency

Asynchronous delivery mechanisms — message brokers, HTTP callbacks, webhooks — operate on strict at-least-once semantics. Brokers guarantee delivery but not uniqueness. Webhook consumers must implement HMAC-SHA256 signature verification to authenticate payloads, replay windows (reject events older than 300 seconds) to block stale redeliveries, and a deduplication store keyed on the event’s unique identifier. Webhook delivery guarantees covers the full consumer pattern.

Transactional Outbox Pattern

The transactional outbox pattern solves the dual-write problem: writing a domain event to a message broker and updating application state are two operations that can fail independently. The outbox writes both within the same database transaction — business state and a pending event record commit atomically. A relay process reads the outbox and publishes to the broker; the consumer processes with idempotency. If the relay crashes, it restarts and republishes from the last unacknowledged row — the consumer’s deduplication store absorbs the duplicate. Transaction scoping and atomic operations covers outbox implementation with Postgres advisory locks and Debezium CDC.

State Management & Finite State Machines

Idempotent State Transitions

Designing idempotent APIs requires mapping operations to explicit, deterministic state transitions. Instead of imperative commands like increment_balance(), use declarative transitions like apply_delta(idempotency_key, delta) where the key guards the delta from being applied twice. Database operations must use atomic check-and-set (SELECT ... FOR UPDATE + conditional UPDATE) or optimistic concurrency control (WHERE version = expected_version) to prevent race conditions between concurrent requests that bypass the idempotency cache.

Finite State Machines for Workflows

Complex workflows — payment authorisation → capture → settlement — benefit from finite state machines (FSMs) that enforce valid transitions and guard against illegal operations during retries. An FSM ensures that a CAPTURE request on an already-CAPTURED order returns the original success response rather than attempting a duplicate capture. Explicit state guards at the domain layer prevent race conditions between the cache check and the database write — a window that distributed lock acquisition patterns closes in high-concurrency scenarios.

Anti-Patterns & Pitfalls

1. Missing Key Validation

Accepting any string as an idempotency key without format validation invites cache poisoning. A client sending an empty string or a predictable low-entropy key (e.g., "1") can accidentally or maliciously trigger false-positive deduplication. Enforce: UUID format, 32–128 character length limit, and reject requests with missing keys with 400 Bad Request, not a permissive fallthrough.

2. Stale Cache Replay

Returning a cached 200 OK without verifying that the underlying resource still exists or has not been administratively overridden produces incorrect client state. Example: a refund is processed and cached as COMPLETED; an admin manually reverses the refund in the database; the next retry returns the original success response while the database reflects a refund. Solution: for long-lived keys (> 1 hour), include a resource version in the cached response and validate on replay, or expire the key aggressively.

3. Ignoring Partial Failures

Transitioning a key to COMPLETED when downstream services failed — returning 200 OK while a dependent ledger posting failed — leaves the system in a permanently inconsistent state. Every retry receives a cached success but the business operation never completed. Design explicit ERROR key states, surface them as 409 Conflict or 424 Failed Dependency, and provide a resolution endpoint to explicitly release or re-execute the failed operation.

4. Over-Engineering Safe Endpoints

Applying idempotency-key mechanics to GET endpoints wastes cache capacity and adds latency. Safe, read-only methods carry protocol-level idempotency guarantees — no deduplication store is needed. Reserve explicit key enforcement for POST and PATCH endpoints that mutate state.

5. TTL Shorter Than Retry Window

Evicting idempotency keys before client retries expire causes duplicate processing without any error signal — the system silently executes the operation a second time. Map idempotency_key_ttl in your TTL management configuration to client_max_retry_duration + 24 h minimum.

6. Non-Atomic Key Reservation

Using a read-then-write sequence (GET key; if absent: SET key) instead of a single atomic operation creates a race condition: two concurrent requests both read “absent” and both proceed to execute. The NX flag in Redis SET key value NX EX ttl and INSERT ... ON CONFLICT DO NOTHING in PostgreSQL are the only correct primitives. Any other pattern is incorrect under concurrent load.

Production Readiness Checklist

Enforce Idempotency-Key on all mutating endpoints — reject POST and PATCH requests that omit the header with 400 Bad Request; include the required format in the error body.
Reserve keys atomically — use SET key PENDING NX EX 172800 (Redis) or INSERT ... ON CONFLICT DO NOTHING (PostgreSQL); never use read-then-write sequences.
Cache the full HTTP response — store status code, response headers, and body verbatim; replay must be byte-for-byte identical to satisfy client contracts.
Set TTL ≥ 48 h for payment endpoints — align to business retry windows; alert when idempotency_key_ttl < client_max_retry_duration.
Emit structured audit log entries — log key_hash, state, outcome, request_id, timestamp on every hit, miss, and state transition for SRE observability and dispute resolution.
Implement graceful degradation — when the idempotency store is unreachable, reject mutating requests with 503 Service Unavailable rather than processing without deduplication; include Retry-After header.
Add ERROR state handling — define explicit resolution paths for partial failures; never silently promote a failed operation to COMPLETED.
Test under concurrent load — simulate 50 simultaneous duplicate requests to verify atomic key reservation and absence of duplicate side effects; use wrk or k6 with a shared idempotency key.
Run chaos scenarios — kill the idempotency store mid-flight, induce gateway timeouts after backend success, and restart consumers during broker redelivery; verify no duplicate mutations occur.
Monitor idempotency_hit_rate — a sustained rate > 5% indicates clients are retrying aggressively; investigate underlying reliability causes rather than tuning the deduplication layer.

HTTP Method Semantics & Safety — RFC 9110 method categories, safe vs. idempotent classifications, and where protocol guarantees end and application enforcement begins.
Idempotency Key Generation Strategies — UUIDv4 vs. UUIDv7 vs. HMAC-deterministic keys: entropy, index locality, and replay-protection trade-offs.
Retry Logic & Backoff Fundamentals — exponential backoff, full jitter, circuit breakers, and Retry-After alignment with server-side TTLs.
Webhook Delivery Guarantees — at-least-once broker semantics, HMAC signature verification, replay windows, and idempotent consumer patterns.
Backend Implementation & Storage Patterns — Redis, PostgreSQL, and DynamoDB implementations of idempotency state, covering atomic operations, TTL management, and outbox patterns.
Distributed Coordination & Locking Strategies — Redlock, ZooKeeper, and lease-based locks that close the race-condition window between cache check and database write.
Observability & Operations for Idempotent Systems — metrics, distributed tracing, and chaos experiments that prove the deduplication contract holds in production.