When should I retry a 500 Internal Server Error?

Retry 500s only if the endpoint is provably idempotent and you have observed that the server recovers within your backoff window. A naked 500 may indicate state corruption; pair it with a circuit breaker and cap retries at 2.

What is the difference between full jitter and equal jitter?

Full jitter picks a random value between 0 and the full exponential ceiling, giving maximum spread but occasionally very short delays. Equal jitter guarantees at least half the exponential value is preserved, balancing spread with predictable minimum wait times.

How do I prevent retry storms in microservices?

Combine jittered exponential backoff, per-client concurrency limits, and circuit breakers. Instrument retry rates as a percentage of baseline traffic and shed load automatically when the ratio exceeds 20%.

Retry Logic & Backoff Fundamentals

Part of: Idempotency Fundamentals & API Guarantees

Distributed systems fail at the network, not in the code. TCP resets, DNS timeouts, and transient 503 responses are not exceptional — they are the steady-state operating environment of any multi-service architecture. Retry logic is the mechanism that converts those transient faults into transparent recoveries, but it introduces its own failure modes: retry storms, duplicate state mutations, and SLA violations caused by compounded latency. This page defines the contract a correct retry strategy must honour, then works through the algorithms and implementation patterns that fulfil it — grounded in the idempotency guarantees that make safe re-execution possible in the first place.

Guarantee Model

A retry policy provides at-least-once delivery at the network boundary. It cannot alone provide exactly-once semantics — that requires idempotent server-side processing anchored by a deduplicated idempotency key. Without that anchor, each retry is a potential duplicate execution.

The contract breaks under three conditions:

Partition with state visibility loss. The server received and processed the request but the response was lost in transit. The client cannot distinguish this from a server failure, so it retries — sending a duplicate.
Clock skew across nodes. Timeout calculations derived from wall-clock time diverge between caller and callee, causing one side to abandon a request the other is still processing.
Thundering herd after partial recovery. Every client whose retry timer fires simultaneously floods the recovering service, re-triggering the failure.

All three are addressed by the algorithms below, but none can be resolved by backoff alone — they require coordinated idempotency enforcement upstream.

Failure Classification: Transient vs. Permanent

Retry logic must strictly separate recoverable network anomalies from deterministic client errors. Retrying a 400 Bad Request is wasteful; retrying a 504 Gateway Timeout is correct.

Retry these status codes:

408 Request Timeout — server-side timeout, safe to retry with backoff
429 Too Many Requests — rate-limited; honour the Retry-After header before re-sending
502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout — upstream unavailability, primary retry targets

Do not retry these:

400 Bad Request, 422 Unprocessable Entity — malformed payload; retrying wastes cycles
401 Unauthorized, 403 Forbidden — credential issue; retrying will not fix it
404 Not Found — resource absent; retrying changes nothing
409 Conflict — state conflict that requires application-level resolution, not blind retry

Treat carefully:

500 Internal Server Error — may indicate state corruption; only retry if the endpoint is verified idempotent and you cap attempts at 2
503 with no Retry-After — apply exponential backoff; the service may be in a degraded rolling-restart state

As covered in HTTP Method Semantics & Safety, GET and HEAD are safe to retry unconditionally. POST, PATCH, and DELETE require explicit idempotency guards before any automatic retry fires.

Core Algorithm: Exponential Backoff with Jitter

The canonical retry algorithm adds an exponentially growing delay between attempts and randomises that delay to prevent synchronised storms. The sequence below is the production-standard implementation.

Step-by-step protocol

Attempt the request.
On a retryable failure, compute the ceiling: cap = min(base × 2^attempt, max_delay).
Sample a jittered wait: wait = random(0, cap) (full jitter) or wait = cap/2 + random(0, cap/2) (equal jitter).
Sleep for wait milliseconds, then repeat from step 1.
After max_attempts, surface the failure to the caller.

Sequence diagram: jittered exponential retry

Backoff algorithm comparison

Algorithm	Formula	Thundering Herd Risk	Minimum Wait Preserved	Best For
Linear	`base × attempt`	High (synchronised)	Yes	Predictable, low-concurrency jobs
Exponential (no jitter)	`base × 2^attempt`	Very high	Yes	Single-client scenarios only
Full jitter	`random(0, base × 2^attempt)`	Minimal	No	High-concurrency API clients
Equal jitter	`cap/2 + random(0, cap/2)`	Low	Yes (≥ cap/2)	Production payment / fintech APIs
Decorrelated jitter	`random(base, prev_wait × 3)`	Minimal	Varies	AWS SDK–style clients

Equal jitter is the recommended default for production systems: it prevents storms while guaranteeing a meaningful minimum wait that prevents tight retry loops during fast, persistent failures.

Implementation Variants

Variant A: Node.js (async/await, equal jitter)

async function retryWithBackoff(fn, { maxAttempts = 5, baseMs = 200, maxMs = 10000 } = {}) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;
      const cap = Math.min(baseMs * 2 ** attempt, maxMs);
      const wait = cap / 2 + Math.random() * (cap / 2); // equal jitter
      await new Promise(resolve => setTimeout(resolve, wait));
    }
  }
}

function isRetryable(err) {
  const retryableCodes = new Set([408, 429, 502, 503, 504]);
  return err.status && retryableCodes.has(err.status);
}

Never call setTimeout with a calculated wait synchronously inside a loop without await — this blocks the event loop. The await new Promise(...) pattern releases control correctly.

Variant B: Go (buffered channel concurrency cap)

func retryWithBackoff(ctx context.Context, fn func() error) error {
    base := 200 * time.Millisecond
    maxDelay := 10 * time.Second
    for attempt := 0; attempt < 5; attempt++ {
        err := fn()
        if err == nil {
            return nil
        }
        if !isRetryable(err) || attempt == 4 {
            return err
        }
        cap := base * (1 << attempt)
        if cap > maxDelay {
            cap = maxDelay
        }
        // equal jitter
        wait := cap/2 + time.Duration(rand.Int63n(int64(cap/2)))
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(wait):
        }
    }
    return nil
}

Always pass a context.Context so callers can cancel in-flight retry loops. Unbounded goroutine spawning during an outage will exhaust memory; cap concurrent retries with a semaphore channel (make(chan struct{}, maxConcurrent)).

Variant C: Python (tenacity library)

from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception

@retry(
    retry=retry_if_exception(lambda e: getattr(e, "status_code", 0) in {408, 429, 502, 503, 504}),
    wait=wait_random_exponential(multiplier=0.2, max=10),
    stop=stop_after_attempt(5),
)
def call_payment_api(payload: dict) -> dict:
    response = requests.post("/payments", json=payload, headers={"Idempotency-Key": payload["key"]})
    response.raise_for_status()
    return response.json()

wait_random_exponential implements full jitter. For equal jitter, compose wait_fixed(0.1) + wait_random_exponential(0.1, 10).

Variant D: Java (ThreadPoolExecutor with backoff)

public <T> T retryWithBackoff(Callable<T> task, int maxAttempts, long baseMs, long maxMs)
        throws Exception {
    for (int attempt = 0; attempt < maxAttempts; attempt++) {
        try {
            return task.call();
        } catch (HttpException e) {
            if (!isRetryable(e.getStatusCode()) || attempt == maxAttempts - 1) throw e;
            long cap = Math.min(baseMs * (1L << attempt), maxMs);
            long wait = cap / 2 + ThreadLocalRandom.current().nextLong(cap / 2); // equal jitter
            TimeUnit.MILLISECONDS.sleep(wait);
        }
    }
    throw new IllegalStateException("unreachable");
}

Tune ThreadPoolExecutor max size and queue capacity before deploying. Implement RejectedExecutionHandler (e.g. CallerRunsPolicy) so backpressure degrades gracefully instead of throwing RejectedExecutionException.

Circuit Breakers: Stopping Retry Amplification

Backoff limits individual client load. Circuit breakers limit aggregate load from an entire service fleet. When a downstream service is truly down, hundreds of clients each running 5-attempt backoff loops create an order-of-magnitude amplification of failed requests. The circuit breaker short-circuits all of them.

State machine

Standard circuit breaker thresholds for a production API tier:

Open threshold: ≥ 50% error rate over a 10-second sliding window, or ≥ 10 consecutive failures
Open timeout: 30 seconds before attempting half-open transition
Half-open probe budget: 1–5% of traffic; 2 consecutive successes to close

Pair circuit breakers with distributed lock acquisition patterns when the protected resource uses a shared lease — the circuit needs to know whether a failure is in the lock layer or the underlying service.

Idempotency & Deduplication Coordination

Retries are safe only when re-execution produces the same result. This requires the server to recognise and suppress duplicate requests. The mechanism is an idempotency key — a client-generated token attached to every non-safe request and stored server-side for the duration of the retry window.

Distributed lock acquisition sequence

Client attaches Idempotency-Key: <uuid> to the request header.
Gateway or service layer executes: SET idempotency:<key> PROCESSING NX EX 90 in Redis.
If SET returns OK, process the request and overwrite the key with the serialised response: SET idempotency:<key> <response_json> XX EX 90.
If SET returns nil (key exists), poll until the value is no longer PROCESSING, then return the cached response — no re-execution.

The TTL of 90 seconds must exceed the maximum retry window. With a 5-attempt equal-jitter policy (base=200ms, max=10s), the worst-case total retry span is approximately 40 seconds. A 90-second TTL provides a safe 2× margin. For full details see using Redis SETNX for distributed request deduplication.

Deduplication cache TTL alignment

Max retry attempts	Max delay cap	Worst-case span	Recommended TTL
3	4 s	~10 s	30 s
5	10 s	~40 s	90 s
7	30 s	~180 s	300 s

Premature TTL eviction causes re-execution of already-processed requests. Excessive TTLs waste Redis memory and increase cache stampede risk during node restarts. Budget memory as: peak_rps × max_key_size_bytes × ttl_seconds.

Database fallback when cache is unavailable

When Redis is unavailable, fall back to database-level deduplication:

-- PostgreSQL: atomic upsert that returns the existing result on collision
INSERT INTO idempotent_requests (key, response, created_at)
VALUES ($1, $2, now())
ON CONFLICT (key) DO NOTHING
RETURNING id, response;

If the INSERT returns no rows, the key already exists — query the existing row and return its response column. Wrap the check and insert in a single transaction at READ COMMITTED isolation. Use SERIALIZABLE only for financial ledgers where phantom reads could cause double-settlement, as documented in wrapping database transactions for safe retries.

Edge Cases & Failure Scenarios

Failure Scenario	Remediation Steps	Observability Hooks
Response lost in transit — server processed but client never received `200`	Client retries with same idempotency key; server detects existing key and returns cached response without re-executing	`idempotency.cache_hit_total` counter; trace span attribute `retry.deduplicated=true`
Redis eviction during active retry window	Extend TTL to 2× worst-case span; enable `maxmemory-policy noeviction` for the idempotency keyspace; fall back to DB upsert	`redis.evicted_keys` gauge; alert when `idempotency.cache_miss_on_retry_total > 0`
Thundering herd: 500 clients retry simultaneously after a 30-second outage	Per-client jitter prevents exact synchronisation; add circuit breaker with 50% error threshold to hold clients during recovery; use adaptive concurrency limits (token-bucket per service)	`http.retry.concurrent_attempts` histogram; `circuit_breaker.state` gauge (`0=closed`, `1=open`, `2=half-open`)
`Retry-After` header absent on `429`	Parse `X-RateLimit-Reset` epoch if present; default to `base × 2^attempt` with cap; never ignore `429` and retry immediately	`http.rate_limit.retries_without_header_total` counter; log `retry_after_source: header
Idempotency key collision across tenants	Prefix keys with tenant ID: `tenantId:uuid`; enforce namespace separation at the gateway layer	`idempotency.key_collision_total` counter segmented by `tenant_id`
Clock skew invalidates TTL before retry window closes	Synchronise service clocks via NTP (max drift < 500 ms); add 10-second safety margin to all TTL calculations	`ntp.offset_ms` gauge; alert when offset exceeds 200 ms

Operational Concerns

SRE alert thresholds

Instrument these metrics on every service running a retry policy:

http.retry.rate — alert when retry attempts exceed 20% of baseline request rate over a 5-minute window
http.retry.deduplication_hit_rate — sustained hit rate above 5% indicates clients are retrying more than expected; investigate upstream instability
http.backoff.duration_ms (p99 histogram) — alert when p99 exceeds 8,000 ms (approaching the 10-second cap)
circuit_breaker.open_duration_seconds — alert when any circuit stays open for more than 60 seconds

Memory and storage budgeting

At 1,000 requests/second with a 128-byte average idempotency key + 512-byte cached response, a 90-second TTL requires:

1,000 rps × 640 bytes × 90 s = ~57.6 MB

With 20% Redis overhead for hash structures: budget 70 MB per service instance. Scale linearly with RPS. For TTL management and eviction strategy details see the dedicated storage patterns page.

Index strategy for database-backed deduplication

CREATE UNIQUE INDEX CONCURRENTLY idx_idempotent_requests_key
    ON idempotent_requests (key);

-- Partial index to skip expired rows (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_idempotent_requests_active
    ON idempotent_requests (key, created_at)
    WHERE created_at > now() - interval '90 seconds';

The partial index keeps the active working set small and prevents full-table scans during high-throughput deduplication checks. Run VACUUM ANALYZE idempotent_requests after TTL-based batch deletes to reclaim dead tuples.

Production Readiness Checklist

Classify all HTTP status codes into retry / fail-fast / treat-carefully buckets and document the policy in the service runbook.
Implement equal jitter or full jitter — never bare exponential backoff without randomisation.
Cap maximum retry attempts at 5 for synchronous paths; move longer retry sequences to an async queue or outbox pattern.
Set idempotency key TTL to at least 2× the worst-case retry span (minimum 90 seconds for a 5-attempt policy).
Enforce Idempotency-Key header on every POST, PATCH, and DELETE at the API gateway; reject requests without it on financial endpoints.
Wire circuit breakers with a 50% error-rate threshold and 30-second open timeout on all external service calls.
Instrument http.retry.rate, idempotency.cache_hit_total, and circuit_breaker.state; wire alerts before go-live.
Test retry behaviour under chaos conditions: simulate Redis unavailability, 503 storms, and clock-skew injection.
Validate out-of-order delivery: confirm business logic remains correct when a delayed retry arrives after a concurrent request has already committed.

Implementing Exponential Backoff Without Overlapping Retries — step-by-step runbook for synchronising retry windows across distributed nodes
Idempotency Key Generation Strategies — how to produce collision-resistant keys that survive network partitions
HTTP Method Semantics & Safety — which methods are safe to retry unconditionally and which require idempotency guards
Redis Cache-Based Deduplication — Redis data structures and eviction policies for high-throughput deduplication
Mitigating Thundering Herd During Retry Storms — adaptive concurrency limits and request shedding under sustained outage
Idempotency Fundamentals & API Guarantees — parent section covering the full idempotency guarantee model