Error Handling: Quick Reference

When the board asks “what happens when this fails?” your answer must include error classification, retry strategy, circuit breaker behavior, dead letter queue design, and monitoring. This page gives you the complete error handling playbook. For full details, see Error Handling Patterns Deep Dive.

Error Classification — Do This First

Every error response demands a different action. Retrying a 400 Bad Request forever is an anti-pattern. Failing immediately on a 503 wastes a recoverable situation.

Category	HTTP Codes	Examples	Correct Action	Wrong Action
Transient	408, 429, 500, 502, 503, 504	Timeout, rate limit, service unavailable	Retry with exponential backoff	Fail immediately
Persistent	400, 401, 403, 404, 422	Bad request, unauthorized, not found	Route to DLQ, alert team	Retry (will never succeed)
Systemic	Repeated 503, connection refused	System fully down, cert expired	Circuit breaker, fallback mode	Keep retrying (wastes resources)
Data quality	400, 422 (validation)	Missing fields, invalid format, dupes	Reject, notify, data cleansing	Silently drop or force through
Capacity	429 (Salesforce daily limit)	API limit exhausted, governor limits	Queue and defer, throttle	Fail the entire batch

The Complete Error Handling Stack

This is the layered strategy you present at the board. Every integration touchpoint should use this stack.

flowchart TD
    A[Integration Call] --> B{Success?}
    B -->|Yes| C[Log Success]
    B -->|No| D[Classify Error]
    D --> E{Transient?<br/>5xx / 429 / timeout}
    E -->|Yes| F["Retry with<br/>Exponential Backoff"]
    F --> G{Retries<br/>exhausted?}
    G -->|No| A
    G -->|Yes| H[Dead Letter Queue]
    E -->|No| I{Systemic?<br/>Repeated failures}
    I -->|Yes| J[Circuit Breaker<br/>OPENS]
    J --> K[Fail Fast for<br/>subsequent calls]
    K --> H
    I -->|No| L[Persistent / Data Quality<br/>will never succeed with retry]
    L --> H
    H --> M[Alert Ops Team]
    H --> N[Dashboard Update]

Walk the board through a failure story

“When the ERP returns a 503, we retry 3 times with exponential backoff (1s, 2s, 4s). After 3 failures, the circuit breaker opens — subsequent calls fail fast for 60 seconds. The failed message routes to the DLQ (Integration_Error__c), PagerDuty alerts the integration team, and a Jira ticket is auto-created. Once ERP recovers, the circuit breaker half-opens, tests one call, and if it succeeds, resumes normal flow. The team replays DLQ messages through the monitoring dashboard.”

Pattern 1: Retry with Exponential Backoff

Parameters

Parameter	Value	Rationale
Max retries	3-5	Enough for transient; not so many it wastes time on persistent failures
Base delay	1 second	Starting wait before first retry
Multiplier	2x (exponential)	1s —> 2s —> 4s —> 8s —> 16s
Max delay cap	60 seconds	Prevent absurdly long waits
Jitter	Random 0-1s added	Prevents thundering herd

Retry Timing Table

Attempt	Delay	Cumulative
1	1s (+jitter)	~1-2s
2	2s (+jitter)	~3-5s
3	4s (+jitter)	~7-10s
4	8s (+jitter)	~15-19s
5	16s (+jitter)	~31-36s

Jitter is not optional

Without jitter, 100 failed clients all retry at the same intervals, creating repeated traffic spikes against an already-stressed system. This is the thundering herd problem. Always add random jitter. Salesforce concurrent API limits (25 org-wide) make this even more critical.

Pattern 2: Circuit Breaker

Prevents your system from hammering a dead external service. Three states:

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold hit (5 consecutive)
    Open --> HalfOpen : Timeout expires (60s)
    HalfOpen --> Closed : Test call succeeds
    HalfOpen --> Open : Test call fails

State	Behavior	Salesforce Implementation
Closed	Normal — calls pass through, track failures	Standard callout behavior
Open	All calls fail fast — no callout attempted	Check Platform Cache / Custom Metadata before calling
Half-Open	Allow one test call to check recovery	Scheduled job or manual reset tests one call

Configuration

Parameter	Recommended	Purpose
Failure threshold	5 consecutive	Opens the circuit
Open timeout	30-60 seconds	Time before testing recovery
Success threshold	2-3 in half-open	Confirms recovery before closing

Salesforce has no native circuit breaker

You must build it. Options: (1) Platform Cache — fastest reads, non-durable (resets on cache eviction); (2) Custom Metadata Type — durable, requires metadata deploy to change state; (3) Custom Settings — middle ground, editable via API. Platform Cache is the most practical for real-time checks.

Pattern 3: Dead Letter Queue (DLQ)

Messages that exhaust all retries are parked in a DLQ for inspection, diagnosis, and eventual reprocessing.

DLQ Record Design

Field	Purpose
Source_System__c	Where message originated
Target_System__c	Where it was going
Payload__c	Original message (Long Text Area)
Error_Message__c	What went wrong
Error_Code__c	HTTP status / exception type
Retry_Count__c	How many attempts were made
First_Failure__c	When it first failed
Last_Failure__c	When retries exhausted
Status__c	New / Under Review / Resubmitted / Archived
Correlation_ID__c	Links to original transaction

Salesforce DLQ Implementation Options

Approach	Best For	Retention
Custom Object (Integration_Error__c)	Audit trail, reporting, dashboards	Permanent
Platform Events (Error_Event__e)	Real-time alerting to monitoring tools	24-72h
Big Object	High-volume error logging	Permanent, archive-oriented
MuleSoft Anypoint MQ DLQ	Middleware-managed integrations	Configurable
External (Splunk, Datadog)	Centralized ops monitoring	Per tool

Pattern 4: Idempotency

At-least-once delivery means duplicates will happen. Every receiver must handle the same message twice without side effects.

Idempotency Strategy Quick Pick

Strategy	How It Works	When to Use
Upsert + External ID	SF upsert is naturally idempotent	Data sync (default choice)
Idempotency key	Sender includes unique key; receiver checks before processing	Custom business logic
Natural key dedup	Check by business key (Order Number) before insert	When unique business key exists
Payload hash	Hash message content, reject duplicates	No client-side key available
Timestamp comparison	Only process if newer than last processed	Simple, but clock skew risk

flowchart TD
    A[Incoming Message] --> B{Has Idempotency<br/>Key?}
    B -->|No| C[Generate from payload<br/>hash or natural key]
    B -->|Yes| D{Key already<br/>processed?}
    C --> D
    D -->|Yes| E[Skip - return<br/>cached result]
    D -->|No| F[Process message]
    F --> G[Store key + result]
    G --> H[Return result]

Upsert is your best friend

For data synchronization, always use upsert with an External ID instead of separate insert/update logic. It is idempotent by design — sending the same record twice produces the same result. This single practice prevents the majority of integration duplicate bugs.

Pattern 5: Monitoring and Alerting

Error handling without monitoring means failures are discovered by end users days later. Build alerting first, not as an afterthought.

What to Monitor — Alert Thresholds

Metric	Warning	Critical
Integration failure rate	> 5% of transactions	> 20% of transactions
DLQ depth	> 100 messages	Growing for 30+ min
API call consumption	> 80% of daily limit	> 95% of daily limit
Response time (real-time)	> 5 seconds	> 10 seconds
Circuit breaker state	—	Any circuit open
Event subscriber lag	> 1 hour behind	> 12 hours behind

Monitoring Stack

Layer	Salesforce-Native	External
Metrics collection	Event Monitoring (Shield add-on)	Splunk, Datadog, ELK
Dashboards	Custom dashboard on Integration_Error__c	Grafana, Datadog
Alerting	Flow email alerts, Platform Events	PagerDuty, OpsGenie, Slack
Ticketing	Auto-create Case from Flow	Jira, ServiceNow

Reverse-Engineered Use Cases

Scenario 1: ERP Goes Down During Order Processing

Situation: Salesforce sends orders to SAP via Fire-and-Forget (Platform Events + MuleSoft). SAP goes down for 2 hours during peak.

What you’d present:

First 5 failures: MuleSoft retries with exponential backoff (1s, 2s, 4s, 8s, 16s + jitter)
After 5 failures: Circuit breaker opens. Subsequent orders fail fast (no SAP call attempted)
Failed orders: Route to Anypoint MQ dead letter queue with full payload and error context
Alert: PagerDuty pages integration team; auto-created Jira ticket
Recovery: After 60s, circuit breaker half-opens, tests one order. SAP still down — circuit stays open
SAP recovers: Half-open test succeeds. Circuit closes. Normal flow resumes
DLQ replay: Integration team bulk-replays 2 hours of queued orders from DLQ
Idempotency: SAP uses order number as idempotency key — replayed orders that partially processed are safe

Scenario 2: Bulk API Partial Failure

Situation: Nightly sync of 500,000 Account records from data warehouse via Bulk API 2.0. Job completes with 498,000 success and 2,000 failures.

What you’d present:

Successful records: Committed normally (no rollback of successes)
Failed records: Download error results file (GET /jobs/ingest/{id}/failedResults)
Classify failures: 1,800 validation rule failures (data quality), 200 duplicate External ID conflicts
Data quality errors: Route to data steward dashboard for cleansing, fix source data, resubmit only failed records
Duplicate errors: Investigate — likely stale dedup window. Switch to upsert if using insert
Monitoring: Dashboard shows 99.6% success rate (within SLA), alert on the 2,000 failures for review

Scenario 3: Event Subscriber Falls Behind

Situation: External analytics system subscribes to CDC on Opportunity via Pub/Sub API. The analytics system goes down for maintenance over a 4-day weekend. CDC retention is 3 days.

What you’d present:

Day 1-3: Events accumulate on bus. When subscriber reconnects, it replays from last checkpoint
Day 3+: Events older than 3 days are lost — beyond retention window
Recovery: Subscriber detects gap event from Salesforce, triggers batch reconciliation job
Reconciliation: Run Bulk API 2.0 query for all Opportunities modified in the last 5 days, full sync
Prevention: Monitor subscriber lag; alert when lag > 12 hours (gives 2.5 days to fix before data loss)
Design improvement: Hybrid architecture — CDC for near-RT, nightly batch sync as safety net

Anti-Pattern Quick Reference

Anti-Pattern	Why It Fails	Do This Instead
Retry forever	Wastes resources, masks permanent failures	Max retries + DLQ
Retry without backoff	Hammers struggling system	Exponential backoff + jitter
Retry all errors equally	400 will never succeed on retry	Classify first, only retry transient
Swallow errors silently	Nobody knows integration is broken	Log + alert + DLQ
No idempotency	Duplicates on retry	External ID upsert or idempotency keys
Manual-only recovery	Does not scale	Automated retry + manual for edge cases
Monitor reactively	Users discover failures days later	Proactive alerting with thresholds

The cardinal sin of integration

Building an integration with no error handling and no monitoring. When it fails — and it will — nobody knows until a business user reports missing data days or weeks later. By then, the data inconsistency may be unrecoverable. Build error handling and monitoring first, not as an afterthought. The board will grill you on this.

Error Handling: Quick Reference

Error Classification — Do This First

The Complete Error Handling Stack

Pattern 1: Retry with Exponential Backoff

Parameters

Retry Timing Table

Pattern 2: Circuit Breaker

Configuration

Pattern 3: Dead Letter Queue (DLQ)

DLQ Record Design

Salesforce DLQ Implementation Options

Pattern 4: Idempotency

Idempotency Strategy Quick Pick

Pattern 5: Monitoring and Alerting

What to Monitor — Alert Thresholds

Monitoring Stack

Reverse-Engineered Use Cases

Scenario 1: ERP Goes Down During Order Processing

Scenario 2: Bulk API Partial Failure

Scenario 3: Event Subscriber Falls Behind

Anti-Pattern Quick Reference

Sources