Skip to content

Error Handling: Quick Reference

When the board asks “what happens when this fails?” your answer must include error classification, retry strategy, circuit breaker behavior, dead letter queue design, and monitoring. This page gives you the complete error handling playbook. For full details, see Error Handling Patterns Deep Dive.

Error Classification — Do This First

Every error response demands a different action. Retrying a 400 Bad Request forever is an anti-pattern. Failing immediately on a 503 wastes a recoverable situation.

CategoryHTTP CodesExamplesCorrect ActionWrong Action
Transient408, 429, 500, 502, 503, 504Timeout, rate limit, service unavailableRetry with exponential backoffFail immediately
Persistent400, 401, 403, 404, 422Bad request, unauthorized, not foundRoute to DLQ, alert teamRetry (will never succeed)
SystemicRepeated 503, connection refusedSystem fully down, cert expiredCircuit breaker, fallback modeKeep retrying (wastes resources)
Data quality400, 422 (validation)Missing fields, invalid format, dupesReject, notify, data cleansingSilently drop or force through
Capacity429 (Salesforce daily limit)API limit exhausted, governor limitsQueue and defer, throttleFail the entire batch

The Complete Error Handling Stack

This is the layered strategy you present at the board. Every integration touchpoint should use this stack.

flowchart TD
    A[Integration Call] --> B{Success?}
    B -->|Yes| C[Log Success]
    B -->|No| D[Classify Error]
    D --> E{Transient?<br/>5xx / 429 / timeout}
    E -->|Yes| F["Retry with<br/>Exponential Backoff"]
    F --> G{Retries<br/>exhausted?}
    G -->|No| A
    G -->|Yes| H[Dead Letter Queue]
    E -->|No| I{Systemic?<br/>Repeated failures}
    I -->|Yes| J[Circuit Breaker<br/>OPENS]
    J --> K[Fail Fast for<br/>subsequent calls]
    K --> H
    I -->|No| L[Persistent / Data Quality<br/>will never succeed with retry]
    L --> H
    H --> M[Alert Ops Team]
    H --> N[Dashboard Update]

Walk the board through a failure story

“When the ERP returns a 503, we retry 3 times with exponential backoff (1s, 2s, 4s). After 3 failures, the circuit breaker opens — subsequent calls fail fast for 60 seconds. The failed message routes to the DLQ (Integration_Error__c), PagerDuty alerts the integration team, and a Jira ticket is auto-created. Once ERP recovers, the circuit breaker half-opens, tests one call, and if it succeeds, resumes normal flow. The team replays DLQ messages through the monitoring dashboard.”

Pattern 1: Retry with Exponential Backoff

Parameters

ParameterValueRationale
Max retries3-5Enough for transient; not so many it wastes time on persistent failures
Base delay1 secondStarting wait before first retry
Multiplier2x (exponential)1s —> 2s —> 4s —> 8s —> 16s
Max delay cap60 secondsPrevent absurdly long waits
JitterRandom 0-1s addedPrevents thundering herd

Retry Timing Table

AttemptDelayCumulative
11s (+jitter)~1-2s
22s (+jitter)~3-5s
34s (+jitter)~7-10s
48s (+jitter)~15-19s
516s (+jitter)~31-36s

Jitter is not optional

Without jitter, 100 failed clients all retry at the same intervals, creating repeated traffic spikes against an already-stressed system. This is the thundering herd problem. Always add random jitter. Salesforce concurrent API limits (25 org-wide) make this even more critical.

Pattern 2: Circuit Breaker

Prevents your system from hammering a dead external service. Three states:

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold hit (5 consecutive)
    Open --> HalfOpen : Timeout expires (60s)
    HalfOpen --> Closed : Test call succeeds
    HalfOpen --> Open : Test call fails
StateBehaviorSalesforce Implementation
ClosedNormal — calls pass through, track failuresStandard callout behavior
OpenAll calls fail fast — no callout attemptedCheck Platform Cache / Custom Metadata before calling
Half-OpenAllow one test call to check recoveryScheduled job or manual reset tests one call

Configuration

ParameterRecommendedPurpose
Failure threshold5 consecutiveOpens the circuit
Open timeout30-60 secondsTime before testing recovery
Success threshold2-3 in half-openConfirms recovery before closing

Salesforce has no native circuit breaker

You must build it. Options: (1) Platform Cache — fastest reads, non-durable (resets on cache eviction); (2) Custom Metadata Type — durable, requires metadata deploy to change state; (3) Custom Settings — middle ground, editable via API. Platform Cache is the most practical for real-time checks.

Pattern 3: Dead Letter Queue (DLQ)

Messages that exhaust all retries are parked in a DLQ for inspection, diagnosis, and eventual reprocessing.

DLQ Record Design

FieldPurpose
Source_System__cWhere message originated
Target_System__cWhere it was going
Payload__cOriginal message (Long Text Area)
Error_Message__cWhat went wrong
Error_Code__cHTTP status / exception type
Retry_Count__cHow many attempts were made
First_Failure__cWhen it first failed
Last_Failure__cWhen retries exhausted
Status__cNew / Under Review / Resubmitted / Archived
Correlation_ID__cLinks to original transaction

Salesforce DLQ Implementation Options

ApproachBest ForRetention
Custom Object (Integration_Error__c)Audit trail, reporting, dashboardsPermanent
Platform Events (Error_Event__e)Real-time alerting to monitoring tools24-72h
Big ObjectHigh-volume error loggingPermanent, archive-oriented
MuleSoft Anypoint MQ DLQMiddleware-managed integrationsConfigurable
External (Splunk, Datadog)Centralized ops monitoringPer tool

Pattern 4: Idempotency

At-least-once delivery means duplicates will happen. Every receiver must handle the same message twice without side effects.

Idempotency Strategy Quick Pick

StrategyHow It WorksWhen to Use
Upsert + External IDSF upsert is naturally idempotentData sync (default choice)
Idempotency keySender includes unique key; receiver checks before processingCustom business logic
Natural key dedupCheck by business key (Order Number) before insertWhen unique business key exists
Payload hashHash message content, reject duplicatesNo client-side key available
Timestamp comparisonOnly process if newer than last processedSimple, but clock skew risk
flowchart TD
    A[Incoming Message] --> B{Has Idempotency<br/>Key?}
    B -->|No| C[Generate from payload<br/>hash or natural key]
    B -->|Yes| D{Key already<br/>processed?}
    C --> D
    D -->|Yes| E[Skip - return<br/>cached result]
    D -->|No| F[Process message]
    F --> G[Store key + result]
    G --> H[Return result]

Upsert is your best friend

For data synchronization, always use upsert with an External ID instead of separate insert/update logic. It is idempotent by design — sending the same record twice produces the same result. This single practice prevents the majority of integration duplicate bugs.

Pattern 5: Monitoring and Alerting

Error handling without monitoring means failures are discovered by end users days later. Build alerting first, not as an afterthought.

What to Monitor — Alert Thresholds

MetricWarningCritical
Integration failure rate> 5% of transactions> 20% of transactions
DLQ depth> 100 messagesGrowing for 30+ min
API call consumption> 80% of daily limit> 95% of daily limit
Response time (real-time)> 5 seconds> 10 seconds
Circuit breaker stateAny circuit open
Event subscriber lag> 1 hour behind> 12 hours behind

Monitoring Stack

LayerSalesforce-NativeExternal
Metrics collectionEvent Monitoring (Shield add-on)Splunk, Datadog, ELK
DashboardsCustom dashboard on Integration_Error__cGrafana, Datadog
AlertingFlow email alerts, Platform EventsPagerDuty, OpsGenie, Slack
TicketingAuto-create Case from FlowJira, ServiceNow

Reverse-Engineered Use Cases

Scenario 1: ERP Goes Down During Order Processing

Situation: Salesforce sends orders to SAP via Fire-and-Forget (Platform Events + MuleSoft). SAP goes down for 2 hours during peak.

What you’d present:

  1. First 5 failures: MuleSoft retries with exponential backoff (1s, 2s, 4s, 8s, 16s + jitter)
  2. After 5 failures: Circuit breaker opens. Subsequent orders fail fast (no SAP call attempted)
  3. Failed orders: Route to Anypoint MQ dead letter queue with full payload and error context
  4. Alert: PagerDuty pages integration team; auto-created Jira ticket
  5. Recovery: After 60s, circuit breaker half-opens, tests one order. SAP still down — circuit stays open
  6. SAP recovers: Half-open test succeeds. Circuit closes. Normal flow resumes
  7. DLQ replay: Integration team bulk-replays 2 hours of queued orders from DLQ
  8. Idempotency: SAP uses order number as idempotency key — replayed orders that partially processed are safe

Scenario 2: Bulk API Partial Failure

Situation: Nightly sync of 500,000 Account records from data warehouse via Bulk API 2.0. Job completes with 498,000 success and 2,000 failures.

What you’d present:

  1. Successful records: Committed normally (no rollback of successes)
  2. Failed records: Download error results file (GET /jobs/ingest/{id}/failedResults)
  3. Classify failures: 1,800 validation rule failures (data quality), 200 duplicate External ID conflicts
  4. Data quality errors: Route to data steward dashboard for cleansing, fix source data, resubmit only failed records
  5. Duplicate errors: Investigate — likely stale dedup window. Switch to upsert if using insert
  6. Monitoring: Dashboard shows 99.6% success rate (within SLA), alert on the 2,000 failures for review

Scenario 3: Event Subscriber Falls Behind

Situation: External analytics system subscribes to CDC on Opportunity via Pub/Sub API. The analytics system goes down for maintenance over a 4-day weekend. CDC retention is 3 days.

What you’d present:

  1. Day 1-3: Events accumulate on bus. When subscriber reconnects, it replays from last checkpoint
  2. Day 3+: Events older than 3 days are lost — beyond retention window
  3. Recovery: Subscriber detects gap event from Salesforce, triggers batch reconciliation job
  4. Reconciliation: Run Bulk API 2.0 query for all Opportunities modified in the last 5 days, full sync
  5. Prevention: Monitor subscriber lag; alert when lag > 12 hours (gives 2.5 days to fix before data loss)
  6. Design improvement: Hybrid architecture — CDC for near-RT, nightly batch sync as safety net

Anti-Pattern Quick Reference

Anti-PatternWhy It FailsDo This Instead
Retry foreverWastes resources, masks permanent failuresMax retries + DLQ
Retry without backoffHammers struggling systemExponential backoff + jitter
Retry all errors equally400 will never succeed on retryClassify first, only retry transient
Swallow errors silentlyNobody knows integration is brokenLog + alert + DLQ
No idempotencyDuplicates on retryExternal ID upsert or idempotency keys
Manual-only recoveryDoes not scaleAutomated retry + manual for edge cases
Monitor reactivelyUsers discover failures days laterProactive alerting with thresholds

The cardinal sin of integration

Building an integration with no error handling and no monitoring. When it fails — and it will — nobody knows until a business user reports missing data days or weeks later. By then, the data inconsistency may be unrecoverable. Build error handling and monitoring first, not as an afterthought. The board will grill you on this.

Sources