Error Handling: Quick Reference
When the board asks “what happens when this fails?” your answer must include error classification, retry strategy, circuit breaker behavior, dead letter queue design, and monitoring. This page gives you the complete error handling playbook. For full details, see Error Handling Patterns Deep Dive.
Error Classification — Do This First
Every error response demands a different action. Retrying a 400 Bad Request forever is an anti-pattern. Failing immediately on a 503 wastes a recoverable situation.
| Category | HTTP Codes | Examples | Correct Action | Wrong Action |
|---|---|---|---|---|
| Transient | 408, 429, 500, 502, 503, 504 | Timeout, rate limit, service unavailable | Retry with exponential backoff | Fail immediately |
| Persistent | 400, 401, 403, 404, 422 | Bad request, unauthorized, not found | Route to DLQ, alert team | Retry (will never succeed) |
| Systemic | Repeated 503, connection refused | System fully down, cert expired | Circuit breaker, fallback mode | Keep retrying (wastes resources) |
| Data quality | 400, 422 (validation) | Missing fields, invalid format, dupes | Reject, notify, data cleansing | Silently drop or force through |
| Capacity | 429 (Salesforce daily limit) | API limit exhausted, governor limits | Queue and defer, throttle | Fail the entire batch |
The Complete Error Handling Stack
This is the layered strategy you present at the board. Every integration touchpoint should use this stack.
flowchart TD
A[Integration Call] --> B{Success?}
B -->|Yes| C[Log Success]
B -->|No| D[Classify Error]
D --> E{Transient?<br/>5xx / 429 / timeout}
E -->|Yes| F["Retry with<br/>Exponential Backoff"]
F --> G{Retries<br/>exhausted?}
G -->|No| A
G -->|Yes| H[Dead Letter Queue]
E -->|No| I{Systemic?<br/>Repeated failures}
I -->|Yes| J[Circuit Breaker<br/>OPENS]
J --> K[Fail Fast for<br/>subsequent calls]
K --> H
I -->|No| L[Persistent / Data Quality<br/>will never succeed with retry]
L --> H
H --> M[Alert Ops Team]
H --> N[Dashboard Update]
Walk the board through a failure story
“When the ERP returns a 503, we retry 3 times with exponential backoff (1s, 2s, 4s). After 3 failures, the circuit breaker opens — subsequent calls fail fast for 60 seconds. The failed message routes to the DLQ (Integration_Error__c), PagerDuty alerts the integration team, and a Jira ticket is auto-created. Once ERP recovers, the circuit breaker half-opens, tests one call, and if it succeeds, resumes normal flow. The team replays DLQ messages through the monitoring dashboard.”
Pattern 1: Retry with Exponential Backoff
Parameters
| Parameter | Value | Rationale |
|---|---|---|
| Max retries | 3-5 | Enough for transient; not so many it wastes time on persistent failures |
| Base delay | 1 second | Starting wait before first retry |
| Multiplier | 2x (exponential) | 1s —> 2s —> 4s —> 8s —> 16s |
| Max delay cap | 60 seconds | Prevent absurdly long waits |
| Jitter | Random 0-1s added | Prevents thundering herd |
Retry Timing Table
| Attempt | Delay | Cumulative |
|---|---|---|
| 1 | 1s (+jitter) | ~1-2s |
| 2 | 2s (+jitter) | ~3-5s |
| 3 | 4s (+jitter) | ~7-10s |
| 4 | 8s (+jitter) | ~15-19s |
| 5 | 16s (+jitter) | ~31-36s |
Jitter is not optional
Without jitter, 100 failed clients all retry at the same intervals, creating repeated traffic spikes against an already-stressed system. This is the thundering herd problem. Always add random jitter. Salesforce concurrent API limits (25 org-wide) make this even more critical.
Pattern 2: Circuit Breaker
Prevents your system from hammering a dead external service. Three states:
stateDiagram-v2
[*] --> Closed
Closed --> Open : Failure threshold hit (5 consecutive)
Open --> HalfOpen : Timeout expires (60s)
HalfOpen --> Closed : Test call succeeds
HalfOpen --> Open : Test call fails
| State | Behavior | Salesforce Implementation |
|---|---|---|
| Closed | Normal — calls pass through, track failures | Standard callout behavior |
| Open | All calls fail fast — no callout attempted | Check Platform Cache / Custom Metadata before calling |
| Half-Open | Allow one test call to check recovery | Scheduled job or manual reset tests one call |
Configuration
| Parameter | Recommended | Purpose |
|---|---|---|
| Failure threshold | 5 consecutive | Opens the circuit |
| Open timeout | 30-60 seconds | Time before testing recovery |
| Success threshold | 2-3 in half-open | Confirms recovery before closing |
Salesforce has no native circuit breaker
You must build it. Options: (1) Platform Cache — fastest reads, non-durable (resets on cache eviction); (2) Custom Metadata Type — durable, requires metadata deploy to change state; (3) Custom Settings — middle ground, editable via API. Platform Cache is the most practical for real-time checks.
Pattern 3: Dead Letter Queue (DLQ)
Messages that exhaust all retries are parked in a DLQ for inspection, diagnosis, and eventual reprocessing.
DLQ Record Design
| Field | Purpose |
|---|---|
| Source_System__c | Where message originated |
| Target_System__c | Where it was going |
| Payload__c | Original message (Long Text Area) |
| Error_Message__c | What went wrong |
| Error_Code__c | HTTP status / exception type |
| Retry_Count__c | How many attempts were made |
| First_Failure__c | When it first failed |
| Last_Failure__c | When retries exhausted |
| Status__c | New / Under Review / Resubmitted / Archived |
| Correlation_ID__c | Links to original transaction |
Salesforce DLQ Implementation Options
| Approach | Best For | Retention |
|---|---|---|
| Custom Object (Integration_Error__c) | Audit trail, reporting, dashboards | Permanent |
| Platform Events (Error_Event__e) | Real-time alerting to monitoring tools | 24-72h |
| Big Object | High-volume error logging | Permanent, archive-oriented |
| MuleSoft Anypoint MQ DLQ | Middleware-managed integrations | Configurable |
| External (Splunk, Datadog) | Centralized ops monitoring | Per tool |
Pattern 4: Idempotency
At-least-once delivery means duplicates will happen. Every receiver must handle the same message twice without side effects.
Idempotency Strategy Quick Pick
| Strategy | How It Works | When to Use |
|---|---|---|
| Upsert + External ID | SF upsert is naturally idempotent | Data sync (default choice) |
| Idempotency key | Sender includes unique key; receiver checks before processing | Custom business logic |
| Natural key dedup | Check by business key (Order Number) before insert | When unique business key exists |
| Payload hash | Hash message content, reject duplicates | No client-side key available |
| Timestamp comparison | Only process if newer than last processed | Simple, but clock skew risk |
flowchart TD
A[Incoming Message] --> B{Has Idempotency<br/>Key?}
B -->|No| C[Generate from payload<br/>hash or natural key]
B -->|Yes| D{Key already<br/>processed?}
C --> D
D -->|Yes| E[Skip - return<br/>cached result]
D -->|No| F[Process message]
F --> G[Store key + result]
G --> H[Return result]
Upsert is your best friend
For data synchronization, always use upsert with an External ID instead of separate insert/update logic. It is idempotent by design — sending the same record twice produces the same result. This single practice prevents the majority of integration duplicate bugs.
Pattern 5: Monitoring and Alerting
Error handling without monitoring means failures are discovered by end users days later. Build alerting first, not as an afterthought.
What to Monitor — Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Integration failure rate | > 5% of transactions | > 20% of transactions |
| DLQ depth | > 100 messages | Growing for 30+ min |
| API call consumption | > 80% of daily limit | > 95% of daily limit |
| Response time (real-time) | > 5 seconds | > 10 seconds |
| Circuit breaker state | — | Any circuit open |
| Event subscriber lag | > 1 hour behind | > 12 hours behind |
Monitoring Stack
| Layer | Salesforce-Native | External |
|---|---|---|
| Metrics collection | Event Monitoring (Shield add-on) | Splunk, Datadog, ELK |
| Dashboards | Custom dashboard on Integration_Error__c | Grafana, Datadog |
| Alerting | Flow email alerts, Platform Events | PagerDuty, OpsGenie, Slack |
| Ticketing | Auto-create Case from Flow | Jira, ServiceNow |
Reverse-Engineered Use Cases
Scenario 1: ERP Goes Down During Order Processing
Situation: Salesforce sends orders to SAP via Fire-and-Forget (Platform Events + MuleSoft). SAP goes down for 2 hours during peak.
What you’d present:
- First 5 failures: MuleSoft retries with exponential backoff (1s, 2s, 4s, 8s, 16s + jitter)
- After 5 failures: Circuit breaker opens. Subsequent orders fail fast (no SAP call attempted)
- Failed orders: Route to Anypoint MQ dead letter queue with full payload and error context
- Alert: PagerDuty pages integration team; auto-created Jira ticket
- Recovery: After 60s, circuit breaker half-opens, tests one order. SAP still down — circuit stays open
- SAP recovers: Half-open test succeeds. Circuit closes. Normal flow resumes
- DLQ replay: Integration team bulk-replays 2 hours of queued orders from DLQ
- Idempotency: SAP uses order number as idempotency key — replayed orders that partially processed are safe
Scenario 2: Bulk API Partial Failure
Situation: Nightly sync of 500,000 Account records from data warehouse via Bulk API 2.0. Job completes with 498,000 success and 2,000 failures.
What you’d present:
- Successful records: Committed normally (no rollback of successes)
- Failed records: Download error results file (GET /jobs/ingest/{id}/failedResults)
- Classify failures: 1,800 validation rule failures (data quality), 200 duplicate External ID conflicts
- Data quality errors: Route to data steward dashboard for cleansing, fix source data, resubmit only failed records
- Duplicate errors: Investigate — likely stale dedup window. Switch to upsert if using insert
- Monitoring: Dashboard shows 99.6% success rate (within SLA), alert on the 2,000 failures for review
Scenario 3: Event Subscriber Falls Behind
Situation: External analytics system subscribes to CDC on Opportunity via Pub/Sub API. The analytics system goes down for maintenance over a 4-day weekend. CDC retention is 3 days.
What you’d present:
- Day 1-3: Events accumulate on bus. When subscriber reconnects, it replays from last checkpoint
- Day 3+: Events older than 3 days are lost — beyond retention window
- Recovery: Subscriber detects gap event from Salesforce, triggers batch reconciliation job
- Reconciliation: Run Bulk API 2.0 query for all Opportunities modified in the last 5 days, full sync
- Prevention: Monitor subscriber lag; alert when lag > 12 hours (gives 2.5 days to fix before data loss)
- Design improvement: Hybrid architecture — CDC for near-RT, nightly batch sync as safety net
Anti-Pattern Quick Reference
| Anti-Pattern | Why It Fails | Do This Instead |
|---|---|---|
| Retry forever | Wastes resources, masks permanent failures | Max retries + DLQ |
| Retry without backoff | Hammers struggling system | Exponential backoff + jitter |
| Retry all errors equally | 400 will never succeed on retry | Classify first, only retry transient |
| Swallow errors silently | Nobody knows integration is broken | Log + alert + DLQ |
| No idempotency | Duplicates on retry | External ID upsert or idempotency keys |
| Manual-only recovery | Does not scale | Automated retry + manual for edge cases |
| Monitor reactively | Users discover failures days later | Proactive alerting with thresholds |
The cardinal sin of integration
Building an integration with no error handling and no monitoring. When it fails — and it will — nobody knows until a business user reports missing data days or weeks later. By then, the data inconsistency may be unrecoverable. Build error handling and monitoring first, not as an afterthought. The board will grill you on this.