Error Handling Patterns
Integration error handling separates architects who design for the happy path from those who design for reality. External systems go down, networks fail, data is malformed, and rate limits are exceeded. This page covers the patterns every CTA must know to handle failures gracefully.
CTA board expectations
When you present an integration architecture, the board WILL ask: “What happens when this fails?” If your answer is “we retry” without specifics on strategy, limits, and fallback behavior, you will lose points. Be specific about retry count, backoff strategy, dead letter handling, and monitoring.
Error Categories
Before choosing a pattern, classify the error. Different error types demand different responses.
| Category | Examples | Correct Response | Wrong Response |
|---|---|---|---|
| Transient | Network timeout, 503 Service Unavailable, rate limit (429) | Retry with backoff | Fail immediately |
| Persistent | 404 Not Found, 400 Bad Request, invalid data | Route to dead letter queue, alert | Retry infinitely |
| Systemic | External system fully down, certificate expired | Circuit breaker, fallback | Keep retrying (wastes resources) |
| Data quality | Missing required fields, invalid format, duplicates | Reject and notify, data cleansing | Silently drop or force through |
| Capacity | Bulk API daily limit reached, governor limits | Queue and defer, throttle | Fail the entire batch |
Pattern 1: Retry with Exponential Backoff
Automatically retry failed operations with increasing delays between attempts. This is the foundation of transient error handling.
How It Works
flowchart TD
A[API Call] --> B{Success?}
B -->|Yes| C[Process Response]
B -->|No| D{Retryable Error?<br/>5xx, timeout, 429}
D -->|No| E[Route to Dead Letter Queue]
D -->|Yes| F{Retry Count<br/>< Max Retries?}
F -->|No| E
F -->|Yes| G["Wait: base_delay * 2^attempt<br/>+ random jitter"]
G --> A
Implementation Parameters
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Max retries | 3-5 | Enough for transient issues, not so many that persistent failures waste time |
| Base delay | 1 second | Starting delay before first retry |
| Max delay | 60 seconds | Cap to prevent excessively long waits |
| Backoff multiplier | 2x (exponential) | 1s, 2s, 4s, 8s, 16s… |
| Jitter | Random 0-1s added | Prevents thundering herd when many clients retry simultaneously |
Retry Timing Example
| Attempt | Delay (no jitter) | Delay (with jitter) |
|---|---|---|
| 1 | 1 second | 1.0 - 2.0 seconds |
| 2 | 2 seconds | 2.0 - 3.0 seconds |
| 3 | 4 seconds | 4.0 - 5.0 seconds |
| 4 | 8 seconds | 8.0 - 9.0 seconds |
| 5 | 16 seconds | 16.0 - 17.0 seconds |
| Total | ~31 seconds | ~31 - 36 seconds |
Jitter prevents thundering herd
Without jitter, if 100 clients all fail at the same time, they all retry at exactly the same intervals — creating repeated spikes. Random jitter spreads retries across time. This is especially important for Salesforce integrations where concurrent API limits are shared across the org.
Pattern 2: Circuit Breaker
Prevents a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.
States
stateDiagram-v2
[*] --> Closed
Closed --> Open : Failure threshold exceeded
Open --> HalfOpen : Timeout period expires
HalfOpen --> Closed : Test call succeeds
HalfOpen --> Open : Test call fails
note right of Closed
Normal operation.
Calls pass through.
Track failure count.
end note
note right of Open
All calls fail fast.
No calls to external system.
Wait for timeout.
end note
note right of HalfOpen
Allow ONE test call.
If succeeds: reset to Closed.
If fails: back to Open.
end note
Implementation in Salesforce
| State | Behavior | Salesforce Implementation |
|---|---|---|
| Closed | Normal operation, calls pass through | Standard callout behavior |
| Open | All calls fail immediately, no callout attempted | Check Custom Metadata / Platform Cache before callout |
| Half-Open | Allow one test call to check if service recovered | Scheduled job or manual reset attempts one call |
Configuration Parameters
| Parameter | Recommended | Purpose |
|---|---|---|
| Failure threshold | 5 consecutive failures | Number of failures before opening circuit |
| Open timeout | 30-60 seconds | How long to wait before testing recovery |
| Success threshold | 2-3 successes in half-open | Successes needed to close circuit |
Salesforce implementation options
Salesforce does not have a native circuit breaker. Implement using: (1) Custom Metadata Type to store circuit state, (2) Platform Cache for fast state checks, or (3) Custom Settings with hierarchical access. Platform Cache is fastest but non-durable; Custom Metadata requires deployment; Custom Settings are a middle ground.
Pattern 3: Dead Letter Queue (DLQ)
Messages that cannot be processed after all retries are routed to a dead letter queue for manual inspection, reprocessing, or alerting.
Flow
flowchart LR
A[Message] --> B{Process}
B -->|Success| C[Done]
B -->|Failure| D{Retries<br/>exhausted?}
D -->|No| E[Retry Queue<br/>with backoff]
E --> B
D -->|Yes| F[Dead Letter Queue]
F --> G[Alert Operations Team]
F --> H[Manual Review Dashboard]
H --> I{Fixable?}
I -->|Yes| J[Fix & Resubmit]
J --> B
I -->|No| K[Log & Archive]
Salesforce DLQ Options
| Approach | Best For | Persistence |
|---|---|---|
| Custom Object (Integration_Error__c) | Full audit trail, reporting | Permanent (until deleted) |
| Platform Events (Error_Event__e) | Real-time alerting | 24-72 hours |
| Big Object | High-volume error logging | Permanent, archive-oriented |
| Middleware DLQ (MuleSoft/Anypoint MQ) | Middleware-managed integrations | Configurable retention |
| External monitoring (Splunk, Datadog) | Centralized ops monitoring | Per tool retention |
DLQ Record Design
A well-designed DLQ record captures everything needed for diagnosis and reprocessing:
| Field | Purpose |
|---|---|
| Source System | Where the message originated |
| Target System | Where it was being sent |
| Payload | The original message content |
| Error Message | What went wrong |
| Error Code | HTTP status, exception type |
| Retry Count | How many attempts were made |
| First Failure Timestamp | When it first failed |
| Last Failure Timestamp | When retries were exhausted |
| Status | New / Under Review / Resubmitted / Archived |
| Correlation ID | Links to the original transaction |
Pattern 4: Idempotency
Ensuring that processing the same message multiple times produces the same result. This is mandatory for any at-least-once delivery system.
Why It Matters
Platform Events, CDC, and most middleware deliver at-least-once. Duplicates WILL happen due to:
- Network retries at the transport layer
- Subscriber reconnection replaying events
- Middleware retry on ambiguous failures
- Bulk API partial retries
Implementation Strategies
flowchart TD
A[Incoming Message] --> B{Has Idempotency Key?}
B -->|No| C[Generate from payload<br/>hash or natural key]
B -->|Yes| D{Key already<br/>processed?}
C --> D
D -->|Yes| E[Return cached result<br/>Skip processing]
D -->|No| F[Process message]
F --> G[Store idempotency key<br/>with result]
G --> H[Return result]
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Idempotency key | Client sends unique key; server checks before processing | Most reliable | Requires key storage and lookup |
| Natural key dedup | Use business key (Order Number) to detect duplicates | No extra infrastructure | Requires unique business key |
| Upsert operations | Use External ID for upsert instead of insert | Built into Salesforce | Only works for CRUD, not business logic |
| Payload hash | Hash the message content, check for duplicate hashes | Works without client changes | Hash collisions (rare), different messages may hash same |
| Timestamp comparison | Only process if timestamp is newer than last processed | Simple | Clock skew issues |
Salesforce upsert is your friend
For data synchronization, always use upsert with an External ID field rather than separate insert/update logic. Upsert is naturally idempotent — sending the same record twice produces the same result. This single recommendation prevents a large class of integration bugs.
Pattern 5: Monitoring and Alerting
Error handling without monitoring is like having a fire alarm with no sound. You must know when integrations fail and respond before business impact.
Monitoring Architecture
flowchart TB
subgraph "Integration Layer"
INT[Integration Processes]
DLQ[Dead Letter Queue]
LOGS[Error Logs]
end
subgraph "Monitoring Platform"
COLLECT[Log Collector<br/>Splunk/Datadog/ELK]
ALERT[Alert Rules Engine]
DASH[Operations Dashboard]
end
subgraph "Response"
EMAIL[Email Alerts]
SLACK[Slack/Teams Notifications]
PAGER[PagerDuty/OpsGenie]
TICKET[Auto-Create Case/Jira]
end
INT --> COLLECT
DLQ --> COLLECT
LOGS --> COLLECT
COLLECT --> ALERT
COLLECT --> DASH
ALERT -->|Warning| EMAIL
ALERT -->|Warning| SLACK
ALERT -->|Critical| PAGER
ALERT -->|All| TICKET
What to Monitor
| Metric | Threshold | Alert Level |
|---|---|---|
| Integration failure rate | > 5% of transactions | Warning |
| Integration failure rate | > 20% of transactions | Critical |
| DLQ depth | > 100 messages | Warning |
| DLQ depth growing | Increasing for 30+ minutes | Critical |
| API call consumption | > 80% of daily limit | Warning |
| API call consumption | > 95% of daily limit | Critical |
| Average response time | > 5 seconds (for real-time) | Warning |
| Circuit breaker open | Any circuit open | Critical |
| Event subscriber lag | > 1 hour behind | Warning |
| Event subscriber lag | > 12 hours behind | Critical (approaching retention limit) |
Salesforce-Native Monitoring Options
| Tool | What It Monitors | Cost |
|---|---|---|
| Event Monitoring | API calls, logins, report exports | Shield add-on |
| Custom Dashboard | Integration_Error__c records | Included |
| Flow Email Alerts | Trigger on error records | Included |
| Platform Events | Real-time error broadcasting | Included |
| Einstein Analytics | Trend analysis on error patterns | Add-on |
Combining Patterns: The Complete Error Handling Stack
In a CTA scenario, you should present a layered error handling strategy, not just a single pattern.
flowchart TD
A[Integration Call] --> B{Success?}
B -->|Yes| C[Log Success Metric]
B -->|No| D[Classify Error]
D --> E{Transient?}
E -->|Yes| F[Retry with<br/>Exponential Backoff]
F --> G{Retries<br/>Exhausted?}
G -->|No| A
G -->|Yes| H[Dead Letter Queue]
E -->|No| I{Systemic?}
I -->|Yes| J[Circuit Breaker<br/>Opens]
J --> K[Fail Fast for<br/>Subsequent Calls]
K --> H
I -->|No| L{Data Quality?}
L -->|Yes| M[Reject to DLQ<br/>with Validation Details]
L -->|No| H
H --> N[Alert Operations]
H --> O[Dashboard Update]
J --> N
CTA presentation strategy
When presenting error handling at the review board, walk through a specific failure scenario end-to-end: “When the ERP is unavailable, the order event goes to the retry queue with exponential backoff. After 5 retries over 30 seconds, the circuit breaker opens. Subsequent calls fail fast. The failed message routes to the DLQ, PagerDuty alerts the integration team, and a Jira ticket is auto-created. The integration team reviews the DLQ dashboard, and once the ERP is back, they resubmit from the DLQ.”
End-to-End Failure Scenario: ERP Goes Down
This sequence diagram shows exactly how the patterns above work together when the ERP becomes unavailable during order processing — the type of walkthrough that scores well at the CTA board.
sequenceDiagram
participant SF as Salesforce
participant MW as Middleware
participant CB as Circuit Breaker<br/>(state store)
participant ERP as ERP System
participant DLQ as Dead Letter Queue
participant OPS as Operations Team
SF->>MW: Order Event (Platform Event)
MW->>CB: Check circuit state
CB-->>MW: CLOSED (healthy)
MW->>ERP: POST /orders (attempt 1)
ERP-->>MW: 503 Service Unavailable
MW->>MW: Wait 1s (backoff)
MW->>ERP: POST /orders (attempt 2)
ERP-->>MW: 503 Service Unavailable
MW->>MW: Wait 2s (backoff)
MW->>ERP: POST /orders (attempt 3)
ERP-->>MW: 503 Service Unavailable
MW->>CB: Report failure #3 (threshold reached)
CB->>CB: State → OPEN
MW->>DLQ: Route failed order to DLQ
MW->>OPS: PagerDuty alert + auto-create Jira
Note over CB: 60 seconds pass...
CB->>CB: State → HALF-OPEN
MW->>ERP: Test call (single probe)
ERP-->>MW: 200 OK
CB->>CB: State → CLOSED
MW->>OPS: Notify: ERP recovered
OPS->>DLQ: Trigger bulk resubmit
DLQ->>MW: Replay failed orders
MW->>ERP: POST /orders (resubmit)
ERP-->>MW: 200 OK
Anti-Patterns
| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Retry forever | Wastes resources, masks permanent failures | Max retries + DLQ |
| Retry without backoff | Hammers already-struggling systems | Exponential backoff with jitter |
| Swallow errors silently | Nobody knows the integration is broken | Log, alert, DLQ |
| Single retry for all errors | 400 Bad Request will never succeed with retry | Classify errors, only retry transient |
| No idempotency | Duplicate processing on retry | Idempotency keys or upsert |
| Manual-only error recovery | Does not scale, human bottleneck | Automated reprocessing with manual review for edge cases |
Related Content
- Risk Management — integration failures are a top risk category; error handling patterns feed directly into risk registers and mitigation plans
- Data Quality & Governance — data quality errors are a major category of integration failures; governance processes prevent bad data from propagating
- Review Board Presentation & Q&A — judges frequently ask “what happens when this fails?” — prepare error handling explanations for every integration
Sources
- Salesforce Integration Patterns: Error Handling
- MuleSoft: Error Handling Best Practices
- Martin Fowler, “Circuit Breaker Pattern” (martinfowler.com)
- Michael Nygard, “Release It! Design and Deploy Production-Ready Software”
- AWS: Exponential Backoff and Jitter
- CTA Study Group notes on integration error handling scenarios