Error Handling Patterns
Error handling separates architects who design for the happy path from those who design for reality. External systems go down. Networks fail. Data arrives malformed. Rate limits get exceeded. These patterns cover what every CTA must know to handle failures properly.
CTA board expectations
When you present an integration architecture, the board WILL ask: “What happens when this fails?” If your answer is “we retry” without specifics on strategy, limits, and fallback behavior, you will lose points. Be specific about retry count, backoff strategy, dead letter handling, and monitoring.
Error Categories
Classify the error before choosing a pattern. Different error types demand different responses.
| Category | Examples | Correct Response | Wrong Response |
|---|---|---|---|
| Transient | Network timeout, 503 Service Unavailable, rate limit (429) | Retry with backoff | Fail immediately |
| Persistent | 404 Not Found, 400 Bad Request, invalid data | Route to dead letter queue, alert | Retry infinitely |
| Systemic | External system fully down, certificate expired | Circuit breaker, fallback | Keep retrying (wastes resources) |
| Data quality | Missing required fields, invalid format, duplicates | Reject and notify, data cleansing | Silently drop or force through |
| Capacity | Bulk API daily limit reached, governor limits | Queue and defer, throttle | Fail the entire batch |
Pattern 1: Retry with Exponential Backoff
Retry failed operations with increasing delays between attempts. The foundation of transient error handling.
How It Works
Implementation Parameters
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Max retries | 3-5 | Enough for transient issues, not so many that persistent failures waste time |
| Base delay | 1 second | Starting delay before first retry |
| Max delay | 60 seconds | Cap to prevent excessively long waits |
| Backoff multiplier | 2x (exponential) | 1s, 2s, 4s, 8s, 16s… |
| Jitter | Random 0-1s added | Prevents thundering herd when many clients retry simultaneously |
Retry Timing Example
| Attempt | Delay (no jitter) | Delay (with jitter) |
|---|---|---|
| 1 | 1 second | 1.0 - 2.0 seconds |
| 2 | 2 seconds | 2.0 - 3.0 seconds |
| 3 | 4 seconds | 4.0 - 5.0 seconds |
| 4 | 8 seconds | 8.0 - 9.0 seconds |
| 5 | 16 seconds | 16.0 - 17.0 seconds |
| Total | ~31 seconds | ~31 - 36 seconds |
Jitter prevents thundering herd
Without jitter, 100 clients failing at the same time all retry at exactly the same intervals, creating repeated spikes. Random jitter spreads retries across time. Especially important for Salesforce integrations where concurrent API limits are shared across the org.
Pattern 2: Circuit Breaker
Stops a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.
States
Implementation in Salesforce
| State | Behavior | Salesforce Implementation |
|---|---|---|
| Closed | Normal operation, calls pass through | Standard callout behavior |
| Open | All calls fail immediately, no callout attempted | Check Custom Metadata / Platform Cache before callout |
| Half-Open | Allow one test call to check if service recovered | Scheduled job or manual reset attempts one call |
Configuration Parameters
| Parameter | Recommended | Purpose |
|---|---|---|
| Failure threshold | 5 consecutive failures | Number of failures before opening circuit |
| Open timeout | 30-60 seconds | How long to wait before testing recovery |
| Success threshold | 2-3 successes in half-open | Successes needed to close circuit |
Salesforce implementation options
Salesforce has no native circuit breaker. Implement using: (1) Custom Metadata Type to store circuit state, (2) Platform Cache for fast state checks, or (3) Custom Settings with hierarchical access. Platform Cache is fastest but non-durable. Custom Metadata requires deployment. Custom Settings are a middle ground.
Pattern 3: Dead Letter Queue (DLQ)
Messages that cannot be processed after all retries go to a dead letter queue for inspection, reprocessing, or alerting.
Flow
Salesforce DLQ Options
| Approach | Best For | Persistence |
|---|---|---|
| Custom Object (Integration_Error__c) | Full audit trail, reporting | Permanent (until deleted) |
| Platform Events (Error_Event__e) | Real-time alerting | 24-72 hours |
| Big Object | High-volume error logging | Permanent, archive-oriented |
| Middleware DLQ (MuleSoft/Anypoint MQ) | Middleware-managed integrations | Configurable retention |
| External monitoring (Splunk, Datadog) | Centralized ops monitoring | Per tool retention |
DLQ Record Design
A well-designed DLQ record captures everything needed for diagnosis and reprocessing:
| Field | Purpose |
|---|---|
| Source System | Where the message originated |
| Target System | Where it was being sent |
| Payload | The original message content |
| Error Message | What went wrong |
| Error Code | HTTP status, exception type |
| Retry Count | How many attempts were made |
| First Failure Timestamp | When it first failed |
| Last Failure Timestamp | When retries were exhausted |
| Status | New / Under Review / Resubmitted / Archived |
| Correlation ID | Links to the original transaction |
Pattern 4: Idempotency
Processing the same message multiple times must produce the same result. Mandatory for any at-least-once delivery system.
Why It Matters
Platform Events, CDC, and most middleware deliver at-least-once. Duplicates will happen because of:
- Network retries at the transport layer
- Subscriber reconnection replaying events
- Middleware retry on ambiguous failures
- Bulk API partial retries
Implementation Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Idempotency key | Client sends unique key; server checks before processing | Most reliable | Requires key storage and lookup |
| Natural key dedup | Use business key (Order Number) to detect duplicates | No extra infrastructure | Requires unique business key |
| Upsert operations | Use External ID for upsert instead of insert | Built into Salesforce | Only works for CRUD, not business logic |
| Payload hash | Hash the message content, check for duplicate hashes | Works without client changes | Hash collisions (rare), different messages may hash same |
| Timestamp comparison | Only process if timestamp is newer than last processed | Simple | Clock skew issues |
Salesforce upsert is your friend
For data synchronization, always use upsert with an External ID field rather than separate insert/update logic. Upsert is naturally idempotent: sending the same record twice produces the same result. This single recommendation prevents a large class of integration bugs.
Pattern 5: Monitoring and Alerting
Error handling without monitoring is a fire alarm with no sound. Failures must be detected and addressed before they create business impact.
Monitoring Architecture
What to Monitor
| Metric | Threshold | Alert Level |
|---|---|---|
| Integration failure rate | > 5% of transactions | Warning |
| Integration failure rate | > 20% of transactions | Critical |
| DLQ depth | > 100 messages | Warning |
| DLQ depth growing | Increasing for 30+ minutes | Critical |
| API call consumption | > 80% of daily limit | Warning |
| API call consumption | > 95% of daily limit | Critical |
| Average response time | > 5 seconds (for real-time) | Warning |
| Circuit breaker open | Any circuit open | Critical |
| Event subscriber lag | > 1 hour behind | Warning |
| Event subscriber lag | > 12 hours behind | Critical (approaching retention limit) |
Salesforce-Native Monitoring Options
| Tool | What It Monitors | Cost |
|---|---|---|
| Event Monitoring | API calls, logins, report exports | Shield add-on |
| Custom Dashboard | Integration_Error__c records | Included |
| Flow Email Alerts | Trigger on error records | Included |
| Platform Events | Real-time error broadcasting | Included |
| Einstein Analytics | Trend analysis on error patterns | Add-on |
Combining Patterns: The Complete Error Handling Stack
In a CTA scenario, present a layered error handling strategy, not just a single pattern.
CTA presentation strategy
When presenting error handling at the review board, walk through a specific failure scenario end-to-end: “When the ERP is unavailable, the order event goes to the retry queue with exponential backoff. After 5 retries over 30 seconds, the circuit breaker opens. Subsequent calls fail fast. The failed message routes to the DLQ, PagerDuty alerts the integration team, and a Jira ticket is auto-created. The integration team reviews the DLQ dashboard, and once the ERP is back, they resubmit from the DLQ.”
End-to-End Failure Scenario: ERP Goes Down
This sequence diagram shows how the patterns work together when the ERP becomes unavailable during order processing. This type of walkthrough scores well at the CTA board.
Detailed walkthrough
This sequence has five distinct phases. Reading it as a runtime narrative rather than an architecture diagram is exactly how you should present it to the review board.
Phase 1: Normal handoff. Salesforce fires a Platform Event when an order is submitted. Middleware receives it and immediately checks circuit state. The circuit breaker returns CLOSED, meaning the ERP is considered healthy. Middleware makes its first POST to /orders. The ERP returns a 503.
Phase 2: Retry with exponential backoff. The 503 is a retryable error (transient, server-side). Middleware waits one second and tries again. Another 503. It waits two seconds and tries a third time. Another 503. The backoff interval doubles between attempts (1s, 2s) deliberately. A recovering ERP under load needs breathing room. If every failing client retries at identical intervals, the recovered system receives a traffic spike at the exact moment it is trying to stabilize, which can re-collapse it. The increasing wait distributes pressure. Three attempts is enough to distinguish a brief self-correcting flap from a genuine outage.
Phase 3: Circuit trips. After the third failure, middleware reports to the circuit breaker state store. The threshold is met and the breaker flips from CLOSED to OPEN. Two things happen simultaneously: the failed order routes to the DLQ, and operations gets a PagerDuty alert plus an auto-created Jira ticket for traceability. Any subsequent order events that arrive while the circuit is OPEN fail fast without touching the ERP. This stops a broken integration from wasting resources and amplifying load on an already-struggling system.
Phase 4: Half-open probe. After 60 seconds, the circuit moves to HALF-OPEN. One test call goes out to the ERP. If it succeeds, the breaker resets to CLOSED. If it fails, it snaps back to OPEN and the cooldown restarts. No bulk traffic crosses until the single probe succeeds.
Phase 5: Recovery and replay. The ERP returns 200 OK on the probe. Circuit closes. Operations receives a recovery notification and triggers a bulk DLQ resubmit. The queued orders replay through middleware to the ERP in sequence. Every order that arrived during the outage is eventually delivered, with a complete audit trail from original Platform Event timestamp through successful resubmit.
The zero-data-loss guarantee comes from the DLQ, not from the retry mechanism. Retries handle transient glitches. The DLQ handles the cases retries cannot resolve. Together they are why this pattern scores well at the board.
Anti-Patterns
| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Retry forever | Wastes resources, masks permanent failures | Max retries + DLQ |
| Retry without backoff | Hammers already-struggling systems | Exponential backoff with jitter |
| Swallow errors silently | Nobody knows the integration is broken | Log, alert, DLQ |
| Single retry for all errors | 400 Bad Request will never succeed with retry | Classify errors, only retry transient |
| No idempotency | Duplicate processing on retry | Idempotency keys or upsert |
| Manual-only error recovery | Does not scale, creates a human bottleneck | Automated reprocessing with manual review for edge cases |
Related Topics
- Risk Management: integration failures are a top risk category; error handling feeds directly into risk registers
- Data Quality & Governance: data quality errors are a major category of integration failures; governance prevents bad data from propagating
- Review Board Presentation & Q&A: judges ask “what happens when this fails?” on every integration. Prepare error handling explanations.
Sources
- Salesforce Integration Patterns: Error Handling
- MuleSoft: Error Handling Best Practices
- Martin Fowler, “Circuit Breaker Pattern”
- Michael Nygard, “Release It! Design and Deploy Production-Ready Software”
- AWS: Exponential Backoff and Jitter
- CTA Study Group notes on integration error handling scenarios
Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.