Skip to content

Error Handling Patterns

Error handling separates architects who design for the happy path from those who design for reality. External systems go down. Networks fail. Data arrives malformed. Rate limits get exceeded. These patterns cover what every CTA must know to handle failures properly.

CTA board expectations

When you present an integration architecture, the board WILL ask: “What happens when this fails?” If your answer is “we retry” without specifics on strategy, limits, and fallback behavior, you will lose points. Be specific about retry count, backoff strategy, dead letter handling, and monitoring.


Error Categories

Classify the error before choosing a pattern. Different error types demand different responses.

CategoryExamplesCorrect ResponseWrong Response
TransientNetwork timeout, 503 Service Unavailable, rate limit (429)Retry with backoffFail immediately
Persistent404 Not Found, 400 Bad Request, invalid dataRoute to dead letter queue, alertRetry infinitely
SystemicExternal system fully down, certificate expiredCircuit breaker, fallbackKeep retrying (wastes resources)
Data qualityMissing required fields, invalid format, duplicatesReject and notify, data cleansingSilently drop or force through
CapacityBulk API daily limit reached, governor limitsQueue and defer, throttleFail the entire batch

Pattern 1: Retry with Exponential Backoff

Retry failed operations with increasing delays between attempts. The foundation of transient error handling.

How It Works

Failed API calls classified as retryable get exponential backoff with jitter until max retries; non-retryable or exhausted calls route to the dead letter queue.
Figure 1. Exponential backoff doubles the wait between each retry attempt while jitter randomizes exact timing to prevent synchronized retry spikes. Only 5xx, timeout, and 429 errors are retryable. 4xx client errors route directly to the dead letter queue.

Implementation Parameters

ParameterRecommended ValueRationale
Max retries3-5Enough for transient issues, not so many that persistent failures waste time
Base delay1 secondStarting delay before first retry
Max delay60 secondsCap to prevent excessively long waits
Backoff multiplier2x (exponential)1s, 2s, 4s, 8s, 16s…
JitterRandom 0-1s addedPrevents thundering herd when many clients retry simultaneously

Retry Timing Example

AttemptDelay (no jitter)Delay (with jitter)
11 second1.0 - 2.0 seconds
22 seconds2.0 - 3.0 seconds
34 seconds4.0 - 5.0 seconds
48 seconds8.0 - 9.0 seconds
516 seconds16.0 - 17.0 seconds
Total~31 seconds~31 - 36 seconds

Jitter prevents thundering herd

Without jitter, 100 clients failing at the same time all retry at exactly the same intervals, creating repeated spikes. Random jitter spreads retries across time. Especially important for Salesforce integrations where concurrent API limits are shared across the org.


Pattern 2: Circuit Breaker

Stops a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.

States

Three-state machine transitions from Closed to Open on failure threshold, to HalfOpen after timeout, then back to Closed on test success or Open on test failure.
Figure 2. The circuit breaker prevents cascading failures by stopping calls to a known-down system and failing fast instead. The HalfOpen probe allows automatic recovery without manual intervention once the external system comes back online.

Implementation in Salesforce

StateBehaviorSalesforce Implementation
ClosedNormal operation, calls pass throughStandard callout behavior
OpenAll calls fail immediately, no callout attemptedCheck Custom Metadata / Platform Cache before callout
Half-OpenAllow one test call to check if service recoveredScheduled job or manual reset attempts one call

Configuration Parameters

ParameterRecommendedPurpose
Failure threshold5 consecutive failuresNumber of failures before opening circuit
Open timeout30-60 secondsHow long to wait before testing recovery
Success threshold2-3 successes in half-openSuccesses needed to close circuit

Salesforce implementation options

Salesforce has no native circuit breaker. Implement using: (1) Custom Metadata Type to store circuit state, (2) Platform Cache for fast state checks, or (3) Custom Settings with hierarchical access. Platform Cache is fastest but non-durable. Custom Metadata requires deployment. Custom Settings are a middle ground.


Pattern 3: Dead Letter Queue (DLQ)

Messages that cannot be processed after all retries go to a dead letter queue for inspection, reprocessing, or alerting.

Flow

Messages that exhaust retries route to a dead letter queue, triggering operations alerts and a manual review dashboard where fixable messages are corrected and resubmitted.
Figure 3. The dead letter queue captures messages that failed all retry attempts, preserving them for human review and resubmission. Without a DLQ, failed messages are silently lost and the integration appears to run while data quietly fails to transfer.

Salesforce DLQ Options

ApproachBest ForPersistence
Custom Object (Integration_Error__c)Full audit trail, reportingPermanent (until deleted)
Platform Events (Error_Event__e)Real-time alerting24-72 hours
Big ObjectHigh-volume error loggingPermanent, archive-oriented
Middleware DLQ (MuleSoft/Anypoint MQ)Middleware-managed integrationsConfigurable retention
External monitoring (Splunk, Datadog)Centralized ops monitoringPer tool retention

DLQ Record Design

A well-designed DLQ record captures everything needed for diagnosis and reprocessing:

FieldPurpose
Source SystemWhere the message originated
Target SystemWhere it was being sent
PayloadThe original message content
Error MessageWhat went wrong
Error CodeHTTP status, exception type
Retry CountHow many attempts were made
First Failure TimestampWhen it first failed
Last Failure TimestampWhen retries were exhausted
StatusNew / Under Review / Resubmitted / Archived
Correlation IDLinks to the original transaction

Pattern 4: Idempotency

Processing the same message multiple times must produce the same result. Mandatory for any at-least-once delivery system.

Why It Matters

Platform Events, CDC, and most middleware deliver at-least-once. Duplicates will happen because of:

  • Network retries at the transport layer
  • Subscriber reconnection replaying events
  • Middleware retry on ambiguous failures
  • Bulk API partial retries

Implementation Strategies

Incoming messages are checked against stored idempotency keys; already-processed messages return cached results while new messages process and store their key.
Figure 4. Idempotency key checks prevent duplicate processing when at-least-once delivery systems (Platform Events, CDC, middleware retries) deliver the same message more than once. Keys generated from natural business identifiers are more reliable than payload hashes.
StrategyHow It WorksProsCons
Idempotency keyClient sends unique key; server checks before processingMost reliableRequires key storage and lookup
Natural key dedupUse business key (Order Number) to detect duplicatesNo extra infrastructureRequires unique business key
Upsert operationsUse External ID for upsert instead of insertBuilt into SalesforceOnly works for CRUD, not business logic
Payload hashHash the message content, check for duplicate hashesWorks without client changesHash collisions (rare), different messages may hash same
Timestamp comparisonOnly process if timestamp is newer than last processedSimpleClock skew issues

Salesforce upsert is your friend

For data synchronization, always use upsert with an External ID field rather than separate insert/update logic. Upsert is naturally idempotent: sending the same record twice produces the same result. This single recommendation prevents a large class of integration bugs.


Pattern 5: Monitoring and Alerting

Error handling without monitoring is a fire alarm with no sound. Failures must be detected and addressed before they create business impact.

Monitoring Architecture

Integration processes, dead letter queues, and error logs feed a log collector that drives an alert rules engine and operations dashboard with tiered response channels.
Figure 5. Tiered alerting routes warning-level events to email and Slack for awareness while critical events page on-call engineers through PagerDuty. All alerts auto-create tickets for traceability, and the operations dashboard provides continuous visibility without alert fatigue.

What to Monitor

MetricThresholdAlert Level
Integration failure rate> 5% of transactionsWarning
Integration failure rate> 20% of transactionsCritical
DLQ depth> 100 messagesWarning
DLQ depth growingIncreasing for 30+ minutesCritical
API call consumption> 80% of daily limitWarning
API call consumption> 95% of daily limitCritical
Average response time> 5 seconds (for real-time)Warning
Circuit breaker openAny circuit openCritical
Event subscriber lag> 1 hour behindWarning
Event subscriber lag> 12 hours behindCritical (approaching retention limit)

Salesforce-Native Monitoring Options

ToolWhat It MonitorsCost
Event MonitoringAPI calls, logins, report exportsShield add-on
Custom DashboardIntegration_Error__c recordsIncluded
Flow Email AlertsTrigger on error recordsIncluded
Platform EventsReal-time error broadcastingIncluded
Einstein AnalyticsTrend analysis on error patternsAdd-on

Combining Patterns: The Complete Error Handling Stack

In a CTA scenario, present a layered error handling strategy, not just a single pattern.

Error classification drives pattern selection: transient errors retry with backoff, systemic errors open the circuit breaker, data quality errors reject to DLQ immediately.
Figure 6. Error classification is the foundation of the full error handling stack. Transient errors retry, systemic failures trigger the circuit breaker to stop wasting resources, and data quality rejections go directly to the DLQ with validation context for the operations team.

CTA presentation strategy

When presenting error handling at the review board, walk through a specific failure scenario end-to-end: “When the ERP is unavailable, the order event goes to the retry queue with exponential backoff. After 5 retries over 30 seconds, the circuit breaker opens. Subsequent calls fail fast. The failed message routes to the DLQ, PagerDuty alerts the integration team, and a Jira ticket is auto-created. The integration team reviews the DLQ dashboard, and once the ERP is back, they resubmit from the DLQ.”

End-to-End Failure Scenario: ERP Goes Down

This sequence diagram shows how the patterns work together when the ERP becomes unavailable during order processing. This type of walkthrough scores well at the CTA board.

Order event triggers three retry attempts against a down ERP, opens the circuit breaker after threshold, routes to DLQ with alert, then auto-recovers and resubmits on ERP restoration.
Figure 7. Walking through a complete ERP outage scenario end-to-end demonstrates how retry, circuit breaker, DLQ, and alerting work together. No orders are lost: they queue in the DLQ and resubmit automatically once the circuit closes, with full operations visibility throughout.
Detailed walkthrough

This sequence has five distinct phases. Reading it as a runtime narrative rather than an architecture diagram is exactly how you should present it to the review board.

Phase 1: Normal handoff. Salesforce fires a Platform Event when an order is submitted. Middleware receives it and immediately checks circuit state. The circuit breaker returns CLOSED, meaning the ERP is considered healthy. Middleware makes its first POST to /orders. The ERP returns a 503.

Phase 2: Retry with exponential backoff. The 503 is a retryable error (transient, server-side). Middleware waits one second and tries again. Another 503. It waits two seconds and tries a third time. Another 503. The backoff interval doubles between attempts (1s, 2s) deliberately. A recovering ERP under load needs breathing room. If every failing client retries at identical intervals, the recovered system receives a traffic spike at the exact moment it is trying to stabilize, which can re-collapse it. The increasing wait distributes pressure. Three attempts is enough to distinguish a brief self-correcting flap from a genuine outage.

Phase 3: Circuit trips. After the third failure, middleware reports to the circuit breaker state store. The threshold is met and the breaker flips from CLOSED to OPEN. Two things happen simultaneously: the failed order routes to the DLQ, and operations gets a PagerDuty alert plus an auto-created Jira ticket for traceability. Any subsequent order events that arrive while the circuit is OPEN fail fast without touching the ERP. This stops a broken integration from wasting resources and amplifying load on an already-struggling system.

Phase 4: Half-open probe. After 60 seconds, the circuit moves to HALF-OPEN. One test call goes out to the ERP. If it succeeds, the breaker resets to CLOSED. If it fails, it snaps back to OPEN and the cooldown restarts. No bulk traffic crosses until the single probe succeeds.

Phase 5: Recovery and replay. The ERP returns 200 OK on the probe. Circuit closes. Operations receives a recovery notification and triggers a bulk DLQ resubmit. The queued orders replay through middleware to the ERP in sequence. Every order that arrived during the outage is eventually delivered, with a complete audit trail from original Platform Event timestamp through successful resubmit.

The zero-data-loss guarantee comes from the DLQ, not from the retry mechanism. Retries handle transient glitches. The DLQ handles the cases retries cannot resolve. Together they are why this pattern scores well at the board.


Anti-Patterns

Anti-PatternWhy It FailsBetter Approach
Retry foreverWastes resources, masks permanent failuresMax retries + DLQ
Retry without backoffHammers already-struggling systemsExponential backoff with jitter
Swallow errors silentlyNobody knows the integration is brokenLog, alert, DLQ
Single retry for all errors400 Bad Request will never succeed with retryClassify errors, only retry transient
No idempotencyDuplicate processing on retryIdempotency keys or upsert
Manual-only error recoveryDoes not scale, creates a human bottleneckAutomated reprocessing with manual review for edge cases

  • Risk Management: integration failures are a top risk category; error handling feeds directly into risk registers
  • Data Quality & Governance: data quality errors are a major category of integration failures; governance prevents bad data from propagating
  • Review Board Presentation & Q&A: judges ask “what happens when this fails?” on every integration. Prepare error handling explanations.

Sources

Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.