Skip to content

Error Handling Patterns

Integration error handling separates architects who design for the happy path from those who design for reality. External systems go down, networks fail, data is malformed, and rate limits are exceeded. This page covers the patterns every CTA must know to handle failures gracefully.

CTA board expectations

When you present an integration architecture, the board WILL ask: “What happens when this fails?” If your answer is “we retry” without specifics on strategy, limits, and fallback behavior, you will lose points. Be specific about retry count, backoff strategy, dead letter handling, and monitoring.


Error Categories

Before choosing a pattern, classify the error. Different error types demand different responses.

CategoryExamplesCorrect ResponseWrong Response
TransientNetwork timeout, 503 Service Unavailable, rate limit (429)Retry with backoffFail immediately
Persistent404 Not Found, 400 Bad Request, invalid dataRoute to dead letter queue, alertRetry infinitely
SystemicExternal system fully down, certificate expiredCircuit breaker, fallbackKeep retrying (wastes resources)
Data qualityMissing required fields, invalid format, duplicatesReject and notify, data cleansingSilently drop or force through
CapacityBulk API daily limit reached, governor limitsQueue and defer, throttleFail the entire batch

Pattern 1: Retry with Exponential Backoff

Automatically retry failed operations with increasing delays between attempts. This is the foundation of transient error handling.

How It Works

flowchart TD
    A[API Call] --> B{Success?}
    B -->|Yes| C[Process Response]
    B -->|No| D{Retryable Error?<br/>5xx, timeout, 429}
    D -->|No| E[Route to Dead Letter Queue]
    D -->|Yes| F{Retry Count<br/>< Max Retries?}
    F -->|No| E
    F -->|Yes| G["Wait: base_delay * 2^attempt<br/>+ random jitter"]
    G --> A

Implementation Parameters

ParameterRecommended ValueRationale
Max retries3-5Enough for transient issues, not so many that persistent failures waste time
Base delay1 secondStarting delay before first retry
Max delay60 secondsCap to prevent excessively long waits
Backoff multiplier2x (exponential)1s, 2s, 4s, 8s, 16s…
JitterRandom 0-1s addedPrevents thundering herd when many clients retry simultaneously

Retry Timing Example

AttemptDelay (no jitter)Delay (with jitter)
11 second1.0 - 2.0 seconds
22 seconds2.0 - 3.0 seconds
34 seconds4.0 - 5.0 seconds
48 seconds8.0 - 9.0 seconds
516 seconds16.0 - 17.0 seconds
Total~31 seconds~31 - 36 seconds

Jitter prevents thundering herd

Without jitter, if 100 clients all fail at the same time, they all retry at exactly the same intervals — creating repeated spikes. Random jitter spreads retries across time. This is especially important for Salesforce integrations where concurrent API limits are shared across the org.


Pattern 2: Circuit Breaker

Prevents a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.

States

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold exceeded
    Open --> HalfOpen : Timeout period expires
    HalfOpen --> Closed : Test call succeeds
    HalfOpen --> Open : Test call fails

    note right of Closed
        Normal operation.
        Calls pass through.
        Track failure count.
    end note

    note right of Open
        All calls fail fast.
        No calls to external system.
        Wait for timeout.
    end note

    note right of HalfOpen
        Allow ONE test call.
        If succeeds: reset to Closed.
        If fails: back to Open.
    end note

Implementation in Salesforce

StateBehaviorSalesforce Implementation
ClosedNormal operation, calls pass throughStandard callout behavior
OpenAll calls fail immediately, no callout attemptedCheck Custom Metadata / Platform Cache before callout
Half-OpenAllow one test call to check if service recoveredScheduled job or manual reset attempts one call

Configuration Parameters

ParameterRecommendedPurpose
Failure threshold5 consecutive failuresNumber of failures before opening circuit
Open timeout30-60 secondsHow long to wait before testing recovery
Success threshold2-3 successes in half-openSuccesses needed to close circuit

Salesforce implementation options

Salesforce does not have a native circuit breaker. Implement using: (1) Custom Metadata Type to store circuit state, (2) Platform Cache for fast state checks, or (3) Custom Settings with hierarchical access. Platform Cache is fastest but non-durable; Custom Metadata requires deployment; Custom Settings are a middle ground.


Pattern 3: Dead Letter Queue (DLQ)

Messages that cannot be processed after all retries are routed to a dead letter queue for manual inspection, reprocessing, or alerting.

Flow

flowchart LR
    A[Message] --> B{Process}
    B -->|Success| C[Done]
    B -->|Failure| D{Retries<br/>exhausted?}
    D -->|No| E[Retry Queue<br/>with backoff]
    E --> B
    D -->|Yes| F[Dead Letter Queue]
    F --> G[Alert Operations Team]
    F --> H[Manual Review Dashboard]
    H --> I{Fixable?}
    I -->|Yes| J[Fix & Resubmit]
    J --> B
    I -->|No| K[Log & Archive]

Salesforce DLQ Options

ApproachBest ForPersistence
Custom Object (Integration_Error__c)Full audit trail, reportingPermanent (until deleted)
Platform Events (Error_Event__e)Real-time alerting24-72 hours
Big ObjectHigh-volume error loggingPermanent, archive-oriented
Middleware DLQ (MuleSoft/Anypoint MQ)Middleware-managed integrationsConfigurable retention
External monitoring (Splunk, Datadog)Centralized ops monitoringPer tool retention

DLQ Record Design

A well-designed DLQ record captures everything needed for diagnosis and reprocessing:

FieldPurpose
Source SystemWhere the message originated
Target SystemWhere it was being sent
PayloadThe original message content
Error MessageWhat went wrong
Error CodeHTTP status, exception type
Retry CountHow many attempts were made
First Failure TimestampWhen it first failed
Last Failure TimestampWhen retries were exhausted
StatusNew / Under Review / Resubmitted / Archived
Correlation IDLinks to the original transaction

Pattern 4: Idempotency

Ensuring that processing the same message multiple times produces the same result. This is mandatory for any at-least-once delivery system.

Why It Matters

Platform Events, CDC, and most middleware deliver at-least-once. Duplicates WILL happen due to:

  • Network retries at the transport layer
  • Subscriber reconnection replaying events
  • Middleware retry on ambiguous failures
  • Bulk API partial retries

Implementation Strategies

flowchart TD
    A[Incoming Message] --> B{Has Idempotency Key?}
    B -->|No| C[Generate from payload<br/>hash or natural key]
    B -->|Yes| D{Key already<br/>processed?}
    C --> D
    D -->|Yes| E[Return cached result<br/>Skip processing]
    D -->|No| F[Process message]
    F --> G[Store idempotency key<br/>with result]
    G --> H[Return result]
StrategyHow It WorksProsCons
Idempotency keyClient sends unique key; server checks before processingMost reliableRequires key storage and lookup
Natural key dedupUse business key (Order Number) to detect duplicatesNo extra infrastructureRequires unique business key
Upsert operationsUse External ID for upsert instead of insertBuilt into SalesforceOnly works for CRUD, not business logic
Payload hashHash the message content, check for duplicate hashesWorks without client changesHash collisions (rare), different messages may hash same
Timestamp comparisonOnly process if timestamp is newer than last processedSimpleClock skew issues

Salesforce upsert is your friend

For data synchronization, always use upsert with an External ID field rather than separate insert/update logic. Upsert is naturally idempotent — sending the same record twice produces the same result. This single recommendation prevents a large class of integration bugs.


Pattern 5: Monitoring and Alerting

Error handling without monitoring is like having a fire alarm with no sound. You must know when integrations fail and respond before business impact.

Monitoring Architecture

flowchart TB
    subgraph "Integration Layer"
        INT[Integration Processes]
        DLQ[Dead Letter Queue]
        LOGS[Error Logs]
    end

    subgraph "Monitoring Platform"
        COLLECT[Log Collector<br/>Splunk/Datadog/ELK]
        ALERT[Alert Rules Engine]
        DASH[Operations Dashboard]
    end

    subgraph "Response"
        EMAIL[Email Alerts]
        SLACK[Slack/Teams Notifications]
        PAGER[PagerDuty/OpsGenie]
        TICKET[Auto-Create Case/Jira]
    end

    INT --> COLLECT
    DLQ --> COLLECT
    LOGS --> COLLECT

    COLLECT --> ALERT
    COLLECT --> DASH

    ALERT -->|Warning| EMAIL
    ALERT -->|Warning| SLACK
    ALERT -->|Critical| PAGER
    ALERT -->|All| TICKET

What to Monitor

MetricThresholdAlert Level
Integration failure rate> 5% of transactionsWarning
Integration failure rate> 20% of transactionsCritical
DLQ depth> 100 messagesWarning
DLQ depth growingIncreasing for 30+ minutesCritical
API call consumption> 80% of daily limitWarning
API call consumption> 95% of daily limitCritical
Average response time> 5 seconds (for real-time)Warning
Circuit breaker openAny circuit openCritical
Event subscriber lag> 1 hour behindWarning
Event subscriber lag> 12 hours behindCritical (approaching retention limit)

Salesforce-Native Monitoring Options

ToolWhat It MonitorsCost
Event MonitoringAPI calls, logins, report exportsShield add-on
Custom DashboardIntegration_Error__c recordsIncluded
Flow Email AlertsTrigger on error recordsIncluded
Platform EventsReal-time error broadcastingIncluded
Einstein AnalyticsTrend analysis on error patternsAdd-on

Combining Patterns: The Complete Error Handling Stack

In a CTA scenario, you should present a layered error handling strategy, not just a single pattern.

flowchart TD
    A[Integration Call] --> B{Success?}
    B -->|Yes| C[Log Success Metric]
    B -->|No| D[Classify Error]
    D --> E{Transient?}
    E -->|Yes| F[Retry with<br/>Exponential Backoff]
    F --> G{Retries<br/>Exhausted?}
    G -->|No| A
    G -->|Yes| H[Dead Letter Queue]
    E -->|No| I{Systemic?}
    I -->|Yes| J[Circuit Breaker<br/>Opens]
    J --> K[Fail Fast for<br/>Subsequent Calls]
    K --> H
    I -->|No| L{Data Quality?}
    L -->|Yes| M[Reject to DLQ<br/>with Validation Details]
    L -->|No| H
    H --> N[Alert Operations]
    H --> O[Dashboard Update]
    J --> N

CTA presentation strategy

When presenting error handling at the review board, walk through a specific failure scenario end-to-end: “When the ERP is unavailable, the order event goes to the retry queue with exponential backoff. After 5 retries over 30 seconds, the circuit breaker opens. Subsequent calls fail fast. The failed message routes to the DLQ, PagerDuty alerts the integration team, and a Jira ticket is auto-created. The integration team reviews the DLQ dashboard, and once the ERP is back, they resubmit from the DLQ.”

End-to-End Failure Scenario: ERP Goes Down

This sequence diagram shows exactly how the patterns above work together when the ERP becomes unavailable during order processing — the type of walkthrough that scores well at the CTA board.

sequenceDiagram
    participant SF as Salesforce
    participant MW as Middleware
    participant CB as Circuit Breaker<br/>(state store)
    participant ERP as ERP System
    participant DLQ as Dead Letter Queue
    participant OPS as Operations Team

    SF->>MW: Order Event (Platform Event)
    MW->>CB: Check circuit state
    CB-->>MW: CLOSED (healthy)
    MW->>ERP: POST /orders (attempt 1)
    ERP-->>MW: 503 Service Unavailable

    MW->>MW: Wait 1s (backoff)
    MW->>ERP: POST /orders (attempt 2)
    ERP-->>MW: 503 Service Unavailable

    MW->>MW: Wait 2s (backoff)
    MW->>ERP: POST /orders (attempt 3)
    ERP-->>MW: 503 Service Unavailable

    MW->>CB: Report failure #3 (threshold reached)
    CB->>CB: State → OPEN

    MW->>DLQ: Route failed order to DLQ
    MW->>OPS: PagerDuty alert + auto-create Jira

    Note over CB: 60 seconds pass...
    CB->>CB: State → HALF-OPEN

    MW->>ERP: Test call (single probe)
    ERP-->>MW: 200 OK

    CB->>CB: State → CLOSED
    MW->>OPS: Notify: ERP recovered
    OPS->>DLQ: Trigger bulk resubmit
    DLQ->>MW: Replay failed orders
    MW->>ERP: POST /orders (resubmit)
    ERP-->>MW: 200 OK

Anti-Patterns

Anti-PatternWhy It FailsBetter Approach
Retry foreverWastes resources, masks permanent failuresMax retries + DLQ
Retry without backoffHammers already-struggling systemsExponential backoff with jitter
Swallow errors silentlyNobody knows the integration is brokenLog, alert, DLQ
Single retry for all errors400 Bad Request will never succeed with retryClassify errors, only retry transient
No idempotencyDuplicate processing on retryIdempotency keys or upsert
Manual-only error recoveryDoes not scale, human bottleneckAutomated reprocessing with manual review for edge cases

  • Risk Management — integration failures are a top risk category; error handling patterns feed directly into risk registers and mitigation plans
  • Data Quality & Governance — data quality errors are a major category of integration failures; governance processes prevent bad data from propagating
  • Review Board Presentation & Q&A — judges frequently ask “what happens when this fails?” — prepare error handling explanations for every integration

Sources