Error Handling Patterns

Integration error handling separates architects who design for the happy path from those who design for reality. External systems go down, networks fail, data is malformed, and rate limits are exceeded. This page covers the patterns every CTA must know to handle failures gracefully.

CTA board expectations

When you present an integration architecture, the board WILL ask: “What happens when this fails?” If your answer is “we retry” without specifics on strategy, limits, and fallback behavior, you will lose points. Be specific about retry count, backoff strategy, dead letter handling, and monitoring.

Error Categories

Before choosing a pattern, classify the error. Different error types demand different responses.

Category	Examples	Correct Response	Wrong Response
Transient	Network timeout, 503 Service Unavailable, rate limit (429)	Retry with backoff	Fail immediately
Persistent	404 Not Found, 400 Bad Request, invalid data	Route to dead letter queue, alert	Retry infinitely
Systemic	External system fully down, certificate expired	Circuit breaker, fallback	Keep retrying (wastes resources)
Data quality	Missing required fields, invalid format, duplicates	Reject and notify, data cleansing	Silently drop or force through
Capacity	Bulk API daily limit reached, governor limits	Queue and defer, throttle	Fail the entire batch

Pattern 1: Retry with Exponential Backoff

Automatically retry failed operations with increasing delays between attempts. This is the foundation of transient error handling.

How It Works

flowchart TD
    A[API Call] --> B{Success?}
    B -->|Yes| C[Process Response]
    B -->|No| D{Retryable Error?<br/>5xx, timeout, 429}
    D -->|No| E[Route to Dead Letter Queue]
    D -->|Yes| F{Retry Count<br/>< Max Retries?}
    F -->|No| E
    F -->|Yes| G["Wait: base_delay * 2^attempt<br/>+ random jitter"]
    G --> A

Implementation Parameters

Parameter	Recommended Value	Rationale
Max retries	3-5	Enough for transient issues, not so many that persistent failures waste time
Base delay	1 second	Starting delay before first retry
Max delay	60 seconds	Cap to prevent excessively long waits
Backoff multiplier	2x (exponential)	1s, 2s, 4s, 8s, 16s…
Jitter	Random 0-1s added	Prevents thundering herd when many clients retry simultaneously

Retry Timing Example

Attempt	Delay (no jitter)	Delay (with jitter)
1	1 second	1.0 - 2.0 seconds
2	2 seconds	2.0 - 3.0 seconds
3	4 seconds	4.0 - 5.0 seconds
4	8 seconds	8.0 - 9.0 seconds
5	16 seconds	16.0 - 17.0 seconds
Total	~31 seconds	~31 - 36 seconds

Jitter prevents thundering herd

Without jitter, if 100 clients all fail at the same time, they all retry at exactly the same intervals — creating repeated spikes. Random jitter spreads retries across time. This is especially important for Salesforce integrations where concurrent API limits are shared across the org.

Pattern 2: Circuit Breaker

Prevents a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.

States

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold exceeded
    Open --> HalfOpen : Timeout period expires
    HalfOpen --> Closed : Test call succeeds
    HalfOpen --> Open : Test call fails

    note right of Closed
        Normal operation.
        Calls pass through.
        Track failure count.
    end note

    note right of Open
        All calls fail fast.
        No calls to external system.
        Wait for timeout.
    end note

    note right of HalfOpen
        Allow ONE test call.
        If succeeds: reset to Closed.
        If fails: back to Open.
    end note

Implementation in Salesforce

State	Behavior	Salesforce Implementation
Closed	Normal operation, calls pass through	Standard callout behavior
Open	All calls fail immediately, no callout attempted	Check Custom Metadata / Platform Cache before callout
Half-Open	Allow one test call to check if service recovered	Scheduled job or manual reset attempts one call

Configuration Parameters

Parameter	Recommended	Purpose
Failure threshold	5 consecutive failures	Number of failures before opening circuit
Open timeout	30-60 seconds	How long to wait before testing recovery
Success threshold	2-3 successes in half-open	Successes needed to close circuit

Salesforce implementation options

Salesforce does not have a native circuit breaker. Implement using: (1) Custom Metadata Type to store circuit state, (2) Platform Cache for fast state checks, or (3) Custom Settings with hierarchical access. Platform Cache is fastest but non-durable; Custom Metadata requires deployment; Custom Settings are a middle ground.

Pattern 3: Dead Letter Queue (DLQ)

Messages that cannot be processed after all retries are routed to a dead letter queue for manual inspection, reprocessing, or alerting.

Flow

flowchart LR
    A[Message] --> B{Process}
    B -->|Success| C[Done]
    B -->|Failure| D{Retries<br/>exhausted?}
    D -->|No| E[Retry Queue<br/>with backoff]
    E --> B
    D -->|Yes| F[Dead Letter Queue]
    F --> G[Alert Operations Team]
    F --> H[Manual Review Dashboard]
    H --> I{Fixable?}
    I -->|Yes| J[Fix & Resubmit]
    J --> B
    I -->|No| K[Log & Archive]

Salesforce DLQ Options

Approach	Best For	Persistence
Custom Object (Integration_Error__c)	Full audit trail, reporting	Permanent (until deleted)
Platform Events (Error_Event__e)	Real-time alerting	24-72 hours
Big Object	High-volume error logging	Permanent, archive-oriented
Middleware DLQ (MuleSoft/Anypoint MQ)	Middleware-managed integrations	Configurable retention
External monitoring (Splunk, Datadog)	Centralized ops monitoring	Per tool retention

DLQ Record Design

A well-designed DLQ record captures everything needed for diagnosis and reprocessing:

Field	Purpose
Source System	Where the message originated
Target System	Where it was being sent
Payload	The original message content
Error Message	What went wrong
Error Code	HTTP status, exception type
Retry Count	How many attempts were made
First Failure Timestamp	When it first failed
Last Failure Timestamp	When retries were exhausted
Status	New / Under Review / Resubmitted / Archived
Correlation ID	Links to the original transaction

Pattern 4: Idempotency

Ensuring that processing the same message multiple times produces the same result. This is mandatory for any at-least-once delivery system.

Why It Matters

Platform Events, CDC, and most middleware deliver at-least-once. Duplicates WILL happen due to:

Network retries at the transport layer
Subscriber reconnection replaying events
Middleware retry on ambiguous failures
Bulk API partial retries

Implementation Strategies

flowchart TD
    A[Incoming Message] --> B{Has Idempotency Key?}
    B -->|No| C[Generate from payload<br/>hash or natural key]
    B -->|Yes| D{Key already<br/>processed?}
    C --> D
    D -->|Yes| E[Return cached result<br/>Skip processing]
    D -->|No| F[Process message]
    F --> G[Store idempotency key<br/>with result]
    G --> H[Return result]

Strategy	How It Works	Pros	Cons
Idempotency key	Client sends unique key; server checks before processing	Most reliable	Requires key storage and lookup
Natural key dedup	Use business key (Order Number) to detect duplicates	No extra infrastructure	Requires unique business key
Upsert operations	Use External ID for upsert instead of insert	Built into Salesforce	Only works for CRUD, not business logic
Payload hash	Hash the message content, check for duplicate hashes	Works without client changes	Hash collisions (rare), different messages may hash same
Timestamp comparison	Only process if timestamp is newer than last processed	Simple	Clock skew issues

Salesforce upsert is your friend

For data synchronization, always use upsert with an External ID field rather than separate insert/update logic. Upsert is naturally idempotent — sending the same record twice produces the same result. This single recommendation prevents a large class of integration bugs.

Pattern 5: Monitoring and Alerting

Error handling without monitoring is like having a fire alarm with no sound. You must know when integrations fail and respond before business impact.

Monitoring Architecture

flowchart TB
    subgraph "Integration Layer"
        INT[Integration Processes]
        DLQ[Dead Letter Queue]
        LOGS[Error Logs]
    end

    subgraph "Monitoring Platform"
        COLLECT[Log Collector<br/>Splunk/Datadog/ELK]
        ALERT[Alert Rules Engine]
        DASH[Operations Dashboard]
    end

    subgraph "Response"
        EMAIL[Email Alerts]
        SLACK[Slack/Teams Notifications]
        PAGER[PagerDuty/OpsGenie]
        TICKET[Auto-Create Case/Jira]
    end

    INT --> COLLECT
    DLQ --> COLLECT
    LOGS --> COLLECT

    COLLECT --> ALERT
    COLLECT --> DASH

    ALERT -->|Warning| EMAIL
    ALERT -->|Warning| SLACK
    ALERT -->|Critical| PAGER
    ALERT -->|All| TICKET

What to Monitor

Metric	Threshold	Alert Level
Integration failure rate	> 5% of transactions	Warning
Integration failure rate	> 20% of transactions	Critical
DLQ depth	> 100 messages	Warning
DLQ depth growing	Increasing for 30+ minutes	Critical
API call consumption	> 80% of daily limit	Warning
API call consumption	> 95% of daily limit	Critical
Average response time	> 5 seconds (for real-time)	Warning
Circuit breaker open	Any circuit open	Critical
Event subscriber lag	> 1 hour behind	Warning
Event subscriber lag	> 12 hours behind	Critical (approaching retention limit)

Salesforce-Native Monitoring Options

Tool	What It Monitors	Cost
Event Monitoring	API calls, logins, report exports	Shield add-on
Custom Dashboard	Integration_Error__c records	Included
Flow Email Alerts	Trigger on error records	Included
Platform Events	Real-time error broadcasting	Included
Einstein Analytics	Trend analysis on error patterns	Add-on

Combining Patterns: The Complete Error Handling Stack

In a CTA scenario, you should present a layered error handling strategy, not just a single pattern.

flowchart TD
    A[Integration Call] --> B{Success?}
    B -->|Yes| C[Log Success Metric]
    B -->|No| D[Classify Error]
    D --> E{Transient?}
    E -->|Yes| F[Retry with<br/>Exponential Backoff]
    F --> G{Retries<br/>Exhausted?}
    G -->|No| A
    G -->|Yes| H[Dead Letter Queue]
    E -->|No| I{Systemic?}
    I -->|Yes| J[Circuit Breaker<br/>Opens]
    J --> K[Fail Fast for<br/>Subsequent Calls]
    K --> H
    I -->|No| L{Data Quality?}
    L -->|Yes| M[Reject to DLQ<br/>with Validation Details]
    L -->|No| H
    H --> N[Alert Operations]
    H --> O[Dashboard Update]
    J --> N

CTA presentation strategy

When presenting error handling at the review board, walk through a specific failure scenario end-to-end: “When the ERP is unavailable, the order event goes to the retry queue with exponential backoff. After 5 retries over 30 seconds, the circuit breaker opens. Subsequent calls fail fast. The failed message routes to the DLQ, PagerDuty alerts the integration team, and a Jira ticket is auto-created. The integration team reviews the DLQ dashboard, and once the ERP is back, they resubmit from the DLQ.”

End-to-End Failure Scenario: ERP Goes Down

This sequence diagram shows exactly how the patterns above work together when the ERP becomes unavailable during order processing — the type of walkthrough that scores well at the CTA board.

sequenceDiagram
    participant SF as Salesforce
    participant MW as Middleware
    participant CB as Circuit Breaker<br/>(state store)
    participant ERP as ERP System
    participant DLQ as Dead Letter Queue
    participant OPS as Operations Team

    SF->>MW: Order Event (Platform Event)
    MW->>CB: Check circuit state
    CB-->>MW: CLOSED (healthy)
    MW->>ERP: POST /orders (attempt 1)
    ERP-->>MW: 503 Service Unavailable

    MW->>MW: Wait 1s (backoff)
    MW->>ERP: POST /orders (attempt 2)
    ERP-->>MW: 503 Service Unavailable

    MW->>MW: Wait 2s (backoff)
    MW->>ERP: POST /orders (attempt 3)
    ERP-->>MW: 503 Service Unavailable

    MW->>CB: Report failure #3 (threshold reached)
    CB->>CB: State → OPEN

    MW->>DLQ: Route failed order to DLQ
    MW->>OPS: PagerDuty alert + auto-create Jira

    Note over CB: 60 seconds pass...
    CB->>CB: State → HALF-OPEN

    MW->>ERP: Test call (single probe)
    ERP-->>MW: 200 OK

    CB->>CB: State → CLOSED
    MW->>OPS: Notify: ERP recovered
    OPS->>DLQ: Trigger bulk resubmit
    DLQ->>MW: Replay failed orders
    MW->>ERP: POST /orders (resubmit)
    ERP-->>MW: 200 OK

Anti-Patterns

Anti-Pattern	Why It Fails	Better Approach
Retry forever	Wastes resources, masks permanent failures	Max retries + DLQ
Retry without backoff	Hammers already-struggling systems	Exponential backoff with jitter
Swallow errors silently	Nobody knows the integration is broken	Log, alert, DLQ
Single retry for all errors	400 Bad Request will never succeed with retry	Classify errors, only retry transient
No idempotency	Duplicate processing on retry	Idempotency keys or upsert
Manual-only error recovery	Does not scale, human bottleneck	Automated reprocessing with manual review for edge cases

Risk Management — integration failures are a top risk category; error handling patterns feed directly into risk registers and mitigation plans
Data Quality & Governance — data quality errors are a major category of integration failures; governance processes prevent bad data from propagating
Review Board Presentation & Q&A — judges frequently ask “what happens when this fails?” — prepare error handling explanations for every integration

Sources

Salesforce Integration Patterns: Error Handling
MuleSoft: Error Handling Best Practices
Martin Fowler, “Circuit Breaker Pattern” (martinfowler.com)
Michael Nygard, “Release It! Design and Deploy Production-Ready Software”
AWS: Exponential Backoff and Jitter
CTA Study Group notes on integration error handling scenarios