Large Data Volumes

Large Data Volume (LDV) is one of the most impactful areas for a CTA because performance problems caused by data volume cascade through every layer of the solution — UI responsiveness, report performance, integration throughput, and user satisfaction. The platform handles millions of records well when you design for it. It handles them terribly when you do not.

What Qualifies as LDV

There is no single magic number, but these thresholds signal LDV territory:

Volume	Classification	Concern Level
< 500K records per object	Normal	Standard best practices suffice
500K - 1M records	Approaching LDV	Start monitoring query plans
1M - 10M records	LDV	Active optimization required
10M - 100M records	High LDV	Skinny tables, custom indexes, archival needed
100M+ records	Extreme LDV	Consider Big Objects, external storage, or Data Cloud

It is relative, not absolute

LDV is relative to query patterns. A 500K-record object with no filters in list views is worse than a 5M-record object with highly selective queries. Always evaluate volume in the context of how data is accessed.

Indexing Deep Dive

Indexes are the primary weapon against LDV performance issues. Understanding what is indexed, what can be indexed, and how the query optimizer uses indexes is essential CTA knowledge.

Standard Indexes (Automatic)

These fields are automatically indexed on every object:

Field	Notes
`Id`	Primary key, always indexed
`Name`	Standard name field
`OwnerId`	Record owner
`CreatedDate`	Record creation timestamp
`LastModifiedDate`	Last modification timestamp
`SystemModstamp`	System modification timestamp
`RecordTypeId`	Record type identifier
`Lookup fields`	All lookup and master-detail fields
`Unique fields`	Any field marked as unique or external ID

Custom Indexes

Custom indexes must be requested from Salesforce Support (or enabled in specific editions). Types:

Index Type	Description	Use Case
Single-column	Index on one custom field	Filter by a single field
Two-column	Composite index on two fields	Filter by two fields together
Skinny table	Denormalized table with subset of fields	Complex objects with many fields

Skinny Tables

Skinny tables are a special Salesforce optimization for objects with many columns. They create a separate, narrow table with only the fields you need for specific queries.

graph LR
    subgraph BaseObject["Base Object Table (Wide)"]
        BO[Account<br/>100+ columns<br/>Id, Name, Phone, Fax,<br/>Custom1__c ... Custom80__c]
    end

    subgraph SkinnyTable["Skinny Table (Narrow)"]
        ST[Account Skinny<br/>5-8 columns<br/>Id, Name, Region__c,<br/>Status__c, OwnerId]
    end

    subgraph QueryOptimizer["Query Optimizer"]
        QO{Evaluate query<br/>columns needed}
    end

    QO -->|"All columns in<br/>skinny table"| ST
    QO -->|"Columns outside<br/>skinny table"| BO
    ST -->|"Faster: fewer columns,<br/>no deleted records"| R[Query Results]
    BO -->|"Slower: wide rows,<br/>includes deleted"| R

    style ST fill:#51cf66,color:#fff
    style BO fill:#ff6b6b,color:#fff

Characteristics:

Must be requested from Salesforce Support
Contain a subset of fields from the original object
Do not include soft-deleted records (IsDeleted = true) — this alone can significantly reduce table size
Rows are automatically synchronized between the base object and the skinny table
The query optimizer determines at runtime whether to use the skinny table
Custom indexes on the base table are replicated on the skinny table and often perform better due to reduced table joins
Work with both standard and custom objects

Skinny table maintenance

When you add or remove fields from a skinny table, you must contact Salesforce Support. New fields added to the object are not automatically included. This creates an operational dependency that must be factored into your governance model.

Indexing Architecture

The Salesforce multitenant architecture stores custom field data in a way that makes the underlying database tables unsuitable for direct indexing. To solve this, the platform creates separate index tables that mirror the data in an indexable format.

graph TD
    subgraph MultitenantDB["Multitenant Database"]
        DT[Data Table<br/>Generic columns: Value0, Value1...<br/>All orgs, all objects intermixed]
    end

    subgraph IndexLayer["Platform Index Layer"]
        IT[Index Table<br/>Copies of field data<br/>with data type info]
        SI[Standard Indexes<br/>Id, Name, OwnerId,<br/>CreatedDate, Lookups]
        CI[Custom Indexes<br/>Requested via Support<br/>Single or Two-column]
    end

    subgraph QueryPath["Query Execution"]
        QO[Query Optimizer]
        IP[Index Path<br/>Fast: uses index seek]
        TS[Table Scan Path<br/>Slow: reads all rows]
    end

    DT --> IT
    IT --> SI
    IT --> CI
    QO --> IP
    QO --> TS
    SI --> QO
    CI --> QO

    style IP fill:#51cf66,color:#fff
    style TS fill:#ff6b6b,color:#fff

Why custom indexes require support cases

Because the platform manages the index tables internally, only Salesforce can add or modify custom indexes. This is by design in the multitenant architecture — the platform must control index creation to balance resource usage across all tenants on the shared infrastructure.

Query Selectivity

The Salesforce query optimizer decides whether to use an index or perform a full table scan. The decision is based on selectivity — how many records the filter is expected to return.

Selectivity Thresholds

graph TD
    A[Query Submitted] --> B{Is the filter field<br/>indexed?}
    B -->|No| C[Full Table Scan]
    B -->|Yes| D{Does filter return<br/>< threshold?}
    D -->|Standard Index:<br/>30% of first 1M records,<br/>15% beyond 1M,<br/>max 1,000,000 rows| E[Index Used]
    D -->|Custom Index:<br/>10% of first 1M records,<br/>5% beyond 1M,<br/>max 333,333 rows| E
    D -->|Above threshold| C
    E --> F[Fast Query]
    C --> G{Table size?}
    G -->|Small| H[Acceptable Performance]
    G -->|Large LDV| I[Slow/Timeout]

    style I fill:#ff6b6b,color:#fff
    style F fill:#51cf66,color:#fff

Selectivity Rules

Condition	Threshold	Notes
Standard index	30% of the first 1M records, then 15% beyond 1M. Max threshold: 1,000,000 targeted records	Tiered calculation
Custom index	10% of the first 1M records, then 5% beyond 1M. Max threshold: 333,333 targeted records	Tiered calculation
Multiple filters (AND)	Each filter independently selective, OR combined selective	Optimizer evaluates individually
Multiple filters (OR)	Each filter must be selective	One non-selective filter ruins the query
NOT operator	Generally non-selective	Avoid NOT in WHERE clauses on LDV objects
NULL checks	Non-selective	`WHERE Field__c = null` almost never selective

Query Optimizer Internal Flow

The Lightning Platform query optimizer acts as a pre-processor between your SOQL and the underlying database. It evaluates multiple potential execution plans, assigns a cost to each, and selects the cheapest path.

graph TD
    A[SOQL Query Submitted] --> B[Parse WHERE clause<br/>and identify filter fields]
    B --> C{Any filter fields<br/>indexed?}
    C -->|No| D[Full Table Scan<br/>Cost = high]
    C -->|Yes| E[Evaluate each<br/>indexed filter]
    E --> F{Check selectivity<br/>per filter}
    F --> G[Calculate estimated<br/>cardinality per path]
    G --> H[Assign cost to<br/>each execution plan]
    H --> I{Compare costs:<br/>Index vs Table Scan}
    I -->|Index cheaper| J[Use Index Path]
    I -->|Table Scan cheaper<br/>or index not selective| D
    J --> K{Multiple selective<br/>indexes available?}
    K -->|Yes| L[Choose index with<br/>lowest cost estimate]
    K -->|No| M[Use the one<br/>selective index]
    L --> N[Execute Query]
    M --> N
    D --> N

    style J fill:#51cf66,color:#fff
    style D fill:#ff6b6b,color:#fff

CTA depth signal

Mentioning that the optimizer assigns cost values and compares execution plans — rather than simply “using indexes” — demonstrates a deeper understanding of the platform internals. State: “The query optimizer evaluates multiple execution paths, assigns cost estimates based on cardinality and selectivity, and selects the lowest-cost plan.”

The Query Plan Tool

The Query Plan tool (available in Developer Console) shows how the optimizer plans to execute a query.

Key fields in query plan output:

Field	Meaning
Cardinality	Estimated number of records returned
Fields	Fields being evaluated for index use
Leading Operation Type	`Index` (good) or `TableScan` (bad for LDV)
Cost	Relative cost estimate — lower is better
sObject Cardinality	Total records in the object

CTA exam technique

When discussing LDV in your review board presentation, mention the Query Plan tool and selectivity thresholds by name. State specific numbers: “The custom index on External_ID__c will be selective because it targets fewer than 10% of the first million records.” This demonstrates practical depth.

Data Skew

Data skew occurs when data distribution is uneven, causing record locking, sharing recalculation delays, and query performance problems. Three types matter for CTA.

graph TD
    subgraph AccountSkew["Account Data Skew"]
        A1[Account A<br/>50,000 Contacts] --- A2[Account B<br/>12 Contacts]
        A1 --- A3[Account C<br/>8 Contacts]
        style A1 fill:#ff6b6b,color:#fff
    end

    subgraph OwnershipSkew["Ownership Skew"]
        O1[User X<br/>Owns 200,000 records] --- O2[User Y<br/>Owns 500 records]
        O1 --- O3[User Z<br/>Owns 300 records]
        style O1 fill:#ff6b6b,color:#fff
    end

    subgraph LookupSkew["Lookup Skew"]
        L1["Campaign: 'General'<br/>150,000 Opps point here"] --- L2["Campaign: 'Q4 Push'<br/>200 Opps"]
        L1 --- L3["Campaign: 'Webinar'<br/>80 Opps"]
        style L1 fill:#ff6b6b,color:#fff
    end

    AccountSkew -->|"Causes"| R1[Sharing recalc delays<br/>Related list timeouts]
    OwnershipSkew -->|"Causes"| R2[Role hierarchy changes<br/>trigger mass re-sharing]
    LookupSkew -->|"Causes"| R3[Record lock contention<br/>on parent record]

Account Data Skew

When a single Account has an extremely large number of child records (Contacts, Opportunities, Cases).

Problems:

Sharing recalculation takes longer (all children re-evaluate)
Record locking during DML on child records
List views and related lists time out

Thresholds:

Caution zone: > 10,000 child records on a single parent
Danger zone: > 100,000 child records

Mitigations:

Split mega-accounts into logical sub-accounts (by region, division)
Use custom lookup instead of master-detail to avoid sharing cascade
Implement lazy loading for related lists
Archive older child records

Ownership Skew

When a single user or queue owns a disproportionate number of records.

Problems:

Sharing recalculation bottleneck when the owner changes roles
Role hierarchy changes trigger mass re-sharing
Transfer operations time out or hit limits

Thresholds:

Caution zone: > 10,000 records owned by one user
Danger zone: > 100,000 records

Mitigations:

Distribute ownership across multiple users or queues
Use a dedicated “system user” for integration-created records with appropriate sharing rules
Avoid assigning all records to a single queue

Lookup Skew

When a lookup field on many records points to the same parent record (e.g., all Opportunities referencing a “General” Campaign).

Problems:

Lock contention when updating the parent record
Performance degradation on the parent record page

Mitigations:

Avoid “catch-all” parent records
Consider removing optional lookups when values are rarely populated
Use deferred sharing calculation for batch operations

graph TD
    A[Data Skew Detected] --> B{Which type?}
    B --> C[Account Skew]
    B --> D[Ownership Skew]
    B --> E[Lookup Skew]

    C --> C1[Split mega-accounts]
    C --> C2[Archive child records]
    C --> C3[Switch to lookup relationship]

    D --> D1[Distribute ownership]
    D --> D2[Multiple queues]
    D --> D3[System integration user]

    E --> E1[Eliminate catch-all parents]
    E --> E2[Remove sparse lookups]
    E --> E3[Deferred sharing calc]

Archival Strategies

Not all data needs to live in the primary Salesforce database. Archival is about moving aged or infrequently accessed data to cheaper, more appropriate storage.

Archival Decision Flowchart

graph TD
    A[Data approaching LDV<br/>thresholds] --> B{Is data actively<br/>queried by users?}
    B -->|Yes| C{Can queries be<br/>optimized?}
    B -->|No| D{Compliance or<br/>legal retention?}
    C -->|Yes| E[Optimize indexes<br/>and queries first]
    C -->|No| F{Data needed for<br/>reporting?}
    D -->|Yes| G[Archive with<br/>retention policy]
    D -->|No| H[Delete with<br/>hard delete]
    F -->|Yes| I{Real-time or<br/>historical?}
    F -->|No| G
    I -->|Real-time| J[Keep on platform<br/>with skinny tables]
    I -->|Historical| K{Volume?}
    K -->|< 100M| L[Big Objects]
    K -->|> 100M| M[External Storage<br/>+ Salesforce Connect]

Archival Options

Option	Best For	Limitations
Big Objects	Audit trails, historical data up to billions	Standard SOQL on indexed fields only (Async SOQL retired Summer ‘25), no triggers, no reports
External storage + Connect	Massive datasets needing Salesforce visibility	Callout limits, latency, requires OData adapter
Data Cloud	Analytics on large datasets, unified profiles	Separate license, learning curve
Hard delete	Data with no retention requirement	Irreversible, audit implications
Export to data lake	Long-term retention, compliance	No Salesforce visibility without Connect

Batch Apex for LDV

Batch Apex is the primary tool for processing large datasets that exceed governor limits in synchronous contexts.

Key Design Patterns

Scope size: Default 200 records per execute. Reduce for complex processing, increase for simple operations (max 2,000)
Query locators: Database.QueryLocator supports up to 50 million records (vs 50,000 for standard SOQL)
Chaining: Chain batch jobs for multi-step processing (Job A finishes, starts Job B in finish())
Stateful: Use Database.Stateful to maintain state across execute calls (but increases memory usage)
Concurrency: Up to 5 batch jobs actively processing at once, with up to 100 batch jobs queued in the flex queue. The platform manages concurrency automatically

LDV Batch Patterns

sequenceDiagram
    participant Scheduler
    participant BatchJob
    participant Database
    participant ExternalSystem

    Scheduler->>BatchJob: start()
    BatchJob->>Database: QueryLocator (up to 50M records)
    loop Each batch of N records
        Database->>BatchJob: execute(scope)
        BatchJob->>BatchJob: Process records
        BatchJob->>Database: DML operations
    end
    BatchJob->>BatchJob: finish()
    BatchJob->>ExternalSystem: Send notification / chain next job

Platform Cache

Platform Cache stores frequently accessed data in memory to reduce SOQL queries and improve performance.

Cache Types

Type	Scope	Capacity	Use Case
Org Cache	Shared across all users	Varies by edition — Enterprise/Unlimited get 30 MB default, purchasable up to 3 GB	Configuration data, reference data, exchange rates
Session Cache	Per user session	Session cache allocation is partition-based, drawn from org total allocation	User preferences, shopping cart, multi-step wizard

LDV Cache Patterns

Reference data: Cache picklist values, product catalogs, pricing tiers to avoid repeated queries
Computed results: Cache roll-up calculations or aggregated metrics
External callout responses: Cache API responses from external systems
Cache-aside pattern: Check cache first, query database on miss, populate cache

Cache invalidation

Cache entries have a TTL (time-to-live) and can be evicted under memory pressure. Never assume cached data is current. Design your solution to gracefully handle cache misses with a fallback to SOQL.

LDV Strategy Decision Guide

graph TD
    A[LDV Performance<br/>Problem Identified] --> B{What symptom?}

    B -->|Slow list views| C[Check query selectivity]
    B -->|Slow reports| D[Check report filters<br/>and scope]
    B -->|API timeouts| E[Check bulk query<br/>batch size]
    B -->|Record locking| F[Check data skew]
    B -->|Sharing timeout| G[Check ownership<br/>and account skew]

    C --> C1{Query selective?}
    C1 -->|No| C2[Add custom index<br/>or skinny table]
    C1 -->|Yes| C3[Check data volume<br/>growth rate]
    C3 --> C4{Growth sustainable?}
    C4 -->|No| C5[Implement archival<br/>strategy]
    C4 -->|Yes| C6[Monitor and<br/>reassess quarterly]

    D --> D1[Add indexed date<br/>range filters]
    E --> E1[Reduce batch size<br/>Use PK chunking]
    F --> F1[Split skewed records]
    G --> G1[Redistribute ownership]

LDV Checklist for CTA Scenarios

When you encounter LDV in a CTA scenario, systematically address:

Identify LDV objects — Which objects will exceed 1M records? In what timeframe?
Query patterns — How will users access this data? List views, reports, searches?
Index strategy — Which fields need custom indexes? Any skinny table candidates?
Data skew risk — Are there mega-accounts, single owners, or catch-all lookups?
Archival plan — What is the retention policy? Where will archived data go?
Growth projection — What is the data growth rate? When will you hit the next threshold?
Batch processing — What jobs run against LDV objects? Appropriate scope sizes?
Integration impact — How do integrations query LDV objects? Bulk API? PK chunking?
Monitoring — How will you detect performance degradation before users notice?

Cross-Domain Impact

Security — Data skew and ownership distribution affect sharing performance (Security)
Integration — Bulk API strategies and PK chunking for LDV extracts (Integration)
Data Modeling — Relationship type affects query performance (Data Modeling)
System Architecture — Org strategy and multi-org implications for LDV (System Architecture)

Large Data Volumes

What Qualifies as LDV

Indexing Deep Dive

Standard Indexes (Automatic)

Custom Indexes

Skinny Tables

Indexing Architecture

Query Selectivity

Selectivity Thresholds

Selectivity Rules

Query Optimizer Internal Flow

The Query Plan Tool

Data Skew

Account Data Skew

Ownership Skew

Lookup Skew

Archival Strategies

Archival Decision Flowchart

Archival Options

Batch Apex for LDV

Key Design Patterns

LDV Batch Patterns

Platform Cache

Cache Types

LDV Cache Patterns

LDV Strategy Decision Guide

LDV Checklist for CTA Scenarios

Cross-Domain Impact

Sources