Skip to content

Large Data Volumes

Large Data Volume (LDV) is one of the highest-impact areas for a CTA. Performance problems caused by data volume cascade through every layer of the solution: UI responsiveness, report performance, integration throughput, and user satisfaction. The platform handles millions of records well when designed for it. It handles them terribly otherwise.

What Qualifies as LDV

There is no single magic number, but these thresholds signal LDV territory:

VolumeClassificationConcern Level
< 500K records per objectNormalStandard best practices suffice
500K - 1M recordsApproaching LDVStart monitoring query plans
1M - 10M recordsLDVActive optimization required
10M - 100M recordsHigh LDVSkinny tables, custom indexes, archival needed
100M+ recordsExtreme LDVConsider Big Objects, external storage, or Data Cloud

It is relative, not absolute

LDV is relative to query patterns. A 500K-record object with no filters in list views is worse than a 5M-record object with highly selective queries. Always evaluate volume in the context of how data is accessed.


Indexing Deep Dive

Indexes are the primary defense against LDV performance issues. Understanding what is indexed, what can be indexed, and how the query optimizer uses indexes is essential.

Standard Indexes (Automatic)

These fields are automatically indexed on every object:

FieldNotes
IdPrimary key, always indexed
NameStandard name field
OwnerIdRecord owner
CreatedDateRecord creation timestamp
LastModifiedDateLast modification timestamp
SystemModstampSystem modification timestamp
RecordTypeIdRecord type identifier
Lookup fieldsAll lookup and master-detail fields
Unique fieldsAny field marked as unique or external ID

Custom Indexes

Custom indexes must be requested from Salesforce Support (or enabled in specific editions). Types:

Index TypeDescriptionUse Case
Single-columnIndex on one custom fieldFilter by a single field
Two-columnComposite index on two fieldsFilter by two fields together
Skinny tableDenormalized table with subset of fieldsComplex objects with many fields

Skinny Tables

Skinny tables are a special Salesforce optimization for objects with many columns. They create a separate, narrow table containing only the fields needed for specific queries.

Diagram showing how the query optimizer routes queries to a narrow skinny table instead of the wide base object table, reducing row scan overhead for LDV objects.
Figure 1. Skinny tables work by materializing a narrow subset of fields from a wide object into a separate table that excludes soft-deleted records. The query optimizer selects the skinny path only when all requested columns fit within it. Any additional column forces the full base table scan.

Characteristics:

  • Must be requested from Salesforce Support
  • Contain a subset of fields from the original object
  • Do not include soft-deleted records (IsDeleted = true), which alone can significantly reduce table size
  • Rows are automatically synchronized between the base object and the skinny table
  • The query optimizer determines at runtime whether to use the skinny table
  • Custom indexes on the base table are replicated on the skinny table and often perform better due to reduced table joins
  • Work with both standard and custom objects

Skinny table maintenance

Adding or removing fields from a skinny table requires a Salesforce Support case. New fields added to the object are not automatically included. This creates an operational dependency that must be factored into the governance model.

CTA Exam Relevance

The board probes whether candidates know when skinny tables are the right tool versus when custom indexes suffice. Skinny tables are not a default recommendation because they add governance overhead (Support case for every field change). Be prepared to articulate the specific scenario: wide objects (50+ columns) with queries that use a narrow subset of fields, where custom indexes alone are insufficient because the table scan overhead is in row width, not selectivity.

Indexing Architecture

The Salesforce multitenant architecture stores custom field data in a way that makes the underlying database tables unsuitable for direct indexing. The platform solves this by creating separate index tables that mirror the data in an indexable format.

Diagram of the Salesforce multitenant index layer showing how separate index tables mirror field data from the shared base data table to support standard and custom indexes.
Figure 2. Because the multitenant data table uses generic value columns shared across all orgs and objects, direct database indexing is not possible. Salesforce maintains separate index tables per field type, which is why custom indexes require a Support case. The platform controls index creation to balance resource usage across all tenants.

Why custom indexes require support cases

The platform manages index tables internally, so only Salesforce can add or modify custom indexes. This is by design in the multitenant architecture: the platform must control index creation to balance resource usage across all tenants on shared infrastructure.


Query Selectivity

The Salesforce query optimizer decides whether to use an index or perform a full table scan. The decision is based on selectivity, meaning how many records the filter is expected to return.

Selectivity Thresholds

Flowchart showing how the query optimizer decides between index path and full table scan based on selectivity thresholds for standard and custom indexes on LDV objects.
Figure 3. Standard indexes allow up to 30% selectivity on the first million records; custom indexes are stricter at 10%. When a filter exceeds its threshold, the optimizer abandons the index and falls back to a full table scan, which causes timeouts on objects with millions of records.

Selectivity Rules

ConditionThresholdNotes
Standard index30% of the first 1M records, then 15% beyond 1M. Max threshold: 1,000,000 targeted recordsTiered calculation
Custom index10% of the first 1M records, then 5% beyond 1M. Max threshold: 333,333 targeted recordsTiered calculation
Multiple filters (AND)Each filter independently selective, OR combined selectiveOptimizer evaluates individually
Multiple filters (OR)Each filter must be selectiveOne non-selective filter ruins the query
NOT operatorGenerally non-selectiveAvoid NOT in WHERE clauses on LDV objects
NULL checksNon-selectiveWHERE Field__c = null almost never selective

Query Optimizer Internal Flow

The Lightning Platform query optimizer acts as a pre-processor between SOQL and the underlying database. It evaluates multiple potential execution plans, assigns a cost to each, and selects the cheapest path.

Flowchart of the Lightning Platform query optimizer evaluating indexed filters, assigning cost estimates to each execution plan, and selecting the lowest-cost path.
Figure 4. The Lightning Platform query optimizer is cost-based, not rule-based. It assigns a cardinality estimate to every possible execution path and picks the cheapest one. This means a table scan can be chosen over an index when the optimizer calculates that fewer rows would be scanned that way.
Detailed walkthrough

When a SOQL query hits the optimizer, execution starts with parsing the WHERE clause to identify every filter field. The optimizer then checks whether each field has an index associated with it. Fields without any index exit immediately to a full table scan. For indexed fields, the optimizer moves to selectivity evaluation: it consults statistics on the index table to estimate what fraction of total records each filter will return.

That fraction is checked against the selectivity threshold for the index type. Standard indexes allow up to 30% of the first million records (then 15% beyond that, capped at one million total targeted rows). Custom indexes are stricter: 10% of the first million, 5% beyond, capped at 333,333 rows. If the estimated cardinality for a filter sits below its threshold, the index path gets a cost assigned. If it exceeds the threshold, the optimizer discards that index path entirely and only the table scan remains as an option.

When multiple indexed filters are present, the optimizer assigns a cost to each viable index path independently, then compares all of them against the table scan cost. The path with the lowest estimated cost wins. Note that a table scan can win. On a 10-million-record object with a Status__c field that has only four distinct values, a filter of WHERE Status__c = 'Open' might target 30-40% of rows, pushing it above the custom index threshold. Even with a custom index on Status__c, the optimizer falls back to a table scan because the index would still touch too many rows. Adding AND CreatedDate > LAST_N_DAYS:30 changes the picture if CreatedDate (a standard indexed field) is selective enough on its own: the optimizer may choose the CreatedDate index path and ignore Status__c entirely, because CreatedDate alone produces a cheaper scan than the composite approach.

At the board, the Query Plan tool is the vocabulary to use. Its output shows four fields that matter: Cost (the optimizer’s relative estimate, lower is better), Cardinality (estimated rows the plan returns), Leading Operation Type (Index or TableScan), and sObject Cardinality (total rows in the object). When a query returns TableScan as the leading operation on a 5M-record object, the diagnosis is: either the filter field is unindexed, or it is indexed but exceeds the selectivity threshold. The fix path follows from which one it is.

CTA depth signal

Mentioning that the optimizer assigns cost values and compares execution plans, rather than simply “using indexes,” shows deeper understanding of platform internals. State: “The query optimizer evaluates multiple execution paths, assigns cost estimates based on cardinality and selectivity, and selects the lowest-cost plan.”

The Query Plan Tool

The Query Plan tool (available in Developer Console) shows how the optimizer plans to execute a query.

Key fields in query plan output:

FieldMeaning
CardinalityEstimated number of records returned
FieldsFields being evaluated for index use
Leading Operation TypeIndex (good) or TableScan (bad for LDV)
CostRelative cost estimate; lower is better
sObject CardinalityTotal records in the object

CTA exam technique

When discussing LDV in your review board presentation, mention the Query Plan tool and selectivity thresholds by name. State specific numbers: “The custom index on External_ID__c will be selective because it targets fewer than 10% of the first million records.” This demonstrates practical depth.


Data Skew

Data skew occurs when data distribution is uneven, causing record locking, sharing recalculation delays, and query performance problems. Three types matter for CTA.

Diagram illustrating account skew, ownership skew, and lookup skew with real-world record distribution examples and the performance problems each type causes.
Figure 5. All three skew types stem from uneven data distribution. Account skew (mega-accounts with thousands of children) triggers sharing recalculation cascades. Ownership skew creates re-sharing bottlenecks when users change roles. Lookup skew causes row-level lock contention on the over-referenced parent record.
Detailed walkthrough

Each skew type has a distinct failure mode, and understanding the mechanism helps you explain the risk to a review board without sounding like you memorized a list.

Account skew activates when a single Account has more than 10,000 child records (Contacts, Opportunities, Cases). The caution threshold is 10K; the danger threshold is 100K. The failure mode is sharing recalculation: every time the Account’s sharing settings change (owner changes role, a sharing rule is added, a manual share is granted), the platform must re-evaluate access for every child record. At 50,000 children on one Account, that recalculation runs against 50,000 rows before it can finish. In org-wide configurations where sharing calc is synchronous (or where the async queue is already loaded), users see list view timeouts and related list errors while the recalculation runs. The mitigation is structural: split the mega-account into regional or divisional sub-accounts so no single parent accumulates enough children to trigger the cascade.

Ownership skew activates when a single user or queue owns more than 10,000 records (danger: 100K). The failure mode is role hierarchy propagation: when that user is moved in the role hierarchy, the platform must re-share every record they own with the new set of managers and peers above them. At 200,000 records, a single org restructuring event can queue hundreds of thousands of re-sharing operations. Integration patterns frequently cause this because batch loads often assign all records to a single system user for convenience. The fix is to distribute ownership across multiple queues or system users from the start, before volume accumulates.

Lookup skew is mechanically different. It does not involve sharing. It is a row-level lock problem. When 150,000 Opportunity records all reference the same Campaign record via a lookup field, any DML that touches those Opportunity records (a batch update, a trigger firing on save) must acquire a shared lock on the parent Campaign row. If multiple processes run DML on Opportunity records concurrently, they each try to lock the Campaign row. Lock wait timeouts follow. The mitigation is to eliminate the catch-all parent: if a “General” Campaign exists only because the field is required and teams pick the default, the field design is the root cause.

For CTA scenarios with large B2C data models, raise account skew proactively when you see an Account with millions of consumer-linked records. The board rewards candidates who identify it before being prompted.

CTA Exam Relevance

The board frequently tests data skew because candidates who design clean data models in theory often fail to account for real-world distribution patterns. Identify skew risks in the scenario’s data model proactively. The board rewards candidates who raise skew mitigation before being asked, especially for Account skew in B2C scenarios with mega-accounts.

Account Data Skew

A single Account has an extremely large number of child records (Contacts, Opportunities, Cases).

Problems:

  • Sharing recalculation takes longer (all children re-evaluate)
  • Record locking during DML on child records
  • List views and related lists time out

Thresholds:

  • Caution zone: > 10,000 child records on a single parent
  • Danger zone: > 100,000 child records

Mitigations:

  • Split mega-accounts into logical sub-accounts (by region, division)
  • Use custom lookup instead of master-detail to avoid sharing cascade
  • Implement lazy loading for related lists
  • Archive older child records

Ownership Skew

A single user or queue owns a disproportionate number of records.

Problems:

  • Sharing recalculation bottleneck when the owner changes roles
  • Role hierarchy changes trigger mass re-sharing
  • Transfer operations time out or hit limits

Thresholds:

  • Caution zone: > 10,000 records owned by one user
  • Danger zone: > 100,000 records

Mitigations:

  • Distribute ownership across multiple users or queues
  • Use a dedicated “system user” for integration-created records with appropriate sharing rules
  • Avoid assigning all records to a single queue

Lookup Skew

A lookup field on many records points to the same parent record (e.g., all Opportunities referencing a “General” Campaign).

Problems:

  • Lock contention when updating the parent record
  • Performance degradation on the parent record page

Mitigations:

  • Avoid “catch-all” parent records
  • Consider removing optional lookups when values are rarely populated
  • Use deferred sharing calculation for batch operations
Decision tree mapping each skew type to its specific mitigations: splitting mega-accounts, distributing ownership across queues, and eliminating catch-all parent records.
Figure 6. Each skew type has distinct mitigations. Account skew requires structural changes such as splitting accounts or archiving children. Ownership skew requires distributing records across users and queues. Lookup skew requires eliminating the catch-all parent pattern that concentrates lock contention.

Archival Strategies

Not all data needs to live in the primary Salesforce database. Archival moves aged or infrequently accessed data to cheaper, more appropriate storage.

Archival Decision Flowchart

Flowchart routing growing data through query optimization, retention policy checks, and volume thresholds to select between Big Objects, external storage, or hard deletion.
Figure 7. Archival decisions hinge on three questions: how often users query the data, whether compliance mandates retention, and whether the volume exceeds Big Object capacity. Data under 100M records that must stay queryable on-platform fits Big Objects; higher volumes require external storage with Salesforce Connect for access.

Archival Options

OptionBest ForLimitations
Big ObjectsAudit trails, historical data up to billionsStandard SOQL on indexed fields only (Async SOQL retired Summer 2023 — use Batch Apex or Bulk API), no triggers, no reports
External storage + ConnectMassive datasets needing Salesforce visibilityCallout limits, latency, requires OData adapter
Data CloudAnalytics on large datasets, unified profilesSeparate license, learning curve
Hard deleteData with no retention requirementIrreversible, audit implications
Export to data lakeLong-term retention, complianceNo Salesforce visibility without Connect

Batch Apex for LDV

Batch Apex is the primary tool for processing large datasets that exceed governor limits in synchronous contexts.

Key Design Patterns

  • Scope size: Default 200 records per execute. Reduce for complex processing; increase for simple operations (max 2,000)
  • Query locators: Database.QueryLocator supports up to 50 million records (vs 50,000 for standard SOQL)
  • Chaining: Chain batch jobs for multi-step processing (Job A finishes, starts Job B in finish())
  • Stateful: Use Database.Stateful to maintain state across execute calls (but increases memory usage)
  • Concurrency: Up to 5 batch jobs actively processing at once, with up to 100 batch jobs queued in the flex queue. The platform manages concurrency automatically

LDV Batch Patterns

Sequence diagram showing the Batch Apex lifecycle from scheduler through QueryLocator chunking, per-batch execute calls with DML, and finish-method chaining to external notification.
Figure 8. Batch Apex processes LDV data by splitting a QueryLocator result (up to 50M records) into configurable chunks. Each execute() call receives one chunk, keeping every operation within governor limits. The finish() method enables chaining to a subsequent batch job or triggering an external notification on completion.

Platform Cache

Platform Cache stores frequently accessed data in memory, reducing SOQL queries and improving performance.

Cache Types

TypeScopeCapacityUse Case
Org CacheShared across all usersVaries by edition; Enterprise gets 10 MB default, Unlimited/Performance get 30 MB default. Additional capacity purchasable.Configuration data, reference data, exchange rates
Session CachePer user sessionSession cache allocation is partition-based, drawn from org total allocationUser preferences, shopping cart, multi-step wizard

LDV Cache Patterns

  • Reference data: Cache picklist values, product catalogs, pricing tiers to avoid repeated queries
  • Computed results: Cache roll-up calculations or aggregated metrics
  • External callout responses: Cache API responses from external systems
  • Cache-aside pattern: Check cache first, query database on miss, populate cache

Cache invalidation

Cache entries have a TTL (time-to-live) and can be evicted under memory pressure. Never assume cached data is current. Design the solution to handle cache misses gracefully, with a fallback to SOQL.

CTA Exam Relevance

The board rarely asks about Platform Cache directly, but mentioning it as part of an LDV mitigation strategy signals architectural maturity. Weaving in “we cache reference data and computed aggregates using Platform Cache to reduce SOQL pressure on high-volume objects” shows thinking beyond the database layer.


LDV Strategy Decision Guide

Decision tree mapping LDV symptoms such as slow list views, API timeouts, record locking, and sharing timeouts to targeted remediation strategies including indexing, archival, and skew mitigation.
Figure 9. LDV symptoms map to distinct root causes. Slow list views point to query selectivity gaps. API timeouts in Bulk API jobs call for PK chunking. Record locking and sharing timeouts both trace back to data skew, with the symptom differing based on whether the skew is in lookups or in ownership distribution.

LDV Checklist for CTA Scenarios

When LDV appears in a CTA scenario, address these points systematically:

  1. Identify LDV objects: Which objects will exceed 1M records? In what timeframe?
  2. Query patterns: How will users access this data? List views, reports, searches?
  3. Index strategy: Which fields need custom indexes? Any skinny table candidates?
  4. Data skew risk: Are there mega-accounts, single owners, or catch-all lookups?
  5. Archival plan: What is the retention policy? Where will archived data go?
  6. Growth projection: What is the data growth rate? When will the next threshold hit?
  7. Batch processing: What jobs run against LDV objects? Appropriate scope sizes?
  8. Integration impact: How do integrations query LDV objects? Bulk API? PK chunking?
  9. Monitoring: How will performance degradation be detected before users notice?


Sources

Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.