Large Data Management Strategies

Managing Large Data VolumesManaging Large Data Volumes

Suchin Rengan, Director, Salesforce Services

@Sacrengan

Mahanthi Gangadhar, Senior Solutions Technical Architect, Salesforce Services

Safe harborSafe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

We do a lot of things with data…

Data Creation (Loads/ Manual)

Data Searches/ Reporting/ List

Data ExtractsData Archival

Data Integration (Out and In)

Am I using the platform’s features optimally?

How do we ensure we keep up with performance?

What factors do we need to consider across each topic?

How can I ensure I have a scalable process?

SFDC Cloud Computing for the Enterprise

Infrastructure Application Operations Innovation Services Services Services Development

Network Storage Security Sharing Authentication Data ModelOperation System Integration Availability Business LogicDatabase Customization Monitoring User InterfaceApp Server Web Services Patch MgmtWeb Server API UpgradesData Center Multi-Language Backup / NOC

Customer Avoids: Tuning OS, Capacity Management Tuning Web Servers, Certificates, Log file Mgmt, etc Tuning App Servers, threads, Java stack, memory/log mgmt Tuning DB Servers, memory mgmt, disk distribution Network management, bandwidth

Lets understand the underlying Platform

UserAccount

Custom Object

Your System

APIApex

Custom Object

Data LoadsData Loads

Suchin Rengan

Mahanthi Gangadhar

Suchin Rengan

Mahanthi Gangadhar

Considerations at every layer and level

File Storage

Sandbox

Apex Triggers

Data Storage

Logic / Application Layer

Storage Layer

Data ObjectsSharing Tables

IndexesSkinny Tables

Validation Rules

Workflow Rules

SOAP/REST API

SOAP API

Bulk API

Real TimeRelatively slow for large volumesLoads for up to 250K records

Batches per dayParallel modeLarger than 250K records

Rolling 24 clock for available batchesTime to download results

Batch sideTime out / failure

Bulk API – Asynchronous Process

Data streamed to temporary storageData streamed to

temporary storage

ClientClientProcessing

ThreadProcessing

Thread

Processing Thread

Processing Servers

Data Batches

Dequeue batch

Insert/updateInsert/update

Save resultsSave

results

Send all data

Check StatusCheck Status

Retrieve ResultsRetrieve Results

Job updated in Admin Setup

Dataset processed in parallel

ResultsResults

Bulk API

The “go-to” option for tens of thousands of records and up

Up to 10,000 records in a batch file

Asynchronous loading, tracked in Setup’s “Monitor Bulk

Data Load Jobs” section

Walkthrough time!

Example: American Insurance Co. ?230 million records processed in 33 hours,14 hours ahead of schedule

Some tips for loading•Difficult to extrapolate performance in a Sandbox

• At Par or Better in Production

•Sharing Calculations

•Indexing

•File Storage

•Triggers – Act Judiciously

•Upserts – Avoid Them!

•Parent References

Sequence of Events/Logic

System Validation

Before Trigger

Custom Validation

AssignmentRules

Auto-responseRules

Workflow Rules

Escalation Rules

Parent RollupSummary

After Trigger

BeginTransaction

End Transaction

Database Layer

APEX Layer

Initial Load – Incremental or Big Bang Obj 1 Obj2 Obj3

Option 1

Option 2

Legend

Pre-Implementation Activities

Initial Load ValidationCatch up and Ongoing Sync

User Activation

All Objects

Audience Question!

Scenario: Data Load on …

Click icon to add pictureData ExtractionData Extraction

Mahanthi Gangadhar

@twittername

Mahanthi Gangadhar

@twittername

Data Extraction – Bulk Query Current LimitationsBulk query works exactly like data loads

Create a job (Job Id)

Each query is a batch (Batch Id)

Close the Job and fetch results when job is complete

Limitations Query optimizer has 100mins processing time (timeout issue)

Informatica currently does not support Bulk Query

Other tools like data loader can submit only one query

Currently requires a custom client for submitting multiple queries for a

given job

Data Extraction - ChunkingAuto Number Chunking

Query smaller chunks of the entire data

Use Auto number and formula field for internal indexing

Find chunk boundaries (25K) and issue queries for each chunk

PK Chunking Use primary key to chunk (ID)

Usually better performance when entire object is extracted

Find chunk boundaries (250K) and issue queries for each chunk

Auto Chunking (“safe harbor”) Provide single query and everything happens magically!

Data Extraction - Chunking

Job Id (1234)

Q1 -> Batch Id (123)

Q2 -> Batch Id (234)

Qn -> Batch Id (789)

Close Job Id (1234)

Find Chunk Boundaries -> Create Job ->

Submit each batch/query -> Get Results -> Close Job

Data Cleanup - General Considerations

Deletion is one of the most resource intensive data operations in Salesforce

and can perform even slower than data load in some cases (objects with

complex sharing, with Master/Detail relationships, with Rollup Summary fields

etc.) even in bulk hard delete mode.

Custom objects can be cleaned up by truncation. Note that truncation cannot be

performed on OOB standard objects.

Data in standard objects can be deleted only using Delete API, faster

performing Bulk Hard Delete option is available

Records from User object cannot be deleted, only deactivated.

Data Cleanup - Truncation and Hard DeleteWhere Truncation is not Possible?

Are referenced by another non empty object through a lookup field

Are referenced in an analytic snapshot

Contains a geo location custom field

Have a custom index or an external ID field

Hard Delete

Hard Delete option is disabled by default and must be enabled by an

administrator

Observed about 4.5M/hr on Task object delete, versus 18M/hr for load time,

indicative of how slow of a process it is…

Data Cleanup - General Recommendations

General Recommendations When removing large amounts of data from a sandbox org consider Refresh option first. This is

the fastest way to clean up an org. Additional benefit of Refresh option is that data in User

object is removed also.

To remove data from a custom object use truncate function. Truncation deletes an object

entirely and places it into the recycle bin. Use Erase function to physically remove data from

the system it frees up storage space. Note it might take several hours to erase data from a

large custom object.

To remove data from standard object use bulk hard delete option. Note that performance of

bulk hard delete is quite low so plan sufficient time to remove large amounts of data using this

option. Recent test on Task object in Shadow production org demonstrated performance of

~2M records deleted per hour. On Account object the rate might be even lower.

When planning large scale tests in production environment consider possibility of getting 2 orgs

that can be used in “round robin” mode: when a test is performed in one org, cleanup is being

performed in another.

Tips Test early and often and as big data sets as possible

Split initial load set to smaller subsets (we used 10MM records). This allows for greater flexibility in load

monitoring and control

Queries: for aggregate data validation queries (since bulk query option is not available for them) consider use of

Work Bench with special asynchronous background processing configuration that prevents early timeout on client

side (more info: http://wiki.developerforce.com/page/Workbench). Publicly available WB app

https://workbench.developerforce.com/login.php can be utilize

ETL Tool Timeout settings

Bulk query support

Handling of Success and Error files

Monitoring of Job Status

General Guidelines – Tips / Partner Tools

Audience Question!

Scenario: Data Load on …

Click icon to add pictureOther Areas Other Areas

Suchin Rengan

@twittername

Suchin Rengan

@twittername

Searching and Reporting and List Views• Indexes

• Filtered Queries

• Skinny Tables

• Data Scope

• Roles and Visibility

• Report Folders

Data Governance• Data Model

• Data Management and Administration

• Security and Access Controls

Data Archival and Retention• Storage

General Guidelines for other areas

Speaker NameSpeaker Name

Speaker Title,@twittername

Speaker NameSpeaker Name

Speaker NameSpeaker Name Speaker NameSpeaker Name

Initial Load - Parent References (* goes into notes)General LDV recommendations XLDV considerations

Avoid reference to parent record via External Id if possible, use SFDC Id instead.

Particularly important on large objects with multiple parent references

Consider preparing source data sets with native SFDC id for parent reference. This might be achieved by querying key mapping data from parent objects and performing joins on client side to retrieve SFDC Id for parent reference vs. using External Id reference. Additional benefit – this is a good data validation step.

Note that querying data from extra large objects might be a challenge by itself and often takes hours. Consider alternative approaches, for example collecting mapping keys from initial upload log files.

Referencing parent via native SFDC GUID will also allow the use of FAST LOAD option when available (next release, “safe harbor”)

Initial Load - Data ValidationGeneral LDV recommendations XLDV considerations

The following post load data validation checks are generally can be considered:

Attribute to attribute comparison

Total counts comparison

Aggregate values/subtotals comparison (numeric and currency values)

Spot checking

Negative validations

Data validation on XLDV is a challenging task. Queries on tables with hundreds of millions of records are slow and often time out (even bulk queries). Consider options for splitting (chunking ) your data validation queries by filtering on indexed attributes.

Special chunking techniques: auto number chunking, PK chunking

Do not underestimate and plan for enough time for performing post load data validation on XLDV.

Incremental synch’sGeneral LDV recommendations XLDV considerations

For SOAP API’s use the largest batch size (200).

During incremental syncs on objects with large amount of processing the biggest batches can fail due to time out (10 min per batch). In this case batch size might need to be reduced.

It is possible to “batch” standard SOAP API calls by submitting several jobs in parallel. When loading data into the same object using several SOAP API batches in parallel consider to group child records by the parent in the same batch to minimize DB locking conflicts.

For incremental synchs of larger data sets (over 50K-100K) use bulk API’s.

Consider configuring client application to programmatically set load mode (SOAP vs. Bulk) based on the size of incremental data set in each load session.

Data Loads - Current Volumes - REMOVE

Object approx. 1A # of rows approx. 1B # of rows User Role 60,000 User 200,000 Account (+Contact) 186,000,000 Agent Role 155,000,000 Relationship 11,000,000 Household 96,000,000 Household Member 155,000,000 Opportunity 100,000,000 Task 104,500,000 120,000,000Remark 16,500,000 18,000,000Note 15,000,000 Attachment 6,000,000 Life Event 900,000External Agreement 33,540,000Agent Marketing Inf. 26,000,000Policy 149,000,000Policy Role 165,000,000Agent Agreement Role 172,000,000Billing Account 23,220,000Billing Account Role 23,220,000Total 845,413,000 730,880,000

• Initial volume close to 3.5 B• Worked with customer to reduce volumes• Split the implementation into two phases

Data Loads - Org/Environment preparation and tuning

For XLDV it is recommended to perform test loads in Production like environments To get a true representation of the performance in Production

To test loads in real production environment for environment specific settings

Allows for more accurate planning of data load in terms of timing and dependencies

Concerns:

Large scale deletion / environment cleanup after tests. Deletion is an resource intensive data

operation and the slowest (even using bulk hard delete)

Cannot predict in advance all possible issues and consequences related to multiple mass

data deletions in actual production environment

Users cannot be deleted from Production environment, only deactivated. Multiple User load

tests can produce a “garbage pile” of inactive User records that would clutter Production

environment.

General best practice recommendation: Request a dedicated production test environment to avoid testing in actual Production org.

Work with SFDC Operation on plans for re-provisioning test production environment because

erasing XLDV data from an org for subsequent test might not be feasible

Coordinate with SFDC Operations on any large scale test activities in Prod environment

(systems can be shut down by SFDC Operations as a suspected DDOS attack)

Plan accordingly and factor in the time required for setting up and configuring test production

environments when multiple production tests rounds are required (production org cannot be

refreshed as sandbox org so it needs to be re-provisioned and configured from scratch)

Data Loads - Org/Environment preparation and tuning (Contd..)

Data Loads – Org/Environment preparation and tuningGeneral LDV recommendations XLDV considerations

For large volume data loads it is possible to request additional temporary special changes on SFDC side to expedite load performance

Request increase of batch API limit

Request Increase of file storage

Notify SFDC Operations about upcoming loads with approximate data volumes

Request turning off Oracle Flashback

Use FASTLOAD option (future option, “safe harbor”)

Note that some settings/perms can technically be tweaked only in Production environments, not in Sandbox environments

Initial Load - Use Bulk APIs for massive Data LoadsGeneral LDV recommendations XLDV considerations

SFDC Bulk API are specifically designed to load large volumes of data. It is recommended to use bulk API for LDV initial data load.

Use the Salesforce Bulk API in parallel mode when loading more that few hundred thousand records. Main performance gain of bulk API’s is executing batches in parallel.

Note that generally bulk API (w/o batching, in sequential mode) might perform slower than SOAP standard API so if number of records is less than certain threshold (that depends on number of factors: object being loaded, processing on SFDC side, number of attributes loaded etc…) using bulk API’s might become counterproductive.

Use bulk API’s in parallel mode with maximum allowed batch size whenever possible to maximize number of records that can be loaded within 24 hour period (based on bulk API batch limit – standard limit 2K batches per 24 hours)

Use standard SOAP APIs for incremental ongoing data synchs (data sets less than 50-100K) and bulk APIs for larger incremental data sets (over 50K-100K rows)

Using standard SOAP API’s for incremental synchs has a benefit of reducing the risk of database contention during uploads of child objects

Allow enough time for collecting Request and Result log files (takes ~15 min for INFA to extract load logs for 10MM rows loaded)

Initial Load - Planning Bulk Jobs (Parallel vs Single Object)General LDV recommendations XLDV considerations

When planning bulk jobs consider the following:

When loading large volume of data into a single object it is recommended to submit all the batches in one job. Splitting (partitioning) data across several parallel jobs does not improve performance when each individual job is big enough. Note that batches of all jobs are placed in the same batch pool and queued for processing independently of jobs they belong to.

When loading data into several objects running several jobs for different objects in parallel is recommended when jobs are small enough to not consume all the available system resources.

For loading extra large volumes of data into single object in Salesforce it is recommended to run one object load at a time. The rationale behind this recommendation is that bulk jobs with hundreds/thousands of batches consume all available system resources and running other jobs in parallel just causes jobs to compete for the limited resources. The other consideration is DB cache. When several objects are loaded in parallel sharing DB cache is causing more frequent swapping that slows down overall performance on the DB layer.

Initial Load - Defer Sharing Calculations

General LDV recommendations XLDV considerations Reduce sharing recalculations processing during initial

data upload.

Disable sharing rules and/or use the defer sharing calculation perm for custom objects while loading data.

Note that sharing calculations will still need to be performed after data load is complete and it can take significant amount of time but it can reduce loading time by allowing sharing to be calculated in “off” hours.

Sharing tables can alternatively be uploaded directly that allows to avoid sharing recalculation on initial load altogether

Based on the XLDV performance tests general recommendation for deferring sharing calculations might not be applicable for extra large volumes. Post loading sharing calculations might take very long time to execute on bigger volumes (it is not parallel currently, on roadmap for the nearest release)

Load data in the following sequence to enforce main sharing calculations to be performed during data upload: Set OWD on objects Private where applicable

Create User Role Hierarchy and upload users with correct roles assigned

Create main user groups and group members

Create sharing rules

Upload data with correct owner specified.

Consider uploading sharing tables on some objects directly to avoid sharing calculations during upload and post load processing.

Initial Load - Triggers, WF Rules, Data Validation Rules

General LDV recommendations XLDV considerations

Disable when possible triggers, WF rules, data validations on objects being uploaded.

To avoid lengthy post load catch up processing consider performing data transformations/ validations on source data set prior to data upload or

consider batch APEX post-load processing for data transformations that can’t be performed prior to upload.

For each trigger, WF rule and validation rude individual analysis should be performed to define the best strategy.

General best practice rules apply

Initial Load - Defer IndexingGeneral LDV recommendations XLDV considerations

Reduce additional processing on SFDC during initial data upload: turn off search indexes, consider creating custom indexes after data load

Based on initial performance results index recalculations on extra large volumes of data take long time post load (not parallelized currently). It might be a better approach to configure all required indexes prior to XLDV load.

Initial Load - Use Fastest Operation PossibleGeneral LDV recommendations XLDV considerations

Use the fastest operation possible. insert is fastest then update, then upset

If possible load full set of initial data in one go using insert with only small incremental upserts when needed (failed records reprocessing for example).

To avoid loading big volumes data in upsert mode (for example when main load job fails in the middle and remaining data should be added on top of existing data in an object) consider configuring a client load job the way that joins existing records to remaining records (on external Id) outside of SFDC and then loading resulting set in insert mode.

Initial Load - Use Clean Data SetGeneral LDV recommendations XLDV considerations

Use as clean data set as possible

Note that errors in batches are causing single row processing on the remainder of the batch that slows down load performance significantly.

Perform rigorous data validation prior to upload. This will allow:

1. Loading most of the records in the fastest insert operation and avoid slow down due to the preventable errors

2. Avoid slow down when processing goes record by record within a 200 record transactional batch when number of failed record reaches a Threshold

Slide parts

Data Load Considerations (*- will be replaced)Topic Consideration What do I need to do?Data Model Normalized/ De-normalized

Sharing Model Sharing calculations

Target Data Does it really need to exist in SFDC? BP etc.

Test Environments Test them before you deploy!

Timelines Adequate timeline for testing

User data Cannot be deleted

API Soap versus Bulk

Batches Limit of 2K batches for Bulk API a day SF Support to increase batches

Deletes Oh No..

Extrapolation Not really possible It is at par or better, if Prod can be used even better

Large Data Management Strategies

Technology

Transcript of Large Data Management Strategies

Clustering. Introduction Clustering Summarization of large data –Understand the large customer data Data organization –Manage the large customer data.

Data Strategies

Isolated Sub Dehumidification Strategies in Large ...

Hybrid Parallelization Strategies for Large-Scale … Parallelization Strategies for Large-Scale Machine Learning in SystemML Matthias Boehm, Shirish Tatikonda, Berthold Reinwald,

Strategies for Large-Scale Cloud Migration

Data Linkage Strategies

SUCCESSFUL COMPETITIVE STRATEGIES OF LARGE CROATIAN …

Isolated Sub Dehumidification Strategies in Large Supermarkets

Effective Strategies for Large Government OER Projects

Teaching Strategies for large class size

slides strategies to teach large multilevel class

Strategies for Exploiting Large Data

Strategies for Engaging Students in Large Classes

Implementing Large Scale OER - Strategies for Success

Strategies for fostering engagement in large classes

Handout: Large IRAs: Wealth Transfer Planning Strategies ...

CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:

Strategies for Solving Large-Scale Optimization Problems

Data Architecture Strategies

Objectives and strategies of large scale organisations.key