Replicating One Billion Records with Minimal API Usage

22
How One Billion Salesforce records Can Be Replicated with Minimal API Usage Baruch Oxman R&D Manager, Implisit @implisithq, @baruchoxman

description

Salesforce API access is nice, but often not enough. In the Big Data era, you are often required to replicate Salesforce data for offline processing. Implisit requires such replication for its data entry engine to automatically enter emails, events, contacts, and leads for its users. In order to do so, Implisit maintains a daily sync of over one billion Salesforce data records, while using no more than a few hundred API calls per Salesforce Org. Join us as we share the suggested architecture of such a replication mechanism, the best practices we developed over time, and the pitfalls to avoid.

Transcript of Replicating One Billion Records with Minimal API Usage

Page 1: Replicating One Billion Records with Minimal API Usage

How One Billion Salesforce recordsCan Be Replicated with Minimal API UsageBaruch Oxman

R&D Manager, Implisit

@implisithq, @baruchoxman

Page 2: Replicating One Billion Records with Minimal API Usage

Safe Harbor

Safe harbor statement under the Private Securities Litigation Reform Act of 1995:

This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.

 

The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.

 

Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

Page 3: Replicating One Billion Records with Minimal API Usage

Baruch OxmanR&D Manager

Page 4: Replicating One Billion Records with Minimal API Usage

In this session…

• Implisit - Intro & Motivation

• Salesforce APIs Usage & Limits - Overview

• Efficient use of Salesforce APIs

• Scale and limitations

• Other pitfalls and tips

Page 5: Replicating One Billion Records with Minimal API Usage

Implisit – The End of CRM Data Entry

• Using Data-Mining and Machine Learning techniques:

– Automatic entry of emails and calendar events to Salesforce

– Matching to relevant Accounts, Opportunities, Contacts, Leads

– Automatic creation of missing Contacts & Leads

• Using text analysis:

– Creating meaningful business insights

– Improving sales pipeline and forecasting

• Requires Salesforce data replication for offline processing

Page 6: Replicating One Billion Records with Minimal API Usage

Data Replication Goals

• Minimize your API usage

– Don’t reach the API limit

– API limits are shared between all API-connected apps – other apps can become blocked

• Minimize sync cycle time

– Don’t make our customers wait for too long

Page 7: Replicating One Billion Records with Minimal API Usage

Salesforce API Limits

• Daily API limits for Salesforce Editions:

– Unlimited/Performance: # of users x 5,000, up to 1,000,000

– Enterprise/Professional: # of users x 1,000

– Developer: 15,000

– Sandbox: 5,000,000

• In-parallel API calls limit (25 – production, 5 – dev)

Source & more info: https://help.salesforce.com/HTViewHelpDoc?id=integrate_api_rate_limiting.htm

Page 8: Replicating One Billion Records with Minimal API Usage

Performance Stats

• Keeping over 1 billion Salesforce records replicated in-sync

– 27 Salesforce object types are replicated

• Initial sync

– 600-1000 API calls in total

• Updates sync

– 200-400 API calls in total

– Performed every few hours

Page 9: Replicating One Billion Records with Minimal API Usage
Page 10: Replicating One Billion Records with Minimal API Usage

Salesforce API Types

• REST API

– Fast, synchronous queries

– Up to 2,000 records per request

– Simple usage

• SOAP API

– describeSObjects

– getDeleted

– Other metadata queries

• Bulk (Async) API

– Large amounts of records in a

single request

– Slow, requires polling for results

– Implements internal retries

Page 11: Replicating One Billion Records with Minimal API Usage

So how do I replicate ?

Page 12: Replicating One Billion Records with Minimal API Usage

Replication method – Initial fetching

• Fetch all records for each relevant object type

– Lots of data

– Only non-deleted records

• Break queried data into chunks, to limit size

• Order by CreatedDate

• On subsequent queries, use latest CreatedDate from previous result

• Example:

– 1st query: “…ORDER BY CreatedDate LIMIT 100000”

– Subsequent: “…WHERE CreatedDate > 2014-08-31T02:29:29Z ORDER BY CreatedDate LIMIT 100000”

Page 13: Replicating One Billion Records with Minimal API Usage

Replication method – Changes fetching

• Fetch only records that changed since the previous fetch time

– Less data – only changes

– Take care of updates and deletions

• Using SystemModstamp as indicator for changes in record

• Chunking logic – same as in initial fetching

• Example:– 1st query: “…WHERE SystemModstamp > 2014-07-31T02:29:29Z AND ORDER BY CreatedDate LIMIT 100000”

– Subsequent: “…WHERE SystemModstamp > 2014-07-31T02:29:29Z AND CreatedDate > 2014-08-31T02:29:29Z ORDER BY CreatedDate LIMIT 100000”

• Bulk changes fetching VS getUpdated()

Page 14: Replicating One Billion Records with Minimal API Usage

Deleted items

• Motivation:

– Required to maintain consistent sync

• Two implementation options

– Use getDeleted() call in SOAP API (our choice)

– Use queryAll() query option

• Supported only in REST API queries

• Tip: some objects can become “undeleted”

Page 15: Replicating One Billion Records with Minimal API Usage

Getting all fields

• No “SELECT *” support

• Get all fields for table using “describe”

– Optionally, filter the fields (skip custom fields, etc…)

• Use the field names in the query

• Limitation: query length cannot exceed 20,000 characters*

* http://www.salesforce.com/us/developer/docs/soql_sosl/Content/sforce_api_calls_soql_select.htm

Page 16: Replicating One Billion Records with Minimal API Usage

User Access Restrictions

• Full access rights are strongly encouraged

– Full view of all objects

– Limited access rights -> slower queries

• Reference Fields – special case

– Tasks / Events - WhoId, WhatId

– Attachment - ParentId

– Relations make access checks in Salesforce even slower

– Limited to 100,000 different values per query*

– Solution: query in smaller chunks

Page 17: Replicating One Billion Records with Minimal API Usage

Error handling• Nothing is fail-safe

• Different APIs produce different errors

• Examples:– Query too long (too many fields)– Scale limitations– Communication errors– Salesforce maintenance windows

• Add support for anything you encounter– “Rare” becomes “frequent” once you scale

• ABR (Always Be Retrying)

• Make sure to clean up upon errors– Close open bulk jobs

Page 18: Replicating One Billion Records with Minimal API Usage

Non-supported Salesforce objects

• Some orgs do not support some of the objects

– For example, Lead or Opportunity

• Check using describeSObjects for each object, before fetching

• Safely skip when not supported

Page 19: Replicating One Billion Records with Minimal API Usage

Summary

• Implisit - Intro & Motivation

• Salesforce APIs Overview

• Efficient use of API

• Scale and limitations

• Other pitfalls and tips

Page 21: Replicating One Billion Records with Minimal API Usage
Page 22: Replicating One Billion Records with Minimal API Usage