Processing Big Data At-Scale in the App Cloud

Processing Big Data At Scale

Naren Chawla Senior Director, Product Management ([email protected]) Prashant Kommireddi @prashant1784

Leverage platform-native Data Pipelines for ETL

Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

Topics

Big Data Processing Problem and Proposed Solution Data Pipeline Deep-dive Demo Key Use-cases Customer Stories Summary Q&A

Problem

ERP HCM SCM Logs

1. Acquire & Store Data

2. Prepare Data (Cleanse, Augment, Transform, Join)

Data Lake / EDW

4. Take Action Customer Success Platform

3. Analyze Wave

Firewall

•  Cost and complexity of managing external data platforms

•  Slow time-to-value, poor support for ad-hoc analysis

•  Inability to deliver high-value packaged analytic solutions

Solution

ERP HCM SCM Logs/Machine Data

4. Take Action Salesforce Apps

3. Analyze Wave

Firewall

• Greater ease-of-use, consistent end-to-end experience

• Greater flexibility and faster time-to-value

• Packaged Analytic Solutions

2. Prepare Data Data Pipelines / Async Query

1. Acquire & Store Data BigObjects

Data Pipelines Overview Currently in Pilot

Data Pipelines Programmatic language based on Apache Pig plus whitelisted UDF

libraries (Piggybank, DataFu)

Multi-tenancy resource management, scheduling, job monitoring and

management

Data Sources Data Targets

SObjects BigObjects Wave Data Sets External Objects Files Archive Objects


Generate mapReduce Jobs

Hadoop

Big Data Processing

Architecture

Salesforce Data Center

Data Pipeline

BigObjects vs. SObjects SObjects BigObjects

Use cases CRM transactional data Read-only immutable data

Data volumes <50m Rows Billions of Rows

Field types All Types Strings, numbers, dates, json

Query Real Time Query Response Blend of real time and asynchronous query response determine by size of result set

Transactions ACID transactions Record Level Consistency

Access Management Full Sharing User Permissions and Field-level Security

APIs Full Support SOQL, Async Query, Data Pipelines

Triggers Full Support None

Reports Full Support Limited CRTs

Search Full Support None

Use to introduce a

demo, video, Q&A, etc. DEMO

Transformations

●  JOIN

●  FILTER

●  UNION

●  MERGE

●  GROUP

●  DISTINCT

●  ORDER BY

●  RANK

●  LIMIT

●  … and many more

Key Use Cases

Big Object

Ext Object

Files

sObject

Wave

sObject sObject

Native Big Data Processing Data Prep for Descriptive Analytics

Data Enrichment to turn “Insight into Actions”

Big Object

Ext Object

Files

sObject

Wave

sObject

Handling Semi-structured Data

JSON, HTML, XML and other complex semi-

structured data...

Customer Stories

Gamification - based on experience points update user levels

Computing Partner Scorecards

Asset Management Analytics Analytics

Large volume data processing (250M + records). Trawl the rewards and update user-objects. Later, will like to use analytics.

Scorecard determines status which in turn determines pricing, resources that partners have access to assist in sales. Calculated multiple times every week for Partner Accounts (70h+).

Account assignment at account/office/contact levels. Will like to run daily

Correlate game-play data with customer interaction to improve customer retention, loyalty, etc.

Multi-org consolidation; White-space analysis.

Use to introduce a

demo, video, Q&A, etc. Future

Roadmap Themes

1.  Resource Management/Fair Allocation 2.  Predictive Analytics 3.  Business Analyst/Salesforce Admin Interface

Summary & Next-Steps

Why Data Pipeline? ●  Massive Parallelism (10-40X performance improvement) ●  Overcome governor limits ●  Work towards Data Lake Architecture ●  Reduce complexity/cost - 100% Platform-Native

Resources ●  Implementation Guide - http://docs.releasenotes.salesforce.com/en-us/summer15/release-notes/

rn_forcecom_data_pipelines.htm

Join the Pilot Program Any questions: [email protected]

Salesforce.com Confidential

And make any adjustments needed before loading.

FUTURE

BigObjects

External SObjects

•  New object type optimized for extremely large row-count

•  Use cases: read-only data from external systems, point-of-sale data, connected product event data, clickstream data, etc.

•  Backed by HBase as a System of Record

•  Integrated into platform via External sObject framework,

Phoenix, Pliny

HBase

Phoenix SQL

Pliny SOQL

Platform

Data Pipelines Overview

Data Pipelines Programmatic language based on Apache Pig plus whitelisted UDF

libraries (Piggybank, DataFu)

Declarative tooling for admins and

analysts

Wav

e D

ev C

onso

le

Set

up

Multi-tenancy Hadoop, resource management, scheduling, job monitoring

and management

Data Sources Data Targets



Data Set Objects Snapshot for provenance tracking

Generate Data Pipelines

Generate mapReduce Jobs

Data Processing

Data Set Objects Snapshot for provenance tracking

Remove Data Sets Object Declarative Tooling - bring it later

Customer Name Brief Description Use-cases

Cloud App CloudApps increases organisational performance by enabling, encouraging, enhancing and measuring behavioural change using gamification

Large volume data processing (250M + records). Trawl the rewards and update user-objects. Later, will like to use analytics.

EMC Computing Partner Scorecards Business Partner scorecards help partners track whether they qualify for a particular Partner Tier status (Gold, Silver, Platinum). Tier status determines pricing, resources that partners have access to assist in sales. Scorecards are calculated multiple times every week for Partner Accounts. This takes 70h+ to calculate. When being processed Scorecards are zero'ed out and a Partner cannot not see the details of why they are in a certain status. In order to process them in a shorter window (~10h), they've reduced the total number of Partner Accounts that qualify for the Business Partner program from 22K to 780.

Legg Mason Asset Management Legg Mason has built an internal process to updates account assignment at account/office/contact levels. They will like to do this more frequently, but async batch apex process is causing them to hit several limits and preventing them to run this process daily.

Activision Video Game Developer Activision want’s to correlate game-play data with customer interaction to improve customer retention, loyalty, etc. Currently, they load game-play data every 2 weeks, they will like to do that daily. Plus, use Pipeline to join game play data with Case records and use Analytics to drive insight (for example, impact of service issue on gaming behaviour)

Financial Force ERP on Platform FF gets files in emails and they have to do manual downstream processing to generate invoices, etc based on this incoming files. They want to leverage Pipelines to scale and automate some steps

USPS Business Transformation USPS wants to combine CRM data with external data (from Equifax) to marry physical address with digital identity for a user. They expect 500 million external records. And they will build transformational applications based on this data (For ex, twitter handle on envelopes, Uber for packages, etc..)

Cisco Multiple use-cases Multi-org consolidation; White-space analysis.

Customer Stories

Data Pipelines Roadmap (WORK ON THIS SLIDE)

-  Spark for internal customers -  Wave connectors -  Better error handling -  Monitoring improvements -  Basic limits

198 Winter ’16 / DF15

-  Resource management -  Scheduler -  Performance / optimization -  Hardening

200 Spring 16

-  Metadata API -  Simple Monitoring -  Dev Console integration -  Logging improvements -  Deployment to HBase servers

196 Summer ‘15

Pilot II

Pilot III

GA

(stretch goal)


External SObjects

BigObjects

•  New object type optimized for extremely large row-count •  Targeted functionality

•  Use cases: read-only data from external systems, point-of-sale data, connected product event data, clickstream data, etc.

•  Backed by HBase as a System of Record

•  Integrated into platform via External sObject framework, Phoenix, Pliny

HBase

Phoenix SQL

Pliny SOQL

Platform

18

6 2.4


BigObjects vs. SObjects SObjects BigObjects

Use cases CRM transactional data Write-once / Read-only data from external systems, point-of-sale data, connected product event data, clickstream data, etc

Data volumes <50m Rows Billions of Rows

Filed types All Types Strings, numbers, dates

Query Realtime query response Blend of real time and asynchronous query response determine by size of result set

Transactions ACID transactions Eventually consistent

Access Management Full Sharing Object Perm Based, Sharing Descriptors in future

APIs Full Support REST, SOQL, Bulk

Triggers Full Support None

Reports Full Support Limited CRTs

Search Full Support None

Processing Big Data At-Scale in the App Cloud

Technology

Transcript of Processing Big Data At-Scale in the App Cloud