Processing Big Data At-Scale in the App Cloud
-
Upload
salesforce-developers -
Category
Technology
-
view
160 -
download
0
Transcript of Processing Big Data At-Scale in the App Cloud
Processing Big Data At Scale
Naren Chawla Senior Director, Product Management ([email protected]) Prashant Kommireddi @prashant1784
Leverage platform-native Data Pipelines for ETL
Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
Topics
Big Data Processing Problem and Proposed Solution Data Pipeline Deep-dive Demo Key Use-cases Customer Stories Summary Q&A
Problem
ERP HCM SCM Logs
1. Acquire & Store Data
2. Prepare Data (Cleanse, Augment, Transform, Join)
Data Lake / EDW
4. Take Action Customer Success Platform
3. Analyze Wave
Firewall
• Cost and complexity of managing external data platforms
• Slow time-to-value, poor support for ad-hoc analysis
• Inability to deliver high-value packaged analytic solutions
Solution
ERP HCM SCM Logs/Machine Data
4. Take Action Salesforce Apps
3. Analyze Wave
Firewall
• Greater ease-of-use, consistent end-to-end experience
• Greater flexibility and faster time-to-value
• Packaged Analytic Solutions
2. Prepare Data Data Pipelines / Async Query
1. Acquire & Store Data BigObjects
Data Pipelines Overview Currently in Pilot
Data Pipelines Programmatic language based on Apache Pig plus whitelisted UDF
libraries (Piggybank, DataFu)
Multi-tenancy resource management, scheduling, job monitoring and
management
Data Sources Data Targets
SObjects BigObjects Wave Data Sets External Objects Files Archive Objects
SObjects BigObjects Wave Data Sets External Objects Files Archive Objects
Generate mapReduce Jobs
Hadoop
Big Data Processing
Architecture
Salesforce Data Center
Data Pipeline
BigObjects vs. SObjects SObjects BigObjects
Use cases CRM transactional data Read-only immutable data
Data volumes <50m Rows Billions of Rows
Field types All Types Strings, numbers, dates, json
Query Real Time Query Response Blend of real time and asynchronous query response determine by size of result set
Transactions ACID transactions Record Level Consistency
Access Management Full Sharing User Permissions and Field-level Security
APIs Full Support SOQL, Async Query, Data Pipelines
Triggers Full Support None
Reports Full Support Limited CRTs
Search Full Support None
Use to introduce a
demo, video, Q&A, etc. DEMO
Transformations
● JOIN
● FILTER
● UNION
● MERGE
● GROUP
● DISTINCT
● ORDER BY
● RANK
● LIMIT
● … and many more
Key Use Cases
Big Object
Ext Object
Files
sObject
Wave
sObject sObject
Native Big Data Processing Data Prep for Descriptive Analytics
Data Enrichment to turn “Insight into Actions”
Big Object
Ext Object
Files
sObject
Wave
sObject
Handling Semi-structured Data
JSON, HTML, XML and other complex semi-
structured data...
Customer Stories
Gamification - based on experience points update user levels
Computing Partner Scorecards
Asset Management Analytics Analytics
Large volume data processing (250M + records). Trawl the rewards and update user-objects. Later, will like to use analytics.
Scorecard determines status which in turn determines pricing, resources that partners have access to assist in sales. Calculated multiple times every week for Partner Accounts (70h+).
Account assignment at account/office/contact levels. Will like to run daily
Correlate game-play data with customer interaction to improve customer retention, loyalty, etc.
Multi-org consolidation; White-space analysis.
Use to introduce a
demo, video, Q&A, etc. Future
Roadmap Themes
1. Resource Management/Fair Allocation 2. Predictive Analytics 3. Business Analyst/Salesforce Admin Interface
Summary & Next-Steps
Why Data Pipeline? ● Massive Parallelism (10-40X performance improvement) ● Overcome governor limits ● Work towards Data Lake Architecture ● Reduce complexity/cost - 100% Platform-Native
Resources ● Implementation Guide - http://docs.releasenotes.salesforce.com/en-us/summer15/release-notes/
rn_forcecom_data_pipelines.htm
Join the Pilot Program Any questions: [email protected]
Salesforce.com Confidential
And make any adjustments needed before loading.
FUTURE
BigObjects
External SObjects
• New object type optimized for extremely large row-count
• Use cases: read-only data from external systems, point-of-sale data, connected product event data, clickstream data, etc.
• Backed by HBase as a System of Record
• Integrated into platform via External sObject framework,
Phoenix, Pliny
HBase
Phoenix SQL
Pliny SOQL
Platform
Data Pipelines Overview
Data Pipelines Programmatic language based on Apache Pig plus whitelisted UDF
libraries (Piggybank, DataFu)
Declarative tooling for admins and
analysts
Wav
e D
ev C
onso
le
Set
up
Multi-tenancy Hadoop, resource management, scheduling, job monitoring
and management
Data Sources Data Targets
SObjects BigObjects Wave Data Sets External Objects Files Archive Objects
SObjects BigObjects Wave Data Sets External Objects Files Archive Objects
Data Set Objects Snapshot for provenance tracking
Generate Data Pipelines
Generate mapReduce Jobs
Data Processing
Data Set Objects Snapshot for provenance tracking
Remove Data Sets Object Declarative Tooling - bring it later
Customer Name Brief Description Use-cases
Cloud App CloudApps increases organisational performance by enabling, encouraging, enhancing and measuring behavioural change using gamification
Large volume data processing (250M + records). Trawl the rewards and update user-objects. Later, will like to use analytics.
EMC Computing Partner Scorecards Business Partner scorecards help partners track whether they qualify for a particular Partner Tier status (Gold, Silver, Platinum). Tier status determines pricing, resources that partners have access to assist in sales. Scorecards are calculated multiple times every week for Partner Accounts. This takes 70h+ to calculate. When being processed Scorecards are zero'ed out and a Partner cannot not see the details of why they are in a certain status. In order to process them in a shorter window (~10h), they've reduced the total number of Partner Accounts that qualify for the Business Partner program from 22K to 780.
Legg Mason Asset Management Legg Mason has built an internal process to updates account assignment at account/office/contact levels. They will like to do this more frequently, but async batch apex process is causing them to hit several limits and preventing them to run this process daily.
Activision Video Game Developer Activision want’s to correlate game-play data with customer interaction to improve customer retention, loyalty, etc. Currently, they load game-play data every 2 weeks, they will like to do that daily. Plus, use Pipeline to join game play data with Case records and use Analytics to drive insight (for example, impact of service issue on gaming behaviour)
Financial Force ERP on Platform FF gets files in emails and they have to do manual downstream processing to generate invoices, etc based on this incoming files. They want to leverage Pipelines to scale and automate some steps
USPS Business Transformation USPS wants to combine CRM data with external data (from Equifax) to marry physical address with digital identity for a user. They expect 500 million external records. And they will build transformational applications based on this data (For ex, twitter handle on envelopes, Uber for packages, etc..)
Cisco Multiple use-cases Multi-org consolidation; White-space analysis.
Customer Stories
Data Pipelines Roadmap (WORK ON THIS SLIDE)
- Spark for internal customers - Wave connectors - Better error handling - Monitoring improvements - Basic limits
198 Winter ’16 / DF15
- Resource management - Scheduler - Performance / optimization - Hardening
200 Spring 16
- Metadata API - Simple Monitoring - Dev Console integration - Logging improvements - Deployment to HBase servers
196 Summer ‘15
Pilot II
Pilot III
GA
(stretch goal)
Salesforce.com Confidential
External SObjects
BigObjects
• New object type optimized for extremely large row-count • Targeted functionality
• Use cases: read-only data from external systems, point-of-sale data, connected product event data, clickstream data, etc.
• Backed by HBase as a System of Record
• Integrated into platform via External sObject framework, Phoenix, Pliny
HBase
Phoenix SQL
Pliny SOQL
Platform
18
6 2.4
Salesforce.com Confidential
BigObjects vs. SObjects SObjects BigObjects
Use cases CRM transactional data Write-once / Read-only data from external systems, point-of-sale data, connected product event data, clickstream data, etc
Data volumes <50m Rows Billions of Rows
Filed types All Types Strings, numbers, dates
Query Realtime query response Blend of real time and asynchronous query response determine by size of result set
Transactions ACID transactions Eventually consistent
Access Management Full Sharing Object Perm Based, Sharing Descriptors in future
APIs Full Support REST, SOQL, Bulk
Triggers Full Support None
Reports Full Support Limited CRTs
Search Full Support None