Building a Scalable Data Science Platform with R on HDInsight

Post on 15-Dec-2016

224 views 5 download

Transcript of Building a Scalable Data Science Platform with R on HDInsight

BR005

Microsoft Machine Learning& Data Science SummitSeptember 26 – 27 | Atlanta, GA

Building a Scalable Data Science Platform with R on HDInsightDebraj GuhaThakurtaSenior Data ScientistData Group – Algorithms and Data Science, Redmond

Email: debraj.guhathakurta@microsoft.comTwitter: @d_guhathakurta

Co-contributors: Mario Inchiosa, Katherine Zhao, Hang Zhang, Max Kaznadi

• R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight • Mon, Sept 26, 1:30 – 2:30 PM• Maxim Lukiyanov

• Big, Fast, and Data-Furious…with Spark • Mon, Sept 27, 12:30 – 1:30 PM• Maxim Lukiyanov

• Instructor-Led Lab: The Cortana Intelligence Suite - Part Two: Deep Dive • Mon, Sept 26, 10:30 AM – 5 PM• Buck Woody

• Self-Paced Lab: Microsoft Server R• Mon, Sept 26, 1 – 4 PM; Tue Sept 27, 10:30 – 11:30 AM & 12:30 – 2:30 PM• Jeremy Reynolds

• Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…• Tue, Sept 27, 3 – 4 PM• Hang Zhang, Jacob Spoelstra, Gopi Kumar

Related talks3

• Microsoft R Server: Benefits

• R Server on HDInsight (Premium, Preview): Scalable analytical platform on Azure

• How to: • Develop end-to-end data science process using R Server on Spark HDInsight

(Premium)• How to adopt process and code

Key takeaways4

• R and its benefits / limitations• Microsoft R Server: Scalable, enterprise-class• R Server on HDInsight (Premium) clusters• Demo - Developing end-to-end data science processes using

R Server on HDInsight Spark clusters• Pointers to technical content: Tutorials, templates, blogs

Agenda5

R – its benefits and limitations

R - introduction

• 2.5+M users • Taught in most universities• Thriving user groups

worldwideCommunity

• The most popular statistical programming & ML language

• Data visualization & reporting tool• Open source, transparent

Language Platform

• Free

7

• 9,000+ contributed packagesEcosystem • Applications & integration• Many use cases / business problems

addressed

Preferred language by Analytics Professionals

Source: SAS, R or Python Survey 2016, by Burtch Works

Which do you prefer to use: SAS, R, or Python?

2015 20142016 2015

Unified IEEE Spectrum Ranking 2016http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages

8

Common R use casesVertical Sales & Marketing Finance & Risk Customer & Channel Operations &

Workforce

Retail

Demand  ForecastingLoyalty ProgramsCross-sell & Upsell

Customer Acquisition

Fraud DetectionPricing Strategy

Personalization Lifetime Customer Value Product Segmentation

Store Location DemographicsSupply Chain Management

Inventory Management

Financial Services

Customer Churn Loyalty Programs Cross-sell & Upsell

Customer Acquisition

Fraud DetectionRisk& Compliance

Loan Defaults

PersonalizationLifetime Customer

Value

Call Center OptimizationPay for Performance 

Healthcare Marketing Mix Optimization

Patient Acquisition Fraud Detection

Bill Collection Population Health

Patient Demographics Operational Efficiency Pay for Performance

ManufacturingDemand Forecasting

Marketing mix OptimizationPricing Strategy

Perf Risk Management Supply Chain Optimization

Personalization

Remote Monitoring Predictive Maintenance

Asset Management

9

Processing limitations of open source R

• In-Memory Operation

• Lack of Parallelism

• Expensive Data Movement

& Duplication

Open source R is not enterprise class

Inadequacy of

Community Support

Lack of Guaranteed

Support Timeliness

No SLAs or Support Models

Microsoft R Server

R from Microsoft brings13

Peace of mind Speed and

scalabilityEfficiencyFlexibilit

y

• Support and SLA• Works on data in memory or on disc (scale)• Wide range of scalable and distributed R functions • Works in several compute contexts (incl. Hadoop, Spark, SQL-server),

and data sources (incl. disk, HDFS, SQL)

Portability & investment assurance

R Server portfolio

Cloud • Windows• Linux

• SQL Server 2016 EE• SQL Server 2016 SERDBMS• Windows• LinuxDesktops & Servers

Hadoop & Spark • Hortonworks• Cloudera• MapR

EDW • SQL Server 2016• Teradata Database

R+CR

ANM

icros

oft R

Op

en

DistributedR

ScaleR

ConnectR

DeployR

R Server Technology

14

Write once deploy anywhere - WODA

• On a workstation:• All available cores used for math operations and parallel processes• Hard drive capacity sets limit for data size, not RAM• Works directly on XDF (External data frames) on disk

• On a cluster:• Parallel utilization of nodes• Distributed file systems like HDFS greatly expand possible data

sizes

ScaleR - parallel or distributed processing

15

Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models

ScaleRPEMA: Parallel external memory algorithms

Stream data into RAM in blocks. “Big Data” can be any data size. Can handle Megabytes to Gigabytes to Terabytes…

ScaleR algorithms work inside multiple cores / nodes in parallel at high speed

Interim results are collected and combined analytically to produce the output on the entire data set

XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.

16

Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models

• Linear regression (rxLinMod)• Generalized linear models (rxLogit, rxGLM)• Decision trees (rxDTree)• Gradient boosted decision trees (rxBTree)• Random forests (rxDForest)• K-means (rxKmeans)• Naïve Bayes (rxNaiveBayes)

Available ScaleR distributed algorithms

17

ScaleR distributed algorithms Data import – Delimited, Fixed, SAS,

SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

ETL Statistical Tests

Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for

set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &

long form) Marginal Summaries of Cross Tabulations

Descriptive Statistics Sum of Squares (cross product matrix for

set variables) Multiple Linear Regression Generalized Linear Models (GLM)

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Predictions/scoring for models Residuals for all models

Predictive Statistics

K-MeansClustering

Linear regression Logistic regression Decision Trees Decision Forests Gradient Boosted Decision

Trees Naïve Bayes

Machine Learning

Simulation Simulation (e.g. Monte Carlo) Parallel Random Number

Generation Custom Parallelization rxExec

PEMA-R APIVariable Selection Stepwise Regression

18

• Any analysis that is more complex than simple aggregations• Analysis with data that fit in physical memory of single

machines• Creating sophisticated visualizations (e.g. ggplot, lattice)• Creating reports (use knitr and Markdown)• Analyses that use domain-specific tools or cutting-edge

algorithms• e.g. Forecasting, health informatics, …. , etc.

Typical uses of open source R19

• Working with big data• Building models that take too long to run in R• Working with clusters and distributed file

systems• e.g. HDInsight clusters + HDFS

• Developing portable scripts for many compute contexts

Typical uses of R Server20

Big Data In-memory bound

Hybrid memory & disk scalability

Operates on bigger volumes of data

Speed of Analysis

Single threaded Parallel threading Shrinks analysis time

Enterprise Readiness

Community support

Commercial support Delivers full service production support

Analytic Breadth & Depth

9000+ innovative analytic packages

Leverage open source packages plus Big Data ready packages

Supercharges R with ScaleR functions

Commercial Viability

Risk of deployment of open source

Commercial license Eliminate risk with open source

Benefits of R Server21

R Server

R Server on HDInsight (Premium)

R Server on HDInsight (Premium)Managed Hadoop for advanced analytics in the Cloud

RevoScaleR

Hadoop / Spark

Blob Storage (HDFS)Data Lake Storage

• Easy setup, elastic, SLA• R Server benefits

• Leverage R skills• ScaleR functions• ….

• Familiar & enhanced IDEs• Popular IDEs (RStudio, RTVS, Notebooks,

etc.)

23

Others (e.g. SparkR)

R

Provisioning HDInsight (Premium) with R Server

24

Elastic - Scaling HDInsight clusters25

R server on HDInsight - Architecture

26

R R R R R

R R R R R

Data Scientists

R Server

Edge

Head Nodes

Data/Worker Nodes

R Server on HDInsight - Connectivity

Worker Task

R Server Master Task

Edge Node

Worker Task

Worker Task

Remote Execution: ssh

ssh or R Tools for Visual Studio

Jupyter Notebooks

Thin Client IDEs

https://

https://

or MapRedu

ce

27

R Server on HDInsight - Data processing

Server Local Processing

Data in Distributed Storage

R process on Edge Node

Server Distributed Processing

Master R process on Edge Node

Apache YARN / Spark

Worker R processes on Data

Nodes

28

Write once deploy anywhere - WODASwitching compute contextsCode can be deployed from a server or edge node to run in Spark/Hadoop without any functional R model re-coding.

## Statistical Summary rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineData, reportProgress = 1)## Linear model and plothdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,

data = AirlineData)

## SETUP LOCAL ENVIRONMENT VARIABLES ## myLocalCC <- “localpar”

## LOCAL COMPUTE CONTEXT ## rxSetComputeContext(myLocalCC)

Local Parallel processing - Linux or Windows

Compute context R script - sets where the model will run

R script – does not need to change to run in Hadoop/ Spark

29

mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()

rxSetComputeContext(mySparkCC) rxSetComputeContext(myHadoopCC)

In – Spark/Hadoop

R Script for Execution in MapReduce

Sample R Script:

rxSetComputeContext( RxHadoopMR(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Define Compute Context

Define Data Source

Train Predictive

Model

30

Easy to Switch From MapReduce to Spark

Keep other code

unchanged

Sample R Script:

rxSetComputeContext( RxSpark(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Change the Compute Context

31

Creating a data science process using R Server on Spark HDInsight

Apache Spark engine and its APIs

33Denny Lee, DataBricks

Spark Core

Spark Streamin

gSpark SQL MLlib GraphX

o Scale out, fault tolerant, distributed, in-memory processing

o Multi-language API (incl. R)

o Standard libraries: ML, statistics

33

Spark’s use cases - Diverse industries & scenarios

Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

34

Spark advanced analytics

Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html 35

Advanced analytics is an important Spark feature

R is rapidly gaining popularity

(Available since June 2015)

35

Open-source packages for ML in Spark using Ro SparkR: o R package - a light-weight front-end for Apache Spark from

Ro Limited in terms of ML algo bindings at this timeo Works on MLlib functions (RDDs)

o Sparklyr-ML: o Developed by RStudioo Provides R bindings to spark.ml library

36

Data science / advanced analytics process

http://aka.ms/tdsp

37

• Git-based repositories with templates providing a central archive

• Standardized project structure• Document templates• Utility scripts• Independent of the execution

environment, to allow scientists to use multiple cloud resources as needs dictate.

Building intelligent applications using team data science process

https://blogs.technet.microsoft.com/machinelearning/2016/09/08/building-intelligent-applications-using-the-team-data-science-

process/ 38http://aka.ms/tdsp

Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…

Tue, Sept 27, 3 – 4 PMHang Zhang, Jacob Spoelstra, Gopi Kumar

38

Prepare: Assemble, cleanse, profile and transform diverse data relevant to the subject

Model: Use statistical and machine learning algorithms to build classifiers and regression models

Operationalize: Make predictions and visualizations to support business applications

DS process shown in demo

OperationalizeModelPrepare

39

E2E Demo/ExampleFlight arrival delay prediction1. Provisioning clusters using PowerShell scripts2. Prep (Clean/Join) – Using SparkR from R Server3. Model (Train/Score/Evaluate) – Scale R4. Deployment – to Azure ML from R Server

40

End-to-end data science process example

Azure Blob Storage

HDInsight

Microsoft R Server Azure Machine Learning

Web Application

Data Sources Data Partition Feature Engineering

Model TrainingPredictions

Web Services Consumption

Power BI

KDD 2016, (Tutorial Using R on Spark) tinyurl.com/KDD2016Rzure Machine Learning: https://azure.microsoft.com/en-us/services/machine-learning/

41

• Azure blob storage (HDFS)• R Server on Spark HDInsight (Premium)• Azure ML R package and Azure ML web

service• PowerBI (optional)

Technologies / services used42

Provisioning & deleting R Server Spark HDInsight clusters using Azure Commandlets & ARM templates## CREATE CLUSTERS USING ARM TEMPLATES$templatePath = "https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/KDDCup2016/Scripts/Configuration/azuredeploy.json";

$hdiparams @{clusterType="spark"; clusterName=$clustername; clusterLoginUserName="admin"; clusterLoginPassword=$clusterpasswd; sshUserName="remoteuser"; sshPassword=$clusterpasswd;clusterWorkerNodeCount=2};

New-AzureRmResourceGroupDeployment -Name $clustername -ResourceGroupName $resourcegroup -TemplateParameterObject $hdiparams -TemplateUri $templatePath;

## DELETE CLUSTERSRemove-AzureRmHDInsightCluster -ClusterName $clustername

43

Script based deployment of HDInsight clusters with R Sever

44

• Predict if a flight arrival is going to be by 15 mins or not (binary classification), based on features:• Airline, flight, airport • Airline carrier• Type of airplane / vehicle• Departure and arrival airports• Flight distance• Month, week, day

• Weather• Wind speed• Visibility• Humidity

Prediction task: Predict flight delays

45

• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection

• >20 years of data• 300+ Airports• Every carrier, every commercial

flight• http://www.transtats.bts.gov

Data-set: Airline & Weather46

• Hourly land-based weather observations from NOAA (National Oceanographic and Atmospheric Assoc.)

• > 2,000 weather stations• http://www.ncdc.noaa.gov/orders/

qclcd/

Airline Weather

Connection: Thin client → RStudio Server+ Glimple of down-sampled data (19 mil rows)

Data prepClean and Join using SparkR in R Server

48

• SparkR: R package - a light-weight front-end for Apache Spark from R• Provides distributed operations like selection, filtering, aggregation using SparkSQL• Distributed machine learning using Apache Spark’s MLlib (limited)

ModelingTrain, score, and evaluate using ScaleR functions

49

Modeling scalability with ScaleR on Spark HDInsight Scales linearly to hundreds of nodes, billions of rows and terabytes of data

50

0

1,000

,000,0

00

2,000

,000,0

00

3,000

,000,0

00

4,000

,000,0

00

5,000

,000,0

00

6,000

,000,0

00

7,000

,000,0

00

8,000

,000,0

00

9,000

,000,0

00

10,00

0,000

,000

11,00

0,000

,000

12,00

0,000

,000

13,00

0,000

,000

0200400600800

10001200140016001800

Logistic Regression on NYC Taxi Dataset

Billions of rows

Elap

sed

Tim

e

HDInsight (Premium) Spark cluster100 D12 (4 core, 28 GB) worker nodes

2.2 TB

Mario Inchiosa

Comparison of ScaleR with open source algorithms (Preliminary)

51

Configuration:• HDI cluster size: 7

nodes• 1 Edge Node: 8 cores,

28GB- 4 Worker Nodes: 8

cores, 28GB• Dataset: Duplicated

Airlines data (.csv)• Number of columns: 26

1 2 3 4 5 6 7 8 9

Logistic Regression (E2E - reading from csv files)

Series1Series2Series3Series4

Number of rows (million)

Elap

sed

time

Katherine Zhao

Azure ML - Deploying web services for predictive analytics

52

Easily build ML models Easily deploy models as web-services

DeploymentPublish Web Service from R Server in AzureML

53

azureml-settings.json{"workspace": {"id": “<>", "authorization_token": “<>", "api_endpoint": "https://studioapi.azureml.net",

"management_endpoint":

https://management.azureml.net }}

A prediction web service in AzureML54

Adopting process and code - Resources

Tutorials - Scalable data analytics using R Server

• KDD Conference tutorial 2016• http://www.tinyurl.com/KDD2016R

• Public GitHub repository

56

Summary & acknowledgements

• R Server on Azure HDInsight (Premium) – a managed distributed compute platform for data science

• Scalable end to end processes can be built on HDI clusters integrated with other Azure services

• Published resources (w/ code) available for developing analytical work-flows

Summary60

• Mario Inchiosa [Principal Software Engineer]• Katherine Zhao [Data Scientist II]• Jeremy Reynolds [Senior Data Scientist Lead]• Max Kaznadi [Data Scientist II]• Hang Zhang [Senior Data Scientist Manager]

Acknowledgements61

Thank you!Debraj Guhathakurtadebraj.guhathakurta@microsoft.com

© Copyright Microsoft Corporation. All rights reserved.

Backups

R Open Microsoft R Server

R+CR

AN

DistributedR

ScaleR

ConnectR

DeployRRTVS

R Server architecture

ConnectR• High-speed & direct

connectorsAvailable for:• High-performance XDF• SAS, SPSS, delimited &

fixed format text data files• Hadoop HDFS (text & XDF)• Teradata Database• EDWs and ADWs• ODBC

ScaleR• Ready-to-Use high-performance

big data big analytics • Fully-parallelized analytics• Data prep & data distillation• Descriptive statistics & statistical tests• Range of predictive functions • User tools for distributing customized R

algorithms across nodes

DistributedR• Distributed computing

framework• Delivers cross-platform

portability

R+CRAN• Open source R interpreter• Freely-available huge range of R

algorithms• Algorithms callable by Microsoft R• Embeddable in R scripts• 100% Compatible with existing R

scripts, functions and packages

Microsoft R Open• Based on open source R• High-performance math

library to speed up linear algebra functions• Checkpoint package to easily share R code and replicate results using specific R package versions

DeployR• RESTful APIs for easy

integration from Java, JavaScript, .NET • Enterprise

authentication & security

R Tools for Visual Studio• State of the art, R Tools for Visual

Studio IDE

ModelingTrain, Score, and Evaluate using R Server

66

DeploymentPublish Web Service from R

rpartModel <- as.rpart(dTreeModel)scoringFn <- function(newdata){ library(rpart) predict(rpartModel, newdata=newdata)}

67

azureml-settings.json{"workspace":

{"id": “<>", "authorization_token": “<>", "api_endpoint":

"https://studioapi.azureml.net", "management_endpoint":

https://management.azureml.net}

}