Building a Scalable Data Science Platform with R on HDInsight

Microsoft Machine Learning& Data Science SummitSeptember 26 – 27 | Atlanta, GA

Building a Scalable Data Science Platform with R on HDInsightDebraj GuhaThakurtaSenior Data ScientistData Group – Algorithms and Data Science, Redmond

Email: debraj.guhathakurta@microsoft.comTwitter: @d_guhathakurta

Co-contributors: Mario Inchiosa, Katherine Zhao, Hang Zhang, Max Kaznadi

• R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight • Mon, Sept 26, 1:30 – 2:30 PM• Maxim Lukiyanov

• Big, Fast, and Data-Furious…with Spark • Mon, Sept 27, 12:30 – 1:30 PM• Maxim Lukiyanov

• Instructor-Led Lab: The Cortana Intelligence Suite - Part Two: Deep Dive • Mon, Sept 26, 10:30 AM – 5 PM• Buck Woody

• Self-Paced Lab: Microsoft Server R• Mon, Sept 26, 1 – 4 PM; Tue Sept 27, 10:30 – 11:30 AM & 12:30 – 2:30 PM• Jeremy Reynolds

• Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…• Tue, Sept 27, 3 – 4 PM• Hang Zhang, Jacob Spoelstra, Gopi Kumar

Related talks3

• Microsoft R Server: Benefits

• R Server on HDInsight (Premium, Preview): Scalable analytical platform on Azure

• How to: • Develop end-to-end data science process using R Server on Spark HDInsight

(Premium)• How to adopt process and code

Key takeaways4

• R and its benefits / limitations• Microsoft R Server: Scalable, enterprise-class• R Server on HDInsight (Premium) clusters• Demo - Developing end-to-end data science processes using

R Server on HDInsight Spark clusters• Pointers to technical content: Tutorials, templates, blogs

Agenda5

R – its benefits and limitations

R - introduction

• 2.5+M users • Taught in most universities• Thriving user groups

worldwideCommunity

• The most popular statistical programming & ML language

• Data visualization & reporting tool• Open source, transparent

Language Platform

• Free

• 9,000+ contributed packagesEcosystem • Applications & integration• Many use cases / business problems

addressed

Preferred language by Analytics Professionals

Source: SAS, R or Python Survey 2016, by Burtch Works

Which do you prefer to use: SAS, R, or Python?

2015 20142016 2015

Unified IEEE Spectrum Ranking 2016http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages

Common R use casesVertical Sales & Marketing Finance & Risk Customer & Channel Operations &

Workforce

Retail

Demand ForecastingLoyalty ProgramsCross-sell & Upsell

Customer Acquisition

Fraud DetectionPricing Strategy

Personalization Lifetime Customer Value Product Segmentation

Store Location DemographicsSupply Chain Management

Inventory Management

Financial Services

Customer Churn Loyalty Programs Cross-sell & Upsell

Customer Acquisition

Fraud DetectionRisk& Compliance

Loan Defaults

PersonalizationLifetime Customer

Call Center OptimizationPay for Performance

Healthcare Marketing Mix Optimization

Patient Acquisition Fraud Detection

Bill Collection Population Health

Patient Demographics Operational Efficiency Pay for Performance

ManufacturingDemand Forecasting

Marketing mix OptimizationPricing Strategy

Perf Risk Management Supply Chain Optimization

Personalization

Remote Monitoring Predictive Maintenance

Asset Management

Processing limitations of open source R

• In-Memory Operation

• Lack of Parallelism

• Expensive Data Movement

& Duplication

Open source R is not enterprise class

Inadequacy of

Community Support

Lack of Guaranteed

Support Timeliness

No SLAs or Support Models

Microsoft R Server

R from Microsoft brings13

Peace of mind Speed and

scalabilityEfficiencyFlexibilit

• Support and SLA• Works on data in memory or on disc (scale)• Wide range of scalable and distributed R functions • Works in several compute contexts (incl. Hadoop, Spark, SQL-server),

and data sources (incl. disk, HDFS, SQL)

Portability & investment assurance

R Server portfolio

Cloud • Windows• Linux

• SQL Server 2016 EE• SQL Server 2016 SERDBMS• Windows• LinuxDesktops & Servers

Hadoop & Spark • Hortonworks• Cloudera• MapR

EDW • SQL Server 2016• Teradata Database

DistributedR

ScaleR

ConnectR

DeployR

R Server Technology

Write once deploy anywhere - WODA

• On a workstation:• All available cores used for math operations and parallel processes• Hard drive capacity sets limit for data size, not RAM• Works directly on XDF (External data frames) on disk

• On a cluster:• Parallel utilization of nodes• Distributed file systems like HDFS greatly expand possible data

ScaleR - parallel or distributed processing

Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models

ScaleRPEMA: Parallel external memory algorithms

Stream data into RAM in blocks. “Big Data” can be any data size. Can handle Megabytes to Gigabytes to Terabytes…

ScaleR algorithms work inside multiple cores / nodes in parallel at high speed

Interim results are collected and combined analytically to produce the output on the entire data set

XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.

Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models

• Linear regression (rxLinMod)• Generalized linear models (rxLogit, rxGLM)• Decision trees (rxDTree)• Gradient boosted decision trees (rxBTree)• Random forests (rxDForest)• K-means (rxKmeans)• Naïve Bayes (rxNaiveBayes)

Available ScaleR distributed algorithms

ScaleR distributed algorithms Data import – Delimited, Fixed, SAS,

SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

ETL Statistical Tests

Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for

set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &

long form) Marginal Summaries of Cross Tabulations

Descriptive Statistics Sum of Squares (cross product matrix for

set variables) Multiple Linear Regression Generalized Linear Models (GLM)

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Predictions/scoring for models Residuals for all models

Predictive Statistics

K-MeansClustering

Linear regression Logistic regression Decision Trees Decision Forests Gradient Boosted Decision

Trees Naïve Bayes

Machine Learning

Simulation Simulation (e.g. Monte Carlo) Parallel Random Number

Generation Custom Parallelization rxExec

PEMA-R APIVariable Selection Stepwise Regression

• Any analysis that is more complex than simple aggregations• Analysis with data that fit in physical memory of single

machines• Creating sophisticated visualizations (e.g. ggplot, lattice)• Creating reports (use knitr and Markdown)• Analyses that use domain-specific tools or cutting-edge

algorithms• e.g. Forecasting, health informatics, …. , etc.

Typical uses of open source R19

• Working with big data• Building models that take too long to run in R• Working with clusters and distributed file

systems• e.g. HDInsight clusters + HDFS

• Developing portable scripts for many compute contexts

Typical uses of R Server20

Big Data In-memory bound

Hybrid memory & disk scalability

Operates on bigger volumes of data

Speed of Analysis

Single threaded Parallel threading Shrinks analysis time

Enterprise Readiness

Community support

Commercial support Delivers full service production support

Analytic Breadth & Depth

9000+ innovative analytic packages

Leverage open source packages plus Big Data ready packages

Supercharges R with ScaleR functions

Commercial Viability

Risk of deployment of open source

Commercial license Eliminate risk with open source

Benefits of R Server21

R Server

R Server on HDInsight (Premium)

R Server on HDInsight (Premium)Managed Hadoop for advanced analytics in the Cloud

RevoScaleR

Hadoop / Spark

Blob Storage (HDFS)Data Lake Storage

• Easy setup, elastic, SLA• R Server benefits

• Leverage R skills• ScaleR functions• ….

• Familiar & enhanced IDEs• Popular IDEs (RStudio, RTVS, Notebooks,

Others (e.g. SparkR)

Provisioning HDInsight (Premium) with R Server

Elastic - Scaling HDInsight clusters25

R server on HDInsight - Architecture

R R R R R

Data Scientists

R Server

Head Nodes

Data/Worker Nodes

R Server on HDInsight - Connectivity

Worker Task

R Server Master Task

Edge Node

Worker Task

Remote Execution: ssh

ssh or R Tools for Visual Studio

Jupyter Notebooks

Thin Client IDEs

https://

or MapRedu

R Server on HDInsight - Data processing

Server Local Processing

Data in Distributed Storage

R process on Edge Node

Server Distributed Processing

Master R process on Edge Node

Apache YARN / Spark

Worker R processes on Data

Write once deploy anywhere - WODASwitching compute contextsCode can be deployed from a server or edge node to run in Spark/Hadoop without any functional R model re-coding.

## Statistical Summary rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineData, reportProgress = 1)## Linear model and plothdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,

data = AirlineData)

## SETUP LOCAL ENVIRONMENT VARIABLES ## myLocalCC <- “localpar”

## LOCAL COMPUTE CONTEXT ## rxSetComputeContext(myLocalCC)

Local Parallel processing - Linux or Windows

Compute context R script - sets where the model will run

R script – does not need to change to run in Hadoop/ Spark

mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()

rxSetComputeContext(mySparkCC) rxSetComputeContext(myHadoopCC)

In – Spark/Hadoop

R Script for Execution in MapReduce

Sample R Script:

rxSetComputeContext( RxHadoopMR(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Define Compute Context

Define Data Source

Train Predictive

Easy to Switch From MapReduce to Spark

Keep other code

unchanged

Sample R Script:

rxSetComputeContext( RxSpark(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Change the Compute Context

Creating a data science process using R Server on Spark HDInsight

Apache Spark engine and its APIs

33Denny Lee, DataBricks

Spark Core

Spark Streamin

gSpark SQL MLlib GraphX

o Scale out, fault tolerant, distributed, in-memory processing

o Multi-language API (incl. R)

o Standard libraries: ML, statistics

Spark’s use cases - Diverse industries & scenarios

Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

Spark advanced analytics

Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html 35

Advanced analytics is an important Spark feature

R is rapidly gaining popularity

(Available since June 2015)

Open-source packages for ML in Spark using Ro SparkR: o R package - a light-weight front-end for Apache Spark from

Ro Limited in terms of ML algo bindings at this timeo Works on MLlib functions (RDDs)

o Sparklyr-ML: o Developed by RStudioo Provides R bindings to spark.ml library

Data science / advanced analytics process

http://aka.ms/tdsp

• Git-based repositories with templates providing a central archive

• Standardized project structure• Document templates• Utility scripts• Independent of the execution

environment, to allow scientists to use multiple cloud resources as needs dictate.

Building intelligent applications using team data science process

https://blogs.technet.microsoft.com/machinelearning/2016/09/08/building-intelligent-applications-using-the-team-data-science-

process/ 38http://aka.ms/tdsp

Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…

Tue, Sept 27, 3 – 4 PMHang Zhang, Jacob Spoelstra, Gopi Kumar

Prepare: Assemble, cleanse, profile and transform diverse data relevant to the subject

Model: Use statistical and machine learning algorithms to build classifiers and regression models

Operationalize: Make predictions and visualizations to support business applications

DS process shown in demo

OperationalizeModelPrepare

E2E Demo/ExampleFlight arrival delay prediction1. Provisioning clusters using PowerShell scripts2. Prep (Clean/Join) – Using SparkR from R Server3. Model (Train/Score/Evaluate) – Scale R4. Deployment – to Azure ML from R Server

End-to-end data science process example

Azure Blob Storage

HDInsight

Microsoft R Server Azure Machine Learning

Web Application

Data Sources Data Partition Feature Engineering

Model TrainingPredictions

Web Services Consumption

Power BI

KDD 2016, (Tutorial Using R on Spark) tinyurl.com/KDD2016Rzure Machine Learning: https://azure.microsoft.com/en-us/services/machine-learning/

• Azure blob storage (HDFS)• R Server on Spark HDInsight (Premium)• Azure ML R package and Azure ML web

service• PowerBI (optional)

Technologies / services used42

Provisioning & deleting R Server Spark HDInsight clusters using Azure Commandlets & ARM templates## CREATE CLUSTERS USING ARM TEMPLATES$templatePath = "https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/KDDCup2016/Scripts/Configuration/azuredeploy.json";

$hdiparams @{clusterType="spark"; clusterName=$clustername; clusterLoginUserName="admin"; clusterLoginPassword=$clusterpasswd; sshUserName="remoteuser"; sshPassword=$clusterpasswd;clusterWorkerNodeCount=2};

New-AzureRmResourceGroupDeployment -Name $clustername -ResourceGroupName $resourcegroup -TemplateParameterObject $hdiparams -TemplateUri $templatePath;

## DELETE CLUSTERSRemove-AzureRmHDInsightCluster -ClusterName $clustername

Script based deployment of HDInsight clusters with R Sever

• Predict if a flight arrival is going to be by 15 mins or not (binary classification), based on features:• Airline, flight, airport • Airline carrier• Type of airplane / vehicle• Departure and arrival airports• Flight distance• Month, week, day

• Weather• Wind speed• Visibility• Humidity

Prediction task: Predict flight delays

• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection

• >20 years of data• 300+ Airports• Every carrier, every commercial

flight• http://www.transtats.bts.gov

Data-set: Airline & Weather46

• Hourly land-based weather observations from NOAA (National Oceanographic and Atmospheric Assoc.)

• > 2,000 weather stations• http://www.ncdc.noaa.gov/orders/

qclcd/

Airline Weather

Connection: Thin client → RStudio Server+ Glimple of down-sampled data (19 mil rows)

Data prepClean and Join using SparkR in R Server

• SparkR: R package - a light-weight front-end for Apache Spark from R• Provides distributed operations like selection, filtering, aggregation using SparkSQL• Distributed machine learning using Apache Spark’s MLlib (limited)

ModelingTrain, score, and evaluate using ScaleR functions

Modeling scalability with ScaleR on Spark HDInsight Scales linearly to hundreds of nodes, billions of rows and terabytes of data

,000,0

0200400600800

10001200140016001800

Logistic Regression on NYC Taxi Dataset

Billions of rows

HDInsight (Premium) Spark cluster100 D12 (4 core, 28 GB) worker nodes

2.2 TB

Mario Inchiosa

Comparison of ScaleR with open source algorithms (Preliminary)

Configuration:• HDI cluster size: 7

nodes• 1 Edge Node: 8 cores,

28GB- 4 Worker Nodes: 8

cores, 28GB• Dataset: Duplicated

Airlines data (.csv)• Number of columns: 26

1 2 3 4 5 6 7 8 9

Logistic Regression (E2E - reading from csv files)

Series1Series2Series3Series4

Number of rows (million)

Katherine Zhao

Azure ML - Deploying web services for predictive analytics

Easily build ML models Easily deploy models as web-services

DeploymentPublish Web Service from R Server in AzureML

azureml-settings.json{"workspace": {"id": “<>", "authorization_token": “<>", "api_endpoint": "https://studioapi.azureml.net",

"management_endpoint":

https://management.azureml.net }}

A prediction web service in AzureML54

Adopting process and code - Resources

Tutorials - Scalable data analytics using R Server

• KDD Conference tutorial 2016• http://www.tinyurl.com/KDD2016R

• Public GitHub repository

Templates – Predictive solutions for business problemsCortana Intelligence Gallery

https://gallery.cortanaintelligence.com/Tutorial/Retail-Customer-Churn-Template-using-Microsoft-R-Server-HDInsight-Spark-1

Blogs - Further examples of scalable analysis

https://blogs.msdn.microsoft.com/azuredatalake/2016/08/09/rapid-big-data-prototyping-with-microsoft-r-server-on-apache-spark-context-switching-spark-tuning/

http://blog.revolutionanalytics.com/2016/04/mrs-nyc-taxi.html

Summary & acknowledgements

• R Server on Azure HDInsight (Premium) – a managed distributed compute platform for data science

• Scalable end to end processes can be built on HDI clusters integrated with other Azure services

• Published resources (w/ code) available for developing analytical work-flows

Summary60

• Mario Inchiosa [Principal Software Engineer]• Katherine Zhao [Data Scientist II]• Jeremy Reynolds [Senior Data Scientist Lead]• Max Kaznadi [Data Scientist II]• Hang Zhang [Senior Data Scientist Manager]

Acknowledgements61

Thank you!Debraj Guhathakurtadebraj.guhathakurta@microsoft.com

Backups

R Open Microsoft R Server

DistributedR

ScaleR

ConnectR

DeployRRTVS

R Server architecture

ConnectR• High-speed & direct

connectorsAvailable for:• High-performance XDF• SAS, SPSS, delimited &

fixed format text data files• Hadoop HDFS (text & XDF)• Teradata Database• EDWs and ADWs• ODBC

ScaleR• Ready-to-Use high-performance

big data big analytics • Fully-parallelized analytics• Data prep & data distillation• Descriptive statistics & statistical tests• Range of predictive functions • User tools for distributing customized R

algorithms across nodes

DistributedR• Distributed computing

framework• Delivers cross-platform

portability

R+CRAN• Open source R interpreter• Freely-available huge range of R

algorithms• Algorithms callable by Microsoft R• Embeddable in R scripts• 100% Compatible with existing R

scripts, functions and packages

Microsoft R Open• Based on open source R• High-performance math

library to speed up linear algebra functions• Checkpoint package to easily share R code and replicate results using specific R package versions

DeployR• RESTful APIs for easy

integration from Java, JavaScript, .NET • Enterprise

authentication & security

R Tools for Visual Studio• State of the art, R Tools for Visual

Studio IDE

ModelingTrain, Score, and Evaluate using R Server

DeploymentPublish Web Service from R

rpartModel <- as.rpart(dTreeModel)scoringFn <- function(newdata){ library(rpart) predict(rpartModel, newdata=newdata)}

azureml-settings.json{"workspace":

{"id": “<>", "authorization_token": “<>", "api_endpoint":

"https://studioapi.azureml.net", "management_endpoint":

https://management.azureml.net}

Building a Scalable Data Science Platform with R on HDInsight

Documents

Transcript of Building a Scalable Data Science Platform with R on HDInsight

Versatile and scalable pulse compression platform

Azure HDInsight

Google Cloud Platform - Building a scalable mobile application

HDInsight in Windows Azuredownload.microsoft.com/download/1/2/2/.../2014.04.22_HDInsightIn… · 22.04.2014 · HDInsight in Windows Azure R 1.00 4 HDInsight Versions on Azure Component

Windows Azure HDInsight Service

Introduction to Azure HDInsight

Introducción a HDInsight

Fuji Scalable Placement Platform Machine Specifications · Fuji Scalable Placement Platform Machine Specifications Model: NXT Preliminary. ... machine during operation, and as individual

Server and Cloud Platform template€¦ · Azure HDInsight, AzureML, Power BI, Azure Data Factory, Azure Data Lake Hot Path Analytics Azure Stream Analytics, Azure HDInsight Storm

Scalable Cloud Platform - Ammattikorkeakoulut - Theseus

Prism : A Platform for Scalable Graphics - Graphics ... Proprietary Silicon Graphics PrismTM – A Platform for Scalable Graphics Overview of Talk • Goals for Scalable Graphics •

Scalable Connected and Programmable Platform

Big data, Hadoop, HDInsight

Open, Flexible and Scalable Video Surveillance Platform

Teleste Luminato Fully scalable Headend platform for …sattrakt.rs/downloads/products/Teleste-Luminato_all-chassisand... · Teleste Luminato Fully scalable Headend platform for ...

Campus days Azure HDInsight automation

Scalable, flexible payments platform - ProfitStars

Intel Xeon Scalable Platform

Launching a hyper scalable platform business by praxent

HdInsight essentials Hadoop on Microsoft Platform