Performance and Scale Options for R with Hadoop: A comparison of potential architectures

38
R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs

Transcript of Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Page 1: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

R and Hadoop: Architectural Options

Bill Jacobs

VP Product Marketing & Field CTO, Revolution

Analytics

@bill_jacobs

Page 2: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Polling Question #1: Who Are You? (choose one)

– Statistician or modeler who uses R

– Other R developer

– Hadoop Expert

– Application builder

– Data guru

– Business user

– Systems vendor or reseller

– Something else…

Page 3: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

• Challenges

• Options

• Considerations

• How to Choose

Agenda

Page 4: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Boundless Opportunities Marketing: Clickstream &

Campaign Analyses

Digital Media:

Recommendation Engines

Retail: Social Sentiment

Analysis

Insurance: Fraud Waste and

Abuse

Healthcare Delivery: Outcome

Prediction

Manufacturing: Quality

Optimization

P&C Insurance: Risk Analysis

Consumer Products: Warranty

Optimization

Operations: Supply Chain

Optimization

Econometrics: Market

Prediction

Marketing: Mix and Price

Optimization

Life Sciences:

Pharmacogenetics

Transportation: Asset

Utilization

Page 5: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Polling Question #2: What Industry Do You Represent?

– Financial Services

– Insurance

– Healthcare, Life Sciences or Pharma

– Manufacturing

– Energy

– Retail

– Logistics and Transportation

– Education

– Government

– Marketing & Advertising

– Technology

– Other

Page 6: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

In A Perfect World…

Analytical Capability

Compute

Data Scale

UsersPrice

Ease

Security

Page 7: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Hadoop Analytics - Many Alternatives R Based Alternatives

Legacy tools updated – SAS HPA, etc.

Big Data Databases

Other Languages – Scala, Java, Julia, various GUIs

Today’s Topic:

R-Based Alternatives

– “Beside Architectures”

– “Inside Architectures”

– Open Source and Commercial

Page 8: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Reality: Tradeoffs.

Memory Limits

In-Memory vs. Shared Infrastructure

CRAN vs. Parallelization

Desktop vs. Remote

Explicit vs. Automatic Distribution

Locality vs. Movement Real-Time vs. MapReduce

Traditional Statistics vs. Machine Learning

Page 9: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

No Magic Bullet.

Page 10: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Corporate Overview & Quick Facts

Founded 2008 (as REvolution

Computing)

Office Locations Palo Alto (HQ), Seattle

(Engineering)

Singapore

London

CEO David Rich

Number of

customers

200+

Investors • Northbridge Venture Partners

• Intel Capital

• Platform Vendor

Web site: • www.revolutionanalytics.com

Revolution R Enterprise is the leading commercial analytics platform based on

the open source R statistical computing language

Page 11: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Revolution Analytics

Our Vision:

R becomes the de-

facto standard for

enterprise predictive

analytics

Our Mission:

Drive enterprise

adoption of R by

providing enhanced R

products tailored to

meet enterprise

challenges

Page 12: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Revolution Analytics Builds & Delivers:

Software Products:

Stable Distributions

Broad Platform Support

Big Data Analytics in R

Application Integration

Deployment Platforms

Agile Development Tooling

Future Platform Support

Support & Services

Commercial Support Programs

Training Programs

Professional Services

Community Programs

Academic Support Programs

Contributions to Open Source R

Open Source Extensions

Sponsorship of R User Groups

Page 13: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Revolution Analytics Technical Innovations

R Options from Open Source

to Enterprise

Parallelized Analytical

Computation

In-Database & In-Hadoop

Analytics

Big Data Scalability

Remote Execution

Production Deployment

Support

Multi-Platform Deployment

Legacy Data Format Support

Multiple IDE Options

PMML Model Export

Page 14: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

The Revolution R Product Suite

• Free and open source R distribution

• Enhanced and distributed by Revolution Analytics

Revolution R Open

• Open-source distribution of R, packages, and other components

• Enhanced, supported and indemnified by Revolution Analytics

Revolution R Plus

• Secure, Scalable and Supported Distribution of R

• With proprietary components created by Revolution Analytics

Revolution R Enterprise

Page 15: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Polling Question #3: State Play: In your company you are…

– Building Our “Data Lake”

– Running R + Hadoop Data Today

– Running R inside Hadoop using Open source

– Running RRE inside Hadoop

– Deploying Business Apps. Using Analytics from Hadoop Data

– Looking at Next Steps e.g. Spark, etc.

Page 16: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Revolution Analytics: Eight Alternatives for Integrating R & Hadoop Open Source

1. Open Source R

2. Revolution R Open

3. Open Source Parallelization on Workstations & Servers

4. rHadoop: Open Source Parallelization with rHadoop

Commercial

5. Revolution R Enterprise on Servers & Workstations

6. Revolution R Enterprise on Edge Nodes

7. Revolution R Enterprise Inside Hadoop

8. Combined Edge Node & Inside Hadoop

Page 17: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

1. Open Source R Integrated With Hadoop

• Traditional

Open Source

• Memory-

Limited

• Data Moves

Traditional Open Source R “Beside” Architecture:

CRAN

Algorithms

rHDFS rHbas

e rHive

rODB

C

Page 18: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

2. Revolution R Open On Workstations & Servers

Replace Open Source R “Beside” Architecture with Revolution R Open

As with Open Source R:

• Still Free.

• Still Memory Based.

• Data Still Moves.

Improvements:

• Accelerates Math

with Intel MKL

• Improves R-based

packages

Limitations

• No Effect

for non-R Code

CRAN

Algorithms

rHDFS rHbas

e rHive

rODB

C

Page 20: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

3. Write Parallel Algorithms PC, Server or Clusters

Write R Code to Explicitly Parallelize – Deploy Across Several Systems

Can Include CRAN

Algorithms “Carefully”

ForEach & Iterator

• DoParallel (PC, server)

• DoMPI (cluster)

• RRE RxEXEC

Example Uses:

• Bootstrapping

• Simulation

• HPC

rHDFS rHbas

e rHive

rODB

C

As with Previous:

• Still Free.

• Still Memory Based.

• Data Still Moves.

• Intel MKL with RRO

Improvements:

• Parallelized Execution

Limitations:

• Parallelization Difficulty

• Data Movement

• Platform Specific

Page 21: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

4. rHadoop: Custom Parallel Execution for Hadoop

Remote

Desktop

R Code

Execute R Code & CRAN Algorithms Inside Hadoop

Example Uses:

• Scoring

• Transformation

• Easily Parallelized

Algorithms

Hadoop

Streaming

Can Include CRAN

Algorithms

“Carefully”

As With Previous:

Still Free.

Optional Intel MKL

in RRO

Improvements:

Runs R in

MapReduce

No Data Movement

Limitations:

Manual

Parallelization

Hadoop Specific

rHbase

rHDFS

rMapReduce

Page 22: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

5. Revolution R Enterprise (RRE) PEMAs inside Hadoop

Traditional “Beside” Architecture with Optimized Algorithms

Available for Windows, Linux As With Previous:

Includes Intel MKL in RRO

Advantages

Speed: PEMAs Parallelize

Across Threads, Cores &

Sockets

Scale: PEMAs “Chunk” -

no Memory Limits

All of CRAN Available

Portability

Fully Supported

Limitations:

Data Movement

Single Machine

Revolution R Enterprise:

• ScaleR PEMA

Algorithms

plus

• All of CRAN (subject to memory limits)

rHDFS rHbas

e rHive

rODB

C

Page 23: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Revolution R Enterprise

High Performance, Scalable Analytics

Portable Across Enterprise Platforms

Easier to Build & Deploy Analytics

is…. the only big data big analytics platform

based on open source R

Page 24: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

ScaleR Refactor Algorithms for Dramatic Performance and Capacity Improvement

Page 25: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

ScaleR High Performance Algorithms for the Most Common Uses

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Step Statistical Tests

Sampling

Descriptive Statistics

Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models

K-Means

Decision Trees

Decision Forests

Gradient Boosted Decision Trees

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination

25 Revolution Analytics Confidential – Under NDA

New in 7.3

PEMA-R API

rxDataStep

rxExec

Page 26: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

ScaleR PEMA

What’s a PEMA? Parallel External Memory Algorithms

Master

Algorithm

Process

Data

Analyze Each

Block

• Not Limited to Available

Memory

• Unlimited Data Scale

• Ingests Data One Chunk

At A Time.

• Adjustable Memory

Footprint

• Multi-Thread Execution

Performance

• Highly-Optimized

Algorithms

• Algorithm Math Fully

Refactored for Parallelism

• Delivered as ScaleR

Library in Revolution R

Enterprise

Load Block At A

Time

Combine

Individual

Results

Script Calls

ScaleR

Algorithm

Scripts can call CRAN Open

Source Algorithms

Start & Manage

Processing

Page 27: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

rHDFS rHbas

e rHive

rODB

C

6. Run Revolution R Enterprise on Hadoop Edge Node(s)

Local File System

(opt.)

ScaleR + CRAN

Algorithms

Fast Single-Server Alternative for Modest Data Scale

Edge

Node Thin Client or

Remote

Desktop

As With Previous:

Single Machine Execution

PEMA Scale & Speed (Single

Machine)

Use ScaleR + CRAN

Accelerate R with Intel MKL

Improvements:

Easily Shared via

No Data Movement

Develop on Desktop Run on

Edge Node

Limitations:

“Shorter Trip” for Data

Page 28: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

7. Fast, Transparent Parallel Computation Inside Hadoop YARN/MapReduce

jobtracker

ScaleR

Algorithms

DeployR

Fast Parallelized Analytics on Large Data Sets In Hadoop

As With Previous:

Speed and Scale of ScaleR PEMA

Algorithms

Use CRAN Where Appropriate

Accelerate R Math with MKL

Custom Parallelized Algo’s

Advantages

Parallel Computation

No Data Movement

ScaleR PEMA Parallelization

Can Parallelize CRAN “Carefully”

Portable Coding

Limitations:

Hadoop Workload Profiles

We

b

Ser

vice

s

Web

Services

Remote

Execution

Desktop & Server

Tools and

Applications

Page 29: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

29

One Client’s Experience with RRE on Hadoop

Test Cluster - 9 Nodes

Task Processing Time

Importing and Filtering Datasets from

HDFS

14 Million Observations 82 sec.

227 Million Observations 310 sec.

Modeling and Estimation

1.2 M Correlations 2771 sec.

Simple Linear Regression, 227 M

Observations 61 sec.

Multiple Linear Regression, Three

Variables, 227 M Observations 58 sec.

Multiple Linear Regression, Four

Variables, 227 M Observations 58 sec.

Random Forest, 10 Predictor Variables,

227 M Observations, 10 Trees with Max

Depth of 10 Splits 2 hr. 3 min.

64GB

24 cores

each

9 Task

Nodes 2 Admin

Nodes 1 Edge

Node

128GB

24 cores

each

128GB

24 cores

each

Page 30: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

8. Combined Edge Node & In-Hadoop

ScaleR

Algorithms

DeployR

Maximized Flexibility, Performance & Workload Handling

As With Previous:

Speed and Scale of ScaleR PEMA

Algorithms

Use CRAN Where Appropriate

Accelerate R Math with MKL

Custom Parallelized Algo’s

Advantages

Flexibility for Blended Workloads

Little or No Data Movement

Maximize CRAN Capabilities by

Sharing Large RAM Edge Nodes

We

b

Ser

vice

s

Thin Client

Development

Remote

Execution

Desktop & Server

Tools and

Applications

rStudio

Page 31: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Occasionally Conflicting Criteria

Infrastructure Criteria:

Big Data Platform

Vendor Choice

Data Ingest

Data Security

Data Governance

Data Science Criteria:

Performance

Self Service

Flexibility

Collaboration

Sharing

Capability

Page 32: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Key Questions: Where are the bulk of your skills? SAS? R? Java? Python? SQL?

Where do you build models today?

Do you have the skills to parallelize algorithms?

Can models be built on a big shared server?

How will you run models?

Do you have the budget to purchase commercial solutions?

How will your needs change over time?

What is your future architecture plan?

How risk averse is your management team regarding new platforms and

open source?

Page 33: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Key Questions (cont.) What Workloads Do You Anticipate?

— How May Users?

— What Workloads?

Workload Realities:

— Many small tasks do not run well

in MapReduce

— Large data movements /

duplications are costly

What Use Cases Will You

Encounter?

— Traditional statistical

exploration, modeling?

— Behavior Prediction?

— Outlier Detection?

— Simulation and HPC?

— Massively wide data?

— Real-Time scoring?

— Internet of Things?

Page 34: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Eight Steps to Fast, Scalable R Analytics with Hadoop Open Source Options

1. Open Source R

2. Revolution R Open

3. Open Source Parallelization…

4. rHadoop…

Commercial Options

5. RRE on Servers &

Workstations

6. RRE on Edge Nodes

7. RRE Inside Hadoop

8. RRE on Edge Node & Inside

Hadoop

No Clear Winner:

Budget & use case determine

optimal path

Compelling options in both open

source & commercial source

RRE ScaleR uniquely provides

automatic parallelization

Current Hadoop platforms are

fast for large scale analytics.

Combined in-server & in-hadoop

fits majority of cases

Page 35: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

2015 Challenges & Opportunities • Evolving Hadoop Architectures

• In-Memory Analytics – Spark, YARN Containers, Caching

• Additional Algorithm Parallelization

• Cluster Management

• Cloud and Hybrid Cloud Clusters

• SQL on Hadoop “Battle-Royale”

• Addressing the Resource Reality

• Integration, Deployment Both Drain on Expensive Resources

• Leverage other skills

• Design efficient collaboration

• “Analytics for the Rest of Us”

• New Consumption Targets – Mobile

• New Participants in Design – Business Users

Page 36: Performance and Scale Options for R with Hadoop: A comparison of potential architectures
Page 37: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Recommended Resources Revolution Analytics Products

– http://www.revolutionanalytics.com/products

– http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws

Whitepaper: “Delivering Value from Big Data with Revolution R

Enterprise and Hadoop

– http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-

revolution-r-enterprise-and-hadoop

Revolution Analytics on Social Media:

– http://blog.revolutionanalytics.com/

– @revolutionr on Twitter

– @bill_jacobs on Twitter

Page 38: Performance and Scale Options for R with Hadoop: A comparison of potential architectures

Thank you.

www.revolutionanalytics.com

1.855.GET.REVO

Twitter: @RevolutionR