New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop...

27
Revolution Confidential Revolution Confidential New Advances in High Performance Analytics with R : 'B ig Data' Decision Trees and Analysis of Hadoop Data Presented by: S ue R anney VP Product Development

description

Revolution R Enterprise 6.1 includes two important advances in high performance predictive analytics with R: (1) big data decision trees, and (2) the ability to easily extract and perform predictive analytics on data stored in the Hadoop Distributed File System (HDFS). Classification and regression trees are among the most frequently used algorithms for data analysis and data mining. The implementation provided in Revolution Analytics’ RevoScaleR package is parallelized, scalable, distributable, and designed with big data in mind. Decision trees and all of the other high performance prediction analytics functions provided with RevoScaleR (such as linear and logistic regression, generalized linear models, and k-means clustering) can now also be used to analyze data stored in the HDFS file system. After specifying the connection parameters to the HDFS file system, some or all of the data can be directly explored, analyzed or quickly and efficiently extracted into a native file system.

Transcript of New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop...

Page 1: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

Revolution Confidential

New A dvances in High P erformanc e A nalytics with R : 'B ig Data' Dec is ion Trees and A nalys is of Hadoop Data

P res ented by:S ue R anneyV P P roduct Development

Page 2: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialIn today’s webc as t:

High Performance Analytics (HPA) with Revolution R Enterprise ‘Big Data’ Decision Trees Revolution’s HPA with Hadoop Data Resources, Q&A

2

Page 3: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialR evolution R E nterpris e: What G ets Ins talled?

3

Latest stable version of Open-Source R High performance math libraries RevoScaleR package that adds: High performance ‘big data’ capabilities to R Access to a variety of ‘data sources’ (e.g., SAS, SPSS,

text files, ODBC) Ability to compute in a variety of ‘compute contexts’

(e.g., Windows/Linux workstation/server, Microsoft HPC Server cluster, Azure Burst, IBM Platform LSF cluster)

High performance computing capabilities Integrated Development Environment based on Visual

Studio technology (for Windows): the R Productivity Environment (RPE)

Revolution R Enterprise 5.0 Webinar

Page 4: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

High P erformance A nalytics (HPA ) in R evoS caleR

High Performance Computing + Data Full-featured, fast, and scalable analysis

functions Same code works on small and big data, and a

variety of data sources Same code works on a variety of compute

contexts - a laptop, server, cluster, or the cloud Scales approximately linearly with the number

of observations – without increasing memory requirements

Revolution R Enterprise 4

Page 5: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialR evoS c aleR : HPA A lgorithms

Descriptive statistics (rxSummary) Tables and cubes (rxCube, rxCrossTabs) Correlations/covariances (rxCovCor, rxCor,

rxCov, rxSSCP) K means clustering (rxKmeans) Linear regressions (rxLinMod) Logistic regressions (rxLogit) Generalized Linear Models (rxGlm) Predictions (scoring) (rxPredict) Decision Trees (rxDTree) NEW!

Revolution R Enterprise 5

Page 6: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialDec is ion Trees

Relatively easy-to-interpret models Widely used in a variety of disciplines. For example, Predicting which patient characteristics are associated with

high risk of, for example, heart attack. Deciding whether or not to offer a loan to an individual

based on individual characteristics. Predicting the rate of return of various investment

strategies Retail target marketing

Can handle multi-factor response easily Useful in identifying important interactions

Revolution R Enterprise 6

Page 7: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialDec is ion Tree Types

Classification tree: predict what ‘class’ or ‘group’ an observation belongs in (dependent variable is a factor) for each terminal node or leaf Regression tree: predict average value of

dependent variable for each terminal node or leaf

Revolution R Enterprise 7

Page 8: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialS imple E xample: Marketing R es pons e

Data set containing the following information: Response: Was response to a phone call, email, or

mailing? Age Income Marital status Attended college?

Revolution R Enterprise 8

Page 9: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialS imple E xample: S pec ifying the model

treeOut <- rxDTree(response~ age + income + college + marital, data = rdata)

where rdata is the name of the data set

Revolution R Enterprise 9

Page 10: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialS imple E xample: B as ic Output Information on the split, the number of observations in

the node, the number that match the y value, and the y probabilities

1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000)

2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743)

4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639)

8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) *

9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) *

5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901)

10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054)

20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) *

21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) *

11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) *

3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) …

Revolution R Enterprise 10

Page 11: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialS imple E xample: Vis ual R epres entation

Root

No College

Age >= 40

Age < 65: Phone

Age >= 65: Mail

Age < 40

Single

Income >= 30: Phone

Income < 30: Email

Married: Email

College

Age < 65

Single

Age < 40

Income >= 30: Phone

Income < 30: Email

Age >= 40: Email

Married: Email

Age >= 65: Mail

Revolution R Enterprise 11

Page 12: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialS c aling HPA with R evoS c aleR RevoScaleR functions can read from data sets on disk in

chunks, so you can increase the number of observations in the data set beyond what can be analyzed in memory all at once

RevoScaleR analysis functions process chunks of data in parallel, taking greater advantage of your computing resources (Parallel External Memory Algorithms) Multiple cores on a desktop/server Cluster/grids have added advantage of more hard drives

for storing & accessing data Windows HPC Server Cluster “Burst” computations to Azure in the cloud IBM Platform LSF Grid

Revolution R Enterprise 12

Page 13: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialT he ‘B ig Data’ Dec is ion Tree A lgorithm

Classical algorithms for building a decision tree sort all continuous variables in order to decide where to split the data.

This sorting step becomes time and memory prohibitive when dealing with large data.

rxDTree bins the data rather than sorting, computing histograms to create empirical distribution functions of the data

rxDTree partitions the data horizontally, processing in parallel different sets of observations

Revolution R Enterprise 13

Page 14: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

Us eful rxDTree A rguments for B ig Data cp: complexity parameter. Increasing cp will

decrease the number of splits attempted maxDepth: the maximum depth of any tree

node. The computations take much longer at greater depth, so lowering maxDepth can greatly speed up computation time. maxNumBins: the maximum number of bins

to use to cut numeric data. Decreasing maxNumBins will speed up computation time.

Revolution R Enterprise 14

Page 15: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential‘B ig Data’ E xample

15

CDC Report in Jan. 2012

Page 16: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialT he U.S . B irth Data: 1985 - 2009

Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm

“These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell

I’ve imported key variables from each year into a single .xdf file with over 100 million observations.

16

Page 17: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

R egres s ion Tree: Multiple B irthsCall: rxDTree(formula = IsMultiple ~ DadAgeR8 + MAGER + FRACEREC + FHISP_REC + MRACEREC + MHISP_REC + DOB_YY, data = birthAllC, maxDepth = 6, cp = 1e-05, blocksPerRead = 10, verbose = 1)

File: C:\Revolution\Data\CDC\BirthUS.xdf

Number of valid observations: 100672041 Number of missing observations: 0

Revolution R Enterprise 17

Page 18: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

L eaves with L owes t P ercent of Multiple B irths

18

Mom is not black and under the age of 20

1.3%

Mom is Asian or Pacific Islander (and not Hispanic) and is between 22 and 28 years of age. The birth is before 1997

1.6%

Mom is black and under the age of 18

1.7%

Page 19: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

L eaves with Highes t P ercent of Multiple B irths

19

Mom is over 47 years old and the birth is after 1996

38.6%

Mom is white, non-Hispanic, is between 45 and 47 years old, and the birth is after 1996

28.1%

Mom is Hispanic, is between 45 and 47 years old, and the birth is after 1996

15.5%

Page 20: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

P oll Ques tion

Are you using Hadoop?

Page 21: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialR evoS caleR with Hadoop Data F iles NE W

The Hadoop Distributed File System (HDFS)

is highly fault-tolerant and

is designed to be deployed on low-cost hardware.

RevoScaleR supports accessing data in the HDFS file system for import or for direct analysis

21

Page 22: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialR evoS c aleR Data S ourc es Data Sources can be used for import or directly for

analysis

External: delimited text, fixed format text, SAS, SPSS, ODBC connections

Provided with RevoScaleR: efficient .xdf file format

Data Sources contain information about their file system

Delimited text and .xdf data sources can both be used with the HDFS file system

Data sources are used as input to HPA functions

22

Page 23: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialA n E xample Us ing Hadoop Data

Hadoop cluster in our office Five nodes of commodity hardware Red Hat Enterprise Linux (RHEL) operating system Cloudera’s Hadoop (CDH3) Also has IBM Platform LSF workload management

system installed (not required to use HDFS data)

My colleague, Dawn Kinsey, recorded a data analysis session 22 comma delimited files stored in HDFS Contain information on U.S. flight arrivals, 1997 – 2008

Revolution R Enterprise 23

Page 24: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialS teps in A nalys is

Set up a ‘file system’ object and a ‘data source’ object

Explore the HDFS airline data for the year 2000 directly

Extract variables of interest from all the files into an .xdf file in the native file system

Use R’s great plotting capabilities on summary information

Perform a big logistic regression on an .xdf file stored in HDFS

Revolution R Enterprise 24

Page 25: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

P oll Ques tion

What features of Revolution R Enterprise 6.1 are most interesting

to you?

Page 26: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution ConfidentialT hank You!

Download slides, replay from today’s webinar http://bit.ly/QJfR4A

Learn more about Revolution R Enterprise Overview: revolutionanalytics.com/products New feature videos:

http://www.revolutionanalytics.com/products/new-features.php

Contact Revolution Analytics http://bit.ly/hey-revo

26

November 29: Real-Time Big Data Analytics: from Deployment to Production

David Smith, VP Marketing and Community, Revolution Analytics

www.revolutionanalytics.com/news-events/free-webinars

Page 27: New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

Revolution Confidential

27

The leading commercial provider of software and support for the popular open source R statistics language.

www.revolutionanalytics.com+1 (650) 646 9545

Twitter: @RevolutionR