Using Apache Spark with IBM SPSS Modeler

42
Using Apache Spark with IBM SPSS Modeler Dr. Steve R. Poulin

Transcript of Using Apache Spark with IBM SPSS Modeler

Page 1: Using Apache Spark with IBM SPSS Modeler

Using Apache Spark with IBM SPSS Modeler

Dr. Steve R. Poulin

Page 2: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 2

Dr. Steve PoulinPrincipal Data Scientist & Manager of Predictive Analytics

Over 20 years experience as SPSS trainer and consultant

Holds a Ph.D. in Social Policy, Planning, and Policy Analysis from Columbia University

IBM Master Instructor with Global Knowledge

Worked with over 250 organizations that have used SPSS

Currently more heavily involved in consulting

Page 3: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 3

Agenda

Intro Concepts

Enabling Apache Spark Applications

Gradient Boosted Trees with Mllib

K-Means with Mllib

Multinomial Naive Bayes with Mllib

Q&A

Follow-Ons & Additional References

Page 4: Using Apache Spark with IBM SPSS Modeler

Intro Concepts

Page 5: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 5

What is Apache Spark?

Apache Spark1 is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today.

Apache Spark works within Hadoop and is an alternative to MapReduce.

Page 6: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 6

Hadoop

Hadoop is a collection of open-source modules that are part of the Apache Project.

o The Apache Project is managed by the volunteer-run Apache Software Foundation.

One of the major components of Hadoop is the Hadoop Distributed File System (HDFS™), which is a distributed file system providing high-throughput access to application data.

Page 7: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 7

MapReduce

MapReduce2 is the processing engine for Apache Hadoop:

o A parallel processing system that is composed of a map procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a reduce procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies.)

It is designed for the analysis of large datasets.

Page 8: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 8

MapReduce and Apache Spark

Apache Spark performs in-memory processing, whereas MapReduce moves data in and out of a disk.3

As a result, Apache Spark can run programs up to 100x faster than MapReducein memory or 10x faster on disk.

Page 9: Using Apache Spark with IBM SPSS Modeler

Enabling Apache Spark

Applications

Page 10: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 10

IBM SPSS Modeler

Apache Spark is well-suited for running complex machine learning techniques using machine learning libraries (MLlib) with large datasets.

Although Apache Spark applications will run with any data source, they will only achieve these efficiencies when connected to the Analytic Server node, which enables IBM SPSS Modeler to use data from a Hadoop environment.

The following applications that can be accessed from with IBM SPSS Modeler will be demonstrated during this seminar:

o Gradient Boosted Trees with MLlib

o K-Means with MLlib

o Multinomial Naive Bayes with MLlib

Page 11: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 11

IBM SPSS Analytic Server

IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from Hadoop distributions

This feature is found as a node in the Sources palette:

Although Apache Spark applications will run with data accessed from many data sources (e.g. SQL databases and text files), they will not achieve their full potential efficiency unless they are connected to a Hadoop data environment through IBM SPSS Analytic Server.4

Page 12: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 12

Enabling IBM SPSS Modeler to Run Apache Spark

Applications

Install a copy of Python 2.7 that includes NumPy, a Python component for scientific computing. o Anaconda is a free package manager that includes Python with the NumPy

component.

o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics©

at: www.continuum.io/downloads

The following line of text must be added to your options.cfg file:o eas_pyspark_python_path, “[location of python.exe file in the Python

program with NumPy]”

o For example: eas_pyspark_python_path, “C:/Program Files/Anaconda2/python.exe”

The options.cfg file is located in the config folder of your IBM SPSS Modeler Program Files.o For example: C:\Program Files\IBM\SPSS\Modeler\18.0\config

Page 13: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 13

Adding Spark Applications through IBM SPSS

Modeler Extension Hub

The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery

http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.

Page 14: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 14

IBM SPSS Modeler Extension Hub Dialog Box

Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic

Page 15: Using Apache Spark with IBM SPSS Modeler

Gradient Boosted Trees

with MLlib

Page 16: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 16

Introduction

Like the Random Trees procedure, this procedure generates ensembles of decision trees but also iteratively trains decision trees in order to minimize a “loss function,” (a penalty for mispredictions.)5

The algorithm uses the current ensemble to predict the label of each training instance and then compares the prediction with the true label.

The dataset is re-labeled to put more emphasis on training instances with poor predictions.

Thus, in the next iteration, the decision tree will help correct for previous mistakes.

Page 17: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 17

Loss Functions

Loss Task Description

Log Loss Classification Twice binomial negative log likelihood

Squared Error Regression Also called L2 loss. Default loss for regression tasks

Absolute Error Regression Also called L1 loss. Can be more robust to outliers than Squared Error

Page 18: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 18

Gradient Boosted Trees with MLlib Dialog Boxes

Page 19: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 19

Gradient Boosted Trees with MLlib Dialog Boxes

One of the three Loss Functions is selected here

Page 20: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 20

Gradient Boosted Trees with MLlib Output

Confidence scores

Page 21: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 21

Gradient Boosted Trees with MLlib Stream:

LIVE DEMO

Page 22: Using Apache Spark with IBM SPSS Modeler

K-Means with MLlib

Page 23: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 23

Introduction

The K-Means clustering technique has long been part of IBM SPSS Modeler and IBM SPSS Statistics.

The user specifies the number of clusters (the “K” value) to test.

In the traditional method, K individual records are selected based on their distinctive profiles although there is some randomness in which records are selected.

The remaining records are assigned to the K clusters based on which of the initial records they are most similar to as determined by the Squared Euclidian distance measure.

Records can be re-assigned to make the clusters more distinctive.

Page 24: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 24

K-Means with MLlib

The K-Means with MLlib procedure uses a machine-learning process to build the clusters.6

The distance measure used to determine which cluster each record is assigned to is labeled Epsilon.

Although the user still provides the K value, the final result may be less than K clusters.

Page 25: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 25

K-Means with MLlib Dialog Boxes

Page 26: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 26

K-Means with MLlib Dialog Boxes

When creating the

clusters does not

improve the Epsilon

less than this value,

the cluster building

process stops.

Lowering this value

will increase

processing time.

Page 27: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 27

K-Means with MLlib Dialog Boxes

This only needs to be

increased if there is an

indication that the

convergence threshold

was not met.

Page 28: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 28

K-Means with MLlib Dialog Boxes

This does not to be

changed for more recent

versions of Spark.

Page 29: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 29

K-Means with MLlib Dialog Boxes

The Initialization Mode determines how

individual records are selected for the

training process.

The Random option randomly selects

these records.

Without the use of a Random Seed,

varying distributions of random numbers

will be generated that result in the

selection of different records each time

the procedure is run.

If this box is checked, the Random Seed

value will ensure that the same initial

records are selected.

Page 30: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 30

K-Means with MLlib Dialog Boxes

The K-Means [] option (also

known as K-Means ++) in the

Initialization Mode section of the

dialog box provides an alternative

way to select the first records for

the cluster-building process.

This option builds clusters more

quickly than the use of randomly

selected records but may not

scale up well for large datasets.

The Initialization Steps only

applies to this option.

Page 31: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 31

K-Means with MLlib Output

Cluster membership

values

Page 32: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 32

K-Means with MLlib Stream:

LIVE DEMO

Page 33: Using Apache Spark with IBM SPSS Modeler

Multinomial Naive Bayes

with MLlib

Page 34: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 34

Multinomial Naive Bayes with MLlib

Naive Bayes is a classification algorithm with the assumption of independence (hence the term “naïve”) between every pair of predictors (called “features” in this procedure).7

As is the case for all classification procedures, it requires one target field and any number of predictors.

Within a single pass to the training data, it computes the conditional probability distribution of each categorical field value, and then it applies Bayes’ theorem (the probability of an event based on prior knowledge of conditions that might be related to the event) to compute the conditional probability distribution of predictor values given an observation and use it for prediction.

Page 35: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 35

Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses fields representing the number of times items, such as words, have been found in a document

This procedure is often used for document classification

Multinomial Naive Bayes with MLlib

Page 36: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 36

The Smoothing

parameter addresses

conditions have a

conditional probability

of zero and should

probably be left at its

default value of 1.

Multinomial Naive Bayes with MLlib Dialog Box

Page 37: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 37

Predicted outcomes

Multinomial Naive Bayes with MLlib Output

Page 38: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 38

Multinomial Naive Bayes with MLlib Stream:

LIVE DEMO

Page 39: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 39

Questions?

Steve Poulin

Still have questions? [email protected]

Page 40: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 40

References: Further Reading

1. www.spark.apache.org

2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/

3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

4. http://www-03.ibm.com/software/products/en/spss-analytic-server

5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts

6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html

Page 41: Using Apache Spark with IBM SPSS Modeler

© Global Knowledge Training LLC. All rights reserved. Page 41

Next Steps

For a deeper dive into the concepts and tactics presented here, take a look at our available training:

Introduction to IBM SPSS Modeler and Data Mining (v18)

Predictive Modeling for Categorical Targets Using IBM SPSS Modeler (v18)

Advanced Predictive Modeling Using IBM SPSS Modeler (v18)

Page 42: Using Apache Spark with IBM SPSS Modeler

For more information contact us at:www.globalknowledge.com | 1-800-COURSES

[email protected]