Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM...

44
Overview of the SPSS Modeler Integration with IBM PureData System for Analytics Session Number 2921 Gregory Walker, Ph.D., IBM © 2013 IBM Corporation

Transcript of Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM...

Page 1: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Overview of the SPSS Modeler Integration with IBM PureData System for Analytics Session Number 2921

Gregory Walker, Ph.D., IBM

© 2013 IBM Corporation

Page 2: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Takeaways

•High-level understanding of Modeler and IPDA

•Integration points

•Tips/Best Practices

Page 3: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Agenda

Prerequisites

SPSS Modeler and Netezza Integration Points

Tips/Best Practices

Page 4: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Prerequisites

•Netezza Appliance

•SPSS Modeler Client

•SPSS Modeler Server

•IBM Netezza Analytics*

Page 5: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Netezza/SPSS Modeler Integration Highlights

•As of IBM SPSS Modeler 15:

• Tier 1 database support

• Enhanced support for SQL generation/pushback

• 11 Netezza In-Database modeling nodes

• Scoring adapter

• Database function (udfs) exposure

Page 6: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Pushback

Page 7: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Pushback Automatically converts Modeler nodes into corresponding SQL

Purple nodes at execution time indicate SQL Pushback is occurring for those nodes

Will attempt to include as much of the Stream as possible in SQL Pushback

Can push back none, some, or all of a Stream’s nodes

A node that cannot be represented in SQL will receive the result set of the previous node’s SQL Pushback statement

7

Page 8: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Nodes Supporting SQL Generation

Sources

Database source only

Can specify a table as a source

Can enter a SQL statement directly

Record Operations

Field Operations

Graphs

Modeler Models

Output

Export

Database

Publisher (Published stream will contain generated SQL)

Expressions

Page 9: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Pushback – Supported Nodes

Record Node

Select

Supports generation only if SQL generation for the select expression itself is supported (see

expressions below). If any fields have nulls, SQL generation does not give the same results

for discard as are given in native IBM® SPSS® Modeler.

Sample

Simple sampling supports SQL generation in certain instances.

Complex sampling does not support SQL generation.

Aggregate In certain instances

RFM

Aggregate

Supports generation except if saving the date of the second or third most recent

transactions, or if only including recent transactions. However, including recent transactions

does work if the datetime_date(YEAR,MONTH,DAY) function is pushed back.

Sort

Merge

No SQL generated for merge by order.

Merge by key with full or partial outer join is only supported if the database/driver supports it.

Non-matching input fields can be renamed by means of a Filter node, or the Filter tab of a

source node.

For all types of merge, SQL_SP_EXISTS is not supported if inputs originate in different

databases.

Append Supports generation if inputs are unsorted.

Distinct

9

Page 10: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Notes on Simple Sampling With IPDA

•First N • Generates SQL but prevents downstream SQL

generation unless the node is cached

• Will return error downstream if not cached • “A connection must be supplied as the

previous nodes do not pushback”

•1-in-n • No SQL Pushback support

•Random Percent • Does generate SQL and does NOT inhibit

downstream SQL pushback even w/o cache

Page 11: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Generation in the Aggregate Node

Storage Sum Mean Min Max Sdev Median Count Variance Percentile

Integer Y Y Y Y Y Y Y

Real Y Y Y Y Y Y Y

Date Y Y Y

Time Y Y Y

Time-

stamp

Y Y Y

String Y Y Y

Page 12: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Pushback – Supported Nodes Field Node

Type Supports SQL generation if the Type node is instantiated

and no ABORT or WARN type checking is specified.

Filter

Derive Supports SQL generation if SQL generated for the derive

expression is supported (see expressions below).

Ensemble

Supports SQL generation for Continuous targets. For other

targets, supports generation only if the "Highest confidence

wins" ensemble method is used.

Filler Supports SQL generation if the SQL generated for the derive

expression is supported (see expressions below).

Anonymize Supports SQL generation for Continuous targets, and partial

SQL generation for Nominal and Flag targets.

Reclassify

Binning

Supports SQL generation if the "Tiles (equal count)" binning

method is used and the "Read from Bin Values tab if available"

option is selected.

RFM

Analysis

Supports SQL generation if the "Read from Bin Values tab if

available" option is selected, but downstream nodes will not

support it.

Partition Supports SQL generation to assign records to partitions.

SetToFlag

Restructure

12

Graph Node

Graphboard

SQL generation is

supported for the following

graph types: Area, 3-D

Area, Bar, 3-D Bar, Bar of

Counts, Heat map, Pie, 3-D

Pie, Pie of Counts. For

Histograms, SQL

generation is supported for

categorical data only.

Distribution

Web

Evaluation

Page 13: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Pushback – Supported Nodes

Export Node

Database

Publisher

The published stream

will contain generated

SQL.

13

Output Node

Table

Supports generation if SQL

generation is supported for

highlight expression (see

expressions below).

Matrix

Supports generation except if

"All numerics" is selected for

the Fields option.

Analysis

Supports generation,

depending on the options

selected.

Transform

Statistics Supports generation if the

Correlate option is not used.

Report

Set Globals

Model Apply Node

C&R Tree Supports SQL generation for the single tree

option, but not for the boosting, bagging or

large dataset options.

QUEST

CHAID

C5.0

Decision List

Linear

Supports SQL generation for the standard

model option, but not for the boosting, bagging

or large dataset options.

Neural Net

Supports SQL generation for the standard

model option (Multilayer Perceptron only), but

not for the boosting, bagging or large dataset

options.

PCA/Factor

Logistic

Supports SQL generation for Multinomial

procedure but not Binomial. For Multinomial,

generation is not supported when confidences

are selected, unless the target type is Flag.

Generated Rulesets

Page 14: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SQL Pushback - Expressions

14

Expressions

Operators + - / * ><

Relational Operators = /= > >= < <=

Functions

Abs Islowercode Or

Allbutfirst Isnumbercode Pi

Allbutlast Isstartstring Real

And Issubstring Rem

Arccos Isuppercode Round

Arcsin Last Sign

Arctan Length Sin

Arctanh Locchar Sqrt

Cos Log String

Div Log10 Strmember

Exp Lowertoupper Subcrs

Fracof Max Substring

Hasstartstring Member Substring_betwe

en

Hassubstring Min Uppertolower

Integer Negate To_string

Intof Not

Isalphacode Number

Aggregate Functions Sum Min Count

Mean Max Sdev

Page 15: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Enabling SQL Pushback

•Verify Modeler Server enablement from Modeler Client: • Help -> About -> Additional Details • Look for “Server Enablement”

Page 16: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Enabling SQL Pushback, Continued

•Enable Optimization Settings • Tools -> Stream Properties -> Options -> Optimization

Page 17: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

How Do I Know SQL Pushback Occurs?

•Nodes will turn purple

Page 18: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Where SQL Pushback Can Help the Most

•Joins

• Merge by key

•Aggregration

•Selection

•Sorting

•Field Derivations

•Field Projections

•Scoring

Page 19: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

In-Database Scoring

Page 20: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Scoring with SPSS Modeler

•Out of database scoring

•SQL Pushback

•Scoring Adapter

Page 21: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

In-Database Scoring Using SQL Pushback

•Small number of Modeler Models

Model Apply Node

C&R Tree Supports SQL generation for the single tree

option, but not for the boosting, bagging or

large dataset options.

QUEST

CHAID

C5.0

Decision List

Linear

Supports SQL generation for the standard

model option, but not for the boosting, bagging

or large dataset options.

Neural Net

Supports SQL generation for the standard

model option (Multilayer Perceptron only), but

not for the boosting, bagging or large dataset

options.

PCA/Factor

Logistic

Supports SQL generation for Multinomial

procedure but not Binomial. For Multinomial,

generation is not supported when confidences

are selected, unless the target type is Flag.

Generated Rulesets

Page 22: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

In-Database Scoring Using SQL Pushback

•Must enable model to score via SQL Pushback within the Model Nugget

• Double-click model nugget -> Settings

Page 23: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Compressed Compressed High

Analytic model*

102M rows

18 GB of data

Out of box Out of box performance performance After tuning

In-Database Scoring Using SQL Pushback

Oracle IPDA 1000-12 Exadata ¼ Rack

20x faster

Run Regression Model

9 seconds (customer churn prediction) 59 minutes 178 seconds

* Created 20 Telco Churn Models using Multinomial Logistic Regression and scored a compressed Table with 102M rows using SQL Pushback

Page 24: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

SPSS Modeler Scoring Adapter

Extension to current In-Database Capabilities allowing more SPSS Modeler models to be scored In-Database

Improve the efficiency of scoring models by minimizing data movement and leveraging database capabilities

Supported for IPDA w/ NPS version > 6.0 P8

Page 25: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Modeler Scoring Adapter Overview

Implementation

IBM SPSS Modeler Server Scoring Adapter must be installed within the database that you will be using with Modeler (they leverage database UDFs for processing)

Models are stored within tables and published when updated

You do not need individual installations for each model.

Benefit:

Once installed, Modeler will automatically use the adapter when a stream is executed and the stream is running against that database.

Usage:

Can be turned off at Server level if needed or which method to use can be determined at model level

Page 26: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Scoring Adapter SQL Pushback * Local Scoring

C&RT, Quest, CHAID, C5.0 X X X

Decision List X X X

ALM X X

Linear Regression X X X

Logistic Regression X X X

Neural Net X X X

Discriminant X X

GenLin X X

Cox X X

SVM X X

Bayes Net X X

SLRM X X

K-Means, Kohonen, Two Step X X

Anomaly Detection X X

KNN X X

Split Models, Large Dataset , Boosting, Bagging X X

GLMM X X

PCA / Factor X X

Feature Selection X X

Time Series / Sequence X

Apriori / Carma X

Text Analytics X

Social Network Analysis X

Entity Analytics X

*Not all options supported - refer to product documentation for Limitations

Page 27: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Pure SQL vs Scoring Adapter for Model Scoring

27

Pure SQL Scoring Adapter (UDFs)

Difficult to support some

model scoring algorithms

Easily supports a large

class of scoring algorithms

Requires a SQL mapping to

be constructed for each

model type

Reuses existing scoring

component to score each

model type

Resulting SQL will run on

many database systems

Needs to be adapted for

each database system

requiring support

No database extensions

required

Requires database

extensions to be installed

Performance/reliability

harder to predict

Performance/reliability

easier to predict

Harder to generate SQL to

score ensemble models

Easier to score ensemble

models

Page 28: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Database Function Exposure

Page 29: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Database Function Exposure

•Exposed in downstream nodes via Expression Builder

• Derive

• Select

• Balance*

• Filler

• Analysis

• Report

• Table

• Merge

• Merge by Condition

•Includes

• Regular database functions

• UDFs

* Balance node does not pushback to database

Page 30: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Database Function Exposure

• Can be useful for replacing Modeler functions that do not pushback

• E.g. various time Modeler time arithmetic functions

Page 31: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

IPDA In-Database Models

Page 32: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Netezza In-Database Models

INZA models supported within Modeler

Bayes Net

Decision Trees

Divisive Clustering

Generalized Linear

K-Means

KNN

Linear Regression

Naive Bayes

PCA

Regression Tree

Time Series

Page 33: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Enabling Netezza In-Database Modeling

•Tools -> Options -> Helper Applications -> IBM Netezza

Page 34: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Using Netezza In-Database Models

Page 35: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Using Netezza In-Database Models

Page 36: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Using Netezza In-Database Models

Page 37: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Using Netezza In-Database Models

Page 38: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Using Netezza In-Database Models

Page 39: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Using Netezza In-Database Models

Page 40: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Model Scoring in IPDA or SPSS?

•Depends on dataset size

0

200

400

600

800

1000

1200

1400

1600

1800

2000

10000 100000 1000000 10000000 100000000

Pro

cessin

g T

ime (

sec)

Number of Records

Netezza

SPSS Modeler

Page 41: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Summary Netezza + SPSS Modeler Integration

Feature Benefit

Asymmetric massively

parallel processing (AMPP)

architecture

Answers to your sophisticated questions, across all of your data,

returned in a fraction of the time it used to take

Analytics Workbench Easy to build, manage, validate and deploy analytic models

SQL Pushback In-database optimized SQL generated for common data

preparation tasks including sampling

In-Database Data Mining / Ready-to-use, parallelized, in-database via Netezza Analytics:

Model Building Decision Trees, K-Means, PCA, Linear Regression, Regression

Trees, Bayes Network, Naïve Bayes, K Nearest Neighbors,

Divisive Clustering, GLM, Time Series

In-Database Model Scoring

via SPSS Algorithms

C&RT, Quest, CHAID, C5.0, Decision List, Linear, Neural Net,

PCA/Factor, Logistic, Generated Rulesets

In-Database Ensemble

Scoring

Delivers higher performance for ensemble models with larger

data and more dimensions / variables

In-Database Scoring with

Scoring Adapter

Delivers high performance scoring for Modeler models that

cannot be rendered in SQL.

Page 42: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, and IBM SPSS Modeler, IBM PureData for Analytics are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Page 43: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Communities

• On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more

o Find the community that interests you …

• Information Management bit.ly/InfoMgmtCommunity

• Business Analytics bit.ly/AnalyticsCommunity

• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions

o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities

• ibm.com/champion

Page 44: Overview of the SPSS Modeler Integration with IBM … of the SPSS Modeler Integration with IBM PureData ... Select Supports generation ... Models are stored within tables and

Thank You Your feedback is important!

• Access the Conference Agenda Builder to complete your session surveys

o Any web or mobile browser at http://iod13surveys.com/surveys.html

o Any Agenda Builder kiosk onsite