Data Mining using SPSS Modeler 2nd...

25
© 2014 IBM Corporation Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin

Transcript of Data Mining using SPSS Modeler 2nd...

Page 1: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation

Data Mining using SPSS Modeler 2nd Session

IBM TaiwanClaire Lin

Page 2: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation2

Agenda

Data Mining Process

Business Understanding

Data Understanding Live Demo and Exercise

Data Preparation and Manipulation Live Demo and Exercise

Page 3: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation3

What is Data Mining?

The analysis step of the Knowledge Discovery in Databases (KDD) process, it encompasses a number of techniques to extract useful information from (large) data files, without necessarily having preconceived notions about what will be discovered.

The goal of data mining is to extract information from a data set and transform it into an understandable structure for further use

Page 4: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation4

Data mining process

Cross Industry Standard Process for Data Mining(CRISP)

Page 5: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

5 © 2013 IBM Corporation

Data mining process

Cross Industry Standard Process for Data Mining(CRISP)

What SPSS Modeler can do? Input raw data Data understanding

Check missing data Check anomalous and outlier

data Data preparation

Filter, derive, reclassify nodes Modeling Output

Page 6: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation6

Business Understanding

Determining business objectives Finding what people will buy together with 粽子 during Dragon Festival

Predicting who is likely to not renew and contract for mobile phone service

Assessing the situation

Determining data mining goals

Producing a project plan

Page 7: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation7

Data Understanding

Need to understand What your data resources are

What the characteristics of those resources are

Includes Collecting initial data

Describing data

Exploring data

Verifying data quality

Missing Data

Anomalous Data

Page 8: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation8

Data Understanding - Missing Data

Blank Contain no information.

White space if the field is string and Null value (non-numeric) if the field is numeric

Empty string A string field may be empty, which means that it contains nothing (This

is common in databases)

Value blanks Represent missing or invalid information

Page 9: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation9

Data Understanding - Missing Data

Page 10: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation10

Data Understanding - Anomalous Data

What is Anomalous Data? Far from the center of the distribution

Measured by the mean or median and using the standard deviation as a measure of spread

Far from other values

Whether close to the center of the distribution, or not

Page 11: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

11 © 2013 IBM Corporation

Data Understanding

Anomaly detection

Page 12: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

12 © 2013 IBM Corporation

SPSS Modeler User Interface

Page 13: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

13 © 2013 IBM Corporation

Page 14: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

© 2014 IBM Corporation14

Data Sources

Database: ODBC source

Var. File: free-field text file

Fixed File: fixed-field text file

Statistics File/SAS File/Excel File

Page 15: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

15 © 2013 IBM Corporation

Data Understanding

The Data Audit node

Provide report

Missing values

Outlier data and Extreme data

Information on a field’s distribution

Page 16: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

16 © 2013 IBM Corporation

Data Understanding

Anomaly detection models identify outliers or unusual cases by using clustering analysis

Each record is assigned an anomaly index

It's the ratio of the group deviation index to its average over the cluster that the case belongs to

Cases with an index value greater than 2 could be good anomaly candidates

Page 17: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

17 © 2013 IBM Corporation

Data Understanding – Outliers Data Live Demo

Live Demo SPSS Modeler UI

Read data into SPSS Modeler

Check missing data

Check anomalous and outlier data

Data Audit Node

Anomaly Node

Page 18: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

18 © 2013 IBM Corporation

Live Demo & Exercise I

Page 19: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

19 © 2013 IBM Corporation

Data Preparation and Manipulation

Objective: Construct the final dataset for modeling

Record Operations Select partial data from dataset

Sort the data

Field Operations

Page 20: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

20 © 2013 IBM Corporation

Type: Specifies field metadata and properties

Type Description

Continuous Used to describe numeric values, such as a range of 0–100 or 0.75–1.25.A continuous value can be an integer, real number, or date/time.

Categorical String values

Nominal Used to describe data with multiple distinct values, each treated as a member of a set.

Ordinal Used to describe data with multiple distinct values that have an inherent order.

Flag Used for data with two distinct values that indicate the presence or absence of a trait. Such as true and false, Yes and No or 0 and 1.

Page 21: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

21 © 2013 IBM Corporation

Filter: Filters, renames fields

Page 22: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

22 © 2013 IBM Corporation

Derive: Modifies data values or creates new fields

Page 23: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

23 © 2013 IBM Corporation

Reclassify

Page 24: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

24 © 2013 IBM Corporation

Live Demo & Exercise II

Page 25: Data Mining using SPSS Modeler 2nd Sessionhomepage.ntu.edu.tw/~stella/Data_mining_using_SPSS_Modeler.pdf · Data Mining using SPSS Modeler 2nd Session IBM Taiwan Claire Lin. 2 ©

25 © 2013 IBM Corporation

Japanese

Hebrew

Thank YouEnglish

MerciFrench

DankeGerman

GrazieItalian

GraciasSpanish

ObrigadoBrazilian

Portuguese

Arabic

Simplified Chinese

Traditional Chinese

Korean

Thai

Hindi

Tamil

go raibh maith agatGaelic Tak

Danish

TrugarezBreton

DutchDank u

Czech

Dekujeme Vam

DankonEsperanto

Tack så mycketSwedish