Data Mining using SPSS Modeler 2nd...

Post on 17-Apr-2018

232 views 2 download

Transcript of Data Mining using SPSS Modeler 2nd...

© 2014 IBM Corporation

Data Mining using SPSS Modeler 2nd Session

IBM TaiwanClaire Lin

© 2014 IBM Corporation2

Agenda

Data Mining Process

Business Understanding

Data Understanding Live Demo and Exercise

Data Preparation and Manipulation Live Demo and Exercise

© 2014 IBM Corporation3

What is Data Mining?

The analysis step of the Knowledge Discovery in Databases (KDD) process, it encompasses a number of techniques to extract useful information from (large) data files, without necessarily having preconceived notions about what will be discovered.

The goal of data mining is to extract information from a data set and transform it into an understandable structure for further use

© 2014 IBM Corporation4

Data mining process

Cross Industry Standard Process for Data Mining(CRISP)

5 © 2013 IBM Corporation

Data mining process

Cross Industry Standard Process for Data Mining(CRISP)

What SPSS Modeler can do? Input raw data Data understanding

Check missing data Check anomalous and outlier

data Data preparation

Filter, derive, reclassify nodes Modeling Output

© 2014 IBM Corporation6

Business Understanding

Determining business objectives Finding what people will buy together with 粽子 during Dragon Festival

Predicting who is likely to not renew and contract for mobile phone service

Assessing the situation

Determining data mining goals

Producing a project plan

© 2014 IBM Corporation7

Data Understanding

Need to understand What your data resources are

What the characteristics of those resources are

Includes Collecting initial data

Describing data

Exploring data

Verifying data quality

Missing Data

Anomalous Data

© 2014 IBM Corporation8

Data Understanding - Missing Data

Blank Contain no information.

White space if the field is string and Null value (non-numeric) if the field is numeric

Empty string A string field may be empty, which means that it contains nothing (This

is common in databases)

Value blanks Represent missing or invalid information

© 2014 IBM Corporation9

Data Understanding - Missing Data

© 2014 IBM Corporation10

Data Understanding - Anomalous Data

What is Anomalous Data? Far from the center of the distribution

Measured by the mean or median and using the standard deviation as a measure of spread

Far from other values

Whether close to the center of the distribution, or not

11 © 2013 IBM Corporation

Data Understanding

Anomaly detection

12 © 2013 IBM Corporation

SPSS Modeler User Interface

13 © 2013 IBM Corporation

© 2014 IBM Corporation14

Data Sources

Database: ODBC source

Var. File: free-field text file

Fixed File: fixed-field text file

Statistics File/SAS File/Excel File

15 © 2013 IBM Corporation

Data Understanding

The Data Audit node

Provide report

Missing values

Outlier data and Extreme data

Information on a field’s distribution

16 © 2013 IBM Corporation

Data Understanding

Anomaly detection models identify outliers or unusual cases by using clustering analysis

Each record is assigned an anomaly index

It's the ratio of the group deviation index to its average over the cluster that the case belongs to

Cases with an index value greater than 2 could be good anomaly candidates

17 © 2013 IBM Corporation

Data Understanding – Outliers Data Live Demo

Live Demo SPSS Modeler UI

Read data into SPSS Modeler

Check missing data

Check anomalous and outlier data

Data Audit Node

Anomaly Node

18 © 2013 IBM Corporation

Live Demo & Exercise I

19 © 2013 IBM Corporation

Data Preparation and Manipulation

Objective: Construct the final dataset for modeling

Record Operations Select partial data from dataset

Sort the data

Field Operations

20 © 2013 IBM Corporation

Type: Specifies field metadata and properties

Type Description

Continuous Used to describe numeric values, such as a range of 0–100 or 0.75–1.25.A continuous value can be an integer, real number, or date/time.

Categorical String values

Nominal Used to describe data with multiple distinct values, each treated as a member of a set.

Ordinal Used to describe data with multiple distinct values that have an inherent order.

Flag Used for data with two distinct values that indicate the presence or absence of a trait. Such as true and false, Yes and No or 0 and 1.

21 © 2013 IBM Corporation

Filter: Filters, renames fields

22 © 2013 IBM Corporation

Derive: Modifies data values or creates new fields

23 © 2013 IBM Corporation

Reclassify

24 © 2013 IBM Corporation

Live Demo & Exercise II

25 © 2013 IBM Corporation

Japanese

Hebrew

Thank YouEnglish

MerciFrench

DankeGerman

GrazieItalian

GraciasSpanish

ObrigadoBrazilian

Portuguese

Arabic

Simplified Chinese

Traditional Chinese

Korean

Thai

Hindi

Tamil

go raibh maith agatGaelic Tak

Danish

TrugarezBreton

DutchDank u

Czech

Dekujeme Vam

DankonEsperanto

Tack så mycketSwedish