Data Mining using SPSS Modeler 2nd...

© 2014 IBM Corporation

Data Mining using SPSS Modeler 2nd Session

IBM TaiwanClaire Lin

© 2014 IBM Corporation2

Agenda

Data Mining Process

Business Understanding

Data Understanding Live Demo and Exercise

Data Preparation and Manipulation Live Demo and Exercise


What is Data Mining?

The analysis step of the Knowledge Discovery in Databases (KDD) process, it encompasses a number of techniques to extract useful information from (large) data files, without necessarily having preconceived notions about what will be discovered.

The goal of data mining is to extract information from a data set and transform it into an understandable structure for further use


Data mining process

Cross Industry Standard Process for Data Mining(CRISP)

5 © 2013 IBM Corporation

Data mining process

Cross Industry Standard Process for Data Mining(CRISP)

What SPSS Modeler can do? Input raw data Data understanding

Check missing data Check anomalous and outlier

data Data preparation

Filter, derive, reclassify nodes Modeling Output


Business Understanding

Determining business objectives Finding what people will buy together with 粽子 during Dragon Festival

Predicting who is likely to not renew and contract for mobile phone service

Assessing the situation

Determining data mining goals

Producing a project plan


Data Understanding

Need to understand What your data resources are

What the characteristics of those resources are

Includes Collecting initial data

Describing data

Exploring data

Verifying data quality

Missing Data

Anomalous Data


Data Understanding - Missing Data

Blank Contain no information.

White space if the field is string and Null value (non-numeric) if the field is numeric

Empty string A string field may be empty, which means that it contains nothing (This

is common in databases)

Value blanks Represent missing or invalid information


Data Understanding - Missing Data


Data Understanding - Anomalous Data

What is Anomalous Data? Far from the center of the distribution

Measured by the mean or median and using the standard deviation as a measure of spread

Far from other values

Whether close to the center of the distribution, or not


Data Understanding

Anomaly detection


SPSS Modeler User Interface


Data Sources

Database: ODBC source

Var. File: free-field text file

Fixed File: fixed-field text file

Statistics File/SAS File/Excel File


Data Understanding

The Data Audit node

Provide report

Missing values

Outlier data and Extreme data

Information on a field’s distribution


Data Understanding

Anomaly detection models identify outliers or unusual cases by using clustering analysis

Each record is assigned an anomaly index

It's the ratio of the group deviation index to its average over the cluster that the case belongs to

Cases with an index value greater than 2 could be good anomaly candidates


Data Understanding – Outliers Data Live Demo

Live Demo SPSS Modeler UI

Read data into SPSS Modeler

Check missing data

Check anomalous and outlier data

Data Audit Node

Anomaly Node


Live Demo & Exercise I


Data Preparation and Manipulation

Objective: Construct the final dataset for modeling

Record Operations Select partial data from dataset

Sort the data

Field Operations


Type: Specifies field metadata and properties

Type Description

Continuous Used to describe numeric values, such as a range of 0–100 or 0.75–1.25.A continuous value can be an integer, real number, or date/time.

Categorical String values

Nominal Used to describe data with multiple distinct values, each treated as a member of a set.

Ordinal Used to describe data with multiple distinct values that have an inherent order.

Flag Used for data with two distinct values that indicate the presence or absence of a trait. Such as true and false, Yes and No or 0 and 1.


Filter: Filters, renames fields


Derive: Modifies data values or creates new fields


Reclassify


Live Demo & Exercise II


Japanese

Hebrew

Thank YouEnglish

MerciFrench

DankeGerman

GrazieItalian

GraciasSpanish

ObrigadoBrazilian

Portuguese

Arabic

Simplified Chinese

Traditional Chinese

Korean

Thai

Hindi

Tamil

go raibh maith agatGaelic Tak

Danish

TrugarezBreton

DutchDank u

Czech

Dekujeme Vam

DankonEsperanto

Tack så mycketSwedish

Data Mining using SPSS Modeler 2nd...

Documents

Transcript of Data Mining using SPSS Modeler 2nd...