Data Mining using SPSS Modeler 2nd...
Transcript of Data Mining using SPSS Modeler 2nd...
© 2014 IBM Corporation
Data Mining using SPSS Modeler 2nd Session
IBM TaiwanClaire Lin
© 2014 IBM Corporation2
Agenda
Data Mining Process
Business Understanding
Data Understanding Live Demo and Exercise
Data Preparation and Manipulation Live Demo and Exercise
© 2014 IBM Corporation3
What is Data Mining?
The analysis step of the Knowledge Discovery in Databases (KDD) process, it encompasses a number of techniques to extract useful information from (large) data files, without necessarily having preconceived notions about what will be discovered.
The goal of data mining is to extract information from a data set and transform it into an understandable structure for further use
© 2014 IBM Corporation4
Data mining process
Cross Industry Standard Process for Data Mining(CRISP)
5 © 2013 IBM Corporation
Data mining process
Cross Industry Standard Process for Data Mining(CRISP)
What SPSS Modeler can do? Input raw data Data understanding
Check missing data Check anomalous and outlier
data Data preparation
Filter, derive, reclassify nodes Modeling Output
© 2014 IBM Corporation6
Business Understanding
Determining business objectives Finding what people will buy together with 粽子 during Dragon Festival
Predicting who is likely to not renew and contract for mobile phone service
Assessing the situation
Determining data mining goals
Producing a project plan
© 2014 IBM Corporation7
Data Understanding
Need to understand What your data resources are
What the characteristics of those resources are
Includes Collecting initial data
Describing data
Exploring data
Verifying data quality
Missing Data
Anomalous Data
© 2014 IBM Corporation8
Data Understanding - Missing Data
Blank Contain no information.
White space if the field is string and Null value (non-numeric) if the field is numeric
Empty string A string field may be empty, which means that it contains nothing (This
is common in databases)
Value blanks Represent missing or invalid information
© 2014 IBM Corporation9
Data Understanding - Missing Data
© 2014 IBM Corporation10
Data Understanding - Anomalous Data
What is Anomalous Data? Far from the center of the distribution
Measured by the mean or median and using the standard deviation as a measure of spread
Far from other values
Whether close to the center of the distribution, or not
11 © 2013 IBM Corporation
Data Understanding
Anomaly detection
12 © 2013 IBM Corporation
SPSS Modeler User Interface
13 © 2013 IBM Corporation
© 2014 IBM Corporation14
Data Sources
Database: ODBC source
Var. File: free-field text file
Fixed File: fixed-field text file
Statistics File/SAS File/Excel File
15 © 2013 IBM Corporation
Data Understanding
The Data Audit node
Provide report
Missing values
Outlier data and Extreme data
Information on a field’s distribution
16 © 2013 IBM Corporation
Data Understanding
Anomaly detection models identify outliers or unusual cases by using clustering analysis
Each record is assigned an anomaly index
It's the ratio of the group deviation index to its average over the cluster that the case belongs to
Cases with an index value greater than 2 could be good anomaly candidates
17 © 2013 IBM Corporation
Data Understanding – Outliers Data Live Demo
Live Demo SPSS Modeler UI
Read data into SPSS Modeler
Check missing data
Check anomalous and outlier data
Data Audit Node
Anomaly Node
18 © 2013 IBM Corporation
Live Demo & Exercise I
19 © 2013 IBM Corporation
Data Preparation and Manipulation
Objective: Construct the final dataset for modeling
Record Operations Select partial data from dataset
Sort the data
Field Operations
20 © 2013 IBM Corporation
Type: Specifies field metadata and properties
Type Description
Continuous Used to describe numeric values, such as a range of 0–100 or 0.75–1.25.A continuous value can be an integer, real number, or date/time.
Categorical String values
Nominal Used to describe data with multiple distinct values, each treated as a member of a set.
Ordinal Used to describe data with multiple distinct values that have an inherent order.
Flag Used for data with two distinct values that indicate the presence or absence of a trait. Such as true and false, Yes and No or 0 and 1.
21 © 2013 IBM Corporation
Filter: Filters, renames fields
22 © 2013 IBM Corporation
Derive: Modifies data values or creates new fields
23 © 2013 IBM Corporation
Reclassify
24 © 2013 IBM Corporation
Live Demo & Exercise II
25 © 2013 IBM Corporation
Japanese
Hebrew
Thank YouEnglish
MerciFrench
DankeGerman
GrazieItalian
GraciasSpanish
ObrigadoBrazilian
Portuguese
Arabic
Simplified Chinese
Traditional Chinese
Korean
Thai
Hindi
Tamil
go raibh maith agatGaelic Tak
Danish
TrugarezBreton
DutchDank u
Czech
Dekujeme Vam
DankonEsperanto
Tack så mycketSwedish