Data Preparation - University of...
Transcript of Data Preparation - University of...
Data Preparation
Dr. Saed SayadUniversity of Toronto
2010
1http://chem-eng.utoronto.ca/~datamining/
Data Mining Steps
1 • Problem Definition
2 • Data Preparation
3 • Data Exploration
4 • Modeling
5 • Evaluation
6 • Deployment
http://chem-eng.utoronto.ca/~datamining/ 2
1. Problem Definition
http://chem-eng.utoronto.ca/~datamining/ 3
Understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition with a preliminary plan designed to achieve the objectives.
Source: http://www.crisp-dm.org/Process/index.htm
2- Data Preparation
The data preparation step covers all activities to construct the final dataset for modeling from the raw data. Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data.
4http://chem-eng.utoronto.ca/~datamining/
Data Preparation
Modeling Data
DataText
Data DSN
ETL
http://chem-eng.utoronto.ca/~datamining/ 5
Data Sources
http://chem-eng.utoronto.ca/~datamining/ 6
Text FilesRelational Database
Multi-dimensional Database
Entities File Table Cube
Attributes Row and ColRecord, Field, Index
Dimension, Level, Measurement
Methods Read, WriteSelect, Insert, Update, Delete
Drill down, Drill up, Drill through
Language - SQL MDX
Data Types
Data
Measurement
Ratio
Interval
Counting
Ordinal
Nominal
http://chem-eng.utoronto.ca/~datamining/ 7
Numerical
Categorical
Denormalization
8http://chem-eng.utoronto.ca/~datamining/
One Row per Subject
Tranformation
Customer
CustomerTransformed
1 to 1
Transaction
TransactionTransformed
1 to 1
1 to N
1 to N
9http://chem-eng.utoronto.ca/~datamining/
Copy and Aggregate
Customer
Transaction
Copy Aggregate
10http://chem-eng.utoronto.ca/~datamining/
Data Preparation - Aggregation
Aggregation
Categorical
Count
Count%
Numeric
Count, Sum
Mean, Std
Min, Max
11http://chem-eng.utoronto.ca/~datamining/
One to Many Relationship
Customer ID Age Married
1 25 N
2 38 Y
3 46 Y
Transaction ID Customer IDPurchased
Amount
1 1 250
2 1 125
3 2 100
4 2 85
5 2 24
6 3 400
12http://chem-eng.utoronto.ca/~datamining/
Customers
Transactions
1
N
Data Preparation - Copy
Transaction ID Customer IDPurchased
Amount Age Married
1 1 250 25 N
2 1 125 25 N
3 2 100 38 Y
4 2 85 38 Y
5 2 24 38 Y
6 3 400 46 Y
13http://chem-eng.utoronto.ca/~datamining/
Data Preparation - Aggregation
Customer ID Age MarriedPurchased
CountPurchased
Total
1 25 N 2 375
2 38 Y 3 209
3 46 Y 1 400
14http://chem-eng.utoronto.ca/~datamining/
Data Transformation and Cleansing
http://chem-eng.utoronto.ca/~datamining/ 15
Variable
Categorical Numeric
Missing Values Missing Values
Invalid Values Invalid & Outliers
Encoding Binning
Missing Values
http://chem-eng.utoronto.ca/~datamining/ 16
Education
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
BLA
NK 1 2 3 4
Fre
qu
en
cy
83%
Missing Value
Invalid Values
http://chem-eng.utoronto.ca/~datamining/ 17
doc_type_id
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
NULL Z X 1 2 3
Fre
qu
en
cy
Invalid
Missing and Invalid Values and Outliers
18http://chem-eng.utoronto.ca/~datamining/
Months in Business
Box Plot
http://chem-eng.utoronto.ca/~datamining/ 19
Outliers
*
Missing Values
• Fill in missing values manually based on our domain knowledge
• Ignore the records with missing data
• Fill in it automatically:– A global constant (e.g., “?”)
– The variable mean
– Inference-based methods such as Bayes’ rule, decision tree, or EM algorithm
http://chem-eng.utoronto.ca/~datamining/ 20
Managing Outliers
• Data points inconsistent with the majority of data
• Different outliers
– Valid: CEO’s salary
– Noisy: One’s age = 200, widely deviated points
• Removal methods
– Box plot
– Clustering
– Curve-fitting
http://chem-eng.utoronto.ca/~datamining/ 21
Encoding Categorical Variables
• Encoding is the process of transforming categorical variables into numerical counterparts.
• Encoding methods:
–Binary method
–Ordinal Method
– Target based Encoding
http://chem-eng.utoronto.ca/~datamining/ 22
Encoding
• Binary method:
– for free: 1, 0, 0
– own: 0, 1, 0
– rent: 0, 0, 1
http://chem-eng.utoronto.ca/~datamining/ 23
• Ordinal method:
– own: 1
– for free: 3
– rent: 5
Housing (for free, own, rent)
Binning Numerical Variables
• Binning is the process of transforming numerical variables into categorical counterparts.
• Binning methods:–Equal Width–Equal Frequency–Entropy Based
http://chem-eng.utoronto.ca/~datamining/ 24
Binning
• Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning: – Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equi-frequency binning :– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+) bin
http://chem-eng.utoronto.ca/~datamining/ 25
Binning
26http://chem-eng.utoronto.ca/~datamining/
Months in Business
Summary
• In the data preparation step the final modeling dataset is constructed from the raw data.
• One Row per Subject is the heart of the data preparation activities for building the modeling dataset.
• Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data also taking care of missing values, invalid values and outliers.
27http://chem-eng.utoronto.ca/~datamining/
28http://chem-eng.utoronto.ca/~datamining/