Data Preparation - University of...

28
Data Preparation Dr. Saed Sayad University of Toronto 2010 [email protected] 1 http://chem-eng.utoronto.ca/~datamining/

Transcript of Data Preparation - University of...

Page 1: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Preparation

Dr. Saed SayadUniversity of Toronto

2010

[email protected]

1http://chem-eng.utoronto.ca/~datamining/

Page 2: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Mining Steps

1 • Problem Definition

2 • Data Preparation

3 • Data Exploration

4 • Modeling

5 • Evaluation

6 • Deployment

http://chem-eng.utoronto.ca/~datamining/ 2

Page 3: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

1. Problem Definition

http://chem-eng.utoronto.ca/~datamining/ 3

Understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition with a preliminary plan designed to achieve the objectives.

Source: http://www.crisp-dm.org/Process/index.htm

Page 4: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

2- Data Preparation

The data preparation step covers all activities to construct the final dataset for modeling from the raw data. Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data.

4http://chem-eng.utoronto.ca/~datamining/

Page 5: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Preparation

Modeling Data

DataText

Data DSN

ETL

http://chem-eng.utoronto.ca/~datamining/ 5

Page 6: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Sources

http://chem-eng.utoronto.ca/~datamining/ 6

Text FilesRelational Database

Multi-dimensional Database

Entities File Table Cube

Attributes Row and ColRecord, Field, Index

Dimension, Level, Measurement

Methods Read, WriteSelect, Insert, Update, Delete

Drill down, Drill up, Drill through

Language - SQL MDX

Page 7: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Types

Data

Measurement

Ratio

Interval

Counting

Ordinal

Nominal

http://chem-eng.utoronto.ca/~datamining/ 7

Numerical

Categorical

Page 8: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Denormalization

8http://chem-eng.utoronto.ca/~datamining/

One Row per Subject

Page 9: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Tranformation

Customer

CustomerTransformed

1 to 1

Transaction

TransactionTransformed

1 to 1

1 to N

1 to N

9http://chem-eng.utoronto.ca/~datamining/

Page 10: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Copy and Aggregate

Customer

Transaction

Copy Aggregate

10http://chem-eng.utoronto.ca/~datamining/

Page 11: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Preparation - Aggregation

Aggregation

Categorical

Count

Count%

Numeric

Count, Sum

Mean, Std

Min, Max

11http://chem-eng.utoronto.ca/~datamining/

Page 12: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

One to Many Relationship

Customer ID Age Married

1 25 N

2 38 Y

3 46 Y

Transaction ID Customer IDPurchased

Amount

1 1 250

2 1 125

3 2 100

4 2 85

5 2 24

6 3 400

12http://chem-eng.utoronto.ca/~datamining/

Customers

Transactions

1

N

Page 13: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Preparation - Copy

Transaction ID Customer IDPurchased

Amount Age Married

1 1 250 25 N

2 1 125 25 N

3 2 100 38 Y

4 2 85 38 Y

5 2 24 38 Y

6 3 400 46 Y

13http://chem-eng.utoronto.ca/~datamining/

Page 14: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Preparation - Aggregation

Customer ID Age MarriedPurchased

CountPurchased

Total

1 25 N 2 375

2 38 Y 3 209

3 46 Y 1 400

14http://chem-eng.utoronto.ca/~datamining/

Page 15: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Data Transformation and Cleansing

http://chem-eng.utoronto.ca/~datamining/ 15

Variable

Categorical Numeric

Missing Values Missing Values

Invalid Values Invalid & Outliers

Encoding Binning

Page 16: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Missing Values

http://chem-eng.utoronto.ca/~datamining/ 16

Education

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

BLA

NK 1 2 3 4

Fre

qu

en

cy

83%

Missing Value

Page 17: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Invalid Values

http://chem-eng.utoronto.ca/~datamining/ 17

doc_type_id

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

NULL Z X 1 2 3

Fre

qu

en

cy

Invalid

Page 18: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Missing and Invalid Values and Outliers

18http://chem-eng.utoronto.ca/~datamining/

Months in Business

Page 19: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Box Plot

http://chem-eng.utoronto.ca/~datamining/ 19

Outliers

*

Page 20: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Missing Values

• Fill in missing values manually based on our domain knowledge

• Ignore the records with missing data

• Fill in it automatically:– A global constant (e.g., “?”)

– The variable mean

– Inference-based methods such as Bayes’ rule, decision tree, or EM algorithm

http://chem-eng.utoronto.ca/~datamining/ 20

Page 21: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Managing Outliers

• Data points inconsistent with the majority of data

• Different outliers

– Valid: CEO’s salary

– Noisy: One’s age = 200, widely deviated points

• Removal methods

– Box plot

– Clustering

– Curve-fitting

http://chem-eng.utoronto.ca/~datamining/ 21

Page 22: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Encoding Categorical Variables

• Encoding is the process of transforming categorical variables into numerical counterparts.

• Encoding methods:

–Binary method

–Ordinal Method

– Target based Encoding

http://chem-eng.utoronto.ca/~datamining/ 22

Page 23: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Encoding

• Binary method:

– for free: 1, 0, 0

– own: 0, 1, 0

– rent: 0, 0, 1

http://chem-eng.utoronto.ca/~datamining/ 23

• Ordinal method:

– own: 1

– for free: 3

– rent: 5

Housing (for free, own, rent)

Page 24: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Binning Numerical Variables

• Binning is the process of transforming numerical variables into categorical counterparts.

• Binning methods:–Equal Width–Equal Frequency–Entropy Based

http://chem-eng.utoronto.ca/~datamining/ 24

Page 25: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Binning

• Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28

• Equi-width binning: – Bin 1: 0, 4 [-,10) bin

– Bin 2: 12, 16, 16, 18 [10,20) bin

– Bin 3: 24, 26, 28 [20,+) bin

• Equi-frequency binning :– Bin 1: 0, 4, 12 [-, 14) bin

– Bin 2: 16, 16, 18 [14, 21) bin

– Bin 3: 24, 26, 28 [21,+) bin

http://chem-eng.utoronto.ca/~datamining/ 25

Page 26: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Binning

26http://chem-eng.utoronto.ca/~datamining/

Months in Business

Page 27: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

Summary

• In the data preparation step the final modeling dataset is constructed from the raw data.

• One Row per Subject is the heart of the data preparation activities for building the modeling dataset.

• Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data also taking care of missing values, invalid values and outliers.

27http://chem-eng.utoronto.ca/~datamining/

Page 28: Data Preparation - University of Torontochem-eng.utoronto.ca/~datamining/Presentations/Data_Preparation.pdf · Summary •In the data preparation step the final modeling dataset is

28http://chem-eng.utoronto.ca/~datamining/