Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02)...

46
Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected]

Transcript of Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02)...

Page 1: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

Introductionto Big Data

Chapter 9 (Week 5)Data Cleansing

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Page 2: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

Contents

Process of data cleansing

Review

Data Cleansing1. Data aggregation

NA, InF, NaN

Variable Check

Raw data check

Missing value treatment

Outlier treatment

Feature Engineering

Page 3: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

01ReviewData preprocessing

Page 4: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 4

Data Preprocessing (Review)Various data preprcessing methods

Aggregation

Sampling

Dimensionality reduction

Feature selection

Feature extraction

...

Data preprocessing is an important step in the data mining process.The phrase "garbage in, garbage out" is particularly applicable to datamining and machine learning projects.

Page 5: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 5

Quality Control for Data (Review)Data cleaning

Data cleansing is the process of detecting and correcting (or removing)corrupt or inaccurate, incomplete, incorrect, or irrelevant parts of thedata and then replacing, modifying, or deleting the dirty or coarse data.

Page 6: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 6

Distance Measures (Review)Diverse Distance Metric

L1-norm (Manhattan Distance)

L2-norm (Euclidean Distance)

Lmax-norm (ChebyShev Distance)

Edit Distance (Hamming Distance)

Pearson’s Correlation

Spearman’s Correlation

Page 7: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

02Data cleansingExploratory Data Analysis (EDA)

Page 8: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 8

Data AggregationData combining and tabulation (data rectangling)

Name Job AgeGildong Hong Football player 20Samsun Kim Football player 21

Heungmin Son Football player 22Cristiano Ronaldo Robber 30

Lionel Messi Football player 30Kylian Mbappé Football player 20

[Data 1]Name Gender Goal2020

Heungmin Son Male 40Cristiano Ronaldo Male 0

Lionel Messi Male 42Kylian Mbappé Male 28

[Data 2]

Let's combine the two different data

Page 9: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 9

Data AggregationData combining and tabulation (data rectangling)

What are the prerequisties for the data combining?

What can happen if we can dombine different dataset?

• Assign wrong value by typos of unique key

• Missing value

• Singleton

• Duplicated features

...

Page 10: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 10

NA, NULL, InF & NaNImportant words

NA (Not Available)

NULL

Inf

NaN (Not a Number)

# Missing

# Undefined

# Infinite (i.e. 3/0)

# Not a number (i.e. Inf/Inf)

Page 11: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 11

Step of data cleansingGeneral process

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Once you have finished collecting data and have obtained the tabular data from the total union of the data you need, move on to the next step.

Generally, the data is cleansed through the following process.

Page 12: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 12

Step of data cleansingVariable Check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Check the type of each variable (categories or continuous), and the data type of the variable (Date, Character, Numeric, etc.).

Depending on the type of variable, the results of the analysis are completely different when fitting the model.

You should also check that the conceptual variable typematches the variable type in your program (Programmer’sprivilege).

Page 13: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 13

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Single Variable (Feature) analysis

• This step is to check each variable independently.

• Use the histogram or boxplot to see the distribution of each variable along with its summary statistic such as mean, mode, and median.

Page 14: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 14

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Summary Statistics

• Summary statistics are numbers that summarizes properties of the data.

• Most summary statistics can be calculated in a single pass through the data.

• Frequency, Mean, Variance, Max, Min, Mode, SD, and Median.

Page 15: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 15

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Frequency, Mean, Variance, Max, Min, Mode, SD, and Median

• The count (or percentage) of how many times the specific value appears.

• Exmaple) The frequency of the Female property value in the Gender variable is 0.5 (or 50%).

Page 16: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 16

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Frequency, Mean, Variance, Max, Min, Mode, SD, and Median

• Attribute value with the highest frequency in specific variable

• Example) ‘Kim' is the most used surname in Korea. == ‘Kim’ is mode value in surname variable for South Korea

Page 17: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 17

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Frequency, Mean, Variance, Max, Min, Mode, SD, and Median

• Two ways to calculate the central position of a continuous variable.

Page 18: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 18

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Visualization

• It is a way of presenting data in visual form such as graphics or figures.

• The purpose of visualization is for humans to interpret the visualized information and form an internal model of the information.

• When visualizing and expressing a large amount of data, one can find (1) general patterns or trends, and (2) outliers or abnormal patterns in the data.

Page 19: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 19

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Why use a Histogram

• To summarize data from a process that has been collected over a period of time, and graphically present its frequency distribution in bar form.

Page 20: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 20

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

What does a Histogram Do?

• Displays large amounts of data that are difficult to interpret in tabular form.

• Shows the relative frequency of occurrence of the various data values.

• Reveals the centering, variation, and shape of the data.

• To check distribution assumption of the statistical analysis.

Page 21: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 21

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Check combinations between two variables

• This step analyzes the relationship between two varaibles.

• You can choose the appropriate visualization and analytical methods according to the types of two variables.

Variable types Visualization Analytical way

Cont. vs Cont. Scatter plotLine plotetc.,

CorrelationLinear regressionetc.,

Cate. vs Cate. Cumulative bar graph

Chi-square testIndependence testetc.,

Cont. vs Cate. Histogram, boxplot

Z-test, t-test, ANOVA, ..., etc

Page 22: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 22

Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Check relationship over three variables

• In general, it does not perform well in the QC process.

• But, sometimes it is necessary to look at relationships of more than three variables.

• Let’s look at an example on the whiteboard.

Page 23: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 23

Step of data cleansingMissing value treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

If we create a model with missing values, the accuracy of themodel is compromised because the relationships between thevariables can be distorted.

Depending on whether missing values occur randomly orsysthematically, the way that missing values are handledvaries slightly.

Page 24: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 24

Step of data cleansingMissing value treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Delection• Delete all observations with missing values (Delete All,

Listwise Deletion).

• Partial delection (it is way to delete missing values only when used in downstream modeling).

• All deletions are easy way, but the total number of observations will be reduced, making the model less valid.

• Partial deletion has the disadvantage of increasing administrative costs because the variables vary from model to model.

Page 25: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 25

Step of data cleansingMissing value treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Replace with other values (mean, mode, median)

• Example) If the mean value of male height is 173 and the mean height of female height is 158, the missing value of male observation is replaced with 173.

• In this approach, there is a possibility that the model will be distorted because it randomly chooses which environmental variables to choose as similar types.

Page 26: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 26

Step of data cleansingMissing value treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Insert predicted values

• It is a method of predicting and assigning them using statistical methods (regression modeling) or machine learning methods (clustering or supervised leraning methods).

• This is better than method of replacement with summary statistic (because the subjectivity of the analyst falls out).

• However, the same limitation still exist because it is not the actual observed value.

Page 27: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 27

Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Outliers are observations away from the main cluster that are likely to distort the model.

It is a very relative concept.

Page 28: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 28

Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

An easy and simple way to find outliers is to visualize the distribution of the variables.

In general, we will use Boxplot or Histogram for one variable and use Scatter plot for two variables.

Page 29: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 29

Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The visualization approaches is intuitive, but it is also arbitrary (Subjective).

That’s why it’s preferable to employ a statistical approaches to find and remove outliers (i.e. Cook’s D value).

Page 30: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 30

Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The first way to handle outliers is to delete them.

• If the outlier is caused by human error (i.e. typo, unrealistic response), we can delete the observation.

The second way is replacement.

• Replace observations with other values (mean, etc.) instead of deleting them.

Page 31: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 31

Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The third way is to variabalize the outliers

• Example) variablize whether or not sample is engaged in a profession.

Years of service

Sala

ry

Page 32: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 32

Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The last way is sub-sampling

• If the samples belonging to a category are outliers, it is advisable to analyze them separately from the rest.

Page 33: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 33

Step of data cleansingFeature Engineering

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

A series of processes that add information to data using existing variables.

• Scaling (Normalization)

• Binning (Categorization)

• Transform

• Dummy

Page 34: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 34

Step of data cleansingFeature Engineering

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

A series of processes that add information to data using existing variables.

• Scaling (Normalization)

When we want to change the unit of a variable

If the distribution of variables is biased

Page 35: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 35

Step of data cleansingFeature Engineering

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

A series of processes that add information to data using existing variables.

• Binning (Categorization)

Creating continuous variables into categorical variables.

Warning) Information loss!

Page 36: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 36

Step of data cleansingFeature Engineering

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

A series of processes that add information to data using existing variables.• Dummy Convert categorical variables to numerical variable.

Mainly used to change ordinal (Categorical) variables usch as cancer classification into numerical variables.

Page 37: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 37

One picture is worth thousand words. Data cleansing is harder than you think.

Especially, if the data size is large, you will want to cry.

This can be felt by doing R-programming with real data.

Page 38: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 38

One picture is worth thousand words.

Example of SAC (Split-Apply-Conbine) process

Page 39: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 39

One picture is worth thousand words.

Page 40: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 40

One picture is worth thousand words.

Page 41: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 41

One picture is worth thousand words.

Page 42: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 42

One picture is worth thousand words.

Page 43: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 43

One picture is worth thousand words.

Page 44: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 44

One picture is worth thousand words.

Page 45: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

copyrightⓒ 2018 All rights reserved by Korea University 45

All things is already impleneted in R There is a R package for data cleansing.

Once you understand how the data changes, just one line is enough with R-programming.

Page 46: Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo mins@korea.ac.kr Contents Process

End of Slide