Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...

34
Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected]

Transcript of Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...

Page 1: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

Introductionto Big Data

Chapter 7 & 8 (Week 4)Data preprocessing

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Page 2: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

Contents

Similarity and Dissimilarity

Distance metric2.

Types of Errors

Quality Control1. Solutions for each error

Data Cleansing

Various Distance Measures

Page 3: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 3

Funny situation

Page 4: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 4

Additional Slide I missed last class

Page 5: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

01Quality ControlData preprocessing

Page 6: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 6

Quality Control for DataQuality control

Noise

Outliers

Missing values

Duplicate data

Page 7: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 7

Quality Control for DataNoise

Noise can refer to any random fluctuations of data that hindersperception of a signal.

Class Noise vs. Attribute Noise: A Quantitative Study, Artificial Intelligence Review 22 (2004) 177-210

Page 8: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 8

Quality Control for DataOutliers

In statistics, an outlier is a data point that differs significantly from otherobservations.

Page 9: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 9

Quality Control for DataMissing values

In statistics, missing data, or missing values, occur when no data valueis stored for the variable in an observation.

Missing data are a common occurrence and can have a significanteffect on the conclusions that can be drawn from the data.

Page 10: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 10

Quality Control for DataDuplicate Data

Data set may include data objects that are duplicates, or almostduplicates of one another.

This is a common issue when collecting data from heterogeneoussources.

• i.e.) If a person has multiple email addresses

Page 11: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 11

Page 12: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 12

Quality Control for DataOutlier detection and removal

There are serveral algorithms and statistical methods to find outliers

We can remove such outliers before establishment of the specificmodel

Page 13: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 13

Quality Control for DataImputation for Missing value

One widely used method is imputation technique for missing data.

In statistics, imputation is the process of replacing missing data withsubstituted values.

It is a way to assign predicted value for missing data by inferringpatterns from well-known information or observed values.

Page 14: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 14

Quality Control for DataData cleaning

Data cleansing is the process of detecting and correcting (or removing)corrupt or inaccurate, incomplete, incorrect, or irrelevant parts of thedata and then replacing, modifying, or deleting the dirty or coarse data.

Page 15: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 15

Data PreprocessingVarious data preprcessing methods

Aggregation

Sampling

Dimensionality reduction

Feature selection

Feature extraction

...

Data preprocessing is an important step in the data mining process.The phrase "garbage in, garbage out" is particularly applicable to datamining and machine learning projects.

Page 16: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

02MetricDistance function

Page 17: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 17

Similarity and DissimilarityFundamentals of all data science methods

Similarity index• Numeric value that indicates how similar different objects are

• In general, the higher the similarity of two objects, the higher thesimilarity.

Similarity index• Numeric value that indicates how different different objects are

• In general, the higher the similarity between objects, the lowerdissimilarity.

Page 18: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 18

Similarity and DissimilarityFundamentals of all data science methods

Page 19: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 19

Diverse Distance MetricEuclidean Distance

In mathematics, the Euclidean distance or Euclidean metric isthe "ordinary" straight-line distance between two points inEuclidean space.

∑=

−=n

kkk qpdist

1

2)(

• n = number of dimensions (attributes)

• pk, qk = value of the k-th dimension

Page 20: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 20

Diverse Distance MetricEuclidean Distance (Cont.)

Let’s calculate distance among each point

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1p4 5 1

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Distance (Sysmmetric) Matrix

Page 21: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 21

Diverse Distance MetricEuclidean Distance (Cont.)

Manhattan Distance

Green: Euclidean Dist.

Others: Manhattan Dist.

Page 22: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 22

Diverse Distance MetricEuclidean Distance (Cont.)

Page 23: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 23

Diverse Distance MetricEuclidean Distance (Cont.)

𝐿𝐿1 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : d(x,y) = square root of the sum of the squares of thedifferences between x and y in each dimension.

𝐿𝐿2 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : d(x,y) = sum of the differences in each dimension.

• The most common notion of “distance.”

• Manhattan distance = distance if you had to travel alongcoordinates only.

Page 24: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 24

Diverse Distance MetricEuclidean Distance (Cont.)

a = (5,5)

b = (9,8)L2-norm:dist(x,y) =√(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35

Let’s do practice with another words

Page 25: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 25

Diverse Distance MetricGeneral Distance (Cont.)

Minkowski Distance

rn

k

rkk qpdist

1

1)||( ∑

=−=

• Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Page 26: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 26

Diverse Distance MetricGeneral Distance (Cont.)

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2. Euclidean distance

r → ∞. “supremum or ChebyShev” (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of

the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

https://proofwiki.org/wiki/Chebyshev_Distance_is_Limit_of_P-Product_Metric

Page 27: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 27

Diverse Distance MetricChebyShev Distance

Let’s do practice for familization with ChebyShev Dist.

a = (5,5)

b = (9,8)L2-norm:dist(x,y) =√(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35

LInf.-norm:dist(x,y) =Max(4,3)= 4

Page 28: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 28

Diverse Distance MetricEdit Distance

The Edit distance between two strings of equal length is the number of positions at which the corresponding symbols are different.

• In another way, it measures the minimum number of substitutionsrequired to change one string into the other.

Example : The Edit distance between:

Page 29: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 29

Diverse Distance MetricCorrelation Coefficients

Correlation Coefficient represents linear relationship between two continuous variables.

Page 30: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 30

Notes on correlation coefficientCorrelation Coefficients

The correlation coefficient is a value, r, between –1 and 1.

• r > 0 suggests a positive (increasing) relationship

• r < 0 suggests a negative (decreasing) relationship

• The closer the value is to 0, the more scattered the data.

• The closer the value is to 1 or –1, the less scattered the data is.

𝑛𝑛𝑥𝑥𝑥𝑥 =𝐶𝐶𝑛𝑛𝐶𝐶(𝑥𝑥,𝑦𝑦)𝜎𝜎𝑥𝑥𝜎𝜎𝑥𝑥

=𝑆𝑆𝑥𝑥𝑥𝑥𝑆𝑆𝑥𝑥𝑥𝑥𝑆𝑆𝑥𝑥𝑥𝑥

𝑆𝑆𝑥𝑥𝑥𝑥 ? ?

Page 31: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 31

Pearson vs Spearman CorrelationsCorrelation Coefficients

Pearson Correlation Coefficient

Spaerman Correlation Coefficient

, where rg() is a function of the rank variable.

Page 32: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 32

Similarity and DissimilarityLet’s go back to the class goal

Relationship??

Page 33: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

copyrightⓒ 2018 All rights reserved by Korea University 33

Similarity and DissimilarityLet’s go back to the class goal

Page 34: Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4) Data preprocessing. DCCS208(02) Korea University 2019 Fall. Asst. Prof. Minseok Seo.

End of Slide