Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...

Introductionto Big Data

Chapter 7 & 8 (Week 4)Data preprocessing

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Contents

Similarity and Dissimilarity

Distance metric2.

Types of Errors

Quality Control1. Solutions for each error

Data Cleansing

Various Distance Measures

copyrightⓒ 2018 All rights reserved by Korea University 3

Funny situation


Additional Slide I missed last class

01Quality ControlData preprocessing


Quality Control for DataQuality control

Noise

Outliers

Missing values

Duplicate data


Quality Control for DataNoise

Noise can refer to any random fluctuations of data that hindersperception of a signal.

Class Noise vs. Attribute Noise: A Quantitative Study, Artificial Intelligence Review 22 (2004) 177-210


Quality Control for DataOutliers

In statistics, an outlier is a data point that differs significantly from otherobservations.


Quality Control for DataMissing values

In statistics, missing data, or missing values, occur when no data valueis stored for the variable in an observation.

Missing data are a common occurrence and can have a significanteffect on the conclusions that can be drawn from the data.


Quality Control for DataDuplicate Data

Data set may include data objects that are duplicates, or almostduplicates of one another.

This is a common issue when collecting data from heterogeneoussources.

• i.e.) If a person has multiple email addresses


Quality Control for DataOutlier detection and removal

There are serveral algorithms and statistical methods to find outliers

We can remove such outliers before establishment of the specificmodel


Quality Control for DataImputation for Missing value

One widely used method is imputation technique for missing data.

In statistics, imputation is the process of replacing missing data withsubstituted values.

It is a way to assign predicted value for missing data by inferringpatterns from well-known information or observed values.


Quality Control for DataData cleaning

Data cleansing is the process of detecting and correcting (or removing)corrupt or inaccurate, incomplete, incorrect, or irrelevant parts of thedata and then replacing, modifying, or deleting the dirty or coarse data.


Data PreprocessingVarious data preprcessing methods

Aggregation

Sampling

Dimensionality reduction

Feature selection

Feature extraction

...

Data preprocessing is an important step in the data mining process.The phrase "garbage in, garbage out" is particularly applicable to datamining and machine learning projects.

02MetricDistance function


Similarity and DissimilarityFundamentals of all data science methods

Similarity index• Numeric value that indicates how similar different objects are

• In general, the higher the similarity of two objects, the higher thesimilarity.

Similarity index• Numeric value that indicates how different different objects are

• In general, the higher the similarity between objects, the lowerdissimilarity.


Similarity and DissimilarityFundamentals of all data science methods


Diverse Distance MetricEuclidean Distance

In mathematics, the Euclidean distance or Euclidean metric isthe "ordinary" straight-line distance between two points inEuclidean space.

∑=

−=n

kkk qpdist

1

2)(

• n = number of dimensions (attributes)

• pk, qk = value of the k-th dimension


Diverse Distance MetricEuclidean Distance (Cont.)

Let’s calculate distance among each point

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1p4 5 1

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Distance (Sysmmetric) Matrix



Manhattan Distance

Green: Euclidean Dist.

Others: Manhattan Dist.



𝐿𝐿1 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : d(x,y) = square root of the sum of the squares of thedifferences between x and y in each dimension.

𝐿𝐿2 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : d(x,y) = sum of the differences in each dimension.

• The most common notion of “distance.”

• Manhattan distance = distance if you had to travel alongcoordinates only.



a = (5,5)

b = (9,8)L2-norm:dist(x,y) =√(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35

Let’s do practice with another words


Diverse Distance MetricGeneral Distance (Cont.)

Minkowski Distance

rn

k

rkk qpdist

1

1)||( ∑

=−=

• Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.


Diverse Distance MetricGeneral Distance (Cont.)

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2. Euclidean distance

r → ∞. “supremum or ChebyShev” (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of

the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

https://proofwiki.org/wiki/Chebyshev_Distance_is_Limit_of_P-Product_Metric


Diverse Distance MetricChebyShev Distance

Let’s do practice for familization with ChebyShev Dist.

a = (5,5)

b = (9,8)L2-norm:dist(x,y) =√(42+32)= 5

L1-norm:dist(x,y) =4+3 = 7

4

35

LInf.-norm:dist(x,y) =Max(4,3)= 4


Diverse Distance MetricEdit Distance

The Edit distance between two strings of equal length is the number of positions at which the corresponding symbols are different.

• In another way, it measures the minimum number of substitutionsrequired to change one string into the other.

Example : The Edit distance between:


Diverse Distance MetricCorrelation Coefficients

Correlation Coefficient represents linear relationship between two continuous variables.


Notes on correlation coefficientCorrelation Coefficients

The correlation coefficient is a value, r, between –1 and 1.

• r > 0 suggests a positive (increasing) relationship

• r < 0 suggests a negative (decreasing) relationship

• The closer the value is to 0, the more scattered the data.

• The closer the value is to 1 or –1, the less scattered the data is.

𝑛𝑛𝑥𝑥𝑥𝑥 =𝐶𝐶𝑛𝑛𝐶𝐶(𝑥𝑥,𝑦𝑦)𝜎𝜎𝑥𝑥𝜎𝜎𝑥𝑥

=𝑆𝑆𝑥𝑥𝑥𝑥𝑆𝑆𝑥𝑥𝑥𝑥𝑆𝑆𝑥𝑥𝑥𝑥

𝑆𝑆𝑥𝑥𝑥𝑥 ? ?


Pearson vs Spearman CorrelationsCorrelation Coefficients

Pearson Correlation Coefficient

Spaerman Correlation Coefficient

, where rg() is a function of the rank variable.


Similarity and DissimilarityLet’s go back to the class goal

Relationship??


Similarity and DissimilarityLet’s go back to the class goal

End of Slide

Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...

Documents

Transcript of Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...