Minseok Kwon Department of Computer Science Rochester Institute of Technology [email protected]
Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...
Transcript of Introduction to Big Data - Harvard University · Introduction to Big Data Chapter 7 & 8 (Week 4)...
Introductionto Big Data
Chapter 7 & 8 (Week 4)Data preprocessing
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok [email protected]
Contents
Similarity and Dissimilarity
Distance metric2.
Types of Errors
Quality Control1. Solutions for each error
Data Cleansing
Various Distance Measures
copyrightⓒ 2018 All rights reserved by Korea University 3
Funny situation
copyrightⓒ 2018 All rights reserved by Korea University 4
Additional Slide I missed last class
01Quality ControlData preprocessing
copyrightⓒ 2018 All rights reserved by Korea University 6
Quality Control for DataQuality control
Noise
Outliers
Missing values
Duplicate data
copyrightⓒ 2018 All rights reserved by Korea University 7
Quality Control for DataNoise
Noise can refer to any random fluctuations of data that hindersperception of a signal.
Class Noise vs. Attribute Noise: A Quantitative Study, Artificial Intelligence Review 22 (2004) 177-210
copyrightⓒ 2018 All rights reserved by Korea University 8
Quality Control for DataOutliers
In statistics, an outlier is a data point that differs significantly from otherobservations.
copyrightⓒ 2018 All rights reserved by Korea University 9
Quality Control for DataMissing values
In statistics, missing data, or missing values, occur when no data valueis stored for the variable in an observation.
Missing data are a common occurrence and can have a significanteffect on the conclusions that can be drawn from the data.
copyrightⓒ 2018 All rights reserved by Korea University 10
Quality Control for DataDuplicate Data
Data set may include data objects that are duplicates, or almostduplicates of one another.
This is a common issue when collecting data from heterogeneoussources.
• i.e.) If a person has multiple email addresses
copyrightⓒ 2018 All rights reserved by Korea University 11
copyrightⓒ 2018 All rights reserved by Korea University 12
Quality Control for DataOutlier detection and removal
There are serveral algorithms and statistical methods to find outliers
We can remove such outliers before establishment of the specificmodel
copyrightⓒ 2018 All rights reserved by Korea University 13
Quality Control for DataImputation for Missing value
One widely used method is imputation technique for missing data.
In statistics, imputation is the process of replacing missing data withsubstituted values.
It is a way to assign predicted value for missing data by inferringpatterns from well-known information or observed values.
copyrightⓒ 2018 All rights reserved by Korea University 14
Quality Control for DataData cleaning
Data cleansing is the process of detecting and correcting (or removing)corrupt or inaccurate, incomplete, incorrect, or irrelevant parts of thedata and then replacing, modifying, or deleting the dirty or coarse data.
copyrightⓒ 2018 All rights reserved by Korea University 15
Data PreprocessingVarious data preprcessing methods
Aggregation
Sampling
Dimensionality reduction
Feature selection
Feature extraction
...
Data preprocessing is an important step in the data mining process.The phrase "garbage in, garbage out" is particularly applicable to datamining and machine learning projects.
02MetricDistance function
copyrightⓒ 2018 All rights reserved by Korea University 17
Similarity and DissimilarityFundamentals of all data science methods
Similarity index• Numeric value that indicates how similar different objects are
• In general, the higher the similarity of two objects, the higher thesimilarity.
Similarity index• Numeric value that indicates how different different objects are
• In general, the higher the similarity between objects, the lowerdissimilarity.
copyrightⓒ 2018 All rights reserved by Korea University 18
Similarity and DissimilarityFundamentals of all data science methods
copyrightⓒ 2018 All rights reserved by Korea University 19
Diverse Distance MetricEuclidean Distance
In mathematics, the Euclidean distance or Euclidean metric isthe "ordinary" straight-line distance between two points inEuclidean space.
∑=
−=n
kkk qpdist
1
2)(
• n = number of dimensions (attributes)
• pk, qk = value of the k-th dimension
copyrightⓒ 2018 All rights reserved by Korea University 20
Diverse Distance MetricEuclidean Distance (Cont.)
Let’s calculate distance among each point
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x yp1 0 2p2 2 0p3 3 1p4 5 1
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
Distance (Sysmmetric) Matrix
copyrightⓒ 2018 All rights reserved by Korea University 21
Diverse Distance MetricEuclidean Distance (Cont.)
Manhattan Distance
Green: Euclidean Dist.
Others: Manhattan Dist.
copyrightⓒ 2018 All rights reserved by Korea University 22
Diverse Distance MetricEuclidean Distance (Cont.)
copyrightⓒ 2018 All rights reserved by Korea University 23
Diverse Distance MetricEuclidean Distance (Cont.)
𝐿𝐿1 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : d(x,y) = square root of the sum of the squares of thedifferences between x and y in each dimension.
𝐿𝐿2 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 : d(x,y) = sum of the differences in each dimension.
• The most common notion of “distance.”
• Manhattan distance = distance if you had to travel alongcoordinates only.
copyrightⓒ 2018 All rights reserved by Korea University 24
Diverse Distance MetricEuclidean Distance (Cont.)
a = (5,5)
b = (9,8)L2-norm:dist(x,y) =√(42+32)= 5
L1-norm:dist(x,y) =4+3 = 7
4
35
Let’s do practice with another words
copyrightⓒ 2018 All rights reserved by Korea University 25
Diverse Distance MetricGeneral Distance (Cont.)
Minkowski Distance
rn
k
rkk qpdist
1
1)||( ∑
=−=
• Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
copyrightⓒ 2018 All rights reserved by Korea University 26
Diverse Distance MetricGeneral Distance (Cont.)
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
r = 2. Euclidean distance
r → ∞. “supremum or ChebyShev” (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of
the vectors
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
https://proofwiki.org/wiki/Chebyshev_Distance_is_Limit_of_P-Product_Metric
copyrightⓒ 2018 All rights reserved by Korea University 27
Diverse Distance MetricChebyShev Distance
Let’s do practice for familization with ChebyShev Dist.
a = (5,5)
b = (9,8)L2-norm:dist(x,y) =√(42+32)= 5
L1-norm:dist(x,y) =4+3 = 7
4
35
LInf.-norm:dist(x,y) =Max(4,3)= 4
copyrightⓒ 2018 All rights reserved by Korea University 28
Diverse Distance MetricEdit Distance
The Edit distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
• In another way, it measures the minimum number of substitutionsrequired to change one string into the other.
Example : The Edit distance between:
copyrightⓒ 2018 All rights reserved by Korea University 29
Diverse Distance MetricCorrelation Coefficients
Correlation Coefficient represents linear relationship between two continuous variables.
copyrightⓒ 2018 All rights reserved by Korea University 30
Notes on correlation coefficientCorrelation Coefficients
The correlation coefficient is a value, r, between –1 and 1.
• r > 0 suggests a positive (increasing) relationship
• r < 0 suggests a negative (decreasing) relationship
• The closer the value is to 0, the more scattered the data.
• The closer the value is to 1 or –1, the less scattered the data is.
𝑛𝑛𝑥𝑥𝑥𝑥 =𝐶𝐶𝑛𝑛𝐶𝐶(𝑥𝑥,𝑦𝑦)𝜎𝜎𝑥𝑥𝜎𝜎𝑥𝑥
=𝑆𝑆𝑥𝑥𝑥𝑥𝑆𝑆𝑥𝑥𝑥𝑥𝑆𝑆𝑥𝑥𝑥𝑥
𝑆𝑆𝑥𝑥𝑥𝑥 ? ?
copyrightⓒ 2018 All rights reserved by Korea University 31
Pearson vs Spearman CorrelationsCorrelation Coefficients
Pearson Correlation Coefficient
Spaerman Correlation Coefficient
, where rg() is a function of the rank variable.
copyrightⓒ 2018 All rights reserved by Korea University 32
Similarity and DissimilarityLet’s go back to the class goal
Relationship??
copyrightⓒ 2018 All rights reserved by Korea University 33
Similarity and DissimilarityLet’s go back to the class goal
End of Slide