CS145: INTRODUCTION TO DATA MINING -...

29
CS145: INTRODUCTION TO DATA MINING Instructor: Yizhou Sun [email protected] October 22, 2017 7: Vector Data: K Nearest Neighbor

Transcript of CS145: INTRODUCTION TO DATA MINING -...

Page 1: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou [email protected]

October 22, 2017

7: Vector Data: K Nearest Neighbor

Page 2: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Methods to Learn: Last Lecture

2

Vector Data Set Data Sequence Data Text Data

Classification Logistic Regression; Decision Tree; KNNSVM; NN

Naïve Bayes for Text

Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models

PLSA

Prediction Linear RegressionGLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search DTW

Page 3: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Methods to Learn

3

Vector Data Set Data Sequence Data Text Data

Classification Logistic Regression; Decision Tree; KNNSVM; NN

Naïve Bayes for Text

Clustering K-means; hierarchicalclustering; DBSCAN; Mixture Models

PLSA

Prediction Linear RegressionGLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search DTW

Page 4: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

K Nearest Neighbor

• Introduction

•kNN

•Similarity and Dissimilarity

•Summary

4

Page 5: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

5

Lazy vs. Eager Learning

• Lazy vs. eager learning

• Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple

• Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify

• Lazy: less time in training but more time in predicting

• Accuracy

• Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function

• Eager: must commit to a single hypothesis that covers the entire instance space

Page 6: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

6

Lazy Learner: Instance-Based Methods

• Instance-based learning:

• Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified

• Typical approaches

• k-nearest neighbor approach

• Instances represented as points in, e.g., a Euclidean space.

• Locally weighted regression

• Constructs local approximation

Page 7: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

K Nearest Neighbor

• Introduction

•kNN

•Similarity and Dissimilarity

•Summary

7

Page 8: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

8

The k-Nearest Neighbor Algorithm

• All instances correspond to points in the n-D space

• The nearest neighbor are defined in terms of a distance measure, dist(X1, X2)

• Target function could be discrete- or real- valued

• For discrete-valued, k-NN returns the most common valueamong the k training examples nearest to xq

• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Page 9: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

kNN Example

9

Page 10: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

kNN Algorithm Summary

•Choose K

•For a given new instance 𝑋𝑛𝑒𝑤 , find K closest training points w.r.t. a distance measure

•Classify 𝑋𝑛𝑒𝑤 = majority vote among the K points

10

Page 11: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

11

Discussion on the k-NN Algorithm

• k-NN for real-valued prediction for a given unknown tuple

• Returns the mean values of the k nearest neighbors

• Distance-weighted nearest neighbor algorithm

• Weight the contribution of each of the k neighbors according to their

distance to the query xq

• Give greater weight to closer neighbors

• 𝑦𝑞 =∑𝑤𝑖𝑦𝑖

∑𝑤𝑖, where 𝑥𝑖’s are 𝑥𝑞’s nearest neighbors

• Robust to noisy data by averaging k-nearest neighbors

• Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes

• To overcome it, axes stretch or elimination of the least relevant

attributes

𝑒. 𝑔. , 𝑤𝑖 =1

𝑑 𝑥𝑞 , 𝑥𝑖2

𝑤𝑖 = exp(−𝑑 𝑥𝑞 , 𝑥𝑖2/2𝜎2)

Page 12: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Selection of k for kNN• The number of neighbors k• Small k: overfitting (high var., low bias)

• Big k: bringing too many irrelevant points (high bias, low var.)

• More discussions:http://scott.fortmann-roe.com/docs/BiasVariance.html

12

Page 13: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

K Nearest Neighbor

• Introduction

•kNN

•Similarity and Dissimilarity

•Summary

13

Page 14: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Similarity and Dissimilarity

• Similarity

• Numerical measure of how alike two data objects are

• Value is higher when objects are more alike

• Often falls in the range [0,1]

• Dissimilarity (e.g., distance)

• Numerical measure of how different two data objects are

• Lower when objects are more alike

• Minimum dissimilarity is often 0

• Upper limit varies

• Proximity refers to a similarity or dissimilarity

14

Page 15: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Data Matrix and Dissimilarity Matrix

• Data matrix

• n data points with p

dimensions

• Two modes

• Dissimilarity matrix

• n data points, but registers

only the distance

• A triangular matrix

• Single mode

15

npx...nfx...n1x

...............

ipx...ifx...i1x

...............

1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

Page 16: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue, green

(generalization of a binary attribute)

• Method 1: Simple matching

• m: # of matches, p: total # of variables

• Method 2: Use a large number of binary attributes

• creating a new binary attribute for each of the M nominal states

16

pmp

jid

),(

Page 17: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Proximity Measure for Binary Attributes

• A contingency table for binary data

• Distance measure for symmetric binary

variables:

• Distance measure for asymmetric binary

variables:

• Jaccard coefficient (similarity measure

for asymmetric binary variables):

Object i

Object j

17

Page 18: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Dissimilarity between Binary Variables

• Example

• Gender is a symmetric attribute

• The remaining attributes are asymmetric binary

• Let the values Y and P be 1, and the value N 0

18

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Page 19: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Standardizing Numeric Data

• Z-score:

• X: raw score to be standardized, μ: mean of the population, σ: standard

deviation

• the distance between the raw score and the population mean in units of

the standard deviation

• negative when the raw score is below the mean, “+” when above

• An alternative way: Calculate the mean absolute deviation

where

• standardized measure (z-score):

• Using mean absolute deviation is more robust than using standard deviation

x

z

.)...21

1nffff

xx(xn m

|)|...|||(|121 fnffffff

mxmxmxns

f

fif

if s

mx z

19

Page 20: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Example: Data Matrix and Dissimilarity Matrix

20

point attribute1 attribute2

x1 1 2

x2 3 5

x3 2 0

x4 4 5

Dissimilarity Matrix

(with Euclidean Distance)

x1 x2 x3 x4

x1 0

x2 3.61 0

x3 2.24 5.1 0

x4 4.24 1 5.39 0

Data Matrix

Page 21: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm)

• Properties

• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)

• d(i, j) = d(j, i) (Symmetry)

• d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

• A distance that satisfies these properties is a metric

21

Page 22: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Special Cases of Minkowski Distance

• h = 1: Manhattan (city block, L1 norm) distance

• E.g., the Hamming distance: the number of bits that are different

between two binary vectors

• h = 2: (L2 norm) Euclidean distance

• h . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component

(attribute) of the vectors

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

22

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 23: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Example: Minkowski Distance

23

Dissimilarity Matrices

point attribute 1 attribute 2

x1 1 2

x2 3 5

x3 2 0

x4 4 5

L x1 x2 x3 x4

x1 0

x2 5 0

x3 3 6 0

x4 6 1 7 0

L2 x1 x2 x3 x4

x1 0

x2 3.61 0

x3 2.24 5.1 0

x4 4.24 1 5.39 0

L x1 x2 x3 x4

x1 0

x2 3 0

x3 2 5 0

x4 3 1 5 0

Manhattan (L1)

Euclidean (L2)

Supremum

Page 24: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Ordinal Variables

• Order is important, e.g., rank

• Can be treated like interval-scaled

• replace xif by their rank

• map the range of each variable onto [0, 1] by replacing i-th object

in the f-th variable by

• compute the dissimilarity using methods for interval-scaled

variables

24

1

1

f

if

if M

rz

},...,1{fif

Mr

Page 25: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Attributes of Mixed Type

• A database may contain all attribute types• Nominal, symmetric binary, asymmetric binary, numeric,

ordinal• One may use a weighted formula to combine their effects

• f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij

(f) = 1 otherwise• f is numeric: use the normalized distance• f is ordinal

• Compute ranks rif and • Treat zif as interval-scaled

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

1

1

f

if

Mr

zif

25

Page 26: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Cosine Similarity

• A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …

• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then

cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

where indicates vector dot product, ||d||: the length of vector d

26

Page 27: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Example: Cosine Similarity

• cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

where indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)

d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25

||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12

cos(d1, d2 ) = 0.94

27

Page 28: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

K Nearest Neighbor

• Introduction

•kNN

•Similarity and Dissimilarity

•Summary

28

Page 29: CS145: INTRODUCTION TO DATA MINING - UCLAweb.cs.ucla.edu/~yzsun/classes/2017Fall_CS145/Slides/07kNN.pdf11 Discussion on the k-NN Algorithm •k-NN for real-valued prediction for a

Summary

• Instance-Based Learning

•Lazy learning vs. eager learning; K-nearest

neighbor algorithm; Similarity / dissimilarity

measures

29