Nearest Neighbor Rule - Robert Haralickharalick.org/ML/nearest_neighbor.pdfGeometry of a Bounded...

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

The Nearest Neighbor RuleError Rate of NN Rule

Large Sample SizeGeometry of a Bounded High Dimensional Space

Max and Euclidean DistancesProjection Based Algorithms

.

......Nearest Neighbor Rule

Robert M. Haralick

Computer Science, Graduate CenterCity University of New York

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Outline

...1 The Nearest Neighbor Rule

...2 Error Rate of NN Rule

...3 Large Sample Size

...4 Geometry of a Bounded High Dimensional Space

...5 Max and Euclidean Distances

...6 Projection Based Algorithms

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Nearest Neighbor Rule

The nearest neighbor rule uses ancient common sensewisdom..Definition..

......Assign a new pattern to the class of the pattern in the trainingset closest to it.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Voroni Tesselation

Let a set of points be given. Associate with each point the setof points that are closer to it than any other of the given points.This tesselation is called the Voroni Tesselation.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Formal Statement

Let the set of classes be C = {c1, . . . , cK} and let X be the setof all possible measurements. We assume that there is a metricd defined on X .

Let the training data set be < (x1, c1), . . . , (xN , cN) > whereeach xn is a measurement vector and its corresponding cn isthe class label of xn.

Let x be the new measurement vector. The NN rule assigns xto class cm where d(x , xm) ≤ d(x , xn),n = 1, . . . ,N.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Nearest Neighbor Probability Distribution Assumption

Let the training data set be < (x1, c1), . . . , (xN , cN) > whereeach xn is a measurement vector and its corresponding cn isthe class label of xn.Let N (xm) denote the nearest neighbor set associated with xm.

N (xm) = {x | d(xm, x) ≤ d(xn, x), n = 1, . . . ,N}

Then,

P(c | x) =

1 if m is the smallest index such that

x ∈ N (xm) and cm = c0 otherwise

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Nearest Neighbor Probability Distribution Assumption

Let {m | x ∈ N (xm)} = {m1, . . . ,mK}Let p1, . . . , pK satisfy

pk ≥ 0, k = 1, . . . ,KK∑

k=1

pk = 1

Then,

P(cmk | x) = pk , k = 1, . . . ,KP(c | x) = 0, c ̸= cmk for some k

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. The Random Sampling Process

The training set is created by a twofold random samplingprocess.

First a class is sampled in accordance with the class priorprobabilities P(c1), . . . ,P(cK ).Suppose that the randomly sampled class for the nth

sample is cn. Then the measurement xn is randomlysampled from the class conditional distribution P(xn | cn).

This two fold sampling is done independently for n = 1, . . . ,N.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. The Random Sampling Process

The training set is created by a twofold random samplingprocess. Hence,

P(c1, . . . , cN) =N∏

n=1

P(cn)

P(x1, . . . , xN | c1, . . . , cN) =N∏

n=1

P(xn | cn)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Conditional IndependenceLet a new pair (x , c) be sampled in accordance with the randomsampling process. However the true class c is not made available tothe decision rule. Suppose that xm is the nearest neighbor to x .Consider, P(c, cm | x , xm)

P(c, cm | x , xm) =P(x , xm | c, cm)P(c, cm)

P(x , xm)

=P(x | c)P(xm | cm)P(c)P(cm)

P(x , xm)

=

P(c | x)P(x)P(c)

P(cm | xm)P(xm)P(cm) P(c)P(cm)

P(x , xm)

=P(c | x)P(x)P(cm | xm)P(xm)

P(x , xm)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Conditional Independence

P(x , xm) =K∑

i=1

K∑j=1

P((c i , x), (c j , xm))

=K∑

i=1

K∑j=1

P(x , xm | c i , c j)P(c i , c j)

=K∑

i=1

K∑j=1

P(x |c i)P(xm | c j)P(c i , c j)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





P(x , xm) =K∑

i=1

K∑j=1

P(x |c i)P(xm | c j)P(c i)P(c j)

=K∑

i=1

P(x |c i)P(c i)K∑

j=1

P(xm | c j)P(c j)

=K∑

i=1

P(x , c i)K∑

j=1

P(xm, c j)

= P(x)P(xm)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Probability of NN Rule Error

Let PN(e | x) be the probability of error of a NN rule based on atraining set sample size of N. Let xm be the nearest neighbor tox .

PN(e |x) =PN(e, x)

P(x)

=

∫PN(e, xm, x)

P(x)dxm

=

∫P(e | x , xm)PN(x , xm)

P(x)dxm

=

∫P(e |x , xm)PN(xm | x)dxm

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Large Sample Size

Given x , consider what happens to its nearest neighbor xm asthe sample size gets large. We assume the mixture densityfunction P to be continuous and P(x) ̸= 0. Let S be ahypersphere centered at x and with small radius r . Let PS bethe probability that a measurement sampled from the mixturedensity function falls into S. Then 0 < PS < 1.

The probability that all of the N independently sampled trainingmeasurements are outside of the hypersphere is (1 − PS)

N .

limN→∞(1 − PS)N = 0

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Large Sample Size

limN→∞(1 − PS)N = 0

This implies that the nearest neighbor xm to x converges to x inprobability. Thus we can write

limN→∞

PN(xm | x) → δ(x − xm)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Let x be given and let Q(e | x) be the error of a Bayes rule onmeasurement x . Let cm be the true class of x . Then

Q(e | x) = 1 − P(cm | x)

K∑k=1

P2(ck |x) = P2(cm | x) +K∑

k=1k ̸=m

P2(ck |x)

= (1 − Q(e | x))2 +K∑

k=1k ̸=m

P2(ck |x)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Inequality

Given,

K∑k=1

k ̸=m

P(ck |x) = 1 − P(cm | x)

= Q(e | x)

What is the smallestK∑

k=1k ̸=m

P2(ck |x)

can be over all P(ck | x)?

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Inequality

MinimizeJ∑

j=1

z2j

under the constraint thatJ∑

j=1

zj = b

The minimum is achieved when zj = b/J, j = 1, . . . , J. In this case,

J∑j=1

z2j = J

b2

J2 =b2

J

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Inequality

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Inequality

When J = 2, the minimum is b2

2 and z1 = z2 = b√2.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Let x be given and let Q(e |x) be the error of a Bayes rule.Then Q(e | x) = 1 − P(c | x)

P(e | x) = 1 −K∑

k=1

P2(ck | x) = 1 − P2(cm | x)−K∑

k=1k ̸=m

P2(ck | x)

≤ 1 − (1 − Q(e | x))2 − Q2(e | x)K − 1

≤ 1 − (1 − 2Q(e | x) + Q2(e | x))− Q2(e | x)K − 1

≤ 2Q(e | x)− Q2(e | x)− Q2(e | x)/(K − 1)

≤ 2Q(e | x)− Q2(e | x)K

K − 1≤ 2Q(e | x)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Let Pe be the asymptotic probability of error of the NN Rule. LetQe be the probability of error of the Bayes rule.

P(e | x) ≤ 2Q(e | x)

Pe =

∫P(e, x)dx

=

∫P(e | x)P(x)dx

≤ 2∫

Q(e | x)P(x)dx

≤ 2Qe

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. The Surprising Geometry in High Dimension Spaces

Consider the volume V(d,r) of a sphere of radius r in a space ofdimension d .

V (d , r) =πd/2rd

Γ(d/2 + 1)

As dimension d gets large the volume decreases to 0.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Volume of a Sphere in a High Dimension Space

Figure: Volume of a sphere of radius 1 as a function of d , thedimension of the space.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Volume of HyperSphere to Volume of HyperBox

Figure: Ratio of the volume of a hypersphere of radius r to a hyperboxof side 2r as a function of d , the dimension of the space. Ratio isbelow 10% when the dimension d ≥ 6.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




..

Volume of Smaller HyperSphere to LargerHyperSphere

Figure: Ratio of the volume of a hypersphere of radius .9 to ahypersphere of radius 1 as a function of d , the dimension of thespace. Ratio is below 10% when the dimension d ≥ 20.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Geometry in Bounded High Dimensional Spaces

A hypersphere in an N Dimensional space has volume β(N)RN .Consider the fraction f of the volume in a shell of width ∆r .

f (N;∆r) =β(N)(r +∆r)N − β(N)rN

β(N)(r +∆r)N

= 1 − rN

(r +∆r)N

"!# &%'$

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Take ∆r to be fixed and take the limit of this shell volumefraction as N → ∞.

limN→∞

f (N;∆r) = limN→∞

1 − rN

(r +∆r)N

= 1 − limN→∞

(r

r +∆r

)N

= 1

As the dimension of the space increases a greater fraction ofthe volume of the hypersphere is in the shell.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Let α be a fixed small fraction like .01, for example. Determinethe shell width ∆r such that

f (N;∆r) =(1 +∆r)N − rN

(r +∆r)N = α

1 − rN

(r +∆r(α;N))N = α

1 − α =

(r

r +∆r(α;N)

)N

(1 − α)1N =

rr +∆r(α;N)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





(1 − α)1N =

rr +∆r(α;N)

(1 − α)1N (r +∆r(α;N)) = r

(1 − α)1N ∆r(α;N) = r − r(1 − α)

1N

∆r(α;N) = r1 − (1 − α)

1N

(1 − α)1N

= r

(1

(1 − α)1N

− 1

)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





∆r(α;N) = r

(1

(1 − α)1N

− 1

)Consider what happens as the dimension N of the spaceincreases.

limN→∞

∆r(α;N) = limN→∞

r

(1

(1 − α)1N

− 1

)

= r limN→∞

(1

(1 − α)1N

− 1

)= 0

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





limN→∞

∆r(α;N) = 0

As the dimension N of the space increases, the width of theshell required to keep the volume of the shell a fixed fraction ofthe volume of the hypersphere decreases to 0.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Gaussian Distribution in High Dimensional Space

In a univariate Gaussian Distribution, 90% of a random samplewill fall in the interval [-1.65,1.65]. This fraction decreases tozero as the dimension of the space increases. By d = 10, thefraction is less than 1%.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Gaussian Distribution in High Dimensional Space

Figure: Fraction of a random sample that will fall into a sphere ofradius 1.65 as a function of d , the dimension of the space. Thefraction is less than 1% when dimension d = 10.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Nearest Neighbors

Let d be the dimension of the spaceLet x be a given pointLet Dmind be the distance of the nearest neighbor to xLet Dmaxd be the distance of the furthers neighbor to x

Suppose limd→∞ var( ||Xd ||E [||Xd || = 0

ThenDmaxd − Dmind

Dmind

→p 0

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Nearest Neighbors

Dmaxd − Dmind

Dmind

→p 0

means poor discrimination of the nearest and farthest pointswith respect to the query point.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Minkowski Distance

For k ≥ 1

ρ((x1, . . . , xd), (y1, . . . , yd)) =

(d∑

i=1

|xi − yi |k)1/k

The norm is the Lk norm.Max Distance: k → ∞Euclidean Distance: k = 2Manhattan Distance: k = 1

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Comparison of Max and Euclidean Distance

ρMax(x , y) = maxn

n=1,...N

|xn − yn|

ρEuclidean(x , y) =

√√√√ N∑n=1

(xn − yn)2

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





It is always the case that ρEuclidean(x , y) ≥ ρMax(x , y). Supposethat (xm − ym)

2 ≥ (xn − yn)2, n = 1, . . . ,N

ρEuclidean(x , y) =

√√√√ N∑n=1

(xn − yn)2

=

√√√√√(xm − ym)2 +N∑

n=1n ̸=m

(xn − yn)2

≥√

(xm − ym)2

≥ ρMax(x , y)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





ρMax(x , y) ≤ ρEuclidean(x , y)ρMax(x , y) < ρMax(x , z)

ρEuclidean(x , z) < ρEuclidean(x , y)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





In an N Dimensional Space

ρMax(x , y) ≤ ρEuclidean(x , y) ≤√

NρMax(x , y)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Fractional Distance

For 0 ≤ k ≤ 1

ρ((x1, . . . , xd), (y1, . . . , yd)) =

(d∑

i=1

|xi − yi |k)

Nearest Neighbor using the fractional distance measuresperform better than Euclidean or Manhattan distances.

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Geometry of a Bounded High Dimension Space

The higher the dimension the more each point is about thesame distance away from another pointDistance between point x and point y has little informationregarding distance between x and any nearest neighbor ofy

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Experimental Protocol

Unit HypercubeSet S of 1000 vectors uniformly distributedDimension 10 - 200Max distanceEuclidean distance10,000 Trials

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Mean Euclidean Distance Between Points

As the dimension of the space increases:the mean Euclidean distance between pairs of points of Sincreasesthe standard deviation of the Euclidean distance betweenpairs of points of S is about constantthe ratio of the mean Euclidean distance to the standarddeviation Euclidean distance increases

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Dimension = 2 Mean/Std = 2.095

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.






..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Mean Max Distance Between Points

As the dimension of the space increases:the mean Max distance between pairs of points of Sincreasesthe standard deviation of the Max distance between pairsof points of S is decreasesthe ratio of the mean Max distance to the standarddeviation Max distance increases

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.






..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Mean Max Distance to Nearest Neighbor

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




..

Standard Deviation of Max Distance to NearestNeighbor

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Random y to NN in S

Dimension = 200

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. NN Between Points of S

Dimension = 200

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Distance to a Point Has Little Information

If the distance between a point x to a point p is d , then p is thenearest neighbor to x when the nearest neighbor to p has agreater distance than 2d .

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Distance to a Point Has Little Information

But the typical distance d of x to a point p is about the same asthe distance of x to its nearest neighbor. So knowing thedistance of x to p provides no information about whether p is anearest neighbor of x .

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. NN Max Distance Algorithm

x and y are measurement vectors of dimension Ndimdmin is the minimum distance found so fard is the current state of the Max distanceTerminate calculation as soon as the distance is greaterthan the minimum distance dmin

d=0.f;i=0;while(i < Ndim && d <= dmin)

{d=max(d,fabs(x[i]-y[i]));i++;}

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. Projection Based Algorithms

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Let unit vectors w1, . . . ,wJ and a distance d be given. Let x bethe vector whose nearest neighbor needs to be found. Definefor each j , j = 1, . . . , J

Sj(d) = {n | w ′j x − d ≤ w ′

j xn ≤ w ′j x + d}

δ = minj

minn∈Sj (d)

ρ(x , xn)

Then the nearest neighbor to x must be in the set

J∩j=1

Sj(δ)

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.





Let wj , j = 1, . . . ,N project onto the N coordinate axes.

Sj(d) = {n | w ′j x − d ≤ w ′

j xn ≤ w ′j x + d}

Find the smallest d such thatJ∩

j=1

Sj(d) ̸= ∅

Then

minn

n=1,...,N

ρMax(x , xn) = d

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.




.. The K-Nearest Neighbor Rule.Definition..

......

Let K be a fixed positive integer. Given a measurement x to beclassified, the K-NN rule finds the K nearest neighbors to x andassigns the class of x to be the class associated with themajority of the K nearest neighbors.

Nearest Neighbor Rule - Robert Haralickharalick.org/ML/nearest_neighbor.pdfGeometry of a Bounded...

Documents

Transcript of Nearest Neighbor Rule - Robert Haralickharalick.org/ML/nearest_neighbor.pdfGeometry of a Bounded...