Download - CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

CSC 411: Lecture 05: Nearest Neighbors

Rich Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 05-Nearest Neighbors 1 / 22

Today

Non-parametric models

I distanceI non-linear decision boundaries

Note: We will mainly use today’s method for classification, but it can also beused for regression

Page 3: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Classification: Oranges and Lemons

Page 4: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Classification: Oranges and Lemons

Can$construct$simple$linear$decision$boundary:$$$$$$$y$=$sign(w0$+$w1x1$$$$$$$$$$$$$$$$$$$

$$$$$$$$+$w2x2)$

Page 5: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

What is the meaning of ”linear” classification

Classification is intrinsically non-linear

I It puts non-identical things in the same class, so a difference in theinput vector sometimes causes zero change in the answer

Linear classification means that the part that adapts is linear (just like linearregression)

z(x) = wTx + w0

with adaptive w,w0

The adaptive part is followed by a non-linearity to make the decision

y(x) = f (z(x))

What functions f () have we seen so far in class?

Page 6: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

z(x) = wTx + w0

with adaptive w,w0

y(x) = f (z(x))

Page 7: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

z(x) = wTx + w0

with adaptive w,w0

y(x) = f (z(x))

Page 8: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

z(x) = wTx + w0

with adaptive w,w0

y(x) = f (z(x))

Page 9: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Classification as Induction

Page 10: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Instance-based Learning

Alternative to parametric models are non-parametric models

These are typically simple methods for approximating discrete-valued orreal-valued target functions (they work for classification or regressionproblems)

Learning amounts to simply storing training data

Test instances classified using similar training instances

Embodies often sensible underlying assumptions:

I Output varies smoothly with inputI Data occupies sub-space of high-dimensional input space

Page 11: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Page 12: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Page 13: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Nearest Neighbors

Training example in Euclidean space: x ∈ <d

Idea: The value of the target function for a new query is estimated from theknown value(s) of the nearest training example(s)

Distance typically defined to be Euclidean:

||x(a) − x(b)||2 =

√√√√ d∑j=1

(x(a)j − x

(b)j )2

Algorithm:

1. Find example (x∗, t∗) (from the stored training set) closest tothe test instance x. That is:

x∗ = argminx(i)∈train. set

distance(x(i), x)

2. Output y = t∗

Note: we don’t really need to compute the square root. Why?

Nearest Neighbors

||x(a) − x(b)||2 =

√√√√ d∑j=1

(x(a)j − x

(b)j )2

Algorithm:

distance(x(i), x)

2. Output y = t∗

Nearest Neighbors

||x(a) − x(b)||2 =

√√√√ d∑j=1

(x(a)j − x

(b)j )2

Algorithm:

distance(x(i), x)

2. Output y = t∗

Nearest Neighbors

||x(a) − x(b)||2 =

√√√√ d∑j=1

(x(a)j − x

(b)j )2

Algorithm:

distance(x(i), x)

2. Output y = t∗

Nearest Neighbors

||x(a) − x(b)||2 =

√√√√ d∑j=1

(x(a)j − x

(b)j )2

Algorithm:

distance(x(i), x)

2. Output y = t∗

Note: we don’t really need to compute the square root. Why?Zemel, Urtasun, Fidler (UofT) CSC 411: 05-Nearest Neighbors 8 / 22

Page 20: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Nearest Neighbors: Decision Boundaries

Nearest neighbor algorithm does not explicitly compute decision boundaries,but these can be inferred

Decision boundaries: Voronoi diagram visualizationI show how input space divided into classesI each line segment is equidistant between two points of opposite classes

Page 21: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Nearest neighbor algorithm does not explicitly compute decision boundaries,but these can be inferred

Decision boundaries: Voronoi diagram visualizationI show how input space divided into classesI each line segment is equidistant between two points of opposite classes

Page 22: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Example: 2D decision boundaryZemel, Urtasun, Fidler (UofT) CSC 411: 05-Nearest Neighbors 10 / 22

Page 23: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Example: 3D decision boundary

Page 24: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Nearest Neighbors: Multi-modal Data

Nearest Neighbor approaches can work with multi-modal data

[Slide credit: O. Veksler]

Page 25: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Nearest Neighbors[Pic by Olga Veksler]

Nearest neighbors sensitive to mis-labeled data (“class noise”). Solution?

Smooth by having k nearest neighbors vote

Algorithm (kNN):

1. Find k examples {x(i), t(i)} closest to the test instance x2. Classification output is majority class

y = arg maxt(z)

k∑r=1

δ(t(z), t(r))

Page 26: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors[Pic by Olga Veksler]

Algorithm (kNN):

y = arg maxt(z)

k∑r=1

δ(t(z), t(r))

Page 27: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors[Pic by Olga Veksler]

Algorithm (kNN):

y = arg maxt(z)

k∑r=1

δ(t(z), t(r))

Page 28: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors

How do we choose k?

Larger k may lead to better performance

But if we set k too large we may end up looking at samples that are notneighbors (are far away from the query)

We can use cross-validation to find k

Rule of thumb is k < sqrt(n), where n is the number of training examples

Page 29: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors

How do we choose k?

Page 30: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors

How do we choose k?

Page 31: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors: Issues & Remedies

If some attributes (coordinates of x) have larger ranges, they are treated asmore important

I normalize scaleI Simple option: Linearly scale the range of each feature to be, e.g., in

range [0,1]I Linearly scale each dimension to have 0 mean and variance 1 (compute

mean µ and variance σ2 for an attribute xj and scale: (xj −m)/σ)

I be careful: sometimes scale matters

Irrelevant, correlated attributes add noise to distance measure

I eliminate some attributesI or vary and possibly adapt weight of attributes

Non-metric attributes (symbols)

I Hamming distance

Page 32: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

I Hamming distance

Page 33: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

I Hamming distance

Page 34: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

I eliminate some attributes

I or vary and possibly adapt weight of attributes

I Hamming distance

Page 35: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

I Hamming distance

Page 36: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

I Hamming distance

Page 37: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

I Hamming distance

Page 38: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors: Issues (Complexity) & Remedies

Expensive at test time: To find one nearest neighbor of a query point x, wemust compute the distance to all N training examples. Complexity: O(kdN)for kNN

I Use subset of dimensionsI Pre-sort training examples into fast data structures (e.g., kd-trees)I Compute only an approximate distance (e.g., LSH)I Remove redundant data (e.g., condensing)

Storage Requirements: Must store all training data

I Remove redundant data (e.g., condensing)I Pre-sorting often increases the storage requirements

High Dimensional Data: “Curse of Dimensionality”

I Required amount of training data increases exponentially withdimension

I Computational cost also increases

[Slide credit: David Claus]

Page 39: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

k-Nearest Neighbors Remedies: Remove Redundancy

If all Voronoi neighbors have the same class, a sample is useless, remove it

Page 40: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Example: Digit Classification

Decent performance when lots of data

[Slide credit: D. Claus]Zemel, Urtasun, Fidler (UofT) CSC 411: 05-Nearest Neighbors 18 / 22

Page 41: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Fun Example: Where on Earth is this Photo From?

Problem: Where (e.g., which country or GPS location) was this picturetaken?

[Paper: James Hays, Alexei A. Efros. im2gps: estimating geographic information from a singleimage. CVPR’08. Project page: http://graphics.cs.cmu.edu/projects/im2gps/]

http://graphics.cs.cmu.edu/projects/im2gps/

Page 42: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Problem: Where (e.g., which country or GPS location) was this picturetaken?

I Get 6M images from Flickr with GPs info (dense sampling across world)I Represent each image with meaningful featuresI Do kNN!

Page 43: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

Problem: Where (eg, which country or GPS location) was this picturetaken?

I Get 6M images from Flickr with gps info (dense sampling across world)I Represent each image with meaningful featuresI Do kNN (large k better, they use k = 120)!

Page 44: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

K-NN Summary

Naturally forms complex decision boundaries; adapts to data density

If we have lots of samples, kNN typically works well

Problems:I Sensitive to class noiseI Sensitive to scales of attributesI Distances are less meaningful in high dimensionsI Scales linearly with number of examples

Inductive Bias: What kind of decision boundaries do we expect to find?

Page 45: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

K-NN Summary

Page 46: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

K-NN Summary

Page 47: CSC 411: Lecture 05: Nearest Neighborsurtasun/courses/CSC411_Fall16/05_nn.… · Today Non-parametric models I distance I non-linear decision boundaries Note: We will mainly use today’s

K-NN Summary

Inductive Bias: What kind of decision boundaries do we expect to find?Zemel, Urtasun, Fidler (UofT) CSC 411: 05-Nearest Neighbors 22 / 22