COMP 551 -Applied Machine Learning Lecture 9 ---Instance...

59
COMP 551 - Applied Machine Learning Lecture 9 --- Instance learning William L. Hamilton (with slides and content from Joelle Pineau) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1

Transcript of COMP 551 -Applied Machine Learning Lecture 9 ---Instance...

Page 1: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

COMP 551 - Applied Machine LearningLecture 9 --- Instance learningWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

MiniProject 2

§ Joint leaderboard with UdeM students: http://www-ens.iro.umontreal.ca/~georgeth/kaggle2019/

William L. Hamilton, McGill University and Mila 2

Page 3: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Variable Ranking§ Idea: Rank features by a scoring function defined for individual

features, independently of the context of others. Choose the m’highest ranked features.

§ Pros / cons:

§ Need to select a scoring function.

§ Must select subset size (m’): cross-validation

§ Simple and fast – just need to compute a scoring function m times and sort m scores.

William L. Hamilton, McGill University and Mila 3

Page 4: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Scoring function: Correlation Criteria

slide by Michel VerleysenR( j) =

(xij −i=1:n∑ x j )(yi − y )

(xij −i=1:n∑ x j )2 (yi − y )

2

i=1:n∑William L. Hamilton, McGill University and Mila 4

Page 5: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Scoring function: Mutual information§ Think of Xj and Y as random variables.

§ Mutual information between variable Xj and target Y:

§ Empirical estimate from data (assume discretized variables):

I( j) = p(x j, yY∫ )logp(x j, y)p(x j )p(y)

dxdyXj∫

I( j) = P(Xj = x j,Y = y)logp(Xj = x j,Y = y)p(Xj = x j )p(Y = y)Y∑X j

William L. Hamilton, McGill University and Mila 5

Page 6: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Nonlinear dependencies with MI

§ Mutual information identifies nonlinear relationships between variables.

§ Example:§ x uniformly distributed over [-1 1]§ y = x2 + noise§ z uniformly distributed over [-1 1]§ z and x are independent

William L. Hamilton, McGill University and Mila 6

Page 7: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Variable Ranking§ Idea: Rank features by a scoring function defined for individual

features, independently of the context of others. Choose the m’highest ranked features.

§ Pros / cons:

§ Need to select a scoring function.

§ Must select subset size (m’): cross-validation

§ Simple and fast – just need to compute a scoring function m times and sort m scores.

§ Scoring function is defined for individual features (not subsets).

William L. Hamilton, McGill University and Mila 7

Page 8: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Best-Subset selection

§ Idea: Consider all possible subsets of the features, measure performance on a validation set, and keep the subset with the best performance.

§ Pros / cons?

§ We get the best model!

§ Very expensive to compute, since there is a combinatorial number of subsets.

William L. Hamilton, McGill University and Mila 8

Page 9: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Search space of subsets

n features, 2n – 1 possible feature subsets!

Kohavi-John, 1997

William L. Hamilton, McGill University and Mila 9

Page 10: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Subset selection in practice§ Formulate as a search problem, where the state is the feature set

that is used, and search operators involve adding or removing feature set

§ Constructive methods like forward/backward search

§ Local search methods, genetic algorithms

§ Use domain knowledge to help you group features together, to reduce size of search space

§ e.g., In NLP, group syntactic features together, semantic features, etc.

William L. Hamilton, McGill University and Mila 10

Page 11: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Regularization§ Idea: Modify the objective function to constrain the model choice.

Typically adding term (∑j=1:m wjp)1/p.§ Linear regression -> Ridge regression, Lasso

§ Challenge: Need to adapt the optimization procedure (e.g. handle non-convex objective).

§ Regularization can be viewed as a form of feature selection.

§ This approach is often used for very large natural (non-constructed) feature sets, e.g. images, speech, text, video.

William L. Hamilton, McGill University and Mila 11

Page 12: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Steps to solving a supervised learning problem

1. Decide what the input-output pairs are.

2. Decide how to encode inputs and outputs.

• This defines the input space X and output space Y.

3. Choose a class of hypotheses / representations H.• E.g. linear functions.

4. Choose an error function (cost function) to define best hypothesis.• E.g. Least-mean squares.

5. Choose an algorithm for searching through space of hypotheses.

(If doing k-fold cross-validation, re-do feature selection for each fold.)

Evaluate onvalidation set

Evaluate final model/hypothesis on test set

William L. Hamilton, McGill University and Mila 12

Page 13: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Parametric supervised learning§ So far, we assumed we have a dataset of labeled examples.

§ From this, learn a parameter vector of a fixed size such that some error measure based on the training data is minimized.

§ These methods are called parametric, and main goal is to summarize the data using the parameters.

§ Parametric methods are typically global = one set of parameters for the entire data space.

William L. Hamilton, McGill University and Mila 18

Page 14: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Non-parametric learning methods§ Key idea: just store all training examples <xi,yi>.

§ When a query is made, computer the value of the new instance based on the values of the closest (most similar) points.

William L. Hamilton, McGill University and Mila 19

Page 15: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Non-parametric learning methods§ Key idea: just store all training examples <xi,yi>.

§ When a query is made, computer the value of the new instance based on the values of the closest (most similar) points.

§ Requirements:§ A distance function.§ How many closest points (neighbors) to look at?§ How do we compute the value of the new point based on the

existing values?

William L. Hamilton, McGill University and Mila 20

Page 16: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

What kind of distance metric?§ Euclidean distance?

William L. Hamilton, McGill University and Mila 21

Page 17: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

What kind of distance metric?§ Euclidean distance?

§ Weighted Euclidean distance (with weights based on domain knowledge): d(x,x’)=∑j=1:m wj (xj – xj’)2?

§ Maximum or minimum difference along any axis?

William L. Hamilton, McGill University and Mila 22

Page 18: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

What kind of distance metric?§ Euclidean distance?

§ Weighted Euclidean distance (with weights based on domain knowledge): d(x,x’)=∑j=1:m wj (xj – xj’)2?

§ Maximum or minimum difference along any axis?

§ An arbitrary distance or similarity function d, specific for the application at hand (works best, if you have one.)

William L. Hamilton, McGill University and Mila 23

Page 19: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Simple idea: Connect the dots!

What kind of distance metric?

• Euclidian distance

• Maximum/minimum di�erence along any axis

• Weighted Euclidian distance (with weights based on domain knowledge)

d(x,x0) =nX

j=1

uj(xj � x0j)

2

where xj denotes the value of the jth feature in the vector / instance x

• An arbitrary distance or similarity function d, specific for the application

at hand (works best, if you have one)

COMP-652, Lecture 7 - September 27, 2012 15

Simple idea: Connect the dots!

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

73;.30*/7319<=-.

Wisconsin data set, classification

COMP-652, Lecture 7 - September 27, 2012 16

William L. Hamilton, McGill University and Mila 24

Page 20: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Simple idea: Connect the dots!

What kind of distance metric?

• Euclidian distance

• Maximum/minimum di�erence along any axis

• Weighted Euclidian distance (with weights based on domain knowledge)

d(x,x0) =nX

j=1

uj(xj � x0j)

2

where xj denotes the value of the jth feature in the vector / instance x

• An arbitrary distance or similarity function d, specific for the application

at hand (works best, if you have one)

COMP-652, Lecture 7 - September 27, 2012 15

Simple idea: Connect the dots!

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

73;.30*/7319<=-.

Wisconsin data set, classification

COMP-652, Lecture 7 - September 27, 2012 16

William L. Hamilton, McGill University and Mila 25

Page 21: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Simple idea: Connect the dots!Simple idea: Connect the dots!

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e to r

ecurr

ence

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e to r

ecurr

ence

Wisconsin data set, regression

COMP-652, Lecture 7 - September 27, 2012 17

One-nearest neighbor

• Given: Training data {(xi, yi)}mi=1, distance metric d on X .

• Learning: Nothing to do! (just store data)

• Prediction: for x ⌃ X– Find nearest training sample to x.

i⇤ ⌃ argmini

d(xi,x)

– Predict y = yi⇤.

COMP-652, Lecture 7 - September 27, 2012 18

William L. Hamilton, McGill University and Mila 26

Page 22: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Simple idea: Connect the dots!Simple idea: Connect the dots!

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e to r

ecurr

ence

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e to r

ecurr

ence

Wisconsin data set, regression

COMP-652, Lecture 7 - September 27, 2012 17

One-nearest neighbor

• Given: Training data {(xi, yi)}mi=1, distance metric d on X .

• Learning: Nothing to do! (just store data)

• Prediction: for x ⌃ X– Find nearest training sample to x.

i⇤ ⌃ argmini

d(xi,x)

– Predict y = yi⇤.

COMP-652, Lecture 7 - September 27, 2012 18

William L. Hamilton, McGill University and Mila 27

Page 23: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

One-nearest neighbor§ Given: Training data X, distance metric d on X.

§ Learning: Nothing to do! (Just store the data).

§ Prediction: For a new point xnewFind nearest training sample xi*

xi* =argmini d(xi,xnew)Predict ynew =yi*

William L. Hamilton, McGill University and Mila 28

Page 24: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

One-nearest neighbor§ Given: Training data X, distance metric d on X.

§ Learning: Nothing to do! (Just store the data).

William L. Hamilton, McGill University and Mila 29

Page 25: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

What does the approximator look like?

§ Nearest-neighbor does not explicitly compute decision boundaries.

§ But the effective decision boundaries are a subset of the Voronoi diagram for the training data.

§ Each decision boundary is a line segment that is equidistant between two points of opposite classes.

What does the approximator look like?

• Nearest-neighbor does not explicitly compute decision boundaries

• But the e�ective decision boundaries are a subset of the Voronoi diagramfor the training data

Each line segment is equidistant between two points of opposite classes.

COMP-652, Lecture 7 - September 27, 2012 19

Distance metric is really important!

!

"#$%&'()*+,+-../0+-..10+234&56+78+9##&5 :3;*<3=5>?<;54+@5<&3'3(A+B@'45+//

9C@*'D<&'<*5+E';*<3=5+95*&'=;BC$$#;5+*)5+'3$C*+D5=*#&;+F/0+F-0+GF3+<&5+*6#+4'H53;'#3<@A

!/ I+J+F// 0+F/- K+0+!- I+J+F-/ 0+F-- K+0+G!L I+J+FL/ 0+FL- K8

M35+=<3+4&<6+*)5+35<&5;*>35'()?#&+&5('#3;+'3+'3$C*+;$<=58

E';*J!'0!NK+IJF'/ O FN/K-PJQF'- O QFN-K

-E';*J!'0!NK+I+JF'/ O FN/K- P+JF'- O FN-K

-

R)5+&5@<*'D5+;=<@'3(;+'3+*)5+4';*<3=5+H5*&'=+<SS5=*+&5('#3+;)<$5;8

"#$%&'()*+,+-../0+-..10+234&56+78+9##&5 :3;*<3=5>?<;54+@5<&3'3(A+B@'45+/-

TC=@'45<3+E';*<3=5+95*&'=

M*)5&+95*&'=;G

U 9<)<@<3#?';0+V<3W>?<;540+"#&&5@<*'#3>?<;54+

JB*<3S'@@P7<@*X0+9<5;Y+V'3(#+;%;*5HGK

! "

#####

$

%

&&&&&

'

(

)*

)

+)

*

*

!

"

!

!

!

#

!!

!$$

$!$

$$!

%&'()*'&'()*'%%&'(*'+

(%%&'(*'+

!

!!!!

!

!

!

"

"""

#

$$# ,

6)5&5

M&+5ZC'D<@53*@%0

Left: both attributes weighted equally; Right: second attributes weighted more

COMP-652, Lecture 7 - September 27, 2012 20

William L. Hamilton, McGill University and Mila 30

Page 26: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Choice of distance metric is important!

What does the approximator look like?

• Nearest-neighbor does not explicitly compute decision boundaries

• But the e�ective decision boundaries are a subset of the Voronoi diagramfor the training data

Each line segment is equidistant between two points of opposite classes.

COMP-652, Lecture 7 - September 27, 2012 19

Distance metric is really important!

!

"#$%&'()*+,+-../0+-..10+234&56+78+9##&5 :3;*<3=5>?<;54+@5<&3'3(A+B@'45+//

9C@*'D<&'<*5+E';*<3=5+95*&'=;BC$$#;5+*)5+'3$C*+D5=*#&;+F/0+F-0+GF3+<&5+*6#+4'H53;'#3<@A

!/ I+J+F// 0+F/- K+0+!- I+J+F-/ 0+F-- K+0+G!L I+J+FL/ 0+FL- K8

M35+=<3+4&<6+*)5+35<&5;*>35'()?#&+&5('#3;+'3+'3$C*+;$<=58

E';*J!'0!NK+IJF'/ O FN/K-PJQF'- O QFN-K

-E';*J!'0!NK+I+JF'/ O FN/K- P+JF'- O FN-K

-

R)5+&5@<*'D5+;=<@'3(;+'3+*)5+4';*<3=5+H5*&'=+<SS5=*+&5('#3+;)<$5;8

"#$%&'()*+,+-../0+-..10+234&56+78+9##&5 :3;*<3=5>?<;54+@5<&3'3(A+B@'45+/-

TC=@'45<3+E';*<3=5+95*&'=

M*)5&+95*&'=;G

U 9<)<@<3#?';0+V<3W>?<;540+"#&&5@<*'#3>?<;54+

JB*<3S'@@P7<@*X0+9<5;Y+V'3(#+;%;*5HGK

! "

#####

$

%

&&&&&

'

(

)*

)

+)

*

*

!

"

!

!

!

#

!!

!$$

$!$

$$!

%&'()*'&'()*'%%&'(*'+

(%%&'(*'+

!

!!!!

!

!

!

"

"""

#

$$# ,

6)5&5

M&+5ZC'D<@53*@%0

Left: both attributes weighted equally; Right: second attributes weighted more

COMP-652, Lecture 7 - September 27, 2012 20

What does the approximator look like?

• Nearest-neighbor does not explicitly compute decision boundaries

• But the e�ective decision boundaries are a subset of the Voronoi diagramfor the training data

Each line segment is equidistant between two points of opposite classes.

COMP-652, Lecture 7 - September 27, 2012 19

Distance metric is really important!

!

"#$%&'()*+,+-../0+-..10+234&56+78+9##&5 :3;*<3=5>?<;54+@5<&3'3(A+B@'45+//

9C@*'D<&'<*5+E';*<3=5+95*&'=;BC$$#;5+*)5+'3$C*+D5=*#&;+F/0+F-0+GF3+<&5+*6#+4'H53;'#3<@A

!/ I+J+F// 0+F/- K+0+!- I+J+F-/ 0+F-- K+0+G!L I+J+FL/ 0+FL- K8

M35+=<3+4&<6+*)5+35<&5;*>35'()?#&+&5('#3;+'3+'3$C*+;$<=58

E';*J!'0!NK+IJF'/ O FN/K-PJQF'- O QFN-K

-E';*J!'0!NK+I+JF'/ O FN/K- P+JF'- O FN-K

-

R)5+&5@<*'D5+;=<@'3(;+'3+*)5+4';*<3=5+H5*&'=+<SS5=*+&5('#3+;)<$5;8

"#$%&'()*+,+-../0+-..10+234&56+78+9##&5 :3;*<3=5>?<;54+@5<&3'3(A+B@'45+/-

TC=@'45<3+E';*<3=5+95*&'=

M*)5&+95*&'=;G

U 9<)<@<3#?';0+V<3W>?<;540+"#&&5@<*'#3>?<;54+

JB*<3S'@@P7<@*X0+9<5;Y+V'3(#+;%;*5HGK

! "

#####

$

%

&&&&&

'

(

)*

)

+)

*

*

!

"

!

!

!

#

!!

!$$

$!$

$$!

%&'()*'&'()*'%%&'(*'+

(%%&'(*'+

!

!!!!

!

!

!

"

"""

#

$$# ,

6)5&5

M&+5ZC'D<@53*@%0

Left: both attributes weighted equally; Right: second attributes weighted more

COMP-652, Lecture 7 - September 27, 2012 20

William L. Hamilton, McGill University and Mila 31

Page 27: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Distance metric tricks

§ You may need to do feature preprocessing:§ Scale the input dimensions (or normalize them).

§ Remove noisy inputs.

§ Determine weights for attributes based on cross-validation (or information-theoretic methods).

William L. Hamilton, McGill University and Mila 32

Page 28: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Distance metric tricks

§ You may need to do feature preprocessing:§ Scale the input dimensions (or normalize them).

§ Remove noisy inputs.

§ Determine weights for attributes based on cross-validation (or information-theoretic methods).

§ Distance metric is often domain-specific.§ E.g. string edit distance in bioinformatics.

§ E.g. trajectory distance in time series models for walking data.

§ Distance can be learned sometimes.

William L. Hamilton, McGill University and Mila 33

Page 29: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

k-nearest neighbor (kNN)§ Given: Training data X, distance metric d on X.

§ Learning: Nothing to do! (Just store the data).

§ Prediction:

§ For a new point xnew, find the k nearest training samples to x.§ Let their indices be i1,i2,…,ik.§ Predict: y=mean/median of {yi1,yi2,…,yik}for regression

y=majority of {yi1,yi2,…,yik}for classification, or empirical probability of each class.

William L. Hamilton, McGill University and Mila 34

Page 30: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Classification, 2-nearest neighborClassification, 2-nearest neighbor, empirical distribution

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

$!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 23

Classification, 3-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

%!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 24

William L. Hamilton, McGill University and Mila 35

Page 31: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Classification, 3-nearest neighbor

Classification, 2-nearest neighbor, empirical distribution

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

$!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 23

Classification, 3-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

%!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 24

William L. Hamilton, McGill University and Mila 36

Page 32: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Classification, 5-nearest neighborClassification, 5-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

#!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 25

Classification, 10-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

!"!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 26

William L. Hamilton, McGill University and Mila 37

Page 33: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Classification, 10-nearest neighbor

Classification, 5-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

#!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 25

Classification, 10-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

!"!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 26

William L. Hamilton, McGill University and Mila 38

Page 34: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Classification, 15-nearest neighborClassification, 15-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

!#!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 27

Classification, 20-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

$"!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 28

William L. Hamilton, McGill University and Mila 39

Page 35: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Classification, 20-nearest neighbor

Classification, 15-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

!#!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 27

Classification, 20-nearest neighbor

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

$"!73;.30*/7319<=-.>/,3;7

COMP-652, Lecture 7 - September 27, 2012 28

William L. Hamilton, McGill University and Mila 40

Page 36: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Regression, 2-nearest neighborRegression, 2-nearest neighbor, mean prediction

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e t

o r

ecurr

en

ce

COMP-652, Lecture 7 - September 27, 2012 29

Regression, 3-nearest neighbor

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e t

o r

ecurr

en

ce

COMP-652, Lecture 7 - September 27, 2012 30

William L. Hamilton, McGill University and Mila 41

Page 37: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Regression, 3-nearest neighbor

Regression, 2-nearest neighbor, mean prediction

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e to r

ecurr

ence

COMP-652, Lecture 7 - September 27, 2012 29

Regression, 3-nearest neighbor

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e to r

ecurr

ence

COMP-652, Lecture 7 - September 27, 2012 30

William L. Hamilton, McGill University and Mila 42

Page 38: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Regression, 5-nearest neighborRegression, 5-nearest neighbor

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e t

o r

ecu

rre

nce

COMP-652, Lecture 7 - September 27, 2012 31

Regression, 10-nearest neighbor

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e t

o r

ecu

rre

nce

COMP-652, Lecture 7 - September 27, 2012 32

William L. Hamilton, McGill University and Mila 43

Page 39: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Regression, 10-nearest neighbor

Regression, 5-nearest neighbor

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e t

o r

ecu

rre

nce

COMP-652, Lecture 7 - September 27, 2012 31

Regression, 10-nearest neighbor

10 12 14 16 18 20 22 24 26 280

10

20

30

40

50

60

70

80

nucleus size

tim

e t

o r

ecu

rre

nce

COMP-652, Lecture 7 - September 27, 2012 32

William L. Hamilton, McGill University and Mila 44

Page 40: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Bias-variance trade-off§ What happens if k is low?

§ What happens if k is high?

William L. Hamilton, McGill University and Mila 45

Page 41: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Bias-variance trade-off§ What happens if k is low?

§ Very non-linear functions can be approximated, but we also capture the noise in the data. Bias is low, variance is high.

§ What happens if k is high?§ The output is much smoother, less sensitive to data variation. High

bias, low variance.

§ A validation set can be used to pick the best k.

William L. Hamilton, McGill University and Mila 46

Page 42: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Limitations of k-nearest neighbor (kNN)

§ A lot of discontinuities!

§ Sensitive to small variations in the input data.

§ Can we fix this but still keep it (fairly) local?

William L. Hamilton, McGill University and Mila 47

Page 43: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

k-nearest neighbor (kNN)§ Given: Training data X, distance metric d on X.

§ Learning: Nothing to do! (Just store the data).

§ Prediction:

§ For a new point xnew, find the k nearest training samples to x.§ Let their indices be i1,i2,…,ik.§ Predict: y=mean/median of {yi1,yi2,…,yik}for regression

y=majority of {yi1,yi2,…,yik}for classification, or empirical probability of each class.

William L. Hamilton, McGill University and Mila 48

Page 44: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Distance-weighted (kernel-based) NN§ Given: Training data X, distance metric d, weighting function w:R→R.

§ Learning: Nothing to do! (Just store the data).

§ Prediction:

§ Given input xnew.

§ For each xi in the training data X,compute wi =w(d(xi,xnew)).§ Predict: y=∑i wiyi /∑i wi .

William L. Hamilton, McGill University and Mila 49

Page 45: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Distance-weighted (kernel-based) NN§ Given: Training data X, distance metric d, weighting function w:R→R.

§ Learning: Nothing to do! (Just store the data).

§ Prediction:§ Given input xnew.§ For each xi in the training data X,compute wi =w(d(xi,xnew)).§ Predict: y=∑i wiyi /∑i wi .

§ How should we weigh the distances?

William L. Hamilton, McGill University and Mila 50

Page 46: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Some weighting functions

Distance-weighted (kernel-based) nearest neighbor

• Inputs: Training data {(xi, yi)}mi=1, distance metric d on X , weighting

function w : R ⌥⌅ R.• Learning: Nothing to do!

• Prediction: On input x,

– For each i compute wi = w(d(xi,x)).– Predict weighted majority or mean. For example,

y =

PiwiyiPiwi

• How to weight distances?

COMP-652, Lecture 7 - September 27, 2012 35

Some weighting functions

1

d(xi,x)

1

d(xi,x)21

c+ d(xi,x)2e�

d(xi,x)2

�2

COMP-652, Lecture 7 - September 27, 2012 36

William L. Hamilton, McGill University and Mila 51

Page 47: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Gaussian weighting, small !Example: Gaussian weighting, small ⇤

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

;<+001<7!=319>*3?/73<.30*/7319>@-./=1*>/!A"&$#

COMP-652, Lecture 7 - September 27, 2012 37

Gaussian weighting, medium ⇤

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

;<+001<7!=319>*3?/73<.30*/7319>@-./=1*>/!A$

COMP-652, Lecture 7 - September 27, 2012 38

William L. Hamilton, McGill University and Mila 52

Page 48: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Gaussian weighting, medium !

Example: Gaussian weighting, small ⇤

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

;<+001<7!=319>*3?/73<.30*/7319>@-./=1*>/!A"&$#

COMP-652, Lecture 7 - September 27, 2012 37

Gaussian weighting, medium ⇤

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

;<+001<7!=319>*3?/73<.30*/7319>@-./=1*>/!A$

COMP-652, Lecture 7 - September 27, 2012 38

William L. Hamilton, McGill University and Mila 53

Page 49: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Gaussian weighting, large !Gaussian weighting, large ⇤

!" !# $" $# %"

"

"&$

"&'

"&(

"&)

!

*+,-./0123/4,,56

7-7!.38+..179/4"6/:/.38+..179/4!6

;<+001<7!=319>*3?/73<.30*/7319>@-./=1*>/!A#

All examples get to vote! Curve is smoother, but perhaps too smooth.

COMP-652, Lecture 7 - September 27, 2012 39

Locally-weighted linear regression

• Weighted linear regression: di�erent weights in the error function for

di�erent points (see answer to homework 1)

• Locally weighted linear regression: weights depend on the distance tothe query point

• Uses a local linear fit (rather than just an average) around the query

point

• If the distance metric is well tuned, it can lead to really good results (can

represent non-linear functions easily and faithfully)

COMP-652, Lecture 7 - September 27, 2012 40

All examples get to vote! Curve is smoother, but perhaps too smooth?

William L. Hamilton, McGill University and Mila 54

Page 50: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Lazy vs eager learning

§ Lazy learning: Wait for query before generalization.

§ E.g. Nearest neighbour.

§ Eager learning: Generalize before seeing query.

§ E.g. Logistic regression, LDA, decision trees, neural networks.

William L. Hamilton, McGill University and Mila 55

Page 51: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Pros and cons of lazy and eager learning§ Eager learners create global approximation.

§ Lazy learners create many local approximations.

§ If they use the same hypothesis space, a lazy learner can represent more complex functions (e.g., consider H = linear function).

William L. Hamilton, McGill University and Mila 56

Page 52: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Pros and cons of lazy and eager learning§ Eager learners create global approximation.

§ Lazy learners create many local approximations.

§ If they use the same hypothesis space, a lazy learner can represent more complex functions (e.g., consider H = linear function).

§ Lazy learning has much faster training time.

§ Eager learner does the work off-line, summarizes lots of data with few parameters.

William L. Hamilton, McGill University and Mila 57

Page 53: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Pros and cons of lazy and eager learning§ Eager learners create global approximation.

§ Lazy learners create many local approximations.

§ If they use the same hypothesis space, a lazy learner can represent more complex functions (e.g., consider H = linear function).

§ Lazy learning has much faster training time.

§ Eager learner does the work off-line, summarizes lots of data with few parameters.

§ Lazy learner typically has slower query answering time (depends on number of instances and number of features) and requires more memory (must store all the data).

William L. Hamilton, McGill University and Mila 58

Page 54: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Scaling up

§ kNN in high-dimensional feature spaces?

William L. Hamilton, McGill University and Mila 59

Page 55: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Scaling up

§ kNN in high-dimensional feature spaces?§ In high dim. spaces, the distance between near and far points appears similar.

§ A few points (“hubs”) show up repeatedly in the top kNN (Radovanovic et al., 2009).

William L. Hamilton, McGill University and Mila 60

Page 56: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

Scaling up

§ kNN in high-dimensional feature spaces?§ In high dim. spaces, the distance between near and far points appears similar.§ A few points (“hubs”) show up repeatedly in the top kNN (Radovanovic et al., 2009).

§ kNN with larger number of datapoints?§ Can be implemented efficiently, O(log n) at retrieval time, if we use smart data

structures:§ Condensation of the dataset.§ Hash tables in which the hashing function is based on the distance metric.§ KD-trees (Tutorial: http://www.autonlab.org/autonweb/14665)

William L. Hamilton, McGill University and Mila 61

Page 57: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

When to use instance-based learning§ Instances map to points in Rn . Or else a given distance metric.

§ Not too many attributes per instance (e.g. <20), otherwise all points look at a similar distance, and noise becomes a big issue.

§ Easily fooled by irrelevant attributes (for most distance metrics.)

§ Can produce confidence intervals in addition to the prediction.

§ Provides a variable resolution approximation (based on density of points).

William L. Hamilton, McGill University and Mila 62

Page 58: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

ApplicationHays & Efros, Scene Completion Using Millions of Photographs, CACM, 2008.http://graphics.cs.cmu.edu/projects/scene-completion/scene_comp_cacm.pdf

William L. Hamilton, McGill University and Mila 63

Page 59: COMP 551 -Applied Machine Learning Lecture 9 ---Instance ...wlh/comp551/slides/09-instance_learning.pdf · COMP-652, Lecture 7 - September 27, 2012 16 William L. Hamilton, McGill

What you should know§ Difference between eager vs lazy learning.

§ Key idea of non-parametric learning.

§ The k-nearest neighbor algorithm for classification and regression, and its properties.

§ The distance-weighted NN algorithm and locally-weighted linear regression.

William L. Hamilton, McGill University and Mila 64