Lecture 8 Statistical Learning & Part Models

Post on 24-May-2022

1 views 0 download

Transcript of Lecture 8 Statistical Learning & Part Models

COS 429: Computer Vision

Lecture 8Statistical Learning & Part Models

COS429 : 12.10.17 : Andras Ferencz

2 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Review: Object Detection Typical Components ● Hypothesis generation

● Sliding window, Segmentation, feature point detection, random, search

● Encoding of (local) image data● Colors, Edges, Corners, Histogram of Oriented

Gradients, Wavelets, Convolution Filters

● Relationship of different parts to each other● Blur or histogram, Tree/Star, Pairwise/Covariance

● Learning from (labeled) examples● Selecting representative examples (templates),

Clustering, Building a cascade ● Classifiers: Bayes, Logistic regression, SVM,

AdaBoost, Neural Network... ● Generative vs. Discriminative

● Verification - removing redundant, overlaping, incompatible examples

● Non-Max Suppression, context priors, geometry

Exemplar Summary

3 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Using least squares for classification

If the right answer is 1 and the model says 1.5, it loses, so it changes the boundary to avoid being “too correct”

least squares regression

J. Hinton

4 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

The Problem: Loss FunctionRecall: Regression Loss Functions

Some Classification Loss Functions

Squared LossHinge LossLog Loss0/1 Loss

5 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Sigmoid

Ziv-Bar Joseph

6 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Log Loss

Ziv-Bar Joseph

The Logistic Loss: Log(1+ exp(-z))Where z = y*(wTx+b)

7 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Logistic Result

least squares regression

Logistic Regression

8 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Using Logistic Regression● Quick, simple

classifier (try it first)

● Outputs a probabilistic label confidence

● Use L2 or L1 regularization

– L1 does feature selection and is robust to irrelevant features but slower to train

9 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Classifiers: Linear SVM

x x

xx

x

x

x

x

oo

o

o

o

x2

x1

• Find a linear function to separate the classes:

f(x) = sgn(w x + b)

10 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Classifiers: Linear SVM

x x

xx

x

x

x

x

oo

o

o

o

x2

x1

• Find a linear function to separate the classes:

f(x) = sgn(w x + b)

11 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Support vector machines: Margin• Want line that maximizes the margin.

1:1)(negative

1:1)( positive

by

by

iii

iii

wxx

wxx

MarginSupport vectors

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

For support, vectors, 1 bi wx

wx+

b=-1

wx+

b=0

wx+

b=1

Kristin Grauman

Hinge Loss

L(y,f(x)) = max(0,1−y f(x))⋅

12 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

• Datasets that are linearly separable work out great:

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

0 x

0 x

0 x

x2

Nonlinear SVMs

Slide credit: Andrew Moore

13 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Φ: x → φ(x)

Nonlinear SVMs

• General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:

Slide credit: Andrew Moore

14 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Nonlinear SVMs• The kernel trick: instead of explicitly computing

the lifting transformation φ(x), define a kernel function K such that

K(xi , xj) = φ(xi ) · φ(xj)

(to be valid, the kernel function must satisfy Mercer’s condition)

• This gives a nonlinear decision boundary in the original feature space:

bKybyi

iiii

iii ),()()( xxxx

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

15 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Nonlinear kernel: Example

• Consider the mapping ),()( 2xxx

22

2222

),(

),(),()()(

yxxyyxK

yxxyyyxxyx

x2

16 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Classifiers: Decision Trees

x x

xx

x

x

x

x

o

oo

o

o

o

o

x2

x1

oGiven (weighted) labeled examples:● Select best single feature &

threshold that separates classes● For each branch, recurse● Stop when

● some depth is reached● Branch is (close to) single-class● too few examples left in branch

17 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Ensemble Methods: Boosting

figure from Friedman et al. 2000

18 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Boosted Decision Trees

Gray?

High inImage?

Many LongLines?

Yes

No

NoNo

No

Yes Yes

Yes

Very High Vanishing

Point?

High in Image?

Smooth? Green?

Blue?

Yes

No

NoNo

No

Yes Yes

Yes

Ground Vertical Sky

[Collins et al. 2002]

P(label | good segment, data)

19 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Using Boosted Decision Trees● Flexible: can deal with both continuous

and categorical variables● How to control bias/variance trade-of

– Size of trees

– Number of trees

● Boosting trees often works best with a small number of well-designed features

● Boosting “stubs” can give a fast classifier

20 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Review: Typical Components ● Hypothesis generation

● Sliding window, Segmentation, feature point detection, random, search

● Encoding of (local) image data● Colors, Edges, Corners, Histogram of Oriented

Gradients, Wavelets, Convolution Filters

● Relationship of different parts to each other● Blur or histogram, Tree/Star, Pairwise/Covariance

● Learning from labeled examples● Selecting representative examples (templates),

Clustering, Building a cascade ● Classifiers: Bayes, Logistic regression, SVM,

Decision Trees, AdaBoost, ... ● Generative vs. Discriminative

● Verification - removing redundant, overlaping, incompatible examples

● Non-Max Suppression, context priors, geometry

Exemplar Summary

21 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Discriminative vs. Generative Classifiers

Discriminative Classification:Find the boundary between classes

1. Assume some functional form for P(Y|X)

2. Estimate parameters of P(Y|X) directly from training data

Training classifiers involves estimating f: X Y, or P(Y|X) “Y given X”

Generative Classification:Model each class & see which fits better

1.Assume some functional form for P(X|Y), P(X)

2.Estimate parameters of P(X|Y), P(X) directly from training data

3.Use Bayes rule to calculate P(Y|X= xi)

X2

X2

22 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Bayesian Classification (Generative Model)

23 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Bayesian Classification

24 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Bayesian Classification

25 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Bayesian Classification

26 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Bayesian Classification

27 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Bayesian Classification

28 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Naïve Bayes Classification

29 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Naïve Bayes, Odds ratio, and Logit

Assuming independence of features d1 and d2, we can classify between 2 classes {C1, C2}, by compute the ratio:

This is called the Odds ratio.

It is often easier to take the log of this, called the Log Odds or Logit:

P (C1∣d1 , d2)P (C2∣d1 , d2)

=P (d1∣C1)P(d2∣C1)P (C1)P(d1∣C2)P(d2∣C2)P(C2)

<1

logP (C1∣d1 , d2)P (C2∣d1 , d2)

=logP(d1∣C1)P(d1∣C2)

+logP (d2∣C1)P (d2∣C2)

+logP(C1)P (C2)

<0

30 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Naïve Bayes

31 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Naive Bayes with Gaussian densities have piecewise quadratic decision boundary.

Naïve Bayes

32 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Eamonn Keogh

Graphical Models

Naive Bayes assumes independence of d1,d2,... conditioned on the class c

Independence assumption of Naive Bayes is the simplest assumption. Often much more complicated relationships needs to be represented.

A good way to do this is a Graphical Model (aka. Byesian Network)

33 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit: Ioannis Tsamardinos

Cough

Graphical Models

More complicated models consider the dependence between different variables.

34 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Discriminative vs. Generative Classifiers

Discriminative Classification:Find the boundary between classes

1. Assume some functional form for P(Y|X)

2. Estimate parameters of P(Y|X) directly from training data

Training classifiers involves estimating f: X Y, or P(Y|X) “Y given X”

Generative Classification:Model each class & see which fits better

1.Assume some functional form for P(X|Y), P(X)

2.Estimate parameters of P(X|Y), P(X) directly from training data

3.Use Bayes rule to calculate P(Y|X= xi)

35 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Discriminative vs. Generative Classifiers

Discriminative:+ Model directly what you care about+ With many examples usually more accurate+ Often faster to evaluate, can scale well to many examples & classes Generative:+ Allows more flexibility to model relationships between variables+ Can handle compositionality, missing & occluded parts (more “object oriented”)+ Often needs less labeled examples

“On Discriminative vs. Generative classifiers: A comparison of logistic regression and naïve Bayes,” A. Ng and M. Jordan, NIPS 2002.

36 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Comparison

Naïve Bayes

Logistic Regression

Linear SVM

Nearest Neighbor

Kernelized SVM

Learning Objective

ii

jjiij

yP

yxP

0;log

;|logmaximize

Training

Krky

rkyx

ii

iiij

kj

1

Inference

0|0

1|0log

,0|1

1|1log where

01

0

1

01

yxP

yxP

yxP

yxP

j

jj

j

jj

TT

xθxθ

xθθx

θθx

Tii

ii

yyP

yP

exp1/1,| where

,|logmaximize Gradient ascent 0xθT

0xθTLinear programmingiy i

Ti

ii

1 such that

2

1 minimize

θ

Quadratic programming

complicated to write

most similar features same label Record data

i

iii Ky 0,ˆ xx

xx ,ˆ argmin where

ii

i

Ki

y

assuming x in {0 1}

Slide credit: D. Hoiem

37 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Train vs. Test Accuracy

38 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Generalization● Components of generalization error

– Bias: how much the average model over all training sets differ from the true model?

● Error due to inaccurate assumptions/simplifications made by the model

– Variance: how much models estimated from different training sets differ from each other

● Underfitting: model is too “simple” to represent all the relevant class characteristics– High bias and low variance– High training error and high test error

● Overfitting: model is too “complex” and fits irrelevant characteristics (noise) in the data– Low bias and high variance– Low training error and high test error

Slide credit: L. Lazebnik

39 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Bias-Variance Trade-of

● Models with too few parameters are inaccurate because of a large bias (not enough flexibility).

● Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample).

Slide credit: D. Hoiem

40 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Bias-Variance Trade-of

E(MSE) = noise2 + bias2 + variance

See the following for explanations of bias-variance (also Bishop’s “Neural Networks” book): •http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf

Unavoidable error

Error due to incorrect

assumptions

Error due to variance of training

samples

Slide credit: D. Hoiem

41 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Try out what hyperparameters work best on test set.

Cross Validation

42 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

42

Trying out what hyperparameters work best on test set:Very bad idea. The test set is a proxy for the generalization performance!Use only VERY SPARINGLY, at the end.

Cross Validation

43 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Validation datause to tune hyperparameters

Validation

44 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Cross-validationcycle through the choice of which fold is the validation fold, average results.

Cross Validation

45 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

So…● No classifier is inherently

better than any other: you need to make assumptions to generalize

● Three kinds of error– Inherent: unavoidable– Bias: due to over-

simplifications– Variance: due to inability

to perfectly estimate parameters from limited data

Slide credit: D. HoiemSlide credit: D. Hoiem

46 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

What to remember about classifiers

● Machine learning algorithms are tools, not dogmas

● Try simple classifiers first

● Better to have smart features and simple classifiers than simple features and smart classifiers

● Use increasingly powerful classifiers with more training data (bias-variance tradeof)

Slide credit: D. Hoiem

47 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

How to reduce variance?

● Choose a simpler classifier

● Regularize the parameters

● Get more training data

Slide credit: D. Hoiem

48 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Review: Typical Components ● Hypothesis generation

● Sliding window, Segmentation, feature point detection, random, search

● Encoding of (local) image data● Colors, Edges, Corners, Histogram of Oriented

Gradients, Wavelets, Convolution Filters

● Relationships of different parts to each other● Blur or histogram, Tree/Star, Pairwise/Covariance

● Learning from labeled examples● Selecting representative examples (templates),

Clustering, Building a cascade ● Classifiers: Bayes, Logistic regression, SVM,

Decision Trees, AdaBoost, ... ● Generative vs. Discriminative

● Verification - removing redundant, overlaping, incompatible examples

● Non-Max Suppression, context priors, geometry

Exemplar Summary

49 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Implicit shape models

• Visual codebook is used to index votes for object position

B. Leibe, A. Leonardis, and B. Schiele, Combined Object Categorization and Segmentation with an Implicit Shape Model, ECCV Workshop on Statistical Learning in Computer Vision 2004

training image annotated with object localization info

visual codeword withdisplacement vectors

Slides by Lana Lazebnik, some adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

50 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Implicit shape models

• Visual codebook is used to index votes for object position

B. Leibe, A. Leonardis, and B. Schiele, Combined Object Categorization and Segmentation with an Implicit Shape Model, ECCV Workshop on Statistical Learning in Computer Vision 2004

training image annotated with object localization info

visual codeword withdisplacement vectors

51 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Implicit shape models: Training

1. Build codebook of patches around extracted interest points using clustering

52 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Implicit shape models: Training

1. Build codebook of patches around extracted interest points using clustering

53 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Implicit shape models: Training

1. Build codebook of patches around extracted interest points using clustering

54 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Implicit shape models: Testing1. Given test image, extract patches, match to

codebook entry

2. Cast votes for possible positions of object center

3. Search for maxima in voting space

4. Extract weighted segmentation mask based on stored masks for the codebook occurrences

55 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:Source: B. Leibe

Original image

Example: Results on Cows

56 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Interest points

Example: Results on Cows

Source: B. Leibe

57 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Example: Results on Cows

Matched patchesSource: B. Leibe

58 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Example: Results on Cows

Probabilistic votesSource: B. Leibe

59 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Example: Results on Cows

Hypothesis 1Source: B. Leibe

60 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Example: Results on Cows

Hypothesis 1Source: B. Leibe

61 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Example: Results on Cows

Hypothesis 3Source: B. Leibe

62 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Additional examples

B. Leibe, A. Leonardis, and B. Schiele, Robust Object Detection with Interleaved Categorization and Segmentation, IJCV 77 (1-3), pp. 259-289, 2008.

63 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Generative part-based models

R. Fergus, P. Perona and A. Zisserman, Object Class Recognition by Unsupervised Scale-Invariant Learning, CVPR 2003

R. Fergus

64 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Gaussian shape pdf

Poission pdf on # detections

Uniform shape pdf

Prob. of detection

Gaussian part appearance pdf

Generative probabilistic model

Clutter model

Gaussian relative scale pdf

Log(scale)

Gaussian appearance pdf

0.8 0.75 0.9

Uniformrelative scale pdf

Log(scale)

Foreground model

R. Fergus

65 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

)|(),|(),|(max

)|,()|(

objecthpobjecthshapepobjecthappearanceP

objectshapeappearancePobjectimageP

h

Probabilistic model

h: assignment of features to partsPartdescriptors

Partlocations

Candidate parts

R. Fergus

66 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Probabilistic model

High-dimensional appearance space

Distribution over patchdescriptors

)|(),|(),|(max

)|,()|(

objecthpobjecthshapepobjecthappearanceP

objectshapeappearancePobjectimageP

h

R. Fergus

67 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Probabilistic model

h: assignment of features to parts

Part 2

Part 3

Part 1

)|(),|(),|(max

)|,()|(

objecthpobjecthshapepobjecthappearanceP

objectshapeappearancePobjectimageP

h

R. Fergus

68 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Probabilistic model

2D image space

Distribution over jointpart positions

)|(),|(),|(max

)|,()|(

objecthpobjecthshapepobjecthappearanceP

objectshapeappearancePobjectimageP

h

R. Fergus

69 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Results: Faces

Faceshapemodel

Patchappearancemodel

Recognitionresults

R. Fergus

70 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Results: Motorbikes

71 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Motorbikes

72 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

Deformable Parts Model

Detections

Template Visualization

Felzenszwalb

P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan“Object Detection with Discriminatively Trained Part Based Models”

73 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

74 : COS429 : L9 : 12.10.17 : Andras Ferencz Slide Credit:

So many options... How to choose? ● Hypothesis generation

● Sliding window, Segmentation, feature point detection, random, search

● Encoding of (local) image data● Colors, Edges, Corners, Histogram of Oriented

Gradients, Wavelets, Convolution Filters

● Relationship of different parts to each other● Blur or histogram, Tree/Star, Pairwise/Covariance

● Learning from labeled examples● Selecting representative examples (templates),

Clustering, Building a cascade ● Classifiers: Bayes, Logistic regression, SVM,

Decision Trees, AdaBoost, ... ● Generative vs. Discriminative

● Verification - removing redundant, overlaping, incompatible examples

● Non-Max Suppression, context priors, geometry

Exemplar Summary