CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear...

48
CSE 291 - Linear Models for Classification Henrik I. Christensen Computer Science and Engineering University of California, San Diego http://www.hichristensen.net H. I. Christensen (UCSD) CSE 291- PR 1 / 47

Transcript of CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear...

Page 1: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

CSE 291 - Linear Models for Classification

Henrik I. Christensen

Computer Science and EngineeringUniversity of California, San Diegohttp://www.hichristensen.net

H. I. Christensen (UCSD) CSE 291- PR 1 / 47

Page 2: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 2 / 47

Page 3: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Introduction

Last time: Bayesian Decision Theory

Today: linear classification of data

Basic pattern recognitionSeparation of data: buy/sellSegmentation of line data, ...

H. I. Christensen (UCSD) CSE 291- PR 3 / 47

Page 4: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Homework

We will use three data-sets

Recognition of handwritten charactersFace recognitionRecognition of trees in a forest

H. I. Christensen (UCSD) CSE 291- PR 4 / 47

Page 5: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Recognition of handwritten characters

43 people contributed data

30 people in the training set and 13 in the test set

Each data point is a 32x32 bit map

Raw and pre-processed data available from http:

//archive.ics.uci.edu/ml/machine-learning-databases/optdigits

H. I. Christensen (UCSD) CSE 291- PR 5 / 47

Page 6: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Recognition of handwritten characters

H. I. Christensen (UCSD) CSE 291- PR 6 / 47

Page 7: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Wine Recommendation

Characterization of wine based on 13 features

alcohol, acid, ash, ...

Relative simple dataset - small enough

Widely used for “simple” benchmarking

Available from https://archive.ics.uci.edu/ml/datasets/Wine

H. I. Christensen (UCSD) CSE 291- PR 7 / 47

Page 8: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Detection of palms

Detection of palms in drone images

Study of images from Belize

Based on recognition of specific trees we can determine the state of theenvironment.

H. I. Christensen (UCSD) CSE 291- PR 8 / 47

Page 9: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 9 / 47

Page 10: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Simple Example - Bolts or Needles

H. I. Christensen (UCSD) CSE 291- PR 10 / 47

Page 11: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Classification

Given

An input vector: XA set of classes: ci ∈ C, i = 1, . . . , k

Mapping m : X → CSeparation of space into decision regions

Boundaries termed decision boundaries/surfaces

H. I. Christensen (UCSD) CSE 291- PR 11 / 47

Page 12: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Basis Formulation

It is a 1-of-K coding problem

Target vector: t = (0, . . . , 1, . . . , 0)

Consideration of 3 different approaches1 Optimization of discriminant function2 Bayesian Formulation: p(ci |x)3 Learning & Decision fusion

H. I. Christensen (UCSD) CSE 291- PR 12 / 47

Page 13: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 13 / 47

Page 14: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Discriminant Functions

Objective: input vector x assigned to a class ci

Simple formulation:y(x) = wTx + w0

w is termed a weight vector

w0 is termed a bias

Two class example: c1 if y(x) ≥ 0 otherwise c2

H. I. Christensen (UCSD) CSE 291- PR 14 / 47

Page 15: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Basic Design

Two points on decision surface xa and xb

y(xa) = y(xb) = 0⇒ wT (xa − xb) = 0

w perpendicular to decision surface

wTx

||w||= − w0

||w||

Define: w = (w0,w) and x = (1, x) so that:

y(x) = wT x

H. I. Christensen (UCSD) CSE 291- PR 15 / 47

Page 16: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Linear discriminant function

x2

x1

wx

y(x)‖w‖

x⊥

−w0‖w‖

y = 0y < 0

y > 0

R2

R1

H. I. Christensen (UCSD) CSE 291- PR 16 / 47

Page 17: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Multi Class Discrimination

Generation of multiple decision functions

yk(x) = wTk x + wk0

Decision strategyj = arg max

i∈1..kyi (x)

H. I. Christensen (UCSD) CSE 291- PR 17 / 47

Page 18: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Multi-Class Decision Regions

Ri

Rj

Rk

xA

xB

x

H. I. Christensen (UCSD) CSE 291- PR 18 / 47

Page 19: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Example - Bolts or Needles

H. I. Christensen (UCSD) CSE 291- PR 19 / 47

Page 20: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Minimum distance classification

Suppose we have computed the mean value for each of the classes

mneedle = [0.86, 2.34]T and mbolt = [5.74, 5, 85]T

We can then compute the minimum distance

dj(x) = ||x −mj ||

argminidi (x) is the best fit

Decision functions can be derived

H. I. Christensen (UCSD) CSE 291- PR 20 / 47

Page 21: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Bolts / Needle Decision Functions

Needle dneedle(x) = 0.86x1 + 2.34x2 − 3.10

Bolt dbolt(x) = 5.74x1 + 5.85x2 − 33.59

Decision boundarydi (x)− dj(x) = 0

dneedle/bolt(x) = −4.88x1 − 3.51x2 + 30.49

H. I. Christensen (UCSD) CSE 291- PR 21 / 47

Page 22: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Example decision surface

H. I. Christensen (UCSD) CSE 291- PR 22 / 47

Page 23: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 23 / 47

Page 24: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Maximum Likelihood & Least Squares

Assume observation from a deterministic function contaminated by GaussianNoise

t = y(x ,w) + ε p(ε|β) = N(ε|0, β−1)

the problem at hand is then

p(t|x ,w , β) = N(t|y(x ,w), β−1)

From a series of observations we have the likelihood

p(t|X,w , β) =N∏i=1

N(ti |wTφ(xi ), β−1)

H. I. Christensen (UCSD) CSE 291- PR 24 / 47

Page 25: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Maximum Likelihood & Least Squares (2)

This results in

ln p(t|w, β) =N

2lnβ − N

2ln(2π)− βED(w)

where

ED(w) =1

2

N∑i=1

{ti −wTφ(xi )}2

is the sum of squared errors

H. I. Christensen (UCSD) CSE 291- PR 25 / 47

Page 26: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Maximum Likelihood & Least Squares (3)

Computing the extrema yields:

wML =(ΦTΦ

)−1ΦT t

where

Φ =

φ0(x1) φ1(x1) · · · φM−1(x1)φ0(x1) φ1(x2) · · · φM−1(x2)

......

. . ....

φ0(xN) φ1(xN) · · · φM−1(xN)

H. I. Christensen (UCSD) CSE 291- PR 26 / 47

Page 27: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Line Estimation

Least square minimization:

Line equation: y = ax + bError in fit:

∑i (yi − axi − b)2

Solution: (y 2

y

)=

(x2 xx 1

)(ab

)Why might this be a non-robust solution?

H. I. Christensen (UCSD) CSE 291- PR 27 / 47

Page 28: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

LSQ on Lasers

Line model: ri cos(φi − θ) = ρ

Error model: di = ri cos(φi − θ)− ρOptimize: argmin(ρ,θ)

∑i (ri cos(φi − θ)− ρ)2

Error model derived in Deriche et al. (1992)

Well suited for “clean-up” of Hough lines

H. I. Christensen (UCSD) CSE 291- PR 28 / 47

Page 29: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Total Least Squares

Line equation: ax + by + c = 0

Error in fit:∑

i (axi + byi + c)2 where a2 + b2 = 1.

Solution: (x2 − x x xy − x y

xy − x y y2 − y y

)(ab

)= µ

(ab

)where µ is a scale factor.

c = −ax − by

H. I. Christensen (UCSD) CSE 291- PR 29 / 47

Page 30: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Line Representations

The line representation is crucialOften a redundant model is adoptedLine parameters vs end-pointsImportant for fusion of segments.End-points are less stable

H. I. Christensen (UCSD) CSE 291- PR 30 / 47

Page 31: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Sequential Adaptation

In some cases one at a time estimation is more suitable

Also known as gradient descent

w(τ+1) = w(τ) − η∇En

= w(τ) − η(tn −w(τ)Tφ(xn))φ(xn)

Known as least-mean square (LMS). An issue is how to choose η?

H. I. Christensen (UCSD) CSE 291- PR 31 / 47

Page 32: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Regularized Least Squares

As seen in lecture 2 sometime control of parameters might be useful.

Consider the error function:

ED(w) + λEW (w)

which generates

1

2

N∑i=1

{ti − w tφ(xi )}2 +λ

2wTw

which is minimized by

w =(λI + ΦTΦ

)−1ΦT t

H. I. Christensen (UCSD) CSE 291- PR 32 / 47

Page 33: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Least Squares for Classification

We could do LSQ for regression and we can perform an approximation to theclassification vector CConsider:

yk(x) = wTk x + wk0

Rewrite toy(x) = WT x

Assuming we have a target vector T

H. I. Christensen (UCSD) CSE 291- PR 33 / 47

Page 34: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Least Squares for Classification

The error is then:

ED(W) =1

2Tr{

(XW − T)T (XW − T)}

The solution is then

W =(XT X

)−1

XTT

H. I. Christensen (UCSD) CSE 291- PR 34 / 47

Page 35: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

LSQ and Outliers

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

H. I. Christensen (UCSD) CSE 291- PR 35 / 47

Page 36: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 36 / 47

Page 37: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Fisher’s linear discriminant

Selection of a decision function that maximizes distance between classes

Assume for a starty = WTx

Compute m1 and m2

m1 = 1N1

∑i∈C1

xi m2 = 1N2

∑j∈C2

xj

Distance:m2 −m1 = wT (m2 −m1)

where mi = wmi

H. I. Christensen (UCSD) CSE 291- PR 37 / 47

Page 38: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

The suboptimal solution

−2 2 6

−2

0

2

4

H. I. Christensen (UCSD) CSE 291- PR 38 / 47

Page 39: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

The Fisher criterion

Consider the expression

J(w) =wTSBw

wTSWw

where SB is the between class covariance and SW is the within classcovariance, i.e.

SB = (m1 −m2)(m1 −m2)T

andSW =

∑i=C1

(xi −m1)(xi −m1)T +∑i=C2

(xi −m2)(xi −m2)T

Optimized when(wTSBw)Sww = (wTSWw)SBw

orw ∝ S−1

w (m2 −m1)

H. I. Christensen (UCSD) CSE 291- PR 39 / 47

Page 40: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

The Fisher result

−2 2 6

−2

0

2

4

H. I. Christensen (UCSD) CSE 291- PR 40 / 47

Page 41: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Generalization to N > 2

Define a stacked weight factor

y = WTx

The within class covariance generalizes to

Sw =K∑

k=1

Sk

The between class covariance is

SB =K∑

k=1

Nk(mk −m)(mk −m)T

It can be shown that J(w) is optimized by the eigenvectors to the equation

S = S−1W SB

H. I. Christensen (UCSD) CSE 291- PR 41 / 47

Page 42: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 42 / 47

Page 43: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Perceptron Algorithm

Developed by Rosenblatt (1962)

Formed an important basis for neural networks

Use a non-linear transformation φ(x)

Construct a decision function

y(x) = f(wTφ(x)

)where

f (a) =

{+1, a ≥ 0−1, a < 0

H. I. Christensen (UCSD) CSE 291- PR 43 / 47

Page 44: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

The perceptron criterion

Normally we wantw tφ(xn) > 0

Given the target vector definition

Ep(w) = −∑

n inM

wTφntn

Where M represents all the mis-classified samples

We can use gradient descent

w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφntn

H. I. Christensen (UCSD) CSE 291- PR 44 / 47

Page 45: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Perceptron learning example

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

H. I. Christensen (UCSD) CSE 291- PR 45 / 47

Page 46: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Outline

1 Introduction

2 Linear Recognition

3 Linear Discriminant Functions

4 LSQ for Classification

5 Fisher’s Discriminant Method

6 Perceptrons

7 Summary

H. I. Christensen (UCSD) CSE 291- PR 46 / 47

Page 47: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Summary

Basics for discrimination / classification

Obviously not all problems are linear

Optimization of the distance/overlap between classes

Minimizing the probability of error classification

Basic formulation as an optimization problem

How to optimize between cluster distance? Covariance Weighted

Basic recursive formulation

Could we make it more robust?

Thursday: Sub-space methods

H. I. Christensen (UCSD) CSE 291- PR 47 / 47

Page 48: CSE 291 - Linear Models for Classification · 2017-04-26 · Outline 1 Introduction 2 Linear Recognition 3 Linear Discriminant Functions 4 LSQ for Classi cation 5 Fisher’s Discriminant

Deriche, R., Vaillant, R., & Faugeras, O. 1992. From Noisy Edges Points to 3D Reconstruction of a Scene : A Robust Approach and Its UncertaintyAnalysis. Vol. 2. World Scientific. Series in Machine Perception and Artificial Intelligence. Pages 71–79.

H. I. Christensen (UCSD) CSE 291- PR 47 / 47