Examples Lecture 7: Kernels for Classification and...

25
CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011

Transcript of Examples Lecture 7: Kernels for Classification and...

Page 1: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Lecture 7: Kernels for Classification and RegressionCS 194-10, Fall 2011

Laurent El Ghaoui

EECS DepartmentUC Berkeley

September 15, 2011

Page 2: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

Page 3: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

Page 4: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

A linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares:

minw‖X T w − y‖2

2

Prediction rule : yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.

Page 5: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Nonlinear regression

Nonlinear auto-regressive model for time-series: yt quadratic functionof yt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2 + w4y2t−1 + w5yt−1yt−2 + w6y2

t−2.

This writes yt = wTφ(xt), with φ(xt) the augmented feature vectors

φ(xt) :=(

1, yt−1, yt−2, y2t−1, yt−1yt−2, y2

t−2

).

Everything the same as before, with x replaced by φ(x).

Page 6: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Nonlinear classification

Non-linear (e.g., quadratic) decision boundary

w1x1 + w2x2 + w3x21 + w4x1x2 + w5x2

2 + b = 0.

Writes wTφ(x) + b = 0, with φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

Page 7: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Challenges

In principle, it seems can always augment the dimension of thefeature space to make the data linearly separable. (See the video athttp://www.youtube.com/watch?v=3liCbRZPrZA)

How do we do it in a computationally efficient manner?

Page 8: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

Page 9: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Linear least-squares

minw‖X T w − y‖2

2 + λ‖w‖22

whereI X = [x1, . . . , xn] is the m × n matrix of data points.I y ∈ Rm is the “response” vector,I w contains regression coefficients.I λ ≥ 0 is a regularization parameter.

Prediction rule: y = wT x , where x ∈ Rn is a new data point.

Page 10: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Support vector machine (SVM)

minw

m∑i=1

(1− yi(wT xi + b)) + λ‖w‖22

whereI X = [x1, . . . , xm] is the n ×m matrix of data points in Rn.I y ∈ {−1, 1}m is the label vector.I w , b contain classifer coefficients.I λ ≥ 0 is a regularization parameter.

In the sequel, we’ll ignore the bias term (for simplicity only).

Classification rule: y = sign(wT x + b), where x ∈ Rn is a new datapoint.

Page 11: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Generic form of problem

Many classification and regression problems can be written

minw

L(X T w , y) + λ‖w‖22

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier coefficients.I L is a “loss” function that depends on the problem considered.I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.

Page 12: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

i=1

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

i=1

log(1 + e−yi zi ).

Page 13: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

Page 14: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w lies in the span of the data points (x1, . . . , xm):

w = Xv

for some vector v ∈ Rm.

Page 15: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Proof

Any w ∈ Rn can be written as the sum of two orthogonal vectors:

w = Xv + r

where X T r = 0 (that is, r is in the nullspace N (X T )).

vectors and matrices 55

Since for any subspace S it holds that S⊥⊥ = S , taking the orthogonalcomplement of both sides in the previous equation, we obtain thatN (A)⊥ = R(A�), hence

Rn = N (A) ⊕ N (A)⊥ = N (A) ⊕ R(A�).

This means that the input space Rn can be decomposed as the directsum of two orthogonal subspaces N (A) and R(A�). Since dim R(A�) =

dim R(A) = rank(A), and obviously dim Rn = n, we also obtain that

dim N (A) + rank(A) = n. (2.10)

With a similar reasoning, we argue that

R(A)⊥ = {y ∈ Rm : y�z = 0, ∀z ∈ R(A)}= {y ∈ Rm : y�Ax = 0, ∀x ∈ Rn} = N (A�),

hence by taking orthogonal complement of both sides, we see that

R(A) = N (A�)⊥.

Therefore, the output space Rm is decomposed as

Rm = R(A) ⊕ R(A)⊥ = R(A) ⊕ N (A�),

and

dim N (A�) + rank(A) = m,

see a graphical exemplification in R3 in Figure 2.2.2.

Figure 2.26: Illustrationof the fundamentaltheorem of linearAlgebra in R3; A =[a1 a2].

Figure shows the case X = A = (a1, a2).

Page 16: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Consequence of key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w can be written as w = Xv for some vector v ∈ Rm.

Hence training problem depends only on K := X T X :

minv

L(Kv) + λvT Kv .

Page 17: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Kernel matrix

The training problem depends only on the “kernel matrix” K = X T X

Kij = xTi xj

K contains the scalar products between all data point pairs.

The prediction/classification rule depends on the scalar productsbetween new point x and the data points x1, . . . , xm:

wT x = vT X T x = vT k , k := X T x = (xT x1, . . . , xT xm).

Page 18: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Computational advantages

Once K is formed (this takes O(n)), then the training problem has onlym variables.

When n >> m, this leads to a dramatic reduction in problem size.

Page 19: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

How about the nonlinear case?

In the nonlinear case, we simply replace the feature vectors xi bysome “augmented” feature vectors φ(xi), with φ a non-linear mapping.

Example : in classification with quadratic decision boundary, we use

φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

This leads to the modified kernel matrix

Kij = φ(xi)Tφ(xj), 1 ≤ i , j ≤ m.

Page 20: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

The kernel function

The kernel function associated with mapping φ is

k(x , z) = φ(x)Tφ(z).

It provides information about the metric in the feature space, e.g.:

‖φ(x)− φ(z)‖22 = k(x , x)− 2k(x , z) + k(z, z).

The computational effort involved inI solving the training problem;I making a prediction,

depends only on our ability to quickly evaluate such scalar products.

We can’t choose k arbitrarily; it has to satisfy the above for some φ.

Page 21: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

Page 22: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Quadratic kernels

Classification with quadratic boundaries involves feature vectors

φ(x) = (1, x1, x2, x21 , x1x2, x2

2 ).

Fact : given two vectors x , z ∈ R2, we have

φ(x)Tφ(z) = (1 + xT z)2.

Page 23: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Polynomial kernels

More generally when φ(x) is the vector formed with all the productsbetween the components of x ∈ Rn, up to degree d , then for any twovectors x , z ∈ Rn,

φ(x)Tφ(z) = (1 + xT z)d .

Computational effort grows linearly in n.

This represents a dramatic reduction in speed over the “brute force”approach:

I Form φ(x), φ(z);I evaluate φ(x)Tφ(z).

Computational effort grows as nd .

Page 24: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Other kernels

Gaussian kernel function:

k(x , z) = exp(−‖x − z‖2

2

2σ2

),

where σ > 0 is a scale parameter. Allows to ignore points that are toofar apart. Corresponds to a non-linear mapping φ toinfinite-dimensional feature space.

There is a large variety (a zoo?) of other kernels, some adapted tostructure of data (text, images, etc).

Page 25: Examples Lecture 7: Kernels for Classification and Regressionrussell/classes/cs194/f11/lectures/CS194 Fall 2011... · CS 194-10, F’11 Lect. 7 Motivations Linear classification

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

In practiceI Kernels need to be chosen by the user.I Choice not always obvious; Gaussian or polynomial kernels are

popular.I Control over-fitting via cross validation (wrt say, scale parameter

of Gaussian kernel, or degree of polynomial kernel).I Kernel methods not well adapted to l1-norm regularization.