Examples Lecture 7: Kernels for Classiﬁcation and...

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Lecture 7: Kernels for Classification and RegressionCS 194-10, Fall 2011

Laurent El Ghaoui

EECS DepartmentUC Berkeley

September 15, 2011

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Outline

Motivations

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

A linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares:

minw‖X T w − y‖2

Prediction rule : yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Nonlinear regression

Nonlinear auto-regressive model for time-series: yt quadratic functionof yt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2 + w4y2t−1 + w5yt−1yt−2 + w6y2

t−2.

This writes yt = wTφ(xt), with φ(xt) the augmented feature vectors

φ(xt) :=(

1, yt−1, yt−2, y2t−1, yt−1yt−2, y2

Everything the same as before, with x replaced by φ(x).

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Nonlinear classification

Non-linear (e.g., quadratic) decision boundary

w1x1 + w2x2 + w3x21 + w4x1x2 + w5x2

2 + b = 0.

Writes wTφ(x) + b = 0, with φ(x) := (x1, x2, x21 , x1x2, x2

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Challenges

In principle, it seems can always augment the dimension of thefeature space to make the data linearly separable. (See the video athttp://www.youtube.com/watch?v=3liCbRZPrZA)

How do we do it in a computationally efficient manner?

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Outline

Motivations

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Linear least-squares

minw‖X T w − y‖2

2 + λ‖w‖22

whereI X = [x1, . . . , xn] is the m × n matrix of data points.I y ∈ Rm is the “response” vector,I w contains regression coefficients.I λ ≥ 0 is a regularization parameter.

Prediction rule: y = wT x , where x ∈ Rn is a new data point.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Support vector machine (SVM)

m∑i=1

(1− yi(wT xi + b)) + λ‖w‖22

whereI X = [x1, . . . , xm] is the n ×m matrix of data points in Rn.I y ∈ {−1, 1}m is the label vector.I w , b contain classifer coefficients.I λ ≥ 0 is a regularization parameter.

In the sequel, we’ll ignore the bias term (for simplicity only).

Classification rule: y = sign(wT x + b), where x ∈ Rn is a new datapoint.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Generic form of problem

Many classification and regression problems can be written

L(X T w , y) + λ‖w‖22

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier coefficients.I L is a “loss” function that depends on the problem considered.I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

log(1 + e−yi zi ).

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Outline

Motivations

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Key result

For the generic problem:

L(X T w) + λ‖w‖22

the optimal w lies in the span of the data points (x1, . . . , xm):

w = Xv

for some vector v ∈ Rm.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Any w ∈ Rn can be written as the sum of two orthogonal vectors:

w = Xv + r

where X T r = 0 (that is, r is in the nullspace N (X T )).

vectors and matrices 55

Since for any subspace S it holds that S⊥⊥ = S , taking the orthogonalcomplement of both sides in the previous equation, we obtain thatN (A)⊥ = R(A�), hence

Rn = N (A) ⊕ N (A)⊥ = N (A) ⊕ R(A�).

This means that the input space Rn can be decomposed as the directsum of two orthogonal subspaces N (A) and R(A�). Since dim R(A�) =

dim R(A) = rank(A), and obviously dim Rn = n, we also obtain that

dim N (A) + rank(A) = n. (2.10)

With a similar reasoning, we argue that

R(A)⊥ = {y ∈ Rm : y�z = 0, ∀z ∈ R(A)}= {y ∈ Rm : y�Ax = 0, ∀x ∈ Rn} = N (A�),

hence by taking orthogonal complement of both sides, we see that

R(A) = N (A�)⊥.

Therefore, the output space Rm is decomposed as

Rm = R(A) ⊕ R(A)⊥ = R(A) ⊕ N (A�),

dim N (A�) + rank(A) = m,

see a graphical exemplification in R3 in Figure 2.2.2.

Figure 2.26: Illustrationof the fundamentaltheorem of linearAlgebra in R3; A =[a1 a2].

Figure shows the case X = A = (a1, a2).

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Consequence of key result

For the generic problem:

L(X T w) + λ‖w‖22

the optimal w can be written as w = Xv for some vector v ∈ Rm.

Hence training problem depends only on K := X T X :

L(Kv) + λvT Kv .

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Kernel matrix

The training problem depends only on the “kernel matrix” K = X T X

Kij = xTi xj

K contains the scalar products between all data point pairs.

The prediction/classification rule depends on the scalar productsbetween new point x and the data points x1, . . . , xm:

wT x = vT X T x = vT k , k := X T x = (xT x1, . . . , xT xm).

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Computational advantages

Once K is formed (this takes O(n)), then the training problem has onlym variables.

When n >> m, this leads to a dramatic reduction in problem size.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

How about the nonlinear case?

In the nonlinear case, we simply replace the feature vectors xi bysome “augmented” feature vectors φ(xi), with φ a non-linear mapping.

Example : in classification with quadratic decision boundary, we use

φ(x) := (x1, x2, x21 , x1x2, x2

This leads to the modified kernel matrix

Kij = φ(xi)Tφ(xj), 1 ≤ i , j ≤ m.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

The kernel function

The kernel function associated with mapping φ is

k(x , z) = φ(x)Tφ(z).

It provides information about the metric in the feature space, e.g.:

‖φ(x)− φ(z)‖22 = k(x , x)− 2k(x , z) + k(z, z).

The computational effort involved inI solving the training problem;I making a prediction,

depends only on our ability to quickly evaluate such scalar products.

We can’t choose k arbitrarily; it has to satisfy the above for some φ.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Outline

Motivations

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Quadratic kernels

Classification with quadratic boundaries involves feature vectors

φ(x) = (1, x1, x2, x21 , x1x2, x2

Fact : given two vectors x , z ∈ R2, we have

φ(x)Tφ(z) = (1 + xT z)2.

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Polynomial kernels

More generally when φ(x) is the vector formed with all the productsbetween the components of x ∈ Rn, up to degree d , then for any twovectors x , z ∈ Rn,

φ(x)Tφ(z) = (1 + xT z)d .

Computational effort grows linearly in n.

This represents a dramatic reduction in speed over the “brute force”approach:

I Form φ(x), φ(z);I evaluate φ(x)Tφ(z).

Computational effort grows as nd .

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

Other kernels

Gaussian kernel function:

k(x , z) = exp(−‖x − z‖2

where σ > 0 is a scale parameter. Allows to ignore points that are toofar apart. Corresponds to a non-linear mapping φ toinfinite-dimensional feature space.

There is a large variety (a zoo?) of other kernels, some adapted tostructure of data (text, images, etc).

CS 194-10, F’11Lect. 7

Motivations

Generic form

Nonlinear case

Other kernels

Kernels in practice

In practiceI Kernels need to be chosen by the user.I Choice not always obvious; Gaussian or polynomial kernels are

popular.I Control over-fitting via cross validation (wrt say, scale parameter

of Gaussian kernel, or degree of polynomial kernel).I Kernel methods not well adapted to l1-norm regularization.

Examples Lecture 7: Kernels for Classiﬁcation and...

Documents

Transcript of Examples Lecture 7: Kernels for Classiﬁcation and...

L23 315 f11

Expal f11 Eng

Point Processing & Filtering - EECS Instructional …cs194-26/fa15/Lectures/...Point Processing & Filtering CS194: Image Manipulation & Computational Photography Alexei Efros, UC Berkeley,

F11 - regents.universityofcalifornia.eduregents.universityofcalifornia.edu/regmeet/nov15/f11.pdf · F11 . Office of the President . TO THE MEMBERS OF THE COMMITTEE ON FINANCE: DISCUSSION

Learning Decision Trees - Peoplerussell/classes/cs194/f11... · Outline ♦Decision tree models ♦Tree construction ♦Tree pruning ♦Continuous input features CS194-10 Fall 2011

CS194-24 Advanced Operating Systems Structures and Implementation Lecture 5 Processes February 11 th, 2013 Prof. John Kubiatowicz cs194-24.

F11 TIB Presentation

Fel Flyer F11

CS194-126, Software De ned PCBs

CS194-24 Advanced Operating Systems Structures and Implementation Lecture 24 Security and Protection (Con’t) May 1 st, 2013 Prof. John Kubiatowicz cs194-24.

Craic F11 02

CS194-24 Advanced Operating Systems Structures and Implementation Lecture 21 Disks and FLASH Queueing Theory April 21 st, 2014 Prof. John Kubiatowicz cs194-24.

7 Sedimentation F11

F11 User Manual

Hydraulic Motor/Pump Series F11/F12 - parkerdenisonpk.comparkerdenisonpk.com/hydraulic_pumps/HY30-8249-UK-F11-F12.pdf · F11-ISO ... Series F11/F12 Catalogue HY30-8249/UK F11-ISO

Footwear catalogue F11

F11 Features - Segafredo Zanetti Australia · F11 Features Eco-mode energy saving function. Choose the right F11 for you The F11 is a stylish compact machine that delivers a range

13 Disinfection F11

F11 portfolio

Craic F11 06