Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ......

33
Introduction to Machine Learning Lecture 15 Mehryar Mohri Courant Institute and Google Research [email protected]

Transcript of Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ......

Page 1: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

Introduction to Machine LearningLecture 15

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning 2

Regression

Page 3: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Regression Problem

Training data: sample drawn i.i.d. from set according to some distribution ,

Loss function: a measure of closeness, typically or for some .

Problem: find hypothesis in with small generalization error with respect to target

3

H

DX

S =((x1, y1), . . . , (xm, ym))∈X×Y,

with is a measurable subset.Y ⊆R

L(y, y�)=(y�−y)2 L(y, y�)= |y�−y|pp≥1

f

RD(h) = Ex∼D

�L

�h(x), f(x)

��.

h :X→R

L : Y ×Y →R+

Page 4: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Notes

Empirical error:

In much of what follows:

• or for some

• mean squared error.

4

Y =R Y =[−M, M ] M >0.

L(y, y�)=(y�−y)2

�RD(h) =1m

m�

i=1

L�h(xi), yi

�.

Page 5: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Generalization Bound - Finite H

Theorem: let be a finite hypothesis set, and assume that is bounded by . Then, for any , with probability at least ,

Proof: By the union bound,

5

H

δ>01−δ

L M

∀h ∈ H, R(h) ≤ �R(h) + M

�log |H | + log 2

δ

2m.

Pr�∃h ∈ H

��R(h)− �R(h)��>�

�≤

h∈H

Pr���R(h)− �R(h)

��>�

�.

By Hoeffding, since ,L�h(x), f(x)

�∈ [0, M ]

Pr���R(h)− �R(h)

��>��≤ 2e−

2m�2

M2 .

Page 6: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Linear Regression

Feature mapping .

Hypothesis set: linear functions.

Optimization problem: empirical risk minimization.

6

Φ : X→RN

{x �→ w · Φ(x) + b : w ∈ RN , b ∈ R}.

Φ(x)

y

minw,b

F (w, b) =1m

m�

i=1

(w · Φ(xi) + b− yi)2 .

Page 7: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Linear Regression - Solution

Rewrite objective function as

Convex and differentiable function.

7

F (W)=1m�X�W −Y�2,

with W=

w1...

wN

b

Y=

y1...

ym

.

∇F (W) =2m

X(X�W −Y).

∇F (W) = 0⇔ X(X�W−Y) = 0⇔ XX�W = XY.

X�=

Φ(x1)� 1...

Φ(xm)� 1

X=�

Φ(x1)...Φ(xm)1 ... 1

Page 8: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Linear Regression - Solution

Solution:

• Computational complexity: if matrix inversion in .

• Poor guarantees, no regularization.

• For output labels in , , solve distinct linear regression problems.

8

O(mN +N3)

O(N3)

Rp p>1 p

W =

�(XX�)−1XY if XX� invertible.(XX�)†XY in general.

Page 9: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Ridge Regression

Optimization problem:

• directly based on generalization bound.

• generalization of linear regression.

• closed-form solution.

• can be used with kernels.

9

where is a (regularization) parameter.λ≥0

(Hoerl and Kennard, 1970)

minw

F (w, b) = λ�w�2 +m�

i=1

(w · Φ(xi) + b− yi)2 ,

Page 10: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Ridge Regression - Solution

Assume : often constant feature used (but not equivalent to the use of original offset!).

Rewrite objective function as

Convex and diferentiable function.

10

b=0

F (W)=λ�W�2 + �X�W−Y�2.

∇F (W) = 2λW + 2X(X�W −Y).

∇F (W) = 0⇔ (XX�+ λI)W = XY.

W = (XX�+ λI)−1XY.Solution:� �� �

always invertible.

Page 11: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Ridge Regression - Equivalent Formulations

Optimization problem:

Optimization problem:

11

minw,b

m�

i=1

(w · Φ(xi) + b− yi)2

subject to: �w�2 ≤ Λ2.

minw,b

m�

i=1

ξ2i

subject to: ξi = w · Φ(xi) + b− yi

�w�2 ≤ Λ2.

Page 12: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Ridge Regression Equations

Lagrangian: assume . For all

KKT conditions:

12

b=0 ξ,w, α�, λ ≥ 0,

L(ξ,w, α�, λ) =m�

i=1

ξ2i +

m�

i=1

α�i(yi − ξi −w · Φ(xi)) + λ(�w�2 − Λ2).

∇wL = −m�

i=1

α�iΦ(xi) + 2λw = 0 ⇐⇒ w =

12λ

m�

i=1

α�iΦ(xi).

∇ξiL = 2ξi − α�i = 0 ⇐⇒ ξi = α�

i/2.

∀i ∈ [1, m], α�i(yi − ξi −w · Φ(xi))=0

λ(�w�2 − Λ2) = 0.

Page 13: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Moving to The Dual

Plugging in the expression of and s gives

Thus,

13

w ξi

L = −14

m�

i=1

α�2i +

m�

i=1

α�iyi −

14λ

m�

i,j=1

α�iα

�jΦ(xi)�Φ(xj)− λΛ2

= −λm�

i=1

α2i + 2

m�

i=1

αiyi −m�

i,j=1

αiαjΦ(xi)�Φ(xj)− λΛ2,

with .α�i =2λαi

L =m�

i=1

α�2i

4+

m�

i=1

α�iyi−

m�

i=1

α�i2

2− 1

m�

i,j=1

α�iα

�jΦ(xi)�Φ(xj)+λ

� 14λ2

�m�

i=1

α�iΦ(xi)�2−Λ2

�.

Page 14: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

RR - Dual Optimization Problem

Optimization problem:

Solution:

14

or

h(x) =m�

i=1

αiΦ(xi) · Φ(x),

with

maxα∈Rm

−λα�α + 2α�y −α�(X�X)α

maxα∈Rm

−α�(X�X + λI)α + 2α�y.

α = (X�X + λI)−1y.

Page 15: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Direct Dual Solution

Lemma: The following matrix identity always holds.

Proof: Observe that Left-multiplying by and right-multiplying by yields the statement.

Dual solution: such that

15

(XX�+ λI)−1X = X(X�X + λI)−1.

(XX�+ λI)X = X(X�X + λI).(XX�+ λI)−1

(X�X + λI)−1

α

W =m�

i=1

αiK(xi, ·) =m�

i=1

αiΦ(xi) = Xα.

By lemma, W = (XX�+ λI)−1XY = X(X�X+ λI)−1Y.

This gives α = (X�X+ λI)−1Y.

Page 16: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

O(mN2 + N

3) O(N)

O(κm2 + m

3) O(κm)

Solution Prediction

Primal

Dual

Computational Complexity

16

Page 17: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Kernel Ridge Regression

Optimization problem:

Solution:

17

or maxα∈Rm

−α�(K + λI)α + 2α�y.

maxα∈Rm

−λα�α + 2α�y −α�Kα

with α = (K + λI)−1y.

h(x) =m�

i=1

αiK(xi, x),

(Saunders et al., 1998)

Page 18: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Notes

Advantages:

• strong theoretical guarantees.

• generalization to outputs in : single matrix inversion (Cortes et al., 2007).

• use of kernels.

Disadvantages:

• solution not sparse.

• training time for large matrices: low-rank approximations of kernel matrix, e.g., Nyström approx., partial Cholesky decomposition.

18

Rp

Page 19: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Support Vector Regression

Hypothesis set:

Loss function: -insensitive loss.

19

(Vapnik, 1995)

{x �→ w · Φ(x) + b : w ∈ RN , b ∈ R}.�

L(y, y�) = |y� − y|� = max(0, |y� − y|− �).

Φ(x)

y

w·Φ(x)+b

Fit ‘tube’ with width to data.�

Page 20: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Support Vector Regression (SVR)

Optimization problem: similar to that of SVM.

Equivalent formulation:

20

12�w�2 + C

m�

i=1

��yi − (w · Φ(xi) + b)���.

minw,ξ,ξ�

12�w�2 + C

m�

i=1

(ξi + ξ�i)

subject to (w · Φ(xi) + b)− yi ≤ � + ξi

yi − (w · Φ(xi) + b) ≤ � + ξ�i

ξi ≥ 0, ξ�i ≥ 0.

(Vapnik, 1995)

Page 21: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

SVR - Dual Optimization Problem

Optimization problem:

Solution:

Support vectors: points strictly outside the tube.

21

h(x) =m�

i=1

(α�i − αi)K(xi,x) + b

with b =

�−

�mi=1(α

�j − αj)K(xj , xi) + yi + � when 0 < αi < C

−�m

i=1(α�j − αj)K(xj , xi) + yi − � when 0 < α�

i < C.

maxα,α�

− �(α� + α)�1 + (α� − α)�y − 12(α� −α)�K(α� −α)

subject to: (0 ≤ α ≤ C) ∧ (0 ≤ α� ≤ C) ∧ ((α� −α)�1 = 0) .

Page 22: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Notes

Advantages:

• strong theoretical guarantees.

• sparser solution.

• use of kernels.

Disadvantages:

• selection of two parameters: and . Heuristics:

• search near maximum , near average difference of s, measure of no. of SVs.

• large matrices: low-rank approximations of kernel matrix.

22

C �

C y �y

Page 23: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Alternative Loss Functions

23

x

x !"max(0, |x|# !)2

x !"

!

x2 if |x| # c

2c|x|$ c2 otherwise.

x !"max(0, |x|# !)

Page 24: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

SVR - Quadratic Loss

Optimization problem:

Solution:

Support vectors: points strictly outside the tube.For , coincides with KRR.

24

h(x) =m�

i=1

(α�i − αi)K(xi,x) + b

with b =

�−

�mi=1(α

�j − αj)K(xj , xi) + yi + � when 0 < αi ∧ ξi = 0

−�m

i=1(α�j − αj)K(xj , xi) + yi − � when 0 < α�

i ∧ ξ�i = 0.

maxα,α�

− �(α� + α)�1 + (α� −α)�y − 12(α� −α)�

�K +

1C

I�

(α� −α)

subject to: (α ≥ 0) ∧ (α ≥ 0) ∧ (α� −α)�1 = 0) .

�=0

Page 25: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning 25

On-line Regression

On-line version of batch algorithms:

• stochastic gradient descent.

• primal or dual.

Examples:

• Mean squared error function: Widrow-Hoff (or LMS) algorithm (Widrow and Hoff, 1995).

• SVR ε-insensitive (dual) linear or quadratic function: on-line SVR.

Page 26: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Widrow-Hoff

26

WidrowHoff(w0)1 w1 ← w0 � typically w0 = 02 for t← 1 to T do3 Receive(xt)4 �yt ← wt · xt

5 Receive(yt)6 wt+1 ← wt + 2η(wt · xt − yt)xt � η>07 return wT+1

(Widrow and Hoff, 1988)

Page 27: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Dual On-Line SVR

27

(b=0) (Vijayakumar and Wu, 1988)

DualSVR()1 α← 02 α� ← 03 for t← 1 to T do4 Receive(xt)5 �yt ←

�Ts=1(α

�s − αs)K(xs, xt)

6 Receive(yt)7 α�

t+1 ← α�t + min(max(η(yt − �yt − �),−α�

t), C − α�t)

8 αt+1 ← αt + min(max(η(�yt − yt − �),−αt), C − αt)9 return

�Tt=1 αtK(xt, ·)

Page 28: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

LASSO

Optimization problem: ‘least absolute shrinkage and selection operator’.

Solution: convex quadratic program (QP).

• general: standard QP solvers.

• specific algorithm: LARS (least angle regression procedure), entire path of solutions.

28

minw

F (w, b) = λ�w�1 +m�

i=1

(w · xi + b− yi)2 ,

where is a (regularization) parameter.λ≥0

(Tibshirani, 1996)

Page 29: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Sparsity of L1 regularization

29

L1 regularization L2 regularization

Page 30: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Notes

Advantages:

• theoretical guarantees.

• sparse solution.

• feature selection.

Drawbacks:

• no natural use of kernels.

• no closed-form solution (not necessary, but can be nice for theoretical analysis).

30

Page 31: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

pageMehryar Mohri - Introduction to Machine Learning

Regression

Many other families of algorithms: including

• neural networks.

• decision trees.

• boosting trees for regression.

31

Page 32: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

Mehryar Mohri - Foundations of Machine Learning page

References• Corinna Cortes, Mehryar Mohri, and Jason Weston. A General Regression Framework for

Learning String-to-String Mappings. In Predicting Structured Data. The MIT Press, 2007.

• Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression. Annals of Statistics 2003.

• Arthur Hoerl and Robert Kennard. Ridge Regression: biased estimation of nonorthogonal problems. Technometrics, 12:55-67, 1970.

• C. Saunders and A. Gammerman and V. Vovk, Ridge Regression Learning Algorithm in Dual Variables, In ICML ’98, pages 515--521,1998.

• Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, pages B. 58:267-288, 1996.

• David Pollard. Convergence of Stochastic Processes. Springer, New York, 1984.

• David Pollard. Empirical Processes: Theory and Applications. Institute of Mathematical Statistics, 1990.

32

Page 33: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form

Mehryar Mohri - Foundations of Machine Learning page

References• Sethu Vijayakumar and Si Wu. Sequential support vector classifiers and regression. In

Proceedings of the International Conference on Soft Computing (SOCO’99), 1999.

• Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, 1982.

• Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

• Bernard Widrow and Ted Hoff. Adaptive Switching Circuits. Neurocomputing: foundations of research, pages 123-134, MIT Press, 1988.

33