Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ......
Transcript of Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ......
![Page 1: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/1.jpg)
Introduction to Machine LearningLecture 15
Mehryar MohriCourant Institute and Google Research
![Page 2: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/2.jpg)
pageMehryar Mohri - Introduction to Machine Learning 2
Regression
![Page 3: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/3.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Regression Problem
Training data: sample drawn i.i.d. from set according to some distribution ,
Loss function: a measure of closeness, typically or for some .
Problem: find hypothesis in with small generalization error with respect to target
3
H
DX
S =((x1, y1), . . . , (xm, ym))∈X×Y,
with is a measurable subset.Y ⊆R
L(y, y�)=(y�−y)2 L(y, y�)= |y�−y|pp≥1
f
RD(h) = Ex∼D
�L
�h(x), f(x)
��.
h :X→R
L : Y ×Y →R+
![Page 4: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/4.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Notes
Empirical error:
In much of what follows:
• or for some
• mean squared error.
4
Y =R Y =[−M, M ] M >0.
L(y, y�)=(y�−y)2
�RD(h) =1m
m�
i=1
L�h(xi), yi
�.
![Page 5: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/5.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Generalization Bound - Finite H
Theorem: let be a finite hypothesis set, and assume that is bounded by . Then, for any , with probability at least ,
Proof: By the union bound,
5
H
δ>01−δ
L M
∀h ∈ H, R(h) ≤ �R(h) + M
�log |H | + log 2
δ
2m.
Pr�∃h ∈ H
��R(h)− �R(h)��>�
�≤
�
h∈H
Pr���R(h)− �R(h)
��>�
�.
By Hoeffding, since ,L�h(x), f(x)
�∈ [0, M ]
Pr���R(h)− �R(h)
��>��≤ 2e−
2m�2
M2 .
![Page 6: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/6.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Linear Regression
Feature mapping .
Hypothesis set: linear functions.
Optimization problem: empirical risk minimization.
6
Φ : X→RN
{x �→ w · Φ(x) + b : w ∈ RN , b ∈ R}.
Φ(x)
y
minw,b
F (w, b) =1m
m�
i=1
(w · Φ(xi) + b− yi)2 .
![Page 7: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/7.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Linear Regression - Solution
Rewrite objective function as
Convex and differentiable function.
7
F (W)=1m�X�W −Y�2,
with W=
w1...
wN
b
Y=
y1...
ym
.
∇F (W) =2m
X(X�W −Y).
∇F (W) = 0⇔ X(X�W−Y) = 0⇔ XX�W = XY.
X�=
Φ(x1)� 1...
Φ(xm)� 1
X=�
Φ(x1)...Φ(xm)1 ... 1
�
![Page 8: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/8.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Linear Regression - Solution
Solution:
• Computational complexity: if matrix inversion in .
• Poor guarantees, no regularization.
• For output labels in , , solve distinct linear regression problems.
8
O(mN +N3)
O(N3)
Rp p>1 p
W =
�(XX�)−1XY if XX� invertible.(XX�)†XY in general.
![Page 9: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/9.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Ridge Regression
Optimization problem:
• directly based on generalization bound.
• generalization of linear regression.
• closed-form solution.
• can be used with kernels.
9
where is a (regularization) parameter.λ≥0
(Hoerl and Kennard, 1970)
minw
F (w, b) = λ�w�2 +m�
i=1
(w · Φ(xi) + b− yi)2 ,
![Page 10: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/10.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Ridge Regression - Solution
Assume : often constant feature used (but not equivalent to the use of original offset!).
Rewrite objective function as
Convex and diferentiable function.
10
b=0
F (W)=λ�W�2 + �X�W−Y�2.
∇F (W) = 2λW + 2X(X�W −Y).
∇F (W) = 0⇔ (XX�+ λI)W = XY.
W = (XX�+ λI)−1XY.Solution:� �� �
always invertible.
![Page 11: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/11.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Ridge Regression - Equivalent Formulations
Optimization problem:
Optimization problem:
11
minw,b
m�
i=1
(w · Φ(xi) + b− yi)2
subject to: �w�2 ≤ Λ2.
minw,b
m�
i=1
ξ2i
subject to: ξi = w · Φ(xi) + b− yi
�w�2 ≤ Λ2.
![Page 12: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/12.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Ridge Regression Equations
Lagrangian: assume . For all
KKT conditions:
12
b=0 ξ,w, α�, λ ≥ 0,
L(ξ,w, α�, λ) =m�
i=1
ξ2i +
m�
i=1
α�i(yi − ξi −w · Φ(xi)) + λ(�w�2 − Λ2).
∇wL = −m�
i=1
α�iΦ(xi) + 2λw = 0 ⇐⇒ w =
12λ
m�
i=1
α�iΦ(xi).
∇ξiL = 2ξi − α�i = 0 ⇐⇒ ξi = α�
i/2.
∀i ∈ [1, m], α�i(yi − ξi −w · Φ(xi))=0
λ(�w�2 − Λ2) = 0.
![Page 13: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/13.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Moving to The Dual
Plugging in the expression of and s gives
Thus,
13
w ξi
L = −14
m�
i=1
α�2i +
m�
i=1
α�iyi −
14λ
m�
i,j=1
α�iα
�jΦ(xi)�Φ(xj)− λΛ2
= −λm�
i=1
α2i + 2
m�
i=1
αiyi −m�
i,j=1
αiαjΦ(xi)�Φ(xj)− λΛ2,
with .α�i =2λαi
L =m�
i=1
α�2i
4+
m�
i=1
α�iyi−
m�
i=1
α�i2
2− 1
2λ
m�
i,j=1
α�iα
�jΦ(xi)�Φ(xj)+λ
� 14λ2
�m�
i=1
α�iΦ(xi)�2−Λ2
�.
![Page 14: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/14.jpg)
pageMehryar Mohri - Introduction to Machine Learning
RR - Dual Optimization Problem
Optimization problem:
Solution:
14
or
h(x) =m�
i=1
αiΦ(xi) · Φ(x),
with
maxα∈Rm
−λα�α + 2α�y −α�(X�X)α
maxα∈Rm
−α�(X�X + λI)α + 2α�y.
α = (X�X + λI)−1y.
![Page 15: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/15.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Direct Dual Solution
Lemma: The following matrix identity always holds.
Proof: Observe that Left-multiplying by and right-multiplying by yields the statement.
Dual solution: such that
15
(XX�+ λI)−1X = X(X�X + λI)−1.
(XX�+ λI)X = X(X�X + λI).(XX�+ λI)−1
(X�X + λI)−1
α
W =m�
i=1
αiK(xi, ·) =m�
i=1
αiΦ(xi) = Xα.
By lemma, W = (XX�+ λI)−1XY = X(X�X+ λI)−1Y.
This gives α = (X�X+ λI)−1Y.
![Page 16: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/16.jpg)
pageMehryar Mohri - Introduction to Machine Learning
O(mN2 + N
3) O(N)
O(κm2 + m
3) O(κm)
Solution Prediction
Primal
Dual
Computational Complexity
16
![Page 17: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/17.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Kernel Ridge Regression
Optimization problem:
Solution:
17
or maxα∈Rm
−α�(K + λI)α + 2α�y.
maxα∈Rm
−λα�α + 2α�y −α�Kα
with α = (K + λI)−1y.
h(x) =m�
i=1
αiK(xi, x),
(Saunders et al., 1998)
![Page 18: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/18.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Notes
Advantages:
• strong theoretical guarantees.
• generalization to outputs in : single matrix inversion (Cortes et al., 2007).
• use of kernels.
Disadvantages:
• solution not sparse.
• training time for large matrices: low-rank approximations of kernel matrix, e.g., Nyström approx., partial Cholesky decomposition.
18
Rp
![Page 19: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/19.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Support Vector Regression
Hypothesis set:
Loss function: -insensitive loss.
19
(Vapnik, 1995)
{x �→ w · Φ(x) + b : w ∈ RN , b ∈ R}.�
L(y, y�) = |y� − y|� = max(0, |y� − y|− �).
Φ(x)
y
�
w·Φ(x)+b
Fit ‘tube’ with width to data.�
![Page 20: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/20.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Support Vector Regression (SVR)
Optimization problem: similar to that of SVM.
Equivalent formulation:
20
12�w�2 + C
m�
i=1
��yi − (w · Φ(xi) + b)���.
minw,ξ,ξ�
12�w�2 + C
m�
i=1
(ξi + ξ�i)
subject to (w · Φ(xi) + b)− yi ≤ � + ξi
yi − (w · Φ(xi) + b) ≤ � + ξ�i
ξi ≥ 0, ξ�i ≥ 0.
(Vapnik, 1995)
![Page 21: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/21.jpg)
pageMehryar Mohri - Introduction to Machine Learning
SVR - Dual Optimization Problem
Optimization problem:
Solution:
Support vectors: points strictly outside the tube.
21
h(x) =m�
i=1
(α�i − αi)K(xi,x) + b
with b =
�−
�mi=1(α
�j − αj)K(xj , xi) + yi + � when 0 < αi < C
−�m
i=1(α�j − αj)K(xj , xi) + yi − � when 0 < α�
i < C.
maxα,α�
− �(α� + α)�1 + (α� − α)�y − 12(α� −α)�K(α� −α)
subject to: (0 ≤ α ≤ C) ∧ (0 ≤ α� ≤ C) ∧ ((α� −α)�1 = 0) .
![Page 22: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/22.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Notes
Advantages:
• strong theoretical guarantees.
• sparser solution.
• use of kernels.
Disadvantages:
• selection of two parameters: and . Heuristics:
• search near maximum , near average difference of s, measure of no. of SVs.
• large matrices: low-rank approximations of kernel matrix.
22
C �
C y �y
![Page 23: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/23.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Alternative Loss Functions
23
x
x !"max(0, |x|# !)2
x !"
!
x2 if |x| # c
2c|x|$ c2 otherwise.
x !"max(0, |x|# !)
![Page 24: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/24.jpg)
pageMehryar Mohri - Introduction to Machine Learning
SVR - Quadratic Loss
Optimization problem:
Solution:
Support vectors: points strictly outside the tube.For , coincides with KRR.
24
h(x) =m�
i=1
(α�i − αi)K(xi,x) + b
with b =
�−
�mi=1(α
�j − αj)K(xj , xi) + yi + � when 0 < αi ∧ ξi = 0
−�m
i=1(α�j − αj)K(xj , xi) + yi − � when 0 < α�
i ∧ ξ�i = 0.
maxα,α�
− �(α� + α)�1 + (α� −α)�y − 12(α� −α)�
�K +
1C
I�
(α� −α)
subject to: (α ≥ 0) ∧ (α ≥ 0) ∧ (α� −α)�1 = 0) .
�=0
![Page 25: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/25.jpg)
pageMehryar Mohri - Introduction to Machine Learning 25
On-line Regression
On-line version of batch algorithms:
• stochastic gradient descent.
• primal or dual.
Examples:
• Mean squared error function: Widrow-Hoff (or LMS) algorithm (Widrow and Hoff, 1995).
• SVR ε-insensitive (dual) linear or quadratic function: on-line SVR.
![Page 26: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/26.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Widrow-Hoff
26
WidrowHoff(w0)1 w1 ← w0 � typically w0 = 02 for t← 1 to T do3 Receive(xt)4 �yt ← wt · xt
5 Receive(yt)6 wt+1 ← wt + 2η(wt · xt − yt)xt � η>07 return wT+1
(Widrow and Hoff, 1988)
![Page 27: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/27.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Dual On-Line SVR
27
(b=0) (Vijayakumar and Wu, 1988)
DualSVR()1 α← 02 α� ← 03 for t← 1 to T do4 Receive(xt)5 �yt ←
�Ts=1(α
�s − αs)K(xs, xt)
6 Receive(yt)7 α�
t+1 ← α�t + min(max(η(yt − �yt − �),−α�
t), C − α�t)
8 αt+1 ← αt + min(max(η(�yt − yt − �),−αt), C − αt)9 return
�Tt=1 αtK(xt, ·)
![Page 28: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/28.jpg)
pageMehryar Mohri - Introduction to Machine Learning
LASSO
Optimization problem: ‘least absolute shrinkage and selection operator’.
Solution: convex quadratic program (QP).
• general: standard QP solvers.
• specific algorithm: LARS (least angle regression procedure), entire path of solutions.
28
minw
F (w, b) = λ�w�1 +m�
i=1
(w · xi + b− yi)2 ,
where is a (regularization) parameter.λ≥0
(Tibshirani, 1996)
![Page 29: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/29.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Sparsity of L1 regularization
29
L1 regularization L2 regularization
![Page 30: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/30.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Notes
Advantages:
• theoretical guarantees.
• sparse solution.
• feature selection.
Drawbacks:
• no natural use of kernels.
• no closed-form solution (not necessary, but can be nice for theoretical analysis).
30
![Page 31: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/31.jpg)
pageMehryar Mohri - Introduction to Machine Learning
Regression
Many other families of algorithms: including
• neural networks.
• decision trees.
• boosting trees for regression.
31
![Page 32: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/32.jpg)
Mehryar Mohri - Foundations of Machine Learning page
References• Corinna Cortes, Mehryar Mohri, and Jason Weston. A General Regression Framework for
Learning String-to-String Mappings. In Predicting Structured Data. The MIT Press, 2007.
• Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression. Annals of Statistics 2003.
• Arthur Hoerl and Robert Kennard. Ridge Regression: biased estimation of nonorthogonal problems. Technometrics, 12:55-67, 1970.
• C. Saunders and A. Gammerman and V. Vovk, Ridge Regression Learning Algorithm in Dual Variables, In ICML ’98, pages 515--521,1998.
• Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, pages B. 58:267-288, 1996.
• David Pollard. Convergence of Stochastic Processes. Springer, New York, 1984.
• David Pollard. Empirical Processes: Theory and Applications. Institute of Mathematical Statistics, 1990.
32
![Page 33: Introduction to Machine Learning Lecture 15mohri/mlu/mlu_lecture_15.pdf · 22 C C y y. Mehryar ... • feature selection. Drawbacks: • no natural use of kernels. • no closed-form](https://reader030.fdocuments.in/reader030/viewer/2022041110/5f0efde77e708231d441f2d7/html5/thumbnails/33.jpg)
Mehryar Mohri - Foundations of Machine Learning page
References• Sethu Vijayakumar and Si Wu. Sequential support vector classifiers and regression. In
Proceedings of the International Conference on Soft Computing (SOCO’99), 1999.
• Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, 1982.
• Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
• Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
• Bernard Widrow and Ted Hoff. Adaptive Switching Circuits. Neurocomputing: foundations of research, pages 123-134, MIT Press, 1988.
33