Linear regression - University Of...
Transcript of Linear regression - University Of...
Linear regression
CS 446
1. Overview
todo
check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.
1 / 94
Lecture 1: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
2 / 94
Lecture 2: nearest neighbors and decision trees
1.0 0.5 0.0 0.5 1.0
1.0
0.5
0.0
0.5
1.0
x1
x2
Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.
Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.
3 / 94
Lectures 3-4: linear regression
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Linear regression / least squares.
Our first (of many!) linear predic-tion methods.
Today:
I Example.
I How to solve it; ERM, andSVD.
I Features.
Next lecture: advanced topics, in-cluding overfitting.
4 / 94
2. Example: Old Faithful
Prediction problem: Old Faithful geyser (Yellowstone)
Task: Predict time of next eruption.
5 / 94
Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94
Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Ydata
. . . t
X
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
3. Least squares and linear regression
Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
y := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94
Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
y := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94
ERM setup for least squares.
I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)
I Loss/penalty: the least squares loss
`ls(y, y) = `ls(y, y) = (y − y)2.
(Some conventions scale this by 1/2.)
I Goal: minimize least squares emprical risk
Rls(f) =1
n
n∑i=1
`ls(yi, f(xi)) =1
n
n∑i=1
(yi − f(xi))2.
I Specifically, we choose w ∈ Rd according to
arg minw∈Rd
Rls
(x 7→ wTx
)= arg min
w∈Rd
1
n
n∑i=1
(yi −wTxi)2.
I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.
10 / 94
ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94
ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94
Least squares ERM in pictures
Red dots: data points.
Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).
ERM: minimize sum of squared verticallengths from hyperplane to points.
12 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
4. SVD and least squares
SVD
Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where
I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)
I This convention is sometimes called the thin SVD.
I Another notation is to write M =∑ri=1 siuiv
Ti . This avoids the issue
with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)
ri=1 span the column space and (vi)
ri=1 span the rows space of M .
I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.
I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.
15 / 94
Pseudoinverse
Let SVD M =∑ri=1 siuiv
Ti be given.
I Define pseudoinverse M+ =∑ri=1
1siviu
Ti .
(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)
I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST butwith reciprocals of non-zero elements;then M+ = V S+UT.
I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).
I If M−1 exists, then M−1 = M+ (why?).
16 / 94
SVD and least squares
Recall: we’d like to find w such that
ATAw = ATb.
If w = A+b, then
ATAw =
r∑i=1
siviuTi
r∑i=1
siuivTi
r∑i=1
1
siviu
Ti
b=
r∑i=1
siviuTi
r∑i=1
uiuTi
b = ATb.
Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)
Note: in general, AA+ =∑ri=1 uiu
Ti 6= I.
17 / 94
5. Summary of linear regression so far
Main points
I Model/function/predictor class of linear regressors x 7→ wTx.
I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.
I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.
18 / 94
Part 2 of linear regression lecture. . .
Recap on SVD. (A messy slide, I’m sorry.)
Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.
I “Decomposition form” thin SVD: M =∑ri=1 siuiv
Ti , and
s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1
1siviu
Ti . and in general
M+M =∑ri=1 viv
Ti 6= I.
I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.
I Full SVD: M = U fSfVTf , U f ∈ Rn×n and V ∈ Rd×d orthonormal and
full rank so UTf U f and V T
f V f are identity matrices and Sf ∈ Rn×d is zeroeverywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS
+f U
Tf where S+
f is obtainedby transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .
The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).
19 / 94
Recap on SVD, zero matrix case
Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.
I In all types of SVD, M+ is MT (another zero matrix).
I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).
I “Factorization form thin SVD” becomes a little messy.
20 / 94
6. More on the normal equations
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.21 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.
Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
7. Features
Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative side-information, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative side-information, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
25 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil n
ext eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94
Final remarks on features
I “Feature engineering” can drastically change the power of a model.
I Some people consider it messy, unprincipled, pure “trial-and-error”.
I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).
28 / 94
8. Statistical view of least squares; maximum likelihood
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5y
f*
32 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).
I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).
I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).
I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
9. Regularization and ridge regression
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
10. True risk and overfitting
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
45 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
45 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
45 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
11. `1 regularization: the LASSO
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · ,
so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Example: Coefficient profile (`2 vs. `1)
Y = levels of prostate cancer antigen, X = clincal measurements
Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.
51 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
12. Summary
Summary
ERM for olsERM in generalnormal equationspseudoinverse solnridge regressionstatistical view (say 1-2 things that should be remembered)
53 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
61 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
61 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
61 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil n
ext eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
62 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
62 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
13. From data to prediction functions
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
65 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
65 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
65 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
66 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
66 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
66 / 94
Empirical risk minimization in pictures
Red dots: data points.
Affine hyperplane: linear function w(via affine expansion (x1, x2) 7→ (1, x1, x2)).
ERM: minimize sum of squared verticallengths from hyperplane to points.
67 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Aside: Convexity
Let f : Rd → R be a differentiable function.
Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?
Yes, if f is a convex function:
f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),
for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.
69 / 94
Aside: Convexity
Let f : Rd → R be a differentiable function.
Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?
Yes, if f is a convex function:
f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),
for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.
69 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22
= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)
= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Algorithm for ERM
Algorithm for ERM with linear functions and squared loss†
input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.
1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).
2: return w.
†Also called “ordinary least squares” in this context.
Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!
72 / 94
Algorithm for ERM
Algorithm for ERM with linear functions and squared loss†
input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.
1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).
2: return w.
†Also called “ordinary least squares” in this context.
Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!
72 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.
Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
14. Risk, empirical risk, and estimating risk
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).76 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
79 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
79 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
79 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
15. Regularization
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · ,
so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Example: Coefficient profile (`2 vs. `1)
Y = levels of prostate cancer antigen, X = clincal measurements
Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.
89 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Key takeaways
1. IID model for supervised learning.
2. Optimal predictors, linear regression models, and optimal linear predictors.
3. Empirical risk minimization for linear predictors.
4. Risk of ERM; training risk vs. test risk; risk minimization vs. riskestimation.
5. Inductive bias, `1- and `2-regularization, sparsity.
Make sure you do the assigned reading, especially from the handouts!
91 / 94
misc
svdpytorch/numpy; gpu; gpu errors. maybe even sgd. they’ll use it in homework.talk about regression and classification somewhere early on. can mention howto do it for dt and knn too i guess, though it’s a little gross in this lecture?before MLE slide, give a quick one-slide refresher/primer on MLE.ridge and soln existence. for homework maybe prove → 0 gives svd?daniel’s 1/n. talk about loss functionslook at my old lecsvd topics: not unique; pseudoinverse equal inverse always; pseudoinversealways unique(?) or at least when inverse exists? talk about things it satisfieslike XX+X = X etc; “meaning” of the U , V matrices in svd; introduce svdvia eigendecomposition
92 / 94
misc
logistic regression: optimize w 7→ 1n
∑ni=1 ln(1 + exp(−yiw>xi)).
SVD solution for ols:- write ‖Xw − y‖22.- normal equations (differentiate and set to zero:) X>Xw = X>y.- Writing X = USV >, have V S2V >w = V SU>y.- Thus pseudoinverse solution X+y = V S+U>y satisfies normal equations.for homework maybe also suggest experiment with ridge regression (addingλ‖w‖2/2).for pytorch solver, can have them manually do gradient, and also use pytorch’s.backward, see the sample code for lecture 1 (in the repository, not in theslides).features: replace xi with φ(xi) where phi is some function. E.g.,φ(x) = (1, x1, . . . , xd, x1x1, x1x2, . . . x1xd, . . . xdxd) means w>φ(x) is aquadratic (and now we can search over all possible quadratics with ouroptimization).
93 / 94
16. Summary of linear regression so far
Main points
I Model/function/predictor class of linear regressors x 7→ wTx.
I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.
I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.
I We also discussed feature expansion; affine and polynomial expansion aregood to keep in mind!
94 / 94