ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic...
Transcript of ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic...
c©Stanley Chan 2020. All Rights Reserved.
ECE595 / STAT598: Machine Learning ILecture 01: Linear Regression
Spring 2020
Stanley Chan
School of Electrical and Computer EngineeringPurdue University
1 / 22
c©Stanley Chan 2020. All Rights Reserved.
Outline
2 / 22
c©Stanley Chan 2020. All Rights Reserved.
Outline
Mathematical Background
Lecture 1: Linear regression: A basic data analytic tool
Lecture 2: Regularization: Constraining the solution
Lecture 3: Kernel Method: Enabling nonlinearity
Lecture 1: Linear Regression
Linear Regression
NotationLoss FunctionSolving the Regression Problem
Geometry
ProjectionMinimum-Norm SolutionPseudo-Inverse
3 / 22
c©Stanley Chan 2020. All Rights Reserved.
Basic Notation
Scalar: a, b, c ∈ RVector: a,b, c ∈ Rd
Matrix: A,B,C ∈ RN×d ; Entries are aij or [A]ij .
Rows and Columns
A =
| | |a1 a2 . . . ad| | |
, and A =
— (x1)T —— (x2)T —
...— (xN)T —
.{aj}: The j-th feature. {xn}: The n-th sample.
Identity matrix I
All-one vector 1 and all-zero vector 0
Standard basis e i .
4 / 22
c©Stanley Chan 2020. All Rights Reserved.
Line Fitting
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
5 / 22
c©Stanley Chan 2020. All Rights Reserved.
Linear Regression
The problem of regression can be summarized as:
Given measurements: yn, (where n = 1, . . . ,N)Given inputs: xn
Given a model: gθ(xn) parameterized by θDetermine θ such that yn ≈ gθ(xn).
Linear regression is one type of regression:
Restrict gθ(·) to a line:gθ(x) = xTθ
The inputs x and the parameters θ are
x = [x1, . . . , xd ]T and θ = [θ1, . . . , θd ]T
This is equivalent to
gθ(x) = xTθ =d∑
j=1
xjθj .
6 / 22
c©Stanley Chan 2020. All Rights Reserved.
Solving the Regression Problem
The (general) regression can be solved via the following logic:
Define a (Squared-Error) Loss Function
J(θ) =N∑
n=1
L(gθ(xn), yn)
=N∑
n=1
(gθ(xn)− yn)2, e.g., L(♣,♠)def= (♣−♠)2
Other loss functions can be used, e.g., L(♣,♠) = |♣ − ♠|.The goal is to solve an optimization
θ = argminθ
J(θ).
The prediction of a new input xnew is ynew = gθ
(xnew) = θTxnew.
7 / 22
c©Stanley Chan 2020. All Rights Reserved.
Linear Regression Solution
The linear regression problem is a special case which we can solveanalytically.
Restrict gθ(·) to a line:
gθ(xn) = θTxn
Then the loss function becomes
J(θ) =N∑
n=1
(θTxn − yn)2 = ‖Aθ − y‖2.
The matrix and vectors are defined as
A =
— (x1)T —— (x2)T —
...— (xN)T —
, θ =
θ1...θd
, and y =
y1y2...yN
‖ · ‖2 stands for the `2-norm square. See Tutorial on Linear Algebra.
8 / 22
c©Stanley Chan 2020. All Rights Reserved.
Linear Regression Solution
Theorem
For a linear regression problem
θ = argminθ
J(θ)def= ‖Aθ − y‖2,
the minimizer isθ = (ATA)−1ATy .
Take derivative and setting to zero: (See Tutorial on “LinearAlgebra”.)
∇θJ(θ) = ∇θ
{‖Aθ − y‖2
}= 2AT (Aθ − y) = 0.
So solution is θ = (ATA)−1ATy , assuming ATA is invertible.
9 / 22
c©Stanley Chan 2020. All Rights Reserved.
Examples
Example 1: Second-order polynomial fitting
yn = ax2n + bxn + c
A =
x21 x1 1...
......
x2N xN 1
, y =
y1...yN
, θ =
abc
Example 2: Auto-regression
yn = ayn−1 + byn−2
A =
y2 y1y3 y2...
...yN−1 yN−2
, y =
y3y4...yN
, θ =
[ab
]
10 / 22
c©Stanley Chan 2020. All Rights Reserved.
Generalized Linear Regression
0 1 2 3 4 5 6 7 8 9 10
-1
0
1
0 1 2 3 4 5 6 7 8 9 10
-1
0
1
0 1 2 3 4 5 6 7 8 9 10
-1
0
1
0 1 2 3 4 5 6 7 8 9 10
-1
0
1
0 1 2 3 4 5 6 7 8 9 10
-2
-1
0
1
Eg 1: Fourier series
xn =
xn1xn2...xnd
=
sin(ω0tn)
sin(2ω0tn)...
sin(Kω0tn)
yn = θTxn =
d∑k=1
θk sin(kω0tn)
θk : k-th Fourier coefficient
sin(kω0tn): k-th Fourier basis attime tn
11 / 22
c©Stanley Chan 2020. All Rights Reserved.
Outline
Mathematical Background
Lecture 1: Linear regression: A basic data analytic tool
Lecture 2: Regularization: Constraining the solution
Lecture 3: Kernel Method: Enabling nonlinearity
Lecture 1: Linear Regression
Linear Regression
NotationLoss FunctionSolving the Regression Problem
Geometry
ProjectionMinimum-Norm SolutionPseudo-Inverse
12 / 22
c©Stanley Chan 2020. All Rights Reserved.
Linear Span
Given a set of vectors {a1, . . . , ad}, the span is the set of all possiblelinear combinations of these vectors.
span
{a1, . . . , ad
}=
{z | z =
d∑j=1
αjaj
}(1)
Which of the following sets of vectors can span R3?
13 / 22
c©Stanley Chan 2020. All Rights Reserved.
Geometry of Linear Regression
Given θ, the product Aθ can be viewed as
Aθ =
| | |a1 a2 . . . ad| | |
θ1...θd
=d∑
j=1
θjaj .
So the set of all possible Aθ’s is equivalent to span{a1, . . . , ad}. Definethe range of A as R(A) = {y | y = Aθ}. Note that y 6∈ R(A).
14 / 22
c©Stanley Chan 2020. All Rights Reserved.
Orthogonality Principle
Consider the error e = y − Aθ.
For the error to minimize, it must be orthogonal to R(A), which isthe span of the columns.
This orthogonality principle means that aTj e = 0 for all
j = 1, . . . , d , which implies ATe = 0.
15 / 22
c©Stanley Chan 2020. All Rights Reserved.
Normal Equation
The orthogonality principle, which states that ATe = 0, implies thatAT (y − Aθ) = 0 by substituting e = y − Aθ.
This is called the normal equation:
ATAθ = ATy . (2)
The predicted value is
y = Aθ = A(ATA)−1ATy
The matrix Pdef= A(ATA)−1AT is a projection onto the span of
{a1, . . . , ad}, i.e., the range of A.
P is called the projection matrix. It holds that PP = P.
The error e = y − y is
e = y − A(ATA)−1ATy
= (I − P)y .
16 / 22
c©Stanley Chan 2020. All Rights Reserved.
Over-determined and Under-determined Systems
Assume A has full column rank.Over-determined A: Tall and skinny. θ = (ATA)−1ATy .Under-determined A: Fat and short. θ = AT (AAT )−1y .If A is under-determined, then there exists a non-trivial null spaceN (A) = {θ | Aθ = 0}.This implies that if θ is a solution, then θ + θ0 is also a solution aslong as θ0 ∈ N (A). (Why?)
17 / 22
c©Stanley Chan 2020. All Rights Reserved.
Minimum-Norm Solution
Assume A is fat and has full row rank.
Since A is fat, there exists infinitely many θ such that Aθ = y .
So we need to pick one in order to be unique.
It turns out that θ = AT (AAT )−1y is the solution to
θ = argminθ
‖θ‖2 subject to Aθ = y . (3)
(You can solve this problem using Lagrange multiplier. See Appendix.)
This is called the minimum-norm solution.
18 / 22
c©Stanley Chan 2020. All Rights Reserved.
What if Rank-Deficient?
If A is rank-deficient, then ATA is not invertibleApproach 1: Regularization. See Lecture 2.Approach 2: Pseudo-inverse. Decompose A = USV T .U ∈ RN×N , with UTU = I . V ∈ Rd×d , with V TV = I .The diagonal block of S ∈ RN×d is diag{s1, . . . , sr , 0, . . . , 0}.The solution is called the pseudo-inverse:
θ = VS+UTy , (4)
where S+ = diag{1/s1, . . . , 1/sr , 0, . . . , 0}.
19 / 22
c©Stanley Chan 2020. All Rights Reserved.
Reading List
Linear Algebra
Gilbert Strang, Linear Algebra and Its Applications, 5th Edition.
Carl Meyer, Matrix Analysis and Applied Linear Algebra, SIAM, 2000.
Univ. Waterloo Matrix Cookbook. https:
//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
Linear Regression
Stanford CS 229 (Note on Linear Algebra)http://cs229.stanford.edu/section/cs229-linalg.pdf
Elements of Statistical Learning (Chapter 3.2)https://web.stanford.edu/~hastie/ElemStatLearn/
Learning from Data (Chapter 3.2)https://work.caltech.edu/telecourse
20 / 22
c©Stanley Chan 2020. All Rights Reserved.
Appendix
21 / 22
c©Stanley Chan 2020. All Rights Reserved.
Solving the Minimum Norm problem
Consider this problem
θ = argminθ
‖θ‖2 subject to Aθ = y . (5)
The Lagrangian is
L(θ,λ) = ‖θ‖2 + λT (Aθ − y).
Take derivative with respect to θ and λ yields
∇θL = 2θ + ATλ = 0, ∇λL = Aθ − y = 0
First equation gives us θ = −ATλ/2.
Substitute into second equation yields λ = −2(AAT )−1y .
Therefore, θ = AT (AAT )−1y .
22 / 22