ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic...

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 01: Linear Regression

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 22


Outline

2 / 22


Outline

Mathematical Background

Lecture 1: Linear regression: A basic data analytic tool

Lecture 2: Regularization: Constraining the solution

Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 1: Linear Regression

Linear Regression

NotationLoss FunctionSolving the Regression Problem

Geometry

ProjectionMinimum-Norm SolutionPseudo-Inverse

3 / 22


Basic Notation

Scalar: a, b, c ∈ RVector: a,b, c ∈ Rd

Matrix: A,B,C ∈ RN×d ; Entries are aij or [A]ij .

Rows and Columns

A =

| | |a1 a2 . . . ad| | |

, and A =

— (x1)T —— (x2)T —

...— (xN)T —

.{aj}: The j-th feature. {xn}: The n-th sample.

Identity matrix I

All-one vector 1 and all-zero vector 0

Standard basis e i .

4 / 22


Line Fitting

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

5 / 22


Linear Regression

The problem of regression can be summarized as:

Given measurements: yn, (where n = 1, . . . ,N)Given inputs: xn

Given a model: gθ(xn) parameterized by θDetermine θ such that yn ≈ gθ(xn).

Linear regression is one type of regression:

Restrict gθ(·) to a line:gθ(x) = xTθ

The inputs x and the parameters θ are

x = [x1, . . . , xd ]T and θ = [θ1, . . . , θd ]T

This is equivalent to

gθ(x) = xTθ =d∑

j=1

xjθj .

6 / 22


Solving the Regression Problem

The (general) regression can be solved via the following logic:

Define a (Squared-Error) Loss Function

J(θ) =N∑

n=1

L(gθ(xn), yn)

=N∑

n=1

(gθ(xn)− yn)2, e.g., L(♣,♠)def= (♣−♠)2

Other loss functions can be used, e.g., L(♣,♠) = |♣ − ♠|.The goal is to solve an optimization

θ = argminθ

J(θ).

The prediction of a new input xnew is ynew = gθ

(xnew) = θTxnew.

7 / 22


Linear Regression Solution

The linear regression problem is a special case which we can solveanalytically.

Restrict gθ(·) to a line:

gθ(xn) = θTxn

Then the loss function becomes

J(θ) =N∑

n=1

(θTxn − yn)2 = ‖Aθ − y‖2.

The matrix and vectors are defined as

A =

— (x1)T —— (x2)T —

...— (xN)T —

, θ =

θ1...θd

, and y =

y1y2...yN

‖ · ‖2 stands for the `2-norm square. See Tutorial on Linear Algebra.

8 / 22


Linear Regression Solution

Theorem

For a linear regression problem

θ = argminθ

J(θ)def= ‖Aθ − y‖2,

the minimizer isθ = (ATA)−1ATy .

Take derivative and setting to zero: (See Tutorial on “LinearAlgebra”.)

∇θJ(θ) = ∇θ

{‖Aθ − y‖2

}= 2AT (Aθ − y) = 0.

So solution is θ = (ATA)−1ATy , assuming ATA is invertible.

9 / 22


Examples

Example 1: Second-order polynomial fitting

yn = ax2n + bxn + c

A =

x21 x1 1...

......

x2N xN 1

, y =

y1...yN

, θ =

abc

Example 2: Auto-regression

yn = ayn−1 + byn−2

A =

y2 y1y3 y2...

...yN−1 yN−2

, y =

y3y4...yN

, θ =

[ab

]

10 / 22


Generalized Linear Regression

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-1

0

1

0 1 2 3 4 5 6 7 8 9 10

-2

-1

0

1

Eg 1: Fourier series

xn =

xn1xn2...xnd

=

sin(ω0tn)

sin(2ω0tn)...

sin(Kω0tn)

yn = θTxn =

d∑k=1

θk sin(kω0tn)

θk : k-th Fourier coefficient

sin(kω0tn): k-th Fourier basis attime tn

11 / 22


Outline

Mathematical Background

Lecture 1: Linear regression: A basic data analytic tool

Lecture 2: Regularization: Constraining the solution

Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 1: Linear Regression

Linear Regression

NotationLoss FunctionSolving the Regression Problem

Geometry

ProjectionMinimum-Norm SolutionPseudo-Inverse

12 / 22


Linear Span

Given a set of vectors {a1, . . . , ad}, the span is the set of all possiblelinear combinations of these vectors.

span

{a1, . . . , ad

}=

{z | z =

d∑j=1

αjaj

}(1)

Which of the following sets of vectors can span R3?

13 / 22


Geometry of Linear Regression

Given θ, the product Aθ can be viewed as

Aθ =

| | |a1 a2 . . . ad| | |

θ1...θd

=d∑

j=1

θjaj .

So the set of all possible Aθ’s is equivalent to span{a1, . . . , ad}. Definethe range of A as R(A) = {y | y = Aθ}. Note that y 6∈ R(A).

14 / 22


Orthogonality Principle

Consider the error e = y − Aθ.

For the error to minimize, it must be orthogonal to R(A), which isthe span of the columns.

This orthogonality principle means that aTj e = 0 for all

j = 1, . . . , d , which implies ATe = 0.

15 / 22


Normal Equation

The orthogonality principle, which states that ATe = 0, implies thatAT (y − Aθ) = 0 by substituting e = y − Aθ.

This is called the normal equation:

ATAθ = ATy . (2)

The predicted value is

y = Aθ = A(ATA)−1ATy

The matrix Pdef= A(ATA)−1AT is a projection onto the span of

{a1, . . . , ad}, i.e., the range of A.

P is called the projection matrix. It holds that PP = P.

The error e = y − y is

e = y − A(ATA)−1ATy

= (I − P)y .

16 / 22


Over-determined and Under-determined Systems

Assume A has full column rank.Over-determined A: Tall and skinny. θ = (ATA)−1ATy .Under-determined A: Fat and short. θ = AT (AAT )−1y .If A is under-determined, then there exists a non-trivial null spaceN (A) = {θ | Aθ = 0}.This implies that if θ is a solution, then θ + θ0 is also a solution aslong as θ0 ∈ N (A). (Why?)

17 / 22


Minimum-Norm Solution

Assume A is fat and has full row rank.

Since A is fat, there exists infinitely many θ such that Aθ = y .

So we need to pick one in order to be unique.

It turns out that θ = AT (AAT )−1y is the solution to

θ = argminθ

‖θ‖2 subject to Aθ = y . (3)

(You can solve this problem using Lagrange multiplier. See Appendix.)

This is called the minimum-norm solution.

18 / 22


What if Rank-Deficient?

If A is rank-deficient, then ATA is not invertibleApproach 1: Regularization. See Lecture 2.Approach 2: Pseudo-inverse. Decompose A = USV T .U ∈ RN×N , with UTU = I . V ∈ Rd×d , with V TV = I .The diagonal block of S ∈ RN×d is diag{s1, . . . , sr , 0, . . . , 0}.The solution is called the pseudo-inverse:

θ = VS+UTy , (4)

where S+ = diag{1/s1, . . . , 1/sr , 0, . . . , 0}.

19 / 22


Reading List

Linear Algebra

Gilbert Strang, Linear Algebra and Its Applications, 5th Edition.

Carl Meyer, Matrix Analysis and Applied Linear Algebra, SIAM, 2000.

Univ. Waterloo Matrix Cookbook. https:

//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Linear Regression

Stanford CS 229 (Note on Linear Algebra)http://cs229.stanford.edu/section/cs229-linalg.pdf

Elements of Statistical Learning (Chapter 3.2)https://web.stanford.edu/~hastie/ElemStatLearn/

Learning from Data (Chapter 3.2)https://work.caltech.edu/telecourse

20 / 22

https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

http://cs229.stanford.edu/section/cs229-linalg.pdf

https://web.stanford.edu/~hastie/ElemStatLearn/

https://work.caltech.edu/telecourse


Appendix

21 / 22


Solving the Minimum Norm problem

Consider this problem

θ = argminθ

‖θ‖2 subject to Aθ = y . (5)

The Lagrangian is

L(θ,λ) = ‖θ‖2 + λT (Aθ − y).

Take derivative with respect to θ and λ yields

∇θL = 2θ + ATλ = 0, ∇λL = Aθ − y = 0

First equation gives us θ = −ATλ/2.

Substitute into second equation yields λ = −2(AAT )−1y .

Therefore, θ = AT (AAT )−1y .

22 / 22

ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic...

Documents

Transcript of ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic...