1 - Linear Regression

Linear RegressionMachine Learning Seminar Series’11

Nikita Zhiltsov

11 March 2011

1 / 31

http://ksu.ru

http://cll.niimm.ksu.ru

Motivating examplePrices of houses in Portland

Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540

......

2 / 31

http://ksu.ru


Motivating examplePlot

How can we predict the prices of other houses as afunction of the size of their living areas?

3 / 31

http://ksu.ru


Terminology and notation

x ∈ X – input variables (“features”)

t ∈ T – a target variable

{xn}, n = 1, . . . , N – given N observations ofinput variables

(xn, tn) – a training example

(x1, t1), . . . , (xN , tN) – a training set

GoalFind a function y(x) : X → T (“hypothesis“) topredict the value of t for a new value of x

4 / 31

http://ksu.ru


Terminology and notation

When the target variable t is continuous

⇒ a regression problem

In the case of discrete values

⇒ a classification problem5 / 31

http://ksu.ru


Terminology and notationLoss function

L(t, y(x)) – loss function or cost functionIn the case of regression problems expected loss isgiven by:

E[L] =

∫R

∫XL(t, y(x))p(x, t) dx dt

ExampleSquared loss:

L(t, y(x)) =1

2(y(x)− t)2

6 / 31

http://ksu.ru


Linear basis function modelsLinear regression

y(x,w) = w0 + w1x1 + · · ·+ wDxD,

where x = (x1, . . . , xD)

In our example,

y(x,w) = w0 + w1x1 + w2x2

Living area # of bedrooms

7 / 31

http://ksu.ru


Linear basis function modelsBasis functions

Generally

y(x,w) =M−1∑j=0

wjφj(x) = wTφ(x)

where φj(x) are known as basis functions.Typically, φ0(x) = 1, so that w0 acts as a bias.In the simplest case, we use linear basis functions:φd(x) = xd.

8 / 31

http://ksu.ru


Linear basis function modelsPolynomial basis functions

Polynomial basis functions:

φj(x) = xj.

These are global; a smallchange in x affects all basisfunctions.

−1 0 1−1

−0.5

0

0.5

1

9 / 31

http://ksu.ru


Linear basis function modelsGaussian basis functions

Gaussian basis functions:

φj(x) = exp

(−(x− µj)

2

2s2

)These are local; a smallchange in x only affectsnearby basis functions. µj

and s control location andscale (width).

−1 0 10

0.25

0.5

0.75

1

10 / 31

http://ksu.ru


Linear basis function modelsSigmoidal basis functions

Sigmoidal basis functions:

φj(x) = σ

(x− µj

s

)where

σ(a) =1

1 + exp (−a).

Also these are local; a smallchange in x only affectsnearby basis functions. µj

and s control location andscale (slope).

−1 0 10

0.25

0.5

0.75

1

11 / 31

http://ksu.ru


Probabilistic interpretation

Assume observations from a deterministic function with addedGaussian noise:

t = y(x,w) + ε, where p(ε|β) = N (ε|0, β−1)

which is the same as saying,

p(t|x,w, β) = N (t|y(x,w), β−1).

12 / 31

http://ksu.ru


Probabilistic interpretationOptimal prediction for a squared loss

Expected loss:

E[L] =

∫∫(y(x)− t)2p(x, t)dxdt,

which is minimized by the conditional mean

y(x) = Et[t|x]

In our case of a Gaussian conditional distribution, it is

E[t|x] =

∫tp(t|x)dt = y(x,w)

13 / 31

http://ksu.ru


Probabilistic interpretationOptimal prediction for a squared loss

t

xx0

y(x0)

y(x)

p(t|x0)

14 / 31

http://ksu.ru


Maximum likelihood and least squaresGiven observed inputs X = {x1, . . . ,xN}, and targetst = [t1, . . . , tN ]T , we obtain the likelihood function

p(t|X,w, β) =N∏

n=1

N (tn|wTφ(xn), β−1).

Taking the logarithm, we get

ln p(t|w, β) =N∑

n=1

lnN (tn|wTφ(xn), β−1)

=N

2ln β − N

2ln(2π)− βED(w)

where

ED(w) =1

2

N∑n=1

(tn −wTφ(xn))2

is the sum-of-squares error.15 / 31

http://ksu.ru


Maximum likelihood and least squaresComputing the gradient and setting it to zero yields

∇w ln p(t|w, β) = βN∑

n=1

(tn −wTφ(xn))φ(xn)T = 0.

Solving for w, we get

wML =

Moore-Penrose

pseudo-inverse︷︸︸︷(ΦTΦ)−1ΦT t

where

Φ =

φ0(x1) φ1(x1) . . . φM−1(x1)φ0(x2) φ1(x2) . . . φM−1(x2)

......

. . ....

φ0(xN) φ1(xN) . . . φM−1(xN)

.

16 / 31

http://ksu.ru


Maximum likelihood and least squaresBias parameter

Rewritten error function:

ED(w) =1

2

N∑n=1

(tn − w0 −M−1∑j=1

wjφj(xn))2

Setting the derivative w.r.t. w0 equal to zero, we obtain

w0 = t̄−M−1∑j=1

wjφ̄j

where

t̄ =1

N

N∑n=1

tn, φ̄j = 1N

∑Nn=1 φj(xn).

17 / 31

http://ksu.ru


Maximum likelihood and least squares

ln p(t|wML, β) =N

2ln β − N

2ln(2π)− βED(wML)

Maximizing the log likelihood function w.r.t. thenoise precision parameter β, we obtain

1

βML=

1

N

N∑n=1

(tn −wTMLφ(xn))

2

18 / 31

http://ksu.ru


Geometry of least squares

Consider

y = ΦwML = [ϕ1, . . . , ϕM ]wML.

y ∈ S ⊆ T , t ∈ T

S is spanned by ϕ1, . . . , ϕM .wML minimizes the distancebetween the distance t betweenand t and its orthogonalprojection on S, i.e. y.

St

yϕ1

ϕ2

19 / 31

http://ksu.ru


Batch learningBatch gradient descent

Consider the gradient descent algorithm, whichstarts with some initial w(0):

w(τ+1) = w(τ) − η∇ED

= w(τ) + ηN∑n=1

(tn −w(τ)Tφ(xn))φ(xn).

This is known as the least-mean-squares (LMS)algorithm.

20 / 31

http://ksu.ru


Batch gradient descentExample calculation

In the case of ordinary least squares and the only leaving area

feature, we start from w(0)0 = 48, w

(0)1 = 30 ...

21 / 31

http://ksu.ru


Batch gradient descentResults of the example calculation

... and obtain the result w0 = 71.27, w1 = 0.1345.

22 / 31

http://ksu.ru


Sequential learning

Data items considered one at a time (a.k.a. onlinelearning); use stochastic (sequential) gradientdescent:

w(τ+1) = w(τ) − η∇En

= w(τ) + η(tn −w(τ)Tφ(xn))φ(xn).

Issue: how to choose η?

23 / 31

http://ksu.ru


Underfitting and overfitting

24 / 31

http://ksu.ru


RegularizationOutlier

25 / 31

http://ksu.ru


Regularized least squaresConsider the error function:

ED(w) + λEW (w)

Data term + Regularization term

λ is called the regularization coefficientWith the sum-of-squares error function and a quadraticregularizer, we get

1

2

N∑n=1

(tn −wTφ(xn))2 +λ

2wTw

which is minimized by

w = (λI + ΦTΦ)−1ΦT t.

26 / 31

http://ksu.ru


Regularized least squares

With a more general regularizer, we have

1

2

N∑n=1

(tn −wTφ(xn))2 +

λ

2

M∑j=1

|wj|q

q = 0.5 q = 1 q = 2 q = 4

Lasso Quadratic

27 / 31

http://ksu.ru


Regularized least squares

Lasso tends to generate sparser solutions than aquadratic regularizer

w1

w2

w?

w1

w2

w?

28 / 31

http://ksu.ru


Multiple outputs

Analogously to the single output case we have:

p(t|x,W, β) = N (t|y(W,x), β−1I))

= N (t|WTφ(x), β−1I)).

Given observed inputs X = {x, . . . ,xN}, and targetsT = [t1, . . . , tN ]T , we obtain the log likelihood function

ln p(T|X,W, β) =N∑

n=1

lnN (tn|WTφ(xn), β−1I)

=NK

2ln

(β

2π

)− β

2

N∑n=1

||tn −WTφ(xn)||2.

29 / 31

http://ksu.ru


Multiple outputs

Maximizing w.r.t. W, we obtain

WML = (ΦTΦ)−1ΦTT

If we consider a single target variable tk, we see that

wk = (ΦTΦ)−1ΦTtk

where tk = [t1k, . . . , tNk]T , which is identical with

the single output case.

30 / 31

http://ksu.ru


Resources

� Stanford Engineering Everywhere CS229 –Machine Learning

http://videolectures.net/

stanfordcs229f07_machine_learning/

� Bishop C.M. Pattern Recognition and MachineLearning. Springer, 2006.

http://research.microsoft.com/en-us/

um/people/cmbishop/prml/

31 / 31

http://ksu.ru


http://videolectures.net/stanfordcs229f07_machine_learning/

http://videolectures.net/stanfordcs229f07_machine_learning/

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

http://research.microsoft.com/en-us/um/people/cmbishop/prml/

1 - Linear Regression

Documents

Transcript of 1 - Linear Regression