1 Statistical Techniques Chapter 10. 2 10.1 Linear Regression Analysis Simple Linear Regression.
1 - Linear Regression
-
Upload
nikita-zhiltsov -
Category
Documents
-
view
893 -
download
0
description
Transcript of 1 - Linear Regression
Linear RegressionMachine Learning Seminar Series’11
Nikita Zhiltsov
11 March 2011
1 / 31
Motivating examplePrices of houses in Portland
Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540
......
2 / 31
Motivating examplePlot
How can we predict the prices of other houses as afunction of the size of their living areas?
3 / 31
Terminology and notation
x ∈ X – input variables (“features”)
t ∈ T – a target variable
{xn}, n = 1, . . . , N – given N observations ofinput variables
(xn, tn) – a training example
(x1, t1), . . . , (xN , tN) – a training set
GoalFind a function y(x) : X → T (“hypothesis“) topredict the value of t for a new value of x
4 / 31
Terminology and notation
When the target variable t is continuous
⇒ a regression problem
In the case of discrete values
⇒ a classification problem5 / 31
Terminology and notationLoss function
L(t, y(x)) – loss function or cost functionIn the case of regression problems expected loss isgiven by:
E[L] =
∫R
∫XL(t, y(x))p(x, t) dx dt
ExampleSquared loss:
L(t, y(x)) =1
2(y(x)− t)2
6 / 31
Linear basis function modelsLinear regression
y(x,w) = w0 + w1x1 + · · ·+ wDxD,
where x = (x1, . . . , xD)
In our example,
y(x,w) = w0 + w1x1 + w2x2
Living area # of bedrooms
7 / 31
Linear basis function modelsBasis functions
Generally
y(x,w) =M−1∑j=0
wjφj(x) = wTφ(x)
where φj(x) are known as basis functions.Typically, φ0(x) = 1, so that w0 acts as a bias.In the simplest case, we use linear basis functions:φd(x) = xd.
8 / 31
Linear basis function modelsPolynomial basis functions
Polynomial basis functions:
φj(x) = xj.
These are global; a smallchange in x affects all basisfunctions.
−1 0 1−1
−0.5
0
0.5
1
9 / 31
Linear basis function modelsGaussian basis functions
Gaussian basis functions:
φj(x) = exp
(−(x− µj)
2
2s2
)These are local; a smallchange in x only affectsnearby basis functions. µj
and s control location andscale (width).
−1 0 10
0.25
0.5
0.75
1
10 / 31
Linear basis function modelsSigmoidal basis functions
Sigmoidal basis functions:
φj(x) = σ
(x− µj
s
)where
σ(a) =1
1 + exp (−a).
Also these are local; a smallchange in x only affectsnearby basis functions. µj
and s control location andscale (slope).
−1 0 10
0.25
0.5
0.75
1
11 / 31
Probabilistic interpretation
Assume observations from a deterministic function with addedGaussian noise:
t = y(x,w) + ε, where p(ε|β) = N (ε|0, β−1)
which is the same as saying,
p(t|x,w, β) = N (t|y(x,w), β−1).
12 / 31
Probabilistic interpretationOptimal prediction for a squared loss
Expected loss:
E[L] =
∫∫(y(x)− t)2p(x, t)dxdt,
which is minimized by the conditional mean
y(x) = Et[t|x]
In our case of a Gaussian conditional distribution, it is
E[t|x] =
∫tp(t|x)dt = y(x,w)
13 / 31
Probabilistic interpretationOptimal prediction for a squared loss
t
xx0
y(x0)
y(x)
p(t|x0)
14 / 31
Maximum likelihood and least squaresGiven observed inputs X = {x1, . . . ,xN}, and targetst = [t1, . . . , tN ]T , we obtain the likelihood function
p(t|X,w, β) =N∏
n=1
N (tn|wTφ(xn), β−1).
Taking the logarithm, we get
ln p(t|w, β) =N∑
n=1
lnN (tn|wTφ(xn), β−1)
=N
2ln β − N
2ln(2π)− βED(w)
where
ED(w) =1
2
N∑n=1
(tn −wTφ(xn))2
is the sum-of-squares error.15 / 31
Maximum likelihood and least squaresComputing the gradient and setting it to zero yields
∇w ln p(t|w, β) = βN∑
n=1
(tn −wTφ(xn))φ(xn)T = 0.
Solving for w, we get
wML =
Moore-Penrose
pseudo-inverse︷ ︸︸ ︷(ΦTΦ)−1ΦT t
where
Φ =
φ0(x1) φ1(x1) . . . φM−1(x1)φ0(x2) φ1(x2) . . . φM−1(x2)
......
. . ....
φ0(xN) φ1(xN) . . . φM−1(xN)
.
16 / 31
Maximum likelihood and least squaresBias parameter
Rewritten error function:
ED(w) =1
2
N∑n=1
(tn − w0 −M−1∑j=1
wjφj(xn))2
Setting the derivative w.r.t. w0 equal to zero, we obtain
w0 = t̄−M−1∑j=1
wjφ̄j
where
t̄ =1
N
N∑n=1
tn, φ̄j = 1N
∑Nn=1 φj(xn).
17 / 31
Maximum likelihood and least squares
ln p(t|wML, β) =N
2ln β − N
2ln(2π)− βED(wML)
Maximizing the log likelihood function w.r.t. thenoise precision parameter β, we obtain
1
βML=
1
N
N∑n=1
(tn −wTMLφ(xn))
2
18 / 31
Geometry of least squares
Consider
y = ΦwML = [ϕ1, . . . , ϕM ]wML.
y ∈ S ⊆ T , t ∈ T
S is spanned by ϕ1, . . . , ϕM .wML minimizes the distancebetween the distance t betweenand t and its orthogonalprojection on S, i.e. y.
St
yϕ1
ϕ2
19 / 31
Batch learningBatch gradient descent
Consider the gradient descent algorithm, whichstarts with some initial w(0):
w(τ+1) = w(τ) − η∇ED
= w(τ) + ηN∑n=1
(tn −w(τ)Tφ(xn))φ(xn).
This is known as the least-mean-squares (LMS)algorithm.
20 / 31
Batch gradient descentExample calculation
In the case of ordinary least squares and the only leaving area
feature, we start from w(0)0 = 48, w
(0)1 = 30 ...
21 / 31
Batch gradient descentResults of the example calculation
... and obtain the result w0 = 71.27, w1 = 0.1345.
22 / 31
Sequential learning
Data items considered one at a time (a.k.a. onlinelearning); use stochastic (sequential) gradientdescent:
w(τ+1) = w(τ) − η∇En
= w(τ) + η(tn −w(τ)Tφ(xn))φ(xn).
Issue: how to choose η?
23 / 31
Regularized least squaresConsider the error function:
ED(w) + λEW (w)
Data term + Regularization term
λ is called the regularization coefficientWith the sum-of-squares error function and a quadraticregularizer, we get
1
2
N∑n=1
(tn −wTφ(xn))2 +λ
2wTw
which is minimized by
w = (λI + ΦTΦ)−1ΦT t.
26 / 31
Regularized least squares
With a more general regularizer, we have
1
2
N∑n=1
(tn −wTφ(xn))2 +
λ
2
M∑j=1
|wj|q
q = 0.5 q = 1 q = 2 q = 4
Lasso Quadratic
27 / 31
Regularized least squares
Lasso tends to generate sparser solutions than aquadratic regularizer
w1
w2
w?
w1
w2
w?
28 / 31
Multiple outputs
Analogously to the single output case we have:
p(t|x,W, β) = N (t|y(W,x), β−1I))
= N (t|WTφ(x), β−1I)).
Given observed inputs X = {x, . . . ,xN}, and targetsT = [t1, . . . , tN ]T , we obtain the log likelihood function
ln p(T|X,W, β) =N∑
n=1
lnN (tn|WTφ(xn), β−1I)
=NK
2ln
(β
2π
)− β
2
N∑n=1
||tn −WTφ(xn)||2.
29 / 31
Multiple outputs
Maximizing w.r.t. W, we obtain
WML = (ΦTΦ)−1ΦTT
If we consider a single target variable tk, we see that
wk = (ΦTΦ)−1ΦTtk
where tk = [t1k, . . . , tNk]T , which is identical with
the single output case.
30 / 31
Resources
� Stanford Engineering Everywhere CS229 –Machine Learning
http://videolectures.net/
stanfordcs229f07_machine_learning/
� Bishop C.M. Pattern Recognition and MachineLearning. Springer, 2006.
http://research.microsoft.com/en-us/
um/people/cmbishop/prml/
31 / 31