CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...

48
aalto-logo-en-3 Linear regression Basis functions CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48

Transcript of CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...

Page 1: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

CS-E3210 Machine Learning: Basic PrinciplesLecture 3: Regression I

slides by Markus Heinonen

Department of Computer ScienceAalto University, School of Science

Autumn (Period I) 2017

1 / 48

Page 2: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

In a nutshell

today and friday we consider regression problems

data points x(i) ∈ Rd and continuous target y (i) ∈ Rwe want to learn a function h(x(i)) ≈ y (i)

function prediction h(x) is continuous

in classification both target y and h(x) are binary

a function h(·) is represented by parameters w

parameters w need to fit data X = (x(1), y (1)), . . . , (x(N), y (N))

2 / 48

Page 3: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Can we predict apartment rent?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

we observe rents y (i) for i = 1, . . . , 11 houses x(i)

learn from this data to predict rent h(x) ∈ R given d houseproperties x ∈ Rd

(designing good h(x) by hand is not machine learning)

3 / 48

Page 4: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Which features do we have access to?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

tRent prediction

0 50 100 150

input x 1 : house size (sqm)

1900

1950

2000

inpu

t x2: h

ouse

age

Rent prediction, output y: rent

600

800

1000

1200

1400

1600

house size xsize can predict a linear trend in rent y

house age xage gives non-linear information about y

new and old houses seem expensive, little effect for 40’s to 90’s

informative features add accuracy (eg. location, condition)

non-informative features add noise (eg. house color)

4 / 48

Page 5: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Alternative hypotheses h(x), which to choose?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

h(x) = 8.5 x + 400

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

h(x) complex

linear functions are surprisingly powerful

⇒ Linear regression

non-linear functions can achieve low error, but still err badly

a model should learn the underlying model and generalise tofuture data⇒ Lectures 7 & 8

5 / 48

Page 6: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Alternative hypotheses h(x), which to choose?

Rent prediction, output y: rent

500

800

1200

1500

0 50 100 150

input x 1 : house size (sqm)

1900

1950

2000

inpu

t x2: h

ouse

age

600

800

1000

1200

1400

1600

Rent prediction, output y: rent

500

800

800

1200

1200

1500

1500

0 50 100 150

input x 1 : house size (sqm)

1900

1950

2000

inpu

t x2: h

ouse

age

600

800

1000

1200

1400

1600

a linear function can not explain the bimodal behavior of xage

⇒ basis functions

6 / 48

Page 7: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Outline

1 Linear regression

2 Basis functionsPolynomial basisGaussian basis

7 / 48

Page 8: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

A regression problem

inputs x(i) = (x(i)1 , . . . , x

(i)d )T ∈ Rd with d

features/properties/dimensions/covariates

a scalar target/response/output/label y (i) ∈ Ra dataset of N data points

X = {(x(1), y (1)), . . . , (x(N), y (N))} = {x(i), y (i)}Ni=1

in matrix form the dataset is

X =

x(1)1 · · · x

(1)d

.... . .

...

x(N)1 · · · x

(N)d

=

x(1)

T

...

x(N)T

∈ RN×d , y =

y (1)

...

y (N)

∈ RN

learn a function h(·) : Rd → R with y (i) ≈ h(x(i))

(1) which function family h(x) to choose?

(2) how to measure “h(x) ≈ y”?

8 / 48

Page 9: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Linear regression

linear regression for multivariate inputs x ∈ Rd defines

hw(x) =d∑

j=0

wjxj = wTx

where w ∈ Rd are linear weight parameters

encode x0 = 1, then w0 encodes intercept

the hypothesis class hw ∈ {hw : w ∈ Rd}all predictions in matrix notation nowh(x(1))

...

h(x(N))

=

wTx(1)

...

wTx(N)

= Xw

measure prediction error by square error/loss

L((x(i), y (i)), h(·)) = (y (i) − h(x(i)))2

9 / 48

Page 10: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Can we predict apartment rent?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction input x (i) output y (i)

x (1) = 31 y (1) = 705

x (2) = 33 y (2) = 540

x (3) = 31 y (3) = 650

x (4) = 49 y (4) = 840

x (5) = 53 y (5) = 890

x (6) = 69 y (6) = 850

x (7) = 101 y (7) = 1200

x (8) = 99 y (8) = 1150

x (9) = 143 y (9) = 1700

x (10) = 132 y (10) = 900

x (11) = 109 y (11) = 1550

we observe data X = (x (1), y (1)), . . . , (x (N), y (N)) with N = 11

we assume y (i) ≈ f (x (i)) where f (·) is the ”true” function

10 / 48

Page 11: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Can we predict apartment rent?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

h(x) = 9 x + 400

input x (i) output y (i) h(x) = 9 x + 400

x (1) = 31 y (1) = 705 h(x (1)) = 679

x (2) = 33 y (2) = 540 h(x (2)) = 697

x (3) = 31 y (3) = 650 h(x (3)) = 679

x (4) = 49 y (4) = 840 h(x (4)) = 841

x (5) = 53 y (5) = 890 h(x (5)) = 877

x (6) = 69 y (6) = 850 h(x (6)) = 1021

x (7) = 101 y (7) = 1200 h(x (7)) = 1309

x (8) = 99 y (8) = 1150 h(x (8)) = 1291

x (9) = 143 y (9) = 1700 h(x (9)) = 1687

x (10) = 132 y (10) = 900 h(x (10)) = 1588

x (11) = 109 y (11) = 1550 h(x (11)) = 1381

linear hypothesis class hw(x) = w1x + w0 = wTxencode x = (x , 1)T with w = (w1,w0)T

compute losses (y (i) − h(x(i)))2

11 / 48

Page 12: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Which parameters to choose?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

choose parameters to minimize empirical risk (mean loss)

w = argminw

{E(h(·)|X) =

1

N

N∑i=1

(y (i) − h(x (i)))2

=1

N||y − Xw||2

}12 / 48

Page 13: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Empirical risk

0 50 100 150

x

0

500

1000

1500

2000

yRent prediction (b=0)

datah(x) = 5x

0 5 10 15 20

w

0

2

4

6

8

10

12

Em

piric

al r

isk

105 Empirical risk

Empirical riskw=5

empirical risk quantifies how well the function fits data

h(x) = w1x + 0, w1 = 5

13 / 48

Page 14: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Empirical risk

0 50 100 150

x

0

500

1000

1500

2000

yRent prediction (b=0)

datah(x) = 5xh(x) = 11.7x

0 5 10 15 20

w

0

2

4

6

8

10

12

Em

piric

al r

isk

105 Empirical risk

Empirical riskw=5w=11.7

empirical risk quantifies how well the function fits data

h(x) = w1x + 0, w1 = 11.7

14 / 48

Page 15: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Empirical risk

0 50 100 150

x

0

500

1000

1500

2000y

Rent prediction (b=0)

datah(x) = 5xh(x) = 11.7xh(x) = 15x

0 5 10 15 20

w

0

2

4

6

8

10

12

Em

piric

al r

isk

105 Empirical risk

Empirical riskw=5w=11.7w=15

empirical risk quantifies how well the function fits datah(x) = w1x + 0, w1 = 15best hypothesis was w1 = 11.7 when w0 = 0 (only on thisdata X !)

15 / 48

Page 16: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Empirical risk

2D empirical risk surface over w0,w1

16 / 48

Page 17: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Derivatives

let’s minimize empirical riskminimization of functions is based on derivatives

df (x)

dx= lim

h→0

f (x + h)− f (x)

h

derivative is the direction of steepest descent

17 / 48

Page 18: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Derivatives

derivative of w wrt empirical error is (for 1D problem)

∂E(h(·)|X)

∂w=∂ 1

N

∑Ni=1(y (i) − wx (i))2

∂w

=1

N

N∑i=1

∂(y (i) − wx (i))2

∂w

=2

N

N∑i=1

∂(y (i) − wx (i))

∂w(y (i) − wx (i))

= − 2

N

N∑i=1

x (i) (y (i) − wx (i))︸ ︷︷ ︸i ’th data error

gradient of w = (w1, . . . ,wd)T wrt empirical error is

∇wE(hw(·)|X) =

∂E(hw1,...,wd (·)|X)

∂w1

...∂E(hw1,...,wd (·)|X)

∂wd

18 / 48

Page 19: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Iterative gradient descent

choose initial parameter w(0) (eg. all 0’s) and stepsize α

iterative gradient descent (GD): for k = 1, . . . ,K , update

w(k+1) = w(k)−α∇wE(h(·)|X)︸ ︷︷ ︸gradient

= w(k)+2α

N

N∑i=1

x(i) (y (i) −w(k)Tx(i))︸ ︷︷ ︸i ’th data point error

output: final K ’th regression weight vector w(K)

choice of step size or learning rate α is crucial!if α is too large: iterations may not convergeif α is too small: very slow convergenceα usually chosen by trial-and-error

gradient ∇wE(h(·)|X) points to direction of the maximal rateof increase of E(h(·)|X) at current value w

subtract gradient from w(k) to maximally decrease E(h(·)|X)

computational complexity O(K · N2)

19 / 48

Page 20: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Gradient minimization

we use update equation

w(k+1) = w(k) +2α

N

N∑i=1

x(i)(y (i) −w(k)Tx(i))

stepsize α is good20 / 48

Page 21: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Gradient minimization

we use update equation

w(k+1) = w(k) +2α

N

N∑i=1

x(i)(y (i) −w(k)Tx(i))

too large α, we are not converging21 / 48

Page 22: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Stochastic gradient descent

in gradient descent each data point ”pulls” parameters

w(k+1) = w(k)−α∇wE(h(·)|X) = w(k)+2α

N

N∑i=1

x(i) (y (i) −w(k)Tx(i))︸ ︷︷ ︸i ’th data point error

in stochastic gradient descent (SGD) we compute gradientover random minibatches I ⊂ {1, . . . ,N} of data of sizeM < N

w(k+1) = w(k)−α∇wE(h(·)|XI ) = w(k)+2α

N

∑i∈I

x(i)(y (i)−w(k)Tx(i))

computational complexity O(K ·M2)

SGD is one of the most powerful optimizers for large models

22 / 48

Page 23: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Analytical solution for linear regression

to minimize E(h(·)|X) we can directly solvewhere its gradients are 0:

∇wE(h(·)|X) = 0

with solution (DLbook 5.1.4)

w = (XTX )−1XTy

we get global optimum since empirical risk(of linear regression) is convex

(XTX )−1 needs to be invertible

⇒ Regression Home Assignment

matrix inverse is an O(d3) operation

23 / 48

Page 24: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

ID card of linear regression

input/feature space X = Rd

target space Y = Rfunction family h(x) = wTx =

∑dj=0 wjxj

bias trick: x0 = 1 and j starts from 0

loss function L((x, y), h(·)) = (h(x)− y)2

empirical risk E(h(·)|X) = 1N ||Xw − y||22

empirical risk minimization leads to parameters

w = (XTX )−1XTy (..or ..)

w(k+1) = w(k) +2α

N

N∑i=1

x(i)(y (i) −w(k)Tx(i))

DL book: covered in chapter 5.1

24 / 48

Page 25: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Case study: predict red wine quality with linear regression?

one wants to understand what makes a wine taste good

we have measured chemical composition of many wines (x)

tasting evaluations to rate the wines (y)

task: predict wine quality h(x) given its composition x

25 / 48

Page 26: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Wine measurement data

we construct a dataset X of N = 1599 wine measurements x

we manually obtain a rating y ∈ [0, 10] for each wine fromsubjective tastings

X =

fixed volatile citric free totalacid acid acid sugar chlorides sulfur sulfur density pH sulphates alcohol

x(i)1 x

(i)2 x

(i)3 x

(i)4 x

(i)5 x

(i)6 x

(i)7 x

(i)8 x

(i)9 x

(i)10 x

(i)11

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)

7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)

7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)

11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)

......

......

......

......

......

......

6.0 0.31 0.47 3.6 0.067 18 42 0.995 3.39 0.66 11.0 x(1599)

, y =

quality

5 y (1)

5 y (2)

5 y (3)

6 y (4)

5 y (5)

......

6 y (1599)

*P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining fromphysicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009

26 / 48

Page 27: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Linear Regression on wine

linear hypothesis space H = {hw(x) = wTx : w ∈ R11}empirical risk minimizer (fits the 1599 wines best):

w = argminw

1

N

N∑i=1

(y (i) −wTx(i))2 = (XTX )−1XTy

X =

[0.004 −1.09 −0.18 0.007 −1.91 0.005 −0.003 4.53 −0.52 0.88 0.29

]= w

fixed volatile citric free totalacid acid acid sugar chlorides sulfur sulfur density pH sulphates alcohol

x(i)1 x

(i)2 x

(i)3 x

(i)4 x

(i)5 x

(i)6 x

(i)7 x

(i)8 x

(i)9 x

(i)10 x

(i)11

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)

7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)

7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)

11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)

......

......

......

......

......

......

, y =

quality

5 y (1)

5 y (2)

5 y (3)

6 y (4)

5 y (5)

......

27 / 48

Page 28: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Linear regression predictions

h(x(i)) =∑j

wjx(i)j = wTx(i)

X =

[0.004 −1.09 −0.18 0.007 −1.91 0.005 −0.003 4.53 −0.52 0.88 0.29

]= w

×

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)

7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)

7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)

11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)

......

......

......

......

......

......

,

5.039 wTx(1)

5.142 wTx(2)

5.217 wTx(3)

5.677 wTx(4)

5.039 wTx(5)

......

6.026 wTx(1599)

h(x(1)) = 0.004 · 7.4 + (−1.09) · 0.70 + · · ·+ 0.29 · 9.4 = 5.039

h(x(2)) = 0.004 · 7.8 + (−1.09) · 0.88 + · · ·+ 0.29 · 9.8 = 5.142

28 / 48

Page 29: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Linear regression result on wine

We achieve empirical risk (mean square error)

E(h(·)|X) =1

N

N∑i=1

(h(x(i))− y (i))2 = 0.4253

y =

5653...

, X w =

5.0395.6245.2173.294

...

29 / 48

Page 30: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Outline

1 Linear regression

2 Basis functionsPolynomial basisGaussian basis

30 / 48

Page 31: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Non-linearity

so far we have analysed linear models where each featurecontribution towards output is summed independently

most machine learning problems are non-linear

non-linear effects, e.g. log(xalcohol)combined effects, e.g. xsugar · xalcohol

let’s expand the feature space by considering n basis functions

h(x) =n∑

j=0

wjφj(x) = wTφ(x)

where φ(x) : Rd → Rn with usually n > d and φ0(x) = 0

dataset is then Φ = (φ(x(1)), . . . , φ(x(N)))T ∈ RN×n

risk: 1N ||Φw − y||22, solution: w = (ΦTΦ)−1ΦTy

31 / 48

Page 32: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Outline

1 Linear regression

2 Basis functionsPolynomial basisGaussian basis

32 / 48

Page 33: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial expansion

map φ : (x1, x2) 7→ (x1, x2, x21x2

2 )

the product x21x2

2 solves the problem (feature expansion)

trivial solution now w3 = 1

33 / 48

Page 34: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial basis functions

let’s consider non-additive effects via M’th order polynomialbasis functions:

φ(M)(x) = {xj1xj2 · · · xjM : jM ∈ 1, . . . , d}

where

φ(0)(x) = 1

φ(1)(x) = (x1, x2, . . . , xd)T

φ(2)(x) = (x21 , x1x2, . . . , xd−1xd , x

2d )T

d = 11 features gives 55 pairwise terms, 165 triplets, etc.basis expansion dramatically increases hypothesis space

bases are precomputed to produce Φ matrix

basis functions results in non-linear prediction

34 / 48

Page 35: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial basis example

sample 100 points wherex (i) ∈ [−1, 1] andy (i) = sin(πx (i)) + ε

black dots: 7 data points

red dots: more samples

linear function h(x) = 1.37x

−1.0 −0.5 0.0 0.5 1.0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

1.5

X

Y

●●

●●

●●

sin((X ππ))degree 1 polynomial

35 / 48

Page 36: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial regressor, M = 0

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●●

●●

●●

sin((X ππ))degree 0 polynomial

h(x) = w0

36 / 48

Page 37: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial regressor, M = 1

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●●

●●

●●

sin((X ππ))degree 1 polynomial

h(x) = w0 + w1x

37 / 48

Page 38: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial regressor, M = 2

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●●

●●

●●

sin((X ππ))degree 2 polynomial

h(x) = wTφ(x) = w0 + w1x + w2x2

38 / 48

Page 39: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial regressor, M = 3

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●●

●●

●●

sin((X ππ))degree 3 polynomial

h(x) = wTφ(x) = w0 + w1x + w2x2 + w3x3

39 / 48

Page 40: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial regressors, M = 5

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●●

●●

●●

sin((X ππ))degree 5 polynomial

h(x) = wTφ(x) = w0 + w1x + w2x2 + w3x3 + w4x4 + w5x5

40 / 48

Page 41: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Polynomial regressors, M = 5 with enough data

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●●

●●

●●

●●

●●

●●

sin((X ππ))Polynomial of degree 5

41 / 48

Page 42: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Outline

1 Linear regression

2 Basis functionsPolynomial basisGaussian basis

42 / 48

Page 43: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Kernel basis functions

kernel function K (x, x′) ∈ R measuressimilarity of two vectors x, x′ ∈ Rd

opposite concept to distance functionD(x, x′)

a common kernel is the gaussian kernel

K (x, x′) = exp

(−1

2

||x− x′||2

σ2

)kernel basis function encodes featureφi (x) as similarity to other point m(i),

φi (x) = K (x,m(i))

how to choose basis points m(i)?

43 / 48

Page 44: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

feature mapping with 3 gaussian bases

3 features φj(x) = e−(x−m(j))2

2σ2 at m(j) = 50, 100, 150

feature mapping φ : x 7→ (φ1(x), φ2(x), φ3(x))

eg. x = 31 becomes φ(31) = (0.74, 0.02, 0.00)eg. x = 69 becomes φ(69) = (0.74, 0.46, 0.00)eg. x = 143 becomes φ(143) = (0.00, 0.22, 0.96)

44 / 48

Page 45: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

3 gaussian bases on 1D

three gaussian features φj(x) = e−(x−m(j))2

2σ2 with(m(1),m(2),m(3)) = (50, 100, 150)hypothesis is a sum of weighted gaussian features

h(x) =3∑

j=1

wjφj(x)

45 / 48

Page 46: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

ID card of linear basis regression

input space X = Rd

feature space F = Rn by basis function φ(x) ∈ Rn

dataset is then Φ = (φ(x(1)), . . . , φ(x(N))T ∈ RN×n

target space Y = Rfunction family h(x) = wTφ(x)

loss function L((x, y), h(·)) = (h(x)− y)2

empirical risk E(h(·)|X) = 1N ||Φw − y||22

empirical risk minimization leads to parameters

w = (ΦTΦ)−1ΦTy

46 / 48

Page 47: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Basis function summary

basis functions φ : Rd → Rn project the data into higherdimensional space (if n > d)

linear regression with the high-dimensional data points φ(x)leads to non-linear hypothesis h(φ(x))

selection of informative basis functions is a difficult task

polynomial bases take combinations (products) of existingfeatures

gaussian bases generate a new feature mapping

47 / 48

Page 48: CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period

aalto-logo-en-3

Linear regressionBasis functions

Polynomial basisGaussian basis

Next steps

next lecture: Regression II with kernel methods and Bayesianregression on friday 22.9.2017 at 10:15

DL book: read chapters 5.1 and 5.2 on linear regression

more information about basis functions

Hastie’s book1: chapters 3.2 & 5Bishop’s book2: chapter 3.1

fill out post-lecture questionnaire in MyCourses !

we read and appreciate all feedback

1Elements of Statistical Learning, Springer. download @https://web.stanford.edu/~hastie/ElemStatLearn

2Pattern recognition and Machine Learning, Springer 200648 / 48