Advanced Lecture on Neural Information Processing Systems ...takeuchi/T/NIPm/NipM03_web.pdfAdvanced...

Advanced Lecture on

Neural Information Processing Systems

(Lecture 03)

Ichiro Takeuchi

Nagoya Institute of Technology

Ichiro Takeuchi, Nagoya Institute of Technology 1/1

Nonlinear modeling

Consider training a model for relationship between elapsedtime after collision (x) and passenger’s head acceleration (y)


Nonlinear modeling


Linear modeling is not helpful here


We want something like this

-150

-100

-50

0

50

100

0 10 20 30 40 50 60

Acc

eler

atio

n[G

]

Time[ms]


Which nonlinear model should we use?

Consider a single input case x ∈ R, y ∈ R▶ y = w1 log x

▶ y = w1

√x+ w2 exp(−x2)

▶ y = w1 cos 2πx+ w2 sin 2πx2 + w3

1x

▶ y = log(w1 + w2x)

▶ y = w1+xexp(−w2x2)

▶ y = sin 2π(w1 + w2x) + cos 2π(w3 + w4x)

What’s the difference between the first and the latter 3models?


Basis function approach

▶ For single input case, i.e., when x ∈ R, basis functionmodel is written as

y = f(x) = w0 + w1h1(x) + w2h2(x) + . . .+ wqhq(x),

where hk, k = 1, . . . , q is a basis function.

▶ How can we estimate the parameters w0, w1, . . . , wq byleast squares method?

minw0,w∈Rq

n∑i=1

(yi − (w0 +

d∑j=1

wjhj(x))

)2


Basis function approach as linear models

▶ Original training set

Xn×1

=

x1

x1...xn

,y =

y1y2...yn

▶ Expanded training set

Xn×1

=

1 h1(x1) h2(x1) · · · hq(x1)1 h1(x2) h2(x2) · · · hq(x2)...

......

. . ....

1 h1(xn) h2(xn) · · · hq(xn)

,y =

y1y2...yn


Basis function approach and linear model

▶ Basis function approach

y = f(x) = w0 · 1 + w1h1(x) + w2h2(x) + . . .+ wqhq(x)

▶ Linear regression with multiple inputs

y = f(x) = w0 · 1 + w1x1 + w2x2 + . . .+ wqxq


Which basis functions should we use?

▶ Radial basis function

hk(x) = exp

(−(x− ck)

2

2σ2

)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Bas

is fu

nctio

n va

lues

hq(

x)

Input x


How to determine q, {ck}qk=1, σ2 in RBF

▶ Approach 1▶ q ← n▶ ck ← xi, k = 1, . . . , q▶ s ← cross validation (explained later)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Bas

is fu

nctio

n va

lues

hq(

x)

Input x


How to determine q, {ck}qk=1, σ2 in RBF

▶ Approach 2▶ q ← cross validation

▶ ck ←(kn

)thquantile of {xi}ni=1

▶ s ← cross validation

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Bas

is fu

nctio

n va

lues

hq(

x)

Input x


RBF Approach for Collision Data

▶ If we select good hyper-parameters (q, {ck}qk=1, s)

-150

-100

-50

0

50

100

0 10 20 30 40 50 60

Acc

eler

atio

n[G

]

Time[ms]


Overfitting

▶ If we do not select good hyper-parameters (q, {ck}qk=1, s)


Simulation Example for RBF

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

q = 1 q = 10

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

q = 20 q = 50


Training Error and True Error

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 4 5 10 20 40 50

Err

or

# of basis "q"

Training ErrorTrue Error


High dimensional problem

E.g. Gene expression microarray

▶ xij: activity of jth gene for ith patient

▶ yi: Effectiveness of a medicine

yi = f(xi) = w0 + w1xi1 + . . .+ w10000xi,10000


How to avoid overfitting: Regularization

minw∈Rd

n∑i=1

(yi −w⊤xi

)subject to

d∑j=1

w2j ≤ s


Ridge regression

w∗λ = arg min

w∈Rd

n∑i=1

(yi −w⊤xi)2 + λ

d∑j=1

w2j ,

where λ > 0 is the regularization parameter.


Simulation Example for Ridge Regression

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

λ = 0 (q = 50) λ = 1.0 (q = 50)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

TruthEstimated

λ = 10 (q = 50) λ = 100 (q = 50)


Solving Ridge regression

▶ Training data

Xn×d

:=

x11 x12 · · · x1d

x21 x22 · · · x2d...

.... . .

...xn1 xn2 · · · xnd

=

x1

x2...xn

, yn×1

:=

y1y2...yn

▶ Solution

w∗λ = (X⊤X + λI)−1X⊤y


Model selection

▶ Example: how to select the regularization parameter λ

▶ Training error cannot be used for model selection becauseit cannot detect over-training (as we will see).


Training and validation data

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

Training dataValidation data

•: Training data, •: Validation data


Training and validation data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 4 5 10 20 40 50

Err

or

# of basis "q"

Training ErrorTrue Error

Validation Error

▶ Training error monotonically decreases

▶ Validation error can be used as a proxy of the true error


Cross-validation

Training data Validation data

R1

R2

R3

R4

R5

▶ The model hyper-parameters (q, λ etc.) are selectedbased on the average validation error.


Cross-validation example

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x

Training DataValidation Data

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x


-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x


Round 1 Round 2 Round 3

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x


-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

Out

put y

Input x


Round 4 Round 5


Leave-one-out cross-validation (LOOCV)

R1

R2

R n

R n-1

Training data Validation data


Final exercise IGiven the data {(xi, yi)}ni=1, consider a constant model thatdoes not use the input x (not useful in practice)

f(x) = w0,

The parameter w0 is estimated by solving the followingminimization problem:

arg minw0∈R

n∑i=1

(yi − f(xi))2 = arg min

w0∈R

n∑i=1

(yi − w0)2

▶ First, show that the solution of the optimal solution ofthe above problem is the sample mean, i.e.,

arg minw0∈R

n∑i=1

(yi − w0)2 =

1

n

n∑i=1

yi


Final exercise II

▶ Next, confirm that the training error and the LOOCVerror of the constant model are respectively written as

TrainEr :=n∑

i=1

(yi − arg min

w0∈R

n∑j=1

(yj − w0)2

)2

=n∑

i=1

(yi − y)2,

LoocvEr :=n∑

i=1

(yi − arg min

w0∈R

∑j =i

(yj − w0)2

)2

=n∑

i=1

(yi −

1

n− 1

∑j =i

yj

)2

.


Final exercise III

▶ Finally, show that the relation of these two errors arewritten as

LoocvEr :=

(n

n− 1

)2

TrainEr.


Advanced Lecture on Neural Information Processing Systems ...takeuchi/T/NIPm/NipM03_web.pdfAdvanced...

Documents

Transcript of Advanced Lecture on Neural Information Processing Systems ...takeuchi/T/NIPm/NipM03_web.pdfAdvanced...