Advanced Lecture on Neural Information Processing Systems ...takeuchi/T/NIPm/NipM03_web.pdfAdvanced...
Transcript of Advanced Lecture on Neural Information Processing Systems ...takeuchi/T/NIPm/NipM03_web.pdfAdvanced...
Advanced Lecture on
Neural Information Processing Systems
(Lecture 03)
Ichiro Takeuchi
Nagoya Institute of Technology
Ichiro Takeuchi, Nagoya Institute of Technology 1/1
Nonlinear modeling
Consider training a model for relationship between elapsedtime after collision (x) and passenger’s head acceleration (y)
Ichiro Takeuchi, Nagoya Institute of Technology 2/1
Nonlinear modeling
Ichiro Takeuchi, Nagoya Institute of Technology 3/1
Linear modeling is not helpful here
Ichiro Takeuchi, Nagoya Institute of Technology 4/1
We want something like this
-150
-100
-50
0
50
100
0 10 20 30 40 50 60
Acc
eler
atio
n[G
]
Time[ms]
Ichiro Takeuchi, Nagoya Institute of Technology 5/1
Which nonlinear model should we use?
Consider a single input case x ∈ R, y ∈ R▶ y = w1 log x
▶ y = w1
√x+ w2 exp(−x2)
▶ y = w1 cos 2πx+ w2 sin 2πx2 + w3
1x
▶ y = log(w1 + w2x)
▶ y = w1+xexp(−w2x2)
▶ y = sin 2π(w1 + w2x) + cos 2π(w3 + w4x)
What’s the difference between the first and the latter 3models?
Ichiro Takeuchi, Nagoya Institute of Technology 6/1
Basis function approach
▶ For single input case, i.e., when x ∈ R, basis functionmodel is written as
y = f(x) = w0 + w1h1(x) + w2h2(x) + . . .+ wqhq(x),
where hk, k = 1, . . . , q is a basis function.
▶ How can we estimate the parameters w0, w1, . . . , wq byleast squares method?
minw0,w∈Rq
n∑i=1
(yi − (w0 +
d∑j=1
wjhj(x))
)2
Ichiro Takeuchi, Nagoya Institute of Technology 7/1
Basis function approach as linear models
▶ Original training set
Xn×1
=
x1
x1...xn
,y =
y1y2...yn
▶ Expanded training set
Xn×1
=
1 h1(x1) h2(x1) · · · hq(x1)1 h1(x2) h2(x2) · · · hq(x2)...
......
. . ....
1 h1(xn) h2(xn) · · · hq(xn)
,y =
y1y2...yn
Ichiro Takeuchi, Nagoya Institute of Technology 8/1
Basis function approach and linear model
▶ Basis function approach
y = f(x) = w0 · 1 + w1h1(x) + w2h2(x) + . . .+ wqhq(x)
▶ Linear regression with multiple inputs
y = f(x) = w0 · 1 + w1x1 + w2x2 + . . .+ wqxq
Ichiro Takeuchi, Nagoya Institute of Technology 9/1
Which basis functions should we use?
▶ Radial basis function
hk(x) = exp
(−(x− ck)
2
2σ2
)
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Bas
is fu
nctio
n va
lues
hq(
x)
Input x
Ichiro Takeuchi, Nagoya Institute of Technology 10/1
How to determine q, {ck}qk=1, σ2 in RBF
▶ Approach 1▶ q ← n▶ ck ← xi, k = 1, . . . , q▶ s ← cross validation (explained later)
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Bas
is fu
nctio
n va
lues
hq(
x)
Input x
Ichiro Takeuchi, Nagoya Institute of Technology 11/1
How to determine q, {ck}qk=1, σ2 in RBF
▶ Approach 2▶ q ← cross validation
▶ ck ←(kn
)thquantile of {xi}ni=1
▶ s ← cross validation
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Bas
is fu
nctio
n va
lues
hq(
x)
Input x
Ichiro Takeuchi, Nagoya Institute of Technology 12/1
RBF Approach for Collision Data
▶ If we select good hyper-parameters (q, {ck}qk=1, s)
-150
-100
-50
0
50
100
0 10 20 30 40 50 60
Acc
eler
atio
n[G
]
Time[ms]
Ichiro Takeuchi, Nagoya Institute of Technology 13/1
Overfitting
▶ If we do not select good hyper-parameters (q, {ck}qk=1, s)
Ichiro Takeuchi, Nagoya Institute of Technology 14/1
Simulation Example for RBF
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
q = 1 q = 10
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
q = 20 q = 50
Ichiro Takeuchi, Nagoya Institute of Technology 15/1
Training Error and True Error
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 4 5 10 20 40 50
Err
or
# of basis "q"
Training ErrorTrue Error
Ichiro Takeuchi, Nagoya Institute of Technology 16/1
High dimensional problem
E.g. Gene expression microarray
▶ xij: activity of jth gene for ith patient
▶ yi: Effectiveness of a medicine
yi = f(xi) = w0 + w1xi1 + . . .+ w10000xi,10000
Ichiro Takeuchi, Nagoya Institute of Technology 17/1
How to avoid overfitting: Regularization
minw∈Rd
n∑i=1
(yi −w⊤xi
)subject to
d∑j=1
w2j ≤ s
Ichiro Takeuchi, Nagoya Institute of Technology 18/1
Ridge regression
w∗λ = arg min
w∈Rd
n∑i=1
(yi −w⊤xi)2 + λ
d∑j=1
w2j ,
where λ > 0 is the regularization parameter.
Ichiro Takeuchi, Nagoya Institute of Technology 19/1
Simulation Example for Ridge Regression
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
λ = 0 (q = 50) λ = 1.0 (q = 50)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
TruthEstimated
λ = 10 (q = 50) λ = 100 (q = 50)
Ichiro Takeuchi, Nagoya Institute of Technology 20/1
Solving Ridge regression
▶ Training data
Xn×d
:=
x11 x12 · · · x1d
x21 x22 · · · x2d...
.... . .
...xn1 xn2 · · · xnd
=
x1
x2...xn
, yn×1
:=
y1y2...yn
▶ Solution
w∗λ = (X⊤X + λI)−1X⊤y
Ichiro Takeuchi, Nagoya Institute of Technology 21/1
Model selection
▶ Example: how to select the regularization parameter λ
▶ Training error cannot be used for model selection becauseit cannot detect over-training (as we will see).
Ichiro Takeuchi, Nagoya Institute of Technology 22/1
Training and validation data
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
Training dataValidation data
•: Training data, •: Validation data
Ichiro Takeuchi, Nagoya Institute of Technology 23/1
Training and validation data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 4 5 10 20 40 50
Err
or
# of basis "q"
Training ErrorTrue Error
Validation Error
▶ Training error monotonically decreases
▶ Validation error can be used as a proxy of the true error
Ichiro Takeuchi, Nagoya Institute of Technology 24/1
Cross-validation
Training data Validation data
R1
R2
R3
R4
R5
▶ The model hyper-parameters (q, λ etc.) are selectedbased on the average validation error.
Ichiro Takeuchi, Nagoya Institute of Technology 25/1
Cross-validation example
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
Training DataValidation Data
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
Training DataValidation Data
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
Training DataValidation Data
Round 1 Round 2 Round 3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
Training DataValidation Data
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1
Out
put y
Input x
Training DataValidation Data
Round 4 Round 5
Ichiro Takeuchi, Nagoya Institute of Technology 26/1
Leave-one-out cross-validation (LOOCV)
R1
R2
R n
R n-1
Training data Validation data
Ichiro Takeuchi, Nagoya Institute of Technology 27/1
Final exercise IGiven the data {(xi, yi)}ni=1, consider a constant model thatdoes not use the input x (not useful in practice)
f(x) = w0,
The parameter w0 is estimated by solving the followingminimization problem:
arg minw0∈R
n∑i=1
(yi − f(xi))2 = arg min
w0∈R
n∑i=1
(yi − w0)2
▶ First, show that the solution of the optimal solution ofthe above problem is the sample mean, i.e.,
arg minw0∈R
n∑i=1
(yi − w0)2 =
1
n
n∑i=1
yi
Ichiro Takeuchi, Nagoya Institute of Technology 28/1
Final exercise II
▶ Next, confirm that the training error and the LOOCVerror of the constant model are respectively written as
TrainEr :=n∑
i=1
(yi − arg min
w0∈R
n∑j=1
(yj − w0)2
)2
=n∑
i=1
(yi − y)2,
LoocvEr :=n∑
i=1
(yi − arg min
w0∈R
∑j =i
(yj − w0)2
)2
=n∑
i=1
(yi −
1
n− 1
∑j =i
yj
)2
.
Ichiro Takeuchi, Nagoya Institute of Technology 29/1
Final exercise III
▶ Finally, show that the relation of these two errors arewritten as
LoocvEr :=
(n
n− 1
)2
TrainEr.
Ichiro Takeuchi, Nagoya Institute of Technology 30/1