Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University...
Transcript of Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University...
NONLINEAR LOW-DIMENSIONAL REGRESSION
USING AUXILIARY COORDINATES.Weiran Wang and Miguel A. Carreira-Perpinan.
EECS, University of California, Merced.
1 AbstractWhen doing regression with inputs and outputs that arehigh-dimensional, it often makes sense to reduce the di-mensionality of the inputs before mapping to the outputs.We propose a method where both the dimensionality re-duction and the regression mapping can be nonlinearand are estimated jointly. Our key idea is to define an ob-jective function where the low-dimensional coordinatesare free parameters, in addition to the dimensionality re-duction and the regression mapping. This has the ef-fect of decoupling many groups of parameters from eachother, affording a more effective optimization, and to usea good initialization from other methods.
Work funded by NSF CAREER award IIS–0754089.
2 Low-dimensional regression using auxiliary coordinatesGiven a training set XDx×N and YDy×N , instead of directly op-timizing
E1(F,g) =N∑
n=1
‖yn − g(F(xn))‖2 + λgR(g) + λFR(F)
with λF, λg ≥ 0 for dimension reduction mapping F and re-gression mapping g, we let the low-dimensional coordinatesZDz×N = (z1, . . . , zN) be independent, auxiliary parameters tobe optimized over, and unfold the squared error into two termsthat decouple given Z:
E2(F, g,Z) =∑N
n=1 ‖yn − g(zn)‖2 + λgR(g)
+∑N
n=1 ‖zn − F(xn)‖2 + λFR(F).
Now, every squared error involves only a shallow mapping,compared to deeper nesting in the function g ◦F that leads toill-conditioning. We apply the following alternating optimiza-tion procedure to solve the problem.
Given XDx×N ,YDy×N , and intialization ZDz×N
repeat1. Optimize over g : min
g
∑Nn=1 ‖yn − g(zn)‖
2 + λgRg(g)
2. Optimize over F : minF
∑Nn=1 ‖zn − F(xn)‖
2 + λFRF(F)
3. Optimize over Z :minZ
∑Nn=1 ‖yn − g(zn)‖
2 +∑N
n=1 ‖zn − F(xn)‖2
until stop
3 Optimization over F and g
We use shallower functions–linear or RBFs–for F and g.
•Linear g: a direct regression that acts on a lower inputdimension Dz, reduces to least squares problem.
•RBFs g: g(z) = WΦ(z) with M ≤ N Gaussian RBFsφm(z) = e(−
‖(z−µm)‖2)2σ2 , and R(g) = ‖W‖2 is a quadratic
regularizer on the weights.
– Centers µm are chosen by k-means on Z (once everyfew iterations, initialized at previous centers)
– Weights W have a unique solution given by a linearsystem.
– Time complexity: O(NM(M +Dz)), linear in trainingset size.
– Space complexity: O(M(Dy +Dz)).
4 Optimization over Z
•For fixed g and F, optimization of the objective functiondecouples over each zn ∈ R
Dz.
•We have N independent nonlinear minimizations eachon Dz parameters, of the form
minz∈RDz E(z) = ‖y − g(z)‖2 + ‖z− F(x)‖2.
• If g is linear, then z can be solved in closed form bysolving a linear system of size Dz.
• If g is nonlinear, we use Gauss-Newton method with linesearch.
•Cost over all Z: O(ND2zDy), linear in training set size.
•The distribution of the coordinates Z changes dramati-cally in the first few iterations, while the error decreasesquickly, but after that Z changes little.
5 Initialization and ValidationInitialization for Z
•Unsupervised dimensionality reduction on X only.
•Supervised dimension reduction: Reduced Rank Regression,Kernel Slice Inverse Regression, Kernel Dimension Reduction,etc.
•Spectral methods run on (X,Y) jointly.
Validation of hyper-parameters
•Parameters for function F and g (#RBFs, Gaussian Kernel width).
•Regularization coefficients λg and λF.
•Dimensionality of z.
They can be determined through evaluating performance of g ◦Fon a validation set.
6 Advantages of low-dimensional regression
•We jointly optimize over both mappings F and g, unlike one-shot methods.
•Our optimization is much more efficient than using a deep network with nested mappings(pretty good model pretty fast).
•The low-dimensional regressor has fewer parameters when Dz is small or #RBFs is small.
•The smooth functions F and g impose regularization on the regressor and may result in abetter generalization performance.
7 Experimental evaluation
•We use g ◦ F as our regression function for testing, which is the natural “out-of-sample”extension for above optimization.
•Criteria: test error, and the quality of dimension reduction.
•Early stopping for training, usually happens within 100 iterations.
Rotated MNIST digits ‘7’Input: images of digit 7, each has 60 different rotated versions.Output: skeleton version of the digit.
testGround
Truth lin G RBFs G GP G
RBFs F
lin g
RBFs F
RBFs g testGround
Truth lin G RBFs G GP G
RBFs F
lin g
RBFs F
RBFs g
Sample outputs of different algorithms.
Method SSEdirect linear 51 710
direct RBFs (1200, 10−2) 32 495direct RBFs (2000, 10−2) 29 525
Gaussian process 29 208KPCA (3) + RBFs (2000, 1) 49 782
KSIR (60, 26) + RBFs (20, 10−5) 39 421F RBFs (2000, 10−2) + g linear (10−3) 29 612F RBFs (1200, 10−2)+g RBFs (75, 10−2) 27 346
� Sum of squared errors (test set), withoptimal parameters coded as RBFs (M,λ),KPCA (Gaussian kernel width), KSIR (num-ber of slices, Gaussian kernel width). Num-ber of iterations for our method: 16 (linear g),7 (RBFs g).
KPCA KSIR (60 slices)
−0.15 −0.1 −0.05 0 0.05 0.1
−0.15
−0.1
−0.05
0
0.05
0.1
−0.1 −0.05 0 0.05 0.1−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
Isomap on (X,Y) F RBFs + g linear
−3 −2 −1 0 1 2 3
x 10−3
−3
−2
−1
0
1
2
3x 10
−3
z1
z2
−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.025
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
0.025
2 dimensional embeddings obtained by different algorithms.
Serpentine robot forward kinematicsForward kinematics mapping goes from 12 to 24 dimen-sions through 4 dimensions.
...x
yz
end-effector
camera
linksρ
x
y
θ1θ2
φ
Method RMSEdirect regression, linear (0) 2.2827
direct regression, RBF (2000, 10−6) 0.3356direct regression, Gaussian process 0.7082
KPCA (2.5) + RBF (400, 10−10) 3.7455KSIR (400, 100) + RBF (1000, 10−8) 3.5533F RBF (2000, 10−6)+g RBF (100, 10−9 0.1006
Test error obtained by different algorithms.
Gro
und
trut
h
−5 0 5 10−10
−8
−6
−4
−2
0
2
4
6
8
10
t1
t2
−0.5 0 0.5 1 1.5 2 2.5 3 3.5−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
t3
t4
−0.5 0 0.5 1 1.5 2 2.5 3 3.5−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
t3
t4
−5 0 5 10−10
−8
−6
−4
−2
0
2
4
6
8
10
t1
t2
Our
s
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
z1
z2
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
z1
z2
−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.01
−0.005
0
0.005
0.01
0.015
0.02
z3
z4
−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.01
−0.005
0
0.005
0.01
0.015
0.02
z3
z4
Correspondence between our 4 dimensional auxiliary coordinates Z and the ideal one that generates data.
1 2 3 4 5 6 7 8 9 10 11 120
0.5
1
1.5
2
RM
SE
dimension Dz
101
102
103
104
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
nested, trainingnested, testauxcoord, trainingauxcoord, test
root
mea
nsq
uare
dre
gres
sion
erro
r
run time (seconds)
switch from auxiliarycoordinates to nested
Validation of Dz by our algorithm.Comparison of run time of our approach and optimiz-ing the nested objective function.