Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...
Transcript of Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...
Engineering Kernel Machines
Johan Suykens
K.U. Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]
http://www.esat.kuleuven.be/scd/
Current Challenges in Kernel Methods (CCKM 2006)Brussels Belgium, Nov. 2006
CCKM 2006 ⋄ Johan Suykens
Data world and the evolution of science
technologies
serving society
applications
Sta
tistic
s Optimization theory
time
Mat
hem
atics
Systems t
heory Machine learning
?
“The Paradox of the Sciences”: Science is full of fragmentation whiletackling important open problems requires interdisciplinary approaches.
CCKM 2006 ⋄ Johan Suykens 1
Kernel based learning: interdisciplinary challenges
SVM & kernel methods
linear algebra
mathematics
statistics
systems and control theory
signal processingoptimization
machine learning
pattern recognition
data mining
neural networks
• Understanding the essential concepts and different facets of problems
• Providing systematical approaches, engineering kernel machines
• Integrative design, bridging the gaps between theory and practice
CCKM 2006 ⋄ Johan Suykens 2
Contents
• Motivation
• Support vector machines, use of the “kernel trick”
• Core problems: least squares support vector machines
• Primal and dual representations
• Robustness
• Kernel PCA, spectral clustering, kernel CCA
• Sparseness, feature selection, relevance determination
• Prior knowledge incorporation, convex optimization
CCKM 2006 ⋄ Johan Suykens 3
Classical MLPsx1
x2
x3
xn
1
w1
w2
w3wn
b
h(·)
h(·)
y
Multilayer Perceptron (MLP) properties:
• Universal approximation of continuous nonlinear functions
• Learning from input-output patterns: off-line/on-line
• Parallel network architecture, multiple inputs and outputs
+ Flexible and widely applicable:
Feedforward/recurrent networks, supervised/unsupervised learning
- Many local minima, trial and error for determining number of neurons
CCKM 2006 ⋄ Johan Suykens 4
Support Vector Machines
cost function cost function
weights weights
MLP SVM
• Nonlinear classification and function estimation by convex optimizationwith a unique solution and primal-dual interpretations.
• Number of neurons automatically follows from a convex program.
• Learning and generalization in high dimensional input spaces(coping with the curse of dimensionality).
• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels fromgraphical models, ... ), application-specific kernels (e.g. bioinformatics)
CCKM 2006 ⋄ Johan Suykens 5
Classifier with maximal margin
• Training set (xi, yi)Ni=1: inputs xi ∈ R
n; class labels yi ∈ −1,+1
• Classifier: y(x) = sign[wTϕ(x) + b]
with ϕ(·) : Rn → R
nh the mapping to a high dimensional feature space(which can be infinite dimensional!)
• Maximize the margin for good generalization ability (margin = 2‖w‖2
)
(VC theory: linear SVM classifier dates back from the sixties)
x x
xx
x
x
x
o
o
oo
o o
oo o
x o → x x
xx
x
x
x
o
o
oo
o o
oo o
x o
CCKM 2006 ⋄ Johan Suykens 6
SVM classifier: primal and dual problem
• Primal problem: [Vapnik, 1995]
minw,b,ξ
J (w, ξ) =1
2wTw + c
N∑
i=1
ξi s.t.
yi[wTϕ(xi) + b] ≥ 1 − ξi
ξi ≥ 0, i = 1, ..., N
Trade-off between margin maximization and tolerating misclassifications
• Conditions for optimality from Lagrangian.Express the solution in the Lagrange multipliers.
• Dual problem: QP problem (convex problem)
maxα
Q(α) = −1
2
N∑
i,j=1
yiyj K(xi, xj)αiαj+
N∑
j=1
αj s.t.
N∑
i=1
αiyi = 0
0 ≤ αi ≤ c, ∀i
CCKM 2006 ⋄ Johan Suykens 7
Obtaining solution via Lagrangian
• Lagrangian:
L(w, b, ξ;α, ν) = J (w, ξ) −
N∑
i=1
αiyi[wTϕ(xi) + b] − 1 + ξi −
N∑
i=1
νiξi
• Find saddle point: maxα,ν
minw,b,ξ
L(w, b, ξ; α, ν), one obtains
∂L∂w = 0 → w =
N∑
i=1
αiyiϕ(xi)
∂L∂b = 0 →
N∑
i=1
αiyi = 0
∂L∂ξi
= 0 → 0 ≤ αi ≤ c, i = 1, ..., N
Finally, write the solution in terms of α (Lagrange multipliers).
CCKM 2006 ⋄ Johan Suykens 8
SVM classifier model representations
• Classifier: primal representation
y(x) = sign[wTϕ(x) + b]
Kernel trick (Mercer Theorem): K(xi, xj) = ϕ(xi)T ϕ(xj)
• Dual representation:
y(x) = sign[∑
i
αi yi K(x, xi) + b]
Some possible kernels K(·, ·):
K(x, xi) = xTi x (linear)
K(x, xi) = (xTi x + τ)d (polynomial)
K(x, xi) = exp(−‖x − xi‖22/σ2) (RBF)
K(x, xi) = tanh(κ xTi x + θ) (MLP)
• Sparseness property (many αi = 0)−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
x(1)
x(2
)
CCKM 2006 ⋄ Johan Suykens 9
SVMs: living in two worlds ...
x
oooo
x
xx
x
x xx
x
x
oo
oo Input space
Feature space
ϕ(x)
Primal space: (→ large data sets)
Dual space: (→ high dimensional inputs)
Parametric: estimate w ∈ Rnh
Non-parametric: estimate α ∈ RN
y(x) = sign[wTϕ(x) + b]
y(x) = sign[P#sv
i=1 αiyiK(x, xi) + b]
K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)
y(x)
y(x)
w1
wnh
α1
α#sv
ϕ1(x)
ϕnh(x)
K(x, x1)
K(x, x#sv)
x
x
CCKM 2006 ⋄ Johan Suykens 10
Reproducing Kernel Hilbert Space (RKHS) view
• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]
find function f such that
minf∈H
1
N
N∑
i=1
L(yi, f(xi)) + λ‖f‖2K
with L(·, ·) the loss function. ‖f‖K is norm in RKHS H defined by K.
• Representer theorem: for any convex loss function the solution is of theform
f(x) =
N∑
i=1
αiK(x, xi)
• Some special cases:
L(y, f(x)) = (y − f(x))2 regularization networkL(y, f(x)) = |y − f(x)|ǫ SVM regression with ǫ-insensitive loss function
−ε 0 +ε
CCKM 2006 ⋄ Johan Suykens 11
Different views on kernel machines
LS−SVMSVM
Kriging
Gaussian Processes
RKHS
Some early history on RKHS:
1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba
Obtaining complementary insights from different perspectives:kernels are used in different settings (try to get the big picture)
Support vector machines (SVM): optimization approach (primal/dual)Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysisGaussian processes (GP): probabilistic/Bayesian approach
CCKM 2006 ⋄ Johan Suykens 12
Wider use of the kernel trick
• Angle between vectors: (e.g. correlation analysis)Input space:
cos θxz =xTz
‖x‖2‖z‖2Feature space:
cos θϕ(x),ϕ(z) =ϕ(x)Tϕ(z)
‖ϕ(x)‖2‖ϕ(z)‖2
=K(x, z)
√
K(x, x)√
K(z, z)
• Distance between vectors: (e.g. for “kernelized” clustering methods)Input space:
‖x − z‖22 = (x − z)T (x − z) = xTx + zTz − 2xTz
Feature space:
‖ϕ(x) − ϕ(z)‖22 = K(x, x) + K(z, z) − 2K(x, z)
x
zθ
CCKM 2006 ⋄ Johan Suykens 13
Least Squares Support Vector Machines: “core problems”
• Regression (RR)
minw,b,e
wTw + γ∑
i
e2i s.t. yi = wTϕ(xi) + b + ei, ∀i
• Classification (FDA)
minw,b,e
wTw + γ∑
i
e2i s.t. yi(w
Tϕ(xi) + b) = 1 − ei, ∀i
• Principal component analysis (PCA)
minw,b,e
−wTw + γ∑
i
e2i s.t. ei = wTϕ(xi) + b, ∀i
• Canonical correlation analysis/partial least squares (CCA/PLS)
minw,v,b,d,e,r
wTw+vTv+ν1
∑
i
e2i+ν2
∑
i
r2i−γ
∑
i
eiri s.t.
ei = wTϕ1(xi) + bri = vTϕ2(yi) + d
• partially linear models, spectral clustering, subspace algorithms, ...
CCKM 2006 ⋄ Johan Suykens 14
LS-SVM classifier
• Preserve support vector machine methodology, but simplify via leastsquares and equality constraints [Suykens, 1999]
• Primal problem:
minw,b,e
J (w, e) =1
2wTw + γ
1
2
N∑
i=1
e2i s.t. yi [w
Tϕ(xi) + b]=1 − ei, ∀i
• Dual problem:
[
0 yT
y Ω + I/γ
] [
bα
]
=
[
01N
]
where Ωij = yiyj ϕ(xi)Tϕ(xj) = yiyj K(xi, xj) and y = [y1; ...; yN ].
• LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004]
Winning results in competition WCCI 2006 by [Cawley, 2006]
CCKM 2006 ⋄ Johan Suykens 15
Obtaining solution from Lagrangian
• Lagrangian:
L(w, b, e;α) = J (w, e) −N
∑
i=1
αiyi[wTϕ(xi) + b] − 1 + ei
with Lagrange multipliers αi (support values).
• Conditions for optimality:8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
∂L∂w = 0 → w =
NX
i=1
αiyiϕ(xi)
∂L∂b = 0 →
NX
i=1
αiyi = 0
∂L∂ei
= 0 → αi = γei, i = 1, ..., N∂L∂αi
= 0 → yi[wTϕ(xi) + b] − 1 + ei = 0, i = 1, ..., N
Eliminate w, e and write solution in α, b.
CCKM 2006 ⋄ Johan Suykens 16
Microarray data analysis
FDALS-SVM classifier (linear, RBF)Kernel PCA + FDA(unsupervised selection of PCs)(supervised selection of PCs)
Use regularization for linear classifiers
Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]
Webservice: http://www.esat.kuleuven.ac.be/MACBETH/
Efficient computational methods for feature selection by rank-one updates
[Ojeda et al., 2006]
CCKM 2006 ⋄ Johan Suykens 17
Weighted versions and robustness
Convex cost function
convexoptimiz.
SVM solution
Weighted version with
modified cost function
robuststatistics
LS-SVM solution
SVM Weighted LS-SVM
• Weighted LS-SVM: minw,b,e
1
2wTw + γ
1
2
N∑
i=1
vie2i s.t. yi = wTϕ(xi) + b + ei, ∀i
with vi determined from eiNi=1 of unweighted LS-SVM [Suykens et al., 2002].
Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].
• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]
CCKM 2006 ⋄ Johan Suykens 18
Kernel principal component analysis (KPCA)
−1.5 −1 −0.5 0 0.5 1−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
linear PCA
−1.5 −1 −0.5 0 0.5 1−2
−1.5
−1
−0.5
0
0.5
1
1.5
kernel PCA (RBF kernel)
Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix
K(x1, x1) ... K(x1, xN)... ...
K(xN , x1) ... K(xN , xN)
(applications in dimensionality reduction and denoising)
Where is the regularization ?
CCKM 2006 ⋄ Johan Suykens 19
Kernel PCA: primal and dual problem
• Underlying primal problem with regularization term [Suykens et al., 2003]
• Primal problem:
minw,b,e
−1
2wTw +
1
2γ
N∑
i=1
e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
(or alternatively min 12w
Tw − 12γ
∑Ni=1 e2
i )
• Dual problem = kernel PCA :
Ωcα = λα with λ = 1/γ
with Ωc,ij = (ϕ(xi) − µϕ)T (ϕ(xj) − µϕ) the centered kernel matrix.
• Score variables (allowing also out-of-sample extensions):
z(x) = wT (ϕ(x) − µϕ) =∑
j αj(K(xj, x) − 1N
∑
r K(xr, x)−1N
∑
r K(xr, xj) + 1N2
∑
r
∑
s K(xr, xs))
CCKM 2006 ⋄ Johan Suykens 20
Generalizations to Kernel PCA: other loss functions
• Consider general loss function L:
minw,b,e
−1
2wTw +
1
2γ
N∑
i=1
L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
Generalizations of KPCA that lead to robustness and sparseness, e.g.Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].
• Weighted least squares versions and incorporation of constraints:
minw,b,e
−1
2wTw +
1
2γ
N∑
i=1
vie2i s.t.
ei = wTϕ(xi) + b, i = 1, ..., N∑N
i=1 eie(1)i = 0
...∑N
i=1 eie(i−1)i = 0
Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC e(j)i ).
The solution is given by a generalized eigenvalue problem.
CCKM 2006 ⋄ Johan Suykens 21
Generalizations to Kernel PCA: robust denoising
Classical Kernel PCA Robust method
First rows: test set images, corrupted with Gaussian noise and outliersNext rows: application of different pre-image algorithmsRobust method: fewer components needed and improved results
CCKM 2006 ⋄ Johan Suykens 22
Generalizations to Kernel PCA: sparseness
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
PC1
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
PC2
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
x1
x 2
PC3
Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006](top figure: denoising; bottom figures: different support vectors (in black) per principal component vector)
CCKM 2006 ⋄ Johan Suykens 23
Spectral clustering: weighted KPCA
• Spectral graph clustering [Chung, 1997;Shi & Malik, 2000; Ng et al. 2002]
• Normalized cut problemLq = λDq
with L = D − W the Laplacian of the graph.Cluster membership indicators are given by q.
• KPCA Weighted LS-SVM formulation to normalized cut:
minw,b,e
−1
2wTw + γ
1
2eTV e such that ei = wTϕ(xi) + b, ∀i = 1, ..., N
with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006].Allows for out-of-sample extensions on test data.
2
3
4
5
6
1
cut of size 2
cut of size 1(minimal cut)
CCKM 2006 ⋄ Johan Suykens 24
Application to image segmentation
Given image (240 × 160) Image segmentation
Large scale image: out-of-sample extension [Alzate & Suykens, 2006]
CCKM 2006 ⋄ Johan Suykens 25
Primal versus dual problems
Example 1: microarray data (10.000 genes & 50 training data)
Classifier model:sign(wTx + b) (primal)sign(
∑
i αiyixTi x + b) (dual)
primal: w ∈ R10.000 (only 50 training data!)
dual: α ∈ R50
Example 2: datamining problem (1.000.000 training data & 20 inputs)
primal: w ∈ R20
dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)
CCKM 2006 ⋄ Johan Suykens 26
Fixed-size LS-SVM: primal-dual kernel machines
Primal space Dual space
Nystrom method
Kernel PCA
Density estimate
Entropy criteria
Eigenfunctions
SV selectionRegression
Link Nystrom approximation (GP) - kernel PCA - density estimation[Girolami, 2002; Williams & Seeger, 2001]
Modelling in view of primal-dual representations[Suykens et al., 2002]: primal space estimation, sparse, large scale
CCKM 2006 ⋄ Johan Suykens 27
Fixed-size LS-SVM: toy examples
−10 −8 −6 −4 −2 0 2 4 6 8 10−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
−10 −8 −6 −4 −2 0 2 4 6 8 10−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
x
y
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Sparse representations with estimation in primal space
CCKM 2006 ⋄ Johan Suykens 28
Large scale problems: Fixed-size LS-SVM
Estimate in primal (approximate feature map from KPCA on subset)Santa Fe laser data
−2
−1
0
1
2
3
4
5
0 100 200 300 400 500 600 700 800 900
discrete time k
yk
10 20 30 40 50 60 70 80 90−2
−1
0
1
2
3
4
discrete time k
yk,y
k
Training: yk+1 = f(yk, yk−1, ..., yk−p)Iterative prediction: yk+1 = f(yk, yk−1, ..., yk−p)(works well for p large, e.g. p = 50) [Espinoza et al., 2003]
CCKM 2006 ⋄ Johan Suykens 29
Partially linear models: nonlinear system identification
0 2 4 6 8 10 12 14
x 104
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
Input Signal
0 2 4 6 8 10 12 14
x 104
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
test training validation
Output Signal
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Discrete Time Index
Residuals
0.5 1 1.5 2 2.5 3 3.5 4
x 104
−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Discrete Time Index
Residuals
Silver box benchmark study (physical system with cubic nonlinearity):(top-right) full black-box, (bottom-right) partially linear
Related application: power load forecasting [Espinoza et al., 2005]
CCKM 2006 ⋄ Johan Suykens 30
Kernel Canonical Correlation Analysis
x
xx
x
x
x
x
x x
x
x
x
x x
xx
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x x
xx
xx
xx
Space XSpace Y
Feature space on XFeature space on Y
Target spaces
Correlation: minw,v
∑
i
‖zxi− zyi
‖22
ϕ1(·)
ϕ2(·)
zx = wT ϕ1(x)zy = vT ϕ2(y)
Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in:
- bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003]
- information retrieval, fMRI [Shawe-Taylor et al., 2004]
- state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005]
CCKM 2006 ⋄ Johan Suykens 31
LS-SVM formulation to Kernel CCA
• Score variables: zx = wT (ϕ1(x) − µϕ1), zy = vT (ϕ2(y) − µϕ2
)
Feature maps ϕ1, ϕ2, kernels K1(xi, xj) = ϕ1(xi)Tϕ1(xj), K2(yi, yj) = ϕ2(yi)
Tϕ2(yj)
• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])
maxw,v,e,r
γ
NX
i=1
eiri − ν1
1
2
NX
i=1
e2i − ν2
1
2
NX
i=1
r2i−
1
2w
Tw −
1
2v
Tv
such that ei = wT (ϕ1(xi) − µϕ1), ri = vT (ϕ2(yi) − µϕ2
), ∀i
with µϕ1 = (1/N)PN
i=1 ϕ1(xi), µϕ2 = (1/N)PN
i=1 ϕ2(yi).
• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]
»
0 Ωc,2
Ωc,1 0
– »
α
β
–
= λ
»
ν1Ωc,1 + I 0
0 ν2Ωc,2 + I
– »
α
β
–
, λ = 1/γ
with Ωc,1ij= (ϕ1(xi) − µϕ1)
T (ϕ1(xj) − µϕ1), Ωc,2ij= (ϕ2(yi) − µϕ2)
T (ϕ2(yj) − µϕ2)
CCKM 2006 ⋄ Johan Suykens 32
Obtaining solution from Lagrangian
• Lagrangian L(w, v, e, r; α, β) = γPN
i=1 eiri − ν112
PNi=1 e2
i − ν212
PNi=1 r2
i
−12w
Tw− 12v
Tv−PN
i=1 αi[ei−wT (ϕ1(xi)−µϕ1)]−
PNi=1 βi[ri−vT (ϕ2(yi)−µϕ2
)]
• Conditions for optimality (eliminate w, v, e, r)8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
∂L∂w = 0 → w =
NX
i=1
αi(ϕ1(xi) − µϕ1)
∂L∂v = 0 → v =
NX
i=1
βi(ϕ2(yi) − µϕ2)
∂L∂ei
= 0 → γvT(ϕ2(yi) − µϕ2
) = ν1wT(ϕ1(xi) − µϕ1
) + αi
i = 1, ..., N∂L∂ri
= 0 → γwT(ϕ1(xi) − µϕ1
) = ν2vT(ϕ2(yi) − µϕ2
) + βi
i = 1, ..., N∂L∂αi
= 0 → ei = wT(ϕ1(xi) − µϕ1
)
i = 1, ..., N∂L∂βi
= 0 → ri = vT(ϕ2(yi) − µϕ2
)
i = 1, ..., N
CCKM 2006 ⋄ Johan Suykens 33
Bayesian inference
=Posterior
Evidence
LikelihoodPrior
=Posterior
Evidence
LikelihoodPrior
=Posterior
Evidence
LikelihoodPrior
Max
Max
Max
Level 1
Level 2
Level 3Model comparison
Hyperparameters
Parameters
Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal
matrix S in K(xi, xj) = exp(−(xi − xj)TS(xi − xj)) which indicate how relevant
input variables are (but: many local minima, computationally expensive).
CCKM 2006 ⋄ Johan Suykens 34
Classification of brain tumors using ARD
Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005]
CCKM 2006 ⋄ Johan Suykens 35
Hierarchical Kernel Machines
Level 1
Level 3
Level 2
LS−SVM substrate
Model selection
Sparseness
Structure detection
Conceptually
Hierarchical kernel machine Convex optimization
Computationally
Hierarchical modelling approach leading to convex optimization problemComputationally fusing training, hyperparameter and model selectionOptimization modelling: sparseness, input/structure selection, stability ...
[Pelckmans et al., ML 2006]
CCKM 2006 ⋄ Johan Suykens 36
Additive regularization trade-off
• Traditional Tikhonov regularization scheme:
minw,e
wTw + γ∑
i
e2i s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N
Training solution for fixed value of γ:
(K + I/γ)α = y
→ Selection of γ via validation set: non-convex problem
• Additive regularization trade-off [Pelckmans et al., 2005]:
minw,e
wTw +∑
i
(ei − ci)2 s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N
Training solution for fixed value of c = [c1; ...; cN ]:
(K + I)α = y − c
→ Selection of c via validation set: can be convex problem
CCKM 2006 ⋄ Johan Suykens 37
Sparse models
• SVM classically: sparse solution from QP problem at training level
• Hierarchical kernel machine: fused problem with sparseness obtained atthe validation level [Pelckmans et al., 2005]
−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2
0
0.2
0.4
0.6
0.8
1
X1
X2
LS−SVMγ=5.3667,σ2=0.90784
RBF , with 2 different classes
class 1class 2
CCKM 2006 ⋄ Johan Suykens 38
Additive models and structure detection
• Additive models: y(x) =∑P
p=1 w(p)Tϕ(p)(x(p)) with x(p) the p-th input
variable and a feature map ϕ(p) for each variable. This leads to the
kernel K(xi, xj) =∑P
p=1 K(p)(x(p)i , x
(p)j ).
• Structure detection [Pelckmans et al., 2005]:
minw,e,t
ρPP
p=1 tp +PP
p=1 w(p)Tw(p) + γPN
i=1 e2i
s.t.
(
yi =PP
p=1 w(p)Tϕ(p)(x(p)i ) + ei, ∀i = 1, ..., N
−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P
Study how the solution with maximalvariation varies for different values of ρ
100
101
102
103
104
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
ρ
Max
imal
Var
iatio
n
4 relevant input variables
21 irrelevant input variables
CCKM 2006 ⋄ Johan Suykens 39
Incorporation of prior knowledge
• Example: LS-SVM regression with monoticity constraint
minw,b,e
1
2wTw + γ
1
2
N∑
i=1
e2i s.t.
yi = wTϕ(xi) + b + ei, ∀i = 1, ..., NwTϕ(xi) ≤ wTϕ(xi+1), ∀i = 1, ..., N − 1
• Application: estimation of cdf [Pelckmans et al., 2005]
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.2
0
0.2
0.4
0.6
0.8
1
X
P(X
)
empirical cdftrue cdf
Y1
Y2
−1.5 −1 −0.5 0 0.5 1 1.5−0.2
0
0.2
0.4
0.6
0.8
1
X
P(X
)
ecdfcdfChebychev mkrmLS−SVM
CCKM 2006 ⋄ Johan Suykens 40
Equivalent kernels from constraints
Regression with autocorrelated errors:
minw,b,r,e
wTw + γ∑
i r2i
s.t. yi = wTϕ(xi) + b + ei (i = 1, .., N)ei = ρei−1 + ri (i = 2, ..., N)
leads to
f(x) =
N∑
j=2
αj−1Keq(xj, x) + b
with “equivalent kernel”
Keq(xj, xi) = K(xj, xi) − ρK(xj−1, xi)
where K(xj, xi) = ϕ(xj)Tϕ(xi) [Espinoza et al., 2006].
Partially Linear Structure
LS-SVM
Regression
Autocorrelated
Residuals
Imposing
Symmetry
Modular Definition of the Model Structure
CCKM 2006 ⋄ Johan Suykens 41
Nonlinear dynamical systems control
u
xp
θ
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
Control objective LS−SVM objectivemin
subject to System dynamics (time k = 1, 2, ... , N)
LS−SVM controller (time k = 1, 2, ... , N)
+
Merging optimal control and support vector machine optimization problems:
Approximate solutions to optimal control problems [Suykens et al., NN 2001]
CCKM 2006 ⋄ Johan Suykens 42
Conclusions and future challenges
• Integrative understanding and systematic design for supervised, semi-supervised, unsupervised learning and beyond
• Kernel methods: complementary views (LS-)SVM, RKHS, GP
• Towards a global understanding of science as a whole: kernel methodsas a role model?
• Bridging gaps between fundamental theory, algorithms and applications
• Reliable methods: numerically, computationally, statistically
• Realizing the ultimate software tool: optimal model representations andsolutions by symbolic computation?
CCKM 2006 ⋄ Johan Suykens 43
Acknowledgements
• Current and former colleagues:
C. Alzate, L. Ameye, T. De Bie, J. De Brabanter, B. De Moor, A.Devos, M. Espinoza, F. De Smet, I. Goethals, B. Hamers, F. Janssens,L. Hoegaerts, P. Karsmakers, G. Lanckriet, C. Lu, L. Lukas, J. Luts,K. Marchal, Y. Moreau, F. Ojeda, K. Pelckmans, N. Pochet, B. VanCalster, V. Van Belle, R. Van de Plas, T. Van Gestel, L. Vanhamme, S.Van Huffel, J. Vandewalle
• Many people for joint work, discussions, invitations, joint organization ofmeetings.
• Support from K.U. Leuven, GOA-Ambiorics, COE Optimization inEngineering, IAP V, FWO projects, IWT.
CCKM 2006 ⋄ Johan Suykens 44
Books
• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.
• Chapelle O., Scholkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, MA, (in
press) 2006.
• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge UniversityPress, 2000.
• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, Cambridge,
MA 2006.
• Scholkopf B., Smola A., Learning with Kernels, MIT Press, Cambridge, MA, 2002.
• Scholkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,Cambridge, MA (2004)
• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,
2004.
• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support
Vector Machines, World Scientific, Singapore, 2002.
• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory :
Methods, Models and Applications, vol. 190 of NATO-ASI Series III : Computer and Systems Sciences,
IOS Press (Amsterdam, The Netherlands) 2003.
• Vapnik V., Statistical Learning Theory, John Wiley & Sons, New York, 1998.
• Wahba G., Spline Models for Observational Data, Series in Applied Mathematics, 59, SIAM,Philadelphia, 1990.
CCKM 2006 ⋄ Johan Suykens 45