Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...

Engineering Kernel Machines

Johan Suykens

K.U. Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium

Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]

http://www.esat.kuleuven.be/scd/

Current Challenges in Kernel Methods (CCKM 2006)Brussels Belgium, Nov. 2006

CCKM 2006 ⋄ Johan Suykens

Data world and the evolution of science

technologies

serving society

applications

Sta

tistic

s Optimization theory

time

Mat

hem

atics

Systems t

heory Machine learning

?

“The Paradox of the Sciences”: Science is full of fragmentation whiletackling important open problems requires interdisciplinary approaches.

CCKM 2006 ⋄ Johan Suykens 1

Kernel based learning: interdisciplinary challenges

SVM & kernel methods

linear algebra

mathematics

statistics

systems and control theory

signal processingoptimization

machine learning

pattern recognition

data mining

neural networks

• Understanding the essential concepts and different facets of problems

• Providing systematical approaches, engineering kernel machines

• Integrative design, bridging the gaps between theory and practice


Contents

• Motivation

• Support vector machines, use of the “kernel trick”

• Core problems: least squares support vector machines

• Primal and dual representations

• Robustness

• Kernel PCA, spectral clustering, kernel CCA

• Sparseness, feature selection, relevance determination

• Prior knowledge incorporation, convex optimization


Classical MLPsx1

x2

x3

xn

1

w1

w2

w3wn

b

h(·)

h(·)

y

Multilayer Perceptron (MLP) properties:

• Universal approximation of continuous nonlinear functions

• Learning from input-output patterns: off-line/on-line

• Parallel network architecture, multiple inputs and outputs

+ Flexible and widely applicable:

Feedforward/recurrent networks, supervised/unsupervised learning

- Many local minima, trial and error for determining number of neurons


Support Vector Machines

cost function cost function

weights weights

MLP SVM

• Nonlinear classification and function estimation by convex optimizationwith a unique solution and primal-dual interpretations.

• Number of neurons automatically follows from a convex program.

• Learning and generalization in high dimensional input spaces(coping with the curse of dimensionality).

• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels fromgraphical models, ... ), application-specific kernels (e.g. bioinformatics)


Classifier with maximal margin

• Training set (xi, yi)Ni=1: inputs xi ∈ R

n; class labels yi ∈ −1,+1

• Classifier: y(x) = sign[wTϕ(x) + b]

with ϕ(·) : Rn → R

nh the mapping to a high dimensional feature space(which can be infinite dimensional!)

• Maximize the margin for good generalization ability (margin = 2‖w‖2

)

(VC theory: linear SVM classifier dates back from the sixties)

x x

xx

x

x

x

o

o

oo

o o

oo o

x o → x x

xx

x

x

x

o

o

oo

o o

oo o

x o


SVM classifier: primal and dual problem

• Primal problem: [Vapnik, 1995]

minw,b,ξ

J (w, ξ) =1

2wTw + c

N∑

i=1

ξi s.t.

yi[wTϕ(xi) + b] ≥ 1 − ξi

ξi ≥ 0, i = 1, ..., N

Trade-off between margin maximization and tolerating misclassifications

• Conditions for optimality from Lagrangian.Express the solution in the Lagrange multipliers.

• Dual problem: QP problem (convex problem)

maxα

Q(α) = −1

2

N∑

i,j=1

yiyj K(xi, xj)αiαj+

N∑

j=1

αj s.t.

N∑

i=1

αiyi = 0

0 ≤ αi ≤ c, ∀i


Obtaining solution via Lagrangian

• Lagrangian:

L(w, b, ξ;α, ν) = J (w, ξ) −

N∑

i=1

αiyi[wTϕ(xi) + b] − 1 + ξi −

N∑

i=1

νiξi

• Find saddle point: maxα,ν

minw,b,ξ

L(w, b, ξ; α, ν), one obtains

∂L∂w = 0 → w =

N∑

i=1

αiyiϕ(xi)

∂L∂b = 0 →

N∑

i=1

αiyi = 0

∂L∂ξi

= 0 → 0 ≤ αi ≤ c, i = 1, ..., N

Finally, write the solution in terms of α (Lagrange multipliers).


SVM classifier model representations

• Classifier: primal representation

y(x) = sign[wTϕ(x) + b]

Kernel trick (Mercer Theorem): K(xi, xj) = ϕ(xi)T ϕ(xj)

• Dual representation:

y(x) = sign[∑

i

αi yi K(x, xi) + b]

Some possible kernels K(·, ·):

K(x, xi) = xTi x (linear)

K(x, xi) = (xTi x + τ)d (polynomial)

K(x, xi) = exp(−‖x − xi‖22/σ2) (RBF)

K(x, xi) = tanh(κ xTi x + θ) (MLP)

• Sparseness property (many αi = 0)−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x(1)

x(2

)


SVMs: living in two worlds ...

x

oooo

x

xx

x

x xx

x

x

oo

oo Input space

Feature space

ϕ(x)

Primal space: (→ large data sets)

Dual space: (→ high dimensional inputs)

Parametric: estimate w ∈ Rnh

Non-parametric: estimate α ∈ RN

y(x) = sign[wTϕ(x) + b]

y(x) = sign[P#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)

y(x)

y(x)

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x


Reproducing Kernel Hilbert Space (RKHS) view

• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]

find function f such that

minf∈H

1

N

N∑

i=1

L(yi, f(xi)) + λ‖f‖2K

with L(·, ·) the loss function. ‖f‖K is norm in RKHS H defined by K.

• Representer theorem: for any convex loss function the solution is of theform

f(x) =

N∑

i=1

αiK(x, xi)

• Some special cases:

L(y, f(x)) = (y − f(x))2 regularization networkL(y, f(x)) = |y − f(x)|ǫ SVM regression with ǫ-insensitive loss function

−ε 0 +ε


Different views on kernel machines

LS−SVMSVM

Kriging

Gaussian Processes

RKHS

Some early history on RKHS:

1910-1920: Moore

1940: Aronszajn

1951: Krige

1970: Parzen

1971: Kimeldorf & Wahba

Obtaining complementary insights from different perspectives:kernels are used in different settings (try to get the big picture)

Support vector machines (SVM): optimization approach (primal/dual)Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysisGaussian processes (GP): probabilistic/Bayesian approach


Wider use of the kernel trick

• Angle between vectors: (e.g. correlation analysis)Input space:

cos θxz =xTz

‖x‖2‖z‖2Feature space:

cos θϕ(x),ϕ(z) =ϕ(x)Tϕ(z)

‖ϕ(x)‖2‖ϕ(z)‖2

=K(x, z)

√

K(x, x)√

K(z, z)

• Distance between vectors: (e.g. for “kernelized” clustering methods)Input space:

‖x − z‖22 = (x − z)T (x − z) = xTx + zTz − 2xTz

Feature space:

‖ϕ(x) − ϕ(z)‖22 = K(x, x) + K(z, z) − 2K(x, z)

x

zθ


Least Squares Support Vector Machines: “core problems”

• Regression (RR)

minw,b,e

wTw + γ∑

i

e2i s.t. yi = wTϕ(xi) + b + ei, ∀i

• Classification (FDA)

minw,b,e

wTw + γ∑

i

e2i s.t. yi(w

Tϕ(xi) + b) = 1 − ei, ∀i

• Principal component analysis (PCA)

minw,b,e

−wTw + γ∑

i

e2i s.t. ei = wTϕ(xi) + b, ∀i

• Canonical correlation analysis/partial least squares (CCA/PLS)

minw,v,b,d,e,r

wTw+vTv+ν1

∑

i

e2i+ν2

∑

i

r2i−γ

∑

i

eiri s.t.

ei = wTϕ1(xi) + bri = vTϕ2(yi) + d

• partially linear models, spectral clustering, subspace algorithms, ...


LS-SVM classifier

• Preserve support vector machine methodology, but simplify via leastsquares and equality constraints [Suykens, 1999]

• Primal problem:

minw,b,e

J (w, e) =1

2wTw + γ

1

2

N∑

i=1

e2i s.t. yi [w

Tϕ(xi) + b]=1 − ei, ∀i

• Dual problem:

[

0 yT

y Ω + I/γ

] [

bα

]

=

[

01N

]

where Ωij = yiyj ϕ(xi)Tϕ(xj) = yiyj K(xi, xj) and y = [y1; ...; yN ].

• LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004]

Winning results in competition WCCI 2006 by [Cawley, 2006]


Obtaining solution from Lagrangian

• Lagrangian:

L(w, b, e;α) = J (w, e) −N

∑

i=1

αiyi[wTϕ(xi) + b] − 1 + ei

with Lagrange multipliers αi (support values).

• Conditions for optimality:8

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

:

∂L∂w = 0 → w =

NX

i=1

αiyiϕ(xi)

∂L∂b = 0 →

NX

i=1

αiyi = 0

∂L∂ei

= 0 → αi = γei, i = 1, ..., N∂L∂αi

= 0 → yi[wTϕ(xi) + b] − 1 + ei = 0, i = 1, ..., N

Eliminate w, e and write solution in α, b.


Microarray data analysis

FDALS-SVM classifier (linear, RBF)Kernel PCA + FDA(unsupervised selection of PCs)(supervised selection of PCs)

Use regularization for linear classifiers

Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]

Webservice: http://www.esat.kuleuven.ac.be/MACBETH/

Efficient computational methods for feature selection by rank-one updates

[Ojeda et al., 2006]


Weighted versions and robustness

Convex cost function

convexoptimiz.

SVM solution

Weighted version with

modified cost function

robuststatistics

LS-SVM solution

SVM Weighted LS-SVM

• Weighted LS-SVM: minw,b,e

1

2wTw + γ

1

2

N∑

i=1

vie2i s.t. yi = wTϕ(xi) + b + ei, ∀i

with vi determined from eiNi=1 of unweighted LS-SVM [Suykens et al., 2002].

Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].

• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]


Kernel principal component analysis (KPCA)

−1.5 −1 −0.5 0 0.5 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

linear PCA

−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

kernel PCA (RBF kernel)

Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix

K(x1, x1) ... K(x1, xN)... ...

K(xN , x1) ... K(xN , xN)

(applications in dimensionality reduction and denoising)

Where is the regularization ?


Kernel PCA: primal and dual problem

• Underlying primal problem with regularization term [Suykens et al., 2003]

• Primal problem:

minw,b,e

−1

2wTw +

1

2γ

N∑

i=1

e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

(or alternatively min 12w

Tw − 12γ

∑Ni=1 e2

i )

• Dual problem = kernel PCA :

Ωcα = λα with λ = 1/γ

with Ωc,ij = (ϕ(xi) − µϕ)T (ϕ(xj) − µϕ) the centered kernel matrix.

• Score variables (allowing also out-of-sample extensions):

z(x) = wT (ϕ(x) − µϕ) =∑

j αj(K(xj, x) − 1N

∑

r K(xr, x)−1N

∑

r K(xr, xj) + 1N2

∑

r

∑

s K(xr, xs))


Generalizations to Kernel PCA: other loss functions

• Consider general loss function L:

minw,b,e

−1

2wTw +

1

2γ

N∑

i=1

L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

Generalizations of KPCA that lead to robustness and sparseness, e.g.Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].

• Weighted least squares versions and incorporation of constraints:

minw,b,e

−1

2wTw +

1

2γ

N∑

i=1

vie2i s.t.

ei = wTϕ(xi) + b, i = 1, ..., N∑N

i=1 eie(1)i = 0

...∑N

i=1 eie(i−1)i = 0

Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC e(j)i ).

The solution is given by a generalized eigenvalue problem.


Generalizations to Kernel PCA: robust denoising

Classical Kernel PCA Robust method

First rows: test set images, corrupted with Gaussian noise and outliersNext rows: application of different pre-image algorithmsRobust method: fewer components needed and improved results


Generalizations to Kernel PCA: sparseness

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC1

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC2

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC3

Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006](top figure: denoising; bottom figures: different support vectors (in black) per principal component vector)


Spectral clustering: weighted KPCA

• Spectral graph clustering [Chung, 1997;Shi & Malik, 2000; Ng et al. 2002]

• Normalized cut problemLq = λDq

with L = D − W the Laplacian of the graph.Cluster membership indicators are given by q.

• KPCA Weighted LS-SVM formulation to normalized cut:

minw,b,e

−1

2wTw + γ

1

2eTV e such that ei = wTϕ(xi) + b, ∀i = 1, ..., N

with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006].Allows for out-of-sample extensions on test data.

2

3

4

5

6

1

cut of size 2

cut of size 1(minimal cut)


Application to image segmentation

Given image (240 × 160) Image segmentation

Large scale image: out-of-sample extension [Alzate & Suykens, 2006]


Primal versus dual problems

Example 1: microarray data (10.000 genes & 50 training data)

Classifier model:sign(wTx + b) (primal)sign(

∑

i αiyixTi x + b) (dual)

primal: w ∈ R10.000 (only 50 training data!)

dual: α ∈ R50

Example 2: datamining problem (1.000.000 training data & 20 inputs)

primal: w ∈ R20

dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)


Fixed-size LS-SVM: primal-dual kernel machines

Primal space Dual space

Nystrom method

Kernel PCA

Density estimate

Entropy criteria

Eigenfunctions

SV selectionRegression

Link Nystrom approximation (GP) - kernel PCA - density estimation[Girolami, 2002; Williams & Seeger, 2001]

Modelling in view of primal-dual representations[Suykens et al., 2002]: primal space estimation, sparse, large scale


Fixed-size LS-SVM: toy examples

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

y

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Sparse representations with estimation in primal space


Large scale problems: Fixed-size LS-SVM

Estimate in primal (approximate feature map from KPCA on subset)Santa Fe laser data

−2

−1

0

1

2

3

4

5

0 100 200 300 400 500 600 700 800 900

discrete time k

yk

10 20 30 40 50 60 70 80 90−2

−1

0

1

2

3

4

discrete time k

yk,y

k

Training: yk+1 = f(yk, yk−1, ..., yk−p)Iterative prediction: yk+1 = f(yk, yk−1, ..., yk−p)(works well for p large, e.g. p = 50) [Espinoza et al., 2003]


Partially linear models: nonlinear system identification

0 2 4 6 8 10 12 14

x 104

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Input Signal

0 2 4 6 8 10 12 14

x 104

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

test training validation

Output Signal

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Discrete Time Index

Residuals

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Discrete Time Index

Residuals

Silver box benchmark study (physical system with cubic nonlinearity):(top-right) full black-box, (bottom-right) partially linear

Related application: power load forecasting [Espinoza et al., 2005]


Kernel Canonical Correlation Analysis

x

xx

x

x

x

x

x x

x

x

x

x x

xx

xx

xx

x

xx

x

x

x

x

x x

x

x

x

x x

xx

xx

xx

Space XSpace Y

Feature space on XFeature space on Y

Target spaces

Correlation: minw,v

∑

i

‖zxi− zyi

‖22

ϕ1(·)

ϕ2(·)

zx = wT ϕ1(x)zy = vT ϕ2(y)

Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in:

- bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003]

- information retrieval, fMRI [Shawe-Taylor et al., 2004]

- state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005]


LS-SVM formulation to Kernel CCA

• Score variables: zx = wT (ϕ1(x) − µϕ1), zy = vT (ϕ2(y) − µϕ2

)

Feature maps ϕ1, ϕ2, kernels K1(xi, xj) = ϕ1(xi)Tϕ1(xj), K2(yi, yj) = ϕ2(yi)

Tϕ2(yj)

• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])

maxw,v,e,r

γ

NX

i=1

eiri − ν1

1

2

NX

i=1

e2i − ν2

1

2

NX

i=1

r2i−

1

2w

Tw −

1

2v

Tv

such that ei = wT (ϕ1(xi) − µϕ1), ri = vT (ϕ2(yi) − µϕ2

), ∀i

with µϕ1 = (1/N)PN

i=1 ϕ1(xi), µϕ2 = (1/N)PN

i=1 ϕ2(yi).

• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]

»

0 Ωc,2

Ωc,1 0

– »

α

β

–

= λ

»

ν1Ωc,1 + I 0

0 ν2Ωc,2 + I

– »

α

β

–

, λ = 1/γ

with Ωc,1ij= (ϕ1(xi) − µϕ1)

T (ϕ1(xj) − µϕ1), Ωc,2ij= (ϕ2(yi) − µϕ2)

T (ϕ2(yj) − µϕ2)


Obtaining solution from Lagrangian

• Lagrangian L(w, v, e, r; α, β) = γPN

i=1 eiri − ν112

PNi=1 e2

i − ν212

PNi=1 r2

i

−12w

Tw− 12v

Tv−PN

i=1 αi[ei−wT (ϕ1(xi)−µϕ1)]−

PNi=1 βi[ri−vT (ϕ2(yi)−µϕ2

)]

• Conditions for optimality (eliminate w, v, e, r)8

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

:

∂L∂w = 0 → w =

NX

i=1

αi(ϕ1(xi) − µϕ1)

∂L∂v = 0 → v =

NX

i=1

βi(ϕ2(yi) − µϕ2)

∂L∂ei

= 0 → γvT(ϕ2(yi) − µϕ2

) = ν1wT(ϕ1(xi) − µϕ1

) + αi

i = 1, ..., N∂L∂ri

= 0 → γwT(ϕ1(xi) − µϕ1

) = ν2vT(ϕ2(yi) − µϕ2

) + βi

i = 1, ..., N∂L∂αi

= 0 → ei = wT(ϕ1(xi) − µϕ1

)

i = 1, ..., N∂L∂βi

= 0 → ri = vT(ϕ2(yi) − µϕ2

)

i = 1, ..., N


Bayesian inference

=Posterior

Evidence

LikelihoodPrior

=Posterior

Evidence

LikelihoodPrior

=Posterior

Evidence

LikelihoodPrior

Max

Max

Max

Level 1

Level 2

Level 3Model comparison

Hyperparameters

Parameters

Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal

matrix S in K(xi, xj) = exp(−(xi − xj)TS(xi − xj)) which indicate how relevant

input variables are (but: many local minima, computationally expensive).


Classification of brain tumors using ARD

Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005]


Hierarchical Kernel Machines

Level 1

Level 3

Level 2

LS−SVM substrate

Model selection

Sparseness

Structure detection

Conceptually

Hierarchical kernel machine Convex optimization

Computationally

Hierarchical modelling approach leading to convex optimization problemComputationally fusing training, hyperparameter and model selectionOptimization modelling: sparseness, input/structure selection, stability ...

[Pelckmans et al., ML 2006]


Additive regularization trade-off

• Traditional Tikhonov regularization scheme:

minw,e

wTw + γ∑

i

e2i s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N

Training solution for fixed value of γ:

(K + I/γ)α = y

→ Selection of γ via validation set: non-convex problem

• Additive regularization trade-off [Pelckmans et al., 2005]:

minw,e

wTw +∑

i

(ei − ci)2 s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N

Training solution for fixed value of c = [c1; ...; cN ]:

(K + I)α = y − c

→ Selection of c via validation set: can be convex problem


Sparse models

• SVM classically: sparse solution from QP problem at training level

• Hierarchical kernel machine: fused problem with sparseness obtained atthe validation level [Pelckmans et al., 2005]

−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2

0

0.2

0.4

0.6

0.8

1

X1

X2

LS−SVMγ=5.3667,σ2=0.90784

RBF , with 2 different classes

class 1class 2


Additive models and structure detection

• Additive models: y(x) =∑P

p=1 w(p)Tϕ(p)(x(p)) with x(p) the p-th input

variable and a feature map ϕ(p) for each variable. This leads to the

kernel K(xi, xj) =∑P

p=1 K(p)(x(p)i , x

(p)j ).

• Structure detection [Pelckmans et al., 2005]:

minw,e,t

ρPP

p=1 tp +PP

p=1 w(p)Tw(p) + γPN

i=1 e2i

s.t.

(

yi =PP

p=1 w(p)Tϕ(p)(x(p)i ) + ei, ∀i = 1, ..., N

−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P

Study how the solution with maximalvariation varies for different values of ρ

100

101

102

103

104

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

ρ

Max

imal

Var

iatio

n

4 relevant input variables

21 irrelevant input variables


Incorporation of prior knowledge

• Example: LS-SVM regression with monoticity constraint

minw,b,e

1

2wTw + γ

1

2

N∑

i=1

e2i s.t.

yi = wTϕ(xi) + b + ei, ∀i = 1, ..., NwTϕ(xi) ≤ wTϕ(xi+1), ∀i = 1, ..., N − 1

• Application: estimation of cdf [Pelckmans et al., 2005]

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.2

0

0.2

0.4

0.6

0.8

1

X

P(X

)

empirical cdftrue cdf

Y1

Y2

−1.5 −1 −0.5 0 0.5 1 1.5−0.2

0

0.2

0.4

0.6

0.8

1

X

P(X

)

ecdfcdfChebychev mkrmLS−SVM


Equivalent kernels from constraints

Regression with autocorrelated errors:

minw,b,r,e

wTw + γ∑

i r2i

s.t. yi = wTϕ(xi) + b + ei (i = 1, .., N)ei = ρei−1 + ri (i = 2, ..., N)

leads to

f(x) =

N∑

j=2

αj−1Keq(xj, x) + b

with “equivalent kernel”

Keq(xj, xi) = K(xj, xi) − ρK(xj−1, xi)

where K(xj, xi) = ϕ(xj)Tϕ(xi) [Espinoza et al., 2006].

Partially Linear Structure

LS-SVM

Regression

Autocorrelated

Residuals

Imposing

Symmetry

Modular Definition of the Model Structure


Nonlinear dynamical systems control

u

xp

θ

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

Control objective LS−SVM objectivemin

subject to System dynamics (time k = 1, 2, ... , N)

LS−SVM controller (time k = 1, 2, ... , N)

+

Merging optimal control and support vector machine optimization problems:

Approximate solutions to optimal control problems [Suykens et al., NN 2001]


Conclusions and future challenges

• Integrative understanding and systematic design for supervised, semi-supervised, unsupervised learning and beyond

• Kernel methods: complementary views (LS-)SVM, RKHS, GP

• Towards a global understanding of science as a whole: kernel methodsas a role model?

• Bridging gaps between fundamental theory, algorithms and applications

• Reliable methods: numerically, computationally, statistically

• Realizing the ultimate software tool: optimal model representations andsolutions by symbolic computation?


Acknowledgements

• Current and former colleagues:

C. Alzate, L. Ameye, T. De Bie, J. De Brabanter, B. De Moor, A.Devos, M. Espinoza, F. De Smet, I. Goethals, B. Hamers, F. Janssens,L. Hoegaerts, P. Karsmakers, G. Lanckriet, C. Lu, L. Lukas, J. Luts,K. Marchal, Y. Moreau, F. Ojeda, K. Pelckmans, N. Pochet, B. VanCalster, V. Van Belle, R. Van de Plas, T. Van Gestel, L. Vanhamme, S.Van Huffel, J. Vandewalle

• Many people for joint work, discussions, invitations, joint organization ofmeetings.

• Support from K.U. Leuven, GOA-Ambiorics, COE Optimization inEngineering, IAP V, FWO projects, IWT.


Books

• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.

• Chapelle O., Scholkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, MA, (in

press) 2006.

• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge UniversityPress, 2000.

• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, Cambridge,

MA 2006.

• Scholkopf B., Smola A., Learning with Kernels, MIT Press, Cambridge, MA, 2002.

• Scholkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,Cambridge, MA (2004)

• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,

2004.

• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support

Vector Machines, World Scientific, Singapore, 2002.

• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory :

Methods, Models and Applications, vol. 190 of NATO-ASI Series III : Computer and Systems Sciences,

IOS Press (Amsterdam, The Netherlands) 2003.

• Vapnik V., Statistical Learning Theory, John Wiley & Sons, New York, 1998.

• Wahba G., Spline Models for Observational Data, Series in Applied Mathematics, 59, SIAM,Philadelphia, 1990.


Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...

Documents

Transcript of Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...