Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...

12
Engineering Kernel Machines Johan Suykens K.U. Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70 Email: [email protected] http://www.esat.kuleuven.be/scd/ Current Challenges in Kernel Methods (CCKM 2006) Brussels Belgium, Nov. 2006 CCKM 2006 Johan Suykens Data world and the evolution of science technologies serving society applications Statistics Optimization theory time Mathematics Systems theory Machine learning ? “The Paradox of the Sciences”: Science is full of fragmentation while tackling important open problems requires interdisciplinary approaches. CCKM 2006 Johan Suykens 1 Kernel based learning: interdisciplinary challenges SVM & kernel methods linear algebra mathematics statistics systems and control theory signal processing optimization machine learning pattern recognition data mining neural networks Understanding the essential concepts and different facets of problems Providing systematical approaches, engineering kernel machines Integrative design, bridging the gaps between theory and practice CCKM 2006 Johan Suykens 2 Contents Motivation Support vector machines, use of the “kernel trick” Core problems: least squares support vector machines Primal and dual representations Robustness Kernel PCA, spectral clustering, kernel CCA Sparseness, feature selection, relevance determination Prior knowledge incorporation, convex optimization CCKM 2006 Johan Suykens 3

Transcript of Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer...

Page 1: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Engineering Kernel Machines

Johan Suykens

K.U. Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium

Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]

http://www.esat.kuleuven.be/scd/

Current Challenges in Kernel Methods (CCKM 2006)Brussels Belgium, Nov. 2006

CCKM 2006 ⋄ Johan Suykens

Data world and the evolution of science

technologies

serving society

applications

Sta

tistic

s Optimization theory

time

Mat

hem

atics

Systems t

heory Machine learning

?

“The Paradox of the Sciences”: Science is full of fragmentation whiletackling important open problems requires interdisciplinary approaches.

CCKM 2006 ⋄ Johan Suykens 1

Kernel based learning: interdisciplinary challenges

SVM & kernel methods

linear algebra

mathematics

statistics

systems and control theory

signal processingoptimization

machine learning

pattern recognition

data mining

neural networks

• Understanding the essential concepts and different facets of problems

• Providing systematical approaches, engineering kernel machines

• Integrative design, bridging the gaps between theory and practice

CCKM 2006 ⋄ Johan Suykens 2

Contents

• Motivation

• Support vector machines, use of the “kernel trick”

• Core problems: least squares support vector machines

• Primal and dual representations

• Robustness

• Kernel PCA, spectral clustering, kernel CCA

• Sparseness, feature selection, relevance determination

• Prior knowledge incorporation, convex optimization

CCKM 2006 ⋄ Johan Suykens 3

Page 2: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Classical MLPsx1

x2

x3

xn

1

w1

w2

w3wn

b

h(·)

h(·)

y

Multilayer Perceptron (MLP) properties:

• Universal approximation of continuous nonlinear functions

• Learning from input-output patterns: off-line/on-line

• Parallel network architecture, multiple inputs and outputs

+ Flexible and widely applicable:

Feedforward/recurrent networks, supervised/unsupervised learning

- Many local minima, trial and error for determining number of neurons

CCKM 2006 ⋄ Johan Suykens 4

Support Vector Machines

cost function cost function

weights weights

MLP SVM

• Nonlinear classification and function estimation by convex optimizationwith a unique solution and primal-dual interpretations.

• Number of neurons automatically follows from a convex program.

• Learning and generalization in high dimensional input spaces(coping with the curse of dimensionality).

• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels fromgraphical models, ... ), application-specific kernels (e.g. bioinformatics)

CCKM 2006 ⋄ Johan Suykens 5

Classifier with maximal margin

• Training set (xi, yi)Ni=1: inputs xi ∈ R

n; class labels yi ∈ −1,+1

• Classifier: y(x) = sign[wTϕ(x) + b]

with ϕ(·) : Rn → R

nh the mapping to a high dimensional feature space(which can be infinite dimensional!)

• Maximize the margin for good generalization ability (margin = 2‖w‖2

)

(VC theory: linear SVM classifier dates back from the sixties)

x x

xx

x

x

x

o

o

oo

o o

oo o

x o → x x

xx

x

x

x

o

o

oo

o o

oo o

x o

CCKM 2006 ⋄ Johan Suykens 6

SVM classifier: primal and dual problem

• Primal problem: [Vapnik, 1995]

minw,b,ξ

J (w, ξ) =1

2wTw + c

N∑

i=1

ξi s.t.

yi[wTϕ(xi) + b] ≥ 1 − ξi

ξi ≥ 0, i = 1, ..., N

Trade-off between margin maximization and tolerating misclassifications

• Conditions for optimality from Lagrangian.Express the solution in the Lagrange multipliers.

• Dual problem: QP problem (convex problem)

maxα

Q(α) = −1

2

N∑

i,j=1

yiyj K(xi, xj)αiαj+

N∑

j=1

αj s.t.

N∑

i=1

αiyi = 0

0 ≤ αi ≤ c, ∀i

CCKM 2006 ⋄ Johan Suykens 7

Page 3: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Obtaining solution via Lagrangian

• Lagrangian:

L(w, b, ξ;α, ν) = J (w, ξ) −

N∑

i=1

αiyi[wTϕ(xi) + b] − 1 + ξi −

N∑

i=1

νiξi

• Find saddle point: maxα,ν

minw,b,ξ

L(w, b, ξ; α, ν), one obtains

∂L∂w = 0 → w =

N∑

i=1

αiyiϕ(xi)

∂L∂b = 0 →

N∑

i=1

αiyi = 0

∂L∂ξi

= 0 → 0 ≤ αi ≤ c, i = 1, ..., N

Finally, write the solution in terms of α (Lagrange multipliers).

CCKM 2006 ⋄ Johan Suykens 8

SVM classifier model representations

• Classifier: primal representation

y(x) = sign[wTϕ(x) + b]

Kernel trick (Mercer Theorem): K(xi, xj) = ϕ(xi)T ϕ(xj)

• Dual representation:

y(x) = sign[∑

i

αi yi K(x, xi) + b]

Some possible kernels K(·, ·):

K(x, xi) = xTi x (linear)

K(x, xi) = (xTi x + τ)d (polynomial)

K(x, xi) = exp(−‖x − xi‖22/σ2) (RBF)

K(x, xi) = tanh(κ xTi x + θ) (MLP)

• Sparseness property (many αi = 0)−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

x(1)

x(2

)

CCKM 2006 ⋄ Johan Suykens 9

SVMs: living in two worlds ...

x

oooo

x

xx

x

x xx

x

x

oo

oo Input space

Feature space

ϕ(x)

Primal space: (→ large data sets)

Dual space: (→ high dimensional inputs)

Parametric: estimate w ∈ Rnh

Non-parametric: estimate α ∈ RN

y(x) = sign[wTϕ(x) + b]

y(x) = sign[P#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)

y(x)

y(x)

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x

CCKM 2006 ⋄ Johan Suykens 10

Reproducing Kernel Hilbert Space (RKHS) view

• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]

find function f such that

minf∈H

1

N

N∑

i=1

L(yi, f(xi)) + λ‖f‖2K

with L(·, ·) the loss function. ‖f‖K is norm in RKHS H defined by K.

• Representer theorem: for any convex loss function the solution is of theform

f(x) =

N∑

i=1

αiK(x, xi)

• Some special cases:

L(y, f(x)) = (y − f(x))2 regularization networkL(y, f(x)) = |y − f(x)|ǫ SVM regression with ǫ-insensitive loss function

−ε 0 +ε

CCKM 2006 ⋄ Johan Suykens 11

Page 4: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Different views on kernel machines

LS−SVMSVM

Kriging

Gaussian Processes

RKHS

Some early history on RKHS:

1910-1920: Moore

1940: Aronszajn

1951: Krige

1970: Parzen

1971: Kimeldorf & Wahba

Obtaining complementary insights from different perspectives:kernels are used in different settings (try to get the big picture)

Support vector machines (SVM): optimization approach (primal/dual)Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysisGaussian processes (GP): probabilistic/Bayesian approach

CCKM 2006 ⋄ Johan Suykens 12

Wider use of the kernel trick

• Angle between vectors: (e.g. correlation analysis)Input space:

cos θxz =xTz

‖x‖2‖z‖2Feature space:

cos θϕ(x),ϕ(z) =ϕ(x)Tϕ(z)

‖ϕ(x)‖2‖ϕ(z)‖2

=K(x, z)

K(x, x)√

K(z, z)

• Distance between vectors: (e.g. for “kernelized” clustering methods)Input space:

‖x − z‖22 = (x − z)T (x − z) = xTx + zTz − 2xTz

Feature space:

‖ϕ(x) − ϕ(z)‖22 = K(x, x) + K(z, z) − 2K(x, z)

x

CCKM 2006 ⋄ Johan Suykens 13

Least Squares Support Vector Machines: “core problems”

• Regression (RR)

minw,b,e

wTw + γ∑

i

e2i s.t. yi = wTϕ(xi) + b + ei, ∀i

• Classification (FDA)

minw,b,e

wTw + γ∑

i

e2i s.t. yi(w

Tϕ(xi) + b) = 1 − ei, ∀i

• Principal component analysis (PCA)

minw,b,e

−wTw + γ∑

i

e2i s.t. ei = wTϕ(xi) + b, ∀i

• Canonical correlation analysis/partial least squares (CCA/PLS)

minw,v,b,d,e,r

wTw+vTv+ν1

i

e2i+ν2

i

r2i−γ

i

eiri s.t.

ei = wTϕ1(xi) + bri = vTϕ2(yi) + d

• partially linear models, spectral clustering, subspace algorithms, ...

CCKM 2006 ⋄ Johan Suykens 14

LS-SVM classifier

• Preserve support vector machine methodology, but simplify via leastsquares and equality constraints [Suykens, 1999]

• Primal problem:

minw,b,e

J (w, e) =1

2wTw + γ

1

2

N∑

i=1

e2i s.t. yi [w

Tϕ(xi) + b]=1 − ei, ∀i

• Dual problem:

[

0 yT

y Ω + I/γ

] [

]

=

[

01N

]

where Ωij = yiyj ϕ(xi)Tϕ(xj) = yiyj K(xi, xj) and y = [y1; ...; yN ].

• LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004]

Winning results in competition WCCI 2006 by [Cawley, 2006]

CCKM 2006 ⋄ Johan Suykens 15

Page 5: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Obtaining solution from Lagrangian

• Lagrangian:

L(w, b, e;α) = J (w, e) −N

i=1

αiyi[wTϕ(xi) + b] − 1 + ei

with Lagrange multipliers αi (support values).

• Conditions for optimality:8

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

:

∂L∂w = 0 → w =

NX

i=1

αiyiϕ(xi)

∂L∂b = 0 →

NX

i=1

αiyi = 0

∂L∂ei

= 0 → αi = γei, i = 1, ..., N∂L∂αi

= 0 → yi[wTϕ(xi) + b] − 1 + ei = 0, i = 1, ..., N

Eliminate w, e and write solution in α, b.

CCKM 2006 ⋄ Johan Suykens 16

Microarray data analysis

FDALS-SVM classifier (linear, RBF)Kernel PCA + FDA(unsupervised selection of PCs)(supervised selection of PCs)

Use regularization for linear classifiers

Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]

Webservice: http://www.esat.kuleuven.ac.be/MACBETH/

Efficient computational methods for feature selection by rank-one updates

[Ojeda et al., 2006]

CCKM 2006 ⋄ Johan Suykens 17

Weighted versions and robustness

Convex cost function

convexoptimiz.

SVM solution

Weighted version with

modified cost function

robuststatistics

LS-SVM solution

SVM Weighted LS-SVM

• Weighted LS-SVM: minw,b,e

1

2wTw + γ

1

2

N∑

i=1

vie2i s.t. yi = wTϕ(xi) + b + ei, ∀i

with vi determined from eiNi=1 of unweighted LS-SVM [Suykens et al., 2002].

Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].

• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]

CCKM 2006 ⋄ Johan Suykens 18

Kernel principal component analysis (KPCA)

−1.5 −1 −0.5 0 0.5 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

linear PCA

−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

kernel PCA (RBF kernel)

Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix

K(x1, x1) ... K(x1, xN)... ...

K(xN , x1) ... K(xN , xN)

(applications in dimensionality reduction and denoising)

Where is the regularization ?

CCKM 2006 ⋄ Johan Suykens 19

Page 6: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Kernel PCA: primal and dual problem

• Underlying primal problem with regularization term [Suykens et al., 2003]

• Primal problem:

minw,b,e

−1

2wTw +

1

N∑

i=1

e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

(or alternatively min 12w

Tw − 12γ

∑Ni=1 e2

i )

• Dual problem = kernel PCA :

Ωcα = λα with λ = 1/γ

with Ωc,ij = (ϕ(xi) − µϕ)T (ϕ(xj) − µϕ) the centered kernel matrix.

• Score variables (allowing also out-of-sample extensions):

z(x) = wT (ϕ(x) − µϕ) =∑

j αj(K(xj, x) − 1N

r K(xr, x)−1N

r K(xr, xj) + 1N2

r

s K(xr, xs))

CCKM 2006 ⋄ Johan Suykens 20

Generalizations to Kernel PCA: other loss functions

• Consider general loss function L:

minw,b,e

−1

2wTw +

1

N∑

i=1

L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

Generalizations of KPCA that lead to robustness and sparseness, e.g.Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].

• Weighted least squares versions and incorporation of constraints:

minw,b,e

−1

2wTw +

1

N∑

i=1

vie2i s.t.

ei = wTϕ(xi) + b, i = 1, ..., N∑N

i=1 eie(1)i = 0

...∑N

i=1 eie(i−1)i = 0

Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC e(j)i ).

The solution is given by a generalized eigenvalue problem.

CCKM 2006 ⋄ Johan Suykens 21

Generalizations to Kernel PCA: robust denoising

Classical Kernel PCA Robust method

First rows: test set images, corrupted with Gaussian noise and outliersNext rows: application of different pre-image algorithmsRobust method: fewer components needed and improved results

CCKM 2006 ⋄ Johan Suykens 22

Generalizations to Kernel PCA: sparseness

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC1

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC2

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

PC3

Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006](top figure: denoising; bottom figures: different support vectors (in black) per principal component vector)

CCKM 2006 ⋄ Johan Suykens 23

Page 7: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Spectral clustering: weighted KPCA

• Spectral graph clustering [Chung, 1997;Shi & Malik, 2000; Ng et al. 2002]

• Normalized cut problemLq = λDq

with L = D − W the Laplacian of the graph.Cluster membership indicators are given by q.

• KPCA Weighted LS-SVM formulation to normalized cut:

minw,b,e

−1

2wTw + γ

1

2eTV e such that ei = wTϕ(xi) + b, ∀i = 1, ..., N

with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006].Allows for out-of-sample extensions on test data.

2

3

4

5

6

1

cut of size 2

cut of size 1(minimal cut)

CCKM 2006 ⋄ Johan Suykens 24

Application to image segmentation

Given image (240 × 160) Image segmentation

Large scale image: out-of-sample extension [Alzate & Suykens, 2006]

CCKM 2006 ⋄ Johan Suykens 25

Primal versus dual problems

Example 1: microarray data (10.000 genes & 50 training data)

Classifier model:sign(wTx + b) (primal)sign(

i αiyixTi x + b) (dual)

primal: w ∈ R10.000 (only 50 training data!)

dual: α ∈ R50

Example 2: datamining problem (1.000.000 training data & 20 inputs)

primal: w ∈ R20

dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)

CCKM 2006 ⋄ Johan Suykens 26

Fixed-size LS-SVM: primal-dual kernel machines

Primal space Dual space

Nystrom method

Kernel PCA

Density estimate

Entropy criteria

Eigenfunctions

SV selectionRegression

Link Nystrom approximation (GP) - kernel PCA - density estimation[Girolami, 2002; Williams & Seeger, 2001]

Modelling in view of primal-dual representations[Suykens et al., 2002]: primal space estimation, sparse, large scale

CCKM 2006 ⋄ Johan Suykens 27

Page 8: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Fixed-size LS-SVM: toy examples

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−10 −8 −6 −4 −2 0 2 4 6 8 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

y

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Sparse representations with estimation in primal space

CCKM 2006 ⋄ Johan Suykens 28

Large scale problems: Fixed-size LS-SVM

Estimate in primal (approximate feature map from KPCA on subset)Santa Fe laser data

−2

−1

0

1

2

3

4

5

0 100 200 300 400 500 600 700 800 900

discrete time k

yk

10 20 30 40 50 60 70 80 90−2

−1

0

1

2

3

4

discrete time k

yk,y

k

Training: yk+1 = f(yk, yk−1, ..., yk−p)Iterative prediction: yk+1 = f(yk, yk−1, ..., yk−p)(works well for p large, e.g. p = 50) [Espinoza et al., 2003]

CCKM 2006 ⋄ Johan Suykens 29

Partially linear models: nonlinear system identification

0 2 4 6 8 10 12 14

x 104

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Input Signal

0 2 4 6 8 10 12 14

x 104

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

test training validation

Output Signal

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Discrete Time Index

Residuals

0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Discrete Time Index

Residuals

Silver box benchmark study (physical system with cubic nonlinearity):(top-right) full black-box, (bottom-right) partially linear

Related application: power load forecasting [Espinoza et al., 2005]

CCKM 2006 ⋄ Johan Suykens 30

Kernel Canonical Correlation Analysis

x

xx

x

x

x

x

x x

x

x

x

x x

xx

xx

xx

x

xx

x

x

x

x

x x

x

x

x

x x

xx

xx

xx

Space XSpace Y

Feature space on XFeature space on Y

Target spaces

Correlation: minw,v

i

‖zxi− zyi

‖22

ϕ1(·)

ϕ2(·)

zx = wT ϕ1(x)zy = vT ϕ2(y)

Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in:

- bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003]

- information retrieval, fMRI [Shawe-Taylor et al., 2004]

- state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005]

CCKM 2006 ⋄ Johan Suykens 31

Page 9: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

LS-SVM formulation to Kernel CCA

• Score variables: zx = wT (ϕ1(x) − µϕ1), zy = vT (ϕ2(y) − µϕ2

)

Feature maps ϕ1, ϕ2, kernels K1(xi, xj) = ϕ1(xi)Tϕ1(xj), K2(yi, yj) = ϕ2(yi)

Tϕ2(yj)

• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])

maxw,v,e,r

γ

NX

i=1

eiri − ν1

1

2

NX

i=1

e2i − ν2

1

2

NX

i=1

r2i−

1

2w

Tw −

1

2v

Tv

such that ei = wT (ϕ1(xi) − µϕ1), ri = vT (ϕ2(yi) − µϕ2

), ∀i

with µϕ1 = (1/N)PN

i=1 ϕ1(xi), µϕ2 = (1/N)PN

i=1 ϕ2(yi).

• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]

»

0 Ωc,2

Ωc,1 0

– »

α

β

= λ

»

ν1Ωc,1 + I 0

0 ν2Ωc,2 + I

– »

α

β

, λ = 1/γ

with Ωc,1ij= (ϕ1(xi) − µϕ1)

T (ϕ1(xj) − µϕ1), Ωc,2ij= (ϕ2(yi) − µϕ2)

T (ϕ2(yj) − µϕ2)

CCKM 2006 ⋄ Johan Suykens 32

Obtaining solution from Lagrangian

• Lagrangian L(w, v, e, r; α, β) = γPN

i=1 eiri − ν112

PNi=1 e2

i − ν212

PNi=1 r2

i

−12w

Tw− 12v

Tv−PN

i=1 αi[ei−wT (ϕ1(xi)−µϕ1)]−

PNi=1 βi[ri−vT (ϕ2(yi)−µϕ2

)]

• Conditions for optimality (eliminate w, v, e, r)8

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

>

:

∂L∂w = 0 → w =

NX

i=1

αi(ϕ1(xi) − µϕ1)

∂L∂v = 0 → v =

NX

i=1

βi(ϕ2(yi) − µϕ2)

∂L∂ei

= 0 → γvT(ϕ2(yi) − µϕ2

) = ν1wT(ϕ1(xi) − µϕ1

) + αi

i = 1, ..., N∂L∂ri

= 0 → γwT(ϕ1(xi) − µϕ1

) = ν2vT(ϕ2(yi) − µϕ2

) + βi

i = 1, ..., N∂L∂αi

= 0 → ei = wT(ϕ1(xi) − µϕ1

)

i = 1, ..., N∂L∂βi

= 0 → ri = vT(ϕ2(yi) − µϕ2

)

i = 1, ..., N

CCKM 2006 ⋄ Johan Suykens 33

Bayesian inference

=Posterior

Evidence

LikelihoodPrior

=Posterior

Evidence

LikelihoodPrior

=Posterior

Evidence

LikelihoodPrior

Max

Max

Max

Level 1

Level 2

Level 3Model comparison

Hyperparameters

Parameters

Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal

matrix S in K(xi, xj) = exp(−(xi − xj)TS(xi − xj)) which indicate how relevant

input variables are (but: many local minima, computationally expensive).

CCKM 2006 ⋄ Johan Suykens 34

Classification of brain tumors using ARD

Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005]

CCKM 2006 ⋄ Johan Suykens 35

Page 10: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Hierarchical Kernel Machines

Level 1

Level 3

Level 2

LS−SVM substrate

Model selection

Sparseness

Structure detection

Conceptually

Hierarchical kernel machine Convex optimization

Computationally

Hierarchical modelling approach leading to convex optimization problemComputationally fusing training, hyperparameter and model selectionOptimization modelling: sparseness, input/structure selection, stability ...

[Pelckmans et al., ML 2006]

CCKM 2006 ⋄ Johan Suykens 36

Additive regularization trade-off

• Traditional Tikhonov regularization scheme:

minw,e

wTw + γ∑

i

e2i s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N

Training solution for fixed value of γ:

(K + I/γ)α = y

→ Selection of γ via validation set: non-convex problem

• Additive regularization trade-off [Pelckmans et al., 2005]:

minw,e

wTw +∑

i

(ei − ci)2 s.t. ei = yi − wTϕ(xi), ∀i = 1, ..., N

Training solution for fixed value of c = [c1; ...; cN ]:

(K + I)α = y − c

→ Selection of c via validation set: can be convex problem

CCKM 2006 ⋄ Johan Suykens 37

Sparse models

• SVM classically: sparse solution from QP problem at training level

• Hierarchical kernel machine: fused problem with sparseness obtained atthe validation level [Pelckmans et al., 2005]

−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2

0

0.2

0.4

0.6

0.8

1

X1

X2

LS−SVMγ=5.3667,σ2=0.90784

RBF , with 2 different classes

class 1class 2

CCKM 2006 ⋄ Johan Suykens 38

Additive models and structure detection

• Additive models: y(x) =∑P

p=1 w(p)Tϕ(p)(x(p)) with x(p) the p-th input

variable and a feature map ϕ(p) for each variable. This leads to the

kernel K(xi, xj) =∑P

p=1 K(p)(x(p)i , x

(p)j ).

• Structure detection [Pelckmans et al., 2005]:

minw,e,t

ρPP

p=1 tp +PP

p=1 w(p)Tw(p) + γPN

i=1 e2i

s.t.

(

yi =PP

p=1 w(p)Tϕ(p)(x(p)i ) + ei, ∀i = 1, ..., N

−tp ≤ w(p)Tϕ(p)(x(p)i ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P

Study how the solution with maximalvariation varies for different values of ρ

100

101

102

103

104

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

ρ

Max

imal

Var

iatio

n

4 relevant input variables

21 irrelevant input variables

CCKM 2006 ⋄ Johan Suykens 39

Page 11: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Incorporation of prior knowledge

• Example: LS-SVM regression with monoticity constraint

minw,b,e

1

2wTw + γ

1

2

N∑

i=1

e2i s.t.

yi = wTϕ(xi) + b + ei, ∀i = 1, ..., NwTϕ(xi) ≤ wTϕ(xi+1), ∀i = 1, ..., N − 1

• Application: estimation of cdf [Pelckmans et al., 2005]

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.2

0

0.2

0.4

0.6

0.8

1

X

P(X

)

empirical cdftrue cdf

Y1

Y2

−1.5 −1 −0.5 0 0.5 1 1.5−0.2

0

0.2

0.4

0.6

0.8

1

X

P(X

)

ecdfcdfChebychev mkrmLS−SVM

CCKM 2006 ⋄ Johan Suykens 40

Equivalent kernels from constraints

Regression with autocorrelated errors:

minw,b,r,e

wTw + γ∑

i r2i

s.t. yi = wTϕ(xi) + b + ei (i = 1, .., N)ei = ρei−1 + ri (i = 2, ..., N)

leads to

f(x) =

N∑

j=2

αj−1Keq(xj, x) + b

with “equivalent kernel”

Keq(xj, xi) = K(xj, xi) − ρK(xj−1, xi)

where K(xj, xi) = ϕ(xj)Tϕ(xi) [Espinoza et al., 2006].

Partially Linear Structure

LS-SVM

Regression

Autocorrelated

Residuals

Imposing

Symmetry

Modular Definition of the Model Structure

CCKM 2006 ⋄ Johan Suykens 41

Nonlinear dynamical systems control

u

xp

θ

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

Control objective LS−SVM objectivemin

subject to System dynamics (time k = 1, 2, ... , N)

LS−SVM controller (time k = 1, 2, ... , N)

+

Merging optimal control and support vector machine optimization problems:

Approximate solutions to optimal control problems [Suykens et al., NN 2001]

CCKM 2006 ⋄ Johan Suykens 42

Conclusions and future challenges

• Integrative understanding and systematic design for supervised, semi-supervised, unsupervised learning and beyond

• Kernel methods: complementary views (LS-)SVM, RKHS, GP

• Towards a global understanding of science as a whole: kernel methodsas a role model?

• Bridging gaps between fundamental theory, algorithms and applications

• Reliable methods: numerically, computationally, statistically

• Realizing the ultimate software tool: optimal model representations andsolutions by symbolic computation?

CCKM 2006 ⋄ Johan Suykens 43

Page 12: Data world and the evolution of science Engineering Kernel … · 2006-11-29 · Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions

Acknowledgements

• Current and former colleagues:

C. Alzate, L. Ameye, T. De Bie, J. De Brabanter, B. De Moor, A.Devos, M. Espinoza, F. De Smet, I. Goethals, B. Hamers, F. Janssens,L. Hoegaerts, P. Karsmakers, G. Lanckriet, C. Lu, L. Lukas, J. Luts,K. Marchal, Y. Moreau, F. Ojeda, K. Pelckmans, N. Pochet, B. VanCalster, V. Van Belle, R. Van de Plas, T. Van Gestel, L. Vanhamme, S.Van Huffel, J. Vandewalle

• Many people for joint work, discussions, invitations, joint organization ofmeetings.

• Support from K.U. Leuven, GOA-Ambiorics, COE Optimization inEngineering, IAP V, FWO projects, IWT.

CCKM 2006 ⋄ Johan Suykens 44

Books

• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.

• Chapelle O., Scholkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, MA, (in

press) 2006.

• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge UniversityPress, 2000.

• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, Cambridge,

MA 2006.

• Scholkopf B., Smola A., Learning with Kernels, MIT Press, Cambridge, MA, 2002.

• Scholkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,Cambridge, MA (2004)

• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,

2004.

• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support

Vector Machines, World Scientific, Singapore, 2002.

• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory :

Methods, Models and Applications, vol. 190 of NATO-ASI Series III : Computer and Systems Sciences,

IOS Press (Amsterdam, The Netherlands) 2003.

• Vapnik V., Statistical Learning Theory, John Wiley & Sons, New York, 1998.

• Wahba G., Spline Models for Observational Data, Series in Applied Mathematics, 59, SIAM,Philadelphia, 1990.

CCKM 2006 ⋄ Johan Suykens 45