Kernel-Induced Feature Spaces - University of Iceland · Kernel-Induced Feature Spaces Chapter3...

79
Kernel-Induced Feature Spaces Chapter 3 March 6, 2003 T.P. Runarsson ([email protected]) and S. Sigurdsson ([email protected])

Transcript of Kernel-Induced Feature Spaces - University of Iceland · Kernel-Induced Feature Spaces Chapter3...

Kernel-Induced Feature Spaces

Chapter 3

March 6, 2003

T.P. Runarsson ([email protected]) and S. Sigurdsson ([email protected])

March 6, 2003

Kernel representation

• Complex real-world applications require more expressive

hypothesis spaces than linear functions (ie. cannot be

expressed as a simple linear combinations of input attributes).

• This limitation was pointed out by Minsky and Papert in the1960s and led to the proposal of multi layered neural networks.

• Kernel representations offer an alternative solution by

projecting the input attributes into a high dimensional feature

space, thus increasing the computational power of the linear

learning machines discussed so far.

• The kernel method allows the decoupling of the learningalgorithm and theory from the design of an appropriate kernel

for a specific application area.

1

March 6, 2003

Linear machines in the dual representation

In the dual representation the training examples never appear

isolated but always in the form of inner products between pairs of

examples.

Perceptron-based learning in dual form:

function [alpha, R] = perceptron_dual(X, y)

%PERCEPTRON_DUAL the perceptron algorithm in dual form

% length of the input vector

n = size(X,2);

% the number of training samples

ell = size(X,1);

% initial bias, zeros

b = 0;

% convergence flag

flag = 1;

% compute the Gram matrix (the inner products!)

K = X*X’;

% the parameter R

R = max(sqrt(diag(K)));

% the dual variables alpha

alpha = zeros(ell,1);

% start repeat (while) loop

while (1 == flag),

flag = 0;

for i = 1:ell,

if (y(i)*((y.*alpha)’*K(:,i) + b) <= 0),

alpha(i) = alpha(i) + 1;

b = b + y(i)*R^2;

flag = 1;

end

end

end

2

March 6, 2003

Learning in feature space

• The complexity of the target function to be learned dependson the way it is represented,

• ideally a representation that matches the specific learningproblem should be chosen,

• and so it is not uncommon in machine learning to change therepresentation of the data:

x = (x1, . . . , xn) 7→ φ(x) = (φ1(x), . . . , φN(x))

where n is the dimension of the inputs (attributes) and N is

the dimension of the features.

Attributes are the original quantities x ∈ X.

Features are the quantities introduced to describe the data.

Feature selection is the task of choosing the most suitable

representation.

Feature space is the new space F = {φ(x)|x ∈ X}.

3

March 6, 2003

Feature selection

Motivation:

• To improve generalization error,• for explanatory purposes determine the relevant features,• and reduce the dimensionality of the input space (for real-timeapplications).

Good feature selection requires domain knowledge (“insight”), for

example what would be a good feature for say:

• a pixel image,• text document,• a living creature,• the stock market,• some physical phenomenon (Newton’s law of gravity),• . . . ?

4

March 6, 2003

The XOR problem in feature space

In this example we change the representation of the XOR problem

using all degree 2 monomial features. The mapping is given by:

Φ : R2 → R3

x = (x1, x2) 7→ φ(x) = (φ1, φ2, φ3) = (x21,√2x1x2, x

22)

The non-linear mapping linearizes the problem i.e. in feature

space the problem becomes linear.

>> X = [0 0;0 1;1 0;1 1]; % input

X =

0 0

0 1

1 0

1 1

>> y = 2*xor(X(:,1),X(:,2))-1 % output

y =

-1

1

1

-1

% construct the monomial features

>> Z = [X(:,1).^2 sqrt(2)*X(:,1).*X(:,2) X(:,2).^2]

Z =

0 0 0

0 0 1.0000

1.0000 0 0

1.0000 1.4142 1.0000

5

March 6, 2003

% let’s just use the above perceptron algorithm

>> [alpha, R] = perceptron_dual(Z,y)

alpha =

17

11

11

6

R =

2

>> b = alpha’*(y*R^2)

b =

-4

>> w = (alpha.*y)’*Z

w =

5.0000 -8.4853 5.0000

% note that the problem is 3-dimensional

>> w*Z’+b

ans =

-4.0000 1.0000 1.0000 -6.0000

>> sign(w*Z’+b)

ans =

-1 1 1 -1

6

March 6, 2003

The separating hyperplane in feature space

−0.50

0.51

−0.50

0.51

−0.5

0

0.5

1

φ1φ

2

φ 3

The plane (w, b) = ((5,−8.4853, 5),−2) separating the XORtask linearly in feature space.

7

March 6, 2003

The implicit mapping into feature space

Consider the monomial features mapping:

Φ : Rn → RN

For different monomials of degree d the dimension of the feature

space is:(d+ n− 1

d

)

=(d+ n− 1)!

d!(n− 1)!and so it becomes very quickly intractable to work directly in

feature space.

There exists, however, a way of computing dot products in these

high-dimensional feature spaces without explicitly mapping into

the spaces. This is done by means of kernels nonlinear in the

input space X.

A kernel is a function K, such that for all x, z ∈ X

K(x, z) =⟨

φ(x) · φ(z)⟩

where φ is a mapping fromX to an (inner product) feature space

F .

8

March 6, 2003

Implicit mapping example: the polynomial kernel

The degree 2 monomials mapping from R2 to R3 is given by:

Φ : R2 → R3

(x1, x2) 7→ (φ1, φ2, φ3) = (x21,√2x1x2, x

22)

This mapping can also be computed as follows, for x = (x1, x2)

and z = (z1, z2):

〈x · z〉2 =

(

2∑

i=1

xizi

)2

= (x1z1 + x2z2)2

= x21z

21 + 2x1z1x2z2 + x

22z

22

= 〈(x21,√2x1x2, x

22) · (z

21,√2z1z2, z

22)〉

= 〈φ(x) · φ(z)〉

and so the implicit mapping for the polynomial kernel is:

K(x, z) = 〈φ(x) · φ(z)〉 = 〈x · z〉2

that is, we never need to compute the features φ.

Notice that the number of features is

(2 + 2− 1)!

2!(2− 1)!=

3!

2!= 3

9

March 6, 2003

The linear learning machine in terms of kernels

The non-linear machine is build in two steps: use a fixed non-

linear mapping to transform the data into a feature space, and

then use a linear machine to classify them in the feature space.

The linear machine in feature space:

f(x) =

N∑

i=1

wiφi(x) + b

we don’t need to explicitly construct weights w, in dual form

f(x) =∑

i=1

αiyi⟨

φ(xi) · φ(x)⟩

+ b

and we don’t need to explicitly construct features φ,

f(x) =∑

i=1

αiyiK(xi,x) + b

using the kernel representation.

We could also ignore all examples i /∈ SV for which αi = 0, i.e.

f(x) =∑

i∈SVαiyiK(xi,x) + b

these vector are known as support vectors.

10

March 6, 2003

The perceptron using a kernel?

Our perceptron algorithm in dual form with only one small change:

function [alpha, R] = perceptron_dual_poly(X, y, d)

%PERCEPTRON_DUAL_POLY perceptron algorithm in dual form + polynomial kernel

% length of the input vector

n = size(X,2);

% the number of training samples

ell = size(X,1);

% initial bias, zeros

b = 0;

% convergence flag

flag = 1;

% compute the *Kernel* matrix

K = (X*X’).^d; % <----

% the parameter R

R = max(sqrt(diag(K)));

% the dual variables alpha

alpha = zeros(ell,1);

% start repeat (while) loop

while (1 == flag),

flag = 0;

for i = 1:ell,

if (y(i)*((y.*alpha)’*K(:,i) + b) <= 0),

alpha(i) = alpha(i) + 1;

b = b + y(i)*R^2;

flag = 1;

end

end

end

11

March 6, 2003

The XOR problem using a kernel classifier

>> [alpha, R] = perceptron_dual_poly(X,y,2)

alpha =

17

11

11

6

R =

2

% the bias can be computed by:

>> b=alpha’*(y*R^2)

b =

-4

% now lets say we want to know the output of for some input vectors:

>> V = [0 0;0 1;1 0;1 1]

V =

0 0

0 1

1 0

1 1

% then the first thing we must do is reconstruct our kernel

% in terms of this new input data, ie.

% (note that in this example X = V, but in principle V can be any input)

>> K=(X*V’).^2

K =

0 0 0 0

0 0 1 1

0 1 0 1

0 1 1 4

% next we determine the output for each example input:

>> (y.*alpha)’*K+b

ans =

-4 1 1 -6

% or

>> sign((y.*alpha)’*K+b)

ans =

-1 1 1 -1

12

March 6, 2003

Optical character recognition demo

We experiment with the US Postal Service OCR database which

contains 7291 examples for the purpose of training and 2007

examples for testing. The object is to learn to recognize 16× 16

pixel images of handwritten digits from 0 to 9. For example, here

are the characters 6, 2, 6, 5, 5, 8 and 2:6(4,2,0) 2(3,2,0) 6(2,4,5) 5(2,5,3) 5(8,5,2) 8(4,8,0) 2(0,3,2) 4(9,4,1) 6(4,6,8) 8(3,8,4)

5(4,5,0) 6(2,4,0) 9(7,9,2) 2(0,2,5) 8(3,8,4) 0(9,8,7) 6(5,2,4) 6(4,3,2) 8(3,2,0) 1(4,1,7)

8(3,5,2) 1(6,2,1) 5(8,5,9) 8(2,8,3) 9(2,7,8) 8(0,8,3) 2(3,8,0) 9(5,7,3) 9(7,4,5) 5(3,5,2)

5(3,5,6) 7(4,0,3) 1(7,1,9) 8(2,7,4) 3(5,3,8) 1(6,2,5) 8(0,5,8) 9(1,9,3) 7(9,3,7) 0(4,0,5)

3(8,4,5) 0(2,0,4) 0(4,0,5) 2(8,2,9) 2(8,2,3) 2(8,2,3) 2(3,0,8) 8(3,5,2) 0(2,0,3) 8(3,2,8)

The kernel based dual perceptron algorithm was tested on this

data in a similar way to the XOR problem. However, this

is a multi-classification problem and so 10 one-against-the-rest

perceptrons are trained.

To run this demo you will need the data files:

usps digits training.mat and usps digits testing.mat.

Also the MATLAB script ocrdemo.m and the MATLAB functions

perceptron dual kernel.m and the kernel function kernel.m.

Because of memory limitations in MATLAB only the first≈ 2000

examples of the 7291 are used for training. In this case we get a

classification error of 8.37%.

13

March 6, 2003

The OCR demo using pairwise classification

In pairwise classification, er train a classifier for each possible

pair of classes. For m classes, this results in (m − 1)m/2

binary classifiers. This values is larger than the one-against-the

rest classifiers, which are 10 in this demo, but 45 using pairwise

classification.

At first this may seem to be a slower approach, but actually the

training sets are smaller and so it is actually possible to save

time! Furthermore, because the training sets are smaller it was

now possible to use the entire training set in MATLAB.

We ran 45 binary classifiers on the whole data set and used a

voting scheme for classification, i.e. the class which gets the

highest number of votes wins! in this case a classification error of

6.38% was achived (probably because we used the whole training

set).

The problem with the one-against-the-rest is that the real valued

ouputs are on different scales! However, this is the most common

approach today and some efforts are being made in tranforming

the real valued ouputs into class probabilities.

14

March 6, 2003

Features for strings and texts

Let Σ∗ denote the set of all strings of finite length that are madeup from characters in the set Σ. If eg. Σ = {a,b,c} thenΣ∗ = {ε, a,b,c,aa,ab,ba,bb,aaa,aab, . . .} where ε denotesthe empty string.

If s ∈ Σ∗ then |s| denotes the length of s, i.e. the number ofcharacters in s. |ε| = 0.

If u ∈ Σ∗ and s ∈ Σ∗ we say that u is a substring of s if

there exist indicies i = (i1, i2, . . . , i|u|) with 1 ≤ i1 < i2 <

. . . < i|u| ≤ |s| such that uj = sij for j = 1, . . . , |u|.Then we also write u = s(i). s[i : j] denotes the substring

si, si+1 . . . sj−1sj of length j − i+ 1.

Let m = |Σ| be the number of (distinct) characters in Σ. Thenfor a given n ≥ 1 we can define a feature map φn : Σ∗ 7→ Rmn

that associates with each string s ∈ Σ∗ a real valued vectorof length mn that contains, for every possible distinct substring

of length n, how many different such substrings there are in s.

Below we show such feature vectors for the strings s = abcaaab

and t = bcaac when n = 1, 2, 3 and Σ = {a,b,c}.

15

March 6, 2003

The possible substrings have been ordered in so-called lexiographic

order:

φ1(s) φ1(t) φ1(s)φ1(t)

a 4 2 8

b 2 1 2

c 1 2 2

K1(s, t) = 12

φ2(s) φ2(t) φ2(s)φ2(t)

aa 6 1 6

ab 5 0 0

ac 1 2 2

ba 3 2 6

bb 1 0 0

bc 1 2 2

ca 3 2 6

cb 1 0 0

cc 0 1 0

K2(s, t) = 22

φ3(s) φ3(t) φ3(s)φ3(t)

aaa 4 0 0

aab 3 0 0

aac 0 1 0

aba 3 0 0

abb 1 0 0

abc 1 0 0

aca 3 0 0

acb 1 0 0

16

March 6, 2003

acc 0 0 0

baa 3 1 3

bab 3 0 0

bac 0 2 0

bba 0 0 0

bbb 0 0 0

bbc 0 0 0

bca 3 2 6

bcb 1 0 0

bcc 0 1 0

caa 3 1 3

cab 3 0 0

cac 0 2 0

cba 0 0 0

cbb 0 0 0

cbc 0 0 0

cca 0 0 0

ccb 0 0 0

ccc 0 0 0

K3(s, t) = 12

17

March 6, 2003

The kernel from string features

Let Kn(s, t) denote the inner product⟨

φn(s) · φn(t)⟩

. We

now observe that Kn(s, t) can be evaluated recursively without

first explicitly constructing φn(s) and φn(t) by making use of

the following relations:

• Ko(s, t) = 1 for all s, t,

• Ki(s, t) = 0 if min(|s|, |t|) < i, i = 1, 2, . . ., and

• Ki(sx, t) = Ki(s, t) +∑

j:tj=xKi−1(s, t[1 : j − 1]),

x ∈ Σ, i = 1, 2, . . .

The last sum is the sum of allKi−1(s, t[1 : j−1]) values where

tj is the same character as x.

18

March 6, 2003

MATLAB function for the string kernel

The following MATLAB function calculates for given input strings

Kn(s, t) in a non-recursive manner by successively calculating

K0(s, t), K1(s, t), . . . , Kn(s, t).

For each fixed i we calculate successively Ki(s[1 : j], t[1 : k])

k = 0, 1, . . . |t|, j = 0, 1, . . . , |s| making use of the relationsabove. Since the indicies in MATLAB start with 1 (not 0)

Ki(s[1 : j], t[1 : k]) is denoted by K(j + 1, k + 1, i + 1).

Also note that strings in MATLAB are effectively handled as

vectors of unicode-values corresponding to the characters in the

string.

The MATLAB function is in fact more general in that it evaluates

for an input weight-vector w of length n the weighted kernel∑n

i=1wiKi(s, t).

Also note that the complexity of the algorithm is O(n|s||t|).By comparison the number of elements in the feature vector is

mn, leaving apart the number of steps required to calculate each

element.

19

March 6, 2003

function [r, K] = stringkernel(s,t,w)

%STRINGKERNEL

ns = length(s);

nt = length(t);

n = length(w);

r = 0;

K(:,:,1) = ones(ns+1,nt+1);

for i = 1:n,

K(:,:,i+1) = zeros(ns+1,nt+1);

for j = 1:ns,

sum = 0;

for k = 1:nt,

if t(k) == s(j);

sum = sum + K(j,k,i);

end

K(j+1,k+1,i+1) = K(j,k+1,i+1) + sum;

end

end

r = r + w(i)*K(ns+1,nt+1,i+1);

end

If you have the MATLAB compiler you can increase the speed of this function, or mostprobably the function calling it, by compiling to machine code:

>> mcc -x stringkernel

20

March 6, 2003

Example using the MATLAB function

Below we show the matrices Ki(s, t) when s = abcaaab,t = bcaac and i = 0, 1, 2, 3.

>> r = stringkernel(’abcaaab’,’bcaac’,[1 1 1])

r =

46

>> [r,K]=stringkernel(’abcaaab’,’bcaac’,0.9.^(2*[1:3]))

r =

30.5315

K(:,:,1) =

empty b bc bca bcaa bcaac

empty 1 1 1 1 1 1

a 1 1 1 1 1 1

ab 1 1 1 1 1 1

abc 1 1 1 1 1 1

abca 1 1 1 1 1 1

abcaa 1 1 1 1 1 1

abcaaa 1 1 1 1 1 1

abcaaab 1 1 1 1 1 1

K(:,:,2) =

0 0 0 0 0 0

0 0 0 1 2 2

0 1 1 2 3 3

0 1 2 3 4 5

0 1 2 4 6 7

0 1 2 5 8 9

0 1 2 6 10 11

0 2 3 7 11 12

21

March 6, 2003

K(:,:,3) =

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 1 1 1 4

0 0 1 3 6 9

0 0 1 5 12 15

0 0 1 7 19 22

0 0 1 7 19 22

K(:,:,4) =

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 1 2 2

0 0 0 2 6 6

0 0 0 3 12 12

0 0 0 3 12 12

On pages 44–45 in the book it is suggested that in the feature

vector φn(s) all values should be multiplied by a factor λn where

0 < λ ≤ 1, This implies that the kernel valueKn(s, t) should be

multiplied by λ2n. According to this we might choose the weights

in the general kernel such that wi = λ2i, i = 1, 2, . . . , n.

We do, however, not introduce the auxiliary kernel K ′i(s, t) as

done in the book. If we associate the kernel value Ki(s, t) with

the choice λ = 1 as is done above, then in fact K ′i(s, t) =

Ki(s, t), and we can include the effect of choosing λ < 1, simply

by multiplying Ki(s, t) with λ2i at the end of our calculations.

22

March 6, 2003

Text versus string kernels

We can construct the same form of kernel working with texts

rather than strings by treating the words of the text as characters

and the text as strings of words. Here it may seem to complicate

matters that we do not usually a priory have information about

all possible characters, i.e. words, that may be included in the

text. While this would cause complications when dealing directly

with feature vectors it does not do so when calculating the kernels

directly with the algorithm above.

23

March 6, 2003

Questions

1. We might be interested in restricting the feature vectors φn(s)

in such a way that we only consider occurrences of substrings

of the form u = s(i : j) where 1 ≤ i ≤ j ≤ |s| andj − i + 1 = n. In the example above we have eg. that the

ab-coordinate of φ2(abcaaab) has the value 2. How would

you change the algorithm above to calculate the corresponding

kernels?

2. We might be interested in changing the features vectors φn(s)

in such a vay that we consider occurrences of substrings

s(i : j), where j − i + 1 = n that differ in at most m

places from a given string u. How would you now calculate

the corresponding kernels? Note: Such kernels (so called

mismatch string kernels) have been used eg. in classification

of proteins into functional and structural classes based on

homology (evolutionary similarity) of protein sequence data.

In the example above we have eq. that if m = 1 then the

aba-coordinate of φ3(abcaaab) has the value 2.

24

March 6, 2003

Levenshtein distance (edit distance)

Levenshtein1 distance is a measure of the similarity between two

strings, which we will refer to as the source string (s) and

the target string (t). The distance is the number of deletions,

insertions, or substitutions required to transform s into t. For

example,

• s is “test” and t is ”test”, then d(s, t) = 0, because the

strings are identical and no transformation is needed.

• s is “test” and t is “best”, then d(s, t) = 1, because one

substitution (change “t” to “b”) is sufficient to transform s

into t.

The greater the Levenshtein distance, the more different the

strings are.

The Levenshtein distance algorithm has been used in:

• spell checking,• speech recognition,• DNA analysis,• plagiarism detection.1Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who

devised the algorithm in 1965.

25

March 6, 2003

Edit distance algorithm

The edit distance (dynamic programming) algorithm is as follows:

1. let ns and nt be the length of the strings s and t respectively,

and construct a zero matrix C of size (ns + 1)× (nt + 1),

2. initialize the first column with Ci+1,1 = Ci,1 + deletion cost,

i = 1, . . . , ns (the deletion cost is by default 1)

3. initialize the first row with C1,j+1 = C1,j + insertion cost,

j = 1, . . . , nt (the insertion cost is by default 1)

4. if s[i] is equal to t[j], the substitution cost is 0 otherwise it

takes some default values (usually 1).

5. the element Ci,j of the matrix equal to the minimum of:

• the element immediately above plus deletion cost• the element immediately to the left plus insertion cost• the element diagonally above and to the left plus thesubstitution cost

6. the edit distance is given by d(s, t) = Cns+1,nt+1.

26

March 6, 2003

Edit distance example

Here are examples of cost matrices for the source word “water”

and target word “wine”.

In this example the cost is 1 for all operations.

empty w i n e

empty 0 1 2 3 4

w 1 0 1 2 3

a 2 1 1 2 3

t 3 2 2 2 3

e 4 3 3 3 2

r 5 4 4 4 3 <- answer

^

In this example the cost of deleting a letter is 2:

empty w i n e

empty 0 1 2 3 4

w 2 0 1 2 3

a 4 2 1 2 3

t 6 4 3 2 3

e 8 6 5 4 2

r 10 8 7 6 4 <- answer

^

27

March 6, 2003

Radial basis function

We now introduce a new kernel based on the string edit distance.

The most general formula for any radial basis function (RBF) is

K(xc,x) = ϕ(

(x− xc)′D−1(x− xc)

)

where ϕ is the function used, xc is the center and D is metric.

The term (x − xc)′D−1(x − xc) is the distance between theinput x and the center xc in the metric defined by D. We think

of the “training vectors” as begin the centers.

In the Euclidean case we have D = r2I for some scalar radius

r, and the above simplifies to:

K(xc,x) = ϕ

(

(x− xc)′(x− xc)r2

)

The most common Radial Basis functions are:

• the Gaussian, ϕ(z) = exp(−z),• the inverse multi-quadratic, ϕ(z) = (1 + z)−1/2,

• and the Cauchy, ϕ(z) = (1 + z)−1.

28

March 6, 2003

Radial basis function using string edit distance

Since any metric can be used with a Radial basis function we can

design a kernel for strings using the edit-distance, d(s, t), for

example:

K(s, t) = exp(−d(s, t)/σ)where σ > 0.

See also page 43 in book, Corollary 3.13, part 3.

29

March 6, 2003

Chromosomal data classification demo

The following demo is the first sequence recognition task used forthe WCCI2002 competition. The task is a chromosomal binaryclassification problem. The input strings are of variable lengthand look something like this:

A==B=f==C===f==A===f=A===f====B===f=A==f=A====f==A==f=A===b===Af=f

A======D===f===A==f===A=f====B===f=A===d=====A=====d=======f

BA=f====C=====f==A=f==A==e===B===f===B=====d====A==e===A=f=e

A==A=f===D===f=A==f==A====f=====B==f==A==f=A===e==A======d=A=e=f

the first two belong to one class while the last two belong to the

other class, do you see any common patterns?

see course webpage for demo.

30

March 6, 2003

Differences between the string and OCR demo

In the OCR example we can form a kernel, K(x, z) =⟨

x · z⟩

in the input space as an inner product of data vectors x and

y. We choose however to change this kernel to a new kernel

Kd(x, z) =⟨

x · y⟩dor Kd1(x, z) =

(

1 +⟨

x · y⟩) d

where

d ∈ {2, 3, . . .}.

Remark; The kernel Kd can be interpreted as choosing a

feature φd(x) with(n+d−1

d

)

coordinates, one for each distinct

possible monomial of degree d made from the original coordinates.

The kernel Kd1 can be interpreted as choosing a feature φd1(x)

and in the second case we replace the input vector x with the

feature vector φd1 with(n+1

d

)

coordinates, one for each possible

distinct monomial of degree 0 up to and including d.

For example, when n = 2 and d = 3, then

φd(x1, x2) =(

x31,√3x

21x2,

√3x1x

22, x

32)

and φd1(x1, x2) =

(

1,√3x1,

√3x2,

√3x1

2,√6x1x2,

√3x2

2, x1

3,√3x1

2x2,

√3x1x2

2, x2

3)

31

March 6, 2003

In the string case we cannot form inner products in the input

space since it consists of strings rather than real vectors. We

map the string s, however, into a feature vector φ(s) that

are real vectors and this in turn allows us to define kernels

K(s, t) =⟨

φ(s) · φ(t)⟩

as an inner product of these feature

vectors. Thus we are now constructing new kernel from features.

Although we are constructing the string kernels from features, we

never actually construct the features, we are in fact constructing

new kernels directly from old kernels. That is, again we do not

construct the feature vectors explicitly, since it turns out to be

computationally more advantageous to do this implicitly.

When constructing a kernel in the string case using string edit

distance, we use neither features nor other kernels to construct

the kernel. Here we obviously have to take care that the kernel

satisfies properties that are implicitly taken care of when the

kernel is defined from inner products. We return to this at the

end of this chapter.

Remark: In connection to the kernels based on distances it is interesting to note thatif we have a kernel K(x, z) based on the inner product

x · z⟩

then one may define thedistance between x and y as:

K(x,x) +K(y,y)− 2K(x,y)

32

March 6, 2003

Kernel regression

We now show how kernels can be introduced to regression or

function learning. We now write our regression model in terms of

features:

yi ≈⟨

w · φ(xi)⟩

+ b

where the aim is to find a weight vector w that minimizes the

discrepancies between the left and right hand sides for a given

dataset (xi, yi), i = 1, . . . , `. The basic idea is to express w

in terms of a new vector α s.t. w = [φ(x1), . . . ,φ(x`)]α,

then the model becomes:

yi ≈∑

j=1

αj⟨

φ(xj) · φ(xi)⟩

+ b

or

yi ≈∑

j=1

αjK(xj,xi) + b

using the kernel representation.

If we want to use this model to estimate y for a given x outside

the original dataset we have that:

y =∑

j=1

αjK(xj,x) + b

33

March 6, 2003

One of the advantages with this representation is that the kernels

K(xi,xj) can be evaluated indirectly from eg. other kernels

of distances between xi and xj as already shown. We let as

before G denote the Gram matrix whose (i, j) − th element is

K(xi,xj). In terms of it the model becomes

y ≈ Gα+ b

1...

1

when the offset is include in the data we denote the Gram matrix

with G and the α vector with α.

The only problem here is that since K is singular, when the

dimension of the feature space N < `, so it cannot be solved

using the normal equations. But we can reformulate the original

problem using ridge regression.

34

March 6, 2003

Ridge regression revisited

We have already derived in chapter 2 the kernel form of the

normal equations for ridge regression:

(

λI` + G)

α = y

Here the Gram matrix G is simply a linear kernel and can be

replaced with any other of a variety of kernels.

Let’s say we would like to apply the monomial kernel of order

d = 6, then

K(xi,xj) =⟨

xi · xj⟩6where x′ = [x′ 1]

Note that when we transform our Gram matrix (the linear

kernel) to a polynomial kernel we have changed our regression

formulation. Thus if the feature space is of higher dimension

than the input space and we would therefore expect the error to

become smaller (in fact zero if the dimension of the feature space

N ≥ `).

For example, recall the example given in the lecture notes forchapter 2:

>> X = [3 7;4 6;5 6;7 7;8 5;4 5;5 5;6 3;7 4;9 4] ;

>> y = [6 5 5 7 5 4 4 2 3 4]’ ;

35

March 6, 2003

If we set say λ = 0.1 then we can compute α as before:

>> d = 6 ;

>> lambda = 0.1 ;

>> K = (Xhat*Xhat’).^d ;

>> alpha= (K + eye(ell)*lambda)\y ;

and we can check the error as follows:

>> error = (y-K*alpha)’

error =

1.0e-006 *

0.0124 -0.1324 -0.0840 0.0366 -0.0684 0.5842 -0.2903 -0.1841 0.2326 -0.0020

The error is very small, but what about other values for x ?

We would like to look at other possible inputs, let Z be a allpossible combination of integers from 1 to 10, i.e. a 100 × 2matrix. We can compute the output of the learned function asfollows:

>> Z = [ ]; for i=1:10, for j=1:10, Z = [Z;i j]; end, end

>> Zhat = [Z ones(size(Z,1),1)];

>> Kt = (Xhat*Zhat’).^6;

>> yest = Kt’*alpha;

36

March 6, 2003

We can now plot this plus the training target points y:

0

5

10

02

46

8100

5

10

x1

x2

y

Is this learned function any better than a simple plane?

37

March 6, 2003

Further experiments using the monomial kernel:

>> lambda = 0.001

lambda =

0.0010

>> alpha = (lambda*eye(ell) + G)\y;

>> G*alpha

ans =

5.9358

4.9943

5.2535

6.9723

4.8303

3.7938

4.0529

1.9109

3.3706

3.8889

% Same result as with the iteration (see NEXT section).

% Now see how the error changes if we introduce polynomial kernels

>> y-G*alpha

ans =

0.0642

0.0057

-0.2535

0.0277

0.1697

0.2062

-0.0529

0.0891

-0.3706

0.1111

>> G2 = G.^2;

>> alpha= (lambda*eye(ell) + G2)\y;

>> y-G2*alpha

ans =

0.0153

0.0209

38

March 6, 2003

-0.0420

-0.0241

0.1429

-0.0253

0.0233

0.0892

-0.1594

-0.0366

>> G3 = G.^3;

>> alpha = (lambda*eye(ell) + G3)\y;

>> y-G3*alpha

ans =

-0.0010

0.0119

-0.0105

-0.0002

0.0028

-0.0143

0.0158

0.0041

-0.0090

0.0005

>> G4 = G.^4;

>> alpha = (lambda*eye(ell) + G4)\y;

>> y-G4*alpha

ans =

1.0e-003 *

-0.0128

0.1469

-0.1505

0.0126

-0.0067

-0.1437

0.1645

0.0046

-0.0206

0.0049

39

March 6, 2003

Steepest descent for regression - kernel version

Consider the basic iteration step:

wnew = wold + ηX′(y − Xwold

)

from this we have, if α = Xw that:

X′αnew = X

′αold + ηX

′(y − XX ′

αold

)

and thus if G = XX′then

Gαnew = Gαold + ηG(

y − Gαold

)

or if we assume that G is invertible that:

αnew = αold + η(

y − Gαold

)

This is the iteration step of the kernel version.

40

March 6, 2003

For the ridge-regression where we minimize

λ⟨

w · w⟩

+⟨

(y − Xw) · (y − Xw)⟩

the basic step in the original version becomes

wnew = wold + η(

X′(y − Xwold

)

− λwold

)

and in the kernel version

αnew = αold + η(

y −(

G+ λI`)

αold

)

41

March 6, 2003

% We try the kernel method on our regression dataset:

>> Xhat=[X ones(ell,1)]

Xhat =

3 7 1

4 6 1

5 6 1

7 7 1

8 5 1

4 5 1

5 5 1

6 3 1

7 4 1

9 4 1

>> G = Xhat*Xhat’

G =

59 55 58 71 60 48 51 40 50 56

55 53 57 71 63 47 51 43 53 61

58 57 62 78 71 51 56 49 60 70

71 71 78 99 92 64 71 64 78 92

60 63 71 92 90 58 66 64 77 93

48 47 51 64 58 42 46 40 49 57

51 51 56 71 66 46 51 46 56 66

40 43 49 64 64 40 46 46 55 67

50 53 60 78 77 49 56 55 66 80

56 61 70 92 93 57 66 67 80 98

>> y = [6 5 5 7 5 4 4 2 3 4]’ ;

>> alpha = zeros(10,1)

alpha =

0

0

0

0

0

0

0

0

0

0

42

March 6, 2003

>> eta = 0.01

eta =

0.0100

>> for i=1:100, alpha = alpha + eta*(y - G*alpha); end

>> G*alpha

ans =

1,0e+073 *

-1.4772

-1.4986

-1.6586

-2.1173

-2.0000

-1.3599

-1.5199

-1.4026

-1.7013

-2.0213

% eta clearly too large so try again with smaller eta

>> alpha = zeros(10,1);

>> eta = 0.001;

>> for i=1:100, alpha = alpha + eta*(y - G*alpha); end

>> G*alpha

ans =

5.9061

5.1328

5.1792

6.0914

4.4986

4.3133

4.3596

2.7668

3.6327

3.7253

>> for i=1:10000, alpha = alpha + eta*(y - G*alpha); end

43

March 6, 2003

>> G*alpha

ans =

5.9406

5.0193

5.2463

6.8484

4.7790

3.8711

4.0981

2.0286

3.4038

3.8577

>> for i=1:10000, alpha = alpha + eta*(y - G*alpha); end

>> G*alpha

ans =

5.9364

4.9973

5.2527

6.9580

4.8244

3.8027

4.0581

1.9245

3.3744

3.8852

>> for i=1:10000, alpha = alpha + eta*(y - G*alpha); end

>> G*alpha

ans =

5.9358

4.9940

5.2536

6.9741

4.8310

3.7927

4.0523

1.9092

3.3701

3.8893

44

March 6, 2003

>> for i=1:10000, alpha = alpha + eta*(y - G*alpha); end

>> G*alpha

ans =

5.9357

4.9935

5.2537

6.9765

4.8320

3.7912

4.0514

1.9069

3.3695

3.8899

>> for i=1:10000, alpha = alpha + eta*(y - G*alpha); end

>> alpha = alpha + eta*(y - G*alpha);

>> G*alpha

ans =

5.9357

4.9935

5.2538

6.9769

4.8322

3.7910

4.0513

1.9066

3.3694

3.8900

% We seem to have reached convergence after 40000 iterations!

45

March 6, 2003

1-norm regression

In chapter 2 we derived the following formulation:

min∑

i=1

ξ+i + ξ

−i

subject to Xw + ξ+ − ξ− = y and ξ+, ξ− ≥ 0.

The kernel version of the formulation will be exactly the same

except that the constraints become:

Gα+ ξ+ − ξ− = y

When re-solving the problem in chapter 2 using a polynomial

kernel, i.e. the (i, j)-the element of G, K(xi,xj), replaced

by K(xi,xj)d or (1 +K(xi,xj))

d we would again expect the

error vectors ξ+ and ξ+ to become smaller.

46

March 6, 2003

An incomplete list of kernels

• The simple dot product:

K(x, z) =⟨

x · z⟩

• The simple polynomial kernel:

K(x, z) = (1 +⟨

x · z⟩

)d

where d is user defined.

• Vovk’s real polynomial:

K(x, z) =1−

x · z⟩d

1−⟨

x · z⟩

where d is user defined and where −1 <⟨

x · z⟩

< 1.

• Vovk’s real infinite polynomial:

K(x, z) =1

1−⟨

x · z⟩

where −1 <⟨

x · z⟩

< 1.

• Radial basis function (RBF):

K(x, z) = exp(−γ|x− z|2)where γ is user defined.

47

March 6, 2003

• Two layer neural network (the sigmoid kernel):

K(x, z) =1

1 + exp(v⟨

x · z⟩

)− c

where v and c must satisfy Mercer’s condition (for example if

|x| = 1, |z| = 1 then c ≥ v.

• Linear splines with an infinite number of points:For the one-dimensional case:

Kk(x, z) = 1 + xz + xzmin(x, z)

−x+ z

2(min(xz))

2+

(min(x, z))3

3

for the multi-dimensional case

K(x, z) =

n∏

k=1

Kk(xk, zk)

• Full polynomial kernel:

K(x, z) =

(

x · z⟩

a+ b

)d

where a, b and d are user defined.

• Regularized Fourier (weaker mode regularization)

48

March 6, 2003

For the one-dimensional case:

K(x, z) =π

cosh π−|x−z|γ

sinh πγ

where 0 ≤ |x− z| ≤ 2π and γ is user defined.

For the multi-dimensional case

K(x, z) =

n∏

k=1

Kk(xk, zk)

• Semi Local Kernel

K(x, z) = [⟨

x · z⟩

+ 1]dexp(−‖x− z‖2σ2

)

where d and σ are user defined and weight between global

and local approximation.

• Regularized Fourier (stronger mode regularization)

K(x, z) =1− γ2

2(1− 2γ cos(x− z) + γ2)

where 0 ≤ |x − z| ≤ 2π and γ is user defined. For the

multi-dimensional case

K(x, z) =

n∏

k=1

Kk(xk, zk)

49

March 6, 2003

• Anova 1

K(x, z) = (

n∑

k=1

exp(−γ(xk − zk)2))d

where the degree d and γ are user defined.

50

March 6, 2003

Two layered neural network

The two layered neural network or sigmoid kernel deserved

some further remarks. Usually the two layered neural network

architecture is drawn as follows:

missing figure drawn in class

• The number of support vectors corresponds to the number ofhidden nodes,

• the weights of the hidden layer are the support vectors (x),• the weights of the output layer is α.

51

March 6, 2003

Properties of kernels

Assume that a kernel K(x, z) is defined as an inner product⟨

x · z⟩

where x, z ∈ Rn. If we construct from such a kernel

and from a set of ` data vectors xi, i = 1, . . . , `, the ` × `

Gram matrix G, where Gij = K(xi,xj), then this matrix will

have the following properties:

1. G is symmetric, i.e. G = G′ this follows from the fact that⟨

x · z⟩

=⟨

z · x⟩

.

2. G is positive (semi) definite, i.e. it holds true for any

vector α ∈ R` that α′Gα ≥ 0 or equivalently that∑`

i=1

∑`j=1Gijαiαj ≥ 0. This follows from the fact that

G = XX ′ whereX = [x1 . . . x`]′. If we set w = X ′α

then α′Gα = α′XX ′α = w′w = ‖w‖22 ≥ 0. (If

αGα > 0 for any non-zero α ∈ R` we say that G is

positive definite. Here, as in the book, we shall say that

G is positive definite even if α′Gα = 0 for some non-zero

α ∈ R`.)

52

March 6, 2003

When a kernel K(x, z) is not defined explicitly as an inner

product, the two properties above must be preserved when we

construct the Gram matrix from the kernel. Note that in this

case the space of data vectors, D, is not necessarily Rn for any n.

The first property is preserved if K(x, z) = K(z,x) for all

x,y ∈ D. In this case we say that the kernel K is symmetric.

The second property is preserved if

D

DK(x, z)f(x)f(z)dxdz ≥ 0

where f is any function defined on the data space with properties

that make the integral mathematically meaningful. In this case

we say that the kernel K is (semi) positive (cf. theorem 3.6 p.

35).

A kernel satisfying these two conditions is called aMercer kernel.

Note that if we choose f as a weighted sum of delta functions at

x1,x2, . . . ,x` ∈ D. with the weights α1, α2, . . . , α` then the

double integral reduces to the double sum:

i=1

j=1

K(xi,xj)αiαj

53

March 6, 2003

It is less straight-forward to show that a kernel is positive than

that it is symmetric. We draw attention to some necessary

conditions that follow the requirement that any Gram matrix

G s.t. Gi,j = K(xi,xj), i, j = 1, . . . , `, must be positive

definite, i.e that α′Gα ≥ 0 for any vector α ∈ Rn.

1. K(xi,xi) ≥ 0 for any data vector xi ∈ D. This follows by

choosing α s.t. αk =

{

1 if k = i,

0 otherwisek = 1, . . . , `

2. 12

(

K(xi,xi) + K(xj,xj))

≥ |K(xi,xj)| for any

xi,xj ∈ D. This follows by choosing α s.t.

αk =

1 if k = i

−1 if k = j, k = 1, . . . , `

0 otherwise

3. If λ is an eigenvalue of G then λ ≥ 0. Note that if λ is

an eigenvalue of G and v is the corresponding eigenvector

then Gv = λv. Furthermore, since G is symmetric λ and

v will be real valued. The result follows by choosing α = v

because then v′Gv = v′λv = λ‖v‖22 ≥ 0, so λ ≥ 0 since

‖v‖2 > 0

54

March 6, 2003

4.(

K(xi,xi)K(xj,xj))

12 ≥ |K(xi,xj)| for any xi,xj ∈

D. This follows from the fact that if α′Gα ≥ 0 for all

α ∈ R` then in particular:

[

αi αj]

[

K(xi,xi) K(xi,xj)

K(xi,xj) K(xj,xj)

] [

αiαj

]

for all [αi αj]′ ∈ R2. Thus by result 3 both eigenvalues

of the 2 × 2 matrix must be ≥ 0 and hence its determinant

K(xi,xi)K(xj,xj)−K(xi,xj)2, since it is the multiple

of the eigenvalues.

Remark: Note that on p. 32 we defined the distance between

two data vectors x and z as

(

K(x,x) +K(z, z)− 2K(x, z))1/2

Property 2 above guarantees that the number in the brackets will

be ≥ 0 and the distance is thus well defined.

55

March 6, 2003

Features from kernels

Suppose we have constructed an ` × ` Gram matrix G from a

symmetric and positive kernel K(x, z), s.t. Gij = K(xi,xj),

i, j = 1, . . . , `, and we are now interested in defining feature

vectors φ(xi), i = 1, . . . , ` such that

K(xi,xj) =⟨

φ(xi) · φ(xj)⟩

Since G will be symmetric and positive definite it will have `

real non-negative eigenvalues λ1, λ1, . . . , λ` and corresponding

eigenvectors v1, . . . ,v`. Moreover, these eigenvectors will be

mutually orthogonal, i.e.⟨

vi · vj⟩

= 0 if i 6= j and they can

be normalized s.t.⟨

vi · vi⟩

= 1 for i = 1, . . . , `.

It follows that if V is the `× ` matrix [v1 v2 . . .v`] then

GV = V

λ1 0. . .

0 λ`

= V Λ

and hence G = V ΛV ′. Hence, if we set V = V√Λ, i.e.

multiply the first column of V with√λ1 the second one with√

λ2 and so on, then

G = V V′

Thus we can define φ(xi)′ to be the i-th row of V ,

i = 1, . . . , `.

56

March 6, 2003

Note that the MATLAB function

[V, Lambda] = eig(G);

gives us the matrix of eigenvectors, V , and the diagonal matrix

of eigenvalues, Λ. Thus the matrix whose rows are feature

vectors corresponding to G will be

Vhat = V*sqrt(Lambda);

57

March 6, 2003

Making kernels from kernels

Let K1 and K2 be kernels over X ×X, X ⊆ Rn, a ∈ R+,

f(·) a real-valued function on X, φ : X 7→ Rm with kernel

K3 a kernel over Rm × Rm, and B a symmetric positive (semi)

definite n× n matrix. Then the following functions are kernels:

• K(x, z) = K1(x, z) +K2(x, z)

• K(x, z) = a+K1(x, z)

• K(x, z) = aK1(x, z)

• K(x, z) = K1(x, z)K2(x, z)

• K(x, z) = f(x)f(z)

• K(x, z) = exp (K1(x, z))

• K(x, z) = p(

K1(x, z))

• K(x, z) = K3

(

φ(x),φ(z))

• K(x, z) = x′Bz

where p(x) is a polynomial with positive coefficients.

58

March 6, 2003

Feature selection revisited

The task of feature extraction is highly problem-dependent and

for different domains the features may be completely different.

In many real world problems feature extraction can only be

partially automated, and the knowledge of human experts

remains indispensable.

In real world applications the number of features is often large.

The goal of feature selection is to reduce this dimensionality by

selecting only a subset of features.

There are in principle to approaches to finding a good feature

subset:

• search the feature space using the learning algorithm’s

estimated accuracy as a search criteria (e.g. cross validation),

or

• use a criteria which is “independent” of the learning algorithm(e.g. compute correlation between feature and target).

59

March 6, 2003

Extracting regularities from attributes

If some regularities in the input attributes are known, then they

can be exploited by creating so called locality-improved kernels.

For example:

• In real world images, correlations over short distance are muchmore reliable features than long-range correlations.

• Genomic sequences contain untranslated regions and CDSregions which encode for proteins. An important task is

therefore to recognize translation initiation sites (TIS). We

know that a TIS usually starts with a ATG code sequence and

so a local correlation using a window around ATG has shown

to be very successful when designing locality-improved kernels

for the DNA start codon recognition task.

In general the locality-improved kernel may be written as follows:

K(x, z) =

(

locality

(

i∈locality

f(xi, zi)

)d1)d2

using a monomial type kernel. Other variation are also

possible. The function f is a simple multiplication for the image

classification problem and a matching operation for the TIS

recognition problem.

60

March 6, 2003

Reproducing Kernel Hilbert Spaces

Let a before X ⊆ Rn denote an input space of possible

n-dimensional input vectors x = (x1, . . . , xn), φ denote the

mapping that maps each x ∈ X into an N -dimensional feature

vector φ(x) =(

φ1(x), . . . , φN(x))

, and K(x,y) denote a

(Mercer) kernel function defined on X×X such thatK(x,y) =⟨

φ(x) · φ(x)⟩

, the inner product of the feature vectors φ(x)

and φ(y) in the feature space F = {φ(x)|x ∈ X} ⊆ RN

Rather than associating with a given input vector x ∈ X its

feature vector φ(x) it is sometimes instructive to associate with

it the function∑N

i=1 φi(x)φi(x) = K(x,x) defined on X(note that here x is a fixed vector but x is a variable) which

contains information on the “relationship” between x and all

other input vectors x ∈ X .

61

March 6, 2003

As an example we show in the figure below such functions when

n = 1, X = [−1, 1], x = 0.2 and

1. K1(0.2, x) = 0.2 · x (ordinary inner product)2. K2(0.2, x) = 1 +K1(0.2, x)

3. K3(0.2, x) =(

1 +K1(0.2, x))2

4. K4(0.2, x) = exp(−(x− 0.2)2)

5. K5(0.2, x) = exp(−(x− 0.2)2/0.22)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

kern

els

0.2*x1+0.2*x(1+0.2*x)2

exp(−(0.2−x)2)exp(−(0.2−x)2/0.22)

62

March 6, 2003

For ` training vectors x1,x2, . . . ,x` ∈ X the binary

classification problem becomes in this context the problem

of finding ` real numbers α1, α2, . . . , α`, and b such that if we

form the function

f(x) =∑

i=1

αiK(xi,x)

f(xi) + b > 0 if xi is in the + class and

f(xi) + b < 0 if xi is in the − class, where b is a possible

offset value. This follows directly from the dual formulation of

this problem.

Example: It becomes clear in this context that if we take the

kernel to be K(x,y) = exp(−‖x − y‖2/σ2) and choose σ

sufficiently small , we can solve the binary classification problem

for the training set simply by setting αi = 1 if xi is in the +

class, αi = −1 if x is in the − class, and b = 0.

63

March 6, 2003

Example: binary classification

In this example:

S =(

− 1,−1), (−0.4,−1), (0.2,+1), (0.6,+1), (1,+1))

The different classifiers were found using the perceptron

algorithm2 and it turns out that b = 0 for all of them.

The different kernels are defined on page 62 and the linestyles

used for plotting are the same as before.

−1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

PSfrag replacements

` i=1yiαiK

(xi,x)

x

2The classification problem was solved using the dual perceptron using the kernel

representation, the data was presented to the learner in the order shown.

64

March 6, 2003

Example: regression

Here the same kernels are used for a regression problem, which is

solved using dual ridge-regression.

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

1.5

2

PSfrag replacements

` i=1αiK

(xi,x)

x

These examples illustrate how kernels:

• are a similarity measure for the data,• a function space for learning, and• covariance function for correlated observations.

65

March 6, 2003

Similarly the regression problem becomes to find ` real numbers

α1, α2, . . . , α` such that f(xi) ≈ yi for i = 1, 2, . . . , `

where yi, i = 1, 2, . . . , ` are given outputs values.

In this context we also consider the space of all functions

a(x) =

N∑

i=1

aiφi(x)

where ai ∈ R and denote it by H. We can define for any twofunctions a(x) =

∑Ni=1 aiφi(x) and b(x) =

∑Ni=1 biφi(x)

their inner product as the value

N∑

i=1

aibi.

Thus H becomes an inner product space and in fact a Hilbert

space (since it will also be complete). We denote this inner

product by⟨

a(x) · b(x)⟩

H and define the corresponding norm:

‖a(x)‖H =⟨

a(x) · a(x)⟩1/2

H =(

N∑

i=1

a2i

)1/2

66

March 6, 2003

Note that according to these definitions we have

1.⟨

K(xi,x) ·K(xj,x)⟩

H=⟨∑N

k=1 φk(xi)φk(x) ·∑N

k=1 φk(xj)φk(x)⟩

H=∑N

k=1 φk(xi)φk(xj) = K(xi,xj)

2. If f(x) =∑`

i=1 αiK(xi,x)

and g(x) =∑`

i=1 βiK(xi,x) then⟨

f(x) · g(x)⟩

H =∑`

i=1

∑`j=1 αiβjK(xi,xj) = α′Gβ

where α = (α1, . . . , α`)′, β = (β1, . . . , β`)

′ and G is an

` × ` Gram (kernel) matrix whose (i, j)-th element is the

kernel K(xi,xj). In particular ‖f(x)‖2H = α′Gα.

3.⟨

a(x) ·K(y,x)⟩

H=⟨∑N

i=1 aiφi(x) ·∑N

i=1 φi(y)φi(x)⟩

H=∑N

i=1 aiφi(y) = a(y)

A kernel with this last property is called a reproducing

kernel. Thus H is called a reproducing kernel Hilbert

space (RKHS).

The above definitions agree with those given on p. 38-40 in book

except in the book N = ∞ and the book further considers a

more general case where we introduce positive weight factors µi,

which above simply take the value 1.

67

March 6, 2003

Principle component analysis (PCA)

Let Φ denote a data matrix in feature space for ` data vectors

x1,x2, . . . ,x`, i.e. the i-th row of Φ is φ(xi)′.

Let Φ denote the corresponding matrix of centered data where

we have subtracted from the j-th component of φ(xi), i.e.

φi(xi), the average value of that component over the dataset so

that the (i, j)-the element of Φ is

φj(xi)− 1`

i=1

φj(xi)

In matrix notation we have that

Φ = Φ− 1`ee

′Φ

where e denotes a column vector of ` 1-elements. This implies

that

ΦΦ′= ΦΦ

′ − 1`ee

′ΦΦ

′ − ΦΦ′1`ee′+ 1

`ee′ΦΦ

′1`ee

and shows how we can obtain the Gram matrix from features

in centered form from the Gram matrix with its data in original

form.

68

March 6, 2003

We are now interested in extracting from the centered feature

vectors a new feature vector which is a linear combination of

the original features but with the additional property that the

variance in these new feature values is as large as possible. In

this way the new feature is to incorporate as much information

as possible from older features. We refer to it as a principle

component of the data. Thus we are interested in finding a

weight vector w, s.t. ‖w‖2 = 1 but

1`

Φw · Φw⟩

= w′1`ΦΦ

′w

is maximized3. Note that 1`Φ

′Φ is the covariance of the original

features.

The solution to the above problem is well known. We choose w

as the eigenvector of the covariance matrix that corresponds to

the largest eigenvalue λmax and normalize it so that ‖w‖2 = 1.

Then w′1`Φ′Φw = w′λmaxw = λmax, i.e. the variance of

the new feature will be λmax and this is the largest possible

variance.

3Compare this with the classification problem where we are interested in finding a weightvector s.t. ‖w‖2 = 1 and such that

xi ·w⟩

> 0 for all xi data vectors in the +-class

and⟨

xi ·w⟩

< 0 for all xi data vectors in the −-class. In that problem we also includedan offset b but if we work with centered data the offset value will in many cases become 0.Also compare it with the regression problem where we are interested in finding the weight

vector s.t.⟨

xi ·w⟩

≈ yi for some output values yi, i = 1, . . . , `.

69

March 6, 2003

Our main problem is however that in many cases we do not

know the new feature vectors explicitly and thus cannot compute

their average values nor the covariance matrix. We do however

know the Gram matrix ΦΦ′ whose (i, j)-th element is the kernelk(xi,xj) and can therefore also calculate the Gram-matrix of

the centered feature values as shown above.

We also have the following result:

• If λ is a non-zero eigenvalue of the matrix ΦΦ′with

eigenvector α then λ is also a non-zero eigenvalue of the

matrix Φ′Φ with eigenvector w = Φ

′α, since

ΦΦ′α = λα ⇒ Φ

′Φ(Φ′α) = λ(Φ′α).

• Conversely if λ is a non-zero eigenvalue of the matrix Φ′Φwith eigenvector w then λ will also be a non-zero eigenvalue

of the matrix ΦΦ′with eigenvector α = Φw.

• Note that the matrices Φ′Φ and ΦΦ′will in general be of

different size so that one will have more eigenvalues than

the other. The result above implies that all such additional

eigenvalues will be 0.

70

March 6, 2003

We can now calculate the largest eigenvalue λmax of the

covariance matrix 1`Φ

′Φ by calculating the eigenvalues of 1

`G,

where G denotes the Gram matrix ΦΦ′. Let αmax denote

the corresponding eigenvector. Since we want to normalize w

s.t.⟨

w ·w⟩

= 1 and w = Φ′α this means that we want to

normalize α s.t.

α′ΦΦ

′α = α

′`λmaxα = `λmaxα · α = 1

i.e. such that ‖α‖2 = 1/√`λmax.

We can however in general not calculate w for the expression

w = Φ′α since we do not know the feature vector explicitly.

This is analogous to the situation in kernel classification and

regression and we can deal with it in the same way, i.e. for a new

x we calculate the corresponding feature value as

i=1

K(xi,x)αi

where αi is the i-the component of the eigenvector α after it

has been normalized (as shown above) and K(z,x) is the kernel

value corresponding to centered feature values.

In particular, the vector of the feature principal component values

71

March 6, 2003

of out data will be:

Gα = `λmaxα

(since α is an eigenvector of G and `λmax is the corresponding

eigenvalue), i.e. the eigenvector normalized so that its length is

`λmax1√

`λmax=√

`λmax

Note that this corresponds to the results on page 56.

It is, however, not clear how to proceed when if we want to

calculate the principle component for a new data vector. Also

note that the eigenvectors of the Gram matrix G will in general

not be sparse, i.e. most of the αi values will be non-zero. To

obtain a sparse α-vector we may have to modify our formulation

in some way. Such modified principle components are called

sparse principle components.

We do not necessarily restrict our attention only to the extracted

feature corresponding to the largest variance. Another feature

may be extracted from the eigenvector of 1`G corresponding to

the second largest eigenvalue. This second feature is uncorrelated

with the first feature and so incorporates information ignored by

the first feature. We call this second feature a second principle

component. Similarly, we can extract a third principle component

and so on.

72

March 6, 2003

Note that the total variance in the original data after it has been

transformed into the feature space is the sum of the diagonal

entries of the covariance matrix 1`Φ

′Φ. We can however not

calculate this metric if the features are not explicitly known.

But it can be shown that this sum is the same as the sum of

the diagonal entries of he matrix 1`G which can be calculated

from the kernels and is also equal to the sum of the eigenvalues

of this matrix. Thus by comparing the sum of the largest p

eigenvalues, say, with the total variance, we can see how much of

the information in the original data is incorporated into the first

p principle components in feature space.

The following MATLAB function performs the calculationsdescribed so far:

function [P,Pvar,totvar] = principalfeatures(G,p);

%PRINCIPLEFEATURES

% usage: [P,Pvar,totvar] = principalfeatures(G,p);

% P are the principle components

% Pvar the corresponding eigenvalues and totvar is the total variance

% G is the Gram matrix and p the number of principle components

ell = size(G,1);

Gbar = ((1/ell)*ones(ell,1))*(ones(1,ell)*G);

Gbar = G - Gbar - ((G - Gbar)*ones(ell,1))*((1/ell)*ones(1,ell));

[V,Lambda] = eigs((1/ell)*Gbar,p);

P = V*sqrt(ell*Lambda);

Pvar = diag(Lambda);

totvar = (1/ell)*sum(diag(Gbar));

% make sure that the largest eigenvalues are first!

[dummy,I] = sort(-Pvar);

Pvar = Pvar(I);

P = P(:,I);

73

March 6, 2003

DEMO - PCA for USPS digits

We illustrate now the technique using the USPS handwrittendigits used in a previous demo.

>> load usps_digits_testing

>> whos

Name Size Bytes Class

Xt 2007x256 4110336 double array

yt 2007x1 16056 double array

Grand total is 515799 elements using 4126392 bytes

>> G = (Xt*Xt’).^2;

% Do the PCA

>> [P,Pvar,totvar] = principalfeatures(G,15);

% find index to images with largest feature value for feature 1

>> [dummy,I] = sort(-P(:,1));

% plot the first 50 of them

>> plotdigits(Xt(I(1:50),:));

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

31 32 33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48 49 50

74

March 6, 2003

What if we skip every 40-th image and so show a sample of allimages sorted by the first principle component feature value:

>> plotdigits(Xt(I(1:40:end),:));

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

31 32 33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48 49 50

Notice that at the other end of the scale we have only zeros!Let’s try another kernel (monomial order 5):

>> G = (Xt*Xt’).^5;

>> [P,Pvar,totvar] = principalfeatures(G,15);

>> [dummy,I] = sort(-P(:,1));

>> plotdigits(Xt(I(1:40:end),:));

This time the zeros are first:

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27 28 29 30

31 32 33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48 49 50

75

March 6, 2003

Finally, let us plot the images with the highest feature value forall 15 principle components extracted using a order 2 monomialkernel:

>> G = (Xt*Xt’).^2;

>> [P,Pvar,totvar] = principalfeatures(G,15);

>> Pvar

Pvar =

1.0e+03 *

4.8461

2.4100

1.6495

1.2537

1.1182

1.0137

0.8469

0.7460

0.6618

0.6380

0.5479

0.4504

0.4491

0.4129

0.3808

>> for i=1:15, [dummy,I] = sort(P(:,i)); Iext(i,:) = [I(1) I(end)]; end

>> figure(1); plotdigits(Xt(Iext(:,1),:),yt(Iext(:,1)),(1:15)’);

>> figure(2); plotdigits(Xt(Iext(:,2),:),yt(Iext(:,2)),-(1:15)’);

0(1) 9(2) 0(3) 8(4) 3(5) 0(6) 4(7) 2(8) 6(9) 5(10)

1(11) 2(12) 3(13) 0(14) 3(15)

1(−1) 6(−2) 0(−3) 2(−4) 6(−5) 2(−6) 7(−7) 9(−8) 9(−9) 5(−10)

6(−11) 3(−12) 7(−13) 4(−14) 5(−15)

Notice that almost all digits are represented by the first 15 PC.

76

March 6, 2003

PCA - unsupervised learning

In general PCA may be regarded as an unsupervised learning

technique.

Classroom discussion on Kernel PCA in kernel design.

77

March 6, 2003

Designing kernels

Designing a kernel corresponds to choosing a:

• similarity measure for the data,• linear representation of the data,• function space for learning,• covariance function for correlated observations.

In general, the choice of kernel reflects the designer’s knowledge

about the problem and its solution.

There is no free lunch in kernel choice!

78