Training multi-class support vector machines

Training Multi-class

Support Vector Machines

Dissertation

zur Erlangung des Grades eines

Doktor-Ingenieurs der Fakultat

fur Elektrotechnik und

Informationstechnik an

der Ruhr-Universitat

vorgelegt von

Urun Dogan

Institut fur Neuroinformatik

Ruhr-Universitat Bochum

Bochum

Januar 2011

ACKNOWLEDGEMENT

This dissertation would not have been possible without the guidance and help of

several individuals who, in one way or another, contributed and extended their

valuable assistance to me in the preparation and completion of this study.

I, therefore, extend thanks to the following:

First and foremost, my utmost gratitude to Prof. Dr. Christian Igel whose sin-

cerity and encouragement I will never forget. Prof. Dr. Igel has been my inspiration

as I hurdled the obstacles that presented themselves during the completion of this

research work.

Dr. Tobias Glasmachers with whom I discussed mathematical issues and who

introduced and explained to me many mathematical and statistical concepts. With-

out Dr. Glasmachers unlimited patience I would not understand abstract concepts

of mathematics and statistics.

Prof. Dr. Ioannis Iossifidis who during my PhD studies, encouraged me when I

was struggling with any kind of problem. As a perfect group leader Prof. Dr. Ios-

sifidis assisted and solved any administrative problem and created a great research

environment for me.

David William Eric Clark reviewed my thesis. He not only improved the lan-

guage of my previous manuscripts but he also restructured many parts of my thesis.

David William Eric Clark spent all his summer holidays with me in order to im-

prove my manuscript. Using his extensive engineering knowledge and experience,

he helped me to explain abstract concepts of machine learning with plain English.

Further still, the discussions with him about abstract machine learning techniques

deepened my understanding in machine learning.

Mathias Tuma who also reviewed my thesis. As a machine learning expert, he

gave me many valuable comments and he also helped me in restructuring my thesis.

I learned a lot of things from him. It was a great opportunity to work with him.

Verena Heidrich-Meisner, my old office mate, she discussed many scientific ideas

with me. She always shed light on the issues that we were discussing and she showed

me different perspectives on these issues.

Last but not the least, my family and M. Kemal Ataturk, the founder of modern

Turkey. Without M. Kemal Ataturk’s achievements and without the continuous

support and encouragement of my family I would not have reached this point in my

educational journey which has, to date, taken 25 years.

Contents

1 Introduction 3

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Statistical Learning Theory 7

2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Consistent Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Nearest Neighbour Classifier . . . . . . . . . . . . . . . . . . 14

2.3.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Support Vector Machines(SVMs) . . . . . . . . . . . . . . . . 15

3 Multi-class Support Vector Machines 25

3.1 Sequential Multi-Class SVMs . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Multi-Class Classification with Maximum Margin Regression

(MC-MMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.2 One versus All (OVA) . . . . . . . . . . . . . . . . . . . . . . 32

3.2 All-in-One Multi-class Machines . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 The Weston and Watkins Method (WW) . . . . . . . . . . . 34

3.2.2 The Crammer and Singer Method . . . . . . . . . . . . . . . 41

3.2.3 Lee, Lin, & Wahba SVM . . . . . . . . . . . . . . . . . . . . 48

4 Unified View to All-in-One Multi-class Machines 53

4.1 Novel Approach to Multi-Class SVM Classification . . . . . . . . . . 55

5 Solvers 59

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1.1 Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . 59

5.1.2 Direct Optimization of Primal Problem . . . . . . . . . . . . 60

5.1.3 On-line Methods . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.4 Cutting Plane Approaches . . . . . . . . . . . . . . . . . . . . 60

5.1.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 61

5.1.6 Decomposition Algorithms . . . . . . . . . . . . . . . . . . . 61

5.1.7 General Comments on SVM Solvers . . . . . . . . . . . . . . 63

1

5.2 Decomposition Algorithms for Multi-Class SVMs . . . . . . . . . . . 63

5.2.1 Dropping the Bias Parameters . . . . . . . . . . . . . . . . . 64

5.2.2 Working Set Sizes for Decomposition Algorithms . . . . . . . 64

5.2.3 On Working Variable Selection . . . . . . . . . . . . . . . . . 65

5.2.4 Maximum Violating Pair Method . . . . . . . . . . . . . . . . 65

5.2.5 Second Order Working Variable Selection for SMO . . . . . . 66

5.2.6 Second Order Working Pair Selection for S2DO . . . . . . . 67

5.2.7 Solving the Crammer and Singer Multi-class SVM Using SMO 71

5.2.8 Efficient Caching for All-in-one Machines . . . . . . . . . . . 71

6 Conceptual and Theoretical Analysis of Multi-class SVMs 73

6.1 Margins in Multi-Class SVMs . . . . . . . . . . . . . . . . . . . . . . 73

6.2 Margins in Multi-Class Maximum Margin Regression . . . . . . . . . 74

6.3 Margins in the One Versus All Classifier . . . . . . . . . . . . . . . . 75

6.4 Margin Normalization for Multi-Class Machines . . . . . . . . . . . . 76

6.5 Generalization Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.6 Universal Consistency of Multi-Class SVM . . . . . . . . . . . . . . . 79

6.7 Training Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7 Empirical Comparison and Applications 81

7.1 Preliminaries for Empirical Evaluation . . . . . . . . . . . . . . . . . 81

7.1.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.1.2 Stopping Conditions . . . . . . . . . . . . . . . . . . . . . . . 82

7.1.3 Statistical Evaluation . . . . . . . . . . . . . . . . . . . . . . 83

7.2 Multi-class Benchmark Problems . . . . . . . . . . . . . . . . . . . . 83

7.2.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 86

7.3 Traffic Sign Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.3.3 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . 90

7.3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . 91

7.3.5 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


7.4 Multi-class Problems in Bioinformatics . . . . . . . . . . . . . . . . . 94

7.4.1 Cancer Classification and Diagnosis with Microarray Gene

Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


7.4.3 Protein Secondary Structure Prediction . . . . . . . . . . . . 99


8 Conclusions 103

2

Chapter 1

Introduction

Making observations and collecting data about natural or man-made phenomena

lies at the heart of all science and knowledge generation. Along with the amounts

of data collected and the complexity of their generating processes grows the need

for sophisticated data analysis techniques. Today, a wealth of data exploration

methods are at the modern scientist’s disposal. Over recent decades, adaptive

learning systems [17, 158], or machine learning methods, have emerged. Rather

than being fixed or rigid in the sense of acting in exactly the same, predefined

manner on different data, they adapt to properties of the data they encounter.

One special domain of machine learning is the supervised learning setting. Here the

learner’s task is to derive as hypothesis a map from an input space to an output space

while relying on exemplary observations, i.e., sample pairs of inputs and outputs.

This hypothesis should, to the best extent possible, also hold when evaluated on

additional input-output pairs stemming from the same underlying distribution as

the training data set.

Supervised learning problems for which the output space is a finite set are re-

ferred to as classification tasks. If the set is of cardinality two, binary classification

is considered, and of multi-class classification otherwise. In real-world applications,

many classification tasks naturally are multi-class problems, such as object recog-

nition, traffic sign recognition, and protein secondary structure prediction. At the

same time, several pattern recognition algorithms, like e.g. support vector machines,

have originally been designed for binary problems and are less easily applied in the

multi-class setting. This thesis focuses on multi-class classification using support

vector machine classifiers.

Support vector machines (SVMs, [22, 38]) are state-of-the art for binary clas-

sification. They are founded on the intuitive geometric concept of large margin

separation. They are also well understood due to their roots in theories of repro-

ducing kernel Hilbert spaces, regularized risk minimization, and statistical learning

theory. In addition, SVMs exhibit excellent performance over a wide range of ap-

plications.

Both from a geometric as well as from a learning theoretical point of view there

3

is no unique canonical extension of SVMs to multiple classes. Instead, several dif-

ferent formulations relying on slightly different notions of margin and margin-based

loss have evolved, most of which reduce to the standard machine in the special case

of a binary problem. Two general strategies to extend SVMs to multi-category clas-

sification can be distinguished. The first approach is to combine separately trained

SVM classifiers. The well-known one-versus-all and one-versus-one methods are

examples for this strategy [139]. Both methods have discussed in this thesis but

only one-versus-all is explained in detail. In the second family of algorithms, a

single optimization problem considering all classes is derived and solved at once.

These all-in-one methods are usually computationally more demanding [87], but –

at least from my point of view – theoretically more elegant. The latter promises

better classification results.

Although no significant differences in classification accuracy between these two

approaches have been observed in some studies, other studies (e.g., [48]) show sig-

nificantly better classification performance of all-in-one SVMs compared to the one-

vs-all approach in practice. The first of these all-in-one formulations (referred to

as WW) was independently proposed by Weston & Watkins [163], Vapnik [158],

and Bredensteiner & Bennett [32]. This machine has been modified by Cram-

mer & Singer (CS) [42]. Their approach is frequently used, in particular when

dealing with structured output. In addition, I consider the conceptually different

approach by Lee, Lin, & Wahba (LLW) [112], relying on a classification calibrated

loss function [154, 114], and multi-class maximum margin regression (MMR) [152]

proposed by Szedmak et al. The latter is equivalent to a multi-class SVM suggested

by Zou et al. [168].

From a geometric point of view all of these extensions offer plausible concepts

of margin and margin violation (or, to be more precise, margin-based loss). From a

learning-theoretical perspective the situation is less clear. There have been different

attempts to generalize learning bounds for binary SVMs to the multi-class case, see

for example [81]. But I am not aware of any bounds that could demonstrate relative

advantages or disadvantages of different multi-class SVM formulations. Still, there

are a few hints from theoretical analysis, in terms of so-called classification calibra-

tion and universal consistency. The LLW machine is the only machine known to

rely on a classification calibrated loss function [154, 114] (which implies Fisher con-

sistency1), while very recently the CS machine has been shown to be a universally

consistent classifier [76].

From this background and inspired by work by Liu [114], I derive a refined unified

formulation of popular all-in-one SVMs revealing their similarities and differences.

A novel multi-class SVM is developed by using the proposed unified view. The

new classifier canonically combines the margin concept of the LLW machine with

the margin-based loss used in the CS approach. This combination, which has been

overlooked so far, is revealed by the novel multi-class method.

1Fisher consistency does not imply universal consistency. I regard universal consistency as themore fundamental statistical property.

4

Another important issue is required training time of multi-class SVMs. Long

training times limit the applicability of multi-class support vector machines (SVMs).

In particular, the canonical extension of binary SVMs to multiple classes (referred

to as WW, [163, 32, 158]) as well as the SVM proposed by Lee, Lin, & Wahba

(LLW, [112]) are rarely used. These approaches are theoretically sound and experi-

ments indicate that they lead to well-generalizing hypotheses, but efficient training

algorithms are not available. Crammer & Singer (CS, [42]) proposed their arguably

most popular modification of the WW SVM mainly to speed-up the training pro-

cess. Still, the fast one-vs-all method [158, 133] is most frequently used when SVMs

are applied to multi-class problems for training time reasons.

Against this background, I consider batch training of multi-class SVMs with

universal (i.e., non-linear) kernels and ask the questions: Is it possible to increase the

learning speed of multi-class SVMs by using a more efficient quadratic programming

method? Do statistical properties have a practically measurable or even significant

impact on classification performance? Can an instructive generalization bound be

developed for explaining the empirical results? This thesis gives positive answers

to these questions. Efficient training algorithms for all-in-one methods is provided.

These make training of LLW machines practical and allow to train WW SVMs as

fast as CS’s variant. Extensive experiments demonstrate the superior performance

of the LLWmachine in terms of generalization performance. A simple generalization

bound indicating why the WW machine outperforms the CS SVM, which matches

the empirical results, is developed.

1.1 Contributions

Before explaining technical details,the contributions of this thesis shall be stated:

• Development of a unified view on all-in-one multi-class machines (see Chap-

ter 4)

• Design of a new all-in-one multi-class machine (see Section 4.1)

• Development of a new solver (i.e., training algorithm) for multi-class machines

(see Chapter 5)

• Conceptual and geometrical analysis of different margin concepts used in

multi-class machines (see Chapter 6)

• Proveof an instructive generalization bound (see Chapter 6)

• Extensive experimental comparison of six multi-class machines (see Chapter 7)

1.2 Thesis Structure

The thesis is organized as follows. In Chapter 2, the preliminaries of statistical learn-

ing theory [158] and several classifiers including the SVM for binary classification

5

are explained. In Chapter 3, an overview on multi-class classification with SVMs are

given, the five previously proposed multi-class SVMs are formulated. Further some

of these, in order to develop similar solvers, are reformulated. The proposed unified

view on all-in-one multi-class SVMs and the new multi-class method are discussed

in Chapter 4. In Chapter 5, different methods for solving the SVM problems are

summarized and the new solver for multi-class SVMs is defined. The geometrical

and conceptual differences, due to the different margin concepts, between multi-

class SVM machines are discussed in Chapter 6. Also, a new generalization bound

is developed in Chapter 6. Finally, I supply a detailed experimental evaluation of

all six different multi-class machines in Chapter 7.

6

Chapter 2

Statistical Learning Theory

This section gives a brief summary of the preliminaries of statistical learning theory

[158]. Further, it summarizes the basic idea of support vector machines (SVMs)

[158].

2.1 Supervised Learning

In science observing a phenomenon and then deducing a model from the observa-

tions representing the rules of nature is common practice. In supervised learning,

one similarly wants to derive a model of the relations between the inputs and the

outputs of the system that one is interested in. In the following some notations

used throughout this thesis will be defined.

Given any measurable space X, the set of vectors Dℓ = x1, . . . , xℓ such that

xi ∈ X ∀i = 1, . . . , ℓ and all xi are independently generated by the same proba-

bility distribution are called inputs/input vectors and X is called input space. For

SVMs the input space X can be any measurable set in other words there is no

restriction on X such as being a vector space. The scalars y ∈ Y are responses

of the supervisor to inputs and are called labels. The set Y represents all possible

responses that the supervisor can give and it is called as output space. A set of

pairs of vectors Sℓ = (x1, y1), . . . , (xℓ, yℓ) ∈ (X × Y )ℓthat contain input vectors

and the corresponding response of the supervisor are called training set or training

data and the cardinality of the training set (i.e. the number of training examples)

is denoted by ℓ. For simplicity and in accordance with the literature, in this thesis

it will be assumed that there is an underlying unknown probability distribution

Υ(x, y) on X × Y and the training data is generated by sampling from Υ(x, y). A

single random realisation of Υ(x, y) will be denoted as (x, y). Given training set

Sℓ, finding a function f : X → Y is called supervised learning. If Y = R then the

supervised problem is called as a regression problem and if Y is a finite discrete set,

this kind of problem is called as a classification problem and each distinct element

of Y is said to correspond to one class. Given training set Sℓ, estimating the un-

derlying probability distribution function from the data at hand is called as density

7

estimation.

If in a classification task |Y | = 2, the problem is called as a binary classification

problem. One of the classes is called the positive class and the other one is called the

negative class. If |Y | > 2, the problem is called a multi-class classification problem

and the number of classes is denoted by d = |Y |. If the number of training pairs for

each class is equal in the training set Sℓ than the training data set is called balanced.

Regression problems are considered to be harder than classification problems. In

classification problems, multi-class problems are harder than binary problems. To

understand this, assuming data set is balanced and having a classifier that assigns

random values to inputs, the probability of correct classification in binary problems

is 12 however in balanced multi-class cases this is 1

d.

Each explanation of the empirical data is called a hypothesis. Here it should

be noted that one may found more than one hypothesis for the data set at hand.

However that does not mean that all these hypothesis equivalently well explaining

the real system from which the data is generated/collected. The process of auto-

matically generating hypothesis about the empirical data is called learning machine.

Besides specific propositions for learning machines that provide an hypothesis

f , there are general aspects worth considering [5]

• Approximation: Given infinitely many data points are available for training

and pick a function class ℑ such that f ∈ ℑ: The first question here is

whether ℑ is adequately inclusive enough to approximate the true relation

between inputs and labels. For supplying an answer to this question, one can

approach this problem from approximation theory point of view because there

are strong synergies between supervised learning and approximation theory

[44, 45]. However, I will stick to the statistical learning theory [158].

• Estimation: Since information about the problem is not enough, the true

relationship between inputs and labels are not known. Therefore, the second

question is how much data one needs to model the unknown relationship?

Again statistical learning theory [158] will provide an answer.

• Computational efficiency: How can the training data be used for choosing a

model that is accurate enough whilst using as few computational resources

as possible? One might argue that the computational power of computers

constantly increasing, however the available data to be analysed is increasing

at an even faster rate. Even more importantly, a supervised learning task may

become harder as the amount of training samples increases [21].

Unfortunately, giving answers to these questions is not trivial [158, 5] and within

the scope of this thesis the multi-class SVMs from the approximation paradigm and

computational efficiency paradigm will be analysed. Some methods for efficient

training of multi-class SVMs in the case of limited computational power will be

developed. In supervised learning, one wants to have a function with a low gen-

eralization error. In addition, during training one needs to identify which member

8

of the function class ℑ is best explaining the unknown relation between inputs and

labels. To succeed in both tasks, an auxiliary function, which will be used for mea-

suring the performance of the hypothesis at a given training point, is needed. This

function will be called as the loss function. The mathematical definition of the loss

function is;

Definition 1 Any function L satisfying the following conditions

1. L : Y × Y → [0,∞)

2. monotonic non-decreasing

3. L(f(xi), yi) = 0 if only if f(xi) = yi

is called loss function.

There are several popular loss functions and these are

• The 0− 1 loss defined as

L(f(xi), yi) =

1 f(xi) 6= yi

0 otherwise .(2.1)

• The least square loss is defined as

L(f(xi), yi) = (f(xi)− yi)2 . (2.2)

• The Hinge loss is defined as

L(f(xi), yi) = max(0, 1− yif(xi)) . (2.3)

• The exponential loss is defined as

L(f(xi), yi) = exp(1− yif(xi)) . (2.4)

• The logistic loss is defined as

L(f(xi), yi) = log(1− yif(xi)) . (2.5)

The selection of loss is an important issue in supervised learning. The differences

between losses are illustrated in the Fig 2.1. In the scope of this thesis, unless the

otherwise is mentioned, the Hinge loss given in eq. (2.3) will be used. Performance

of the hypothesis at a single training point is not the only interesting point, generally

the overall performance of the hypothesis on the entire entire distribution Υ(x, y)

9

0

2

4

6

8

10

-2 -1 0 1 2

0-1 Loss

Least Square Loss

Hinge Loss

Exponential Loss

Logistic Loss

Figure 2.1: The figure illustrates the outputs of the popular loss functions. The lat-eral axis represents the value of the argument of the loss function and the horizontalaxis shows the output of the loss function

is more interesting. The functional defined for this purpose is called expected risk

and is defined as

I [f ] =

∫

X×Y

L(f(x), y)Υ(x, y) dxdy . (2.6)

The minimizer of eq. (2.6) is written as

f∗(x) = argminℑ

I [f ] . (2.7)

Generally one cannot find f∗(x) through eq. (2.7) because Υ(x, y) is unknown.

Density estimation techniques can be used for estimation of Υ(x, y) and the result

of this estimation can be used in eq. (2.6) to find f∗(x). Although this is an valid

approach, density estimation is a harder method than classification and regression

[158]. Empirical Risk Minimization(ERM) [62] can be used to overcome these issues.

The main idea of ERM is defined as to minimize the empirical risk of the training

set at hand, defined as follows,

Iemp [f ] =1

ℓ

ℓ∑

i=1

L(f(xi), yi) (2.8)

instead of expected risk given in eq. (2.6). Here, it should be noted that the Iemp

is a random variable because it is dependent on the training set at hand.

In the ERM framework, Υ(x, y) is exchanged with the finite training data, with

a cost, which can be stated as how close is the minimizer of Iemp [f ] to I [f ] in the

case of finite data. Statistical learning theory gives answers to this question for

different conditions.

In practice there may be problems different from approximation aspects because

ERM uses finite training data. This may lead to overfitting problems and/or un-

derfitting problems. To explain these problems, a regression problem is considered

10

and sin(t) is selected as a target function. The data set of regression problem con-

tains 20 points equidistantly sampled from target function and added a univariate

Gaussian noise with 0.6 standard deviation, ςt, to the sampled points. In other

words, samples from the target function is corrupted. Now using these data the

task is to learn the target function. The target function and the the corrupted data

is illustrated in Figure 2.2-a) and -b). First a first order polynomial is fitted to the

data by using least squares. Further a fourth order polynomial and twentieth order

polynomial is fitted to data respectively. These polynomials are shown in 2.2-c), d)

and e). The underfitting problem occurs when a simple model is used e.g. the first

order polynomial for this problem. The overfitting problem occurs when a complex

model is used e.g. the twentieth order polynomial for this problem. Intuitively it

is easy to see that the fourth order polynomial is more suited as hypothesis than

the other two. In this example, a model that neither over- nor underfits is selected.

This was possible because the target function and the problem is low dimensional.

But how can such a trade-off be formalized in the general setting? Developing an

appropriate solution for this problem is very important issue to be addressed. A

solution is given by the concept of Structural Risk Minimization (SRM) introduced

by Vapnik [158, 160] for supervised learning. SRM has two contradictory goals: the

first one is selecting a hypothesis with small empirical risk (ER). The second one

is selecting a hypothesis or a function of small complexity as measured by some

suitable function on ℑ [158, 107, 28]. Basically, in SRM there are four steps: First,

using prior knowledge, one needs to choose a class of functions, ℑ. In the second

step, the chosen class is divided into subclasses, in a nested way, with increasing

complexity. In the third step ERM is applied to the problem at hand. In the final

step the model which has minimum weighted sum of the empirical risk and the

complexity of function class is selected.

Until now, the complexity of a function class is mentioned without any technical

details. In SRM, generally the Vapnik-Chervonenkis (VC) dimension is used as a

measure of complexity. In order to give the technical definition of a VC dimension

a similar line of argument as in [107, 51] will be followed. Any binary classifier can

be identified by the subsets of X to which it assigns the positive class and it can be

assumed that any classifier is indeed a collection of subsets of X. In the following,

these subsets corresponding to f ∈ ℑ, is also denoted by f .

Given a set of points Dℓ, define Λℑ(Dℓ) as the number of distinct subsets of Dℓ

that are intersecting with ℑ, that is Dℓ

⋂f for f ∈ ℑ. The ℓth shattering coefficient

is defined as

S(ℑ, ℓ) = maxDℓ

Λℑ(Dℓ) .

If S(ℑ, ℓ) = 2ℓ, in other words if all possible distinct subsets of Dℓ are intersecting

with at least one f ∈ ℑ, then Dℓ is shattered by ℑ. Vapnik-Chervonenkis(VC)

dimension of the function class ℑ, denoted by V Cℑ, is defined as the maximum

largest integer h such that there exists a set of cardinality h that is shattered by

11

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7

Data Generation Function( sin(t) )

a)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7

Data With Noise

b)-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7

Data With Noise

First order Polynomial

c)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7

Data With Noise

Fourth order Polynomial

d)-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 1 2 3 4 5 6 7

Data With Noise

Twentieth order Polynomial

e)

Figure 2.2: Overfitting and undefitting problems are illustrated in the figure. Theoriginal sin(t) is shown in a), the corrupted training data points is shown in b). Theoutputs of the first, fourth and twentieth order polynomials are shown in c), d) ande) respectively. Underfitting problem is shown in c) and the overfitting problem isshown in e).

ℑ. It is important to note that if the VC dimension of ℑ is h, there is at least one

set of h points which can be shattered by ℑ. However this does not mean that any

set of h points will be shattered by ℑ. Now, one of the main results of statistical

learning theory; namely a bound on the expected risk [159, 158], will be stated:

Theorem 1 Choose some η such that 0 ≤ η ≤ 1 then for losses taking 0 or 1 the

following bound holds with probability 1− η

I [f ] ≤

T1︷︸︸︷Iemp [f ] +

√√√√√√√√

T2︷︸︸︷h(log(2ℓ/h) + 1)−

T3︷︸︸︷log(η/4)

ℓ

(2.9)

12

where h represents the VC dimension of ℑ.

Several facts, related to this bound, should be explained. The term T1 is the

empirical risk. T2 includes the number of training examples, ℓ, as well as the VC

dimension of the function class , h. Since, T2 is basically dominated by h. If one

wants to minimize T2 one needs to select a function class which has a small value for

h. The term T3 depends on the degree of confidence ,η , of the bound. Generally,

one wants to be as sure as possible and therefore the value of η is usually small.

If the term under the square root is analysed, containing T2 and T3, one sees that

it is inversely proportional to the number of training examples, ℓ. In other words,

when the number of training examples, ℓ, increases the second term of right side

approaches zero. However, in practical problems ℓ is fixed. By using the bound

(2.9) on the expected risk, the SRM framework is achieved by selecting a learning

machine for a given data. In summary, given a family of hypothesis, Theorem 1

implies that one should search for a hypothesis which has the minimum sum of the

empirical risk and the VC dimension.

2.2 Consistent Classifier

If one restricts himself/herself to the classification task one can ask for what is the

best performance, in the sense of accuracy, that can be reachable by a classifier?

The answer to this question is known if the probabilities and class conditional

distributions are already known. In this case the optimal classifier in the sense

of minimum probability of error or misclassification rate is Bayes Decision Rule

[58, 107, 112]. It is denoted by fB(x) and the loss corresponding to Bayes Decision

Rule is denoted as LB . Given (x, y), pj(x) is defined as P (Y = j|x) for j = 1, . . . , k.

The Bayes Decision Rule minimizing the expected misclassification rate is

fB(x) = arg minj=1,...,k

[1− pj(x)] = arg maxj=1,...,k

pj(x) .

However, in most of real life problems probabilities and distributions are not already

known and so calculating the outcome of the Bayes Decision Rule is not possible.

Even though, training data is finite it is appropriate to expect that a classifier to

yield an error that approaches to Bayes Decision Rule’s error rate as ℓ → ∞. A

classifier, that provides outcomes consistent with Bayes Decision Rule, is called

consistent classifier.

There is three important types of consistency in statistics and machine learning.

First one is; if E [L(fℓ)] → LB when ℓ → ∞ then fℓ is consistent. Second type

is; if L(fℓ) → LB when ℓ → ∞ then fℓ is strongly consistent. The last one is, if

L(fℓ)→ LB when ℓ→∞ for any Υ(x, y) then fℓ is universally consistent.

It is natural to strive for a universal consistent classifier, but it is hard to make as-

sumptions independent of Υ(x, y). However, in a seminal paper Stone [151] showed

the existence of universal consistent classifiers. This supplies practitioners with an

13

important guarantee. That is, if sufficient data is supplied to the universal consis-

tent classifier, then the universal consistent classifier on the test data will perform

as accurately as the Bayes Decision Rule. Unfortunately, having an universal con-

sistent classifiers does not guarantee that the classifier will perform as accurately

as the Bayes Decision Rule on the test data at hand because the technical meaning

of sufficient data with respect to the problem at hand is unclear. Indeed, if the

training data set is small, any universal consistent classifier can perform arbitrarily

bad. This may be due to the slow convergence speed of L(fℓ) to LB [41, 50]. The

design of a good classifier is hard and non-trivial. Within the next sections, common

classifiers, namely nearest neighbour, Perceptron and SVMs will be presented.

2.3 Classifiers

2.3.1 Nearest Neighbour Classifier

Probably the nearest neighbour classifier (NN) [68, 69, 40] is the simplest non-

parametric algorithm for classification tasks. In the original definition [68, 69], the

1-nearest neighbour (1-NN) algorithm assigns to a test example the class of its

nearest neighbour in the training set. With most methods of classification, the

performance of 1-NN is depends on the chosen metric. A generalised version of

1-NN is called d-nearest neighbour classifier(d-NN) and for a given test example, it

assigns the label of the majority of d nearest neighbours in the training set to the

test example. Although the nearest neighbour classifier is a simple algorithm, it is

consistent when d → ∞ and dℓ→ 0 as ℓ → ∞. Even more it is competitive with

other state-of-the-art methods [107].

2.3.2 Perceptron

Rosenblatt [134] proposed the well known Perceptron algorithm in the early 1960’s.

In his original construction of the Perceptron algorithm, it handles only binary

classification and requires infinitely many training samples. The most important

restriction of the Perceptron algorithm is that it is only applicable to problems that

are linearly separable, such that there is at least one hyperplane between classes

that separates them without an error. The Perceptron algorithm starts from an

arbitrary point and constructs this separating hyperplane iteratively. For finite

data, the algorithm cycles through data and updates the parameters whenever a

training example is misclassified.

Before discussing the Perceptron algorithm further, an important concept namely

’margin’ should be defined.

Definition 2 Given training data, Sℓ and a hyperplane of the form f(x) = 〈w, x〉+

b, and define the margin ν(Sℓ,wℓ) as

ν(Sℓ,wℓ) = arg min(xi,yi)∈Sℓ

f(xi)

‖wℓ‖. (2.10)

14

Here it should be noted that if Sℓ contains only one point the definition is still

valid. A margin is a function of the data set and the hyperplane. For simplicity

it will be denoted by ν. It sould be underlined that the margin is defined for

binary problems. If the data set at hand is linearly separable, the Novikoff [158]

generalization bound for the Perceptron algorithm states that the number of errors

that the Perceptron algorithm will make during training is inversely proportional to

the margin. As Figure 2.3 illustrates all data sets that are linearly separable with

margin i.e. ν > 0 for all hyperplanes, can be separated by an infinite number of

hyperplanes. However, the Perceptron algorithm does not include a mechanism to

select any particular hyperplane. Instead, it will provide a solution solution based

on the initial parameters. This led many researchers to note that the accuracy of the

Perceptron algorithm for a given problem is based on the quality of the separating

hyperplane [5, 158]. In general, the necessary assumption for a Perceptron to be

applicable - that the classes are linearly separable - is not warranted for most real-

world problems. This severely limits its usefulness in practice.

2.3.3 Support Vector Machines(SVMs)

Motivated by the intrinsic shortcomings of the Perceptron algorithm, its failure to

actively select among the possible hyperplanes and converge in cases not linearly

separable, more research on the role of the margin and margin violations followed.

In the following some general theorems on properties of separating hyperplanes will

be given. First , the ’canonical hyperplane’ will be introduced because it will be

used in theorems and then two important theorems of statistical learning theory

will be stated.

Definition 3 Given a set of linearly separable points Dℓ = x1, . . . , xℓ, consider

a hyperplane of the form f(x) = 〈w, x〉 + b. If the hyperplane has the following

property

minn=1,...,ℓ

|f(xn)| = 1 (2.11)

it is called a canonical hyperplane.

Vapnik[139, 158] related the VC bound and margin as follows

Theorem 2 Given a set of points Dℓ = x1, . . . , xℓ separated by a canonical hy-

perplane: Select a class of functions ℑ of the form fw(x) = sgn(〈x,w〉) where sgn is

the sign/signum function, and the norm of the hyperplane w is bounded from above

by a positive constant, i.e. ‖w‖ ≤ Ω. If R is the radius of the smallest ball centred

at the origin containing Dℓ, then the VC dimension of ℑ, h, is bounded by

h ≤ R2Ω2 . (2.12)

The main interpretation of this theorem is that the VC dimension of the space

can be bounded by using the norm of the w. Please note that a value for Ω should

15

be chosen beforehand. Although this bound clearly shows the relation between the

norm of the weight and the VC dimension, it does not give the relation between the

margin and the VC dimension. Therefore another relationship is needed for this

purpose. Vapnik [139, 158] stated margin VC dimension relation as follows:

P1P2

P3

Figure 2.3: Points of two different classes that are linearly separable are shown.Among the infinitely many possible separating hyperplanes between the two classes,three of them are illustrated (P1, P2 and P3). The Perceptron algorithm constructs aseparating hyperplane between classes depending on initial parameter values. Notethat P3 has a margin smaller than P1 which in turn has a smaller margin than P2.It is clear that among the three, P2 is more robust against noise in further test datathan P1 and P3. Perceptron algorithms do not supply a solution with the maximummargin even for linearly separable data.

Theorem 3 Given a set of points Dℓ = x1, . . . , xℓ such that ∀x ∈ Dℓ, ‖x‖ < 0

with R > 0. Let ℑ be the class of functions of the form fw(x) = sgn(〈x,w〉) with the

norm of w is bounded from the above by a positive constant, i.e. ‖w‖ ≤ Ω. Define

µ as the fraction of points that have margin a smaller than ν, where ν > 0. Then

for all distributions Υ(x, y), from which the data is generated, with probability of at

least 1− δ and for any ν > 0 and δ ∈ (0, 1), the probability, p(fw(xi) 6= yi), that a

test pattern, xi, drawn from Υ(x, y) will be misclassified is bounded by

p(fw(xi) 6= yi) ≤ µ+

√σ

ℓ

(R2Ω2

ν2ln2ℓ+ ln

(1

δ

)). (2.13)

Where σ is an unknown universal constant.

This theorem bounds the probability of misclassification of unseen test examples.

If the bound given in (2.13) is examined, it can be seen that it has two components.

The first component is basically the fraction of points that have a smaller margin

than ν‖w‖ . The second component represents the complexity of the learning machine

and requires further focused explanation to understand. The complexity compo-

nent is proportional to R and Ω and inversely proportional to ν. To minimize the

16

complexity of the learning machine, one needs to make R and Ω small, and one also

needs to make ν as large as possible. Since ℓ, R and Ω are fixed beforehand, ν is the

parameter that drives the complexity of the learning machine. On the one hand, a

large ν makes the complexity of the learning machine smaller. However, this will

likely increase µ and hence also have detrimental effect on the bound. On the other

hand, if ν goes to zero, µ also goes to zero but then complexity of the learning

machine tends to infinity. It is clear that maximizing ν and minimizing µ is contra-

dictory. In other words there exists an intricate relationship between maximizing ν

and minimizing µ. But how to settle the trade-off between contradictory tendencies

still has to be determined. An algorithm which provide a consistent framework for

such trade-offs are Support Vector Machines which will be introduced in the next

section.

2.3.3.1 Hard Margin Support Vector Machines

In order to clearly explain SVMs, two closely related margin concepts, namely

functional and geometrical margin, should be defined and the difference between

them should be stated. Given training data Sℓ and a hyperplane f(x) of the form

〈w, x〉 + b, the functional margin of nth training example with respect to the f(x)

is defined as

yn(〈w, xn〉+ b) .

The functional margin is not scale invariant which means the functional margin

can be increased arbitrarily by just multiplying w and b by positive scalar value.

To resolve this issue a different margin concept, namely geometrical margin, is used

in SVMs. The geometrical margin of the ith training example with respect to the

w is defined as

yn(〈w, xn〉+ b)

‖w‖.

If not stated otherwise, the geometrical margin is referred as margin in this thesis.

Basically, SVMs are constructing the optimal hyperplane by finding the hyperplane

that maximises the margin, ν (defined in eq. (2.10) and please note the relation

between it and the geometrical margin of the ith training example) and minimizes

µ. To do that, SVMs fix the margin, ν, to 1 (which is the case for canonical hy-

perplanes) and search for the hyperplane, w, that has the smallest norm possible

and smallest fraction of points, µ, that have a smaller margin than 1‖w‖ . Before

explaining the technical details of SVMs, it should be noted that SVMs were ini-

tially developed for binary classification problems and regression problems. For the

remainder of this section, the SVMs for binary problems will be briefly explained

and the multi-class SVMs will be discussed in Section 3.

In this section Hard Margin Support Vector Machines, which are just like the

Perceptron only applicable to linearly separable, data will be defined. However, they

17

actively select a specific hyperplane among all possible ones. Motivated by Theorem

2 and Theorem 3, the criterion for selecting the hyperplane is the largest possible

margin as shown in Figure 2.4. This can be formulated as follows: Given that

the training data are linearly separable and the function space ℑ that is consisting

canonical hyperplanes and SVMs will pick a function from this space, all the training

data satisfy the following constraints:

〈xn, w〉+ b ≥ 1 if yn = 1 (2.14)

〈xn, w〉+ b ≤ −1 if yn = −1 (2.15)

These constraints can be written in a more compact way such as:

∀n ∈ 1, . . . , ℓ : yn(〈xn, w〉+ b) ≥ 1 (2.16)

As a result of Theorem 3, the goal also is to minimize w. Together this yields

the optimisation problem:

min1

2〈w,w〉 (2.17)

s.t. ∀n ∈ 1, . . . , ℓ : yn(〈xn, w〉+ b) ≥ 1 (2.18)

The optimization problem (2.17) is an example of constraint convex optimization

problem. It is referred to as primal problem. These classes of optimization problems

are well established in optimization theory and standard approaches to solve them

exist. One of the well known approach is the Lagrange multipliers [15] and it will

be used in this thesis. There are two reasons for using the Lagrange multipliers

method; the first one is that constraints will be replaced by Lagrange variables that

are easy to handle and the second is that the optimization problem will be written

in such a way that the training data is only used in inner products. The Lagrangian

of (2.17) is:

L =1

2〈w,w〉 −

ℓ∑

n=1

αn (yn(〈xn, w〉 − 1) + b) (2.19)

An optimization problem (the so called dual optimization problem) which is equiv-

alent to eq. (2.17) can be formulated by using eq. (2.19). To do this the Lagrangian

with respect to primal variables, w, should be minimized and the Lagrangian with

respect to dual variables, α, should be maximized. To this end the partial deriva-

tives of the Lagrangian with respect to primal variables should be written. The

18

partial derivatives of the Lagrangian with respect to, w and b, are

∂L

∂w= w −

ℓ∑

n=1

αnynxn (2.20)

∂L

∂b= −

ℓ∑

n=1

ynαn . (2.21)

To find the saddle point of the Lagrangian, one sets ∂L∂w

and ∂L∂b

to zero and then

obtains

w =

ℓ∑

n=1

αnynxn (2.22)

0 =ℓ∑

n=1

ynαn . (2.23)

By substituting eq. (2.22) into eq. (2.19) the Lagrangian with respect to the dual

variables is obtained and it is

L =

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

ynykαnαm〈xn, xm〉 . (2.24)

Finally the dual problem is written as follows:

maxα

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

ynykαnαm〈xn, xm〉 (2.25)

s.t.ℓ∑

n=1

ynαn = 0

∀n ∈ 1, . . . , ℓ : αn ≥ 0

2.3.3.2 Soft Margin Support Vector Machines

Unfortunately, Hard Margin SVMs are only applicable to linearly separable prob-

lems reflected by the constraint of the primal (2.18). In order to apply SVMs to

inseparable problems, the constraint (2.18) should be relaxed, thus allowing mar-

gin violations for training examples. To this end, so called slack variables ξn ≥ 0,

n = 1, . . . , ℓ are defined and they are introduced as constraints (2.18). An illus-

tration of slack variables and the optimal hyperplane is given in Figure 2.5. For

a binary soft margin SVM the following primal optimization problem should be

19

Figure 2.4: Two linearly separable classes are shown in the figure. Hard marginSVMs construct a hyperplane that maximises the margin.

solved:

min1

2〈w,w〉+ C

ℓ∑

n=1

ξn (2.26)

s.t. ∀n ∈ 1, . . . , ℓ :

yn(〈xn, w〉+ b) ≥ 1− ξn (2.27)

ξn > 0

C is the regularization coefficient. As C approaches to infinity, the objective

function of the Soft Margin Support Vector Machine is dominated by the sum of

slack variables or in other words the amount of margin violations and when the C

approaches zero, the objective function of the Soft Margin Support Vector Machine

is dominated by the norm of the weight vector. In other words, in the former case

high priority is given to minimize the margin violations in the training data set

which may lead overfitting and the latter case higher priority is given to minimize

the complexity of the hypothesis which may lead underfitting. Clearly, neither

overfitting nor underfitting desired and the solution to both problems is choosing

an appropriate regularization parameter for the problem at hand. Unfortunately C

cannot be identified beforehand, therefore model selection procedures i.e. choosing

a statistical model from a set of candidate models, should be applied for identifying

20

the C. The Lagrangian of the primal problem is:

L =1

2〈w,w〉+ C

∑

n

ξn −

ℓ∑

n=1

αn (yn(〈xn, w〉+ b)− 1 + ξn)

−

ℓ∑

n=1

βnξn (2.28)

Following the same procedure as the Hard Margin SVMs, the partial derivatives

with respect to primal variables, w, b and ξ are obtained. The partial derivatives

are

∂L

∂w= w −

ℓ∑

n=1

αnynxn (2.29)

∂L

∂b= −

ℓ∑

n=1

ynαn

∂L

∂ξn= C − αn − βn .


, ∂L∂b

and ∂L∂ξn

to zero and

obtains

w =

ℓ∑

n=1

αnynxn (2.30)

0 =ℓ∑

n=1

ynαn (2.31)

αn = C − βn (2.32)

By definition the Lagrange multipliers are equal or bigger than zero, i.e. αn ≥

0 ∀n = 1 . . . ℓ and βn ≥ 0 ∀n = 1 . . . ℓ. So equation (2.32) is equivalent to

0 ≤ αn ≤ C. By substituting eq. (2.30) into eq. (2.28) the Lagrangian with respect

to the dual variables is derived and it is

L =

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1


Finally the dual problem is written as:

maxα

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

ynykαnαm〈xn, xm〉 (2.34)

s.t.

ℓ∑

n=1

ynαn = 0

∀n ∈ 1, . . . , ℓ : 0 ≤ αn ≤ C

21

ξ1

ξ2ξ3

ξ4

Figure 2.5: Two linearly inseparable classes are shown in the figure. ξ1, ξ2, ξ3and ξ4 show slack variables. Soft margin SVMs construct a hyperplane that findsa compromise between maximizing the margin and minimizing the sum of slackvariables.

2.3.3.3 Soft Margin Support Vector Machines for Non-linear Cases

Until now, all SVM algorithms that are discussed use linear functions of the train-

ing data. In other words, till now SVM algorithms are restricted to the class of

linear functions of the training data which have a limited ability to supply suitable

solutions. Further, methods using linear functions can only be applied to vector

valued data. To turn the linear SVMs presented above into non-linear algorithms,

kernel functions [6, 158] will be used. Before giving a technical definition of the

”kernel”, the term that is called the ”Gram Matrix” should be introduced,

Definition 4 Given a set of points Sℓ = x1, . . . , xℓ and a function, k : Sℓ×Sℓ →

R, the matrix K ∈ Rℓ×ℓ with elements K(n,m) = k(xn, xm) is called the Gram

Matrix of k(·, ·) with respect to Sℓ.

After giving the definition of the Gram Matrix, the ”kernel” is defined mathe-

matically;

Definition 5 Any function k : Sℓ×Sℓ → R is identified as kernel if it is a symmet-

ric function and also if it gives a positive semi-definite Gram Matrix with respect

to Sℓ.

After defining the kernel function, the concept of converting linear SVMs to non-

linear ones will be explained. Firstly it should be noted that in dual formulations the

training data appears only in the form of inner products. Each training example

, xn, is mapped to another space, that is named as the feature space, with the

22

help of a function φ(xn), then all inner product terms in the dual formulations

can be replaced by the new inner products in the feature space. In other words

〈xn, xm〉 can be replaced with 〈φ(xn), φ(xm)〉. However, this approach can have a

computational problem namely feature space may be very high-dimensional or even

infinite dimensional. For some feature space, one can replace 〈φ(xn), φ(xm)〉 with

k(φ(xn), φ(xm)). The question is; for which feature spaces one can use a kernel

function for calculating the inner product i.e. 〈φ(xn), φ(xm)〉? The answer of this

question is given by Mercer Theorem [158]:

Theorem 4 A continuous symmetric function k(x, y) ∈ L2(C) has an expansion

k(x, y) = 〈φ(x), φ(y)〉

If and only if for any f(x) the condition

∫k(x, y)f(x)f(y) dx dy ≥ 0

is valid.

If the used kernel satisfies the Mercer Theorem, from mathematical point of view

there is no difference between mapping each training example to feature space and

then calculating the inner product then there is with directly using a function ,

k(·, ·), that is equal to 〈φ(xn), φ(xm)〉 i.e. k(φ(xn), φ(xm)) = 〈φ(xn), φ(xm)〉. From

a computational point of view the function calculation is preferred because, as

mentioned before, in some cases the feature space can have infinite dimensions. In

the renaming of this thesis, unless otherwise stated, all the kernels are satisfying

the Mercer Theorem. Now, if the function k(·, ·) is a kernel then a non-linear SVM

is obtained. The new learning machine’s primal that is using kernels is as follows:

min1

2〈w,w〉+ C

ℓ∑

n=1

ξn (2.35)

s.t. ∀n ∈ 1, . . . , ℓ :

yn(〈φ(xn), w〉+ b) ≥ 1− ξn (2.36)

ξn > 0

The Lagrangian of the primal problem is:

L =1

2〈w,w〉+ C

ℓ∑

n=1

ξn −

ℓ∑

n=1

αn (yn(〈φ(xn), w〉+ b)− 1 + ξn)

−ℓ∑

n=1

βnξn (2.37)

23


with respect to primal variables, w, b and ξ are obtained

∂L

∂w= w −

ℓ∑

n=1

αnynφ(xn) (2.38)

∂L

∂b= −

ℓ∑

n=1

αnyn

∂L

∂ξn= C − αn − βn .


, ∂L∂b

and ∂L∂ξn

to zero and

obtains

w =

ℓ∑

n=1

αnynφ(xn) (2.39)

0 =

ℓ∑

n=1

αnyn (2.40)

αn = C − βn . (2.41)

Substituting eq. (2.39) into eq. (2.37) yields the Lagrangian with respect to the

dual variables is obtained and it is

L =

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1


By converting eq. (2.41) into a inequality constraint like Soft Margin SVMs, the

dual problem can be written as

maxα

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

ynykαnαmk(xn, xm)

s.t.

ℓ∑

n=1

αnyn = 0

∀n ∈ 1, . . . , ℓ : 0 ≤ αn ≤ C .

24

Chapter 3

Multi-class Support Vector

Machines

In the Chapter 2, general concepts of supervised learning and statistical learning

theory were discussed. Three different classifiers, i.e. the nearest neighbour, Per-

ceptron and binarySVMs were also explained. In this chapter, multi-class classifiers

that are derived from binary SVMs will be discussed.

Through out the remainder of this thesis, unless otherwise noted, training data

set, belonging to d classes i.e.((x1, y1), . . . , (xℓ, yℓ)

)∈(X × 1, . . . , d

)ℓsampled

i.i.d. from a fixed but unknown probability distribution will be denoted by Sℓ. All

machines explained in this thesis construct a decision function of the form

x 7→ arg maxc∈1,...,d

〈wc, φ(x)〉+ bc . (3.1)

Here, φ : X → H is a feature map into an inner product space H, w1, . . . , wd ∈ H

are class-wise weight vectors and b1, . . . , bd ∈ R are class wise bias/offset values.

The previously discussed SVM algorithms are restricted to binary classification

problems. However, many problems in practice contain more than two classes,

i.e. they are multi-class problems. Different ways of solving d-class classification

problems with SVMs will be discussed. In general, one distinguish two different

approaches for solving d-class classification problems. The first is to cast d-class

problems into a series of binary or one-class classification problems. The second

group of approaches constructs a single optimization problem for the entire d-class

problem. In this thesis, the methods belonging to second group will be referred as

all-in-one methods. The first approach has been analysed extensively [53, 4, 43] in

the literature due to its relative simplicity.

For kernel methods such as SVMs the feature space H can be constructed as

the Reproducing Kernel Hilbert Space (RKHS) (see [6, 139]) of a positive definite

kernel function k : X × X → R. The corresponding feature map takes the form

φ(x) = k(x, ·), and k(x, x′) = 〈φ(x), φ(x′)〉. The decision function (3.1) is illustrated

in Figure 3.2.

25

Figure 3.1: The general two approaches for solving d-class problems are shown.Methods that are developed from these approaches are illustrated. All these meth-ods will be discussed in this study.

When solving d-class problems as a series of binary problems, there are two

common methods. Method 1 is one-versus-one (OVO) also known as one-against-

one. Method 2 is One-versus-all (OVA) also known as one-against-all. OVO casts

multi-class problems in to a series of pairwise binary problems more precisely OVO

constructs d(d−1)2 binary problems. Although during the training phase of each

binary problem the decision function given in eq. (3.1) is used it is not clear how

to combine the outputs of d(d−1)2 binary problems for a given test sample. Several

methods have been proposed to address this deficiency i.e. to have an unified

decision function [84], however all these methods provides a solution that reflects

the process rather than the data. Because of this deficiency, OVO will be briefly

explained in Section 3.1. However, OVA will be discussed in more detail, Section

3.1 and Section 3.1.2, which has a similar philosophy with OVO but does provide a

unique decision function.

In OVA one constructs d classifiers and operates on them using a decision func-

tion eq. (3.1). This thesis is restricted to classifiers which have an unique decision

function of the form (3.1). Regardless of the issues related the decision function,

both OVO and OVA cast the d-class problem into a series of binary classification

problems. In addition to two common methods, Szedmak et.al. [152] proposed a

new multi-class SVM method which also casts the problem into a series of one class

classification problem. In this method, each term inside the argmax expression in

equation (3.1) can be interpreted as a linear projection to a one-dimensional sub-

space. In multi-class classification it is natural to combine these one-dimensional

sub-spaces into a single d-dimensional decision space. This corresponds to filling

the components of a vector with the single inner products. This new method ad-

dresses the decision function deficiencies of OVO methods and so this method will

be discussed further in Section 3.1 and Section 3.1.1.

26

As previously noted the alternative to sequential approaches is to address the

problem by All-in-One Methods [163, 42, 112]. There are several methods proposed

for this approach but a unified analysis is missing. These methods will be the focus

of Section 3.2 and a novel unified view will be supplied on these methods at Chapter

4.

input space X feature space H label space Rd prediction

φ

π1

π2

π3

x

φ(x)

〈w1, φ(x)〉+ b1

〈w2, φ(x)〉+ b2

〈w3, φ(x)〉+ b3

argmax1, . . . , d

Figure 3.2: Illustration of multi-class support vector machine decision making.First, the input space is (implicitly) mapped to a feature Hilbert space with thefeature map φ : X → H. Then, projections πi : H → R, h 7→ 〈wi, h〉 + b, i ∈1, . . . , d are applied, resulting in a d-dimensional label vector. Finally, the labelvector is mapped to a label index by applying the argmax decision function (3.1).Here, only the positive octant of the decision space is drawn. The solid planes arethe different parts of the decision boundary separating pairs of classes.

It is not clear a priori that the sub-spaces should be embedded along the (or-

thogonal) coordinate axes. Equally valid one would fix so-called label prototype

vector vc ∈ Rd per class, embed the inner products into the decision space as

v(x) =d∑

c=1

〈wc, φ(x)〉vc ,

and then make a decision according to

x 7→ argmaxc∈1,...,d

〈vc, v(x)〉+ bc .

Therefore, the decision space Rd can be referred as the label or label prototype space.

This slight generalization of the decision function (3.1) is considered in [152].

In the following sections label prototypes are restricted orthonormal prototype

vectors vc and decision functions of the form (3.1).

3.1 Sequential Multi-Class SVMs

In this section first the previously proposed methods for solving multi-class SVMs

sequentially will be summarized. Generally, machine learning and pattern recogni-

tion algorithms are initially developed for binary problems and then extended to

multi-class problems [58, 17, 72]. Although there are some exceptions [33, 131] to

this rule of thumb similarly SVMs were also initially developed for binary problems

27

[159, 138, 39]. One of the first approaches to extend binary SVMs to multi-class

problems was put forward by [84, 53].

A very general and widely applicable framework for extending the binary clas-

sifiers to multi-class problems are based on error correcting output codes (ECOC)

which have roots in information theory and communication theory [127, 116, 117].

Moreover, the use of ECOC in machine learning can already be found in the early

machine learning studies [59].

In the ECOC methodology, each class is assigned a unique codeword which is a

binary string of length s. In most of the literature, a binary value by convention may

take either one or zero, for consistency with SVM literature it will be assumed that

binary strings are composed of mixtures of 1’s and −1’s. The length of the binary

string may or may not be equal to the number of classes, d. ECOC utilizes s binary

classifiers and an error correcting matrix C ∈ Rd×s. These binary classifiers are

learned for each element of this ECOC matrix. Note that the value of each ECOC

matrix elements represents the relationship between the corresponding classifier and

the corresponding class. For example, for a multi-class problem containing 4 classes

and assuming codewords that have a string length of 7, the ECOC matrix C is

given in Table 3.1. Within this table ci represents the ith classifier and each entry

C(m, i) represents the desired output of ith classifier with respect to the mth class.

To classify a test example x, a code word string, length s, should be populated by

Class Code Wordc0 c1 c2 c3 c4 c5 c6

0 -1 -1 -1 1 -1 1 -11 1 -1 -1 -1 1 -1 12 -1 -1 1 -1 -1 1 -13 1 1 -1 -1 -1 -1 1

Table 3.1: The table shows an error correcting output code for a four class problem

using the binary classifiers. This string is stored as a vector p. An example of the p

is given in Table 3.2 In the final step, the label of nearest code word to p is assigned

p (Code Word)c0 c1 c2 c3 c4 c5 c6-1 -1 -1 -1 1 1 -1

Table 3.2: For a test example x, the populated p is shown.

as the label of test example x. Generally the Hamming distance [83] is used for

determining the closest code word. The final step is illustrated in Table 3.3 and the

test example x is classified as class 2 because class 2 has the minimum Hamming

distance to p.

The main advantage of the ECOC frameworks is its robustness against error if

the code matrix is defined appropriately [53, 4]. Further, some popular sequential

28

Class Code Wordc0 c1 c2 c3 c4 c5 c6 Hamming Distance

0 -1 -1 -1 1 -1 1 -1 31 1 -1 -1 -1 1 -1 1 62 -1 -1 1 -1 -1 1 -1 13 1 1 -1 -1 -1 -1 1 4p -1 1 1 -1 -1 1 -1

Table 3.3: Illustration of the decision step for ECOC

multi-class SVM solutions, such as OVA, fit directly into the ECOC framework.

For a similar four class problem, the corresponding ECOC matrix of OVA is given

in Table 3.4

Class Code Wordc1 c2 c3 c4

0 1 -1 -1 -11 -1 1 -1 -12 -1 -1 1 -13 -1 -1 -1 1

Table 3.4: The ECOC matrix of OVA

Allwein et.al. [4] adapted the ECOC to allow a single technique to be compat-

ible with all sequential multi-class classification methods. Basically they proposed

codewords which comprise 1’s, 0’s or −1’s. Values equal to 0 indicates that the

corresponding classifier ignores the training examples of the corresponding classes.

Values equal to 1 indicates that corresponding classifier accepts the corresponding

classes as a positive class of the binary classification problem and values equal to

−1 indicates that corresponding classifier accepts the corresponding classes as a

negative class of the binary classification problem. With this extension of ECOC,

the ECOC matrices of OVO and MCM-MR are stated in Table 3.5 and 3.6

Class Code Wordc1 c2 c3 c4 c5 c6

0 1 1 1 0 0 01 -1 0 0 1 1 02 0 -1 0 -1 0 13 0 0 -1 0 -1 -1

Table 3.5: The ECOC matrix of OVO

ECOC frameworks supply a flexible tool for using binary classifiers to solve

multi-class problems. It is important to note that ECOC frameworks do not assume

any type of classifier.

However, design of the ECOC matrices is problematic and in the scientific com-

munity known as ’the Error Correcting Code Design Problem’ [56]. Allwein et al. [4]

29

Class Code Wordc1 c2 c3 c4

0 1 0 0 01 0 1 0 02 0 0 1 03 0 0 0 1

Table 3.6: The ECOC matrix of MC-MMR

developed some bounds on the training errors of margin-based-classifiers which are

based on the ECOC framework. Instead of stating the exact derivation of these

bounds, it is enough to point that these bounds are a function of q/ρ where q rep-

resents the proportion of redundant bits to the number of total bits, i.e. s−ds

,

in the ECOC matrix, and ρ is the minimum Hamming distance between distinct

codewords, that are constructing the ECOC matrix, leading to the error correcting

code problem. The problem is caused because there are two conflicting goals. First

goals is to minimize the number of redundant components. However this is mu-

tually exclusive to the second goal of maximizing the minimum distance between

distinct codewords. From the optimization perspective, the design of ECOC matrix

problem is NP-complete [43] and designing codewords is independent from empir-

ical risk minimization. Therefore, there may be some incompatible issues between

the classifier and ECOC. Generally good codes can be designed by using genetic al-

gorithms [155, 2, 1, 3]. Although Hamming distance is often used, there are several

other distance metrics [4] sometimes used for good code design.

This study is limited to discussion of OVA and MC-MMR within the ECOC

framework. The basic idea of OVO is mentioned and explained. However, OVO

will not be discussed further for the following two reasons. Firstly, several studies

showed that one-versus-all and one-versus-one perform similarly with regard to

classification accuracies [87, 57]. Secondly, for OVO it is not clear which decision

function will be used after training [84], however, in the case of OVA and MC-MMR,

it is convenient to use the decision function (3.1).

In the following subsections, first the MC-MMR method will be explained and

then the OVA method will be explained in detail.

3.1.1 Multi-Class Classification with Maximum Margin Re-

gression (MC-MMR)

One of the recent extensions of binary SVMs to multiple classes is Multi-Class

Classification with Maximum Margin Regression (MC-MMR) [152]. The basic idea

behind MC-MMR can be stated as follows: In binary SVMs, the normal vector of

the decision function can be interpreted as a projection operator which maps the

feature space to a one-dimensional decision space for classification. This projection

operator can be extended to multiple classes, corresponding to a higher dimensional

decision space. This line of thought is followed by MC-MMR, in which the decision

30

functions map inputs to vector-valued labels [152].

The primal of MC-MMR is:

min1

2||W ||2 + C

ℓ∑

n=1

ξn (MC-MMR)

s.t. ∀n ∈ 1, . . . , ℓ :

〈Wφ(xn) + b, yn〉 ≥ 1− ξn

ξn ≥ 0 . (3.2)

here W is a matrix where the cth row corresponds to wc, i.e. the separating hyper-

plane w.r.t the cth class, and b ∈ Rd is the bias/offset vector where the cth entry

corresponds to bc. The Lagrangian of the primal problem is:

L =1

2||W ||2 + C

ℓ∑

n=1

ξn −

ℓ∑

n=1

αn (〈Wφ(xn) + b, yn〉 − 1 + ξn)−

ℓ∑

n=1

βnξn (3.3)


with respect to primal variables, W , b and ξ are obtained:

∂L

∂W= W −

ℓ∑

n=1

αn(Iφ(xn))Tyn (3.4)

∂L

∂b= −

ℓ∑

n=1

αn [yn]p (3.5)

∂L

∂ξn= C − αn − βn (3.6)

where [yn]p is the pth entry of label vector yn. To find the saddle point of the

Lagrangian, one sets ∂L∂W

, ∂L∂b

and ∂L∂ξn

to zero and obtains

W =

ℓ∑

n=1

αn(Iφ(xn))Tyn (3.7)

0 =

ℓ∑

n=1

αn [yn]p (3.8)

αn = C − βn . (3.9)

Substituting eq. (3.7) into eq. (3.3) the Lagrangian with respect to the dual vari-

ables is derived and it is

L =

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

αnαmk(xn, xm)yTn ym . (3.10)

31

For the case of orthogonal prototype labels:

L =

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

αnαmk(xn, xm)δyn,ym(3.11)

In this thesis I will use only orthogonal prototype labels because Szedmak et al. [152]

hinted that for MC-MMR classification accuracy is not depending to prototype

labels if they are independent. By converting eq. (3.9) into a inequality constraint

like Soft Margin SVMs, the dual problem is written as

maxα

ℓ∑

n=1

αn −1

2

ℓ∑

n,m=1

αnαmk(xn, xm)δyn,ym(3.12)

s.t.ℓ∑

n=1

αn [yn]p = 0

∀n ∈ 1, . . . , ℓ : 0 ≤ αn ≤ C (3.13)

giving weight vectors

wc =

ℓ∑

n=1

αnδyc,ynk(xn, ·) .

The dual problem can be decomposed into d independent sub-problems involving

only the variables indexed by the sets Sc, c ∈ 1, . . . , d.

3.1.2 One versus All (OVA)

The one-versus-all (OVA) method is a straightforward way to extend standard

SVMs for binary classification [158, 139] to d-class problems. Let Sc = n |yn =

c, 1, . . . , ℓ denote the index set of training examples of class c. Then for each

c ∈ 1, . . . , d, OVA constructs a binary classifier that tries to separate class c from

all other classes by solving the convex quadratic optimization problem

min1

2〈wc, wc〉+ C

ℓ∑

n=1

ξn,c (3.14)

s.t. ∀n ∈ Sc : 〈wc,φ(xn)〉+ bc ≥ 1− ξn,c

∀n 6∈ Sc : 〈wc,φ(xn)〉+ bc ≤ −1 + ξn,c

∀n ∈ 1, . . . , ℓ : ξn,c ≥ 0 . (3.15)

For OVA the dual will be directly given because the derivation of the dual is

identical with the binary case (see Section 2.3.3.3). In practice, the equivalent dual

32

problem

maxα

ℓ∑

n=1

αn,c −1

2

ℓ∑

n,m=1

αn,cαm,cvm,nk(xn, xm) (3.16)

s.t.

ℓ∑

n=1

ζnαn,c = 0 (3.17)

∀n ∈ 1, . . . , ℓ : 0 ≤ αn,c ≤ C

with vm,n = (−1)|m,n∩Sc| =

+1 m,n ∈ Sc

+1 m,n 6∈ Sc

−1 m ∈ Sc, n 6∈ Sc

−1 m 6∈ Sc, n ∈ Sc

and ζn =

+1 n ∈ Sc

−1 n 6∈ Sc

(3.18)

is solved (see Section 5). Finally, the weight vectors are obtained as

wc =∑

n∈Sc

αn,ck(xn, ·)−∑

n6∈Sc

αn,ck(xn, ·) .

Each resulting vector wc is designed to separate class c from the rest by means of

the sign of (〈wc,φ(x)〉+ bc).

3.2 All-in-One Multi-class Machines

Instead of constructing the weight vectors wc independently by training multiple

binary classifiers, several methods have been proposed to directly obtain all vectors

from a single optimization problem taking all class relations into account at once. In

this section three different approaches to extend SVMs to multiple classes by solving

a single optimization problem is discussed. The standard extension of the SVM

optimization problem to multiple classes as proposed in [158, 163] is considered,

because it is the canonical, fundamental all-together approach. Further, the method

proposed by Crammer and Singer [42], which can be regarded as a modification

of the previous all-together approach with the goal to increase learning speed by

simplifying the constraints in the learning problem, is also consisered. Finally, the

one of the latest multi-class extension of SVMs proposed by Lee et al. [112] is

discussed. Although, this machine has nice theoretical properties like consistency

and a classification calibrated loss till now, no solver has been made available for

this machine. The theoretical aspects of the machine will be explained in this

section and how to solve the corresponding optimization problems will be discussed

in Chapter 5.

33

3.2.1 The Weston and Watkins Method (WW)

Weston and Watkins (WW) [163] and Vapnik [158, 163], independently from each

other, proposed first extensions of all-in-one SVMs. Their proposed method is iden-

tical up to a constant in the absolute value of the target margin. The corresponding

primal problem of WW is as follows

min1

2

d∑

c=1

〈wc, wc〉+ C

d∑

c=1

ℓ∑

n=1

ξn,c (3.19)

s.t. ∀n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d \ yn :

〈wyn− wc, φ(xn)〉+ byn

− bc ≥ 2− ξn,c

n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d : ξn,c ≥ 0 . (3.20)

If the first set of inequality constraints is replaced by

∀n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d \ yn :

〈wyn− wc,φ(xn)〉+ byn

− bc ≥ 1− ξn,c ,

that is, if one sets the target margin to 1 instead of 2, WW equals formulation used

by Vapnik [158]. In this study, the formulation (3.19) with a target margin of 2

is used. The objective function is the sum of the objective functions of the binary

SVM problems (see eq. (3.14)). The major difference lies in the interpretation and

handling of the slack variables ξn,c. While their numbers are identical, their role is

different in the sense that the ℓ×d matrix of slack variables ξn,c corresponds to the

hinge loss when separating example xn from the decision boundary between classes

yn and c.

The Lagrangian of the primal problem is:

L =1

2

d∑

c=1

〈wc, wc〉+ C

d∑

c=1

ℓ∑

n=1

ξn,c −

d∑

c=1

d∑

n=l

βn,cξn,c

−

d∑

c=1

ℓ∑

n=1

(αn,c (〈wyn− wc, φ(xn)〉+ byn

− bc)− 2 + ξn,c) (3.21)

Note that the values for some of the variables are known and fixed and these vari-

ables are named as dummy variables in the literature. Dummy variables are

αn,yn= 0, ξn,yn

= 2, βn,yn= 0, ∀n ∈ 1, . . . , ℓ

and constraints;

αn,m > 0, βn,m > 0, ξn,m > 0 ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn

34

For convenience, the Lagrangian should be reorganized:

L =1

2

d∑

c=1

〈wc, wc〉+ C

d∑

c=1

ℓ∑

n=1

ξn,c −

d∑

c=1

ℓ∑

n=1

αn,c〈wyn, φ(xn)〉

+

d∑

c=1

ℓ∑

n=1

αn,c〈wc, φ(xn)〉+ 2

d∑

c=1

ℓ∑

n=1

αn,c −

d∑

c=1

ℓ∑

n=1

αn,cξn,c

−

d∑

c=1

ℓ∑

n=1

βn,cξn,c −

d∑

c=1

ℓ∑

n=1

αn,cbyn+

d∑

c=1

ℓ∑

n=1

αn,cbc (3.22)

Defining for simplicity some variables

Sm =

d∑

m=1

αm,c

δn,c =

1 yn = c

0 otherwise

and following the same procedure as the Hard Margin SVMs the partial derivatives

with respect to primal variables, wc, bc, and ξn,c are obtained:

∂L

∂wc

= wc −ℓ∑

n=1,yn=c

(ℓ∑

m=1

αn,m

)φ(xn)−

ℓ∑

n=1

αn,cφ(xn) (3.23)

∂L

∂bc=

ℓ∑

n=1

αn,c −

ℓ∑

n=1

δn,cSn (3.24)

∂L

∂ξn,c= C − αn,c − βn,c (3.25)

Setting the partial derivatives to zero and by converting eq. (3.25) into an inequality

constraint like in Soft Margin SVMs one obtains

wc =

ℓ∑

n=1

Snδn,cφ(xn)−

ℓ∑

n=1

αn,cφ(xn)

=ℓ∑

n=1

(Snδn,c − αn,c)φ(xn) (3.26)

ℓ∑

n=1

αn,c =

ℓ∑

n=1

δn,cSn (3.27)

0 ≤ αn,c ≤ C . (3.28)

35

By inserting eq. (3.26) into eq. (3.22), the Lagrangian is derived

L =1

2

d∑

c=1

〈

(ℓ∑

n=1

(Snδn,c − αn,c)φ(xn)

),

(ℓ∑

n=1

(Snδn,c − αn,c)φ(xn)

)〉

+

d∑

c=1

ℓ∑

n=1

=0︷︸︸︷C − αn,c − βn,c

ξn,c

−

d∑

c=1

ℓ∑

n=1

αn,c〈

ℓ∑

m=1

(Smδm,yn− αm,yn

)φ(xm), φ(xn)〉

+d∑

c=1

ℓ∑

n=1

αn,c〈ℓ∑

m=1

(Smδm,c − αm,c)φ(xm), φ(xn)〉

+ 2

d∑

c=1

ℓ∑

n=1

αn,cξn,c

=1

2

d∑

c=1

(ℓ∑

n,m=1

SnSmδn,cδm,ck(xn, xm)−

ℓ∑

n,m=1

Snδn,cαm,ck(xn, xm)

)

+1

2

d∑

c=1

(−

ℓ∑

n,m=1

Smδm,cαm,ck(xn, xm) +

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

)

−

d∑

c=1

ℓ∑

n,m=1

Smδm,ynαn,ck(xn, xm) +

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

+

d∑

c=1

ℓ∑

n,m=1

αn,cSmδm,ck(xn, xm)−

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

+ 2

d∑

c=1

ℓ∑

n=1

αn,cξn,c

=1

2

d∑

c=1

ℓ∑

n,m=1

SnSmδn,cδm,ck(xn, xm)−

d∑

c=1

ℓ∑

n,m=1

Smδm,cαm,ck(xn, xm)

+1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)−

d∑

c=1

ℓ∑

n,m=1

Smδm,ynαn,ck(xn, xm)

+

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm) +

d∑

c=1

ℓ∑

n,m=1

Smδm,cαn,ck(xn, xm)

−

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm) + 2

d∑

c=1

ℓ∑

n,m=1

αn,cξn,c

36

=1

2

d∑

c=1

ℓ∑

n,m=1

SnSmδn,cδm,ck(xn, xm)−1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

−

d∑

c=1

ℓ∑

n,m=1

Smδm,ynαn,ck(xn, xm) +

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ynk(xn, xm)

+ 2

d∑

c=1

ℓ∑

n,m=1

αn,cξn,c . (3.29)

Now the dual of the WW primal problem can be written aa

maxα

1

2

d∑

c=1

ℓ∑

n,m=1

(SnSmδn,cδm,ck(xn, xm)− αn,cαm,ck(xn, xm)

− 2Smδm,ynαn,ck(xn, xm) + 2αn,cαm,yn

k(xn, xm))

+ 2

d∑

c=1

ℓ∑

n,m=1

αn,cξn,c (3.30)

s.t.

ℓ∑

n=1

αn,c =

ℓ∑

n=1

δn,cSn

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : 0 ≤ αn,c ≤ C .

3.2.1.1 Vectorized Weston and Wakins Formulation

In the following I will derive an alternative formulation for both the primal and

dual of the WW. The primary goal in this section is to have a new formulation for

WW which can be solved efficiently. The primal problem is expressed as follows:

min1

2||W ||2 + C

d∑

c=1

ℓ∑

n=1

ξn,c (3.31)

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn

〈Wφ(xn) + b,yn − yc‖yn − yc‖

〉 > 2− ξn,c

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : ξn,c > 0

Here W is a matrix where the nth row corresponds to wn in the original WW primal

eq (3.19), yn ∈ Rd is a vector of zeros with a single one in the nth component and

b ∈ Rd is the bias/offset vector where the cth entry corresponds to bc. Lagrangian

37

of the vectorized WW is

L =1

2WTW + C

d∑

c=1

ℓ∑

n=1

ξn,c

−d∑

c=1

ℓ∑

n=1

αn,c

(〈Wφ(xn),

yn − yc‖yn − yc‖

〉 − 2 + ξn,c

)

d∑

c=1

ℓ∑

n=1

αn,c〈b,yn − yc‖yn − yc‖

〉 −

d∑

c=1

ℓ∑

n=1

βn,cξn,c . (3.32)


with respect to primal variables, W , b and ξn,c are obtained

∂L

∂W= W −

d∑

c=1

ℓ∑

n=1

αn,c〈Iφ(xn),yn − yc‖yn − yc‖

〉 (3.33)

∂L

∂b=

ℓ∑

n=1

αn,c −

ℓ∑

n=1

δyn,yc

Sn︷︸︸︷(d∑

c=1

αn,c

)(3.34)

∂L

∂ξn,c= C − αn,c − βn,c (3.35)

where I is the identity matrix. Setting the partial derivatives to zero and by con-

verting eq. (3.35) into a inequality constraint like Soft Margin SVMs one arrives

at

W =

d∑

c=1

ℓ∑

n=1


〉

=d∑

c=1

ℓ∑

n=1

αn,c (Iφ(xn))T yn − yc‖yn − yc‖

(3.36)

and

ℓ∑

n=1

αn,c =

ℓ∑

n=1

δyn,ycSn (3.37)

0 ≤ αn,c ≤ C . (3.38)

38

Substituting eq. (3.36) into eq (3.32), the Lagragian becomes

L =1

2

d∑

c,e=1

ℓ∑

n,m=1

(αn,c (Iφ(xn))

T yn − yc‖yn − yc‖

)T

(αm,e (Iφ(xm))

T ym − ye‖ym − ye‖

)

−d∑

c=1

ℓ∑

n=1

αn,c

(〈

(αm,e (Iφ(xm))


)T

φ(xn),yn − yc‖yn − yc‖

〉

)

+ C

d∑

c=1

ℓ∑

n=1

ξn,c + 2

d∑

c=1

ℓ∑

n=1

αn,c −

d∑

c=1

ℓ∑

n=1

αn,cξn,c −

d∑

c=1

ℓ∑

n=1

βn,cξn,c

(3.39)

=1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e

(yn − yc‖yn − yc‖

)T(ym − ye‖ym − ye‖

)φ(xn)

TITIφ(xm)T

−d∑

c=1

ℓ∑

n=1

αn,c

(d∑

e=1

ℓ∑

m=1

αm,e


)T

(ym − ye‖ym − ye‖

)φ(xn)

TITIφ(xm)

)

+

d∑

c=1

ℓ∑

n=1

(C − αn,c − βn,c)ξn,c + 2

d∑

c=1

ℓ∑

n=1

αn,c

=1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)k(xn, xm)

−

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)k(xn, xm)

+ 2d∑

c=1

ℓ∑

n=1

αn,c

= −1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)k(xn, xm)

+ 2d∑

c=1

ℓ∑

n=1

αn,c

39

After expressing the Lagrangian solely in terms of the dual variables the dual opti-

mization problem of the WW problem can be states:

maxα

−1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e

T1︷︸︸︷(yn − yc‖yn − yc‖


)k(xn, xm)

+ 2

d∑

c=1

ℓ∑

n=1

αn,c

s.t.

ℓ∑

n=1

αn,c =

ℓ∑

n=1

δyn,ycSn

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : 0 ≤ αn,c ≤ C .

If orthonormal labels are used the T1 can be exprseed in terms of the Kronecker

delta, denoted by δa,b, and it is one for a = b and zero otherwise. The final version

of the new dual of WW as follows:

maxα

2

ℓ∑

n=1

d∑

c=1

αn,c −1

2

ℓ∑

n,m=1

d∑

c,e=1

αn,cαm,evym,eyn,c

k(xn, xm) (3.40)

s.t.ℓ∑

n=1

αn,c =ℓ∑

n=1

δyn,ycSn

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d :

0 ≤ αn,c ≤

0 if n ∈ Sc

C if n 6∈ Sc

with vym,eyn,c

= δyn,ym− δyn,e − δc,ym

+ δc,e

The weight vectors are then given by

wc =∑

n∈Sc

(d∑

e=1

αn,e

)k(xn, ·)−

∑

n6∈Sc

αn,ck(xn, ·) .

If the new dual formulation, given in eq. (3.40), is compared with the original one,

given in eq. (3.30), it can be seen that the new formulation is more suitable for

decomposition algorithms. Although decomposition algorithms will be discussed in

detail in Chapter 5. I want to clarify the underlying reasons why the new formula-

tion is important. If the original WW formulation is analysed it can be seen that it

contains five sum operators, please note that Sm and Sn also contain sum operator.

In short two sum operators run from 1 to ℓ and the other three run from 1 to d.

However, dual problem has only dℓ unknown variables. Furthermore, dual problem

is a quadratic problem which hints that indeed four sum operators, such that two of

them run from 1 to ℓ and the others run from 1 to d, are needed. This makes clear

that in the original formulation there is a redundant sum operator. This is one of

the reasons, until now there was no efficient caching strategy is developed for this

40

machine. However, the new dual formulation, given in eq. (3.40) contains exactly

four sums. Further, the new formulation is decomposed kernel matrix such that one

just needs to store a d2 matrix and a ℓ2 matrix instead of a single (dℓ)2 matrix. It

is clear that this kind of decomposition of the kernel matrix is memory friendly and

lead us to apply WW method to large scale problems. It should be noted that the

decomposition of the kernel matrix do not decrease the computational complexity

of the WW problem, it just decreases the required memory.

3.2.2 The Crammer and Singer Method

Crammer and Singer proposed an alternative multi-class SVM [42](CS). Like WW,

they take all class relations into account at once and solve a single optimization

problem, however, with fewer slack variables. The CS classifier is trained by solving

the primal problem

min1

2

d∑

c=1

〈wc, wc〉+ Cℓ∑

n=1

ξn (3.41)

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn

〈wyn− wc, φ(xn)〉 ≥ 1− ξn

∀n ∈ 1, . . . , ℓ : ξn ≥ 0 .

Although the dual can be derived from eq. (3.41), it will be rewriten to obtain a

more compact formulation :

min1

2

1

C

d∑

c=1

〈wc, wc〉+

ℓ∑

n=1

ξn (3.42)

s.t. 〈wyn− wc, φ(xn)〉+ δyn,c > 1− ξn ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn

The important point in this formulation is that the inequality constraints of eq.

(3.41) are combined and rewritten as the in equality constraints of eq. (3.42). The

Lagrangian of the problem eq. (3.42) is

L =1

2

1

C

d∑

c=1

〈wc, wc〉+

ℓ∑

n=1

ξn −

d∑

c=1

ℓ∑

n=1

αn,c〈wyn, φ(xn)〉

+d∑

c=1

ℓ∑

n=1

αn,c〈wc, φ(xn)〉 −d∑

c=1

ℓ∑

n=1

αn,cδyn,c +d∑

c=1

ℓ∑

n=1

αn,c

−

d∑

c=1

ℓ∑

n=1

αn,cξn . (3.43)

41

Again the partial derivatives with respect to primal variables, wc and ξn are ob-

tained:

∂L

∂ξn= 1−

d∑

c=1

αn,c ∀n = 1 . . . ℓ

∂L

∂wc

=1

Cwc −

∂

∂wc

(d∑

c=1

ℓ∑

n=1

αn,cwynφ(xn)

)+

∂

∂wc

(d∑

c=1

ℓ∑

n=1

αn,cwcφ(xn)

)

Setting the partial derivative with respect to ξn zero one obtains a constraint on

the dual variables

d∑

c=1

αn,c = 1 ∀n = 1 . . . ℓ (3.44)

and setting the partial derivative with respect to wc one obtains:

1

Cwc = −

T1︷︸︸︷∂

∂wc

(d∑

c=1

ℓ∑

n=1

αn,cwynφ(xn)

)

+

T2︷︸︸︷∂

∂wc

(d∑

c=1

ℓ∑

n=1

αn,cwcφ(xn)

)(3.45)

For clarity T1 and T2 will be evaluate separately. T1 will be evaluated as follows:

T1 =∂

∂wc

(α1,1wy1φ(x1) + . . .+ α1,dwy1

φ(x1))

+∂

∂wc

(α2,1wy2φ(x2) + . . .+ α2,dwy2

φ(x2))

.

.

.

.

+∂

∂wc

(αℓ,1wyℓφ(xℓ) + αℓ,2wyℓ

φ(xℓ) + . . .+ αℓ,dwyℓφ(xℓ))

It should be noted that the partial derivatives with respect to wc will be 0 whenever

wyn6= wc and it will be equal to 1 whenever wyn

= wc. By using this fact:

T1 =

ℓ∑

n,yn=c

=1︷︸︸︷(d∑

p=1

αn,p

)φ(xn)

42

T2 will be evaluated as follows:

T2 =∂

∂wc

(d∑

c=1

ℓ∑

n=1

αn,cwcφ(xn)

)

=ℓ∑

n=1

αn,cφ(xn)

Inserting these expressions for T1 and T2 back into eq (3.45), wc becomes

wc = C

(ℓ∑

n,yn=c

φ(xn)−

ℓ∑

n=1

αn,cφ(xn)

)

= C

ℓ∑

n=1

(δyn,c − αn,c)φ(xn) (3.46)

Now the Lagrangian (3.43) will be rewritten step by step in terms of the dual

variables. The last two terms of the Lagrangian are:

ℓ∑

n=1

d∑

c=1

αn,c =

ℓ∑

n=1

=1︷︸︸︷(d∑

c=1

αn,c

)= ℓ (3.47)

ℓ∑

n=1

d∑

c=1

αn,cξn =

ℓ∑

n=1

=1︷︸︸︷(d∑

c=1

αn,c

)ξn =

ℓ∑

n=1

ξn (3.48)

Eq. (3.47) can be ignored because it is constant. The Lagrangian thus becomes

L =

L1︷︸︸︷1

2

1

C

d∑

c=1

〈wc, wc〉−

L2︷︸︸︷d∑

c=1

ℓ∑

n=1

αn,cwynφ(xn)

+

L3︷︸︸︷d∑

c=1

ℓ∑

n=1

αn,cwcφ(xn)−d∑

c=1

ℓ∑

n=1

αn,cδyn,c . (3.49)

For clarity, L1, L2 and L3 will be evaluated separately, replacing wc by the expres-

sion in (3.46):

L1 =1

2

1

C

d∑

c=1

(C

ℓ∑

n=1

(δyn,c − αn,c)φ(xn)

)(C

ℓ∑

m=1

(δym,c − αm,c)φ(xm)

)

C

2

ℓ∑

n,m=1

(k(xn, xm)

d∑

c=1

(δym,yn− αn,c) (δym,c − αm,c)

)(3.50)

43

L2 =d∑

c=1

ℓ∑

n=1

αn,c

(C

(ℓ∑

m=1

(δym,yn− αm,yn

)φ(xm)

))φ(xn)

= Cℓ∑

n,m=1

k(xn, xm)

δym,yn

− αm,yn

=1︷︸︸︷(d∑

c=1

αn,c

)

= C

ℓ∑

n,m=1

k(xn, xm)

d∑

c=1

δym,yn(δym,c − αm,c) (3.51)

L3 =

d∑

c=1

ℓ∑

n=1

αn,c

(C

(ℓ∑

m=1

(δym,c − αm,c)φ(xm)

))φ(xn)

= C

ℓ∑

n,m=1

k(xn, xm)

(d∑

c=1

αn,c (δym,c − αm,c)

). (3.52)

Hence L3 − L2 takes the form

L3 − L2 = C

ℓ∑

n,m=1

k(xn, xm)

(d∑

c=1

αn,c (δym,c − αm,c)

)

− C

ℓ∑

n,m=1

k(xn, xm)

d∑

c=1

δym,yn(δym,c − αm,c)

= − Cℓ∑

n,m=1

k(xn, xm)

(d∑

c=1

(δym,c − αm,c) (δyn,c − αn,c)

). (3.53)

Inserting eq. (3.50) and eq. (3.53) into eq. (3.49) yields the Lagrangian:

L =C

2

ℓ∑

n,m=1

k(xn, xm)

(d∑

c=1


)

− C

ℓ∑

n,m=1

k(xn, xm)

(d∑

c=1


)

−d∑

c=1

ℓ∑

n=1

αn,cδyn,c

= −C

2

ℓ∑

n,m=1

k(xn, xm)

(d∑

c=1

(δyn,c − αn,c) (δym,c − αm,c)

)

−d∑

c=1

ℓ∑

n=1

αn,cδyn,c (3.54)

44

The dual problem of the CS is :

max −C

2

ℓ∑

n,m=1

k(xn, xm)

(d∑

c=1

(δyn,c − αn,c) (δym,c − αm,c)

)

−

d∑

c=1

ℓ∑

n=1

αn,cδyn,c (3.55)

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : αn,c ≥ 0

∀n ∈ 1, . . . , ℓ : αTnI1 = 1,

where I1 ∈ Rd is a vector of ones and αn = (αn,1, αn,2, . . . , αn,d)

T is the Lagrange

multiplier vector for the nth example. This dual formulation can be used for solving

the problem. However, for having the dual problems different machines as similar

as possible and also to have a more compact formulation, the final version of CS

dual problem is expressed as

max −1

2

ℓ∑

n,m=1

k(xn, xm)τTn τm + βℓ∑

n=1

τTn Iyn(3.56)

s.t. ∀n ∈ 1, . . . , ℓ : τn ≤ Iyn, τTn I1 = 0 ,

where β = 2C, Iyn

∈ Rd is a vector of zeros with a single one in the ythn component,

and τn ∈ Rd is an auxiliary vector defined as (1yn

− αn). The relation a ≤ b is

understood to hold for a, b ∈ Rd if ai ≤ bi for all i = 1, . . . , d. The weight vectors

are given by

wc =

ℓ∑

n=1

τn,ck(xn, ·) .

If the CS dual formulation given in (3.57) is compared with the WW dual formu-

lation given in given in (3.40), it is seen that the dual formulations are similar to

each other. However this fact does not allow us to develop a solver for CS that is

using identical or very similar solver technology with WW solver. As one of the

main contributions of this thesis is developing similar solvers for all machines, I will

reformulate the CS machine in the next section.

3.2.2.1 Vectorized Crammer and Singer Formulation

In the following I will derive an alternative formulation for both the primal and

dual of CS. The primary goal in this section is to have a new formulation for CS

which can be solved efficiently and also which is easy to implement. The primal

45

problem is expressed as follows:

min1

2||W ||2 + C

ℓ∑

n=1

ξn (3.57)

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn

〈Wφ(xn),yn − yc‖yn − yc‖

〉 > 1− ξn

∀n ∈ 1, . . . , ℓ : ξn > 0

here W is a matrix where the nth row corresponds to wn in the original CS primal

eq (3.42), and yn ∈ Rd is a vector of zeros with a single one in the nth component.

Lagrangian of the vectorized CS is

L =1

2WTW + C

d∑

c=1

ℓ∑

n=1

ξn (3.58)

−

d∑

c=1

ℓ∑

n=1

αn,c

(〈Wφ(xn),

yn − yc‖yn − yc‖

〉 − 1 + ξn

)−

ℓ∑

n=1

βnξn .


with respect to primal variables, W and ξn, are obtained

∂L

∂W= W −

d∑

c=1

ℓ∑

n=1


〉 (3.59)

∂L

∂ξn= C −

d∑

c=1

αn,c − βn . (3.60)

where I is the identity matrix. Setting the partial derivatives to zero and by con-

verting eq. (3.60) into a inequality constraint like Soft Margin SVMs one arrives

at:

W =

d∑

c=1

ℓ∑

n=1


〉

=

d∑

c=1

ℓ∑

n=1

αn,c (Iφ(xn))T yn − yc‖yn − yc‖

(3.61)

d∑

c=1

αn,c ≤ C (3.62)

46

Substituting eq. (3.61) into eq (3.59), the Lagrangian becomes

L =1

2

d∑

c,e=1

ℓ∑

n,m=1

(αn,c (Iφ(xn))

T yn − yc‖yn − yc‖

)T

(αm,e (Iφ(xm))


)

−d∑

c=1

ℓ∑

n=1

αn,c

(〈

(αm,e (Iφ(xm))


)T

φ(xn),yn − yc‖yn − yc‖

〉

)

+

d∑

c=1

ℓ∑

n=1

αn,c −

d∑

c=1

ℓ∑

n=1

αn,cξn −

ℓ∑

n=1

βnξn + C

ℓ∑

n=1

ξn

=1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)φ(xn)

TITIφ(xm)T

−d∑

c=1

ℓ∑

n=1

αn,c

(d∑

e=1

ℓ∑

m=1

αm,e


)T

(ym − ye‖ym − ye‖

)φ(xn)

TITIφ(xm)

)

+

ℓ∑

n=1

0︷︸︸︷

(C −

d∑

c=1

αn,c − βn) ξn +

d∑

c=1

ℓ∑

n=1

αn,c

To get the final version of Lagrangian, one needs to make several algebraic opera-

tions and Lagrangian can be stated as :

L =1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)k(xn, xm)

−d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)k(xn, xm)

+

d∑

c=1

ℓ∑

n=1

αn,c

= −1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e



)k(xn, xm)

+

d∑

c=1

ℓ∑

n=1

αn,c

47

After expressing the Lagrangian solely in terms of the dual variablesthe dual opti-

mization problem of the CS problem can be stated as:

maxα

−1

2

d∑

c,e=1

ℓ∑

n,m=1

αn,cαm,e

T1︷︸︸︷(yn − yc‖yn − yc‖


)k(xn, xm)

+

d∑

c=1

ℓ∑

n=1

αn,c (3.63)

s.t. c ∈ 1, . . . , d :

d∑

c=1

αn,c ≤ C

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d :

αn,c ≥ 0

(3.64)

If orthonormal vectors are used the T1 can be expressed in terms of the Kronecker

delta, denoted by δa,b, and it is one for a = b and zero otherwise. The final version

of the new dual of CS as follows:

maxα

ℓ∑

n=1

d∑

c=1

αn,c −1

2

ℓ∑

n,m=1

d∑

c,e=1

αn,cαm,evym,eyn,c

k(xn, xm) (3.65)

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : αn,c ≥ 0

∀c ∈ 1, . . . , d :

d∑

c=1

αn,c ≤ C, (3.66)

with vym,eyn,c

= δyn,ym− δyn,e − δc,ym

+ δc,e ,

The new formulation given in (3.65) is very similar to the WW formulation given in

(3.40). Indeed objective function of the new formulation of CS machine is identical

with the new WW formulation. There are only two differences; first one is the the

constraint which is a result of bias/offset term and the second one is the sum-to-

zero constraint of CS given in eq. (3.66). These issues do not create important

differences in the sense of solver development. In other words, one can develop very

similar solver for CS and WW. Thus, CS will also enjoy the new caching technique

developed for WW (see Section 3.2.1.1 ).

3.2.3 Lee, Lin, & Wahba SVM

Lee et al. (LLW, [112]) have proposed alternative approach to multi-class SVM clas-

sification which is structurally distinct from WW machine and its simplification, the

CS machine. Classification calibrated/Fisher consistent loss function should be de-

fine before explaining the LLW machine. A loss function L(f(x), y) is classification

calibrated/Fisher consistent if and only if f∗j = argmaxj=1,...,d P (Y = j | x) where

48

f∗(x) = f∗1 (x), . . . , f

∗d (x) is the minimizer of E[L(f(X), Y )|X = x] . The analysis

of Tewari and Bartlett [154] shows that this machine relies on a so-called classifi-

cation calibrated loss function, which guarantees Fisher consistency. Its primal

problem can be stated as

minwc

1

2

d∑

c=1

〈wc, wc〉+ C

ℓ∑

n=1

d∑

c=1

ξn,c

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn :

〈wc,φ(xn) + bc〉 ≤ −1

d− 1+ ξn,c

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : ξn,c ≥ 0

∀h ∈ H :

d∑

c=1

(〈wc, h〉+ bc) = 0 . (3.67)

If the feature map is injective then the sum-to-zero constraint (3.67) can be ex-

pressed as∑d

c=1 wc = 0 and∑d

c=1 bc = 0. I will derive the dual problem for the

case when the feature map is injective. For the case of non-injective feature map

details can be found in ([112]). The Lagrangian of the LLW primal problem is

L =1

2

d∑

c=1

〈wc, wc〉+ C

d∑

c=1

ℓ∑

n=1

ξn,c +

d∑

c=1

ℓ∑

n=1

αn,c〈wc, φ(xn)〉

+1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c −

d∑

c=1

ℓ∑

n=1

αn,cξn,c + ρ

d∑

c=1

wc

d∑

c=1

ℓ∑

n=1

αn,cbc + γ

d∑

c=1

bc −

ℓ∑

n=1

d∑

c=1

βn,cξn,c . (3.68)

As previous machines, the partial derivatives with respect to primal variables, wc,

bc and ξn,c are obtained:

∂L

∂wc

= wcφ(xn) +

ℓ∑

n=1

αn,c + ρ

∂L

∂bc=

ℓ∑

n=1

αn,c + γ

∂L

∂ξn,c= C − αn,c − βn,c

If one obtains the partial derivative of Lagrangian with respect to ρ, one obtain

exactly the sum to the zero constraint. This constraint will be used in the following

steps in order to find a relation between αn,c and ρ. Setting the partial derivative

49

with respect to bc and ξn,c zero one obtains

−γ =ℓ∑

n=1

αn,c (3.69)

0 = C − αn,c − βn,c . (3.70)

The constraint (3.69) ensures that all d sums∑ℓ

n=1 αn,c take the same value−γ ∈ R.

The value of −γ itself does not matter. Like in Section 2.3.3.2, (3.70) can be

expressed as an inequality constratint on αn,c. Setting the partial derivative with

respect to wc to zero one obtains

wc = −

(ℓ∑

n=1

αn,cφ(xn) + ρ

). (3.71)

By substituting (3.71) in eq. (3.68), one can eliminate the dependence of Lagrangian

on primal variables and then

L =1

2

d∑

c=1

〈

(ℓ∑

n=1

αn,cφ(xn) + ρ

),

(ℓ∑

m=1

αm,cφ(xm) + ρ

)〉

−

ℓ∑

n=1

d∑

c=1

αn,c〈

(ℓ∑

m=1

αm,cφ(xm) + ρ

), φ(xn)〉

+d∑

c=1

ℓ∑

n=1

αn,c

1

d− 1− ρ

d∑

c=1

(ℓ∑

n=1

αn,cφ(xn) + ρ

)

+

d∑

c=1

ℓ∑

n=1

(C − αn,c − βn,c)

=1

2

d∑

c=1

(ℓ∑

n,m=1

αn,cαm,ck(xn, xm) + ρ

ℓ∑

n=1

αn,cφ(xn) + ρ

ℓ∑

m=1

αm,cφ(xm)

)

−

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)− 2ρ

d∑

c=1

ℓ∑

n=1

αn,cφ(xn)−1

2

d∑

c=1

ρ2

+1

d− 1

ℓ∑

n=1

d∑

c=1

αn,c

=1

2

d∑

c=1

ℓ∑

n,m=1


ℓ∑

n=1

αn,cφ(xn)

−d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)− 2ρd∑

c=1

ℓ∑

n=1

αn,cφ(xn)

+1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c −1

2

d∑

c=1

ρ2

50

= −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)−1

2

d∑

c=1

ρ2 − ρd∑

c=1

ℓ∑

n=1

αn,cφ(xn)

+1

d− 1

ℓ∑

n=1

d∑

c=1

αn,c .

Utilizing the some-to-zero constraint,∑d

c=1〈wc, ·〉 = 0, a relation between ρ and

αn,c is derived and this relation is

0 =

d∑

s=1

wc

0 = −

d∑

s=1

(ℓ∑

n=1

αn,cφ(xn) + ρ

)

ρ =

(−1

d

d∑

s=1

ℓ∑

n=1

αn,sφ(xn)

). (3.72)

With relation eq. (3.72), the Lagrangian of the LLW machine is

L = −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

−1

2

d∑

c=1

〈

(−1

d

d∑

s=1

ℓ∑

n=1

αn,sφ(xn)

),

(−1

d

d∑

u=1

ℓ∑

m=1

αm,uφ(xm)

)〉

+1

d

d∑

s=1

ℓ∑

m=1

αm,sφ(xm)

(d∑

c=1

ℓ∑

n=1

αn,cφ(xn)

)+

1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

= −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

−1

2d2

d∑

c=1

(d∑

s,u=1

ℓ∑

n,m=1

αn,sαm,uk(xn, xm)

)

+1

d

d∑

s,c=1

ℓ∑

n,m=1

αn,cαm,uk(xn, xm) +1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

= −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

+1

2d

d∑

s,c=1

ℓ∑

n,m=1

αn,sαm,ck(xn, xm) +1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c . (3.73)

51

The eq (3.73) will be written in a more compact way

L = − dd∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm) +d∑

s,c=1

ℓ∑

n,m=1

αn,sαm,ck(xn, xm)

+2d

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

=d∑

s=1

(d∑

c=1

ℓ∑

n,m=1

αn,sαm,ck(xn, xm)−d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

)

+2d

d− 1

ℓ∑

n=1

d∑

c=1

αn,c

=d∑

s=1

(d∑

c=1

ℓ∑

n,m=1

(αn,s − αn,c)αm,ck(xn, xm)

)+

2d

d− 1

d∑

c=1

ℓ∑

n=1

αn,c .

The corresponding dual problem of LLW is:

maxα

1

d− 1·

d∑

c=1

ℓ∑

n=1

αn,c −1

2

d∑

c,e=1

ℓ∑

n,m=1

(δc,d − 1/d)αn,cαm,ek(xn, xm) (3.74)

s.t. ∀ c ∈ 1, . . . , d

ℓ∑

n=1

αn,c = −γ

∀n ∈ 1, . . . , ℓ, ∀ c ∈ 1, . . . , d \ yn : 0 ≤ αn,c ≤ C (3.75)

∀n ∈ 1, . . . , ℓ : αn,yn= 0

The LLW dual problem is also similar to WW and CS. The main difference is the

interpretation of the dual variables. These similarities are useful to develop solvers

for all methods that uses similar algorithms which in the end they make possible to

have a conclusions not only about the classification accuracies but also about the

training times.

52

Chapter 4

Unified View to All-in-One

Multi-class Machines

In this section, I will develop a unified view on all-in-one MCSVM machines. The

analysis will focus on three key design choices for MCSVMs, which so far have

only been described for each of the machine in isolation. The first concerns the

hypotheses class considered, namely the presence or absence of a bias or offset

term. The second is related to the loss function used for machine training. The

third is the margin concept used in the machine. The existing loss functions vary in

whether their margin definitions are absolute or relative and how the penalty term

depends on different kinds of margin violations. I derive a unifying template for

the primal as well as dual optimization problems arising when training the different

machines. From this view it will become apparent that one machine is missing to

complete the picture of all-in-one support vector machines (SVMs). I derive this

novel multi-class SVM variant, which results from bringing together concepts of the

CS and LLW machine.

All machines reviewed in the previous section can be cast into the common

primal form

minf

1

2‖f‖2 + C ·

ℓ∑

n=1

λ(ν(fyn(xn), f1(xn)), . . . , ν(fyn

(xn), fd(xn))) .

They differ in three components. The first of these is the set of variables over

which the primal objective is to be minimized, which can be either f=w ∈ H or

f=(w, b) ∈ H × R. The second variation is the margin function ν : R × R → R,

which can encode a relative margin concept by ν(u, v) = u − v or an absolute

margin by ν(u, v) = −1−v. Third, the methods differ in how the margin values are

composed into a loss by the function λ, which amounts to taking either the sum or

the maximum of its arguments.

These differences are in correspondence to properties of the dual problems. The

53

elements common to all dual problems are

maxα

ℓ∑

n=1

d∑

c=1

αn,c −1

2

ℓ∑

n,m=1

d∑

c,e=1

αn,cαm,e ·M(yn, c, ym, e) · k(xn, xm)

s.t. 0 ≤ αn,c ≤ C ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn

αn,yn= 0 ∀n ∈ 1, . . . , ℓ ,

possibly augmented by the additional bias constraint

ℓ∑

m=1

d∑

e=1

N(c, ym, e)αm,e = 0 ∀c ∈ 1, . . . , d (4.1)

or the max-loss constraint

d∑

c=1

αn,c ≤ C ∀n ∈ 1, . . . , ℓ . (4.2)

The resulting weight vectors take the form

wc =

ℓ∑

m=1

(d∑

e=1

N(c, ym, e)αm,e

)φ(xn) .

The WW, CS, and LLW machines differ only in the form of the coefficients M

and N , and the presence or absence of the constraints (4.1) and (4.2). Table 4.1

summarizes these insights and connects properties of the primal and dual problems.

It distinguishes the presence of a bias term (right), the type of loss function used

(left), and the margin concept applied (top):

sum-loss

max-lossconstraint (4.2)

with bias

with bias

without bias

without bias

constraint (4.1)

constraint (4.1)

relative margin absolute marginν(u, v) = u− v ν(u, v) = −1− v

M = δyn,ym+ δc,e − δyn,e − δym,c M = δc,e − 1/d

N = δc,e − δym,c N = 1/d− δc,e

WW LLW

WW without bias LLW without bias

––

CS ?

Table 4.1: Unified view on primal and dual problems of multi-class support vectormachine classifiers.

54

In practice the max-loss is not used in combination with a bias term. This is for

obvious reasons. Although possible in principle, the solution of the corresponding

dual problem is difficult because of the interfering constraints (4.1) and (4.2). Thus,

in practice the maximum-loss can only be applied to machines without bias.

The unified view also reveals that one machine is missing, namely the combi-

nation of maximum-loss (without bias) and absolute margin. From Table 4.1 it

becomes obvious how the primal and dual of such a new machine should look like.

However, while the correctness of the table entries belonging to already known ma-

chines can easily be verified, it has to be proven for the new machine—as done in

the next section.

4.1 Novel Approach to Multi-Class SVM Classifi-

cation

In this section, I derive a novel support vector machine for multi-class classification,

combining max-loss and the absolute margin concept and this machine is referred

as DGI. The corresponding primal problem is

minwc

1

2

d∑

c=1

〈wc, wc〉+ C

ℓ∑

n=1

ξn

s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn :

〈wc,φ(xn) + bc〉 ≤ −1

d− 1+ ξn

∀n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d : ξn ≥ 0

∀h ∈ H :

d∑

c=1

(〈wc, h〉+ bc) = 0 . (4.3)

As in LLW machine, if the feature map is injective then the sum-to-zero con-

straint (4.3) can be expressed as∑d

c=1 wc = 0 and∑d

c=1 bc = 0 . The Lagrangian

of the primal problem of DGI is

L =1

2

d∑

c=1

〈wc, wc〉+ Cℓ∑

n=1

ξn +ℓ∑

n=1

d∑

c=1

αn,c

(〈wc, φ(xn)〉+

1

d− 1− ξn

)

ℓ∑

n=1

d∑

c=1

αn,cbc −

ℓ∑

n=1

βnξn + ρ

d∑

c=1

wc + γ

d∑

c=1

bc (4.4)

with αn,c ≥ 0, βn ≥ 0, and ρ unconstrained. Following the same procedure as in

the previous chapters, the partial derivatives with respect to primal variables, wc,

55

bc and ξn, are obtained:

∂L

∂wc

= wc +

ℓ∑

n=1

αn,cφ(xn) + ρ, wc = −

(ℓ∑

n=1

αn,cφ(xn) + ρ

)(4.5)

∂L

∂bc=

ℓ∑

n=1

αn,c + γ (4.6)

∂L

∂ξn= C − αn,c − βn, 0 ≤ αn,c ≤ C (4.7)

(4.8)

Setting the partial derivative with respect to ξn,c zero one obtains a constraint on

the dual variables:

0 ≤ αn,c ≤ C (4.9)

Like in Section 2.3.3.2, 4.9 can be expressed as an inequaltity constraint on αn,c.

Setting the partial derivative with respect to wc and bc to zero one obtains

−γ =ℓ∑

n=1

αn,c (4.10)

wc = −

(l∑

n=1

αn,cφ(xn) + ρ

). (4.11)

The constraint (4.10) ensures that all d sums∑ℓ

n=1 αn,c take the same value−γ ∈ R.

The value of −γ itself does not matter. By substituting eq. (4.11) in eq. (4.4), the

dependence of Lagrangian on primal variables avn be eliminated and then

L =1

2

d∑

c=1

〈

(−

d∑

c=1

ℓ∑

n=1

αn,cφ(xn)− ρ

),

(−

d∑

c=1

ℓ∑

n=1

αn,cφ(xn)− ρ〉

)

+ C ·

ℓ∑

n=1

ξn +

d∑

c=1

ℓ∑

n=1

αn,c

(〈wx, φ(xn)〉+

1

d− 1− ξn

)−

ℓ∑

n=1

βnξn

+ ρ

d∑

c=1

(−

d∑

c=1

ℓ∑

n=1

αn,cφ(xn)− ρ)

=1

2

d∑

c=1

ℓ∑

n,m=1


d∑

c=1

ℓ∑

n=1

αn,cφ(xn)+

+d

2ρ2 −

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)− ρ

d∑

c=1

ℓ∑

n,m=1

αn,cφ(xn)

+1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c − ρd∑

c=1

ℓ∑

n=1

αn,cφ(xn)− dρ2

56

= −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)−1

2

d∑

c=1

ρ2 − ρd∑

c=1

ℓ∑

n=1

αn,cφ(xn)

−

ℓ∑

n=1

d∑

c=1

αn,c .

To derive the dual one needs to find a relation between ρ and αn,c. To do this

sum-to-zero constraint of the problem is used

0 =

d∑

c=1

wc = −dρ−

d∑

c=1

ℓ∑

n=1

αn,cφ(xn)

= −dρ−ℓ∑

n=1

(C − βn)φ(xn) ,

then ρ is expressed as

ρ = −1

d

ℓ∑

n=1

(C − βn)φ(xn) .

The Lagrangian with respect to the dual variables is

L = −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

−1

2

d∑

c=1

〈

(−1

d

d∑

s=1

ℓ∑

n=1

αn,sφ(xn)

),

(−1

d

d∑

u=1

ℓ∑

m=1

αm,uφ(xm)

)〉

+1

d

d∑

s=1

ℓ∑

m=1

αm,sφ(xm)

(d∑

c=1

ℓ∑

n=1

αn,cφ(xn)

)+

1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

= −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

−1

2d2

d∑

c=1

(d∑

s,u=1

ℓ∑

n,m=1

αn,sαm,uk(xn, xm)

)

+1

d

d∑

s,c=1

ℓ∑

n,m=1

αn,cαm,uk(xn, xm) +1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

= −1

2

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm) +1

2d

d∑

s,c=1

ℓ∑

n,m=1

αn,sαm,ck(xn, xm)

+1

d− 1

d∑

c=1

ℓ∑

n=1

αn,c .

57

To write it in a more compact way:

L = − d

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm) +

d∑

s,c=1

ℓ∑

n,m=1

αn,sαm,ck(xn, xm)

+2d

d− 1

ℓ∑

n=1

d∑

c=1

αn,c

=

d∑

s=1

(d∑

c=1

ℓ∑

n,m=1

αn,sαm,ck(xn, xm)−

d∑

c=1

ℓ∑

n,m=1

αn,cαm,ck(xn, xm)

)

+2d

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

=

d∑

s=1

(d∑

c=1

ℓ∑

n,m=1

(αn,s − αn,c)αm,ck(xn, xm)

)+

2d

d− 1

d∑

c=1

ℓ∑

n=1

αn,c

Finally the dual optimization problem is

maxα

d∑

c,e=1

ℓ∑

n,m=1

(αn,e − αn,c)k(xn, xm)αm,c +2d

d− 1

d∑

c=1

ℓ∑

n=1

αn,c (4.12)

s.t. ∀c ∈ 1, . . . , d :

ℓ∑

n=1

αn,c = −γ (4.13)

∀n ∈ 1, . . . , ℓ : Ln ≤ αn ≤ Un

∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : αn,c ≤ C

The DGI dual problem is also similar to WW, CS, and LLW. The main difference

is the interpretation of the dual variables. These similarities are useful to develop

solvers for all methods that uses similar algorithms which in the end these make

possible to have conclusions not only about the classification accuracies but also

about the optimization times.

58

Chapter 5

Solvers

In the previous sections, I have reviewed the multi-class SVM quadratic programs

and the main topic of this section is how to solve these multi-class problems effi-

ciently. There are several strategies used for solving quadratic programs in primal

and in dual form (5.1) and I will summarize these methods in subsection 5.1 and

one of the contributions of this thesis, namely a new solver for multi-class SVMs,

will be described in section 5.2.

5.1 Related Work

Different researchers applied many kind of optimization techniques to primal or dual

problem of SVMs in order to efficiently solve SVMs. In this section, I will briefly

summarize some of these techniques. However, such a summary must be restricted

to the most important and popular approaches because research r elated to SVM

optimization problem is dense.

5.1.1 Interior Point Methods

Interior point methods are the work horses of non-linear optimization problems

and they were applied successfully to many different problems in different domains

[29, 166]. The main idea of interior points is to replace the constraints of SVM prob-

lem with barrier functions and solve a series of unconstrained quadratic problems.

Interior point methods were applied to SVM problems [66, 164, 165]. The main

advantages of interior point methods are that the required number of iterations is

low, more technically speaking the log (log (ǫ)), where ǫ is the required accuracy of

optimization problem, and they gave very high accuracy in the optimization prob-

lem [166]. However there are several disadvantages of interior methods. First of

all, the required runtime of interior point methods is O(s3), where s denotes the

number of parameters of the optimization problem. To underline the importance of

this fact in multi-class SVMs, let us compare OVA with any all-in-one multi-class

machines. In the cases OVA, the problem has exactly ℓ variables and the interior

59

point algorithms run time requirement will be d×O(ℓ3). However, in the case of all-

in-one multi-class machines the number of parameters will be ℓ× d and the interior

point algorithms run time requirement will be O((dℓ)3). The second point is: inte-

rior point methods require O(s2) memory and this is also problematic for all-in-one

multi-class machines. Finally they are sensitive to numerical computations [119]. It

is important to note that reaching high accuracy in an optimization problem does

not generally mean high classification accuracy [25]. It is clear that for large scale

problems or even for large number of classes with small number of examples for each

class interior point methods are not applicable for all-in-one multi-class machines.

5.1.2 Direct Optimization of Primal Problem

Most of the methods considered for solving SVM optimization problems generally

deal with the dual of the SVM problem. Indeed, SVM optimization problems can

be solved in the primal problem. The main idea is that one can rewrite the weight

vectors of the primal by using the representer theorem [104] in the form wc =∑ℓ

n=1 αn,cφ(xn) and directly optimize the primal in this formulation. However,

there is problem in the case of the hinge-loss because the hinge-loss is non-smooth.

Chapelle proposed to use smooth loss functions and to use Newton’s method [37].

5.1.3 On-line Methods

Generally speaking, SVMs are developed for batch training and most of the solvers

were also developed for this case. Recently several researchers proposed on-line

training of binary SVMs and also CS [35, 21, 19, 20] . In these methods first

order SMO [128] is used as an internal solver. Recently, Glasmachers and Igel [78]

showed that second order working set selection improves the on-line learning of

SVMs. It should be underlined that on-line learning of multi-class SVMs are up to

now restricted to only CS because to have an on-line learning of a multi-class SVM

one needs to have a solver for batch training of that machine. Although CS method

does not have a bias term, for each training example there is an equality constraint

and by using this constraint one can apply decomposition techniques such as SMO

to this problem and develop on-line version of solvers. For LLW and WW there were

not effiecient solvers and therefore till now on-line versions of these algorithms was

not developed. In this section I will develop efficient batch solvers for LLW,WW

and DGI. Finally, it can be concluded that on-line learning algorithms supplies

considerable speed up for large scale data sets [20, 78]

5.1.4 Cutting Plane Approaches

For non-differentiable optimization problems cutting plane algorithms [86, 15] are

one of the mainstream methods. Although the dual problem of SVMs is differen-

tiable, the primal problems of SVMs are not when the hinge loss is used. Recently,

some studies applied the cutting plane algorithms to SVM optimization problem

60

[96, 71, 97, 98, 67, 153]. These methods need a training time which is proportional

to the dimension of the input space which is problematic for non-linear kernels,

i.e. the feature space corresponding to Gaussian kernels is infinite dimensional.

Generally, these methods are applied to SVMs when linear kernels are used and the

assumption is that the original input space is high dimensional and sparse, one does

not need any feature space other than input space [95, 140]. To use cutting plane

methods for non-linear kernels, a low rank approximation [100, 74] of the full kernel

matrix as suggested by Joachims and Yu [97] should be used. However, using a low

rank approximation of kernel matrix may be troublesome or even not possible if

the conditioning number of kernel matrix is high. The conditioning number of the

matrix is a function of the kernel hyper-parameters. During models selection this

can cause problems such as long training times or even worse, bad model selections.

5.1.5 Stochastic Gradient Descent

Stochastic gradient descent algorithms [108, 146] have been used nearly from start of

the learning machines [24, 23, 111]. Recently, there is a trend of applying stochastic

gradient algorithms to SVMs [105, 143, 18, 141] and it is claimed that stochastic

gradient descent algorithms are suitable for large scale data sets [25, 143, 26]. All

these stochastic gradient algorithms are related to each other [167] and they are

applied to SVMs that are using linear kernels. They have similar disadvantages as

cutting plane methods when non-linear kernels are used. Recently an experimental

study [142] showed that, these methods are not faster than on-line learning methods

or decomposition algorithms when non-linear kernels were used. On-line learning

methods and decomposition methods can handle both linear and non-linear kernels.

5.1.6 Decomposition Algorithms

Decomposition algorithms [125, 94] to solve SVM problems. Let us consider quadratic

programs of the canonical form

maxα

f(α) = vTα−1

2αTQα (5.1)

s.t. Ln ≤ αn ≤ Un ∀n ∈ 1, . . . ,m

for α ∈ Rm. Here v ∈ R

m is some vector, Q ∈ Rm×m is a (symmetric) positive

definite matrix, and Ln ≤ Un are component-wise lower and upper bounds. The

gradient g = ∇f(α) of (5.1) has components

gn =∂f(α)

∂αn

= vn −

m∑

i=1

αiQin (5.2)

The most frequently used algorithms for solving SVM quadratic programs are

decomposition methods [124, 128, 64, 77, 27]. These methods iteratively decompose

the quadratic program into subprograms, which are restricted to a subset B (working

61

set) of variables. The main idea of decomposition method is to modify a small

portion of working set variables in each iteration. Decomposition algorithms needs

O (s) time and memory for each iteration. This property is an significant advantage

for large scale problems and especially for multi-class all-in-one machines because s,

the number of variables in the optimization problem, is d×ℓ. A desirable property of

state-of-the-art decomposition algorithms is that iterations are fast in the sense that

for any fixed upper bound q ∈ N on the working set size each iteration requires only

O(m) operations. A general decomposition scheme for (5.1) is given in Algorithm 1.

Algorithm 1: Decomposition algorithm for problem (5.1).

Input: feasible initial point α(0), accuracy ε ≥ 0compute the initial gradient g(0) ← ∇f(α(0)) = v −Qα(0)

t← 1while stopping criterion not met do

select working indices B(t) ⊂ 1, . . . ,m

solve subproblem restricted to B(t) and update α(t) ← α(t−1) + µ⋆(t)

update the gradient g(t) ← g(t−1) −Qµ⋆(t)

set t← t+ 1end

For a vector α ∈ Rm and an index set I ⊂ 1, . . . ,m let αI =

∑i∈I αiei denote

the projection to the components indexed by I, where ei ∈ Rm is the unit vector

in which the i-th component is 1. If all variables except those in B = b1, . . . , b|B|

are fixed, the subproblem can be written as:

maxαB

f(αB) = f(αF + αB) = (vB −QαF )TαB −

1

2αBQαB + const (5.3)

s.t. Ln ≤ αn ≤ Un ∀n ∈ B

Here, the complement F = 1, . . . ,m \ B of B contains the indices of the fixed

variables.

The convergence properties of the decomposition method are determined by the

heuristic for selecting the working indices. Given a feasible search point, the set of

possible working indices that indicate a violation of the KKT optimality condition

by the corresponding variable is called violating set [101]. The set of violating

indices in a search point α is denoted by

B(α) =n ∈ 1, . . . ,m

∣∣ αn > Ln ∧ gn < 0 or αn < Un ∧ gn > 0

.

If the working set has minimum size for generating feasible solutions,the working set

are called irreducible and this approach is called sequential minimal optimization

(SMO, [128]), which is the most frequently used technique for SVM training.

62

5.1.7 General Comments on SVM Solvers

In the subsection 5.1, SVM solvers were briefly reviewed. There two approaches

for optimizing SVM solvers; one is approach is to develop solvers for special type

of kernels i.e. linear kernels and the other one is to develop solvers for any type of

kernels. Although linear kernels can be a good choice for some kind of problems

like text classification, it is more convenient to have solvers for any type of kernels.

It should be noted that SVMs that are using linear kernels are not universally

consistent [148]. In this thesis, I will focus on decomposition type solvers. In

the subsection 5.2, I will develop a solver for WW, LLW and DGI. It should be

underlined that upto now, no efficient solver has been developed for LLW and also

WW and training these methods are considered to be slow.

5.2 Decomposition Algorithms for Multi-Class SVMs

Solver techniques for SVM problems were briefly described and in this new solvers

for multi-class SVMs including all-in-one multi-class SVMs will be developed. First

of all the goal should be clearly stated. Motivation: Long training times limit the

applicability of multi-class SVMs. In particular, the canonical extension of binary

SVMs to multiple classes, WW SVM [163], as well as the theoretically sound LLW

SVM [112] are rarely used. Instead, alternative multi-class formulations are pre-

ferred. While these can be trained faster, they lack desirable theoretical properties

and/or often lead to less accurate hypotheses.

The CS SVM [42] is the arguably most popular modification of the WW formula-

tion, mainly to speed-up the training. For learning structured data, this all-together

method is usually the SVM algorithm of choice. Lee et al. [112] modified the stan-

dard multi-class SVM formulation for theoretical reasons. In contrast to the other

machines, their SVM relies on a classification calibrated loss function, which leads

to consistency [154]. However, up to now no efficient solver for the LLW SVM has

been derived and implemented and thus empirical comparisons with other methods

are rare.

In this thesis, I will consider (batch) training of multi-class SVMs with universal

(i.e., non-linear) kernels and ask the questions: Can the learning speed of WW be

increased by using a more efficient quadratic programming method? Can such a

method make LLW learning practical and do the nicer theoretical properties of the

LLW machine lead to better hypotheses in practice? In thesis positive answers are

given to these questions by applying the crucial computational trick of CS, namely

removing the bias term from the hypotheses, to the WW and LLW machines. For

additional acceleration, a non-standard decomposition scheme that speeds-up multi-

class SVM training is proposed.

As mentioned before there are several key issues for developing fast and memory

efficient decomposition solvers. First of all, the working set size should be decided.

Second how to select variables for working set should be defined. In the following

63

first these issues will be clarified and then an efficient solution that is using de-

composition algorithms to these problems will be proposed. For deriving the new

training algorithms, quadratic programs of the canonical form 5.1 are considered.

The dual problems of the WW and LLW machines without bias can directly be writ-

ten in this canonical form. The dual problem of the CS machine introduces a large

number of additional equality constraints, which will be ignored for the moment

and will be discussed in section 5.2.7. The minimum working set size depends on

the number of equality constraints. For problem (5.1) it is one. Next, the trade-offs

influencing the choice of the number of elements in the working set are discussed.

Then working set selection heuristics for solving (5.1) are described and in Section

5.2.6, the proposed solver S2DO is defined.

5.2.1 Dropping the Bias Parameters

The constraint (4.1) makes the multi-class SVM quadratic programs difficult to solve

for decomposition techniques, because a feasible step requires the modification of

at least d variables simultaneously. This problem concerns both the WW, LLW

and DGI approach. However, such a constraint is not present in the standard CS

machine, because Crammer & Singer dropped the bias terms bc, which are of minor

importance when working with characteristic or universal kernels [129].

Instead of restricting this trick to the CS machine, I propose to apply it also to

the WW, LLW and DGI SVMs. If this trick applied to all multi-class SVMs, the

constraint (4.1) simply vanishes from the dual, while everything else remains the

same. This step of removing the bias terms is crucial, because it allows us for the

first time to solve the WW and the LLW machine with elaborated decomposition

techniques as discussed in the next section and also this allows us applying the same

decomposition techniques to DGI.

Dropping the bias terms in all machines is also a prerequisite for a fair empirical

comparison of the approaches. First, it makes fast training and therefore appropri-

ate model selection and evaluation on several benchmark problems feasible. Second,

now all machines consider the same hypothesis space. Instead, bias term could have

been introduced into the CS method, but then the resulting dual problem gets much

more complicated, because it contains two sets of interfering equality constraints,

see also [87], which renders the solution technique presented in the next section

intractable.

5.2.2 Working Set Sizes for Decomposition Algorithms

The size of the working set B influences the overall performance of the decomposition

algorithm in a number of different ways: First, the complexity of solving subprob-

lem (5.3) analytically grows combinatorially with |B|. This limits the working set

size to small numbers unless a numerical solver as done in [94] is used. Second,

the larger |B|, the less well-founded is a heuristic for picking the actual working

set from the O(m|B|) candidates, because such a heuristic is acceptable only if its

64

time complexity is O(m). At the same time large |B| offers the opportunity for the

working set selection scheme to provide a working set on which large progress can be

made. Third, the gradient update takes O(m · |B|) operations. Thus, small working

sets result in fast iterations making few progress, while larger working sets result

in slower iterations making larger progress. For example, trade two iterations with

|B| = 1 are roughly traded for one iteration with |B| = 2. Thus, a single step on

two variables must make at least as much progress as two steps on single variables,

which trivially is the case if the same variables are used. Such a step can directly

take into account correlations between the variables. However, the second iteration

can profit from the gradient update done by the first iteration, and thus make a

better decision for picking its active variable. This fast update of the gradient may

be an important reason for the success of SMO.

After taking into account all these issues mentioned above one concludes that (a)

working set sizes should be small in order to avoid unacceptable computation times

for solving the subproblem, and (b) there is an inherent trade-off between many

cheap iterations profiting from fast gradient updates and fewer slow iterations with

more working sets available and larger progress per iteration. In my point of view,

none of the above arguments enforces the working set size to be minimal as in SMO.

Instead of always using the minimum working set, I propose to use working

sets of size two whenever possible. This strategy will be refered as sequential two-

dimensional optimization (S2DO).

5.2.3 On Working Variable Selection

For developing decomposition algorithms one needs to decide how to select which

variables to modify from the working set B. For the binary SVMs there are several

proposed methods [128, 102, 90, 113, 89] for variable selection. There are two

important methods for this purpose. The first one is maximum violating pair (MVP)

method [94, 102] and the other one is second order working variable selection[64] .

5.2.4 Maximum Violating Pair Method

There are several ways to pick the elements of working set, B. The most effective

way is to maximize the gain of the restricted dual objective function 5.3. Even

the |B| is fixed to 2, the search for picking the pair of indices that is giving the

maximum gain requires calculating ℓ × (ℓ − 1) possible pairs of indices. It is clear

that this method will be slow when the ℓ is large. Please note that for multi-class

SVMs, the number of free parameters are approximately d × ℓ. To overcome this

problem Keerthi et.al. [102] proposed the maximum violating pair method. Similar

65

notations with Keerthi et.al. [102] are used in this thesis and define following sets:

Iup (α) = n ∈ 1, . . . , ℓ|αn < Un

Idown (α) = n ∈ 1, . . . , ℓ|αn > Ln

B (α) = (i, j)|i ∈ Iup, j ∈ Idown, i 6= j

Keerthi et.al. [102] use first order approximation of (5.3). In short MVP methods

proposes to pick the pair that is violating the KKT condition most strongly [102]

and this corresponds to

i = max gi i ∈ Iup

j = min gj j ∈ Idown

5.2.5 Second Order Working Variable Selection for SMO

The second order working set selection introduced by [64] is adopted in this thesis.1

Let us assume the optimization problem is restricted to the single variable αb,

b ∈ B(α). The update direction is thus eb. If the box constraints are ignored, the

optimal step size is given by the Newton step

µ = [∇f(α)]b /Qbb , (5.4)

yielding a gain of

f(α+ µ · eb)− f(α) = µ2 ·Qbb

2=

[∇f(α)]2b2Qbb

=g2b

2Qbb

.

This definition of gain leads to the greedy heuristic

b(t) = argmax

(g

(t−1)n )2

Qnn

∣∣∣∣∣ n ∈ B(α(t−1))

for choosing the working index, where g(t−1)n is defined as

g(t−1)n =

∂f

∂αn

(α(t−1)) .

To obtain a feasible new search point, the Newton step must be clipped to the

feasible region by computing

µ⋆ = maxLb − αb,minUb − αb, µ

) .

1Note that in the case of single-variable working sets first order and second order working setselection coincide for the important special case ∀1 ≤ i, j ≤ m : Qii = Qjj , e.g., for normalizedkernels (k(x, x) = 1 for all x ∈ X).

66

The update of the variables is simply given by

α(t) = α(t−1) + µ⋆eb .

In each iteration the algorithm needs the b(t)-th column of the large kernel-based

matrix Q for the gradient update. In addition, the diagonal entries Qnn needed for

second order working set selection should be precomputed and stored. This requires

O(m) time and memory.

5.2.6 Second Order Working Pair Selection for S2DO

I now derive second order working set selection for (5.1) using |B| = 2. In this

thesis the focus is on multi-class SVMs, but the selection scheme is also valid for

binary SVMs without bias. The first index is selected according to the maximum

absolute value of the gradient component

i = argmaxk∈B(α)

|gk| .

The second index is then selected by maximizing the gain [64]. Let the optimization

problem be restricted to

αB = (αa, αb)T . (5.5)

Let the update of the current point αB be

µ⋆B = (µ⋆

a, µ⋆b)

T . (5.6)

The unconstrained optimum is

αB = (αa, αb)T (5.7)

of (5.3) and the corresponding gain

f(αB)− f(αB)) . (5.8)

The second order Taylor expansion of (5.3)is written around αB using µB = αB−αB

f(αB) = f(αB) + µTB∇f(αB)−

1

2µTBQBµB) , (5.9)

where the matrix QB ∈ R2×2 is the restriction of Q to entries corresponding to the

working set indices. At the optimal point the first order term vanishes and the gain

is

f(αB)− f(αB) =1

2µTBQBµB . (5.10)

67

The gradient at αB is

gB = gB −QBµB . (5.11)

It vanishes at an external point and a Newton step gives

µB = QB−1gB . (5.12)

However, this computation assumes that the matrix QB can be inverted. If this is

indeed true (det(QB) > 0), the gain can be directly computed as follows

g2aQbb − 2gagbQab + g2bQaa

2(QaaQbb −Q2ab)

.

In the case of det(QB) = 0 the calculation of the gain is more complicated. For

QB = 0 there are two cases. Objective function is constant when gB = 0 and the

gain is zero, and objective functions is lienar function when gB 6= 0 with infinite

gain. The case that QB is a rank one matrix remains. Let qB be an eigenvector

spanning the null-eigenspace. For gBTqB 6= 0 the gain is infinite and only if gB and

qB are orthogonal the problem is reduced to a one dimensional quadratic equation.

In this case the (non-unique) optimum can be computed as follows: Let wB be a

nonzero vector orthogonal to qB or in other words an eigenvector corresponding to

the non-null eigenspace of QB . Then the point

αB = αB +wT

BgBwT

BQBwB

is optimal and the corresponding gain is

gain =(wT

BgB)2

2wTBQBwB

.

The vectors gB and wB are aligned in this case (gB = λwB for some λ ∈ R), such

that gB can directly take the role of wB , resulting in

gain =(g2a + g2b )

2

2(g2aQaa + 2gagbQab + g2bQbb).

For normalized kernels Qaa = Qbb = 1 and so the case QB = 0 is impossible and

det(QB) = 0 amounts to the two cases Qab ∈ ±1, resulting in qB = (−Qab, 1)T

and wB = (1, Qab)T . For

gTBqB = gb −Qabga = 0 (5.13)

the gain is given by

gain = (g2a +Qabgb)2/8 (5.14)

68

The update of the α vector is non-trivial when |B| = 2. It has been derived

in [149] for normalized kernels and it will be adapted to all kernels i.e. including

non-normalised kernels in the following section.

5.2.6.1 The update for S2DO

In this section, the solution of the sub-problem (5.3) for all kernels, in other words

for both normalised2 and non-normalised kernels, in the case of working sets size

|B| = 2 will be described. Let B = i, j and µB denotes the update vector for αB.

Consider the sub-problem:

maxµB

f(αB + µB) = (νB −QαF )T(αB + µB) (5.15)

−1

2(αB + µB)

TQ(αB + µB) + const

s.t. αB + µB ∈ [Li, Ui]× [Lj , Uj ]

To derive the update step for S2DO algorithm objective function of the eq.

(5.15) should be rewritten

f(αB + µB) = (νB −QαF )T(αB + µB)

−1

2(αB + µB)

TQ(αB + µB) + const

= (νB −QαF )αB + (ν −QαF )µB

−1

2(αT

B + µTB)Q(αB + µB) + const

= (νB −QαF )TµB

−1

2(αT

BQ+ µTBQ)(αB + µB) + const

= (νB −QαF )µB −1

2µTBQµB −

1

2µTBQαB + const .

Although in the equations the contents of the const part varies, it contains the

terms that do not depend on µB . After, the arithmetic operations shown above,

f(αB + µB) can be written as follows

f(αB + µB) = (ν −QαF )TµB − µT

BQµB + const . (5.16)

The first term of the right hand side of the equation (5.16) can be written as gTBand so we have the following equality

f(αB + µB) = gTB − µTBQµB + const . (5.17)

To find the maximum of the f(αB + µB), its partial derivatives with respect to

2We should note that S2DO has been derived in [149] for normalized kernels

69

µi and µj will be written

∂f(αB + µB)

∂µi

= gi − (µiQii + µjQij)

∂f(αB + µB)

∂µj

= gj − (µjQjj + µiQij)

and then the partial derivatives are set to zero

gi = (µiQii + µjQij)

gj = (µjQjj + µiQij) .

These equations can be written as a matrix equation

(gi

gj

)=(µi µj

)(Qii Qij

Qij Qjj

)(5.18)

and then

µTB = gBQ

−1B . (5.19)

In the sequel a number of one-dimensional sub-problems ,where one of the variables

µi or µj is fixed to one of its bounds, will be solved. W.l.o.g. assume that αi+µ⋆i =

Li. Then the optimal solution is given by

µ⋆j = min

max

gj −Qijµ

⋆i

Qjj

, Lj − αj

, Uj − αj

.

Three different cases according to the rank of QB are distinguished: For QB = 0

the solution is found by following the gradient, i.e., µ⋆i = Ui − αi for gi > 0,

µ⋆i = Li − αi for gi < 0, and µ⋆

i = 0 for gi = 0; with analogous rules for µ⋆j .

Now assume that QB has rank one. Then the objective function is linear on

each line segment Sp = p + λ · (−Qij , Qii)T |λ ∈ R ∩ F, p ∈ F, with derivative

γ = ∂f/∂λ = Qiigj − Qijgi in parameter direction. For γ ≥ 0 the optimum is

obtained on one of the line segments at the maximal parameter value. These points

cover either one or two adjacent edges of the parameter rectangle [Li, Ui]× [Lj , Uj ],

depending on the signs of Qii and Qij . For each of the edges the one-dimensional

sub-problem is solved. The best solution obtained from the one-dimensional sub-

problems is the optimum µ⋆B . The case γ < 0 is handled analogously with the

opposite edge(s).

If QB has full rank then the unconstrained optimum is computed

µB = Q−1B gB =

1

det(QB)·

(Qjjgi −QijgjQiigj −Qijgi

).

If this solution is feasible then µ⋆B = µB . Otherwise first assume that only one of

the variables µi and µj is outside the bounds; w.l.o.g. assume αi + µi > Ui. Then

70

by convexity it is concluded that the optimum is found on the edge Ui× [Lj , Uj ],

which amounts to a one-dimensional problem. In case that both variables violate

the constraints, w.l.o.g. αi+µi < Li and αj+µj > Uj , the same convexity argument

ensures that the optimum is located on one of the adjacent edges Li× [Lj , Uj ] and

[Li, Ui]× Uj. As above, the better solution of the two one-dimensional problems

constitutes the optimum.

5.2.7 Solving the Crammer and Singer Multi-class SVM Us-

ing SMO

To solve the dual problem with m = ℓ× d variables introduced in [42] and in order

to respect the ℓ additional equality constraints

d∑

c=1

αn,c = 0 ∀i ∈ 1, . . . , ℓ .

In this case techniques already established in many standard SMO solvers are used.

The indices maximizing the gain from the set

B(α) =(i,m), (i, n)

∣∣∣ i ∈ 1, . . . , ℓ, m, n ∈ 1, . . . , d, m 6= n,

αi,m < Ui,m, αi,n > Li,n, (∇f(α))i,m − (∇f(α))i,n > 0

of candidates are selected, compute the optimal step size as a restricted Newton step

in the standard way [128, 27] and update the gradient according to the changes in

both variables in the working set.3

The new algorithms proposed in this thesis considers a working set of size two

in contrast to the solvers proposed so far, which operate on working sets of size d.

Because of the constraints, a working set size of two is minimal. The SMO method

proposed in this section promises an increase in learning speed, because the two-

dimensional sub-problem can be solved efficiently analytically. From binary SVMs

it is known that enlarging the working set in general decreases performance.

5.2.8 Efficient Caching for All-in-one Machines

Key techniques for making SVM solvers efficient are caching and shrinking as in-

troduced by Joachims [94]. In this thesis, the shrinking and unshrinking heuristics

from LIBSVM [36, 64, 27] are adopted. However, for the large problems with d× ℓ

variables arising from the all-in-one approaches [163, 42] we use a special optimiza-

tion. In these cases the matrix Q can be decomposed into the smaller ℓ × ℓ kernel

matrix and a set of at most O(d4) coefficients:

Q(m,e),(n,c) = M(ym,e),(yn,c)k(xi, xj) .

3In the case of two-variable working sets second order working set selection differs from firstorder selection also for normalized kernels. It is usually much more efficient (see [64, 77, 79]).

71

In fact, the coefficients can easily be computed instead of being cached for large

values of d. They take the form

M(ym,e),(yn,c) = δyn,ym+ δc,e − δyn,e − δym,c

for the method WW and CS. Further, for LLW and DGI they take the following

form

M(ym,e),(yn,c) = δc,e − 1/d .

It is well known that the speed of SMO-type solvers can crucially depend on the

fraction of kernel cache misses (e.g., [77]). Therefore we cache only the ℓ× ℓ kernel

matrix Kij = k(xi, xj) and compute the coefficients on the fly. Such an archi-

tecture makes the implementation more challenging, because shrinking has to be

implemented both on the level of training examples (in order to reduce the cache

requirements) and on the level of variables (in order to speed up the working set

selection and gradient update loops).

72

Chapter 6

Conceptual and Theoretical

Analysis of Multi-class SVMs

All machines considered in this thesis have some pros and cons. Before I present

an empirical evaluation in Section 7, I want to discuss their conceptual differences.

This provides a better understanding of their behavior and guidelines for choosing

the most promising approach for a certain application. Most of my considerations

are focussed on the margin concepts used in the different machines. Without loss

of generality the analysis in the following assumes there is no bias in any machine.I

start the discussion with the MC-MMR and OVA machines before I turn to the

all-in-one Multi-class SVMs. After that, in subsection 6.4, the problem of correct

margin normalization in multi-class classification in general is discussed and then I

develop a bound on generalization error of CS and WW. I briefly discuss the uni-

versal consistency of multi-class SVMs. I close this section contrasting asymptotic

the training complexities of the six implemented multi-class machines depending on

the number of training examples and the number of classes in the problem.

6.1 Margins in Multi-Class SVMs

A key feature of SVMs for binary classification is the separation of data of different

classes with a large margin. For non-separable data this notion gets a little bit

fuzzy, involving a target margin and the amount of margin violations. Accordingly,

the primal SVM formulation tries to achieve two conflicting goals at the same time,

namely the minimization of the weight vector (corresponding to the maximization

of the target margin and to favouring smoother hypotheses, because the hypothesis

is Lipschitz with a constant proportional to the norm of the weight vector) and

the minimization of the empirical loss in terms of some norm of the vector of slack

variables.

While the notion of margin is quite clear for the case of binary classification, it

turns out that this concept is more complicated for the multi-class case. The decision

73

boundary is still of co-dimension one. However, on each side of the boundary the

classifier assigns the label of one class, such that different (linear) parts of the

decision boundary correspond to different pairs of classes. Thus, the number of

such different parts of the decision boundary, corresponding to the separation of

different pairs of classes, can grow quadratically with the number of classes in the

problem.

In feature space, SVMs typically use one linear function per class for the predic-

tion, and assign a label by maximizing over the single predictions. When looking

at the label space this amounts to a decision function of the form illustrated in

Figure 3.2.

6.2 Margins in Multi-Class Maximum Margin Re-

gression

Although the classifier proposed by Szedmak et al. [152] is clearly inspired from the

dual problem of the binary SVM classifier, it turns out that the problem solved is

different, at least for the standard case of orthogonal prototype label vectors (see

Chapter 3). Figure 6.1 illustrates this difference.

〈w1, φ(x)〉〈w1, φ(x)〉

〈w2, φ(x)〉〈w2, φ(x)〉

Figure 6.1: Illustration of the slack penalties in the MC-MMR machine (left) andWW or CS, which reduce to a standard binary SVM (right), for two classes with two-dimensional decision space. Training examples of the two classes are illustrated withblack and white dots. The dotted lines correspond to the class-wise target margin,i.e., points on one side of this line do not violate the target margin, resulting inξi = 0. The lengths of the solid lines connecting margin violators with the marginlines indicate the amount of slack penalty ξi > 0 induced by the correspondingpoints. Obviously, the minimization of the slack variables for the MC-MMR ruleon the left has nothing to do with the decision boundary.

The fact that the MC-MMR machine does not reduce to the standard binary

SVM is sceptical and may even justify the point of view that this kernel machine

should not be considered an SVM variant at all, because the decisive maximum

margin feature is missing. However, the proceeding of the MC-MMR machine is

quite similar to the one-class SVM [137] approach used to identify the support of a

74

distribution.

The primal MC-MMR problem expresses the desire to find a low complexity

function that takes a value of at least one (or close to one) on the support of the

class distribution. This comes close to training a one-class SVM per class [137].

Then MC-MMR makes predictions in an ad-hoc manner by just taking the class for

which the support estimation function outputs the largest value.

6.3 Margins in the One Versus All Classifier

At a first glance it seems that the OVA classifier should have a reasonable margin

concept, as it is derived from a series of well-understood binary classifiers. However,

it turns out that this intuition is wrong. The reason is that the margin concepts of

OVA training and the decision function (3.1) differ. This difference does not only

quantitatively affect the amount of margin violations, but also results in qualitative

differences when it comes to linear separability.

class 1

class 2

class 3

Figure 6.2: The figure illustrates the linear separability problem faced by OVA.The three classes are pairwise linearly separable, and they are separable with adecision function of type (3.1). However, OVA tries to solve the multi-class problemby treating one of the classes as the positive class and combining all remainingclasses to a single negative class. With this proceeding neither class two nor classthree is linearly separable from its complement. In other words, the individualdecision functions constructed by OVA use a different concept of separability thanequation (3.1), which is finally used for prediction.

This problem is illustrated in Figure 6.2. The three classes in this example are

pairwise linearly separable. However, OVA tries to form hyperplanes that separate

one class from all others. In the given example, this can be achieved without error

for class one, but not for classes two or three. Thus, a soft margin OVA machine is

required for this problem. In other words, although the linearly separating decision

functions are in OVAs hypothesis space, the training scheme will in general not find

this solution.

It is clear that depending on the characteristics of the problem at hand the

differences of the margin concepts in OVA training and prediction may affect per-

formance. Nevertheless OVA may, just like MC-MMR, work well on various data

sets [133].

75

6.4 Margin Normalization for Multi-Class Machines

As discussed above, the decision boundary of decision function (3.1) can be split

into O(d×d) functionally different parts, corresponding to the separation of pairs of

classes. Given this, even the concept of a hard margin SVM becomes tricky, because

the maximization of multiple margins naturally is a multi-objective optimization

problem.

There are certainly different meaningful ways how to merge all the margin ob-

jectives into a single objective function, for example as a simple linear combina-

tion. The way taken in all-in-one multi-class machines is slightly different. Instead

of maximizing the margins, which would be reflected by maximizing an objective

function of the type

d∑

c,e=1

1

‖wc − we‖,

the complexity of the hypothesis measured by the sum of squared norms of the

normal vectors is penalized directly. However, the term

d∑

c=1

〈wc, wc〉

does not have any obvious geometric interpretation in terms of (target) margins.

Note that one may, depending on the loss function, want to insert constants in front

of each of the summands in both objective functions.

The situation is even worse for soft-margin classification. Usually one wants

to penalize geometric margin violations. In case of a single margin (as found in

the binary classification case) this is equivalent to minimizing functional margin

violations, up to a multiplicative constant (which is the norm of the hyperplane

normal vector w). However, in multi-class classification there is one such multi-

plicative constant (transforming functional into geometric margin violations) per

pair of classes. Nevertheless, even the relatively sophisticated all-in-one multi-class

machines classifiers treat all functional margin violations the same. It seems that

any systematic solution to this problem that is still solvable by quadratic program-

ming requires the introduction of O(d) or even O(d×d) additional hyperparameters

scaling the different slack variables, instead of the single complexity parameter C.

This would considerably complicate the model selection problem. Interestingly, the

OVA classifier suffers from a similar normalization problem, because there is no rule

to adjust the norms of the different weight vectors to compatible ranges. On the

other hand, MC-MMR gets around this problem relatively well.

76

6.5 Generalization Analysis

In [157], the empirical risk of multi-class SVMs is upper bounded in terms of the

mean of the slack variables. Based on this bound it is argued that the CS SVM

has advantages compared to the WW formulation because it leads to lower values

in the bounds. It is not clear if this argument is convincing. In general, one has

to be careful when drawing conclusions just from upper bounds on performance.

Further, the empirical error may only be a weak predictor of the generalization error

(in particular for large values of C). Apart from these general arguments, one arrives

at exactly the opposite conclusion when looking at generalization bounds. These

bounds are instructive, because they indicate why it may be beneficial to sum-up

all margin violation in the multi-class SVM optimization problem. As an example,

I extend a bound on the generalization error of binary SVMs by Shawe-Taylor and

Cristianini [144] to the multi-class case in order to investigate the impact of the

different loss functions on the generalization performance. Let hc(x) = 〈wc, φ(x)〉.

After dropping the bias term in the WW machine the conceptual difference between

the WW and the CS approach is the loss function used to measure margin violations.

For a given training example (xi, yi) the WW machine penalizes:

d∑

c=1

[1− δyi,c + (hyi(xi)− hc(xi))]+ (6.1)

of margin violations,1 while the CS machine penalizes the maximum margin viola-

tion

max[1− δyi,c + (hyi(xi)− hc(xi))]+. (6.2)

Here again the short notation [t]+ = max0, t is used.

The basic idea of the analysis, that is proposed in this thesis, is the following:

There are d−1 mistakes one can make per example xi, namely preferring class e over

the true class yi (e ∈ 1, . . . , d \ yi). Each of these possible mistakes corresponds

to one binary problem (having a decision function with normal wyi−we) indicating

the specific mistake. One of these mistakes is sufficient for wrong classification

and no “binary” mistake at all implies correct classification. A union bound over

all mistakes gives the multi-class generalization result based on known bounds for

binary classifiers.

First a fundamental result from [144] for binary classification problems with

labels ±1 will be restates. It bounds the risk under the 0-1 loss depending on the

fat shattering dimension (e.g., see [99, 5]) of the class of real-valed decision functions.

The margin violation of training pattern (xi, yi) is measured by zi = [ν−yih(xi)]+,

collected in the vector z = (z1, . . . , zℓ)T ∈ R

ℓ and then:

1For simplicity we use the target margin ν = 1 proposed in [158] for the analysis. This makesthe target margins of the two machines directly comparable.

77

Theorem 5 (Corollary 6.14 from [144]) Let F be a sturdy class of functions

h : F→ [a, b] ∈ R with fat shattering dimension fatF(ν). Fix a scaling of the output

range η ∈ R. Consider a fixed but unknown probability distribution on the input

space X. Then with probability 1− δ over randomly drawn training sets T of size ℓ

for all 0 < ν < b− a the risk of a function h ∈ F thresholded at zero is bounded by

ǫh =2

ℓ

([fatF(ν/16) + 64D2

]log2

(65ℓ(1 + D)3

)

· log2(9eℓ(1 + D)) + log2

(64ℓ1.5(b− a)

δη

))

with D = 2(√‖z‖1 · (b− a) + η)/ν, provided ℓ ≥ 2/ǫh and there is no discrete

probability on misclassified training points.

If the logarithmic terms are ignored, the bound can be simplified to

ǫh ∈ O

(fat(ν/16) + ‖z‖1/ν

2

ℓ

).

Now a union bound is used over the d(d− 1)/2 possible pairs of classes to transfer

this result to the multi-class case. For a more elaborate treatment of fat shattering

in the multi-class case see the literature [99, 81].

The training set is decomposed into subsets Tc = (xi, yi) ∈ T | yi = c, c ∈

1, . . . , d according to the training labels and denote their sizes by ℓc = |Tc|. The

natural extension of the margin violations d to the loss functions used in the WW

and CS machines is zi,c = [ν − hyi(xi) + hc(xi)]+ for c 6= yi. These values are

collected in vectors z(c,e) ∈ Rℓc+ℓe with entries zi,e and zi,c for i ∈ Tc and i ∈ Te,

respectively, for each fixed pair (c, e) of different classes. The vector z ∈ Rℓ×(d−1)

collecting all margin violations. The error probability of each binary classifier hc−he

separating the problem restricted to the classes c and e can be upper bounded by

ǫ(c,e)h ∈ O

(fat(ν/16) + ‖z(c,e)‖1/ν

2

ℓc + ℓe

),

and by a simple union bound argument the total generalization error is bounded by

ǫh ≤∑

1≤c<e≤d

ℓc + ℓeℓ· ǫ

(c,e)h .

The resulting upper bound for the multi-class case is

ǫh ∈ O

( 12d(d− 1)fatF(ν/16) + ‖z‖1/ν

2

ℓ

), (6.3)

where the complexity of the class of Rd-valued functions f used for multi-class

classification is measured by the maximal fat shattering dimension of the real-valued

differences he − hc. The exact same technique can be applied directly to the term

78

ǫh as it appears in Theorem 5.

We formulate the primal problems of the WW and CS machines are formulated

as

min1

2

d∑

c=1

〈wc, wc〉+ C ·ℓ∑

i=1

L(zi,1, . . . , zi,d)

where the loss L is the sum of its arguments and the maximum, respectively. Let

z(WW) and z(CS) denote the margin violations for the WW and CS machine, re-

spectively. Then it is clear from the form of the primal optimization problems that

‖z(WW)‖1 ≤ ‖z(CS)‖1, as the one-norm is directly minimized in the WW formu-

lation. Thus, the generalization bound (6.3) is lower for the WW machine. As

noted before, comparing performances of different machines by only using bounds

may lead to wrong conclusions. The best thing to support any kind of bound with

empirical results. This is one of the reasons an extensive empirical comparison is

given in Chapter 7 and results of this empirical comparison is in accordance with

the generalisation bound presented above.

6.6 Universal Consistency of Multi-Class SVM

As discussed in Chapter 2, universal consistency of a classifier is an important topic.

Although, using a universal consistent classifier does not guarantee best performance

with limited data, still it provides hope for large scale data sets. Lately Glasmachers

[76] proved the universal consistency of the CS machine. Following Glasmachers’s

work and applying very similar techniques Dogan et al. [55] proved the universal

consistency of DGI. Although Glasmachers’s work [76] does not contain universal

consistency proofs of WW and LLW, it implies their universal consistency. I will not

discuss universal consistency of multi-class SVMs, for a detailed discussion please

see Glasmachers [76].

6.7 Training Complexity

Table 6.1 gives the asymptotic training time of the four algorithms under the as-

sumption that solving the different quadratic programs takes O(nk) time, where n

is the number of variables and 2 ≤ q ≤ 3 is a constant [27]. This table indicates that

training MC-MMR is faster than training OVA, which is again faster than training

CS and WW if the assumption holds.

Table 6.1: Asymptotic runtime of the training algorithms under the assumptionthat solving the different n-dimensional quadratic programs takes O(nk) time (2 <k < 3). The number of training patterns is denoted by ℓ, the number of examplesper class c by ℓc = |Sc|, and an the number of different classes by d.

OVA MC-MMR All-in-one Multi-Class SVMs

O(dℓq) O(∑d

c=1 ℓqc) = O(d

1−qℓq) O((dℓ)q)

79

The superior training speed of the MC-MMR method stems from the fact that

it does not take any cross-terms relating examples of different classes into account

at all. This may be an advantage on very large problems where training time is a

concern. In particular, the separability of the training problem makes it scale well

with the number of classes in the problem. Therefore, the method is an interesting

candidate for problems with lots of classes, at least from the training complexity

point of view. Increasing the number of classes d while keeping the total number of

training examples ℓ constant can even speed up the training procedure.

Training an OVA classifier just amounts to training d binary SVMs on the full

data set, rendering the method tractable for most applications. In contrast, training

the all-in-one multi-class machines scales considerably worse with the number of

classes, restriction the applicability of these elegant all-in-one machines to small

numbers of classes. This is because both machines take all cross-terms into account,

considering the separation of all examples from all other classes at the same time

within a single big optimization problem.

80

Chapter 7

Empirical Comparison and

Applications

In the previous chapters, I have considered multi-class SVMs from a conceptual and

theoretical point of view. I have also derived a unified view on multi-class SVMs and

by using this unified view, I proposed a novel multi-class machine. I also proposed

new algorithms for training multi-class SVMs. In this final chapter, I empirically

compared the different approaches.

7.1 Preliminaries for Empirical Evaluation

The six multi-class SVMs are empirically compared on several standard benchmark

problems and also these methods are applied to a variety of different problems. The

empirical comparison addresses following questions

• Which one of the six multi-class methods has a better generalization perfor-

mance?

• Does the generalization performance of the methods are depend on to their

margin concepts and the loss functions?

• Does the generalization performance of the multi-class methods are depending

the problem at hand? In other words, is there a unique multi-class SVM

method which is giving always the best generalization performance?

• How much one gains with respect to number of iterations and training time

by using S2DO and the proposed caching technique?

Firstly, experimental set-up is explained without giving any further details of the

problems at hand. Second the model selection methodology is described at Section

7.1.1. Finally the related experiments and their results are supplied and discussed

in Section 7.2, 7.3 and 7.4.

81

To answer the first three questions, the generalization accuracy and training

time of the six different multi-class SVMs on well known benchmark problems are

evaluated. Twelve data sets were taken from the UCI machine learning repository

[7]. Three real world problems which are traffic sign recognition, protein secondary

structure prediction and cancer classification are also considered.

In order to supply accurate answers to the last question, the machines are trained

by using the SMO solvers and also with S2DO solvers described in Section 5. Fur-

ther, the new solution method for the CS quadratic program and the new caching

strategy for the all-in-one machines were employed. All of the methods were imple-

mented using the Shark open source machine learning library [92].

7.1.1 Model selection

In all experiments, Gaussian kernels k(x1, x2) = exp(−γ||x1 − x2||2) were used.

The bandwidths γ of the Gaussian kernels and the regularization parameter C

were determined using nested grid search. As a model selection criterion 5-fold

cross-validation is employed. Candidate parameters were evaluated on 5 validation

subsets of the available training data and the configuration yielding the best average

performance was chosen. If more than one parameter configuration for γ or C gave

equal results we selected the smallest γ and C. If the selected model was at the

boundary, we shifted the grid such that the former boundary value was in the middle

of the new grid. For each group of data set different hyper-parameter search space is

used for model selection. The details of hyper-parameter search space for all group

of data sets are given in the corresponding subsections.

7.1.2 Stopping Conditions

For a fair comparison of training times of different types of SVMs, it is of important

to choose comparable stopping criteria for the quadratic programming. Unfortu-

nately, this is hardly possible in the experiments presented in this thesis, because

the quadratic programs differ. However, in the case of just two classes WW, CS,

LLW, DGI and OVA solve the same problem. Therefore the stopping condition is

selected such that in the binary case the criterion agrees for these five methods. The

stopping condition described in Section 5 with the common threshold of ε = 10−3

is used. To rule out any possible artifact of this choice, all CS,DGI and MC-MMR

experiments are repeated with ε == 10−5, however, these results are not reported

in this thesis because there was no change in accuracy of these methods.

For all machines, the maximum number of SMO iterations was limited to 10000

times the number of dual variables. If for some parameter configuration (γ,C)

a solver did not reach the desired accuracy within this budget of iterations, the

parameter configuration was discarded from the grid search. This was necessary to

keep the grid searches computationally tractable. However, the discarded parameter

configurations always corresponded to “degenerated” machines (i.e., bad solutions),

so this did not influence the outcome of the model selection process.

82

7.1.3 Statistical Evaluation

Several ways to compare multiple classifiers on multiple data sets have been pro-

posed in the literature [52, 49, 75]. Such a statistical comparison is not straightfor-

ward, because one has to account for multiple testing. In this thesis the recommen-

dation in [75] is followed and non-parametric statistical tests are used in a step-wise

procedure. For each data set, the algorithms are ranked and then the average ranks

are computed. Then, Friedman test is applied to check whether the ranks are dif-

ferent from the mean rank. If so, whether two algorithms differ is determined by

pairwise ad hoc comparison (using Bergmann-Hommel’s dynamic procedure). Sig-

nificance level is fixed to α = 0.01. A detailed description of the test procedure can

be found in the literature [49, 75]. The open source software supplied by Garcıa

and Herrera [75] is used for evaluation.

7.2 Multi-class Benchmark Problems

To evaluate the multi-class SVM methods, twelve data set from the UCI machine

learning repository [7] are used. The descriptive statistics of these data are given

in Table 7.1 In all data sets all feature values are rescaled between 0 and 1 and this

Table 7.1: The descriptive statistics of 12 UCI data set are shown. The column ℓtrnshows the number of training examples and the column ℓtest shows the number oftest examples for corresponding data set.

#-classes ℓtrn ℓtestAbalone 27 3133 1044Car 4 1209 519Glass 6 149 65Iris 3 105 45Letter 26 14000 6000Isolet 26 6238 1559OptDigits 10 3823 1797Page Blocks 5 3831 1642Sat 7 4435 2000Segment 7 1617 693SoyBean 19 214 93Vehicle 4 592 254

rescaling is done based on training data only. First, γ ∈ 2−12+3i | i = 0, 1, . . . , 4

and C ∈ 23i | i = 0, 1, . . . , 4 are varied. Except Abalone, Letter and Isolet, the

training data are randomized 10 times then the experiments are made. The cross

validation error is stored for all grid points. That is for each grid point 10 different

cross validation error, due to the randomization, is stored. For each grid point the

median, (γs, CS) from the 5×5, is picked from the 10 values as a final cross-validation

error. Then, a second grid search is performed by looking at the hyperparameters

γ ∈ γs2i | i = −2,−1 . . . , 2 and C ∈ Cs2

i | i = −2,−1 . . . , 2 and applied the

same randomization procedure in the second search. This randomization procedure

83

makes the model selection robust against small sample problems and also any kind

of artifacts related 5-fold cross validation.

Table 7.2: Best hyperparameters found in the model selection procedure for OVA,MC-MMR and WW.

OVA MC-MMR WWlog γ logC log γ logC log γ logC

Abalone 0 -12 0 -14 0 -3Car -2 5 0 4 -2 5Glass -3 2 0 -1 1 -4Letter -2 4 0 0 -2 1Isolet -10 4 -7 3 -9 3Iris -12 18 -1 0 -9 9OptDigits -5 10 -3 0 -5 1Page Blocks -9 20 3 -8 -4 8Sat -1 4 0 0 -1 2Segment -5 10 1 0 -4 7SoyBean -7 3 -3 0 -6 1Vehicle -8 12 -1 0 -7 10

Table 7.3: Best hyperparameters found in the model selection procedure for CS,LLW and DGI.

CS LLW DGIlog γ logC log γ logC log γ logC

Abalone 1 -2 0 -6 0 -5Car -2 5 -2 5 0 5Glass -3 2 -3 1 3 -6Letter -2 2 -3 0 3 -6Isolet -12 4 -10 4 -6 -14Iris -6 6 -4 5 0 0OptDigits -6 5 -6 10 -2 -14Page Blocks -5 11 -7 16 3 -9Sat -1 2 -1 2 1 -14Segment -9 12 -4 15 3 -10SoyBean -6 3 -7 4 0 -12Vehicle -7 10 -7 11 -1 3

The selected hyperparameters for OVA, MC-MMR andWWmethod are given in

Table 7.2 and the hyperparemeters of CS, LLW and DGI is given in Table 7.3. The

classification accuracies, in percentage, of OVA, MC-MMR, WW SVMs and 1-NN

are given in Table 7.4. The classification accuracies, in percentage, of CS, LLW, DGI

SVMs and 1-NN are given in Table 7.5. Additionally 1-NN’s classification accuracy

for each data set is also supplied. The purpose of supplying 1-NN classification

accuracy is to have baseline on the data set.

S2DO solver is compared with SMO solver for WW and LLW by using optimal

hyperparameters. This comparison is needed because of two reasons. First for

deciding newly proposed solver, S2DO , is better than the SMO . The second for

84

Table 7.4: Classification accuracies of OVA, MC-MMR, WW, 1-NN are shown.. Ineach row bold numbers shows the best classification accuracy on the data set.

OVA MC-MMR WW 1-NNAbalone 26.72 26.72 26.05 19.35Car 98.07 96.72 98.07 90.17Glass 69.23 70.77 72.31 55.38Letter 97.33 96.10 97.43 93.17Isolet 96.41 94.55 96.54 89.03Iris 93.33 91.11 95.56 77.78OptDigits 97.61 96.88 97.61 96.27Page Blocks 93.12 93.42 93.42 91.53Sat 91.35 91.05 92.35 90.15Segment 96.25 96.10 96.39 89.90SoyBean 92.47 88.17 90.32 87.10Vehicle 83.46 68.90 84.25 66.93

Table 7.5: Classification accuracies of CS, LLW, DGI, 1-NN are shown.. In eachrow bold numbers shows the best classification accuracy on the data set.

CS LLW DGI 1-NNAbalone 22.32 26.82 26.82 19.35Car 98.07 98.46 96.53 90.17Glass 70.77 72.31 70.77 55.38Letter 97.27 96.98 95.22 93.17Isolet 96.15 96.86 92.37 89.03Iris 95.56 95.56 95.56 77.78OptDigits 97.50 97.89 96.66 96.27Page Blocks 92.45 93.30 92.94 91.53Sat 92.40 92.30 90.50 90.15Segment 96.39 96.83 95.96 89.90SoyBean 92.47 92.47 86.02 87.10Vehicle 81.50 84.25 65.75 66.93

having a fair comparison of training times of multi-class methods on the benchmark

problems. For data sets that have training time less than 100 seconds, experiments

are repeated 10 times and take the median as final training time. This procedure

prevents any kind of hardware or operating system related artifacts. The main

result of this comparison is summarized in Table 7.6 and Table 7.7. In presented

experiments, S2DO was statistically significantly better than SMO with respect

to training time and number of iterations. The time taken by one S2DO iteration

was roughly equivalent to two SMO iterations.1

S2DO is used for WW and LLW and Table 7.8 shows the training time require-

ments of OVA, MC-MMR and WW method for each data set by using the optimal

1The Iris data set is an exception from this rule of thumb. With 105 training examples andthree classes it is the smallest data set in the benchmark suite used in this study. The SMOalgorithm performed several fast shrinking and unshrinking operations, the S2DO none because itsolved the problem so quickly. Thus, each S2DO iteration considered the complete set of variables,most SMO iterations only subsets. Therefore, a single SMO iteration took less time on average.However, SMO needed much more iterations.

85

WW

SMO S2DO

#iter time #iter time

Abalone 92705 361.084 40514 319.213Car 15309 0.847 2973 0.727Glass 742 0.048 372 0.037Letter 2968967 1349.440 1564183 791.111Isolet 4190876 652.100 1948607 340.225Iris 146387 0.153 554 0.022OptDigits 24102 58.799 10419 76.952Page Blocks 1037684 29.518 93251 10.604Sat 59495 104.640 22001 95.857Segment 206149 9.378 17782 4.720SoyBean 7073 0.561 1627 0.492Vehicle 1391588 18.131 203840 6.286

LLW

SMO S2DO

#iter time #iter time

Abalone 122853 671.257 52501 492.611Car 130199 9.859 31360 6.370Glass 38030 1.082 5475 0.477Letter 12581295 16652.621 6724447 10128.417Isolet 41908763 65812.100 19486076 37462.100Iris 21145 0.065 1697 0.049OptDigits 529247 532.024 195362 520.406Page Blocks 693478729 34078.329 381032269 25258.837Sat 219895 191.136 95643 276.002Segment 55210740 6507.105 19496762 5161.155SoyBean 728480 66.255 214096 51.840Vehicle 16517891 565.718 1743176 163.347

Table 7.6: Training time and number of iterations needed for solving the WW(top) and the LLW (bottom) multi-class SVMs using decomposition algorithms withworking sets of size one (SMO ) and two (S2DO ). The training times are given inseconds along with the number of iterations of the decomposition algorithms neededby the all-in-one SVMs.

parameters and Table 7.9 shows the training time requirements of CS, LLW and

DGI method for each data set by using the optimal parameters.

7.2.1 Summary of Results

Classification accuracy results shows that the LLW method is superior to all other

methods. In order to have more precise comparison, the statistical evaluation

method, explained in Section 7.1.3, will be used in a hierarchical way. It has the

following steps;

• Given the Tables 7.4 and 7.5 rank the methods.

• Ignore the column that is corresponding to the best method and then rank

the methods

• Stop when all the methods ranked.

The ranking results of hierarchical evaluation is given in Table 7.10.

86

WW LLWSMO/S2DO SMO/S2DO

Iter Ratio Time Ratio Iter Ratio Time RatioAbalone 2.29 1.13 2.34 1.36Car 5.15 1.16 4.15 1.55Glass 1.99 1.30 6.95 2.27Letter 1.90 1.71 1.87 1.64Isolet 2.15 1.92 2.15 1.76Iris 264.24 6.80 12.46 1.32OptDigits 2.31 0.76 2.71 1.02Page Blocks 11.13 2.78 1.82 1.35Sat 2.70 1.09 2.30 0.69Segment 11.59 1.99 2.83 1.26SoyBean 4.35 1.14 3.40 1.28Vehicle 6.83 2.88 9.48 3.46

Table 7.7: The ratios of training time and number of iterations needed for solving theWW (top) and the LLW (bottom) multi-class SVMs using decomposition algorithmswith working sets of size one (SMO ) and two (S2DO ).

Table 7.8: Training time requirements of OVA, MC-MMR and WW for optimalparameters

OVA MC-MMR WWAbalone 59.628 4.142 319.213Car 0.193 0.767 0.727Glass 0.181 0.023 0.037Letter 332.942 114.503 791.111Isolet 433.881 78.068 340.225Iris 0.035 0.007 0.022OptDigits 6.450 16.681 76.952Page Blocks 252.218 10.271 10.604Sat 15.265 15.158 95.857Segment 0.982 1.201 4.720SoyBean 0.018 0.047 0.492Vehicle 0.849 0.170 6.286

The classification performance of the methods on benchmark data sets have been

discussed. In order to complete picture training times of the considered mathods on

these data sets should be also taken into account. From the Table ?? and the Table

7.9 it is clear that LLW is the slowest machine. Training the CS machines was slower

than training WW in eight of the benchmarks. Hence, a statistical comparison

support a significant difference between CS andWW in terms of training complexity.

Further, the accuracy of WW was statistically significantly superior to CS. The

OVA approach scales linearly with the number of classes d while the all-together

methods are in ω(d). MC-MMR method is the fastest method because it basically

an assembly d one-class machine. The asymptomatic properties of these machines

are given in Chapter 6. Unfortunately, all relatively fast machines, i.e. OVA, MC-

MMR, yielded hypotheses with a statistically significantly worse accuracy.

87

Table 7.9: Training time requirements of CS, LLW and DGI for optimal parameters

CS LLW DGIAbalone 43.844 492.611 36.309Car 2.207 6.370 53.181Glass 0.192 0.477 0.132Letter 972.047 10128.417 On RunIsolet 139.237 37462.100 367.796Iris 0.107 0.049 0.226OptDigits 45.803 520.406 31.003Page Blocks 4438.124 25258.837 25.081Sat 108.612 276.002 29.699Segment 447.566 5161.155 3.431SoyBean 1.710 51.840 0.130Vehicle 26.416 163.347 5.249

Table 7.10: The results of hierarchical statistical evaluation method are shown.

Rank Method/Methods1 LLW2 WW3 CS and OVA4 MC-MMR and DGI

7.3 Traffic Sign Recognition

Automatic camera-based traffic sign recognition plays an important role for driver

assistance systems as it can help increasing safety and comfort. From a technical

point of view, one can declare that the problem is solved to a degree that allows for

first technical applications in everyday life. Still, many research questions remain,

for example in the choice of appropriate features and classifiers and how feature

extraction and classification depend on each other. Although various approaches

to feature extraction and classification have been proposed in the domain of traffic

sign recognition, a systematic comparison is missing. New algorithms are typically

evaluated on data sets that are not publicly available and often not compared to

alternative methods.

In this study, the recognition (and not the detection) of traffic signs, which is a

multi-class classification problem, is considered. In real-world applications, there is

restricted computational time available for this task and this time has to be shared

between feature extraction and classification. Therefore, in this thesis it is argued

that it is not possible to calculate highly sophisticated features and to use a complex

classifier at the same time.

There are two conflictory hypothesis and the first one is; appropriate features

make the classification problem easier and as a result the discriminative power of

the classifier is less important. The second one is; using a sophisticated classifier

is required when feature extraction is less elaborated (using raw image data as an

extreme case). To test this hypothesis, the performance of different combinations

88

of feature extraction and classification algorithms are evaluated.

7.3.1 Related work

Many different approaches have been proposed for detection, classification, and

tracking of traffic signs in video sequences. In the following a short overview over

recent publications, focusing on techniques used for feature extraction and classifi-

cation in each case, will be given.

Miura et al. [120] present an active vision system for traffic sign recognition.

After detection, a nearest neighbor approach based on normalized cross-correlation

is employed for classification. Vicen-Bueno et al. consider [161] different techniques

for preprocessing of images and compare a nearest neighbor classifier and multi-layer

neural networks for classification.

In the approach described in [109], feature vectors are obtained from shape

information and a linear SVM is employed for classification. A similar approach

also using SVMs for classification is presented in [118]. Support vector machines

were also employed within a similar two-stage setup by Fleyeh and Dougherty [70].

A popular framework for real-time object detection based on Haar wavelet fea-

tures was proposed by Viola and Jones [162]. This method is widely adapted for

traffic sign detection. Bahlmann et al. [9] proposed to use this framework for detec-

tion, additionally considering different color channels. For classification, a Bayesian

framework was used where feature vectors are obtained as most discriminative basis

vectors found by linear discriminant analysis (LDA). The same idea is considered

in [103]. The cascaded detection was also used in [11], based on special Haar-like fea-

tures called dissociated dipoles there. For classification, an error-correcting output

code (ECOC) was employed.

Torresen et al. [156] and Moutarde et al. [121] proposed to classify single digits

for classification of speed limits after detection and appropriate segmentation.

Muhammad et al. recently published a survey and experimental study of differ-

ent approaches for traffic sign recognition [122]. For classification, they considered,

amongst others, an SVM implementation and a nearest neighbor like algorithm.

They made their data set (containing 1300 examples) publicly available. It was the

first data set that could be used for systematic comparison. Nevertheless, as they

focused on classification only, they provide preprocessed data. Therefore this data

set cannot be used to evaluate and compare different feature extraction approaches.

7.3.2 Features

In this thesis three different types of features, which are briefly introduced in this

section, are used in the related experiments.

7.3.2.1 Raw image data

As a baseline for comparison, the performance of all classifiers are supplied based

on raw image data. Dataset used in this study (see Section 7.3.3) contains 8 bit-

89

grayscale images scaled to a fixed size of 32× 32 pixel that are considered here.

7.3.2.2 Haar wavelet features

Haar wavelet features are state-of-the-art for real-time computer vision. Their pop-

ularity is mainly based upon the efficient computation using the integral image

proposed by Viola and Jones [162]. Haar wavelet features were successfully applied

in many computer vision applications, especially for object detection, classification,

and tracking [162, 126, 135, 80]. Figure 7.1 shows examples of six basic types of

Haar Wavelet features that can be used to detect different types of edges. Their re-

sponses can be calculated with 6 to 9 look-ups in the integral image, independently

of their absolute sizes.

Figure 7.1: Basic types of Haar wavelet features.

It has been shown that provided appropriate Haar wavelet features (e.g., found

by cascaded AdaBoost or created by evolutionary optimization) simple classifiers

can be used in order to achieve state-of-the-art performance in different tasks under

real-time constraints [162, 135, 61].

7.3.2.3 Histograms of Oriented Gradient

Histograms of Oriented Gradient (HOG) descriptors have been proposed by Dalal

and Triggs [46] for pedestrian detection. Based on gradients of color images, dif-

ferent weighted and normalized histograms are calculated: first for small cells that

cover the whole image and then for larger blocks that integrate over cells.

Using a linear SVM classifier based on HOG features, state-of-the-art perfor-

mance can be achieved, for instance, for pedestrian classification [46, 61].

7.3.3 Benchmark Data

For data collection, a Prosilica GC750C camera was used together with a Pentax

C30811KPC815 objective and a spacer ring, resulting in an opening angle of 60.

The automatic exposure control was used, therefore the frame rate was dynamically

changing while being mostly 30 fps or more. The camera images had sizes of 752×480

pixels and were stored in raw Bayer -pattern format. The recording is performed

while driving in different urban regions during daytime in good weather.

The sequences were labelled semi-automatically : the first occurrence of a traffic

sign was marked manually. Then a simple tracking algorithm based on normalized

cross-correlation was employed until the sign disappeared of the camera’s field of

vision. This semi-automatic procedure was chosen intentionally in order to generate

variability (translation, change of relative size and position, partial miss) typical in

real-world systems.

90

Examples of non-traffic signs were not labelled because one may argue that they

should not be chosen independently of the detection algorithm considered which is

not in the scope of this study.

Labelled traffic sign examples from the sequences were converted both to RGB

and to 8-bit grayscale, scaled to 32×32 pixels, and stored in PGM respectively

PPM file format. To simplify matters a nested directory structure is used in order

to sort examples according to their classes and to pool examples derived from the

same instance.

In total, 60 instances of traffic signs from 7 different classes have been labeled,

resulting in a total number of 3977 examples. The smallest images are 22×22, the

largest 87×87. A human viewer can classify all example images without doubt.

Some randomly chosen examples, which are from the data set at hand, are shown

in Table 7.11.

Table 7.11: Example images from the used traffic sign database.

A very important issue is how to split examples into training and test sets.

If images derived from the same traffic sign instance in the same driving situation

(but from different frames) occurred in both, classification of unseen examples would

become notably easier. Although this statement may sound trivial, not all data sets

used for evaluation of computer vision algorithms do follow this principle.

All data is grouped, based on instances and then select one half of the instances

belonging to each category, as training set and the remaining instances for testing.

The descriptive statistics of final benchmark data set are given in Table 7.12. There

are 1929 training and 2048 test examples.

7.3.4 Experiments and Results

In this section, the setup (Section 7.3.5) and results (Section 7.3.6) of the experi-

ments are described.

91

Table 7.12: Properties of the benchmark data set. In the first column indicates theclass. The second and fourth column represent the number of different instancesin the training and test set, respectively. The third and fifth column show thenumber of examples in training and test set, respectively. The number of examplesis larger than the number of instances, because a single traffic sign appears in severalconsecutive frames.

trafficsign

#Training #Training #Test #Test

instances examples instances examples

1 54 1 84

4 172 3 378

8 455 8 727

8 442 8 400

5 290 4 198

3 169 2 63

3 347 2 198

7.3.5 Setup

7.3.5.1 Feature calculation.

For the HOG feature calculation, the code supplied by the original authors [46] is

used. The features were computed on the RGB images that are scaled to 24×24.

This smaller size was favored in order to have more options for evenly dividing the

image into cells. Different parameters are tried for calculating HOG descriptors but

the reported results are belonging only two performing best, referred to as HOGA

and HOGB , respectively. See Table 7.13 for a detailed list of parameters used.

For Haar wavelet features, several sets containing variable numbers of differently

parametrized (basic types and sizes) features are tested. For the final experiments,

two sets (see Table 7.13) containing roughly 100 and 1000 features are selected that

showed typical effects. We refer to these sets as HaarA and HaarB . All images are

scaled to 24×24 for the calculation of HOG features, the same size is used when

calculating Haar features. However, grayscale images were used here instead of

color images because basic Haar features do not consider color information.

7.3.5.2 SVM model selection.

The model selection procedure is explained in Section 7.1.1. The initial search

space for hyperparemeters was, γ ∈ 2−18+3i | i = 0, 1, . . . , 4 and C ∈ 23i | i =

0, 1, . . . , 4.

92

HOGA

cell size 8×8 pixels, block size 16×16pixels, block stride 8×8, 9 orientationbins, semi-circle

144

HOGB

cell size 6×6 pixels, block size 12×12pixels, block stride 6×6, 9 orientationbins, full circle

324

HaarA

Features used: horizontal 8×8 pixels,vertical 8×8 pixels.Features were calculated at every sec-ond pixel in both dimensions.

98

HaarB

Features used: horizontal 4×4 pixels,vertical 4×4 pixels, horizontal 8×8 pix-els, vertical 8×8 pixels, horizontal bar6×6, vertical bar 6×6, diagonal 6×6,and 6×6 center-surround.Features of size 4×4 were calculated atevery pixel, the other features at everysecond pixel in both dimensions.

1002

Table 7.13: The table decribes the parameters used for feature extraction and thelast column of table represents the dimension of the feature vectors.

7.3.6 Results

The results of the experiments are shown in Table 7.14, which summarizes the test

errors of the different combinations of feature extraction and classification methods.

In addition to six different SVMs and 1-NN, linear discriminant analysis(LDA) [85]

is also tested.

Raw HOGA HOGB HaarA HaarBOVA 55.178 90.283 91.748 80.908 80.518

MC-MMR 55.277 72.631 77.683 75.637 75.599

CS 72.836 88.525 92.334 84.033 83.936

WW 71.986 88.916 92.285 85.986 82.129

LLW 74.318 89.871 93.814 89.320 86.114

DGI 60.829 80.318 81.194 80.965 70.782

1-NN 55.978 73.486 85.840 90.234 82.031

LDA 45.623 90.332 87.744 80.859 80.566

Table 7.14: The first column of the table shows different types of classifiers evaluatedin this study and the first row shows the features that gave best results. Renamingcells of the table shows the accuracies of each classifier-feature pair.

Regarding features, HOG features were more suitable for traffic sign recognition

problem because they gave best results for nearly all classifiers considered in this

study. Haar wavelet features allowed for similar performance only when used in

conjunction with a nearest neighbour classifier.

Regarding classifiers, in most of cases SVMs showed the best accuracies inde-

pendent of the features used. The all-in-one SVMs (CS and WW) performed on

par, but outperformed OVA. The simple classifiers LDA and NN yielded similar

performances as the SVMs when applied to the right features. The LDA worked

well in conjunction with HOG descriptors and NN in conjunction with the smaller

set of Haar features (HaarA). The larger set of Haar features (HaarB) did not allow

93

for good performance independently of the classifier used. This is probably due to

over-fitting as the number of features is relatively high compared to the number of

training examples.


Recognition of traffic signs is an important real-world application in the context of

driver assistance systems and at the same time an interesting academic problem.

As there are almost no survey studies, very few systematical comparisons, and only

a single data set publicly available, there is a lack understanding the characteristics

of different solutions.

In the work presented in this thesis, the classification of traffic signs was the

main topic. This multi-class classification task typically has to be solved under

strict time constraints. Therefore, choosing a trade-off between highly sophisticated

feature calculation and complex classification methods becomes necessary.

In the experiments, the performance of different combinations of feature extrac-

tion and classification techniques were compared. In particular, feature extraction

approaches typically used for real-time applications such as Histograms of Oriented

Gradients (HOG) and Haar wavelet features were considered. For classification,

LDA, 1-NN), and different types of multi-class SVMs were used.

Results showed that the most sophisticated classifiers considered in this study,

the SVMs, always yield the highest classification performance independently of the

type of features used. The “true” (all-in-one) multi-class SVMs outperformed the

one-versus-all approach and achieved classification accuracies larger than 70% even

on raw image data and exceeding 93% when applied to HOG features.

However, combinations of more simple classifiers and appropriate features high

performances that may be sufficient for real-world applications. In particular, the

fast (e.g., compared to non-linear SVMs) discriminant analysis (LDA) applied to

HOG features achieved an accuracy of more than 90%, while performing consid-

erably worse on Haar features. In contrast, nearest neighbor was the best overall

choice for Haar features, but performed considerably worse when using HOG. This

underlines the complex interplay between features and classifiers.

7.4 Multi-class Problems in Bioinformatics

Biological data is produced astonishingly fast [132]. For instance, GenBank reposi-

tory of nucleic acid sequences contained 8, 214, 000 entries in 1999 [13] and in 2007

the same repository contained 108, 431, 692 entries of nucleic acid sequences [12].

That is, in 2007 the repository was approximately 132 times larger than in 1999

version. This level of increase of information forced cooperation of different fields of

sciences. One of the resulting fields is called bioinformatics, which can be defined

as making sense out of the biological data with the help of computational tools and

statistics.

94

Bioinformatics is an huge field of science and has numerous applications and un-

solved problems [63]. In this thesis, I considered multi-class classification problems

in bioinformatics and I picked two relevant problems of bioinformatics, namely can-

cer classification explained in section 7.4.1 and protein fold recognition explained

in section 7.4.3.

7.4.1 Cancer Classification and Diagnosis with Microarray

Gene Expression

Cancer is a family of diseases in which a group of cells show uncontrolled behaviours:

uncontrolled growth (replicating themselves beyond the limits), invading and de-

stroying of near by tissues and even more spreading to other locations in the body.

According to World Health Organization’s World Cancer Report [30] each year ap-

proximately 13% of the deaths all around the world are caused by cancer. In 2002

approximately 7.6 million people died because of cancer and it is estimated that

approximately 25 million people will die because of cancer [30] at 2030. It is clear

from the previous records related to cancer and projections of cancer rates, cancer

treatment was/is/will (be) one of the major challenges of modern biology, medicine

and bioinformatics.

To deal with cancer, researchers are continuously developing new tools and meth-

ods. One of the the promising tools is the usage of DNA microarrays and their

statistical analysis for cancer [65, 123]. A replica of an DNA microarray is illus-

trated in Figure 7.2 . In the following, a brief explanation of DNA microarrays, the

definition of cancer classification with microarray data and the performance of the

methods, which are considered in this thesis, will be given.

Figure 7.2: Illustration of an Microarray sample. Note that this illustration is notderived from real data, it is just a cartoon.

A DNA microarray is a chip [136] that contains an arrayed series of microscopic

spots with each containing a specific DNA sequence (probe). DNA microarrays

inform us how similar these DNA sequences are with a known DNA sequence of

interest (target). When the target is added to the probe, complementary nucleic

acid sequences specifically pair with each other forming hydrogen bonds. If the

probe is compliment to the target, more chemical bonds will occur. This process

is called probe-target hybridization and the degree of hybridization is measured by

fluorescence or some other imaging techniques [16].

95

The analysis of microarray data is a challenging task in bioinformatics and re-

quires sophisticated statistical methods [34, 150, 16] because of several reasons:

• Microarray data is high dimensional (modern microarray data contain more

than several thousand dimensions).

• The number of examples in microarray data experiments is generally low i.e.

less than 250 examples.

• Microaarray data contain noise due to measurement errors.

• The relationship between the level of hybridization and the quantization method

is unknown and non-linear.

Despite the mentioned problems for analysis of DNA microarray data, their us-

age in cancer research is very promising [110, 115, 88]. However, using microarray

data in practice/clinics, other than scientific research, is rare (I am not aware of

any application of microarray data other than for research purposes.) Nevertheless,

it can be assumed that in the near feature microarray data will be used in ordinary

clinics. To reach this level of applicability, to understand two things, first, what is

the relation between cancer types and individual genes should be understood. Sec-

ond is what are the advantages and disadvantages of classifiers on these microarray

data sets.

In this thesis, the relation between cancer types and individual genes will not

be analysed. This kind of relations are analysed in machine learning under the

topic of feature/variable selection and the interested reader is referred to Guyon et

al. [82]. The focus of this section will be the performance of multi-class SVMs on

microarray cancer data. Datasets also used in a well known comparison paper of

multi-category classifiers on cancer microarray data [147] will be used. The data

sets are given in Table 7.15. The main differences between this comparison and the

Statnikov et al.’s [147] comparison should be clarified. comparison.

Dataset Name ℓ ℓtst d #features ℓd

9 Tumors 42 18 11 5726 3.8214 Tumors 216 92 9 15009 24.00Brain Tumor 1 63 27 5 5920 12.60Brain Tumor 2 35 15 4 10367 8.75Leukemia1 50 22 3 5327 16.67Leukemia2 50 22 3 11225 16.67Lung Cancer 142 61 5 12600 28.40SRBCT 58 25 4 2308 14.50

Table 7.15: Description of the cancer microarray data used in this study. Thecolumn ℓ shows the number of training examples for each data, the column ℓtst showsthe number of test examples for each data, features column shows the dimension ofthe input space and finally the ℓ

dshows average number of training examples per

class.

The first difference is that Statnikov et al. [147] used many different types of

classifiers such as neural networks, decision tress and even CS and WW. However,

96

Statnikov et al. [147] clearly stated that the theoretically elegant method of LLW was

not tested on the data sets because there was no efficient solver for LLW and they

also added that it will be fruitful to have LLW in comparison. The second difference

in their comparison that is not clear if WW and CS compared fairly because of the

bias term. The last important difference is model selection. Statnikov et al. [147]

is using two different types of model selection procedure; the first one basically

relies 10-fold cross validation with grid search for model selection and the second

one is basically leave-one-out(loo) model selection strategy. It is well known fact

that loo strategy is approximately unbiased for prediction error. However it has

a high variance since the ℓ-training data sets used for model selection are highly

similar [85]. Recently, Kelement et al. [106] showed that for hard-margin SVMs,

which dimensions of the feature space are much higher than the number of training

examples, the estimated error of loo is 1. Although their results are not directly

implying that loo strategy is not suitable for soft margin SVMs, the model selection

should be selected by taking into account these flaws of loo strategy. However, even

second strategy may have some flaws. The last column of Table 7.15 indicates that

in one third of the data sets the average number of training examples per class are

less than the number of folds. Even more in seven data sets the number of training

examples per class are less than two times of the number of folds.These facts mean

that 10-fold cross validation may have a high variance at the grid points. The last

difference is they use polynomial kernels with a fixed degree and their grid space

for C contains only four values.

After considering all these issues related model selection and small sample prob-

lem, I believe that using 5-fold cross validation for model selection is a suitable

strategy. To prevent any artifacts of small sample problems, the training data is

randomized 10 times and trained the machines on these 10 randomized training sets.

For each grid point the median performance of the 10 repetitions at the grid point

is taken as the performance of the method (smoothing of the cross validation error

surface). This is an important difference because model selection strategy directly

affects the performance of the learning machine. The number of training samples

in microarray cancer data sets are small and this problem is known as small sample

problem in statistics and machine learning [93]. Besides most of the microarray

cancer classification problems are multi-class and the average number of training

examples per class can be much smaller than the binary case. My personal expe-

rience also showed that without smoothing the cross validation error surface the

model selection procedure can contain some artifacts of the small sample problem

[73, 31, 93].

In all data sets all feature values are rescaled between 0 and 1 and this rescaling

is done based on training data. The selected hyperparameters for OVA, MC-MMR

and WW method are given in Table 7.16 and the hyperparemeters of CS, LLW and

DGI are given in Table 7.17.

The classification accuracies in percentage of six multi-class SVMs are given in

Table 7.18 and in Table 7.19. Additionally, I selected the best classification accuracy

97



11 Tumors -19 7 -27 0 -26 1214 Tumors -26 14 -8 -13 -22 129 Tumors -25 11 -8 0 -18 4Brain Tumor1 -16 4 -8 -2 -15 5Brain Tumor2 -18 6 -12 0 -16 4Leukemia1 -27 15 -12 0 -21 9Leukemia2 -24 12 -13 0 -28 0SRBCT -26 15 -10 -8 -25 13

Table 7.17: Best hyperparameters found in the model selection procedure for CS,LLW and DGI.


11 Tumors -26 13 -20 11 -9 -814 Tumors -23 11 -20 9 -8 -29 Tumors -19 6 -19 6 -7 -11Brain Tumor1 -14 -2 -25 15 -38 -10Brain Tumor2 -18 -4 -13 1 -11 -8Leukemia1 -18 -6 -24 15 -9 -8Leukemia2 -21 -12 -42 7 -8 -8SRBCT -20 -8 -25 13 -5 -5

for each data set from the results of Statnikov et al. [147] and these accuracies are

reported in the last columns of Table 7.18 and Table 7.19. Also 1-NN classification

accuracies are shown in the results.

Table 7.18: classification accuracies of OVA, MC-MMR, WW, 1-NN are shown.The last column shows classification accuracies from the Statnikov et.al. In eachrow bold numbers shows the best classification accuracy of multi-class SVMs onthe data set.

OVA MC-MMR WW 1-NN [147]9 Tumors 83.33 38.89 72.22 44.44 78.6714 Tumors 75.27 60.22 73.12 60.22 90.96Brain Tumor1 92.86 85.71 89.29 85.71 82.31Brain Tumor2 80.00 73.33 80.00 73.33 80.00Leukemia1 100.00 77.27 100.00 81.82 93.90Leukemia2 100.00 86.36 100.00 77.27 94.42SRBCT 88.00 88.00 88.00 84.00 100.00


If I compare my results with the results from the literature, it can be seen that in

four of the seven data sets, the results were improved, in two data sets the results

98

Table 7.19: classification accuracies of CS, LLW, DGI, 1-NN are shown. The lastcolumn shows classification accuracies from the Statnikov et.al. In each row boldnumbers shows the best classification accuracy of multi-class SVMs on the data set.

9 Tumors CS LLW DGI 1-NN [147]14 Tumors 83.33 83.33 38.89 44.44 78.67Brain Tumor1 74.19 75.27 60.22 60.22 90.96Brain Tumor2 92.86 89.29 85.71 85.71 82.31Leukemia1 80.00 80.00 73.33 73.33 80.00Leukemia2 100.00 100.00 90.91 81.82 93.90SRBCT 100.00 100.00 77.27 77.27 94.42

88.00 88.00 84.00 84.00 100.00

are worse and in one data set the results are equal. For a classification problem set

that contains seven different problems, these differences are high. These differences

could be caused because of either type of the kernel used or because of the model se-

lection procedure. In Statnikov et al.’s [147] study, they used MATLAB (2003, The

MathWorks) and also they claimed that the training of SVM are computationally

expensive. In this thesis, C++ is used for implementing the solvers and new solvers

allow to apply better model selection procedures and in turn this make possible to

get better classification accuracies.

7.4.3 Protein Secondary Structure Prediction

One of the main building blocks of cells are proteins and so it is essential to under-

stand biological function of proteins. It is known that a protein biological function

is closely related to its 3D structure [10]. Biologists developed several experimental

methods to determine the 3D structure of a protein such as protein nuclear mag-

netic resonance (NMR) or X-ray based techniques. However, these experimental

techniques are generally time consuming, slow and very expensive [91, 47]. In the

year 2000, the protein data bank (PDB) contained approximately 12, 000 exper-

imentally identified protein structures [14] and in the year 2009, PDB contained

approximately 30, 000 experimentally identified protein structures [60]. However,

the non-redundant National Center of Biotechnology Information (NCBI) reference

sequence (RefSeq) [130] contained approximately 780, 000 non-identified protein se-

quences in the year 2004 and it contained app. 1, 100, 000 non-identified protein

sequences in the year 2005. Although, the exact numbers for 2009 are not known, it

can be assumed that the RefSeq contains at least two million non-identified protein

sequences. From these statistics, it is clear that the experimental methods are not

fast enough to identify the protein sequences.

Identification of protein sequences is not only important for improvement of bi-

ological understanding of the cell, it is also important for drug discovery and/or

developing treatments schemes against diseases. Given all these facts, bioinformat-

ics researchers applied several statistical techniques to overcome these disadvantages

of experimental methods. One of the basic ideas is trying to determine the proteins

99

structural class by using only the primary sequence of amino acids which is doable

easily and fast from the protein at hand. In this part of the thesis I will consider

this problem, namely protein secondary structure prediction.

There are two mainstream approaches to solve protein secondary structure pre-

diction problems. The first one is to use supervised classification and the second one

are ab-initio techniques which optimize a predefined energy function without using

any supervised information [10]. However, the performance of ab-initio techniques

are far below than that of supervised classification [54].

Unfortunately, although the number of proteins is large, the number of training

examples per protein is small. Therefore, one needs to identify the best classifier in

order to use this relatively small information efficiently. In this thesis, I compared

6 of the MCSVMs and the 1-NN classifier on a multi-class protein data set, which

is a baseline data set for protein structure prediction [54]. Further the results of

this thesis will be compared with the best results published in the literature [47].

The data set contains 27 proteins and there are 12 different feature vectors derived

from these proteins. Each feature vector is regarded as a different data set. Each

data set contains 311 training examples and 383 test examples.

In all data sets all feature values are rescaled between 0 and 1 and this rescaling

is done based on training data. Again used a nested grid search with 5-fold cross

validation is used for determining hyperparameters. The model selection procedure

is identical to the nested grid search with randomization of training data procedure

that is explained in Section 7.4.1.

The selected hyperparameters for OVA, MC-MMR and WW method are given

in Table 7.20 and the hyperparemeters of CS, LLW and DGI are given in Table

7.21.


OVA MC-MMR WWlog γ logC log γ logC log γ logC

Composition -4 1 -3 0 -4 0Hydrophobicity -5 -2 -2 0 -8 4L14 -7 1 -5 0 -8 -1L1 -5 1 -3 0 -7 2L30 -6 0 -5 0 -6 4L4 -6 2 -3 0 -4 -1Polarity -5 -1 -2 0 -5 -2Polarizability -5 -5 -3 -1 -9 6Secondary -1 -1 0 0 -1 -2Swblosum62 -18 11 -5 -14 -17 7Swpam50 -27 19 -5 -4 -11 4Volume -4 10 -2 0 -5 1

The classification accuracies in percentage of six multi-class SVMs are given in

Table 7.22 and in Table 7.23. Additionally, 1-NN classification accuracies are shown

in the last column for each table.

100



Composition -4 1 -4 0 0 -3Hydrophobicity -8 -2 -5 1 0 -9L14 -6 2 -8 5 0 -9L1 -8 4 -7 1 0 -11L30 -6 1 -6 1 -2 -8L4 -5 -2 -5 2 -2 -7Polarity -5 1 -5 2 -1 -3Polarizability -8 0 -5 2 -1 -3Secondary -4 3 -2 1 -2 -7Swblosum62 -18 -9 -15 7 -5 -4Swpam50 -15 9 -11 7 -6 -6Volume -6 6 -3 1 -1 -2

Table 7.22: classification accuracies of OVA, MC-MMR and WW . The last columnshows classification accuracies of 1-NN. In each row bold numbers shows.

OVA MC-MMR WW 1-NNComposition 0.5352 0.4909 0.5379 0.4595Hydrophobicity 0.3786 0.3864 0.3420 0.3446L14 0.4543 0.3708 0.3577 0.3394L1 0.4648 0.4334 0.3760 0.3734L30 0.3969 0.3159 0.3681 0.3133L4 0.4465 0.4334 0.4491 0.3838Polarity 0.3708 0.4021 0.3420 0.3368Polarizability 0.2898 0.3081 0.2950 0.3159Secondary 0.3943 0.3786 0.3838 0.3577Swblosum62 0.6240 0.5352 0.6292 0.4439Swpam50 0.6371 0.5535 0.6397 0.4595Volume 0.3603 0.3760 0.3473 0.3446


Damoulas and Girolami [47] reported the best accuracy on this data set and their

accuracy was 59.8% ± 1.9. In their study, they considered four state-of-the-art

string kernels and take the best classification accuracy for each feature set. In this

study, I used Gaussian kernel and applied six multi-class methods to the feature

sets. In my study the best result is given by LLW and it is 64.23%. I increased the

classification accuracy approximately 4.5% which is clearly significant. However, all

methods gave the best accuracy exactly in two different feature sets and this means

there is no winner among different multi-class SVMs with respect to this problem.

Here it should be noted that, Damoulas and Girolami [47] also applied multiple

kernel learning(MKL) [8, 145] to this problem. They reported 68.1% classification

accuracy as a result of MKL. Although this result is better than the result reported

in this study, their results encourages us to apply MKL to LLW and WW because

101

Table 7.23: classification accuracies of CS, LLW and DGI . The last column showsclassification accuracies of 1-NN. In each row bold numbers shows.

CS LLW DGI 1-NNComposition 0.5222 0.4909 0.4543 0.4595Hydrophobicity 0.3368 0.3551 0.3629 0.3446L14 0.4256 0.4360 0.3525 0.3394L1 0.3812 0.2872 0.2881 0.3734L30 0.3812 0.3081 0.3394 0.3133L4 0.4648 0.4439 0.3812 0.3838Polarity 0.3708 0.3499 0.3577 0.3368Polarizability 0.2846 0.3394 0.3473 0.3159Secondary 0.3916 0.3916 0.4021 0.3577Swblosum62 0.6266 0.6162 0.5405 0.4439Swpam50 0.6371 0.6423 0.4909 0.4595Volume 0.3342 0.3943 0.3655 0.3446

of their superior performance on single feature sets.

102

Chapter 8

Conclusions

I have provided a novel unified view on the seemingly diverse field of all-in-one

multi-class SVMs. Albeit all popular all-in-one approaches reduce to the standard

SVM for binary classification problems, they differ along three dimensions when

applied to more than two classes. These are the presence or absence of a bias term

in the classification functions, the use of a relative or absolute margin concept, and

the way of combining margin violations in their loss functions.

The unified scheme pointed at a canonical combination of these features that

had not been investigated. The missing machine, which can be viewed as marrying

the approaches by Crammer & Singer (CS, [42]) and Lee, Lin, & Wahba (LLW,

[112]), has been derived and evaluated. The new SVM named DGI SVM considers

the maximum over the margin violations per variable in its loss function, and an

absolute margin concept as proposed by LLW.

A fast training algorithm for WW, LLW and DGI SVMs is presented. By drop-

ping the bias term—as done in the CS approach—the equality constraints in the

dual problems for all machines have vanished. This makes decomposition methods

easily applicable. A second order working set selection algorithm using working sets

of size two for these problems has been proposed. Instead of choosing the smallest,

irreducible working set size, to use a working set size of two whenever possible is

proposed. This allows for a still tractable analytic solution of the sub-problem and

as shown empirically this corresponds to a significantly better trade-off between

iteration complexity (as, e.g., determined by the working set selection heuristic and

the gradient update) and progress. That is, sequential two-dimensional optimization

(S2DO ) should be favored over the strict SMO heuristic. This is also supported by

the findings in [149] for binary SVMs. The S2DO heuristic is not restricted to the

SVMs considered in this study, but can be applied to machines involving quadratic

programs without equality constraints in general.

The developed solver is applied to all multi-class SVM machines, and this made

the empirical comparison fair enough to draw conclusions about the required train-

ing times. Another novel contribution of this thesis regarding SVM solvers is that

for all-in-one multi-class machines a new caching technique, which needs only to

103

store a O(d2) matrix and a O(ℓ2) matrix instead of O(s2) matrix where s is d× ℓ,

has been developed. This caching technique made possible to use WW and LLW

in all data sets. According to my knowledge, S2DO is the only existing solver for

LLW that is using decomposition algorithms. As a result, LLW method can now

be used for much larger data sets.1

An extensive empirical study has been accomplished. The new solver allows

to apply better model selection procedures for all multi-class machines. This is,

because of two reasons, a particularly important contribution of this study . First,

until now the WW and LLW methods were often either ignored or not carefully

analysed in empirical studies due to the lack of efficient solvers [133, 54]. The

second reason is that researchers did not make suitable model selection because

of the computational requirement of these methods [54, 147]. One-vs-all (OVA)

and CS are considered the best machine for multi-class problems because of these

reasons and also due to their high training speeds. However, the empirical analysis

presented in this thesis showed that this common belief is at least not completely

true. Empirical analysis revealed two important facts. First, LLW is better than all

other methods in the sense of classification accuracy and the second best method

is WW (see Section 7.2). The second insight is that WW is not slower than the CS

method (see Section 7.2). Further, if one focuses only all-in-one class machines, the

superior results supplied by LLW and WW implies that sum-loss machines are in

general better than max-loss machines (i.e., CS and DGI).

The results of six multi-class methods on Bioinformatics data sets implied that

the model selection very important when the data set at hand is small (i.e., we face

a small sample problem, see Section 7.4.1 and Section 7.4.3). Finally, the results of

the traffic sign recognition problem implied that neither using the good features nor

using the good classifiers gives the best result. In order to solve real world problems

we need to take into account both issues (see Section 7.3).

The extensive experimental comparison showed that the WW approach gener-

ated hypotheses with higher classification accuracy compared to the CS machine.

Both approaches outperformed the one-versus-all method in this respect. Using

S2DO, the original WW multi-class SVM now becomes at least as fast as the CS

method trained with tailored, state-of-the-art second order working set selection.

This indicates that the faster training times observed for the CS SVM compared to

the WW formulation were not achieved by reducing the number of slack variables,

but rather by dropping the bias term from the hypotheses (this is in accordance

with the findings in [87], where training times increased drastically when adding

bias parameters to the CS machine). The better generalization results are in accor-

dance with newly derived risk bounds. These follow from a union bound on results

for binary machines and are lower for the WW SVM compared to the CS machine.

Given the empirical and theoretical results, there is no reason any more for a priori

preferring the CS SVM to the original (WW) method. We hope that the results of

1The original solver proposed by the Lee et al. [112] is based on interior point method andhave complexity of O(s3) and a memory requirement of O(s2).

104

this thesis makes the WW method more popular among practitioners, because it

offers improved accuracy without additional costs in training time compared to CS.

From a theoretical point of view, the decisive property of the LLW multi-class

SVM is the classification calibration of its loss function [154]. The efficient solver

proposed in this thesis makes LLW training practical and thereby allowed for the

first extensive empirical comparison of LLW with alternative multi-class SVMs. The

LLW method is the only classification calibrated machine in this comparison [154]

and showed the best generalization performance. This improved accuracy required

considerably more training time. However, if training time does not matter, the

LLW machine is the multi-class SVM of choice. This experimental result corrobo-

rates the theoretical advantages of the LLW machine.

In this study, I considered batch learning of multi-class SVMs. For binary classi-

fication, it has been shown that improved second-order working set selection derived

for batch learning is even more advantageous when applied to on-line learning in

LASVM [79]. Therefore, I am confident that the results in this study also carry

over to the popular LaRank online multi-class SVM [19].

105

Bibliography

[1] E. Alba and J. Chicano. Solving the error correcting code problem with

parallel hybrid heuristics. In Proceedings of the 2004 ACM symposium on

Applied computing, page 989. ACM, 2004.

[2] E. Alba, C. Cotta, F. Chicano, and AJ Nebro. Parallel evolutionary algorithms

in telecommunications: Two case studies. network, 8(13):14–19, 2002.

[3] E. Alba and S. Khuri. Sequential and distributed evolutionary algorithms for

combinatorial optimization problems. Studies In Fuzziness And Soft Comput-

ing, pages 211–233, 2003.

[4] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary:

A unifying approach for margin classifiers. The Journal of Machine Learning

Research, 1:113–141, 2001.

[5] M. Anthony and P.L. Bartlett. Neural network learning: Theoretical founda-

tions. Cambridge Univ Pr, 1999.

[6] N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American

Mathematical Society, 68(3):337–404, 1950.

[7] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.

[8] F.R. Bach, G.R.G. Lanckriet, and M.I. Jordan. Multiple kernel learning,

conic duality, and the SMO algorithm. In Proceedings of the twenty-first

international conference on Machine learning, page 6. ACM, 2004.

[9] C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, and T. Koehler. A system for

traffic sign detection, tracking, and recognition using color, shape, and mo-

tion information. In Proceedings of the IEEE Intelligent Vehicles Symposium,

pages 255–260, 2005.

[10] D. Baker and A. Sali. Protein structure prediction and structural genomics.

Science’s STKE, 294(5540):93, 2001.

[11] X. Baro, S. Escalera, J. Vitria, Oriol Pujol, and Petia Radeva. Traffic sign

recognition using evolutionary adaboost detection and forest-ECOC classifi-

cation. IEEE Transactions on Intelligent Transportation Systems, 10(1):113–

126, 2009.

106

[12] D. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and D. Wheeler. Gen-

Bank Nucl. Acids Res, 35, 2007.

[13] D.A. Benson, M.S. Boguski, D.J. Lipman, J. Ostell, BF Ouellette, B.A. Rapp,

and D.L. Wheeler. GenBank. Nucleic acids research, 27(1):12, 1999.

[14] H.M. Berman, T. Battistuz, TN Bhat, W.F. Bluhm, P.E. Bourne,

K. Burkhardt, Z. Feng, G.L. Gilliland, L. Iype, S. Jain, et al. The pro-

tein data bank. Acta Crystallographica Section D: Biological Crystallography,

58(6):899–907, 2002.

[15] D.P. Bertsekas, M.L. Homer, D.A. Logan, and S.D. Patek. Nonlinear pro-

gramming. Athena scientific, 1995.

[16] P.J. Bickel, J.B. Brown, H. Huang, and Q. Li. An overview of recent de-

velopments in genomics and associated statistical methods. Philosophical

Transactions of the Royal Society A: Mathematical, Physical and Engineering

Sciences, 367(1906):4313, 2009.

[17] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[18] A. Bordes, L. Bottou, and P. Gallinari. Sgd-qn: Careful quasi-newton stochas-

tic gradient descent. The Journal of Machine Learning Research, 10:1737–

1754, 2009.

[19] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass support

vector machines with LaRank. In Zoubin Ghahramani, editor, Proceedings of

the 24th International Machine Learning Conference (ICML), pages 89–96.

OmniPress, 2007.

[20] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass sup-

port vector machines with LaRank. In Proceedings of the 24th international

conference on Machine learning, page 96. ACM, 2007.

[21] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with

online and active learning. The Journal of Machine Learning Research, 6:1619,

2005.

[22] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for op-

timal margin classifiers. In Proceedings of the Fifth Annual Workshop on

Computational Learning Theory (COLT 1992), pages 144–152. ACM, 1992.

[23] L. Bottou. Online algorithms and stochastic approximations. Online Learning

and Neural Networks. Cambridge University Press, Cambridge, UK, 5, 1998.

[24] L. Bottou. Stochastic learning. In O. Bousquet and U. von Luxburg, editors,

Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelli-

gence, LNAI 3176, pages 146–168. Springer Verlag, Berlin, 2004.

107

[25] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Advances

in neural information processing systems, 20:161–168, 2008.

[26] L. Bottou and Y. LeCun. Large scale online learning. In Sebastian Thrun,

Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural Infor-

mation Processing Systems 16. MIT Press, Cambridge, MA, 2004.

[27] L. Bottou and C.J. Lin. Support vector machine solvers. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Ma-

chines, pages 1–28. MIT Press, 2007.

[28] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning

theory. Advanced Lectures on Machine Learning, pages 169–207, 2004.

[29] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr,

2004.

[30] P. Boyle and B. Levin. World cancer report 2008. IARC Press Lyon, France,

2008.

[31] U.M. Braga-Neto and E.R. Dougherty. Is cross-validation valid for small-

sample microarray classification? Bioinformatics, 20(3):374, 2004.

[32] E. J. Bredensteiner and K. P. Bennett. Multicategory classification by support

vector machines. Computational Optimization and Applications, 12(1):53–79,

1999.

[33] L. Breiman. Classification and regression trees. Chapman & Hall/CRC, 1984.

[34] H.C. Causton, J. Quackenbush, and A. Brazma. Microarray gene expression

data analysis: a beginner’s guide. Wiley-Blackwell, 2003.

[35] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector

machine learning. In Advances in neural information processing systems 13:

proceedings of the 2000 conference, page 409. The MIT Press, 2001.

[36] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines,

2001.

[37] O. Chapelle. Training a support vector machine in the primal. Neural Com-

putation, 19(5):1155–1178, 2007.

[38] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,

20(3):273–297, 1995.

[39] C. Cortes and V. Vapnik. Support-vector networks. Machine learning,

20(3):273–297, 1995.

[40] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans-

actions on Information Theory, 13(1):21–27, 1967.

108

[41] T.M. Cover. Capacity problems for linear machines. Pattern recognition,

pages 283–289, 1968.

[42] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass

kernel-based vector machines. Journal of Machine Learning Research, 2:265–

292, 2002.

[43] K. Crammer and Y. Singer. On the learnability and design of output codes

for multiclass problems. Machine Learning, 47(2):201–233, 2002.

[44] F. Cucker and S. Smale. On the mathematical foundations of learning. Bul-

letin of American Mathematical Society, 39(1):1–50, 2002.

[45] F. Cucker and D.X. Zhou. Learning theory: an approximation theory view-

point. Cambridge Univ Pr, 2007.

[46] N. Dalal and B. Triggs. Histograms of oriented gradients for human detec-

tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 886–893, 2005.

[47] T. Damoulas and M.A. Girolami. Probabilistic multi-class multi-kernel learn-

ing: on protein fold recognition and remote homology detection. Bioinfor-

matics, 24(10):1264, 2008.

[48] C. Demirkesen and H. Cherifi. A comparison of multiclass SVM methods for

real world natural scenes. In S. Blanc-Talon, J.and Bourennane, W. Philips,

and P. Popescu, D.and Scheunders, editors, Advanced Concepts for Intelli-

gent Vision Systems (ACIVS 2008), volume 5259 of LNCS, pages 763–763.

Springer, 2008.

[49] J. Demsar. Statistical comparisons of classifiers over multiple data sets. Jour-

nal of Machine Learning Research, 7:1–30, 2006.

[50] L. Devroye. Any Discrimination Rule Can Have an Arbitrarily Bad Probabil-

ity of Error for Finite Sample Size. IEEE Transactions on Pattern Analysis

And Machine Intelligence, 4(2):154–156, 1982.

[51] L. Devroye, L. Gyorfi, and G. Lugosi. A probabilistic theory of pattern recog-

nition. Springer Verlag, 1996.

[52] T.G. Dietterich. Approximate statistical tests for comparing supervised clas-

sification learning algorithms. Neural Computation, 10(7):1895–1923, 1998.

[53] T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-

correcting output codes. Arxiv preprint cs/9501101, 1995.

[54] C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using sup-

port vector machines and neural networks. Bioinformatics, 17(4):349, 2001.

109

[55] T. Dogan, U. Glasmachers and C. Igel. A novel approach to consistent multi-

category support vector classification. submitted, 20011.

[56] K. Dontas and K. De Jong. Discovery of maximal distance codes using genetic

algorithms. In Tools for Artificial Intelligence, 1990., Proceedings of the 2nd

International IEEE Conference on, pages 805–811, 1990.

[57] K. Duan and S. S. Keerthi. Which is the best multiclass SVM method? An

empirical study. In N. C. Oza, R. Polikar, J. Kittler, and F. Roli, editors, Pro-

ceedings of the Sixth International Workshop on Multiple Classifier Systems

(MCS 2005), volume 3541 of LNCS, pages 278–285, 2005.

[58] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Citeseer, 2001.

[59] R.O. Duda, J.W. Machanik, and R.C. Singleton. Function Modeling Experi-

ments, 1963.

[60] S. Dutta, K. Burkhardt, J. Young, G.J. Swaminathan, T. Matsuura, K. Hen-

rick, H. Nakamura, and H.M. Berman. Data deposition and annotation at

the worldwide protein data bank. Molecular biotechnology, 42(1):1–13, 2009.

[61] M. Enzweiler and D. M. Gavrila. Monocular pedestrian detection: Survey

and experiments. IEEE Transactions on Pattern Analysis and Machine In-

telligence, 31(12):2179–2195, 2009.

[62] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support

vector machines. Advances in Computational Mathematics, 13(1):1–50, 2000.

[63] W.J. Ewens and G.R. Grant. Statistical methods in bioinformatics: an intro-

duction. Springer Verlag, 2005.

[64] R.E. Fan, P.H. Chen, and C.J. Lin. Working set selection using second or-

der information for training support vector machines. Journal of Machine

Learning Research, 6:1889–1918, 2005.

[65] X. Fan, L. Shi, H. Fang, Y. Cheng, R. Perkins, and W. Tong. DNA Microar-

rays Are Predictive of Cancer Prognosis: A Re-evaluation. Clinical Cancer

Research, 16(2):629, 2010.

[66] M.C. Ferris and T.S. Munson. Interior-point methods for massive support

vector machines. SIAM Journal on Optimization, 13(3):783–804, 2003.

[67] T. Finley and T. Joachims. Training structural SVMs when exact inference is

intractable. In Proceedings of the 25th international conference on Machine

learning, pages 304–311. ACM, 2008.

[68] E. Fix and J. Hodges. Discriminatory Analysis-Nonparametric Discrimina-

tion: Consistency Properties, 1951.

110

[69] E. Fix and J.L. Hodges Jr. Discriminatory Analysis-Nonparametric Discrim-

ination: Small Sample Performance, 1952.

[70] H. Fleyeh and M. Dougherty. Traffic sign classification using invariant features

and support vector machines. In Proceedings of the IEEE Intelligent Vehicles

Symposium, pages 530–535, 2008.

[71] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for sup-

port vector machines. In Proceedings of the 25th international conference on

Machine learning, pages 320–327. ACM, 2008.

[72] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line

learning and an application to boosting. In Computational Learning Theory,

pages 23–37. Springer, 1995.

[73] W.J. Fu, R.J. Carroll, and S. Wang. Estimating misclassification error with

small samples via bootstrap cross-validation. Bioinformatics, 21(9):1979,

2005.

[74] K. Fukumizu, F.R. Bach, and M.I. Jordan. Dimensionality reduction for

supervised learning with reproducing kernel Hilbert spaces. The Journal of

Machine Learning Research, 5:73–99, 2004.

[75] S. Garcıa and F. Herrera. An extension on statistical ”comparisons of classi-

fiers over multiple data sets” for all pairwise comparisons. Journal of Machine


[76] T. Glasmachers. Universal Consistency of Multi-Class Support Vector Classi-

fiation. In Advances in Neural Information Processing Systems (NIPS), 2010.

[77] T. Glasmachers and C. Igel. Maximum-gain working set selection for SVMs.

Journal of Machine Learning Research, 7:1437–1466, 2006.

[78] T. Glasmachers and C. Igel. Second-order smo improves svm online and active

learning. Neural computation, 20(2):374–382, 2008.

[79] T. Glasmachers and C. Igel. Second order SMO improves SVM online and

active learning. Neural Computation, 20(2):374–382, 2008.

[80] H. Grabner and H. Bischof. On-line boosting and vision. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, pages

260–267, 2006.

[81] Y. Guermeur. VC theory for large margin multi-category classifiers. Journal

of Machine Learning Research, 8:2551–2594, 2007.

[82] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer

classification using support vector machines. Machine learning, 46(1):389–

422, 2002.

111

[83] R.W. Hamming. Error detecting and error correcting codes. Bell System

Technical Journal, 29(2):147–160, 1950.

[84] T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals of

Statistics, 26(2):451–471, 1998.

[85] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn-

ing: Data Mining, Inference, and Prediction. Springer-Verlag, 2001.

[86] J.B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization

Algorithms: Fundamentals. Springer, 1993.

[87] C.W. Hsu and C.J. Lin. A comparison of methods for multiclass support vec-

tor machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

[88] P. Hu, G. Bader, D.A. Wigle, and A. Emili. Computational prediction of

cancer-gene function. Nature Reviews Cancer, 7(1):23–34, 2006.

[89] D. Hush, P. Kelly, C. Scovel, and I. Steinwart. QP algorithms with guaranteed

accuracy and run time for support vector machines. The Journal of Machine

Learning Research, 7:769, 2006.

[90] D. Hush and C. Scovel. Polynomial-time decomposition algorithms for support

vector machines. Machine Learning, 51(1):51–71, 2003.

[91] E. Ie, J. Weston, W.S. Noble, and C. Leslie. Multi-class protein fold recogni-

tion using adaptive codes. In Proceedings of the 22nd international conference

on Machine learning, pages 329–336. ACM, 2005.

[92] C. Igel, T. Glasmachers, and V. Heidrich-Meisner. Shark. Journal of Machine


[93] A. Jain and D. Zongker. Feature selection: Evaluation, application, and small

sample performance. IEEE transactions on pattern analysis and machine

intelligence, 19(2):153–158, 1997.

[94] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf,

C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support

Vector Learning, chapter 11, pages 169–184. MIT Press, 1998.

[95] T. Joachims. Text categorization with support vector machines: Learning

with many relevant features. Machine Learning: ECML-98, pages 137–142,

1998.

[96] T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th

ACM SIGKDD international conference on Knowledge discovery and data

mining, page 226. ACM, 2006.

[97] T. Joachims, T. Finley, and C.N.J. Yu. Cutting-plane training of structural

SVMs. Machine Learning, 77(1):27–59, 2009.

112

[98] T. Joachims and C.N.J. Yu. Sparse kernel SVMs via cutting-plane training.

Machine Learning, 76(2):179–193, 2009.

[99] M. J. Kearns and R. E. Shapire. Efficient distribution-free learning of prob-

abilistic concepts. Journal of Computer and System Sciences, 48(3):464–497,

1994.

[100] S.S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector ma-

chines with reduced classifier complexity. The Journal of Machine Learning

Research, 7:1515, 2006.

[101] S.S. Keerthi and E.G. Gilbert. Convergence of a generalized SMO algorithm

for SVM classifier design. Machine Learning, 46(1):351–360, 2002.

[102] SS Keerthi, SK Shevade, C. Bhattacharyya, and KRK Murthy. Improvements

to Platt’s SMO algorithm for SVM classifier design. Neural Computation,

13(3):637–649, 2001.

[103] C. G. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff. Real-

time recognition of U.S. speed signs. In Proceedings of the IEEE Intelligent

Vehicles Symposium, pages 518–523, 2008.

[104] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions*

1. Journal of Mathematical Analysis and Applications, 33(1):82–95, 1971.

[105] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels.

IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004.

[106] S. Klement, A. Madany Mamlouk, and T. Martinetz. Reliability of cross-

validation for SVMs in high-dimensional, low sample size scenarios. Artificial

Neural Networks-ICANN 2008, pages 41–50, 2008.

[107] S.R. Kulkarni, G. Lugosi, and S.S. Venkatesh. Learning pattern classification-

a survey. IEEE Transactions on Information Theory, 44(6):2178–2206, 1998.

[108] H.J. Kushner and G. Yin. Stochastic approximation and recursive algorithms

and applications. Springer Verlag, 2003.

[109] S. Lafuente-Arroyo, P. Garcıa-Dıaz, F.J. Acevedo-Rodrıguez, P. Gil-Jimenez,

and S. Maldonado-Bascon. Trafic sign classification invariant to rotations

using support vector machines. In Proceedings of Conference on Advanced

Concepts for Intelligent Vision Systems, pages 37–42, 2004.

[110] S.R. Lakhani and A. Ashworth. Microarray and histopathological analysis of

tumours: the future and the past? Nature Reviews Cancer, 1(2):151–157,

2001.

[111] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,

1998.

113

[112] Y. Lee, Y. Lin, and G. Wahba. Multicategory Support Vector Machines:

Theory and Application to the Classification of Microarray Data and Satellite

Radiance Data. Journal of the American Statistical Association, 99(465):67–

82, 2004.

[113] N. List and H.U. Simon. A general convergence theorem for the decomposition

method. Learning Theory, pages 363–377, 2004.

[114] Y. Liu. Fisher consistency of multicategory support vector machines. In

M. Meila and X. Shen, editors, Eleventh International Conference on Artificial

Intelligence and Statistics (AISTATS), pages 289–296, 2007.

[115] J.A. Ludwig and J.N. Weinstein. Biomarkers in cancer staging, prognosis and

treatment selection. Nature Reviews Cancer, 5(11):845–856, 2005.

[116] D.J.C. MacKay. Information theory, inference, and learning algorithms. Cam-

bridge Univ Pr, 2003.

[117] D.J.C. MacKay and R.M. Neal. Near Shannon limit performance of low

density parity check codes. Electronics letters, 33(6):457–458, 1997.

[118] S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Gil-Jimenez, Hilario Gomez-

Moreno, and Francisco Lopez-Ferreras. Road-sign detection and recognition

based on support vector machines. IEEE Transactions on Intelligent Trans-

portation Systems, 8(2):264–278, 2007.

[119] S. Mehrotra. On the implementation of a primal-dual interior point method.

SIAM Journal on Optimization, 2:575, 1992.

[120] J. Miura, T. Kanda, and Y. Shirai. An active vision system for real-time

traffic sign recognition. In Proceedings of IEEE International Conference on

Intelligent Transportation Systems, pages 52–57, 2000.

[121] F. Moutarde, A. Bargeton, A. Herbin, and L. Chanussot. Robust on-vehicle

real-time visual detection of American and European speed limit signs, with a

modular Traffic Signs Recognition system. In Intelligent Vehicles Symposium,

2007 IEEE, pages 1122–1126. IEEE, 2007.

[122] A. S. Muhammad, N. Lavesson, P. Davidsson, and M. Nilsson. Analysis of

speed sign classification algorithms using shape based segmentation of binary

images. In Proceedings of the International Conference on Computer Analysis

of Images and Patterns, pages 1220–1227, 2009.

[123] E.E. Ntzani and J. Ioannidis. Predictive ability of DNA microarrays for

cancer outcomes and correlates: an empirical assessment. The Lancet,

362(9394):1439–1444, 2003.

[124] E. Osuna, R. Freund, and F. Girosi. Improved Training Algorithm for Support

Vector Machines. In J. Principe, L. Giles, N. Morgan, and E. Wilson, editors,

Neural Networks for Signal Processing VII, pages 276–285. IEEE Press, 1997.

114

[125] E. Osuna, R. Freund, and F. Girosit. Training support vector machines: an

application to face detection. In 1997 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, 1997. Proceedings., pages 130–136,

1997.

[126] C. Papageorgiou and T. Poggio. A trainable system for object detection.

International of Journal Computer Vision, 38(1):15–33, 2000.

[127] W.W. Peterson and EJ Weldon. Error-correcting codes. The MIT Press, 1972.

[128] J.C. Platt. Fast training of support vector machines using sequential minimal

optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in

Kernel Methods – Support Vector Learning, chapter 11, pages 185–208. MIT

Press, 1998.

[129] T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri. b. In J. Win-

kler and M. Niranjan, editors, Uncertainty in Geometric Computations, chap-

ter 11, pages 131–141. Kluwer Academic Publishers, 2002.

[130] K.D. Pruitt, T. Tatusova, and D.R. Maglott. NCBI reference sequences (Ref-

Seq): a curated non-redundant sequence database of genomes, transcripts and

proteins. Nucleic acids research, 2006.

[131] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann,

1993.

[132] T. Reichhardt. It’s sink or swim as a tidal wave of data approaches. Nature,

399(6736):517–520, 1999.

[133] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of

Machine Learning Research, 5:101–141, 2004.

[134] F. Rosenblatt. Principles of Neurodynamics: Perceptron and Theory of Brain

Mechanisms. Spartan Books, 1962.

[135] J. Salmen, T. Suttorp, J. Edelbrunner, and C. Igel. Evolutionary optimization

of wavelet feature sets for real-time pedestrian classification. In Proceedings

of the IEEE Conference on Hybrid Intelligent Systems, pages 222–227, 2007.

[136] M. Schena, D. Shalon, R.W. Davis, and P.O. Brown. Quantitative monitoring

of gene expression patterns with a complementary DNA microarray. Science,

270(5235):467, 1995.

[137] B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson.

Estimating the support of a high-dimensional distribution. Neural Computa-

tion, 13(7):1443–1471, 2001.

[138] B. Scholkopf and A.J. Smola. Learning with kernels. Citeseer, 2002.

115

[139] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Ma-

chines, Regularization, Optimization, and Beyond. MIT Press, 2002.

[140] F. Sebastiani. Machine learning in automated text categorization. ACM

computing surveys (CSUR), 34(1):1–47, 2002.

[141] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-

gradient solver for svm. In Proceedings of the 24th international conference

on Machine learning, page 814. ACM, 2007.

[142] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal

estimated sub-gradient solver for svm. In Mathematical Programming, page

814. ACM, 2007.

[143] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence

on training set size. In Proceedings of the 25th international conference on

Machine learning, pages 928–935. ACM, 2008.

[144] J. Shawe-Taylor and N. Cristianini. Robust bounds on generalization from the

margin distribution. In 4th European Conference on Computational Learning

Theory. Citeseer, 1999.

[145] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple

kernel learning. The Journal of Machine Learning Research, 7:1531–1565,

2006.

[146] J.C. Spall. Introduction to stochastic search and optimization: estimation,

simulation, and control. John Wiley and Sons, 2003.

[147] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A com-

prehensive evaluation of multicategory classification methods for microarray

gene expression cancer diagnosis. Bioinformatics, 21(5):631, 2005.

[148] I. Steinwart. On the influence of the kernel on the consistency of support

vector machines. The Journal of Machine Learning Research, 2:67–93, 2002.

[149] I. Steinwart, D. Hush, and C. Scovel. Training SVMs without offset. Technical

Report LA-UR-09-00638, Los Alamos National Laboratory (LANL), 2009.

[150] D. Stekel. Microarray bioinformatics. Cambridge Univ Pr, 2003.

[151] C.J. Stone. Consistent nonparametric regression. The annals of statistics,

5(4):595–620, 1977.

[152] S. Szedmak, J. Shawe-Taylor, and E. Parado-Hernandez. Learning via lin-

ear operators: Maximum margin regression. Technical report, PASCAL,

Southampton, UK, 2006.

[153] C.H. Teo, SVN Vishwanthan, A.J. Smola, and Q.V. Le. Bundle methods

for regularized risk minimization. Journal of Machine Learning Research,

11:311–365, 2010.

116

[154] A. Tewari and P. L. Bartlett. On the Consistency of Multiclass Classification

Methods. Journal of Machine Learning Research, 8:1007–1025, 2007.

[155] E. Torres and S. Khuri. Applying evolutionary algorithms to combinatorial

optimization problems. Computational Science-ICCS 2001, pages 689–698,

2001.

[156] J. Torresen, J.W. Bakke, and L. Sekanina. Efficient recognition of speed limit

signs. In Proceedings of the IEEE International Conference on Intelligent

Transportation Systems, pages 652–656, 2004.

[157] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Support vector

machine learning for interdependent and structured output spaces. Journal

of Machine Learning Research, 6:1453–1484, 2005.

[158] V. Vapnik. Statistical Learning Theory. Wiley, 1998.

[159] V. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000.

[160] V. Vapnik and A. Chervonenkis. Theory of pattern recognition, 1974.

[161] R. Vicen-Bueno, A. Garcıa-Gonzalez, E. Torijano-Gordo, R. Gil-Pita, and

M. Rosa-Zurera. Traffic sign classification by image preprocessing and neural

networks. In Proceedings of the Work-Conference on Artificial Neural Net-

works, pages 741–748, 2007.

[162] P. Viola and M. Jones. Robust real-time object detection. International

Journal of Computer Vision, 57(2):137–154, 2004.

[163] J. Weston and C. Watkins. Support vector machines for multi-class pattern

recognition. In M. Verleysen, editor, Proceedings of the Seventh European

Symposium On Artificial Neural Networks (ESANN), pages 219–224, 1999.

[164] K. Woodsend. Using Interior Point Methods for Large-scale Support Vector

Machine training. 2009.

[165] K. Woodsend and J. Gondzio. Exploiting separability in large-scale linear sup-

port vector machine training. Computational Optimization and Applications,

pages 1–29, 2009.

[166] S.J. Wright. Primal-dual interior-point methods. Society for Industrial Math-

ematics, 1997.

[167] T. Zhang. Solving large scale linear prediction problems using stochastic

gradient descent algorithms. In Proceedings of the twenty-first international

conference on Machine learning, page 116. ACM, 2004.

[168] H. Zou, J. Zhu, and T. Hastie. The margin vector, admissible loss and multi-

class marginbased classifiers. Annals of Applied Statistics, 2:1290–1306, 2008.

117

Resume

Personal Data

Name Urun Dogan

Date of birth 4. April 1979

Place of birth Eskisehir

E-Mail [email protected]

Education And Work Experience

1994 - 1996 Eskisehir Science High School

1996 - 1997 Eskisehir Ataturk High School

1997 - 2001 B.Sc Mechanical Engineering from Istanbul Technical University

2001 - 2004 M.Sc in System Dynamics and Control from Istanbul Technical University

2005 - now Research Fellow at Institut fur Neuroinformatik an der Ruhr-Universitat Bochum

118

Training multi-class support vector machines

Documents

Transcript of Training multi-class support vector machines