A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
Training multi-class support vector machines
Transcript of Training multi-class support vector machines
Training Multi-class
Support Vector Machines
Dissertation
zur Erlangung des Grades eines
Doktor-Ingenieurs der Fakultat
fur Elektrotechnik und
Informationstechnik an
der Ruhr-Universitat
vorgelegt von
Urun Dogan
Institut fur Neuroinformatik
Ruhr-Universitat Bochum
Bochum
Januar 2011
ACKNOWLEDGEMENT
This dissertation would not have been possible without the guidance and help of
several individuals who, in one way or another, contributed and extended their
valuable assistance to me in the preparation and completion of this study.
I, therefore, extend thanks to the following:
First and foremost, my utmost gratitude to Prof. Dr. Christian Igel whose sin-
cerity and encouragement I will never forget. Prof. Dr. Igel has been my inspiration
as I hurdled the obstacles that presented themselves during the completion of this
research work.
Dr. Tobias Glasmachers with whom I discussed mathematical issues and who
introduced and explained to me many mathematical and statistical concepts. With-
out Dr. Glasmachers unlimited patience I would not understand abstract concepts
of mathematics and statistics.
Prof. Dr. Ioannis Iossifidis who during my PhD studies, encouraged me when I
was struggling with any kind of problem. As a perfect group leader Prof. Dr. Ios-
sifidis assisted and solved any administrative problem and created a great research
environment for me.
David William Eric Clark reviewed my thesis. He not only improved the lan-
guage of my previous manuscripts but he also restructured many parts of my thesis.
David William Eric Clark spent all his summer holidays with me in order to im-
prove my manuscript. Using his extensive engineering knowledge and experience,
he helped me to explain abstract concepts of machine learning with plain English.
Further still, the discussions with him about abstract machine learning techniques
deepened my understanding in machine learning.
Mathias Tuma who also reviewed my thesis. As a machine learning expert, he
gave me many valuable comments and he also helped me in restructuring my thesis.
I learned a lot of things from him. It was a great opportunity to work with him.
Verena Heidrich-Meisner, my old office mate, she discussed many scientific ideas
with me. She always shed light on the issues that we were discussing and she showed
me different perspectives on these issues.
Last but not the least, my family and M. Kemal Ataturk, the founder of modern
Turkey. Without M. Kemal Ataturk’s achievements and without the continuous
support and encouragement of my family I would not have reached this point in my
educational journey which has, to date, taken 25 years.
Contents
1 Introduction 3
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Statistical Learning Theory 7
2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Consistent Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Nearest Neighbour Classifier . . . . . . . . . . . . . . . . . . 14
2.3.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Support Vector Machines(SVMs) . . . . . . . . . . . . . . . . 15
3 Multi-class Support Vector Machines 25
3.1 Sequential Multi-Class SVMs . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Multi-Class Classification with Maximum Margin Regression
(MC-MMR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 One versus All (OVA) . . . . . . . . . . . . . . . . . . . . . . 32
3.2 All-in-One Multi-class Machines . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 The Weston and Watkins Method (WW) . . . . . . . . . . . 34
3.2.2 The Crammer and Singer Method . . . . . . . . . . . . . . . 41
3.2.3 Lee, Lin, & Wahba SVM . . . . . . . . . . . . . . . . . . . . 48
4 Unified View to All-in-One Multi-class Machines 53
4.1 Novel Approach to Multi-Class SVM Classification . . . . . . . . . . 55
5 Solvers 59
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Direct Optimization of Primal Problem . . . . . . . . . . . . 60
5.1.3 On-line Methods . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.4 Cutting Plane Approaches . . . . . . . . . . . . . . . . . . . . 60
5.1.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 61
5.1.6 Decomposition Algorithms . . . . . . . . . . . . . . . . . . . 61
5.1.7 General Comments on SVM Solvers . . . . . . . . . . . . . . 63
1
5.2 Decomposition Algorithms for Multi-Class SVMs . . . . . . . . . . . 63
5.2.1 Dropping the Bias Parameters . . . . . . . . . . . . . . . . . 64
5.2.2 Working Set Sizes for Decomposition Algorithms . . . . . . . 64
5.2.3 On Working Variable Selection . . . . . . . . . . . . . . . . . 65
5.2.4 Maximum Violating Pair Method . . . . . . . . . . . . . . . . 65
5.2.5 Second Order Working Variable Selection for SMO . . . . . . 66
5.2.6 Second Order Working Pair Selection for S2DO . . . . . . . 67
5.2.7 Solving the Crammer and Singer Multi-class SVM Using SMO 71
5.2.8 Efficient Caching for All-in-one Machines . . . . . . . . . . . 71
6 Conceptual and Theoretical Analysis of Multi-class SVMs 73
6.1 Margins in Multi-Class SVMs . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Margins in Multi-Class Maximum Margin Regression . . . . . . . . . 74
6.3 Margins in the One Versus All Classifier . . . . . . . . . . . . . . . . 75
6.4 Margin Normalization for Multi-Class Machines . . . . . . . . . . . . 76
6.5 Generalization Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.6 Universal Consistency of Multi-Class SVM . . . . . . . . . . . . . . . 79
6.7 Training Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7 Empirical Comparison and Applications 81
7.1 Preliminaries for Empirical Evaluation . . . . . . . . . . . . . . . . . 81
7.1.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.1.2 Stopping Conditions . . . . . . . . . . . . . . . . . . . . . . . 82
7.1.3 Statistical Evaluation . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Multi-class Benchmark Problems . . . . . . . . . . . . . . . . . . . . 83
7.2.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Traffic Sign Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3.3 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . 91
7.3.5 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3.7 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Multi-class Problems in Bioinformatics . . . . . . . . . . . . . . . . . 94
7.4.1 Cancer Classification and Diagnosis with Microarray Gene
Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.4.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 98
7.4.3 Protein Secondary Structure Prediction . . . . . . . . . . . . 99
7.4.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 101
8 Conclusions 103
2
Chapter 1
Introduction
Making observations and collecting data about natural or man-made phenomena
lies at the heart of all science and knowledge generation. Along with the amounts
of data collected and the complexity of their generating processes grows the need
for sophisticated data analysis techniques. Today, a wealth of data exploration
methods are at the modern scientist’s disposal. Over recent decades, adaptive
learning systems [17, 158], or machine learning methods, have emerged. Rather
than being fixed or rigid in the sense of acting in exactly the same, predefined
manner on different data, they adapt to properties of the data they encounter.
One special domain of machine learning is the supervised learning setting. Here the
learner’s task is to derive as hypothesis a map from an input space to an output space
while relying on exemplary observations, i.e., sample pairs of inputs and outputs.
This hypothesis should, to the best extent possible, also hold when evaluated on
additional input-output pairs stemming from the same underlying distribution as
the training data set.
Supervised learning problems for which the output space is a finite set are re-
ferred to as classification tasks. If the set is of cardinality two, binary classification
is considered, and of multi-class classification otherwise. In real-world applications,
many classification tasks naturally are multi-class problems, such as object recog-
nition, traffic sign recognition, and protein secondary structure prediction. At the
same time, several pattern recognition algorithms, like e.g. support vector machines,
have originally been designed for binary problems and are less easily applied in the
multi-class setting. This thesis focuses on multi-class classification using support
vector machine classifiers.
Support vector machines (SVMs, [22, 38]) are state-of-the art for binary clas-
sification. They are founded on the intuitive geometric concept of large margin
separation. They are also well understood due to their roots in theories of repro-
ducing kernel Hilbert spaces, regularized risk minimization, and statistical learning
theory. In addition, SVMs exhibit excellent performance over a wide range of ap-
plications.
Both from a geometric as well as from a learning theoretical point of view there
3
is no unique canonical extension of SVMs to multiple classes. Instead, several dif-
ferent formulations relying on slightly different notions of margin and margin-based
loss have evolved, most of which reduce to the standard machine in the special case
of a binary problem. Two general strategies to extend SVMs to multi-category clas-
sification can be distinguished. The first approach is to combine separately trained
SVM classifiers. The well-known one-versus-all and one-versus-one methods are
examples for this strategy [139]. Both methods have discussed in this thesis but
only one-versus-all is explained in detail. In the second family of algorithms, a
single optimization problem considering all classes is derived and solved at once.
These all-in-one methods are usually computationally more demanding [87], but –
at least from my point of view – theoretically more elegant. The latter promises
better classification results.
Although no significant differences in classification accuracy between these two
approaches have been observed in some studies, other studies (e.g., [48]) show sig-
nificantly better classification performance of all-in-one SVMs compared to the one-
vs-all approach in practice. The first of these all-in-one formulations (referred to
as WW) was independently proposed by Weston & Watkins [163], Vapnik [158],
and Bredensteiner & Bennett [32]. This machine has been modified by Cram-
mer & Singer (CS) [42]. Their approach is frequently used, in particular when
dealing with structured output. In addition, I consider the conceptually different
approach by Lee, Lin, & Wahba (LLW) [112], relying on a classification calibrated
loss function [154, 114], and multi-class maximum margin regression (MMR) [152]
proposed by Szedmak et al. The latter is equivalent to a multi-class SVM suggested
by Zou et al. [168].
From a geometric point of view all of these extensions offer plausible concepts
of margin and margin violation (or, to be more precise, margin-based loss). From a
learning-theoretical perspective the situation is less clear. There have been different
attempts to generalize learning bounds for binary SVMs to the multi-class case, see
for example [81]. But I am not aware of any bounds that could demonstrate relative
advantages or disadvantages of different multi-class SVM formulations. Still, there
are a few hints from theoretical analysis, in terms of so-called classification calibra-
tion and universal consistency. The LLW machine is the only machine known to
rely on a classification calibrated loss function [154, 114] (which implies Fisher con-
sistency1), while very recently the CS machine has been shown to be a universally
consistent classifier [76].
From this background and inspired by work by Liu [114], I derive a refined unified
formulation of popular all-in-one SVMs revealing their similarities and differences.
A novel multi-class SVM is developed by using the proposed unified view. The
new classifier canonically combines the margin concept of the LLW machine with
the margin-based loss used in the CS approach. This combination, which has been
overlooked so far, is revealed by the novel multi-class method.
1Fisher consistency does not imply universal consistency. I regard universal consistency as themore fundamental statistical property.
4
Another important issue is required training time of multi-class SVMs. Long
training times limit the applicability of multi-class support vector machines (SVMs).
In particular, the canonical extension of binary SVMs to multiple classes (referred
to as WW, [163, 32, 158]) as well as the SVM proposed by Lee, Lin, & Wahba
(LLW, [112]) are rarely used. These approaches are theoretically sound and experi-
ments indicate that they lead to well-generalizing hypotheses, but efficient training
algorithms are not available. Crammer & Singer (CS, [42]) proposed their arguably
most popular modification of the WW SVM mainly to speed-up the training pro-
cess. Still, the fast one-vs-all method [158, 133] is most frequently used when SVMs
are applied to multi-class problems for training time reasons.
Against this background, I consider batch training of multi-class SVMs with
universal (i.e., non-linear) kernels and ask the questions: Is it possible to increase the
learning speed of multi-class SVMs by using a more efficient quadratic programming
method? Do statistical properties have a practically measurable or even significant
impact on classification performance? Can an instructive generalization bound be
developed for explaining the empirical results? This thesis gives positive answers
to these questions. Efficient training algorithms for all-in-one methods is provided.
These make training of LLW machines practical and allow to train WW SVMs as
fast as CS’s variant. Extensive experiments demonstrate the superior performance
of the LLWmachine in terms of generalization performance. A simple generalization
bound indicating why the WW machine outperforms the CS SVM, which matches
the empirical results, is developed.
1.1 Contributions
Before explaining technical details,the contributions of this thesis shall be stated:
• Development of a unified view on all-in-one multi-class machines (see Chap-
ter 4)
• Design of a new all-in-one multi-class machine (see Section 4.1)
• Development of a new solver (i.e., training algorithm) for multi-class machines
(see Chapter 5)
• Conceptual and geometrical analysis of different margin concepts used in
multi-class machines (see Chapter 6)
• Proveof an instructive generalization bound (see Chapter 6)
• Extensive experimental comparison of six multi-class machines (see Chapter 7)
1.2 Thesis Structure
The thesis is organized as follows. In Chapter 2, the preliminaries of statistical learn-
ing theory [158] and several classifiers including the SVM for binary classification
5
are explained. In Chapter 3, an overview on multi-class classification with SVMs are
given, the five previously proposed multi-class SVMs are formulated. Further some
of these, in order to develop similar solvers, are reformulated. The proposed unified
view on all-in-one multi-class SVMs and the new multi-class method are discussed
in Chapter 4. In Chapter 5, different methods for solving the SVM problems are
summarized and the new solver for multi-class SVMs is defined. The geometrical
and conceptual differences, due to the different margin concepts, between multi-
class SVM machines are discussed in Chapter 6. Also, a new generalization bound
is developed in Chapter 6. Finally, I supply a detailed experimental evaluation of
all six different multi-class machines in Chapter 7.
6
Chapter 2
Statistical Learning Theory
This section gives a brief summary of the preliminaries of statistical learning theory
[158]. Further, it summarizes the basic idea of support vector machines (SVMs)
[158].
2.1 Supervised Learning
In science observing a phenomenon and then deducing a model from the observa-
tions representing the rules of nature is common practice. In supervised learning,
one similarly wants to derive a model of the relations between the inputs and the
outputs of the system that one is interested in. In the following some notations
used throughout this thesis will be defined.
Given any measurable space X, the set of vectors Dℓ = x1, . . . , xℓ such that
xi ∈ X ∀i = 1, . . . , ℓ and all xi are independently generated by the same proba-
bility distribution are called inputs/input vectors and X is called input space. For
SVMs the input space X can be any measurable set in other words there is no
restriction on X such as being a vector space. The scalars y ∈ Y are responses
of the supervisor to inputs and are called labels. The set Y represents all possible
responses that the supervisor can give and it is called as output space. A set of
pairs of vectors Sℓ = (x1, y1), . . . , (xℓ, yℓ) ∈ (X × Y )ℓthat contain input vectors
and the corresponding response of the supervisor are called training set or training
data and the cardinality of the training set (i.e. the number of training examples)
is denoted by ℓ. For simplicity and in accordance with the literature, in this thesis
it will be assumed that there is an underlying unknown probability distribution
Υ(x, y) on X × Y and the training data is generated by sampling from Υ(x, y). A
single random realisation of Υ(x, y) will be denoted as (x, y). Given training set
Sℓ, finding a function f : X → Y is called supervised learning. If Y = R then the
supervised problem is called as a regression problem and if Y is a finite discrete set,
this kind of problem is called as a classification problem and each distinct element
of Y is said to correspond to one class. Given training set Sℓ, estimating the un-
derlying probability distribution function from the data at hand is called as density
7
estimation.
If in a classification task |Y | = 2, the problem is called as a binary classification
problem. One of the classes is called the positive class and the other one is called the
negative class. If |Y | > 2, the problem is called a multi-class classification problem
and the number of classes is denoted by d = |Y |. If the number of training pairs for
each class is equal in the training set Sℓ than the training data set is called balanced.
Regression problems are considered to be harder than classification problems. In
classification problems, multi-class problems are harder than binary problems. To
understand this, assuming data set is balanced and having a classifier that assigns
random values to inputs, the probability of correct classification in binary problems
is 12 however in balanced multi-class cases this is 1
d.
Each explanation of the empirical data is called a hypothesis. Here it should
be noted that one may found more than one hypothesis for the data set at hand.
However that does not mean that all these hypothesis equivalently well explaining
the real system from which the data is generated/collected. The process of auto-
matically generating hypothesis about the empirical data is called learning machine.
Besides specific propositions for learning machines that provide an hypothesis
f , there are general aspects worth considering [5]
• Approximation: Given infinitely many data points are available for training
and pick a function class ℑ such that f ∈ ℑ: The first question here is
whether ℑ is adequately inclusive enough to approximate the true relation
between inputs and labels. For supplying an answer to this question, one can
approach this problem from approximation theory point of view because there
are strong synergies between supervised learning and approximation theory
[44, 45]. However, I will stick to the statistical learning theory [158].
• Estimation: Since information about the problem is not enough, the true
relationship between inputs and labels are not known. Therefore, the second
question is how much data one needs to model the unknown relationship?
Again statistical learning theory [158] will provide an answer.
• Computational efficiency: How can the training data be used for choosing a
model that is accurate enough whilst using as few computational resources
as possible? One might argue that the computational power of computers
constantly increasing, however the available data to be analysed is increasing
at an even faster rate. Even more importantly, a supervised learning task may
become harder as the amount of training samples increases [21].
Unfortunately, giving answers to these questions is not trivial [158, 5] and within
the scope of this thesis the multi-class SVMs from the approximation paradigm and
computational efficiency paradigm will be analysed. Some methods for efficient
training of multi-class SVMs in the case of limited computational power will be
developed. In supervised learning, one wants to have a function with a low gen-
eralization error. In addition, during training one needs to identify which member
8
of the function class ℑ is best explaining the unknown relation between inputs and
labels. To succeed in both tasks, an auxiliary function, which will be used for mea-
suring the performance of the hypothesis at a given training point, is needed. This
function will be called as the loss function. The mathematical definition of the loss
function is;
Definition 1 Any function L satisfying the following conditions
1. L : Y × Y → [0,∞)
2. monotonic non-decreasing
3. L(f(xi), yi) = 0 if only if f(xi) = yi
is called loss function.
There are several popular loss functions and these are
• The 0− 1 loss defined as
L(f(xi), yi) =
1 f(xi) 6= yi
0 otherwise .(2.1)
• The least square loss is defined as
L(f(xi), yi) = (f(xi)− yi)2 . (2.2)
• The Hinge loss is defined as
L(f(xi), yi) = max(0, 1− yif(xi)) . (2.3)
• The exponential loss is defined as
L(f(xi), yi) = exp(1− yif(xi)) . (2.4)
• The logistic loss is defined as
L(f(xi), yi) = log(1− yif(xi)) . (2.5)
The selection of loss is an important issue in supervised learning. The differences
between losses are illustrated in the Fig 2.1. In the scope of this thesis, unless the
otherwise is mentioned, the Hinge loss given in eq. (2.3) will be used. Performance
of the hypothesis at a single training point is not the only interesting point, generally
the overall performance of the hypothesis on the entire entire distribution Υ(x, y)
9
0
2
4
6
8
10
-2 -1 0 1 2
0-1 Loss
Least Square Loss
Hinge Loss
Exponential Loss
Logistic Loss
Figure 2.1: The figure illustrates the outputs of the popular loss functions. The lat-eral axis represents the value of the argument of the loss function and the horizontalaxis shows the output of the loss function
is more interesting. The functional defined for this purpose is called expected risk
and is defined as
I [f ] =
∫
X×Y
L(f(x), y)Υ(x, y) dxdy . (2.6)
The minimizer of eq. (2.6) is written as
f∗(x) = argminℑ
I [f ] . (2.7)
Generally one cannot find f∗(x) through eq. (2.7) because Υ(x, y) is unknown.
Density estimation techniques can be used for estimation of Υ(x, y) and the result
of this estimation can be used in eq. (2.6) to find f∗(x). Although this is an valid
approach, density estimation is a harder method than classification and regression
[158]. Empirical Risk Minimization(ERM) [62] can be used to overcome these issues.
The main idea of ERM is defined as to minimize the empirical risk of the training
set at hand, defined as follows,
Iemp [f ] =1
ℓ
ℓ∑
i=1
L(f(xi), yi) (2.8)
instead of expected risk given in eq. (2.6). Here, it should be noted that the Iemp
is a random variable because it is dependent on the training set at hand.
In the ERM framework, Υ(x, y) is exchanged with the finite training data, with
a cost, which can be stated as how close is the minimizer of Iemp [f ] to I [f ] in the
case of finite data. Statistical learning theory gives answers to this question for
different conditions.
In practice there may be problems different from approximation aspects because
ERM uses finite training data. This may lead to overfitting problems and/or un-
derfitting problems. To explain these problems, a regression problem is considered
10
and sin(t) is selected as a target function. The data set of regression problem con-
tains 20 points equidistantly sampled from target function and added a univariate
Gaussian noise with 0.6 standard deviation, ςt, to the sampled points. In other
words, samples from the target function is corrupted. Now using these data the
task is to learn the target function. The target function and the the corrupted data
is illustrated in Figure 2.2-a) and -b). First a first order polynomial is fitted to the
data by using least squares. Further a fourth order polynomial and twentieth order
polynomial is fitted to data respectively. These polynomials are shown in 2.2-c), d)
and e). The underfitting problem occurs when a simple model is used e.g. the first
order polynomial for this problem. The overfitting problem occurs when a complex
model is used e.g. the twentieth order polynomial for this problem. Intuitively it
is easy to see that the fourth order polynomial is more suited as hypothesis than
the other two. In this example, a model that neither over- nor underfits is selected.
This was possible because the target function and the problem is low dimensional.
But how can such a trade-off be formalized in the general setting? Developing an
appropriate solution for this problem is very important issue to be addressed. A
solution is given by the concept of Structural Risk Minimization (SRM) introduced
by Vapnik [158, 160] for supervised learning. SRM has two contradictory goals: the
first one is selecting a hypothesis with small empirical risk (ER). The second one
is selecting a hypothesis or a function of small complexity as measured by some
suitable function on ℑ [158, 107, 28]. Basically, in SRM there are four steps: First,
using prior knowledge, one needs to choose a class of functions, ℑ. In the second
step, the chosen class is divided into subclasses, in a nested way, with increasing
complexity. In the third step ERM is applied to the problem at hand. In the final
step the model which has minimum weighted sum of the empirical risk and the
complexity of function class is selected.
Until now, the complexity of a function class is mentioned without any technical
details. In SRM, generally the Vapnik-Chervonenkis (VC) dimension is used as a
measure of complexity. In order to give the technical definition of a VC dimension
a similar line of argument as in [107, 51] will be followed. Any binary classifier can
be identified by the subsets of X to which it assigns the positive class and it can be
assumed that any classifier is indeed a collection of subsets of X. In the following,
these subsets corresponding to f ∈ ℑ, is also denoted by f .
Given a set of points Dℓ, define Λℑ(Dℓ) as the number of distinct subsets of Dℓ
that are intersecting with ℑ, that is Dℓ
⋂f for f ∈ ℑ. The ℓth shattering coefficient
is defined as
S(ℑ, ℓ) = maxDℓ
Λℑ(Dℓ) .
If S(ℑ, ℓ) = 2ℓ, in other words if all possible distinct subsets of Dℓ are intersecting
with at least one f ∈ ℑ, then Dℓ is shattered by ℑ. Vapnik-Chervonenkis(VC)
dimension of the function class ℑ, denoted by V Cℑ, is defined as the maximum
largest integer h such that there exists a set of cardinality h that is shattered by
11
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7
Data Generation Function( sin(t) )
a)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7
Data With Noise
b)-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7
Data With Noise
First order Polynomial
c)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7
Data With Noise
Fourth order Polynomial
d)-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 1 2 3 4 5 6 7
Data With Noise
Twentieth order Polynomial
e)
Figure 2.2: Overfitting and undefitting problems are illustrated in the figure. Theoriginal sin(t) is shown in a), the corrupted training data points is shown in b). Theoutputs of the first, fourth and twentieth order polynomials are shown in c), d) ande) respectively. Underfitting problem is shown in c) and the overfitting problem isshown in e).
ℑ. It is important to note that if the VC dimension of ℑ is h, there is at least one
set of h points which can be shattered by ℑ. However this does not mean that any
set of h points will be shattered by ℑ. Now, one of the main results of statistical
learning theory; namely a bound on the expected risk [159, 158], will be stated:
Theorem 1 Choose some η such that 0 ≤ η ≤ 1 then for losses taking 0 or 1 the
following bound holds with probability 1− η
I [f ] ≤
T1︷ ︸︸ ︷Iemp [f ] +
√√√√√√√√
T2︷ ︸︸ ︷h(log(2ℓ/h) + 1)−
T3︷ ︸︸ ︷log(η/4)
ℓ
(2.9)
12
where h represents the VC dimension of ℑ.
Several facts, related to this bound, should be explained. The term T1 is the
empirical risk. T2 includes the number of training examples, ℓ, as well as the VC
dimension of the function class , h. Since, T2 is basically dominated by h. If one
wants to minimize T2 one needs to select a function class which has a small value for
h. The term T3 depends on the degree of confidence ,η , of the bound. Generally,
one wants to be as sure as possible and therefore the value of η is usually small.
If the term under the square root is analysed, containing T2 and T3, one sees that
it is inversely proportional to the number of training examples, ℓ. In other words,
when the number of training examples, ℓ, increases the second term of right side
approaches zero. However, in practical problems ℓ is fixed. By using the bound
(2.9) on the expected risk, the SRM framework is achieved by selecting a learning
machine for a given data. In summary, given a family of hypothesis, Theorem 1
implies that one should search for a hypothesis which has the minimum sum of the
empirical risk and the VC dimension.
2.2 Consistent Classifier
If one restricts himself/herself to the classification task one can ask for what is the
best performance, in the sense of accuracy, that can be reachable by a classifier?
The answer to this question is known if the probabilities and class conditional
distributions are already known. In this case the optimal classifier in the sense
of minimum probability of error or misclassification rate is Bayes Decision Rule
[58, 107, 112]. It is denoted by fB(x) and the loss corresponding to Bayes Decision
Rule is denoted as LB . Given (x, y), pj(x) is defined as P (Y = j|x) for j = 1, . . . , k.
The Bayes Decision Rule minimizing the expected misclassification rate is
fB(x) = arg minj=1,...,k
[1− pj(x)] = arg maxj=1,...,k
pj(x) .
However, in most of real life problems probabilities and distributions are not already
known and so calculating the outcome of the Bayes Decision Rule is not possible.
Even though, training data is finite it is appropriate to expect that a classifier to
yield an error that approaches to Bayes Decision Rule’s error rate as ℓ → ∞. A
classifier, that provides outcomes consistent with Bayes Decision Rule, is called
consistent classifier.
There is three important types of consistency in statistics and machine learning.
First one is; if E [L(fℓ)] → LB when ℓ → ∞ then fℓ is consistent. Second type
is; if L(fℓ) → LB when ℓ → ∞ then fℓ is strongly consistent. The last one is, if
L(fℓ)→ LB when ℓ→∞ for any Υ(x, y) then fℓ is universally consistent.
It is natural to strive for a universal consistent classifier, but it is hard to make as-
sumptions independent of Υ(x, y). However, in a seminal paper Stone [151] showed
the existence of universal consistent classifiers. This supplies practitioners with an
13
important guarantee. That is, if sufficient data is supplied to the universal consis-
tent classifier, then the universal consistent classifier on the test data will perform
as accurately as the Bayes Decision Rule. Unfortunately, having an universal con-
sistent classifiers does not guarantee that the classifier will perform as accurately
as the Bayes Decision Rule on the test data at hand because the technical meaning
of sufficient data with respect to the problem at hand is unclear. Indeed, if the
training data set is small, any universal consistent classifier can perform arbitrarily
bad. This may be due to the slow convergence speed of L(fℓ) to LB [41, 50]. The
design of a good classifier is hard and non-trivial. Within the next sections, common
classifiers, namely nearest neighbour, Perceptron and SVMs will be presented.
2.3 Classifiers
2.3.1 Nearest Neighbour Classifier
Probably the nearest neighbour classifier (NN) [68, 69, 40] is the simplest non-
parametric algorithm for classification tasks. In the original definition [68, 69], the
1-nearest neighbour (1-NN) algorithm assigns to a test example the class of its
nearest neighbour in the training set. With most methods of classification, the
performance of 1-NN is depends on the chosen metric. A generalised version of
1-NN is called d-nearest neighbour classifier(d-NN) and for a given test example, it
assigns the label of the majority of d nearest neighbours in the training set to the
test example. Although the nearest neighbour classifier is a simple algorithm, it is
consistent when d → ∞ and dℓ→ 0 as ℓ → ∞. Even more it is competitive with
other state-of-the-art methods [107].
2.3.2 Perceptron
Rosenblatt [134] proposed the well known Perceptron algorithm in the early 1960’s.
In his original construction of the Perceptron algorithm, it handles only binary
classification and requires infinitely many training samples. The most important
restriction of the Perceptron algorithm is that it is only applicable to problems that
are linearly separable, such that there is at least one hyperplane between classes
that separates them without an error. The Perceptron algorithm starts from an
arbitrary point and constructs this separating hyperplane iteratively. For finite
data, the algorithm cycles through data and updates the parameters whenever a
training example is misclassified.
Before discussing the Perceptron algorithm further, an important concept namely
’margin’ should be defined.
Definition 2 Given training data, Sℓ and a hyperplane of the form f(x) = 〈w, x〉+
b, and define the margin ν(Sℓ,wℓ) as
ν(Sℓ,wℓ) = arg min(xi,yi)∈Sℓ
f(xi)
‖wℓ‖. (2.10)
14
Here it should be noted that if Sℓ contains only one point the definition is still
valid. A margin is a function of the data set and the hyperplane. For simplicity
it will be denoted by ν. It sould be underlined that the margin is defined for
binary problems. If the data set at hand is linearly separable, the Novikoff [158]
generalization bound for the Perceptron algorithm states that the number of errors
that the Perceptron algorithm will make during training is inversely proportional to
the margin. As Figure 2.3 illustrates all data sets that are linearly separable with
margin i.e. ν > 0 for all hyperplanes, can be separated by an infinite number of
hyperplanes. However, the Perceptron algorithm does not include a mechanism to
select any particular hyperplane. Instead, it will provide a solution solution based
on the initial parameters. This led many researchers to note that the accuracy of the
Perceptron algorithm for a given problem is based on the quality of the separating
hyperplane [5, 158]. In general, the necessary assumption for a Perceptron to be
applicable - that the classes are linearly separable - is not warranted for most real-
world problems. This severely limits its usefulness in practice.
2.3.3 Support Vector Machines(SVMs)
Motivated by the intrinsic shortcomings of the Perceptron algorithm, its failure to
actively select among the possible hyperplanes and converge in cases not linearly
separable, more research on the role of the margin and margin violations followed.
In the following some general theorems on properties of separating hyperplanes will
be given. First , the ’canonical hyperplane’ will be introduced because it will be
used in theorems and then two important theorems of statistical learning theory
will be stated.
Definition 3 Given a set of linearly separable points Dℓ = x1, . . . , xℓ, consider
a hyperplane of the form f(x) = 〈w, x〉 + b. If the hyperplane has the following
property
minn=1,...,ℓ
|f(xn)| = 1 (2.11)
it is called a canonical hyperplane.
Vapnik[139, 158] related the VC bound and margin as follows
Theorem 2 Given a set of points Dℓ = x1, . . . , xℓ separated by a canonical hy-
perplane: Select a class of functions ℑ of the form fw(x) = sgn(〈x,w〉) where sgn is
the sign/signum function, and the norm of the hyperplane w is bounded from above
by a positive constant, i.e. ‖w‖ ≤ Ω. If R is the radius of the smallest ball centred
at the origin containing Dℓ, then the VC dimension of ℑ, h, is bounded by
h ≤ R2Ω2 . (2.12)
The main interpretation of this theorem is that the VC dimension of the space
can be bounded by using the norm of the w. Please note that a value for Ω should
15
be chosen beforehand. Although this bound clearly shows the relation between the
norm of the weight and the VC dimension, it does not give the relation between the
margin and the VC dimension. Therefore another relationship is needed for this
purpose. Vapnik [139, 158] stated margin VC dimension relation as follows:
P1P2
P3
Figure 2.3: Points of two different classes that are linearly separable are shown.Among the infinitely many possible separating hyperplanes between the two classes,three of them are illustrated (P1, P2 and P3). The Perceptron algorithm constructs aseparating hyperplane between classes depending on initial parameter values. Notethat P3 has a margin smaller than P1 which in turn has a smaller margin than P2.It is clear that among the three, P2 is more robust against noise in further test datathan P1 and P3. Perceptron algorithms do not supply a solution with the maximummargin even for linearly separable data.
Theorem 3 Given a set of points Dℓ = x1, . . . , xℓ such that ∀x ∈ Dℓ, ‖x‖ < 0
with R > 0. Let ℑ be the class of functions of the form fw(x) = sgn(〈x,w〉) with the
norm of w is bounded from the above by a positive constant, i.e. ‖w‖ ≤ Ω. Define
µ as the fraction of points that have margin a smaller than ν, where ν > 0. Then
for all distributions Υ(x, y), from which the data is generated, with probability of at
least 1− δ and for any ν > 0 and δ ∈ (0, 1), the probability, p(fw(xi) 6= yi), that a
test pattern, xi, drawn from Υ(x, y) will be misclassified is bounded by
p(fw(xi) 6= yi) ≤ µ+
√σ
ℓ
(R2Ω2
ν2ln2ℓ+ ln
(1
δ
)). (2.13)
Where σ is an unknown universal constant.
This theorem bounds the probability of misclassification of unseen test examples.
If the bound given in (2.13) is examined, it can be seen that it has two components.
The first component is basically the fraction of points that have a smaller margin
than ν‖w‖ . The second component represents the complexity of the learning machine
and requires further focused explanation to understand. The complexity compo-
nent is proportional to R and Ω and inversely proportional to ν. To minimize the
16
complexity of the learning machine, one needs to make R and Ω small, and one also
needs to make ν as large as possible. Since ℓ, R and Ω are fixed beforehand, ν is the
parameter that drives the complexity of the learning machine. On the one hand, a
large ν makes the complexity of the learning machine smaller. However, this will
likely increase µ and hence also have detrimental effect on the bound. On the other
hand, if ν goes to zero, µ also goes to zero but then complexity of the learning
machine tends to infinity. It is clear that maximizing ν and minimizing µ is contra-
dictory. In other words there exists an intricate relationship between maximizing ν
and minimizing µ. But how to settle the trade-off between contradictory tendencies
still has to be determined. An algorithm which provide a consistent framework for
such trade-offs are Support Vector Machines which will be introduced in the next
section.
2.3.3.1 Hard Margin Support Vector Machines
In order to clearly explain SVMs, two closely related margin concepts, namely
functional and geometrical margin, should be defined and the difference between
them should be stated. Given training data Sℓ and a hyperplane f(x) of the form
〈w, x〉 + b, the functional margin of nth training example with respect to the f(x)
is defined as
yn(〈w, xn〉+ b) .
The functional margin is not scale invariant which means the functional margin
can be increased arbitrarily by just multiplying w and b by positive scalar value.
To resolve this issue a different margin concept, namely geometrical margin, is used
in SVMs. The geometrical margin of the ith training example with respect to the
w is defined as
yn(〈w, xn〉+ b)
‖w‖.
If not stated otherwise, the geometrical margin is referred as margin in this thesis.
Basically, SVMs are constructing the optimal hyperplane by finding the hyperplane
that maximises the margin, ν (defined in eq. (2.10) and please note the relation
between it and the geometrical margin of the ith training example) and minimizes
µ. To do that, SVMs fix the margin, ν, to 1 (which is the case for canonical hy-
perplanes) and search for the hyperplane, w, that has the smallest norm possible
and smallest fraction of points, µ, that have a smaller margin than 1‖w‖ . Before
explaining the technical details of SVMs, it should be noted that SVMs were ini-
tially developed for binary classification problems and regression problems. For the
remainder of this section, the SVMs for binary problems will be briefly explained
and the multi-class SVMs will be discussed in Section 3.
In this section Hard Margin Support Vector Machines, which are just like the
Perceptron only applicable to linearly separable, data will be defined. However, they
17
actively select a specific hyperplane among all possible ones. Motivated by Theorem
2 and Theorem 3, the criterion for selecting the hyperplane is the largest possible
margin as shown in Figure 2.4. This can be formulated as follows: Given that
the training data are linearly separable and the function space ℑ that is consisting
canonical hyperplanes and SVMs will pick a function from this space, all the training
data satisfy the following constraints:
〈xn, w〉+ b ≥ 1 if yn = 1 (2.14)
〈xn, w〉+ b ≤ −1 if yn = −1 (2.15)
These constraints can be written in a more compact way such as:
∀n ∈ 1, . . . , ℓ : yn(〈xn, w〉+ b) ≥ 1 (2.16)
As a result of Theorem 3, the goal also is to minimize w. Together this yields
the optimisation problem:
min1
2〈w,w〉 (2.17)
s.t. ∀n ∈ 1, . . . , ℓ : yn(〈xn, w〉+ b) ≥ 1 (2.18)
The optimization problem (2.17) is an example of constraint convex optimization
problem. It is referred to as primal problem. These classes of optimization problems
are well established in optimization theory and standard approaches to solve them
exist. One of the well known approach is the Lagrange multipliers [15] and it will
be used in this thesis. There are two reasons for using the Lagrange multipliers
method; the first one is that constraints will be replaced by Lagrange variables that
are easy to handle and the second is that the optimization problem will be written
in such a way that the training data is only used in inner products. The Lagrangian
of (2.17) is:
L =1
2〈w,w〉 −
ℓ∑
n=1
αn (yn(〈xn, w〉 − 1) + b) (2.19)
An optimization problem (the so called dual optimization problem) which is equiv-
alent to eq. (2.17) can be formulated by using eq. (2.19). To do this the Lagrangian
with respect to primal variables, w, should be minimized and the Lagrangian with
respect to dual variables, α, should be maximized. To this end the partial deriva-
tives of the Lagrangian with respect to primal variables should be written. The
18
partial derivatives of the Lagrangian with respect to, w and b, are
∂L
∂w= w −
ℓ∑
n=1
αnynxn (2.20)
∂L
∂b= −
ℓ∑
n=1
ynαn . (2.21)
To find the saddle point of the Lagrangian, one sets ∂L∂w
and ∂L∂b
to zero and then
obtains
w =
ℓ∑
n=1
αnynxn (2.22)
0 =ℓ∑
n=1
ynαn . (2.23)
By substituting eq. (2.22) into eq. (2.19) the Lagrangian with respect to the dual
variables is obtained and it is
L =
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
ynykαnαm〈xn, xm〉 . (2.24)
Finally the dual problem is written as follows:
maxα
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
ynykαnαm〈xn, xm〉 (2.25)
s.t.ℓ∑
n=1
ynαn = 0
∀n ∈ 1, . . . , ℓ : αn ≥ 0
2.3.3.2 Soft Margin Support Vector Machines
Unfortunately, Hard Margin SVMs are only applicable to linearly separable prob-
lems reflected by the constraint of the primal (2.18). In order to apply SVMs to
inseparable problems, the constraint (2.18) should be relaxed, thus allowing mar-
gin violations for training examples. To this end, so called slack variables ξn ≥ 0,
n = 1, . . . , ℓ are defined and they are introduced as constraints (2.18). An illus-
tration of slack variables and the optimal hyperplane is given in Figure 2.5. For
a binary soft margin SVM the following primal optimization problem should be
19
Figure 2.4: Two linearly separable classes are shown in the figure. Hard marginSVMs construct a hyperplane that maximises the margin.
solved:
min1
2〈w,w〉+ C
ℓ∑
n=1
ξn (2.26)
s.t. ∀n ∈ 1, . . . , ℓ :
yn(〈xn, w〉+ b) ≥ 1− ξn (2.27)
ξn > 0
C is the regularization coefficient. As C approaches to infinity, the objective
function of the Soft Margin Support Vector Machine is dominated by the sum of
slack variables or in other words the amount of margin violations and when the C
approaches zero, the objective function of the Soft Margin Support Vector Machine
is dominated by the norm of the weight vector. In other words, in the former case
high priority is given to minimize the margin violations in the training data set
which may lead overfitting and the latter case higher priority is given to minimize
the complexity of the hypothesis which may lead underfitting. Clearly, neither
overfitting nor underfitting desired and the solution to both problems is choosing
an appropriate regularization parameter for the problem at hand. Unfortunately C
cannot be identified beforehand, therefore model selection procedures i.e. choosing
a statistical model from a set of candidate models, should be applied for identifying
20
the C. The Lagrangian of the primal problem is:
L =1
2〈w,w〉+ C
∑
n
ξn −
ℓ∑
n=1
αn (yn(〈xn, w〉+ b)− 1 + ξn)
−
ℓ∑
n=1
βnξn (2.28)
Following the same procedure as the Hard Margin SVMs, the partial derivatives
with respect to primal variables, w, b and ξ are obtained. The partial derivatives
are
∂L
∂w= w −
ℓ∑
n=1
αnynxn (2.29)
∂L
∂b= −
ℓ∑
n=1
ynαn
∂L
∂ξn= C − αn − βn .
To find the saddle point of the Lagrangian, one sets ∂L∂w
, ∂L∂b
and ∂L∂ξn
to zero and
obtains
w =
ℓ∑
n=1
αnynxn (2.30)
0 =ℓ∑
n=1
ynαn (2.31)
αn = C − βn (2.32)
By definition the Lagrange multipliers are equal or bigger than zero, i.e. αn ≥
0 ∀n = 1 . . . ℓ and βn ≥ 0 ∀n = 1 . . . ℓ. So equation (2.32) is equivalent to
0 ≤ αn ≤ C. By substituting eq. (2.30) into eq. (2.28) the Lagrangian with respect
to the dual variables is derived and it is
L =
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
ynykαnαm〈xn, xm〉 . (2.33)
Finally the dual problem is written as:
maxα
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
ynykαnαm〈xn, xm〉 (2.34)
s.t.
ℓ∑
n=1
ynαn = 0
∀n ∈ 1, . . . , ℓ : 0 ≤ αn ≤ C
21
ξ1
ξ2ξ3
ξ4
Figure 2.5: Two linearly inseparable classes are shown in the figure. ξ1, ξ2, ξ3and ξ4 show slack variables. Soft margin SVMs construct a hyperplane that findsa compromise between maximizing the margin and minimizing the sum of slackvariables.
2.3.3.3 Soft Margin Support Vector Machines for Non-linear Cases
Until now, all SVM algorithms that are discussed use linear functions of the train-
ing data. In other words, till now SVM algorithms are restricted to the class of
linear functions of the training data which have a limited ability to supply suitable
solutions. Further, methods using linear functions can only be applied to vector
valued data. To turn the linear SVMs presented above into non-linear algorithms,
kernel functions [6, 158] will be used. Before giving a technical definition of the
”kernel”, the term that is called the ”Gram Matrix” should be introduced,
Definition 4 Given a set of points Sℓ = x1, . . . , xℓ and a function, k : Sℓ×Sℓ →
R, the matrix K ∈ Rℓ×ℓ with elements K(n,m) = k(xn, xm) is called the Gram
Matrix of k(·, ·) with respect to Sℓ.
After giving the definition of the Gram Matrix, the ”kernel” is defined mathe-
matically;
Definition 5 Any function k : Sℓ×Sℓ → R is identified as kernel if it is a symmet-
ric function and also if it gives a positive semi-definite Gram Matrix with respect
to Sℓ.
After defining the kernel function, the concept of converting linear SVMs to non-
linear ones will be explained. Firstly it should be noted that in dual formulations the
training data appears only in the form of inner products. Each training example
, xn, is mapped to another space, that is named as the feature space, with the
22
help of a function φ(xn), then all inner product terms in the dual formulations
can be replaced by the new inner products in the feature space. In other words
〈xn, xm〉 can be replaced with 〈φ(xn), φ(xm)〉. However, this approach can have a
computational problem namely feature space may be very high-dimensional or even
infinite dimensional. For some feature space, one can replace 〈φ(xn), φ(xm)〉 with
k(φ(xn), φ(xm)). The question is; for which feature spaces one can use a kernel
function for calculating the inner product i.e. 〈φ(xn), φ(xm)〉? The answer of this
question is given by Mercer Theorem [158]:
Theorem 4 A continuous symmetric function k(x, y) ∈ L2(C) has an expansion
k(x, y) = 〈φ(x), φ(y)〉
If and only if for any f(x) the condition
∫k(x, y)f(x)f(y) dx dy ≥ 0
is valid.
If the used kernel satisfies the Mercer Theorem, from mathematical point of view
there is no difference between mapping each training example to feature space and
then calculating the inner product then there is with directly using a function ,
k(·, ·), that is equal to 〈φ(xn), φ(xm)〉 i.e. k(φ(xn), φ(xm)) = 〈φ(xn), φ(xm)〉. From
a computational point of view the function calculation is preferred because, as
mentioned before, in some cases the feature space can have infinite dimensions. In
the renaming of this thesis, unless otherwise stated, all the kernels are satisfying
the Mercer Theorem. Now, if the function k(·, ·) is a kernel then a non-linear SVM
is obtained. The new learning machine’s primal that is using kernels is as follows:
min1
2〈w,w〉+ C
ℓ∑
n=1
ξn (2.35)
s.t. ∀n ∈ 1, . . . , ℓ :
yn(〈φ(xn), w〉+ b) ≥ 1− ξn (2.36)
ξn > 0
The Lagrangian of the primal problem is:
L =1
2〈w,w〉+ C
ℓ∑
n=1
ξn −
ℓ∑
n=1
αn (yn(〈φ(xn), w〉+ b)− 1 + ξn)
−ℓ∑
n=1
βnξn (2.37)
23
Following the same procedure as the Hard Margin SVMs, the partial derivatives
with respect to primal variables, w, b and ξ are obtained
∂L
∂w= w −
ℓ∑
n=1
αnynφ(xn) (2.38)
∂L
∂b= −
ℓ∑
n=1
αnyn
∂L
∂ξn= C − αn − βn .
To find the saddle point of the Lagrangian, one sets ∂L∂w
, ∂L∂b
and ∂L∂ξn
to zero and
obtains
w =
ℓ∑
n=1
αnynφ(xn) (2.39)
0 =
ℓ∑
n=1
αnyn (2.40)
αn = C − βn . (2.41)
Substituting eq. (2.39) into eq. (2.37) yields the Lagrangian with respect to the
dual variables is obtained and it is
L =
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
ynykαnαm〈xn, xm〉 . (2.42)
By converting eq. (2.41) into a inequality constraint like Soft Margin SVMs, the
dual problem can be written as
maxα
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
ynykαnαmk(xn, xm)
s.t.
ℓ∑
n=1
αnyn = 0
∀n ∈ 1, . . . , ℓ : 0 ≤ αn ≤ C .
24
Chapter 3
Multi-class Support Vector
Machines
In the Chapter 2, general concepts of supervised learning and statistical learning
theory were discussed. Three different classifiers, i.e. the nearest neighbour, Per-
ceptron and binarySVMs were also explained. In this chapter, multi-class classifiers
that are derived from binary SVMs will be discussed.
Through out the remainder of this thesis, unless otherwise noted, training data
set, belonging to d classes i.e.((x1, y1), . . . , (xℓ, yℓ)
)∈(X × 1, . . . , d
)ℓsampled
i.i.d. from a fixed but unknown probability distribution will be denoted by Sℓ. All
machines explained in this thesis construct a decision function of the form
x 7→ arg maxc∈1,...,d
〈wc, φ(x)〉+ bc . (3.1)
Here, φ : X → H is a feature map into an inner product space H, w1, . . . , wd ∈ H
are class-wise weight vectors and b1, . . . , bd ∈ R are class wise bias/offset values.
The previously discussed SVM algorithms are restricted to binary classification
problems. However, many problems in practice contain more than two classes,
i.e. they are multi-class problems. Different ways of solving d-class classification
problems with SVMs will be discussed. In general, one distinguish two different
approaches for solving d-class classification problems. The first is to cast d-class
problems into a series of binary or one-class classification problems. The second
group of approaches constructs a single optimization problem for the entire d-class
problem. In this thesis, the methods belonging to second group will be referred as
all-in-one methods. The first approach has been analysed extensively [53, 4, 43] in
the literature due to its relative simplicity.
For kernel methods such as SVMs the feature space H can be constructed as
the Reproducing Kernel Hilbert Space (RKHS) (see [6, 139]) of a positive definite
kernel function k : X × X → R. The corresponding feature map takes the form
φ(x) = k(x, ·), and k(x, x′) = 〈φ(x), φ(x′)〉. The decision function (3.1) is illustrated
in Figure 3.2.
25
Figure 3.1: The general two approaches for solving d-class problems are shown.Methods that are developed from these approaches are illustrated. All these meth-ods will be discussed in this study.
When solving d-class problems as a series of binary problems, there are two
common methods. Method 1 is one-versus-one (OVO) also known as one-against-
one. Method 2 is One-versus-all (OVA) also known as one-against-all. OVO casts
multi-class problems in to a series of pairwise binary problems more precisely OVO
constructs d(d−1)2 binary problems. Although during the training phase of each
binary problem the decision function given in eq. (3.1) is used it is not clear how
to combine the outputs of d(d−1)2 binary problems for a given test sample. Several
methods have been proposed to address this deficiency i.e. to have an unified
decision function [84], however all these methods provides a solution that reflects
the process rather than the data. Because of this deficiency, OVO will be briefly
explained in Section 3.1. However, OVA will be discussed in more detail, Section
3.1 and Section 3.1.2, which has a similar philosophy with OVO but does provide a
unique decision function.
In OVA one constructs d classifiers and operates on them using a decision func-
tion eq. (3.1). This thesis is restricted to classifiers which have an unique decision
function of the form (3.1). Regardless of the issues related the decision function,
both OVO and OVA cast the d-class problem into a series of binary classification
problems. In addition to two common methods, Szedmak et.al. [152] proposed a
new multi-class SVM method which also casts the problem into a series of one class
classification problem. In this method, each term inside the argmax expression in
equation (3.1) can be interpreted as a linear projection to a one-dimensional sub-
space. In multi-class classification it is natural to combine these one-dimensional
sub-spaces into a single d-dimensional decision space. This corresponds to filling
the components of a vector with the single inner products. This new method ad-
dresses the decision function deficiencies of OVO methods and so this method will
be discussed further in Section 3.1 and Section 3.1.1.
26
As previously noted the alternative to sequential approaches is to address the
problem by All-in-One Methods [163, 42, 112]. There are several methods proposed
for this approach but a unified analysis is missing. These methods will be the focus
of Section 3.2 and a novel unified view will be supplied on these methods at Chapter
4.
input space X feature space H label space Rd prediction
φ
π1
π2
π3
x
φ(x)
〈w1, φ(x)〉+ b1
〈w2, φ(x)〉+ b2
〈w3, φ(x)〉+ b3
argmax1, . . . , d
Figure 3.2: Illustration of multi-class support vector machine decision making.First, the input space is (implicitly) mapped to a feature Hilbert space with thefeature map φ : X → H. Then, projections πi : H → R, h 7→ 〈wi, h〉 + b, i ∈1, . . . , d are applied, resulting in a d-dimensional label vector. Finally, the labelvector is mapped to a label index by applying the argmax decision function (3.1).Here, only the positive octant of the decision space is drawn. The solid planes arethe different parts of the decision boundary separating pairs of classes.
It is not clear a priori that the sub-spaces should be embedded along the (or-
thogonal) coordinate axes. Equally valid one would fix so-called label prototype
vector vc ∈ Rd per class, embed the inner products into the decision space as
v(x) =d∑
c=1
〈wc, φ(x)〉vc ,
and then make a decision according to
x 7→ argmaxc∈1,...,d
〈vc, v(x)〉+ bc .
Therefore, the decision space Rd can be referred as the label or label prototype space.
This slight generalization of the decision function (3.1) is considered in [152].
In the following sections label prototypes are restricted orthonormal prototype
vectors vc and decision functions of the form (3.1).
3.1 Sequential Multi-Class SVMs
In this section first the previously proposed methods for solving multi-class SVMs
sequentially will be summarized. Generally, machine learning and pattern recogni-
tion algorithms are initially developed for binary problems and then extended to
multi-class problems [58, 17, 72]. Although there are some exceptions [33, 131] to
this rule of thumb similarly SVMs were also initially developed for binary problems
27
[159, 138, 39]. One of the first approaches to extend binary SVMs to multi-class
problems was put forward by [84, 53].
A very general and widely applicable framework for extending the binary clas-
sifiers to multi-class problems are based on error correcting output codes (ECOC)
which have roots in information theory and communication theory [127, 116, 117].
Moreover, the use of ECOC in machine learning can already be found in the early
machine learning studies [59].
In the ECOC methodology, each class is assigned a unique codeword which is a
binary string of length s. In most of the literature, a binary value by convention may
take either one or zero, for consistency with SVM literature it will be assumed that
binary strings are composed of mixtures of 1’s and −1’s. The length of the binary
string may or may not be equal to the number of classes, d. ECOC utilizes s binary
classifiers and an error correcting matrix C ∈ Rd×s. These binary classifiers are
learned for each element of this ECOC matrix. Note that the value of each ECOC
matrix elements represents the relationship between the corresponding classifier and
the corresponding class. For example, for a multi-class problem containing 4 classes
and assuming codewords that have a string length of 7, the ECOC matrix C is
given in Table 3.1. Within this table ci represents the ith classifier and each entry
C(m, i) represents the desired output of ith classifier with respect to the mth class.
To classify a test example x, a code word string, length s, should be populated by
Class Code Wordc0 c1 c2 c3 c4 c5 c6
0 -1 -1 -1 1 -1 1 -11 1 -1 -1 -1 1 -1 12 -1 -1 1 -1 -1 1 -13 1 1 -1 -1 -1 -1 1
Table 3.1: The table shows an error correcting output code for a four class problem
using the binary classifiers. This string is stored as a vector p. An example of the p
is given in Table 3.2 In the final step, the label of nearest code word to p is assigned
p (Code Word)c0 c1 c2 c3 c4 c5 c6-1 -1 -1 -1 1 1 -1
Table 3.2: For a test example x, the populated p is shown.
as the label of test example x. Generally the Hamming distance [83] is used for
determining the closest code word. The final step is illustrated in Table 3.3 and the
test example x is classified as class 2 because class 2 has the minimum Hamming
distance to p.
The main advantage of the ECOC frameworks is its robustness against error if
the code matrix is defined appropriately [53, 4]. Further, some popular sequential
28
Class Code Wordc0 c1 c2 c3 c4 c5 c6 Hamming Distance
0 -1 -1 -1 1 -1 1 -1 31 1 -1 -1 -1 1 -1 1 62 -1 -1 1 -1 -1 1 -1 13 1 1 -1 -1 -1 -1 1 4p -1 1 1 -1 -1 1 -1
Table 3.3: Illustration of the decision step for ECOC
multi-class SVM solutions, such as OVA, fit directly into the ECOC framework.
For a similar four class problem, the corresponding ECOC matrix of OVA is given
in Table 3.4
Class Code Wordc1 c2 c3 c4
0 1 -1 -1 -11 -1 1 -1 -12 -1 -1 1 -13 -1 -1 -1 1
Table 3.4: The ECOC matrix of OVA
Allwein et.al. [4] adapted the ECOC to allow a single technique to be compat-
ible with all sequential multi-class classification methods. Basically they proposed
codewords which comprise 1’s, 0’s or −1’s. Values equal to 0 indicates that the
corresponding classifier ignores the training examples of the corresponding classes.
Values equal to 1 indicates that corresponding classifier accepts the corresponding
classes as a positive class of the binary classification problem and values equal to
−1 indicates that corresponding classifier accepts the corresponding classes as a
negative class of the binary classification problem. With this extension of ECOC,
the ECOC matrices of OVO and MCM-MR are stated in Table 3.5 and 3.6
Class Code Wordc1 c2 c3 c4 c5 c6
0 1 1 1 0 0 01 -1 0 0 1 1 02 0 -1 0 -1 0 13 0 0 -1 0 -1 -1
Table 3.5: The ECOC matrix of OVO
ECOC frameworks supply a flexible tool for using binary classifiers to solve
multi-class problems. It is important to note that ECOC frameworks do not assume
any type of classifier.
However, design of the ECOC matrices is problematic and in the scientific com-
munity known as ’the Error Correcting Code Design Problem’ [56]. Allwein et al. [4]
29
Class Code Wordc1 c2 c3 c4
0 1 0 0 01 0 1 0 02 0 0 1 03 0 0 0 1
Table 3.6: The ECOC matrix of MC-MMR
developed some bounds on the training errors of margin-based-classifiers which are
based on the ECOC framework. Instead of stating the exact derivation of these
bounds, it is enough to point that these bounds are a function of q/ρ where q rep-
resents the proportion of redundant bits to the number of total bits, i.e. s−ds
,
in the ECOC matrix, and ρ is the minimum Hamming distance between distinct
codewords, that are constructing the ECOC matrix, leading to the error correcting
code problem. The problem is caused because there are two conflicting goals. First
goals is to minimize the number of redundant components. However this is mu-
tually exclusive to the second goal of maximizing the minimum distance between
distinct codewords. From the optimization perspective, the design of ECOC matrix
problem is NP-complete [43] and designing codewords is independent from empir-
ical risk minimization. Therefore, there may be some incompatible issues between
the classifier and ECOC. Generally good codes can be designed by using genetic al-
gorithms [155, 2, 1, 3]. Although Hamming distance is often used, there are several
other distance metrics [4] sometimes used for good code design.
This study is limited to discussion of OVA and MC-MMR within the ECOC
framework. The basic idea of OVO is mentioned and explained. However, OVO
will not be discussed further for the following two reasons. Firstly, several studies
showed that one-versus-all and one-versus-one perform similarly with regard to
classification accuracies [87, 57]. Secondly, for OVO it is not clear which decision
function will be used after training [84], however, in the case of OVA and MC-MMR,
it is convenient to use the decision function (3.1).
In the following subsections, first the MC-MMR method will be explained and
then the OVA method will be explained in detail.
3.1.1 Multi-Class Classification with Maximum Margin Re-
gression (MC-MMR)
One of the recent extensions of binary SVMs to multiple classes is Multi-Class
Classification with Maximum Margin Regression (MC-MMR) [152]. The basic idea
behind MC-MMR can be stated as follows: In binary SVMs, the normal vector of
the decision function can be interpreted as a projection operator which maps the
feature space to a one-dimensional decision space for classification. This projection
operator can be extended to multiple classes, corresponding to a higher dimensional
decision space. This line of thought is followed by MC-MMR, in which the decision
30
functions map inputs to vector-valued labels [152].
The primal of MC-MMR is:
min1
2||W ||2 + C
ℓ∑
n=1
ξn (MC-MMR)
s.t. ∀n ∈ 1, . . . , ℓ :
〈Wφ(xn) + b, yn〉 ≥ 1− ξn
ξn ≥ 0 . (3.2)
here W is a matrix where the cth row corresponds to wc, i.e. the separating hyper-
plane w.r.t the cth class, and b ∈ Rd is the bias/offset vector where the cth entry
corresponds to bc. The Lagrangian of the primal problem is:
L =1
2||W ||2 + C
ℓ∑
n=1
ξn −
ℓ∑
n=1
αn (〈Wφ(xn) + b, yn〉 − 1 + ξn)−
ℓ∑
n=1
βnξn (3.3)
Following the same procedure as the Hard Margin SVMs, the partial derivatives
with respect to primal variables, W , b and ξ are obtained:
∂L
∂W= W −
ℓ∑
n=1
αn(Iφ(xn))Tyn (3.4)
∂L
∂b= −
ℓ∑
n=1
αn [yn]p (3.5)
∂L
∂ξn= C − αn − βn (3.6)
where [yn]p is the pth entry of label vector yn. To find the saddle point of the
Lagrangian, one sets ∂L∂W
, ∂L∂b
and ∂L∂ξn
to zero and obtains
W =
ℓ∑
n=1
αn(Iφ(xn))Tyn (3.7)
0 =
ℓ∑
n=1
αn [yn]p (3.8)
αn = C − βn . (3.9)
Substituting eq. (3.7) into eq. (3.3) the Lagrangian with respect to the dual vari-
ables is derived and it is
L =
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
αnαmk(xn, xm)yTn ym . (3.10)
31
For the case of orthogonal prototype labels:
L =
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
αnαmk(xn, xm)δyn,ym(3.11)
In this thesis I will use only orthogonal prototype labels because Szedmak et al. [152]
hinted that for MC-MMR classification accuracy is not depending to prototype
labels if they are independent. By converting eq. (3.9) into a inequality constraint
like Soft Margin SVMs, the dual problem is written as
maxα
ℓ∑
n=1
αn −1
2
ℓ∑
n,m=1
αnαmk(xn, xm)δyn,ym(3.12)
s.t.ℓ∑
n=1
αn [yn]p = 0
∀n ∈ 1, . . . , ℓ : 0 ≤ αn ≤ C (3.13)
giving weight vectors
wc =
ℓ∑
n=1
αnδyc,ynk(xn, ·) .
The dual problem can be decomposed into d independent sub-problems involving
only the variables indexed by the sets Sc, c ∈ 1, . . . , d.
3.1.2 One versus All (OVA)
The one-versus-all (OVA) method is a straightforward way to extend standard
SVMs for binary classification [158, 139] to d-class problems. Let Sc = n |yn =
c, 1, . . . , ℓ denote the index set of training examples of class c. Then for each
c ∈ 1, . . . , d, OVA constructs a binary classifier that tries to separate class c from
all other classes by solving the convex quadratic optimization problem
min1
2〈wc, wc〉+ C
ℓ∑
n=1
ξn,c (3.14)
s.t. ∀n ∈ Sc : 〈wc,φ(xn)〉+ bc ≥ 1− ξn,c
∀n 6∈ Sc : 〈wc,φ(xn)〉+ bc ≤ −1 + ξn,c
∀n ∈ 1, . . . , ℓ : ξn,c ≥ 0 . (3.15)
For OVA the dual will be directly given because the derivation of the dual is
identical with the binary case (see Section 2.3.3.3). In practice, the equivalent dual
32
problem
maxα
ℓ∑
n=1
αn,c −1
2
ℓ∑
n,m=1
αn,cαm,cvm,nk(xn, xm) (3.16)
s.t.
ℓ∑
n=1
ζnαn,c = 0 (3.17)
∀n ∈ 1, . . . , ℓ : 0 ≤ αn,c ≤ C
with vm,n = (−1)|m,n∩Sc| =
+1 m,n ∈ Sc
+1 m,n 6∈ Sc
−1 m ∈ Sc, n 6∈ Sc
−1 m 6∈ Sc, n ∈ Sc
and ζn =
+1 n ∈ Sc
−1 n 6∈ Sc
(3.18)
is solved (see Section 5). Finally, the weight vectors are obtained as
wc =∑
n∈Sc
αn,ck(xn, ·)−∑
n6∈Sc
αn,ck(xn, ·) .
Each resulting vector wc is designed to separate class c from the rest by means of
the sign of (〈wc,φ(x)〉+ bc).
3.2 All-in-One Multi-class Machines
Instead of constructing the weight vectors wc independently by training multiple
binary classifiers, several methods have been proposed to directly obtain all vectors
from a single optimization problem taking all class relations into account at once. In
this section three different approaches to extend SVMs to multiple classes by solving
a single optimization problem is discussed. The standard extension of the SVM
optimization problem to multiple classes as proposed in [158, 163] is considered,
because it is the canonical, fundamental all-together approach. Further, the method
proposed by Crammer and Singer [42], which can be regarded as a modification
of the previous all-together approach with the goal to increase learning speed by
simplifying the constraints in the learning problem, is also consisered. Finally, the
one of the latest multi-class extension of SVMs proposed by Lee et al. [112] is
discussed. Although, this machine has nice theoretical properties like consistency
and a classification calibrated loss till now, no solver has been made available for
this machine. The theoretical aspects of the machine will be explained in this
section and how to solve the corresponding optimization problems will be discussed
in Chapter 5.
33
3.2.1 The Weston and Watkins Method (WW)
Weston and Watkins (WW) [163] and Vapnik [158, 163], independently from each
other, proposed first extensions of all-in-one SVMs. Their proposed method is iden-
tical up to a constant in the absolute value of the target margin. The corresponding
primal problem of WW is as follows
min1
2
d∑
c=1
〈wc, wc〉+ C
d∑
c=1
ℓ∑
n=1
ξn,c (3.19)
s.t. ∀n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d \ yn :
〈wyn− wc, φ(xn)〉+ byn
− bc ≥ 2− ξn,c
n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d : ξn,c ≥ 0 . (3.20)
If the first set of inequality constraints is replaced by
∀n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d \ yn :
〈wyn− wc,φ(xn)〉+ byn
− bc ≥ 1− ξn,c ,
that is, if one sets the target margin to 1 instead of 2, WW equals formulation used
by Vapnik [158]. In this study, the formulation (3.19) with a target margin of 2
is used. The objective function is the sum of the objective functions of the binary
SVM problems (see eq. (3.14)). The major difference lies in the interpretation and
handling of the slack variables ξn,c. While their numbers are identical, their role is
different in the sense that the ℓ×d matrix of slack variables ξn,c corresponds to the
hinge loss when separating example xn from the decision boundary between classes
yn and c.
The Lagrangian of the primal problem is:
L =1
2
d∑
c=1
〈wc, wc〉+ C
d∑
c=1
ℓ∑
n=1
ξn,c −
d∑
c=1
d∑
n=l
βn,cξn,c
−
d∑
c=1
ℓ∑
n=1
(αn,c (〈wyn− wc, φ(xn)〉+ byn
− bc)− 2 + ξn,c) (3.21)
Note that the values for some of the variables are known and fixed and these vari-
ables are named as dummy variables in the literature. Dummy variables are
αn,yn= 0, ξn,yn
= 2, βn,yn= 0, ∀n ∈ 1, . . . , ℓ
and constraints;
αn,m > 0, βn,m > 0, ξn,m > 0 ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn
34
For convenience, the Lagrangian should be reorganized:
L =1
2
d∑
c=1
〈wc, wc〉+ C
d∑
c=1
ℓ∑
n=1
ξn,c −
d∑
c=1
ℓ∑
n=1
αn,c〈wyn, φ(xn)〉
+
d∑
c=1
ℓ∑
n=1
αn,c〈wc, φ(xn)〉+ 2
d∑
c=1
ℓ∑
n=1
αn,c −
d∑
c=1
ℓ∑
n=1
αn,cξn,c
−
d∑
c=1
ℓ∑
n=1
βn,cξn,c −
d∑
c=1
ℓ∑
n=1
αn,cbyn+
d∑
c=1
ℓ∑
n=1
αn,cbc (3.22)
Defining for simplicity some variables
Sm =
d∑
m=1
αm,c
δn,c =
1 yn = c
0 otherwise
and following the same procedure as the Hard Margin SVMs the partial derivatives
with respect to primal variables, wc, bc, and ξn,c are obtained:
∂L
∂wc
= wc −ℓ∑
n=1,yn=c
(ℓ∑
m=1
αn,m
)φ(xn)−
ℓ∑
n=1
αn,cφ(xn) (3.23)
∂L
∂bc=
ℓ∑
n=1
αn,c −
ℓ∑
n=1
δn,cSn (3.24)
∂L
∂ξn,c= C − αn,c − βn,c (3.25)
Setting the partial derivatives to zero and by converting eq. (3.25) into an inequality
constraint like in Soft Margin SVMs one obtains
wc =
ℓ∑
n=1
Snδn,cφ(xn)−
ℓ∑
n=1
αn,cφ(xn)
=ℓ∑
n=1
(Snδn,c − αn,c)φ(xn) (3.26)
ℓ∑
n=1
αn,c =
ℓ∑
n=1
δn,cSn (3.27)
0 ≤ αn,c ≤ C . (3.28)
35
By inserting eq. (3.26) into eq. (3.22), the Lagrangian is derived
L =1
2
d∑
c=1
〈
(ℓ∑
n=1
(Snδn,c − αn,c)φ(xn)
),
(ℓ∑
n=1
(Snδn,c − αn,c)φ(xn)
)〉
+
d∑
c=1
ℓ∑
n=1
=0︷ ︸︸ ︷C − αn,c − βn,c
ξn,c
−
d∑
c=1
ℓ∑
n=1
αn,c〈
ℓ∑
m=1
(Smδm,yn− αm,yn
)φ(xm), φ(xn)〉
+d∑
c=1
ℓ∑
n=1
αn,c〈ℓ∑
m=1
(Smδm,c − αm,c)φ(xm), φ(xn)〉
+ 2
d∑
c=1
ℓ∑
n=1
αn,cξn,c
=1
2
d∑
c=1
(ℓ∑
n,m=1
SnSmδn,cδm,ck(xn, xm)−
ℓ∑
n,m=1
Snδn,cαm,ck(xn, xm)
)
+1
2
d∑
c=1
(−
ℓ∑
n,m=1
Smδm,cαm,ck(xn, xm) +
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
)
−
d∑
c=1
ℓ∑
n,m=1
Smδm,ynαn,ck(xn, xm) +
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
+
d∑
c=1
ℓ∑
n,m=1
αn,cSmδm,ck(xn, xm)−
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
+ 2
d∑
c=1
ℓ∑
n=1
αn,cξn,c
=1
2
d∑
c=1
ℓ∑
n,m=1
SnSmδn,cδm,ck(xn, xm)−
d∑
c=1
ℓ∑
n,m=1
Smδm,cαm,ck(xn, xm)
+1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)−
d∑
c=1
ℓ∑
n,m=1
Smδm,ynαn,ck(xn, xm)
+
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) +
d∑
c=1
ℓ∑
n,m=1
Smδm,cαn,ck(xn, xm)
−
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) + 2
d∑
c=1
ℓ∑
n,m=1
αn,cξn,c
36
=1
2
d∑
c=1
ℓ∑
n,m=1
SnSmδn,cδm,ck(xn, xm)−1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
−
d∑
c=1
ℓ∑
n,m=1
Smδm,ynαn,ck(xn, xm) +
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ynk(xn, xm)
+ 2
d∑
c=1
ℓ∑
n,m=1
αn,cξn,c . (3.29)
Now the dual of the WW primal problem can be written aa
maxα
1
2
d∑
c=1
ℓ∑
n,m=1
(SnSmδn,cδm,ck(xn, xm)− αn,cαm,ck(xn, xm)
− 2Smδm,ynαn,ck(xn, xm) + 2αn,cαm,yn
k(xn, xm))
+ 2
d∑
c=1
ℓ∑
n,m=1
αn,cξn,c (3.30)
s.t.
ℓ∑
n=1
αn,c =
ℓ∑
n=1
δn,cSn
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : 0 ≤ αn,c ≤ C .
3.2.1.1 Vectorized Weston and Wakins Formulation
In the following I will derive an alternative formulation for both the primal and
dual of the WW. The primary goal in this section is to have a new formulation for
WW which can be solved efficiently. The primal problem is expressed as follows:
min1
2||W ||2 + C
d∑
c=1
ℓ∑
n=1
ξn,c (3.31)
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn
〈Wφ(xn) + b,yn − yc‖yn − yc‖
〉 > 2− ξn,c
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : ξn,c > 0
Here W is a matrix where the nth row corresponds to wn in the original WW primal
eq (3.19), yn ∈ Rd is a vector of zeros with a single one in the nth component and
b ∈ Rd is the bias/offset vector where the cth entry corresponds to bc. Lagrangian
37
of the vectorized WW is
L =1
2WTW + C
d∑
c=1
ℓ∑
n=1
ξn,c
−d∑
c=1
ℓ∑
n=1
αn,c
(〈Wφ(xn),
yn − yc‖yn − yc‖
〉 − 2 + ξn,c
)
d∑
c=1
ℓ∑
n=1
αn,c〈b,yn − yc‖yn − yc‖
〉 −
d∑
c=1
ℓ∑
n=1
βn,cξn,c . (3.32)
Following the same procedure as the Hard Margin SVMs, the partial derivatives
with respect to primal variables, W , b and ξn,c are obtained
∂L
∂W= W −
d∑
c=1
ℓ∑
n=1
αn,c〈Iφ(xn),yn − yc‖yn − yc‖
〉 (3.33)
∂L
∂b=
ℓ∑
n=1
αn,c −
ℓ∑
n=1
δyn,yc
Sn︷ ︸︸ ︷(d∑
c=1
αn,c
)(3.34)
∂L
∂ξn,c= C − αn,c − βn,c (3.35)
where I is the identity matrix. Setting the partial derivatives to zero and by con-
verting eq. (3.35) into a inequality constraint like Soft Margin SVMs one arrives
at
W =
d∑
c=1
ℓ∑
n=1
αn,c〈Iφ(xn),yn − yc‖yn − yc‖
〉
=d∑
c=1
ℓ∑
n=1
αn,c (Iφ(xn))T yn − yc‖yn − yc‖
(3.36)
and
ℓ∑
n=1
αn,c =
ℓ∑
n=1
δyn,ycSn (3.37)
0 ≤ αn,c ≤ C . (3.38)
38
Substituting eq. (3.36) into eq (3.32), the Lagragian becomes
L =1
2
d∑
c,e=1
ℓ∑
n,m=1
(αn,c (Iφ(xn))
T yn − yc‖yn − yc‖
)T
(αm,e (Iφ(xm))
T ym − ye‖ym − ye‖
)
−d∑
c=1
ℓ∑
n=1
αn,c
(〈
(αm,e (Iφ(xm))
T ym − ye‖ym − ye‖
)T
φ(xn),yn − yc‖yn − yc‖
〉
)
+ C
d∑
c=1
ℓ∑
n=1
ξn,c + 2
d∑
c=1
ℓ∑
n=1
αn,c −
d∑
c=1
ℓ∑
n=1
αn,cξn,c −
d∑
c=1
ℓ∑
n=1
βn,cξn,c
(3.39)
=1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)φ(xn)
TITIφ(xm)T
−d∑
c=1
ℓ∑
n=1
αn,c
(d∑
e=1
ℓ∑
m=1
αm,e
(yn − yc‖yn − yc‖
)T
(ym − ye‖ym − ye‖
)φ(xn)
TITIφ(xm)
)
+
d∑
c=1
ℓ∑
n=1
(C − αn,c − βn,c)ξn,c + 2
d∑
c=1
ℓ∑
n=1
αn,c
=1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
−
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
+ 2d∑
c=1
ℓ∑
n=1
αn,c
= −1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
+ 2d∑
c=1
ℓ∑
n=1
αn,c
39
After expressing the Lagrangian solely in terms of the dual variables the dual opti-
mization problem of the WW problem can be states:
maxα
−1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
T1︷ ︸︸ ︷(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
+ 2
d∑
c=1
ℓ∑
n=1
αn,c
s.t.
ℓ∑
n=1
αn,c =
ℓ∑
n=1
δyn,ycSn
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : 0 ≤ αn,c ≤ C .
If orthonormal labels are used the T1 can be exprseed in terms of the Kronecker
delta, denoted by δa,b, and it is one for a = b and zero otherwise. The final version
of the new dual of WW as follows:
maxα
2
ℓ∑
n=1
d∑
c=1
αn,c −1
2
ℓ∑
n,m=1
d∑
c,e=1
αn,cαm,evym,eyn,c
k(xn, xm) (3.40)
s.t.ℓ∑
n=1
αn,c =ℓ∑
n=1
δyn,ycSn
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d :
0 ≤ αn,c ≤
0 if n ∈ Sc
C if n 6∈ Sc
with vym,eyn,c
= δyn,ym− δyn,e − δc,ym
+ δc,e
The weight vectors are then given by
wc =∑
n∈Sc
(d∑
e=1
αn,e
)k(xn, ·)−
∑
n6∈Sc
αn,ck(xn, ·) .
If the new dual formulation, given in eq. (3.40), is compared with the original one,
given in eq. (3.30), it can be seen that the new formulation is more suitable for
decomposition algorithms. Although decomposition algorithms will be discussed in
detail in Chapter 5. I want to clarify the underlying reasons why the new formula-
tion is important. If the original WW formulation is analysed it can be seen that it
contains five sum operators, please note that Sm and Sn also contain sum operator.
In short two sum operators run from 1 to ℓ and the other three run from 1 to d.
However, dual problem has only dℓ unknown variables. Furthermore, dual problem
is a quadratic problem which hints that indeed four sum operators, such that two of
them run from 1 to ℓ and the others run from 1 to d, are needed. This makes clear
that in the original formulation there is a redundant sum operator. This is one of
the reasons, until now there was no efficient caching strategy is developed for this
40
machine. However, the new dual formulation, given in eq. (3.40) contains exactly
four sums. Further, the new formulation is decomposed kernel matrix such that one
just needs to store a d2 matrix and a ℓ2 matrix instead of a single (dℓ)2 matrix. It
is clear that this kind of decomposition of the kernel matrix is memory friendly and
lead us to apply WW method to large scale problems. It should be noted that the
decomposition of the kernel matrix do not decrease the computational complexity
of the WW problem, it just decreases the required memory.
3.2.2 The Crammer and Singer Method
Crammer and Singer proposed an alternative multi-class SVM [42](CS). Like WW,
they take all class relations into account at once and solve a single optimization
problem, however, with fewer slack variables. The CS classifier is trained by solving
the primal problem
min1
2
d∑
c=1
〈wc, wc〉+ Cℓ∑
n=1
ξn (3.41)
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn
〈wyn− wc, φ(xn)〉 ≥ 1− ξn
∀n ∈ 1, . . . , ℓ : ξn ≥ 0 .
Although the dual can be derived from eq. (3.41), it will be rewriten to obtain a
more compact formulation :
min1
2
1
C
d∑
c=1
〈wc, wc〉+
ℓ∑
n=1
ξn (3.42)
s.t. 〈wyn− wc, φ(xn)〉+ δyn,c > 1− ξn ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn
The important point in this formulation is that the inequality constraints of eq.
(3.41) are combined and rewritten as the in equality constraints of eq. (3.42). The
Lagrangian of the problem eq. (3.42) is
L =1
2
1
C
d∑
c=1
〈wc, wc〉+
ℓ∑
n=1
ξn −
d∑
c=1
ℓ∑
n=1
αn,c〈wyn, φ(xn)〉
+d∑
c=1
ℓ∑
n=1
αn,c〈wc, φ(xn)〉 −d∑
c=1
ℓ∑
n=1
αn,cδyn,c +d∑
c=1
ℓ∑
n=1
αn,c
−
d∑
c=1
ℓ∑
n=1
αn,cξn . (3.43)
41
Again the partial derivatives with respect to primal variables, wc and ξn are ob-
tained:
∂L
∂ξn= 1−
d∑
c=1
αn,c ∀n = 1 . . . ℓ
∂L
∂wc
=1
Cwc −
∂
∂wc
(d∑
c=1
ℓ∑
n=1
αn,cwynφ(xn)
)+
∂
∂wc
(d∑
c=1
ℓ∑
n=1
αn,cwcφ(xn)
)
Setting the partial derivative with respect to ξn zero one obtains a constraint on
the dual variables
d∑
c=1
αn,c = 1 ∀n = 1 . . . ℓ (3.44)
and setting the partial derivative with respect to wc one obtains:
1
Cwc = −
T1︷ ︸︸ ︷∂
∂wc
(d∑
c=1
ℓ∑
n=1
αn,cwynφ(xn)
)
+
T2︷ ︸︸ ︷∂
∂wc
(d∑
c=1
ℓ∑
n=1
αn,cwcφ(xn)
)(3.45)
For clarity T1 and T2 will be evaluate separately. T1 will be evaluated as follows:
T1 =∂
∂wc
(α1,1wy1φ(x1) + . . .+ α1,dwy1
φ(x1))
+∂
∂wc
(α2,1wy2φ(x2) + . . .+ α2,dwy2
φ(x2))
.
.
.
.
+∂
∂wc
(αℓ,1wyℓφ(xℓ) + αℓ,2wyℓ
φ(xℓ) + . . .+ αℓ,dwyℓφ(xℓ))
It should be noted that the partial derivatives with respect to wc will be 0 whenever
wyn6= wc and it will be equal to 1 whenever wyn
= wc. By using this fact:
T1 =
ℓ∑
n,yn=c
=1︷ ︸︸ ︷(d∑
p=1
αn,p
)φ(xn)
42
T2 will be evaluated as follows:
T2 =∂
∂wc
(d∑
c=1
ℓ∑
n=1
αn,cwcφ(xn)
)
=ℓ∑
n=1
αn,cφ(xn)
Inserting these expressions for T1 and T2 back into eq (3.45), wc becomes
wc = C
(ℓ∑
n,yn=c
φ(xn)−
ℓ∑
n=1
αn,cφ(xn)
)
= C
ℓ∑
n=1
(δyn,c − αn,c)φ(xn) (3.46)
Now the Lagrangian (3.43) will be rewritten step by step in terms of the dual
variables. The last two terms of the Lagrangian are:
ℓ∑
n=1
d∑
c=1
αn,c =
ℓ∑
n=1
=1︷ ︸︸ ︷(d∑
c=1
αn,c
)= ℓ (3.47)
ℓ∑
n=1
d∑
c=1
αn,cξn =
ℓ∑
n=1
=1︷ ︸︸ ︷(d∑
c=1
αn,c
)ξn =
ℓ∑
n=1
ξn (3.48)
Eq. (3.47) can be ignored because it is constant. The Lagrangian thus becomes
L =
L1︷ ︸︸ ︷1
2
1
C
d∑
c=1
〈wc, wc〉−
L2︷ ︸︸ ︷d∑
c=1
ℓ∑
n=1
αn,cwynφ(xn)
+
L3︷ ︸︸ ︷d∑
c=1
ℓ∑
n=1
αn,cwcφ(xn)−d∑
c=1
ℓ∑
n=1
αn,cδyn,c . (3.49)
For clarity, L1, L2 and L3 will be evaluated separately, replacing wc by the expres-
sion in (3.46):
L1 =1
2
1
C
d∑
c=1
(C
ℓ∑
n=1
(δyn,c − αn,c)φ(xn)
)(C
ℓ∑
m=1
(δym,c − αm,c)φ(xm)
)
C
2
ℓ∑
n,m=1
(k(xn, xm)
d∑
c=1
(δym,yn− αn,c) (δym,c − αm,c)
)(3.50)
43
L2 =d∑
c=1
ℓ∑
n=1
αn,c
(C
(ℓ∑
m=1
(δym,yn− αm,yn
)φ(xm)
))φ(xn)
= Cℓ∑
n,m=1
k(xn, xm)
δym,yn
− αm,yn
=1︷ ︸︸ ︷(d∑
c=1
αn,c
)
= C
ℓ∑
n,m=1
k(xn, xm)
d∑
c=1
δym,yn(δym,c − αm,c) (3.51)
L3 =
d∑
c=1
ℓ∑
n=1
αn,c
(C
(ℓ∑
m=1
(δym,c − αm,c)φ(xm)
))φ(xn)
= C
ℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
αn,c (δym,c − αm,c)
). (3.52)
Hence L3 − L2 takes the form
L3 − L2 = C
ℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
αn,c (δym,c − αm,c)
)
− C
ℓ∑
n,m=1
k(xn, xm)
d∑
c=1
δym,yn(δym,c − αm,c)
= − Cℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
(δym,c − αm,c) (δyn,c − αn,c)
). (3.53)
Inserting eq. (3.50) and eq. (3.53) into eq. (3.49) yields the Lagrangian:
L =C
2
ℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
(δym,c − αm,c) (δyn,c − αn,c)
)
− C
ℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
(δym,c − αm,c) (δyn,c − αn,c)
)
−d∑
c=1
ℓ∑
n=1
αn,cδyn,c
= −C
2
ℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
(δyn,c − αn,c) (δym,c − αm,c)
)
−d∑
c=1
ℓ∑
n=1
αn,cδyn,c (3.54)
44
The dual problem of the CS is :
max −C
2
ℓ∑
n,m=1
k(xn, xm)
(d∑
c=1
(δyn,c − αn,c) (δym,c − αm,c)
)
−
d∑
c=1
ℓ∑
n=1
αn,cδyn,c (3.55)
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : αn,c ≥ 0
∀n ∈ 1, . . . , ℓ : αTnI1 = 1,
where I1 ∈ Rd is a vector of ones and αn = (αn,1, αn,2, . . . , αn,d)
T is the Lagrange
multiplier vector for the nth example. This dual formulation can be used for solving
the problem. However, for having the dual problems different machines as similar
as possible and also to have a more compact formulation, the final version of CS
dual problem is expressed as
max −1
2
ℓ∑
n,m=1
k(xn, xm)τTn τm + βℓ∑
n=1
τTn Iyn(3.56)
s.t. ∀n ∈ 1, . . . , ℓ : τn ≤ Iyn, τTn I1 = 0 ,
where β = 2C, Iyn
∈ Rd is a vector of zeros with a single one in the ythn component,
and τn ∈ Rd is an auxiliary vector defined as (1yn
− αn). The relation a ≤ b is
understood to hold for a, b ∈ Rd if ai ≤ bi for all i = 1, . . . , d. The weight vectors
are given by
wc =
ℓ∑
n=1
τn,ck(xn, ·) .
If the CS dual formulation given in (3.57) is compared with the WW dual formu-
lation given in given in (3.40), it is seen that the dual formulations are similar to
each other. However this fact does not allow us to develop a solver for CS that is
using identical or very similar solver technology with WW solver. As one of the
main contributions of this thesis is developing similar solvers for all machines, I will
reformulate the CS machine in the next section.
3.2.2.1 Vectorized Crammer and Singer Formulation
In the following I will derive an alternative formulation for both the primal and
dual of CS. The primary goal in this section is to have a new formulation for CS
which can be solved efficiently and also which is easy to implement. The primal
45
problem is expressed as follows:
min1
2||W ||2 + C
ℓ∑
n=1
ξn (3.57)
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn
〈Wφ(xn),yn − yc‖yn − yc‖
〉 > 1− ξn
∀n ∈ 1, . . . , ℓ : ξn > 0
here W is a matrix where the nth row corresponds to wn in the original CS primal
eq (3.42), and yn ∈ Rd is a vector of zeros with a single one in the nth component.
Lagrangian of the vectorized CS is
L =1
2WTW + C
d∑
c=1
ℓ∑
n=1
ξn (3.58)
−
d∑
c=1
ℓ∑
n=1
αn,c
(〈Wφ(xn),
yn − yc‖yn − yc‖
〉 − 1 + ξn
)−
ℓ∑
n=1
βnξn .
Following the same procedure as the Hard Margin SVMs, the partial derivatives
with respect to primal variables, W and ξn, are obtained
∂L
∂W= W −
d∑
c=1
ℓ∑
n=1
αn,c〈Iφ(xn),yn − yc‖yn − yc‖
〉 (3.59)
∂L
∂ξn= C −
d∑
c=1
αn,c − βn . (3.60)
where I is the identity matrix. Setting the partial derivatives to zero and by con-
verting eq. (3.60) into a inequality constraint like Soft Margin SVMs one arrives
at:
W =
d∑
c=1
ℓ∑
n=1
αn,c〈Iφ(xn),yn − yc‖yn − yc‖
〉
=
d∑
c=1
ℓ∑
n=1
αn,c (Iφ(xn))T yn − yc‖yn − yc‖
(3.61)
d∑
c=1
αn,c ≤ C (3.62)
46
Substituting eq. (3.61) into eq (3.59), the Lagrangian becomes
L =1
2
d∑
c,e=1
ℓ∑
n,m=1
(αn,c (Iφ(xn))
T yn − yc‖yn − yc‖
)T
(αm,e (Iφ(xm))
T ym − ye‖ym − ye‖
)
−d∑
c=1
ℓ∑
n=1
αn,c
(〈
(αm,e (Iφ(xm))
T ym − ye‖ym − ye‖
)T
φ(xn),yn − yc‖yn − yc‖
〉
)
+
d∑
c=1
ℓ∑
n=1
αn,c −
d∑
c=1
ℓ∑
n=1
αn,cξn −
ℓ∑
n=1
βnξn + C
ℓ∑
n=1
ξn
=1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)φ(xn)
TITIφ(xm)T
−d∑
c=1
ℓ∑
n=1
αn,c
(d∑
e=1
ℓ∑
m=1
αm,e
(yn − yc‖yn − yc‖
)T
(ym − ye‖ym − ye‖
)φ(xn)
TITIφ(xm)
)
+
ℓ∑
n=1
0︷ ︸︸ ︷
(C −
d∑
c=1
αn,c − βn) ξn +
d∑
c=1
ℓ∑
n=1
αn,c
To get the final version of Lagrangian, one needs to make several algebraic opera-
tions and Lagrangian can be stated as :
L =1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
−d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
+
d∑
c=1
ℓ∑
n=1
αn,c
= −1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
+
d∑
c=1
ℓ∑
n=1
αn,c
47
After expressing the Lagrangian solely in terms of the dual variablesthe dual opti-
mization problem of the CS problem can be stated as:
maxα
−1
2
d∑
c,e=1
ℓ∑
n,m=1
αn,cαm,e
T1︷ ︸︸ ︷(yn − yc‖yn − yc‖
)T(ym − ye‖ym − ye‖
)k(xn, xm)
+
d∑
c=1
ℓ∑
n=1
αn,c (3.63)
s.t. c ∈ 1, . . . , d :
d∑
c=1
αn,c ≤ C
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d :
αn,c ≥ 0
(3.64)
If orthonormal vectors are used the T1 can be expressed in terms of the Kronecker
delta, denoted by δa,b, and it is one for a = b and zero otherwise. The final version
of the new dual of CS as follows:
maxα
ℓ∑
n=1
d∑
c=1
αn,c −1
2
ℓ∑
n,m=1
d∑
c,e=1
αn,cαm,evym,eyn,c
k(xn, xm) (3.65)
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : αn,c ≥ 0
∀c ∈ 1, . . . , d :
d∑
c=1
αn,c ≤ C, (3.66)
with vym,eyn,c
= δyn,ym− δyn,e − δc,ym
+ δc,e ,
The new formulation given in (3.65) is very similar to the WW formulation given in
(3.40). Indeed objective function of the new formulation of CS machine is identical
with the new WW formulation. There are only two differences; first one is the the
constraint which is a result of bias/offset term and the second one is the sum-to-
zero constraint of CS given in eq. (3.66). These issues do not create important
differences in the sense of solver development. In other words, one can develop very
similar solver for CS and WW. Thus, CS will also enjoy the new caching technique
developed for WW (see Section 3.2.1.1 ).
3.2.3 Lee, Lin, & Wahba SVM
Lee et al. (LLW, [112]) have proposed alternative approach to multi-class SVM clas-
sification which is structurally distinct from WW machine and its simplification, the
CS machine. Classification calibrated/Fisher consistent loss function should be de-
fine before explaining the LLW machine. A loss function L(f(x), y) is classification
calibrated/Fisher consistent if and only if f∗j = argmaxj=1,...,d P (Y = j | x) where
48
f∗(x) = f∗1 (x), . . . , f
∗d (x) is the minimizer of E[L(f(X), Y )|X = x] . The analysis
of Tewari and Bartlett [154] shows that this machine relies on a so-called classifi-
cation calibrated loss function, which guarantees Fisher consistency. Its primal
problem can be stated as
minwc
1
2
d∑
c=1
〈wc, wc〉+ C
ℓ∑
n=1
d∑
c=1
ξn,c
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn :
〈wc,φ(xn) + bc〉 ≤ −1
d− 1+ ξn,c
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : ξn,c ≥ 0
∀h ∈ H :
d∑
c=1
(〈wc, h〉+ bc) = 0 . (3.67)
If the feature map is injective then the sum-to-zero constraint (3.67) can be ex-
pressed as∑d
c=1 wc = 0 and∑d
c=1 bc = 0. I will derive the dual problem for the
case when the feature map is injective. For the case of non-injective feature map
details can be found in ([112]). The Lagrangian of the LLW primal problem is
L =1
2
d∑
c=1
〈wc, wc〉+ C
d∑
c=1
ℓ∑
n=1
ξn,c +
d∑
c=1
ℓ∑
n=1
αn,c〈wc, φ(xn)〉
+1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c −
d∑
c=1
ℓ∑
n=1
αn,cξn,c + ρ
d∑
c=1
wc
d∑
c=1
ℓ∑
n=1
αn,cbc + γ
d∑
c=1
bc −
ℓ∑
n=1
d∑
c=1
βn,cξn,c . (3.68)
As previous machines, the partial derivatives with respect to primal variables, wc,
bc and ξn,c are obtained:
∂L
∂wc
= wcφ(xn) +
ℓ∑
n=1
αn,c + ρ
∂L
∂bc=
ℓ∑
n=1
αn,c + γ
∂L
∂ξn,c= C − αn,c − βn,c
If one obtains the partial derivative of Lagrangian with respect to ρ, one obtain
exactly the sum to the zero constraint. This constraint will be used in the following
steps in order to find a relation between αn,c and ρ. Setting the partial derivative
49
with respect to bc and ξn,c zero one obtains
−γ =ℓ∑
n=1
αn,c (3.69)
0 = C − αn,c − βn,c . (3.70)
The constraint (3.69) ensures that all d sums∑ℓ
n=1 αn,c take the same value−γ ∈ R.
The value of −γ itself does not matter. Like in Section 2.3.3.2, (3.70) can be
expressed as an inequality constratint on αn,c. Setting the partial derivative with
respect to wc to zero one obtains
wc = −
(ℓ∑
n=1
αn,cφ(xn) + ρ
). (3.71)
By substituting (3.71) in eq. (3.68), one can eliminate the dependence of Lagrangian
on primal variables and then
L =1
2
d∑
c=1
〈
(ℓ∑
n=1
αn,cφ(xn) + ρ
),
(ℓ∑
m=1
αm,cφ(xm) + ρ
)〉
−
ℓ∑
n=1
d∑
c=1
αn,c〈
(ℓ∑
m=1
αm,cφ(xm) + ρ
), φ(xn)〉
+d∑
c=1
ℓ∑
n=1
αn,c
1
d− 1− ρ
d∑
c=1
(ℓ∑
n=1
αn,cφ(xn) + ρ
)
+
d∑
c=1
ℓ∑
n=1
(C − αn,c − βn,c)
=1
2
d∑
c=1
(ℓ∑
n,m=1
αn,cαm,ck(xn, xm) + ρ
ℓ∑
n=1
αn,cφ(xn) + ρ
ℓ∑
m=1
αm,cφ(xm)
)
−
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)− 2ρ
d∑
c=1
ℓ∑
n=1
αn,cφ(xn)−1
2
d∑
c=1
ρ2
+1
d− 1
ℓ∑
n=1
d∑
c=1
αn,c
=1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) + ρ
ℓ∑
n=1
αn,cφ(xn)
−d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)− 2ρd∑
c=1
ℓ∑
n=1
αn,cφ(xn)
+1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c −1
2
d∑
c=1
ρ2
50
= −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)−1
2
d∑
c=1
ρ2 − ρd∑
c=1
ℓ∑
n=1
αn,cφ(xn)
+1
d− 1
ℓ∑
n=1
d∑
c=1
αn,c .
Utilizing the some-to-zero constraint,∑d
c=1〈wc, ·〉 = 0, a relation between ρ and
αn,c is derived and this relation is
0 =
d∑
s=1
wc
0 = −
d∑
s=1
(ℓ∑
n=1
αn,cφ(xn) + ρ
)
ρ =
(−1
d
d∑
s=1
ℓ∑
n=1
αn,sφ(xn)
). (3.72)
With relation eq. (3.72), the Lagrangian of the LLW machine is
L = −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
−1
2
d∑
c=1
〈
(−1
d
d∑
s=1
ℓ∑
n=1
αn,sφ(xn)
),
(−1
d
d∑
u=1
ℓ∑
m=1
αm,uφ(xm)
)〉
+1
d
d∑
s=1
ℓ∑
m=1
αm,sφ(xm)
(d∑
c=1
ℓ∑
n=1
αn,cφ(xn)
)+
1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
= −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
−1
2d2
d∑
c=1
(d∑
s,u=1
ℓ∑
n,m=1
αn,sαm,uk(xn, xm)
)
+1
d
d∑
s,c=1
ℓ∑
n,m=1
αn,cαm,uk(xn, xm) +1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
= −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
+1
2d
d∑
s,c=1
ℓ∑
n,m=1
αn,sαm,ck(xn, xm) +1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c . (3.73)
51
The eq (3.73) will be written in a more compact way
L = − dd∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) +d∑
s,c=1
ℓ∑
n,m=1
αn,sαm,ck(xn, xm)
+2d
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
=d∑
s=1
(d∑
c=1
ℓ∑
n,m=1
αn,sαm,ck(xn, xm)−d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
)
+2d
d− 1
ℓ∑
n=1
d∑
c=1
αn,c
=d∑
s=1
(d∑
c=1
ℓ∑
n,m=1
(αn,s − αn,c)αm,ck(xn, xm)
)+
2d
d− 1
d∑
c=1
ℓ∑
n=1
αn,c .
The corresponding dual problem of LLW is:
maxα
1
d− 1·
d∑
c=1
ℓ∑
n=1
αn,c −1
2
d∑
c,e=1
ℓ∑
n,m=1
(δc,d − 1/d)αn,cαm,ek(xn, xm) (3.74)
s.t. ∀ c ∈ 1, . . . , d
ℓ∑
n=1
αn,c = −γ
∀n ∈ 1, . . . , ℓ, ∀ c ∈ 1, . . . , d \ yn : 0 ≤ αn,c ≤ C (3.75)
∀n ∈ 1, . . . , ℓ : αn,yn= 0
The LLW dual problem is also similar to WW and CS. The main difference is the
interpretation of the dual variables. These similarities are useful to develop solvers
for all methods that uses similar algorithms which in the end they make possible to
have a conclusions not only about the classification accuracies but also about the
training times.
52
Chapter 4
Unified View to All-in-One
Multi-class Machines
In this section, I will develop a unified view on all-in-one MCSVM machines. The
analysis will focus on three key design choices for MCSVMs, which so far have
only been described for each of the machine in isolation. The first concerns the
hypotheses class considered, namely the presence or absence of a bias or offset
term. The second is related to the loss function used for machine training. The
third is the margin concept used in the machine. The existing loss functions vary in
whether their margin definitions are absolute or relative and how the penalty term
depends on different kinds of margin violations. I derive a unifying template for
the primal as well as dual optimization problems arising when training the different
machines. From this view it will become apparent that one machine is missing to
complete the picture of all-in-one support vector machines (SVMs). I derive this
novel multi-class SVM variant, which results from bringing together concepts of the
CS and LLW machine.
All machines reviewed in the previous section can be cast into the common
primal form
minf
1
2‖f‖2 + C ·
ℓ∑
n=1
λ(ν(fyn(xn), f1(xn)), . . . , ν(fyn
(xn), fd(xn))) .
They differ in three components. The first of these is the set of variables over
which the primal objective is to be minimized, which can be either f=w ∈ H or
f=(w, b) ∈ H × R. The second variation is the margin function ν : R × R → R,
which can encode a relative margin concept by ν(u, v) = u − v or an absolute
margin by ν(u, v) = −1−v. Third, the methods differ in how the margin values are
composed into a loss by the function λ, which amounts to taking either the sum or
the maximum of its arguments.
These differences are in correspondence to properties of the dual problems. The
53
elements common to all dual problems are
maxα
ℓ∑
n=1
d∑
c=1
αn,c −1
2
ℓ∑
n,m=1
d∑
c,e=1
αn,cαm,e ·M(yn, c, ym, e) · k(xn, xm)
s.t. 0 ≤ αn,c ≤ C ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn
αn,yn= 0 ∀n ∈ 1, . . . , ℓ ,
possibly augmented by the additional bias constraint
ℓ∑
m=1
d∑
e=1
N(c, ym, e)αm,e = 0 ∀c ∈ 1, . . . , d (4.1)
or the max-loss constraint
d∑
c=1
αn,c ≤ C ∀n ∈ 1, . . . , ℓ . (4.2)
The resulting weight vectors take the form
wc =
ℓ∑
m=1
(d∑
e=1
N(c, ym, e)αm,e
)φ(xn) .
The WW, CS, and LLW machines differ only in the form of the coefficients M
and N , and the presence or absence of the constraints (4.1) and (4.2). Table 4.1
summarizes these insights and connects properties of the primal and dual problems.
It distinguishes the presence of a bias term (right), the type of loss function used
(left), and the margin concept applied (top):
sum-loss
max-lossconstraint (4.2)
with bias
with bias
without bias
without bias
constraint (4.1)
constraint (4.1)
relative margin absolute marginν(u, v) = u− v ν(u, v) = −1− v
M = δyn,ym+ δc,e − δyn,e − δym,c M = δc,e − 1/d
N = δc,e − δym,c N = 1/d− δc,e
WW LLW
WW without bias LLW without bias
––
CS ?
Table 4.1: Unified view on primal and dual problems of multi-class support vectormachine classifiers.
54
In practice the max-loss is not used in combination with a bias term. This is for
obvious reasons. Although possible in principle, the solution of the corresponding
dual problem is difficult because of the interfering constraints (4.1) and (4.2). Thus,
in practice the maximum-loss can only be applied to machines without bias.
The unified view also reveals that one machine is missing, namely the combi-
nation of maximum-loss (without bias) and absolute margin. From Table 4.1 it
becomes obvious how the primal and dual of such a new machine should look like.
However, while the correctness of the table entries belonging to already known ma-
chines can easily be verified, it has to be proven for the new machine—as done in
the next section.
4.1 Novel Approach to Multi-Class SVM Classifi-
cation
In this section, I derive a novel support vector machine for multi-class classification,
combining max-loss and the absolute margin concept and this machine is referred
as DGI. The corresponding primal problem is
minwc
1
2
d∑
c=1
〈wc, wc〉+ C
ℓ∑
n=1
ξn
s.t. ∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d \ yn :
〈wc,φ(xn) + bc〉 ≤ −1
d− 1+ ξn
∀n ∈ 1, . . . , ℓ, ∀c ∈ 1, . . . , d : ξn ≥ 0
∀h ∈ H :
d∑
c=1
(〈wc, h〉+ bc) = 0 . (4.3)
As in LLW machine, if the feature map is injective then the sum-to-zero con-
straint (4.3) can be expressed as∑d
c=1 wc = 0 and∑d
c=1 bc = 0 . The Lagrangian
of the primal problem of DGI is
L =1
2
d∑
c=1
〈wc, wc〉+ Cℓ∑
n=1
ξn +ℓ∑
n=1
d∑
c=1
αn,c
(〈wc, φ(xn)〉+
1
d− 1− ξn
)
ℓ∑
n=1
d∑
c=1
αn,cbc −
ℓ∑
n=1
βnξn + ρ
d∑
c=1
wc + γ
d∑
c=1
bc (4.4)
with αn,c ≥ 0, βn ≥ 0, and ρ unconstrained. Following the same procedure as in
the previous chapters, the partial derivatives with respect to primal variables, wc,
55
bc and ξn, are obtained:
∂L
∂wc
= wc +
ℓ∑
n=1
αn,cφ(xn) + ρ, wc = −
(ℓ∑
n=1
αn,cφ(xn) + ρ
)(4.5)
∂L
∂bc=
ℓ∑
n=1
αn,c + γ (4.6)
∂L
∂ξn= C − αn,c − βn, 0 ≤ αn,c ≤ C (4.7)
(4.8)
Setting the partial derivative with respect to ξn,c zero one obtains a constraint on
the dual variables:
0 ≤ αn,c ≤ C (4.9)
Like in Section 2.3.3.2, 4.9 can be expressed as an inequaltity constraint on αn,c.
Setting the partial derivative with respect to wc and bc to zero one obtains
−γ =ℓ∑
n=1
αn,c (4.10)
wc = −
(l∑
n=1
αn,cφ(xn) + ρ
). (4.11)
The constraint (4.10) ensures that all d sums∑ℓ
n=1 αn,c take the same value−γ ∈ R.
The value of −γ itself does not matter. By substituting eq. (4.11) in eq. (4.4), the
dependence of Lagrangian on primal variables avn be eliminated and then
L =1
2
d∑
c=1
〈
(−
d∑
c=1
ℓ∑
n=1
αn,cφ(xn)− ρ
),
(−
d∑
c=1
ℓ∑
n=1
αn,cφ(xn)− ρ〉
)
+ C ·
ℓ∑
n=1
ξn +
d∑
c=1
ℓ∑
n=1
αn,c
(〈wx, φ(xn)〉+
1
d− 1− ξn
)−
ℓ∑
n=1
βnξn
+ ρ
d∑
c=1
(−
d∑
c=1
ℓ∑
n=1
αn,cφ(xn)− ρ)
=1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) + ρ
d∑
c=1
ℓ∑
n=1
αn,cφ(xn)+
+d
2ρ2 −
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)− ρ
d∑
c=1
ℓ∑
n,m=1
αn,cφ(xn)
+1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c − ρd∑
c=1
ℓ∑
n=1
αn,cφ(xn)− dρ2
56
= −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)−1
2
d∑
c=1
ρ2 − ρd∑
c=1
ℓ∑
n=1
αn,cφ(xn)
−
ℓ∑
n=1
d∑
c=1
αn,c .
To derive the dual one needs to find a relation between ρ and αn,c. To do this
sum-to-zero constraint of the problem is used
0 =
d∑
c=1
wc = −dρ−
d∑
c=1
ℓ∑
n=1
αn,cφ(xn)
= −dρ−ℓ∑
n=1
(C − βn)φ(xn) ,
then ρ is expressed as
ρ = −1
d
ℓ∑
n=1
(C − βn)φ(xn) .
The Lagrangian with respect to the dual variables is
L = −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
−1
2
d∑
c=1
〈
(−1
d
d∑
s=1
ℓ∑
n=1
αn,sφ(xn)
),
(−1
d
d∑
u=1
ℓ∑
m=1
αm,uφ(xm)
)〉
+1
d
d∑
s=1
ℓ∑
m=1
αm,sφ(xm)
(d∑
c=1
ℓ∑
n=1
αn,cφ(xn)
)+
1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
= −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
−1
2d2
d∑
c=1
(d∑
s,u=1
ℓ∑
n,m=1
αn,sαm,uk(xn, xm)
)
+1
d
d∑
s,c=1
ℓ∑
n,m=1
αn,cαm,uk(xn, xm) +1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
= −1
2
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) +1
2d
d∑
s,c=1
ℓ∑
n,m=1
αn,sαm,ck(xn, xm)
+1
d− 1
d∑
c=1
ℓ∑
n=1
αn,c .
57
To write it in a more compact way:
L = − d
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm) +
d∑
s,c=1
ℓ∑
n,m=1
αn,sαm,ck(xn, xm)
+2d
d− 1
ℓ∑
n=1
d∑
c=1
αn,c
=
d∑
s=1
(d∑
c=1
ℓ∑
n,m=1
αn,sαm,ck(xn, xm)−
d∑
c=1
ℓ∑
n,m=1
αn,cαm,ck(xn, xm)
)
+2d
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
=
d∑
s=1
(d∑
c=1
ℓ∑
n,m=1
(αn,s − αn,c)αm,ck(xn, xm)
)+
2d
d− 1
d∑
c=1
ℓ∑
n=1
αn,c
Finally the dual optimization problem is
maxα
d∑
c,e=1
ℓ∑
n,m=1
(αn,e − αn,c)k(xn, xm)αm,c +2d
d− 1
d∑
c=1
ℓ∑
n=1
αn,c (4.12)
s.t. ∀c ∈ 1, . . . , d :
ℓ∑
n=1
αn,c = −γ (4.13)
∀n ∈ 1, . . . , ℓ : Ln ≤ αn ≤ Un
∀n ∈ 1, . . . , ℓ, c ∈ 1, . . . , d : αn,c ≤ C
The DGI dual problem is also similar to WW, CS, and LLW. The main difference
is the interpretation of the dual variables. These similarities are useful to develop
solvers for all methods that uses similar algorithms which in the end these make
possible to have conclusions not only about the classification accuracies but also
about the optimization times.
58
Chapter 5
Solvers
In the previous sections, I have reviewed the multi-class SVM quadratic programs
and the main topic of this section is how to solve these multi-class problems effi-
ciently. There are several strategies used for solving quadratic programs in primal
and in dual form (5.1) and I will summarize these methods in subsection 5.1 and
one of the contributions of this thesis, namely a new solver for multi-class SVMs,
will be described in section 5.2.
5.1 Related Work
Different researchers applied many kind of optimization techniques to primal or dual
problem of SVMs in order to efficiently solve SVMs. In this section, I will briefly
summarize some of these techniques. However, such a summary must be restricted
to the most important and popular approaches because research r elated to SVM
optimization problem is dense.
5.1.1 Interior Point Methods
Interior point methods are the work horses of non-linear optimization problems
and they were applied successfully to many different problems in different domains
[29, 166]. The main idea of interior points is to replace the constraints of SVM prob-
lem with barrier functions and solve a series of unconstrained quadratic problems.
Interior point methods were applied to SVM problems [66, 164, 165]. The main
advantages of interior point methods are that the required number of iterations is
low, more technically speaking the log (log (ǫ)), where ǫ is the required accuracy of
optimization problem, and they gave very high accuracy in the optimization prob-
lem [166]. However there are several disadvantages of interior methods. First of
all, the required runtime of interior point methods is O(s3), where s denotes the
number of parameters of the optimization problem. To underline the importance of
this fact in multi-class SVMs, let us compare OVA with any all-in-one multi-class
machines. In the cases OVA, the problem has exactly ℓ variables and the interior
59
point algorithms run time requirement will be d×O(ℓ3). However, in the case of all-
in-one multi-class machines the number of parameters will be ℓ× d and the interior
point algorithms run time requirement will be O((dℓ)3). The second point is: inte-
rior point methods require O(s2) memory and this is also problematic for all-in-one
multi-class machines. Finally they are sensitive to numerical computations [119]. It
is important to note that reaching high accuracy in an optimization problem does
not generally mean high classification accuracy [25]. It is clear that for large scale
problems or even for large number of classes with small number of examples for each
class interior point methods are not applicable for all-in-one multi-class machines.
5.1.2 Direct Optimization of Primal Problem
Most of the methods considered for solving SVM optimization problems generally
deal with the dual of the SVM problem. Indeed, SVM optimization problems can
be solved in the primal problem. The main idea is that one can rewrite the weight
vectors of the primal by using the representer theorem [104] in the form wc =∑ℓ
n=1 αn,cφ(xn) and directly optimize the primal in this formulation. However,
there is problem in the case of the hinge-loss because the hinge-loss is non-smooth.
Chapelle proposed to use smooth loss functions and to use Newton’s method [37].
5.1.3 On-line Methods
Generally speaking, SVMs are developed for batch training and most of the solvers
were also developed for this case. Recently several researchers proposed on-line
training of binary SVMs and also CS [35, 21, 19, 20] . In these methods first
order SMO [128] is used as an internal solver. Recently, Glasmachers and Igel [78]
showed that second order working set selection improves the on-line learning of
SVMs. It should be underlined that on-line learning of multi-class SVMs are up to
now restricted to only CS because to have an on-line learning of a multi-class SVM
one needs to have a solver for batch training of that machine. Although CS method
does not have a bias term, for each training example there is an equality constraint
and by using this constraint one can apply decomposition techniques such as SMO
to this problem and develop on-line version of solvers. For LLW and WW there were
not effiecient solvers and therefore till now on-line versions of these algorithms was
not developed. In this section I will develop efficient batch solvers for LLW,WW
and DGI. Finally, it can be concluded that on-line learning algorithms supplies
considerable speed up for large scale data sets [20, 78]
5.1.4 Cutting Plane Approaches
For non-differentiable optimization problems cutting plane algorithms [86, 15] are
one of the mainstream methods. Although the dual problem of SVMs is differen-
tiable, the primal problems of SVMs are not when the hinge loss is used. Recently,
some studies applied the cutting plane algorithms to SVM optimization problem
60
[96, 71, 97, 98, 67, 153]. These methods need a training time which is proportional
to the dimension of the input space which is problematic for non-linear kernels,
i.e. the feature space corresponding to Gaussian kernels is infinite dimensional.
Generally, these methods are applied to SVMs when linear kernels are used and the
assumption is that the original input space is high dimensional and sparse, one does
not need any feature space other than input space [95, 140]. To use cutting plane
methods for non-linear kernels, a low rank approximation [100, 74] of the full kernel
matrix as suggested by Joachims and Yu [97] should be used. However, using a low
rank approximation of kernel matrix may be troublesome or even not possible if
the conditioning number of kernel matrix is high. The conditioning number of the
matrix is a function of the kernel hyper-parameters. During models selection this
can cause problems such as long training times or even worse, bad model selections.
5.1.5 Stochastic Gradient Descent
Stochastic gradient descent algorithms [108, 146] have been used nearly from start of
the learning machines [24, 23, 111]. Recently, there is a trend of applying stochastic
gradient algorithms to SVMs [105, 143, 18, 141] and it is claimed that stochastic
gradient descent algorithms are suitable for large scale data sets [25, 143, 26]. All
these stochastic gradient algorithms are related to each other [167] and they are
applied to SVMs that are using linear kernels. They have similar disadvantages as
cutting plane methods when non-linear kernels are used. Recently an experimental
study [142] showed that, these methods are not faster than on-line learning methods
or decomposition algorithms when non-linear kernels were used. On-line learning
methods and decomposition methods can handle both linear and non-linear kernels.
5.1.6 Decomposition Algorithms
Decomposition algorithms [125, 94] to solve SVM problems. Let us consider quadratic
programs of the canonical form
maxα
f(α) = vTα−1
2αTQα (5.1)
s.t. Ln ≤ αn ≤ Un ∀n ∈ 1, . . . ,m
for α ∈ Rm. Here v ∈ R
m is some vector, Q ∈ Rm×m is a (symmetric) positive
definite matrix, and Ln ≤ Un are component-wise lower and upper bounds. The
gradient g = ∇f(α) of (5.1) has components
gn =∂f(α)
∂αn
= vn −
m∑
i=1
αiQin (5.2)
The most frequently used algorithms for solving SVM quadratic programs are
decomposition methods [124, 128, 64, 77, 27]. These methods iteratively decompose
the quadratic program into subprograms, which are restricted to a subset B (working
61
set) of variables. The main idea of decomposition method is to modify a small
portion of working set variables in each iteration. Decomposition algorithms needs
O (s) time and memory for each iteration. This property is an significant advantage
for large scale problems and especially for multi-class all-in-one machines because s,
the number of variables in the optimization problem, is d×ℓ. A desirable property of
state-of-the-art decomposition algorithms is that iterations are fast in the sense that
for any fixed upper bound q ∈ N on the working set size each iteration requires only
O(m) operations. A general decomposition scheme for (5.1) is given in Algorithm 1.
Algorithm 1: Decomposition algorithm for problem (5.1).
Input: feasible initial point α(0), accuracy ε ≥ 0compute the initial gradient g(0) ← ∇f(α(0)) = v −Qα(0)
t← 1while stopping criterion not met do
select working indices B(t) ⊂ 1, . . . ,m
solve subproblem restricted to B(t) and update α(t) ← α(t−1) + µ⋆(t)
update the gradient g(t) ← g(t−1) −Qµ⋆(t)
set t← t+ 1end
For a vector α ∈ Rm and an index set I ⊂ 1, . . . ,m let αI =
∑i∈I αiei denote
the projection to the components indexed by I, where ei ∈ Rm is the unit vector
in which the i-th component is 1. If all variables except those in B = b1, . . . , b|B|
are fixed, the subproblem can be written as:
maxαB
f(αB) = f(αF + αB) = (vB −QαF )TαB −
1
2αBQαB + const (5.3)
s.t. Ln ≤ αn ≤ Un ∀n ∈ B
Here, the complement F = 1, . . . ,m \ B of B contains the indices of the fixed
variables.
The convergence properties of the decomposition method are determined by the
heuristic for selecting the working indices. Given a feasible search point, the set of
possible working indices that indicate a violation of the KKT optimality condition
by the corresponding variable is called violating set [101]. The set of violating
indices in a search point α is denoted by
B(α) =n ∈ 1, . . . ,m
∣∣ αn > Ln ∧ gn < 0 or αn < Un ∧ gn > 0
.
If the working set has minimum size for generating feasible solutions,the working set
are called irreducible and this approach is called sequential minimal optimization
(SMO, [128]), which is the most frequently used technique for SVM training.
62
5.1.7 General Comments on SVM Solvers
In the subsection 5.1, SVM solvers were briefly reviewed. There two approaches
for optimizing SVM solvers; one is approach is to develop solvers for special type
of kernels i.e. linear kernels and the other one is to develop solvers for any type of
kernels. Although linear kernels can be a good choice for some kind of problems
like text classification, it is more convenient to have solvers for any type of kernels.
It should be noted that SVMs that are using linear kernels are not universally
consistent [148]. In this thesis, I will focus on decomposition type solvers. In
the subsection 5.2, I will develop a solver for WW, LLW and DGI. It should be
underlined that upto now, no efficient solver has been developed for LLW and also
WW and training these methods are considered to be slow.
5.2 Decomposition Algorithms for Multi-Class SVMs
Solver techniques for SVM problems were briefly described and in this new solvers
for multi-class SVMs including all-in-one multi-class SVMs will be developed. First
of all the goal should be clearly stated. Motivation: Long training times limit the
applicability of multi-class SVMs. In particular, the canonical extension of binary
SVMs to multiple classes, WW SVM [163], as well as the theoretically sound LLW
SVM [112] are rarely used. Instead, alternative multi-class formulations are pre-
ferred. While these can be trained faster, they lack desirable theoretical properties
and/or often lead to less accurate hypotheses.
The CS SVM [42] is the arguably most popular modification of the WW formula-
tion, mainly to speed-up the training. For learning structured data, this all-together
method is usually the SVM algorithm of choice. Lee et al. [112] modified the stan-
dard multi-class SVM formulation for theoretical reasons. In contrast to the other
machines, their SVM relies on a classification calibrated loss function, which leads
to consistency [154]. However, up to now no efficient solver for the LLW SVM has
been derived and implemented and thus empirical comparisons with other methods
are rare.
In this thesis, I will consider (batch) training of multi-class SVMs with universal
(i.e., non-linear) kernels and ask the questions: Can the learning speed of WW be
increased by using a more efficient quadratic programming method? Can such a
method make LLW learning practical and do the nicer theoretical properties of the
LLW machine lead to better hypotheses in practice? In thesis positive answers are
given to these questions by applying the crucial computational trick of CS, namely
removing the bias term from the hypotheses, to the WW and LLW machines. For
additional acceleration, a non-standard decomposition scheme that speeds-up multi-
class SVM training is proposed.
As mentioned before there are several key issues for developing fast and memory
efficient decomposition solvers. First of all, the working set size should be decided.
Second how to select variables for working set should be defined. In the following
63
first these issues will be clarified and then an efficient solution that is using de-
composition algorithms to these problems will be proposed. For deriving the new
training algorithms, quadratic programs of the canonical form 5.1 are considered.
The dual problems of the WW and LLW machines without bias can directly be writ-
ten in this canonical form. The dual problem of the CS machine introduces a large
number of additional equality constraints, which will be ignored for the moment
and will be discussed in section 5.2.7. The minimum working set size depends on
the number of equality constraints. For problem (5.1) it is one. Next, the trade-offs
influencing the choice of the number of elements in the working set are discussed.
Then working set selection heuristics for solving (5.1) are described and in Section
5.2.6, the proposed solver S2DO is defined.
5.2.1 Dropping the Bias Parameters
The constraint (4.1) makes the multi-class SVM quadratic programs difficult to solve
for decomposition techniques, because a feasible step requires the modification of
at least d variables simultaneously. This problem concerns both the WW, LLW
and DGI approach. However, such a constraint is not present in the standard CS
machine, because Crammer & Singer dropped the bias terms bc, which are of minor
importance when working with characteristic or universal kernels [129].
Instead of restricting this trick to the CS machine, I propose to apply it also to
the WW, LLW and DGI SVMs. If this trick applied to all multi-class SVMs, the
constraint (4.1) simply vanishes from the dual, while everything else remains the
same. This step of removing the bias terms is crucial, because it allows us for the
first time to solve the WW and the LLW machine with elaborated decomposition
techniques as discussed in the next section and also this allows us applying the same
decomposition techniques to DGI.
Dropping the bias terms in all machines is also a prerequisite for a fair empirical
comparison of the approaches. First, it makes fast training and therefore appropri-
ate model selection and evaluation on several benchmark problems feasible. Second,
now all machines consider the same hypothesis space. Instead, bias term could have
been introduced into the CS method, but then the resulting dual problem gets much
more complicated, because it contains two sets of interfering equality constraints,
see also [87], which renders the solution technique presented in the next section
intractable.
5.2.2 Working Set Sizes for Decomposition Algorithms
The size of the working set B influences the overall performance of the decomposition
algorithm in a number of different ways: First, the complexity of solving subprob-
lem (5.3) analytically grows combinatorially with |B|. This limits the working set
size to small numbers unless a numerical solver as done in [94] is used. Second,
the larger |B|, the less well-founded is a heuristic for picking the actual working
set from the O(m|B|) candidates, because such a heuristic is acceptable only if its
64
time complexity is O(m). At the same time large |B| offers the opportunity for the
working set selection scheme to provide a working set on which large progress can be
made. Third, the gradient update takes O(m · |B|) operations. Thus, small working
sets result in fast iterations making few progress, while larger working sets result
in slower iterations making larger progress. For example, trade two iterations with
|B| = 1 are roughly traded for one iteration with |B| = 2. Thus, a single step on
two variables must make at least as much progress as two steps on single variables,
which trivially is the case if the same variables are used. Such a step can directly
take into account correlations between the variables. However, the second iteration
can profit from the gradient update done by the first iteration, and thus make a
better decision for picking its active variable. This fast update of the gradient may
be an important reason for the success of SMO.
After taking into account all these issues mentioned above one concludes that (a)
working set sizes should be small in order to avoid unacceptable computation times
for solving the subproblem, and (b) there is an inherent trade-off between many
cheap iterations profiting from fast gradient updates and fewer slow iterations with
more working sets available and larger progress per iteration. In my point of view,
none of the above arguments enforces the working set size to be minimal as in SMO.
Instead of always using the minimum working set, I propose to use working
sets of size two whenever possible. This strategy will be refered as sequential two-
dimensional optimization (S2DO).
5.2.3 On Working Variable Selection
For developing decomposition algorithms one needs to decide how to select which
variables to modify from the working set B. For the binary SVMs there are several
proposed methods [128, 102, 90, 113, 89] for variable selection. There are two
important methods for this purpose. The first one is maximum violating pair (MVP)
method [94, 102] and the other one is second order working variable selection[64] .
5.2.4 Maximum Violating Pair Method
There are several ways to pick the elements of working set, B. The most effective
way is to maximize the gain of the restricted dual objective function 5.3. Even
the |B| is fixed to 2, the search for picking the pair of indices that is giving the
maximum gain requires calculating ℓ × (ℓ − 1) possible pairs of indices. It is clear
that this method will be slow when the ℓ is large. Please note that for multi-class
SVMs, the number of free parameters are approximately d × ℓ. To overcome this
problem Keerthi et.al. [102] proposed the maximum violating pair method. Similar
65
notations with Keerthi et.al. [102] are used in this thesis and define following sets:
Iup (α) = n ∈ 1, . . . , ℓ|αn < Un
Idown (α) = n ∈ 1, . . . , ℓ|αn > Ln
B (α) = (i, j)|i ∈ Iup, j ∈ Idown, i 6= j
Keerthi et.al. [102] use first order approximation of (5.3). In short MVP methods
proposes to pick the pair that is violating the KKT condition most strongly [102]
and this corresponds to
i = max gi i ∈ Iup
j = min gj j ∈ Idown
5.2.5 Second Order Working Variable Selection for SMO
The second order working set selection introduced by [64] is adopted in this thesis.1
Let us assume the optimization problem is restricted to the single variable αb,
b ∈ B(α). The update direction is thus eb. If the box constraints are ignored, the
optimal step size is given by the Newton step
µ = [∇f(α)]b /Qbb , (5.4)
yielding a gain of
f(α+ µ · eb)− f(α) = µ2 ·Qbb
2=
[∇f(α)]2b2Qbb
=g2b
2Qbb
.
This definition of gain leads to the greedy heuristic
b(t) = argmax
(g
(t−1)n )2
Qnn
∣∣∣∣∣ n ∈ B(α(t−1))
for choosing the working index, where g(t−1)n is defined as
g(t−1)n =
∂f
∂αn
(α(t−1)) .
To obtain a feasible new search point, the Newton step must be clipped to the
feasible region by computing
µ⋆ = maxLb − αb,minUb − αb, µ
) .
1Note that in the case of single-variable working sets first order and second order working setselection coincide for the important special case ∀1 ≤ i, j ≤ m : Qii = Qjj , e.g., for normalizedkernels (k(x, x) = 1 for all x ∈ X).
66
The update of the variables is simply given by
α(t) = α(t−1) + µ⋆eb .
In each iteration the algorithm needs the b(t)-th column of the large kernel-based
matrix Q for the gradient update. In addition, the diagonal entries Qnn needed for
second order working set selection should be precomputed and stored. This requires
O(m) time and memory.
5.2.6 Second Order Working Pair Selection for S2DO
I now derive second order working set selection for (5.1) using |B| = 2. In this
thesis the focus is on multi-class SVMs, but the selection scheme is also valid for
binary SVMs without bias. The first index is selected according to the maximum
absolute value of the gradient component
i = argmaxk∈B(α)
|gk| .
The second index is then selected by maximizing the gain [64]. Let the optimization
problem be restricted to
αB = (αa, αb)T . (5.5)
Let the update of the current point αB be
µ⋆B = (µ⋆
a, µ⋆b)
T . (5.6)
The unconstrained optimum is
αB = (αa, αb)T (5.7)
of (5.3) and the corresponding gain
f(αB)− f(αB)) . (5.8)
The second order Taylor expansion of (5.3)is written around αB using µB = αB−αB
f(αB) = f(αB) + µTB∇f(αB)−
1
2µTBQBµB) , (5.9)
where the matrix QB ∈ R2×2 is the restriction of Q to entries corresponding to the
working set indices. At the optimal point the first order term vanishes and the gain
is
f(αB)− f(αB) =1
2µTBQBµB . (5.10)
67
The gradient at αB is
gB = gB −QBµB . (5.11)
It vanishes at an external point and a Newton step gives
µB = QB−1gB . (5.12)
However, this computation assumes that the matrix QB can be inverted. If this is
indeed true (det(QB) > 0), the gain can be directly computed as follows
g2aQbb − 2gagbQab + g2bQaa
2(QaaQbb −Q2ab)
.
In the case of det(QB) = 0 the calculation of the gain is more complicated. For
QB = 0 there are two cases. Objective function is constant when gB = 0 and the
gain is zero, and objective functions is lienar function when gB 6= 0 with infinite
gain. The case that QB is a rank one matrix remains. Let qB be an eigenvector
spanning the null-eigenspace. For gBTqB 6= 0 the gain is infinite and only if gB and
qB are orthogonal the problem is reduced to a one dimensional quadratic equation.
In this case the (non-unique) optimum can be computed as follows: Let wB be a
nonzero vector orthogonal to qB or in other words an eigenvector corresponding to
the non-null eigenspace of QB . Then the point
αB = αB +wT
BgBwT
BQBwB
is optimal and the corresponding gain is
gain =(wT
BgB)2
2wTBQBwB
.
The vectors gB and wB are aligned in this case (gB = λwB for some λ ∈ R), such
that gB can directly take the role of wB , resulting in
gain =(g2a + g2b )
2
2(g2aQaa + 2gagbQab + g2bQbb).
For normalized kernels Qaa = Qbb = 1 and so the case QB = 0 is impossible and
det(QB) = 0 amounts to the two cases Qab ∈ ±1, resulting in qB = (−Qab, 1)T
and wB = (1, Qab)T . For
gTBqB = gb −Qabga = 0 (5.13)
the gain is given by
gain = (g2a +Qabgb)2/8 (5.14)
68
The update of the α vector is non-trivial when |B| = 2. It has been derived
in [149] for normalized kernels and it will be adapted to all kernels i.e. including
non-normalised kernels in the following section.
5.2.6.1 The update for S2DO
In this section, the solution of the sub-problem (5.3) for all kernels, in other words
for both normalised2 and non-normalised kernels, in the case of working sets size
|B| = 2 will be described. Let B = i, j and µB denotes the update vector for αB.
Consider the sub-problem:
maxµB
f(αB + µB) = (νB −QαF )T(αB + µB) (5.15)
−1
2(αB + µB)
TQ(αB + µB) + const
s.t. αB + µB ∈ [Li, Ui]× [Lj , Uj ]
To derive the update step for S2DO algorithm objective function of the eq.
(5.15) should be rewritten
f(αB + µB) = (νB −QαF )T(αB + µB)
−1
2(αB + µB)
TQ(αB + µB) + const
= (νB −QαF )αB + (ν −QαF )µB
−1
2(αT
B + µTB)Q(αB + µB) + const
= (νB −QαF )TµB
−1
2(αT
BQ+ µTBQ)(αB + µB) + const
= (νB −QαF )µB −1
2µTBQµB −
1
2µTBQαB + const .
Although in the equations the contents of the const part varies, it contains the
terms that do not depend on µB . After, the arithmetic operations shown above,
f(αB + µB) can be written as follows
f(αB + µB) = (ν −QαF )TµB − µT
BQµB + const . (5.16)
The first term of the right hand side of the equation (5.16) can be written as gTBand so we have the following equality
f(αB + µB) = gTB − µTBQµB + const . (5.17)
To find the maximum of the f(αB + µB), its partial derivatives with respect to
2We should note that S2DO has been derived in [149] for normalized kernels
69
µi and µj will be written
∂f(αB + µB)
∂µi
= gi − (µiQii + µjQij)
∂f(αB + µB)
∂µj
= gj − (µjQjj + µiQij)
and then the partial derivatives are set to zero
gi = (µiQii + µjQij)
gj = (µjQjj + µiQij) .
These equations can be written as a matrix equation
(gi
gj
)=(µi µj
)(Qii Qij
Qij Qjj
)(5.18)
and then
µTB = gBQ
−1B . (5.19)
In the sequel a number of one-dimensional sub-problems ,where one of the variables
µi or µj is fixed to one of its bounds, will be solved. W.l.o.g. assume that αi+µ⋆i =
Li. Then the optimal solution is given by
µ⋆j = min
max
gj −Qijµ
⋆i
Qjj
, Lj − αj
, Uj − αj
.
Three different cases according to the rank of QB are distinguished: For QB = 0
the solution is found by following the gradient, i.e., µ⋆i = Ui − αi for gi > 0,
µ⋆i = Li − αi for gi < 0, and µ⋆
i = 0 for gi = 0; with analogous rules for µ⋆j .
Now assume that QB has rank one. Then the objective function is linear on
each line segment Sp = p + λ · (−Qij , Qii)T |λ ∈ R ∩ F, p ∈ F, with derivative
γ = ∂f/∂λ = Qiigj − Qijgi in parameter direction. For γ ≥ 0 the optimum is
obtained on one of the line segments at the maximal parameter value. These points
cover either one or two adjacent edges of the parameter rectangle [Li, Ui]× [Lj , Uj ],
depending on the signs of Qii and Qij . For each of the edges the one-dimensional
sub-problem is solved. The best solution obtained from the one-dimensional sub-
problems is the optimum µ⋆B . The case γ < 0 is handled analogously with the
opposite edge(s).
If QB has full rank then the unconstrained optimum is computed
µB = Q−1B gB =
1
det(QB)·
(Qjjgi −QijgjQiigj −Qijgi
).
If this solution is feasible then µ⋆B = µB . Otherwise first assume that only one of
the variables µi and µj is outside the bounds; w.l.o.g. assume αi + µi > Ui. Then
70
by convexity it is concluded that the optimum is found on the edge Ui× [Lj , Uj ],
which amounts to a one-dimensional problem. In case that both variables violate
the constraints, w.l.o.g. αi+µi < Li and αj+µj > Uj , the same convexity argument
ensures that the optimum is located on one of the adjacent edges Li× [Lj , Uj ] and
[Li, Ui]× Uj. As above, the better solution of the two one-dimensional problems
constitutes the optimum.
5.2.7 Solving the Crammer and Singer Multi-class SVM Us-
ing SMO
To solve the dual problem with m = ℓ× d variables introduced in [42] and in order
to respect the ℓ additional equality constraints
d∑
c=1
αn,c = 0 ∀i ∈ 1, . . . , ℓ .
In this case techniques already established in many standard SMO solvers are used.
The indices maximizing the gain from the set
B(α) =(i,m), (i, n)
∣∣∣ i ∈ 1, . . . , ℓ, m, n ∈ 1, . . . , d, m 6= n,
αi,m < Ui,m, αi,n > Li,n, (∇f(α))i,m − (∇f(α))i,n > 0
of candidates are selected, compute the optimal step size as a restricted Newton step
in the standard way [128, 27] and update the gradient according to the changes in
both variables in the working set.3
The new algorithms proposed in this thesis considers a working set of size two
in contrast to the solvers proposed so far, which operate on working sets of size d.
Because of the constraints, a working set size of two is minimal. The SMO method
proposed in this section promises an increase in learning speed, because the two-
dimensional sub-problem can be solved efficiently analytically. From binary SVMs
it is known that enlarging the working set in general decreases performance.
5.2.8 Efficient Caching for All-in-one Machines
Key techniques for making SVM solvers efficient are caching and shrinking as in-
troduced by Joachims [94]. In this thesis, the shrinking and unshrinking heuristics
from LIBSVM [36, 64, 27] are adopted. However, for the large problems with d× ℓ
variables arising from the all-in-one approaches [163, 42] we use a special optimiza-
tion. In these cases the matrix Q can be decomposed into the smaller ℓ × ℓ kernel
matrix and a set of at most O(d4) coefficients:
Q(m,e),(n,c) = M(ym,e),(yn,c)k(xi, xj) .
3In the case of two-variable working sets second order working set selection differs from firstorder selection also for normalized kernels. It is usually much more efficient (see [64, 77, 79]).
71
In fact, the coefficients can easily be computed instead of being cached for large
values of d. They take the form
M(ym,e),(yn,c) = δyn,ym+ δc,e − δyn,e − δym,c
for the method WW and CS. Further, for LLW and DGI they take the following
form
M(ym,e),(yn,c) = δc,e − 1/d .
It is well known that the speed of SMO-type solvers can crucially depend on the
fraction of kernel cache misses (e.g., [77]). Therefore we cache only the ℓ× ℓ kernel
matrix Kij = k(xi, xj) and compute the coefficients on the fly. Such an archi-
tecture makes the implementation more challenging, because shrinking has to be
implemented both on the level of training examples (in order to reduce the cache
requirements) and on the level of variables (in order to speed up the working set
selection and gradient update loops).
72
Chapter 6
Conceptual and Theoretical
Analysis of Multi-class SVMs
All machines considered in this thesis have some pros and cons. Before I present
an empirical evaluation in Section 7, I want to discuss their conceptual differences.
This provides a better understanding of their behavior and guidelines for choosing
the most promising approach for a certain application. Most of my considerations
are focussed on the margin concepts used in the different machines. Without loss
of generality the analysis in the following assumes there is no bias in any machine.I
start the discussion with the MC-MMR and OVA machines before I turn to the
all-in-one Multi-class SVMs. After that, in subsection 6.4, the problem of correct
margin normalization in multi-class classification in general is discussed and then I
develop a bound on generalization error of CS and WW. I briefly discuss the uni-
versal consistency of multi-class SVMs. I close this section contrasting asymptotic
the training complexities of the six implemented multi-class machines depending on
the number of training examples and the number of classes in the problem.
6.1 Margins in Multi-Class SVMs
A key feature of SVMs for binary classification is the separation of data of different
classes with a large margin. For non-separable data this notion gets a little bit
fuzzy, involving a target margin and the amount of margin violations. Accordingly,
the primal SVM formulation tries to achieve two conflicting goals at the same time,
namely the minimization of the weight vector (corresponding to the maximization
of the target margin and to favouring smoother hypotheses, because the hypothesis
is Lipschitz with a constant proportional to the norm of the weight vector) and
the minimization of the empirical loss in terms of some norm of the vector of slack
variables.
While the notion of margin is quite clear for the case of binary classification, it
turns out that this concept is more complicated for the multi-class case. The decision
73
boundary is still of co-dimension one. However, on each side of the boundary the
classifier assigns the label of one class, such that different (linear) parts of the
decision boundary correspond to different pairs of classes. Thus, the number of
such different parts of the decision boundary, corresponding to the separation of
different pairs of classes, can grow quadratically with the number of classes in the
problem.
In feature space, SVMs typically use one linear function per class for the predic-
tion, and assign a label by maximizing over the single predictions. When looking
at the label space this amounts to a decision function of the form illustrated in
Figure 3.2.
6.2 Margins in Multi-Class Maximum Margin Re-
gression
Although the classifier proposed by Szedmak et al. [152] is clearly inspired from the
dual problem of the binary SVM classifier, it turns out that the problem solved is
different, at least for the standard case of orthogonal prototype label vectors (see
Chapter 3). Figure 6.1 illustrates this difference.
〈w1, φ(x)〉〈w1, φ(x)〉
〈w2, φ(x)〉〈w2, φ(x)〉
Figure 6.1: Illustration of the slack penalties in the MC-MMR machine (left) andWW or CS, which reduce to a standard binary SVM (right), for two classes with two-dimensional decision space. Training examples of the two classes are illustrated withblack and white dots. The dotted lines correspond to the class-wise target margin,i.e., points on one side of this line do not violate the target margin, resulting inξi = 0. The lengths of the solid lines connecting margin violators with the marginlines indicate the amount of slack penalty ξi > 0 induced by the correspondingpoints. Obviously, the minimization of the slack variables for the MC-MMR ruleon the left has nothing to do with the decision boundary.
The fact that the MC-MMR machine does not reduce to the standard binary
SVM is sceptical and may even justify the point of view that this kernel machine
should not be considered an SVM variant at all, because the decisive maximum
margin feature is missing. However, the proceeding of the MC-MMR machine is
quite similar to the one-class SVM [137] approach used to identify the support of a
74
distribution.
The primal MC-MMR problem expresses the desire to find a low complexity
function that takes a value of at least one (or close to one) on the support of the
class distribution. This comes close to training a one-class SVM per class [137].
Then MC-MMR makes predictions in an ad-hoc manner by just taking the class for
which the support estimation function outputs the largest value.
6.3 Margins in the One Versus All Classifier
At a first glance it seems that the OVA classifier should have a reasonable margin
concept, as it is derived from a series of well-understood binary classifiers. However,
it turns out that this intuition is wrong. The reason is that the margin concepts of
OVA training and the decision function (3.1) differ. This difference does not only
quantitatively affect the amount of margin violations, but also results in qualitative
differences when it comes to linear separability.
class 1
class 2
class 3
Figure 6.2: The figure illustrates the linear separability problem faced by OVA.The three classes are pairwise linearly separable, and they are separable with adecision function of type (3.1). However, OVA tries to solve the multi-class problemby treating one of the classes as the positive class and combining all remainingclasses to a single negative class. With this proceeding neither class two nor classthree is linearly separable from its complement. In other words, the individualdecision functions constructed by OVA use a different concept of separability thanequation (3.1), which is finally used for prediction.
This problem is illustrated in Figure 6.2. The three classes in this example are
pairwise linearly separable. However, OVA tries to form hyperplanes that separate
one class from all others. In the given example, this can be achieved without error
for class one, but not for classes two or three. Thus, a soft margin OVA machine is
required for this problem. In other words, although the linearly separating decision
functions are in OVAs hypothesis space, the training scheme will in general not find
this solution.
It is clear that depending on the characteristics of the problem at hand the
differences of the margin concepts in OVA training and prediction may affect per-
formance. Nevertheless OVA may, just like MC-MMR, work well on various data
sets [133].
75
6.4 Margin Normalization for Multi-Class Machines
As discussed above, the decision boundary of decision function (3.1) can be split
into O(d×d) functionally different parts, corresponding to the separation of pairs of
classes. Given this, even the concept of a hard margin SVM becomes tricky, because
the maximization of multiple margins naturally is a multi-objective optimization
problem.
There are certainly different meaningful ways how to merge all the margin ob-
jectives into a single objective function, for example as a simple linear combina-
tion. The way taken in all-in-one multi-class machines is slightly different. Instead
of maximizing the margins, which would be reflected by maximizing an objective
function of the type
d∑
c,e=1
1
‖wc − we‖,
the complexity of the hypothesis measured by the sum of squared norms of the
normal vectors is penalized directly. However, the term
d∑
c=1
〈wc, wc〉
does not have any obvious geometric interpretation in terms of (target) margins.
Note that one may, depending on the loss function, want to insert constants in front
of each of the summands in both objective functions.
The situation is even worse for soft-margin classification. Usually one wants
to penalize geometric margin violations. In case of a single margin (as found in
the binary classification case) this is equivalent to minimizing functional margin
violations, up to a multiplicative constant (which is the norm of the hyperplane
normal vector w). However, in multi-class classification there is one such multi-
plicative constant (transforming functional into geometric margin violations) per
pair of classes. Nevertheless, even the relatively sophisticated all-in-one multi-class
machines classifiers treat all functional margin violations the same. It seems that
any systematic solution to this problem that is still solvable by quadratic program-
ming requires the introduction of O(d) or even O(d×d) additional hyperparameters
scaling the different slack variables, instead of the single complexity parameter C.
This would considerably complicate the model selection problem. Interestingly, the
OVA classifier suffers from a similar normalization problem, because there is no rule
to adjust the norms of the different weight vectors to compatible ranges. On the
other hand, MC-MMR gets around this problem relatively well.
76
6.5 Generalization Analysis
In [157], the empirical risk of multi-class SVMs is upper bounded in terms of the
mean of the slack variables. Based on this bound it is argued that the CS SVM
has advantages compared to the WW formulation because it leads to lower values
in the bounds. It is not clear if this argument is convincing. In general, one has
to be careful when drawing conclusions just from upper bounds on performance.
Further, the empirical error may only be a weak predictor of the generalization error
(in particular for large values of C). Apart from these general arguments, one arrives
at exactly the opposite conclusion when looking at generalization bounds. These
bounds are instructive, because they indicate why it may be beneficial to sum-up
all margin violation in the multi-class SVM optimization problem. As an example,
I extend a bound on the generalization error of binary SVMs by Shawe-Taylor and
Cristianini [144] to the multi-class case in order to investigate the impact of the
different loss functions on the generalization performance. Let hc(x) = 〈wc, φ(x)〉.
After dropping the bias term in the WW machine the conceptual difference between
the WW and the CS approach is the loss function used to measure margin violations.
For a given training example (xi, yi) the WW machine penalizes:
d∑
c=1
[1− δyi,c + (hyi(xi)− hc(xi))]+ (6.1)
of margin violations,1 while the CS machine penalizes the maximum margin viola-
tion
max[1− δyi,c + (hyi(xi)− hc(xi))]+. (6.2)
Here again the short notation [t]+ = max0, t is used.
The basic idea of the analysis, that is proposed in this thesis, is the following:
There are d−1 mistakes one can make per example xi, namely preferring class e over
the true class yi (e ∈ 1, . . . , d \ yi). Each of these possible mistakes corresponds
to one binary problem (having a decision function with normal wyi−we) indicating
the specific mistake. One of these mistakes is sufficient for wrong classification
and no “binary” mistake at all implies correct classification. A union bound over
all mistakes gives the multi-class generalization result based on known bounds for
binary classifiers.
First a fundamental result from [144] for binary classification problems with
labels ±1 will be restates. It bounds the risk under the 0-1 loss depending on the
fat shattering dimension (e.g., see [99, 5]) of the class of real-valed decision functions.
The margin violation of training pattern (xi, yi) is measured by zi = [ν−yih(xi)]+,
collected in the vector z = (z1, . . . , zℓ)T ∈ R
ℓ and then:
1For simplicity we use the target margin ν = 1 proposed in [158] for the analysis. This makesthe target margins of the two machines directly comparable.
77
Theorem 5 (Corollary 6.14 from [144]) Let F be a sturdy class of functions
h : F→ [a, b] ∈ R with fat shattering dimension fatF(ν). Fix a scaling of the output
range η ∈ R. Consider a fixed but unknown probability distribution on the input
space X. Then with probability 1− δ over randomly drawn training sets T of size ℓ
for all 0 < ν < b− a the risk of a function h ∈ F thresholded at zero is bounded by
ǫh =2
ℓ
([fatF(ν/16) + 64D2
]log2
(65ℓ(1 + D)3
)
· log2(9eℓ(1 + D)) + log2
(64ℓ1.5(b− a)
δη
))
with D = 2(√‖z‖1 · (b− a) + η)/ν, provided ℓ ≥ 2/ǫh and there is no discrete
probability on misclassified training points.
If the logarithmic terms are ignored, the bound can be simplified to
ǫh ∈ O
(fat(ν/16) + ‖z‖1/ν
2
ℓ
).
Now a union bound is used over the d(d− 1)/2 possible pairs of classes to transfer
this result to the multi-class case. For a more elaborate treatment of fat shattering
in the multi-class case see the literature [99, 81].
The training set is decomposed into subsets Tc = (xi, yi) ∈ T | yi = c, c ∈
1, . . . , d according to the training labels and denote their sizes by ℓc = |Tc|. The
natural extension of the margin violations d to the loss functions used in the WW
and CS machines is zi,c = [ν − hyi(xi) + hc(xi)]+ for c 6= yi. These values are
collected in vectors z(c,e) ∈ Rℓc+ℓe with entries zi,e and zi,c for i ∈ Tc and i ∈ Te,
respectively, for each fixed pair (c, e) of different classes. The vector z ∈ Rℓ×(d−1)
collecting all margin violations. The error probability of each binary classifier hc−he
separating the problem restricted to the classes c and e can be upper bounded by
ǫ(c,e)h ∈ O
(fat(ν/16) + ‖z(c,e)‖1/ν
2
ℓc + ℓe
),
and by a simple union bound argument the total generalization error is bounded by
ǫh ≤∑
1≤c<e≤d
ℓc + ℓeℓ· ǫ
(c,e)h .
The resulting upper bound for the multi-class case is
ǫh ∈ O
( 12d(d− 1)fatF(ν/16) + ‖z‖1/ν
2
ℓ
), (6.3)
where the complexity of the class of Rd-valued functions f used for multi-class
classification is measured by the maximal fat shattering dimension of the real-valued
differences he − hc. The exact same technique can be applied directly to the term
78
ǫh as it appears in Theorem 5.
We formulate the primal problems of the WW and CS machines are formulated
as
min1
2
d∑
c=1
〈wc, wc〉+ C ·ℓ∑
i=1
L(zi,1, . . . , zi,d)
where the loss L is the sum of its arguments and the maximum, respectively. Let
z(WW) and z(CS) denote the margin violations for the WW and CS machine, re-
spectively. Then it is clear from the form of the primal optimization problems that
‖z(WW)‖1 ≤ ‖z(CS)‖1, as the one-norm is directly minimized in the WW formu-
lation. Thus, the generalization bound (6.3) is lower for the WW machine. As
noted before, comparing performances of different machines by only using bounds
may lead to wrong conclusions. The best thing to support any kind of bound with
empirical results. This is one of the reasons an extensive empirical comparison is
given in Chapter 7 and results of this empirical comparison is in accordance with
the generalisation bound presented above.
6.6 Universal Consistency of Multi-Class SVM
As discussed in Chapter 2, universal consistency of a classifier is an important topic.
Although, using a universal consistent classifier does not guarantee best performance
with limited data, still it provides hope for large scale data sets. Lately Glasmachers
[76] proved the universal consistency of the CS machine. Following Glasmachers’s
work and applying very similar techniques Dogan et al. [55] proved the universal
consistency of DGI. Although Glasmachers’s work [76] does not contain universal
consistency proofs of WW and LLW, it implies their universal consistency. I will not
discuss universal consistency of multi-class SVMs, for a detailed discussion please
see Glasmachers [76].
6.7 Training Complexity
Table 6.1 gives the asymptotic training time of the four algorithms under the as-
sumption that solving the different quadratic programs takes O(nk) time, where n
is the number of variables and 2 ≤ q ≤ 3 is a constant [27]. This table indicates that
training MC-MMR is faster than training OVA, which is again faster than training
CS and WW if the assumption holds.
Table 6.1: Asymptotic runtime of the training algorithms under the assumptionthat solving the different n-dimensional quadratic programs takes O(nk) time (2 <k < 3). The number of training patterns is denoted by ℓ, the number of examplesper class c by ℓc = |Sc|, and an the number of different classes by d.
OVA MC-MMR All-in-one Multi-Class SVMs
O(dℓq) O(∑d
c=1 ℓqc) = O(d
1−qℓq) O((dℓ)q)
79
The superior training speed of the MC-MMR method stems from the fact that
it does not take any cross-terms relating examples of different classes into account
at all. This may be an advantage on very large problems where training time is a
concern. In particular, the separability of the training problem makes it scale well
with the number of classes in the problem. Therefore, the method is an interesting
candidate for problems with lots of classes, at least from the training complexity
point of view. Increasing the number of classes d while keeping the total number of
training examples ℓ constant can even speed up the training procedure.
Training an OVA classifier just amounts to training d binary SVMs on the full
data set, rendering the method tractable for most applications. In contrast, training
the all-in-one multi-class machines scales considerably worse with the number of
classes, restriction the applicability of these elegant all-in-one machines to small
numbers of classes. This is because both machines take all cross-terms into account,
considering the separation of all examples from all other classes at the same time
within a single big optimization problem.
80
Chapter 7
Empirical Comparison and
Applications
In the previous chapters, I have considered multi-class SVMs from a conceptual and
theoretical point of view. I have also derived a unified view on multi-class SVMs and
by using this unified view, I proposed a novel multi-class machine. I also proposed
new algorithms for training multi-class SVMs. In this final chapter, I empirically
compared the different approaches.
7.1 Preliminaries for Empirical Evaluation
The six multi-class SVMs are empirically compared on several standard benchmark
problems and also these methods are applied to a variety of different problems. The
empirical comparison addresses following questions
• Which one of the six multi-class methods has a better generalization perfor-
mance?
• Does the generalization performance of the methods are depend on to their
margin concepts and the loss functions?
• Does the generalization performance of the multi-class methods are depending
the problem at hand? In other words, is there a unique multi-class SVM
method which is giving always the best generalization performance?
• How much one gains with respect to number of iterations and training time
by using S2DO and the proposed caching technique?
Firstly, experimental set-up is explained without giving any further details of the
problems at hand. Second the model selection methodology is described at Section
7.1.1. Finally the related experiments and their results are supplied and discussed
in Section 7.2, 7.3 and 7.4.
81
To answer the first three questions, the generalization accuracy and training
time of the six different multi-class SVMs on well known benchmark problems are
evaluated. Twelve data sets were taken from the UCI machine learning repository
[7]. Three real world problems which are traffic sign recognition, protein secondary
structure prediction and cancer classification are also considered.
In order to supply accurate answers to the last question, the machines are trained
by using the SMO solvers and also with S2DO solvers described in Section 5. Fur-
ther, the new solution method for the CS quadratic program and the new caching
strategy for the all-in-one machines were employed. All of the methods were imple-
mented using the Shark open source machine learning library [92].
7.1.1 Model selection
In all experiments, Gaussian kernels k(x1, x2) = exp(−γ||x1 − x2||2) were used.
The bandwidths γ of the Gaussian kernels and the regularization parameter C
were determined using nested grid search. As a model selection criterion 5-fold
cross-validation is employed. Candidate parameters were evaluated on 5 validation
subsets of the available training data and the configuration yielding the best average
performance was chosen. If more than one parameter configuration for γ or C gave
equal results we selected the smallest γ and C. If the selected model was at the
boundary, we shifted the grid such that the former boundary value was in the middle
of the new grid. For each group of data set different hyper-parameter search space is
used for model selection. The details of hyper-parameter search space for all group
of data sets are given in the corresponding subsections.
7.1.2 Stopping Conditions
For a fair comparison of training times of different types of SVMs, it is of important
to choose comparable stopping criteria for the quadratic programming. Unfortu-
nately, this is hardly possible in the experiments presented in this thesis, because
the quadratic programs differ. However, in the case of just two classes WW, CS,
LLW, DGI and OVA solve the same problem. Therefore the stopping condition is
selected such that in the binary case the criterion agrees for these five methods. The
stopping condition described in Section 5 with the common threshold of ε = 10−3
is used. To rule out any possible artifact of this choice, all CS,DGI and MC-MMR
experiments are repeated with ε == 10−5, however, these results are not reported
in this thesis because there was no change in accuracy of these methods.
For all machines, the maximum number of SMO iterations was limited to 10000
times the number of dual variables. If for some parameter configuration (γ,C)
a solver did not reach the desired accuracy within this budget of iterations, the
parameter configuration was discarded from the grid search. This was necessary to
keep the grid searches computationally tractable. However, the discarded parameter
configurations always corresponded to “degenerated” machines (i.e., bad solutions),
so this did not influence the outcome of the model selection process.
82
7.1.3 Statistical Evaluation
Several ways to compare multiple classifiers on multiple data sets have been pro-
posed in the literature [52, 49, 75]. Such a statistical comparison is not straightfor-
ward, because one has to account for multiple testing. In this thesis the recommen-
dation in [75] is followed and non-parametric statistical tests are used in a step-wise
procedure. For each data set, the algorithms are ranked and then the average ranks
are computed. Then, Friedman test is applied to check whether the ranks are dif-
ferent from the mean rank. If so, whether two algorithms differ is determined by
pairwise ad hoc comparison (using Bergmann-Hommel’s dynamic procedure). Sig-
nificance level is fixed to α = 0.01. A detailed description of the test procedure can
be found in the literature [49, 75]. The open source software supplied by Garcıa
and Herrera [75] is used for evaluation.
7.2 Multi-class Benchmark Problems
To evaluate the multi-class SVM methods, twelve data set from the UCI machine
learning repository [7] are used. The descriptive statistics of these data are given
in Table 7.1 In all data sets all feature values are rescaled between 0 and 1 and this
Table 7.1: The descriptive statistics of 12 UCI data set are shown. The column ℓtrnshows the number of training examples and the column ℓtest shows the number oftest examples for corresponding data set.
#-classes ℓtrn ℓtestAbalone 27 3133 1044Car 4 1209 519Glass 6 149 65Iris 3 105 45Letter 26 14000 6000Isolet 26 6238 1559OptDigits 10 3823 1797Page Blocks 5 3831 1642Sat 7 4435 2000Segment 7 1617 693SoyBean 19 214 93Vehicle 4 592 254
rescaling is done based on training data only. First, γ ∈ 2−12+3i | i = 0, 1, . . . , 4
and C ∈ 23i | i = 0, 1, . . . , 4 are varied. Except Abalone, Letter and Isolet, the
training data are randomized 10 times then the experiments are made. The cross
validation error is stored for all grid points. That is for each grid point 10 different
cross validation error, due to the randomization, is stored. For each grid point the
median, (γs, CS) from the 5×5, is picked from the 10 values as a final cross-validation
error. Then, a second grid search is performed by looking at the hyperparameters
γ ∈ γs2i | i = −2,−1 . . . , 2 and C ∈ Cs2
i | i = −2,−1 . . . , 2 and applied the
same randomization procedure in the second search. This randomization procedure
83
makes the model selection robust against small sample problems and also any kind
of artifacts related 5-fold cross validation.
Table 7.2: Best hyperparameters found in the model selection procedure for OVA,MC-MMR and WW.
OVA MC-MMR WWlog γ logC log γ logC log γ logC
Abalone 0 -12 0 -14 0 -3Car -2 5 0 4 -2 5Glass -3 2 0 -1 1 -4Letter -2 4 0 0 -2 1Isolet -10 4 -7 3 -9 3Iris -12 18 -1 0 -9 9OptDigits -5 10 -3 0 -5 1Page Blocks -9 20 3 -8 -4 8Sat -1 4 0 0 -1 2Segment -5 10 1 0 -4 7SoyBean -7 3 -3 0 -6 1Vehicle -8 12 -1 0 -7 10
Table 7.3: Best hyperparameters found in the model selection procedure for CS,LLW and DGI.
CS LLW DGIlog γ logC log γ logC log γ logC
Abalone 1 -2 0 -6 0 -5Car -2 5 -2 5 0 5Glass -3 2 -3 1 3 -6Letter -2 2 -3 0 3 -6Isolet -12 4 -10 4 -6 -14Iris -6 6 -4 5 0 0OptDigits -6 5 -6 10 -2 -14Page Blocks -5 11 -7 16 3 -9Sat -1 2 -1 2 1 -14Segment -9 12 -4 15 3 -10SoyBean -6 3 -7 4 0 -12Vehicle -7 10 -7 11 -1 3
The selected hyperparameters for OVA, MC-MMR andWWmethod are given in
Table 7.2 and the hyperparemeters of CS, LLW and DGI is given in Table 7.3. The
classification accuracies, in percentage, of OVA, MC-MMR, WW SVMs and 1-NN
are given in Table 7.4. The classification accuracies, in percentage, of CS, LLW, DGI
SVMs and 1-NN are given in Table 7.5. Additionally 1-NN’s classification accuracy
for each data set is also supplied. The purpose of supplying 1-NN classification
accuracy is to have baseline on the data set.
S2DO solver is compared with SMO solver for WW and LLW by using optimal
hyperparameters. This comparison is needed because of two reasons. First for
deciding newly proposed solver, S2DO , is better than the SMO . The second for
84
Table 7.4: Classification accuracies of OVA, MC-MMR, WW, 1-NN are shown.. Ineach row bold numbers shows the best classification accuracy on the data set.
OVA MC-MMR WW 1-NNAbalone 26.72 26.72 26.05 19.35Car 98.07 96.72 98.07 90.17Glass 69.23 70.77 72.31 55.38Letter 97.33 96.10 97.43 93.17Isolet 96.41 94.55 96.54 89.03Iris 93.33 91.11 95.56 77.78OptDigits 97.61 96.88 97.61 96.27Page Blocks 93.12 93.42 93.42 91.53Sat 91.35 91.05 92.35 90.15Segment 96.25 96.10 96.39 89.90SoyBean 92.47 88.17 90.32 87.10Vehicle 83.46 68.90 84.25 66.93
Table 7.5: Classification accuracies of CS, LLW, DGI, 1-NN are shown.. In eachrow bold numbers shows the best classification accuracy on the data set.
CS LLW DGI 1-NNAbalone 22.32 26.82 26.82 19.35Car 98.07 98.46 96.53 90.17Glass 70.77 72.31 70.77 55.38Letter 97.27 96.98 95.22 93.17Isolet 96.15 96.86 92.37 89.03Iris 95.56 95.56 95.56 77.78OptDigits 97.50 97.89 96.66 96.27Page Blocks 92.45 93.30 92.94 91.53Sat 92.40 92.30 90.50 90.15Segment 96.39 96.83 95.96 89.90SoyBean 92.47 92.47 86.02 87.10Vehicle 81.50 84.25 65.75 66.93
having a fair comparison of training times of multi-class methods on the benchmark
problems. For data sets that have training time less than 100 seconds, experiments
are repeated 10 times and take the median as final training time. This procedure
prevents any kind of hardware or operating system related artifacts. The main
result of this comparison is summarized in Table 7.6 and Table 7.7. In presented
experiments, S2DO was statistically significantly better than SMO with respect
to training time and number of iterations. The time taken by one S2DO iteration
was roughly equivalent to two SMO iterations.1
S2DO is used for WW and LLW and Table 7.8 shows the training time require-
ments of OVA, MC-MMR and WW method for each data set by using the optimal
1The Iris data set is an exception from this rule of thumb. With 105 training examples andthree classes it is the smallest data set in the benchmark suite used in this study. The SMOalgorithm performed several fast shrinking and unshrinking operations, the S2DO none because itsolved the problem so quickly. Thus, each S2DO iteration considered the complete set of variables,most SMO iterations only subsets. Therefore, a single SMO iteration took less time on average.However, SMO needed much more iterations.
85
WW
SMO S2DO
#iter time #iter time
Abalone 92705 361.084 40514 319.213Car 15309 0.847 2973 0.727Glass 742 0.048 372 0.037Letter 2968967 1349.440 1564183 791.111Isolet 4190876 652.100 1948607 340.225Iris 146387 0.153 554 0.022OptDigits 24102 58.799 10419 76.952Page Blocks 1037684 29.518 93251 10.604Sat 59495 104.640 22001 95.857Segment 206149 9.378 17782 4.720SoyBean 7073 0.561 1627 0.492Vehicle 1391588 18.131 203840 6.286
LLW
SMO S2DO
#iter time #iter time
Abalone 122853 671.257 52501 492.611Car 130199 9.859 31360 6.370Glass 38030 1.082 5475 0.477Letter 12581295 16652.621 6724447 10128.417Isolet 41908763 65812.100 19486076 37462.100Iris 21145 0.065 1697 0.049OptDigits 529247 532.024 195362 520.406Page Blocks 693478729 34078.329 381032269 25258.837Sat 219895 191.136 95643 276.002Segment 55210740 6507.105 19496762 5161.155SoyBean 728480 66.255 214096 51.840Vehicle 16517891 565.718 1743176 163.347
Table 7.6: Training time and number of iterations needed for solving the WW(top) and the LLW (bottom) multi-class SVMs using decomposition algorithms withworking sets of size one (SMO ) and two (S2DO ). The training times are given inseconds along with the number of iterations of the decomposition algorithms neededby the all-in-one SVMs.
parameters and Table 7.9 shows the training time requirements of CS, LLW and
DGI method for each data set by using the optimal parameters.
7.2.1 Summary of Results
Classification accuracy results shows that the LLW method is superior to all other
methods. In order to have more precise comparison, the statistical evaluation
method, explained in Section 7.1.3, will be used in a hierarchical way. It has the
following steps;
• Given the Tables 7.4 and 7.5 rank the methods.
• Ignore the column that is corresponding to the best method and then rank
the methods
• Stop when all the methods ranked.
The ranking results of hierarchical evaluation is given in Table 7.10.
86
WW LLWSMO/S2DO SMO/S2DO
Iter Ratio Time Ratio Iter Ratio Time RatioAbalone 2.29 1.13 2.34 1.36Car 5.15 1.16 4.15 1.55Glass 1.99 1.30 6.95 2.27Letter 1.90 1.71 1.87 1.64Isolet 2.15 1.92 2.15 1.76Iris 264.24 6.80 12.46 1.32OptDigits 2.31 0.76 2.71 1.02Page Blocks 11.13 2.78 1.82 1.35Sat 2.70 1.09 2.30 0.69Segment 11.59 1.99 2.83 1.26SoyBean 4.35 1.14 3.40 1.28Vehicle 6.83 2.88 9.48 3.46
Table 7.7: The ratios of training time and number of iterations needed for solving theWW (top) and the LLW (bottom) multi-class SVMs using decomposition algorithmswith working sets of size one (SMO ) and two (S2DO ).
Table 7.8: Training time requirements of OVA, MC-MMR and WW for optimalparameters
OVA MC-MMR WWAbalone 59.628 4.142 319.213Car 0.193 0.767 0.727Glass 0.181 0.023 0.037Letter 332.942 114.503 791.111Isolet 433.881 78.068 340.225Iris 0.035 0.007 0.022OptDigits 6.450 16.681 76.952Page Blocks 252.218 10.271 10.604Sat 15.265 15.158 95.857Segment 0.982 1.201 4.720SoyBean 0.018 0.047 0.492Vehicle 0.849 0.170 6.286
The classification performance of the methods on benchmark data sets have been
discussed. In order to complete picture training times of the considered mathods on
these data sets should be also taken into account. From the Table ?? and the Table
7.9 it is clear that LLW is the slowest machine. Training the CS machines was slower
than training WW in eight of the benchmarks. Hence, a statistical comparison
support a significant difference between CS andWW in terms of training complexity.
Further, the accuracy of WW was statistically significantly superior to CS. The
OVA approach scales linearly with the number of classes d while the all-together
methods are in ω(d). MC-MMR method is the fastest method because it basically
an assembly d one-class machine. The asymptomatic properties of these machines
are given in Chapter 6. Unfortunately, all relatively fast machines, i.e. OVA, MC-
MMR, yielded hypotheses with a statistically significantly worse accuracy.
87
Table 7.9: Training time requirements of CS, LLW and DGI for optimal parameters
CS LLW DGIAbalone 43.844 492.611 36.309Car 2.207 6.370 53.181Glass 0.192 0.477 0.132Letter 972.047 10128.417 On RunIsolet 139.237 37462.100 367.796Iris 0.107 0.049 0.226OptDigits 45.803 520.406 31.003Page Blocks 4438.124 25258.837 25.081Sat 108.612 276.002 29.699Segment 447.566 5161.155 3.431SoyBean 1.710 51.840 0.130Vehicle 26.416 163.347 5.249
Table 7.10: The results of hierarchical statistical evaluation method are shown.
Rank Method/Methods1 LLW2 WW3 CS and OVA4 MC-MMR and DGI
7.3 Traffic Sign Recognition
Automatic camera-based traffic sign recognition plays an important role for driver
assistance systems as it can help increasing safety and comfort. From a technical
point of view, one can declare that the problem is solved to a degree that allows for
first technical applications in everyday life. Still, many research questions remain,
for example in the choice of appropriate features and classifiers and how feature
extraction and classification depend on each other. Although various approaches
to feature extraction and classification have been proposed in the domain of traffic
sign recognition, a systematic comparison is missing. New algorithms are typically
evaluated on data sets that are not publicly available and often not compared to
alternative methods.
In this study, the recognition (and not the detection) of traffic signs, which is a
multi-class classification problem, is considered. In real-world applications, there is
restricted computational time available for this task and this time has to be shared
between feature extraction and classification. Therefore, in this thesis it is argued
that it is not possible to calculate highly sophisticated features and to use a complex
classifier at the same time.
There are two conflictory hypothesis and the first one is; appropriate features
make the classification problem easier and as a result the discriminative power of
the classifier is less important. The second one is; using a sophisticated classifier
is required when feature extraction is less elaborated (using raw image data as an
extreme case). To test this hypothesis, the performance of different combinations
88
of feature extraction and classification algorithms are evaluated.
7.3.1 Related work
Many different approaches have been proposed for detection, classification, and
tracking of traffic signs in video sequences. In the following a short overview over
recent publications, focusing on techniques used for feature extraction and classifi-
cation in each case, will be given.
Miura et al. [120] present an active vision system for traffic sign recognition.
After detection, a nearest neighbor approach based on normalized cross-correlation
is employed for classification. Vicen-Bueno et al. consider [161] different techniques
for preprocessing of images and compare a nearest neighbor classifier and multi-layer
neural networks for classification.
In the approach described in [109], feature vectors are obtained from shape
information and a linear SVM is employed for classification. A similar approach
also using SVMs for classification is presented in [118]. Support vector machines
were also employed within a similar two-stage setup by Fleyeh and Dougherty [70].
A popular framework for real-time object detection based on Haar wavelet fea-
tures was proposed by Viola and Jones [162]. This method is widely adapted for
traffic sign detection. Bahlmann et al. [9] proposed to use this framework for detec-
tion, additionally considering different color channels. For classification, a Bayesian
framework was used where feature vectors are obtained as most discriminative basis
vectors found by linear discriminant analysis (LDA). The same idea is considered
in [103]. The cascaded detection was also used in [11], based on special Haar-like fea-
tures called dissociated dipoles there. For classification, an error-correcting output
code (ECOC) was employed.
Torresen et al. [156] and Moutarde et al. [121] proposed to classify single digits
for classification of speed limits after detection and appropriate segmentation.
Muhammad et al. recently published a survey and experimental study of differ-
ent approaches for traffic sign recognition [122]. For classification, they considered,
amongst others, an SVM implementation and a nearest neighbor like algorithm.
They made their data set (containing 1300 examples) publicly available. It was the
first data set that could be used for systematic comparison. Nevertheless, as they
focused on classification only, they provide preprocessed data. Therefore this data
set cannot be used to evaluate and compare different feature extraction approaches.
7.3.2 Features
In this thesis three different types of features, which are briefly introduced in this
section, are used in the related experiments.
7.3.2.1 Raw image data
As a baseline for comparison, the performance of all classifiers are supplied based
on raw image data. Dataset used in this study (see Section 7.3.3) contains 8 bit-
89
grayscale images scaled to a fixed size of 32× 32 pixel that are considered here.
7.3.2.2 Haar wavelet features
Haar wavelet features are state-of-the-art for real-time computer vision. Their pop-
ularity is mainly based upon the efficient computation using the integral image
proposed by Viola and Jones [162]. Haar wavelet features were successfully applied
in many computer vision applications, especially for object detection, classification,
and tracking [162, 126, 135, 80]. Figure 7.1 shows examples of six basic types of
Haar Wavelet features that can be used to detect different types of edges. Their re-
sponses can be calculated with 6 to 9 look-ups in the integral image, independently
of their absolute sizes.
Figure 7.1: Basic types of Haar wavelet features.
It has been shown that provided appropriate Haar wavelet features (e.g., found
by cascaded AdaBoost or created by evolutionary optimization) simple classifiers
can be used in order to achieve state-of-the-art performance in different tasks under
real-time constraints [162, 135, 61].
7.3.2.3 Histograms of Oriented Gradient
Histograms of Oriented Gradient (HOG) descriptors have been proposed by Dalal
and Triggs [46] for pedestrian detection. Based on gradients of color images, dif-
ferent weighted and normalized histograms are calculated: first for small cells that
cover the whole image and then for larger blocks that integrate over cells.
Using a linear SVM classifier based on HOG features, state-of-the-art perfor-
mance can be achieved, for instance, for pedestrian classification [46, 61].
7.3.3 Benchmark Data
For data collection, a Prosilica GC750C camera was used together with a Pentax
C30811KPC815 objective and a spacer ring, resulting in an opening angle of 60.
The automatic exposure control was used, therefore the frame rate was dynamically
changing while being mostly 30 fps or more. The camera images had sizes of 752×480
pixels and were stored in raw Bayer -pattern format. The recording is performed
while driving in different urban regions during daytime in good weather.
The sequences were labelled semi-automatically : the first occurrence of a traffic
sign was marked manually. Then a simple tracking algorithm based on normalized
cross-correlation was employed until the sign disappeared of the camera’s field of
vision. This semi-automatic procedure was chosen intentionally in order to generate
variability (translation, change of relative size and position, partial miss) typical in
real-world systems.
90
Examples of non-traffic signs were not labelled because one may argue that they
should not be chosen independently of the detection algorithm considered which is
not in the scope of this study.
Labelled traffic sign examples from the sequences were converted both to RGB
and to 8-bit grayscale, scaled to 32×32 pixels, and stored in PGM respectively
PPM file format. To simplify matters a nested directory structure is used in order
to sort examples according to their classes and to pool examples derived from the
same instance.
In total, 60 instances of traffic signs from 7 different classes have been labeled,
resulting in a total number of 3977 examples. The smallest images are 22×22, the
largest 87×87. A human viewer can classify all example images without doubt.
Some randomly chosen examples, which are from the data set at hand, are shown
in Table 7.11.
Table 7.11: Example images from the used traffic sign database.
A very important issue is how to split examples into training and test sets.
If images derived from the same traffic sign instance in the same driving situation
(but from different frames) occurred in both, classification of unseen examples would
become notably easier. Although this statement may sound trivial, not all data sets
used for evaluation of computer vision algorithms do follow this principle.
All data is grouped, based on instances and then select one half of the instances
belonging to each category, as training set and the remaining instances for testing.
The descriptive statistics of final benchmark data set are given in Table 7.12. There
are 1929 training and 2048 test examples.
7.3.4 Experiments and Results
In this section, the setup (Section 7.3.5) and results (Section 7.3.6) of the experi-
ments are described.
91
Table 7.12: Properties of the benchmark data set. In the first column indicates theclass. The second and fourth column represent the number of different instancesin the training and test set, respectively. The third and fifth column show thenumber of examples in training and test set, respectively. The number of examplesis larger than the number of instances, because a single traffic sign appears in severalconsecutive frames.
trafficsign
#Training #Training #Test #Test
instances examples instances examples
1 54 1 84
4 172 3 378
8 455 8 727
8 442 8 400
5 290 4 198
3 169 2 63
3 347 2 198
7.3.5 Setup
7.3.5.1 Feature calculation.
For the HOG feature calculation, the code supplied by the original authors [46] is
used. The features were computed on the RGB images that are scaled to 24×24.
This smaller size was favored in order to have more options for evenly dividing the
image into cells. Different parameters are tried for calculating HOG descriptors but
the reported results are belonging only two performing best, referred to as HOGA
and HOGB , respectively. See Table 7.13 for a detailed list of parameters used.
For Haar wavelet features, several sets containing variable numbers of differently
parametrized (basic types and sizes) features are tested. For the final experiments,
two sets (see Table 7.13) containing roughly 100 and 1000 features are selected that
showed typical effects. We refer to these sets as HaarA and HaarB . All images are
scaled to 24×24 for the calculation of HOG features, the same size is used when
calculating Haar features. However, grayscale images were used here instead of
color images because basic Haar features do not consider color information.
7.3.5.2 SVM model selection.
The model selection procedure is explained in Section 7.1.1. The initial search
space for hyperparemeters was, γ ∈ 2−18+3i | i = 0, 1, . . . , 4 and C ∈ 23i | i =
0, 1, . . . , 4.
92
HOGA
cell size 8×8 pixels, block size 16×16pixels, block stride 8×8, 9 orientationbins, semi-circle
144
HOGB
cell size 6×6 pixels, block size 12×12pixels, block stride 6×6, 9 orientationbins, full circle
324
HaarA
Features used: horizontal 8×8 pixels,vertical 8×8 pixels.Features were calculated at every sec-ond pixel in both dimensions.
98
HaarB
Features used: horizontal 4×4 pixels,vertical 4×4 pixels, horizontal 8×8 pix-els, vertical 8×8 pixels, horizontal bar6×6, vertical bar 6×6, diagonal 6×6,and 6×6 center-surround.Features of size 4×4 were calculated atevery pixel, the other features at everysecond pixel in both dimensions.
1002
Table 7.13: The table decribes the parameters used for feature extraction and thelast column of table represents the dimension of the feature vectors.
7.3.6 Results
The results of the experiments are shown in Table 7.14, which summarizes the test
errors of the different combinations of feature extraction and classification methods.
In addition to six different SVMs and 1-NN, linear discriminant analysis(LDA) [85]
is also tested.
Raw HOGA HOGB HaarA HaarBOVA 55.178 90.283 91.748 80.908 80.518
MC-MMR 55.277 72.631 77.683 75.637 75.599
CS 72.836 88.525 92.334 84.033 83.936
WW 71.986 88.916 92.285 85.986 82.129
LLW 74.318 89.871 93.814 89.320 86.114
DGI 60.829 80.318 81.194 80.965 70.782
1-NN 55.978 73.486 85.840 90.234 82.031
LDA 45.623 90.332 87.744 80.859 80.566
Table 7.14: The first column of the table shows different types of classifiers evaluatedin this study and the first row shows the features that gave best results. Renamingcells of the table shows the accuracies of each classifier-feature pair.
Regarding features, HOG features were more suitable for traffic sign recognition
problem because they gave best results for nearly all classifiers considered in this
study. Haar wavelet features allowed for similar performance only when used in
conjunction with a nearest neighbour classifier.
Regarding classifiers, in most of cases SVMs showed the best accuracies inde-
pendent of the features used. The all-in-one SVMs (CS and WW) performed on
par, but outperformed OVA. The simple classifiers LDA and NN yielded similar
performances as the SVMs when applied to the right features. The LDA worked
well in conjunction with HOG descriptors and NN in conjunction with the smaller
set of Haar features (HaarA). The larger set of Haar features (HaarB) did not allow
93
for good performance independently of the classifier used. This is probably due to
over-fitting as the number of features is relatively high compared to the number of
training examples.
7.3.7 Summary of Results
Recognition of traffic signs is an important real-world application in the context of
driver assistance systems and at the same time an interesting academic problem.
As there are almost no survey studies, very few systematical comparisons, and only
a single data set publicly available, there is a lack understanding the characteristics
of different solutions.
In the work presented in this thesis, the classification of traffic signs was the
main topic. This multi-class classification task typically has to be solved under
strict time constraints. Therefore, choosing a trade-off between highly sophisticated
feature calculation and complex classification methods becomes necessary.
In the experiments, the performance of different combinations of feature extrac-
tion and classification techniques were compared. In particular, feature extraction
approaches typically used for real-time applications such as Histograms of Oriented
Gradients (HOG) and Haar wavelet features were considered. For classification,
LDA, 1-NN), and different types of multi-class SVMs were used.
Results showed that the most sophisticated classifiers considered in this study,
the SVMs, always yield the highest classification performance independently of the
type of features used. The “true” (all-in-one) multi-class SVMs outperformed the
one-versus-all approach and achieved classification accuracies larger than 70% even
on raw image data and exceeding 93% when applied to HOG features.
However, combinations of more simple classifiers and appropriate features high
performances that may be sufficient for real-world applications. In particular, the
fast (e.g., compared to non-linear SVMs) discriminant analysis (LDA) applied to
HOG features achieved an accuracy of more than 90%, while performing consid-
erably worse on Haar features. In contrast, nearest neighbor was the best overall
choice for Haar features, but performed considerably worse when using HOG. This
underlines the complex interplay between features and classifiers.
7.4 Multi-class Problems in Bioinformatics
Biological data is produced astonishingly fast [132]. For instance, GenBank reposi-
tory of nucleic acid sequences contained 8, 214, 000 entries in 1999 [13] and in 2007
the same repository contained 108, 431, 692 entries of nucleic acid sequences [12].
That is, in 2007 the repository was approximately 132 times larger than in 1999
version. This level of increase of information forced cooperation of different fields of
sciences. One of the resulting fields is called bioinformatics, which can be defined
as making sense out of the biological data with the help of computational tools and
statistics.
94
Bioinformatics is an huge field of science and has numerous applications and un-
solved problems [63]. In this thesis, I considered multi-class classification problems
in bioinformatics and I picked two relevant problems of bioinformatics, namely can-
cer classification explained in section 7.4.1 and protein fold recognition explained
in section 7.4.3.
7.4.1 Cancer Classification and Diagnosis with Microarray
Gene Expression
Cancer is a family of diseases in which a group of cells show uncontrolled behaviours:
uncontrolled growth (replicating themselves beyond the limits), invading and de-
stroying of near by tissues and even more spreading to other locations in the body.
According to World Health Organization’s World Cancer Report [30] each year ap-
proximately 13% of the deaths all around the world are caused by cancer. In 2002
approximately 7.6 million people died because of cancer and it is estimated that
approximately 25 million people will die because of cancer [30] at 2030. It is clear
from the previous records related to cancer and projections of cancer rates, cancer
treatment was/is/will (be) one of the major challenges of modern biology, medicine
and bioinformatics.
To deal with cancer, researchers are continuously developing new tools and meth-
ods. One of the the promising tools is the usage of DNA microarrays and their
statistical analysis for cancer [65, 123]. A replica of an DNA microarray is illus-
trated in Figure 7.2 . In the following, a brief explanation of DNA microarrays, the
definition of cancer classification with microarray data and the performance of the
methods, which are considered in this thesis, will be given.
Figure 7.2: Illustration of an Microarray sample. Note that this illustration is notderived from real data, it is just a cartoon.
A DNA microarray is a chip [136] that contains an arrayed series of microscopic
spots with each containing a specific DNA sequence (probe). DNA microarrays
inform us how similar these DNA sequences are with a known DNA sequence of
interest (target). When the target is added to the probe, complementary nucleic
acid sequences specifically pair with each other forming hydrogen bonds. If the
probe is compliment to the target, more chemical bonds will occur. This process
is called probe-target hybridization and the degree of hybridization is measured by
fluorescence or some other imaging techniques [16].
95
The analysis of microarray data is a challenging task in bioinformatics and re-
quires sophisticated statistical methods [34, 150, 16] because of several reasons:
• Microarray data is high dimensional (modern microarray data contain more
than several thousand dimensions).
• The number of examples in microarray data experiments is generally low i.e.
less than 250 examples.
• Microaarray data contain noise due to measurement errors.
• The relationship between the level of hybridization and the quantization method
is unknown and non-linear.
Despite the mentioned problems for analysis of DNA microarray data, their us-
age in cancer research is very promising [110, 115, 88]. However, using microarray
data in practice/clinics, other than scientific research, is rare (I am not aware of
any application of microarray data other than for research purposes.) Nevertheless,
it can be assumed that in the near feature microarray data will be used in ordinary
clinics. To reach this level of applicability, to understand two things, first, what is
the relation between cancer types and individual genes should be understood. Sec-
ond is what are the advantages and disadvantages of classifiers on these microarray
data sets.
In this thesis, the relation between cancer types and individual genes will not
be analysed. This kind of relations are analysed in machine learning under the
topic of feature/variable selection and the interested reader is referred to Guyon et
al. [82]. The focus of this section will be the performance of multi-class SVMs on
microarray cancer data. Datasets also used in a well known comparison paper of
multi-category classifiers on cancer microarray data [147] will be used. The data
sets are given in Table 7.15. The main differences between this comparison and the
Statnikov et al.’s [147] comparison should be clarified. comparison.
Dataset Name ℓ ℓtst d #features ℓd
9 Tumors 42 18 11 5726 3.8214 Tumors 216 92 9 15009 24.00Brain Tumor 1 63 27 5 5920 12.60Brain Tumor 2 35 15 4 10367 8.75Leukemia1 50 22 3 5327 16.67Leukemia2 50 22 3 11225 16.67Lung Cancer 142 61 5 12600 28.40SRBCT 58 25 4 2308 14.50
Table 7.15: Description of the cancer microarray data used in this study. Thecolumn ℓ shows the number of training examples for each data, the column ℓtst showsthe number of test examples for each data, features column shows the dimension ofthe input space and finally the ℓ
dshows average number of training examples per
class.
The first difference is that Statnikov et al. [147] used many different types of
classifiers such as neural networks, decision tress and even CS and WW. However,
96
Statnikov et al. [147] clearly stated that the theoretically elegant method of LLW was
not tested on the data sets because there was no efficient solver for LLW and they
also added that it will be fruitful to have LLW in comparison. The second difference
in their comparison that is not clear if WW and CS compared fairly because of the
bias term. The last important difference is model selection. Statnikov et al. [147]
is using two different types of model selection procedure; the first one basically
relies 10-fold cross validation with grid search for model selection and the second
one is basically leave-one-out(loo) model selection strategy. It is well known fact
that loo strategy is approximately unbiased for prediction error. However it has
a high variance since the ℓ-training data sets used for model selection are highly
similar [85]. Recently, Kelement et al. [106] showed that for hard-margin SVMs,
which dimensions of the feature space are much higher than the number of training
examples, the estimated error of loo is 1. Although their results are not directly
implying that loo strategy is not suitable for soft margin SVMs, the model selection
should be selected by taking into account these flaws of loo strategy. However, even
second strategy may have some flaws. The last column of Table 7.15 indicates that
in one third of the data sets the average number of training examples per class are
less than the number of folds. Even more in seven data sets the number of training
examples per class are less than two times of the number of folds.These facts mean
that 10-fold cross validation may have a high variance at the grid points. The last
difference is they use polynomial kernels with a fixed degree and their grid space
for C contains only four values.
After considering all these issues related model selection and small sample prob-
lem, I believe that using 5-fold cross validation for model selection is a suitable
strategy. To prevent any artifacts of small sample problems, the training data is
randomized 10 times and trained the machines on these 10 randomized training sets.
For each grid point the median performance of the 10 repetitions at the grid point
is taken as the performance of the method (smoothing of the cross validation error
surface). This is an important difference because model selection strategy directly
affects the performance of the learning machine. The number of training samples
in microarray cancer data sets are small and this problem is known as small sample
problem in statistics and machine learning [93]. Besides most of the microarray
cancer classification problems are multi-class and the average number of training
examples per class can be much smaller than the binary case. My personal expe-
rience also showed that without smoothing the cross validation error surface the
model selection procedure can contain some artifacts of the small sample problem
[73, 31, 93].
In all data sets all feature values are rescaled between 0 and 1 and this rescaling
is done based on training data. The selected hyperparameters for OVA, MC-MMR
and WW method are given in Table 7.16 and the hyperparemeters of CS, LLW and
DGI are given in Table 7.17.
The classification accuracies in percentage of six multi-class SVMs are given in
Table 7.18 and in Table 7.19. Additionally, I selected the best classification accuracy
97
Table 7.16: Best hyperparameters found in the model selection procedure for OVA,MC-MMR and WW.
CS LLW DGIlog γ logC log γ logC log γ logC
11 Tumors -19 7 -27 0 -26 1214 Tumors -26 14 -8 -13 -22 129 Tumors -25 11 -8 0 -18 4Brain Tumor1 -16 4 -8 -2 -15 5Brain Tumor2 -18 6 -12 0 -16 4Leukemia1 -27 15 -12 0 -21 9Leukemia2 -24 12 -13 0 -28 0SRBCT -26 15 -10 -8 -25 13
Table 7.17: Best hyperparameters found in the model selection procedure for CS,LLW and DGI.
CS LLW DGIlog γ logC log γ logC log γ logC
11 Tumors -26 13 -20 11 -9 -814 Tumors -23 11 -20 9 -8 -29 Tumors -19 6 -19 6 -7 -11Brain Tumor1 -14 -2 -25 15 -38 -10Brain Tumor2 -18 -4 -13 1 -11 -8Leukemia1 -18 -6 -24 15 -9 -8Leukemia2 -21 -12 -42 7 -8 -8SRBCT -20 -8 -25 13 -5 -5
for each data set from the results of Statnikov et al. [147] and these accuracies are
reported in the last columns of Table 7.18 and Table 7.19. Also 1-NN classification
accuracies are shown in the results.
Table 7.18: classification accuracies of OVA, MC-MMR, WW, 1-NN are shown.The last column shows classification accuracies from the Statnikov et.al. In eachrow bold numbers shows the best classification accuracy of multi-class SVMs onthe data set.
OVA MC-MMR WW 1-NN [147]9 Tumors 83.33 38.89 72.22 44.44 78.6714 Tumors 75.27 60.22 73.12 60.22 90.96Brain Tumor1 92.86 85.71 89.29 85.71 82.31Brain Tumor2 80.00 73.33 80.00 73.33 80.00Leukemia1 100.00 77.27 100.00 81.82 93.90Leukemia2 100.00 86.36 100.00 77.27 94.42SRBCT 88.00 88.00 88.00 84.00 100.00
7.4.2 Summary of Results
If I compare my results with the results from the literature, it can be seen that in
four of the seven data sets, the results were improved, in two data sets the results
98
Table 7.19: classification accuracies of CS, LLW, DGI, 1-NN are shown. The lastcolumn shows classification accuracies from the Statnikov et.al. In each row boldnumbers shows the best classification accuracy of multi-class SVMs on the data set.
9 Tumors CS LLW DGI 1-NN [147]14 Tumors 83.33 83.33 38.89 44.44 78.67Brain Tumor1 74.19 75.27 60.22 60.22 90.96Brain Tumor2 92.86 89.29 85.71 85.71 82.31Leukemia1 80.00 80.00 73.33 73.33 80.00Leukemia2 100.00 100.00 90.91 81.82 93.90SRBCT 100.00 100.00 77.27 77.27 94.42
88.00 88.00 84.00 84.00 100.00
are worse and in one data set the results are equal. For a classification problem set
that contains seven different problems, these differences are high. These differences
could be caused because of either type of the kernel used or because of the model se-
lection procedure. In Statnikov et al.’s [147] study, they used MATLAB (2003, The
MathWorks) and also they claimed that the training of SVM are computationally
expensive. In this thesis, C++ is used for implementing the solvers and new solvers
allow to apply better model selection procedures and in turn this make possible to
get better classification accuracies.
7.4.3 Protein Secondary Structure Prediction
One of the main building blocks of cells are proteins and so it is essential to under-
stand biological function of proteins. It is known that a protein biological function
is closely related to its 3D structure [10]. Biologists developed several experimental
methods to determine the 3D structure of a protein such as protein nuclear mag-
netic resonance (NMR) or X-ray based techniques. However, these experimental
techniques are generally time consuming, slow and very expensive [91, 47]. In the
year 2000, the protein data bank (PDB) contained approximately 12, 000 exper-
imentally identified protein structures [14] and in the year 2009, PDB contained
approximately 30, 000 experimentally identified protein structures [60]. However,
the non-redundant National Center of Biotechnology Information (NCBI) reference
sequence (RefSeq) [130] contained approximately 780, 000 non-identified protein se-
quences in the year 2004 and it contained app. 1, 100, 000 non-identified protein
sequences in the year 2005. Although, the exact numbers for 2009 are not known, it
can be assumed that the RefSeq contains at least two million non-identified protein
sequences. From these statistics, it is clear that the experimental methods are not
fast enough to identify the protein sequences.
Identification of protein sequences is not only important for improvement of bi-
ological understanding of the cell, it is also important for drug discovery and/or
developing treatments schemes against diseases. Given all these facts, bioinformat-
ics researchers applied several statistical techniques to overcome these disadvantages
of experimental methods. One of the basic ideas is trying to determine the proteins
99
structural class by using only the primary sequence of amino acids which is doable
easily and fast from the protein at hand. In this part of the thesis I will consider
this problem, namely protein secondary structure prediction.
There are two mainstream approaches to solve protein secondary structure pre-
diction problems. The first one is to use supervised classification and the second one
are ab-initio techniques which optimize a predefined energy function without using
any supervised information [10]. However, the performance of ab-initio techniques
are far below than that of supervised classification [54].
Unfortunately, although the number of proteins is large, the number of training
examples per protein is small. Therefore, one needs to identify the best classifier in
order to use this relatively small information efficiently. In this thesis, I compared
6 of the MCSVMs and the 1-NN classifier on a multi-class protein data set, which
is a baseline data set for protein structure prediction [54]. Further the results of
this thesis will be compared with the best results published in the literature [47].
The data set contains 27 proteins and there are 12 different feature vectors derived
from these proteins. Each feature vector is regarded as a different data set. Each
data set contains 311 training examples and 383 test examples.
In all data sets all feature values are rescaled between 0 and 1 and this rescaling
is done based on training data. Again used a nested grid search with 5-fold cross
validation is used for determining hyperparameters. The model selection procedure
is identical to the nested grid search with randomization of training data procedure
that is explained in Section 7.4.1.
The selected hyperparameters for OVA, MC-MMR and WW method are given
in Table 7.20 and the hyperparemeters of CS, LLW and DGI are given in Table
7.21.
Table 7.20: Best hyperparameters found in the model selection procedure for OVA,MC-MMR and WW.
OVA MC-MMR WWlog γ logC log γ logC log γ logC
Composition -4 1 -3 0 -4 0Hydrophobicity -5 -2 -2 0 -8 4L14 -7 1 -5 0 -8 -1L1 -5 1 -3 0 -7 2L30 -6 0 -5 0 -6 4L4 -6 2 -3 0 -4 -1Polarity -5 -1 -2 0 -5 -2Polarizability -5 -5 -3 -1 -9 6Secondary -1 -1 0 0 -1 -2Swblosum62 -18 11 -5 -14 -17 7Swpam50 -27 19 -5 -4 -11 4Volume -4 10 -2 0 -5 1
The classification accuracies in percentage of six multi-class SVMs are given in
Table 7.22 and in Table 7.23. Additionally, 1-NN classification accuracies are shown
in the last column for each table.
100
Table 7.21: Best hyperparameters found in the model selection procedure for OVA,MC-MMR and WW.
CS LLW DGIlog γ logC log γ logC log γ logC
Composition -4 1 -4 0 0 -3Hydrophobicity -8 -2 -5 1 0 -9L14 -6 2 -8 5 0 -9L1 -8 4 -7 1 0 -11L30 -6 1 -6 1 -2 -8L4 -5 -2 -5 2 -2 -7Polarity -5 1 -5 2 -1 -3Polarizability -8 0 -5 2 -1 -3Secondary -4 3 -2 1 -2 -7Swblosum62 -18 -9 -15 7 -5 -4Swpam50 -15 9 -11 7 -6 -6Volume -6 6 -3 1 -1 -2
Table 7.22: classification accuracies of OVA, MC-MMR and WW . The last columnshows classification accuracies of 1-NN. In each row bold numbers shows.
OVA MC-MMR WW 1-NNComposition 0.5352 0.4909 0.5379 0.4595Hydrophobicity 0.3786 0.3864 0.3420 0.3446L14 0.4543 0.3708 0.3577 0.3394L1 0.4648 0.4334 0.3760 0.3734L30 0.3969 0.3159 0.3681 0.3133L4 0.4465 0.4334 0.4491 0.3838Polarity 0.3708 0.4021 0.3420 0.3368Polarizability 0.2898 0.3081 0.2950 0.3159Secondary 0.3943 0.3786 0.3838 0.3577Swblosum62 0.6240 0.5352 0.6292 0.4439Swpam50 0.6371 0.5535 0.6397 0.4595Volume 0.3603 0.3760 0.3473 0.3446
7.4.4 Summary of Results
Damoulas and Girolami [47] reported the best accuracy on this data set and their
accuracy was 59.8% ± 1.9. In their study, they considered four state-of-the-art
string kernels and take the best classification accuracy for each feature set. In this
study, I used Gaussian kernel and applied six multi-class methods to the feature
sets. In my study the best result is given by LLW and it is 64.23%. I increased the
classification accuracy approximately 4.5% which is clearly significant. However, all
methods gave the best accuracy exactly in two different feature sets and this means
there is no winner among different multi-class SVMs with respect to this problem.
Here it should be noted that, Damoulas and Girolami [47] also applied multiple
kernel learning(MKL) [8, 145] to this problem. They reported 68.1% classification
accuracy as a result of MKL. Although this result is better than the result reported
in this study, their results encourages us to apply MKL to LLW and WW because
101
Table 7.23: classification accuracies of CS, LLW and DGI . The last column showsclassification accuracies of 1-NN. In each row bold numbers shows.
CS LLW DGI 1-NNComposition 0.5222 0.4909 0.4543 0.4595Hydrophobicity 0.3368 0.3551 0.3629 0.3446L14 0.4256 0.4360 0.3525 0.3394L1 0.3812 0.2872 0.2881 0.3734L30 0.3812 0.3081 0.3394 0.3133L4 0.4648 0.4439 0.3812 0.3838Polarity 0.3708 0.3499 0.3577 0.3368Polarizability 0.2846 0.3394 0.3473 0.3159Secondary 0.3916 0.3916 0.4021 0.3577Swblosum62 0.6266 0.6162 0.5405 0.4439Swpam50 0.6371 0.6423 0.4909 0.4595Volume 0.3342 0.3943 0.3655 0.3446
of their superior performance on single feature sets.
102
Chapter 8
Conclusions
I have provided a novel unified view on the seemingly diverse field of all-in-one
multi-class SVMs. Albeit all popular all-in-one approaches reduce to the standard
SVM for binary classification problems, they differ along three dimensions when
applied to more than two classes. These are the presence or absence of a bias term
in the classification functions, the use of a relative or absolute margin concept, and
the way of combining margin violations in their loss functions.
The unified scheme pointed at a canonical combination of these features that
had not been investigated. The missing machine, which can be viewed as marrying
the approaches by Crammer & Singer (CS, [42]) and Lee, Lin, & Wahba (LLW,
[112]), has been derived and evaluated. The new SVM named DGI SVM considers
the maximum over the margin violations per variable in its loss function, and an
absolute margin concept as proposed by LLW.
A fast training algorithm for WW, LLW and DGI SVMs is presented. By drop-
ping the bias term—as done in the CS approach—the equality constraints in the
dual problems for all machines have vanished. This makes decomposition methods
easily applicable. A second order working set selection algorithm using working sets
of size two for these problems has been proposed. Instead of choosing the smallest,
irreducible working set size, to use a working set size of two whenever possible is
proposed. This allows for a still tractable analytic solution of the sub-problem and
as shown empirically this corresponds to a significantly better trade-off between
iteration complexity (as, e.g., determined by the working set selection heuristic and
the gradient update) and progress. That is, sequential two-dimensional optimization
(S2DO ) should be favored over the strict SMO heuristic. This is also supported by
the findings in [149] for binary SVMs. The S2DO heuristic is not restricted to the
SVMs considered in this study, but can be applied to machines involving quadratic
programs without equality constraints in general.
The developed solver is applied to all multi-class SVM machines, and this made
the empirical comparison fair enough to draw conclusions about the required train-
ing times. Another novel contribution of this thesis regarding SVM solvers is that
for all-in-one multi-class machines a new caching technique, which needs only to
103
store a O(d2) matrix and a O(ℓ2) matrix instead of O(s2) matrix where s is d× ℓ,
has been developed. This caching technique made possible to use WW and LLW
in all data sets. According to my knowledge, S2DO is the only existing solver for
LLW that is using decomposition algorithms. As a result, LLW method can now
be used for much larger data sets.1
An extensive empirical study has been accomplished. The new solver allows
to apply better model selection procedures for all multi-class machines. This is,
because of two reasons, a particularly important contribution of this study . First,
until now the WW and LLW methods were often either ignored or not carefully
analysed in empirical studies due to the lack of efficient solvers [133, 54]. The
second reason is that researchers did not make suitable model selection because
of the computational requirement of these methods [54, 147]. One-vs-all (OVA)
and CS are considered the best machine for multi-class problems because of these
reasons and also due to their high training speeds. However, the empirical analysis
presented in this thesis showed that this common belief is at least not completely
true. Empirical analysis revealed two important facts. First, LLW is better than all
other methods in the sense of classification accuracy and the second best method
is WW (see Section 7.2). The second insight is that WW is not slower than the CS
method (see Section 7.2). Further, if one focuses only all-in-one class machines, the
superior results supplied by LLW and WW implies that sum-loss machines are in
general better than max-loss machines (i.e., CS and DGI).
The results of six multi-class methods on Bioinformatics data sets implied that
the model selection very important when the data set at hand is small (i.e., we face
a small sample problem, see Section 7.4.1 and Section 7.4.3). Finally, the results of
the traffic sign recognition problem implied that neither using the good features nor
using the good classifiers gives the best result. In order to solve real world problems
we need to take into account both issues (see Section 7.3).
The extensive experimental comparison showed that the WW approach gener-
ated hypotheses with higher classification accuracy compared to the CS machine.
Both approaches outperformed the one-versus-all method in this respect. Using
S2DO, the original WW multi-class SVM now becomes at least as fast as the CS
method trained with tailored, state-of-the-art second order working set selection.
This indicates that the faster training times observed for the CS SVM compared to
the WW formulation were not achieved by reducing the number of slack variables,
but rather by dropping the bias term from the hypotheses (this is in accordance
with the findings in [87], where training times increased drastically when adding
bias parameters to the CS machine). The better generalization results are in accor-
dance with newly derived risk bounds. These follow from a union bound on results
for binary machines and are lower for the WW SVM compared to the CS machine.
Given the empirical and theoretical results, there is no reason any more for a priori
preferring the CS SVM to the original (WW) method. We hope that the results of
1The original solver proposed by the Lee et al. [112] is based on interior point method andhave complexity of O(s3) and a memory requirement of O(s2).
104
this thesis makes the WW method more popular among practitioners, because it
offers improved accuracy without additional costs in training time compared to CS.
From a theoretical point of view, the decisive property of the LLW multi-class
SVM is the classification calibration of its loss function [154]. The efficient solver
proposed in this thesis makes LLW training practical and thereby allowed for the
first extensive empirical comparison of LLW with alternative multi-class SVMs. The
LLW method is the only classification calibrated machine in this comparison [154]
and showed the best generalization performance. This improved accuracy required
considerably more training time. However, if training time does not matter, the
LLW machine is the multi-class SVM of choice. This experimental result corrobo-
rates the theoretical advantages of the LLW machine.
In this study, I considered batch learning of multi-class SVMs. For binary classi-
fication, it has been shown that improved second-order working set selection derived
for batch learning is even more advantageous when applied to on-line learning in
LASVM [79]. Therefore, I am confident that the results in this study also carry
over to the popular LaRank online multi-class SVM [19].
105
Bibliography
[1] E. Alba and J. Chicano. Solving the error correcting code problem with
parallel hybrid heuristics. In Proceedings of the 2004 ACM symposium on
Applied computing, page 989. ACM, 2004.
[2] E. Alba, C. Cotta, F. Chicano, and AJ Nebro. Parallel evolutionary algorithms
in telecommunications: Two case studies. network, 8(13):14–19, 2002.
[3] E. Alba and S. Khuri. Sequential and distributed evolutionary algorithms for
combinatorial optimization problems. Studies In Fuzziness And Soft Comput-
ing, pages 211–233, 2003.
[4] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary:
A unifying approach for margin classifiers. The Journal of Machine Learning
Research, 1:113–141, 2001.
[5] M. Anthony and P.L. Bartlett. Neural network learning: Theoretical founda-
tions. Cambridge Univ Pr, 1999.
[6] N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American
Mathematical Society, 68(3):337–404, 1950.
[7] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.
[8] F.R. Bach, G.R.G. Lanckriet, and M.I. Jordan. Multiple kernel learning,
conic duality, and the SMO algorithm. In Proceedings of the twenty-first
international conference on Machine learning, page 6. ACM, 2004.
[9] C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, and T. Koehler. A system for
traffic sign detection, tracking, and recognition using color, shape, and mo-
tion information. In Proceedings of the IEEE Intelligent Vehicles Symposium,
pages 255–260, 2005.
[10] D. Baker and A. Sali. Protein structure prediction and structural genomics.
Science’s STKE, 294(5540):93, 2001.
[11] X. Baro, S. Escalera, J. Vitria, Oriol Pujol, and Petia Radeva. Traffic sign
recognition using evolutionary adaboost detection and forest-ECOC classifi-
cation. IEEE Transactions on Intelligent Transportation Systems, 10(1):113–
126, 2009.
106
[12] D. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and D. Wheeler. Gen-
Bank Nucl. Acids Res, 35, 2007.
[13] D.A. Benson, M.S. Boguski, D.J. Lipman, J. Ostell, BF Ouellette, B.A. Rapp,
and D.L. Wheeler. GenBank. Nucleic acids research, 27(1):12, 1999.
[14] H.M. Berman, T. Battistuz, TN Bhat, W.F. Bluhm, P.E. Bourne,
K. Burkhardt, Z. Feng, G.L. Gilliland, L. Iype, S. Jain, et al. The pro-
tein data bank. Acta Crystallographica Section D: Biological Crystallography,
58(6):899–907, 2002.
[15] D.P. Bertsekas, M.L. Homer, D.A. Logan, and S.D. Patek. Nonlinear pro-
gramming. Athena scientific, 1995.
[16] P.J. Bickel, J.B. Brown, H. Huang, and Q. Li. An overview of recent de-
velopments in genomics and associated statistical methods. Philosophical
Transactions of the Royal Society A: Mathematical, Physical and Engineering
Sciences, 367(1906):4313, 2009.
[17] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[18] A. Bordes, L. Bottou, and P. Gallinari. Sgd-qn: Careful quasi-newton stochas-
tic gradient descent. The Journal of Machine Learning Research, 10:1737–
1754, 2009.
[19] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass support
vector machines with LaRank. In Zoubin Ghahramani, editor, Proceedings of
the 24th International Machine Learning Conference (ICML), pages 89–96.
OmniPress, 2007.
[20] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass sup-
port vector machines with LaRank. In Proceedings of the 24th international
conference on Machine learning, page 96. ACM, 2007.
[21] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with
online and active learning. The Journal of Machine Learning Research, 6:1619,
2005.
[22] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for op-
timal margin classifiers. In Proceedings of the Fifth Annual Workshop on
Computational Learning Theory (COLT 1992), pages 144–152. ACM, 1992.
[23] L. Bottou. Online algorithms and stochastic approximations. Online Learning
and Neural Networks. Cambridge University Press, Cambridge, UK, 5, 1998.
[24] L. Bottou. Stochastic learning. In O. Bousquet and U. von Luxburg, editors,
Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelli-
gence, LNAI 3176, pages 146–168. Springer Verlag, Berlin, 2004.
107
[25] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Advances
in neural information processing systems, 20:161–168, 2008.
[26] L. Bottou and Y. LeCun. Large scale online learning. In Sebastian Thrun,
Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural Infor-
mation Processing Systems 16. MIT Press, Cambridge, MA, 2004.
[27] L. Bottou and C.J. Lin. Support vector machine solvers. In L. Bottou,
O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Ma-
chines, pages 1–28. MIT Press, 2007.
[28] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning
theory. Advanced Lectures on Machine Learning, pages 169–207, 2004.
[29] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr,
2004.
[30] P. Boyle and B. Levin. World cancer report 2008. IARC Press Lyon, France,
2008.
[31] U.M. Braga-Neto and E.R. Dougherty. Is cross-validation valid for small-
sample microarray classification? Bioinformatics, 20(3):374, 2004.
[32] E. J. Bredensteiner and K. P. Bennett. Multicategory classification by support
vector machines. Computational Optimization and Applications, 12(1):53–79,
1999.
[33] L. Breiman. Classification and regression trees. Chapman & Hall/CRC, 1984.
[34] H.C. Causton, J. Quackenbush, and A. Brazma. Microarray gene expression
data analysis: a beginner’s guide. Wiley-Blackwell, 2003.
[35] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector
machine learning. In Advances in neural information processing systems 13:
proceedings of the 2000 conference, page 409. The MIT Press, 2001.
[36] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines,
2001.
[37] O. Chapelle. Training a support vector machine in the primal. Neural Com-
putation, 19(5):1155–1178, 2007.
[38] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.
[39] C. Cortes and V. Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
[40] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans-
actions on Information Theory, 13(1):21–27, 1967.
108
[41] T.M. Cover. Capacity problems for linear machines. Pattern recognition,
pages 283–289, 1968.
[42] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass
kernel-based vector machines. Journal of Machine Learning Research, 2:265–
292, 2002.
[43] K. Crammer and Y. Singer. On the learnability and design of output codes
for multiclass problems. Machine Learning, 47(2):201–233, 2002.
[44] F. Cucker and S. Smale. On the mathematical foundations of learning. Bul-
letin of American Mathematical Society, 39(1):1–50, 2002.
[45] F. Cucker and D.X. Zhou. Learning theory: an approximation theory view-
point. Cambridge Univ Pr, 2007.
[46] N. Dalal and B. Triggs. Histograms of oriented gradients for human detec-
tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 886–893, 2005.
[47] T. Damoulas and M.A. Girolami. Probabilistic multi-class multi-kernel learn-
ing: on protein fold recognition and remote homology detection. Bioinfor-
matics, 24(10):1264, 2008.
[48] C. Demirkesen and H. Cherifi. A comparison of multiclass SVM methods for
real world natural scenes. In S. Blanc-Talon, J.and Bourennane, W. Philips,
and P. Popescu, D.and Scheunders, editors, Advanced Concepts for Intelli-
gent Vision Systems (ACIVS 2008), volume 5259 of LNCS, pages 763–763.
Springer, 2008.
[49] J. Demsar. Statistical comparisons of classifiers over multiple data sets. Jour-
nal of Machine Learning Research, 7:1–30, 2006.
[50] L. Devroye. Any Discrimination Rule Can Have an Arbitrarily Bad Probabil-
ity of Error for Finite Sample Size. IEEE Transactions on Pattern Analysis
And Machine Intelligence, 4(2):154–156, 1982.
[51] L. Devroye, L. Gyorfi, and G. Lugosi. A probabilistic theory of pattern recog-
nition. Springer Verlag, 1996.
[52] T.G. Dietterich. Approximate statistical tests for comparing supervised clas-
sification learning algorithms. Neural Computation, 10(7):1895–1923, 1998.
[53] T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-
correcting output codes. Arxiv preprint cs/9501101, 1995.
[54] C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using sup-
port vector machines and neural networks. Bioinformatics, 17(4):349, 2001.
109
[55] T. Dogan, U. Glasmachers and C. Igel. A novel approach to consistent multi-
category support vector classification. submitted, 20011.
[56] K. Dontas and K. De Jong. Discovery of maximal distance codes using genetic
algorithms. In Tools for Artificial Intelligence, 1990., Proceedings of the 2nd
International IEEE Conference on, pages 805–811, 1990.
[57] K. Duan and S. S. Keerthi. Which is the best multiclass SVM method? An
empirical study. In N. C. Oza, R. Polikar, J. Kittler, and F. Roli, editors, Pro-
ceedings of the Sixth International Workshop on Multiple Classifier Systems
(MCS 2005), volume 3541 of LNCS, pages 278–285, 2005.
[58] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Citeseer, 2001.
[59] R.O. Duda, J.W. Machanik, and R.C. Singleton. Function Modeling Experi-
ments, 1963.
[60] S. Dutta, K. Burkhardt, J. Young, G.J. Swaminathan, T. Matsuura, K. Hen-
rick, H. Nakamura, and H.M. Berman. Data deposition and annotation at
the worldwide protein data bank. Molecular biotechnology, 42(1):1–13, 2009.
[61] M. Enzweiler and D. M. Gavrila. Monocular pedestrian detection: Survey
and experiments. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 31(12):2179–2195, 2009.
[62] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support
vector machines. Advances in Computational Mathematics, 13(1):1–50, 2000.
[63] W.J. Ewens and G.R. Grant. Statistical methods in bioinformatics: an intro-
duction. Springer Verlag, 2005.
[64] R.E. Fan, P.H. Chen, and C.J. Lin. Working set selection using second or-
der information for training support vector machines. Journal of Machine
Learning Research, 6:1889–1918, 2005.
[65] X. Fan, L. Shi, H. Fang, Y. Cheng, R. Perkins, and W. Tong. DNA Microar-
rays Are Predictive of Cancer Prognosis: A Re-evaluation. Clinical Cancer
Research, 16(2):629, 2010.
[66] M.C. Ferris and T.S. Munson. Interior-point methods for massive support
vector machines. SIAM Journal on Optimization, 13(3):783–804, 2003.
[67] T. Finley and T. Joachims. Training structural SVMs when exact inference is
intractable. In Proceedings of the 25th international conference on Machine
learning, pages 304–311. ACM, 2008.
[68] E. Fix and J. Hodges. Discriminatory Analysis-Nonparametric Discrimina-
tion: Consistency Properties, 1951.
110
[69] E. Fix and J.L. Hodges Jr. Discriminatory Analysis-Nonparametric Discrim-
ination: Small Sample Performance, 1952.
[70] H. Fleyeh and M. Dougherty. Traffic sign classification using invariant features
and support vector machines. In Proceedings of the IEEE Intelligent Vehicles
Symposium, pages 530–535, 2008.
[71] V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for sup-
port vector machines. In Proceedings of the 25th international conference on
Machine learning, pages 320–327. ACM, 2008.
[72] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line
learning and an application to boosting. In Computational Learning Theory,
pages 23–37. Springer, 1995.
[73] W.J. Fu, R.J. Carroll, and S. Wang. Estimating misclassification error with
small samples via bootstrap cross-validation. Bioinformatics, 21(9):1979,
2005.
[74] K. Fukumizu, F.R. Bach, and M.I. Jordan. Dimensionality reduction for
supervised learning with reproducing kernel Hilbert spaces. The Journal of
Machine Learning Research, 5:73–99, 2004.
[75] S. Garcıa and F. Herrera. An extension on statistical ”comparisons of classi-
fiers over multiple data sets” for all pairwise comparisons. Journal of Machine
Learning Research, 9:2677–2694, 2008.
[76] T. Glasmachers. Universal Consistency of Multi-Class Support Vector Classi-
fiation. In Advances in Neural Information Processing Systems (NIPS), 2010.
[77] T. Glasmachers and C. Igel. Maximum-gain working set selection for SVMs.
Journal of Machine Learning Research, 7:1437–1466, 2006.
[78] T. Glasmachers and C. Igel. Second-order smo improves svm online and active
learning. Neural computation, 20(2):374–382, 2008.
[79] T. Glasmachers and C. Igel. Second order SMO improves SVM online and
active learning. Neural Computation, 20(2):374–382, 2008.
[80] H. Grabner and H. Bischof. On-line boosting and vision. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
260–267, 2006.
[81] Y. Guermeur. VC theory for large margin multi-category classifiers. Journal
of Machine Learning Research, 8:2551–2594, 2007.
[82] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer
classification using support vector machines. Machine learning, 46(1):389–
422, 2002.
111
[83] R.W. Hamming. Error detecting and error correcting codes. Bell System
Technical Journal, 29(2):147–160, 1950.
[84] T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals of
Statistics, 26(2):451–471, 1998.
[85] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn-
ing: Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
[86] J.B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization
Algorithms: Fundamentals. Springer, 1993.
[87] C.W. Hsu and C.J. Lin. A comparison of methods for multiclass support vec-
tor machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
[88] P. Hu, G. Bader, D.A. Wigle, and A. Emili. Computational prediction of
cancer-gene function. Nature Reviews Cancer, 7(1):23–34, 2006.
[89] D. Hush, P. Kelly, C. Scovel, and I. Steinwart. QP algorithms with guaranteed
accuracy and run time for support vector machines. The Journal of Machine
Learning Research, 7:769, 2006.
[90] D. Hush and C. Scovel. Polynomial-time decomposition algorithms for support
vector machines. Machine Learning, 51(1):51–71, 2003.
[91] E. Ie, J. Weston, W.S. Noble, and C. Leslie. Multi-class protein fold recogni-
tion using adaptive codes. In Proceedings of the 22nd international conference
on Machine learning, pages 329–336. ACM, 2005.
[92] C. Igel, T. Glasmachers, and V. Heidrich-Meisner. Shark. Journal of Machine
Learning Research, 9:993–996, 2008.
[93] A. Jain and D. Zongker. Feature selection: Evaluation, application, and small
sample performance. IEEE transactions on pattern analysis and machine
intelligence, 19(2):153–158, 1997.
[94] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf,
C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support
Vector Learning, chapter 11, pages 169–184. MIT Press, 1998.
[95] T. Joachims. Text categorization with support vector machines: Learning
with many relevant features. Machine Learning: ECML-98, pages 137–142,
1998.
[96] T. Joachims. Training linear SVMs in linear time. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data
mining, page 226. ACM, 2006.
[97] T. Joachims, T. Finley, and C.N.J. Yu. Cutting-plane training of structural
SVMs. Machine Learning, 77(1):27–59, 2009.
112
[98] T. Joachims and C.N.J. Yu. Sparse kernel SVMs via cutting-plane training.
Machine Learning, 76(2):179–193, 2009.
[99] M. J. Kearns and R. E. Shapire. Efficient distribution-free learning of prob-
abilistic concepts. Journal of Computer and System Sciences, 48(3):464–497,
1994.
[100] S.S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector ma-
chines with reduced classifier complexity. The Journal of Machine Learning
Research, 7:1515, 2006.
[101] S.S. Keerthi and E.G. Gilbert. Convergence of a generalized SMO algorithm
for SVM classifier design. Machine Learning, 46(1):351–360, 2002.
[102] SS Keerthi, SK Shevade, C. Bhattacharyya, and KRK Murthy. Improvements
to Platt’s SMO algorithm for SVM classifier design. Neural Computation,
13(3):637–649, 2001.
[103] C. G. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff. Real-
time recognition of U.S. speed signs. In Proceedings of the IEEE Intelligent
Vehicles Symposium, pages 518–523, 2008.
[104] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions*
1. Journal of Mathematical Analysis and Applications, 33(1):82–95, 1971.
[105] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels.
IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004.
[106] S. Klement, A. Madany Mamlouk, and T. Martinetz. Reliability of cross-
validation for SVMs in high-dimensional, low sample size scenarios. Artificial
Neural Networks-ICANN 2008, pages 41–50, 2008.
[107] S.R. Kulkarni, G. Lugosi, and S.S. Venkatesh. Learning pattern classification-
a survey. IEEE Transactions on Information Theory, 44(6):2178–2206, 1998.
[108] H.J. Kushner and G. Yin. Stochastic approximation and recursive algorithms
and applications. Springer Verlag, 2003.
[109] S. Lafuente-Arroyo, P. Garcıa-Dıaz, F.J. Acevedo-Rodrıguez, P. Gil-Jimenez,
and S. Maldonado-Bascon. Trafic sign classification invariant to rotations
using support vector machines. In Proceedings of Conference on Advanced
Concepts for Intelligent Vision Systems, pages 37–42, 2004.
[110] S.R. Lakhani and A. Ashworth. Microarray and histopathological analysis of
tumours: the future and the past? Nature Reviews Cancer, 1(2):151–157,
2001.
[111] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,
1998.
113
[112] Y. Lee, Y. Lin, and G. Wahba. Multicategory Support Vector Machines:
Theory and Application to the Classification of Microarray Data and Satellite
Radiance Data. Journal of the American Statistical Association, 99(465):67–
82, 2004.
[113] N. List and H.U. Simon. A general convergence theorem for the decomposition
method. Learning Theory, pages 363–377, 2004.
[114] Y. Liu. Fisher consistency of multicategory support vector machines. In
M. Meila and X. Shen, editors, Eleventh International Conference on Artificial
Intelligence and Statistics (AISTATS), pages 289–296, 2007.
[115] J.A. Ludwig and J.N. Weinstein. Biomarkers in cancer staging, prognosis and
treatment selection. Nature Reviews Cancer, 5(11):845–856, 2005.
[116] D.J.C. MacKay. Information theory, inference, and learning algorithms. Cam-
bridge Univ Pr, 2003.
[117] D.J.C. MacKay and R.M. Neal. Near Shannon limit performance of low
density parity check codes. Electronics letters, 33(6):457–458, 1997.
[118] S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Gil-Jimenez, Hilario Gomez-
Moreno, and Francisco Lopez-Ferreras. Road-sign detection and recognition
based on support vector machines. IEEE Transactions on Intelligent Trans-
portation Systems, 8(2):264–278, 2007.
[119] S. Mehrotra. On the implementation of a primal-dual interior point method.
SIAM Journal on Optimization, 2:575, 1992.
[120] J. Miura, T. Kanda, and Y. Shirai. An active vision system for real-time
traffic sign recognition. In Proceedings of IEEE International Conference on
Intelligent Transportation Systems, pages 52–57, 2000.
[121] F. Moutarde, A. Bargeton, A. Herbin, and L. Chanussot. Robust on-vehicle
real-time visual detection of American and European speed limit signs, with a
modular Traffic Signs Recognition system. In Intelligent Vehicles Symposium,
2007 IEEE, pages 1122–1126. IEEE, 2007.
[122] A. S. Muhammad, N. Lavesson, P. Davidsson, and M. Nilsson. Analysis of
speed sign classification algorithms using shape based segmentation of binary
images. In Proceedings of the International Conference on Computer Analysis
of Images and Patterns, pages 1220–1227, 2009.
[123] E.E. Ntzani and J. Ioannidis. Predictive ability of DNA microarrays for
cancer outcomes and correlates: an empirical assessment. The Lancet,
362(9394):1439–1444, 2003.
[124] E. Osuna, R. Freund, and F. Girosi. Improved Training Algorithm for Support
Vector Machines. In J. Principe, L. Giles, N. Morgan, and E. Wilson, editors,
Neural Networks for Signal Processing VII, pages 276–285. IEEE Press, 1997.
114
[125] E. Osuna, R. Freund, and F. Girosit. Training support vector machines: an
application to face detection. In 1997 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 1997. Proceedings., pages 130–136,
1997.
[126] C. Papageorgiou and T. Poggio. A trainable system for object detection.
International of Journal Computer Vision, 38(1):15–33, 2000.
[127] W.W. Peterson and EJ Weldon. Error-correcting codes. The MIT Press, 1972.
[128] J.C. Platt. Fast training of support vector machines using sequential minimal
optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods – Support Vector Learning, chapter 11, pages 185–208. MIT
Press, 1998.
[129] T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri. b. In J. Win-
kler and M. Niranjan, editors, Uncertainty in Geometric Computations, chap-
ter 11, pages 131–141. Kluwer Academic Publishers, 2002.
[130] K.D. Pruitt, T. Tatusova, and D.R. Maglott. NCBI reference sequences (Ref-
Seq): a curated non-redundant sequence database of genomes, transcripts and
proteins. Nucleic acids research, 2006.
[131] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann,
1993.
[132] T. Reichhardt. It’s sink or swim as a tidal wave of data approaches. Nature,
399(6736):517–520, 1999.
[133] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of
Machine Learning Research, 5:101–141, 2004.
[134] F. Rosenblatt. Principles of Neurodynamics: Perceptron and Theory of Brain
Mechanisms. Spartan Books, 1962.
[135] J. Salmen, T. Suttorp, J. Edelbrunner, and C. Igel. Evolutionary optimization
of wavelet feature sets for real-time pedestrian classification. In Proceedings
of the IEEE Conference on Hybrid Intelligent Systems, pages 222–227, 2007.
[136] M. Schena, D. Shalon, R.W. Davis, and P.O. Brown. Quantitative monitoring
of gene expression patterns with a complementary DNA microarray. Science,
270(5235):467, 1995.
[137] B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson.
Estimating the support of a high-dimensional distribution. Neural Computa-
tion, 13(7):1443–1471, 2001.
[138] B. Scholkopf and A.J. Smola. Learning with kernels. Citeseer, 2002.
115
[139] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Ma-
chines, Regularization, Optimization, and Beyond. MIT Press, 2002.
[140] F. Sebastiani. Machine learning in automated text categorization. ACM
computing surveys (CSUR), 34(1):1–47, 2002.
[141] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-
gradient solver for svm. In Proceedings of the 24th international conference
on Machine learning, page 814. ACM, 2007.
[142] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal
estimated sub-gradient solver for svm. In Mathematical Programming, page
814. ACM, 2007.
[143] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence
on training set size. In Proceedings of the 25th international conference on
Machine learning, pages 928–935. ACM, 2008.
[144] J. Shawe-Taylor and N. Cristianini. Robust bounds on generalization from the
margin distribution. In 4th European Conference on Computational Learning
Theory. Citeseer, 1999.
[145] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple
kernel learning. The Journal of Machine Learning Research, 7:1531–1565,
2006.
[146] J.C. Spall. Introduction to stochastic search and optimization: estimation,
simulation, and control. John Wiley and Sons, 2003.
[147] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A com-
prehensive evaluation of multicategory classification methods for microarray
gene expression cancer diagnosis. Bioinformatics, 21(5):631, 2005.
[148] I. Steinwart. On the influence of the kernel on the consistency of support
vector machines. The Journal of Machine Learning Research, 2:67–93, 2002.
[149] I. Steinwart, D. Hush, and C. Scovel. Training SVMs without offset. Technical
Report LA-UR-09-00638, Los Alamos National Laboratory (LANL), 2009.
[150] D. Stekel. Microarray bioinformatics. Cambridge Univ Pr, 2003.
[151] C.J. Stone. Consistent nonparametric regression. The annals of statistics,
5(4):595–620, 1977.
[152] S. Szedmak, J. Shawe-Taylor, and E. Parado-Hernandez. Learning via lin-
ear operators: Maximum margin regression. Technical report, PASCAL,
Southampton, UK, 2006.
[153] C.H. Teo, SVN Vishwanthan, A.J. Smola, and Q.V. Le. Bundle methods
for regularized risk minimization. Journal of Machine Learning Research,
11:311–365, 2010.
116
[154] A. Tewari and P. L. Bartlett. On the Consistency of Multiclass Classification
Methods. Journal of Machine Learning Research, 8:1007–1025, 2007.
[155] E. Torres and S. Khuri. Applying evolutionary algorithms to combinatorial
optimization problems. Computational Science-ICCS 2001, pages 689–698,
2001.
[156] J. Torresen, J.W. Bakke, and L. Sekanina. Efficient recognition of speed limit
signs. In Proceedings of the IEEE International Conference on Intelligent
Transportation Systems, pages 652–656, 2004.
[157] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Support vector
machine learning for interdependent and structured output spaces. Journal
of Machine Learning Research, 6:1453–1484, 2005.
[158] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
[159] V. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000.
[160] V. Vapnik and A. Chervonenkis. Theory of pattern recognition, 1974.
[161] R. Vicen-Bueno, A. Garcıa-Gonzalez, E. Torijano-Gordo, R. Gil-Pita, and
M. Rosa-Zurera. Traffic sign classification by image preprocessing and neural
networks. In Proceedings of the Work-Conference on Artificial Neural Net-
works, pages 741–748, 2007.
[162] P. Viola and M. Jones. Robust real-time object detection. International
Journal of Computer Vision, 57(2):137–154, 2004.
[163] J. Weston and C. Watkins. Support vector machines for multi-class pattern
recognition. In M. Verleysen, editor, Proceedings of the Seventh European
Symposium On Artificial Neural Networks (ESANN), pages 219–224, 1999.
[164] K. Woodsend. Using Interior Point Methods for Large-scale Support Vector
Machine training. 2009.
[165] K. Woodsend and J. Gondzio. Exploiting separability in large-scale linear sup-
port vector machine training. Computational Optimization and Applications,
pages 1–29, 2009.
[166] S.J. Wright. Primal-dual interior-point methods. Society for Industrial Math-
ematics, 1997.
[167] T. Zhang. Solving large scale linear prediction problems using stochastic
gradient descent algorithms. In Proceedings of the twenty-first international
conference on Machine learning, page 116. ACM, 2004.
[168] H. Zou, J. Zhu, and T. Hastie. The margin vector, admissible loss and multi-
class marginbased classifiers. Annals of Applied Statistics, 2:1290–1306, 2008.
117
Resume
Personal Data
Name Urun Dogan
Date of birth 4. April 1979
Place of birth Eskisehir
E-Mail [email protected]
Education And Work Experience
1994 - 1996 Eskisehir Science High School
1996 - 1997 Eskisehir Ataturk High School
1997 - 2001 B.Sc Mechanical Engineering from Istanbul Technical University
2001 - 2004 M.Sc in System Dynamics and Control from Istanbul Technical University
2005 - now Research Fellow at Institut fur Neuroinformatik an der Ruhr-Universitat Bochum
118