Nick Ruiz - Bachelors Thesis

18
Support Vector Machines for Optimizing Multiclass Image Classification Training Time by Nicholas Ruiz Submitted in partial fulfillment of the requirements for Major Honors in Computer Science Houghton College, Houghton, New York May 3,2006

description

This is my bachelors thesis on analyzing alternatives to Support Vector Machines for reducing the training time for image classification.I lost my LaTeX file when my hard drive crashed, so as of right now the only digital version I have is this photocopied version.

Transcript of Nick Ruiz - Bachelors Thesis

Page 1: Nick Ruiz - Bachelors Thesis

Support Vector Machines for Optimizing Multiclass Image Classification Training Time

by

Nicholas Ruiz

Submitted in partial fulfillment of the requirements for Major Honors in Computer Science

Houghton College, Houghton, New York May 3,2006

Page 2: Nick Ruiz - Bachelors Thesis

------Acknowledgments

I would like to thank Dr. Wei Hu for his assistance as an advisor during and before the development of this thesis. Dr. Hu has motivated me to push myself beyond average scholarship: he has inspired me to be diligent and faithful in all of my endeavors. I would also like to thank the other members of the Honors Committee: Dr. Richard Jacobson, Dr. Paul Young, and Dr. Mark Yuly, in additionto the other faculty and staff at Houghton College for providing a challenging academic atmosphere while maintaining genuine dedication to the development of each student.

I would also like to thank my parents, Brian and Sandra, my sibliJames, for their encouragement and continued support as I have grown into the person I am today. Their involvement in my life has motivated me to work hard and have held me accountable to my work. Finally, I earnestly thank God for his involvement in my life: the time and effort I have spent in every activity would be worthless if it were not for the opportunity to honor Him. I am eternally thankful for his love, concern, and the sacrifice of his Son to provide meaning to my life.

L

1

Page 3: Nick Ruiz - Bachelors Thesis

Contents

1 Introduction

2 Support Vector Machines 2.1 Linearly separable Support Vector Machines 2.2 Nonlinear Support Vector Machines . . ...

3 Multiclass classification 3.1 One-versus-rest model ..... . 3.2 Decision Directed Acyclic Graphs

4 Alternative SVM training methods 4.1 Gradient ascent .. . 4.2 Kernel Adatron .. .

4.2 .1 Without bias 4.2.2 With biqs

4.3 Friess Adatron . 4.4 Kernel Minover .

5 Experimentation procedures 5.1 Input representation ... . 5.2 Choice of the kernel .... . 5.3 SVM training methods used 5.4 Leave-one-out cross validation

6 Experimentation results

7 Remarks

8 Conclusion

2

3

3 3 5

6 6 6

7 7 8 8 9 9

10

11 11 11 12 12

12

15

16

Page 4: Nick Ruiz - Bachelors Thesis

1 Introd uction

SUppoTt Vector Machines (SVMs) have proven to successfully solve real-world problems; one example of their power is exhibited through the classification of images. For simple binary classification tasks, SVMs use a kernel function - a "computational shortcut which makes it possible to represent linear patterns efficiently in high-dimensional spaces" [1] - where a separating hyperplane can be found that maximizes the distance between the two classes of labeled points [2].

While Chapelle has shown that SVMs have a remarkable recognition rate for color­and luminescence-based image classification [3], the requirement of solving a complex quadratic programming problem may result in a slow training time for the classifica­tion model. The purpose of this paper is to compare the training time efficiency of the quadratic programming SVM to several alternative SVMs (Kernel Adatron, Kernel Mi­nover, and Friess Adatron), using several experimental procedures outlined by Chapelle on a subset of the Corel image database for multiclass image classification. In addition, we shall compare the results of each of the above SVMs to their implementations using the Directed Acyclic Graph for multi class classification [4].

This paper is organized as follows. In Section 2 we present an overview of Support Vector Machines. Section 3 discusses techniques of multiclass classification. Section 4 introduces four alternative algorithms for training the SVM. Section 5 discusses the ex­perimentation procedures for testing the SVM models on sample images from the Corel database; results of the experimentation are illustrated in section 6.

2 Support Vector Machines

In the binary classification model, Support Vector Machines typically use a non-linear function ¢ (if the data is not linearly-separable) to map training points to a high­dimensional and linearly-separable feature space. By mapping the training points to a linearly-separable feature space, a separating hyperplane can be found that separates the training points into distinct classes. A separating hyperplane may be found which mini­mizes the distance between training points of the same class and maximizes the distance between training points of differing classes (the distance between the nearest points of differing classes is called the margin). The separating hyperplane is represented by a linear combination of the training points [1].

2.1 Linearly separable Support Vector Machines

Let S = {(Xl , yd, (X2' Y2), ... , (Xm' Ym)} be a sample of points Xi E X (the entire set of training data; in image classification, for example, Xi represents the training data for one image) with targets Yi E {-I, + 1 }. Consider a hyperplane defined by (w, e) where w is a weight vector and e is a bias . The goal is to find a hyperplane which divides the set of examples such that all training points with the same target labels are on the same side

3

Page 5: Nick Ruiz - Bachelors Thesis

of the hyperplane. Thus, the goal is to find wand x such that

(1)

where i = {I, ... , m} and m is the number of training examples. If there exists a hyperplane that satisfies (1) , then the sample set S is said to be

linearly separable. In this case, wand e may be rescaled, such that

min Yi (w . Xi + e) 2: 1. l:S;i:S;m

The purpose of this rescaling is so that the closest point to the hyperplane has a distance of l/ JJ w ll . In this case, (1) becomes

(2)

By maximizing the left hand side of (2) , we obtain the optimal separating hyperplane. This can be understood by considering that the closest point to any hyperplane satisfying (2) has a distance of l / JJw JJ. Thus, finding the optimal separating hyperplane amounts to minimizing " where

1 ,= 2w, w, (3)

under constraints (2). It should be noted that, in this case, the margin is 2/ lI w JJ . The optimal separating hyperplane is understood as the hyperplane which maximizes the margin.

Since IIW JJ2 is convex, minimizing (3) under constraints (2) may be achieved using Lagrange multipliers [lJ . Thus, the maximal margin can be found by minimizing the Lagrangian: .

1 m L(w, e, a) = 2(w, w) - L adydw, Xi + eJ - 1). (4)

i=l

These ai are Lagrange multipliers, one for each training point. The partial derivatives of (4) with respect to e and ware:

8L(w, e, a) _ ~ _ --'--="8'--e -'----..:... - L.... ai Yi - 0 (5)

i=l

(6)

Substituting (6) into (4) gives the dual representation of the Lagrangian:

m m

L(w, e, a) = L ai - L aiajYiYj (Xi' Xj) , (7) i=l i,j=l

4

Page 6: Nick Ruiz - Bachelors Thesis

which must be maximized with respect to each ai, subject to the constraint from (5),

and ai 2: O.

m

I:aiYi = 0, i=l

When the optimal separating hyperplane and margin are found , only the training points that lie close to the hyperplane have ai > 0 and contribute to the classification model. These training points are called support vectors (SV) . All other training points have associated ai values of zero. The training points with nonzero ai values provide the most informative patterns on the model's data. The resulting decision function can be written as:

j(x) = sign (I: Yia?(X' Xi) + e) , iESV

(8)

where a? is the solution to the maximization problem under the constraints listed above and SV represents the indices of the support vectors. The sign function transforms the result of the function in order to map approximate results to either -lor + 1.

2.2 Nonlinear Support Vector Machines

In the case that the training points are not linearly separable, SVMs support mapping the input data to a high dimension feature space through a nonlinear mapping function ¢. Campbell and Cristianini show that SVMs may use high dimensional spaces without overfitting the data. Thus, we may replace ¢(x) for X in (8) [1]:

j(x) = sign (I: Yia?(¢(X) . ¢(Xi)) + e) . iESV

Interestingly, the mapping ¢ does not need to be explicity defined if we use a kernel junction, since the only case that requires ¢ is a dot-product between two mappings. A kernel is a symmetric function K that can be described for all x, x' E X as

K(x, x') = ¢(x) . ¢(X')

for a specific mapping ¢. Appropriate kernels that describe this mapping must satisfy Mercer's condition [5], which states that for any g(x) ' for which

J g(x)2dx < 00,

then J K(x , xl)g(X)g(xl)dxdx' 2: O.

Common choices for kernels include polynomial kernels of the form

K(x, x') = (x· x' + l)d ,

5

(9)

(10)

Page 7: Nick Ruiz - Bachelors Thesis

where d is the ' degree of the polynomial, and Gaussian Radial Basis Functions (RBF) of the form

I 2

K(x,x') = e-lIx~:211 , (11)

where (J" is the standard deviation of the gaussian curve. Similar to the mapping function ¢(x), we may replace x . Xi in (8) with

f( x) = sign (L YiC1:?K(X, Xi) + e) . iESV

(12)

By satisfying Mercer 's condition, a kernel may be used to replace the use of a ¢ mapping; thus, ¢ need not be explicitly defined.

3 M ulticlass classification

3.1 One-versus-rest model

In the previous section, we described the use of the SVM for binary classification. The standard method of multiclass classification is to construct N binary classifiers. For example, let i E {1 , 2, ... N}. The binary classification for the ith class is to separate the training points from class i from the remaining N - 1 classes. This is also known as one-versus-rest (or 1-v-r) classification. Empirically, SVM training is observed to scale super-linearly with the training set size m [6]' according to a power law:

Tsingle = cm'Y , (13)

where c is a proportionality constant, m is the number of training examples, and "f reflects the time complexity of the classification model. Tsingl e represents the time associated with training a single classifier. For the standard 1-v-r multiclass SVM training algorithm, the entire training set is used to create all N classifiers, thus the training time is

(14)

3.2 Decision Directed Acyclic Graphs

An alternative method of multiclass classification was proposed by Platt, Cristianini, and Shawe-Taylor [4], which utilizes a Directed Acyclic Graph (DAG). A Decision DAG (DDAG) is composed of N(~-l) nodes, each containing a one-versus-one (1-v-1) classifier, classifying two distinct classes from one another. Each node on the graph eliminates one class from the list of candidate results. Once the 1-v-1 classification is performed on a leaf node , the graph has eliminated N - 1 classes from the list. The remaining class is the result of the multiclass classification for a data point. Figure 1 is an illustration of the DAG structure.

Instead ofrequiring all m data points for training, each 1-v-1 classification only requires the data points that are expected to classify as one of the two candidate classes. Thus,

6

Page 8: Nick Ruiz - Bachelors Thesis

! ,\ 2 (' , .

3 2 • • 4 , 1 nl

"I ~v, \' .{ . " ., .,' ( ' "".

! . 3" ",-s 4 . : 2 "'s 3 1 1 '" 2 "

r~/ \/j ' 4 J 2 1

(a)

. 1 I 1 I 1

I 1 1

(0)

4 • 44~

4

1 ''54 SliM

Figure 1: (a) The decision DAG for finding the best class out of four classes, The equiva­lent list state for each node is shown next to that node, (b) A diagram of the input space of a four-class problem, A I-v-l SVM can only exclude one class from consideration [4],

if each class contains the same number of examples, each I-v-l classifier requires only 2;:­training examples. Recalling (13), training the DDAG would require

T _ N(N - 1) (2m)"Y I-v- l - C 2 N (15)

Comparing (14) to (15) , the DDAG trains more quickly than the standard I-v-r method.

4 Alternative SVM training methods

Several alternative SVM training methods exist that are much simpler to implement than the quadratic programming SVM. It is suggested that the use of the Kernel Adatron (KA) as a simple implementation of the SVM that compares to the original SVM in both accuracy and time-complexity during classification [1]. In the following subsection, we will discuss the explanation of gradient ascent in [1], which is necessary to understand the Kernel Adatron. In the subsequent subsections, we will discuss the Kernel Adatron and several variations on the Adatron theme.

4.1 Gradient ascent

A simple alternative method to maximizing a concave Lagrangian under linear constraints is to use gradient ascent. The Lagrangian to be maximized is

(16)

7

Page 9: Nick Ruiz - Bachelors Thesis

w here the final term implements the constraint condition in (5). Using stochastic gradient ascent based on the derivative of the Lagrangian [1],

(17)

where r; is controls the growth rate. In addition, the constraint ai 2: 0 will be enforced by updating ai ----> 0 when ai < O. The Lagrangian changes as follows during an update procedure where ai ----> ak + Oak for a particular pattern k [1] :

O£ = £(ak + oak) - £(ak) (18)

~ oa, (1 -y, ~ ajyjK(x" Xj) - >.y,) - ~(oa,)2 K(x" x,) (19)

[~ _ K(X;, Xk) ] (Oak)2. (20)

Thus, o£ > 0, given [1]: 2 > r;K(Xk' Xk) > O.

For a Gaussian RBF kernel in the form of (10), K(Xk' Xk) = 1, which suggests that the gradient ascent algorithm will converge with the maximal Lagrangian, provided that

2> r; > O.

For a polynomial kernel in the form of (11), the upper bound for r; is determined from the 2-norm of each pattern [1]:

4.2 Kernel Adatron

4.2.1 Without bias

According to Campbell [1], dropping the condition 2::1 aiYi = 0 is equivalent to forcing the hyperplane to pass through the origin of the feature space. Since the feature space is high-dimensional, this restriction is not an active constraint for many problems and thus, will not affect the overall generalization of the model significantly. Essentially, this means that we may drop the). term in the gradient ascent model without a significant decrease in generalization efficiency.

We shall first outline the Kernel Adatron algorithm without bias in Table 1. The algorithm was developed by Thilo Friess and [1] .

8

Page 10: Nick Ruiz - Bachelors Thesis

1. Initialize a? = O. 2. For i = 1, 2, ... , m execute steps 3 and 4. 3. For labeled points (Xi, Yi) calculate:

Zi = 2:7=1 ajyjK(xi, Xj). 4. Calculate 8a~ = 1](1 - ZiYi ):

4.1. If (a~ + 8aD :S 0 then a~ <-- O. 4.2. If ( a~ + 8aD > 0 then a~ <-- a~ + 8a~.

5. If a maximum number of iterations is exceeded or the margin: "f = ~ (min{iIYi=+1}(zi) - max{iIYi=-l} (Zi)) ~ 1 then stop, otherwise, return to 2 for next epoch t .

Table 1: Kernel Adatron algorithm without bias.

4.2.2 With bias

In the KA model with a bias term, we shall once again consider the 2:: 1 aiYi = 0 condition. In the process of gradient ascent, [1] notes that L increases irrespective of A, except for iterations where 8ai = O. In addition, the final A of the optimization problem is the bias , since from the stationary condition of 8aiYi = 0 (when the maximum has been found) ,

1 - Yk 2( aj yjK (Xk, Xj ) - AYk ~ Yk (Yk - 2( ajyjK (.'k, Xj) - A) ~ 0,

recalling that for each k , Yk = {-I, + 1 }. Thus, the only additional change from the KA without bias is to keep track of the A-values at each epoch t. It is noted that the bias "can be found by a subprocess involving iterative adjustment of the A based on the gradient of L with respect to X' [1]. Thus, A is updated by:

(21)

where v is a learning parameter, which may be derived from the secant method. Likewise, v is defined as

At - 1 _ At- 2

V = --,-----:­wt- 1 _ wt- 2 ,

where wt = 2:i a~Yi [1]. The KA (without bias) algorithm is outlined in Table 2 (where tmax is the maximum number of epochs, and f1, is an arbitrary value to initialize A) .

4.3 Friess Adatron

The Friess KA algorithm is a modification of the KA (without bias) algorithm in three ways. First , a bias term b2 / 2 is added to the objective function. Thus, the Zi is calculated as:

m

Zi = L ajihK(xi ' Xj) + b. j=l

9

(22)

Page 11: Nick Ruiz - Bachelors Thesis

1. Initialize a? = O. 2. For t = 1, ... , tmax execute steps 3 through 8. 3. If t = 0 then

At = fl,

Elseif t = 1 then At = -fl, Else \t t-1 .x t - 1 .x t - 2

/I = -w wt l w t 2

Endif 4. For i = 1, ... , m, execute steps 5 and 6. 5. For labeled points (Xi, Vi) calculate:

Zi = 2:;:1 ajyjK(Xi, Xj) 6. Calculate 6a~ = 77(1 - ZiYi - AtYi):

6.1 If (af + 6aD :::; 0 then a~ = 0 6.2 If (a~ + 6aD > 0 then a~ .- a~ + 6a~

7. Calculate wt = 2:j a;Yj 8. If a maximum number of iterations is exceeded or the margin:

'Y = ~ (min{iIY;=+l}(Zi) - max{iIYi=-l}) ~ 1 then stop, otherwise, return to step 2 for next epoch t.

Table 2: Kernel Adatron algorithm with bias.

This implies an equality constraint of

(23)

Second, the bias term is updated each time in the learning loop, where the value of a Lagrange multiplier is increased, using the rule b .- b + Yi6ai.

The third change to the KA (without bias) algorithm is that the criteria that prevents ai from going negative is discarded during the learning loop in order to maintain (23).

4.4 Kernel Minover

The Kernel Minover (KM) algorithm varies from the Friess KA algorithm only in the employment of the minover algorithm, instead of the Adatron algorithm. Like the Friess KA algorithm, the KM algorithm is introducing a bias term b. However, the loop in Table 1 is reduced to a single operation [7]:

YiZi = Yi (f ajyjK(xi, Xj) + b) . )=1

(24)

By minimizing YiZi in (24), Xi' (the pattern associated with the minimum value) is selected to be updated; subsequently, only 6ai' needs to be calCulated, where

6ai' = 77(1 - Yi,Zi').

10

Page 12: Nick Ruiz - Bachelors Thesis

6ai' is used to update ai' and b. Thus, only one Lagrange multiplier is updated at each epoch t .

5 Experimentation procedures

Following Chapelle's methodology for experimentation, we used sample images from the Corel database. Images were collected from 10 categories: people, beaches , buildings, buses, reptiles , elephants, flowers, horses , mountains, and food. In this experimentation, we used each classification model to train the images to classify into their respective classes.

5.1 Input representation

The simplest way to represent an image is by its bitmap representation. Each pixel of the image contains numeric values which represent its red, green, and blue levels. Assuming the size of the images is h x w (where hand ware the height and width of the image in pixels, respectively) , the input data for the SVM are vectors of size h x w for grayscale images and 3 x h x w for color images. To simplify the learning procedure, we convert each pixel into grayscale in our experiment, using the following formula:

L = 0.30R + 0.59G + O.llB,

where R, G, and B, represent the red, green, and blue levels in a specific pixel (this will be an integer between 0 and 255). This formula is accepted by various mathematicians as an acceptable converson from RGB to grayscale.

While it is simple to construct the bitmap representation of each image, the represen­tation lacks invariance with respect to translations. To alleviate this problem, we consider the luminescence histogram of each image's bitmap representation. Chapelle states that constructing a histogram with 16 bins per color component yields the best results in image classification [3] .

Another advantage of representing each image as a luminescence histogram is the reduction of the vector size for each data point, which yields fewer computational issues. Rather than using input data with vector sizes of h x w, the vector size of each histogram is simply the number of color bins (16). Moreover, the input vectors may be normalized, which makes the vector invariant to image size.

5.2 Choice of the kernel

While there are many kernel functions that may be used to map each data point to a higher-dimensional feature space , Chapelle suggests the use Gaussian Radial Basis Func­tion (RBF) kernel

I II x - x; 1I2

K(x, x) = e- 2" ,

where (J determines the spread of the RBF function. Smaller (J-values suggest a slower learning rate for the training procedure.

11

Page 13: Nick Ruiz - Bachelors Thesis

5.3 SVM training methods used

In this experiment, we used MATLAB R14 to implement each SVM. We compared the performance of a quadratic programming SVM toolbox written by Steve Gunn [8] to the following alternative SVMs (following the algorithms listed in Section 4):

• Adatron (Without Bias)

• Adatron (With Bias)

• Kernel Minover

• Friess Adatron

In addition, we implemented and benchmarked a DDAG version of each SVM, including Gunn's SVM. The alternative SVM learning methods (including the DDAGs) are writ­ten purely in MATLAB, while the quadratic programming procedure of Gunn's SVM is written in C and is called by MATLAB.

5.4 Leave-one-out cross validation

To determine the efficiency of both the training time and the classification rate for each SVM model, we use a modified leave-one-out cross validation (LOO-XV), with respect to the ten distinct classes in the model. The typical method of LOO-XV is k-fold cross validation where the training data is split up into k = m subsets (m is the number of training points in the set). The standard method of LOO-XV uses k - 1 subsets for training and the remaining subset for testing. The cycle is repeated until each subset is "left out" for testing. The results for each test are averaged together to find the overall results of the classification model.

For our purposes, we divided the training data into miN subsets. Each subset contains one training point per class. Thus, in each training cycle, we leave out N data points for testing.

6 Experimentation results

To understand the relationships among the training time complexities, we performed the same experiment on 5, 6, 7, 8, and 9 images per class to construct a visual graph. The training times for each experiment are graphed using Vandermonde interpolation. Figure 2 shows the interpolation of training times for Gunn's SVM, the Adatron (without bias), and the Friess Adatron. According to Figure 2, Gunn's SVM begins to outperform the Adatron (without bias) at 8 training examples per class.

Figure 3 shows the interpolation of training times for the Adatron (with bias) and the Kernel Minover. In comparison to Figure 2, the Adatron (with bias) and the Kernel Minover are computationally inefficient. The rationale behind this finding lies in the nature of each classification method. In the case of the KA (with bias) algorithm, it

12

Page 14: Nick Ruiz - Bachelors Thesis

SVM Method T error 'rJ Gunn's SVM 40,0906 0,000 -

Adatron (No Bias) 77,5535 0,160 1.0 Adatron (Bias) 2585,5 0,000 1.0 Kernel Minover 604,8047 0,000 0,5 Friess Adatron 157,2681 0,000 0,5 DAG-SVM 9,8762 0,000 -

DAG-Adatron (No Bias) 3,5541 0,000 1.0 DAG-Adatron (Bias) 38,2240 0,010 1.0 DAG-Minover 17,6314 0,000 0,5 DAG-Friess 4,7699 0,080 0,5

Table 3: Training time comparisons (in seconds) for 100 training examples (10 images per class), using RBF kernel with a = 2,

00' 'U c 0 U (l)

Ul

C -=-(l)

E en c c ~ f-

Vandermonde Interpolation ofTraining Times versus Training Points Per Class 150r--------.--------.--------.--------.--------.

--SVC / /

140 - , - , - Adatron (No Bias) i I

- - - Friess Adatron / I

120 I /

/' 100 /i

,/l

/ 80 /

./ / ,/

./ ./ 50 /' ./

/-' ./ /

,/ ",-

40 ,/

./ ,/ ,/ ",,-

./ ,~

./ "", ' -----,/

20 ./

/'

..or

-- --" -~ , -.,...

0 5 5 7 8 9 10

Number ofTraining Points per class

Figure 2: Vandermonde interpolation of training times for Gunn's SVM, Adatron (without bias) , and Friess Adatron,

13

Page 15: Nick Ruiz - Bachelors Thesis

is necessary to calculate the w- and A-values in the learning loop of each epoch t. The greatest computational expense lies in the arithmatic required to calculate these values. In addition, the ai-values are only slightly updated at each t, requiring more iterations to optimize classification model. While the KA (with bias) algorithm shows great accuracy, it requires a longer training time.

Similarly, the Kernel Minover algorithm's slowness is related to requiring more itera­tions to optimize the classification model. This is due to the Kernel Minover algorithm only updating one ai-value per iteration.

Vandermonde Interpolation ofTraining Times versus Training Points Per Class 3000 I---;::===::::!:=======r::;------,-----,------~

2500

(i) -g 2000 o u (l) (J)

c

~ 1500 E

OJ C C

~ 1000 I-

500

Adatron (Bias) .. Kernel Minover

6 7 8 Number of Training Points per class

9 10

Figure 3: Vandermonde interpolation oftraining times for Adatron (with Bias) and Kernel Minover.

Figure 4 shows the interpolation of the training times of each DDAG method. Both the DAG-equivalent methods of the Adatron (without bias) and the Friess Adatron show significant improvement from the overall training time of each classifier.

It is interesting to note that though Gunn's SVM performs best in the 1-v-r problem, its DAG-equivalent method is suboptimal with respect to training time. Since the DDAG requires n( n-1) /2 1-v-1 classification models to compute, the large number of calculations of the quadratic programming procedure increases the training time required for the SVM when compared to the Adatron and minover classification models.

14

Page 16: Nick Ruiz - Bachelors Thesis

Vandermonde Interpolation ofTraining Times versus Training Points Per Class 40.--------.--------.--------.--------,--------.

(j) <:l C

35

30

8 25 (I) (j)

c

~ 20 E OJ c 15 c 2:' I-

10

5

--·-DAG SVC

. - DAG Adatron (No Bias)

DAG Adatron (Bias) DAG Kernel Minover

- - - DAG Friess Adatron

-"'" ----- . ---. OL-------~--------~--------~------~~------~

5 G 7 8 9 10 Number of Training Points per class

Figure 4: Vandermonde interpolation of training times for DAG implementations of each classifier.

7 Remarks

Additional remarks should be made to present other potential discrepancies in the exper­imentation results. Firstly, it must be noted that Gunn's SVM uses the C programming language (called by MATLAB) to perform the quadratic programming arithmetic, while the other algorithms presented use only the MATLAB language. While C programs are compiled into machine code, MATLAB scripts are interpreted upon runtime. By plac­ing the most computationally challenging arithmetic into machine code, Gunn's SVM may be saving computational time, whereas the other algorithms must first translate the MATLAB code to compute the results. It would be interesting to compare a SVM writ­ten purely in MATLAB to the other algorithms. However, the mathematics involved in the computation of the quadratic programming procedure is outside of the scope of this experiment.

In addition, the Adatron (with bias) should receive additional attention. Program­matically storing the w- and A-values of each epoch of the KA algorithm requires much consideration. We decided to store only the previous three values of these parameters and used modular arithmetic to determine which w- and A-values correspond to wt- 1

, wt- 2,

etc. at each iteration. Modular arithmetic in itself is computationally expensive. Other

15

Page 17: Nick Ruiz - Bachelors Thesis

possibilities include storing the parameter values for every iteration (which would require a large amount of space).

8 Conclusion

In conclusion, while the SVM appears to require the least amount of training time in the 1-v-r multiclassification model, the KA (without bias) and the Kernel Friess Adatron algorithm require significantly less time to train in the DDAG 1-v-1 model as the number of data points increase. The use of the DDAG 1-v-1 model results in a significant decrease in training time for the overall multiclassification model while maintaining classification accuracy.

16

Page 18: Nick Ruiz - Bachelors Thesis

References

[1] C. Campbell and N. Cristianini, Simple learning algorithms for training support vector machines, http://citeseer. ist. psu. edu/ campbell98simple. html, 1998.

[2] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis (Cambridge, New York, 2004).

[3] O. Chapelle, P. Haffner , and et al. , Support Vector Machines et Classification d' Images, http://citeseer. ist.psu.edu/392011.html, 1998.

[4] J. Platt, N. Cristianini, and J. Shawe-Taylor, in Large Margin DAGs for Multiclass Classification, edited by S. Solla, T. Leen, and K.-R. Mueller (MIT Press, Cambdridge, MA, 2000), pp. 547-553.

[5] V. Vapnik, The Nature of Statistical Learning Theory (Springer-Verlag, Berlin, 1995).

[6] J. Platt, in Advances in Kernel Methods - Support Vector Learning, edited by B. Scholkopf, C. J. C. Burges, and A. J. Smola (MIT Press, Cambdridge, MA, 1999) , pp. 185- 208.

[7] H. D. Navone and T. Downs, Variations on a Kernel-Adatron Theme, http://citeseer.ist.psu.edu/navoneOlvariations. html, 2001.

[8] S. Gunn, MATLAB SVM Toolbox, http://www.isis.ecs.soton.ac.uk/resources/svminfo/

17 .