Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...

8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A

1/4

Improving training speed of Support Vector Machines by creating exploitable

trends of Lagrangian variables: an application to DNA splice site detection

Jason Li, Saman K. Halgamuge

Dynamic Systems & Control Group, DMME, The University of Melbourne, VIC 3010, [email protected]

Abstract

Support Vector Machines are state-of-the-art

machine learning algorithms that can be used for

classification problems such as DNA splice site

identification. However, the large number of samples

in biological data sets can often lead to slow training speed. The training speed can be improved by

removing non-support vectors prior to training. This

paper proposes a method to predict non-support

vectors with high accuracy by the use of strict-

constrained gradient ascent optimisation. Unlike other

data pre-selection methods, the proposed gradient

based method is itself a training algorithm for SVM,and is also very simple to implement. Experiments with

comparable results are conducted on a DNA splice-site

detection problem. Results show significant speed

improvements over other algorithms. The relationshipbetween speed improvement and cache memory size is

also exploited. Generalisation capability of the

proposed algorithm is also shown to be better than

some other reformulated SVMs.

1. Introduction

Support vector machines (SVMs) [1, 2] are

powerful machine learning algorithms that have been

reported as successful in a variety of biological data

classification problems including disease diagnosis and

gene expression analysis [3, 4]. Although the performance of SVM is superior in terms of

classification accuracy, its training methodology andspeed still have significant room for improvement and

remain the focus of many research works. Suchresearches are especially important for biomedical data

sets as their high dimensionality and large number of

data often hinder the speed of SVM training.

The SVM classifier can be described as a quadratic

programming (QP) problem. Traditional methods for

solving this QP problem such as Newton or Quasi-

Newtons methods are incapable of handling large

dataset due to their ( )2lO memory requirement [1]. Totackle this, a decomposition framework has been

developed to divide the large problem into smaller sub-

problems [5, 6]. The well known Sequential Minimal

Optimisation (SMO) [7, 8] and kernel-AdaTron (KA)[9, 10] training algorithms were also developed to

address this issue, aiming to keep the memoryrequirement at minimum.

However, modern computers possess ample

memory that minimum memory usage is inefficientand hinders computational speed. Most biomedical and

pattern recognition data sets are extremely high

dimensional, meaning that the computation of a kernelentry can be very expensive. To address this problem,

the idea of caching has emerged [11]. Caching refers to

the process of storing the values of kernel entries in a

computers physical memory to avoid repeat

computation. The physical memory used for such

purpose is called the cache. Caching allows

practitioners to strive a balance between memory usageand time required for training. The effect of caching ontraining time is demonstrated in Fig. 1. Note the

tremendous time saved when the whole kernel matrix

can fit into memory (100%).

Cache effect on training time

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0% 20% 40% 60% 80% 100%

Cache memory

(% of the memory required for full kernel storage)

Training

tim

e(sec)

Fig. 1: The training time of SMO on a splice-site detection datasetversus different memory size available for storing kernel entries

Frontiers in the Convergence of Bioscience and Information Technologies 2007

0-7695-2999-2/07 $25.00 2007 IEEE

DOI 10.1109/FBIT.2007.56

230


0-7695-2999-2/07 $25.00 2007 IEEE

DOI 10.1109/FBIT.2007.56

230


0-7695-2999-2/07 $25.00 2007 IEEE

DOI 10.1109/FBIT.2007.56

230


0-7695-2999-2/07 $25.00 2007 IEEE

DOI 10.1109/FBIT.2007.56

230


2/4

The work presented in this paper has been

motivated by the need to reduce training time

especially for cases where available cache memory can

store only a fraction of the SVM kernel matrix. Theproposed method integrates a specially tailored version

of constrained gradient ascent (CGA) [12] with

Keerthis modified version of SMO [7] (with caching)to provide a two-stage training: the proposed extended

CGA serves as a fast preliminary training step to

identify potential support vectors while SMO fine

tunes the solution values. We will show that the

proposed method, although involving data removal,

can achieve better classification accuracy than LS-SVM [13] and RSVM [14], two of the more popular

algorithms.

2. Data reduction by CGA

2.1. The proposed constrained gradient ascent

(CGA) algorithm

The CGA algorithm we propose comprises the first

stage of training. In the literature, there are different

types of constrained gradient methods reported andthey have been applied in a variety of optimisation

problems [12, 15, 16]. In this work, we utilise its

simplest form strict-gradient ascent, further developit to incorporate the constraints imposed by SVM, and

develop a simple and fast implementation for it. More

specifically, the CGA algorithm has been developed as

follows:

A. The simplest case without inequality constraints.

A simple form of constrained gradient method has

first been considered, ignoring all inequalities ofSVM. This sets the framework for further

derivation of our algorithm and helps to observecomputational simplicity of CGA.

B. Formulation with equalities. A mathematicalmodel has then been developed to describe how toupdate the Lagrangian variables of the SVM

optimisation problem.

C. Implementation. We have developed a pseudo

code describing the computational procedure that

efficiently implements the associated

mathematical model.

D. Optimal learning rate. One inevitable parameter

of CGA is the learning rate. We have developed amethod to approximate the theoretical optimallearning rate with computational time taken into

consideration.The details of these are available in the

accompanying publication in the Journal of

Biomedicine and Biotechnology.

2.2. CGA as the pre-training step

The proposed two-stage training approach, withCGA as the pre-training step to SMO, aims to exploitthe strengths of both the CGA and SMO algorithms to

provide an overall faster training method.

The proposed CGA trains data in batch and itstraining time for each iteration is very fast. This nice

property allows it to quickly identify potential support

vectors, serving as a preliminary training step. The fine

tuning ability of CGA, however, is low due to numeric

precision and possibly ill-conditioned problems. SMO

does not face the same problem in this regard, since itstraining is based on a completely different basis

heuristics and analytical solutions.

The disadvantage of SMO lies in its scalability to

large dataset. It has time complexity ( )LnO where nis the number of training data and L the number of

candidate support vectors during training [11].

Predetermining candidate support vectors anddiscarding the rest using CGA can help reduce L andn, thus improve the training speed.

Their strengths and weaknesses imply that a jointeffort is desired. The methodology of the two-stage

training is largely based on the behaviour of the SVM

Lagrangian variables () under the training of CGA.As our results indicate, follow the behavioursillustrated in Fig. 2 and Fig. 3 below, for non-support

vectors and support vectors respectively. These patterns of behaviours are a result of strict-gradient

ascent; such behaviours will not exist if non-strictgradient methods are used.

These graphs show that all values will initiallyincrease regardless of whether they will become

support vectors (i.e., > 0) later or not. This increaseis an intrinsic property of the SVM objective function.

However, after a period of time, those for non-support vectors will drop back to zero.

Fig. 2: The plot of alpha values of a non-support vector against the

training epochs in CGA

231231231231


3/4

Fig. 3: The plot of alpha values of a support vector against the

training epochs in CGA

Unlike other data extraction techniques described

previously, CGA also takes into account the available

cache memory size. This allows the SMO training to

be more effective.

3. ResultsThe proposed method possesses simplicity and

analytical foundation, two crucial characteristics for

algorithmic success as demonstrated by SMO [17]. The

use of CGA as a pre-training step helps to work around

poor caching policy by allowing a large data set to be

reduced according to the size of cache memory.

A splice-site detection dataset from StatLog [18]

has been used to evaluate our proposed method. For

comparability, both the CGA and SMO areimplemented in the same settings. Results of other

SVM algorithms are obtained from their respectivepublications.

Table 1 shows that speed improvement is mostsignificant when the cache memory size is 94% of the

size required for storing full kernel matrix. This

indicates the point of best balance between the two

stages of training. Note that 94% of memory size

means a coverage of 75% of data points since only half

of the kernel needs to be stored in memory due to

symmetry. This means that 25% data reduction withCGA is the most effective for this spice-site detection

problem. Nevertheless, there is an overall improvement

in speed regardless of the cache size.

Since might not follow the behaviours in Fig. 2and Fig. 3 in circumstances where we have extreme

kernel values and precision restrictions, it is possible tohave some alphas incorrectly removed during the first-

stage training with CGA. Consequently, classification

accuracy could be affected.

Table 2 shows that the classification accuracy of a

CGA-reduced problem is slightly lower for donor-site

detection. However, we have also compared the

accuracy with Least-Squares SVM and Reduced SVM

(Table 3) and it shows that the proposed two-staged

method does not degrade the performance as much as

those reformulations do.

4. Conclusion

Both CGA and SMO have the merit of simplicity inimplementation. We propose a method that combines

CGA with SMO to provide faster training for SVM

classifiers. In terms of training speed, the two-stage

training scheme brings significant improvement over

the spice site data set. However, the amount of

improvement is not steady across different cache sizes.Experiments also indicate that classification accuracy

of the two-stage SVM is some times a little worse than

the standard SVM because practical data sets can beill-conditioned and practical learning rates are finite.

Future works include developing a better criterion for

transition from CGA to SMO such that real support

vectors can be preserved. The possibility of prefixing

CGA to algorithms other than SMO will also be

explored.

5. References

[1] N. Cristianini and J. Shawe-Taylor, An

introduction to support vector machines: And

other kernel-based learning methods.

Cambridge, England: Cambridge Press, 2000.

[2] V. Vapnik, Statistical Learning Theory. NY:

Wiley, 1998.[3] M. Brown, W. Grundy, D. Lin, N. Cristianini,

C. Sugnet, T. Furey, M. Ares, and D.

Haussler, "Knowledge-based analysis ofmicroarray gene expression data using support

vector machines," in Proc. National Academy

of Sciences, vol. 97, 2000, pp. 262-267.

[4] S. Liu, Q. Song, W. Hu, and A. Cao,

"Diseases classification using support vector

machine (SVM)," in Proc. 9th Intl. Conf.

Neural Information Processing, vol. 2, 2002,

pp. 760-763.[5] T. Joachims, "Making large-scale support

vector machine learning practical," in

Advances in Kernel Methods: Support VectorMachines, B. Scholkopf, C. Burges, and A.

Smola, Eds. Cambridge, MA: MIT Press,1998.

[6] C. J. Lin, "On the Convergence of the

Decomposition Method for Support Vector

Machines," IEEE Trans. Neural Networks,

vol. 12, 2001.

[7] S. S. Keerthi, S. K. Shevade, C.

Bhattacharyya, and K. R. K. Murthy,

232232232232


4/4

"Improvements to Platt's SMO Algorithm for

SVM Classifier Design," Neural Comp., vol.

13, pp. 637-649, 2001.

[8] J. C. Platt, "Fast training of support vector

machines using sequential minimaloptimization," in Advances in KernelMethods: Support Vector Machines, B.Scholkopf, C. Burges, and A. Smola, Eds.

Cambridge, MA: MIT Press, 1998.

[9] C. Campbell and N. Cristianini, "Simple

Learning Algorithms for training support

vector machines," Technical Report,

University of Bristol 1998.

[10] T. Frie, N. Cristianini, and C. Campbell,

"The kernel-Adatron algorithm: a fast and

simple learning procedure for support vector

machines," in Machine Learning: Proc. of the

15th International Conf., J. Shavlik, Ed. SanFrancisco: Morgan Kauffman Publishers,1998.

[11] J. X. Dong, A. Krzyzak, and C. Y. Suen, "Afast SVM training algorithm,"Intl. J. Pattern

Recognition and Artificial Intelligence, vol.

17, pp. 367-384, 2003.

[12] A. A. Hasan and M. A. Hasan, "Constrained

Gradient Descent and Line Search for Solving

Optimization Problem with Elliptic

Constraints," in Proc. Intl. Conf. Acoustics,

Speech, and Signal Processing, vol. 2, 2003,pp. 763-796.

[13] J. A. K. Suykens and J. Vandewalle, "Least

Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 9, pp. 293-

300, 1999.

[14] K. M. Lin and C. J. Lin, "A Study on Reduced

Support Vector Machines," IEEE Trans. Neural Networks, vol. 14, pp. 1449-1459,

2003.[15] Z. Wang and E. P. Simoncelli, "Stimulus

Synthesis for Efficient Evaluation and

Refinement of Perceptual Image Quality

Metrics," in Proc. Human Vision and

Electronic Imaging IX, vol. 5292, 2004.

[16] H. K. Zhao, B. Merriman, S. Osher, and L.

Wang, "Capturing the Behaviour of Bubbles

and Drops Using Variational Level Set

Approach," J. Computational Physics, vol.

143, pp. 495-518, 1998.[17] V. Kecman, M. Vogt, and T. M. Huang, "On

the Equality of Kernel AdaTron and

Sequential Minimal Optimization inClassification and Regression Tasks and Alike

Algorithms for Kernel Machines," in Proc.

11th European Symposium on Artificial

Neural Networks . Bruges, Belgium, 2003.

[18] D. Michie, D. J. Spiegelhalter, and C. C.

Taylor, Machine Learning, Neural and

Statistical Classification. Englewood Cliffs,

NJ: Prentice Hall, 1994.

Table 1: Speed improvement with two-stage training approach; classifying Acceptor Site or Not and Donor Site or Not

Kernel Cache Size (% of the memory required

to store full kernel matrix)

100% 94% 75% 44% 0% (no cache)

Acceptor Site +86% +453% +367% +297% +153%Speed Improvement

(%) Donor Site +29% +143% +121% +114% +74%

Table 2: Classification accuracy on Statlog splice-site data set showing effect of data reduction due to CGA.

Acceptor Site or Not Donor Site or NotAccuracy on train set (%) Accuracy on test set (%) Accuracy on train set (%) Accuracy on test set (%)

SVM (SMO) 100 97.302 100 96.46

Two-staged SVM

CGA+SMO

100 97.302 99.8 95.6

Table 3: Comparison of classification accuracy with different versions of SVM. Data for LS-SVM and RSVM obtained from [14].

Original SVM (with SMO) Two-staged SVM CGA+SMO LS-SVM RSVM

Accuracy on testset (average)

96.881 96.451 93.086 93.002

233233233233

Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...

Documents

Transcript of Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...