Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...
-
Upload
bhattchintan7 -
Category
Documents
-
view
220 -
download
0
Transcript of Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...
-
8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A
1/4
Improving training speed of Support Vector Machines by creating exploitable
trends of Lagrangian variables: an application to DNA splice site detection
Jason Li, Saman K. Halgamuge
Dynamic Systems & Control Group, DMME, The University of Melbourne, VIC 3010, [email protected]
Abstract
Support Vector Machines are state-of-the-art
machine learning algorithms that can be used for
classification problems such as DNA splice site
identification. However, the large number of samples
in biological data sets can often lead to slow training speed. The training speed can be improved by
removing non-support vectors prior to training. This
paper proposes a method to predict non-support
vectors with high accuracy by the use of strict-
constrained gradient ascent optimisation. Unlike other
data pre-selection methods, the proposed gradient
based method is itself a training algorithm for SVM,and is also very simple to implement. Experiments with
comparable results are conducted on a DNA splice-site
detection problem. Results show significant speed
improvements over other algorithms. The relationshipbetween speed improvement and cache memory size is
also exploited. Generalisation capability of the
proposed algorithm is also shown to be better than
some other reformulated SVMs.
1. Introduction
Support vector machines (SVMs) [1, 2] are
powerful machine learning algorithms that have been
reported as successful in a variety of biological data
classification problems including disease diagnosis and
gene expression analysis [3, 4]. Although the performance of SVM is superior in terms of
classification accuracy, its training methodology andspeed still have significant room for improvement and
remain the focus of many research works. Suchresearches are especially important for biomedical data
sets as their high dimensionality and large number of
data often hinder the speed of SVM training.
The SVM classifier can be described as a quadratic
programming (QP) problem. Traditional methods for
solving this QP problem such as Newton or Quasi-
Newtons methods are incapable of handling large
dataset due to their ( )2lO memory requirement [1]. Totackle this, a decomposition framework has been
developed to divide the large problem into smaller sub-
problems [5, 6]. The well known Sequential Minimal
Optimisation (SMO) [7, 8] and kernel-AdaTron (KA)[9, 10] training algorithms were also developed to
address this issue, aiming to keep the memoryrequirement at minimum.
However, modern computers possess ample
memory that minimum memory usage is inefficientand hinders computational speed. Most biomedical and
pattern recognition data sets are extremely high
dimensional, meaning that the computation of a kernelentry can be very expensive. To address this problem,
the idea of caching has emerged [11]. Caching refers to
the process of storing the values of kernel entries in a
computers physical memory to avoid repeat
computation. The physical memory used for such
purpose is called the cache. Caching allows
practitioners to strive a balance between memory usageand time required for training. The effect of caching ontraining time is demonstrated in Fig. 1. Note the
tremendous time saved when the whole kernel matrix
can fit into memory (100%).
Cache effect on training time
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0% 20% 40% 60% 80% 100%
Cache memory
(% of the memory required for full kernel storage)
Training
tim
e(sec)
Fig. 1: The training time of SMO on a splice-site detection datasetversus different memory size available for storing kernel entries
Frontiers in the Convergence of Bioscience and Information Technologies 2007
0-7695-2999-2/07 $25.00 2007 IEEE
DOI 10.1109/FBIT.2007.56
230
Frontiers in the Convergence of Bioscience and Information Technologies 2007
0-7695-2999-2/07 $25.00 2007 IEEE
DOI 10.1109/FBIT.2007.56
230
Frontiers in the Convergence of Bioscience and Information Technologies 2007
0-7695-2999-2/07 $25.00 2007 IEEE
DOI 10.1109/FBIT.2007.56
230
Frontiers in the Convergence of Bioscience and Information Technologies 2007
0-7695-2999-2/07 $25.00 2007 IEEE
DOI 10.1109/FBIT.2007.56
230
-
8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A
2/4
The work presented in this paper has been
motivated by the need to reduce training time
especially for cases where available cache memory can
store only a fraction of the SVM kernel matrix. Theproposed method integrates a specially tailored version
of constrained gradient ascent (CGA) [12] with
Keerthis modified version of SMO [7] (with caching)to provide a two-stage training: the proposed extended
CGA serves as a fast preliminary training step to
identify potential support vectors while SMO fine
tunes the solution values. We will show that the
proposed method, although involving data removal,
can achieve better classification accuracy than LS-SVM [13] and RSVM [14], two of the more popular
algorithms.
2. Data reduction by CGA
2.1. The proposed constrained gradient ascent
(CGA) algorithm
The CGA algorithm we propose comprises the first
stage of training. In the literature, there are different
types of constrained gradient methods reported andthey have been applied in a variety of optimisation
problems [12, 15, 16]. In this work, we utilise its
simplest form strict-gradient ascent, further developit to incorporate the constraints imposed by SVM, and
develop a simple and fast implementation for it. More
specifically, the CGA algorithm has been developed as
follows:
A. The simplest case without inequality constraints.
A simple form of constrained gradient method has
first been considered, ignoring all inequalities ofSVM. This sets the framework for further
derivation of our algorithm and helps to observecomputational simplicity of CGA.
B. Formulation with equalities. A mathematicalmodel has then been developed to describe how toupdate the Lagrangian variables of the SVM
optimisation problem.
C. Implementation. We have developed a pseudo
code describing the computational procedure that
efficiently implements the associated
mathematical model.
D. Optimal learning rate. One inevitable parameter
of CGA is the learning rate. We have developed amethod to approximate the theoretical optimallearning rate with computational time taken into
consideration.The details of these are available in the
accompanying publication in the Journal of
Biomedicine and Biotechnology.
2.2. CGA as the pre-training step
The proposed two-stage training approach, withCGA as the pre-training step to SMO, aims to exploitthe strengths of both the CGA and SMO algorithms to
provide an overall faster training method.
The proposed CGA trains data in batch and itstraining time for each iteration is very fast. This nice
property allows it to quickly identify potential support
vectors, serving as a preliminary training step. The fine
tuning ability of CGA, however, is low due to numeric
precision and possibly ill-conditioned problems. SMO
does not face the same problem in this regard, since itstraining is based on a completely different basis
heuristics and analytical solutions.
The disadvantage of SMO lies in its scalability to
large dataset. It has time complexity ( )LnO where nis the number of training data and L the number of
candidate support vectors during training [11].
Predetermining candidate support vectors anddiscarding the rest using CGA can help reduce L andn, thus improve the training speed.
Their strengths and weaknesses imply that a jointeffort is desired. The methodology of the two-stage
training is largely based on the behaviour of the SVM
Lagrangian variables () under the training of CGA.As our results indicate, follow the behavioursillustrated in Fig. 2 and Fig. 3 below, for non-support
vectors and support vectors respectively. These patterns of behaviours are a result of strict-gradient
ascent; such behaviours will not exist if non-strictgradient methods are used.
These graphs show that all values will initiallyincrease regardless of whether they will become
support vectors (i.e., > 0) later or not. This increaseis an intrinsic property of the SVM objective function.
However, after a period of time, those for non-support vectors will drop back to zero.
Fig. 2: The plot of alpha values of a non-support vector against the
training epochs in CGA
231231231231
-
8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A
3/4
Fig. 3: The plot of alpha values of a support vector against the
training epochs in CGA
Unlike other data extraction techniques described
previously, CGA also takes into account the available
cache memory size. This allows the SMO training to
be more effective.
3. ResultsThe proposed method possesses simplicity and
analytical foundation, two crucial characteristics for
algorithmic success as demonstrated by SMO [17]. The
use of CGA as a pre-training step helps to work around
poor caching policy by allowing a large data set to be
reduced according to the size of cache memory.
A splice-site detection dataset from StatLog [18]
has been used to evaluate our proposed method. For
comparability, both the CGA and SMO areimplemented in the same settings. Results of other
SVM algorithms are obtained from their respectivepublications.
Table 1 shows that speed improvement is mostsignificant when the cache memory size is 94% of the
size required for storing full kernel matrix. This
indicates the point of best balance between the two
stages of training. Note that 94% of memory size
means a coverage of 75% of data points since only half
of the kernel needs to be stored in memory due to
symmetry. This means that 25% data reduction withCGA is the most effective for this spice-site detection
problem. Nevertheless, there is an overall improvement
in speed regardless of the cache size.
Since might not follow the behaviours in Fig. 2and Fig. 3 in circumstances where we have extreme
kernel values and precision restrictions, it is possible tohave some alphas incorrectly removed during the first-
stage training with CGA. Consequently, classification
accuracy could be affected.
Table 2 shows that the classification accuracy of a
CGA-reduced problem is slightly lower for donor-site
detection. However, we have also compared the
accuracy with Least-Squares SVM and Reduced SVM
(Table 3) and it shows that the proposed two-staged
method does not degrade the performance as much as
those reformulations do.
4. Conclusion
Both CGA and SMO have the merit of simplicity inimplementation. We propose a method that combines
CGA with SMO to provide faster training for SVM
classifiers. In terms of training speed, the two-stage
training scheme brings significant improvement over
the spice site data set. However, the amount of
improvement is not steady across different cache sizes.Experiments also indicate that classification accuracy
of the two-stage SVM is some times a little worse than
the standard SVM because practical data sets can beill-conditioned and practical learning rates are finite.
Future works include developing a better criterion for
transition from CGA to SMO such that real support
vectors can be preserved. The possibility of prefixing
CGA to algorithms other than SMO will also be
explored.
5. References
[1] N. Cristianini and J. Shawe-Taylor, An
introduction to support vector machines: And
other kernel-based learning methods.
Cambridge, England: Cambridge Press, 2000.
[2] V. Vapnik, Statistical Learning Theory. NY:
Wiley, 1998.[3] M. Brown, W. Grundy, D. Lin, N. Cristianini,
C. Sugnet, T. Furey, M. Ares, and D.
Haussler, "Knowledge-based analysis ofmicroarray gene expression data using support
vector machines," in Proc. National Academy
of Sciences, vol. 97, 2000, pp. 262-267.
[4] S. Liu, Q. Song, W. Hu, and A. Cao,
"Diseases classification using support vector
machine (SVM)," in Proc. 9th Intl. Conf.
Neural Information Processing, vol. 2, 2002,
pp. 760-763.[5] T. Joachims, "Making large-scale support
vector machine learning practical," in
Advances in Kernel Methods: Support VectorMachines, B. Scholkopf, C. Burges, and A.
Smola, Eds. Cambridge, MA: MIT Press,1998.
[6] C. J. Lin, "On the Convergence of the
Decomposition Method for Support Vector
Machines," IEEE Trans. Neural Networks,
vol. 12, 2001.
[7] S. S. Keerthi, S. K. Shevade, C.
Bhattacharyya, and K. R. K. Murthy,
232232232232
-
8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A
4/4
"Improvements to Platt's SMO Algorithm for
SVM Classifier Design," Neural Comp., vol.
13, pp. 637-649, 2001.
[8] J. C. Platt, "Fast training of support vector
machines using sequential minimaloptimization," in Advances in KernelMethods: Support Vector Machines, B.Scholkopf, C. Burges, and A. Smola, Eds.
Cambridge, MA: MIT Press, 1998.
[9] C. Campbell and N. Cristianini, "Simple
Learning Algorithms for training support
vector machines," Technical Report,
University of Bristol 1998.
[10] T. Frie, N. Cristianini, and C. Campbell,
"The kernel-Adatron algorithm: a fast and
simple learning procedure for support vector
machines," in Machine Learning: Proc. of the
15th International Conf., J. Shavlik, Ed. SanFrancisco: Morgan Kauffman Publishers,1998.
[11] J. X. Dong, A. Krzyzak, and C. Y. Suen, "Afast SVM training algorithm,"Intl. J. Pattern
Recognition and Artificial Intelligence, vol.
17, pp. 367-384, 2003.
[12] A. A. Hasan and M. A. Hasan, "Constrained
Gradient Descent and Line Search for Solving
Optimization Problem with Elliptic
Constraints," in Proc. Intl. Conf. Acoustics,
Speech, and Signal Processing, vol. 2, 2003,pp. 763-796.
[13] J. A. K. Suykens and J. Vandewalle, "Least
Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 9, pp. 293-
300, 1999.
[14] K. M. Lin and C. J. Lin, "A Study on Reduced
Support Vector Machines," IEEE Trans. Neural Networks, vol. 14, pp. 1449-1459,
2003.[15] Z. Wang and E. P. Simoncelli, "Stimulus
Synthesis for Efficient Evaluation and
Refinement of Perceptual Image Quality
Metrics," in Proc. Human Vision and
Electronic Imaging IX, vol. 5292, 2004.
[16] H. K. Zhao, B. Merriman, S. Osher, and L.
Wang, "Capturing the Behaviour of Bubbles
and Drops Using Variational Level Set
Approach," J. Computational Physics, vol.
143, pp. 495-518, 1998.[17] V. Kecman, M. Vogt, and T. M. Huang, "On
the Equality of Kernel AdaTron and
Sequential Minimal Optimization inClassification and Regression Tasks and Alike
Algorithms for Kernel Machines," in Proc.
11th European Symposium on Artificial
Neural Networks . Bruges, Belgium, 2003.
[18] D. Michie, D. J. Spiegelhalter, and C. C.
Taylor, Machine Learning, Neural and
Statistical Classification. Englewood Cliffs,
NJ: Prentice Hall, 1994.
Table 1: Speed improvement with two-stage training approach; classifying Acceptor Site or Not and Donor Site or Not
Kernel Cache Size (% of the memory required
to store full kernel matrix)
100% 94% 75% 44% 0% (no cache)
Acceptor Site +86% +453% +367% +297% +153%Speed Improvement
(%) Donor Site +29% +143% +121% +114% +74%
Table 2: Classification accuracy on Statlog splice-site data set showing effect of data reduction due to CGA.
Acceptor Site or Not Donor Site or NotAccuracy on train set (%) Accuracy on test set (%) Accuracy on train set (%) Accuracy on test set (%)
SVM (SMO) 100 97.302 100 96.46
Two-staged SVM
CGA+SMO
100 97.302 99.8 95.6
Table 3: Comparison of classification accuracy with different versions of SVM. Data for LS-SVM and RSVM obtained from [14].
Original SVM (with SMO) Two-staged SVM CGA+SMO LS-SVM RSVM
Accuracy on testset (average)
96.881 96.451 93.086 93.002
233233233233