Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...

download Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an Application to DNA Splice Site Detection

of 4

Transcript of Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian...

  • 8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A

    1/4

    Improving training speed of Support Vector Machines by creating exploitable

    trends of Lagrangian variables: an application to DNA splice site detection

    Jason Li, Saman K. Halgamuge

    Dynamic Systems & Control Group, DMME, The University of Melbourne, VIC 3010, [email protected]

    Abstract

    Support Vector Machines are state-of-the-art

    machine learning algorithms that can be used for

    classification problems such as DNA splice site

    identification. However, the large number of samples

    in biological data sets can often lead to slow training speed. The training speed can be improved by

    removing non-support vectors prior to training. This

    paper proposes a method to predict non-support

    vectors with high accuracy by the use of strict-

    constrained gradient ascent optimisation. Unlike other

    data pre-selection methods, the proposed gradient

    based method is itself a training algorithm for SVM,and is also very simple to implement. Experiments with

    comparable results are conducted on a DNA splice-site

    detection problem. Results show significant speed

    improvements over other algorithms. The relationshipbetween speed improvement and cache memory size is

    also exploited. Generalisation capability of the

    proposed algorithm is also shown to be better than

    some other reformulated SVMs.

    1. Introduction

    Support vector machines (SVMs) [1, 2] are

    powerful machine learning algorithms that have been

    reported as successful in a variety of biological data

    classification problems including disease diagnosis and

    gene expression analysis [3, 4]. Although the performance of SVM is superior in terms of

    classification accuracy, its training methodology andspeed still have significant room for improvement and

    remain the focus of many research works. Suchresearches are especially important for biomedical data

    sets as their high dimensionality and large number of

    data often hinder the speed of SVM training.

    The SVM classifier can be described as a quadratic

    programming (QP) problem. Traditional methods for

    solving this QP problem such as Newton or Quasi-

    Newtons methods are incapable of handling large

    dataset due to their ( )2lO memory requirement [1]. Totackle this, a decomposition framework has been

    developed to divide the large problem into smaller sub-

    problems [5, 6]. The well known Sequential Minimal

    Optimisation (SMO) [7, 8] and kernel-AdaTron (KA)[9, 10] training algorithms were also developed to

    address this issue, aiming to keep the memoryrequirement at minimum.

    However, modern computers possess ample

    memory that minimum memory usage is inefficientand hinders computational speed. Most biomedical and

    pattern recognition data sets are extremely high

    dimensional, meaning that the computation of a kernelentry can be very expensive. To address this problem,

    the idea of caching has emerged [11]. Caching refers to

    the process of storing the values of kernel entries in a

    computers physical memory to avoid repeat

    computation. The physical memory used for such

    purpose is called the cache. Caching allows

    practitioners to strive a balance between memory usageand time required for training. The effect of caching ontraining time is demonstrated in Fig. 1. Note the

    tremendous time saved when the whole kernel matrix

    can fit into memory (100%).

    Cache effect on training time

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    5000

    0% 20% 40% 60% 80% 100%

    Cache memory

    (% of the memory required for full kernel storage)

    Training

    tim

    e(sec)

    Fig. 1: The training time of SMO on a splice-site detection datasetversus different memory size available for storing kernel entries

    Frontiers in the Convergence of Bioscience and Information Technologies 2007

    0-7695-2999-2/07 $25.00 2007 IEEE

    DOI 10.1109/FBIT.2007.56

    230

    Frontiers in the Convergence of Bioscience and Information Technologies 2007

    0-7695-2999-2/07 $25.00 2007 IEEE

    DOI 10.1109/FBIT.2007.56

    230

    Frontiers in the Convergence of Bioscience and Information Technologies 2007

    0-7695-2999-2/07 $25.00 2007 IEEE

    DOI 10.1109/FBIT.2007.56

    230

    Frontiers in the Convergence of Bioscience and Information Technologies 2007

    0-7695-2999-2/07 $25.00 2007 IEEE

    DOI 10.1109/FBIT.2007.56

    230

  • 8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A

    2/4

    The work presented in this paper has been

    motivated by the need to reduce training time

    especially for cases where available cache memory can

    store only a fraction of the SVM kernel matrix. Theproposed method integrates a specially tailored version

    of constrained gradient ascent (CGA) [12] with

    Keerthis modified version of SMO [7] (with caching)to provide a two-stage training: the proposed extended

    CGA serves as a fast preliminary training step to

    identify potential support vectors while SMO fine

    tunes the solution values. We will show that the

    proposed method, although involving data removal,

    can achieve better classification accuracy than LS-SVM [13] and RSVM [14], two of the more popular

    algorithms.

    2. Data reduction by CGA

    2.1. The proposed constrained gradient ascent

    (CGA) algorithm

    The CGA algorithm we propose comprises the first

    stage of training. In the literature, there are different

    types of constrained gradient methods reported andthey have been applied in a variety of optimisation

    problems [12, 15, 16]. In this work, we utilise its

    simplest form strict-gradient ascent, further developit to incorporate the constraints imposed by SVM, and

    develop a simple and fast implementation for it. More

    specifically, the CGA algorithm has been developed as

    follows:

    A. The simplest case without inequality constraints.

    A simple form of constrained gradient method has

    first been considered, ignoring all inequalities ofSVM. This sets the framework for further

    derivation of our algorithm and helps to observecomputational simplicity of CGA.

    B. Formulation with equalities. A mathematicalmodel has then been developed to describe how toupdate the Lagrangian variables of the SVM

    optimisation problem.

    C. Implementation. We have developed a pseudo

    code describing the computational procedure that

    efficiently implements the associated

    mathematical model.

    D. Optimal learning rate. One inevitable parameter

    of CGA is the learning rate. We have developed amethod to approximate the theoretical optimallearning rate with computational time taken into

    consideration.The details of these are available in the

    accompanying publication in the Journal of

    Biomedicine and Biotechnology.

    2.2. CGA as the pre-training step

    The proposed two-stage training approach, withCGA as the pre-training step to SMO, aims to exploitthe strengths of both the CGA and SMO algorithms to

    provide an overall faster training method.

    The proposed CGA trains data in batch and itstraining time for each iteration is very fast. This nice

    property allows it to quickly identify potential support

    vectors, serving as a preliminary training step. The fine

    tuning ability of CGA, however, is low due to numeric

    precision and possibly ill-conditioned problems. SMO

    does not face the same problem in this regard, since itstraining is based on a completely different basis

    heuristics and analytical solutions.

    The disadvantage of SMO lies in its scalability to

    large dataset. It has time complexity ( )LnO where nis the number of training data and L the number of

    candidate support vectors during training [11].

    Predetermining candidate support vectors anddiscarding the rest using CGA can help reduce L andn, thus improve the training speed.

    Their strengths and weaknesses imply that a jointeffort is desired. The methodology of the two-stage

    training is largely based on the behaviour of the SVM

    Lagrangian variables () under the training of CGA.As our results indicate, follow the behavioursillustrated in Fig. 2 and Fig. 3 below, for non-support

    vectors and support vectors respectively. These patterns of behaviours are a result of strict-gradient

    ascent; such behaviours will not exist if non-strictgradient methods are used.

    These graphs show that all values will initiallyincrease regardless of whether they will become

    support vectors (i.e., > 0) later or not. This increaseis an intrinsic property of the SVM objective function.

    However, after a period of time, those for non-support vectors will drop back to zero.

    Fig. 2: The plot of alpha values of a non-support vector against the

    training epochs in CGA

    231231231231

  • 8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A

    3/4

    Fig. 3: The plot of alpha values of a support vector against the

    training epochs in CGA

    Unlike other data extraction techniques described

    previously, CGA also takes into account the available

    cache memory size. This allows the SMO training to

    be more effective.

    3. ResultsThe proposed method possesses simplicity and

    analytical foundation, two crucial characteristics for

    algorithmic success as demonstrated by SMO [17]. The

    use of CGA as a pre-training step helps to work around

    poor caching policy by allowing a large data set to be

    reduced according to the size of cache memory.

    A splice-site detection dataset from StatLog [18]

    has been used to evaluate our proposed method. For

    comparability, both the CGA and SMO areimplemented in the same settings. Results of other

    SVM algorithms are obtained from their respectivepublications.

    Table 1 shows that speed improvement is mostsignificant when the cache memory size is 94% of the

    size required for storing full kernel matrix. This

    indicates the point of best balance between the two

    stages of training. Note that 94% of memory size

    means a coverage of 75% of data points since only half

    of the kernel needs to be stored in memory due to

    symmetry. This means that 25% data reduction withCGA is the most effective for this spice-site detection

    problem. Nevertheless, there is an overall improvement

    in speed regardless of the cache size.

    Since might not follow the behaviours in Fig. 2and Fig. 3 in circumstances where we have extreme

    kernel values and precision restrictions, it is possible tohave some alphas incorrectly removed during the first-

    stage training with CGA. Consequently, classification

    accuracy could be affected.

    Table 2 shows that the classification accuracy of a

    CGA-reduced problem is slightly lower for donor-site

    detection. However, we have also compared the

    accuracy with Least-Squares SVM and Reduced SVM

    (Table 3) and it shows that the proposed two-staged

    method does not degrade the performance as much as

    those reformulations do.

    4. Conclusion

    Both CGA and SMO have the merit of simplicity inimplementation. We propose a method that combines

    CGA with SMO to provide faster training for SVM

    classifiers. In terms of training speed, the two-stage

    training scheme brings significant improvement over

    the spice site data set. However, the amount of

    improvement is not steady across different cache sizes.Experiments also indicate that classification accuracy

    of the two-stage SVM is some times a little worse than

    the standard SVM because practical data sets can beill-conditioned and practical learning rates are finite.

    Future works include developing a better criterion for

    transition from CGA to SMO such that real support

    vectors can be preserved. The possibility of prefixing

    CGA to algorithms other than SMO will also be

    explored.

    5. References

    [1] N. Cristianini and J. Shawe-Taylor, An

    introduction to support vector machines: And

    other kernel-based learning methods.

    Cambridge, England: Cambridge Press, 2000.

    [2] V. Vapnik, Statistical Learning Theory. NY:

    Wiley, 1998.[3] M. Brown, W. Grundy, D. Lin, N. Cristianini,

    C. Sugnet, T. Furey, M. Ares, and D.

    Haussler, "Knowledge-based analysis ofmicroarray gene expression data using support

    vector machines," in Proc. National Academy

    of Sciences, vol. 97, 2000, pp. 262-267.

    [4] S. Liu, Q. Song, W. Hu, and A. Cao,

    "Diseases classification using support vector

    machine (SVM)," in Proc. 9th Intl. Conf.

    Neural Information Processing, vol. 2, 2002,

    pp. 760-763.[5] T. Joachims, "Making large-scale support

    vector machine learning practical," in

    Advances in Kernel Methods: Support VectorMachines, B. Scholkopf, C. Burges, and A.

    Smola, Eds. Cambridge, MA: MIT Press,1998.

    [6] C. J. Lin, "On the Convergence of the

    Decomposition Method for Support Vector

    Machines," IEEE Trans. Neural Networks,

    vol. 12, 2001.

    [7] S. S. Keerthi, S. K. Shevade, C.

    Bhattacharyya, and K. R. K. Murthy,

    232232232232

  • 8/3/2019 Improving Training Speed of Support Vector Machines by Creating Exploitable Trends of Lag Rang Ian Variables an A

    4/4

    "Improvements to Platt's SMO Algorithm for

    SVM Classifier Design," Neural Comp., vol.

    13, pp. 637-649, 2001.

    [8] J. C. Platt, "Fast training of support vector

    machines using sequential minimaloptimization," in Advances in KernelMethods: Support Vector Machines, B.Scholkopf, C. Burges, and A. Smola, Eds.

    Cambridge, MA: MIT Press, 1998.

    [9] C. Campbell and N. Cristianini, "Simple

    Learning Algorithms for training support

    vector machines," Technical Report,

    University of Bristol 1998.

    [10] T. Frie, N. Cristianini, and C. Campbell,

    "The kernel-Adatron algorithm: a fast and

    simple learning procedure for support vector

    machines," in Machine Learning: Proc. of the

    15th International Conf., J. Shavlik, Ed. SanFrancisco: Morgan Kauffman Publishers,1998.

    [11] J. X. Dong, A. Krzyzak, and C. Y. Suen, "Afast SVM training algorithm,"Intl. J. Pattern

    Recognition and Artificial Intelligence, vol.

    17, pp. 367-384, 2003.

    [12] A. A. Hasan and M. A. Hasan, "Constrained

    Gradient Descent and Line Search for Solving

    Optimization Problem with Elliptic

    Constraints," in Proc. Intl. Conf. Acoustics,

    Speech, and Signal Processing, vol. 2, 2003,pp. 763-796.

    [13] J. A. K. Suykens and J. Vandewalle, "Least

    Squares Support Vector Machine Classifiers," Neural Processing Letters, vol. 9, pp. 293-

    300, 1999.

    [14] K. M. Lin and C. J. Lin, "A Study on Reduced

    Support Vector Machines," IEEE Trans. Neural Networks, vol. 14, pp. 1449-1459,

    2003.[15] Z. Wang and E. P. Simoncelli, "Stimulus

    Synthesis for Efficient Evaluation and

    Refinement of Perceptual Image Quality

    Metrics," in Proc. Human Vision and

    Electronic Imaging IX, vol. 5292, 2004.

    [16] H. K. Zhao, B. Merriman, S. Osher, and L.

    Wang, "Capturing the Behaviour of Bubbles

    and Drops Using Variational Level Set

    Approach," J. Computational Physics, vol.

    143, pp. 495-518, 1998.[17] V. Kecman, M. Vogt, and T. M. Huang, "On

    the Equality of Kernel AdaTron and

    Sequential Minimal Optimization inClassification and Regression Tasks and Alike

    Algorithms for Kernel Machines," in Proc.

    11th European Symposium on Artificial

    Neural Networks . Bruges, Belgium, 2003.

    [18] D. Michie, D. J. Spiegelhalter, and C. C.

    Taylor, Machine Learning, Neural and

    Statistical Classification. Englewood Cliffs,

    NJ: Prentice Hall, 1994.

    Table 1: Speed improvement with two-stage training approach; classifying Acceptor Site or Not and Donor Site or Not

    Kernel Cache Size (% of the memory required

    to store full kernel matrix)

    100% 94% 75% 44% 0% (no cache)

    Acceptor Site +86% +453% +367% +297% +153%Speed Improvement

    (%) Donor Site +29% +143% +121% +114% +74%

    Table 2: Classification accuracy on Statlog splice-site data set showing effect of data reduction due to CGA.

    Acceptor Site or Not Donor Site or NotAccuracy on train set (%) Accuracy on test set (%) Accuracy on train set (%) Accuracy on test set (%)

    SVM (SMO) 100 97.302 100 96.46

    Two-staged SVM

    CGA+SMO

    100 97.302 99.8 95.6

    Table 3: Comparison of classification accuracy with different versions of SVM. Data for LS-SVM and RSVM obtained from [14].

    Original SVM (with SMO) Two-staged SVM CGA+SMO LS-SVM RSVM

    Accuracy on testset (average)

    96.881 96.451 93.086 93.002

    233233233233