[IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...

6
Proceedings of the Ninth Inteational Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010 Canonical Duality Solution to Support Vector Machine Yubo Yuan , Feilong Cao* Institute of Meology and Computational Science, China Jiliang University,Hangzhou, Zhejiang 310018, P R. China, E-AL: [email protected] Abstract Support vector machine (SVM) is one of the most populal' machine learning method and educed from a bina data classification problem. In this paper, a new duality theory named canonical duali theo is pl'e- sented to solve the normal model of SVM. Several ex- amples are illustrated to show that the exact solution can be obtained aftel' the canonical duality problem be- ing solved. Moreover, the support vectors can be located by non-zero elements of the canonical dual solution. Keords: supp vector machine; data mmmg; classifica- tion; smooth function; quadratic programming; BFGS method 1 Introduction In is paper, we consider how to solve support vec- tor machine. SVM is from a patte classification prob- lem based on a given classification of m points in e n-dimensional space n, represented by an m x n ma- trix A, given the membership of each data point Ai, i = 1,2, ... , m in e classes 1 or -1 as specified by a given m x m diagonal maix D with 1 or -1 diagonals. The original support vector machine (see in [1]-[5]) for this problem is given by the following model (briefly de- noted by (SVM») mm (SVM): (w,,)ERn+l s.t. (1) w is a vector of separator coefficients ( direction vector of classification hyperplane), is an offset ( the conol pa- rameter of the distance of hyperplane plane to the origin) and e E m stands for a vector of ones. 978-1-4244-6527-9/10/$26.00 ©2010 IEEE 70 60 ". 50 . . . . 40 I. . - _ 30 . l " 20 . �I . up irers 1 0 . . . -. - . . 0 ·· - . . . o 10 20 30 40 Figure 1. The illustration diagram of SVM to binary classification. The linear separating hyperplane 80 (2) · I Rn d d· hi h . . 2 WI normawE an Istance to t e ongm. is called as margin. In order to understand these concepts, a sim pIe illusation diagram is given in Figure 1. Till now, most proposed solution methods are based on the dual method by Lagrange multiplier. More detail dis- cuss can be seen in references [6]-[12]. Especially, in 2006 ( [12]), Gonzalez and et. presented a dual unification of bi- class support vector machine formulations. It is very help- ful to understand how to use the duali problem to get the optimal classification parameters. In recent years, a very powerful duality theory called canonical duali eo has developed by Gao([13]-[17]) and been used in global optimization, nonlinear mechanic and other non-convex and non-linear fields. Here, we use it 3140

Transcript of [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...

Page 1: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

Canonical Duality Solution to Support Vector Machine

Yubo Yuan , Feilong Cao*

Institute of Metrology and Computational Science, China Jiliang University,Hangzhou, Zhejiang 310018, P. R. China,

E-MIAL: [email protected]

Abstract Support vector machine (SVM) is one of the most

populal' machine learning method and educed from a

binary data classification problem. In this paper, a new

duality theory named canonical duality theory is pl'e­

sented to solve the normal model of SVM. Several ex­

amples are illustrated to show that the exact solution

can be obtained aftel' the canonical duality problem be­

ing solved. Moreover, the support vectors can be located

by non-zero elements of the canonical dual solution.

Keywords: supp0l1 vector machine; data mmmg; classifica­

tion; smooth function; quadratic programming; BFGS

method

1 Introduction

In this paper, we consider how to solve support vec­tor machine. SVM is from a pattern classification prob­lem based on a given classification of m points in the n-dimensional space !R.n, represented by an m x n ma­trix A, given the membership of each data point Ai, i =

1,2, ... , m in the classes 1 or -1 as specified by a given m x m diagonal matrix D with 1 or -1 diagonals.

The original support vector machine (see in [1 ]-[5]) for this problem is given by the following model (briefly de­noted by (SVM»)

mm (SVM): (w,,)ERn+l

s.t. (1)

w is a vector of separator coefficients ( direction vector of classification hyperplane), --y is an offset ( the control pa­rameter of the distance of hyperplane plane to the origin) and e E !R.m stands for a vector of ones.

978-1-4244-6527 -9/10/$26.00 ©201 0 IEEE

70

• 60 •

.. ".. 50 ... .

... . ., 40 .. I. .. II

.rI' ••• • - _ 30 • • .. ..,plll}:, • ". • tf' • • 20 . �I ••

• ... up OIir¥fers 10 . .. -. ••

- ... � . 0·· • - . . .

o 10 20 30 40

Figure 1. The illustration diagram of SVM to binary classification.

The linear separating hyperplane

80

(2)

·th I Rn d d· hi h . . 2 WI norm awE an Istance 1JWlI2 to t e ongm. 1JWlI2

is called as margin. In order to understand these concepts, a sim pIe illustration diagram is given in Figure 1.

Till now, most proposed solution methods are based on the dual method by Lagrange multiplier. More detail dis­cuss can be seen in references [6]-[12]. Especiall y, in 2006 ( [12]), Gonzalez and et. presented a dual unification of bi­class support vector machine formulations. It is very help­ful to understand how to use the duality problem to get the optimal classification parameters.

In recent years, a very powerful duality theory called

canonical duality theory has developed by Gao([13]-[17]) and been used in global optimization, nonlinear mechanic and other non-convex and non-linear fields. Here, we use it

3140

Page 2: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

to solve support vector machine. The rest of the paper is organized as follows. In Section

2 we will briefly introduce how to obtain the canonical dual problem of support vector machine. In Section 3, BFGS method are presented to solve the canonical dual problem. In Section 4, several examples are illustrated. We conclude the paper in Section 5.

2 Canonical Dual Problem for SVM

(SV M) can be represented as the following

mm P(x ) = �llxI12, (P): xERn+l

s.t. e-DEx �O,

where x = (wT, ,)T E ]Rn+l and E = (A, -e) Let

x = {x ix E ]Rn+l,e -DEx � O}.

(3)

(4)

Following the standard procedure of the canonical dual transformation developed in [19J, the so-called geometrical mapping is defined

E(X) = e - DEx : ]Rn --+ ]Rm.

The indicator is defined by

(5)

(6)

Thus, (P) can be relaxed by this indicator and takes the unconstrained form as following

(7)

Let us pay attention to the fact that I( E) is convex, lower semi-continuous on ]Rm, their canonical dual variable a is satisfied the following duality relation

a E a-I (E) {c} E E a-I* (a) {c} I (E) + I* (a) = ETa, (8)

where a- is called the sub-differential of I in convex anal­ysis. I* (a) is Fenchel sup-conjugate of I by

I*(a) = sup {ETa-I(E)} = { 0 ija;?;O, EER= += otherwzse.

(9) By the canonical dual transformation developed in [13J,

the canonical dual function of P(x ) is defined by

where

where the notation sta{ * : x E jj{n-t-'} represents finding the stationary point of the statement with respect to x E ]Rn+l

Thus, the problem (P) can be eventually formulated as follows

max pd (a) (pd): a ERm '

s.t. a;?; O. (12)

In order to show that there is no duality gap between of them, the below theorem is presented.

Theorem 1 Suppose that D, E, e, are same with defini­tions in (P) such that the dual feasible space

(13)

is not empty, then the problem (Pd) is canonically (per­fectly) dual to (P) in the sense that if a is a solution of the dual problem (Pd), then

(14)

is a solution of (P) and

(15)

Proof If it is a solution of the dual problem (Pd) such that (14) is hold, it must be satisfied the KKT condi­tions. Then, according to the complementarity conditions, we have

(16)

According to complementarity conditions, we have

(17)

Thus, in terms of x = ET Dit, then

which shows that there is no duality gap between (P) and (pd). The proof of the theorem is end.

3 BFGS method for the Canonical Dual

Model

BFGS method is suitable for unconstrained optimiza­tion problems when the objective function and its gradient value can easily be obtained. BFGS method is the most widely used one among various quasi-Newton methods (see in [39J [40] ).

3141

Page 3: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

Let us list some denotations used in algorithm, F(O') =

pd(O'), VF is the gradient vector of F, H is the approxi­mate of Hessian matrix of F, E is the control parameter of algorithm stop.

The BFGS method for Problem (12) is as follows BFGS Algorithm (The BFGS([39J[40J) algorithm for

(pd) (12)). step 0: (Input) Data points set A, index diagonal matrix

D, a vector of ones e E lRm, B = (A, -e), algorithm stop cutoff c.

step 1: (Initialization) HO = I E lRmxm, 0'0 E Rm, aD = 1 and set i := 0;

step 2: Compute Fi = F(O'i) and gi = V F(O'i); step 3 If IIgill� � cor aO < 10- 1

2, then stop, and acceptO'i as the optimal solution of (12), else calculate di =

_Higi; step 4: Perform linear search along direction di to get a

step length ai > 0; Let

and compute Fi+1 = F(O'i+1), gi+1 = VF(O'i+1) and yi = gi+1 _ gi;

step 5: Update Hi to get Hi+ 1:

step 6: Set i := i + 1, go to step 2; step 7: tt = O'i, modify 0' according to complementary

condition, for k=1 to m do

if fh ;? 0, if fh < o.

(19)

step 8: Compute x = BT Dtt, W = X(l:n), , = xn+l, X(l:n) E lRn is composed of the front n elements of x;

step 9: Output F(tt), wand, Remark 1 The proof of convergence property of the pro­

posed algorithm can be seen in [39J and [40] . From step 2 to step 6 is the main part and step 7-9 are used to get the optimization solution.

4 Examples

We now list a few examples to illustrate the applications of the theory presented in this paper.

Example 1. First of all, let us consider two pomts prob­

lem. The training points are Al = (1 1) T , A2 =

(-1 -1) T. We want to use these given points to get a classifier(actually, it is a lmear separatmg line presented by (2))

The corresponding (SV M) model is represented as the following

s.t.

By (2), let

� (wi +w�) + h2, WI + W2 - , ;? 1, WI + W2 + , ;? 1.

x = (W1,W2,,)T,B = (A, -e) = ( �1 1 -1

(SV M) model can be represented as followmg

(20)

-1 ) -1 '

min P(x) = �llxI12, (P): xElR3 (21)

s.t. e- DBx � O,

According to the canonical duality theory, the dual model is given as followmg

(22) Usmg the designed algorithm in section 3, this dual model has a unique solution that is

ttl = 0.25, tt2 = 0.25.

The corresponding optimal solution of primal problem can be obtained by

x = BT Dtt = ( 1 1 =� ) T ( -1 -1

(" ) 005

. Finally, the solution of the primal model is

WI = 0.5,W2 = 0.5" = o.

The final classifier can be seen m Figure 2.

1 0 ) tt 0 -1

Example 2. We now consider two classification problem in three dimension with four pomts. The training points are

Al = ( 1 1 1 ) T ,A2 = (-1 -1 -1 ) T , A3 =

(1 2 3)T , A4 = (-1 -2 _3 ) T The classifi­cation mdex matrix

o o -1 0

1 o o � ). o -1

After using these four training pomts, the classification plane can be deter.mmed.

3142

Page 4: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

-0.5

-1

-1.5

Figure 2. The dotted blue line is linear classi­fier.

By (2), let

-1 ( 1

B = (A,-e) = 1

1 1 -1 -1 -1 -1 2 3 -1

-2 -3 -1 -1

(SV M) model can be re presented as following

min P(x) = �llxI12, (P): xER4

s.t. e - DBx � 0, (23)

According to the canonical duality theory, the dual model is given as following

-!Ol r U i +171 + 172 + 173 + 174

7 5 5 7 15 13 13 15

s.t. 171 � 0,172 � 0,173 � 0,174 � O. (24)

Using the designed algorithm in section 3, this dual model has a unique solution that is

0"1 = 0.1667,0"2 = 0.1667,0"3 = 0,0"4 = O.

The corresponding optimal solution of primal problem can be obtained by

-... '. ::� . .. .. . _ .. . . . . - . . ::: . . .. - . - .

,:. , - . . . . . - . . . '.�' . . . . . . . . - , . . . . .. '," . . '

Figure 3. The color blue plane is the linear classifier.

( 1 1 1 -1 -1 -1 1 2 3

-1 -2 -3 -}C -1 0 -1 0 -1 0 ( 0.3333 )

0.3333 = 0.3333 .

0

0 -1 0 0

Finally, the solution of the primal model is

0 0 )d 0 0 1 0 0 -1

WI = 0.3333,W2 = 0.3333, W3 = 0.3333" = o.

The final classifier is a plane Y3 = -Yl -Y2 in three dimen­sion and can be seen in Figure 3.

Remark 20" is the dual solution, actually, the support vectors can be identified according its non-zero elements. In example 2, the non-zero elements are 0"1,0"2, we can say

that the row vectors (1,2) of A are support vectors. This issue can be seen in Figure 3.

5 Conclusions

Based on the canonical duality theory, the dual model of SV M is presented. The BFGS method is used to solve it. From the above examples, the exact solution of the primal

model of SV M can be obtained. More important, the loca­tion of support vectors also can be pointed out by the index of nonzero elements of the canonical duality solution.

3143

Page 5: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

Acknowledgements

The author would like to offer his sincere thanks to Pro­fessor David Gao Yang at the University of Virginia Tech for his very kindly tutorial on canonical duality theory.

References

[1] C J C Burges, A Tutorial on Support Vector Machines for Pattern Recognition., Data Min­ing and Knowledge Discovery, 2 (1998), 121-167.

[2] VVapnik, The Nature of Statistical Learning Theory, Springer, 1995.

[3] VVapnik, The support vector method of fimc­

tion estimation NATO ASI Series, Neural Network and Machine Learning,C Bishop (Ed.),Springer, 1998.

[4] VVapnik, An overview of statistical learning theory, in Advanced in Kernel methods: Sup­port Vector Learning, B. Scholkopf, B. Burges and A Smola (Eds), The MIT Press, Cam­bridge, Massachusetts, 1999.

[5] VVapnik, Three remarks on support vector function estimation, IEEE transactions on Neu­ral Networks, 10 (1999), 988-1000.

[6] Gary William Flake and Steve Lawrence, Ef jicient SVM Regression Training with SMO. Machine Learning, Volume 46, Issues 1-3, Pages: 271-290, January 2002.

[7] Chih-Wei Hsu and Chih-Jen Lin, A Simple Decomposition Method for Support Vector Ma­chines. Machine Learning, Volume 46, Issues 1-3, Pages:291-314, January 2002.

[8] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet and Sayan Mukherjee, Choosing Multiple Parameters for Support Vector Ma­chines. Machine Learning, Volume 46, Issues 1-3, Pages:131-159, January 2002.

[9] Pavel Laskov, Feasible Direction Decompo­sition Algorithms for Training Support Vector

Machines. Machine Learning, Volume 46, Is­sues 1-3, Pages:315-349, January 2002.

[10] S. S. Keerthi, K B. Duan, S. K Shevade and A N. Poo, A Fast Dual Algorithm for Kernel

Logistic Regression. Machine Learning, Vol­ume 61, Issues 1-3, Pages:151-165, Novemem­ber 2005.

3144

III J Glenn Fung and o. 1. Mangasarian, Finite Newton method for Lagrangian support vec­tor machine classijication. Neurocomputing, Volume 55, Issues 1-2, Pages:39-55, September 2003.

[12] 1. Gonzalez, C Angulo, F. Velasco and A Catala, Dual unification of bi-class support vector machine formulations. Pattern Recog­nition, Volume 39, Issue 7,Pages: 1325-1332, July 2006.

[13] Gao, DY, (2000). Canonical dual transforma­tion method and generalized triality theory in nonsmooth global optimization, J. Global Op­timization, 17 (1/4), pp. 127-160.

[14] Gao, DY, Perfect duality theory and complete set of solutions to a class of global optimiza­tion, Optimization,52( 4-5)(2003),467-493.

[15] Gao, DY , Complete solutions to constrained quadratic optimization problems, Journal of GlobaIOptimisation,29(2004),377-399.

[16] Gao, DY , Sufficient Conditions and Canoni­cal Duality in Nonconvex Minimization with In­equality Constraints, Journal of Industrial and Management Optimisation,l(I)(2005), 53-63.

[17] Gao, D Y , Complete Solutions andExtremality Criteria to Polynomial Optimization Problems, Journal of Global Optim isation,35(2006), 131-143.

[18] P J Bickel and K A Doksum, Mathemati­cal Statistics - Basic Ideas and Selected Topics (Second Edition}. , Prentice -Hall, Inc. 2001.

[19] J O. Berger, Statistical Decision Theory and Bayesian AnalysisF, Springer Verlag, New York,1985.

[20] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press,Inc.,1990.

[21] T M. Mitchell, Machine Learning, McGraw­Hill Companies Inc., 1997.

[22] T Mitchell, Statistical Approaches to Learning and Discovery, The course of Machine Learn­ing at CMU, 2003.

[23] D Montgomery, Design andAnalysis ofExper­

iments, John Wiley & sons, Inc., 1991.

[24] B. Schblkoft, Support Vector Learning, R Old­enbourg Verlag, Munich, 1997.

Page 6: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

[25] Q HE, Z Z SRI, L A REN and E. S LEE, A Novel Classification Method Based on Hyper­surface, "Mathematical and Computer Mod­elling,38(2003),395-407.

[26] Ping-Feng Pai, System reliability forecasting by support vector machines with genetic al­gorithms, "Mathematical and Computer Mod­elling,43(2006),262-274.

[27] B. Chen and P T Harker, Smooth Approxima­tions to Nonlinear Complementarity Problems, SIAM J. Optimization, 7(1997),403-420.

[28] C. Chen and O.L. "Mangasarian, A Class of Smoothing Functions for Nonlinear andMixed Complementarity Problems, Computational Optimization and Applications,5(2)(1996),97-138.

[29] C. Chen and O.L "Mangasarian, Smooth­ing Methods for Convex Inequalities and Linear Complementarity Problems, Math.Programming, 71(1 )(1995),51-69.

[30] X. Chen, L Qi and D. Sun, Global and Su­perlinear Convergence of the Smoothing New­ton Method and Its Application to Parameter­ral Box Constrained Variational Inequalities, Math. of Computation,67(1998) 519-540.

[31] X. Chen and Y. Ye, On Homotopy-Smoothing Methods for Variational Inequalities, SIAM J. Control and Optimization,37(1999),589-616.

[32] Lee Yuh-Jye, Wen-Feng Hsieh, and Chien­Ming Huang, E-SSVR: A Smooth Support Vec­tor Machine for E-Insensitive Regression, IEEE Transaction on Knowledge and Data Engineer­ing,17(5)(2005),678-685.

[33] Lee Yuh-Jye and 0 L "Mangarasian, SSVM: A smooth support vector machine for classifica­tion, Computational Optimization and Applica­tions,22(1)(2001 ),5-21.

[34] Y Yuan, J. Yan and C. Xu, Polynomial Smooth Support Vector Machine (PSSVM) , Chi­nese Journal Of computers, 28 (1)(2005),9-17.

[35] Y Yuan and T Huang, A Polynomial Smooth Support Vector Machine for Classifica­tion, Lecture Note in Artificial Intelligence, 3584(2005),157-164.

3145

[36] O.L Mangasarian and David RMusicant, Suc­cessive overrelaxation for support vector ma­chines, IEEE Transactions on Neural Networks, 10(1999),1032-1037.

[37] 1 Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Advances in Kernel Methods-Support Vector Learning[R], 1999,185-208.

[38] T Joachims, Making large-scale support vec­tor machine learning practical, in Advanced in Kernel methods: Support Vector Learning, B. Scholkopf, B. Burges and A Smola (Eds), The MIT Press, Cambridge, Massachusetts, 1999.

[39] Y Yuan and R Byrd, Non-quasi-Newton up­dates for unconstrained optimization, J. Com­put Math , 13(1995),95-107.

[40] Y Yuan, A modifiedBFGS algorithm for uncon­strained optimization, lMA J. Numer. Anal. , 11(1991),325-332.

[41] Navneet Panda and Edward Y Chang, KDX:

An Indexer for Support Vector Machines, IEEE Transaction on Knowledge and Data Engineer­ing,18(6)(2006),748-763.

[42] K. Schittkowski, Optimal parameter selection in support vector machines, Journal of Indus­trial and Management Optimization, 1 (2005), 465-476.

[43] K.FC. Yiu, K.L. "Mak, K.L Teo, Airfoil design via optimal control theory, Journal of Industrial and "Management Optimisation, 1 (2005), 133-148.

[44] A. Ghaffari Hadigheh and T Ter-laky, Generalized support set invariancy sensitivity analysis in linear optimization, Journal of Industrial and Management Optimi­sation, 2(1) (2006), 1-18.

[45] Z.Y Wu, HWJ. Lee, FS. Bai and LS. Zhang, Quadratic smoothing approximation to h exact penalty function in global optimization, Journal of Industrial and "Management Optimization, 1 (2005), 533-547.

[46] Giovanni P Crespi, I van Ginchev and Mat­teo Rocca, Two approaches toward constrained vector optimization and identity of the solu­tions, Journal of Industrial and "Management Optimisation, 1 (2005), 549-563.