CYBERNETICS or Control and Communication in the Animal and the Machine Second Edit
[IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...
Transcript of [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
Canonical Duality Solution to Support Vector Machine
Yubo Yuan , Feilong Cao*
Institute of Metrology and Computational Science, China Jiliang University,Hangzhou, Zhejiang 310018, P. R. China,
E-MIAL: [email protected]
Abstract Support vector machine (SVM) is one of the most
populal' machine learning method and educed from a
binary data classification problem. In this paper, a new
duality theory named canonical duality theory is pl'e
sented to solve the normal model of SVM. Several ex
amples are illustrated to show that the exact solution
can be obtained aftel' the canonical duality problem be
ing solved. Moreover, the support vectors can be located
by non-zero elements of the canonical dual solution.
Keywords: supp0l1 vector machine; data mmmg; classifica
tion; smooth function; quadratic programming; BFGS
method
1 Introduction
In this paper, we consider how to solve support vector machine. SVM is from a pattern classification problem based on a given classification of m points in the n-dimensional space !R.n, represented by an m x n matrix A, given the membership of each data point Ai, i =
1,2, ... , m in the classes 1 or -1 as specified by a given m x m diagonal matrix D with 1 or -1 diagonals.
The original support vector machine (see in [1 ]-[5]) for this problem is given by the following model (briefly denoted by (SVM»)
mm (SVM): (w,,)ERn+l
s.t. (1)
w is a vector of separator coefficients ( direction vector of classification hyperplane), --y is an offset ( the control parameter of the distance of hyperplane plane to the origin) and e E !R.m stands for a vector of ones.
978-1-4244-6527 -9/10/$26.00 ©201 0 IEEE
70
• 60 •
.. ".. 50 ... .
... . ., 40 .. I. .. II
.rI' ••• • - _ 30 • • .. ..,plll}:, • ". • tf' • • 20 . �I ••
• ... up OIir¥fers 10 . .. -. ••
- ... � . 0·· • - . . .
o 10 20 30 40
Figure 1. The illustration diagram of SVM to binary classification.
The linear separating hyperplane
80
(2)
·th I Rn d d· hi h . . 2 WI norm awE an Istance 1JWlI2 to t e ongm. 1JWlI2
is called as margin. In order to understand these concepts, a sim pIe illustration diagram is given in Figure 1.
Till now, most proposed solution methods are based on the dual method by Lagrange multiplier. More detail discuss can be seen in references [6]-[12]. Especiall y, in 2006 ( [12]), Gonzalez and et. presented a dual unification of biclass support vector machine formulations. It is very helpful to understand how to use the duality problem to get the optimal classification parameters.
In recent years, a very powerful duality theory called
canonical duality theory has developed by Gao([13]-[17]) and been used in global optimization, nonlinear mechanic and other non-convex and non-linear fields. Here, we use it
3140
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
to solve support vector machine. The rest of the paper is organized as follows. In Section
2 we will briefly introduce how to obtain the canonical dual problem of support vector machine. In Section 3, BFGS method are presented to solve the canonical dual problem. In Section 4, several examples are illustrated. We conclude the paper in Section 5.
2 Canonical Dual Problem for SVM
(SV M) can be represented as the following
mm P(x ) = �llxI12, (P): xERn+l
s.t. e-DEx �O,
where x = (wT, ,)T E ]Rn+l and E = (A, -e) Let
x = {x ix E ]Rn+l,e -DEx � O}.
(3)
(4)
Following the standard procedure of the canonical dual transformation developed in [19J, the so-called geometrical mapping is defined
E(X) = e - DEx : ]Rn --+ ]Rm.
The indicator is defined by
(5)
(6)
Thus, (P) can be relaxed by this indicator and takes the unconstrained form as following
(7)
Let us pay attention to the fact that I( E) is convex, lower semi-continuous on ]Rm, their canonical dual variable a is satisfied the following duality relation
a E a-I (E) {c} E E a-I* (a) {c} I (E) + I* (a) = ETa, (8)
where a- is called the sub-differential of I in convex analysis. I* (a) is Fenchel sup-conjugate of I by
I*(a) = sup {ETa-I(E)} = { 0 ija;?;O, EER= += otherwzse.
(9) By the canonical dual transformation developed in [13J,
the canonical dual function of P(x ) is defined by
where
where the notation sta{ * : x E jj{n-t-'} represents finding the stationary point of the statement with respect to x E ]Rn+l
Thus, the problem (P) can be eventually formulated as follows
max pd (a) (pd): a ERm '
s.t. a;?; O. (12)
In order to show that there is no duality gap between of them, the below theorem is presented.
Theorem 1 Suppose that D, E, e, are same with definitions in (P) such that the dual feasible space
(13)
is not empty, then the problem (Pd) is canonically (perfectly) dual to (P) in the sense that if a is a solution of the dual problem (Pd), then
(14)
is a solution of (P) and
(15)
Proof If it is a solution of the dual problem (Pd) such that (14) is hold, it must be satisfied the KKT conditions. Then, according to the complementarity conditions, we have
(16)
According to complementarity conditions, we have
(17)
Thus, in terms of x = ET Dit, then
which shows that there is no duality gap between (P) and (pd). The proof of the theorem is end.
3 BFGS method for the Canonical Dual
Model
BFGS method is suitable for unconstrained optimization problems when the objective function and its gradient value can easily be obtained. BFGS method is the most widely used one among various quasi-Newton methods (see in [39J [40] ).
3141
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
Let us list some denotations used in algorithm, F(O') =
pd(O'), VF is the gradient vector of F, H is the approximate of Hessian matrix of F, E is the control parameter of algorithm stop.
The BFGS method for Problem (12) is as follows BFGS Algorithm (The BFGS([39J[40J) algorithm for
(pd) (12)). step 0: (Input) Data points set A, index diagonal matrix
D, a vector of ones e E lRm, B = (A, -e), algorithm stop cutoff c.
step 1: (Initialization) HO = I E lRmxm, 0'0 E Rm, aD = 1 and set i := 0;
step 2: Compute Fi = F(O'i) and gi = V F(O'i); step 3 If IIgill� � cor aO < 10- 1
2, then stop, and acceptO'i as the optimal solution of (12), else calculate di =
_Higi; step 4: Perform linear search along direction di to get a
step length ai > 0; Let
and compute Fi+1 = F(O'i+1), gi+1 = VF(O'i+1) and yi = gi+1 _ gi;
step 5: Update Hi to get Hi+ 1:
step 6: Set i := i + 1, go to step 2; step 7: tt = O'i, modify 0' according to complementary
condition, for k=1 to m do
if fh ;? 0, if fh < o.
(19)
step 8: Compute x = BT Dtt, W = X(l:n), , = xn+l, X(l:n) E lRn is composed of the front n elements of x;
step 9: Output F(tt), wand, Remark 1 The proof of convergence property of the pro
posed algorithm can be seen in [39J and [40] . From step 2 to step 6 is the main part and step 7-9 are used to get the optimization solution.
4 Examples
We now list a few examples to illustrate the applications of the theory presented in this paper.
Example 1. First of all, let us consider two pomts prob
lem. The training points are Al = (1 1) T , A2 =
(-1 -1) T. We want to use these given points to get a classifier(actually, it is a lmear separatmg line presented by (2))
The corresponding (SV M) model is represented as the following
s.t.
By (2), let
� (wi +w�) + h2, WI + W2 - , ;? 1, WI + W2 + , ;? 1.
x = (W1,W2,,)T,B = (A, -e) = ( �1 1 -1
(SV M) model can be represented as followmg
(20)
-1 ) -1 '
min P(x) = �llxI12, (P): xElR3 (21)
s.t. e- DBx � O,
According to the canonical duality theory, the dual model is given as followmg
(22) Usmg the designed algorithm in section 3, this dual model has a unique solution that is
ttl = 0.25, tt2 = 0.25.
The corresponding optimal solution of primal problem can be obtained by
x = BT Dtt = ( 1 1 =� ) T ( -1 -1
(" ) 005
. Finally, the solution of the primal model is
WI = 0.5,W2 = 0.5" = o.
The final classifier can be seen m Figure 2.
1 0 ) tt 0 -1
Example 2. We now consider two classification problem in three dimension with four pomts. The training points are
Al = ( 1 1 1 ) T ,A2 = (-1 -1 -1 ) T , A3 =
(1 2 3)T , A4 = (-1 -2 _3 ) T The classification mdex matrix
o o -1 0
1 o o � ). o -1
After using these four training pomts, the classification plane can be deter.mmed.
3142
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
-0.5
-1
-1.5
Figure 2. The dotted blue line is linear classifier.
By (2), let
-1 ( 1
B = (A,-e) = 1
1 1 -1 -1 -1 -1 2 3 -1
-2 -3 -1 -1
(SV M) model can be re presented as following
min P(x) = �llxI12, (P): xER4
s.t. e - DBx � 0, (23)
According to the canonical duality theory, the dual model is given as following
-!Ol r U i +171 + 172 + 173 + 174
7 5 5 7 15 13 13 15
s.t. 171 � 0,172 � 0,173 � 0,174 � O. (24)
Using the designed algorithm in section 3, this dual model has a unique solution that is
0"1 = 0.1667,0"2 = 0.1667,0"3 = 0,0"4 = O.
The corresponding optimal solution of primal problem can be obtained by
-... '. ::� . .. .. . _ .. . . . . - . . ::: . . .. - . - .
,:. , - . . . . . - . . . '.�' . . . . . . . . - , . . . . .. '," . . '
Figure 3. The color blue plane is the linear classifier.
( 1 1 1 -1 -1 -1 1 2 3
-1 -2 -3 -}C -1 0 -1 0 -1 0 ( 0.3333 )
0.3333 = 0.3333 .
0
0 -1 0 0
Finally, the solution of the primal model is
0 0 )d 0 0 1 0 0 -1
WI = 0.3333,W2 = 0.3333, W3 = 0.3333" = o.
The final classifier is a plane Y3 = -Yl -Y2 in three dimension and can be seen in Figure 3.
Remark 20" is the dual solution, actually, the support vectors can be identified according its non-zero elements. In example 2, the non-zero elements are 0"1,0"2, we can say
that the row vectors (1,2) of A are support vectors. This issue can be seen in Figure 3.
5 Conclusions
Based on the canonical duality theory, the dual model of SV M is presented. The BFGS method is used to solve it. From the above examples, the exact solution of the primal
model of SV M can be obtained. More important, the location of support vectors also can be pointed out by the index of nonzero elements of the canonical duality solution.
3143
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
Acknowledgements
The author would like to offer his sincere thanks to Professor David Gao Yang at the University of Virginia Tech for his very kindly tutorial on canonical duality theory.
References
[1] C J C Burges, A Tutorial on Support Vector Machines for Pattern Recognition., Data Mining and Knowledge Discovery, 2 (1998), 121-167.
[2] VVapnik, The Nature of Statistical Learning Theory, Springer, 1995.
[3] VVapnik, The support vector method of fimc
tion estimation NATO ASI Series, Neural Network and Machine Learning,C Bishop (Ed.),Springer, 1998.
[4] VVapnik, An overview of statistical learning theory, in Advanced in Kernel methods: Support Vector Learning, B. Scholkopf, B. Burges and A Smola (Eds), The MIT Press, Cambridge, Massachusetts, 1999.
[5] VVapnik, Three remarks on support vector function estimation, IEEE transactions on Neural Networks, 10 (1999), 988-1000.
[6] Gary William Flake and Steve Lawrence, Ef jicient SVM Regression Training with SMO. Machine Learning, Volume 46, Issues 1-3, Pages: 271-290, January 2002.
[7] Chih-Wei Hsu and Chih-Jen Lin, A Simple Decomposition Method for Support Vector Machines. Machine Learning, Volume 46, Issues 1-3, Pages:291-314, January 2002.
[8] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet and Sayan Mukherjee, Choosing Multiple Parameters for Support Vector Machines. Machine Learning, Volume 46, Issues 1-3, Pages:131-159, January 2002.
[9] Pavel Laskov, Feasible Direction Decomposition Algorithms for Training Support Vector
Machines. Machine Learning, Volume 46, Issues 1-3, Pages:315-349, January 2002.
[10] S. S. Keerthi, K B. Duan, S. K Shevade and A N. Poo, A Fast Dual Algorithm for Kernel
Logistic Regression. Machine Learning, Volume 61, Issues 1-3, Pages:151-165, Novemember 2005.
3144
III J Glenn Fung and o. 1. Mangasarian, Finite Newton method for Lagrangian support vector machine classijication. Neurocomputing, Volume 55, Issues 1-2, Pages:39-55, September 2003.
[12] 1. Gonzalez, C Angulo, F. Velasco and A Catala, Dual unification of bi-class support vector machine formulations. Pattern Recognition, Volume 39, Issue 7,Pages: 1325-1332, July 2006.
[13] Gao, DY, (2000). Canonical dual transformation method and generalized triality theory in nonsmooth global optimization, J. Global Optimization, 17 (1/4), pp. 127-160.
[14] Gao, DY, Perfect duality theory and complete set of solutions to a class of global optimization, Optimization,52( 4-5)(2003),467-493.
[15] Gao, DY , Complete solutions to constrained quadratic optimization problems, Journal of GlobaIOptimisation,29(2004),377-399.
[16] Gao, DY , Sufficient Conditions and Canonical Duality in Nonconvex Minimization with Inequality Constraints, Journal of Industrial and Management Optimisation,l(I)(2005), 53-63.
[17] Gao, D Y , Complete Solutions andExtremality Criteria to Polynomial Optimization Problems, Journal of Global Optim isation,35(2006), 131-143.
[18] P J Bickel and K A Doksum, Mathematical Statistics - Basic Ideas and Selected Topics (Second Edition}. , Prentice -Hall, Inc. 2001.
[19] J O. Berger, Statistical Decision Theory and Bayesian AnalysisF, Springer Verlag, New York,1985.
[20] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press,Inc.,1990.
[21] T M. Mitchell, Machine Learning, McGrawHill Companies Inc., 1997.
[22] T Mitchell, Statistical Approaches to Learning and Discovery, The course of Machine Learning at CMU, 2003.
[23] D Montgomery, Design andAnalysis ofExper
iments, John Wiley & sons, Inc., 1991.
[24] B. Schblkoft, Support Vector Learning, R Oldenbourg Verlag, Munich, 1997.
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
[25] Q HE, Z Z SRI, L A REN and E. S LEE, A Novel Classification Method Based on Hypersurface, "Mathematical and Computer Modelling,38(2003),395-407.
[26] Ping-Feng Pai, System reliability forecasting by support vector machines with genetic algorithms, "Mathematical and Computer Modelling,43(2006),262-274.
[27] B. Chen and P T Harker, Smooth Approximations to Nonlinear Complementarity Problems, SIAM J. Optimization, 7(1997),403-420.
[28] C. Chen and O.L. "Mangasarian, A Class of Smoothing Functions for Nonlinear andMixed Complementarity Problems, Computational Optimization and Applications,5(2)(1996),97-138.
[29] C. Chen and O.L "Mangasarian, Smoothing Methods for Convex Inequalities and Linear Complementarity Problems, Math.Programming, 71(1 )(1995),51-69.
[30] X. Chen, L Qi and D. Sun, Global and Superlinear Convergence of the Smoothing Newton Method and Its Application to Parameterral Box Constrained Variational Inequalities, Math. of Computation,67(1998) 519-540.
[31] X. Chen and Y. Ye, On Homotopy-Smoothing Methods for Variational Inequalities, SIAM J. Control and Optimization,37(1999),589-616.
[32] Lee Yuh-Jye, Wen-Feng Hsieh, and ChienMing Huang, E-SSVR: A Smooth Support Vector Machine for E-Insensitive Regression, IEEE Transaction on Knowledge and Data Engineering,17(5)(2005),678-685.
[33] Lee Yuh-Jye and 0 L "Mangarasian, SSVM: A smooth support vector machine for classification, Computational Optimization and Applications,22(1)(2001 ),5-21.
[34] Y Yuan, J. Yan and C. Xu, Polynomial Smooth Support Vector Machine (PSSVM) , Chinese Journal Of computers, 28 (1)(2005),9-17.
[35] Y Yuan and T Huang, A Polynomial Smooth Support Vector Machine for Classification, Lecture Note in Artificial Intelligence, 3584(2005),157-164.
3145
[36] O.L Mangasarian and David RMusicant, Successive overrelaxation for support vector machines, IEEE Transactions on Neural Networks, 10(1999),1032-1037.
[37] 1 Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Advances in Kernel Methods-Support Vector Learning[R], 1999,185-208.
[38] T Joachims, Making large-scale support vector machine learning practical, in Advanced in Kernel methods: Support Vector Learning, B. Scholkopf, B. Burges and A Smola (Eds), The MIT Press, Cambridge, Massachusetts, 1999.
[39] Y Yuan and R Byrd, Non-quasi-Newton updates for unconstrained optimization, J. Comput Math , 13(1995),95-107.
[40] Y Yuan, A modifiedBFGS algorithm for unconstrained optimization, lMA J. Numer. Anal. , 11(1991),325-332.
[41] Navneet Panda and Edward Y Chang, KDX:
An Indexer for Support Vector Machines, IEEE Transaction on Knowledge and Data Engineering,18(6)(2006),748-763.
[42] K. Schittkowski, Optimal parameter selection in support vector machines, Journal of Industrial and Management Optimization, 1 (2005), 465-476.
[43] K.FC. Yiu, K.L. "Mak, K.L Teo, Airfoil design via optimal control theory, Journal of Industrial and "Management Optimisation, 1 (2005), 133-148.
[44] A. Ghaffari Hadigheh and T Ter-laky, Generalized support set invariancy sensitivity analysis in linear optimization, Journal of Industrial and Management Optimisation, 2(1) (2006), 1-18.
[45] Z.Y Wu, HWJ. Lee, FS. Bai and LS. Zhang, Quadratic smoothing approximation to h exact penalty function in global optimization, Journal of Industrial and "Management Optimization, 1 (2005), 533-547.
[46] Giovanni P Crespi, I van Ginchev and Matteo Rocca, Two approaches toward constrained vector optimization and identity of the solutions, Journal of Industrial and "Management Optimisation, 1 (2005), 549-563.