Training TSVM with the proper number of positive samples

8
Training TSVM with the proper number of positive samples Ye Wang * , Shang-Teng Huang Computer Science and Engineering Department, Shanghai JiaoTong University, Shanghai 200030, PR China Received 20 September 2004; received in revised form 14 March 2005 Available online 13 June 2005 Communicated by R.P.W. Duin Abstract The transductive support vector machine (TSVM) is the transductive inference of the support vector machine. The TSVM utilizes the information carried by the unlabeled samples for classification and acquires better classification per- formance than the regular support vector machine (SVM). As effective as the TSVM is, it still has obvious deficiency: The number of positive samples must be appointed before training and it is not changed during the training phase. This deficiency is caused by the pair-wise exchanging criterion used in the TSVM. In this paper, we propose a new transduc- tive training algorithm by substituting the pair-wise exchanging criterion with the individually judging and changing criterion. Experimental results show that the new method releases the restriction of the appointment of the number of positive samples beforehand and improves the adaptability of the TSVM. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Transductive support vector machine; Training algorithm 1. Introduction Different from the traditional inductive learning method, the transductive learning method is a way to utilize the information of the labeled samples to- gether with that of the unlabeled samples. Now- adays, under the condition that obtaining labels is more expensive than getting samples, the transduc- tive method is significantly important for the machine learning. Based on the theory of the statistical learning (Vapnik, 1995), the support vec- tor machine (SVM) was proposed for the pattern recognition by Vapnik (1998). The SVM shows bet- ter performance than other learning method in the classification field of the small training sets in a nonlinear and high-dimensional feature space. The transductive support vector machine (TSVM) (Joachims, 1999) is a SVM combining with the transductive learning procedure. The iteration algorithm of the TSVM utilizes the 0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.03.034 * Corresponding author. Fax: +86 21 6293 4107. E-mail addresses: [email protected], [email protected]. edu.cn (Y. Wang). Pattern Recognition Letters 26 (2005) 2187–2194 www.elsevier.com/locate/patrec

Transcript of Training TSVM with the proper number of positive samples

Pattern Recognition Letters 26 (2005) 2187–2194

www.elsevier.com/locate/patrec

Training TSVM with the proper number of positive samples

Ye Wang *, Shang-Teng Huang

Computer Science and Engineering Department, Shanghai JiaoTong University, Shanghai 200030, PR China

Received 20 September 2004; received in revised form 14 March 2005

Available online 13 June 2005

Communicated by R.P.W. Duin

Abstract

The transductive support vector machine (TSVM) is the transductive inference of the support vector machine. The

TSVM utilizes the information carried by the unlabeled samples for classification and acquires better classification per-

formance than the regular support vector machine (SVM). As effective as the TSVM is, it still has obvious deficiency:

The number of positive samples must be appointed before training and it is not changed during the training phase. This

deficiency is caused by the pair-wise exchanging criterion used in the TSVM. In this paper, we propose a new transduc-

tive training algorithm by substituting the pair-wise exchanging criterion with the individually judging and changing

criterion. Experimental results show that the new method releases the restriction of the appointment of the number

of positive samples beforehand and improves the adaptability of the TSVM.

� 2005 Elsevier B.V. All rights reserved.

Keywords: Transductive support vector machine; Training algorithm

1. Introduction

Different from the traditional inductive learning

method, the transductive learning method is a way

to utilize the information of the labeled samples to-

gether with that of the unlabeled samples. Now-

adays, under the condition that obtaining labels ismore expensive than getting samples, the transduc-

0167-8655/$ - see front matter � 2005 Elsevier B.V. All rights reserv

doi:10.1016/j.patrec.2005.03.034

* Corresponding author. Fax: +86 21 6293 4107.

E-mail addresses: [email protected], [email protected].

edu.cn (Y. Wang).

tive method is significantly important for the

machine learning. Based on the theory of the

statistical learning (Vapnik, 1995), the support vec-

tor machine (SVM) was proposed for the pattern

recognition by Vapnik (1998). The SVM shows bet-

ter performance than other learning method in the

classification field of the small training sets in anonlinear and high-dimensional feature space.

The transductive support vector machine

(TSVM) (Joachims, 1999) is a SVM combining

with the transductive learning procedure. The

iteration algorithm of the TSVM utilizes the

ed.

2188 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194

information of the unlabeled samples for classifica-

tion and predicts the optimal labels for them. The

TSVM is suitable in the case when the distribution

of the training samples differs from that of the test

samples.After reviewing the training algorithm of the

TSVM, we propose in this paper that the individu-

ally judging and changing criterion instead of the

pair-wise exchanging criterion is utilized to mini-

mize the objective function. The training algorithm

is also changed accordingly. Experimental results

show that the new method releases the restriction

of the appointment of the number of positive sam-ples beforehand. It could be used in much wider

ranges.

The remaining of this paper is organized as fol-

lows: The training algorithm of the TSVM and its

deficiency are described in Section 2. An improved

label exchanging criterion, a new training method

and the theoretical proof of convergence are devel-

oped in Section 3. Experimental results of compar-ing SVM, TSVM and the new method are shown

in Section 4. Finally, conclusions are given in the

last section.

2. Review of the TSVM

The transductive learning is completely differentfrom the inductive learning. Not caring the decision

hyper-plane or decision rules, the transductive

learning focuses on the construction of a transduc-

tive mechanism that can label test samples with

the information carried by either the labeled or

the unlabeled samples in the training phase. The

idea of the transductive learning originates from

the dilemma of the machine learning, where the la-beled samples are sparse and expensive while the

unlabeled samples are plentiful and cheap. Through

the learning process of the transductive learning, the

helpful information concealed in the unlabeled sam-

ples is transferred into the final classifier. Thismech-

anism makes classification results more accurate.

2.1. Principle

The TSVM is an application of the transductive

learning theory to the SVM. The principle of the

regular SVM is described as follows. An SVM

classifier tries to find a decision hyper-plane that

classifies training samples correctly (or basically

correctly). The margin between this hyper-plane

and the nearest sample should be as large as possi-ble. In the training phase, the SVM resolves the

following optimization problem to find the deci-

sion hyper-plane,

min1

2wTwþ C

Xi

ni ð2:1Þ

such that

yiðwTxi þ bÞ P 1� ni; ni P 0

where b is a threshold; C is an influencing para-

meter for trade-off and n is a slack variable. Inthe decision phase, the SVM labels the test samples

according to the side of the hyper-plane that they

lie on. The decision formula is given by

DðxÞ ¼ wTxþ b ð2:2ÞAccording to the principle of the transductive

learning, the optimization problem of the TSVM

(Joachims, 1999) is given by

min1

2wTwþ C

Xi

ni þ C�Xj

n�j ð2:3Þ

such that

yiðwTxi þ bÞ P 1� ni; ni P 0;

yjðwTxj þ bÞ P 1� n�j ; n�j P 0

where n and n* are slack variables for the training

and test samples, respectively; C and C * are the

influencing parameters for the training and test

samples, respectively.

2.2. Training algorithm

To acquire optimal solutions of Eq. (2.3),

Joachims presented a training algorithm. The main

steps of this algorithm are listed as follows:

(1) Specify C and C* and train a SVM with the

training samples. Specify Np (the number of

positive samples in the test set) according

to the proportion of positive samples in thetraining set.

Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194 2189

(2) Classify the test samples with the SVM

trained according to Step (1). Label Np test

samples with the largest decision values as

positive and others negative. Specify C�þ

and C�� with a small number.

(3) Retrain all samples (including the test sam-

ples). For the test samples, if a pair of sam-

ples with different labels for which both

slack variables are positive and their sum is

greater than 2, their labels are exchanged.

Iterate this step until no such a pair of sam-

ples is found. This is an inner loop.

(4) Augment C�þ and C�

� with the same propor-tion. If C�

þ or C�� is greater than C * output

labels of the test samples and terminate,

otherwise go to Step (3). This is an outer

loop.

The exchanging criterion in Step (3) of the algo-

rithm will be called pair-wise exchanging criterion.

It ensures that the objective function of the optimi-zation problem (Eq. (2.3)) decreases after the

labels exchange and thus an optimal solution is

obtained. This statement has been proved in

Theorem 2 in (Joachims, 1999). The magnification

of the influencing parameters in Step (4) increases

the influence of the unlabeled samples for classifi-

cation gradually. As efficient as TSVM is, its defi-

ciency is also obvious. The number of positivesamples, Np, must be appointed before training

and it is not changed during the entire training

phase. It is easy to infer that when the appointed

Np disagrees with the real number of positive sam-

ples in the test set the TSVM cannot obtain the

best solutions. Some related works of the TSVM

are introduced in the following paragraph.

2.3. Other related works

Chen et al. (2003) proposed a progressive

method of TSVM named PTSVM. Instead of

appointing Np in advance, the training process of

the PTSVM labels the most ‘‘believable’’ positive

or negative samples in the test set, respectively

and progressively, until all the test samples arelabeled. The PTSVM releases the restriction on

Np to some extent. However, another restriction

is emerged, that is, C�þ must be equal to C�

�. Due

to this restriction, the PTSVM is not suitable for

the case that the test set is imbalanced. Liu and

Huang (2004) presented a fuzzy TSVM algorithm

(FTSVM) to resolve problems of the web classifi-

cation. Its main idea is to multiply slack variablesof the test samples by membership factors that rep-

resent the degree of importance of the test samples

in training. Experimental results indicate that the

FTSVM performs well in the web classification.

Its main deficiency is the computation of the mem-

bership factors is very complex. In addition, the

transductive learning theory has been developed

in some other fields, such as k-NN (Joachims,2003), latent semantic indexing (LSI) (Zelikovitz,

2004) and kernel learning (Zhang et al., 2003).

3. New method

As mentioned above, the main deficiency of the

TSVM is that the number of positive samples Np

must be appointed beforehand and it is not chan-

ged during the training phase. This deficiency is

caused by the pair-wise exchanging criterion. In

order to modify this criterion, we consider whether

the labels of the test samples can be changed indi-

vidually. If it is feasible, the restriction of Np could

be released and the adaptability of the TSVM

could be promoted.

3.1. Individually judging and changing criterion

In the TSVM, the strategy of decreasing the

penalty items, which is the product of influencing

parameter C and the sum of slack variables, is

adopted to minimize the objective function of the

optimization problem. If there are two test sampleswith different labels for which both slack variables

are positive and their sum is greater than 2, both of

them may be labeled by an error. So the exchange

of their labels will decrease the objective function.

After analyzing the mechanism of the pair-wise

exchanging criterion, we find another way to

decrease the objective function if we change the

labels of the test samples on the error side of thedecision hyper-plane. We begin our inference with

discriminating the penalty items of positive and

negative samples,

2190 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194

C�Xj

n�j ¼ C�þ

Xjþ

n�þ þ C��

Xj�

n�� ð3:1Þ

where C�þ and C�

� are the influencing parameters of

positive and negative test samples, respectively.

They are related to the numbers of positive andnegative samples when the test set is imbalanced

(Osuna et al., 1997). Let C�þ and C�

� be in inverse

proportion to the numbers of positive and negative

samples, respectively, i.e., the following equation

holds:

C�þ=N n ¼ C�

�=N p ¼ a ð3:2Þ

where Np and Nn are numbers of positive and neg-

ative samples and a is a positive constant.

Theorem 1. Let D ¼P

jþn�þ �

Pj�n

��. If a test

sample is labeled positive and its slack variable n*

satisfies

n� >D

N n þ 1þ N p � 1

N n þ 1maxðð2� n�Þ; 0Þ ð3:3Þ

or if a test sample is labeled negative and its slack

variable n* satisfies

n� > � DN p þ 1

þ N n � 1

N p þ 1maxðð2� n�Þ; 0Þ ð3:4Þ

then the label change of this test sample decreases

the objective function.

Proof. The penalty items of the test samples can

be rewritten as

C�þ

Xjþ

n�þ þ C��

Xj�

n�� ¼ aN n

Xjþ

n�þ þ aN p

Xj�

n��

ð3:5ÞAssume that a positive labeled test sample is

changed to negative labeled one. Then the penalty

items of test samples become

C��þ

Xjþ

n�þ � n� !

þC���

Xj�

n�� þmaxð2� n�Þ; 0 !

ð3:6ÞHere, C��

þ and C��� are a little different from C�

þ and

C�� in Eq. (3.5). According to Eq. (3.2), after a

label change, the influencing parameters should

also be in inverse proportion to the number of

positive and negative samples, respectively, i.e.,

C��þ =ðNn þ 1Þ ¼ C��

� =ðNp � 1Þ ¼ a ð3:7Þ

Substitute n* in the first item of Eq. (3.6) with Eq.

(3.3) and combine Eq. (3.6) with Eqs. (3.2) and

(3.7). We get

ðN n þ 1ÞXjþ

n�þ � n� !

þ ðNp � 1ÞXj�

n�� þmaxð2� n�;0Þ !

< ðNn þ 1ÞXjþ

n�þ �Dþ ðNp � 1Þmaxð2� n�;0ÞNn þ 1

!

þ ðNp � 1ÞXj�

n�� þmaxð2� n�;0Þ !

< Nn

Xjþ

n�þ þNp

Xj�

n��

This result means that the objective function

decreases after the positive label changed. The case

when a negative labeled test sample is changed to

positive label can be proved in an analogous

way. The proof is completed. h

Based on Theorem 1, we may utilize a new cri-

terion to minimize the objective function of the

TSVM. This criterion, as described in (3.3) and(3.4), is used to judge whether a test sample lies

on the error side and its label should be changed

individually. We call this individually judging and

changing criterion. The number of positive samples

is allowed to be updated to a proper value. This is

the main superiority of this criterion over the pair-

wise exchanging criterion.

3.2. Improved training algorithm

The full version of our new algorithm is listed

as follows:

(1) Specify Np, C and C *. Train a SVM with the

training samples and label the test samples

using the trained SVM.(2) Let C�

þ ¼ aNn and C�� ¼ aNp, where a is a

small positive constant such as 10�4.

(3) If C�þ < C� or C�

� < C� then repeat Steps

(4)–(6).

Fig. 1. Two-dimensional toy data.

Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194 2191

(4) Retrain the SVM with training and test sam-

ples. If there exist test samples satisfying (3.3)

or (3.4), then select the sample with the big-

gest n* among them. Else, go to Step (6).

(5) Change the label of the test sample that isselected by Step (4). Update C�

þ and C��.

Let them still be in inverse proportion to

the numbers of positive and negative sam-

ples, that is, C�þ=C

�� ¼ ðNn þ 1Þ=ðNp � 1Þ or

(Nn�1)/(Np + 1). Go to Step (4).

(6) Let C�þ ¼ minf2C�

þ;C�g and C�

� ¼ minf2C��;

C�g.(7) Output the labels of the test samples and

terminate.

In this new algorithm, Np could be initialized as

an integer between 1 and the number of the sam-

ples of the test set. In Step (4), the sample with

the biggest n* is selected in order to minimize the

objective function. In addition to the change of

the label of the selected sample, the influencingparameters are adjusted according to the updated

numbers of positive and negative samples in Step

(5). If a positive label is changed to negative then

C�þ=C

�� ¼ ðN n þ 1Þ=ðN p � 1Þ. On the other hand,

if a negative label is changed to positive then

C�þ=C

�� ¼ ðN n � 1Þ=ðN p þ 1Þ.

Theorem 2 in (Joachims, 1999) indicates that

the transductive algorithm terminates in finitesteps as the objective function of the optimization

problem decreases. Theorem 1 ensures that the

label changes of samples satisfying (3.3) or (3.4)

decrease the objective function. So the new algo-

rithm also converges in finite steps.

Fig. 2. Regular SVM.

4. Experimental results

To verify the superiority of accuracy and adapt-

ability of the new method, the experiments on toy

data and real world data tests are done.

4.1. Toy data

The toy data are some two-dimensionalpoints generated artificially, as shown in Fig. 1.

The positive samples are represented by ‘‘·’’ andthe negative ones by ‘‘s’’. The training samples

are symbolized as ‘‘h’’. It is shown in Fig. 1 that

the distribution of the training samples is quite

different from that of the test samples.

In these tests, Accuracy is used as a measure of

performance. Accuracy is the probability that theclass category of a document is predicted to be

true.

The classification result of the regular SVM is

shown in Fig. 2, where the solid line represents

the decision hyper-plane and the samples symbol-

ized as big circles are the support vectors. Due to

Fig. 3. TSVM with an improper Np.Fig. 4. The new method.

2192 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194

the distribution difference between the training and

test samples, the classification accuracy of the reg-

ular SVM is only 85.5%.When the TSVM is used to classify the toy data,

Np should be appointed beforehand. This ran-

domly appointed Np might be different from the

real Np. Accordingly, the classification perfor-

mance is not as good as expected. The classifica-

tion result of the TSVM for an improper Np (a

little bigger than the real Np) is shown in Fig. 3.

Its accuracy is 94.2%. On the other hand, ournew method updates Np to a proper value and

determines the best decision hyper-plane. As is

shown in Fig. 4, the accuracy of our method is

100%. It is mentioned that the initial values of

Np are identical in these two tests.

4.2. Real world data

The experiments of the real world data are done

on a subset of the ‘‘Reuters-21578’’1 dataset. In the

‘‘Reuters-21578’’ dataset, some samples belong to

the ‘‘training set’’ and others belong to the ‘‘test

set’’. We select six groups from the ‘‘Reuters-

21578’’ as the subset. Each group of the subset

has positive samples and negative samples, which

1 http://www.research.att.com/~lewis/reuter21578.html

belong to different topics. Numbers and topics of

these groups are listed in Table 1.

After stop word filtering and stemming, the text

samples are transformed to the vectors in a very

high-dimensional feature space. The weight ofword is evaluated by the tf-idf formula. Compari-

son of the classification accuracy of regular SVM,

TSVM and the new method is listed in Table 2.

The parameters of each test are all identical, that

is, RBF kernel is used and C = 103, c = 0.1. The

toolbox of SVMlight is used in this experiment.

In Table 2, ‘‘PPS’’ is an abbreviation of ‘‘pro-

portion of positive samples’’. From the fourth col-umn to the seventh column are the classification

accuracies of SVM, TSVM and the new method.

In the fifth column, Np is initialized as the half

of the number of test samples; in the sixth column,

Np is initialized as the product of the number of

test samples and the proportion of positive sam-

ples in the training set, as advised in (Joachims,

1999), and in the seventh column, Np is appointedby the same method of the fifth column.

To find the influence of different Np to the new

method, another experiment is done. In this exper-

iment, the proportion of Np and the number of the

test set varies from 0.1 to 0.9. The results of this

experiment are listed in Table 3. NTE represents

the number of the test samples.

Table 2

Accuracies of SVM, TSVM and the new method

No. PPS in the

training set

PPS in the

test set

SVM (%) TSVM (0.5) (%) TSVM

(proportion) (%)

New method

(0.5) (%)

1 0.850 0.764 90.0 74.6 90.9 98.2

2 0.688 0.638 93.8 85.7 95.9 97.9

3 0.540 0.575 95.7 97.9 100 100

4 0.839 0.860 88.4 65.1 90.9 93.9

5 0.295 0.333 75.8 75.8 87.9 90.9

6 0.432 0.351 94.6 84.7 91.9 100

Table 1

Numbers and topics of the subset

No. Positive Number of the

training set

Number of the

test set

Negative Number of the

training set

Number of the

test set

1 Wheat 198 84 Rice 35 26

2 Coffee 110 30 Cocoa 50 17

3 Copper 47 27 Iron–steel 40 20

4 Soybean 73 37 Soy oil 14 6

5 Fuel 13 11 Gas 31 22

6 Livestock 73 39 Dlr 96 72

Table 3

Accuracies of the new method with different Np/NTE

No. 0.1 (%) 0.3 (%) 0.5 (%) 0.7 (%) 0.9 (%)

1 96.4 96.4 98.2 98.2 98.2

2 95.7 97.9 97.9 97.9 97.9

3 97.9 100 100 100 100

4 93.9 93.9 93.9 93.9 93.9

5 90.9 90.9 90.9 87.9 84.8

6 100 100 100 100 97.3

Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194 2193

4.3. Discussion

By comparing the values of Table 2, we can find

that the performance of the TSVM depends on Np

significantly. When a proper Np is appointed, i.e.,

Np is close to the number of positive test samples,

the performance of the TSVM is very well. Other-

wise, the performance of the TSVM is quitepoor and sometimes it is even worse than that

of the regular SVM. In most condition, the new

method exceeds SVM and TSVM in performance

aspect.

It is shown in Table 3 that the initial Np does

not affect the results seriously in most cases. This

phenomenon infers that the new method is not

sensitive to the initial Np. However, when the ini-

tial Np is much different from the real Np, the per-

formance will decrease to some extent.

5. Conclusions

Based on the analysis of the training algorithm

and the labels exchanging criterion of TSVM, we

proposes a new criterion to minimize the objective

function in this paper. The new criterion judges

and changes the labels of the samples individually.

A new training algorithm is presented too.

2194 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194

Theoretical proof for the new criterion is given.

By this criterion it is ensured that the objective

function decreases and an optimal solution is ob-

tained in finite steps. The toy data and real world

data tests show that new algorithm is not as sensi-tive as TSVM to improper number of the positive

samples. When the number of positive samples is

initialized randomly, the new method gives a better

performance than the regular SVM and TSVM.

References

Chen, Y.S., Wang, G.P., Dong, S.H., 2003. A progressive

transductive inference algorithm based on support vector

machine. J. Software 14 (3), 451–460 (in Chinese).

Joachims, T., 1999. Transductive inference for text classifica-

tion using support vector machines. In: Proc. ICML-99,

16th Internat. Conf. on Machine Learning, pp. 200–209.

Joachims, T., 2003. Transductive learning via spectral graph

partitioning. In: Proc. Internat. Conf. on Machine Learning

(ICML �03), pp. 290–297.Liu, H., Huang, S.T., 2004. Fuzzy transductive support vector

machines for hypertext classification. Internat. J. Uncer-

tainty, Fuzziness Knowledge-Based Systems 12 (1), 21–36.

Osuna, E., Freund, R., Girosi, F., 1997. Training support

vector machines: An application to face detection. In: Proc.

CVPR�97, Puerto Rico, pp. 130–136.

Vapnik, V., 1995. The Nature of Statistical Learning Theory.

Springer, New York.

Vapnik, V., 1998. Statistical Learning Theory. Wiley, New

York.

Zelikovitz, S., 2004. Transductive LSI for short text classifica-

tion problemsProc. Seventeenth Internat. Florida Artificial

Intelligence Research Symposium Conf., Miami Beach, FL,

USA. AAAI Press.

Zhang, Z.J., Kwok, T., Yeung, D.Y., et al., 2003. Bayesian

transductive learning of the kernel matrix using Wishart

processes. Technical Report HKUST-CS03-09, Department

of Computer Science, Hong Kong University of Science and

Technology.