Training TSVM with the proper number of positive samples
Transcript of Training TSVM with the proper number of positive samples
Pattern Recognition Letters 26 (2005) 2187–2194
www.elsevier.com/locate/patrec
Training TSVM with the proper number of positive samples
Ye Wang *, Shang-Teng Huang
Computer Science and Engineering Department, Shanghai JiaoTong University, Shanghai 200030, PR China
Received 20 September 2004; received in revised form 14 March 2005
Available online 13 June 2005
Communicated by R.P.W. Duin
Abstract
The transductive support vector machine (TSVM) is the transductive inference of the support vector machine. The
TSVM utilizes the information carried by the unlabeled samples for classification and acquires better classification per-
formance than the regular support vector machine (SVM). As effective as the TSVM is, it still has obvious deficiency:
The number of positive samples must be appointed before training and it is not changed during the training phase. This
deficiency is caused by the pair-wise exchanging criterion used in the TSVM. In this paper, we propose a new transduc-
tive training algorithm by substituting the pair-wise exchanging criterion with the individually judging and changing
criterion. Experimental results show that the new method releases the restriction of the appointment of the number
of positive samples beforehand and improves the adaptability of the TSVM.
� 2005 Elsevier B.V. All rights reserved.
Keywords: Transductive support vector machine; Training algorithm
1. Introduction
Different from the traditional inductive learning
method, the transductive learning method is a way
to utilize the information of the labeled samples to-
gether with that of the unlabeled samples. Now-
adays, under the condition that obtaining labels ismore expensive than getting samples, the transduc-
0167-8655/$ - see front matter � 2005 Elsevier B.V. All rights reserv
doi:10.1016/j.patrec.2005.03.034
* Corresponding author. Fax: +86 21 6293 4107.
E-mail addresses: [email protected], [email protected].
edu.cn (Y. Wang).
tive method is significantly important for the
machine learning. Based on the theory of the
statistical learning (Vapnik, 1995), the support vec-
tor machine (SVM) was proposed for the pattern
recognition by Vapnik (1998). The SVM shows bet-
ter performance than other learning method in the
classification field of the small training sets in anonlinear and high-dimensional feature space.
The transductive support vector machine
(TSVM) (Joachims, 1999) is a SVM combining
with the transductive learning procedure. The
iteration algorithm of the TSVM utilizes the
ed.
2188 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194
information of the unlabeled samples for classifica-
tion and predicts the optimal labels for them. The
TSVM is suitable in the case when the distribution
of the training samples differs from that of the test
samples.After reviewing the training algorithm of the
TSVM, we propose in this paper that the individu-
ally judging and changing criterion instead of the
pair-wise exchanging criterion is utilized to mini-
mize the objective function. The training algorithm
is also changed accordingly. Experimental results
show that the new method releases the restriction
of the appointment of the number of positive sam-ples beforehand. It could be used in much wider
ranges.
The remaining of this paper is organized as fol-
lows: The training algorithm of the TSVM and its
deficiency are described in Section 2. An improved
label exchanging criterion, a new training method
and the theoretical proof of convergence are devel-
oped in Section 3. Experimental results of compar-ing SVM, TSVM and the new method are shown
in Section 4. Finally, conclusions are given in the
last section.
2. Review of the TSVM
The transductive learning is completely differentfrom the inductive learning. Not caring the decision
hyper-plane or decision rules, the transductive
learning focuses on the construction of a transduc-
tive mechanism that can label test samples with
the information carried by either the labeled or
the unlabeled samples in the training phase. The
idea of the transductive learning originates from
the dilemma of the machine learning, where the la-beled samples are sparse and expensive while the
unlabeled samples are plentiful and cheap. Through
the learning process of the transductive learning, the
helpful information concealed in the unlabeled sam-
ples is transferred into the final classifier. Thismech-
anism makes classification results more accurate.
2.1. Principle
The TSVM is an application of the transductive
learning theory to the SVM. The principle of the
regular SVM is described as follows. An SVM
classifier tries to find a decision hyper-plane that
classifies training samples correctly (or basically
correctly). The margin between this hyper-plane
and the nearest sample should be as large as possi-ble. In the training phase, the SVM resolves the
following optimization problem to find the deci-
sion hyper-plane,
min1
2wTwþ C
Xi
ni ð2:1Þ
such that
yiðwTxi þ bÞ P 1� ni; ni P 0
where b is a threshold; C is an influencing para-
meter for trade-off and n is a slack variable. Inthe decision phase, the SVM labels the test samples
according to the side of the hyper-plane that they
lie on. The decision formula is given by
DðxÞ ¼ wTxþ b ð2:2ÞAccording to the principle of the transductive
learning, the optimization problem of the TSVM
(Joachims, 1999) is given by
min1
2wTwþ C
Xi
ni þ C�Xj
n�j ð2:3Þ
such that
yiðwTxi þ bÞ P 1� ni; ni P 0;
yjðwTxj þ bÞ P 1� n�j ; n�j P 0
where n and n* are slack variables for the training
and test samples, respectively; C and C * are the
influencing parameters for the training and test
samples, respectively.
2.2. Training algorithm
To acquire optimal solutions of Eq. (2.3),
Joachims presented a training algorithm. The main
steps of this algorithm are listed as follows:
(1) Specify C and C* and train a SVM with the
training samples. Specify Np (the number of
positive samples in the test set) according
to the proportion of positive samples in thetraining set.
Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194 2189
(2) Classify the test samples with the SVM
trained according to Step (1). Label Np test
samples with the largest decision values as
positive and others negative. Specify C�þ
and C�� with a small number.
(3) Retrain all samples (including the test sam-
ples). For the test samples, if a pair of sam-
ples with different labels for which both
slack variables are positive and their sum is
greater than 2, their labels are exchanged.
Iterate this step until no such a pair of sam-
ples is found. This is an inner loop.
(4) Augment C�þ and C�
� with the same propor-tion. If C�
þ or C�� is greater than C * output
labels of the test samples and terminate,
otherwise go to Step (3). This is an outer
loop.
The exchanging criterion in Step (3) of the algo-
rithm will be called pair-wise exchanging criterion.
It ensures that the objective function of the optimi-zation problem (Eq. (2.3)) decreases after the
labels exchange and thus an optimal solution is
obtained. This statement has been proved in
Theorem 2 in (Joachims, 1999). The magnification
of the influencing parameters in Step (4) increases
the influence of the unlabeled samples for classifi-
cation gradually. As efficient as TSVM is, its defi-
ciency is also obvious. The number of positivesamples, Np, must be appointed before training
and it is not changed during the entire training
phase. It is easy to infer that when the appointed
Np disagrees with the real number of positive sam-
ples in the test set the TSVM cannot obtain the
best solutions. Some related works of the TSVM
are introduced in the following paragraph.
2.3. Other related works
Chen et al. (2003) proposed a progressive
method of TSVM named PTSVM. Instead of
appointing Np in advance, the training process of
the PTSVM labels the most ‘‘believable’’ positive
or negative samples in the test set, respectively
and progressively, until all the test samples arelabeled. The PTSVM releases the restriction on
Np to some extent. However, another restriction
is emerged, that is, C�þ must be equal to C�
�. Due
to this restriction, the PTSVM is not suitable for
the case that the test set is imbalanced. Liu and
Huang (2004) presented a fuzzy TSVM algorithm
(FTSVM) to resolve problems of the web classifi-
cation. Its main idea is to multiply slack variablesof the test samples by membership factors that rep-
resent the degree of importance of the test samples
in training. Experimental results indicate that the
FTSVM performs well in the web classification.
Its main deficiency is the computation of the mem-
bership factors is very complex. In addition, the
transductive learning theory has been developed
in some other fields, such as k-NN (Joachims,2003), latent semantic indexing (LSI) (Zelikovitz,
2004) and kernel learning (Zhang et al., 2003).
3. New method
As mentioned above, the main deficiency of the
TSVM is that the number of positive samples Np
must be appointed beforehand and it is not chan-
ged during the training phase. This deficiency is
caused by the pair-wise exchanging criterion. In
order to modify this criterion, we consider whether
the labels of the test samples can be changed indi-
vidually. If it is feasible, the restriction of Np could
be released and the adaptability of the TSVM
could be promoted.
3.1. Individually judging and changing criterion
In the TSVM, the strategy of decreasing the
penalty items, which is the product of influencing
parameter C and the sum of slack variables, is
adopted to minimize the objective function of the
optimization problem. If there are two test sampleswith different labels for which both slack variables
are positive and their sum is greater than 2, both of
them may be labeled by an error. So the exchange
of their labels will decrease the objective function.
After analyzing the mechanism of the pair-wise
exchanging criterion, we find another way to
decrease the objective function if we change the
labels of the test samples on the error side of thedecision hyper-plane. We begin our inference with
discriminating the penalty items of positive and
negative samples,
2190 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194
C�Xj
n�j ¼ C�þ
Xjþ
n�þ þ C��
Xj�
n�� ð3:1Þ
where C�þ and C�
� are the influencing parameters of
positive and negative test samples, respectively.
They are related to the numbers of positive andnegative samples when the test set is imbalanced
(Osuna et al., 1997). Let C�þ and C�
� be in inverse
proportion to the numbers of positive and negative
samples, respectively, i.e., the following equation
holds:
C�þ=N n ¼ C�
�=N p ¼ a ð3:2Þ
where Np and Nn are numbers of positive and neg-
ative samples and a is a positive constant.
Theorem 1. Let D ¼P
jþn�þ �
Pj�n
��. If a test
sample is labeled positive and its slack variable n*
satisfies
n� >D
N n þ 1þ N p � 1
N n þ 1maxðð2� n�Þ; 0Þ ð3:3Þ
or if a test sample is labeled negative and its slack
variable n* satisfies
n� > � DN p þ 1
þ N n � 1
N p þ 1maxðð2� n�Þ; 0Þ ð3:4Þ
then the label change of this test sample decreases
the objective function.
Proof. The penalty items of the test samples can
be rewritten as
C�þ
Xjþ
n�þ þ C��
Xj�
n�� ¼ aN n
Xjþ
n�þ þ aN p
Xj�
n��
ð3:5ÞAssume that a positive labeled test sample is
changed to negative labeled one. Then the penalty
items of test samples become
C��þ
Xjþ
n�þ � n� !
þC���
Xj�
n�� þmaxð2� n�Þ; 0 !
ð3:6ÞHere, C��
þ and C��� are a little different from C�
þ and
C�� in Eq. (3.5). According to Eq. (3.2), after a
label change, the influencing parameters should
also be in inverse proportion to the number of
positive and negative samples, respectively, i.e.,
C��þ =ðNn þ 1Þ ¼ C��
� =ðNp � 1Þ ¼ a ð3:7Þ
Substitute n* in the first item of Eq. (3.6) with Eq.
(3.3) and combine Eq. (3.6) with Eqs. (3.2) and
(3.7). We get
ðN n þ 1ÞXjþ
n�þ � n� !
þ ðNp � 1ÞXj�
n�� þmaxð2� n�;0Þ !
< ðNn þ 1ÞXjþ
n�þ �Dþ ðNp � 1Þmaxð2� n�;0ÞNn þ 1
!
þ ðNp � 1ÞXj�
n�� þmaxð2� n�;0Þ !
< Nn
Xjþ
n�þ þNp
Xj�
n��
This result means that the objective function
decreases after the positive label changed. The case
when a negative labeled test sample is changed to
positive label can be proved in an analogous
way. The proof is completed. h
Based on Theorem 1, we may utilize a new cri-
terion to minimize the objective function of the
TSVM. This criterion, as described in (3.3) and(3.4), is used to judge whether a test sample lies
on the error side and its label should be changed
individually. We call this individually judging and
changing criterion. The number of positive samples
is allowed to be updated to a proper value. This is
the main superiority of this criterion over the pair-
wise exchanging criterion.
3.2. Improved training algorithm
The full version of our new algorithm is listed
as follows:
(1) Specify Np, C and C *. Train a SVM with the
training samples and label the test samples
using the trained SVM.(2) Let C�
þ ¼ aNn and C�� ¼ aNp, where a is a
small positive constant such as 10�4.
(3) If C�þ < C� or C�
� < C� then repeat Steps
(4)–(6).
Fig. 1. Two-dimensional toy data.
Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194 2191
(4) Retrain the SVM with training and test sam-
ples. If there exist test samples satisfying (3.3)
or (3.4), then select the sample with the big-
gest n* among them. Else, go to Step (6).
(5) Change the label of the test sample that isselected by Step (4). Update C�
þ and C��.
Let them still be in inverse proportion to
the numbers of positive and negative sam-
ples, that is, C�þ=C
�� ¼ ðNn þ 1Þ=ðNp � 1Þ or
(Nn�1)/(Np + 1). Go to Step (4).
(6) Let C�þ ¼ minf2C�
þ;C�g and C�
� ¼ minf2C��;
C�g.(7) Output the labels of the test samples and
terminate.
In this new algorithm, Np could be initialized as
an integer between 1 and the number of the sam-
ples of the test set. In Step (4), the sample with
the biggest n* is selected in order to minimize the
objective function. In addition to the change of
the label of the selected sample, the influencingparameters are adjusted according to the updated
numbers of positive and negative samples in Step
(5). If a positive label is changed to negative then
C�þ=C
�� ¼ ðN n þ 1Þ=ðN p � 1Þ. On the other hand,
if a negative label is changed to positive then
C�þ=C
�� ¼ ðN n � 1Þ=ðN p þ 1Þ.
Theorem 2 in (Joachims, 1999) indicates that
the transductive algorithm terminates in finitesteps as the objective function of the optimization
problem decreases. Theorem 1 ensures that the
label changes of samples satisfying (3.3) or (3.4)
decrease the objective function. So the new algo-
rithm also converges in finite steps.
Fig. 2. Regular SVM.
4. Experimental results
To verify the superiority of accuracy and adapt-
ability of the new method, the experiments on toy
data and real world data tests are done.
4.1. Toy data
The toy data are some two-dimensionalpoints generated artificially, as shown in Fig. 1.
The positive samples are represented by ‘‘·’’ andthe negative ones by ‘‘s’’. The training samples
are symbolized as ‘‘h’’. It is shown in Fig. 1 that
the distribution of the training samples is quite
different from that of the test samples.
In these tests, Accuracy is used as a measure of
performance. Accuracy is the probability that theclass category of a document is predicted to be
true.
The classification result of the regular SVM is
shown in Fig. 2, where the solid line represents
the decision hyper-plane and the samples symbol-
ized as big circles are the support vectors. Due to
Fig. 3. TSVM with an improper Np.Fig. 4. The new method.
2192 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194
the distribution difference between the training and
test samples, the classification accuracy of the reg-
ular SVM is only 85.5%.When the TSVM is used to classify the toy data,
Np should be appointed beforehand. This ran-
domly appointed Np might be different from the
real Np. Accordingly, the classification perfor-
mance is not as good as expected. The classifica-
tion result of the TSVM for an improper Np (a
little bigger than the real Np) is shown in Fig. 3.
Its accuracy is 94.2%. On the other hand, ournew method updates Np to a proper value and
determines the best decision hyper-plane. As is
shown in Fig. 4, the accuracy of our method is
100%. It is mentioned that the initial values of
Np are identical in these two tests.
4.2. Real world data
The experiments of the real world data are done
on a subset of the ‘‘Reuters-21578’’1 dataset. In the
‘‘Reuters-21578’’ dataset, some samples belong to
the ‘‘training set’’ and others belong to the ‘‘test
set’’. We select six groups from the ‘‘Reuters-
21578’’ as the subset. Each group of the subset
has positive samples and negative samples, which
1 http://www.research.att.com/~lewis/reuter21578.html
belong to different topics. Numbers and topics of
these groups are listed in Table 1.
After stop word filtering and stemming, the text
samples are transformed to the vectors in a very
high-dimensional feature space. The weight ofword is evaluated by the tf-idf formula. Compari-
son of the classification accuracy of regular SVM,
TSVM and the new method is listed in Table 2.
The parameters of each test are all identical, that
is, RBF kernel is used and C = 103, c = 0.1. The
toolbox of SVMlight is used in this experiment.
In Table 2, ‘‘PPS’’ is an abbreviation of ‘‘pro-
portion of positive samples’’. From the fourth col-umn to the seventh column are the classification
accuracies of SVM, TSVM and the new method.
In the fifth column, Np is initialized as the half
of the number of test samples; in the sixth column,
Np is initialized as the product of the number of
test samples and the proportion of positive sam-
ples in the training set, as advised in (Joachims,
1999), and in the seventh column, Np is appointedby the same method of the fifth column.
To find the influence of different Np to the new
method, another experiment is done. In this exper-
iment, the proportion of Np and the number of the
test set varies from 0.1 to 0.9. The results of this
experiment are listed in Table 3. NTE represents
the number of the test samples.
Table 2
Accuracies of SVM, TSVM and the new method
No. PPS in the
training set
PPS in the
test set
SVM (%) TSVM (0.5) (%) TSVM
(proportion) (%)
New method
(0.5) (%)
1 0.850 0.764 90.0 74.6 90.9 98.2
2 0.688 0.638 93.8 85.7 95.9 97.9
3 0.540 0.575 95.7 97.9 100 100
4 0.839 0.860 88.4 65.1 90.9 93.9
5 0.295 0.333 75.8 75.8 87.9 90.9
6 0.432 0.351 94.6 84.7 91.9 100
Table 1
Numbers and topics of the subset
No. Positive Number of the
training set
Number of the
test set
Negative Number of the
training set
Number of the
test set
1 Wheat 198 84 Rice 35 26
2 Coffee 110 30 Cocoa 50 17
3 Copper 47 27 Iron–steel 40 20
4 Soybean 73 37 Soy oil 14 6
5 Fuel 13 11 Gas 31 22
6 Livestock 73 39 Dlr 96 72
Table 3
Accuracies of the new method with different Np/NTE
No. 0.1 (%) 0.3 (%) 0.5 (%) 0.7 (%) 0.9 (%)
1 96.4 96.4 98.2 98.2 98.2
2 95.7 97.9 97.9 97.9 97.9
3 97.9 100 100 100 100
4 93.9 93.9 93.9 93.9 93.9
5 90.9 90.9 90.9 87.9 84.8
6 100 100 100 100 97.3
Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194 2193
4.3. Discussion
By comparing the values of Table 2, we can find
that the performance of the TSVM depends on Np
significantly. When a proper Np is appointed, i.e.,
Np is close to the number of positive test samples,
the performance of the TSVM is very well. Other-
wise, the performance of the TSVM is quitepoor and sometimes it is even worse than that
of the regular SVM. In most condition, the new
method exceeds SVM and TSVM in performance
aspect.
It is shown in Table 3 that the initial Np does
not affect the results seriously in most cases. This
phenomenon infers that the new method is not
sensitive to the initial Np. However, when the ini-
tial Np is much different from the real Np, the per-
formance will decrease to some extent.
5. Conclusions
Based on the analysis of the training algorithm
and the labels exchanging criterion of TSVM, we
proposes a new criterion to minimize the objective
function in this paper. The new criterion judges
and changes the labels of the samples individually.
A new training algorithm is presented too.
2194 Y. Wang, S.-T. Huang / Pattern Recognition Letters 26 (2005) 2187–2194
Theoretical proof for the new criterion is given.
By this criterion it is ensured that the objective
function decreases and an optimal solution is ob-
tained in finite steps. The toy data and real world
data tests show that new algorithm is not as sensi-tive as TSVM to improper number of the positive
samples. When the number of positive samples is
initialized randomly, the new method gives a better
performance than the regular SVM and TSVM.
References
Chen, Y.S., Wang, G.P., Dong, S.H., 2003. A progressive
transductive inference algorithm based on support vector
machine. J. Software 14 (3), 451–460 (in Chinese).
Joachims, T., 1999. Transductive inference for text classifica-
tion using support vector machines. In: Proc. ICML-99,
16th Internat. Conf. on Machine Learning, pp. 200–209.
Joachims, T., 2003. Transductive learning via spectral graph
partitioning. In: Proc. Internat. Conf. on Machine Learning
(ICML �03), pp. 290–297.Liu, H., Huang, S.T., 2004. Fuzzy transductive support vector
machines for hypertext classification. Internat. J. Uncer-
tainty, Fuzziness Knowledge-Based Systems 12 (1), 21–36.
Osuna, E., Freund, R., Girosi, F., 1997. Training support
vector machines: An application to face detection. In: Proc.
CVPR�97, Puerto Rico, pp. 130–136.
Vapnik, V., 1995. The Nature of Statistical Learning Theory.
Springer, New York.
Vapnik, V., 1998. Statistical Learning Theory. Wiley, New
York.
Zelikovitz, S., 2004. Transductive LSI for short text classifica-
tion problemsProc. Seventeenth Internat. Florida Artificial
Intelligence Research Symposium Conf., Miami Beach, FL,
USA. AAAI Press.
Zhang, Z.J., Kwok, T., Yeung, D.Y., et al., 2003. Bayesian
transductive learning of the kernel matrix using Wishart
processes. Technical Report HKUST-CS03-09, Department
of Computer Science, Hong Kong University of Science and
Technology.