Graph Transduction via Alternating Minimizationjebara/papers/icml08.pdf · transduction methods...

Graph Transduction via Alternating Minimization

Jun Wang [email protected]

Department of Electrical Engineering, Columbia University

Tony Jebara [email protected]

Department of Computer Science, Columbia University

Shih-Fu Chang [email protected]

Department of Electrical Engineering, Columbia University

Abstract

Graph transduction methods label input databy learning a classification function that isregularized to exhibit smoothness along agraph over labeled and unlabeled samples. Inpractice, these algorithms are sensitive to theinitial set of labels provided by the user. Forinstance, classification accuracy drops if thetraining set contains weak labels, if imbal-ances exist across label classes or if the la-beled portion of the data is not chosen at ran-dom. This paper introduces a propagation al-gorithm that more reliably minimizes a costfunction over both a function on the graphand a binary label matrix. The cost func-tion generalizes prior work in graph trans-duction and also introduces node normaliza-tion terms for resilience to label imbalances.We demonstrate that global minimization ofthe function is intractable but instead pro-vide an alternating minimization scheme thatincrementally adjusts the function and the la-bels towards a reliable local minimum. Un-like prior methods, the resulting propagationof labels does not prematurely commit to anerroneous labeling and obtains more consis-tent labels. Experiments are shown for syn-thetic and real classification tasks includingdigit and text recognition. A substantial im-provement in accuracy compared to state ofthe art semi-supervised methods is achieved.The advantage are even more dramatic whenlabeled instances are limited.

Appearing in Proceedings of the 25 th International Confer-ence on Machine Learning, Helsinki, Finland, 2008. Copy-right 2008 by the author(s)/owner(s).

1. Introduction

Graph transduction refers to a family of algorithmsthat achieve state of the art performance in semi-supervised learning and classification. These methodstradeoff a classification function’s accuracy on labeledexamples against a regularizer term that encouragesthe function to remain smooth over a weighted graphconnecting the data samples. The weighted graph andthe minimized function ultimately propagate label in-formation from labeled data to unlabeled data to pro-vide the desired transductive predictions. Popular al-gorithms for graph transduction include the Gaussianfields and harmonic functions based method (GFHF)(Zhu et al., 2003) as well as the local and global con-sistency method (LGC) (Zhou et al., 2004). Otherclosely related methods include the manifold regu-larization framework proposed in (Sindhwani et al.,2005; Belkin et al., 2006) where graph Laplacianregularization terms are combined with regularizedleast squares (RLS) or support vector machine (SVM)function estimation criteria. These methods lead tograph-regularized variants denoted as Laplacian RLS(LapRLS and Laplacian SVM (LapSVM) respectively.For certain synthetic and real data problems, graphtransduction approaches do achieve promising perfor-mance. However, this article identifies several realis-tic settings and labeling situations where this perfor-mance can be compromised. An alternative algorithmwhich generalizes the previous techniques is proposedby defining a joint iterative optimization over the clas-sification function and a balanced label matrix.

Even if one assumes the graph structures used in theabove methods faithfully describe the data manifold,graph transduction algorithms may still be misled byproblems in the label information. Figure 1 depictsseveral cases where the label information leads to in-valid graph transduction solutions for all the aforemen-tioned algorithms. The top row of Figure 1 shows a


separable pair of manifolds where unbalanced label in-formation affects the propagation results. Although aclear separation region is visible between the two mani-folds, the imbalance in the labels misleads the previousalgorithms which prefer assigning points to the largerclass. In the bottom row of Figure 1, a non-separableproblem is shown where two otherwise separable man-ifolds are peppered with noisy outlier samples. Here,the outliers do not belong to either class but once againinterfere with the propagation of label information. Inboth situations, conventional transductive learning ap-proaches such as GFHF, LGC, LapRLS, and LapSVMfail to give acceptable labeling results.

In order to handle such situations, we extend the graphtransduction optimization problem by casting it as ajoint optimization over the classification function andthe labels. The optimization is solved iteratively andremedies the instability previous methods seem to havevis-a-vis the initial labeling. In our novel framework,initial labels simply act as the starting value of thelabel matrix variable which is incrementally refineduntil convergence. The overall minimization over thecontinuous classification function and the binary labelmatrix proceeds by an alternating minimization overeach term separately and converges to a local mini-mum. Moreover, to handle the imbalanced labels is-sue, a node regularizer term is introduced to balancethe label matrix among different classes. These twofundamental changes to the graph transduction prob-lem produce significantly better performance on bothartificial and real datasets.

The remainder of this paper is organized as the follows.In Section 2, we revisit the graph regularization frame-work of (Zhou et al., 2004) and extend it into a bi-variate graph optimization problem. A correspondingalgorithm is provided that solves the new optimizationproblem by iterative alternating minimization. Section3 provides experimental validation for the algorithmon both toy and real classification datasets, includingtext classification and digital recognition. Compar-isons with leading semi-supervised methods are made.Concluding remarks and a discussion are then pro-vided in Section 4.

2. Graph Transduction

Consider the dataset X = (Xl,Xu) of labeled in-puts Xl = {x1, · · · ,xl} and unlabeled inputs Xu ={xl+1, · · · ,xn} along with a small portion of cor-responding labels {y1, · · · , yl}, where yi ∈ L ={1, · · · , c}. For transductive learning, the objective isto infer the labels {yl+1, · · · , yn} of the unlabeled data{xl+1, · · · ,xn} where typically l << n . The graph

transduction methods define an undirected graph rep-resented by G = {X , E}, where the set of node or ver-tices is X = {xi} and the set of edges is E = {eij}.Each sample xi is treated as the node on the graphand the weight of edge eij is wij . Typically, oneuses a kernel function k over pairs of points to re-cover weights, in other words wij = k(xi,xj) with theRBF kernel being a popular choice. The weights foredges are used to build a weight matrix which is de-noted by W = {wij}. Similarly, the node degree ma-

trix D = diag ([d1, · · · , dn]) is defined as di =n∑

j=1

wij .

The binary label matrix Y is described as Y ∈ Bn×c

with Yij = 1 if xi has label yi = j and Yij = 0otherwise. This article will often refer to row and col-umn vectors of such matrices, for instance, ith rowand jth column vectors of Y are denoted as Yi· andY·j , respectively. The graph Laplacian is defined as∆ = D −W and the normalized graph Laplacian isL = D−1/2∆D−1/2 = I−D−1/2WD−1/2.

2.1. Consistent Label Propagation

Graph based semi-supervised learning methods propa-gate label information from labeled nodes to unlabelednodes by treating all samples as nodes in a graph andusing edge-based affinity functions between all pairsof nodes to estimate the weight of each edge. Mostmethods then define a continuous classification func-tion F ∈ Rn×c that is estimated on the graph to min-imize a cost function. The cost function typically en-forces a tradeoff between the smoothness of the func-tion on the graph of both labeled and unlabeled dataand the accuracy of the function at fitting the labelinformation for the labeled nodes. Such is the case fora large variety of graph based semi-supervised learn-ing techniques ranging from the the mincuts method(Blum & Chawla, 2001), the Gaussian fields and har-monic functions (GFHF) method, and the local andglobal consistency (LGC) method. A detailed surveyof these methods is available in (Zhu, 2005).

In trading off smoothness for accuracy, both GFHFand LGC approaches attempt to preserve consistencyon the data manifold during the optimization of theclassification function. The loss function for bothmethods involves the additive contribution of twopenalty terms the global smoothness Qsmooth and localfitness Qfit as shown below:

F∗ = arg minFQ(F) = arg min

F{Qsmooth(F) + Qfit(F)}

(1)

In particular, recall that LGC uses an elastic regular-izer framework with the following cost function (Zhou


(a) (e)(d)(c)(b)

Figure 1. A demonstration with artificial data of the sensitivity graph transduction exhibits for certain initial label settings.The top row shows how imbalanced labels adversely affect even a well-separated 2D two-moon dataset. The bottom rowshows a 3D two-moon data where graph transduction is again easily misled by the introduction of cloud of outliers. Largemarkers indicate known labels and the two-color small markers represent the predicted classification results. Columnsdepict the results from (a) the GFHF method (Zhu et al., 2003); (b) the LGC method (Zhou et al., 2004); (c) the LapRLSmethod (Belkin et al., 2006); (d) the LapSVM method (Belkin et al., 2006); and (e) Our method (GTAM).

et al., 2004).

Q(F) =n∑

i,j=1

wij

∥∥∥∥∥Fi·√Dii

− Fj·√Djj

∥∥∥∥∥

2

+µ

n∑

i=1

‖Fi·−Yi·‖2

(2)where the coefficient µ balances global smoothness andlocal fitting penalty terms. If we set µ = ∞ and usea standard graph Laplacian for the smoothness term,the above framework reduces to the harmonic functionformulation as shown in (Zhu et al., 2003).

While LGC and GFHF formulations remain popularand have been empirically validated in the past, it ispossible to discern some key limitations. First, theoptimization can be broken up into a separate paral-lel problems since the cost function decomposes intoterms that only depend on individual columns of thematrix F. Since each column of F indexes the labelingof a single class, such a decomposition reveals that bi-ases may arise if the input labels are disproportionatelyimbalanced. In practice, both propagation algorithmstend to prefer predicting the class with the majority oflabels. Second, both learning algorithms are extremelydependent on the initial labels provided in Y. This isseen in practice but can also be explained mathemati-cally by fact that Y is starts off extremely sparse andhas many unknown terms. Third, when the graph con-tains background noise and makes class manifolds non-separable, these graph transduction approaches fail tooutput reasonable classification results. These diffi-culties were illustrated in Figure 1 and seem to plaguemany graph transduction approaches. However, theproposed method, graph transduction via alternatingminimization (GTAM) appears resilient.

To address these problems, we begin will make modi-fications to the cost function in Equation 1. The firstone is to explicitly show the optimization over both theclassification function F and the binary label matrixY:

(F∗,Y∗) = arg minF∈Rn×c,Y∈Bn×cQ(F,Y). (3)

Where Bn×c is the set of all binary matrices Y of sizen×c that satisfy

∑j Yij = 1 and, for the labeled data

xi ∈ Xl, Yij = 1 if yi = j. More specifically, our lossfunction is:

Q(F,Y) = tr{FT LF + µ(F−VY)T (F−VY)

}(4)

where we have introduced the matrix V which is anode regularizer to balance the influence of labels fromdifferent classes. The matrix V = diag(v) is a functionof the current label matrix Y:

v =c∑

j=1

Y·j ¯D~1YT·jD~1

(5)

where the symbol ¯ denotes Hadamard product andcolumn vector ~1 represents ~1 = [1 · · · 1]T . This noderegularizer permits us to work with a normalized ver-sion of the label matrix Z defined as: Z = VY.

By definition, we see that the normalized label ma-trix satisfies

∑i Zij = 1. Using the normalized la-

bel matrix Z in a graph regularization allows labelednodes with high degree to contribute more during thegraph diffusion and label propagation process. How-ever, the total diffusion of each class is kept equal andnormalized to be one. Therefore, the influence of dif-ferent classes is balanced even when if the given class


labels are imbalanced. If class proportion informationis known a priori, it can integrated by scaling the diffu-sion with the prior class proportion. However, becauseof the nature of graph transduction and the limita-tions in the size of labeled data, equal class balancingleads to generally more reliable solutions than classproportional weighting. This intuition is in line withprior work that uses of class proportion information intransductive inference such as (Chapelle et al., 2007)where class proportion is enforced as a hard constrainton the labels or in (Mann & McCallum, 2007) wheresuch information is used as a regularizer. We next dis-cuss the alternating minimization procedure which isthe key modification to the overall framework.

2.2. Alternating Minimization Procedure

In our proposed graph regularization framework, thecost function involves two variables to be optimized.While simultaneously recovering both solutions is in-tractable due to the mixed integer programming prob-lem over binary Y and continuous F, we will pro-pose a greedy alternating minimization approach. Thefirst update of the continuous classification function Fis straightforward since the resulting cost function isconvex and unconstrained allowing us to recover theoptimal F by setting the partial derivative ∂Q

∂F to bezero. However, since Y ∈ B is a binary matrix andsubject to linear constraints of the form

∑cj=1 Yij ,

the other step in our alternating minimization requiressolving a linearly constrained max cut problem whichis NP (Karp, 1972). Due to the alternating minimiza-tion outer loop, investigating guaranteed approxima-tion schemes (Goemans & Williamson, 1995) to solvea constrained max cut problem for Y is unjustifieddue to the solution’s dependence on the dynamicallyvarying classification function F during the alternat-ing minimization procedure. Instead, we use a greedygradient based approach to incrementally update Y,while keeping the classification function F at the corre-sponding optimal setting. Moreover, because the noderegularizer term V normalizes the labeled data, wealso interleave updates of V based on the revised Y.

Minimization for F :The classification function F ∈ Rn×c is continuous andits loss terms are convex allowing the minimum to berecovered by zeroing the partial derivative:

∂Q∂F∗

= 0 =⇒ LF∗ + µ(F∗ −VY) = 0

=⇒ F∗ = (L/µ + I)−1VY = PVY(6)

where we denote P = (L/µ + I)−1 as the propagationmatrix.

Greedy minimization of Y:To update Y, first replace F in Eq. 4 by its optimalvlue F∗ from to solution of Eq. 6.

Q(Y) = tr(YT VT PT LPVY (7)+ µ(PVY −VY)T (PVY −VY))= tr

(YT VT

[PT LP + µ(PT − I)(P− I)

]VY

)

The optimization still involves the node regularizer Vin Eq. 5, which depends on Y and normalizes thelabel matrix over columns. Due to the dependence onthe current estimate of F and V, only an incrementalstep will be taken greedily to reduce Equation 6. Ineach iteration, we find position (i∗, j∗) in the matrixY and change the binary value of Yi∗j∗ from 0 to1. The direction with largest negative gradient guidesour choice of binary step on Y. Therefore, we need toevaluate ‖ 5 QY‖ and find the largest negative valueto determine (i∗, j∗).

Note that setting Yi∗,j∗ = 1 is equivalent to a similaroperation on the normalized label matrix Z by settingZi∗,j∗ = ε, 0 < ε < 1, and Y,Z have one to one corre-spondence. Thus, the greedy optimization of Q withrespect to Y is equivalent to greedy minimizing of Qwith respect to Z. More formally: ∂Q

∂Y = ∂Q∂Z

∂Z∂Y and

with straightforward algebra we see that:

(i∗, j∗) = arg mini,j

∂Q∂Y

= arg mini,j

∂Q∂Z

(8)

Then we can rewrite the loss function using the vari-able Z as:

Q(Z) = tr(ZT

[PT LP + (PT − I)(P− I)

]Z

)(9)

and recover the gradient with respect to Z of the aboveloss function as:

∂Q∂Z

=(PT LP + (PT − I)(P− I)

)Z

+(PT LP + (PT − I)(P− I)

)TZ

Let A = PT LP + (PT − I)(P − I). We simplify therepresentation of the gradient as:

∇ZQ =(A + AT

)Z =

(A + AT

)VY (10)

As described earlier, we search the gradient matrix∇ZQ to find the minimal element for updating

(i∗, j∗) = arg minxi∈Xu,1≤j≤c∇ZijQt+1 (11)

Yt+1i∗j∗ = 1


Because of the binary nature of Y, we simple setYi∗j∗ = 1 instead of using a continuous gradient ap-proach. In t + 1th iteration, after obtaining Yt+1, thenode regularizer vt+1 can be recalculated with Yt+1

using Eq. 5.

The update Y is indeed greedy. Therefore, it could os-cillate and backtrack from predicted labelings in pre-vious iterations without convergence guarantees. Wepropose a straightforward way to guarantee conver-gence and avoid backtracking, inconsistency or un-stable oscillation in the greedy propagation of labels.Once an unlabeled point has been labeled, its labelingcan no longer be changed. Thus, we remove the mostrecently labeled point (i∗, j∗) from future considera-tion and only permit the algorithm to search for theminimal gradient entries corresponding to the remain-ing unlabeled examples. Thus, to avoid changing thelabeling of previous predictions, the new labeled nodexi∗ will be removed from Xu and added to Xl.

In the following, we summarize the updating rules fromstep t to t + 1 in the alternative minimization scheme.Although optimal F∗ is being computed in each iter-ation as shown in Eq. 6, we do not explicitly need toupdate it. Instead, it is implicitly used in Eq. 8 todirectly update Y.

∇(Z)Qt+1 =(A + AT

)diag(vt)Yt (12)

(i∗, j∗) = arg minxi∈Xu,1≤j≤c∇ZijQt+1

Yt+1i∗j∗ = 1

vt+1 =c∑

j=1

Yt+1·j ¯D~1

Yt+1·j

TD~1

X t+1l ←− X t

l + xi∗ ; X t+1u ←− X t

u − xi∗

The procedure above repeats until all points have beenlabeled.

2.3. Algorithm Summary and Convergence

From the above discussion, our method is uniquein that it optimizes the loss function over bothcontinuous-valued F space and binary-valued Y space.Starting from a few given labels, the method itera-tively and greedily updates the label matrix Y, noderegularizer v, and gradient matrix ∇Q. In each indi-vidual iteration, new labeled samples are obtained todrive a better graph propagation in the next iteration.In our approach, we directly acquire new labels insteadof calculating F and then conducting a mapping to Y,which is the regular procedure in other graph transduc-tion methods like LGC and GFHF. This unique featuremakes the proposed algorithm very efficient since weonly update the gradient matrix ∇ZQ in each itera-

Algorithm 1 Graph Transduction via AlternatingMinimization (GTAM)

Input: data set X = {x1, · · · ,xl,xl+1, · · · ,xn},labeled subset Xl = {x1, · · · ,xl}, unlabeled sub-set Xu = {xl+1, · · · ,xn}, labels {y1, · · · , yj , · · · , yl},where yj ∈ L = {1, · · · , l}. Affinity matrix W ={wij}, node degree matrix D, initial label matrixY0;Initialization:iteration counter t = 0;normalized graph Laplacian L = D−1/2∆D−1/2;propagation matrix P = (L/µ + I)−1;matrix A = PT LP + (PT − I)(P− I);

node regularizer v0 =∑c

j=1

Y0·j¯D~1

Y0·j

T D~1.

repeatCompute graph gradient:

Zt = diag(vt)Yt, ∇ZtQ = (A + AT )Zt;Find the optimal element in ∇ZtQ:

(i∗, j∗) = arg minxi∈Xu,1≤j≤c∇ZijQ;

Update label matrix to obtain Yt+1 by setting:Yt+1

i∗j∗ = 1; also yi∗ = j∗;Update node regularizer by:

vt+1 =∑c

j=1

Yt+1·j ¯D~1

Yt+1·j

TD~1

;

Remove xi∗ from Xu: X t+1u ←− X t

u − xi∗ ;Add xi∗ to Xl: X t+1

l ←− X tl + xi∗ ;

Update iteration counter: t = t + 1;until X t

u = ∅Output:The labels of unlabeled samples {yl+1, · · · , yn}.

tion. Furthermore, similar to the graph superpositionapproach introduced in (Wang et al., 2008), the cal-culation of ∇ZQ can be more efficient by conductingincremental updating as to the newly gained labels.

Due to greedy assignment, the algorithm’s can onlyloop the alternative minimization (or the gradientcomputation equivalently) at most n−l times. The up-date of the graph gradient, finding the largest elementin the gradient and the matrix algebra involved can bedone in efficiently by only a single entry in Y is mod-ified per loop. Each minimization step over F and Ythus requires O(n2) time and the total runtime of thegreedy GTAM algorithm is O(n3 − ln n). Empirically,the value of the loss function Q decreases rapidly inthe first dozens of iterations and achieves steady con-vergence afterward. This phenomenon indicates thatthe label propagation loop could be early terminatedthrough deriving the labels from the optimized F (Eq.6) after gaining enough labels. The above algorithmchart summarizes the proposed GTAM method.


3. Experiments

In this section, we demonstrate the superiority of theproposed GTAM method in comparison to state of theart semi-supervised learning methods over both syn-thetic and real data. For instance, on the WebKBdata, previous work shows that LapSVM and LapRLSare better than other semi-supervised approaches, suchas Transductive SVMs TSVM (Joachims, 1999) and∇TSVM. Therefore, we only compare our methodwith LapRLS, LapSVM and two related methods,LapRLSjoint and LapSVMjoint (Sindhwani et al.,2005). In all experiments, we used the same param-eter settings reported in the literatures. The GTAMapproach only requires a single µ parameter which con-trols the tradeoff between the global smoothness andlocal fitting terms in the cost function. Although ourexperiments show that GTAM is fairly robust to thesetting of µ, we set µ = 99 throughout all experiments.

For all real implementations of graph-based methods,there needs a construction method that builds a graphfrom the training data X , which involves a procedurefor computing the weight of links via a kernel or sim-ilarity function. Typically, these methods use RBFkernels for image recognition and cosine distances fortext classification (Zhou et al., 2004; Ng et al., 2001;Chapelle et al., 2003; Hein & Maier, 2006). However,finding adequate parameters for the kernel or similar-ity function, such as RBF kernel size δ, is not alwaysstraightforward particularly if labeled data is scarce.The empirical experience has shown that the prop-agation results highly depend on the kernel param-eter selection. Motivated by the approach reportedin (Hein & Maier, 2006), we use an adaptive kernelsize based on the mean distance of k-nearest neigh-borhoods (k = 6) for the experiments on real USPShandwritten digit data. On the WebKB data, we usethe same graph construction suggested by (Sindhwaniet al., 2005). For each dataset, the same graph is usedbuilt for all the compared transductive learning ap-proaches.

3.1. Two Moon Synthetic Data

Figure 1 illustrated synthetic experiments on 2D and3D two-moon data. Despite the near-perfect classifica-tion results reported on such datasets in the literature(Zhou et al., 2004), we show how small perturbationsto the problem can have adverse effects on prior al-gorithms. The prior methods are overly sensitive tolocations of the initial labels, ratios of the two-classlabels, and the level of ambient noise or outliers.

A more thorough experimental study is also possiblefor the two-moon data by exploring the effect of class

(b)(a)

Figure 2. Performance comparison of LGC, GFHF,LapRLS, LapSVM, and GTAM on noisy 3D two moondata. Only one label is given for one class, while the otherclass has a varying number of labels, shown as imbalanceratio on the horizontal axis: (a) The mean of the testerror; (b) The standard deviation of the test error.

imbalance. We start by fixing one class to have oneobserved label and select r labels from the other class.Here, r is also the imbalance ratio and the range weexplore is 1 ≤ r ≤ 20. These experiments use the 3Dnoisy two-moon data which contain 300 positive and300 negative sample points as well as an additional200 background noisy point. Multiple round tests (100trails) are evaluated for each imbalance condition bycalculating the average prediction accuracy on the rel-evant 600 samples. For a fair comparison, we use thesame graph Laplacian, which is constructed using k-NN (k = 6) neighbors with RBF weights. Moreover,the parameter for LGC is set as α = 0.99. The pa-rameters for LapRLS and LapSVM are γA = 1, γI = 1.For GTAM, we set µ = 99, which corresponding toα = 0.99 in the LGC method.

Figure 2 demonstrates the performance advantage ofthe proposed GTAM approach versus the LGC, GFHF,LapRLS, and LapSVM methods. From the figure, wecan conclude that all the four strawman approachesare extremely sensitive to the initial labels and labelclass imbalance since none of them can produce per-fect accuracy and the error rates of LGC and GFHFare dramatically increased when the label class be-comes more imbalanced even though more informationis being provided to the algorithm. However, GTAM isclearly superior, achieving the best accuracy regardlessof the imbalance ratio and despite contamination withnoisy samples. In fact only 1 or 2 of the 100 trails foreach individual setting of r were imperfect using theGTAM method.

3.2. WebKB Dataset

For validation on real data, we first evaluated ourmethod using the WebKB dataset, which has beenwidely used in semi-supervised learning experiments(Joachims, 2003; Sindhwani et al., 2005). The WebKB


dataset contains two document categories, course andnon-course. Each document has two types of informa-tion, the webpage text content called page representa-tion and link or pointer representation. For fair com-parison, we applied the same feature extraction proce-dure as discussed in (Sindhwani et al., 2005), obtained1051 samples with 1840-dimensional page attributesand 3000 link attributes. The graph was built basedon cosine-distance neighbors with Gaussian weights(number of nearest neighbors is 200 as in (Sindhwaniet al., 2005)). We compared our method with fourof the best known approaches, LapRLS, LapSVM,and the two problem specific methods, LapRLSjoint ,LapSVMjoint reported in (Sindhwani et al., 2005). Allthe compared approaches used the same graph con-struction procedure and all parameter settings were setaccording to (Sindhwani et al., 2005), in other wordsγA = 10−6, γI = 0.01. We varied the number of la-beled data to measure the performance gain with in-creasing supervision. For each fixed number of labeledsamples, 100 random trails were tested. The means ofthe test errors are shown in Figure 3.

Figure 3. Performance comparison on text classification(WebKB dataset). The horizontal axis represents the num-ber of randomly observed labels (guaranteeing there is atleast one label for each class). The vertical axis shows theaverage error rate over 100 random trials.

As the Figure reveals, the proposed GTAM methodachieved significantly better accuracy than all theother methods, except for the extreme case when onlyfour labeled samples were available. The performancegain grows rapidly when the number of labeled sam-ples increases, although in some cases the error ratedoes not drop monotonically.

3.3. USPS digit data

We also evaluated the proposed method in an im-age recognition task. Specifically, we used the datain (Zhou et al., 2004) for handwritten digit classifi-

cation experiments. To evaluate the algorithms, wereveal a subset of the labels (randomly chosen andguaranteeing at least one labeled example is availablefor each digit). We compared our method with LGCand GFHF, LapRLS, and LapSVM. The error rates arecalculated based on the average over 20 trials.

Figure 4. Performance comparison on handwritten digitclassification (USPS database). The horizontal axis showsthe total number of randomly observed labels (guarantee-ing there is at least one label for each class). The verticalaxis shows the average error rate over 20 random trials.

From Figure 4, we can conclude that GTAM signifi-cantly improved the classification accuracy, comparedto the other approaches, especially when very few la-beled samples are available. The mean accuracies ofGTAM are consistently low for different numbers oflabels and the standard deviation values are also verysmall (10−4 level). This demonstrates that the GTAMmethod is insensitive to the numbers and specified lo-cations of the initially given labels. Only 1% of thetest digit images were mislabeled. These failure casesare presented in Figure 5 and are often ambiguous orextremely poorly drawn digits. Compared to the per-formance on WebKB dataset shown in Figure 3, theUSPS digit database experiments exhibit even morepromising results. One possible reason is that USPSdigit dataset has relatively more samples (3874) and alower feature dimensionality (256), compared to We-bKB dataset (which has 1840 samples in 4800 dimen-sions). Therefore the graph construction procedure ismore reliable and the estimation of graph gradients inour algorithm is more robust.

Figure 5. USPS handwritten digit samples which are incor-rectly classified.


4. Conclusion and Discussion

Existing graph-based transductive learning methodshinge on good labeling information and can easily bemisled if the labels are not distributed evenly acrossclasses, if the choice of initial label locations is variedor if excessive noise or outliers corrupt the underlyingmanifold structure. These degenerate settings seemto plague real world problems as well, compromisingthe performance of state-of-the-art graph transductionmethods. Our experiments over synthetic data sets(two moon data sets) and the real data sets (USPSdigits and WebKB) confirm the shortcomings of theexisting tools.

This article addresses these shortcomings and proposesa novel graph transduction method, graph transduc-tion via alternating minimization (GTAM). Therein,both the classification function and the label matrixare treated as variables in a cost function that is it-eratively minimized. While the optimal classificationfunction can be estimated exactly, greedy optimiza-tion is applied to update the label matrix. The al-gorithm iterates an alternating minimization betweenboth variables and is guaranteed to converge via agreedy scheme. In each individual iteration, throughthe graph gradient, the unlabeled node with the largestcost reduction is labeled. We gradually update thelabel matrix by adding more labeled samples whilekeeping the classification function at its optimal set-ting. Furthermore, we enforce normalization of thelabel matrix to avoid degeneracies. This results in analgorithm that can cope with all the aforementioneddegeneracies and in practice achieves significant gainsin accuracy while remaining efficient and cubic in thenumber of samples. Future work will include out ofsample extensions of this method such that new datapoints can be added to the training and test set with-out requiring a full retraining procedure.

5. Acknowledgments

The authors would like to thank Mr. Yongtao Su andShouzhen Liu for their valuable comments. We alsothank Mr. We Liu and Deli Zhao for the collection ofthe artificial data.

References

Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Man-ifold Regularization: A Geometric Framework forLearning from Labeled and Unlabeled Examples.JMLR, 7, 2399–2434.

Blum, A., & Chawla, S. (2001). Learning from labeled

and unlabeled data using graph mincuts. Proc. 18thICML (pp. 19–26).

Chapelle, O., Sindhwani, V., & Keerthi, S. (2007).Branch and Bound for Semi-Supervised SupportVector Machines. Proc. of NIPS.

Chapelle, O., Weston, J., & Scholkopf, B. (2003). Clus-ter kernels for semi-supervised learning. Proc. NIPS,15, 1.

Goemans, M., & Williamson, D. (1995). Improved ap-proximation algorithms for maximum cut and satis-fiability problems using semidefinite programming.Journal of the ACM (JACM), 42, 1115–1145.

Hein, M., & Maier, M. (2006). Manifold denoising.Proc. NIPS, 19.

Joachims, T. (1999). Transductive inference for textclassification using support vector machines. Proc.of the ICML, 200–209.

Joachims, T. (2003). Transductive learning via spec-tral graph partitioning. Proc. of ICML, 290–297.

Karp, R. (1972). Reducibility among combinatorialproblems. Complexity of Computer Computations,43, 85–103.

Mann, G., & McCallum, A. (2007). Simple, robust,scalable semi-supervised learning via expectationregularization. Proc. of the ICML, 593–600.

Ng, A., Jordan, M., & Weiss, Y. (2001). On spectralclustering: Analysis and an algorithm. Proc. NIPS,14, 849–856.

Sindhwani, V., Niyogi, P., & Belkin, M. (2005). Be-yond the point cloud: from transductive to semi-supervised learning. Proc. of ICML.

Wang, J., Chang, S.-F., Zhou, X., & Wong, T. C. S.(2008). Active microscopic cellular image annota-tion by superposable graph transduction with im-balanced labels. IEEE CVPR. Alaska, USA.

Zhou, D., Bousquet, O., Lal, T., Weston, J., &Scholkopf, B. (2004). Learning with local and globalconsistency. Proc. NIPS (pp. 321–328).

Zhu, X. (2005). Semi-supervised learning literaturesurvey (Technical Report 1530). Computer Sciences,University of Wisconsin-Madison.

Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and har-monic functions. Proc. 20th ICML.

Graph Transduction via Alternating Minimizationjebara/papers/icml08.pdf · transduction methods...

Documents

Transcript of Graph Transduction via Alternating Minimizationjebara/papers/icml08.pdf · transduction methods...