IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf ·...
Transcript of IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf ·...
Clustering with Local and Global RegularizationFei Wang, Changshui Zhang, Member, IEEE, and Tao Li
Abstract—Clustering is an old research topic in data mining and machine learning. Most of the traditional clustering methods can be
categorized as local or global ones. In this paper, a novel clustering method that can explore both the local and global information in the
data set is proposed. The method, Clustering with Local and Global Regularization (CLGR), aims to minimize a cost function that
properly trades off the local and global costs. We show that such an optimization problem can be solved by the eigenvalue
decomposition of a sparse symmetric matrix, which can be done efficiently using iterative methods. Finally, the experimental results on
several data sets are presented to show the effectiveness of our method.
Index Terms—Clustering, local learning, smoothness, regularization.
Ç
1 INTRODUCTION
CLUSTERING [18] is one of the most fundamental researchtopics in both data mining and machine learning
communities. It aims to divide data into groups of similarobjects, i.e., clusters. From a machine learning perspective,what clustering does is to learn the hidden patterns of thedata set in an unsupervised way, and these patterns areusually referred to as data concepts. From a practicalperspective, clustering plays a vital role in data miningapplications such as information retrieval, text mining, Webanalysis, marketing, computational biology, and manyothers [15].
Many clustering methods have been proposed till now,among which K-means [12] is one of the most popularalgorithms. K-means aims to minimize the sum of thesquared distance between the data points and theircorresponding cluster centers. However, it is well knownthat there are some problems existing in the K-meansalgorithm: 1) the predefined criterion is usually nonconvexwhich causes many local optimal solutions; 2) the iterativeprocedure (e.g., for optimizing the criterion usually makesthe final solutions heavily depend on the initializations. Inthe last decades, many methods [17], [38] have beenproposed to overcome the above problems.
Recently, another type of methods, which are based onclustering on data graphs, have aroused considerableinterest in the machine learning and data mining commu-nity. The basic idea behind these methods is to first modelthe whole data set as a weighted graph, in which the graphnodes represent the data points, and the weights on theedges correspond to the similarities between pairwisepoints. Then, the cluster assignments of the data set can
be achieved by optimizing some criteria defined on thegraph. For example, Spectral Clustering is one kind of themost representative graph-based clustering approaches,and it aims to optimize some cut value (e.g., NormalizedCut [27], Ratio Cut [7], Min-Max Cut [11]) defined on anundirected graph. After some relaxations, these criteria canusually be optimized via eigendecompositions and thesolutions are guaranteed to be global optimal. In this way,spectral clustering efficiently avoids the problems of thetraditional K-means method.
In this paper, we propose a novel clustering algorithmthat inherits the superiority of spectral clustering, i.e., thefinal cluster results can also be obtained by exploiting theeigenstructure of a symmetric matrix. However, unlikespectral clustering, which just enforces a smoothnessconstraint on the data labels over the whole data manifold[2], our method first constructs a regularized linear labelpredictor for each data point from its neighborhood, andthen combines the results of all these local label predictorswith a global label smoothness regularizer. So, we call ourmethod Clustering with Local and Global Regularization(CLGR). The idea of incorporating both local and globalinformation into label prediction is inspired by the recentworks on semisupervised learning [41], and our experi-mental evaluations on several real document data sets showthat CLGR performs better than many state-of-the-artclustering methods.
The rest of this paper is organized as follows: inSection 2, we will introduce our CLGR algorithm in detail.The experimental results on several data sets are pre-sented in Section 3, followed by the conclusions anddiscussions in Section 4.
2 RELATED WORKS
Before we go into the details of our CLGR algorithm, firstwe briefly review some works that are closely related tothis paper.
2.1 Local Learning Algorithms
The goal of learning is to choose from a given set offunctions the one which best approximates the supervisor’sresponses. In traditional supervised learning, we are given a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009 1665
. F. Wang and C. Zhang are with Tsinghua University, FIT 3-120,Beijing 100084, PR China.E-mail: [email protected], [email protected].
. T. Li is with the School of Computing and Information Sciences, FloridaInternational University, ECS 251, Miami, FL 33199.E-mail: [email protected].
Manuscript received 31 Jan. 2008; revised 28 Aug. 2008; accepted 14 Jan.2009; published online 22 Jan. 2009.Recommended for acceptance by V. Ganti.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2008-01-0073.Digital Object Identifier no. 10.1109/TKDE.2009.40.
1041-4347/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
set of training sample-label pairs fðxi; yiÞgni¼1, then acommon principle for selecting a good f is to minimizethe following Structural Risk
J ðfðxi;wÞÞ ¼Xn
i¼1Lðyi; fðxi;wÞÞ þ �kfk2
F ; ð1Þ
where Lð�; �Þ is the loss function (e.g., square loss inregularized least-square regression, hinge loss in SVM),kfkF is the induced norm of f in the functional space F(e.g., F can be a Reproducing Kernel Hilbert Space induced bysome kernel k if we construct a kernel machine).
Clearly, the above statement of the learning problemimplies that a unique function f�ðx;wÞ (which minimizes(1)) will be used for prediction over the whole input space.However, this may not be a good strategy: The functionalspace may not contain a good predictor for the full inputspace, but it would be much easier to find a function that iscapable of making good predictions on some specifiedregions of the input space. Fig. 1 shows an intuitive exampleof the necessity of local learning algorithms, in which it ishard to find a good function to discriminate the two classes,but in each local region, the data from different classes canbe easily classified.
Typically, a local learning algorithm aims to find anoptimal classifier over the local region centered at c0 byminimizing [6]
J f0¼ 1
n
Xni¼1
kðxi � c0; "0ÞJ ðf0ðxi;w0ÞÞ; ð2Þ
where J ð�Þ is some predefined loss, and kðxi � c0; "0Þ is akernel centered at c0 with width "0 which specializes thelocation and size of the local region. Fig. 2 shows us twoexamples of local kernels.
2.2 Spectral Clustering
In this section, we will give a brief review on spectralclustering [19], [7], [27], [23], [33]. Given a data setfx1;x2; . . . ;xng, the spectral clustering algorithms typicallyaim to recover the data partitions by solving the followingeigenvalue decomposition problem
SV ¼ �V; ð3Þ
where S is the n� n smoothness matrix (e.g., S can be the
combinatorial graph Laplacian as in [7] or the normalized graph
Laplacian as in [27]), V 2 IRn�C is the n� C relaxed cluster
indication matrix with C being the desired number of
clusters,1 which corresponds to the eigenvectors of S, and �
is the diagonal eigenvalue matrix of S. Traditionally, there
are two ways to get the final clustering results from V.
1. As in [23], we can treat the ith row of V as theembedding of xi in a C-dimensional space, andapply some traditional clustering methods likek-means to cluster these embeddings into C clusters.
2. Usually, the optimal V� is not unique (up to anarbitrary rotational matrix R2). Thus, we can pursuean optimal R that will rotate V� to a standard clusterindication matrix. The detailed algorithm can bereferred to in [36].
3 THE PROPOSED ALGORITHM
In this section, we will introduce our Clustering with Local
and Global Regularization (CLGR) algorithm in detail. First,
we discuss the basic principles behind our algorithm.
3.1 The Basic Principle
As its name suggests, the main idea behind our algorithm
is to construct a predictor via two steps: local regularization
and global regularization. In the local regularization step,
we borrow the idea from local learning introduced in
Section 2.1, which assumes that the the label of a data point
can be well estimated from the local region it belongs to. That
is, if we partition the input data space into M regions
fRmgMm¼1, with each region being characterized by a local
kernel kðcm; "mÞ, then we can minimize the cost function
J fm ¼1
n
Xni¼1
kðxi � cm; "mÞJ ðfmðxi;wmÞÞ; ð4Þ
to get the optimal fm for the local region Rm.One problem of pure local learning is that there might
not be enough data points in each local region for training
the local classifiers. Therefore, in the global regularization
1666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
Fig. 1. An intuitive illustration of why local learning is needed. Thereare two classes of data points in the figure, which are severelyoverlapped. Hence, in this case, it is difficult to find a global classifierwhich can discriminate the two classes well. However, if we split theinput space into some local regions, then it is much easier to constructgood local classifiers.
Fig. 2. Two examples of the local kernel K. (a) A square kernel,which only selects the data points in a specific neighborhood with“diameter” "0. (b) A smooth kernel, which gives different weights forall data points according to their position with respect to c0, and itssize is controlled by "0.
1. The definition of an n�K cluster indication matrix G is that:Gij 2 f0; 1g, and Gij ¼ 1 if xi belongs to cluster j, otherwise Gij ¼ 0. In theeigenvalue decomposition problem (3), we cannot guarantee the entries inV to satisfy those constraints; therefore, we usually call V the relaxedcluster indication matrix [27].
2. Generally, the final optimization problem form of spectral clusteringcan be formulated as minimization of the trace trðVTSVÞ subject to VTV ¼I [27], [7], [38], the solution of which is according to the Ky Fan theorem [14].
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
step, we apply a smoother to smooth the predicted datalabels with respect to the intrinsic data manifold. Mathe-matically, we want to obtain a classifier f by minimizing
J f ¼XMm¼1
J fm þ �kfk2I ; ð5Þ
where f is a “region-wise” function with fðxÞ ¼ fmðxÞ ifx 2 Rm, and kfkI measures the smoothness of f withrespect to the intrinsic data manifold, and � > 0 is a trade-off parameter.
In the following sections, we will introduce our clusteringwith local and global regularization (CLGR) algorithm in detail.
3.2 Local Regularization
In this section, we introduce a detailed algorithm to realizethe local regularization step. Let q ¼ ½q1; q2; . . . ; qn�T be theclass labels (or the cluster assignments) we want to solve.Then, in local regularization, we assume that qi can be wellestimated by the classifier constructed on the local regionthat xi belongs to. In our CLGR algorithm, we adopt thefollowing k-nearest neighbor local kernel
kðxi � cm; "mÞ
¼1; if xi is the k� nearest neighbor of cm
0; otherwise;
� ð6Þ
and we set the kernel centers fcmgMm¼1 to be the data pointsfxigni¼1. In this way, we, in fact, partition the whole inputdata space into n overlapping regions, and the ith region Ri
is just the k-nearest neighborhood of xi. The label of xi willbe estimated by fiðxiÞ. Then, according to (4), we need toconstruct a classifier fi on each Ri ð1 � i � nÞ. In thefollowing, we will first see how to construct these localclassifiers.
3.2.1 The Construction of Local Classifiers
In this paper, we consider two types of fi: Regularized LinearClassifier and Kernel Ridge Regression.
Regularized linear classifier. In this case, we assumethat fi takes the following linear form
fiðxÞ ¼ wTi ðx� xiÞ þ bi; ð7Þ
where ðwi; biÞ are the weight vector and bias of fi.3
J i ¼1
ni
Xxj2N i
kwTi ðxj � xiÞ þ bi � qjk2 þ �ikwik2; ð8Þ
where ni ¼ jN ij is the cardinality of N i, and qj is thecluster membership of xj. Define the locally centered datamatrix Xi ¼ ½xi1 � xi;xi2 � xi; . . . ;xini � xi�. Then, by tak-ing the partial derivatives of J i with respect to wi and bi,we can get
@J i
@wi¼ 2
ni
�Xi
�XTi wi þ bi1� qi
�þ ni�iwi
�; ð9Þ
@J i
@bi¼ 2
ni
�1TXT
i wi þ nibi � 1Tqi�; ð10Þ
where qi ¼ ½qi1 ; qi2 ; . . . ; qini �T with qik being the kth neighbor
of xi, and 1 ¼ ½1; 1; . . . ; 1�T 2 IRni�1 is a column vector of all
1s. Let @J i=@wi ¼ 0, then
wi ¼�XiX
Ti þ ni�iId
��1Xiðqi � bi1Þ; ð11Þ
where Id is the d� d identity matrix. Using the Woodburyformula [14], we can rewrite (11) as
wi ¼ Xi
�XTi Xi þ ni�iIni
��1ðqi � bi1Þ; ð12Þ
where Ini is the ni � ni identity matrix. Let @J i=@bi ¼ 0, then
nibi ¼ 1T�qi �XT
i wi
�: ð13Þ
Combining (12) and (13), we can get
nibi ¼ 1T�qi �XT
i Xi
�XTi Xi þ ni�iI
��1ðqi � bi1Þ�
¼)bi ¼1T � 1TXT
i Xi
�XTi Xi þ ni�iIni
�ni � 1TXT
i Xi
�XTi Xi þ �iIni
��11
qi:ð14Þ
Kernel ridge regression. The regularized linear classi-fier can only tackle the linear problems. In the following,we will apply the kernel trick [25] to extend it to thenonlinear cases. Specifically, let � : IRd ! IF be a non-linear mapping that maps the data in IRd to a high-dimensional (possibly infinite) feature space IF, such thatthe nonlinear problem in IRd can become a linear one inthe feature space IF. Then, for a testing point x 2 N i, wecan construct a local classifier fi which predicts the labelof x by
fiðxÞ ¼ ~wTi ð�ðxÞ � �ðxiÞÞ þ bi; ð15Þ
where ~wi 2 IF; bi 2 IR are the parameters of fi. Since ~wi liesin the span of f�ðxikÞg
nik¼1 [25], where xik represents the
kth neighbor of xi, then there exists a set of coefficientsf�ikg
nik¼1 such that
~w ¼Xnik¼1
�ik��xik�: ð16Þ
Combining (16) and (15), we can get
fiðxÞ ¼X
xj2N i
�ijh�ðxÞ � �ðxiÞ; �ðxjÞi þ bi; ð17Þ
where h�i is the inner product operator. Define a positive-definite kernel function K : X � X ! IR [25], then we cansolve ��i ¼ ½�i1; . . . ; �ini �
T and bi by minimizing
J i ¼1
ni
Xxj2N i
�� ~wTi ð�ðxjÞ � �ðxiÞÞ þ bi � qj
��2 þ �ik ~wik2
¼ 1
ni
Xxj2N i
Xxk2N i
�ikh�ðxjÞ � �ðxiÞ; �ðxkÞi þ bi � qj
����������
2
þ �ið��iÞTKi��i;
ð18Þ
where Ki 2 IRni�ni is the local kernel matrix for the datapoints in N i with its ði; jÞth entry Kiðk; jÞ ¼ Kðxik ;xijÞ, and
WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1667
3. Note that we subtract xi from x because usually there are only afew data points in the neighborhood of xi; hence, the structural penaltyterm kwik will pull the weight vector wi toward some arbitrary origin.For isotropy reasons, we translate the origin of the input space to theneighborhood medoid xi.
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
xij is the jth neighbor of xi, and �i > 0 is the trade-off
parameter. Defining the partial locally centered kernel
matrix �Ki 2 IRni�ni as
�Kiðj; kÞ ¼ h�ðxjÞ � �ðxiÞ; �ðxkÞi ¼ Kðxj;xkÞ �Kðxi;xkÞ;
we rewrite J l as
J i ¼1
nik �Ki��i þ bi1� qik2 þ �ið��iÞTKi��
i; ð19Þ
where qi ¼ ½qi1 ; qi2 ; . . . ; qini �T with qik being the kth neighbor
of xi, and 1 ¼ ½1; 1; . . . ; 1�T 2 IRni�1 is a column vector of all
1s. Then,
@J i
@�i¼ 2
ni
��KTi ð �Ki��i þ bi1� qiÞ þ ni�iKi��i
�; ð20Þ
@J i
@bi¼ 2
ni
�1T �Ki��i þ nibi � 1Tqi
�: ð21Þ
By setting @J i=@��i ¼ 0 and @J i=@bi ¼ 0, we can get
��i ¼�
�KTi
�Ki þ ni�iKi
��1 �KTi ðqi � bi1Þ; ð22Þ
bi ¼1T � 1T �Ki
��KTi
�Ki þ ni�iKi
��1 �KTi
ni � 1T �Ki
��KTi
�Ki þ ni�iKi
��1 �KTi 1
qi: ð23Þ
3.2.2 Combining the Local Regularized Predictors
After all the local predictors having been constructed, we
combine them together by minimizing the local prediction
loss
J l ¼Xni¼1
fiðxiÞ � qið Þ2; ð24Þ
in which we use fi to predict the label of xi in the
following ways:
. Regularized Linear Classifier. In this case, combining(7) and (14), we can get
fiðxiÞ ¼ bi ¼1T � 1TXT
i Xi
�XTi Xi þ ni�iIni
�ni � 1TXT
i Xi
�XTi Xi þ �iIni
��11
qi
¼ uTi qi;
ð25Þ
where
uTi ¼1T � 1TXT
i Xi
�XTi Xi þ ni�iIni
�ni � 1TXT
i Xi
�XTi Xi þ �iIni
��11:
. Kernel Ridge Regression. In this case, combining (17)and (23), we can get
fiðxiÞ ¼1T � 1T �Ki
��KTi
�Ki þ ni�iKi
��1 �KTi
ni � 1T �Ki
��KTi
�Ki þ ni�iKi
��1 �KTi 1
qi
¼ vTi qi;
ð26Þ
where vTi ¼1T�1T �Kið �KT
i�Kiþni�iKiÞ�1 �KT
i
ni�1T �Kið �KTi
�Kiþni�iKiÞ�1 �KTi 1
.
Therefore, we can predict the label of xi using the local
classifiers in the form of
fiðxiÞ ¼ ��Ti xi;
where ��i ¼ ui for regularized linear classifier, and ��i ¼ vifor kernel ridge regression. Then, by defining a square
matrix �� 2 IRn�n with its ði; jÞth entry
�ij ¼��iðjÞ; if xj 2 N i
0; otherwise;
�ð27Þ
where ��iðjÞ is the jth element of ��i. We can expand the local
loss in (24) as
J l ¼ k��q� qk2; ð28Þ
where q ¼ ½q1; q2; . . . ; qn�T 2 IRn�1 is the label vector of X .Till now we can write the criterion of clustering by
combining locally regularized linear label predictors J l in
an explicit mathematical form, and can minimize it directly
using some standard optimization techniques. However,
the results may not be good enough since we only exploit
the local information of the data set. In the next section, we
will introduce a global regularization criterion and combine
it with J l, which aims to find a good clustering result in a
local-global way.
3.3 Global Regularization
In data clustering, we usually require that the cluster
assignments of the data points should be sufficiently
smooth with respect to the underlying data manifold,
which implies 1) the nearby points tend to have the same
cluster assignments; 2) the points on the same structure
(e.g., submanifold or cluster) tend to have the same cluster
assignments [41].Without the loss of generality, we assume that the data
points reside (roughly) on a low-dimensional manifoldM,4
and q is the cluster assignment function defined on M.
Generally, a graph can be viewed as the discretized form of
a manifold [3]. We can model the data set as a weighted
undirected graph, as in spectral clustering [27], where the
graph nodes are just the data points, and the weights on the
edges represent the similarities between pairwise points.
Then, the following objective can be used to measure the
smoothness of q over the data graph [2], [40], [42]
J g ¼ qTLq ¼Xni¼1
ðqi � qjÞ2wij; ð29Þ
where q ¼ ½q1; q2; . . . ; qn�T with qi ¼ qðxiÞ;L is the graph
Laplacian with its ði; jÞth entry
Lij ¼di � wii; if i ¼ j�wij; if xi and xj are adjacent0; otherwise;
8<: ð30Þ
1668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
4. We believe that the text data are also sampled from some low-dimensional manifold, since it is impossible for them to fill in the wholehigh-dimensional sample space. And it has been shown that the manifold-based methods can achieve good results on text classification tasks [41].
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
where di ¼P
j wij is the degree of xi, and wij is the similarity
between xi and xj. If xi and xj are adjacent,5 wij is usually
computed in the following way:6
wij ¼ e�kxi�xjk2
2�2 ; ð31Þ
where � is a data-set-dependent parameter. It is proved that
under certain conditions, when the scale of the data set
becomes larger and larger, the graph Laplacian will
converge to the Laplace Beltrami operator on the data
manifold [4], [16].In summary, using (29) with exponential weights can
effectively measure the smoothness of the data assignments
with respect to the intrinsic data manifold. Thus, we adopt
it as a global regularizer to punish the smoothness of the
predicted data assignments.
3.4 Clustering with Local and Global Regularization
Combining the contents we have introduced in Sections 3.2
and 3.3, we can derive the clustering criterion as
minqJ ¼ J l þ �J g ¼ k��q� qk2 þ �qTLq
s:t: qi 2 f�1;þ1g;ð32Þ
where � is defined as in (27), and � is a regularization
parameter to trade off J l and J g. However, the discrete
constraint of pi makes the problem an NP-hard integer
programming problem. A natural way for making the
problem solvable is to remove the constraint and relax qito be continuous. Then, the objective that we aim to
minimize becomes
J ¼ k��q� qk2 þ �qTLq
¼ qT ð��� IÞT ð��� IÞqþ �qTLq
¼ qT ðð��� IÞT ð��� IÞ þ �LÞq;ð33Þ
and we further add a constraint qTq ¼ 1 to restrict the scale
of q. Then, our objective becomes to solve the following
optimization problem
minqJ ¼ qT ðð��� IÞT ð��� IÞ þ �LÞq
s:t: qTq ¼ 1:ð34Þ
We have the following theorem:
Theorem 1. The optimal relaxed cluster indicator can be achieved
by the eigenvector corresponding to the second smallest
eigenvalue of the matrix M ¼ ð��� IÞT ð��� IÞ þ �L.
Proof. Using the Ky Fan theorem [38], we can derive that the
optimal solution q corresponds to the smallest eigen-
vector of matrix M. However, we will prove in the
following that such an eigenvector does not contain any
discriminative information.First, for an arbitrary column vector a 2 IRn�1, we
have
aTMa ¼ aT ð��� IÞT ð��� IÞaþ �aTLa
¼ k��a� ak2 þ �Xni¼1
ai � aj� �2
wij � 0:
Therefore, M is positive semidefinite, i.e., its eigenvalues
are nonnegative. In the following, we show that the
smallest eigenvalue of M is 0, and its corresponding
eigenvector is an all-ones vector 1 ¼ ½1; 1; . . . ; 1�T 2 IRn�1:
1. L1 ¼ ðD�WÞ1 ¼ 0 � 1, where W 2 IRn�n is theweight matrix and D ¼ diagð
Pj WijÞ is the
degree matrix.2. ð��� IÞT ð��� IÞ1 ¼ 0 � 1, since according to the
definitions of �� in (27), we have the following.
a. For regularized linear classifier, assuming���i 2 IR1�n is the ith row of ��, then
��1 ¼ uTi 1ni
¼1Tni � 1TniX
Ti Xi
�XTi Xi þ ni�iIni
�ni � 1TniX
Ti Xi
�XTi Xi þ �iIni
��11ni
1ni ¼ 1:
b. For kernel ridge regression, assuming ���i 2IR1�n is the ith row of ��, then
��1 ¼ vTi 1ni
¼1Tni � 1Tni
�Kið �KTi
�Ki þ ni�iKiÞ�1 �KTi
ni � 1Tni�Kið �KT
i�Ki þ ni�iKiÞ�1 �KT
i 1ni1ni ¼ 1:
Then, ��1 ¼ 1 and ð��� IÞT ð��� IÞ1 ¼ 0.Therefore,
M1 ¼ ðð��� IÞT ð��� IÞ þ �LÞ1 ¼ 0 � 1;
where 0 is the smallest eigenvalue of M with 1 as its
corresponding eigenvector, which does not contain any
discriminative information. Therefore, we should use the
eigenvector corresponding to the second smallest eigen-
value for clustering as in [27]. tu
3.5 Extending CLGR for Multiclass Clustering
In the above, we have introduced the basic framework of
Clustering with Local and Global Regularization (CLGR) for the
two-class clustering problem, and we will extend it to
multiclass clustering in this section.First, we assume that all the data objects belong toC classes
indexed by L ¼ f1; 2; . . . ; Cg. qc is the classification function
for class c ð1 � c � CÞ, such that qcðxiÞ returns the possibility
that xi belongs to class c. Our goal is to obtain the value of
qcðxiÞ ð1 � c � C; 1 � i � nÞ, and the cluster assignment of xican be determined by fqcðxiÞgCc¼1 using some proper dis-
cretization methods that we have introduced in Section 2.2.Therefore, in this multiclass case, for each data
point xi ð1 � i � nÞ, we will construct C locally regular-
ized label predictors. The total local prediction error can
be constructed as
J l ¼XCc¼1
J cl ¼
XCc¼1
��qc � qck k2: ð35Þ
WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1669
5. In this paper, we define xi and xj to be adjacent if xi 2 NðxjÞ orxj 2 NðxiÞ.
6. There are also some other ways for constructing wij, such as in [32].
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
Similarly, we can construct the global smoothness regular-
izer in multiclass case as
J g ¼XCc¼1
Xni¼1
�qci � qcj
�2wij ¼
XCc¼1
ðqcÞTLqc: ð36Þ
Then, the criterion to be minimized for CLGR in multiclass
case becomes
J ¼ J l þ �J g
¼XCc¼1
½k��qc � qck2 þ �ðqcÞTLqc�
¼XCc¼1
½ðqcÞT ðð��� IÞT ð��� IÞ þ �LÞqc�
¼ trace½QT ðð��� IÞT ð��� IÞ þ �LÞQ�;
ð37Þ
where Q ¼ ½q1;q2; . . . ;qc� is an n� c matrix, and traceð�Þreturns the trace of a matrix. The same as in (34), we also
add the constraint that QTQ ¼ I to restrict the scale of Q.
Then, our optimization problem becomes
minQJ ¼ trace½QT ðð��� IÞT ð��� IÞ þ �LÞQ�
s:t: QTQ ¼ I:ð38Þ
From the Ky Fan theorem [38], we know the optimal
solution of the above problem is
Q� ¼�q�1;q
�2; . . . ;q�C
�R; ð39Þ
where q�k ð1 � k � CÞ is the eigenvector that corresponds to
the kth smallest eigenvalue of matrix ðP� IÞT ðP� IÞ þ �L,
and R is an arbitrary C � C matrix.Since the values of the entries in Q� are continuous, we
need to further discretize Q� to get the cluster assignments
of all the data points by making use of the methodsintroduced in Section 2.2. The detailed algorithm procedurefor CLGR is summarized in Table 1.
3.6 A Mixed Regularization Viewpoint
As we have mentioned in Section 3.1, the motivation behindour approach is that there may not be enough points in eachlocal region to train a good predictor; therefore, we apply aglobal smoothness regularizer to smooth those predictedlabels and make it compliant with the intrinsic datadistribution. However, by revisiting the expression of theloss function (37), we can find that it is, in fact, a mixed lossfunction composed of two parts. The first part
J 1 ¼ trðQT ðð��� IÞT ð��� IÞÞQÞ
contains some local information of the data set from theclassifier construction perspective, which is derived usinglocal learning algorithms. Purely minimizing J 1 withproper orthogonality constraints will result in the Local-
Learning-based Algorithm for Clustering (LLAC) like in [34].The second part
J 2 ¼ trðQTLQÞ
includes some distribution information of the data set fromthe geometry perspective. Purely minimizing J 2 withproper orthogonality constraints will result in some spectralclustering algorithms like ratio cut.
Therefore, what our clustering does with local and globalregularization algorithms is just to seek for a consistentpartitioning of the data set which has a good trade-offbetween J 1 and J 2 by �. Such an idea of mixed
regularization with different types of information haspreviously been explored in the semisupervised learningcommunity [41], [44]. In the following section, we will
1670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
TABLE 1Clustering with Local and Global Regularization (CLGR)
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
discuss the relationship between our method and sometraditional approaches.
3.7 Relationship with Some Traditional Approaches
As stated in the last section, the CLGR method proposed inthis paper can be viewed from a mixed regularization point,where the objective function is just a regularization termcomposed of a local learning regularizer and a Laplacianregularizer. Therefore, the traditional spectral clustering[27] (which just contains the Laplacian regularizer) andlocal learning clustering [34], [35] (which just contains thelocal learning regularizer7) can be viewed as special cases ofour method.
In fact, the idea of combining some “weak” regularizerto construct a “strong” regularizer has appeared in someprevious papers (e.g., [9], [1], [10]). However, what thesepapers mainly discuss is how to combine the “weak”regularizers (usually they assume that the final regularizeris a convex combination of those weak regularizers, and thefinal goal is how to efficiently obtain the combinationcoefficients), and they didn’t mention how to obtain moreinformative weak regularizers. The mixed regularizersproposed in this paper are complementary to each otherin some sense (the local regularizer makes use of the localproperties while the global regularizer consolidates thoselocal properties). Thus, it can be more effective in realworld applications.
4 EXPERIMENTS
In this section, experiments are conducted to empiricallycompare the clustering results of CLGR with some otherclustering algorithms on four data sets. First, we will brieflyintroduce the basic information of those data sets.
4.1 Data Sets
We use three categories of data sets in our experiments,which are selected to cover a wide range of properties.Specifically, those data sets include:
. UCI Data. We perform experiments on 15 UCI datasets.8 The sizes of those data sets vary from 24 to4,435, the dimensionality of the data points varyfrom 3 to 56, and the number of classes vary from 2to 22.
. Image Data. We perform experiments on six imagedata sets: ORL,9 Yale,10 YaleB [13], PIE [28], COIL[22], and USPS.11 All the images in the data setsexcept USPS are resized to 32� 32. For the PIE dataset, we only select five near-frontal poses (C05, C07,C09, C27, C29) and all the images under differentilluminations and expressions. So, there are170 images for each individual. For the YaleBdatabase, we simply use the cropped images whichcontain 38 individuals and around 64 near-frontalimages under different illuminations per individual.
For the USPS data set, we only use the images ofdigits 1-4.
. Text Data. We also perform experiments on three
text data sets: 20-newsgroup,12 WebKB,13 and
Cora [21]. For the 20-newsgroup data set, we
choose the topic rec which contains autos, motor-
cycles, baseball, and hockey from the version
20-news-18828. For the WebKB data set, we select
a subset consisting of about 6,000 Web pages from
computer science departments of four schools
(Cornell, Texas, Washington, and Wisconsin),
which can be classified into seven categories. For
the Cora data set, we select a subset containing the
research paper of subfield data structure (DS),hardware and architecture (HA), machine learning
(ML), and programming language (PL).
The basic information of those data sets are summarized in
Table 2.
4.2 Evaluation Metrics
In the experiments, we set the number of clusters equal to the
true number of classes C for all the clustering algorithms. To
evaluate their performance, we compare the clusters
generated by these algorithms with the true classes by
computing the following two performance measures.Clustering accuracy (Acc). The first performance mea-
sure is the Clustering Accuracy, which discovers the one-to-
one relationship between clusters and classes and measures
the extent to which each cluster contain data points from the
corresponding class. It sums up the whole matching degree
between all pair class-clusters. Clustering accuracy can be
computed as
Acc ¼ 1
Nmax
XCk;Lm
T ðCk;LmÞ !
; ð40Þ
where Ck denotes the kth cluster in the final results, and
Lm is the true mth class. T ðCk;LmÞ is the number of
entities belonging to class m that are assigned to cluster k.
Accuracy computes the maximum sum of T ðCk;LmÞ for
all pairs of clusters and classes, and these pairs have no
overlaps. The greater clustering accuracy means the better
clustering performance.Normalized mutual information (NMI). Another eva-
luation metric that we adopt here is the Normalized Mutual
Information [29], which is widely used for determining the
quality of clusters. For two random variables X and Y, the
NMI is defined as
NMIðX;YÞ ¼ IðX;YÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiHðXÞHðYÞ
p ; ð41Þ
where IðX;YÞ is the mutual information between X and Y,
while HðXÞ and HðYÞ are the entropies of X and Y,
respectively. One can see that NMIðX;XÞ ¼ 1, which is the
maximal possible value of NMI. Given a clustering result,
the NMI in (41) is estimated as
WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1671
7. An issue that is worth mentioning here is that the local learningregularizer in this paper is not the same as in [34], [35].
8. http://mlearn.ics.uci.edu/MLRepository.html.9. http://www.uk.research.att.com/facedatabase.html.10. http://cvc.yale.edu/projects/yalefaces/yalefaces.html.11. Available at: http://www.kernel-machines.org/data.html.
12. http://people.csail.mit.edu/jrennie/20Newsgroups/.13. http://www.cs.cmu.edu/~WebKB/.
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
NMI ¼PC
k¼1
PCm¼1 nk;mlog
� n�nk;mnknm
�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�PC
k¼1 nklognkn
��PCm¼1 nmlog
nmn
�q ; ð42Þ
where nk denotes the number of data contained in the
cluster Ck (1 � k � C), nm is the number of data belonging
to the mth class (1 � m � C), and nk;m denotes the number
of data that are in the intersection between the cluster Ckand the mth class. The value calculated in (42) is used as a
performance measure for the given clustering result. The
larger this value, the better the clustering performance.
4.3 Comparisons and Parameter Settings
We have compared the performances of our methods with
five other popular clustering approaches.
. K-Means. In our implementation, the cluster centersare randomly initialized and the performancesreported are averaged over 50 independent runs.
. Normalized Cut (Ncut) [27]. The implementation isthe same as in [36]. The similarities betweenpairwise points are computed using the standardGaussian kernel,14 and the width of the Gaussiankernel is set in an automatic way using the methodintroduced in [37].
. Local Linearly Regularized Clustering (LLRC). Clus-tering purely is based on minimizing the local lossin (24), and the local predictions are made by (25).The neighborhood size is searched from the gridf5; 10; 20; 50; 100g, all the regularization parametersf�ig of the local regularized classifiers are set tobe the same and are searched from the gridf0:01; 0:1; 1; 10; 100g, and the final cluster labels areachieved by the discretization method in [36].
. Local-Kernel-Regularized Clustering (LKRC). Clusteringis purely based on minimizing the local loss in (24),and the local predictions are made by (26). Theneighborhood size is searched from the gridf5; 10; 20; 50; 100g, all the regularization parametersf�ig of the local regularized classifiers are set to be thesame and are searched from the grid f0:01; 0:1;1; 10; 100g, the width of the Gaussian kernel issearched from the grid f4�3�0; 4
�2�0; 4�1�0; �0; 4
1�0;42�0; 4
3�0g, where�0 is the mean distance between anytwo examples in the data set, and the final clusterlabels are achieved by the discretization method in[36]. Note that this method is very similar to the one in[34] except for the centralization in the construction ofthe local classifiers, which, as introduced earlier in afootnote, is needed for isotropic reasons.
. Regularized Clustering (RC) [5]. The implementation isbased on the description in [5, Section 6.1]. Thewidth of the Gaussian similarities used for con-structing graph Laplacian is set using an automaticmethod described in [37], the Gaussian kernel isadopted in RKHS with its width tuned from the gridf4�3�0; 4
�2�0; 4�1�0; �0; 4
1�0; 42�0; 4
3�0g, where �0 isthe mean distance between any two examples in thedata set, the ratio of the regularization parameters�I=�A is tuned from f0:01; 0:1; 1; 10; 100g, and thefinal cluster labels are also achieved by the dis-cretization method in [36].
For our proposed clustering with local and globalregularization methods (LCLGR denotes the approach usingregularized linear classifiers as its local predictors, andKCLGR represents the approach using kernel ridge regres-sion as its local predictors), the local regularization para-meters (i.e., f�ig are set to be the same and are tuned fromf0:01; 0:1; 1; 10; 100g), the size of the neighborhood is tunedfrom f5; 10; 20; 50; 100g, and the pairwise data similaritiesused for constructing the global smoothness regularizer arecomputed using standard Gaussian kernels with its widthdetermined by the method in [37]. For the parameter used totrade off the local and global loss, we change J in (32) to
1672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
TABLE 2Descriptions of the Data Sets
14. For all the methods given below, which are involved in thecomputation of Gaussian kernels, we adopt the Gaussian kernel withEuclidean distance on the UCI and Image data sets, and the Gaussian kernelwith inner product distance on the Text data sets [42].
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
J ¼ ð1� �ÞJ l þ �J g to better capture the relative impor-tance of J l and J g, and � is tuned from ½0; 0:2; 0:4; 0:6; 0:8; 1�.For KCLGR, the Gaussian kernel is adopted for constructingthe local kernel predictors with its width tuned from thegrid f4�3�0; 4
�2�0; 4�1�0; �0; 4
1�0; 42�0; 4
3�0g, where �0 is themean distance between any two examples in the data set.
4.4 Algorithm Performances
The performances of those clustering algorithms on the threetypes of data sets are reported in Tables 3, 4, 5, 6, 7 and 8, in
which the best two performances for each data set arehighlighted. From those tables, we can observe the following.
. In most of the cases, k-means clustering performpoorer than other sophisticated methods, since thedistributions of those high-dimensional data sets arecommonly much more complicated than mixtures ofspherical Gaussians.
. The Ncut algorithm has similar performances asthe LLRC algorithm has on UCI data sets, as thedistribution of those data sets are complex and
WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1673
TABLE 3Clustering Accuracy Results on UCI Data Sets
TABLE 4Normalized Mutual Information Results on UCI Data Sets
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
there is no apparent mode underlying those datasets. On the image data sets, the Ncut algorithmusually outperforms the LLRC algorithm sincethere are clear nonlinear underlying manifoldsbehind those data sets, which cannot be tackledwell by the linear methods like LLRC. On theother hand, on most of the text data sets, LLRCperforms better than Ncut because the regularizedlinear classifier has been shown to be veryeffective for text data sets [39].
. The KLRC algorithm usually outperforms LLRC sincethe local distributions of the data sets are nonlinear.
. The mixed-regularization algorithms, includingRC, LCLGR, and KCLGR usually perform betterthan the algorithms based on a single regularizer
(Ncut, LLRC, and KLRC), as they make use ofmore information contained in the data sets.
. As expected, the LCLGR and KCLGR usually per-form better than LLRC and KLRC because the datalabels are smoothed by the global smoother, whichmakes the data labels more compliant with theintrinsic data distribution.
4.5 Sensitivity to the Selection of Parameters
There are mainly three parameters in our algorithm: thelocal regularization parameters f�ig, which are assumed tobe the same in all local classifiers, the size of theneighborhood k, and the global trade-off parameter �.We have also conducted a set of experiments to test thesensitivity of our method to the selection of those
1674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
TABLE 5Clustering Accuracy Results on Image Data Sets
TABLE 6Normalized Mutual Information Results on Image Data Sets
TABLE 7Clustering Accuracy Results on Text Data Sets
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
parameters, and the results are shown in Figs. 3, 4, 5, 6, 7,and 8. From these figures, we can clearly see that:
. Generally a too small or too large k will lead topoor results. This is because when k is too small,then the data in each neighborhood would be toosparse such that the resulting local classifier maynot be accurate. When k is too large, the localproperties of the data set would be hidden and theresults will also be bad. Therefore, a relativelymedium k would give better results.
. A proper trade-off between the local and globalregularizer would lead to better results, and this isjust what our method does. This also proved
experimentally the reasonability and effectivenessof combining local and global regularizers.
5 CONCLUSIONS
In this paper, we derived a new clustering algorithm called
clustering with local and global regularization. Our method
preserves the merit of local learning algorithms and spectral
clustering. Our experiments show that the proposed algo-
rithm outperforms some of the state-of-the-art algorithms
on many benchmark data sets. In the future, we will focus
on the parameter selection and acceleration issues of the
CLGR algorithm.
WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1675
Fig. 3. Parameter sensitivity testing results on the UCI Satimage data set. (a) Acc versus �i plot. The parameter settings for KCLGR: � ¼ 0:2; k ¼ 20,
for LCLGR: � ¼ 0:4; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 1, for LCLGR: � ¼ 0:4; �i ¼ 1. (c) Acc versus
� plot. The parameter settings for KCLGR: �i ¼ 1; k ¼ 20, for LCLGR: �i ¼ 1; k ¼ 20.
Fig. 4. Parameter sensitivity testing results on the UCI Iris data set. (a) Acc versus �i plot. The parameter settings for KCLGR: � ¼ 0:6; k ¼ 20, for
LCLGR: � ¼ 0:8; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: �i ¼ 1; � ¼ 0:6, for LCLGR: �i ¼ 0:1; � ¼ 0:8. (c) Acc versus
� plot. The parameter settings for KCLGR: �i ¼ 1; k ¼ 20, for LCLGR: �i ¼ 0:1; k ¼ 20.
TABLE 8Normalized Mutual Information Results on Text Data Sets
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
1676 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
Fig. 7. Parameter sensitivity testing results on the 20-Newsgroup text data set. (a) Acc versus �i plot. The parameter settings for KCLGR:
� ¼ 0:2; k ¼ 20, for LCLGR: � ¼ 0:4; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 10, for LCLGR: � ¼ 0:4; �i ¼ 10.
(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 10; k ¼ 20, for LCLGR: �i ¼ 10; k ¼ 20.
Fig. 8. Parameter sensitivity testing results on the WebKB-Cornell text data set. (a) Acc versus �i plot. The parameter settings for KCLGR:
� ¼ 0:4; k ¼ 10, for LCLGR: � ¼ 0:4; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 1, for LCLGR: � ¼ 0:4; �i ¼ 0:1.
(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 1; k ¼ 10, for LCLGR: �i ¼ 0:1; k ¼ 10.
Fig. 5. Parameter sensitivity testing results on the Yale face image data set. (a) Acc versus �i plot. The parameter settings for KCLGR:
� ¼ 0:8; k ¼ 20, for LCLGR: � ¼ 0:6; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 0:1, for LCLGR: � ¼ 0:4; �i ¼ 100.
(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 0:1; k ¼ 20, for LCLGR: �i ¼ 100; k ¼ 10.
Fig. 6. Parameter sensitivity testing results on the ORL face image data set. (a) Acc versus �i plot. The parameter settings for KCLGR:
� ¼ 0:2; k ¼ 20, for LCLGR: � ¼ 0:8; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 100, for LCLGR: � ¼ 0:8; �i ¼ 10.
(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 100; k ¼ 20, for LCLGR: �i ¼ 10; k ¼ 10.
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGMENTS
The work of Fei Wang and Changshui Zhang is supported
by the China Natural Science Foundation under Grant Nos.
60835002, 60675009. The work of Tao Li is partially
supported by the US National Science Foundation CAREER
Award IIS-0546280, DMS-0844513, and CCF-0830659.
REFERENCES
[1] A. Argyriou, M. Herbster, and M. Pontil, “Combining GraphLaplacians for Semi-Supervised Learning,” Proc. Conf. NeuralInformation Processing Systems, 2005.
[2] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimension-ality Reduction and Data Representation,” Neural Computation,vol. 15, no. 6, pp. 1373-1396, 2003.
[3] M. Belkin and P. Niyogi, “Semi-Supervised Learning onRiemannian Manifolds,” Machine Learning, vol. 56, pp. 209-239, 2004.
[4] M. Belkin and P. Niyogi, “Towards a Theoretical Foundation forLaplacian-Based Manifold Methods,” Proc. 18th Conf. LearningTheory (COLT), 2005.
[5] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold Regulariza-tion: A Geometric Framework for Learning from Labeled andUnlabeled Examples,” J. Machine Learning Research, vol. 7,pp. 2399-2434, 2006.
[6] L. Bottou and V. Vapnik, “Local Learning Algorithms,” NeuralComputation, vol. 4, pp. 888-900, 1992.
[7] P.K. Chan, D.F. Schlag, and J.Y. Zien, “Spectral K-Way Ratio-CutPartitioning and Clustering,” IEEE Trans. Computer-Aided Design ofIntegrated Circuits and Systems, vol. 13, no. 9, pp. 1088-1096, Sept.1994.
[8] O. Chapelle, M. Chi, and A. Zien, “A Continuation Method forSemi-Supervised SVMs,” Proc. 23rd Int’l Conf. Machine Learning,pp. 185-192, 2006.
[9] J. Chen, Z. Zhao, J. Ye, and L. Huan, “Nonlinear AdaptiveDistance Metric Learning for Clustering,” Proc. 13th ACM SpecialInterest Group Conf. Knowledge Discovery and Data Mining(SIGKDD), pp. 123-132, 2007.
[10] G. Dai and D.-Y. Yeung, “Kernel Selection for Semi-SupervisedKernel Machines,” Proc. 24th Int’l Conf. Machine Learning (ICML’07), pp. 185-192, 2007.
[11] C. Ding, X. He, H. Zha, M. Gu, and H.D. Simon, “A Min-Max CutAlgorithm for Graph Partitioning and Data Clustering,” Proc. FirstInt’l Conf. Data Mining (ICDM ’01), pp. 107-114, 2001.
[12] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. JohnWiley & Sons, 2001.
[13] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman, “FromFew to Many: Illumination Cone Models for Face Recognitionunder Variable Lighting and Pose,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 23, no. 6, pp. 643-660,June 2001.
[14] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. TheJohns Hopkins Univ. Press, 1996.
[15] J. Han and M. Kamber, Data Mining. Morgan Kaufmann, 2001.[16] M. Hein, J.Y. Audibert, and U. von Luxburg, “From Graphs to
Manifolds—Weak and Strong Pointwise Consistency of GraphLaplacians,” Proc. 18th Conf. Learning Theory (COLT ’05), pp. 470-485, 2005.
[17] J. He, M. Lan, C.-L. Tan, S.-Y. Sung, and H.-B. Low, “Initializationof Cluster Refinement Algorithms: A Review and ComparativeStudy,” Proc. Int’l Joint Conf. Neural Networks, 2004.
[18] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice-Hall,1988.
[19] B. Kernighan and S. Lin, “An Efficient Heuristic Procedure forPartitioning Graphs,” The Bell System Technical J., vol. 49, no. 2,pp. 291-307, 1970.
[20] L. Zelnik-Manor and P. Perona, “Self-Tuning Spectral Clustering,”Proc. Conf. Neural Information Processing Systems, 2005.
[21] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automatingthe Contruction of Internet Portals with Machine Learning,”Information Retrieval J., vol. 3, pp. 127-163, 2000.
[22] S.A. Nene, S.K. Nayar, and J. Murase, “Columbia ObjectImage Library (COIL-20),” Technical Report CUCS-005-96,Columbia Univ., Feb. 1996.
[23] A.Y. Ng, M.I. Jordan, and Y. Weiss, “On Spectral Clustering:Analysis and an Algorithm,” Proc. Conf. Neural InformationProcessing Systems, 2002.
[24] S.T. Roweis and L.K. Saul, “Noninear Dimensionality Reductionby Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326,2000.
[25] B. Scholkopf and A. Smola, Learning with Kernels. The MIT Press,2002.
[26] J. Shawe-Taylor and N. Cristianini, Kernel Methods for PatternAnalysis. Cambridge Univ. Press, 2004.
[27] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.
[28] T. Sim, S. Baker, and M. Bsat, “The CMU Pose, Illumination, andExpression Database,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 25, no. 12, pp. 1615-1618, Dec. 2003.
[29] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge ReuseFramework for Combining Multiple Partitions,” J. MachineLearning Research, vol. 3, pp. 583-617, 2002.
[30] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[31] F. Wang, C. Zhang, and T. Li, “Regularized Clustering forDocuments,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research andDevelopment in Information Retrieval (SIGIR ’07), 2007.
[32] F. Wang and C. Zhang, “Label Propagation through LinearNeighborhoods,” Proc. 23rd Int’l Conf. Machine Learning, 2006.
[33] Y. Weiss, “Segmentation Using Eigenvectors: A Unifying View,”Proc. IEEE Int’l Conf. Computer Vision, pp. 975-982, 1999.
[34] M. Wu and B. Scholkopf, “A Local Learning Approach forClustering,” Proc. Conf. Neural Information Processing Systems,2006.
[35] M. Wu and B. Scholkopf, “Transductive Classification via LocalLearning Regularization,” Proc. 11th Int’l Conf. Artificial Intelligenceand Statistics (AISTATS ’07), pp. 628-635, 2007.
[36] S.X. Yu and J. Shi, “Multiclass Spectral Clustering,” Proc. Int’l Conf.Computer Vision, 2003.
[37] L. Zelnik-Manor and P. Perona, “Self-Tuning Spectral Clustering,”Proc. Conf. Neural Information Processing Systems, 2005.
[38] H. Zha, X. He, C. Ding, M. Gu, and H. Simon, “Spectral Relaxationfor K-Means Clustering,” Proc. Conf. Neural Information ProcessingSystems, 2001.
[39] T. Zhang and J.F. Oles, “Text Categorization Based on RegularizedLinear Classification Methods,” Information Retrieval, vol. 4, pp. 5-31, 2001.
[40] D. Zhou and B. Scholkopf, “Learning from Labeled and UnlabeledData Using Random Walks,” Proc. 26th DAGM Symp. PatternRecognition, pp. 237-244, 2004.
[41] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Scholkopf,“Learning with Local and Global Consistency,” Proc. Conf. NeuralInformation Processing Systems, 2004.
[42] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-SupervisedLearning Using Gaussian Fields and Harmonic Functions,” Proc.20th Int’l Conf. Machine Learning, 2003.
[43] X. Zhu, J. Lafferty, and Z. Ghahramani, “Semi-SupervisedLearning: From Gaussian Fields to Gaussian Process,” ComputerScience Technical Report CMU-CS-03-175, Carnegie Mellon Univ.,2003.
[44] X. Zhu and A. Goldberg, “Kernel Regression with OrderPreferences,” Proc. 22nd AAAI Conf. Artificial Intelligence (AAAI),2007.
Fei Wang received the BE degree from XidianUniversity in 2003, and the ME and PhD degreesfrom Tsinghua University in 2006 and 2008,respectively. His main research interests includemachine learning, data mining, and informationretrieval.
WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1677
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.
Changshui Zhang is a professor in the Depart-ment of Automation, Tsinghua University. Hismain research interests include machine learn-ing. He is a member of the IEEE.
Tao Li is currently an assistant professor in theSchool of Computing and Information Sciences,Florida International University. His main re-search interests include machine learning anddata mining.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
1678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009
Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.