IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf ·...

14
Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE, and Tao Li Abstract—Clustering is an old research topic in data mining and machine learning. Most of the traditional clustering methods can be categorized as local or global ones. In this paper, a novel clustering method that can explore both the local and global information in the data set is proposed. The method, Clustering with Local and Global Regularization (CLGR), aims to minimize a cost function that properly trades off the local and global costs. We show that such an optimization problem can be solved by the eigenvalue decomposition of a sparse symmetric matrix, which can be done efficiently using iterative methods. Finally, the experimental results on several data sets are presented to show the effectiveness of our method. Index Terms—Clustering, local learning, smoothness, regularization. Ç 1 INTRODUCTION C LUSTERING [18] is one of the most fundamental research topics in both data mining and machine learning communities. It aims to divide data into groups of similar objects, i.e., clusters. From a machine learning perspective, what clustering does is to learn the hidden patterns of the data set in an unsupervised way, and these patterns are usually referred to as data concepts. From a practical perspective, clustering plays a vital role in data mining applications such as information retrieval, text mining, Web analysis, marketing, computational biology, and many others [15]. Many clustering methods have been proposed till now, among which K-means [12] is one of the most popular algorithms. K-means aims to minimize the sum of the squared distance between the data points and their corresponding cluster centers. However, it is well known that there are some problems existing in the K-means algorithm: 1) the predefined criterion is usually nonconvex which causes many local optimal solutions; 2) the iterative procedure (e.g., for optimizing the criterion usually makes the final solutions heavily depend on the initializations. In the last decades, many methods [17], [38] have been proposed to overcome the above problems. Recently, another type of methods, which are based on clustering on data graphs, have aroused considerable interest in the machine learning and data mining commu- nity. The basic idea behind these methods is to first model the whole data set as a weighted graph, in which the graph nodes represent the data points, and the weights on the edges correspond to the similarities between pairwise points. Then, the cluster assignments of the data set can be achieved by optimizing some criteria defined on the graph. For example, Spectral Clustering is one kind of the most representative graph-based clustering approaches, and it aims to optimize some cut value (e.g., Normalized Cut [27], Ratio Cut [7], Min-Max Cut [11]) defined on an undirected graph. After some relaxations, these criteria can usually be optimized via eigendecompositions and the solutions are guaranteed to be global optimal. In this way, spectral clustering efficiently avoids the problems of the traditional K-means method. In this paper, we propose a novel clustering algorithm that inherits the superiority of spectral clustering, i.e., the final cluster results can also be obtained by exploiting the eigenstructure of a symmetric matrix. However, unlike spectral clustering, which just enforces a smoothness constraint on the data labels over the whole data manifold [2], our method first constructs a regularized linear label predictor for each data point from its neighborhood, and then combines the results of all these local label predictors with a global label smoothness regularizer. So, we call our method Clustering with Local and Global Regularization (CLGR). The idea of incorporating both local and global information into label prediction is inspired by the recent works on semisupervised learning [41], and our experi- mental evaluations on several real document data sets show that CLGR performs better than many state-of-the-art clustering methods. The rest of this paper is organized as follows: in Section 2, we will introduce our CLGR algorithm in detail. The experimental results on several data sets are pre- sented in Section 3, followed by the conclusions and discussions in Section 4. 2 RELATED WORKS Before we go into the details of our CLGR algorithm, first we briefly review some works that are closely related to this paper. 2.1 Local Learning Algorithms The goal of learning is to choose from a given set of functions the one which best approximates the supervisor’s responses. In traditional supervised learning, we are given a IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009 1665 . F. Wang and C. Zhang are with Tsinghua University, FIT 3-120, Beijing 100084, PR China. E-mail: [email protected], [email protected]. . T. Li is with the School of Computing and Information Sciences, Florida International University, ECS 251, Miami, FL 33199. E-mail: [email protected]. Manuscript received 31 Jan. 2008; revised 28 Aug. 2008; accepted 14 Jan. 2009; published online 22 Jan. 2009. Recommended for acceptance by V. Ganti. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2008-01-0073. Digital Object Identifier no. 10.1109/TKDE.2009.40. 1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Transcript of IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf ·...

Page 1: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

Clustering with Local and Global RegularizationFei Wang, Changshui Zhang, Member, IEEE, and Tao Li

Abstract—Clustering is an old research topic in data mining and machine learning. Most of the traditional clustering methods can be

categorized as local or global ones. In this paper, a novel clustering method that can explore both the local and global information in the

data set is proposed. The method, Clustering with Local and Global Regularization (CLGR), aims to minimize a cost function that

properly trades off the local and global costs. We show that such an optimization problem can be solved by the eigenvalue

decomposition of a sparse symmetric matrix, which can be done efficiently using iterative methods. Finally, the experimental results on

several data sets are presented to show the effectiveness of our method.

Index Terms—Clustering, local learning, smoothness, regularization.

Ç

1 INTRODUCTION

CLUSTERING [18] is one of the most fundamental researchtopics in both data mining and machine learning

communities. It aims to divide data into groups of similarobjects, i.e., clusters. From a machine learning perspective,what clustering does is to learn the hidden patterns of thedata set in an unsupervised way, and these patterns areusually referred to as data concepts. From a practicalperspective, clustering plays a vital role in data miningapplications such as information retrieval, text mining, Webanalysis, marketing, computational biology, and manyothers [15].

Many clustering methods have been proposed till now,among which K-means [12] is one of the most popularalgorithms. K-means aims to minimize the sum of thesquared distance between the data points and theircorresponding cluster centers. However, it is well knownthat there are some problems existing in the K-meansalgorithm: 1) the predefined criterion is usually nonconvexwhich causes many local optimal solutions; 2) the iterativeprocedure (e.g., for optimizing the criterion usually makesthe final solutions heavily depend on the initializations. Inthe last decades, many methods [17], [38] have beenproposed to overcome the above problems.

Recently, another type of methods, which are based onclustering on data graphs, have aroused considerableinterest in the machine learning and data mining commu-nity. The basic idea behind these methods is to first modelthe whole data set as a weighted graph, in which the graphnodes represent the data points, and the weights on theedges correspond to the similarities between pairwisepoints. Then, the cluster assignments of the data set can

be achieved by optimizing some criteria defined on thegraph. For example, Spectral Clustering is one kind of themost representative graph-based clustering approaches,and it aims to optimize some cut value (e.g., NormalizedCut [27], Ratio Cut [7], Min-Max Cut [11]) defined on anundirected graph. After some relaxations, these criteria canusually be optimized via eigendecompositions and thesolutions are guaranteed to be global optimal. In this way,spectral clustering efficiently avoids the problems of thetraditional K-means method.

In this paper, we propose a novel clustering algorithmthat inherits the superiority of spectral clustering, i.e., thefinal cluster results can also be obtained by exploiting theeigenstructure of a symmetric matrix. However, unlikespectral clustering, which just enforces a smoothnessconstraint on the data labels over the whole data manifold[2], our method first constructs a regularized linear labelpredictor for each data point from its neighborhood, andthen combines the results of all these local label predictorswith a global label smoothness regularizer. So, we call ourmethod Clustering with Local and Global Regularization(CLGR). The idea of incorporating both local and globalinformation into label prediction is inspired by the recentworks on semisupervised learning [41], and our experi-mental evaluations on several real document data sets showthat CLGR performs better than many state-of-the-artclustering methods.

The rest of this paper is organized as follows: inSection 2, we will introduce our CLGR algorithm in detail.The experimental results on several data sets are pre-sented in Section 3, followed by the conclusions anddiscussions in Section 4.

2 RELATED WORKS

Before we go into the details of our CLGR algorithm, firstwe briefly review some works that are closely related tothis paper.

2.1 Local Learning Algorithms

The goal of learning is to choose from a given set offunctions the one which best approximates the supervisor’sresponses. In traditional supervised learning, we are given a

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009 1665

. F. Wang and C. Zhang are with Tsinghua University, FIT 3-120,Beijing 100084, PR China.E-mail: [email protected], [email protected].

. T. Li is with the School of Computing and Information Sciences, FloridaInternational University, ECS 251, Miami, FL 33199.E-mail: [email protected].

Manuscript received 31 Jan. 2008; revised 28 Aug. 2008; accepted 14 Jan.2009; published online 22 Jan. 2009.Recommended for acceptance by V. Ganti.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2008-01-0073.Digital Object Identifier no. 10.1109/TKDE.2009.40.

1041-4347/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 2: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

set of training sample-label pairs fðxi; yiÞgni¼1, then acommon principle for selecting a good f is to minimizethe following Structural Risk

J ðfðxi;wÞÞ ¼Xn

i¼1Lðyi; fðxi;wÞÞ þ �kfk2

F ; ð1Þ

where Lð�; �Þ is the loss function (e.g., square loss inregularized least-square regression, hinge loss in SVM),kfkF is the induced norm of f in the functional space F(e.g., F can be a Reproducing Kernel Hilbert Space induced bysome kernel k if we construct a kernel machine).

Clearly, the above statement of the learning problemimplies that a unique function f�ðx;wÞ (which minimizes(1)) will be used for prediction over the whole input space.However, this may not be a good strategy: The functionalspace may not contain a good predictor for the full inputspace, but it would be much easier to find a function that iscapable of making good predictions on some specifiedregions of the input space. Fig. 1 shows an intuitive exampleof the necessity of local learning algorithms, in which it ishard to find a good function to discriminate the two classes,but in each local region, the data from different classes canbe easily classified.

Typically, a local learning algorithm aims to find anoptimal classifier over the local region centered at c0 byminimizing [6]

J f0¼ 1

n

Xni¼1

kðxi � c0; "0ÞJ ðf0ðxi;w0ÞÞ; ð2Þ

where J ð�Þ is some predefined loss, and kðxi � c0; "0Þ is akernel centered at c0 with width "0 which specializes thelocation and size of the local region. Fig. 2 shows us twoexamples of local kernels.

2.2 Spectral Clustering

In this section, we will give a brief review on spectralclustering [19], [7], [27], [23], [33]. Given a data setfx1;x2; . . . ;xng, the spectral clustering algorithms typicallyaim to recover the data partitions by solving the followingeigenvalue decomposition problem

SV ¼ �V; ð3Þ

where S is the n� n smoothness matrix (e.g., S can be the

combinatorial graph Laplacian as in [7] or the normalized graph

Laplacian as in [27]), V 2 IRn�C is the n� C relaxed cluster

indication matrix with C being the desired number of

clusters,1 which corresponds to the eigenvectors of S, and �

is the diagonal eigenvalue matrix of S. Traditionally, there

are two ways to get the final clustering results from V.

1. As in [23], we can treat the ith row of V as theembedding of xi in a C-dimensional space, andapply some traditional clustering methods likek-means to cluster these embeddings into C clusters.

2. Usually, the optimal V� is not unique (up to anarbitrary rotational matrix R2). Thus, we can pursuean optimal R that will rotate V� to a standard clusterindication matrix. The detailed algorithm can bereferred to in [36].

3 THE PROPOSED ALGORITHM

In this section, we will introduce our Clustering with Local

and Global Regularization (CLGR) algorithm in detail. First,

we discuss the basic principles behind our algorithm.

3.1 The Basic Principle

As its name suggests, the main idea behind our algorithm

is to construct a predictor via two steps: local regularization

and global regularization. In the local regularization step,

we borrow the idea from local learning introduced in

Section 2.1, which assumes that the the label of a data point

can be well estimated from the local region it belongs to. That

is, if we partition the input data space into M regions

fRmgMm¼1, with each region being characterized by a local

kernel kðcm; "mÞ, then we can minimize the cost function

J fm ¼1

n

Xni¼1

kðxi � cm; "mÞJ ðfmðxi;wmÞÞ; ð4Þ

to get the optimal fm for the local region Rm.One problem of pure local learning is that there might

not be enough data points in each local region for training

the local classifiers. Therefore, in the global regularization

1666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

Fig. 1. An intuitive illustration of why local learning is needed. Thereare two classes of data points in the figure, which are severelyoverlapped. Hence, in this case, it is difficult to find a global classifierwhich can discriminate the two classes well. However, if we split theinput space into some local regions, then it is much easier to constructgood local classifiers.

Fig. 2. Two examples of the local kernel K. (a) A square kernel,which only selects the data points in a specific neighborhood with“diameter” "0. (b) A smooth kernel, which gives different weights forall data points according to their position with respect to c0, and itssize is controlled by "0.

1. The definition of an n�K cluster indication matrix G is that:Gij 2 f0; 1g, and Gij ¼ 1 if xi belongs to cluster j, otherwise Gij ¼ 0. In theeigenvalue decomposition problem (3), we cannot guarantee the entries inV to satisfy those constraints; therefore, we usually call V the relaxedcluster indication matrix [27].

2. Generally, the final optimization problem form of spectral clusteringcan be formulated as minimization of the trace trðVTSVÞ subject to VTV ¼I [27], [7], [38], the solution of which is according to the Ky Fan theorem [14].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 3: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

step, we apply a smoother to smooth the predicted datalabels with respect to the intrinsic data manifold. Mathe-matically, we want to obtain a classifier f by minimizing

J f ¼XMm¼1

J fm þ �kfk2I ; ð5Þ

where f is a “region-wise” function with fðxÞ ¼ fmðxÞ ifx 2 Rm, and kfkI measures the smoothness of f withrespect to the intrinsic data manifold, and � > 0 is a trade-off parameter.

In the following sections, we will introduce our clusteringwith local and global regularization (CLGR) algorithm in detail.

3.2 Local Regularization

In this section, we introduce a detailed algorithm to realizethe local regularization step. Let q ¼ ½q1; q2; . . . ; qn�T be theclass labels (or the cluster assignments) we want to solve.Then, in local regularization, we assume that qi can be wellestimated by the classifier constructed on the local regionthat xi belongs to. In our CLGR algorithm, we adopt thefollowing k-nearest neighbor local kernel

kðxi � cm; "mÞ

¼1; if xi is the k� nearest neighbor of cm

0; otherwise;

� ð6Þ

and we set the kernel centers fcmgMm¼1 to be the data pointsfxigni¼1. In this way, we, in fact, partition the whole inputdata space into n overlapping regions, and the ith region Ri

is just the k-nearest neighborhood of xi. The label of xi willbe estimated by fiðxiÞ. Then, according to (4), we need toconstruct a classifier fi on each Ri ð1 � i � nÞ. In thefollowing, we will first see how to construct these localclassifiers.

3.2.1 The Construction of Local Classifiers

In this paper, we consider two types of fi: Regularized LinearClassifier and Kernel Ridge Regression.

Regularized linear classifier. In this case, we assumethat fi takes the following linear form

fiðxÞ ¼ wTi ðx� xiÞ þ bi; ð7Þ

where ðwi; biÞ are the weight vector and bias of fi.3

J i ¼1

ni

Xxj2N i

kwTi ðxj � xiÞ þ bi � qjk2 þ �ikwik2; ð8Þ

where ni ¼ jN ij is the cardinality of N i, and qj is thecluster membership of xj. Define the locally centered datamatrix Xi ¼ ½xi1 � xi;xi2 � xi; . . . ;xini � xi�. Then, by tak-ing the partial derivatives of J i with respect to wi and bi,we can get

@J i

@wi¼ 2

ni

�Xi

�XTi wi þ bi1� qi

�þ ni�iwi

�; ð9Þ

@J i

@bi¼ 2

ni

�1TXT

i wi þ nibi � 1Tqi�; ð10Þ

where qi ¼ ½qi1 ; qi2 ; . . . ; qini �T with qik being the kth neighbor

of xi, and 1 ¼ ½1; 1; . . . ; 1�T 2 IRni�1 is a column vector of all

1s. Let @J i=@wi ¼ 0, then

wi ¼�XiX

Ti þ ni�iId

��1Xiðqi � bi1Þ; ð11Þ

where Id is the d� d identity matrix. Using the Woodburyformula [14], we can rewrite (11) as

wi ¼ Xi

�XTi Xi þ ni�iIni

��1ðqi � bi1Þ; ð12Þ

where Ini is the ni � ni identity matrix. Let @J i=@bi ¼ 0, then

nibi ¼ 1T�qi �XT

i wi

�: ð13Þ

Combining (12) and (13), we can get

nibi ¼ 1T�qi �XT

i Xi

�XTi Xi þ ni�iI

��1ðqi � bi1Þ�

¼)bi ¼1T � 1TXT

i Xi

�XTi Xi þ ni�iIni

�ni � 1TXT

i Xi

�XTi Xi þ �iIni

��11

qi:ð14Þ

Kernel ridge regression. The regularized linear classi-fier can only tackle the linear problems. In the following,we will apply the kernel trick [25] to extend it to thenonlinear cases. Specifically, let � : IRd ! IF be a non-linear mapping that maps the data in IRd to a high-dimensional (possibly infinite) feature space IF, such thatthe nonlinear problem in IRd can become a linear one inthe feature space IF. Then, for a testing point x 2 N i, wecan construct a local classifier fi which predicts the labelof x by

fiðxÞ ¼ ~wTi ð�ðxÞ � �ðxiÞÞ þ bi; ð15Þ

where ~wi 2 IF; bi 2 IR are the parameters of fi. Since ~wi liesin the span of f�ðxikÞg

nik¼1 [25], where xik represents the

kth neighbor of xi, then there exists a set of coefficientsf�ikg

nik¼1 such that

~w ¼Xnik¼1

�ik��xik�: ð16Þ

Combining (16) and (15), we can get

fiðxÞ ¼X

xj2N i

�ijh�ðxÞ � �ðxiÞ; �ðxjÞi þ bi; ð17Þ

where h�i is the inner product operator. Define a positive-definite kernel function K : X � X ! IR [25], then we cansolve ��i ¼ ½�i1; . . . ; �ini �

T and bi by minimizing

J i ¼1

ni

Xxj2N i

�� ~wTi ð�ðxjÞ � �ðxiÞÞ þ bi � qj

��2 þ �ik ~wik2

¼ 1

ni

Xxj2N i

Xxk2N i

�ikh�ðxjÞ � �ðxiÞ; �ðxkÞi þ bi � qj

����������

2

þ �ið��iÞTKi��i;

ð18Þ

where Ki 2 IRni�ni is the local kernel matrix for the datapoints in N i with its ði; jÞth entry Kiðk; jÞ ¼ Kðxik ;xijÞ, and

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1667

3. Note that we subtract xi from x because usually there are only afew data points in the neighborhood of xi; hence, the structural penaltyterm kwik will pull the weight vector wi toward some arbitrary origin.For isotropy reasons, we translate the origin of the input space to theneighborhood medoid xi.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 4: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

xij is the jth neighbor of xi, and �i > 0 is the trade-off

parameter. Defining the partial locally centered kernel

matrix �Ki 2 IRni�ni as

�Kiðj; kÞ ¼ h�ðxjÞ � �ðxiÞ; �ðxkÞi ¼ Kðxj;xkÞ �Kðxi;xkÞ;

we rewrite J l as

J i ¼1

nik �Ki��i þ bi1� qik2 þ �ið��iÞTKi��

i; ð19Þ

where qi ¼ ½qi1 ; qi2 ; . . . ; qini �T with qik being the kth neighbor

of xi, and 1 ¼ ½1; 1; . . . ; 1�T 2 IRni�1 is a column vector of all

1s. Then,

@J i

@�i¼ 2

ni

��KTi ð �Ki��i þ bi1� qiÞ þ ni�iKi��i

�; ð20Þ

@J i

@bi¼ 2

ni

�1T �Ki��i þ nibi � 1Tqi

�: ð21Þ

By setting @J i=@��i ¼ 0 and @J i=@bi ¼ 0, we can get

��i ¼�

�KTi

�Ki þ ni�iKi

��1 �KTi ðqi � bi1Þ; ð22Þ

bi ¼1T � 1T �Ki

��KTi

�Ki þ ni�iKi

��1 �KTi

ni � 1T �Ki

��KTi

�Ki þ ni�iKi

��1 �KTi 1

qi: ð23Þ

3.2.2 Combining the Local Regularized Predictors

After all the local predictors having been constructed, we

combine them together by minimizing the local prediction

loss

J l ¼Xni¼1

fiðxiÞ � qið Þ2; ð24Þ

in which we use fi to predict the label of xi in the

following ways:

. Regularized Linear Classifier. In this case, combining(7) and (14), we can get

fiðxiÞ ¼ bi ¼1T � 1TXT

i Xi

�XTi Xi þ ni�iIni

�ni � 1TXT

i Xi

�XTi Xi þ �iIni

��11

qi

¼ uTi qi;

ð25Þ

where

uTi ¼1T � 1TXT

i Xi

�XTi Xi þ ni�iIni

�ni � 1TXT

i Xi

�XTi Xi þ �iIni

��11:

. Kernel Ridge Regression. In this case, combining (17)and (23), we can get

fiðxiÞ ¼1T � 1T �Ki

��KTi

�Ki þ ni�iKi

��1 �KTi

ni � 1T �Ki

��KTi

�Ki þ ni�iKi

��1 �KTi 1

qi

¼ vTi qi;

ð26Þ

where vTi ¼1T�1T �Kið �KT

i�Kiþni�iKiÞ�1 �KT

i

ni�1T �Kið �KTi

�Kiþni�iKiÞ�1 �KTi 1

.

Therefore, we can predict the label of xi using the local

classifiers in the form of

fiðxiÞ ¼ ��Ti xi;

where ��i ¼ ui for regularized linear classifier, and ��i ¼ vifor kernel ridge regression. Then, by defining a square

matrix �� 2 IRn�n with its ði; jÞth entry

�ij ¼��iðjÞ; if xj 2 N i

0; otherwise;

�ð27Þ

where ��iðjÞ is the jth element of ��i. We can expand the local

loss in (24) as

J l ¼ k��q� qk2; ð28Þ

where q ¼ ½q1; q2; . . . ; qn�T 2 IRn�1 is the label vector of X .Till now we can write the criterion of clustering by

combining locally regularized linear label predictors J l in

an explicit mathematical form, and can minimize it directly

using some standard optimization techniques. However,

the results may not be good enough since we only exploit

the local information of the data set. In the next section, we

will introduce a global regularization criterion and combine

it with J l, which aims to find a good clustering result in a

local-global way.

3.3 Global Regularization

In data clustering, we usually require that the cluster

assignments of the data points should be sufficiently

smooth with respect to the underlying data manifold,

which implies 1) the nearby points tend to have the same

cluster assignments; 2) the points on the same structure

(e.g., submanifold or cluster) tend to have the same cluster

assignments [41].Without the loss of generality, we assume that the data

points reside (roughly) on a low-dimensional manifoldM,4

and q is the cluster assignment function defined on M.

Generally, a graph can be viewed as the discretized form of

a manifold [3]. We can model the data set as a weighted

undirected graph, as in spectral clustering [27], where the

graph nodes are just the data points, and the weights on the

edges represent the similarities between pairwise points.

Then, the following objective can be used to measure the

smoothness of q over the data graph [2], [40], [42]

J g ¼ qTLq ¼Xni¼1

ðqi � qjÞ2wij; ð29Þ

where q ¼ ½q1; q2; . . . ; qn�T with qi ¼ qðxiÞ;L is the graph

Laplacian with its ði; jÞth entry

Lij ¼di � wii; if i ¼ j�wij; if xi and xj are adjacent0; otherwise;

8<: ð30Þ

1668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

4. We believe that the text data are also sampled from some low-dimensional manifold, since it is impossible for them to fill in the wholehigh-dimensional sample space. And it has been shown that the manifold-based methods can achieve good results on text classification tasks [41].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 5: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

where di ¼P

j wij is the degree of xi, and wij is the similarity

between xi and xj. If xi and xj are adjacent,5 wij is usually

computed in the following way:6

wij ¼ e�kxi�xjk2

2�2 ; ð31Þ

where � is a data-set-dependent parameter. It is proved that

under certain conditions, when the scale of the data set

becomes larger and larger, the graph Laplacian will

converge to the Laplace Beltrami operator on the data

manifold [4], [16].In summary, using (29) with exponential weights can

effectively measure the smoothness of the data assignments

with respect to the intrinsic data manifold. Thus, we adopt

it as a global regularizer to punish the smoothness of the

predicted data assignments.

3.4 Clustering with Local and Global Regularization

Combining the contents we have introduced in Sections 3.2

and 3.3, we can derive the clustering criterion as

minqJ ¼ J l þ �J g ¼ k��q� qk2 þ �qTLq

s:t: qi 2 f�1;þ1g;ð32Þ

where � is defined as in (27), and � is a regularization

parameter to trade off J l and J g. However, the discrete

constraint of pi makes the problem an NP-hard integer

programming problem. A natural way for making the

problem solvable is to remove the constraint and relax qito be continuous. Then, the objective that we aim to

minimize becomes

J ¼ k��q� qk2 þ �qTLq

¼ qT ð��� IÞT ð��� IÞqþ �qTLq

¼ qT ðð��� IÞT ð��� IÞ þ �LÞq;ð33Þ

and we further add a constraint qTq ¼ 1 to restrict the scale

of q. Then, our objective becomes to solve the following

optimization problem

minqJ ¼ qT ðð��� IÞT ð��� IÞ þ �LÞq

s:t: qTq ¼ 1:ð34Þ

We have the following theorem:

Theorem 1. The optimal relaxed cluster indicator can be achieved

by the eigenvector corresponding to the second smallest

eigenvalue of the matrix M ¼ ð��� IÞT ð��� IÞ þ �L.

Proof. Using the Ky Fan theorem [38], we can derive that the

optimal solution q corresponds to the smallest eigen-

vector of matrix M. However, we will prove in the

following that such an eigenvector does not contain any

discriminative information.First, for an arbitrary column vector a 2 IRn�1, we

have

aTMa ¼ aT ð��� IÞT ð��� IÞaþ �aTLa

¼ k��a� ak2 þ �Xni¼1

ai � aj� �2

wij � 0:

Therefore, M is positive semidefinite, i.e., its eigenvalues

are nonnegative. In the following, we show that the

smallest eigenvalue of M is 0, and its corresponding

eigenvector is an all-ones vector 1 ¼ ½1; 1; . . . ; 1�T 2 IRn�1:

1. L1 ¼ ðD�WÞ1 ¼ 0 � 1, where W 2 IRn�n is theweight matrix and D ¼ diagð

Pj WijÞ is the

degree matrix.2. ð��� IÞT ð��� IÞ1 ¼ 0 � 1, since according to the

definitions of �� in (27), we have the following.

a. For regularized linear classifier, assuming���i 2 IR1�n is the ith row of ��, then

��1 ¼ uTi 1ni

¼1Tni � 1TniX

Ti Xi

�XTi Xi þ ni�iIni

�ni � 1TniX

Ti Xi

�XTi Xi þ �iIni

��11ni

1ni ¼ 1:

b. For kernel ridge regression, assuming ���i 2IR1�n is the ith row of ��, then

��1 ¼ vTi 1ni

¼1Tni � 1Tni

�Kið �KTi

�Ki þ ni�iKiÞ�1 �KTi

ni � 1Tni�Kið �KT

i�Ki þ ni�iKiÞ�1 �KT

i 1ni1ni ¼ 1:

Then, ��1 ¼ 1 and ð��� IÞT ð��� IÞ1 ¼ 0.Therefore,

M1 ¼ ðð��� IÞT ð��� IÞ þ �LÞ1 ¼ 0 � 1;

where 0 is the smallest eigenvalue of M with 1 as its

corresponding eigenvector, which does not contain any

discriminative information. Therefore, we should use the

eigenvector corresponding to the second smallest eigen-

value for clustering as in [27]. tu

3.5 Extending CLGR for Multiclass Clustering

In the above, we have introduced the basic framework of

Clustering with Local and Global Regularization (CLGR) for the

two-class clustering problem, and we will extend it to

multiclass clustering in this section.First, we assume that all the data objects belong toC classes

indexed by L ¼ f1; 2; . . . ; Cg. qc is the classification function

for class c ð1 � c � CÞ, such that qcðxiÞ returns the possibility

that xi belongs to class c. Our goal is to obtain the value of

qcðxiÞ ð1 � c � C; 1 � i � nÞ, and the cluster assignment of xican be determined by fqcðxiÞgCc¼1 using some proper dis-

cretization methods that we have introduced in Section 2.2.Therefore, in this multiclass case, for each data

point xi ð1 � i � nÞ, we will construct C locally regular-

ized label predictors. The total local prediction error can

be constructed as

J l ¼XCc¼1

J cl ¼

XCc¼1

��qc � qck k2: ð35Þ

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1669

5. In this paper, we define xi and xj to be adjacent if xi 2 NðxjÞ orxj 2 NðxiÞ.

6. There are also some other ways for constructing wij, such as in [32].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 6: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

Similarly, we can construct the global smoothness regular-

izer in multiclass case as

J g ¼XCc¼1

Xni¼1

�qci � qcj

�2wij ¼

XCc¼1

ðqcÞTLqc: ð36Þ

Then, the criterion to be minimized for CLGR in multiclass

case becomes

J ¼ J l þ �J g

¼XCc¼1

½k��qc � qck2 þ �ðqcÞTLqc�

¼XCc¼1

½ðqcÞT ðð��� IÞT ð��� IÞ þ �LÞqc�

¼ trace½QT ðð��� IÞT ð��� IÞ þ �LÞQ�;

ð37Þ

where Q ¼ ½q1;q2; . . . ;qc� is an n� c matrix, and traceð�Þreturns the trace of a matrix. The same as in (34), we also

add the constraint that QTQ ¼ I to restrict the scale of Q.

Then, our optimization problem becomes

minQJ ¼ trace½QT ðð��� IÞT ð��� IÞ þ �LÞQ�

s:t: QTQ ¼ I:ð38Þ

From the Ky Fan theorem [38], we know the optimal

solution of the above problem is

Q� ¼�q�1;q

�2; . . . ;q�C

�R; ð39Þ

where q�k ð1 � k � CÞ is the eigenvector that corresponds to

the kth smallest eigenvalue of matrix ðP� IÞT ðP� IÞ þ �L,

and R is an arbitrary C � C matrix.Since the values of the entries in Q� are continuous, we

need to further discretize Q� to get the cluster assignments

of all the data points by making use of the methodsintroduced in Section 2.2. The detailed algorithm procedurefor CLGR is summarized in Table 1.

3.6 A Mixed Regularization Viewpoint

As we have mentioned in Section 3.1, the motivation behindour approach is that there may not be enough points in eachlocal region to train a good predictor; therefore, we apply aglobal smoothness regularizer to smooth those predictedlabels and make it compliant with the intrinsic datadistribution. However, by revisiting the expression of theloss function (37), we can find that it is, in fact, a mixed lossfunction composed of two parts. The first part

J 1 ¼ trðQT ðð��� IÞT ð��� IÞÞQÞ

contains some local information of the data set from theclassifier construction perspective, which is derived usinglocal learning algorithms. Purely minimizing J 1 withproper orthogonality constraints will result in the Local-

Learning-based Algorithm for Clustering (LLAC) like in [34].The second part

J 2 ¼ trðQTLQÞ

includes some distribution information of the data set fromthe geometry perspective. Purely minimizing J 2 withproper orthogonality constraints will result in some spectralclustering algorithms like ratio cut.

Therefore, what our clustering does with local and globalregularization algorithms is just to seek for a consistentpartitioning of the data set which has a good trade-offbetween J 1 and J 2 by �. Such an idea of mixed

regularization with different types of information haspreviously been explored in the semisupervised learningcommunity [41], [44]. In the following section, we will

1670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

TABLE 1Clustering with Local and Global Regularization (CLGR)

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 7: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

discuss the relationship between our method and sometraditional approaches.

3.7 Relationship with Some Traditional Approaches

As stated in the last section, the CLGR method proposed inthis paper can be viewed from a mixed regularization point,where the objective function is just a regularization termcomposed of a local learning regularizer and a Laplacianregularizer. Therefore, the traditional spectral clustering[27] (which just contains the Laplacian regularizer) andlocal learning clustering [34], [35] (which just contains thelocal learning regularizer7) can be viewed as special cases ofour method.

In fact, the idea of combining some “weak” regularizerto construct a “strong” regularizer has appeared in someprevious papers (e.g., [9], [1], [10]). However, what thesepapers mainly discuss is how to combine the “weak”regularizers (usually they assume that the final regularizeris a convex combination of those weak regularizers, and thefinal goal is how to efficiently obtain the combinationcoefficients), and they didn’t mention how to obtain moreinformative weak regularizers. The mixed regularizersproposed in this paper are complementary to each otherin some sense (the local regularizer makes use of the localproperties while the global regularizer consolidates thoselocal properties). Thus, it can be more effective in realworld applications.

4 EXPERIMENTS

In this section, experiments are conducted to empiricallycompare the clustering results of CLGR with some otherclustering algorithms on four data sets. First, we will brieflyintroduce the basic information of those data sets.

4.1 Data Sets

We use three categories of data sets in our experiments,which are selected to cover a wide range of properties.Specifically, those data sets include:

. UCI Data. We perform experiments on 15 UCI datasets.8 The sizes of those data sets vary from 24 to4,435, the dimensionality of the data points varyfrom 3 to 56, and the number of classes vary from 2to 22.

. Image Data. We perform experiments on six imagedata sets: ORL,9 Yale,10 YaleB [13], PIE [28], COIL[22], and USPS.11 All the images in the data setsexcept USPS are resized to 32� 32. For the PIE dataset, we only select five near-frontal poses (C05, C07,C09, C27, C29) and all the images under differentilluminations and expressions. So, there are170 images for each individual. For the YaleBdatabase, we simply use the cropped images whichcontain 38 individuals and around 64 near-frontalimages under different illuminations per individual.

For the USPS data set, we only use the images ofdigits 1-4.

. Text Data. We also perform experiments on three

text data sets: 20-newsgroup,12 WebKB,13 and

Cora [21]. For the 20-newsgroup data set, we

choose the topic rec which contains autos, motor-

cycles, baseball, and hockey from the version

20-news-18828. For the WebKB data set, we select

a subset consisting of about 6,000 Web pages from

computer science departments of four schools

(Cornell, Texas, Washington, and Wisconsin),

which can be classified into seven categories. For

the Cora data set, we select a subset containing the

research paper of subfield data structure (DS),hardware and architecture (HA), machine learning

(ML), and programming language (PL).

The basic information of those data sets are summarized in

Table 2.

4.2 Evaluation Metrics

In the experiments, we set the number of clusters equal to the

true number of classes C for all the clustering algorithms. To

evaluate their performance, we compare the clusters

generated by these algorithms with the true classes by

computing the following two performance measures.Clustering accuracy (Acc). The first performance mea-

sure is the Clustering Accuracy, which discovers the one-to-

one relationship between clusters and classes and measures

the extent to which each cluster contain data points from the

corresponding class. It sums up the whole matching degree

between all pair class-clusters. Clustering accuracy can be

computed as

Acc ¼ 1

Nmax

XCk;Lm

T ðCk;LmÞ !

; ð40Þ

where Ck denotes the kth cluster in the final results, and

Lm is the true mth class. T ðCk;LmÞ is the number of

entities belonging to class m that are assigned to cluster k.

Accuracy computes the maximum sum of T ðCk;LmÞ for

all pairs of clusters and classes, and these pairs have no

overlaps. The greater clustering accuracy means the better

clustering performance.Normalized mutual information (NMI). Another eva-

luation metric that we adopt here is the Normalized Mutual

Information [29], which is widely used for determining the

quality of clusters. For two random variables X and Y, the

NMI is defined as

NMIðX;YÞ ¼ IðX;YÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiHðXÞHðYÞ

p ; ð41Þ

where IðX;YÞ is the mutual information between X and Y,

while HðXÞ and HðYÞ are the entropies of X and Y,

respectively. One can see that NMIðX;XÞ ¼ 1, which is the

maximal possible value of NMI. Given a clustering result,

the NMI in (41) is estimated as

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1671

7. An issue that is worth mentioning here is that the local learningregularizer in this paper is not the same as in [34], [35].

8. http://mlearn.ics.uci.edu/MLRepository.html.9. http://www.uk.research.att.com/facedatabase.html.10. http://cvc.yale.edu/projects/yalefaces/yalefaces.html.11. Available at: http://www.kernel-machines.org/data.html.

12. http://people.csail.mit.edu/jrennie/20Newsgroups/.13. http://www.cs.cmu.edu/~WebKB/.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 8: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

NMI ¼PC

k¼1

PCm¼1 nk;mlog

� n�nk;mnknm

�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�PC

k¼1 nklognkn

��PCm¼1 nmlog

nmn

�q ; ð42Þ

where nk denotes the number of data contained in the

cluster Ck (1 � k � C), nm is the number of data belonging

to the mth class (1 � m � C), and nk;m denotes the number

of data that are in the intersection between the cluster Ckand the mth class. The value calculated in (42) is used as a

performance measure for the given clustering result. The

larger this value, the better the clustering performance.

4.3 Comparisons and Parameter Settings

We have compared the performances of our methods with

five other popular clustering approaches.

. K-Means. In our implementation, the cluster centersare randomly initialized and the performancesreported are averaged over 50 independent runs.

. Normalized Cut (Ncut) [27]. The implementation isthe same as in [36]. The similarities betweenpairwise points are computed using the standardGaussian kernel,14 and the width of the Gaussiankernel is set in an automatic way using the methodintroduced in [37].

. Local Linearly Regularized Clustering (LLRC). Clus-tering purely is based on minimizing the local lossin (24), and the local predictions are made by (25).The neighborhood size is searched from the gridf5; 10; 20; 50; 100g, all the regularization parametersf�ig of the local regularized classifiers are set tobe the same and are searched from the gridf0:01; 0:1; 1; 10; 100g, and the final cluster labels areachieved by the discretization method in [36].

. Local-Kernel-Regularized Clustering (LKRC). Clusteringis purely based on minimizing the local loss in (24),and the local predictions are made by (26). Theneighborhood size is searched from the gridf5; 10; 20; 50; 100g, all the regularization parametersf�ig of the local regularized classifiers are set to be thesame and are searched from the grid f0:01; 0:1;1; 10; 100g, the width of the Gaussian kernel issearched from the grid f4�3�0; 4

�2�0; 4�1�0; �0; 4

1�0;42�0; 4

3�0g, where�0 is the mean distance between anytwo examples in the data set, and the final clusterlabels are achieved by the discretization method in[36]. Note that this method is very similar to the one in[34] except for the centralization in the construction ofthe local classifiers, which, as introduced earlier in afootnote, is needed for isotropic reasons.

. Regularized Clustering (RC) [5]. The implementation isbased on the description in [5, Section 6.1]. Thewidth of the Gaussian similarities used for con-structing graph Laplacian is set using an automaticmethod described in [37], the Gaussian kernel isadopted in RKHS with its width tuned from the gridf4�3�0; 4

�2�0; 4�1�0; �0; 4

1�0; 42�0; 4

3�0g, where �0 isthe mean distance between any two examples in thedata set, the ratio of the regularization parameters�I=�A is tuned from f0:01; 0:1; 1; 10; 100g, and thefinal cluster labels are also achieved by the dis-cretization method in [36].

For our proposed clustering with local and globalregularization methods (LCLGR denotes the approach usingregularized linear classifiers as its local predictors, andKCLGR represents the approach using kernel ridge regres-sion as its local predictors), the local regularization para-meters (i.e., f�ig are set to be the same and are tuned fromf0:01; 0:1; 1; 10; 100g), the size of the neighborhood is tunedfrom f5; 10; 20; 50; 100g, and the pairwise data similaritiesused for constructing the global smoothness regularizer arecomputed using standard Gaussian kernels with its widthdetermined by the method in [37]. For the parameter used totrade off the local and global loss, we change J in (32) to

1672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

TABLE 2Descriptions of the Data Sets

14. For all the methods given below, which are involved in thecomputation of Gaussian kernels, we adopt the Gaussian kernel withEuclidean distance on the UCI and Image data sets, and the Gaussian kernelwith inner product distance on the Text data sets [42].

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 9: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

J ¼ ð1� �ÞJ l þ �J g to better capture the relative impor-tance of J l and J g, and � is tuned from ½0; 0:2; 0:4; 0:6; 0:8; 1�.For KCLGR, the Gaussian kernel is adopted for constructingthe local kernel predictors with its width tuned from thegrid f4�3�0; 4

�2�0; 4�1�0; �0; 4

1�0; 42�0; 4

3�0g, where �0 is themean distance between any two examples in the data set.

4.4 Algorithm Performances

The performances of those clustering algorithms on the threetypes of data sets are reported in Tables 3, 4, 5, 6, 7 and 8, in

which the best two performances for each data set arehighlighted. From those tables, we can observe the following.

. In most of the cases, k-means clustering performpoorer than other sophisticated methods, since thedistributions of those high-dimensional data sets arecommonly much more complicated than mixtures ofspherical Gaussians.

. The Ncut algorithm has similar performances asthe LLRC algorithm has on UCI data sets, as thedistribution of those data sets are complex and

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1673

TABLE 3Clustering Accuracy Results on UCI Data Sets

TABLE 4Normalized Mutual Information Results on UCI Data Sets

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 10: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

there is no apparent mode underlying those datasets. On the image data sets, the Ncut algorithmusually outperforms the LLRC algorithm sincethere are clear nonlinear underlying manifoldsbehind those data sets, which cannot be tackledwell by the linear methods like LLRC. On theother hand, on most of the text data sets, LLRCperforms better than Ncut because the regularizedlinear classifier has been shown to be veryeffective for text data sets [39].

. The KLRC algorithm usually outperforms LLRC sincethe local distributions of the data sets are nonlinear.

. The mixed-regularization algorithms, includingRC, LCLGR, and KCLGR usually perform betterthan the algorithms based on a single regularizer

(Ncut, LLRC, and KLRC), as they make use ofmore information contained in the data sets.

. As expected, the LCLGR and KCLGR usually per-form better than LLRC and KLRC because the datalabels are smoothed by the global smoother, whichmakes the data labels more compliant with theintrinsic data distribution.

4.5 Sensitivity to the Selection of Parameters

There are mainly three parameters in our algorithm: thelocal regularization parameters f�ig, which are assumed tobe the same in all local classifiers, the size of theneighborhood k, and the global trade-off parameter �.We have also conducted a set of experiments to test thesensitivity of our method to the selection of those

1674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

TABLE 5Clustering Accuracy Results on Image Data Sets

TABLE 6Normalized Mutual Information Results on Image Data Sets

TABLE 7Clustering Accuracy Results on Text Data Sets

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 11: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

parameters, and the results are shown in Figs. 3, 4, 5, 6, 7,and 8. From these figures, we can clearly see that:

. Generally a too small or too large k will lead topoor results. This is because when k is too small,then the data in each neighborhood would be toosparse such that the resulting local classifier maynot be accurate. When k is too large, the localproperties of the data set would be hidden and theresults will also be bad. Therefore, a relativelymedium k would give better results.

. A proper trade-off between the local and globalregularizer would lead to better results, and this isjust what our method does. This also proved

experimentally the reasonability and effectivenessof combining local and global regularizers.

5 CONCLUSIONS

In this paper, we derived a new clustering algorithm called

clustering with local and global regularization. Our method

preserves the merit of local learning algorithms and spectral

clustering. Our experiments show that the proposed algo-

rithm outperforms some of the state-of-the-art algorithms

on many benchmark data sets. In the future, we will focus

on the parameter selection and acceleration issues of the

CLGR algorithm.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1675

Fig. 3. Parameter sensitivity testing results on the UCI Satimage data set. (a) Acc versus �i plot. The parameter settings for KCLGR: � ¼ 0:2; k ¼ 20,

for LCLGR: � ¼ 0:4; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 1, for LCLGR: � ¼ 0:4; �i ¼ 1. (c) Acc versus

� plot. The parameter settings for KCLGR: �i ¼ 1; k ¼ 20, for LCLGR: �i ¼ 1; k ¼ 20.

Fig. 4. Parameter sensitivity testing results on the UCI Iris data set. (a) Acc versus �i plot. The parameter settings for KCLGR: � ¼ 0:6; k ¼ 20, for

LCLGR: � ¼ 0:8; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: �i ¼ 1; � ¼ 0:6, for LCLGR: �i ¼ 0:1; � ¼ 0:8. (c) Acc versus

� plot. The parameter settings for KCLGR: �i ¼ 1; k ¼ 20, for LCLGR: �i ¼ 0:1; k ¼ 20.

TABLE 8Normalized Mutual Information Results on Text Data Sets

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 12: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

1676 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

Fig. 7. Parameter sensitivity testing results on the 20-Newsgroup text data set. (a) Acc versus �i plot. The parameter settings for KCLGR:

� ¼ 0:2; k ¼ 20, for LCLGR: � ¼ 0:4; k ¼ 20. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 10, for LCLGR: � ¼ 0:4; �i ¼ 10.

(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 10; k ¼ 20, for LCLGR: �i ¼ 10; k ¼ 20.

Fig. 8. Parameter sensitivity testing results on the WebKB-Cornell text data set. (a) Acc versus �i plot. The parameter settings for KCLGR:

� ¼ 0:4; k ¼ 10, for LCLGR: � ¼ 0:4; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 1, for LCLGR: � ¼ 0:4; �i ¼ 0:1.

(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 1; k ¼ 10, for LCLGR: �i ¼ 0:1; k ¼ 10.

Fig. 5. Parameter sensitivity testing results on the Yale face image data set. (a) Acc versus �i plot. The parameter settings for KCLGR:

� ¼ 0:8; k ¼ 20, for LCLGR: � ¼ 0:6; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 0:1, for LCLGR: � ¼ 0:4; �i ¼ 100.

(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 0:1; k ¼ 20, for LCLGR: �i ¼ 100; k ¼ 10.

Fig. 6. Parameter sensitivity testing results on the ORL face image data set. (a) Acc versus �i plot. The parameter settings for KCLGR:

� ¼ 0:2; k ¼ 20, for LCLGR: � ¼ 0:8; k ¼ 10. (b) Acc versus k plot. The parameter settings for KCLGR: � ¼ 0:2; �i ¼ 100, for LCLGR: � ¼ 0:8; �i ¼ 10.

(c) Acc versus � plot. The parameter settings for KCLGR: �i ¼ 100; k ¼ 20, for LCLGR: �i ¼ 10; k ¼ 10.

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 13: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

ACKNOWLEDGMENTS

The work of Fei Wang and Changshui Zhang is supported

by the China Natural Science Foundation under Grant Nos.

60835002, 60675009. The work of Tao Li is partially

supported by the US National Science Foundation CAREER

Award IIS-0546280, DMS-0844513, and CCF-0830659.

REFERENCES

[1] A. Argyriou, M. Herbster, and M. Pontil, “Combining GraphLaplacians for Semi-Supervised Learning,” Proc. Conf. NeuralInformation Processing Systems, 2005.

[2] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimension-ality Reduction and Data Representation,” Neural Computation,vol. 15, no. 6, pp. 1373-1396, 2003.

[3] M. Belkin and P. Niyogi, “Semi-Supervised Learning onRiemannian Manifolds,” Machine Learning, vol. 56, pp. 209-239, 2004.

[4] M. Belkin and P. Niyogi, “Towards a Theoretical Foundation forLaplacian-Based Manifold Methods,” Proc. 18th Conf. LearningTheory (COLT), 2005.

[5] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold Regulariza-tion: A Geometric Framework for Learning from Labeled andUnlabeled Examples,” J. Machine Learning Research, vol. 7,pp. 2399-2434, 2006.

[6] L. Bottou and V. Vapnik, “Local Learning Algorithms,” NeuralComputation, vol. 4, pp. 888-900, 1992.

[7] P.K. Chan, D.F. Schlag, and J.Y. Zien, “Spectral K-Way Ratio-CutPartitioning and Clustering,” IEEE Trans. Computer-Aided Design ofIntegrated Circuits and Systems, vol. 13, no. 9, pp. 1088-1096, Sept.1994.

[8] O. Chapelle, M. Chi, and A. Zien, “A Continuation Method forSemi-Supervised SVMs,” Proc. 23rd Int’l Conf. Machine Learning,pp. 185-192, 2006.

[9] J. Chen, Z. Zhao, J. Ye, and L. Huan, “Nonlinear AdaptiveDistance Metric Learning for Clustering,” Proc. 13th ACM SpecialInterest Group Conf. Knowledge Discovery and Data Mining(SIGKDD), pp. 123-132, 2007.

[10] G. Dai and D.-Y. Yeung, “Kernel Selection for Semi-SupervisedKernel Machines,” Proc. 24th Int’l Conf. Machine Learning (ICML’07), pp. 185-192, 2007.

[11] C. Ding, X. He, H. Zha, M. Gu, and H.D. Simon, “A Min-Max CutAlgorithm for Graph Partitioning and Data Clustering,” Proc. FirstInt’l Conf. Data Mining (ICDM ’01), pp. 107-114, 2001.

[12] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. JohnWiley & Sons, 2001.

[13] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman, “FromFew to Many: Illumination Cone Models for Face Recognitionunder Variable Lighting and Pose,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 23, no. 6, pp. 643-660,June 2001.

[14] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. TheJohns Hopkins Univ. Press, 1996.

[15] J. Han and M. Kamber, Data Mining. Morgan Kaufmann, 2001.[16] M. Hein, J.Y. Audibert, and U. von Luxburg, “From Graphs to

Manifolds—Weak and Strong Pointwise Consistency of GraphLaplacians,” Proc. 18th Conf. Learning Theory (COLT ’05), pp. 470-485, 2005.

[17] J. He, M. Lan, C.-L. Tan, S.-Y. Sung, and H.-B. Low, “Initializationof Cluster Refinement Algorithms: A Review and ComparativeStudy,” Proc. Int’l Joint Conf. Neural Networks, 2004.

[18] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice-Hall,1988.

[19] B. Kernighan and S. Lin, “An Efficient Heuristic Procedure forPartitioning Graphs,” The Bell System Technical J., vol. 49, no. 2,pp. 291-307, 1970.

[20] L. Zelnik-Manor and P. Perona, “Self-Tuning Spectral Clustering,”Proc. Conf. Neural Information Processing Systems, 2005.

[21] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automatingthe Contruction of Internet Portals with Machine Learning,”Information Retrieval J., vol. 3, pp. 127-163, 2000.

[22] S.A. Nene, S.K. Nayar, and J. Murase, “Columbia ObjectImage Library (COIL-20),” Technical Report CUCS-005-96,Columbia Univ., Feb. 1996.

[23] A.Y. Ng, M.I. Jordan, and Y. Weiss, “On Spectral Clustering:Analysis and an Algorithm,” Proc. Conf. Neural InformationProcessing Systems, 2002.

[24] S.T. Roweis and L.K. Saul, “Noninear Dimensionality Reductionby Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326,2000.

[25] B. Scholkopf and A. Smola, Learning with Kernels. The MIT Press,2002.

[26] J. Shawe-Taylor and N. Cristianini, Kernel Methods for PatternAnalysis. Cambridge Univ. Press, 2004.

[27] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.

[28] T. Sim, S. Baker, and M. Bsat, “The CMU Pose, Illumination, andExpression Database,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 25, no. 12, pp. 1615-1618, Dec. 2003.

[29] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge ReuseFramework for Combining Multiple Partitions,” J. MachineLearning Research, vol. 3, pp. 583-617, 2002.

[30] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.

[31] F. Wang, C. Zhang, and T. Li, “Regularized Clustering forDocuments,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research andDevelopment in Information Retrieval (SIGIR ’07), 2007.

[32] F. Wang and C. Zhang, “Label Propagation through LinearNeighborhoods,” Proc. 23rd Int’l Conf. Machine Learning, 2006.

[33] Y. Weiss, “Segmentation Using Eigenvectors: A Unifying View,”Proc. IEEE Int’l Conf. Computer Vision, pp. 975-982, 1999.

[34] M. Wu and B. Scholkopf, “A Local Learning Approach forClustering,” Proc. Conf. Neural Information Processing Systems,2006.

[35] M. Wu and B. Scholkopf, “Transductive Classification via LocalLearning Regularization,” Proc. 11th Int’l Conf. Artificial Intelligenceand Statistics (AISTATS ’07), pp. 628-635, 2007.

[36] S.X. Yu and J. Shi, “Multiclass Spectral Clustering,” Proc. Int’l Conf.Computer Vision, 2003.

[37] L. Zelnik-Manor and P. Perona, “Self-Tuning Spectral Clustering,”Proc. Conf. Neural Information Processing Systems, 2005.

[38] H. Zha, X. He, C. Ding, M. Gu, and H. Simon, “Spectral Relaxationfor K-Means Clustering,” Proc. Conf. Neural Information ProcessingSystems, 2001.

[39] T. Zhang and J.F. Oles, “Text Categorization Based on RegularizedLinear Classification Methods,” Information Retrieval, vol. 4, pp. 5-31, 2001.

[40] D. Zhou and B. Scholkopf, “Learning from Labeled and UnlabeledData Using Random Walks,” Proc. 26th DAGM Symp. PatternRecognition, pp. 237-244, 2004.

[41] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Scholkopf,“Learning with Local and Global Consistency,” Proc. Conf. NeuralInformation Processing Systems, 2004.

[42] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-SupervisedLearning Using Gaussian Fields and Harmonic Functions,” Proc.20th Int’l Conf. Machine Learning, 2003.

[43] X. Zhu, J. Lafferty, and Z. Ghahramani, “Semi-SupervisedLearning: From Gaussian Fields to Gaussian Process,” ComputerScience Technical Report CMU-CS-03-175, Carnegie Mellon Univ.,2003.

[44] X. Zhu and A. Goldberg, “Kernel Regression with OrderPreferences,” Proc. 22nd AAAI Conf. Artificial Intelligence (AAAI),2007.

Fei Wang received the BE degree from XidianUniversity in 2003, and the ME and PhD degreesfrom Tsinghua University in 2006 and 2008,respectively. His main research interests includemachine learning, data mining, and informationretrieval.

WANG ET AL.: CLUSTERING WITH LOCAL AND GLOBAL REGULARIZATION 1677

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.

Page 14: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …users.cis.fiu.edu/~taoli/pub/TKDE-clustering.pdf · Clustering with Local and Global Regularization Fei Wang, Changshui Zhang, Member, IEEE,

Changshui Zhang is a professor in the Depart-ment of Automation, Tsinghua University. Hismain research interests include machine learn-ing. He is a member of the IEEE.

Tao Li is currently an assistant professor in theSchool of Computing and Information Sciences,Florida International University. His main re-search interests include machine learning anddata mining.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 12, DECEMBER 2009

Authorized licensed use limited to: FLORIDA INTERNATIONAL UNIVERSITY. Downloaded on July 28,2010 at 23:04:31 UTC from IEEE Xplore. Restrictions apply.