A Semi-supervised Clustering via Orthogonal Projection

4
2009 ISECS International Colloquium on Computing, Communication, Control, and Management 978-1-4244-4246-1/09/$25.00 ©2009 IEEE CCCM 2009 A Semi-supervised Clustering via Orthogonal Projection Cui Peng Harbin Engineering University Harbin 150001, China [email protected] Zhang Ru-bo Harbin Engineering University Harbin 150001, China [email protected]   Abstract   As dimensionality is very high, image feature space is usually complex. For effectively processing this space, technology of dimensionality reduction is widely used. Semi- supervised clustering incorporates limited information into unsupervised clustering in order to improve clustering performance. However, many existing semi-supervised clustering methods can not be used to handle high-dimensional sparse data. To solve this problem, we proposed a semi- supervised fuzzy clustering method via constrained orthogonal projection. With results of experiments on different datasets, it shows the method has good clustering performance for handling high dimensionality data.  Keywords-dimension reduction; clustering; projection; semi-  supervised learning I. I  NTRODUCTION In recent years, because of fast extension of feature information and volume of image data, many tasks in multimedia processing have become increasingly challenging Dimensionality reduction techniques have been  proposed to uncover the underlying low dimensional structures of the high-dimensional image space  [1].These efforts have proved to be very useful in image retrieval, classification and clustering. There are a number of dimensionality reduction techniques in the literature. One of the classical methods is Principal Component Analysis (PCA) [2], which minimizes the information loss in the reduction process. One of the disadvantages of PCA is that it likely distorts the local structures of a dataset. Locality Preserving Projection (LPP) [3-4] encodes the local neighborhood structure into a similarity matrix and d erives a linear manifold embedding as the optimal approximation to this matrix, but LPP, on the other hand, may overlook the global structures. Recently, semi-supervised learning has gained much attention [6-10], which leverages domain knowledge represented in the form of pairwise constraints. Various reduction techniques have been developed to utilize this form of knowledge[ 11-12]. The constrained FLD defines the embedding based solely on must-link constraints. Semi-Supervised Dimensionality Reduction (SSDR) [13], preserves the intrinsic global covariance structure of the data while exploiting both constraints. As many semi-supervised clustering methods are based density or distance, they are difficult to handle high- dimensional data. Thus, reduced feature must be added into semi-supervised clustering process. We propose COPFC(Constrained Orthogonal Projection Fuzzy Clustering)method to solve this problem. II. COPEC METHOD FRAMEWORK Figure 1. COPFC framework Figure 1 shows the framework of the COPFC method. Given a set of instances and a set of supervision in the form of must-link constraints C ML ={(  x i ,  x  j )}, (  x i ,  x  j ) where (  x i ,  x  j ) must reside in the same cluster, and cannot-link constraints, C CL ={(  x i ,  x  j )}, (  x i ,  x  j ) where should be in the different clusters, the COPFC method is composed of three steps. In the first step, a preprocessing method is exploited to reduce the unlabelled instances and pairwise constraints according to the transitivity property of must-link  constraints. In the second step, a constraint-guided Orthogonal projection method, called COPFC  proj , is used to project the original data into a low-dimensional space. Finally, we apply a semi- supervised fuzzy clustering algorithm, called COPFC  fuzzy ,  produce the clustering results on the projected low- dimensional dataset. 356

Transcript of A Semi-supervised Clustering via Orthogonal Projection

8/7/2019 A Semi-supervised Clustering via Orthogonal Projection

http://slidepdf.com/reader/full/a-semi-supervised-clustering-via-orthogonal-projection 1/4

2009 ISECS International Colloquium on Computing, Communication, Control, and Management

978-1-4244-4246-1/09/$25.00 ©2009 IEEE CCCM 2009

A Semi-supervised Clustering via Orthogonal Projection

Cui PengHarbin Engineering University

Harbin 150001, [email protected]

Zhang Ru-boHarbin Engineering University

Harbin 150001, [email protected]

 

 Abstract  — As dimensionality is very high, image feature space

is usually complex. For effectively processing this space,

technology of dimensionality reduction is widely used. Semi-

supervised clustering incorporates limited information into

unsupervised clustering in order to improve clustering

performance. However, many existing semi-supervised

clustering methods can not be used to handle high-dimensional

sparse data. To solve this problem, we proposed a semi-supervised fuzzy clustering method via constrained orthogonal

projection. With results of experiments on different datasets, it

shows the method has good clustering performance for

handling high dimensionality data.

 Keywords-dimension reduction; clustering; projection; semi-

 supervised learning 

I.  I NTRODUCTION 

In recent years, because of fast extension of featureinformation and volume of image data, many tasks inmultimedia processing have become increasinglychallenging Dimensionality reduction techniques have been

  proposed to uncover the underlying low dimensionalstructures of the high-dimensional image space  [1].Theseefforts have proved to be very useful in image retrieval,classification and clustering. There are a number of dimensionality reduction techniques in the literature. One of the classical methods is Principal Component Analysis(PCA) [2], which minimizes the information loss in thereduction process. One of the disadvantages of PCA is thatit likely distorts the local structures of a dataset. LocalityPreserving Projection (LPP) [3-4] encodes the localneighborhood structure into a similarity matrix and derives alinear manifold embedding as the optimal approximation tothis matrix, but LPP, on the other hand, may overlook theglobal structures.

Recently, semi-supervised learning has gained muchattention [6-10], which leverages domain knowledgerepresented in the form of pairwise constraints. Variousreduction techniques have been developed to utilize this formof knowledge[11-12].

The constrained FLD defines the embedding basedsolely on must-link constraints. Semi-SupervisedDimensionality Reduction (SSDR) [13], preserves theintrinsic global covariance structure of the data whileexploiting both constraints.

As many semi-supervised clustering methods are baseddensity or distance, they are difficult to handle high-dimensional data. Thus, reduced feature must be added intosemi-supervised clustering process. We proposeCOPFC(Constrained Orthogonal Projection FuzzyClustering)method to solve this problem.

II.  COPEC METHOD FRAMEWORK 

Figure 1. COPFC framework 

Figure 1 shows the framework of the COPFC method.Given a set of instances and a set of supervision in the formof must-link constraints CML={( xi, x j)}, ( xi, x j) where ( xi, x j)must reside in the same cluster, and cannot-link constraints,C CL={( xi,  x j)}, ( xi,  x j) where should be in the differentclusters, the COPFC method is composed of three steps. Inthe first step, a preprocessing method is exploited to reduce

the unlabelled instances and pairwise constraints accordingto the transitivity property of must-link   constraints. In thesecond step, a constraint-guided Orthogonal projectionmethod, called COPFC proj, is used to project the originaldata into a low-dimensional space. Finally, we apply a semi-supervised fuzzy clustering algorithm, called COPFC fuzzy,

  produce the clustering results on the projected low-dimensional dataset.

356

8/7/2019 A Semi-supervised Clustering via Orthogonal Projection

http://slidepdf.com/reader/full/a-semi-supervised-clustering-via-orthogonal-projection 2/4

III.  COPFC PROJ - A CONSTRAINED ORTHOGONAL

PROJECTION METHOD

In a typical image retrieval system, each image isrepresented by an m -dimensional feature vector  x whose jthvalue is denoted as x j. During the retrieval process, the user is allowed to mark several images with must-links whichmatch his query interest, and also to indicate thoseapparently irrelevant with cannot-links. COPFC proj is alinear method and depends on a set of l axes pi. For a givenimage  x, its embedding coordinates are the projection of  x onto l axes, which are

1, 1

m x

i j ij j P x p i l  

== ≤ ≤∑   .

As the images in the set ML are considered mutuallysimilar to each other, they should be kept compactly in thenew space. In other words, the distances among them should

 be kept small, while the irrelevant images in CL are to bemapped far apart from those in ML as much as possible. The

above two criteria can be formally stated as follows:2

1

min ( )l 

 x y

i i

 x ML y ML i

 P P ∈ ∈ =

−∑ ∑ ∑ (1)

2

1

max ( )l 

 x y

i i

 x ML y CL i

 P P ∈ ∈ =

−∑ ∑ ∑ (2)

Intuitively, equation (1) forces the embedding to havethe image points in reside in a small local neighborhood inthe new feature space, and equation (2) reflects our objective to prevent the points in and close together after theembedding. To construct a salient embedding, COPFC proj combines these two criteria and finds the axis in the one-by-

one fashion which optimizes the following objective,2min ( ) x y

i i

 x ML y ML

 P P ∈ ∈

−∑ ∑ (3)

subject to 2min ( ) 1 x y

i i

 x ML y CL

 P P ∈ ∈

− =∑ ∑ (4)

1 2 3 1... 0T T T T  

i i i i i p p p p p p p p −= = = = = (5)

T is the transpose of a vector. The choice of constant 1 onthe right hand side of equation (4) is rather arbitrary as anyother value (except 0) would not cause any substantialchanges in the embedding produced. The constraint inequation (5) is to force all the axes to be mutuallyorthogonal. Equations (3) and (4) are implicit functions of 

the axes pi , which should be re-written in the explicit forms.First, we introduce the necessary notations. For a given set X of image points, the mean of  X is an -dimensional columnvector M ( X ) , whose i th component is

1( )i i

 x X 

M X x X  ∈

= ∑ (6)

and its covariance matrix C ( X ) is an m×m matrix:

1( ) ( ) ( )

ij i j i j

 x X 

C X x x M X M X   X  ∈

⎛ ⎞= −⎜ ⎟

⎝ ⎠∑ (7)

For two sets X and Y , define an m×m matrix M ( X ,Y ) , in

which

( , ) ( ( ) ( ))( ( ) ( ))T M X Y M X M Y M X M Y  = − − .Accordingly, we can

rewrite equation (3) as follows:22( ) 2 ( ( )) x y T  

i i i i

 x ML y ML

 P P p ML C ML p∈ ∈

− =∑ ∑ (8)

Similarly, we can rewrite equation (4) as follows:2( ) ( ( ( ) ( )

( , )))

 x y T  

i i i

 x ML y CL

i

 P P p ML CL C X C Y  

M X Y p

∈ ∈

− = +

+

∑ ∑ (9)

Hence, the problem to be solved is min T 

i i p Ap , subject

to1 11, ... 0T T T 

i i i i i p Bp p p p p −= = = = , where

22 ( ), ( ( ) ( ) ( , )) A ML C ML B ML CL C X C Y M X Y  = = + + .

It is easy to see that both  A and  B are symmetric and  positive semi-definite. The above problem can be solvedusing the Lagrange Multipliers method. Below we discussthe procedure to obtain the optimal axes.

The first projection axis is the eigenvector of thegeneralized eigen-problem  Ap1= λ Bp1 corresponding to thesmallest eigenvalue. After that, we compute the remainingaxes one by one in the following fashion. Suppose wealready obtained the first (k -1) axes, define:

( 1)

1 2 1

( 1) ( 1) 1 ( 1)

[ , ,..., ],

[ ]

k k T k  

 P p p p

Q P B P  

−−

− − − −

=

=(10)

Then the k th axis  pk  is the eigenvector associated with thesmallest eigenvalue for the eigen-problem:

1 ( 1) ( 1) 1 ( 1) 1( [ ] [ ] )k k k T  

k k  I B P Q P B Ap pλ − − − − − −− = (11)

We adopt the above procedure to determine the optimal l 

orthogonal projection axes, which can preserve the metricstructure of the image space for the given relevancefeedback information. The new coordinates for the imagedata points can then be derived accordingly.

IV.  COPFC FUZZY   SEMI-SUPERVISED CLUSTERING

COPFC fuzzy is new search-based semi-supervisedclustering algorithm that allows the constraints to help theclustering process towards an appropriate partition. To thisend, we define an objective function that takes into account  both the feature-based similarity between data points andthe pairwise constraints [14-16]. Let ML be the set of must-

link  constraints, i.e.( xi, x j)∈ML implies that xi and x j should

 be assigned to the same cluster, and CL the set of cannot-

link constraints,( xi,  x j)∈CL xi and  x j should be assigned todifferent clusters. we can write the objective function

COPFC fuzzy must minimize:: 

2 2

1 1

( , ) 1 1, ( , ) 1

2

1 1

( , ) ( ) ( , )

 

( )

i j i j

C N 

ik i k  

k i

C C C 

ik jl ik jk  

ML k l l k x x CL k  

C N 

ik 

k i

 J V U u d  

u u u u

u

μ 

λ 

γ 

= =

∈ = = ≠ ∈ =

= =

=

⎛ ⎞+ +⎜ ⎟⎜ ⎟

⎝ ⎠

⎡ ⎤− ⎢ ⎥

⎣ ⎦

∑∑

∑ ∑ ∑ ∑ ∑

∑ ∑

x x

x

 

(12)

357

8/7/2019 A Semi-supervised Clustering via Orthogonal Projection

http://slidepdf.com/reader/full/a-semi-supervised-clustering-via-orthogonal-projection 3/4

The first term in equation (12) is the sum of squareddistances to the prototypes weighted by constrainedmemberships (Fuzzy C-Means objective function). Thisterm reinforces the compactness of the clusters.

The second component in equation (12) is composed of:the cost of violating the pairwise must-link constraints; thecost of violating the pairwise cannot-link constraints. Thisterm is weighted by  λ, a constant factor that specifies therelative importance of the supervision.

The third component in equation (12) is the sum of thesquares of the cardinalities of the clusters controls thecompetition between clusters. It is weighted by γ. 

When the parameters are well chosen, the final partitionwill minimize the sum of intra-cluster distances, while partitioning the data set into the smallest number of clusterssuch that the specified constraints are respected as well as possible. 

V.  EXPERIMENTAL EVALUATION 

 A.   Dataset selection and evaluation criterion

We performed experiments on COREL image databaseand 2 datasets from UCI as follows:

(1) We selected 1500 images from COREL imagedatabase. They were divided into 15 sufficiently distinctclasses of 100 images each. In our experiments, each imagewas represented by a 37-dimensional vector, which included3 types of features extracted for the image. We comparedCOPFC proj algorithm against PCA and SSDR. The performance of each technique was evaluated under various

amounts of domain knowledge and different reduceddimensionalities. In different scenarios, after thedimensionality reduction, the Kmeans was applied toclassify the test images.

(2) Iris and Wine datasets from UCI repository. Irisdataset contains three classes of 50 instances each and 4numerical attributes; Wine dataset contains three classes 178instances, and 13 numerical attributes. The simplicity andlow dimension of this data set also allows us to display theconstraints that are actually selected. To evaluate clustering  performance of COPFC fuzzy, we compared COPFC fuzzy

algorithm against Kmeans and PCKmeans algorithm.(3) Evaluation criterion. In this paper, we use Corrected

Rand Index (CRI) as the clustering validation measure.

CRI( 1) / 2

 A C 

n n C 

−=

× − −(13)

where  A is number of instance pairs which assigned cluster meets with actual cluster; n is number of all instances in thedataset, then ( 1) / 2n n× − is number all instance pairs in

dataset; C is number of all constraints.For each dataset, we run each experiment 20 times. To

study the effect of constraints 100 constraints are generatedrandomly for test set. Each point on the learning curve is anaverage of results over 20 runs.

 B.  The effectiveness of COPFC  

In figure 2, we use three different dimensionalityreduction methods (COPFC proj, PCA, SSDR) for originalimages. Dimensionalities are reduced 15, 20 respectively.

For data of reduced dimension, we used Kmeans for clustering. The curves in figure 2 show clustering  performance of PCA method is independent of number of constraints. However clustering performance of SSDR hadslight changes. For COPFC proj, clustering performanceobtained largely improvement with increasing number of constraints. When there are small amount of constraints,clustering performance of COPFC proj is worst in theremethods. In general, COPFC proj outperforms PCA andSSDR for reducing dimensionalities.

10 20 30 40 50 60 7080 90100 Number of constraints

0.6

0.65

0.7

0.75

COPFC proj

SSDR PCA

0.8

0.85

Dimension=20

 (a) (b)

Figure 2. Clustering performance with different number of constraints

Figure 3 shows clustering performance of three methodson Iris and Wine datasets. For all datasets, COPFC fuzzy allobtained best performance. In three methods, clustering  performance of Kmeans is worst. Though clustering performance of PCKmeans is effectively improved, it still is

worse than that of COPFC fuzzy.

10 20 304

0

5

0607080

9

0

10

0Numberofconstraints

      C      R      I

0.8

0.83

0.86

0.8

9

0.950.9

2

0.98

1.01

COPFC

PCKmeans

Kmeans

 

      C      R      I

 (a) Iris dataset (b) Wine dataset

Figure 3. Clustering performance on UCI datasets

VI.  CONCLUSION AND FUTURE WORK  We propose a semi-supervised fuzzy clustering via

orthogonal projection to handle high-dimensional sparsedata in image feature space. The method reducesdimensionalities of images via orthogonal projection, andclusters data of reduced dimensionalities by constrainedfuzzy clustering algorithm.

There are several potential directions for future research.First, we are interested in automatically identifying the rightnumber for the reduced dimensionality based on the background knowledge other than providing a pre-specifiedvalue. Second, we plan to explore alternative methods toemploy supervision in guiding the unsupervised clustering.

358

8/7/2019 A Semi-supervised Clustering via Orthogonal Projection

http://slidepdf.com/reader/full/a-semi-supervised-clustering-via-orthogonal-projection 4/4

 

R EFERENCES 

[1]  X. Yang, H. Fu and H. Zha. “Semi-Supervised Nonlinear Dimensionality Reduction”.   In Proc. of the 23rdIntl. Conf. onMachine Learning , 2006.

[2]  C. Ding and X. He. “K-Means Clustering via Principal ComponentAnalysis”. In Proc. of the 21st Intl. Conf. on Machine Learning , 2004.

[3]  D. Cai, and X. F. He. “Orthogonal Locality Preserving Projection”. In  Proc. of the 28th Intl. ACM SIGIR Conf. on Research and  Development in information Retrieval ,2005.

[4]  X. F. He and P. Niyogi. “Locality Preserving Projections”.  Neural  Information Processing Systems . NIPS ’03, 2003.

[5]  H. Cheng, K. Hua, and K. Vu. “Semi-Supervised DimensionalityReduction in Image Feature Space.Technical Report”, University of Central Florida, 2007.

[6]  Wagstaff. K and Cardie C. “Clustering with instance—levelconstraints”. Proc. of the 17th Int’1 Conf. on Machine Learning . San

Francisco: Morgan Kaufmann Publishers, 2000.[7]  S. Basu. “Semi-supervised Clustering: Probabilistic Models,

Algorithms and Experiments”. Austin: The University of Texas, 2005

[8]  S. Basu , A. Banerjee and R.J. Mooney, “Semi-supervised clustering by seeding”. Proceedings of the 19th Int’l Conf. on Machine Learning(ICML 2002). 19−26

[9]  Wagstaff K, Cardie C and Rogers S. “Constrained K-means clusteringwith background knowledge”.   Proc. of the 18th Int’l Conf. onMachine Learning . Williamstown: Williams College, MorganKaufmann Publishers, 2001. 577−584.

[10]  Klein D, Kamvar SD andManning CD. “From instance-Levelconstraints to space-level constraints: Making the most of prior knowledge in data clustering”.   In Proc. of the 19th Int’l Conf. onMachine Learning . University of New South Wales. Sydney: MorganKaufmann Publishers, 2002. 307−314.

[11] 

Hertz T, Shental N and Bar-Hillel A. “Enhancing image and videoretrieval: Learning via equivalence constraint”.   Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. Madison: IEEEComputer Society, 2003. pp.668−674.

[12]  T. Deselaers, D. Keysers, and H. Ney. “Features for Image Retrieval  – a Quantitative Comparison”.  In Pattern Recognition, 26th DAGM Symposium, 2004.

[13]  D. Zhang, Z. H. Zhou, and S. Chen. “Semi-SupervisedDimensionality Reduction”.  In Proc. of the 2007 SIAM Intl.Conf. on Data Mining. SDM ’07 , 2007.

[14]   N. Grira, M. Crucianu, N. Boujemaa. “Semi-supervised fuzzyclustering with pairwise-constrained competitive agglomeration”, in: IEEE International Conference on Fuzzy Systems , 2005.

[15]  H. Frigui, R. Krishnapuram. “Clustering by competitiveagglomeration”, Pattern Recognition 30 (7) ,1997 1109–1119.

[16]  M. Bilenko, R.J. Mooney. “Adaptive duplicate detection usinglearnable string similarity measures”. in: International Conference on Knowledge Discovery and Data Mining , Washington, DC, 2003, pp.39–48. 

359