The University of Texas at Austinml/papers/semi-proposal-03.pdfsemi-supervised learning is...

PhD Proposal, Nov 2003 (Also Appears as Technical Report, UT-AI-TR-03-307)

Semi-supervised Clustering: Learning with Limited User Feedback

by

Sugato Basu

Doctoral Dissertation Proposal

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

The University of Texas at Austin

November 2003

Semi-supervised Clustering: Learning with Limited User Feedback

Publication No.

Sugato BasuThe University of Texas at Austin, 2003

Supervisor: Dr. Raymond J. Mooney

In many machine learning domains (e.g. text processing, bioinformatics), there is a large supply of unlabeled databut limited labeled data, which can be expensive to generate. Consequently, semi-supervised learning, learning froma combination of both labeled and unlabeled data, has become a topic of significant recent interest. In the proposedthesis, our research focus is on semi-supervised clustering, which uses a small amount of supervised data in the form ofclass labels or pairwise constraints on some examples to aid unsupervised clustering. Semi-supervised clustering canbe either search-based, i.e., changes are made to the clustering objective to satisfy user-specified labels/constraints,or similarity-based, i.e., the clustering similarity metric is trained to satisfy the given labels/constraints. Our maingoal in the proposed thesis is to study search-based semi-supervised clustering algorithms and apply them to differentdomains.

In our initial work, we have shown how supervision can be provided to clustering in the form of labeleddata points or pairwise constraints. We have also developed an active learning framework for selecting informativeconstraints in the pairwise constrained semi-supervised clustering model, and proposed a method for unifying search-based and similarity-based techniques in semi-supervised clustering.

In this thesis, we want to study other aspects of semi-supervised clustering. Some of the issues we wantto investigate include: (1) effect of noisy, probabilistic or incomplete supervision in clustering; (2) model selectiontechniques for automatic selection of number of clusters in semi-supervised clustering; (3) ensemble semi-supervisedclustering. In our work so far, we have mainly focussed on generative clustering models, e.g. KMeans and EM, andran experiments on clustering low-dimensional UCI datasets or high-dimensional text datasets. In future, we wantto study the effect of semi-supervision on other clustering algorithms, especially in the discriminative clustering andonline clustering framework. We also want to study the effectiveness of our semi-supervised clustering algorithms onother domains, e.g., web search engines (clustering of search results), astronomy (clustering of Mars spectral images)and bioinformatics (clustering of gene microarray data).

ii

Contents

Abstract ii

Chapter 1 Introduction 11.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Semi-supervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Semi-supervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Goal of proposed thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2 Background 42.1 Overview of clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Partitional clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Our representative clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 KMeans and variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Hierarchical Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Clustering evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 3 Completed Research 103.1 Semi-supervised clustering using labeled data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Seeded and constrained KMeans algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Motivation of semi-supervised KMeans algorithms . . . . . . . . . . . . . . . . . . . . . . . 123.1.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Semi-supervised clustering using pairwise constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Pairwise constrained KMeans algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Motivation of the PCC framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Active learning for semi-supervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Explore and Consolidate algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.3 Motivation of active constraint selection algorithm . . . . . . . . . . . . . . . . . . . . . . . 193.3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Unified model of semi-supervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.2 Motivation of the unified framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.3 MPC-KMeans algorithm for the unified framework . . . . . . . . . . . . . . . . . . . . . . . 243.4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iii

Chapter 4 Proposed Research 304.1 Short-term definite goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.1 Soft and noisy constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 Incomplete semi-supervision and class discovery . . . . . . . . . . . . . . . . . . . . . . . . 304.1.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1.4 Generalizing the unified semi-supervised clustering model . . . . . . . . . . . . . . . . . . . 314.1.5 Other semi-supervised clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.6 Ensemble semi-supervised clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.7 Other application domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Long-term speculative goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.1 Other forms of semi-supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.2 Study of cluster evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.3 Approximation algorithms for semi-supervised clustering . . . . . . . . . . . . . . . . . . . . 354.2.4 Analysis of labeled semi-supervision with EM . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Bibliography 37

iv

Chapter 1

Introduction

Two of the most widely-used methods in machine learning for prediction and data analysis are classification and clus-tering (Duda, Hart, & Stork, 2001; Mitchell, 1997). Classification is a purely supervised learning model, whereasclustering is completely unsupervised. Recently, there has been a lot of interest in the continuum between completelysupervised and unsupervised learning (Muslea, 2002; Nigam, 2001; Ghani, Jones, & Rosenberg, 2003). In this chap-ter, we will give an overview of traditional supervised classification and unsupervised clustering, and then describelearning in the continuum between these two, where we have partially supervised data. We will then be presenting themain goal of our proposed thesis.

1.1 Classification

Classification is a supervised task, where supervision is provided in the form of a set of labeled training data, eachdata point having a class label selected from a fixed set of classes (Mitchell, 1997). The goal in classification is tolearn a function from the training data that gives the best prediction of the class label of unseen (test) data points.Generative models for classification learn the joint distribution of the data and class variables by assuming a particularparametric form of the underlying distribution that generated the data points in each class, and then apply BayesRule to obtain class conditional probabilities that are used to predict the class labels for test points drawn from thesame distribution, with unknown class labels (Ng & Jordan, 2002). In the discriminative framework, the focus is onlearning the discriminant function for the class boundaries or a posterior probability for the class labels directly withoutlearning the underlying generative densities (Jaakkola & Haussler, 1999). It can be shown that the discriminativemodel of classification has better generalization error than the generative model under certain assumptions (Vapnik,1998), which has made discriminative classifiers, e.g., support vector machines (Joachims, 1999) and nearest neighborclassifiers (Devroye, Gyorfi, & Lugosi, 1996), very popular for the classification task.

1.2 Clustering

Clustering is an unsupervised learning problem, which tries to group a set of points into clusters such that points in thesame cluster are more similar to each other than points in different clusters, under a particular similarity metric (Jain& Dubes, 1988). Here, the learning algorithm just observes a set of points without observing any correspondingclass/category labels. Clustering problems can also be categorized as generative or discriminative. In the generativeclustering model, a parametric form of data generation is assumed, and the goal in the maximum likelihood formulationis to find the parameters that maximize the probability (likelihood) of generation of the data given the model. Inthe most general formulation, the number of clusters

�is also considered to be an unknown parameter. Such a

clustering formulation is called a “model selection” framework, since it has to choose the best value of�

underwhich the clustering model fits the data. We will be assuming that

�is known in the clustering frameworks that

we will be considering, unless explicitly mentioned otherwise. In the discriminative clustering setting (e.g., graph-

1

theoretic clustering), the clustering algorithm tries to cluster the data so as to maximize within-cluster similarity andminimize between-cluster similarity based on a particular similarity metric, where it is not necessary to consider anunderlying parametric data generation model. In both the generative and discriminative models, clustering algorithmsare generally posed as optimization problems and solved by iterative methods like EM (Dempster, Laird, & Rubin,1977), approximation algorithms like KMedian (Jain & Vazirani, 2001), or heuristic methods like Metis (Karypis &Kumar, 1998).

1.3 Semi-supervised learning

In many practical learning domains (e.g. text processing, bioinformatics), there is a large supply of unlabeled data butlimited labeled data, which can be expensive to generate. Consequently, semi-supervised learning, learning from acombination of both labeled and unlabeled data, has become a topic of significant recent interest. The framework ofsemi-supervised learning is applicable to both classification and clustering.

1.3.1 Semi-supervised classification

Supervised classification has a known, fixed set of categories, and category-labeled training data is used to induce aclassification function. In this setting, the training can also exploit additional unlabeled data, frequently resulting in amore accurate classification function. Several semi-supervised classification algorithms that use unlabeled data to im-prove classification accuracy have become popular in the past few years, which include co-training (Blum & Mitchell,1998), transductive support vector machines (Joachims, 1999), and using Expectation Maximization to incorporateunlabeled data into training (Ghahramani & Jordan, 1994; Nigam, McCallum, Thrun, & Mitchell, 2000). Unlabeleddata have also been used to learn good metrics in the classification setting (Hastie & Tibshirani, 1996). A good reviewof semi-supervised classification methods is given in (Seeger, 2000).

1.3.2 Semi-supervised clustering

Semi-supervised clustering, which uses class labels or pairwise constraints on some examples to aid unsupervisedclustering, has been the focus of several recent projects (Basu, Banerjee, & Mooney, 2002; Klein, Kamvar, & Man-ning, 2002; Wagstaff, Cardie, Rogers, & Schroedl, 2001; Xing, Ng, Jordan, & Russell, 2003). If the initial labeleddata represent all the relevant categories, then both semi-supervised clustering and semi-supervised classification algo-rithms can be used for categorization. However in many domains, knowledge of the relevant categories is incomplete.Unlike semi-supervised classification, semi-supervised clustering (in the model-selection framework) can group datausing the categories in the initial labeled data as well as extend and modify the existing set of categories as needed toreflect other regularities in the data.

Existing methods for semi-supervised clustering fall into two general approaches that we call search-basedand similarity-based methods.

Search-based methods

In search-based approaches, the clustering algorithm itself is modified so that user-provided labels or constraints areused to bias the search for an appropriate partitioning. This can be done by several methods, e.g., modifying theclustering objective function so that it includes a term for satisfying specified constraints (Demiriz, Bennett, & Em-brechts, 1999), enforcing constraints to be satisfied during the cluster assignment in the clustering process (Wagstaffet al., 2001), doing clustering using side-information from conditional distributions in an auxiliary space (Sinkkonen& Kaski, 2000), and initializing clusters and inferring clustering constraints based on neighborhoods derived fromlabeled examples (Basu et al., 2002).

Similarity-based methods

In similarity-based approaches, an existing clustering algorithm that uses a similarity metric is employed; however, thesimilarity metric is first trained to satisfy the labels or constraints in the supervised data. Several similarity metrics have

2

been used for similarity-based semi-supervised clustering, including string-edit distance trained using EM (Bilenko& Mooney, 2003), Jensen-Shannon divergence trained using gradient descent (Cohn, Caruana, & McCallum, 2000),Euclidean distance modified by a shortest-path algorithm (Klein et al., 2002), or Mahalanobis distances trained usingconvex optimization (Hillel, Hertz, Shental, & Weinshall, 2003; Xing et al., 2003). Several clustering algorithmsusing trained similarity metrics have been employed for semi-supervised clustering, including single-link (Bilenko &Mooney, 2003) and complete-link (Klein et al., 2002) agglomerative clustering, EM (Cohn et al., 2000; Hillel et al.,2003), and KMeans (Hillel et al., 2003; Xing et al., 2003).

However, similarity-based and search-based approaches to semi-supervised clustering have not been ade-quately compared in previous work, and so their relative strengths and weaknesses are largely unknown. In Section 3.4,we will be presenting a new semi-supervised clustering algorithm that unifies these two approaches.

1.4 Goal of proposed thesis

In the proposed thesis, the main goal is to study semi-supervised clustering algorithms, characterize some of theirproperties and apply them to different domains. In our completed work, we have already shown how supervisioncan be provided to clustering in the form of labeled data points or pairwise constraints. We have also developedan active learning framework for selecting informative constraints in the pairwise constrained semi-supervised clus-tering model, and proposed a method for unifying search-based and similarity-based techniques in semi-supervisedclustering. Details of the completed work are given in Chapter 3.

In future, we want to look at the following issues, details of which are given in Chapter 4:

� Investigate the effects of noisy supervision, probabilistic supervision (e.g., soft constraints) or incomplete su-pervision (e.g., labels not specified for all clusters) in clustering;

� Study model selection issues in semi-supervised clustering, which will help to characterize the difference be-tween semi-supervised clustering and classification;

� Study the feasibility of semi-supervising other clustering algorithms, especially in the discriminative clusteringor online clustering framework;

� Create a framework for ensemble semi-supervised clustering;

� Apply the semi-supervised clustering model on other domains apart from text, especially web search engines,astronomy and bioinformatics;

� Study the relation between different evaluation metrics used to evaluate semi-supervised clustering;

� Investigate other forms of semi-supervision, e.g., attribute-level constraints;

� Do more theoretical analysis of certain aspects of semi-supervision, especially semi-supervised clustering withlabeled data and the unified semi-supervised clustering model.

3

Chapter 2

Background

This chapter gives a brief review of clustering algorithms on which our proposed semi-supervised clustering techniqueswill be applied. It also gives an overview of different popular clustering evaluation measures, and describes themeasures we will be using in our experiments.

2.1 Overview of clustering algorithms

As explained in Chapter 1, clustering algorithms can be classified into two models — generative or discriminative.There are other categorizations of clustering, e.g., hierarchical or partitional (Jain, Myrthy, & Flynn, 1999), dependingon whether the algorithm clusters the data into a hierarchical structure or gives a flat partitioning of the data.

2.1.1 Hierarchical clustering

In hierarchical clustering, the data is not partitioned into clusters in a single step. Instead, a series of partitions takeplace, which may run from a single cluster containing all objects to

�clusters each containing a single object. This

gives rise to a hierarchy of clusterings, also known as the cluster dendrogram. Hierarchical clustering can be furthercategorized as:

� Divisive methods: Create the cluster dendrogram in a top-down divisive fashion, starting with every data point inone cluster and splitting clusters successively according to some measure till a convergence criterion is reached,e.g., Cobweb (Fisher, 1987), recursive cluster-splitting using a statistical transformation (Dubnov, El-Yaniv,Gdalyahu, Schneidman, Tishby, & Yona, 2002), etc.;

� Agglomerative methods: Create the cluster dendrogram in a bottom-up agglomerative fashion, starting witheach data point in its own cluster and merging clusters successively according to a similarity measure till aconvergence criterion is reached, e.g., hierarchical agglomerative clustering (Kaufman & Rousseeuw, 1990),Birch (Zhang, Ramakrishnan, & Livny, 1996), etc.

2.1.2 Partitional clustering

Let �� be the set of�

data-points we want to cluster with each �� . A partitional clustering algorithmdivides the data into

�partitions (

�given as input to the algorithm) by grouping the associated feature vectors into

�clusters. Partitional algorithms can be classified as:

� Graph-theoretic: These are discriminative clustering approaches, where an undirected graph ��! #" isconstructed from the dataset, each vertex in $%�&�'� corresponding to a data point (�� and the weight of eachedge )*�,+#�- corresponding to the similarity between the data points (.� and (�+ according to a domain-specificsimilarity measure. The

�clustering problem becomes equivalent to finding the

�-mincut in this graph, which

4

is known to be a NP-complete problem for��

(Garey & Johnson, 1979). So, most graph-based clusteringalgorithms try to use good heuristic methods to group nodes so as to find low-cost cuts in � . Several differ-ent graph-theoretic algorithms have been proposed: methods like Rock (Guha, Rastogi, & Shim, 1999) andChameleon (Karypis, Han, & Kumar, 1999) group nodes based on the idea of defining neighborhoods usinginter-connectivity of nodes in � , Metis (Karypis & Kumar, 1998) performs fast multi-level heuristics on � atmultiple resolutions to give good partitions, while Opossum (Strehl & Ghosh, 2000) uses a modified cut criterionto ensure that the resulting clusters are well-balanced according to a specified balancing criterion.

� Density-based: These methods model clusters as dense regions and use different heuristics to find arbitrary-shaped high-density regions in the data space and group points accordingly. Well-known methods includeDenclue, which tries to analytically model the overall density around a point (Hinneburg & Keim, 1998), andWaveCluster, which uses wavelet-transform to find high-density regions (Sheikholesami, Chatterjee, & Zhang,1998). Density-based methods typically have difficulty scaling up to very high dimensional data ( � 10000dimensions), which are common in domains like text.

� Mixture-model based: In mixture-model based clustering, the underlying assumption is that each of the�

data points �*( � �� to be clustered are generated by one of�

probability distributions �� , where eachdistribution �� represents a cluster � � . The probability of observing any point ( � is given by:

�� ( �� "�� ( �� "

where � �� " is the parameter vector, � � are the prior probabilities of the clusters( � � .� � � �� ), and � � is the probability distribution of cluster � � parameterized by the set of parameters � � .The data-generation process is assumed to be as follows – first, one of the

�components is chosen following

their prior probability distribution � � � �� ; then, a data-point is sampled following the distribution � � of thechosen component.

Since the cluster assignment of the points are not known, we assume the existence of a random variable � thatencodes the cluster assignment �� for each data point ( � . It takes values in �! "� �� and is always conditionedon the data-point ( � under consideration. The goal of clustering in this model is to find the estimates of theparameter vector � and the cluster assignment variable � such that the log-likelihood of the data:

# ��$� � � "�� &%('*)

�� ( � �$� � � "is maximized. Since � is unknown, the log-likelihood cannot be maximized directly. So, traditional approachesiteratively maximize the expected log-likelihood in the Expectation Maximization (EM) framework (Dempsteret al., 1977). Starting from an initial estimate of � , the EM algorithm iteratively improves the estimates of � and��+� � � � � " such that the expected value of the complete-data log-likelihood computed over the class conditionaldistribution �.�+� � �� " is maximized. It can be shown that the EM algorithm converges to a local maximumof the expected log-likelihood distribution (Dempster et al., 1977), and the final estimates of the conditionaldistribution �.�+� � �� " are used to find the cluster assignments of the points in � .

Most of the work in this area has assumed that the individual mixture density components � � are Gaussian, andin this case the parameters of the individual Gaussians are estimated by the EM procedure. The popular KMeansclustering algorithm (MacQueen, 1967) can be shown to be an EM algorithm on a mixture of

�Gaussians under

certain assumptions. Details of this derivation are shown in Section 2.2.1.

2.2 Our representative clustering algorithms

In our work, we have chosen KMeans and Hierarchical Agglomerative Clustering as two representative clustering algo-rithms, from the partitional and hierarchical clustering categories respectively, on which our proposed semi-supervised

5

schemes will be applied. The following sections give brief descriptions of KMeans and Hierarchical AgglomerativeClustering.

2.2.1 KMeans and variants

KMeans is a partitional clustering algorithm. It performs iterative relocation to partition a dataset into�

clusters,locally minimizing the average squared distance between the data points and the cluster centers. For a set of datapoints � �� , the KMeans algorithm creates a

�-partitioning 1 � � � � .� of � so that if �� .�

represent the�

partition centers, then the following objective function

�kmeans �

��

��

� � �� (2.1)

is locally minimized. Note that finding the global optima for the KMeans objective function is an NP-completeproblem (Garey, Johnson, & Witsenhausen, 1982). Let � � be the cluster assignment of the point ( � , where � � � �"!�"� .� .An equivalent form of the KMeans clustering objective function, which we will be using interchangeably, is:

�kmeans �

��

� � � �� (2.2)

The pseudocode for KMeans is given in Figure 2.1. If we have the additional constraint that the centroids �� "� .��! #"%$'&(*),+.-�/021#354 �.$76,829;:.<>=@?�9A?CB�:EDGFH9JI;KMLON�P#Q�R2S2S�S�RTPVUXWHRJPZY>[]\_^.`

F,a@bdc�8�ef:.<5g2hGa*Ii9J82eAIkjl 4 � 3>4 �.$fm7DnIpoi:EDGFH9kjqB*?Eer9JDs9JDG:EF@DGF@tdNEKkuHWEvu�w5Q :E<xKyITa*gAz{9Tz*?.9

|X} 8!?.F*Ik:c~oi8�g�9TDG�E8�<�a@F*g�9TDG:EF�DnIk:EB~9JD�b�DG��8�=�� 5�Z��$�.�x� F@D�9JDG?EhGD��87g�hGa*Ir9T8�eJI2� � F@D�9TDn?.h�g�8�FH9TeJ:EDn=@IfN��A�u W vY�w#Q ?.eJ8�IT82hG8!g�9T8!={?.9feJ?EF*=~:Eb�,�� 8�B�8!?�9;a@FH9JD�h��A�.�*��H�2��A�� ? � +*/H/,�� - ¡H¢.£�/�¤~)¥��5¦�ITITDGtEF{8�?EgAz]=@?.9J?dB�:DGF9fP§9T:¨9Jz@8

g�hGa*Ir9T8�ef©«ª¨¬pD � 8 � IT8�9�K �s�® Q �u�¯ ° `@<�:e7©�ª;L±?.eJtxb�DGFu ² P{³´� ��u ²�µ� c � )*/�¤��2¶«+¤~) ¶«)H+.-�/,�>� ��® Q �u · Q

¸ ¹'º¼»�½,¾�¿À ¸.ÁÃÂ,Ä ¹ ºÅ»G½H¾�¿À P� g �ÇÆ · ¬ ÆZÈÃ� °

Figure 2.1: KMeans algorithm

are restricted to be selected from � , then the resulting problem is called KMedian clustering. KMedian clusteringcorresponds to an integer programming problem, for which many approximation algorithms have been proposed (Jain& Vazirani, 2001; Mettu & Plaxton, 2000).

In certain high dimensional data, e.g. text, Euclidean distance is not a good measure of similarity. Certainhigh dimensional spaces like text have good directional properties, which has made directional similarity measureslike É � normalized dot product (cosine similarity) between the vector representations of text data a popular measureof similarity in the information retrieval community (Baeza-Yates & Ribeiro-Neto, 1999). Note that other similaritymeasures, e.g., probabilistic document overlap (Goldszmidt & Sahami, 1998), have also been used successfully fortext clustering, but we will be focusing on cosine similarity in our work.

Spherical KMeans (SPKMeans) is a version of KMeans that uses cosine similarity as its underlying similaritymetric. In the SPKMeans algorithm, standard KMeans is applied to data vectors � � �� that have been normalized tohave unit É � norm, so that the data points lie on a unit sphere (Dhillon & Modha, 2001). Note that in SPKMeans, the

1 Ê disjoint subsets of Ë , whose union is Ë

6

centroid vectors �� are also constrained to lie on the unit sphere. Assuming� � � � � � � � � �� ! in (2.1),

we get� � � � � � � �� . Then, the clustering problem can be equivalently formulated as that of maximizing

the objective function:

�spkmeans �

��

�� (2.3)

The SPKMeans algorithm gives a local maximum of this objective function. The SPKMeans algorithm is computa-tionally efficient for sparse high dimensional data vectors, which are very common in domains like text clustering. Forthis reason, we have used SPKMeans in our experiments with text data.

Both KMeans and SPKMeans are model-based clustering algorithms, having well-defined underlying gen-erative models. As mentioned earlier, KMeans can be considered as fitting a mixture of Gaussians to a dataset un-der certain assumptions. The assumptions are that the prior distribution � � � � .� of the Gaussians is uniform, i.e.,

� � � �� ! , and that each Gaussian has identity covariance. Then, the parameter set � in the EM frameworkconsists of just the

�means �� "� .� . With these assumptions, one can show that (Bilmes, 1997):

� �� %('*) �� "��

�� % '") � � � �

�� " �� )�� ! " ��+ � � � � � � " (2.4)

� ��

��

� �� .�+ � � �� "#"%$where $ is a constant and � � � !�" is denoted by � . Further assuming that

�.�+ � � �� "��&' ( � if ! � arg )+*-,�

� � � �� .otherwise,

(2.5)

and replacing it in (2.4), we note that the expectation term comes out to be the negative of the well-known KMeansobjective function with an additive constant.2 Thus, the problem of maximizing the expected log-likelihood underthese assumptions is same as that of minimizing the KMeans objective function. Keeping in mind the assumption in(2.5), the KMeans objective can be written as

�kmeans �

��

��

� �� .�+ � � �� T� � " (2.6)

In a similar fashion, SPKMeans can be considered as fitting a mixture of von Mises-Fisher distributions to a datasetunder some assumptions (Banerjee, Dhillon, Ghosh, & Sra, 2003).

In this proposal, we will be comparing our proposed semi-supervised KMeans algorithms to another semi-supervised variant of KMeans, called COP-KMeans (Wagstaff et al., 2001). In COP-KMeans, initial backgroundknowledge, provided in the form of constraints between instances in the dataset, is used in the clustering process. Ituses two types of constraints, must-link (two instances have to be together in the same cluster) and cannot-link (twoinstances have to be in different clusters). The pseudo-code for COP-KMeans is given in Figure 2.2.

2.2.2 Hierarchical Agglomerative Clustering

Hierarchical agglomerative clustering (HAC) is a bottom-up hierarchical clustering algorithm. In HAC, points areinitially allocated to singleton clusters, and at each step the “closest” pair of clusters are merged, where closeness isdefined according to a similarity measure between clusters. The algorithm generally terminates when the specified“convergence criterion” is reached, which in our case is when the number of current clusters becomes equal to the

2The assumption in (2.5) can also be derived by assuming the covariance of the Gaussians to be /�0 and letting /214365 (Kearns, Mansour, &Ng, 1997).

7

�� !#"$&% '�(�*)+�,�-/.10�2435276�87249':<;=87>@?A3B:'C5?ED+FG84>�?H35:ECI;=:'0J35:ECK?I6�C5:L;KM�NJ?B:E0OC56�0@P�N�QR87S�T�U�>�3<6�?V:'6J;KMONJ0�:W2X?I;KM�NJ?B:E0#T

6�0 SYQZ>�?H3B[\87270�]R;=NJ0@?H35C56�270J3K?^35M@63_243I`@6�CA3527;E27`@635:'?^270a6�C5:W:E0�b�NJC5;E:'Pdc�2\- :�-^6�878e2f35:EQY?G3BM�6335M�:g;KM�N�?A:'0O270@?A356�0@;=:hQZ>�?H3I87270�]i3BNR6�C5:L6�?5?A27jJ0�:'Pi35Ni3BM�:L0@:Ekl;E87>@?H35:EC'T@?ANm35M@63V3BM@:ES;'6�0�0@N�3I8X63B:'CVU�:g;KM�N�?A:'0n6J?�3BM�:g;E:E0�3B:'CIN�b/6�0�N�3BM�:'CI;=87>@?A3B:ECpoK-

q -^rI:'`�: 63_>�0�35248+sKt�u@vw=x�y�wEuzsKwq�6�- �@"�" {|J! }�~��#"��J� D(��?B?B27j�0O:'6�;KMaP�6�356m`�NJ270J3V35NZ35M�:g;=87N�?A: ?H3I;=87>@?A3B:'CV?B>@;KMn3BM�63

0�NRQm>@?H3B[\87270�]mN�C_;'6�0�0�N�3A[�87240@]i;EN�0@?A3BCK6�270�3G2X?�� 27N�8X635:'PRU SY35M�:g6�?5?B24jJ0�QR:E0�3'-.�b(0@NY?A>@;KMa6�?5?B24jJ0�QR:E0�3+:E��27?A35?ET@�� -

qU*- �@"��z{E�e�J�� e��!#" D��<`zP@63B:L: 6�;KMa;=87>@?A3B:'CV;E:E0�3BC5N�2XPY3BNRUz:W3BM�:g6��J:ECK6�j�:<N�b/6�878z3BM�:`�NJ270J3K?I6�?5?A27jJ0�:'Pi35NR243E-

Figure 2.2: COP-KMeans algorithm

number of clusters desired by the user. Different cluster-level similarity measures are used to determine the closenessbetween clusters to be merged – single-link, complete-link, or group-average (Manning & Schutze, 1999).

Different HAC schemes have been recently shown to have well-defined underlying generative models – single-link HAC corresponds to the probabilistic model of a mixture of branching random walks, complete-link HAC cor-responds to uniform equal-radius hyperspheres, whereas group-average HAC corresponds to equal-variance config-urations (Kamvar, Klein, & Manning, 2002). So, the HAC algorithms can be categorized as generative clusteringalgorithms.

The pseudo-code for HAC is given in Figure 2.3.

�� G��J��J��#�J�� ¡ ¡�� ¢�£e�J��J¤z�¥��§¦��ë©¤��J�z�'ª�¡«E¬�(® ��<¯ °E±_²�³�´�µ±KµZ¶z²�·7¸�±5¹_º¼»¾½�¿�À�ÁEÂEÂ'Â#ÁB¿ÄÃLÅ�Á5¿*Æ�ÇaÈYÉÊ ® � �® ��IË<°'¸@´�²�ÌJÍ5µ�ÎÏÍ5°E¶�Í5°'¹B°E¸�±B·7¸�ÌZÐ�·7°EÍKµ�ÍKÑKÐ�·XÑEµ�Ò@Ñ=Ò7Ó@¹A±B°'ÍB·7¸�ÌZ²�³^ºÔ&Õ �'�(�*Ö+�×�Ø/Ù ¸�·4±5·7µ�Ò7·4Ú'°<Ñ=Ò7Ó@¹A±B°'Í5¹EÛ/Ü+µ�ÑKÐn´�µ±Kµi¶�²J·7¸J±V¿ Æ ·X¹V¶�ÒXµ�ÑE°'´�·4¸n·4±5¹�²ÝI¸aÑ=Ò7Ó@¹A±B°EÍ<Þ Æ Ø+ß Ð�°'¹B°

ÑEÒ7Ó@¹H±5°EÍK¹G³�²�Í5ÎÏ±BÐ@°LÒ4° µ�à�° ¹+²�³Ä±BÐ@°g´�°E¸@´�²�Ì�ÍKµ�Îaá�µ�¸�´�³�²JÍBÎÏ±BÐ�°g¹B°=±_²�³VâEã�äpäBå=æzçGâEèéã ê=ç\åEäKê Øë Ø^ì °'¶�° µ±_Ó�¸�±5·4Ò+âKí�æ@îå=ä�ï�åEæzâKåë µ Øñð °EÍ5Ì�°h±HÝ�²òâ=èfíêEåpê=ç*ÑEÒ4Ó�¹H±5°EÍK¹VÞ Æ µ�¸@´óÞ/ôW³�Í5²�Î¼±5Ð�°�â=ã�äpä5å=æzçGâEèéã ê=ç\åEäKê^±B²RÌJ°=±_Ñ=Ò7Ó@¹A±B°'Í_Þëõ*Øöì °EÎR²àJ°WÞ Æ µ�¸@´óÞ/ôW³�Í5²�Î÷âEã�äpäBåEæ�çGâEèéã ê=ç�å=äKêKá@µ�´@ńÑEÒ7Ó@¹H±5°EÍ_Þ&±B²òâ=ã�äpä5å=æzçGâEèéã ê=ç\åEäKêë Ñ Øùø ´�ń¶@µ�ÍB°'¸�±VÒ7·7¸�ú�¹G³�ÍB²JÎûÞ�Æ^µ�¸�ńÞ ô ±B²�Þü·7¸O±BÐ�°gÑEÒ7Ó@¹H±5°EÍ_´�°'¸@´�²JÌ�ÍKµ�Î

Figure 2.3: HAC algorithm

2.3 Clustering evaluation measures

Evaluation of the quality of output of clustering algorithms is a difficult problem in general, since there is no “gold-standard” solution in clustering. The commonly used clustering validation measures can be categorized as internal orexternal. Internal validation measures, e.g., ratio of average inter-cluster to intra-cluster similarity (higher the better),need only the data and the clustering for their measurement. External validation measures, on the other hand, matchthe clustering solution to some known prior knowledge, e.g., an underlying class labeling of the data. Many datasetsin supervised learning have class information. We can evaluate a clustering algorithm by applying it to such a dataset(with the class label information removed), and then using the class labels of the data as the gold standard againstwhich we can compare the quality of the data clustering obtained.

In our experiments, we have used three metrics for cluster evaluation: normalized mutual information (NMI),pairwise F-measure, and objective function. Of these, normalized mutual information and pairwise F-measure are

8

external clustering validation metrics that estimate the quality of the clustering with respect to a given underlying classlabeling of the data.

For clustering algorithms which optimize a particular objective function, we report the value of the objectivefunction when the algorithm converges. For KMeans and SPKMeans, the objective function values reported are

�kmeans

and�

spkmeans respectively. For the semi-supervised versions of KMeans, we report their corresponding objectivefunction values. Since all the semi-supervised clustering algorithms we propose are iterative methods that (locally)minimize the corresponding clustering objective functions, looking at the objective function value after convergencewould give us an idea of whether the semi-supervised algorithm under consideration generated a good clustering.

Normalized mutual information (NMI) determines the amount of statistical information shared by the randomvariables representing the cluster assignments and the user-labeled class assignments of the data points. We computeNMI following the methodology of Strehl et al. (Strehl, Ghosh, & Mooney, 2000). NMI measures how closelythe clustering algorithm could reconstruct the underlying label distribution in the data. If � is the random variabledenoting the cluster assignments of the points, and

�is the random variable denoting the underlying class labels on

the points (Banerjee et al., 2003), then the NMI measure is defined as:

�� "

�� " "�� " " ��where

� �� #"�� " � � � �#" is the mutual information between the random variables and � , � �� " is theShannon entropy of , and � �� " is the conditional entropy of given � .

Pairwise F-measure is defined as the harmonic mean of pairwise precision and recall, the traditional infor-mation retrieval measures adapted for evaluating clustering by considering pairs of points. For every pair of pointsthat do not have explicit constraints between them, the decision to cluster this pair into same or different clusters isconsidered to be correct if it matches with the underlying class labeling available for the points (Bilenko & Mooney,2002). Pairwise F-measure is defined as:

�� !"��$#%��&��$�('*),+-��./��('*��."01�324!-56�&#�),78��'*��:9 ��'�!-);��!"��./�� '<��."01�=24!"5��&#�),7>��'<��

?%�$�(!-);)8� � ��!"��$#%��&��$�('*),+-��./��('*��."01�324!-56�&#�),78��'*��@9 �&'<!-);��!"��$01�324!"5��&#�),78��'<�$�

ACBD5��!E��7F��G� HCI ��(�<��&� I ?%�$�(!-);)��$�(��&�KJL?�� !-);)Pairwise F-measure is a generalization of measures like Rand Index (Klein et al., 2002; Wagstaff et al., 2001; Xing

et al., 2003) that are frequently used in other semi-supervised clustering projects. Mutual information (Cover &Thomas, 1991) has also become a popular clustering evaluation metric (Banerjee et al., 2003; Dom, 2001; Fern &Brodley, 2003). Recently, a symmetric cluster evaluation metric based on mutual information has been proposed,which has some useful properties, e.g., it is a true metric in the space of clusterings (Meila, 2003). Interestingly, weempirically observed from some of our experimental results that NMI and pairwise F-measure are highly correlated,an observation which we want to investigate further in future.

Note that the external cluster validation measures we have used (e.g., pairwise F-measure and NMI) arenot completely definitive, since the clustering can find a grouping of the data that is different from the underly-ing class structure. For example, in our initial experiments on clustering articles from the CMU 20 Newsgroupsdata (where the main Usenet newsgroup to which an article was posted is considered to be its class label), wefound one cluster that had articles from four underlying classes — alt.atheism, soc.religion.christian,talk.politics.misc, and talk.politics.guns. On closer observation, we noticed that all the articles inthe cluster were about the David Koresh episode; this is a valid cluster, albeit different from the grouping suggestedby the underlying class labels.

If we had human judges to evaluate the cluster quality, we could find an alternate external cluster validationmeasure — we could ask the human judges to rank data categorizations generated by humans and the clusteringalgorithm, and the quality of a clustering output would be considered to be high if the human judges could not reliablydiscriminate between a human categorization of the data and the grouping generated by the clustering algorithm. Sincethis is a time- and resource-consuming method of evaluation in the academic setting, we have used automatic externalcluster validation methods like pairwise F-measure and NMI in our experiments.

9

Chapter 3

Completed Research

This chapter presents the results of our research so far on semi-supervised clustering. Section 3.1 describes howsupervision in the form of labeled data can be incorporated into clustering. Section 3.2 describes a framework forconsidering pairwise constraints in clustering, while Section 3.3 outlines a method for selecting maximally informativeconstraints in a query-driven framework for pairwise constrained clustering. Finally, Section 3.4 presents a new semi-supervised clustering approach that unifies search-based and similarity-based techniques (see Section 1.3.2).

3.1 Semi-supervised clustering using labeled data

In this section, we give an outline of our initial work where we considered a scenario where supervision is incorporatedinto clustering in the form of labeled data. We used the labeled data to generate seed clusters that initialize a clusteringalgorithm, and used constraints generated from the labeled data to guide the clustering process. The underlyingintuition is that proper seeding biases clustering towards a good region of the search space, thereby reducing thechances of it getting stuck in poor local optima, while simultaneously producing a clustering similar to the user-specified labels.

The importance of good seeding in clustering is well-known. In partitional clustering algorithms like EM (Demp-ster et al., 1977) or K-Means (MacQueen, 1967; Selim & Ismail, 1984), some commonly used approaches for initial-ization include simple random selection, taking the mean of the whole data and randomly perturbing to get initialcluster centers (Dhillon, Fan, & Guan, 2001), or running K smaller clustering problems recursively to initialize K-Means (Duda et al., 2001). Some other interesting initialization methods include the Buckshot method of doing hier-archical clustering on a sample of the data to get an initial set of cluster centers (Manning & Schutze, 1999), runningrepeated K-Means on multiple data samples and clustering the K-Means solutions to get initial seeds (Fayyad, Reina,& Bradley, 1998), and selecting the K densest intervals along each co-ordinate to get the K cluster centers (Bradley,Mangasarian, & Street, 1997). Our approach is different from these because we use labeled data to get good initializa-tion for clustering.

In the following section, we present the goal of clustering in the presence of labeled data.

3.1.1 Problem definition

Given a dataset � , as previously mentioned, KMeans clustering of the dataset generates a�

-partitioning �� "� �� of� so that the KMeans objective is locally minimized. Let �� , called the seed set, be the subset of data-points onwhich supervision is provided as follows: for each �� , the user provides the cluster � � of the partition to whichit belongs. We assume that corresponding to each partition � � of � , there is typically at least one seedpoint �� .Note that we get a disjoint

�-partitioning �� of the seed set � , so that all �� belongs to � � according

to the supervision. This partitioning of the seed set � forms the seed clustering. The goal is to guide the KMeansalgorithm towards the desired clustering of the whole data as illustrated by the seed clustering.

10

3.1.2 Seeded and constrained KMeans algorithms

We propose two algorithms for semi-supervised clustering with labeled data: seeded KMeans (S-KMeans) and con-strained KMeans (C-KMeans).

In S-KMeans, the seed clustering is used to initialize the KMeans algorithm. Thus, rather than initializingKMeans from

�random means, the centroid of the ! th cluster is initialized with the centroid of the ! th partition � �

of the seed set. The seed clustering is only used for initialization, and the seeds are not used in the following steps ofthe algorithm. The algorithm is presented in detail in Fig. 3.1.

�� "!�#%$ &�('�)"*,+&-/.10*20435+�687�*:9,;=<?>@BA C"DED"D�C2@�FHG�C2@�I/JLK/M�N7�O1PRQ�)EST+&-%U"V8O�9W*:)"S29�XYN�9Z)[*T\]<_^a`b c A \ b +�-�68716d*Z6e0&VB9Z)")E.�9f $ #/$ &�Tg(6e9ihW+�687�*�Xj3�0�Sk*:6l*:68+�716871mR>�; b G�`bEc A +�-a;n9ZO�U2op*Zo�0&*qHr ) 0&7�9�+sQthW)EU[*Z68u�)v-�O17�U[*Z68+�7w6e9�+�3t*:6dPx68yE)E.

z|{ E�%�~}��&�� &� ��/�l�:�b�� A� ��s�&�� @�C�-�+�S5��< � CE�"�E�[C:X��:� �� )E3�) 0*,O17�*:6dV� 2¡&¢�£¤[¥�¦�¤"¢5 2¤� 0 � �� § � ¨��&©�� sª5�%«v9Z9Z68m�7p)E0�U2o¬.10&*:0R3�+s687s*,@w*:+*Zo1)

U[V8O�9k*Z)EST�B®¯i6 � ) � 9Z)[*v;w�l°�± A �b² ³ N1-�+sS(�5®,<_0&S:maPx687b ´ @µ¶�/�l°��b ´[·� Q � �� 5�"¸ � � � ¸ ��&��%�/�l°�± A �b � A

� ¹»º½¼8¾�¿�À� � � �� ¹ÁºÂ¼d¾�¿�À� @� U � � � ¯i�~Ã � ³

Figure 3.1: Seeded KMeans algorithm

In C-KMeans, the seed clustering is used to initialize the KMeans algorithm as described for the S-KMeansalgorithm. However, in the subsequent steps, the cluster memberships of the data points in the seed set are not re-computed in the assign cluster steps of the algorithm – the cluster labels of the seed data are kept unchanged, and onlythe labels of the non-seed data are re-estimated. The algorithm is given in detail in Fig. 3.2.

Ä�Å�Æ�Ç�ÈÉ�Ê Ë�Ì�Í»ÎtÏ�Ð�Ñ�Ò�ÓÔ�ÕÖ"×�Ø%Ù Ê&Í(Ú�Û"Ü,Ý&Þ/ß1àÜ2à4á5Ý�â8ã�Ü:ä,å=æ?çéè�êéë"ì"ìEì�ëZè�íHî�ë:è~ï/ð¬ñ�ò&óã�ô1õRö�ÛE÷TÝ&Þ%ø"ù8ô�äWÜ:Û"÷2ä�úYó�äZÛ[ÜTû]æ_üaýþ ÿ ê û þ Ý�Þ�â8ã1âdÜZâeà&ùBäZÛ"ÛEß�ä� Ù Ê Ø/Ù Ê&Í��(âeä��WÝ�â8ã�Ü�újá�à�÷kÜ:âlÜ:â8Ý�ã1â8ã��Rç�å þ î�ýþEÿ ê Ý�ÞaånäZô�ø��pÜ��à&ÜÜ��Û� �LÛEà�ã�ä�Ý�ö��WÛ ø Ü:â��sÛ(Þ�ô�ã�ø Ü:âdÝsãpâeä�ÝsátÜZâ8õxâ��"Û ß

�� ÊEË%Ç��Í�� ã1âdÜ:â8à�ù8â��EÛ(ø[ù8ô�äkÜZÛE÷:ä��! #"%$þ'& ê ( )+*,(�-/.10 )+* è/ë�Þ�Ý�÷32�æ � ë54�454[ë:ú76%8 &:9;1�=< ÛEá�Û àÜ,ô1ã�Ü:âdù?>�@�ACBEDGFIHJD�AK>�D; à � Ó�Õ�Õ1LEMsÔ NJO�P�ÕEQtÒ,RK�TS�Ý�÷Tè ðpû,ó�âdÞ�è ðpû þ à�ä:äkâ��sã�è¶ÜZÝRÜ�1Û

ø[ù8ô�äkÜZÛE÷�27U�â � Û � ó�äZÛ[ÜvåV XWZY ê $þ [ � S1Ýs÷Tè]\ð�û,óBà�ä:äkâ��sã�è Ü:ÝRÜ�1Ûø[ù8ô�äkÜZÛE÷�2_^`Uiâ � Û � äZÛ[Üvåa #WZY ê $þEb [ ó1Þ�Ýs÷!2K^,æ_à&÷%�aõxâ8ãþ c èedf�g XWZ$þ cGh

; ö � Ò�ÕiQKL�jBÓ,QtÒ jBÒ�Ó&Ô�Õ1�k�g XWZY ê $þ & ê( lnmpo�q1rIs* ( - .10 l mto�qJrIs* è

; ø � 8 & U�8�u � [

Figure 3.2: Constrained KMeans algorithm

C-KMeans seeds the KMeans algorithm with the user-specified labeled data and keeps that labeling un-changed throughout the algorithm. In S-KMeans, the user-specified labeling of the seed data may be changed in thecourse of the algorithm. C-KMeans is appropriate when the initial seed labeling is noise-free, or if the user does notwant the labels on the seed data to change, whereas S-KMeans is appropriate in the presence of noisy seeds.

11

3.1.3 Motivation of semi-supervised KMeans algorithms

The two semi-supervised KMeans algorithms presented in the last section can be motivated by considering KMeansin the EM framework, as shown in Section 2.1. The only “missing data” for the KMeans problem are the conditionaldistributions of the cluster labels given the points and the parameters, i.e., ��+ � � �� T� � " . Knowledge of these distri-butions solves the clustering problem, but normally there is no way to compute it. In the semi-supervised clusteringframework, the user provides information about some of the data points that specifies the corresponding conditionaldistributions. Thus, semi-supervision by providing labeled data is equivalent to providing information about the con-ditional distributions �� T�� " .

In standard KMeans without any initial supervision, the�

means are chosen randomly in the initial M-stepand the data-points are assigned to the nearest means in the subsequent E-step. As explained above, every point � �in the dataset has

�possible conditional distributions associated with it (each satisfying (2.5)) corresponding to the

�means to which it can belong. This assignment of data point � � to a random cluster in the first E-step is similar to

picking one conditional distribution at random from the�

possible conditional distributions.In S-KMeans, the initial supervision is equivalent to specifying the conditional distributions �� J� � " for

the seed points � � � � . The specified conditional distributions of the seed data are just used in the initial M-step ofthe algorithm, and �.�+ � � � � �J� � " is re-estimated for all � �� in the following E-steps of the algorithm.

In C-KMeans, the initial M-step is same as S-KMeans. The difference is that for the seed data points, theinitial labels, i.e., the conditional distributions �� T� � " , are kept unchanged throughout the algorithm, whereas theconditional distribution for the non-seed points are re-estimated at every E-step.

Note that in the SPKMeans framework (Sec. 2.2.1), since every point lies on the unit sphere so that� � � � �� , the expectation term in (2.4) becomes equivalent to

� � � � � �� % '") ��$� � � " � �� .�

��

� �� "#" $So, maximizing the SPKMeans objective function is equivalent to maximizing the expectation of the complete-datalog-likelihood in the E-step of the EM algorithm.

It can also be shown that getting good seeding is very essential for centroid-based clustering algorithms likeKMeans. As shown in Section 2.2.1, under certain generative model-based assumptions, one can connect the mixtureof Gaussians model to the KMeans model. A direct calculation using Chernoff bounds shows that if a particular cluster(with an underlying Gaussian model) with true centroid � is seeded with � points (drawn independently at randomfrom the corresponding Gaussian distribution) and the estimated centroid is

�� , then

�� "�� ) �� !� � � (3.1)

where� �-� " (Banerjee, 2001). Thus, the probability of deviation of the centroid estimates falls exponentially with

the number of seeds, and hence seeding results in good initial centroids.

3.1.4 Experimental results

Our experiments demonstrated the advantages of S-KMeans and C-KMeans over standard random seeding andCOP-KMeans (Wagstaff et al., 2001), a previously developed semi-supervised clustering algorithm described in Sec-tion 2.2.1. Details of the experimental results can be found in (Basu et al., 2002).

Datasets

We are showing results of our experiments on 2 high-dimensional text data sets – a subset of CMU 20 Newsgroups(2000 documents, 21631 words) and Yahoo! News K-series (2340 documents, 12229 words) datasets. Both datasetshad 20 classes, and the clustering algorithms were asked to generate the same number of clusters. For each dataset,we ran 4 clustering algorithms – S-KMeans, C-KMeans, COP-KMeans, and random KMeans. In random KMeans,the

�means were initialized by taking the mean of the entire data and randomly perturbing it

�times (Fayyad et al.,

1998). This technique of initialization has given good results in unsupervised KMeans in previous work (Dhillon

12

et al., 2001). We compared the performance of these 4 methods on the 2 datasets with varying seeding and noiselevels, using 10-fold cross validation. For each dataset, SPKMeans was used as the underlying KMeans algorithm forall the 4 KMeans variants.

Learning curves with cross validation

For all the algorithms, on each dataset, we generated learning curves with 10-fold cross-validation. For studying theeffect of seeding, 10% of the dataset was set aside as the test set at any particular fold. The training sets at differentpoints of the learning curve were obtained from the remaining 90% of the data by varying the seed fraction from 0.0to 1.0 in steps of 0.1, and the results at each point on the learning curve were obtained by averaging over 10 folds. Theclustering algorithm was run on the whole dataset, but the evaluation metrics (objective function and NMI measure)were calculated only on the test set. For studying the effects of noise in the seeding, learning curves were generatedby keeping a fixed fraction of seeding and varying the noise fraction.

Results

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 0.2 0.4 0.6 0.8 1

MI m

etric

Seed fraction

Seeded-KMeansConstrained-KMeans

COP-KMeansRandom-KMeans

Figure 3.3: Comparison of NMI values on the reduced 20 Newsgroup dataset with increasing seed fraction, with noisefraction = 0

The semi-supervised algorithms (S-KMeans and C-KMeans) performed better than the unsupervised al-gorithm (random KMeans) in terms of the NMI evaluation measure, as shown in Figs. 3.3 and 3.4. For a balanceddatasets (e.g., reduced 20 Newsgroups), the NMI value of the S-KMeans algorithm (see Figure 3.3) seems to increaseexponentially with increasing amount of labeled data along the learning curve, a phenomenon which we will explainlater. Similar results were also obtained for objective function. Both S-KMeans and C-KMeans performed betterthan COP-KMeans in most cases, when the seeds have no noise.

We also performed experiments where we simulated noise in the seed data by changing the labels of a fractionof the seed examples to a random incorrect value. The results in Figure 3.5 show that S-KMeans is quite robustagainst noisy seeding, since it takes the benefit of initialization using the supervised data but does not enforce thenoisy constraints to be satisfied during every clustering iteration (like C-KMeans and COP-KMeans do).

We also ran initial experiments with incomplete seeding, where seeds are not specified for every cluster – fromFig. 3.6, it can be seen that the NMI metric did not decrease substantially with increase in the number of unseededcategories, showing that the semi-supervised clustering algorithms could extend the seed clusters and generate moreclusters, in order to fit the regularity of the data.

13

0.41

0.42

0.43

0.44

0.45

0.46

0.47

0.48

0.49

0 0.2 0.4 0.6 0.8 1

MI m

etric

Seed fraction



Figure 3.4: Comparison of NMI values on the Yahoo! News dataset with increasing seed fraction, with noise fraction= 0

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 0.1 0.2 0.3 0.4 0.5

MI m

etric

Noise fraction



Figure 3.5: Comparison of NMI values on the reduced 20 Newsgroup dataset with increasing noise fraction, with seedfraction = 0.5

14

0.48

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0 0.5 1 1.5 2 2.5 3

MI m

etric

Number of Unseeded Categories

Constrained-KMeansSeeded-KMeans


Figure 3.6: Comparison of NMI values on the full 20 Newsgroups data with increasing incompleteness fraction, withseed fraction = 0.1

Further experimental results on other datasets can be found in (Basu et al., 2002).

3.2 Semi-supervised clustering using pairwise constraints

In this work, we considered a framework that has pairwise must-link and cannot-link constraints between points ina dataset (with an associated cost of violating each constraint), in addition to having distances between the points.These constraints specify that two examples must be in the same cluster (must-link) or different clusters (cannot-link) (Wagstaff et al., 2001). In real-world unsupervised learning tasks, e.g., clustering for speaker identification in aconversation (Hillel et al., 2003), visual correspondence in multi-view image processing (Boykov, Veksler, & Zabih,1998), clustering multi-spectral information from Mars images (Wagstaff, 2002), etc., considering supervision in theform of constraints is generally more practical than providing class labels, since true labels may be unknown a priori,while it can be easier for a user to specify whether pairs of points belong to the same cluster or different clusters.Constraints are a more general way to provide supervision in clustering than labels — given a set of labeled points onecan always infer an unique equivalent set of pairwise must-link and cannot-link constraints, but not vice versa.

We proposed a cost function for pairwise constrained clustering (PCC) that can be shown to be the energy ofa configuration of a Markov random field (MRF) over the data with a well-defined potential function and noise model.Then, the pairwise-constrained clustering problem becomes equivalent to finding the MRF configuration with the high-est probability, or, in other words, minimizing its energy. We developed an iterative KMeans-type algorithm for solvingthis problem. Previous work in the PCC framework include the hard-constrained COP-KMeans algorithm (Wagstaffet al., 2001) and the soft-constrained SCOP-KMeans algorithm (Wagstaff, 2002), which have heuristically motivatedobjective functions. Our formulation, on the other hand, has a well-defined underlying generative model. Bansal etal. (Bansal, Blum, & Chawla, 2002) also proposed a theoretical model for pairwise constrained clustering, but theirclustering model uses only pairwise constraints for clustering, whereas our formulation uses both constraints and anunderlying distance metric. Pairwise clustering models have also been proposed for other non-parametric clusteringalgorithms (Dubnov et al., 2002).

15


Centroid-based partitional clustering algorithms (e.g., KMeans) find a disjoint�

partitioning � � � "� �� (with eachpartition having a centroid � � ) of a dataset � � �*( �� such that the total distance between the points and thecluster centroids is (locally) minimized. We introduce a framework for pairwise constrained clustering (PCC) that haspairwise must-link and cannot-link constraints (Wagstaff et al., 2001) between a subset of points in the dataset (witha cost of violating each constraint), in addition to distances between points. Since centroid-based clustering cannothandle pairwise constraints explicitly, we formulate the goal of clustering in the PCC framework as minimizing acombined objective function: the sum of the total distance between the points and their cluster centroids and the costof violating the pairwise constraints.

For the PCC framework with both must-link and cannot-link constraints, let � be the set of must-link pairssuch that � ( � � (�+*" �� implies ( � and (�+ should be assigned to the same cluster, and � be the set of cannot-linkpairs such that � ( � � (�+ " �� implies ( � and (�+ should be assigned to different clusters. Note that the tuples in � and� are order-independent, i.e., � ( � � (�+ " �� (�+ � ( ��" �� , and so also for � . Let � � � �,+ and � � � �,+*be two sets that give the weights corresponding to the must-link constraints in � and the cannot-link constraints in� respectively. Let � � be the cluster assignment of a point (�� , where � �#��! � �� . Let � � and �� be two metricsthat quantify the cost of violating must-link and cannot-link constraints (Kleinberg & Tardos, 1999). We restrict ourattention to �� A� + " �� q� + � and �� p� � �J� + " �� q� +�� , where � is the indicator function ( � � �� ) � = 1,� � �� ) � = 0). Using this model, the problem of PCC under must-link and cannot-link constraints is formulated asminimizing the following objective function, where point ( � is assigned to the partition � � � with centroid � � � :

�pckmeans � ��

!.�� ( �� " �

" !.�� !#%$��'& �,+ � � � � ��Ã� + � " �" !.�� !#($��*) � + � � � � � � + � (3.2)

This objective function tries to minimize the total distance between points and their cluster centroids such that theminimum number of specified constraints between the points are violated. The goal of any clustering algorithm in thePCC framework is to minimize this combined objective function.

3.2.2 Pairwise constrained KMeans algorithm

Given a set of data points � , a set of must-link constraints � , a set of cannot-link constraints � , the weight of theconstraints and the number of clusters to form

�, we propose an algorithm PC-KMeans that finds a disjoint

�

partitioning �� "� �� of � (with each partition having a centroid � � ) such that�

pckmeans is (locally) minimized.In the initialization step of PC-KMeans, assuming consistency of constraints, we take the transitive closure

of the must-link constraints (Wagstaff et al., 2001) and augment the set � by adding these entailed constraints.Let the number of connected components in the augmented set � be + , which are used to create + neighborhoodsets � �-, *., �� . For every pair of neighborhoods

�/,and

�-,10that have at least one cannot-link between them, we add

cannot-link constraints between every pair of points in�2,

and�3, 0

and augment the cannot-link set � by these entailedconstraints, again assuming consistency of constraints. From now on, we will refer to the augmented must-link andcannot-link sets as � and � respectively.

Note that the neighborhood sets� ,

, which contain the neighborhood information inferred from the must-linkconstraints and are unchanged during the iterations of the algorithm, are different from the partition sets � � , whichcontain the cluster partitioning information and get updated at each iteration of the algorithm. After this preprocessingstep we get + neighborhood sets � �/, '., �� , which are used to initialize the cluster centroids.

If + � �, where

�is the required number of clusters, we select the

�neighborhood sets of largest size

and initialize the�

cluster centers with the centroids of these sets. If +54 �, we initialize + cluster centers with

the centroids of the + neighborhood sets. We then look for a point ( that is connected by cannot-links to everyneighborhood set. If such a point exists, it is used to initialize the �6+ " �*"87 � cluster. If there are any more clustercentroids left uninitialized, we initialize them by random perturbations of the global centroid of � .

The algorithm PC-KMeans alternates between the cluster assignment and centroid estimation steps (seeFigure 3.7). In the cluster assignment step of PC-KMeans, every point ( is assigned to a cluster such that it minimizesthe sum of the distance of ( to the cluster centroid and the cost of constraint violations incurred by that cluster

16

�� "!#%$�&(' �*),+%-/. 021�3-43657.�8:9�-<;/=?>A@CBEDGF HDJI�KCL;M+N-/. 02OQP�;R-MSUT:8:9�VXWN.Y9�;R-<Z<3�8:9Y-4;\[]>A@�^�B DR_ B7`Ca<F L;M+N-/. 0bW%3�9�9�.�-cSGTJ8:9�VXWN.Y9�;c-MZ43 8:9�-<;\de>A@�^�B Dc_ B7` a<F L9,P�OQf�+gZh. 0(W%T:P�;R-<+%Z4;ji L�k +%8:l�m�-j. 0bWN.�9�;R-<Z<3�8J9�-4;�npoq ' &2' �hr*8s;utR.�8:9�-jiv5�3�Zc-<8w-<8:.�9�8:9�lQ@�=jx�F�yx I(K .�0b=z;MP�W4m{-Mm�3 -.�f�tR+gW|-<8:}�+~0�P�9�W|-<8:.�9��2�C��|��8s;pûTJ.�Wg3 T:T:��abO�8J9�8JO�8:�g+g1�o�� g�(�E�� ob��9�8J-<8:3�T:8J�g+*WNT:P�;c-M+gZ<;%�� 3�o�W%ZM+ 3-<+~-<m�+��e9�+g8:l�m,f�.YZMm�.�.�1�;h@�p��F�� I�K 0�ZM.YO�[ 3�9�1{d� fEo�;M.�ZM-j-Mm�+p8:9�1�8:W%+g;(�e8:9�1�+ WNZ<+g3Y;c8:9�lQ;c8:�%+�. 0b�� W o�8J0b��i

8J9�8w-<8s3 T:8:�%+~@ *¡£¢4¤x F yx I�K k 8J-MmeWN+g9�-MZ<.�8s1�;\. 0�@� � F y� I�K+gTs;c+�8J02�e¥�i8J9�8w-<8s3 T:8:�%+~@ ¡£¢4¤x F �x I�K k 8J-MmeWN+g9�-MZ<.�8s1�;\. 0�@��YF �� I�K8w0§¦�5�.Y8J9�-hB¨W%3 9�9�. -MSUT:8:9�V�+ 1X-M.�3�TJT"9�+%8:l�m,f7.�Z<m�.,.,1�;j@ � � F�� I�K8:9�8J-M8s3 T:8:�%+h ~¡£¢4¤� © K k 8J-Mm�B8J9�8w-<8s3 T:8:�%+hZ<+%Oª3 8:9�8:9�l6WNT:P�;c-M+%Z4;j3-hZ43 9�1�.YO« o§¬h+g5�+ 3-/P�9�-<8JT�4® ¯�°±N²�³�±%¯74±« 3�o´��!�!,µ¶Y� ·�¸ ¹"!º��Y»7�(¼~;M;M8:l�9{+g3�W4me1�3 -<3Q5�.Y8:9Y-hB½-M.X-<m�+

WNT:P�;c-M+gZh¾�¿Xû8�o + o§;M+N-~=�¡wÀ © K ¤xÁ a L 0�.YZ*¾7¿/>�3 Z<lbO�8:9x ^ KÂ7Ã BeÄÅ ~¡£À�¤x Ã ÂÆ n½Ç ¡wÈ,É ÈÊR¤GË�ÌÎÍ"Ï ¾ÅÐ>�Ñ `NÒ Æ n½Ç ¡£È,É ÈÊR¤GË Ó*Í"Ï ¾�>ÔÑ `NÒ a

« fEoÕ��!Cº7µ%Ö��Yº�� Ö�� "!,�§@ ¡wÀ © K ¤x F yx I�K/× @ KØ ÙÛÚÝÜ:Þ�ß�àá Ø Ç È,Ë Ù ÚÝÜ:Þ�ß�àá B�F yx I�K

« W oãâ × ûâ Æ � a

Figure 3.7: Pairwise Constrained KMeans algorithm

assignment (by equivalently satisfying as many must-links and cannot-links as given by the assignment). The centroidre-estimation step is the same as KMeans. It is easy to show that the objective function decreases after every clusterassignment and re-estimation step until convergence, implying that the PC-KMeans algorithm will converge to a localminima of

�pckmeans. The proof is shown in detail in (Basu, Banerjee, & Mooney, 2003a).

3.2.3 Motivation of the PCC framework

The mathematical formulation of the PCC framework was motivated by the metric labeling problem and the general-ized Potts model (Boykov et al., 1998; Kleinberg & Tardos, 1999), for which Kleinberg et al. (Kleinberg & Tardos,1999) proposed an approximation algorithm. Their formulation only considers the set � of must-link constraints,which we extended to the PCC framework by adding the set � of cannot-link constraints. Our proposed pairwise con-strained KMeans (PC-KMeans) algorithm greedily optimizes

�pckmeans using a KMeans-type iteration with a modified

cluster-assignment step.The PCC formulation can be motivated by considering a Markov Random Field (MRF) (Boykov et al.,

1998) defined over � such that the field (or set) ä � � ä � �� of random variables over � can take values �� .�with � � � �"!�"� �� . Let a configuration � denote the joint event � ä � � � �ä � � � � �� .� . Restricting themodel to MRFs whose clique potentials involve pairwise points, the prior probability of a configuration � is å �p��"~æçNè�é � � � � + � " � � + $ �� A� + " " , where

� " � � + $ �p� � �J� + " �&' ( � + � � � � �� + � if � ( � � (�+*" ��

�,+�� +�� if � ( � � (�+*" � �.otherwise

17

Assuming an identity covariance Gaussian noise model for the observed data (von-Mises Fisher distribu-tion (Mardia & Jupp, 2000) was considered as the noise model for high-dimensional text data)1 and assuming theobserved data has been drawn independently, if �.� � � �� denote the true representatives corresponding to the labels�! � �� , the conditional probability of the observation � for a given configuration � is å �� "§æ ç%è�é � �� !.� � ( �« � � � � � " . Then, since the posterior probability of a configuration � is å �p� � � " � å �p��"Rå �� " , finding the MAP config-uration boils down to minimizing the energy of the configuration that is exactly as given by (3.2). Hence, the PCCobjective function is really the energy function over the specified MRF.


The experimental results showing the improved performance of the PCC clustering framework over unsupervisedclustering are shown in Section 3.3.4, where we show the benefit of the PCC framework as well as the usefulness ofselecting pairwise constraints using our proposed active learning algorithm.

3.3 Active learning for semi-supervised clustering

In order to maximize the utility of the limited supervised data available in a semi-supervised setting, supervisedtraining examples should be actively selected as maximally informative ones rather than chosen at random, if possi-ble (McCallum & Nigam, 1998). In the PCC framework, this would imply that fewer constraints will be required tosignificantly improve the clustering accuracy. To this end, we developed a new method for actively selecting goodpairwise constraints for semi-supervised clustering in the PCC framework.

Previous work in active learning has been mostly restricted to classification, where different principles ofquery selection have been studied, e.g., reduction of the version space size (Freund, Seung, Shamir, & Tishby, 1997),reduction of uncertainty in predicted label (Lewis & Gale, 1994), maximizing the margin on training data (Abe &Mamitsuka, 1998), and finding high variance data points by density-weighted pool-based sampling (McCallum &Nigam, 1998). However, active learning techniques in classification are not applicable in the clustering framework,since the basic underlying concept of reduction of classification error and variance over the distribution of exam-ples (Cohn, Ghahramani, & Jordan, 1996) is not well-defined for clustering. In the unsupervised setting, Hofmann etal. (Hofmann & Buhmann, 1998) consider a model of active learning which is different from ours – they have incom-plete pairwise similarities between points, and their active learning goal is to select new data, using expected value ofinformation estimated from the existing data, such that the risk of making wrong estimates about the true underlyingclustering from the existing incomplete data is minimized. In contrast, our model assumes that we have completesimilarity information between all pairs of points, along with pairwise constraints whose violation cost is a componentof the objective function (3.2), and the active learning goal is to select pairwise constraints which are most informativeabout the underlying clustering. Klein et al. (Klein et al., 2002) also consider active learning in semi-supervised clus-tering, but instead of making example-level queries they make cluster level queries, i.e., they ask the user whether ornot two whole clusters should be merged. Answering example-level queries rather than cluster-level queries is a mucheasier task for a user, making our model more practical in a real-world active learning setting.


Formally, the scheme has access to a noiseless oracle that can assign a must-link or cannot-link label on a given pair� ( � � (�+ " , and it can pose a constant number of queries to the oracle.2 Given a fixed number of queries that can bemade to the oracle, the goal is to select the � and � sets such that the PC-KMeans clustering algorithm, using thosepairwise constraints, can get to a local optimum of the objective function

�pckmeans which is better than the local optima

it would reach with the same number of randomly chosen constraints.

1The framework can be shown to hold for arbitrary exponential noise models.2The oracle can also give a don’t-know response to a query, in which case that response is ignored (pair not considered as a constraint) and that

query is not posed again later.

18

3.3.2 Explore and Consolidate algorithms

The proposed active learning scheme has two phases. The first phase explores the given data to get�

pairwisedisjoint non-null neighborhoods, each belonging to a different cluster in the underlying clustering of the data, asfast as possible. Note that even if there is only one point per neighborhood, this neighborhood structure defines acorrect skeleton of the underlying cluster. For this phase, we propose an algorithm Explore that essentially usesthe farthest-first scheme (Dasgupta, 2002; Hochbaum & Shmoys, 1985) to form appropriate queries for getting therequired pairwise disjoint neighborhoods. For our data model, we prove another interesting property of farthest-firsttraversal (see (Basu et al., 2003a) for more details) that justifies its use for active learning.

At the end of Explore, at least one point has been obtained per cluster. The remaining queries are usedto consolidate this structure. The cluster skeleton obtained from Explore is used to initialize

�pairwise disjoint

non-null neighborhoods � �/, , �� . Then, given any point ( not in any of the existing neighborhoods, we will haveto ask at most � � � " queries by pairing ( up with a member from each of the disjoint neighborhoods

� ,to find

out the neighborhood to which ( belongs. Note that it is practical to sort the neighborhoods in increasing order ofthe distance of their centroids from ( so that the correct must-link neighborhood for ( is encountered sooner in thequerying process. This principle forms the second phase of our active learning algorithm, and we call the algorithmfor this phase Consolidate. In this phase, we are able to get the correct cluster label of ( by asking at most � � � "queries. So, � � � " pairwise labels are equivalent to a single pointwise label in the worst case, when

�is known.

The details of the algorithms for performing the exploration and the consolidation phases are given in Fig-ures 3.8 and 3.9.

�� "! #�%$�&�')(#*,+.-'/-102(43657'98):<;>=@?BADC#EAGF�H@I -�J�J�&K898L'M(N-#5O(4P/-4J�Q6&)'MR�-'-#5S8MTU&KP98L0S-436PMTV3W8M&%X7Y.&�P936&K8 I 5�Y.Z\[2&�P](#*^J_Q6YS8`'M&KP98]a I '9(#'/-#Qb5�Y.Z\[�&KP(#*"X7Y.&KPM36& 8]ced

f ! ,! #�)gihjak+.368mln(43657'U5S&�36o4R�[2(4P9R.(�(�+S8]=#prq�C#st F�H J_(�PMP9&K8M0�(�5S+�365.o'M(\'MR.&r'9PMYS&uJ_Q6YS8`'M&�P9365.ov(4*,:wTV3G'MRO-')Q6&K-�8n'](45S&x02(43657'V0�&KP5.&�36o4R�[2(4P9R.(�(�+yd

z|{ K�"�B}��~ d^��5.3G'936-4Q63G�K&#��8M&_'%-4Q6Q25S&�36o4R�[2(4P9R.(�(�+S8]=#prq�C#�q F�H 'M(N5�Y.Q6Q� d��36J/�e'9R.&r�SP/8n'V02(43657'V?�-#')P9-45S+�(4Z I -4+S+�'9(Np\H I g�� ~� d,��R.36Q6&rX�YS&�P93G& 8]-#P9&r-4Q6QG(TL&K+N-45S+Ogi�ja

?��0�(�365�']*m-4P`'9R.&K8`']*�PM(�Z�&_��3W8`'M365.ov5S&�36o4R�[2(4P9R.(�(�+S8]=#p q C#sq F�H3�* I [��X7Y.&KPM��365.o I ?�3W8]J�-#5S5.(#'M��Q6365.�4& +e'M(N-4QGQb&��368`'M365.o5.&�36o4R�[2(4P9R.(�(�+.8gO��g\� ~ I 8n'/-#PM')-15S&�T>5.&�36o4R�[2(4P9R.(�(�+�p s TV3G'9R�?&�QW8`&-4+.+�?�'M(\'MRS&�5.&�36o4R�[2(4P9R.(�(�+�TV3G'MR�TVR.3WJ/R�3G'V3W8]Z1YS8`'`�DQ6365.�4& +

Figure 3.8: Algorithm Explore

3.3.3 Motivation of active constraint selection algorithm

In Section 3.1.3, it was observed that initializing KMeans with centroids estimated from a set of labeled examples foreach cluster gives significant performance improvements. Since good initial centroids are very critical for the successof greedy algorithms such as KMeans, we follow the same principle for the pairwise case: we will try to get as manypoints (proportional to the actual cluster size) as possible per cluster, so that PC-KMeans is initialized from a verygood set of centroids.

In the exploration phase, we use a very interesting property of the farthest-first traversal. Given a set of�

disjoint balls of unequal size in a metric space, we show that the farthest-first scheme is sure to get one point fromeach of the

�balls in a reasonably small number of attempts (see (Basu et al., 2003a) for details).

19

�� "!�#$&%�')( ��+*�,&-/.�021�3-43657.�8:9;-=</>@?BADCFEHG�IEKJ�LDM 3"N&N&,O<=<�-P.Q3�9R.�S43�N&T:,/-PU�3-3�9V<PWX,OS=<�5V3�8:SPWY8Z<P,+[;\�,&S=8:,O< M 9�\�]_^7,&S`.�0aNbT:\V<c-P,OS=<`d M -=.�-43�T�9�\�]_^�,OS.�0)[;\�,OSP8:, <`e M df1 8Z<�gh."8K9;-�9�,&8:i�U�^7.�S=U�.�. 1�<XN&.�S=SP, <c57.�9�1 8:9�ij-=.6-=SP\V,NbT:\V<c-P,&S=8:9�ik.�02>lWY8K-PUR3-/T:,O3"<h-`.�9V,m57.�8:9;-Y5�,OS/9�,O8Ki"U�^�."SPU�.�. 1onp ( '2( ��Ydq1�8:<rgh.�8:9;-`9�,&8:i�U�^7.�S=U�.�. 1�<XN&.�S=SP, <c57.�9�1 8:9�ij-=.6-=U�,s-PS=\�,NbT:\V<c-P,&S=8:9�ik.�02>lWY8K-PUtU�8:i�U�,OS`9;\V]6^7,&SY.�0�57.�8:9;-=<`57,&S9�,&8:i�U�^7.�S=U�.�. 1onuwv O�)�Fxy�z n|{X<h-=8K]}3�-P,mNb,&9;-=SP."8:1V<YA�~��G�� J�L .�0), 3�N4Ut.�0�-=U�,�9�,O8Ki"U�^�."SPU�.�. 1�<� n2��U�8:T:,s["\V,&S=8K, <`3�S=,s3�T:TK.W�,O1� 3�n�S43�9�1 .�]QT:�_5�8ZN4�}3�5�."8K9;-`C�9V.�-/8:9�-=U�,�,b� 8Z<c-P8:9�ik9V,&8:i�U�^7.�S=U�.�.�1V<� ^Fn@<P.�SP-�-PU�,�8:9V1 8ZN&,O<Y�tWY8K-PUt8K9�NbS=,O3�<P8:9�i61�8:<c-=3�9VNb, <��Ct��~Y�7�b�� N�n�0�."SB��? z -P.Qd[;\�,OSP�}C�WY8K-PUt,O3"N4U�.�0�-PUV,�9�,&8:i�U�^7.�S=U�.�. 1�<y8:9R<c."Sc-=,O1�.�S41 ,&S-=8KT:T�3_]6\�<h-P��T:8:9��8Z<X."^ -=3�8:9�,O1 M 3"1�1�C�-=.�-PUV3�-`9V,&8:i�U�^7.�S=U�.�.�1Figure 3.9: Algorithm Consolidate

Both Explore and Consolidate add points to the clusters at a good rate. It can be shown using resultsproved in (Basu et al., 2003a) that the Explore phase gets at least one point from each of the

�underlying clusters

in maximum�� V� queries. When the active scheme has finished running Explore and is running Consolidate,

it can also be shown using a generalization of the coupon collector’s problem (Motwani & Raghavan, 1995) that withhigh probability it will get one new point from each cluster in approximately

� �% '")

�queries. Consolidate

therefore adds points to clusters at a faster rate than Explore by a factor of �� =� " .3.3.4 Experimental results

In this section, we present our experiments with active selection of constraints in the PCC framework.

Datasets

We ran experiments on both high-dimensional text datasets and low-dimensional UCI (Blake & Merz, 1998) datasets.Here we present the results on a subset of the text data Classic3 (Dhillon & Modha, 2001), containing 400 documents– 100 Cranfield documents from aeronautical system papers, 100 Medline documents from medical journals, and 200Cisi documents from information retrieval papers. This Classic3-subset dataset was specifically designed to createclusters of unequal size, and has 400 points in 2897 dimensions after standard pre-processing steps like stop-wordremoval, tf-idf weighting, and removal of very high-frequency and very low-frequency words (Dhillon & Modha,2001). From the UCI collection we selected Iris, which is a well-known dataset having 150 points in 4 dimensions.We used the active pairwise constrained version of KMeans on Iris, and SPKMeans on Classic3-subset.


For all algorithms on each dataset, we generated learning curves with 10-fold cross-validation. The same methodologywas used to generate the learning curves as in Section 3.1.4, except for that fact that here the x-axis represents thenumber of pairwise constraints given as input to the algorithms. For non-active PCKMeans the pairwise constraintswere selected at random, while for active PCKMeans the pairwise constraints were selected using our active learningscheme.

20

0 100 200 300 400 500 600 700 800 900 10000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Pairwise Constraints

Nor

mal

ized

Mut

ual I

nfor

mat

ion

PCKMeansActive PCKMeans

Figure 3.10: Comparison of NMI values on Classic400

0 100 200 300 400 500 600 700 800 900 10000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pai

rwis

e F

−m

easu

re


Figure 3.11: Comparison of pairwise F-measure values on Classic400

21

0 100 200 300 400 500 600 700 800 900 10000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Nor

mal

ized

Mut

ual I

nfor

mat

ion


Figure 3.12: Comparison of NMI values on Iris

0 100 200 300 400 500 600 700 800 900 10000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Pai

rwis

e F

−m

easu

re


Figure 3.13: Comparison of pairwise F-measure values on Iris

22

Results

In the domains that we considered, e.g., text clustering, different costs for different pairwise constraints are not avail-able in general, so for simplicity we will be assuming all elements of � and � to have the same constant value in(3.2). The parameter acts as a tuning knob, giving us the continuum between a S-KMeans-like algorithm on oneextreme ( = 0), where there is no guarantee of the constraint satisfaction in the clustering, and a C-KMeans-likealgorithm on the other extreme ( �� ), where the clustering process is forced to respect all the given constraints.The parameter can be chosen either by cross-validation on a held-out set, or by the user according to the degreeof confidence in the constraints. We chose an intermediate value of in (3.2) so that the algorithms gives a tradeoffbetween minimizing the total distance between points and cluster centroids and the cost of violating the constraints.

Figures 3.11-3.13 show that (1) semi-supervised clustering with constraints, for both active and non-activePCKMeans, performs considerably better than unsupervised KMeans clustering for the datasets we considered, sincethe clustering evaluation measures (NMI and pairwise F-measure) improve with increasing number of pairwise con-straints along the learning curve. and (2) active selection of pairwise constraints in active PCKMeans, using our two-phase active learning algorithm, significantly outperforms random selection of constraints in non-active PCKMeans.Detailed discussions and results on other datasets can be found in (Basu et al., 2003a).

3.4 Unified model of semi-supervised clustering

In previous work, similarity-based and search-based approaches to semi-supervised clustering have not been ade-quately compared experimentally, so their relative strengths and weaknesses are largely unknown. Also, the two ap-proaches are not incompatible, therefore, applying a search-based approach with a trained similarity metric is clearlyan additional option which may have advantages over both existing approaches. In this work, we presented a newunified semi-supervised clustering algorithm derived from KMeans that incorporates both metric learning and usinglabeled data as seeds and/or constraints.


Following previous work (Xing et al., 2003), we can parameterize Euclidean distance with a symmetric positive-

definite weight matrix�

as follows:� ( �� ( + ��

�� ( �Z � � � " � � � ( �Z � � � " . If

�is restricted to be a diagonal

matrix, then it scales each axis by a different weight and corresponds to feature weighting; otherwise new featuresare created that are linear combinations of the original features. In our clustering formulation, using the matrix

�is equivalent to considering a generalized version of the KMeans model described in Section 2.2.1, where all theGaussians have a covariance matrix

� � � (Bilmes, 1997).It can be easily shown that maximizing the complete data log-likelihood under this generalized KMeans model

is equivalent to minimizing the objective function:�

mkmeans ��!.� ��

� ( �� %('*) ��ç�� " " (3.3)

where the second term arises due to the normalizing constant of a Gaussian with covariance matrix� � � .

We combine objective functions (3.2) and (3.3) to get the following objective function that attempts to mini-mize cluster dispersion under a learned metric along with minimizing the number of constraint violations:

�mpckm �

�!.� ��

� ( �� % '*) ��ç� � � " " " �

" !.�� !#%$��'& �,+ � � � ( � � ( + "1� � � � ��±� + �" �

" ! � � ! # $��*) � + � � � ( � � (�+ " � � � �� +�� (3.4)

where � � and � � are functions (defined in the next section) that specify the “seriousness” of violating a must-linkor cannot-link constraint. The weights � + and � + provide a way to specify the relative importance of the unlabeledversus labeled data while allowing individual constraint weights. This objective function

�mpckm is greedily optimized

by our proposed metric pairwise constrained KMeans (MPC-KMeans) algorithm that uses a KMeans-type iteration.

23

3.4.2 Motivation of the unified framework

If we do not scale the weights � + and �,+ by the functions � � and � � in (3.4), one problem with the resultingcombined objective function would be that all constraint violations are treated equally. However, the cost of violatinga must-link constraint between two close points should be higher than the cost of violating a must-link constraintbetween two points that are far apart. Such cost assignment reflects the intuition that it is a “worse error” to violatea must-link constraint between similar points, and such an error should have more impact on the metric learningframework. Multiplying the weights �,+ with the penalty function � � � ( � � ( + " �4)�� è � � �� ( �5 ( + � �� "gives us the overall cost of violating a must-link constraint between two points ( � and ( + , where � �� and � ��are non-negative constants that correspond to minimum and maximum penalties respectively. They can be set asfractions of the square of the maximum must-link distance )�� è " !.�� !#($ �'& � ( �' ( + � � , thus guaranteeing that thepenalty for violating a constraint is always positive. Overall, this formulation enables the penalty for violating amust-link constraint to be proportional to the “seriousness” of the violation.

Analogously, the cost of violating a cannot-link constraint between two distant points should be higher thanthe cost of violating a cannot-link constraint between points that are close, since the former is a “worse error”. Mul-tiplying weights � + with � � � ( � � (�+ " �4)+*-, � � �� " � ( � (�+ � �� " allows us to take the “seriousness” of theconstraint violation into account.

3.4.3 MPC-KMeans algorithm for the unified framework

Given a set of data points � , a set of must-link constraints � , a set of cannot-link constraints � , correspondingweight sets � and � , and the number of clusters to form

�, metric pairwise constrained KMeans (MPC-KMeans)

finds a disjoint�

partitioning �� of � (with each partition having a centroid � � ) such that�

mpckm is (locally)minimized.

The algorithm MPC-KMeans has two components. Utilizing constraints during cluster initialization andsatisfaction of the constraints during every cluster assignment step constitutes the search-based component of thealgorithm. Learning the distance metric by re-estimating the weight matrix

�during each algorithm iteration based

on current constraint violations is the similarity-based component.Intuitively, the search-based technique uses the pairwise constraints to generate seed clusters that initialize

the clustering algorithm, and also uses the constraints to guide the clustering process through the iterations. Seedsinferred from the constraints bias the clustering towards a good region of the search space, thereby possibly reducingthe chances of it getting stuck in poor local optima, while a clustering that satisfies the user-specified constraints isproduced simultaneously.

The similarity-based technique distorts the metric space to minimize the costs of violated constraints, pos-sibly removing the violations in the subsequent iterations. Implicitly, the space where data points are embedded istransformed to respect the user-provided constraints, thus capturing the notion of similarity appropriate for the datasetfrom the user’s perspective.

Initialization: The MPC-KMeans algorithm is initialized from neighborhoods inferred from the � and �sets, in the same way as the PC-KMeans algorithm (see Section 3.2.2).

E-step: MPC-KMeans alternates between cluster assignment in the E-step, and centroid estimation andmetric learning in the M-step (see Figure 3.14).

In the E-step of MPC-KMeans, every point ( is assigned to a cluster so that the sum of the distance of ( to thecluster centroid and the cost of constraint violations possibly incurred by this cluster assignment is minimized. Notethat this assignment step is order-dependent, since the subsets of � and � associated with each cluster may changewith the assignment of a point. In the cluster assignment step, each point moves to a new cluster only if the componentof�

mpckm contributed by this point decreases. So when all points are given their new assignment,�

mpckm will decreaseor remain the same.

M step: In the M-step, the cluster centroids � � are first re-estimated using the points in � � . As a result,the contribution of each cluster to

�mpckm is minimized. The pairwise constraints do not take in part in this centroid

re-estimation step because the constraints are not an explicit function of the centroid. Thus, only the first term (thedistance component) of

�mpckm is minimized in this step. The centroid re-estimation step effectively remains the same

as KMeans.

24

�� ! "$#�%'& ��)(�*$+-,�.0/�1+21435,�687�+:9-;=<?>A@CBED�FBHG�IAJ9K*L+-,�.NMPO�QSR�TVUXWZY\[)]$,�7^9_+K`21�687�+:9Nab<c>�d�@ B_e @5f gKD J9K*L+-,�.-h2iY�Y5j�R�TVUXWZY\[k]L,�7�9l+:`:1�6H7�+29nmo<c>�d�@ B_e @5f gKD J7�p�qPr�*s`t,�.']$u8p^9l+:*$`29Nv J 9K*L+:9-,�.']$,�7^9_+K`21�687�+Nwx*s6Hy{z�+:9N|}1�7^/ |�~

� & %0& ��t�)6Z9��l,�687�+Nv�3^1�`_+:6�+:68,�7�687�yP>�;N��D��sG'I ,�.�;�9Kp^]2z�+Kz^1�+,�r��l*s]S+:68��*�.�p�7^]S+:68,�7��0��A��S��6Z9�d�u8,\]$1�u8uH��gnq�6H7^6Hq�68�s*s/�~

�� s�'�C�� ~��V7�6H+:681�u86H�s*)]Lu8p^9_+K*s`:9$�� 1�~ ]$`K* 1+:*�+:z�*¢¡£7�*s68y�z�r�,{`Kz^,�,\/�9t>¤¦¥�D�§¥$G�I .�`K,{q¨a 1�7^/�m� rC~©9K,�`K+N+Kz�*¦687^/�68]$*s9'ª£687«/\* ]L`:*s1{9_687�yP9_68�$*k,�.�¤ ¥� ]�~¬6H.�¡��v

6H7^6�+:6Z1�u868�$*�>®)¯±°2²� D�� G�I wt6H+Kz£]L*s7�+K`:,�6Z/�9x,�.�>¤ ¥ D��¥$G�I*suZ9_*�6H.0¡£³�v6H7^6�+:6Z1�u868�$*�>® ¯±°2²� D�§� G�I wt6H+Kz£]L*s7�+K`:,�6Z/�9x,�.�>¤�¥{D�§¥$G�I6�.n´�3�,{6H7�+t@©h:i�Y^Y�j�R�TVUXWZY�[{µ2¶�+K,·1�u8u�7�*s6Hy{z�r�,{`Kz�,�,\/�9t>¤¦¥�D�§¥LG'I687�6H+K6Z1�u868�$*t®�¯±°2²§ ¸ I wt6H+Kz«@6H7^6�+:6Z1�u868�$*t`:*$q·1�687�687�y4]Lu8p^9_+K*$`29N1+t`21�7^/\,{q

¹ ~nºt*s3�* 1+-p�7�+:6Hu�h2j�Y^»µL¼�½�µ$Y5h2µ¹ 1�~¾�^ � �¿À{� Á�Â�Ã! Ä\�{Å5�'Æ�9K9K68y�7�*s1�]2z£/�1�+:1P3�,{687{+t@ B +:,�]Lu8p^9_+K*$`-Ç�È

d�6É~ *�~n9_*$+�; ¯±Ê ¸ I ²�AË g J .�,�`)Ç�È-<©1�`Ky�q�687� dKÌL@CB!ÍÎ® ¯�Ê�²� ÌLÏÐÑÍÒu8,{y^d�/\*$+Ad��gKgÓÕÔ ¯±Ö�×EØ ÖÙ_²VÚ�ÛÝÜ B fAÞ�ßÒd�@ Ble @5f gLà!á ÇÎâ<©ã�fLäÓ Ô ¯±Ö × Ø Ö Ù ²VÚ�å Ü B f Þæ-d�@CB e @ f gLà!á Ç�<©ã f ä�g¹ rC~ç�^ AÄ5¿$è��{Ä\� è��! ��n>�®-¯�Ê ¸ I ²� D�� G�I-é > I

ê ëíìïî8ð�ñ�òó ê Ô Ö�Ú ëíìïî8ð�ñ�òó @�D�� G�I¹ ]�~ôÃ�õ�ö\�{Ä�� è��{Ä�Å5¿�Á��÷ I < Ô Ö�×EÚ ë d�@ B ÍÕ®íø × g$d�@ B ÍÎ®íø × glùÍ Ô ¯±Ö�×EØ ÖÙK²EÚ�Û Ë Ü B f d�@CBCÍÕ@ f gLd�@CBCÍÕ@ f glùxà!á ãZBNâ<�ã f äÓÎÔ ¯±Ö�×EØ ÖÙK²EÚ�å Ë Ü B f{d�@ B ÍÒ@5f g$d�@ B ÍÒ@5fAg_ù�à!á ã B <©ã�fLä¹ /!~ûú é d�ú Ó � g

Figure 3.14: Metric Pairwise Constrained KMeans algorithm

The second part of the M-step is metric learning, where the matrix�

is re-estimated to decrease the objectivefunction ü mpckm. The updated matrix

�is obtained by taking the partial derivative ýAþ mpckm

ý� and setting it to zero,

resulting in:

� ÿ ��!��

" !�� ! # $��'&�� !� � �� !� �� #" $ �ÿ $ �&%' �" ! � � !#($��*)(� )��*�� !� � *�� !� �� #" $ ÿ $ �+%-,/.#0

where�21

and 3 1 are subsets of�

and 3 that exclude the constraint pairs for which the penalty functions � � and� � take the threshold values 4 � -5 and 4 �76+8 respectively. See (Basu, Bilenko, & Mooney, 2003b) for the details ofthe derivation.

Since estimating a full matrix�

from limited training data is difficult, we limit ourselves to diagonal�

formost of our experiments, which is equivalent to learning a metric via feature weighting. In that case, the � -th diagonal

25

element of�

, �� , corresponds to the weight of the � -th feature:

�� ÿ � �! � �� " !�� ! # $��'&�� !� � � � �#" $ �ÿ $ � %' �

" ! � � !#($��*)(� �� !� � � � �#" $ ÿ $ �+% , .#0Intuitively, the first term in the sum,

� !�� , scales the weight of each feature proportionately to thefeature’s contribution to the overall cluster dispersion, analogously to scaling performed when computing Mahalanobisdistance. The second two terms that depend on constraint violations, � � " ! � � !#%$ �'& � �� *� � �� #" $ �ÿ $ � % and� " !�� ! # $��*) � )��*�� !� � � � �#" $ ÿ $ �+% , respectively contract and stretch each dimension attempting to mend thecurrent violations. Thus, the metric weights are adjusted at each iteration in such a way that the contribution ofdifferent attributes to distance is equalized, while constraint violations are minimized.

The objective function decreases after every cluster assignment, centroid re-estimation and metric learningstep till convergence, implying that the MPC-KMeans algorithm will converge to a local minima of ü mpckm.


In this section, we present ablation experiments which show the improved performance of our unified semi-supervisedclustering framework. For each dataset, we compared four semi-supervised clustering schemes:

� MPC-KMeans clustering, which involves both seeding and metric learning in the unified framework describedin Section 3.4.3;

� M-KMeans, which is K-Means clustering with the metric learning component only, without utilizing constraintsfor seeding;

� PC-KMeans clustering, which utilizes constraints for seeding and cluster assignment without doing any metriclearning;

� Unsupervised K-Means clustering.

Datasets

Experiments were conducted on several datasets from the UCI repository: Iris, Wine, and representative randomlysampled subsets from the Pen-Digits and Letter datasets. For Pen-Digits and Letter, we chose two sets of three classes:�

I, J, L � from Letter and�3, 8, 9 � from Pen-Digits, sampling 20% of the data points from the original datasets ran-

domly. These classes were chosen from the handwriting recognition datasets since they intuitively represent difficultvisual discrimination problems.


We used the same learning curve generation methodology as in Section 3.3.4. Unit constraint weights � and � wereused, since the datasets did not provide information for setting individual weights for the constraints. The maximumsquare distance between must-link constraints was used as value for 4 �� , while 4 �� was set to 0. All experimentswere run using Euclidean KMeans.

Results

The learning curves in Figures 3.15-3.18 illustrate that providing pairwise constraints is beneficial to clustering quality.On the presented datasets, the unified approach (MPC-KMeans) outperforms individual seeding (PC-KMeans) andmetric learning (M-KMeans) approaches. When both seeding and metric learning are utilized, the unified approachbenefits from the individual strengths of the two methods.

26

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 100 200 300 400 500 600 700 800

Pai

rwis

e F

-mea

sure


MPC-KMeansm-KMeans

PC-KMeansKMeans

Figure 3.15: Comparison of pairwise F-measure values on Iris dataset

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 100 200 300 400 500 600 700 800

Pai

rwis

e F

-mea

sure


MPC-KMeansm-KMeans

PC-KMeansKMeans

Figure 3.16: Comparison of pairwise F-measure values on Wine dataset

27

0.4

0.45

0.5

0.55

0.6

0.65

0 100 200 300 400 500 600 700 800

Pai

rwis

e F

-mea

sure


MPC-KMeansm-KMeans

PC-KMeansKMeans

Figure 3.17: Comparison of pairwise F-measure values on Letter-IJL dataset

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 100 200 300 400 500 600 700 800

Pai

rwis

e F

-mea

sure


MPC-KMeansm-KMeans

PC-KMeansKMeans

Figure 3.18: Comparison of pairwise F-measure values on Digits-389 dataset

28

Some of the metric learning curves display a characteristic “dip”, where clustering accuracy decreases wheninitial constraints are provided, but after a certain point starts to increase and eventually outperforms the initial pointof the learning curve. We conjecture that this phenomenon is due to the fact that feature weights learned from fewconstraints are unreliable, while increasing the number of constraints provides the metric learning mechanism enoughdata to estimate good metric parameters. On the other hand, seeding the clusters with a small number of pairwiseconstraints has an immediate positive effect on the final cluster quality, while providing more pairwise constraints hasdiminishing returns, i.e., PC-KMeans learning curves rise slowly. As can be seen from the MPC-KMeans results,the unified method has the advantages of both metric learning and seeding, and outperforms each of these individualmethods of semi-supervised clustering.

29

Chapter 4

Proposed Research

This chapter outlines the issues that we want to further investigate in this thesis.

4.1 Short-term definite goals

This section describes our short-term goals, which we will definitely investigate in our future work.

4.1.1 Soft and noisy constraints

The labeled semi-supervised clustering framework can handle both soft or noisy labels, as shown in Section 3.1.We want to extend the formulation of the pairwise constrained semi-supervised clustering framework to be able tohandle soft or noisy constraints also. This would involve extending the PCC clustering framework and the activeconstraint selection strategy to include a noise model, and formulating a model where violation of soft-constraints canbe considered. The penalized cluustering objective function can handle soft or noisy constraints, but the active PCCmodel we have considered needs to be extended in two places to handle such constraints. Firstly, the initialization steptakes the transitive closure of must-link constraints and adds entailed constraints inferred from cannot-links — thisneeds to be modified, since for soft or noisy constraints such direct constraint inference will not be possible. Secondly,the active learning strategy also uses the fact that the constraints are hard and noise-free. Both these steps need to bemodified to be able to support soft or noisy constraints.

SCOP-KMeans (Wagstaff, 2002) supports soft-constraint violation, but they do not perform any algorithmanalysis to show convergence guarantees.

4.1.2 Incomplete semi-supervision and class discovery

In semi-supervised classification, all classes are assumed to be known apriori and labeled training data is providedfor all classes. In labeled semi-supervised clustering, when we consider clustering a dataset that has an underlyingclass labeling, we would like to consider incomplete seeding where labeled data are not provided for every underlyingclass. For such incomplete semi-supervision, we would like to see if the labels on some classes can help the clusteringalgorithm discover the unknown classes. An example of class discovery using incomplete seeding is provided in theFigure 4.1. Given the points in Figure 4.1, if we are asked to do a 2-clustering, we can get a clustering as shown inFigure 4.1. Now, if we give as input a pair of points labeled to be in the same cluster (shown by the annular pointsin Figure 4.2), we will get a clustering as in Figure 4.2. In this case, even though we did not provide any supervisionabout the top cluster, clustering using the provided supervision helped us to discover that cluster.

Initial experiments for class discovery under incomplete seeding were considered in (Basu et al., 2002), whereseeds were not provided for different categories and the NMI measure was calculated on the whole test dataset.We want to refine these experiments, so that (1) when we remove seeds from one category in the labeled data, westill consider the same overall number of labeled data points provided to the clustering algorithm (by adding more

30

Figure 4.1: Incomplete seeding and class discovery

labeled data from the other categories), (2) the NMI measure is calculated only on the test data points which belongto categories for which no labeled data have been provided, and not on the whole test data (so that we see whetherthe unknown classes are discovered when supervision is provided for the other classes). We expect that in theseexperiments the semi-supervised clustering algorithm will be able to discover the categories for which no supervisionwas provided. We also want to extend the work of (Miller & Browning, 2003) to give a theoretically well-motivatedmodel for class discovery.

4.1.3 Model selection

In all our experiments so far, we have considered that the number of clusters, � , has been provided as input to thealgorithm. In the future, we want select the number of clusters automatically by using a model selection criterion inthe KMeans or EM clustering objective function. Several model selection criteria exist in the literature for selectingthe right number of clusters. Criteria like Minimum Description Length (Rissanen, 1978), Bayesian InformationCriterion (Pelleg & Moore, 2000) or Minimum Message Length (Wallace & Lowe, 1999) encode the Occam’s Razorprinciple in some form, penalizing models according to their model complexity. We would like to use one of thesemodels, or explore a recent model proposed in (Hamerly & Elkan, 2003), where the � in KMeans is selected based ona statistical test for the hypothesis that a subset of the data follows a Gaussian distribution. In hierarchical clustering,the right number of clusters can be selected by using some criteria for stopping the cluster merging, e.g., (Fern &Brodley, 2003) used a heuristic where they stopped merging clusters when the similarity value between the two closestcurrent clusters in the algorithm had a sudden drop compared to the values in previous merges.

As mentioned in Section 1.3.2, automatic model selection is the main advantage of using semi-supervisedclustering rather than semi-supervised classification in scenarios where knowledge of the relevant categories is incom-plete.

4.1.4 Generalizing the unified semi-supervised clustering model

In Section 3.4, we proposed a framework for unifying search-based and similarity-based semi-supervised clusteringthat works only with Euclidean KMeans. I am working with Misha Bilenko, who is formulating an effective metriclearning algorithm in high dimensions (Bilenko, 2003), on generalizing our unified PCC framework to work withSPKMeans so that we can apply it to domains like text.

31

Figure 4.2: Incomplete seeding and class discovery

4.1.5 Other semi-supervised clustering algorithms

We want to apply our semi-supervision ideas to hierarchical algorithms, e.g., HAC and Cobweb. Incorporating con-straints into hierarchical algorithms will be relatively straightforward. For example, to run constrained HAC, we canchange the similarity � �� between 2 points � and � � to � �� ' )� if there is a must-link provided between� and � � with weight )� , and � *� � � � � � )� if there is a cannot-link provided between � and � � with weight �� .Then, we can proceed and run the usual HAC algorithm on the data points using this modified similarity metric, sothat at each cluster-merge step, we consider the similarity between the data points as well as the cost of constraintviolation incurred during the merge operation.

A more interesting problem would be when the initial supervision is given in the form of a hierarchy, andthe clustering problem will be to do hierarchical clustering “using” the initial hierarchy. We want to formalize thenotion of using an initial seed hierarchy for hierarchical clustering. Such an approach would be useful for contentmanagement applications, e.g., if the requirement is to hierarchically cluster the documents of a company, and theinitial seed hierarchy is a preliminary directory structure containing a subset of the documents.

So far we have mainly focussed on clustering algorithms that use a generative model. We also want toapply the pairwise constrained framework to discriminative clustering algorithms (e.g., graph partitioning), for whichpairwise constraints are a natural way for providing constraints. Another interesting research direction would be onlineclustering in the semi-supervised framework.

4.1.6 Ensemble semi-supervised clustering

In our work so far, we have assumed constraints to be noise-free. We have also assumed the weights on the constraintsto be uniform (PCKMeans) or changed the weights based on the “difficulty of satisfying the constraints” (unifiedmodel). An interesting problem in the PCC model would be the choice of the constraint weights in the general case ofnoisy constraints.

Given a set of noisy constraints, we can create an ensemble of semi-supervised clusterers, each of whichput different weights on the constraints and possibly get different clusterings. We propose a scheme for creating anensemble of PCC clusterers and combining their results using boosting (Freund & Schapire, 1996).

Each PCC clusterer can be considered as a weak learner taking pairwise data points as input, and givingan binary output decision of “same-cluster” or “different-cluster”. The must-link and cannot-link constraints can beconsidered as the training data for each weak learner. Given a set of input constraints, the PCC clusterer initially setsall constraints to have uniform weight and performs clustering. After clustering is completed, the clusterer categorizeseach pair of points as “same-cluster” or “different-cluster”, based on whether the pair ended up in the same cluster

32

or in different clusters. Since the given constraints are noisy, some of them will be violated by the clustering. Theconstraints are reweighted based on the number of errors made by the weak learner, and a new clusterer is created toperform the clustering with the new weights on the constraints. We use boosting for re-weighting of the constraintsand combining the outputs of the clusterers in the ensemble.

The boosted ensemble will give as output a��

symmetric matrix�

with 0/1 entries, where the� " � �� %entry in the matrix would represent the ensemble decision of whether points � and � should be in the same cluster

(� " � �� % ÿ� ) or not (

� " � �� % ÿ� ). Once we get this matrix, we can apply any discriminative clustering algorithm(e.g., graph partitioning using min-cuts) on

�, which we have to evaluate empirically to see whether it gives us better

clustering performance than the initial PCC clusterer. Initial work along the lines of semi-supervised boosting in theproduct space has been proposed in (Hertz, Bar-Hiller, & Weinshall, 2003), which we want to investigate further.

4.1.7 Other application domains

We have been running our experiments on text data and UCI datasets. In future, we also want to run experiments onother domains.

Search result clustering

One domain we are very interested in is the clustering of web search results. It is an important tool for helpingsearch engine users do better navigation of search results, and has become a part of some recent search engines, e.g.,Vivisimo. For short or ambiguous queries, e.g., “jaguar”, it is useful for a user if the retrieved documents are organizedin different clusters based on their topics, e.g., Jaguar cars, jaguar cats and Apple Jaguar OS. The task we are proposingis using constraints derived from the click-through data in the query-logs to get constrained clustering of search results,which would correspond better with user browsing behavior.

(Joachims, 2002) proposed a model for a user browsing through search results of a query given to a searchengine. The model considered that lower ranked search results clicked-on by the user in a ranked list (as on a searchresults webpage) are more relevant to the user than higher ranked results that were not clicked-on. As a naturalcorollary of this model, we consider that if a user clicks on two search results on a page, we can infer a must-linkconstraint between these pages since the user considered both of them relevant for his query. For example, let usconsider a user who gives the query “jaguar” looking for automobiles. The top 5 corresponding search resultsin Google are: (1) Jaguar Apple (the Mac OS website), (2) Jaguar Racing (Grand Prix racing team sponsored byJaguar), (3) Jaguar Cars (official Jaguar car website), (4) Jaguar Australia (Australian Jaguar car website), and (5)Jaguar Canada (Canadian Jaguar car website). The user in our example, who is interested in jaguar cars, would tendto skip the first two search results and click on some of the last 3 results, say (3) and (5). Looking at the query-logs, we can then infer a must-link constraint between (3)-(5). Note that these inferred must-link constraints will beindividually noisy, but we can consider a subset of reliable constraints by aggregating the constraints over all instancesof a particular query in the query-logs and considering only those constraints above a particular frequency threshold.For example, if the threshold is 5%, we consider only those must-link constraints that occur in more than 5% of alluser sessions for that query in the query-logs. At the end of this aggregation, we will have a set of inferred must-linkconstraints corresponding to each query.

If we want more aggressive constraint inference, we can also consider that if the user skips a higher rankedsearch result and clicks on a result of lower rank, then there is a cannot-link between these results. In the jaguarexample, we can infer cannot-link constraints between (1)-(3), (1)-(5), (2)-(3), and (2)-(5). Note, however, that wehave to be careful while inferring cannot-link constraints, since in this example (4)-(5) will also be incorrectly inferredto be a cannot-link. To take care of such cases, we have to apply a higher cutoff frequency (say 10%) for selectingcannot-links from the query-logs. For constraints above the cutoff frequency thresholds, we can weight the importanceof the constraints by the fraction of times they occurred in the query-log entries for the corresponding query.

The proposed algorithms in the PCC framework could be used to improve clustering of search results. Theconstraints corresponding to different queries could be collected from the query-logs and used while clustering theresults of corresponding queries, to give better clustering. We could also re-use the constraints for one query to clusterother similar queries. To this effect, we can use a query-document bipartite graph, where there is a link between a queryand a document if the document is clicked-on a significant fraction of times in the query-log entries for that query.

33

We can then cluster the queries, based on the overlap of the documents linked to each query in this graph (Dhillon,2001; Mishra, Ron, & Swaminathan, 2003). Let � � be a query for which not many constraints have been inferred fromthe query-logs. Queries � 0 and � � can be considered similar if they belong to the same cluster, and the constraints of� 0 inferred from the query-logs can be now used (after weighting them down suitably, e.g., multiplying them by thefractional similarity between � 0 and � � ) while clustering the search results of � � .

This proposed work is part of a research proposal submitted to Google. If the proposal gets approved, we canget access to anonymous query-logs on which we can run our experiments. If a constrained clustering prototype isdeveloped, evaluation of the clustering results can be performed by “temps” (temporary contract workers) at Google.If such clustering is not feasible in real-time, a possible initial deployment model could be the following: do offlineclustering of the top

�most popular queries using constraints from the query-logs, cache the results and show the

clustering results to the restricted set of users giving those queries.

Other datasets

We want to apply our algorithms to clustering astronomical datasets, e.g., Mars spectral data (Wagstaff, 2002) andgalaxy data (Pelleg & Moore, 2000). The Mars data will be especially useful — the dataset consists of spectralanalysis of telescopic Mars images, and has domain knowledge (spatial relations, spectrum slope characteristics, etc.)encoded in the form of soft constraints, which will be very useful for testing the soft-constrained clustering algorithmswe plan to develop. We also want to investigate clustering of bioinformatics data, especially gene micro-array dataand phylogenetic profiles (Marcotte, Xenarios, van der Bliek, & Eisenberg, 2000). In these cases, constraints can bederived from other knowledge sources, e.g., while clustering a large set of genes, we can use the information thatsome of these genes belong to the same pathway (implying must-link constraints) or to different pathways (implyingcannot-link constraints) in the KEGG database (Ogata, Goto, Sato, Fujibuchi, Bono, & Kanehisa, 1999).

4.2 Long-term speculative goals

This section outlines our long-term goals, some of which we plan to investigate as time permits.

4.2.1 Other forms of semi-supervision

Till now, we have only considered labeled data and constraints between data points as possible methods of super-vision in clustering. Other forms of supervision have also been considered in clustering, e.g., (1) cluster-size con-straints (Banerjee & Ghosh, 2002), where balancing the size of the clusters is considered as an explicit soft constraint,and (2) attribute-level constraints (Dai, Lin, & Chen, 2003), where clustering is performed under constraints on thenumerical attributes (e.g., maximum difference between two points in a cluster along the “age” attribute should be 15).

We want to also explore supervision in the form of partial classification functions defined on subsets of thedata space � . In this case, along with the unsupervised data � , we will have a set of classification functions, each ofwhich classifies a subset of points from � to a set of class labels that can be different for each classifier. For example,while clustering search engine results, we can use the DMOZ and the Yahoo! hierarchies as classifiers for a subset ofthe documents returned for a query. Note that in the general case, the label sets for the classifiers will be different, e.g.in the example above, DMOZ and Yahoo! will have different categories.

Given a classifier 3 that gives posterior probabilities corresponding to its set of class labels � and two points� 0 and � � , we can estimate the probabilities of a must-link and cannot-link between the two points, given the classifier,as follows:

�� must-link �� 0 � � � �� 3 �� cannot-link �� 0 � � � �� 3 � ÿ � � �� 0 � $ �� $ �� !� �� 0 � $ 0 �

�� *� � � $ � �Given � classifiers

� 3�� 0 , the probability of constraints between the two points, as estimated from the classifiers,

34

is:

�� must-link �� 0 � � � � � ÿ �� 0

�� must-link �� 0 � � � � � 3 � � �� 3 � �� cannot-link �� 0 � � � � � ÿ � � �� must-link *� 0 � � � ��In the absence of any other information on the distribution over the classifiers,

�� 3 � � would be assumed to beuniform.

In a generative clustering model, the probabilistic constraints inferred from the different classifiers can bedirectly incorporated into the soft-constrained PCC model. If we use kernel-based discriminative clustering, we canadapt the original kernel so that the kernel similarity between two points in the adapted kernel would be more similarif the points have a higher probability of being must-linked and less similar of they have a higher probability of beingcannot-linked, using a framework similar to (Kwok & Tsang, 2003).

4.2.2 Study of cluster evaluation metrics

As observed in Figures 3.10-3.13 in Section 3.3.4, there seems to be a strong correlation between the NMI and pairwiseF-measure clustering evaluation metrics. We want to further investigate this similarity between NMI and pairwise F-measure. Dom (2001) gives a nice formulation of evaluation metrics for hard clustering using the clustering confusionmatrix, where he shows the relation between popular hard clustering evaluation metrics like the Rand Index, theJaccard index, the Folkes-Mallows index and the Hubert

�statistic. We want to see if we can extend this formulation

to evaluate soft-clustering.

4.2.3 Approximation algorithms for semi-supervised clustering

Another interesting research direction is considering how semi-supervision affects appromixation algorithms for someclustering methods, e.g., KMedian. The KMedian problem, which was explained briefly in Section 2.2.1, is similarto the facility location problem. In the facility location problem, we are given a set of demand points and a set ofcandidate facility sites with costs of building facilities at each of them. Each demand point is then assigned to itsclosest facility, incurring a service cost equal to the distance to its assigned facility. The goal is to select a subset ofsites where facilities should be built, so that the sum of facility costs and the service costs for the demand points isminimized. The KMedian problem is similar to facility location, but with a few differences — in KMedian there areno facility costs and there is a bound � on the number of facilities that can be opened. The KMedian objective is toselect a set of � facilities so as to minimize the sum of the service costs for the demand points.

We propose an semi-supervised extension to KMedian to handle constraints on the demand points. Theconstrained KMedian problem would be additionally given an input set of must-link and cannot-link constraints on thedemand points (i.e., two demand points should be or should not be assigned to the same facility), and the goal wouldbe to minimize an objective function that is the sum of the service costs for the demand points and the cost of violatingthe constraints.

Initial investigation has shown we if we consider only must-link constraints, we can make a simple modi-fication to an approximation algorithm for KMedian to get a constant-factor approximation algorithm for must-linkconstrained KMedian. However, if we consider both must-links and cannot-link constraints, then the constrainedKMedian problem becomes NP-complete, which can be shown by a reduction to graph coloring (Plaxton, 2003). Aninteresting research problem to look into is the constrained KMedian problem with limited number of cannot-links(e.g., � �� cannot-links, � being the number of clusters), and to try and get approximation guarantees in thiscase.

4.2.4 Analysis of labeled semi-supervision with EM

As mentioned in Section 3.1, the objective function plots for S-KMeans seem to increase exponentially as the numberof labeled points are increased along the learning curve. As shown in (3.1), if we consider a Gaussian model for eachcluster, the probability of deviation of the centroid estimates falls exponentially with the number of seeds (Banerjee,

35

2001). I am working with Arindam Banerjee to extend this observation and show that when we add more unlabeledpoints along with the seeds and perform the EM iteration, we get a tighter bound for the centroid deviation with highprobability. Ratsaby & Venkatesh (1995) gave a PAC analysis for labeled and unlabeled sample complexity whenlearning a classification rule, with an underlying data generation model of a mixture of 2 Gaussians. We want to usetheir style of analysis to give similar bounds for the semi-supervised EM algorithm on a mixture of 2 Gaussians.

4.3 Conclusion

Our main goal in the proposed thesis is to study search-based semi-supervised clustering algorithms and apply themto different domains. As explained in Chapter 3, our initial work has shown: (1) how supervision can be provided toclustering in the form of labeled data points or pairwise constraints; (2) how informative constraints can be selectedin an active learning framework for the pairwise constrained semi-supervised clustering model; and (3) how search-based and similarity-based techniques can be unified in semi-supervised clustering. In our work so far, we have mainlyfocussed on generative clustering models, e.g. KMeans and EM, and ran experiments on clustering low-dimensionalUCI datasets or high-dimensional text datasets.

In this thesis, we want to study other aspects of semi-supervised clustering, like: (1) the effect of noisy,probabilistic or incomplete supervision in clustering; (2) model selection techniques for automatic selection of num-ber of clusters in semi-supervised clustering; (3) ensemble semi-supervised clustering. In future, we want to studythe effect of semi-supervision on other clustering algorithms, especially in the discriminative clustering and onlineclustering framework. We also want to study the effectiveness of our semi-supervised clustering algorithms on otherdomains, e.g., web search engines (clustering of search results), astronomy (clustering of Mars spectral images) andbioinformatics (clustering of gene microarray data).

36

Bibliography

Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proceedings of theFifteenth International Conference on Machine Learning (ICML-98), pp. 1–10.

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press, New York.

Banerjee, A., Dhillon, I., Ghosh, J., & Sra, S. (2003). Generative model-based clustering of directional data. InProceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003).

Banerjee, A. (2001). Large scale clustering of sequential patterns. Doctoral Dissertation Proposal, University of Texasat Austin.

Banerjee, A., & Ghosh, J. (2002). On scaling up balanced clustering algorithms. In Proceedings of the SIAM Interna-tional Conference on Data Mining, pp. 333–349.

Bansal, N., Blum, A., & Chawla, S. (2002). Correlation clustering. In IEEE Symp. on Foundations of Comp. Sci.

Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. In Proceedings of 19thInternational Conference on Machine Learning (ICML-2002), pp. 19–26.

Basu, S., Banerjee, A., & Mooney, R. J. (2003a). Active semi-supervision for pairwise constrained clustering. Sub-mitted for publication, available at http://www.cs.utexas.edu/˜sugato/.

Basu, S., Bilenko, M., & Mooney, R. J. (2003b). Comparing and unifying search-based and similarity-based ap-proaches to semi-supervised clustering. In Proceedings of the ICML-2003 Workshop on the Continuum fromLabeled to Unlabeled Data in Machine Learning and Data Mining Washington, DC.

Bilenko, M. (2003). Learnable similarity functions and their applications to record linkage and clustering. DoctoralDissertation Proposal, University of Texas at Austin.

Bilenko, M., & Mooney, R. J. (2002). Learning to combine trained distance metrics for duplicate detection indatabases. Tech. rep. AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX.

Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. InProceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003), pp. 39–48 Washington, DC.

Bilmes, J. (1997). A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussianmixture and hidden Markov models. Tech. rep. ICSI-TR-97-021, ICSI.

Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases.http://www.ics.uci.edu/˜mlearn/MLRepository.html.

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11thAnnual Conference on Computational Learning Theory Madison, WI.

37

Boykov, Y., Veksler, O., & Zabih, R. (1998). Markov random fields with efficient approximations. In IEEE ComputerVision and Pattern Recognition Conf.

Bradley, P. S., Mangasarian, O. L., & Street, W. N. (1997). Clustering via concave minimization. In Mozer, M. C.,Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9, pp. 368–374. TheMIT Press.

Cohn, D., Caruana, R., & McCallum, A. (2000). Semi-supervised clustering with user feedback. Unpublishedmanuscript. Available at http://www-2.cs.cmu.edu/˜mccallum/.

Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statistical models. Journal of ArtificialIntelligence Research, 4, 129–145.

Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley-Interscience.

Dai, B.-R., Lin, C.-R., & Chen, M.-S. (2003). On the techniques for data clustering with numerical constraints. InProceedings of the SIAM International Conference on Data Mining.

Dasgupta, S. (2002). Performance guarantees for hierarchical clustering. In Computational Learning Theory.

Demiriz, A., Bennett, K. P., & Embrechts, M. J. (1999). Semi-supervised clustering using genetic algorithms. InANNIE’99 (Artificial Neural Networks in Engineering).

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EMalgorithm. Journal of the Royal Statistical Society B, 39, 1–38.

Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer Verlag.

Dhillon, I. S., Fan, J., & Guan, Y. (2001). Efficient clustering of very large document collections. In Data Mining forScientific and Engineering Applications. Kluwer Academic Publishers.

Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. MachineLearning, 42, 143–175.

Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings ofthe Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001).

Dom, B. E. (2001). An information-theoretic external cluster-validity measure. Research report RJ 10219, IBM.

Dubnov, S., El-Yaniv, R., Gdalyahu, Y., Schneidman, E., Tishby, N., & Yona, G. (2002). A new nonparametric pairwiseclustering algorithm based on iterative estimation of distance profiles. Machine Learning, 47(1), 35–61.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (Second edition). Wiley, New York.

Fayyad, U. M., Reina, C., & Bradley, P. S. (1998). Initialization of iterative refinement clustering algorithms. InProceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pp.194–198.

Fern, X., & Brodley, C. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach.In Proceedings of 20th International Conference on Machine Learning (ICML-2003).

Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.

Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Saitta, L. (Ed.), Proceedings ofthe Thirteenth International Conference on Machine Learning (ICML-96). Morgan Kaufmann.

Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm.Machine Learning, 28, 133–168.

38

Garey, M. R., Johnson, D. S., & Witsenhausen, H. S. (1982). The complexity of the generalized Lloyd-max problem.IEEE Transactions on Information Theory, 28(2), 255–256.

Garey, M., & Johnson, D. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman,New York, NY.

Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via the EM approach. In Advancesin Neural Information Processing Systems 6, pp. 120–127.

Ghani, R., Jones, R., & Rosenberg, C. (Eds.). (2003). ICML Workshop on the Continuum from Labeled to UnlabeledData in Machine Learning and Data Mining, Washington, DC.

Goldszmidt, M., & Sahami, M. (1998). A probabilistic approach to full-text document clustering. Tech. rep. ITAD-433-MS-98-044, SRI International.

Guha, S., Rastogi, R., & Shim, K. (1999). Rock: a robust clsutering algorithm for categorical attributes. In Proceedingsof the Fifteenth International Conference on Data Engineering.

Hamerly, G., & Elkan, C. (2003). Learning the � in � -means. In Advances in Neural Information Processing Systems15.

Hastie, T., & Tibshirani, R. (1996). Discriminant adaptive nearest-neighbor classification. IEEE Transactions onPattern Analysis and Machine Intelligence, 18(6), 607–617.

Hertz, T., Bar-Hiller, A., & Weinshall, D. (2003). Learning distance functions with product space boosting. Tech. rep.TR 2003-35, Leibniz Center for Research in Computer Science, The Hebrew University of Jerusalem.

Hillel, A. B., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations.In Proceedings of 20th International Conference on Machine Learning (ICML-2003).

Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering in large multimedia databses with noise. InProceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pp.58–65.

Hochbaum, D., & Shmoys, D. (1985). A best possible heuristic for the � -center problem. Mathematics of OperationsResearch, 10(2), 180–184.

Hofmann, T., & Buhmann, J. M. (1998). Active data clustering. In Advances in Neural Information ProcessingSystems 10.

Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In Advances in NeuralInformation Processing Systems 11, pp. 487–493.

Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall, New Jersey.

Jain, A. K., Myrthy, M. N., & Flynn, P. J. (1999). Data clustering: A survey. ACM Computing Survey, 31(3), 264–323.

Jain, K., & Vazirani, V. (2001). Approximation algorithms for metric facility location and k-median problems usingthe primal-dual schema and Lagrangian relaxation. Journal of the ACM, 48, 274–296.

Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings ofthe Sixteenth International Conference on Machine Learning (ICML-99) Bled, Slovenia.

Joachims, T. (2002). Optimizing search engines through clickthrough data. In Proceedings of the Eighth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002).

Kamvar, S. D., Klein, D., & Manning, C. D. (2002). Interpreting and extending classical agglomerative clustering al-gorithms using a model-based approach. In Proceedings of 19th International Conference on Machine Learning(ICML-2002).

39

Karypis, G., Han, E. H., & Kumar, V. (1999). Chameleon: A hierarchical clustering algorithm using dynamic model-ing. IEEE Computer, 32(8), 68–75.

Karypis, G., & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAMJournal on Scientific Computing, 20(1), 359–392.

Kaufman, L., & Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley andSons, New York.

Kearns, M., Mansour, Y., & Ng, A. Y. (1997). An information-theoretic analysis of hard and soft assignment methodsfor clustering. In Proceedings of 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), pp. 282–293.

Klein, D., Kamvar, S. D., & Manning, C. (2002). From instance-level constraints to space-level constraints: Makingthe most of prior knowledge in data clustering. In Proceedings of the The Nineteenth International Conferenceon Machine Learning (ICML-2002) Sydney, Australia.

Kleinberg, J., & Tardos, E. (1999). Approximation algorithms for classification problems with pairwise relationships:Metric labeling and Markov random fields. In IEEE Symp. on Foundations of Comp. Sci.

Kwok, J. T., & Tsang, I. W. (2003). Learning with idealized kernels. In Proceedings of 20th International Conferenceon Machine Learning (ICML-2003), pp. 400–407.

Lewis, D., & Gale, W. (1994). A sequential algorithm for training text classifiers. In Proc. of 17th Intl. ACM SIGIRConf. on Research and Development in Information Retrieval.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297.

Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cam-bridge, MA.

Marcotte, E. M., Xenarios, I., van der Bliek, A., & Eisenberg, D. (2000). Localizing proteins in the cell from theirphylogenetic profiles. Proceedings of the National Academy of Science, 97, 12115–20.

Mardia, K. V., & Jupp, P. (2000). Directional Statistics. John Wiley and Sons Ltd., 2nd edition.

McCallum, A., & Nigam, K. (1998). Employing EM and pool-based active learning for text classification. In Pro-ceedings of the Fifteenth International Conference on Machine Learning (ICML-98) Madison, WI. MorganKaufmann.

Meila, M. (2003). Comparing clusterings. In Proceedings of the 16th Annual Conference on Computational LearningTheory.

Mettu, R., & Plaxton, C. G. (2000). The online median problem. In Proceedings of the 41st Annual IEEE Symposiumon Foundations of Computer Science, pp. 339–348.

Miller, D. J., & Browning, J. (2003). A mixture model and EM algorithm for robust classification, outlier rejection,and class discovery. In IEEE International Conference on Acoustics, Speech and Signal Processing.

Mishra, N., Ron, D., & Swaminathan, R. (2003). On finding large conjunctive clusters. In Computational LearningTheory.

Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York, NY.

Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. Cambridge University Press.

Muslea, I. (2002). Active learning with multiple views. Ph.D. thesis, University of Southern California.

40

Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regressionand naive Bayes. In Advances in Neural Information Processing Systems 14, pp. 605–610.

Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeleddocuments using EM. Machine Learning, 39, 103–134.

Nigam, K. (2001). Using Unlabeled Data to Improve Text Classification. Ph.D. thesis, Carnegie Mellon University.

Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., & Kanehisa, M. (1999). KEGG: Kyoto encyclopedia of genesand genomes. Nucleic Acids Res., 27, 29–34.

Pelleg, D., & Moore, A. (2000). X-means: Extending k-means with efficient estimation of the number of clusters. InProceedings of the Seventeenth International Conference on Machine Learning (ICML-2000).

Plaxton, C. G. (2003). Personal communication.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471.

Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. rep., Institute for ANC, Edinburgh, UK. Seehttp://www.dai.ed.ac.uk/˜seeger/papers.html.

Selim, S., & Ismail, M. (1984). K-means-type algorithms: A generalized convergence theorem and characterizationof local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 81–87.

Sheikholesami, G., Chatterjee, S., & Zhang, A. (1998). Wavecluster: A muti-resolution clustering approach for verylarge spatial databases. In Proceedings of the International Conference on Very Large Data Bases, pp. 428–439.

Sinkkonen, J., & Kaski, S. (2000). Semisupervised clustering based on conditional distributions in an auxiliary space.Tech. rep. A60, Helsinki University of Technology.

Strehl, A., & Ghosh, J. (2000). A scalable approach to balanced, high-dimensional clustering of market-baskets. InProceedings of the Seventh International Conference on High Performance Computing (HiPC 2000).

Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Workshop onArtificial Intelligence for Web Search (AAAI 2000), pp. 58–64.

Vapnik, V. N. (1998). Statistical Learning Theory. John Wiley & Sons.

Wagstaff, K. (2002). Intelligent Clustering with Instance-Level Constraints. Ph.D. thesis, Cornell University.

Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained K-Means clustering with background knowl-edge. In Proceedings of 18th International Conference on Machine Learning (ICML-2001).

Wallace, C. S., & Lowe, D. L. (1999). Minimum message length and Kolmogorov complexity. In The ComputerJournal.

Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning, with application to clusteringwith side-information. In Advances in Neural Information Processing Systems 15. MIT Press.

Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: An efficient data clustering method for very large databases.In In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 103–114.

41

The University of Texas at Austinml/papers/semi-proposal-03.pdfsemi-supervised learning is...

Documents

Transcript of The University of Texas at Austinml/papers/semi-proposal-03.pdfsemi-supervised learning is...