Incremental Hierarchical Clustering of Text Documents · Incremental hierarchical text document...

Incremental Hierarchical Clustering ofText Documents

by Nachiketa Sahoo

Adviser: Jamie Callan

May 5, 2006

Abstract

Incremental hierarchical text document clustering algorithms are important in organizing

documents generated from streaming on-line sources, such as, Newswire and Blogs. How-

ever, this is a relatively unexplored area in the text document clustering literature. Pop-

ular incremental hierarchical clustering algorithms, namely Cobweb and Classit, have

not been applied to text document data. We discuss why, in the current form, these algo-

rithms are not suitable for text clustering and propose an alternative formulation for the

same. This includes changes to the underlying distributional assumption of the algorithm

in order to conform with the empirical data. Both the original Classit algorithm and our

proposed algorithm are evaluated using Reuters newswire articles and Ohsumed dataset,

and the gain from using a more appropriate distribution is demonstrated.

1 Introduction

Document clustering is an effective tool to manage information overload. By grouping similardocuments together, we enable a human observer to quickly browse large document collec-tions[6], make it possible to easily grasp the distinct topics and subtopics (concept hierarchies)in them, allow search engines to efficiently query large document collections [16] among manyother applications. Hence, it has been widely studied as a part of the broad literature of dataclustering. One such survey of existing clustering literature can be found in Jain et. al[13].

The often studied document clustering algorithms are batch clustering algorithms, whichrequire all the documents to be present at the start of the exercise and cluster the document col-lection by making multiple iterations over them. But, with the advent of online publishing inthe World Wide Web, the number of documents being generated everyday has increased consid-erably. Popular sources of informational text documents such as Newswire and Blogs are contin-uous in nature. To organize such documents naively using existing batch clustering algorithmsone might attempt to perform clustering on the documents collected so far. But, this isextremely time consuming, if not impossible, due to the sheer volume of documents. One mightbe tempted to convert the existing batch clustering algorithms into incremental clustering algo-rithms by performing batch clustering on periodically collected small batches of documents andthen merge the generated clusters. However, ignoring for the moment the problem of decidingon an appropriate time window to collect documents, there will always be a wait time before anewly generated document can appear in the cluster hierarchy. This delay would be unaccept-able in several important scenarios, e.g., financial services, where trading decisions depend onbreaking news, and quick access to appropriately classified news documents is important. Aclustering algorithm in such a setting needs to process the documents as soon as they arrive.This calls for the use of an incremental clustering algorithm.

There has been some work in incremental clustering of text documents as a part of TopicDetection and Tracking initiative ([1], [19], [10] and [7]) to detect a new event from a stream ofnews articles. But, the clusters generated by this task are not hierarchical in nature. Although,that was adequate for the purpose of new event detection, we believe this is a limitation. Thebenefits of using a hierarchy of clusters instead of clusters residing at the same level of granu-larity is twofold. First, by describing the relationship between groups of documents one makes itpossible to quickly browse to the specific topic of interest. The second reason is a technical one.Finding the right number of clusters in a set of documents is an ill-formed problem when one

1

does not know the information needs of the end user. But, if we present the user with a topichierarchy populated with documents, which she can browse at her desired level of specificity, wewould circumvent the problem of finding the right number of clusters while generating a solutionthat would satisfy users with different needs.

In spite of potential benefits of an incremental algorithm that can cluster text documents asthey arrive into a informative cluster hierarchy, this is a relatively unexplored area in text docu-ment clustering literature. In this work we examine a well known incremental hierarchical clus-tering algorithm Cobweb that has been used in non-text domain and its variant Classit. Wediscuss why they are not suitable to be directly applied to text clustering and propose a variantof these algorithm that is based on the properties of text document data. Then we evaluateboth the algorithm using real world data and show the gains obtained by our proposed algo-rithm.

1.1 Contribution of this research

In this paper we demonstrate methods to carry out incremental hierarchical clustering of textdocuments. Specifically, the contributions of this work are:

1. A Cobweb-based algorithm for text document clustering where word occurrenceattributes follow Katz’s distribution.

2. Evaluation of the existing algorithms and our proposed algorithm on large real world doc-ument datasets.

In Section 34 we briefly review the text clustering literature. In Section 3 we describe theCobweb and Classit algorithms. In Section 3 we describe key properties of text documentsthat are central to this work. In Section 4 we explain the contributions of our work. In Section 5we describe the cluster quality metrics that we have used to evaluate the results obtained. InSection 6 we explain the setup of the experiment and discuss the results. In Section 7 we con-clude with scope for future research.

2 Literature review

Clustering is a widely studied problem in the Machine Learning literature [13]. The prevalentclustering algorithms have been categorized in different ways depending on different criteria,such as hierarchical vs. non-hierarchical, partitional vs. agglomerative algorithms, deterministicvs. probabilistic algorithms, incremental vs. batch algorithms, etc. Hierarchical clustering algo-rithms and non hierarchical clustering algorithms are categorized based on whether they pro-duce a cluster hierarchy or a set of clusters all belonging to the same level. Different hierarchicaland non-hierarchical clustering algorithms for text documents have been discussed by Manningand Schutze[17]. Clustering algorithms can be partitional or agglomerative in nature. In a parti-tional algorithm one starts with one large cluster containing all the documents in the datasetand divides it into smaller clusters. On the other hand, an agglomerative clustering algorithmstarts with all documents belonging to their individual clusters and combines the most similarclusters until the desired number of clusters are obtained. Deterministic clustering algorithmsassign each document to only one cluster, while probabilistic clustering algorithms produce theprobabilities of each item belonging to each cluster. The former is said to make “hard” assign-ment while the later is said to make “soft” assignments. Incremental clustering algorithms makeone or very few passes over the entire dataset and they decide the cluster of an item as they seeit. But, the batch clustering algorithms iterate over the entire dataset many times and graduallychange the assignments of the items to the cluster so that a clustering criterion function isimproved. One such criterion function is the average similarity among documents inside theclusters formed. Another criterion function is the average similarity between a document in acluster and documents outside the cluster. The first criterion is called average internal simi-larity and the second criterion is called average external similarity . In a clustering solution wewould want high average internal similarity, because that would mean that our clusters are com-

2 Section 2

posed of similar items. We would also want low average external similarity because that wouldmean our clusters are dissimilar, i.e., they do not overlap. The final set of clusters is producedafter many iterations when no further improvement of the cluster assignment is possible.

Clustering to browser large document collections (Scatter/Gather)

Cutting et al.[6] is one of the first to suggest a cluster aided approach, called Scatter/Gather,to browse large document collections. It describes two fast routines named Buckshot and Frac-tionation to find the centroids of the clusters to be formed. Then it assigns the documents inthe collection to the nearest centroid and recomputes the centroids iteratively until very little orno improvement is observed. The last step is similar to the Simple K-means clustering exceptthat in Simple K-means initially one randomly assigns k items as centroids of k clusters [17].Note that k is a fixed user provided number. Buckshot finds the k centers in the document

datasets by drawing a sample of kn√

documents and clustering them into k clusters using anagglomerative hierarchical clustering routine. The agglomerative hierarchical clustering algo-

rithms have a time complexity of O(

n2)

. By drawing a random sample of size kn√

, the time

complexity is reduced to O(k n). Fractionation, on the other hand, finds k centroids in the fol-lowing manner. It divides the set of documents into buckets of size m, where m > k. Then itclusters each bucket into ρ m clusters, where ρ < 1 and is a constant. Then it repeats the pro-cess of partitioning the data and clustering them treating each of the formed cluster as a onedata item, until k clusters are obtained. Cutting et al. have shown that Fractionation has atime complexity of O(m n). The center of the clusters formed by the two methods are returnedas the starting points for the Simple K-means clustering routine. With the help of these tworoutines they have proposed a cluster aided approach to browse document collections in whichthe program presents the user with a set of clusters for the document dataset (Scatter) alongwith their descriptive labels. Then the user can select the clusters which interest her and submitthem to the program. The program merges the documents contained in those clusters (Gather)and clusters them again. This process is repeated until the user’s information need is met or theuser decides to stop the process. The recursive clustering idea proposed in Scatter/Gather canbe effective in browsing large document sets, especially when one does not know enough aboutthe documents to query a deployed search engine using key words. This concept loosely parallelsthe idea of organizing documents into a hierarchy of topics and subtopics, except that the orga-nization in this case is guided by the user and executed by a clustering routine. However,Scatter/Gather has its limitations. It is a batch clustering routine, hence it cannot be used insome important scenarios as described in subsection. Another limitation that Scatter/Gathershares with many other clustering algorithms is that it requires the input of k, the number ofclusters to present the user. A value of k different from the number of subtopics in the collec-tion might lead to meaningless clusters.

Right number of clusters

Finding the right number of clusters in a non-hierarchical clustering exercise is often a diffi-cult problem [18]. The approaches suggested in the literature can, in general, be divided intotwo groups [4]. The first approach is a multi-fold cross validation one with likelihood as theobjective function, in which one fits a series of mixture models with different numbers of compo-nents to a subset of the data called training data and computes the likelihood of each modelgiven the remaining subset of the data called testing data. The model that results in the highestlikelihood is selected. The second approach also fits a mixture model to the data and computesthe likelihood of the model given the entire dataset using different number of clusters, but itpenalizes a model with a higher number of clusters for increased complexity. Observe that ahigher number of clusters can be made to fit any dataset better than a lower number of clusters.Hence, by penalizing a clustering solution for its complexity one can achieve a trade off betweenfitness, or likelihood, of the model and its complexity, which is optimized at the right number ofclusters. One such work has been done by Cheeseman and Stutz in their AutoClass algo-rithm[5]. Other such works include Bayesian Information Criteria and Minimum DescriptorLength criteria [8]. A different approach has been suggested in Liu et al.[16] for clustering textdocuments. It uses stability of clustering solutions over multiple runs at each of a set of clustercounts to decide the right number of clusters for the document dataset.

Literature review 3

Even when the “right” number of clusters can be determined by an algorithm based on somecriterion, human observers often differ from each other about the clusters existing in the datasetand what should be the right number of clusters. One alternative solution is to generate a hier-archy of clusters, also called a dendrogram , with all the documents belonging to a single clusterat the top of the hierarchy, each document in its individual cluster at the lowest level of thehierarchy and intermediate number of clusters at levels between the two. Thus, the user canlook at the desired level in the hierarchy and find a number of clusters that meets her require-ment ([17],[13]).

Incremental document clustering

As part of Topic Detection and Tracking (TDT) initiative ([1], [19], [10] and [7]) some experi-ments have been done in incrementally clustering text documents. The TDT initiative is aDARPA sponsored project started to study and advance the state of the art in detection andtracking of new events in stream of news broadcast and intelligence reports. The identified tasksof TDT are Story Segmentation, Retrospective Topic Detection, On-line New Event Detection,Topic Tracking and Link Detection. The Story Segmentation task involves breaking a stream oftext or audio data without story delimiters into its constituent stories. Retrospective topicdetection involves detecting new events in the already collected set of documents. On-line newevent detection involves identifying a new event, e.g., an earthquake or a road accident, in a newdocument. Tracking involves keeping track of evolution of an event by assigning the incomingnews stories to their corresponding events. Among these tasks the on-line new event detectiontask involves incremental clustering. In this task a decision is made, after observing a new item,whether it belongs to one of the existing clusters, or it belongs to a new cluster of its own.

The TDT team at the Carnegie Mellon University (CMU) uses a threshold-based rule todecide whether a new document is another story of one of the detected events or it belongs to anew event of its own. If the maximum similarity between the new document and any of theexisting clusters is more than a threshold (tc) the new document is said to belong to the clusterto which it is most similar and it is merged to the cluster. If the maximum similarity is lessthan tc but more than another threshold, tn, then the document is assumed to be an old storybut it is not merged to any cluster. If the maximum similarity is less than tn, then the docu-ment is accepted to be about a new event and a new cluster is formed. They have also investi-gated adding a time component to the incremental clustering. In this experiment, similarities ofa new document to each of the past m documents are computed but they are weighted downlinearly depending on how old the past documents are. If the similarity scores computed in thismanner are less than a preset threshold then the new document is presumed to be about a newevent. This work finds that use of time component improves the performance of new eventdetection task.

TDT team at the University of Massachusetts Amherst (UMASS) takes a variable thresh-olding approach to the on line event detection task[1]. For each document that initiates a newcluster the top n words are extracted and called a query vector . The similarity of the queryvector to the document from which the query was extracted defines an upper bound on thethreshold required to be met by a document to match the query. A time dependent componentis also used in the variable threshold that makes it harder for a new documents to match anolder query. When a new document dj is compared to a past query qi the threshold is computedas 0.4 + p × (sim(qi, di) − 0.4) + tp× (j − i), where 0 < p < 1 and tp, a time penalty factor, aretunable parameters. qi is the query generated from document di. Such threshold is computed forall existing queries qis. If the similarity of the new document dj does not exceed any of thethresholds then the document is assigned to a new cluster and a query is computed for the doc-ument, else it is added to the clusters assigned to the queries it triggers. The newly generatedcluster is said to have detected a new news event.

Outside the TDT initiative, Zhang and Liu in a recent study have proposed a competitivelearning algorithm, which is incremental in nature and does not need to be supplied with thecorrect number of clusters [20]. The algorithm, called Self Splitting Competitive Learning , startswith a prototype vector that is a property of the only cluster present initially. During the execu-tion of the algorithm the prototype vector is split and updated to approximate the centroids of

4 Section 2

the clusters in the dataset. The update of the property vector is controlled, i.e., when a newdata point is added to the cluster the prototype vector is updated only if the data point is nearenough to the prototype. This determined by another property vector that starts away from theprototype and zeroes on to it as more and more data points are added. Time for splitting thecluster associated with the prototype is determined based on a threshold condition. When thereare more than one prototype a new data point is added to the prototype nearest to it. Theyhave demonstrated their algorithm over text snippets returned from search engines as a responseto a query. However, the success of this algorithm on datasets with longer text documents is yetto be demonstrated.

Yet another on-line algorithm called frequency sensitive competitive learning has been pro-posed and evaluated on text datasets by Banerjee and Ghosh[2], which is designed to produceclusters of items of approximately equal sizes. In this work a version of the K-means clusteringalgorithm called spherical K-means has been modified so that the dispersion of the distributionsassociated with the clusters reduces as more and more data points are added to them. Thismakes larger clusters less likely candidates for a new data point than the smaller clusters. Thus,the algorithm is tailored to produce clusters which are more or less equal in size.

All of these algorithms produce non-hierarchical clustering solutions, which foregoes theopportunity to use clustering as an aid to detect topic and subtopic structure within a largedocument collection. Also, TDT experiments effectively exploit the information in the timestamp available with news stories, i.e., assumes that news stories that describe the same eventwill occur within a brief span of time. Such information may not always be available.

Incremental Hierarchical Clustering: Nominal Attributes

Methods have been proposed in the non-text domain to cluster items in an incremental mannerinto hierarchies. Most notable among them is the Cobweb algorithm by Fisher [9] and itsderivative Classit [12]. Cobweb is an algorithm to incrementally cluster data points withnominal attributes into cluster hierarchies.

At the heart of Cobweb is a cluster quality measure called Category Utility.

Let C1, � , CK be the child clusters of a cluster Cp. The Category Utility of C1, � , CK iscomputed as

C Up[C1, � , CK] =

∑

k=1

KP (Ck)

∑

i

∑

j[P (Ai =Vij F Ck)2−P (Ai = Vij F Cp)2]

K, (1)

where,

P (Ck) = Probability of a document belonging to the parent cluster Cp belongs to the childcluster Ck.

Ai = The ith attribute of the items being clustered (say A1 ∈ {male, female}, A2 ∈ {Red,Green, Blue}; assumed to be a multinomial variable),

Vij = jth value of the ith attribute (say, V12 indicates “female”),

P (Ck) = the probability of a document belonging to cluster k, given that it belongs to theparent cluster p.

The P (Ai = Vij F Ck)2 is the expected number of times we can correctly guess of the value of

multinomial variable Ai to be Vij for an item in the cluster k when one follows a probabilitymatching guessing strategy. For example, if we have a variable that takes values A, B and Cwith probabilities 0.3, 0.5 and 0.2, and we randomly predict that the variable takes value A 0.3fraction of the time, B 0.5 fraction of the time and C 0.2 fraction of the time, we would be cor-rect in predicting A 0.3 × 0.3 = 0.09 fraction of the time, B 0.25 fraction of the time and C 0.04fraction of the time. A good cluster, in which the attributes of the items take similar values, willhave high P (Ai = Vij |Ck) values, hence high score of

∑

jP (Ai = Vij F Ck)2. Cobweb maximizes

sum of P (Ai = Vij F Ck)2 scores over all possible assignment of a document to children clusters.

When the algorithm assigns a new item to a child node of the node p, it assigns the item in sucha manner that the total gain in expected number of correct guesses by moving an item from p

to its child node,∑

i

∑

j[P (Ai = Vij F Ck)

2 − P (Ai = Vij F Cp)2], is maximized. In this manner

the algorithm maximizes the utility function for each node to which a new item is added.

The Cobweb control structure is shown in Fig 3.

Literature review 5

Algorithm CobWeb (Adapted from Fisher’s original work [9])function Cobweb(item, root)Update the attribute value statistics at the rootIf root is a leaf node thenReturn the expanded node that accommodates the new Object

elseFind the best child of the root to host the item and perform thequalifying step (if any) among the following:

1 Create a new node for the item instead of adding it to thebest host, if that leads to improved Category Utility.2 Merge nodes if it leads to improved Category Utility andcall cobweb(item, Merged Node)3 Split node if it leads to improved Category Utility and callcobweb(item, root)

If none of the above steps are performed thenCall cobweb(item, best child of root)

end ifend if

Figure 1. Cobweb control structure.

An illustration of the clustering process is given in Figure 2.

(2)→ (1)� (3)

(1) (2)

Addition of a new item (2) to a leaf node (1).

Let (104) be a new item.(104)→ (89)

(34) (67)

(23) (12)

(@89) Which node should the new item be added to? (34) or (67)or should it belong to a cluster of its own next to (34) and (67)?

Use Category Utility comparison as described in Fig 3. Let the answer be (67)

(@67) Which node should the new item be added to? (23) or (12)Figure 2. Cobweb illustrated

Assume that there is only one attribute of interest called t and it takes values in {A, B, C}.Also assume that we have three items a, b and c with t value A, B and C respectively. Furtherassume that the objects are presented in the order specified, i.e. first a followed by b which isfollowed by c.

After the first two items are presented the following cluster configuration is arrived withoutany computation of category utility (First part of Figure 2).

C3 P (C3)= 1(a and b)t =A, t = B

C1 P (C1)= 0.5(a)t = A

C2 P (C2)= 0.5(b)t = B

Figure 3. After first two items are added.

C3 is the root cluster and C1 andC2 are two child clusters each containing one item. P (C1) isthe probability that a document randomly picked from its parent cluster of C1, i.e., C3, belongsto C1. Similarly for C2.

6 Section 2

Let’s add the third item c to the root node. We can add it at the level of C1 and C2 (level 2)as another cluster C3, or we can add it in C1 or C2 that will delegate the item c to the third (anew) level. So, our options are (omitting the c within (b, c) configuration that is analogous tothe c within (a, c) configuration described below):

C3 P (C3)= 1(a, b and c): t = A, t = B, t = C

C1 P (C1)=1

3

(a): t = A

C2 P (C2)=1

3

(b): t = B

C4 P (C4)=1

3

(c): t = C

or

C3 P (C3)= 1(a, b and c): t =A, t =B, t = C

C4 P (C4)=2

3

(a and c): t = A, t =C

C1 P (C1)= 0.5(a): t = A

C5 P (C5)= 0.5(c): t = C

C2 P (C2)=1

3

(b): t = B

Figure 4. Two partitions of the root cluster.

At this point Category Utilities of the two configurations let us decide which configuration tochoose. Note that we need to compute category utility of the two partitions of the root clusters.They can be computed using expression (1) as described below.

For the first configuration in Figure 4 the parent cluster is C3 and the child clusters are C1,C2 and C4. The category utility of this configuration is:

CU1 =

∑

k={1,2,4}P (Ck)

[

∑

Ai=t

∑

t={A,B,C}P (t|Ck)2−

∑

Ai=t

∑

t={A,B,C}P (t|C3)2

]

3

=

[

1

3

{

12−(

(

1

3

)2

+

(

1

3

)2

+

(

1

3

)2)}

+1

3

{

12−(

(

1

3

)2

+

(

1

3

)2

+

(

1

3

)2)}

+1

3

{

12−(

(

1

3

)2

+

(

1

3

)2

+

(

1

3

)2)}]/

3

=2

9

For the second configuration in Figure 4 the parent cluster is C3 and the child clusters are C4

and C2.

CU2 =

∑

k={4,2}P (Ck)

[

∑

Ai=t

∑

t={A,B,C}P (t|Ck)2−

∑

Ai=t

∑

t={A,B,C}P (t|C3)2

]

2

=

[

2

3

{(

(

1

2

)2

+

(

1

2

)2)

−(

(

1

3

)2

+

(

1

3

)2

+

(

1

3

)2)}

+1

3

{

12−(

(

1

3

)2

+

(

1

3

)2

+

(

1

3

)2)}]/

2

=1

6

Literature review 7

Since, CU1 > CU2 we select configuration 1 over configuration 2. Looking at the Figure 4, it isintuitive to make a new cluster for the third item, because, it has an attribute value not seen inany of the existing categories.

There is one more possible configuration, where c is added below C2 instead of C1, but thatis symmetrical to the second configuration in Figure 4. So, the analysis will be identical to theone shown in previous paragraph.

Incremental clustering algorithms, such as Cobweb, are sensitive to the order in which itemsare presented [9]. Cobweb makes use of split and merge operations to correct this problem. Inthe merge operation the child nodes with highest and second highest Category Utility areremoved from the original node and made child nodes of a new node, which takes their placeunder the parent node. In the split operation the best node is removed and its child nodes aremade children of the parent of the removed node. Merge and split operations are only carriedout if they lead to a better Category Utility than obtainable by either assigning the item toexisting best node or to a new cluster of its own. By using these two operators, the algorithmremains flexible on the face of change in property of data items in the subsequent observations.

1

2 3 4

� 1

2 5

3 4

Merge (merging 3 and 4 into 5)

1

2 3

4 5

� 1

2 4 5

Split (splitting 3 into 4 and 5)

Figure 5. Merge and split operations illustrated.

Incremental Hierarchical Clustering: Numerical Attributes

We now consider an extension of the Cobweb from nominal attributes to numerical attributes.Gennari et al.[12] has shown that in order to use cobweb for data items with numeric, ratherthan nominal, attribute values we need to make some assumption about the distribution ofattribute values. When the values of each attribute follow a normal distribution, they haveshown that the Category Utility function can be written as

CUp[C1, � , Ck] =

∑

kP (Ck)

∑

i

(

1

σik− 1

σip

)

Kwhere,

σip = standard deviation of the value of the attribute i in parent node p, and

σik = standard deviation of the value of the attribute i in the child node k.

This algorithm is known as the classit algorithm.

We have not seen any prior application of either of these algorithms to text clustering.Hence, their performance on text document data is uncertain at the time of this work. Further,word occurrence counts, attributes of text documents that are commonly used to represent adocument, follow a skewed distribution–-unlike the Normal distribution (Figure 6). Also,Normal distribution assumes that the attributes are Real numbers, but, word occurrence countsare Nonnegative Integers. They can not be treated as nominal attributes either, because theoccurrence counts are not contained in a bounded set, which one would have to assume whiletreating them as nominal attributes. A more suitable distribution for such count data is Nega-tive Binomial, or Katz’s distribution [14].

Our work proposes to improve upon the original Cobweb algorithm using distributionalassumptions that are more appropriate for word count data.

8 Section 2

3 Text Documents and word distributions

Text, as we commonly know it, is available in the form of unstructured documents. Before wecan use such documents for classification or clustering, we need to convert them to items withattributes and values. A popular way of converting the document to such a form is to use thewords1 in a document as attributes and the number of times the word occurs in the document,or some function of it, as the value of the attribute. This is called the “Bag of Words” approach.One consequence of using such a method to convert documents to an actionable form is that oneforegoes information contained in the order of the word. Despite this drawback, the bag-of-words approach is one of the most successful and widely used method of converting text docu-ments into actionable form.

Several attempts has been made to characterize the distribution of words across documents.This is useful in judging the information content of a word. For instance a word that occurs uni-formly in every document of the corpus, e.g., “the” is not as informative as a word that occursfrequently in only a few, e.g., “Zipf”.

Occurrence statistics of a word in a document can be used along with the information con-tent of the word to infer the topic of the document and cluster documents of similar topic intosame group–- as is done in this work. Manning and Schutze have discussed several models tocharacterize the occurrence of words across different documents [17].

3.1 Models based on Poisson distribution

3.1.1 Poisson The Poisson distribution has been used to model number of times a wordoccurs in a document. The probability of a word occurring k times in a document is given by

P (k)=λke−λ

k!(2)

where, λ is a rate parameter. However, from empirical observations, it has been found thatPoisson distribution tends to over estimate the frequency of informative words (content words)[17].

3.1.2 Two Poisson Model There have been attempts to characterize the occurrence of aword across documents using a mixture of Poisson distributions. One such attempts uses twoPoisson distributions to model the probability of a word occurring a certain number of times ina document. One of the distributions captures the rate of the word occurrence when the wordoccurs because it is topically relevant to the document. The second distribution captures therate of the word occurrence when the word occurs without being topically relevant to the docu-ment. This mixture of two probability distributions has the probability density function:

P (k) =αλ1

ke−λ1

k!+ (1−α)

λ2ke−λ2

k!(3)

where, α is the probability of the word being topically relevant and 1 − α is the probability ofthe word being topically unrelated to the document.

It has been empirically observed that, although the two Poisson model fits the data betterthan single Poisson model[3], a spurious drop is seen for the probability of a word occurringtwice in a document[14]. The fitted distribution has lower probability for a word occurring twicein a document than it occurring three times, i.e., it predicts that there are fewer documents thatcontain a word twice than there are documents that contain the same word three times. But,empirically it has been observed that document count monotonically decreases for increasingnumber of occurrences of a word (see Figure 6).

1. Through out this paper we shall use word and term interchangeably to refer to the same thing, i.e., a con-tiguous sequence of alphanumeric characters delimited by non-alphanumeric character(s). E.g. the first word orterm in this footnote is “Through”.

Text Documents and word distributions 9

3.1.3 Negative Binomial A proposed solution to the above problem is to use a mixture ofmore than two Poisson distributions to model the word occurrences. A natural extension of thisidea is to use a Negative Binomial distribution, which is a gamma mixture of infinite number ofPoisson distributions[11]. The probability density functions of a Negative Binomial distributionis given below,

P (k) =

(

k + r − 1r − 1

)

pr(1− p)k, (4)

where p and r are parameters of the distributions.Although the Negative Binomial distribution fits the word occurrence data very well it can

be hard to work with because it often involves computing a large number of coefficients[17].This has been confirmed in our analysis (see Expressions (28) and (29) in Section 4.2).

3.1.4 Zero inflated Poisson When we observe the word occurrence counts in documents, wefind that most words occurs in only a few documents in the corpus. So, for most of the words,the count of documents where they occur zero times is very large (see Figure 6). Looking at theshape of the empirical probability density function we attempt to model the occurrence countsusing a Zero Inflated Poisson distribution, which assigns a large probability mass at the variablevalue 0 and distributes the remaining probability mass over rest of the occurrence countsaccording to a Poisson distribution.

result

frequency

Den

sity

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

Figure 6. The occurrence of a typical word (“result”) across different documents in our test collection.

The probability density function of Zero Inflated Poisson distribution is given by

P (k)= (1−α)δk +αλkeλ

λ!, k =0, 1, 2� (5)

10 Section 3

where,

δk =

{

1, iffk= 00, otherwise

As we shall demonstrate in Section 3.3, this distribution does not fit text data as well as theNegative Binomial or the Katz’s distribution.

3.2 Katz’s K-mixture modelThis distribution, proposed by Katz[14], although simple to work with, has been shown to modelthe occurrences of words in the documents better than many other distributions such as Poissonand Two Poisson, and about as well as the more complex Negative Binomial distribution[17].Katz’s distribution assigns the following probability to the event that word i occurs k times in adocument2.

P (k) = (1−α)δk +α

β + 1

(

β

β + 1

)k

(6)

δk = 1iffk= 0and0otherwise.The MLE estimates of parameters α and β are:

β =cf−df

df(7)

α =1

β× cf

N(8)

cf = collection frequency = number of times word i occurred in the document collectionobtained by adding up the times the word occurred in each document. Here, a collection can bewhatever we deem our universe of documents to be. It can be the entire corpus of documents ora subset of it.

df = document frequency = number of documents in the entire collection that contain theword i.

From (6) it follows that

P (0) = 1−α +α

β + 1

= 1− df

N(9)

= 1−Pr (the word occurs in a document)

= Pr (the word does not occur in a document)

Also, it follows that

P (k) =α

β + 1

(

β

β + 1

)k

, k = 1, 2, � (10)

Substituting p forβ

β +1, we have

P (k)= α(1− p)pk (11)

Let’s define a parameter p0 as

p0 = P (0) (12)

using (7) we find that

p =

cf− df

dfcf

df

=cf− df

cf(13)

=Pr (the word repeats in a document)

Pr (the word occurs in a document)

=Pr (the word repeats

⋂

the word occurs)

Pr (the word occurs)= Pr (the word repeats | the word occurs)

2. In this section we shall discuss the case of one word, the ith word. Hence, we shall drop the subscript i

from the equations and expressions.

Text Documents and word distributions 11

Hence, 1− p can be interpreted as the probability of the word occurring only once. Or, it can bethought of as a scaling factor used to make (11) and (12) together a valid probability densityfunction.

We can write Expression (6) for k = 0, using p as

P (0) = (1−α)+ α(1− p)

= 1−α + α−αp

Hence, α in terms of p0 and p is

p0 = 1−αp

⇒αp = 1− p0

⇒α =1− p0

p(14)

Expression (11) can now be written as

P (k)= (1− p0)(1− p)pk−1 (15)

when k > 0.Using Expressions (12) and (15), we can fully specify the Katz’s distribution. The two

parameters are p0 and p, which can be estimated as (see Expressions 9 and 13)

p0̂ =1− df

N(16)

and

p̂ =cf−df

cf(17)

It can be shown that if a distribution is defined by Expressions (12) and (15), then the estimates(16) and (17) are the MLE of the parameters p0 and p (see Appendix A).

3.3 Fitness comparison

We estimated the parameters of Zero Inflated Poisson and Negative Binomial using the methodof moment, and parameters for Katz’s distribution using the Maximum Likelihood Estimate(MLE) method. The reason for using the method of moments and not the MLE is that for theNegative Binomial and the Zero Inflated Poisson distributions the MLE can only be foundnumerically, which is computationally complex for our task of incremental clustering. One canstill use numerical methods to determine MLEs of the parameters of the distribution, whichadmittedly have better properties, if one is willing to pay the cost in terms of delay. In thiswork we shall limit ourselves to the estimates that have closed form expressions and can be com-puted efficiently, because our goal is to carry out the incremental document clustering in realtime.

3.3.1 Zero Inflated PoissonIf the probability density function of a Zero Inflated Poisson dis-tribution is given in the form of Expression (5), then the method of moment estimates of itsparameters α and λ are

λ̂ =Var(X)

X+ X − 1 (18)

and

α̂ =X

λ(19)

3.3.2 Negative BinomialFor the Negative Binomial distribution, parameters p and r can beestimated as

r̂ =X̄ 2

Var(X)− X̄(20)

p̂ =X̄

Var(X)(21)

For the Katz’s distribution we used Expressions (16) and (17) to estimate the parameters p0 andp.

12 Section 3

We evaluated the fitness of these three distributions by computing the probabilities of theword occurrences using the estimated parameters, on three different datasets. For each datasetwe selected the top 100 terms by their cf× log(N/d f) score. The distribution that has a higherlikelihood than another can be considered a better fit to the data. For each term a pairwisecomparison of fitness of different distributions is carried out in this manner. The results areshown in the form of three dominance matrices in Table 1. Each cell records the number ofterms for which distribution for the row has 10% or higher likelihood than the distribution forthe column.

dataset dominance table

classic

NB Katz’s ZIPNB 0 55 92

Katz’s 41 0 96ZIP 7 4 0

tr41


Katz’s 58 0 98ZIP 2 2 0

k1a


Katz’s 35 0 98ZIP 2 2 0

Table 1. Likelihood comparisons, count of likelihood of row distribution > likelihood of col distribu-

tion× 1.1

It can be observed from the table that Katz’s distribution, is not only easier to work with aswe will see in Section 4, it also fits better than Zero Inflated Poisson (ZIP) and gives fitnesscomparable to Negative Binomial (NB) distribution.

4 Algorithms for text

4.1 COBWEB: when attribute values follow Katz’s distribution

4.1.1 Category utility

Using words as attributes, we can derive the Category Utility function assuming that wordoccurrences follow Katz’s distribution. For reference, the Category Utility formula as given inCobweb is

1

K

∑

k

P (Ck)

[

∑

i

∑

j

(

P (Ai =Vi,j |Ck)2−P (Ai = Vi,j |Cp)

2)

]

Notice that for each attribute indexed i we need to compute∑

j

(

P (Ai = Vi,j |Ck)2−P (Ai = Vi,j |Cp)

2)

(22)

where, j is an index of value of the attribute i. In this case Vi,j would take values 0, 1, 2 ...because we are working with count data.

Hence, the first part of Expression (22) can be written as

CUi,k =∑

f=0

∞

P (Ai = f |Ck)2 (23)

Let’s use CUi,k to refer to the contribution of the attribute i towards the Category Utility of thecluster k.

Substituting Expressions (12) and (15) in Expression (23), we obtain

CUi,k =∑

f=0

∞

P (Ai = f |Ck)2 =

1− 2p0(1− p0)− p(1− 2p0)

1 + p(24)

Algorithms for text 13

Substituting estimates of p0 and p from Expressions (16) and (17) in Expression (24), and sim-plifying, we get

CUi,k =∑

f=0

∞

P (Ai = f |Ck)2 = 1−

2× df(

N − cf× df

2× cf− df

)

N2(25)

where, df, cf, and N are counted in the category k.Expression (25) specifies how to calculate the Category Utility contribution of an attribute in

a category. Hence, the Category Utility of the Classit algorithm, when the distribution ofattributes follows Katz’s model, is given by

CUp =1

K

∑

k

P (Ck)[

∑

i

CUi,k −∑

i

CUi,p

]

(26)

where, CUi,k is given by Expression (25).

4.2 Cobweb: when attribute values follow Negative Binomial distribu-tionThe probability density function of the Negative Binomial distribution is

P (x)=

(

x+ r − 1r − 1

)

pr(1− p)x (27)

p and r are the parameters of the distribution, which are to be estimated from the data.

4.2.1 Category utility

Substituting Expression (27) in (23), we obtain the contribution of a word in a child clustertowards Category Utility

CUi,k =∑

x=0

∞ (

(x+ r − 1)!

x!(r − 1)!pr(1− p)x−1

)2

(28)

This expression cannot be reduced to any simpler form, although, it can be written using ahyper-geometric function in the following manner.

CUi,k =p22rF1

(

r, r, 1, (1− p)2)

(1− p)2(29)

One can use a library, such as the one available with Mathematica, to numerically evaluate

2F1(r, r, 1, (1 − p)2). In our experience this computation is three orders of magnitudes moreresource intensive than computing (25), the equivalent expression for Katz’s distribution. As wedescribed in Section 3.3, in this work we shall limit ourselves to the methods that will let uscarry out incremental clustering in real time, i.e., in the time available between arrival of twodocuments.

For this reason and the reasons cited in Section 3.1 and 3.3, we shall fully explore onlyKatz’s distribution and original Classit algorithm based on Normal distribution in our work.

5 Cluster Evaluation Methods

5.1 Evaluating the clustersOne commonly used cluster quality measure is the purity of clustering solution. Purity of acluster is defined as

pk =maxc {CFk(c)}

Nk(30)

where,

− c is the index of classes

• class is a pre-specified group of items

− k is the index of clusters

• cluster is an algorithm generated group of items

14 Section 5

CFk(c) = number of items from class c occurring in cluster k. Or, the frequency of class c incluster k.

Nk = number of items in class k.Purity of the entire collection of clusters can be found by taking the average of the cluster

qualities. Here, there are two kinds of averages one might consider: weighted or unweighted. Ifwe assign a weight to each cluster proportional to the size of the cluster and take the weightedaverage then it is called micro average, since each of the documents get equal weight. If weinstead want to give equal weight to each cluster, we compute the arithmetic average instead.This is called macro average. The first one is a document level evaluation, while the second oneis a cluster level evaluation. Both these purity are greater than 0 and less than 1.

The drawback of relying only on purity to evaluate the quality of a set of clusters, becomesapparent in hierarchical clustering. When we collect clusters occurring at or near the lowestlevel of the hierarchy, we get clusters with very few documents in them. Hence, we obtain clus-ters with high purity score. In the limit, at the lowest level there are N clusters each containingonly one item. Hence, maxc {CFk(c)} is 1 for each k ∈ {1, � , N } resulting in purity score of 1.We get larger clusters at a higher level in the hierarchy, which are more likely to contain docu-ments belonging to different classes, leading to a lower purity score. This illustrates how purityscore can be misleading when the number of clusters formed is different than the number ofclasses in the dataset. If we make more number of clusters than there are in the dataset we biasthe purity score up. If we make less number of clusters than there are in the dataset we bias thepurity score down.

To correct this problem, we define another score of the clustering solution in the followingmanner.

rc =maxk {CFk(c)}

Nc

where, Nc is the size of the class c. The other variables are as defined for the expression of thepurity score in Expression (30). Here, also we can compute the micro average or the macroaverage to compute the score for the entire solution.

This is a purity computation with the clustering solution treated as the true classes of thedata items and the human generated clusters as the solutions to be evaluated. Using this mea-sure we evaluate how well the “true” classes in the datasets are represented in the clustersformed.

These metrics, pk and rc, have interpretations that parallel the precision and recall metrics,respectively, in information retrieval literature. Precision is the fraction of the retrieved docu-ments that are relevant. Our pk has the precision interpretation when we think of a cluster toretrieve documents from the class to which majority of its elements belong. On the other handrecall is the fraction of all the relevant documents that are retrieved. In the framework wedescribed for pk, our metric rc has the recall interpretation.

Taking a cue from the F measure commonly used in IR literature to combine precision andrecall, we computed the F score as the harmonic mean of the PandR values:

1

F=

1

2

(

1

P+

1

R

)

(31)

The F score is the metric by which we shall measure the quality of our clusters.

5.2 Evaluating the hierarchy

Another question of interest when evaluating a hierarchical clustering algorithm is “To whatextent the generated cluster hierarchy agree with the class hierarchy present in the data?”. Aswe shall describe in Section 6, the datasets we have used in our experiments have a hierarchy ofclasses and provide us a rare opportunity to evaluate our generated cluster hierarchy for correct-ness. As a reminder, a class is a document category that has been provided to us as a part ofthe dataset. It is what the documents have been labeled with by an external entity and help usin evaluating how good our algorithm is. On the other hand, a cluster is a grouping of docu-ments that our algorithm generates. It does so by grouping together the documents it considerssimilar.

Cluster Evaluation Methods 15

Matching the generated cluster hierarchy with the existing class hierarchy is a non-trivialtask. In stead, in this work we focus on measuring how often the sibling clusters in the gener-ated hierarchy have sibling classes, i.e, how often children clusters of a parent cluster have chil-dren classes of the class that is assigned to the parent cluster. For instance, consider the gener-ated cluster subtree shown in Figure 7.

K0

K1(C1.1) K2(C1.2) K3(C2.1) K4(C1.1.3)#Parent class frequency

C1 2C2 1C1.1 1#K0(C1)

K1(C1) K2(C1) K3(C2) K4(C1.1)

Figure 7. A sample subtree with the children nodes. Class labels of the children node are given in

parenthesis.

In this case we have already determined the classes of child clusters3. To be able to measureif they are filed under the correct class, we need to find the class of the parent cluster. To dothis we tabulate the parent classes of the child clusters and assign the most frequent parent classto the parent cluster K0. So, in this case the parent cluster K0 gets the label C1. Then we eval-uate this cluster configuration as if K0 is merely a cluster of four other smaller entities, each ofwhich has a class label same as the parent class of what they really have. This is equivalent ofsaying that as long as the children clusters of K0 have children classes of the class of K0, i.e., C1

in this case, they are correct. Clusters with all other class labels that occur under that parentcluster are incorrect classifications by the algorithm. They should have been somewhere else.

So, in the above example the precision of K0 would be2

4= 0.5. We compute this precision

for all the internal nodes of the cluster tree and take their average (both micro average andmacro average) to compute the overall precision of the hierarchy. This gives us a measure ofhow much the generated cluster hierarchy agree with the class hierarchy present in the data. Wecall it sibling precision score of the cluster hierarchy.

We needed to make a few decisions while evaluating the hierarchy in this manner. Forinstance, we used only the internal nodes to compute the precision of any node. This is because,often times leaf nodes co-exist with internal nodes as children of another internal node. In thiscase if we compute precision based on leaf nodes, i.e., single documents, then we are mixing theprecision of the kind we described in Section 5.1 with the precision of the hierarchy and it is notclear how we should interpret the resulting number. Another decision that needed to be madewas, what should we do if a child cluster has the broadest class label assigned to it? Since, wecan not find a parent class for these classes, we explored the possibility of

i. dropping such child clusters from our evaluation and

ii. treating them as their own parent cluster since, they are the broadest level classes.

3. At the lowest level each cluster has only one document and its class can be read from the data directly.

16 Section 5

In our experiments the results do not change much if we take either of these strategy. So, weshall report only the results we got by treating the broadest classes as their own parent classes.

6 Experiment setup and results

We evaluate our algorithm over two text document collections, i.e., Reuters-RCV1 andOhsumed (88-91). These datasets were picked because of the presence of human labeled hierar-chical class labels and reasonably large number of documents in them. They are described inmore detail in the following section.

6.1 Reuters-RCV1

Incremental clustering algorithms process the data points only once and in the order in whichthey are presented and the order in which data points are present in the dataset influences theclusters produced4. Therefore, it is imperative that we test the incremental clustering algorithmswith an ordering of data points that is similar to the what they are expected to receive duringtheir deployment. As we envision the two algorithms in this work to be used to process streamsof text documents from newswire, newsgroups, Blogs, etc., the natural ordering among the docu-ments is determined by the time at which they are received. Therefore, we need a documentdataset in which the time order of the documents is preserved. Reuters-RCV1[15] is one suchdataset.

Reuters-RCV1 dataset is a collection of over 800,000 English newswire articles collected fromReuters over a period of one year(20th Aug 1996 to 19th Aug 1997). These documents havebeen classified by editors at Reuters simultaneously under three category hierarchies: “Topic”hierarchy, “Industry” hierarchy and “Region” hierarchy. The Topic hierarchy contains four cate-gories at the depth one of the tree, namely “Corporate/Industrial”, “Economics”, “Govern-ment/Social” and “Market”. There are ten such categories in the Industry hierarchy. Some ofthem are “Metals and Minerals”, “Construction”, etc. The Region hierarchy has geographicallocations, such as country names, and economic/political groups as categories. There are nofiner sub-categories in the Region hierarchy.

TopicRoot

Corporate/Industrial Economics Government/Social Market

Region Root

MEX USA UK INDIA ... ...

Figure 8. Three classification hierarchies.

The classification policy, also called The Coding Policy, requires that each document musthave at least one Topic category and at least one Region category assigned to it. It also requiresthat each document be assigned to the most specific possible subcategory in a classification hier-archy. A document might be, and often is, assigned more than one categories from any one ofthe three category hierarchies. The documents are present in the dataset in the order in time inwhich they were collected.

4. However, the ideal incremental clustering algorithm is expected to be insensitive to the order in which itencounters the data points. Such, characteristic is partly achieved by the Cobweb algorithm by its split andmerge operators.

Experiment setup and results 17

number of documents 62935number of unique words 93792average document length 222

number of classes 259

Table 2. rcv1 dataset (First 30 days). Classes are the region classes

6.1.1 Evaluating clusters

Experiment setup For our experiments articles from the first 30 days of the Reuters-RCV1dataset were used. There were 62935 articles. Stop words were removed from the documentsand the terms were stemmed. Then the most informative terms were selected by their cf ×log(N/d f) scores to represent the documents. We repeated the experiments using 100 to 800terms at step size of 100.

We have evaluated the clustering solutions for the correctness of assignment of documents tothe clusters using the region categories, because (i) in the region class hierarchy all the assignedclasses belong to one level and (ii) fewer articles are assigned multiple region class labels thanthey are assigned other class labels, suggesting that the region classes in the dataset do notoverlap a lot. This allows us to evaluate out algorithm on a dataset with well defined classes.There were 259 region categories present in the selected documents. So, we have extracted 259clusters from the dendrogram constructed by the clustering algorithms and measured theirquality using the Region categories of the documents.

Results and Discussion The results of the clustering exercise is given in Table 3. We can seethat Katz’s distribution based Classit algorithm dominates Normal distribution based Classitalgorithm across varying vocabulary sizes in both the micro and macro average of F scores.

V

100K N

micro 0.46 0.31macro 0.83 0.60

200K N

micro 0.45 0.43macro 0.81 0.74

300K N

micro 0.45 0.33macro 0.85 0.67

400K N

micro 0.45 0.42macro 0.79 0.74

500K N

micro 0.45 0.36macro 0.84 0.69

600K N

micro 0.45 0.42macro 0.82 0.76

700K N

micro 0.45 0.39macro 0.81 0.74

800K N

micro 0.45 0.30macro 0.83 0.61

Table 3. Cluster quality comparison on RCV1 data

18 Section 6

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800

KatzNormal

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800

KatzNormal

Figure 9. Cluster quality comparison on RCV1 data. The left panel shows the micro average of F-score

and the right panel shows the macro average of the F-score.

As we can see Katz based Classit algorithm consistently performs better than the Normalbased Classit algorithm on this dataset. However, we are cautious in interpreting the microaveraged-F score. Both of these algorithms produce clusters of widely different sizes, i.e., a fewbig clusters, a few more clusters of intermediate size and a lot of smaller clusters. The micro-averaged F score, is affected by it. Because, performance over a few good clusters dominates theentire performance metric. This explains the flat nature of the plot of micro averaged F scorewith Katz based Classit. The larger of the clusters generated by the algorithm do not changemuch over different vocabulary sizes, so, the micro-averaged F score remains nearly constant.Therefore, we also compute the macro-averaged F score, where each cluster gets equal weight,and find that Katz based Classit performs better than Normal based Classit over a widerange of vocabulary sizes.

6.1.2 Evaluating hierarchy

We evaluate the generated cluster hierarchy using the topic hierarchy of classes5 as our refer-ence. There are 63 different topic codes in the documents we used, where as in the entire topichierarchy there are 103 topic codes.

We pre-processed the documents using the steps described in the previous section. Evaluatedthe accuracy of the parent/child cluster configurations as described in Section 5.2. The resultsare given in Table 4.

V Normal Katz Normal KatzMacro avg Macro avg Micro avg Micro avg

100 0.925 0.956 0.814 0.959200 0.924 0.935 0.797 0.943300 0.926 0.874 0.825 0.871400 0.92 0.866 0.814 0.789500 0.918 0.896 0.812 0.871600 0.922 0.841 0.814 0.989700 0.929 0.836 0.846 0.653800 0.918 0.855 0.832 0.718

Table 4. Evaluation of the cluster hierarchy using Rcv1 data

The values in the table cells are the average sibling precision of internal nodes of the clusterhierarchy. As we can see there is no clear winner in this case, although, both the algorithms doreasonably well in assigning sibling classes under the same cluster. However, we must be carefulto interpret these values as the correctness of the sibling classes getting grouped together andnot as recovering all of the original class hierarchy.

5. This can be obtained from [15] Appendix 2.


6.2 OHSUMED (88-91)

The Ohsumed test collection is a set of 348,566 abstracts collected from 270 medical journalsover a period of 5 years. Each abstract is annotated with MeSH (Medical Subject Heading)labels by human observers. This indicates the topic of the abstract. Unlike the rcv1 dataset,these documents are not in temporal order. Another property of this dataset is, being from aspecific subject area, they contain words from a much smaller vocabulary. Due to the presenceof human assigned MeSH keywords over such a large collection, this dataset provides us with anopportunity to evaluate our algorithm over a large dataset and against real topic labels.

number of documents 196555number of unique words 16133average document length 167

number of classes 14138

Table 5. Ohsumed dataset (88-91)

6.2.1 Evaluating clusters

Experiment Setup We used the Ohsumed 88-91 dataset from the TREC-9 filtering track toevaluate our algorithm for the correctness of assignment of documents to the classes. Weselected only those articles for which both the MeSH labels and the abstract text were present.There were 196,555 such articles. As with the rcv1 dataset most informative words in the

dataset were selected using cf × log(

N

df

)

score of the words. We repeated the clustering exercise

using 25 to 200 words at a step size of 25. To determine the number of different topics presentin this dataset one can look at the unique MeSH labels present in the dataset. But, as there aretens of thousands of such labels present we used fixed number of clusters to evaluate (see Table6) the algorithms.

Results and discussion The F-score results of the experiments are given in Table 6.

k �V 5 10 20 40 80 160 320 640

25? K Nµ 57 55M 62 62

? K Nµ 57 53M 60 61

� K Nµ 57 53M 63 62

? K Nµ 57 53M 62 62

� K Nµ 55 38M 60 54

× K Nµ 55 61M 37 54

? K Nµ 36 34M 49 52

× K Nµ 27 34M 43 52

50� K Nµ 70 57M 74 65

� K Nµ 70 57M 75 63

� K Nµ 69 57M 75 65

� K Nµ 69 57M 76 69

� K Nµ 69 57M 76 70

� K Nµ 69 57M 76 71

× K Nµ 48 51M 60 65

× K Nµ 47 51M 59 65

75? K Nµ 70 70M 71 71

? K Nµ 70 70M 69 70

? K Nµ 70 70M 73 77

× K Nµ 69 70M 76 80

× K Nµ 69 70M 77 81

× K Nµ 69 70M 78 82

� K Nµ 69 39M 78 56

� K Nµ 69 35M 79 53

100� K Nµ 70 62M 72 69

� K Nµ 69 62M 71 70

� K Nµ 69 62M 75 73

� K Nµ 69 62M 78 75

� K Nµ 68 62M 78 76

� K Nµ 68 45M 79 62

� K Nµ 69 45M 80 62

� K Nµ 69 45M 80 62

125� K Nµ 71 61M 74 68

� K Nµ 71 61M 76 68

� K Nµ 69 61M 77 71

� K Nµ 69 61M 78 72

� K Nµ 69 61M 80 74

× K Nµ 53 61M 68 74

� K Nµ 53 47M 69 61

� K Nµ 53 46M 69 60

150� K Nµ 72 54M 72 65

� K Nµ 72 51M 77 61

� K Nµ 59 51M 72 64

� K Nµ 59 51M 74 66

� K Nµ 55 51M 71 66

� K Nµ 54 51M 71 67

� K Nµ 54 51M 71 67

× K Nµ 48 51M 66 67

175� K Nµ 71 52M 74 64

� K Nµ 71 51M 78 62

� K Nµ 71 51M 81 64

� K Nµ 71 51M 83 66

� K Nµ 59 51M 75 67

� K Nµ 58 43M 74 60

� K Nµ 54 43M 71 60

� K Nµ 54 41M 71 58

200� K Nµ 62 52M 72 63

� K Nµ 62 50M 75 62

� K Nµ 62 50M 77 65

� K Nµ 62 50M 78 65

� K Nµ 62 50M 79 66

� K Nµ 62 50M 79 67

� K Nµ 62 50M 79 67

� K Nµ 62 50M 79 67

Table 6. Cluster quality comparison on OHSUMED data at different number of clusters (k) and vocabu-

lary size (V). The figures in the table are F-score × 100. K stands for Katz-Classit, N for the original

Classit. µ and M row in the smallest table hold the micro and macro average of the F-score respec-

tively. The cells where Katz-Classit performs better are marked with a �, the cells where Normal-

Classit performs better are marked with a × and the cells where there is no clear winner are marked

with a ?. Best Katz-Classit and the best Normal-Classit have been highlighted by grey cells.

20 Section 6

We can see from the table that Normal-Classit is the most competitive when the vocabu-lary size is small and the number of clusters formed is large. For all other settings, i.e., whenthe size of the vocabulary used is larger or when the number of clusters formed is smaller, Katz-Classit performs better. This shows that the Katz-Classit algorithm is more robust as it per-forms well across a much larger range of parameter values.

Performances of both the algorithms suffer when we create more number of clusters, whichmakes sense because, there are fewer features based on which to distinguish between clusters.

6.2.2 Evaluating hierarchy

MeSH labels present in the Ohsumed collection has a hierarchical structure to it6. This pro-vides us with another opportunity to evaluate the correctness of our hierarchy. This class hier-archy is much larger than the topic hierarchy of Rcv1 dataset. There are 42610 different MeSHlabels. Each MeSH label has a code attached to it. The class hierarchy information can bedirectly read from this code. For instance the first three records of 2005 “ASCII MeSH collec-tion” reads

Body Regions;A01Abdomen;A01.047Abdominal Cavity;A01.047.025...

Figure 10. First three lines of MeSH labels file (filename: mtrees2005.bin)

This says that the topic labeled “Abdominal Cavity” (A01.047 .025) is a child topic of labelwith code A01.047, which we can find from the file as the topic “Abdomen” (A01 .047), which inturn is a child topic of a topic with code A01. We can find from the file that this is the code ofthe label “Body Regions”. This “.” separated topic codes let us easily find the parent topics bydropping the suffix of the code. Not all the MeSH labels are seen in our dataset. There wereonly about 14138 different MeSH labels used in document set we used for our experiments.

Documents were pre-processed as described in the previous section. Entire cluster hierarchywas generated and the correctness of the hierarchy was evaluated as described in Section 5.2.The precision values are reported in table

V Normal Katz Normal KatzMacro avg Macro avg Micro avg Micro avg

25 0.786 0.795 0.626 0.74950 0.781 0.831 0.667 0.78475 0.79 0.857 0.654 0.831100 0.801 0.888 0.742 0.891125 0.828 0.939 0.788 0.976150 0.847 0.935 0.812 0.963175 0.876 0.91 0.859 0.858200 0.894 0.958 0.819 0.919

Table 7. Evaluation of the cluster hierarchy using Ohsumed data

Here again both the algorithms do reasonably well in grouping classes with common parentsunder the same cluster with Katz-Classit seems to have an advantage over Normal-Classit

6. The entire collection of MeSH labels can be downloaded from the web-site of National Institute of Health(http://www.nlm.nih.gov). We have used 2005 MeSH label collection for our purpose.


across all vocabulary sizes. But, we must be careful here not to interpret these precision valuesas closeness of the entire cluster hierarchy to the existing class hierarchy. Instead it is the accu-racy of the algorithms in classifying sibling classes under same parent cluster.

We also tracked the sibling precision score at different depths of the generated cluster tree(Figure 11 and 12).

V=25

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18

Pre

cisi

on

Depth

Macro average

NormalKatz

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18

Pre

cisi

on

Depth

Micro average

NormalKatz

V=75

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Pre

cisi

on

Depth

Macro average

NormalKatz

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Pre

cisi

on

Depth

Micro average

NormalKatz

Figure 11. Tracing the sibling precision over the height of the tree. Vocabulary 25 and 75.

22 Section 6

V=125

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Pre

cisi

on

Depth

Macro average

NormalKatz

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Pre

cisi

on

Depth

Micro average

NormalKatz

V=175

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18 20

Pre

cisi

on

Depth

Macro average

NormalKatz

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18 20

Pre

cisi

on

Depth

Micro average

NormalKatz

Figure 12. Tracing the sibling precision over the height of the tree. Vocabulary 125 and 175.

These plots show the general trend at different vocabulary sizes. As we can see there is con-siderable variation in the sibling precision over different depths. Amidst these variation we canobserve that the sibling precision is higher and more consistent when we look at the nodesoccuring at the lower layers of the tree. Also, we find that on these layers the Katz-Classitusually performs better than the Normal-Classit.

It is interesting to observe the general consistency at the lower levels of the tree and lack ofit at higher levels of the tree. At the lower levels we have a large number of nodes at each layer.When we average the perfomance of each algorithm over these large number of nodes we get ascore that is robust to random mistakes. So, we get a consistent score from layer to layer and itis easier to see which algorithm does better. But, it is not so in the higher levels. In the higher


levels we have only a few nodes in each layer over which to average the score. So, the average ismore sensitive to random mistakes. Note that both the micro average and macro average aresensitive to these random mistakes. The wrong nodes in the higher levels of the tree either get aweight equal to other nodes (macro average) or they get a weight that is proportional to thenumber of documents in them. Both of these weights are significant at these levels of the tree.This is the reason why we find the plot of average sibling precision fluctuating a lot at theselevels and we do not get a clear winner across the layers in the upper part of the tree.

7 Conclusion

This is the first attempt of incremental hierarchical clustering of text documents to our knowl-edge. We have evaluated an incremental hierarchical clustering algorithm, which is often usedwith non-text datasets, using text document datasets. We have also proposed a variation of thesame that has more desirable properties when used for incremental hierarchical text clustering.

The variation of Cobweb/Classit algorithm that we have demonstrated in this work usesKatz’s distribution instead of Normal distribution used in the original formulation of theClassit algorithm. Katz’s distribution is more appropriate for the word occurrence data as hasbeen shown in prior work[14] and empirically observed in our work. We have evaluated both thealgorithms over Reuters-RCV1 dataset, which allows us to carry out the experiments in a sce-nario very similar to the real life. We tested the algorithms by presenting them Newswire arti-cles from Reuters-RCV1 dataset in time order and have shown that our algorithm performs con-sistently better than the Normal based Classit algorithm as measured by both the micro andmacro average of the F score over a range vocabulary sizes. We have also evaluated both thealgorithms using Ohsumed 88-91 dataset and have found that Katz-Classit performs betterexcept for the narrow range of parameter values with small vocabulary sizes and large numberof clusters, where results are likely to be unreliable. This shows that the performance of Katz-Classit is more robust across broad parameter settings.

We have also proposed a way to evaluate the quality of the hierarchy generated by the hier-archical clustering algorithms, by observing how often children clusters of a cluster get childrenclasses of the class assigned to the cluster. We found that although, both the existing algorithmand our proposed algorithm perform well in this metric, our algorithm performs marginallybetter on Ohsumed dataset.

The most important contribution we think we have made in this work is a separation ofattribute distribution and its parameter estimation from the control structure of the Classitalgorithm. Thus, one can use a new attribute distribution, which may be different from Normalor Katz but is more appropriate for the data at hand, inside the well established control struc-ture of the Classit algorithm to carry out incremental hierarchical clustering of a new kind ofdata. For instance, if it is considered that Negative Binomial could be better fit for the worddistribution than Katz distribution, and one can come up with an efficient way to estimate theparameters of the distribution, it can be used in the framework of the existing Classit algo-rithm as demonstrated in this work. One can also experiment using a Bayesian approach to esti-mate the parameters of the distribution and carry out incremental hierarchical clustering in thisframework, which might lead to better results due to more reliable parameter estimates for clus-ters with a small number of documents.

Bibliography

[1] James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection and tracking. In Proceedings

of the 21st annual international ACM SIGIR conference on Research and development in information

retrieval , pages 37–45. ACM Press, 1998.

[2] J. Banerjee, A.; Ghosh. Competitive learning mechanisms for scalable, incremental and balanced clus-

tering of streaming texts. In Proceedings of the International Joint Conference on, Neural Networks ,

volume 4, pages 2697– 2702, Jul 2003.

[3] Abraham Bookstein and Don R. Swanson. A decision theoretic foundation for indexing. Journal of the

American Society for Information Science , pages 45–50, Jan-Feb 1975.

24 Section

[4] Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data . Morgan-Kauffman,

2002.

[5] P. Cheeseman and J. Stutz. Bayesian classification (AUTOCLASS): Theory and results. Advances in

Knowledge Discovery and Data Mining , 1996.

[6] Douglass R. Cutting, David R. Karger, Pedersen Pedersen, and John W. Tukey. Scatter/gather: A

cluster-based approach to browsing large document collections. In Proceedings of the Fifteenth Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval , Interface

Design and Display, pages 318–329, 1992.

[7] George Doddington, Jaime Carbonell, James Allan, Jonathan Yamron, Umass Amherst, and Yiming

Yang. Topic detection and tracking pilot study final report, Jul 2000.

[8] M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Trans. on

Patt. Analysis and Machine Intell., 24(3):381–396, March 2002.

[9] Douglas H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning ,

2:139–172, 1987.

[10] Martin Franz, Todd Ward, J. Scott McCarley, and Wei-Jing Zhu. Unsupervised and supervised clus-

tering for topic tracking. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR confer-

ence on Research and development in information retrieval , pages 310–317. ACM Press, 2001.

[11] David L. Wallace Frederick Mosteller. Applied Bayesian and Classical Inference The case of The Feder-

alist Papers . Springer series in Statistics. Springer-Verlag, 1983.

[12] J. H. Gennari, P. Langley, and D. Fisher. Models of incremental concept formation. Journal of Artifi-

cial Intelligence , 40:11–61, 1989.

[13] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys ,

31(3):264–323, 1999.

[14] Slava M. Katz. Distribution of content words and phrases in text and language modelling. Nat. Lang.

Eng., 2(1):15–59, 1996.

[15] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text

categorization research. Journal of Machine Learning Research , 5:361–397, 2004.

[16] Xiaoyong Liu and W. Bruce Croft. Cluster-based retrieval using language models. In Proceedings of the

27th Annual International ACM SIGIR Conference on Research and Development in Information

Retrieval , Language models, pages 186–193, 2004.

[17] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing .

The MIT Press, Cambridge, England, 2000.

[18] Padhraic Smyth. Clustering Using Monte Carlo Cross-Validation. In Evangelos Simoudis, Jia Wei Han,

and Usama Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery

and Data Mining (KDD-96), page 126. AAAI Press, 1996.

[19] Yiming Yang, Tom Pierce, and Jaime Carbonell. A study of retrospective and on-line event detection.

In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and

development in information retrieval , pages 28–36. ACM Press, 1998.

[20] Ya-Jun Zhang and Zhi-Qiang Liu. Refining web search engine results using incremental clustering.

International journal of intelligent systems , 19:191–199, 2004.

Appendix A

MLE of Katz’s distribution parameters

The Katz’s distribution is defined as:

P (0) = p0

P (k) = (1− p0)(1− p)pk−1;when k > 0 (32)

where, p0 and p are the parameters of the distribution.Let us discuss about the distribution of only one word or term. The data is the count of

occurrences of the word in each document in the text collection. So, if we have N documents inthe dataset we have N observations, each of which is a count.

MLE of Katz’s distribution parameters 25

Let us also define nk to be the number of observations equal to k, i.e., number of documentsin which the term occur k times. Let’s assume the maximum value of k is K.

Hence,

• document frequency df= N −n0 =∑

k=1

Knk and

• collection term frequency cf=∑

k=1

Kknk

The likelihood L(p, p0) of the parameters given data is

=∏

i=1

N

Pr (the word occurs xtimes in document i)

=∏

i=1

N[

δ(x)p0 + (1− δk)(1− p0)(1− p)px−1]

; x∈ 1� K

= p0n0

∏

k=1

K

(1− p0)nk(1− p)nk(pk−1)nk

where, δ( · ) is the indicator function that is 1 if argument is zero and 0 otherwise.Log of likelihood is

LL(p, p0)

= n0log (p0)

+∑

k=1

K

[nklog (1− p0) +nklog (1− p)+ nk(k − 1)log (p)]

Taking the partial derivative of the log likelihood with respect to p0 and equating it to 0:

∂LL

∂p0

=n0

p0̂

+∑

k=1

K

nk− 1

1− p0̂

= 0

⇒ n0

p0̂

=1

1− p0̂

∑

k=1

K

nk =1

1− p0̂

(N −n0)

⇒ 1− p0̂

p0̂

=N −n0

n0

⇒ 1

p0̂

− 1 =N

n0

− 1

⇒ p0̂ =n0

N=

N −df

N= 1− df

N(33)

We can find the MLE of p in a similar manner.

∂LL

∂p=∑

k=1

K

nk− 1

1− p̂+

nk(k − 1)

p̂= 0

⇒ 0 =1

p̂

∑

k=1

K

nk(k − 1)− 1

1− p̂

∑

k=1

K

nk

⇒ 0 =1

p̂

(

∑

k=1

K

knk −∑

k=1

K

nk

)

− 1

1− p̂

∑

k=1

K

nk

⇒ 0 =1

p̂(cf− df)− 1

1− p̂df

⇒ 1− p̂

p̂=

df

cf−df

⇒ 1

p̂=

cf

cf−df

⇒ p̂ =cf−df

cf(34)

Expressions (33) and (34) are the MLE of the parameters of Katz’s distribution described inExpression

26 Section

Incremental Hierarchical Clustering of Text Documents · Incremental hierarchical text document...

Documents

Transcript of Incremental Hierarchical Clustering of Text Documents · Incremental hierarchical text document...