Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

15
Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering  Antonio Gomez Skarmeta, 1, * Amine Bensaid 2, † Nadia Tazi 2 ´ 1 Departamento de Informatica, Inteligencia, Articial, y Electronica, Universidad ´ ´ de Murcia, Murcia, Spain 2 ( ) Compu ter Science Graduat e Divisi on, Al Akha wayn Univ ersit y in Ifrane, AUI , Ifrane, Morocco In this paper we study the use of a semi-supervised agglomerative hierarchical clustering Ž . ssAHC algorithm to text categ oriz ation , which consists of assigning text documen ts to Ž . Ž . pre dened cat ego ries. ssAHC is i a clusterin g algo rit hm t hat ii uses a ni te desig n set Ž . Ž . of lab el ed data to ii i help agg lo merati ve hie rarchi cal cl usteri ng AHC al go ri thms Ž . par tit ion a nite set of unl abeled dat a and then iv ter min ates withou t the capabili ty to label other objects. We rst describe the text representation method we use in this work;  we then present a feature selection method that is used to reduce the dimensionality of the fea tur e spa ce. Fin all y, we apply the ssAHC algo rit hm to the Reu ter s dat abase of documents and show that its performance is superior to the Bayes classier and to the Exp ect ati on- Max imizatio n alg or ithm combin ed wi th Bayes cla ssi er . We sho wed als o that ssAHC helps AHC techniques to improve their performance.   2000 John Wiley & Sons, Inc. 1. INTRODUCTI ON Text categorization consists of assigning predened categories to text docu- ments. A growing number of machine learning techniques have been applied to text categorization in recent years, including multivariate regression models, 6,16 nea res t nei ghbor classi er, 8,17 Baye s proba bilis tic appr oach, 9 decis ion trees , 9 neural netwo rks, 7,15 and inductive learning algorithms. 4,10 The use of machine learning in text categorization is difcult due to characteristics of the domain. Text does not come normally in the form of a feature vector, there are a large number of features, there are a large number of documents, and there is a large  variation in the amount of information in each document. Documents are  written in natural language, so they may contain many ambiguities; documents are also written by humans, so they may contain errors. As a consequence, the *Author to whom correspondence should be addressed. e-mail: [email protected] †e-mail: [email protected] Ž . INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 1 5, 633 646 2000  2000 John Wiley & Sons, Inc.

Transcript of Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 1/15

Data Mining for Text Categorization with

Semi-Supervised Agglomerative

Hierarchical Clustering

 Antonio Gomez Skarmeta,1, * Amine Bensaid 2, † Nadia Tazi2´1Departamento de Informatica, Inteligencia, Artificial, y Electronica, Universida´ ´

de Murcia, Murcia, Spain2 ( )Computer Science Graduate Division, Al Akhawayn University in Ifrane, AUI

Ifrane, Morocco

In this paper we study the use of a semi-supervised agglomerative hierarchical clusterinŽ .ssAHC algorithm to text categorization, which consists of assigning text documents t

Ž . Ž .predefined categories. ssAHC is i a clustering algorithm that ii uses a finite design s

Ž . Ž .of labeled data to iii help agglomerative hierarchical clustering AHC algorithmŽ .partition a finite set of unlabeled data and then iv terminates without the capability t

label other objects. We first describe the text representation method we use in this work we then present a feature selection method that is used to reduce the dimensionality othe feature space. Finally, we apply the ssAHC algorithm to the Reuters database odocuments and show that its performance is superior to the Bayes classifier and to thExpectation-Maximization algorithm combined with Bayes classifier. We showed alsthat ssAHC helps AHC techniques to improve their performance.     2000 John Wiley

Sons, Inc.

1. INTRODUCTION

Text categorization consists of assigning predefined categories to text documents. A growing number of machine learning techniques have been applied ttext categorization in recent years, including multivariate regression models,6,

nearest neighbor classifier,8,17 Bayes probabilistic approach,9 decision treesneural networks,7,15 and inductive learning algorithms.4,10 The use of machinlearning in text categorization is difficult due to characteristics of the domainText does not come normally in the form of a feature vector, there are a largnumber of features there are a large number of documents and there is a larg

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 2/15

SKARMETA, BENSAID, AND TAZI634

information relevant to classifying a document may be hidden or implicit andmay depend on the representation of documents into feature vectors.

Supervised learning methods enable a user to judge examples for classmembership and have a classifier formed automatically. In recent research,

supervised learning techniques for text categorization outperform classifiersconstructed by expert human researchers, but they require a large number of

Ž .training examples. On the other hand, unsupervised learning or clusteringmethods have had less impact. One reason is that blind unsupervised learning,

 when applied to documents viewed as bags of words, takes a lot of real and virtual computer memory and a lot of time and will not necessarily find thecategories that the humans think are important without any prior knowledge. Acompromise between the disadvantages of supervised and unsupervised learning

is a partial supervision approach, which lies somewhere ‘‘in-between’’ the twotechniques, and can improve learning efficiency.

In this paper, we explore the use of a partially supervised clusteringapproach for text categorization. The goal is to take advantage of a set of

Ž .training documents which are in general manually classified documents and ofthe information contained in a set of unlabeled documents to help classify theset of unlabeled documents. We start by describing the text representationmethod we use in this work. We then present a feature selection method that is

used to reduce the dimensionality of the feature space. Next, we describe aslight adaptation of our learning algorithm, useful when it is applied to textdocuments. Then we present the database we use in our experiments. Finally,

 we discuss the results and compare them with those of other published applica-tions of classifiers to text categorization and with those of agglomerativehierarchical clustering techniques.

2. TEXT REPRESENTATION

The first step in text categorization is to transform documents, which aretypically strings of characters, into a representation suitable for presentation to

Ž .the learning algorithm. We use single words or keywords as the basic units torepresent text. A ‘‘word’’ is defined as a contiguous string of characters delim-

Ž .ited by spaces. We process each text or document using the following steps:

Ž .1 Digits and punctuation marks are removed.Ž .2 All words are converted to lower case.Ž .3 Stop-words, i.e., words like prepositions, conjunctions, auxiliary verbs, etc., are

removed. The list of stop-words we use is given in Ref. 5.

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 3/15

DATA MINING FOR TEXT CATEGORIZATION   63

Ž .possible ways in which the words features can be generated. When a categorof documents is being considered, the words can be selected exclusively from

Ž .documents belonging to this category local dictionary , or they can be selecteboth from documents belonging and those not belonging to this category. Apt

et al.2

report results indicating that local dictionary selection gives betteperformance. In our experiments, we have found also that the local dictionargives the best results. Hence, we adopt the local dictionary methods. We alsremove rare words; i.e., words that occur in less than five training documents arnot considered as features. This measure is quite widely used in text categoriza

Ž .tion e.g., in Ref. 9 . The basic assumption is that infrequent words arnon-informative for category prediction, so improvement in text categorization possible if rare words happen to be noise words.

Ž . 13

To represent documents we use the vector space model VSM . VSM waoriginally developed for information retrieval, but it provides support for mantext categorization tasks. The VSM model is used in many text categorizatiostudies,8,9 so an effective comparison is possible. The main concern of VSM is trepresent natural language expressions as term weight vectors; each weighmeasures the importance of a term in a natural language expression, which cabe a document or a query. In VSM each documenttext   o   is represented as i

 N -dimensional vector

o     o   , o   , . . . , o   1Ž . Ži i1   i2   i N 

Ž . where   o   represents the weight of word t   j 1 , 2 , . . . ,  N    in document  o   ani j j i

4 p   is the number of terms. The weights   o   are computed using the termi j

Ž .   14frequency-inverse document frequency TF-IDF method. Based on this basrepresentation it is known that scaling the dimensions of the feature vector wittheir IDF leads to improved categorization results 8,9

 n

TF IDF   tf    idf   tf    log 2Ž . Ži j   i j j i j   2

ž / df  j

 where   tf    is the number of times term   t   occurs in document   o ,   df    is thi j j i j

number of documents the word   t   occurs in, and   n   is the total number o j

training documents.The inverse document frequency of a term is low if the term occurs in man

documents, and it is high if the term occurs in only one. Each document featur vector is normalized to unit length using cosine normalization:

TF IDFŽ . i j1 i 1 j N 3Ž

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 4/15

SKARMETA, BENSAID, AND TAZI636

space comprises one new dimension for each unique term that occurs in the textdocuments, which can lead to tens of thousands of dimensions for even asmall-sized text collection. The dimensionality of the feature space often ex-ceeds the number of available training documents; this is an obstacle for

learning algorithms. So, it is desirable to reduce the dimensionality of thefeature space without sacrificing categorization accuracy to make the use ofconventional learning methods possible.

The most popular approach to document feature selection is to choose asubset of the available features using methods like document frequency-

Ž .  18 2 Ž . Ž .   15thresholding DF ,      -text CHI , mutual information MI , or the informa-Ž .   18 18tion gain IG criterion. Yang and Pedersen have shown that DF, IG, and

CHI values of a term are strongly correlated. In our experiments, we use the IG

criterion, which measures the number of bits of information gained for categoryprediction by knowing the presence or absence of a featureterm in a docu-

Žment. The information gain of a given term   t  and for a given class   C   1 i  c,i

. where   c   is the number of target classes is defined as

G t    P C   log   P C    P t P C t   log   P C tŽ . Ž . Ž . Ž . Ž . Ž .i   2   i i   2   i

 P t P C t   log   P C t   4Ž . Ž .Ž . Ž .i   2   i

Ž . Ž Ž .. Ž . Ž . Ž Ž .. where   P C   or   P C   is the probability of having   C   or   C   ,   P t   or   P t   isi i i iŽ . Ž .the probability of having term   t   or not having term   t   ,   P C t   is thei

Ž .probability of having   C   given term   t  is observed, and   P C t   is the probabilityi i

of having   C   given term   t   is not in the document.i

Ž . All   N   terms are ranked according to their   G t   . To select a subset of   pŽ . Ž .terms, the   p p N    features with the highest   G t   are selected; the other

features are ignored.

4. ssAHC ALGORITHM FOR TEXT CATEGORIZATION

In this section we apply the semi-supervised agglomerative hierarchicalŽ .clustering ssAHC algorithm described in Ref. 1 to text categorization.

u Let   D     d   be the symmetric   n    n   proximity matrix representing   ni j u u u

Ž     u u u 4.unlabeled documents   O     o   , o   , . . . , o   . We take the proximities to beu   1 2   n u

Ž   u u.   u udissimilarities, so   d    d  o   , o   denotes the dissimilarity between   o   and  o   .i j i j i j

Ž . An agglomerative hierarchical clustering AHC algorithm can be used totransform   D u into a sequence of nested partitions, represented by a dendro-

gram. If the target number of   c  clusters is known, the dendrogram is cut at thelevel that yields one c-partition. Assume that in addition to unlabeled docu-

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 5/15

DATA MINING FOR TEXT CATEGORIZATION   63

and we array the label vectors of all   o ds in the   c  n   membership matr k d

  d   dU    . Each labeled document in   O   is represented by its dissimilarity t c n d

Ž   u d .other documents both from   O   and   O   , so we can think of the dissimilaritmatrix   D u as being augmented to   D  D d  D u.

Ž  d d d

.By using the training information contained in   O   ,  U   , and   D   , we seeu Žto obtain a better   c-partition of   O   than obtained otherwise if no trainin. Ž .information is available . To this end, we assume that the number   n   o d

training documents exceeds the number   c  of target classes. We first over-part d u Žtion   O O   O   into as many clusters as there are training documents tha

.is, into   n   clusters . This is carried out by applying an AHC algorithm to th d

Ž . Ž .augmented dissimilarity matrix   D, which results in   n   clusters groups   G d

each group   G   contains   n   objects, with  Ý n d  n    n. At this point, all that  j G j1   G j j

left is to label each of the   n   clusters using one of the   c   target labels; in effec d

this would ‘‘merge’’ each cluster into one of the   c   target classes.The database used in our experiments contains documents that belong t

more than one class. So, we treat classification into each class as a separatbinary classification problem. The result of such a binary classification is thassignment of a membership grade to each document’s belonging to the clasunder consideration and another membership grade to the document’s nobelonging to this class. ssAHC algorithm as we apply it to this binary classifica

tion can be summarized in Algorithm 1:

 A LGORITHM  1:   SS AHC   ALGORITHM. For each class   C :i

 d i iInputs:   O   the subset of   n    n    n   training documents; each documen d d dueither belongs to class  C  and has label   i, or does not and has label   i. O   a set oi

 n   test documents, and   c 2 the number of target classes corresponding tu

classes   C   and  C .i i

ˆu u Outputs:   U    containing the labels of the   n   documents of  O   in classes   C2 n u u

and  C .i

  d ussAHC1:   Apply average link algorithm to   O O   O   to get a membersh

  umatrix   U    . n    n d   Ž .ssAHC2:   Find the representative  v  of each clustergroup  G j 1 , 2 , . . . , n   a j j d

dÝ   k do   G k j

v G n 5ŽG

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 6/15

SKARMETA, BENSAID, AND TAZI638

  ussAHC4:  Compute the final partition of   O

ˆu uU     LU    7Ž .  c n u

Several approaches can be used in ssAHC2 to assign a target class label to eachcluster   G . j

4.1. Using Nearest-Neighbor Rule

Ž .To determine a class label   l   , we look for the nearest training object j

Ž . Žneighbor for every cluster   G j 1 , 2 , . . . , n   ; we refer to the index either j d

.column or row number in matrix   D   corresponding to this nearest neighbor as

Ž . nn j   :l   U  d ,   j 1 , 2 , . . . , n   where 8Ž . j n nŽ j.   d

 nn j   arg min      d   , vŽ .   4Ž . s j

1 s  n d

Ž .   d where      d   , v   is the Euclidean distance between training object   o   and the s j s

representative   v   of cluster   G . j j

4.2. Using Ratio between Distances from Clusters to Target Classes

In this approach,   L   l   is computed asi j c n d

 d   v , CŽ . j i1   c

 d   v , CŽ .Ý   j k

 k1 l     9Ž .i j  c 1

Ž . where   C   represents the target class labeled   i, and   d  v , C   is a measure ofi j i

Ž   ddistance between the representative  v   of cluster   G   and the set we call it   O   of j j i

.training objects labeled  e   . For instance, we compute it asi

 d   v , C     min      d   , v   10Ž . 4Ž . Ž . j j k j

 d do   O k i

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 7/15

DATA MINING FOR TEXT CATEGORIZATION   63

then   l   is then computed asi j

  d Oi j l     12Ž i j   d Oi

Equations 9 and 12 used in Algorithm 1 to label the different clusters found istep ssAHC2 guarantee ssAHC to produce nondegenerate partitions, but therequire that training data be crisply labeled. On the other hand Eq. 8 workequally well for all types of labels but does not ensure that ssAHC will produca nondegenerate partition.

In general, database collection contain a large number of training documents. So we investigated a procedure to choose a subset of documents from

large pool of training documents to speed up step ssAHC2, since this step computationally expensive.

When applying ssAHC to a given category   C , we decide to use   ni

documents from the training corpus of   C  . It is often the case that there is i

Žlarge number of training documents that do not belong to   C   we refer to themii. Ž .as belonging to  C   , we select from these documents, the   n   documents, whici d

are the most similar to   C  . To this end, we form the vector sum of all thi

documents belonging to   C .i

s     o d 13Ž Ýi k do   C k i

The documents that belong to  C  are ranked by their dot product score with thi

aggregate vector  s i

²   d : ²   d : ²   d :s   , o     s   , o     s   , o   14Ž i   1   i   2   i k

 k   n1

 n2

 nŽ i1.

 nŽ i1.

 n c

Ž . d d d d d

 d where o   C , and  c  is the number of target classes. The higher the dot produc k i

score, the more similar the corresponding document is to the category. There ino rule of thumb to choose the number   n   of documents belonging to   C ; w d i

i iŽ .experiment with different values of   n n   100, 200, 500, or 1000 , and w d diretain the   n   that gives the best results. d

5. EVALUATION MEASURES

The performance of a binary classifier can be summarized using a two wa

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 8/15

SKARMETA, BENSAID, AND TAZI640

Table I.   A two by two contingency table.

Correct class is  C   Correct class is  Ci i

 Assigned class  C    i

 Assigned class  C    i

correctly classified documents. R, P, and Acc are defined using Table I as

R   if        0, otherwise R 1 15Ž . Ž .

 

P   if        0, otherwise P 1 16Ž . Ž .  

  Acc   where  n       0 17Ž .

 n

Using precision or recall alone may not be sensible. For example, a trivialalgorithm which assigns class   C   to all documents will have a perfect recall of100%, but an unacceptably low score in precision. Conversely, a system that

decides not to assign any document to   C  will have a perfect score in precisionbut low score in recall. So a single measure,   F   , is computed to give an equal1

 weight to precision and recall.   F    is maximized when recall and precision are1

equal to 1; otherwise,   F    is dominated by the smaller value between recall and1

precision. The   F    measure is defined as101

2PR F   1 P RŽ .

2   18Ž .

2    

Since P, R, Acc, and   F    are defined only for binary classification tasks, the1

results of multiple binary tasks need to be averaged to get a single performancefor multiple class problems. For a set of   c  categories and   n   test documents au

total of   c  n   categorization decisions are made. Given these   c  n   decisions,u u

 we use microaveraging, which considers all   c  n   decisions as a single groupuand computes recall, precision, and accuracy for the group as a whole. Microav-eraging reflects the per document performance of a system; an equal weight is

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 9/15

DATA MINING FOR TEXT CATEGORIZATION   64

sists of 21,578 newswire articles collected during 1987 from Reuters. Th

documents deal with financial topics and are classified into several sets o

financial categories. They are marked-up with standard generalization markuŽ .language SGML ; that is, they include SGML syntax. These documents vary i

Žlength and in number of categories they belong to each document may belon.to zero, one, or more categories . There are five sets of super-categorie

TOPICS, ORGANIZATIONS, EXCHANGES, PLACES, and PEOPLE. Eac

of these sets is made up of sub-categories. For our experiments, we decide t

consider only the TOPICS super-category, which is made up of 135 sub-cate

gories.

Several partitions into training and test subsets have been proposed for th

Reuters documents. We use the ‘‘ModApte’’ split which leads to 9603 trainin

Ždocuments and 3299 test documents the remaining 8676 documents are no.   11used . Following the work of other researchers and to fulfill requirements fo

our learning algorithm, we only retain 90 sub-categories of the 135 sub-cate

gories in TOPICS that admit at least one training document and one te

document.

7. EXPERIMENTS AND RESULTS

We applied ssAHC to the Reuters-21578 database of documents. Since th

documents in this database may have multiple class labels, we evaluate th

result for each category separately, using two-category classification. FollowinŽ .other studies e.g., Ref. 11 , we use the 10 most populous classes from the 9

sub-categories in TOPICS and build a two-category ssAHC ‘‘classifier’’ for eac

class. Table II shows the 10 categories used in our experiments with the numbe

of documents in the training and the test sets.

Table II.   The 10 categories used withnumber of occurrences in training andtest sets.

Number of occurrences

Category Training Test

Earn 2877 1087Shi 197 89

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 10/15

SKARMETA, BENSAID, AND TAZI642

Experiment 1

For   p 10, 20, 50, 100, 200, 400 and ‘‘all features’’

For each category   Ci

  Choose the best   p   termskeywords from training documents belonging to   CiŽ .using information gain Eq. 4 .

 Put training and test documents into feature vector form using Eq. 3.   Select all the   ni training documents from class   C . d i

i   For   n   100, 200, 500, 1000 d

i   Select   n   training documents from   C   using Eq. 14. d i

i i   Let   n    n    n   . d d d

Ž .   Apply average link AHC-AL algorithm to   D; the dendrogram is cut to yield a   n   -partition  U u. d

  Form   n   cluster prototypes using Eq. 5. d

 Use Eqs. 8, 9, and 12 in ssAHC2 to label the   n   prototypes. dˆ   Compute three   U    s using Eqs. 6 and 7 in ssAHC3 and ssAHC4ssAHC

respectively.   Compute R, P, Acc, and   F    measures using Eqs. 15, 16, 17, and 18,1

respectively.   End For

End For

End For

Experiment 1 evaluates ssAHC using the three approaches for constructingmatrix   L. For each category we run several series of experiments; the ssAHCalgorithm is tested using various numbers of features, 20, 50, 100, 200, 400, and‘‘all features’’ and various sizes for the training setthe training patterns from

ithe category   C   under consideration and a fixed number   n   of training docu-i d

ments from the remaining 89 categories using Eq. 14. Table III shows ssAHCaccuracy on the Reuters test set for each of the 10 selected categories using

Eqs. 8, 9, and 12.

Table III.   ssAHC’s accuracy on the Reuters test set with100 features using Eqs. 8, 9, and 12.

Classification accuracy

Class Eq. 8 Eq. 9 Eq. 12

 Acq 87.30 87.30 87.70Corn 93.90 93.90 94.10Crude 96.40 96.40 96.50

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 11/15

DATA MINING FOR TEXT CATEGORIZATION   64

The results listed in Table III show that the use of Eqs. 8, 9, or 12 yield

approximately the same results. The average accuracy of ssAHC on the Reuter

test set for the 10 most populous categories is 92.00% using Eqs. 8, 9, or 12.

ssAHC’s average accuracy of the 10 categories was best when selecting onl

Ž .the top 100 features according to information gain Eq. 4 . Figure 1 shows thinfluence of the number of features on ssAHC’s average accuracy on th

Reuters test set.

ssAHC results were compared to two classifiers: the Bayes classifier and thŽexpectation-maximization algorithm combined with the Bayes classifier EM

.   11Bayes . ssAHC and EM-Bayes have some similarities; they both use th

information contained in the set of labeled documents and the set of unlabele

documents to help the clustering of the set of unlabeled documents. Th

EM-Bayes algorithm as described in Ref. 11 first trains a classifier with thavailable labeled documents and probabilistically labels the unlabeled docu

ments. It then trains a new classifier using the labels for all the documents an

iterates. In Ref. 11, the Bayes and EM-Bayes algorithms were evaluated usin

recall, precision, and breakeven point measures. In our experiments ssAHC

performance is evaluated using recall, precision, and the   F    measure. Th1

Ž .breakeven point and the   F    measure are comparable since   F P *, P *    P 1 1

 where   P * is the precision obtained at the breakeven point. Table IV lists recal

precision, and   F    measures of ssAHC algorithm on the Reuters test set.1

From Table IV we notice that the use of Eqs. 8, 9, or 12 yield approximate

the same results. But the use of Eqs. 9 and 12 ensures nondegenerate partition

In Ref. 11 the average breakeven point for the Reuters 10 categories and wit

200 features is 57% with the Bayes classifier and 60% with the EM-BayeŽ .algorithm. ssAHC results microaveraged   F    at 71.8% are better than those o1

the Bayes classifier and the EM-Bayes algorithm. Furthermore, ssAHC uses

smaller number of features and training patterns.

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 12/15

SKARMETA, BENSAID, AND TAZI644

Table IV.   R, P, and   F    measures for the results of ssAHC on the Reuters test set with1

100 features using Eqs. 8, 9, and 12.

Recall Precision   F 1

Class Eq. 8 Eq. 8 Eq. 12 Eq. 8 Eq. 9 Eq. 12 Eq. 8 Eq. 9 Eq. 12

 Acq 96.00 96.00 95.40 70.10 70.10 70.90 81.10 81.10 81.40Corn 75.00 75.00 78.60 23.10 23.10 24.20 35.30 35.30 37.00Crude 94.20 94.20 92.60 68.70 68.70 70.00 79.50 79.50 79.70Earn 99.10 99.10 99.10 78.00 78.00 78.60 87.30 87.30 87.70Grain 91.90 91.90 91.90 60.40 60.40 63.10 72.90 72.90 74.90Interest 68.70 68.70 77.10 29.90 29.90 31.90 41.70 41.70 45.10Money-fx 86.60 86.60 86.00 35.10 35.10 35.10 50.00 50.00 49.80Ship 87.60 87.60 87.60 43.80 43.80 42.90 58.40 58.40 57.60Trade 67.50 67.50 70.10 21.30 21.30 19.70 32.40 32.40 30.70

Wheat 78.90 78.90 78.90 63.80 63.80 65.90 73.20 73.10 71.80

Microaverage 71.80 71.80 71.80

ssAHC’s microaveraged   F    measure of the 10 categories was best by1

Ž .selecting only the top 100 features according to information gain Eq. 4 . Figure2 shows the influence of the number of features on ssAHC’s   F    measure on the1

Reuters test set.ssAHC was also computed to AHC algorithms. Experiment 1 was repeated

using AHC-AL and taking out the steps that are specific to ssAHC. Table Vreports the results of AHC-AL on the Reuters test set.

Comparison of the results listed in Table V with those in Tables III and IVshows the efficiency of ssAHC in improving the results of AHC methods whenlabeled objects are available. ssAHC uses the information that labeled data maycarry to help the clustering of the unlabeled objects. Still, the performance of

ssAHC depends on the quality of the training objects. As a result, when ssAHCŽis presented with poor training objects as was sometimes the case in our

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 13/15

DATA MINING FOR TEXT CATEGORIZATION   64

Table V.   AHC-AL Results on the Reuters test set with 100 features.

Class Recall Precision   F    Accuracy1

 Acq 99.90 28.70 44.60 30.00Corn 100 2.31 4.50 7.00Crude 100 8.70 16.00 21.80Earn 99.90 43.40 60.50 44.30Grain 100 6.13 11.60 10.50Interest 100 5.30 10.10 9.00Money-fx 100 7.20 13.50 10.00Ship 100 3.90 7.60 15.00Trade 100 5.00 9.60 13.30Wheat 100 3.00 5.80 10.20

Microaverage 20.90 17.10

.experiments with a very low or a very high number of features , it performpoorly.

8. CONCLUSIONS

In this paper, we investigated the use of a partially supervised approach fo

text categorization, as a compromise between the disadvantage of superviseand unsupervised clustering techniques when applied to text documents. Wapplied ssAHC to the Reuters database of documents; we use Information Gaias a feature selection method, to reduce the dimensionality of the feature spaceThe experimental results showed that ssAHC achieved good performance otext categorization even with small numbers of training data, outperforming thBayes classifier, and an approach combining the EM algorithm with the Bayeclassifier; this hybrid algorithm can be considered to be semi-supervised since

uses the information contained in the labeled and the unlabeled documents foits learning phase. ssAHC also helps the AHC algorithms improve their performance. All this makes a partial supervision approach a promising method fotext categorization.

We add, however, that ssAHC is not a true classifier in the sense that newdocuments can not be labeled directly with the trained design. On the othe

Ž .hand, it is not a truly unsupervised clustering algorithm. The prefix ‘‘semi’’ appropriate, since this algorithm lies somewhere ‘‘in between’’ clustering an

classifier designs. However we can think that ssAHC can be extended to a truclassifier. For example, we can find the cluster prototypes for the final cluster

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 14/15

SKARMETA, BENSAID, AND TAZI646

2. Apte C, Damerau F, Weiss SM. Automated learning of decision rules for textŽ .categorization. ACM Trans Inform Syst, 1994;13 3 :233251.

3. Bezdek J, Reichherzer TR, Lim G, Attikiouzel Y. Multiple prototype classifierŽ .design. IEEE Trans SMC 1998;28 Pt C, No 1 :6779.

4. Cohen WW, Singer Y. Context sensitive learning methods for text categorization.

Proceedings of the 19th Annual International ACM SIGIR Conference on ResearchŽ .and Development in Information Retrieval SIGIR’96 , 1996. p 307315.

5. Frakes WB, Baeza-Yates R. Information retrieval, data structures and algorithms.Englewood Cliffs, NJ: Prentice Hall; 1992.

6. Fuhr N, Hartmanna S, Lustig G, Schwantner M, Tzeras K. Air x a rule basedmultistage indexing systems for large subject fields. In: Proceedings of RIAO’911991. p 606623.

7. Ng HT, Goh WB, Low KL. Feature selection, Perceptron Learning, and a usabilitycase study for text categorization. Proceedings of ACM SIGIR ConferenceŽ .SIGIR’97 , 1997.

8. Joachims T. Text categorization with vector support machine: learning with manyrelevant features. Technical report 23, University of Dortsmund, LS VIII, 1997.

9. Lewis DD, Ringuette M. Comparison of two learnings algorithms for text categoriza-tion. Proceedings of the Third Annual Symposium on Document Analysis and

Ž .Information Retrieval SDAIR’94 , 1994.10. Lewis DD, Schapire RE, Callan JP, Papka R. Training algorithms for linear text

classifiers. Proceedings of the 19th Annual International ACM SIGIR Conference onŽ .Research and Development in Information Retrieval SIGIR’96 , 1996. 298306.

11. Nigam K, McCallum A, Thrun S, Michell T. Learning to classify text from labeled

and unlabeled documents. 15th National Conference on Artificial IntelligenceŽ . AAAI , 1998.Ž .12. Porter MF. An algorithm for suffix stripping. Program, 1980;14 3 :130137.

13. Salton G, McGill C. Introduction to modern information retrieval. New York:McGraw-Hill; 1983.

14. Salton G, Buckley C. Term weighting approaches in automatic text retrieval. InformŽ .Process Manage 1988;24 5 :513523.

15. Wiener E, Pedersen JO, Weigend AS. A neural network approach to topic spottingProceedings of the Fourth Annual Symposium on Document Analysis and Informa-

Ž .tion Retrieval SDAIR’95 , 1995.

16. Yang Y, Chute CG. An example-based in mapping method for text categorizationŽ .and retrieval. ACM Trans Inform Syst TOIS 1994. p 253277.17. Yang Y. Noise reduction in a statistical approach to text categorization. Proceedings

of the 18th Annual International ACM SIGIR Conference on Research and Devel-Ž .opment in Information Retrieval SIGIR’95 , 1995. p 256263.

18. Yang Y, Pedersen J. A comparative study on feature selection in text categorization.Ž .International Conference on Machine Learning ICML , 1997.

8/12/2019 Data Mining for Text Categorization with Semi-Supervised Agglomerative Hierarchical Clustering

http://slidepdf.com/reader/full/data-mining-for-text-categorization-with-semi-supervised-agglomerative-hierarchical 15/15