Machine learning in automated text categorization

Machine Learning in Automated Text Categorization

FABRIZIO SEBASTIANI

Consiglio Nazionale delle Ricerche, Italy

The automated categorization (or classification) of texts into predefined categories haswitnessed a booming interest in the last 10 years, due to the increased availability ofdocuments in digital form and the ensuing need to organize them. In the researchcommunity the dominant approach to this problem is based on machine learningtechniques: a general inductive process automatically builds a classifier by learning,from a set of preclassified documents, the characteristics of the categories. Theadvantages of this approach over the knowledge engineering approach (consisting inthe manual definition of a classifier by domain experts) are a very good effectiveness,considerable savings in terms of expert labor power, and straightforward portability todifferent domains. This survey discusses the main approaches to text categorizationthat fall within the machine learning paradigm. We will discuss in detail issuespertaining to three different problems, namely, document representation, classifierconstruction, and classifier evaluation.

Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]:Content Analysis and Indexing—Indexing methods; H.3.3 [Information Storage andRetrieval]: Information Search and Retrieval—Information filtering; H.3.4[Information Storage and Retrieval]: Systems and Software—Performanceevaluation (efficiency and effectiveness); I.2.6 [Artificial Intelligence]: Learning—Induction

General Terms: Algorithms, Experimentation, Theory

Additional Key Words and Phrases: Machine learning, text categorization, textclassification

1. INTRODUCTION

In the last 10 years content-based doc-ument management tasks (collectivelyknown as information retrieval—IR) havegained a prominent status in the informa-tion systems field, due to the increasedavailability of documents in digital formand the ensuing need to access them inflexible ways. Text categorization (TC—a.k.a. text classification, or topic spotting),the activity of labeling natural language

Author’s address: Istituto di Elaborazione dell’Informazione, Consiglio Nazionale delle Ricerche, Via G.Moruzzi 1, 56124 Pisa, Italy; e-mail: [email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,requires prior specific permission and/or a fee.c©2002 ACM 0360-0300/02/0300-0001 $5.00

texts with thematic categories from a pre-defined set, is one such task. TC datesback to the early ’60s, but only in the early’90s did it become a major subfield of theinformation systems discipline, thanks toincreased applicative interest and to theavailability of more powerful hardware.TC is now being applied in many contexts,ranging from document indexing basedon a controlled vocabulary, to documentfiltering, automated metadata generation,word sense disambiguation, population of

ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1–47.

2 Sebastiani

hierarchical catalogues of Web resources,and in general any application requiringdocument organization or selective andadaptive document dispatching.

Until the late ’80s the most popular ap-proach to TC, at least in the “operational”(i.e., real-world applications) community,was a knowledge engineering (KE) one,consisting in manually defining a set ofrules encoding expert knowledge on howto classify documents under the given cat-egories. In the ’90s this approach has in-creasingly lost popularity (especially inthe research community) in favor of themachine learning (ML) paradigm, accord-ing to which a general inductive processautomatically builds an automatic textclassifier by learning, from a set of preclas-sified documents, the characteristics of thecategories of interest. The advantages ofthis approach are an accuracy comparableto that achieved by human experts, anda considerable savings in terms of expertlabor power, since no intervention from ei-ther knowledge engineers or domain ex-perts is needed for the construction of theclassifier or for its porting to a different setof categories. It is the ML approach to TCthat this paper concentrates on.

Current-day TC is thus a discipline atthe crossroads of ML and IR, and assuch it shares a number of characteris-tics with other tasks such as information/knowledge extraction from texts and textmining [Knight 1999; Pazienza 1997].There is still considerable debate on wherethe exact border between these disciplineslies, and the terminology is still evolving.“Text mining” is increasingly being usedto denote all the tasks that, by analyz-ing large quantities of text and detect-ing usage patterns, try to extract probablyuseful (although only probably correct)information. According to this view, TC isan instance of text mining. TC enjoys quitea rich literature now, but this is still fairlyscattered.1 Although two internationaljournals have devoted special issues to

1 A fully searchable bibliography on TC created andmaintained by this author is available at http://liinwww.ira.uka.de/bibliography/Ai/automated.text.categorization.html.

this topic [Joachims and Sebastiani 2002;Lewis and Hayes 1994], there are no sys-tematic treatments of the subject: thereare neither textbooks nor journals en-tirely devoted to TC yet, and Manningand Schutze [1999, Chapter 16] is the onlychapter-length treatment of the subject.As a note, we should warn the readerthat the term “automatic text classifica-tion” has sometimes been used in the liter-ature to mean things quite different fromthe ones discussed here. Aside from (i) theautomatic assignment of documents to apredefined set of categories, which is themain topic of this paper, the term has alsobeen used to mean (ii) the automatic iden-tification of such a set of categories (e.g.,Borko and Bernick [1963]), or (iii) the au-tomatic identification of such a set of cat-egories and the grouping of documentsunder them (e.g., Merkl [1998]), a taskusually called text clustering, or (iv) anyactivity of placing text items into groups,a task that has thus both TC and text clus-tering as particular instances [Manningand Schutze 1999].

This paper is organized as follows. InSection 2 we formally define TC and itsvarious subcases, and in Section 3 wereview its most important applications.Section 4 describes the main ideas under-lying the ML approach to classification.Our discussion of text classification startsin Section 5 by introducing text index-ing, that is, the transformation of textualdocuments into a form that can be inter-preted by a classifier-building algorithmand by the classifier eventually built by it.Section 6 tackles the inductive construc-tion of a text classifier from a “training”set of preclassified documents. Section 7discusses the evaluation of text classi-fiers. Section 8 concludes, discussing openissues and possible avenues of furtherresearch for TC.

2. TEXT CATEGORIZATION

2.1. A Definition of Text Categorization

Text categorization is the task of assigninga Boolean value to each pair 〈d j , ci〉 ∈D ×C, where D is a domain of documents and

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

Machine Learning in Automated Text Categorization 3

C = {c1, . . . , c|C|} is a set of predefined cat-egories. A value of T assigned to 〈d j , ci〉indicates a decision to file d j under ci,while a value of F indicates a decisionnot to file d j under ci. More formally, thetask is to approximate the unknown tar-get function 8 : D× C → {T, F } (that de-scribes how documents ought to be classi-fied) by means of a function 8 : D × C →{T, F } called the classifier (aka rule, orhypothesis, or model) such that 8 and 8“coincide as much as possible.” How to pre-cisely define and measure this coincidence(called effectiveness) will be discussed inSection 7.1. From now on we will assumethat:

—The categories are just symbolic la-bels, and no additional knowledge (ofa procedural or declarative nature) oftheir meaning is available.

—No exogenous knowledge (i.e., data pro-vided for classification purposes by anexternal source) is available; therefore,classification must be accomplished onthe basis of endogenous knowledge only(i.e., knowledge extracted from the doc-uments). In particular, this means thatmetadata such as, for example, pub-lication date, document type, publica-tion source, etc., is not assumed to beavailable.

The TC methods we will discuss arethus completely general, and do not de-pend on the availability of special-purposeresources that might be unavailable orcostly to develop. Of course, these as-sumptions need not be verified in opera-tional settings, where it is legitimate touse any source of information that mightbe available or deemed worth developing[Dıaz Esteban et al. 1998; Junker andAbecker 1997]. Relying only on endoge-nous knowledge means classifying a docu-ment based solely on its semantics, andgiven that the semantics of a documentis a subjective notion, it follows that themembership of a document in a cate-gory (pretty much as the relevance of adocument to an information need in IR[Saracevic 1975]) cannot be decided de-terministically. This is exemplified by the

phenomenon of inter-indexer inconsistency[Cleverdon 1984]: when two human ex-perts decide whether to classify documentd j under category ci, they may disagree,and this in fact happens with relativelyhigh frequency. A news article on Clintonattending Dizzy Gillespie’s funeral couldbe filed under Politics, or under Jazz, or un-der both, or even under neither, dependingon the subjective judgment of the expert.

2.2. Single-Label Versus MultilabelText Categorization

Different constraints may be enforced onthe TC task, depending on the applica-tion. For instance we might need that, fora given integer k, exactly k (or≤ k, or≥ k)elements of C be assigned to each d j ∈D.The case in which exactly one categorymust be assigned to each d j ∈D is oftencalled the single-label (a.k.a. nonoverlap-ping categories) case, while the case inwhich any number of categories from 0to |C| may be assigned to the same d j ∈Dis dubbed the multilabel (aka overlappingcategories) case. A special case of single-label TC is binary TC, in which each d j ∈Dmust be assigned either to category ci orto its complement ci.

From a theoretical point of view, thebinary case (hence, the single-label case,too) is more general than the multilabel,since an algorithm for binary classifica-tion can also be used for multilabel clas-sification: one needs only transform theproblem of multilabel classification under{c1, . . . , c|C|} into |C| independent problemsof binary classification under {ci, ci}, fori= 1, . . . , |C|. However, this requires thatcategories be stochastically independentof each other, that is, for any c′, c′′, thevalue of 8(d j , c′) does not depend onthe value of 8(d j , c′′) and vice versa;this is usually assumed to be the case(applications in which this is not the caseare discussed in Section 3.5). The converseis not true: an algorithm for multilabelclassification cannot be used for either bi-nary or single-label classification. In fact,given a document d j to classify, (i) the clas-sifier might attribute k> 1 categories tod j , and it might not be obvious how to


4 Sebastiani

choose a “most appropriate” category fromthem; or (ii) the classifier might attributeto d j no category at all, and it might notbe obvious how to choose a “least inappro-priate” category from C.

In the rest of the paper, unless explicitlymentioned, we will deal with the binarycase. There are various reasons for this:

—The binary case is important in itselfbecause important TC applications, in-cluding filtering (see Section 3.3), con-sist of binary classification problems(e.g., deciding whether d j is about Jazzor not). In TC, most binary classificationproblems feature unevenly populatedcategories (e.g., much fewer documentsare about Jazz than are not) and un-evenly characterized categories (e.g.,what is about Jazz can be characterizedmuch better than what is not).

—Solving the binary case also means solv-ing the multilabel case, which is alsorepresentative of important TC applica-tions, including automated indexing forBoolean systems (see Section 3.1).

—Most of the TC literature is couched interms of the binary case.

—Most techniques for binary classifica-tion are just special cases of existingtechniques for the single-label case, andare simpler to illustrate than theselatter.

This ultimately means that we will viewclassification under C={c1, . . . , c|C|} asconsisting of |C| independent problems ofclassifying the documents in D under agiven category ci, for i = 1, . . . , |C|. A clas-sifier for ci is then a function 8i : D →{T, F } that approximates an unknown tar-get function 8i : D→ {T, F }.

2.3. Category-Pivoted VersusDocument-Pivoted Text Categorization

There are two different ways of usinga text classifier. Given d j ∈D, we mightwant to find all the ci ∈ C under which itshould be filed (document-pivoted catego-rization—DPC); alternatively, given ci ∈ C,we might want to find all the d j ∈ D thatshould be filed under it (category-pivoted

categorization—CPC). This distinction ismore pragmatic than conceptual, but isimportant since the sets C andDmight notbe available in their entirety right fromthe start. It is also relevant to the choiceof the classifier-building method, as someof these methods (see Section 6.9) allowthe construction of classifiers with a defi-nite slant toward one or the other style.

DPC is thus suitable when documentsbecome available at different moments intime, e.g., in filtering e-mail. CPC is in-stead suitable when (i) a new categoryc|C|+1 may be added to an existing setC={c1, . . . , c|C|} after a number of docu-ments have already been classified underC, and (ii) these documents need to be re-considered for classification under c|C|+1(e.g., Larkey [1999]). DPC is used more of-ten than CPC, as the former situation ismore common than the latter.

Although some specific techniques ap-ply to one style and not to the other (e.g.,the proportional thresholding method dis-cussed in Section 6.1 applies only to CPC),this is more the exception than the rule:most of the techniques we will discuss al-low the construction of classifiers capableof working in either mode.

2.4. “Hard” Categorization VersusRanking Categorization

While a complete automation of theTC task requires a T or F decisionfor each pair 〈d j , ci〉, a partial automa-tion of this process might have differentrequirements.

For instance, given d j ∈D a systemmight simply rank the categories inC={c1, . . . , c|C|} according to their esti-mated appropriateness to d j , without tak-ing any “hard” decision on any of them.Such a ranked list would be of greathelp to a human expert in charge oftaking the final categorization decision,since she could thus restrict the choiceto the category (or categories) at the topof the list, rather than having to examinethe entire set. Alternatively, given ci ∈ Ca system might simply rank the docu-ments in D according to their estimatedappropriateness to ci; symmetrically, for



classification under ci a human expertwould just examine the top-ranked doc-uments instead of the entire documentset. These two modalities are sometimescalled category-ranking TC and document-ranking TC [Yang 1999], respectively,and are the obvious counterparts of DPCand CPC.

Semiautomated, “interactive” classifica-tion systems [Larkey and Croft 1996] areuseful especially in critical applicationsin which the effectiveness of a fully au-tomated system may be expected to besignificantly lower than that of a humanexpert. This may be the case when thequality of the training data (see Section 4)is low, or when the training documentscannot be trusted to be a representativesample of the unseen documents that areto come, so that the results of a completelyautomatic classifier could not be trustedcompletely.

In the rest of the paper, unless explicitlymentioned, we will deal with “hard” classi-fication; however, many of the algorithmswe will discuss naturally lend themselvesto ranking TC too (more details on this inSection 6.1).

3. APPLICATIONS OF TEXTCATEGORIZATION

TC goes back to Maron’s [1961] semi-nal work on probabilistic text classifica-tion. Since then, it has been used for anumber of different applications, of whichwe here briefly review the most impor-tant ones. Note that the borders betweenthe different classes of applications listedhere are fuzzy and somehow artificial, andsome of these may be considered specialcases of others. Other applications we donot explicitly discuss are speech catego-rization by means of a combination ofspeech recognition and TC [Myers et al.2000; Schapire and Singer 2000], multi-media document categorization throughthe analysis of textual captions [Sableand Hatzivassiloglou 2000], author iden-tification for literary texts of unknown ordisputed authorship [Forsyth 1999], lan-guage identification for texts of unknownlanguage [Cavnar and Trenkle 1994],

automated identification of text genre[Kessler et al. 1997], and automated essaygrading [Larkey 1998].

3.1. Automatic Indexing for BooleanInformation Retrieval Systems

The application that has spawned mostof the early research in the field [Borkoand Bernick 1963; Field 1975; Gray andHarley 1971; Heaps 1973; Maron 1961]is that of automatic document indexingfor IR systems relying on a controlleddictionary, the most prominent exampleof which is Boolean systems. In theselatter each document is assigned one ormore key words or key phrases describ-ing its content, where these key words andkey phrases belong to a finite set calledcontrolled dictionary, often consisting ofa thematic hierarchical thesaurus (e.g.,the NASA thesaurus for the aerospacediscipline, or the MESH thesaurus formedicine). Usually, this assignment isdone by trained human indexers, and isthus a costly activity.

If the entries in the controlled vocab-ulary are viewed as categories, text in-dexing is an instance of TC, and maythus be addressed by the automatic tech-niques described in this paper. Recall-ing Section 2.2, note that this applica-tion may typically require that k1≤ x ≤ k2key words are assigned to each docu-ment, for given k1, k2. Document-pivotedTC is probably the best option, so thatnew documents may be classified as theybecome available. Various text classifiersexplicitly conceived for document index-ing have been described in the literature;see, for example, Fuhr and Knorz [1984],Robertson and Harding [1984], and Tzerasand Hartmann [1993].

Automatic indexing with controlled dic-tionaries is closely related to automatedmetadata generation. In digital libraries,one is usually interested in tagging doc-uments by metadata that describes themunder a variety of aspects (e.g., creationdate, document type or format, availabil-ity, etc.). Some of this metadata is the-matic, that is, its role is to describe thesemantics of the document by means of


6 Sebastiani

bibliographic codes, key words or keyphrases. The generation of this metadatamay thus be viewed as a problem of doc-ument indexing with controlled dictio-nary, and thus tackled by means of TCtechniques.

3.2. Document Organization

Indexing with a controlled vocabulary isan instance of the general problem of docu-ment base organization. In general, manyother issues pertaining to document or-ganization and filing, be it for purposesof personal organization or structuring ofa corporate document base, may be ad-dressed by TC techniques. For instance,at the offices of a newspaper incoming“classified” ads must be, prior to publi-cation, categorized under categories suchas Personals, Cars for Sale, Real Estate,etc. Newspapers dealing with a high vol-ume of classified ads would benefit from anautomatic system that chooses the mostsuitable category for a given ad. Otherpossible applications are the organiza-tion of patents into categories for mak-ing their search easier [Larkey 1999], theautomatic filing of newspaper articles un-der the appropriate sections (e.g., Politics,Home News, Lifestyles, etc.), or the auto-matic grouping of conference papers intosessions.

3.3. Text Filtering

Text filtering is the activity of classify-ing a stream of incoming documents dis-patched in an asynchronous way by aninformation producer to an informationconsumer [Belkin and Croft 1992]. A typ-ical case is a newsfeed, where the pro-ducer is a news agency and the consumeris a newspaper [Hayes et al. 1990]. Inthis case, the filtering system should blockthe delivery of the documents the con-sumer is likely not interested in (e.g., allnews not concerning sports, in the caseof a sports newspaper). Filtering can beseen as a case of single-label TC, thatis, the classification of incoming docu-ments into two disjoint categories, therelevant and the irrelevant. Additionally,

a filtering system may also further clas-sify the documents deemed relevant tothe consumer into thematic categories;in the example above, all articles aboutsports should be further classified accord-ing to which sport they deal with, so asto allow journalists specialized in indi-vidual sports to access only documents ofprospective interest for them. Similarly,an e-mail filter might be trained to discard“junk” mail [Androutsopoulos et al. 2000;Drucker et al. 1999] and further classifynonjunk mail into topical categories of in-terest to the user.

A filtering system may be installed atthe producer end, in which case it mustroute the documents to the interested con-sumers only, or at the consumer end, inwhich case it must block the delivery ofdocuments deemed uninteresting to theconsumer. In the former case, the systembuilds and updates a “profile” for each con-sumer [Liddy et al. 1994], while in the lat-ter case (which is the more common, andto which we will refer in the rest of thissection) a single profile is needed.

A profile may be initially specified bythe user, thereby resembling a standingIR query, and is updated by the systemby using feedback information provided(either implicitly or explicitly) by the useron the relevance or nonrelevance of the de-livered messages. In the TREC community[Lewis 1995c], this is called adaptive fil-tering, while the case in which no user-specified profile is available is called ei-ther routing or batch filtering, dependingon whether documents have to be rankedin decreasing order of estimated relevanceor just accepted/rejected. Batch filteringthus coincides with single-label TC un-der |C| =2 categories; since this latter isa completely general TC task, some au-thors [Hull 1994; Hull et al. 1996; Schapireet al. 1998; Schutze et al. 1995], some-what confusingly, use the term “filtering”in place of the more appropriate term“categorization.”

In information science, document filter-ing has a tradition dating back to the’60s, when, addressed by systems of var-ious degrees of automation and dealingwith the multiconsumer case discussed



above, it was called selective dissemina-tion of information or current awareness(see Korfhage [1997, Chapter 6]). The ex-plosion in the availability of digital infor-mation has boosted the importance of suchsystems, which are nowadays being usedin contexts such as the creation of person-alized Web newspapers, junk e-mail block-ing, and Usenet news selection.

Information filtering by ML techniquesis widely discussed in the literature: seeAmati and Crestani [1999], Iyer et al.[2000], Kim et al. [2000], Tauritz et al.[2000], and Yu and Lam [1998].

3.4. Word Sense Disambiguation

Word sense disambiguation (WSD) is theactivity of finding, given the occurrence ina text of an ambiguous (i.e., polysemousor homonymous) word, the sense of thisparticular word occurrence. For instance,bank may have (at least) two differentsenses in English, as in the Bank ofEngland (a financial institution) or thebank of river Thames (a hydraulic engi-neering artifact). It is thus a WSD taskto decide which of the above senses the oc-currence of bank in Last week I borrowedsome money from the bank has. WSD isvery important for many applications, in-cluding natural language processing, andindexing documents by word senses ratherthan by words for IR purposes. WSD maybe seen as a TC task (see Gale et al.[1993]; Escudero et al. [2000]) once weview word occurrence contexts as doc-uments and word senses as categories.Quite obviously, this is a single-label TCcase, and one in which document-pivotedTC is usually the right choice.

WSD is just an example of the more gen-eral issue of resolving natural languageambiguities, one of the most importantproblems in computational linguistics.Other examples, which may all be tackledby means of TC techniques along the linesdiscussed for WSD, are context-sensitivespelling correction, prepositional phraseattachment, part of speech tagging, andword choice selection in machine transla-tion; see Roth [1998] for an introduction.

3.5. Hierarchical Categorizationof Web Pages

TC has recently aroused a lot of interestalso for its possible application to auto-matically classifying Web pages, or sites,under the hierarchical catalogues hostedby popular Internet portals. When Webdocuments are catalogued in this way,rather than issuing a query to a general-purpose Web search engine a searchermay find it easier to first navigate inthe hierarchy of categories and then re-strict her search to a particular categoryof interest.

Classifying Web pages automaticallyhas obvious advantages, since the man-ual categorization of a large enough sub-set of the Web is infeasible. Unlike in theprevious applications, it is typically thecase that each category must be populatedby a set of k1≤ x ≤ k2 documents. CPCshould be chosen so as to allow new cate-gories to be added and obsolete ones to bedeleted.

With respect to previously discussed TCapplications, automatic Web page catego-rization has two essential peculiarities:(1) The hypertextual nature of the doc-

uments: Links are a rich source ofinformation, as they may be under-stood as stating the relevance of thelinked page to the linking page. Tech-niques exploiting this intuition in aTC context have been presented byAttardi et al. [1998], Chakrabarti et al.[1998b], Furnkranz [1999], Govertet al. [1999], and Oh et al. [2000]and experimentally compared by Yanget al. [2002].

(2) The hierarchical structure of the cate-gory set: This may be used, for example,by decomposing the classification prob-lem into a number of smaller classifica-tion problems, each corresponding to abranching decision at an internal node.Techniques exploiting this intuition ina TC context have been presented byDumais and Chen [2000], Chakrabartiet al. [1998a], Koller and Sahami[1997], McCallum et al. [1998], Ruizand Srinivasan [1999], and Weigendet al. [1999].


8 Sebastiani

if ((wheat & farm) or(wheat & commodity) or

(bushels & export) or(wheat & tonnes) or

(wheat & winter & ¬ soft)) then WHEAT else ¬WHEAT

Fig. 1 . Rule-based classifier for the WHEAT category; key wordsare indicated in italic, categories are indicated in SMALL CAPS (fromApte et al. [1994]).

4. THE MACHINE LEARNING APPROACHTO TEXT CATEGORIZATION

In the ’80s, the most popular approach(at least in operational settings) for thecreation of automatic document classifiersconsisted in manually building, by meansof knowledge engineering (KE) techniques,an expert system capable of taking TC de-cisions. Such an expert system would typ-ically consist of a set of manually definedlogical rules, one per category, of type

if 〈DNF formula〉 then 〈category〉.A DNF (“disjunctive normal form”) for-mula is a disjunction of conjunctiveclauses; the document is classified under〈category〉 iff it satisfies the formula, thatis, iff it satisfies at least one of the clauses.The most famous example of this approachis the CONSTRUE system [Hayes et al. 1990],built by Carnegie Group for the Reutersnews agency. A sample rule of the typeused in CONSTRUE is illustrated in Figure 1.

The drawback of this approach isthe knowledge acquisition bottleneck wellknown from the expert systems literature.That is, the rules must be manually de-fined by a knowledge engineer with theaid of a domain expert (in this case, anexpert in the membership of documents inthe chosen set of categories): if the set ofcategories is updated, then these two pro-fessionals must intervene again, and if theclassifier is ported to a completely differ-ent domain (i.e., set of categories), a differ-ent domain expert needs to intervene andthe work has to be repeated from scratch.

On the other hand, it was originallysuggested that this approach can give verygood effectiveness results: Hayes et al.[1990] reported a .90 “breakeven” result(see Section 7) on a subset of the Reuterstest collection, a figure that outperforms

even the best classifiers built in the late’90s by state-of-the-art ML techniques.However, no other classifier has beentested on the same dataset as CONSTRUE,and it is not clear whether this was arandomly chosen or a favorable subset ofthe entire Reuters collection. As arguedby Yang [1999], the results above do notallow us to state that these effectivenessresults may be obtained in general.

Since the early ’90s, the ML approachto TC has gained popularity and haseventually become the dominant one, atleast in the research community (seeMitchell [1996] for a comprehensive intro-duction to ML). In this approach, a generalinductive process (also called the learner)automatically builds a classifier for a cat-egory ci by observing the characteristicsof a set of documents manually classifiedunder ci or ci by a domain expert; fromthese characteristics, the inductive pro-cess gleans the characteristics that a newunseen document should have in order tobe classified under ci. In ML terminology,the classification problem is an activityof supervised learning, since the learningprocess is “supervised” by the knowledgeof the categories and of the training in-stances that belong to them.2

The advantages of the ML approachover the KE approach are evident. The en-gineering effort goes toward the construc-tion not of a classifier, but of an automaticbuilder of classifiers (the learner). Thismeans that if a learner is (as it often is)available off-the-shelf, all that is neededis the inductive, automatic construction ofa classifier from a set of manually clas-sified documents. The same happens if a

2 Within the area of content-based document man-agement tasks, an example of an unsupervised learn-ing activity is document clustering (see Section 1).



classifier already exists and the originalset of categories is updated, or if the clas-sifier is ported to a completely differentdomain.

In the ML approach, the preclassifieddocuments are then the key resource.In the most favorable case, they are al-ready available; this typically happens fororganizations which have previously car-ried out the same categorization activitymanually and decide to automate the pro-cess. The less favorable case is when nomanually classified documents are avail-able; this typically happens for organi-zations that start a categorization activ-ity and opt for an automated modalitystraightaway. The ML approach is moreconvenient than the KE approach also inthis latter case. In fact, it is easier to man-ually classify a set of documents than tobuild and tune a set of rules, since it iseasier to characterize a concept extension-ally (i.e., to select instances of it) than in-tensionally (i.e., to describe the concept inwords, or to describe a procedure for rec-ognizing its instances).

Classifiers built by means of ML tech-niques nowadays achieve impressive lev-els of effectiveness (see Section 7), makingautomatic classification a qualitatively(and not only economically) viable alter-native to manual classification.

4.1. Training Set, Test Set, andValidation Set

The ML approach relies on the availabil-ity of an initial corpus Ä={d1, . . . , d|Ä|} ⊂D of documents preclassified under C={c1, . . . , c|C|}. That is, the values of the totalfunction 8 : D×C → {T, F } are known forevery pair 〈d j , ci〉 ∈ Ä× C. A document d jis a positive example of ci if 8(d j , ci) = T ,a negative example of ci if 8(d j , ci) = F .

In research settings (and in most opera-tional settings too), once a classifier 8 hasbeen built it is desirable to evaluate its ef-fectiveness. In this case, prior to classifierconstruction the initial corpus is split intwo sets, not necessarily of equal size:

—a training(-and-validation) set T V ={d1, . . . , d|T V |}. The classifier 8 for cat-egories C = {c1, . . . , c|C|} is inductively

built by observing the characteristics ofthese documents;

—a test set Te = {d|T V |+1, . . . , d|Ä|}, usedfor testing the effectiveness of the clas-sifiers. Each d j ∈Te is fed to the classi-fier, and the classifier decisions8(d j , ci)are compared with the expert decisions8(d j , ci). A measure of classificationeffectiveness is based on how oftenthe 8(d j , ci) values match the 8(d j , ci)values.

The documents in Te cannot participatein any way in the inductive construc-tion of the classifiers; if this conditionwere not satisfied, the experimental re-sults obtained would likely be unrealis-tically good, and the evaluation wouldthus have no scientific character [Mitchell1996, page 129]. In an operational setting,after evaluation has been performed onewould typically retrain the classifier onthe entire initial corpus, in order to boosteffectiveness. In this case, the results ofthe previous evaluation would be a pes-simistic estimate of the real performance,since the final classifier has been trainedon more data than the classifier evaluated.

This is called the train-and-test ap-proach. An alternative is the k-fold cross-validation approach (see Mitchell [1996],page 146), in which k different classi-fiers 81, . . . ,8k are built by partition-ing the initial corpus into k disjoint setsTe1, . . . , Tek and then iteratively apply-ing the train-and-test approach on pairs〈T Vi =Ä−Tei, Tei〉. The final effectivenessfigure is obtained by individually comput-ing the effectiveness of 81, . . . ,8k , andthen averaging the individual results insome way.

In both approaches, it is often the casethat the internal parameters of the clas-sifiers must be tuned by testing whichvalues of the parameters yield the besteffectiveness. In order to make this op-timization possible, in the train-and-testapproach the set {d1, . . . , d|T V |} is furthersplit into a training set Tr={d1, . . . , d|Tr|},from which the classifier is built, and a val-idation set Va={d|Tr|+1, . . . , d|T V |} (some-times called a hold-out set), on whichthe repeated tests of the classifier aimed


10 Sebastiani

at parameter optimization are performed;the obvious variant may be used in thek-fold cross-validation case. Note that, forthe same reason why we do not test a clas-sifier on the documents it has been trainedon, we do not test it on the documents ithas been optimized on: test set and vali-dation set must be kept separate.3

Given a corpus Ä, one may define thegenerality gÄ(ci) of a category ci as thepercentage of documents that belong to ci,that is:

gÄ(ci) = |{d j ∈ Ä | 8(d j , ci) = T }||Ä| .

The training set generality gTr(ci), valida-tion set generality gVa(ci), and test set gen-erality gTe(ci) of ci may be defined in theobvious way.

4.2. Information Retrieval Techniquesand Text Categorization

Text categorization heavily relies on thebasic machinery of IR. The reason is thatTC is a content-based document manage-ment task, and as such it shares manycharacteristics with other IR tasks suchas text search.

IR techniques are used in three phasesof the text classifier life cycle:

(1) IR-style indexing is always performedon the documents of the initial corpusand on those to be classified during theoperational phase;

(2) IR-style techniques (such as docu-ment-request matching, query refor-mulation, . . .) are often used in the in-ductive construction of the classifiers;

(3) IR-style evaluation of the effectivenessof the classifiers is performed.

The various approaches to classificationdiffer mostly for how they tackle (2),although in a few cases nonstandard

3 From now on, we will take the freedom to use theexpression “test document” to denote any documentnot in the training set and validation set. This in-cludes thus any document submitted to the classifierin the operational phase.

approaches to (1) and (3) are also used. In-dexing, induction, and evaluation are thethemes of Sections 5, 6 and 7, respectively.

5. DOCUMENT INDEXING ANDDIMENSIONALITY REDUCTION

5.1. Document Indexing

Texts cannot be directly interpreted by aclassifier or by a classifier-building algo-rithm. Because of this, an indexing proce-dure that maps a text d j into a compactrepresentation of its content needs to beuniformly applied to training, validation,and test documents. The choice of a rep-resentation for text depends on what oneregards as the meaningful units of text(the problem of lexical semantics) and themeaningful natural language rules for thecombination of these units (the problemof compositional semantics). Similarly towhat happens in IR, in TC this latter prob-lem is usually disregarded,4 and a textd j is usually represented as a vector ofterm weights Ed j =〈w1 j , . . . , w|T | j 〉, whereT is the set of terms (sometimes calledfeatures) that occur at least once in at leastone document of Tr, and 0 ≤ wkj ≤ 1 rep-resents, loosely speaking, how much termtk contributes to the semantics of docu-ment d j . Differences among approachesare accounted for by

(1) different ways to understand what aterm is;

(2) different ways to compute termweights.

A typical choice for (1) is to identify termswith words. This is often called either theset of words or the bag of words approachto document representation, depending onwhether weights are binary or not.

In a number of experiments [Apteet al. 1994; Dumais et al. 1998; Lewis1992a], it has been found that represen-tations more sophisticated than this donot yield significantly better effectiveness,thereby confirming similar results from IR

4 An exception to this is represented by learning ap-proaches based on hidden Markov models [Denoyeret al. 2001; Frasconi et al. 2002].



[Salton and Buckley 1988]. In particular,some authors have used phrases, ratherthan individual words, as indexing terms[Fuhr et al. 1991; Schutze et al. 1995;Tzeras and Hartmann 1993], but the ex-perimental results found to date havenot been uniformly encouraging, irrespec-tively of whether the notion of “phrase” ismotivated

—syntactically, that is, the phrase is suchaccording to a grammar of the language(see Lewis [1992a]); or

—statistically, that is, the phrase isnot grammatically such, but is com-posed of a set/sequence of words whosepatterns of contiguous occurrence in thecollection are statistically significant(see Caropreso et al. [2001]).

Lewis [1992a] argued that the likely rea-son for the discouraging results is that,although indexing languages based onphrases have superior semantic qualities,they have inferior statistical qualitieswith respect to word-only indexing lan-guages: a phrase-only indexing languagehas “more terms, more synonymous ornearly synonymous terms, lower consis-tency of assignment (since synonymousterms are not assigned to the same docu-ments), and lower document frequency forterms” [Lewis 1992a, page 40]. Althoughhis remarks are about syntactically moti-vated phrases, they also apply to statisti-cally motivated ones, although perhaps toa smaller degree. A combination of the twoapproaches is probably the best way togo: Tzeras and Hartmann [1993] obtainedsignificant improvements by using nounphrases obtained through a combinationof syntactic and statistical criteria, wherea “crude” syntactic method was comple-mented by a statistical filter (only thosesyntactic phrases that occurred at leastthree times in the positive examples of acategory ci were retained). It is likely thatthe final word on the usefulness of phraseindexing in TC has still to be told, andinvestigations in this direction are stillbeing actively pursued [Caropreso et al.2001; Mladenic and Grobelnik 1998].

As for issue (2), weights usuallyrange between 0 and 1 (an exception is

Lewis et al. [1996]), and for ease of expo-sition we will assume they always do. As aspecial case, binary weights may be used(1 denoting presence and 0 absence of theterm in the document); whether binary ornonbinary weights are used depends onthe classifier learning algorithm used. Inthe case of nonbinary indexing, for deter-mining the weight wkj of term tk in docu-ment d j any IR-style indexing techniquethat represents a document as a vector ofweighted terms may be used. Most of thetimes, the standard tfidf function is used(see Salton and Buckley [1988]), defined as

tfidf (tk , d j ) = #(tk , d j ) · log|Tr|

#Tr(tk), (1)

where #(tk , d j ) denotes the number oftimes tk occurs in d j , and #Tr(tk) denotesthe document frequency of term tk , thatis, the number of documents in Tr inwhich tk occurs. This function embodiesthe intuitions that (i) the more often aterm occurs in a document, the more itis representative of its content, and (ii)the more documents a term occurs in,the less discriminating it is.5 Note thatthis formula (as most other indexingformulae) weights the importance of aterm to a document in terms of occurrenceconsiderations only, thereby deeming ofnull importance the order in which theterms occur in the document and the syn-tactic role they play. In other words, thesemantics of a document is reduced to thecollective lexical semantics of the termsthat occur in it, thereby disregarding theissue of compositional semantics (an ex-ception are the representation techniquesused for FOIL [Cohen 1995a] and SLEEPING

EXPERTS [Cohen and Singer 1999]).In order for the weights to fall in the

[0,1] interval and for the documents tobe represented by vectors of equal length,the weights resulting from tfidf are often

5 There exist many variants of tfidf, that differ fromeach other in terms of logarithms, normalization orother correction factors. Formula 1 is just one ofthe possible instances of this class; see Salton andBuckley [1988] and Singhal et al. [1996] for varia-tions on this theme.


12 Sebastiani

normalized by cosine normalization, givenby

wkj = tfidf (tk , d j )√∑|T |s=1(tfidf (ts, d j ))2

. (2)

Although normalized tfidf is the mostpopular one, other indexing functionshave also been used, including proba-bilistic techniques [Govert et al. 1999] ortechniques for indexing structured docu-ments [Larkey and Croft 1996]. Functionsdifferent from tfidf are especially neededwhen Tr is not available in its entiretyfrom the start and #Tr(tk) cannot thus becomputed, as in adaptive filtering; in thiscase, approximations of tfidf are usuallyemployed [Dagan et al. 1997, Section 4.3].

Before indexing, the removal of functionwords (i.e., topic-neutral words such as ar-ticles, prepositions, conjunctions, etc.) isalmost always performed (exceptions in-clude Lewis et al. [1996], Nigam et al.[2000], and Riloff [1995]).6 Concerningstemming (i.e., grouping words that sharethe same morphological root), its suitabil-ity to TC is controversial. Although, simi-larly to unsupervised term clustering (seeSection 5.5.1) of which it is an instance,stemming has sometimes been reportedto hurt effectiveness (e.g., Baker andMcCallum [1998]), the recent tendency isto adopt it, as it reduces both the dimen-sionality of the term space (see Section 5.3)and the stochastic dependence betweenterms (see Section 6.2).

Depending on the application, eitherthe full text of the document or selectedparts of it are indexed. While the formeroption is the rule, exceptions exist. Forinstance, in a patent categorization ap-plication Larkey [1999] indexed only thetitle, the abstract, the first 20 lines ofthe summary, and the section containing

6 One application of TC in which it would be inap-propriate to remove function words is author identi-fication for documents of disputed paternity. In fact,as noted in Manning and Schutze [1999], page 589,“it is often the ‘little’ words that give an author away(for example, the relative frequencies of words likebecause or though).”

the claims of novelty of the described in-vention. This approach was made possi-ble by the fact that documents describingpatents are structured. Similarly, when adocument title is available, one can payextra importance to the words it contains[Apte et al. 1994; Cohen and Singer 1999;Weiss et al. 1999]. When documents areflat, identifying the most relevant part ofa document is instead a nonobvious task.

5.2. The Darmstadt Indexing Approach

The AIR/X system [Fuhr et al. 1991] oc-cupies a special place in the literature onindexing for TC. This system is the finalresult of the AIR project, one of the mostimportant efforts in the history of TC:spanning a duration of more than 10 years[Knorz 1982; Tzeras and Hartmann 1993],it has produced a system operatively em-ployed since 1985 in the classification ofcorpora of scientific literature of O(105)documents and O(104) categories, and hashad important theoretical spin-offs in thefield of probabilistic indexing [Fuhr 1989;Fuhr and Buckely 1991].7

The approach to indexing taken inAIR/X is known as the Darmstadt In-dexing Approach (DIA) [Fuhr 1985].Here, “indexing” is used in the sense ofSection 3.1, that is, as using terms froma controlled vocabulary, and is thus asynonym of TC (the DIA was later ex-tended to indexing with free terms [Fuhrand Buckley 1991]). The idea that under-lies the DIA is the use of a much widerset of “features” than described in Sec-tion 5.1. All other approaches mentionedin this paper view terms as the dimen-sions of the learning space, where termsmay be single words, stems, phrases, or(see Sections 5.5.1 and 5.5.2) combina-tions of any of these. In contrast, the DIAconsiders properties (of terms, documents,

7 The AIR/X system, its applications (including theAIR/PHYS system [Biebricher et al. 1988], an appli-cation of AIR/X to indexing physics literature), andits experiments have also been richly documentedin a series of papers and doctoral theses written inGerman. The interested reader may consult Fuhret al. [1991] for a detailed bibliography.



categories, or pairwise relationships am-ong these) as basic dimensions of thelearning space. Examples of these are

—properties of a term tk : e.g. the idf of tk ;—properties of the relationship between a

term tk and a document d j : for example,the t f of tk in d j ; or the location (e.g., inthe title, or in the abstract) of tk withind j ;

—properties of a document d j : for exam-ple, the length of d j ;

—properties of a category ci: for example,the training set generality of ci.

For each possible document-category pair,the values of these features are collectedin a so-called relevance description vec-tor Erd(d j , ci). The size of this vector isdetermined by the number of propertiesconsidered, and is thus independent ofspecific terms, categories, or documents(for multivalued features, appropriate ag-gregation functions are applied in orderto yield a single value to be included inErd(d j , ci)); in this way an abstraction fromspecific terms, categories, or documents isachieved.

The main advantage of this approachis the possibility to consider additionalfeatures that can hardly be accounted forin the usual term-based approaches, forexample, the location of a term within adocument, or the certainty with which aphrase was identified in a document. Theterm-category relationship is described byestimates, derived from the training set, ofthe probability P (ci | tk) that a documentbelongs to category ci, given that it con-tains term tk (the DIA association factor).8Relevance description vectors Erd (dj , ci)are then the final representations thatare used for the classification of documentd j under category ci.

The essential ideas of the DIA—transforming the classification space bymeans of abstraction and using a more de-tailed text representation than the stan-dard bag-of-words approach—have not

8 Association factors are called adhesion coefficientsin many early papers on TC; see Field [1975];Robertson and Harding [1984].

been taken up by other researchers sofar. For new TC applications dealing withstructured documents or categorization ofWeb pages, these ideas may become of in-creasing importance.

5.3. Dimensionality Reduction

Unlike in text retrieval, in TC the highdimensionality of the term space (i.e.,the large value of |T |) may be problem-atic. In fact, while typical algorithms usedin text retrieval (such as cosine match-ing) can scale to high values of |T |, thesame does not hold of many sophisticatedlearning algorithms used for classifier in-duction (e.g., the LLSF algorithm of Yangand Chute [1994]). Because of this, be-fore classifier induction one often appliesa pass of dimensionality reduction (DR),whose effect is to reduce the size of thevector space from |T | to |T ′| ¿ |T |; the setT ′ is called the reduced term set.

DR is also beneficial since it tends to re-duce overfitting, that is, the phenomenonby which a classifier is tuned also tothe contingent characteristics of the train-ing data rather than just the constitu-tive characteristics of the categories. Clas-sifiers that overfit the training data aregood at reclassifying the data they havebeen trained on, but much worse at clas-sifying previously unseen data. Experi-ments have shown that, in order to avoidoverfitting a number of training exam-ples roughly proportional to the numberof terms used is needed; Fuhr and Buckley[1991, page 235] have suggested that 50–100 training examples per term may beneeded in TC tasks. This means that, if DRis performed, overfitting may be avoidedeven if a smaller amount of training exam-ples is used. However, in removing termsthe risk is to remove potentially usefulinformation on the meaning of the docu-ments. It is then clear that, in order toobtain optimal (cost-)effectiveness, the re-duction process must be performed withcare. Various DR methods have been pro-posed, either from the information theoryor from the linear algebra literature, andtheir relative merits have been tested byexperimentally evaluating the variation


14 Sebastiani

in effectiveness that a given classifierundergoes after application of the functionto the term space.

There are two distinct ways of view-ing DR, depending on whether the task isperformed locally (i.e., for each individualcategory) or globally:

—local DR: for each category ci, a set T ′i ofterms, with |T ′i | ¿ |T |, is chosen for clas-sification under ci (see Apte et al. [1994];Lewis and Ringuette [1994]; Li andJain [1998]; Ng et al. [1997]; Sable andHatzivassiloglou [2000]; Schutze et al.[1995], Wiener et al. [1995]). This meansthat different subsets of Ed j are usedwhen working with the different cate-gories. Typical values are 10≤ |T ′i | ≤50.

—global DR: a set T ′ of terms, with|T ′| ¿ |T |, is chosen for the classifica-tion under all categories C={c1, . . . , c|C|}(see Caropreso et al. [2001]; Mladenic[1998]; Yang [1999]; Yang and Pedersen[1997]).

This distinction usually does not impacton the choice of DR technique, sincemost such techniques can be used (andhave been used) for local and globalDR alike (supervised DR techniques—seeSection 5.5.1—are exceptions to this rule).In the rest of this section, we will assumethat the global approach is used, althougheverything we will say also applies to thelocal approach.

A second, orthogonal distinction may bedrawn in terms of the nature of the result-ing terms:

—DR by term selection: T ′ is a subsetof T ;

—DR by term extraction: the terms inT ′ are not of the same type of theterms in T (e.g., if the terms in T arewords, the terms in T ′may not be wordsat all), but are obtained by combina-tions or transformations of the originalones.

Unlike in the previous distinction, thesetwo ways of doing DR are tackled by verydifferent techniques; we will address themseparately in the next two sections.

5.4. Dimensionality Reductionby Term Selection

Given a predetermined integer r, tech-niques for term selection (also called termspace reduction—TSR) attempt to select,from the original set T , the set T ′ ofterms (with |T ′| ¿ |T |) that, when usedfor document indexing, yields the highesteffectiveness. Yang and Pedersen [1997]have shown that TSR may even result ina moderate (≤5%) increase in effective-ness, depending on the classifier, on theaggressivity |T |

|T ′| of the reduction, and onthe TSR technique used.

Moulinier et al. [1996] have used a so-called wrapper approach, that is, one inwhich T ′ is identified by means of thesame learning method that will be used forbuilding the classifier [John et al. 1994].Starting from an initial term set, a newterm set is generated by either addingor removing a term. When a new termset is generated, a classifier based on itis built and then tested on a validationset. The term set that results in the besteffectiveness is chosen. This approach hasthe advantage of being tuned to the learn-ing algorithm being used; moreover, if lo-cal DR is performed, different numbers ofterms for different categories may be cho-sen, depending on whether a category isor is not easily separable from the others.However, the sheer size of the space of dif-ferent term sets makes its cost-prohibitivefor standard TC applications.

A computationally easier alternative isthe filtering approach [John et al. 1994],that is, keeping the |T ′| ¿ |T | terms thatreceive the highest score according to afunction that measures the “importance”of the term for the TC task. We will explorethis solution in the rest of this section.

5.4.1. Document Frequency. A simple andeffective global TSR function is the docu-ment frequency #Tr(tk) of a term tk , that is,only the terms that occur in the highestnumber of documents are retained. In aseries of experiments Yang and Pedersen[1997] have shown that with #Tr(tk) it ispossible to reduce the dimensionality by afactor of 10 with no loss in effectiveness (a



reduction by a factor of 100 bringing aboutjust a small loss).

This seems to indicate that the termsoccurring most frequently in the collectionare the most valuable for TC. As such, thiswould seem to contradict a well-known“law” of IR, according to which the termswith low-to-medium document frequencyare the most informative ones [Salton andBuckley 1988]. But these two results donot contradict each other, since it is wellknown (see Salton et al. [1975]) that thelarge majority of the words occurring ina corpus have a very low document fre-quency; this means that by reducing theterm set by a factor of 10 using documentfrequency, only such words are removed,while the words from low-to-medium tohigh document frequency are preserved.Of course, stop words need to be removedin advance, lest only topic-neutral wordsare retained [Mladenic 1998].

Finally, note that a slightly more empir-ical form of TSR by document frequencyis adopted by many authors, who removeall terms occurring in at most x train-ing documents (popular values for x rangefrom 1 to 3), either as the only form of DR[Maron 1961; Ittner et al. 1995] or beforeapplying another more sophisticated form[Dumais et al. 1998; Li and Jain 1998]. Avariant of this policy is removing all termsthat occur at most x times in the train-ing set (e.g., Dagan et al. [1997]; Joachims[1997]), with popular values for x rang-ing from 1 (e.g., Baker and McCallum[1998]) to 5 (e.g., Apte et al. [1994]; Cohen[1995a]).

5.4.2. Other Information-Theoretic TermSelection Functions. Other more sophis-ticated information-theoretic functionshave been used in the literature, amongthem the DIA association factor [Fuhret al. 1991], chi-square [Caropreso et al.2001; Galavotti et al. 2000; Schutze et al.1995; Sebastiani et al. 2000; Yang andPedersen 1997; Yang and Liu 1999],NGL coefficient [Ng et al. 1997; Ruizand Srinivasan 1999], information gain[Caropreso et al. 2001; Larkey 1998;Lewis 1992a; Lewis and Ringuette 1994;Mladenic 1998; Moulinier and Ganascia

1996; Yang and Pedersen 1997, Yang andLiu 1999], mutual information [Dumaiset al. 1998; Lam et al. 1997; Larkeyand Croft 1996; Lewis and Ringuette1994; Li and Jain 1998; Moulinier et al.1996; Ruiz and Srinivasan 1999; Tairaand Haruno 1999; Yang and Pedersen1997], odds ratio [Caropreso et al. 2001;Mladenic 1998; Ruiz and Srinivasan1999], relevancy score [Wiener et al.1995], and GSS coefficient [Galavottiet al. 2000]. The mathematical definitionsof these measures are summarized forconvenience in Table I.9 Here, probabil-ities are interpreted on an event spaceof documents (e.g., P (tk , ci) denotes theprobability that, for a random documentx, term tk does not occur in x and xbelongs to category ci), and are estimatedby counting occurrences in the trainingset. All functions are specified “locally” toa specific category ci; in order to assess thevalue of a term tk in a “global,” category-independent sense, either the sumfsum(tk)= ∑|C|i=1 f (tk , ci), or the weightedsum fwsum(tk)= ∑|C|i=1 P (ci) f (tk , ci), or themaximum fmax(tk)= max|C|i=1 f (tk , ci) oftheir category-specific values f (tk , ci) areusually computed.

These functions try to capture the in-tuition that the best terms for ci are theones distributed most differently in thesets of positive and negative examples ofci. However, interpretations of this prin-ciple vary across different functions. Forinstance, in the experimental sciences χ2

is used to measure how the results of anobservation differ (i.e., are independent)from the results expected according to aninitial hypothesis (lower values indicatelower dependence). In DR we measure howindependent tk and ci are. The terms tk

9 For better uniformity Table I views all the TSRfunctions of this section in terms of subjective proba-bility. In some cases such as χ2(tk , ci) this is slightlyartificial, since this function is not usually viewed inprobabilistic terms. The formulae refer to the “local”(i.e., category-specific) forms of the functions, whichagain is slightly artificial in some cases. Note thatthe NGL and GSS coefficients are here named aftertheir authors, since they had originally been givennames that might generate some confusion if usedhere.


16 Sebastiani

Table I. Main Functions Used for Term Space Reduction Purposes. (Information gain is also known asexpected mutual information, and is used under this name by Lewis [1992a, page 44] and

Larkey [1998]. In the RS(t k , ci ) formula, d is a constant damping factor.)Function Denoted by Mathematical form

DIA association factor z(tk , ci) P (ci | tk)

Information gain IG(tk , ci)∑

c∈{ci ,ci }

∑t∈{tk , tk }

P (t, c) · logP (t, c)

P (t) · P (c)

Mutual information MI(tk , ci) logP (tk , ci)

P (tk) · P (ci)

Chi-square χ2(tk , ci)|Tr| · [P (tk , ci) · P (tk , ci)− P (tk , ci) · P (tk , ci)]2

P (tk) · P (tk) · P (ci) · P (ci)

NGL coefficient NGL(tk , ci)

√|Tr| · [P (tk , ci) · P (tk , ci)− P (tk , ci) · P (tk , ci)]√

P (tk) · P (tk) · P (ci) · P (ci)

Relevancy score RS(tk , ci) logP (tk | ci)+ dP (tk | ci)+ d

Odds ratio OR(tk , ci)P (tk | ci) · (1− P (tk | ci))(1− P (tk | ci)) · P (tk | ci)

GSS coefficient GSS(tk , ci) P (tk , ci) · P (tk , ci)− P (tk , ci) · P (tk , ci)

with the lowest value for χ2(tk , ci) are thusthe most independent from ci; since weare interested in the terms which are not,we select the terms for which χ2(tk , ci) ishighest.

While each TSR function has its ownrationale, the ultimate word on its valueis the effectiveness it brings about. Var-ious experimental comparisons of TSRfunctions have thus been carried out[Caropreso et al. 2001; Galavotti et al.2000; Mladenic 1998; Yang and Pedersen1997]. In these experiments most func-tions listed in Table I (with the possibleexception of MI) have improved on the re-sults of document frequency. For instance,Yang and Pedersen [1997] have shownthat, with various classifiers and variousinitial corpora, sophisticated techniquessuch as IGsum(tk , ci) or χ2

max(tk , ci) can re-duce the dimensionality of the term spaceby a factor of 100 with no loss (or evenwith a small increase) of effectiveness.Collectively, the experiments reported inthe above-mentioned papers seem to in-dicate that {ORsum, NGLsum, GSSmax} >{χ2

max , IGsum}> {χ2wavg} À {MImax , MIwsum},

where “>” means “performs better than.”

However, it should be noted that theseresults are just indicative, and that moregeneral statements on the relative mer-its of these functions could be made onlyas a result of comparative experimentsperformed in thoroughly controlled condi-tions and on a variety of different situ-ations (e.g., different classifiers, differentinitial corpora, . . . ).

5.5. Dimensionality Reductionby Term Extraction

Given a predetermined |T ′|¿ |T |, term ex-traction attempts to generate, from theoriginal set T , a set T ′ of “synthetic”terms that maximize effectiveness. Therationale for using synthetic (rather thannaturally occurring) terms is that, dueto the pervasive problems of polysemy,homonymy, and synonymy, the originalterms may not be optimal dimensionsfor document content representation.Methods for term extraction try to solvethese problems by creating artificial termsthat do not suffer from them. Any term ex-traction method consists in (i) a methodfor extracting the new terms from the



old ones, and (ii) a method for convert-ing the original document representa-tions into new representations based onthe newly synthesized dimensions. Twoterm extraction methods have been exper-imented with in TC, namely term cluster-ing and latent semantic indexing.

5.5.1. Term Clustering. Term clusteringtries to group words with a high degree ofpairwise semantic relatedness, so that thegroups (or their centroids, or a represen-tative of them) may be used instead of theterms as dimensions of the vector space.Term clustering is different from term se-lection, since the former tends to addressterms synonymous (or near-synonymous)with other terms, while the latter targetsnoninformative terms.10

Lewis [1992a] was the first to inves-tigate the use of term clustering in TC.The method he employed, called recipro-cal nearest neighbor clustering, consistsin creating clusters of two terms that areone the most similar to the other accord-ing to some measure of similarity. His re-sults were inferior to those obtained bysingle-word indexing, possibly due to a dis-appointing performance by the clusteringmethod: as Lewis [1992a, page 48] said,“The relationships captured in the clus-ters are mostly accidental, rather than thesystematic relationships that were hopedfor.”

Li and Jain [1998] viewed semanticrelatedness between words in terms oftheir co-occurrence and co-absence withintraining documents. By using this tech-nique in the context of a hierarchicalclustering algorithm, they witnessed onlya marginal effectiveness improvement;however, the small size of their experiment(see Section 6.11) hardly allows any defini-tive conclusion to be reached.

Both Lewis [1992a] and Li and Jain[1998] are examples of unsupervised clus-tering, since clustering is not affected bythe category labels attached to the docu-

10 Some term selection methods, such as wrappermethods, also address the problem of redundancy.

ments. Baker and McCallum [1998] pro-vided instead an example of supervisedclustering, as the distributional clusteringmethod they employed clusters togetherthose terms that tend to indicate the pres-ence of the same category, or group of cat-egories. Their experiments, carried out inthe context of a Naıve Bayes classifier (seeSection 6.2), showed only a 2% effective-ness loss with an aggressivity of 1,000,and even showed some effectiveness im-provement with less aggressive levels ofreduction. Later experiments by Slonimand Tishby [2001] have confirmed the po-tential of supervised clustering methodsfor term extraction.

5.5.2. Latent Semantic Indexing. Latent se-mantic indexing (LSI—[Deerwester et al.1990]) is a DR technique developed in IRin order to address the problems deriv-ing from the use of synonymous, near-synonymous, and polysemous words asdimensions of document representations.This technique compresses document vec-tors into vectors of a lower-dimensionalspace whose dimensions are obtainedas combinations of the original dimen-sions by looking at their patterns of co-occurrence. In practice, LSI infers thedependence among the original termsfrom a corpus and “wires” this dependenceinto the newly obtained, independent di-mensions. The function mapping originalvectors into new vectors is obtained by ap-plying a singular value decomposition tothe matrix formed by the original docu-ment vectors. In TC this technique is ap-plied by deriving the mapping functionfrom the training set and then applyingit to training and test documents alike.

One characteristic of LSI is that thenewly obtained dimensions are not, unlikein term selection and term clustering,intuitively interpretable. However, theywork well in bringing out the “latent”semantic structure of the vocabularyused in the corpus. For instance, Schutzeet al. [1995, page 235] discussed the clas-sification under category Demographicshifts in the U.S. with economic impact ofa document that was indeed a positive


18 Sebastiani

test instance for the category, and thatcontained, among others, the quite reveal-ing sentence The nation grew to 249.6million people in the 1980s as moreAmericans left the industrial and ag-ricultural heartlands for the Southand West. The classifier decision was in-correct when local DR had been performedby χ2-based term selection retaining thetop original 200 terms, but was correctwhen the same task was tackled bymeans of LSI. This well exemplifieshow LSI works: the above sentence doesnot contain any of the 200 terms mostrelevant to the category selected by χ2,but quite possibly the words contained init have concurred to produce one or moreof the LSI higher-order terms that gener-ate the document space of the category.As Schutze et al. [1995, page 230] put it,“if there is a great number of terms whichall contribute a small amount of criticalinformation, then the combination of evi-dence is a major problem for a term-basedclassifier.” A drawback of LSI, though, isthat if some original term is particularlygood in itself at discriminating a category,that discrimination power may be lost inthe new vector space.

Wiener et al. [1995] used LSI in twoalternative ways: (i) for local DR, thuscreating several category-specific LSIrepresentations, and (ii) for global DR,thus creating a single LSI representa-tion for the entire category set. Theirexperiments showed the former approachto perform better than the latter, andboth approaches to perform better thansimple TSR based on Relevancy Score(see Table I).

Schutze et al. [1995] experimentallycompared LSI-based term extraction withχ2-based TSR using three different clas-sifier learning techniques (namely, lineardiscriminant analysis, logistic regression,and neural networks). Their experimentsshowed LSI to be far more effective thanχ2 for the first two techniques, while bothmethods performed equally well for theneural network classifier.

For other TC works that have usedLSI or similar term extraction techniques,see Hull [1994], Li and Jain [1998],

Schutze [1998], Weigend et al. [1999], andYang [1995].

6. INDUCTIVE CONSTRUCTIONOF TEXT CLASSIFIERS

The inductive construction of text clas-sifiers has been tackled in a variety ofways. Here we will deal only with themethods that have been most popularin TC, but we will also briefly mentionthe existence of alternative, less standardapproaches.

We start by discussing the generalform that a text classifier has. Let usrecall from Section 2.4 that there aretwo alternative ways of viewing classi-fication: “hard” (fully automated) clas-sification and ranking (semiautomated)classification.

The inductive construction of a rankingclassifier for category ci ∈ C usually con-sists in the definition of a function CSVi :D→ [0, 1] that, given a document d j , re-turns a categorization status value for it,that is, a number between 0 and 1 which,roughly speaking, represents the evidencefor the fact that d j ∈ ci. Documents arethen ranked according to their CSVi value.This works for “document-ranking TC”;“category-ranking TC” is usually tackledby ranking, for a given document d j , itsCSVi scores for the different categories inC = {c1, . . . , c|C|}.

The CSVi function takes up differ-ent meanings according to the learn-ing method used: for instance, in the“Naıve Bayes” approach of Section 6.2CSVi(d j ) is defined in terms of a proba-bility, whereas in the “Rocchio” approachdiscussed in Section 6.7 CSVi(d j ) is a mea-sure of vector closeness in |T |-dimensionalspace.

The construction of a “hard” classi-fier may follow two alternative paths.The former consists in the definition ofa function CSVi : D→{T, F }. The lat-ter consists instead in the definition ofa function CSVi : D→ [0, 1], analogousto the one used for ranking classification,followed by the definition of a thresholdτi such that CSVi(d j )≥ τi is interpreted



as T while CSVi(d j )<τi is interpretedas F .11

The definition of thresholds will be thetopic of Section 6.1. In Sections 6.2 to 6.12we will instead concentrate on the defini-tion of CSVi, discussing a number of ap-proaches that have been applied in the TCliterature. In general we will assume weare dealing with “hard” classification; itwill be evident from the context how andwhether the approaches can be adapted toranking classification. The presentation ofthe algorithms will be mostly qualitativerather than quantitative, that is, will fo-cus on the methods for classifier learningrather than on the effectiveness and ef-ficiency of the classifiers built by meansof them; this will instead be the focus ofSection 7.

6.1. Determining Thresholds

There are various policies for determin-ing the threshold τi, also depending on theconstraints imposed by the application.The most important distinction is whetherthe threshold is derived analytically orexperimentally.

The former method is possible only inthe presence of a theoretical result that in-dicates how to compute the threshold thatmaximizes the expected value of the ef-fectiveness function [Lewis 1995a]. This istypical of classifiers that output probabil-ity estimates of the membership of d j in ci(see Section 6.2) and whose effectiveness iscomputed by decision-theoretic measuressuch as utility (see Section 7.1.3); we thusdefer the discussion of this policy (whichis called probability thresholding in Lewis[1995a]) to Section 7.1.3.

When such a theoretical result is notknown, one has to revert to the lattermethod, which consists in testing differentvalues for τi on a validation set and choos-ing the value which maximizes effective-ness. We call this policy CSV thresholding

11 Alternative methods are possible, such as train-ing a classifier for which some standard, predefinedvalue such as 0 is the threshold. For ease of exposi-tion we will not discuss them.

[Cohen and Singer 1999; Schapire et al.1998; Wiener et al. 1995]; it is also calledScut in Yang [1999]. Different τi ’s are typ-ically chosen for the different ci ’s.

A second, popular experimental pol-icy is proportional thresholding [Iwayamaand Tokunaga 1995; Larkey 1998; Lewis1992a; Lewis and Ringuette 1994; Wieneret al. 1995], also called Pcut in Yang[1999]. This policy consists in choosingthe value of τi for which gVa(ci) is clos-est to gTr(ci), and embodies the principlethat the same percentage of documents ofboth training and test set should be clas-sified under ci. For obvious reasons, thispolicy does not lend itself to document-pivoted TC.

Sometimes, depending on the applica-tion, a fixed thresholding policy (a.k.a.“k-per-doc” thresholding [Lewis 1992a] orRcut [Yang 1999]) is applied, whereby it isstipulated that a fixed number k of cate-gories, equal for all d j ’s, are to be assignedto each document d j . This is often used,for instance, in applications of TC to au-tomated document indexing [Field 1975;Lam et al. 1999]. Strictly speaking, how-ever, this is not a thresholding policy in thesense defined at the beginning of Section 6,as it might happen that d ′ is classified un-der ci, d ′′ is not, and CSVi(d ′) < CSVi(d ′′).Quite clearly, this policy is mostly at homewith document-pivoted TC. However, itsuffers from a certain coarseness, as thefact that k is equal for all documents (norcould this be otherwise) allows no fine-tuning.

In his experiments Lewis [1992a] foundthe proportional policy to be superior toprobability thresholding when microaver-aged effectiveness was tested but slightlyinferior with macroaveraging (see Section7.1.1). Yang [1999] found instead CSVthresholding to be superior to proportionalthresholding (possibly due to her category-specific optimization on a validation set),and found fixed thresholding to be con-sistently inferior to the other two poli-cies. The fact that these latter results havebeen obtained across different classifiersno doubt reinforces them.

In general, aside from the considera-tions above, the choice of the thresholding


20 Sebastiani

policy may also be influenced by theapplication; for instance, in applying atext classifier to document indexing forBoolean systems a fixed thresholding pol-icy might be chosen, while a proportionalor CSV thresholding method might be cho-sen for Web page classification under hier-archical catalogues.

6.2. Probabilistic Classifiers

Probabilistic classifiers (see Lewis [1998]for a thorough discussion) view CSVi(d j )in terms of P (ci| Ed j ), that is, the proba-bility that a document represented by avector Ed j =〈w1 j , . . . , w|T | j 〉 of (binary orweighted) terms belongs to ci, and com-pute this probability by an application ofBayes’ theorem, given by

P (ci | Ed j ) = P (ci)P ( Ed j | ci)

P ( Ed j ). (3)

In (3) the event space is the space of docu-ments: P ( Ed j ) is thus the probability that arandomly picked document has vector Ed jas its representation, and P (ci) the prob-ability that a randomly picked documentbelongs to ci.

The estimation of P ( Ed j | ci) in (3) isproblematic, since the number of possiblevectors Ed j is too high (the same holds forP ( Ed j ), but for reasons that will be clearshortly this will not concern us). In or-der to alleviate this problem it is com-mon to make the assumption that any twocoordinates of the document vector are,when viewed as random variables, statis-tically independent of each other; this in-dependence assumption is encoded by theequation

P ( Ed j | ci) =|T |∏

k=1

P (wkj | ci). (4)

Probabilistic classifiers that use this as-sumption are called Naıve Bayes clas-sifiers, and account for most of theprobabilistic approaches to TC in the lit-erature (see Joachims [1998]; Koller and

Sahami [1997]; Larkey and Croft [1996];Lewis [1992a]; Lewis and Gale [1994];Li and Jain [1998]; Robertson andHarding [1984]). The “naıve” character ofthe classifier is due to the fact that usu-ally this assumption is, quite obviously,not verified in practice.

One of the best-known Naıve Bayes ap-proaches is the binary independence clas-sifier [Robertson and Sparck Jones 1976],which results from using binary-valuedvector representations for documents. Inthis case, if we write pki as short forP (wkx = 1 | ci), the P (wkj | ci) factors of(4) may be written as

P (wkj | ci) = pwkj

ki (1− pki)1−wkj

=(

pki

1− pki

)wkj

(1− pki). (5)

We may further observe that in TC thedocument space is partitioned into twocategories,12 ci and its complement ci, suchthat P (ci | Ed j )= 1− P (ci | Ed j ). If we plugin (4) and (5) into (3) and take logs weobtain

log P (ci | Ed j )

= log P (ci)+|T |∑

k=1

wkj logpki

1− pki

+|T |∑

k=1

log(1− pki)− log P ( Ed j ) (6)

log(1− P (ci | Ed j ))

= log(1− P (ci))+|T |∑

k=1

wkj logpki

1− pki

+|T |∑

k=1

log(1− pki)− log P ( Ed j ), (7)

12 Cooper [1995] has pointed out that in this casethe full independence assumption of (4) is not ac-tually made in the Naıve Bayes classifier; the as-sumption needed here is instead the weaker linkeddependence assumption, which may be written asP ( Ed j | ci )P ( Ed j | ci )

=∏|T |

k=1P (wkj | ci )P (wkj | ci )

.



where we write pki as short forP (wkx = 1 | ci). We may convert (6) and (7)into a single equation by subtracting com-ponentwise (7) from (6), thus obtaining

logP (ci | Ed j )

1− P (ci | Ed j )

= logP (ci)

1− P (ci)+|T |∑

k=1

wkj logpki(1− pki)pki(1− pki)

+|T |∑

k=1

log1− pki

1− pki. (8)

Note that P (ci | Ed j )1−P (ci | Ed j )

is an increasing mono-tonic function of P (ci | Ed j ), and may thusbe used directly as CSVi(d j ). Note alsothat log P (ci )

1−P (ci )and

∑|T |k=1 log 1−pki

1−pkiare

constant for all documents, and maythus be disregarded.13 Defining a clas-sifier for category ci thus basically re-quires estimating the 2|T | parameters{p1i, p1i, . . . , p|T |i, p|T |i} from the trainingdata, which may be done in the obviousway. Note that in general the classifica-tion of a given document does not re-quire one to compute a sum of |T | factors,as the presence of

∑|T |k=1 wkj log pki (1−pki )

pki (1−pki )would imply; in fact, all the factors forwhich wkj = 0 may be disregarded, andthis accounts for the vast majority of them,since document vectors are usually verysparse.

The method we have illustrated is justone of the many variants of the NaıveBayes approach, the common denomina-tor of which is (4). A recent paper by Lewis[1998] is an excellent roadmap on thevarious directions that research on NaıveBayes classifiers has taken; among theseare the ones aiming

—to relax the constraint that documentvectors should be binary-valued. This

13 This is not true, however, if the “fixed threshold-ing” method of Section 6.1 is adopted. In fact, for afixed document d j the first and third factor in the for-mula above are different for different categories, andmay therefore influence the choice of the categoriesunder which to file d j .

looks natural, given that weighted in-dexing techniques (see Fuhr [1989];Salton and Buckley [1988]) accountingfor the “importance” of tk for d j play akey role in IR.

—to introduce document length normal-ization. The value of log P (ci | Ed j )

1−P (ci | Ed j )tends

to be more extreme (i.e., very highor very low) for long documents (i.e.,documents such that wkj = 1 for manyvalues of k), irrespectively of theirsemantic relatedness to ci, thus call-ing for length normalization. Takinglength into account is easy in non-probabilistic approaches to classifica-tion (see Section 6.7), but is problematicin probabilistic ones (see Lewis [1998],Section 5). One possible answer is toswitch from an interpretation of NaıveBayes in which documents are events toone in which terms are events [Bakerand McCallum 1998; McCallum et al.1998; Chakrabarti et al. 1998a; Guthrieet al. 1994]. This accounts for documentlength naturally but, as noted by Lewis[1998], has the drawback that differ-ent occurrences of the same word withinthe same document are viewed as in-dependent, an assumption even moreimplausible than (4).

—to relax the independence assumption.This may be the hardest route to follow,since this produces classifiers of highercomputational cost and characterizedby harder parameter estimation prob-lems [Koller and Sahami 1997]. Earlierefforts in this direction within proba-bilistic text search (e.g., vanRijsbergen[1977]) have not shown the perfor-mance improvements that were hopedfor. Recently, the fact that the binary in-dependence assumption seldom harmseffectiveness has also been given sometheoretical justification [Domingos andPazzani 1997].

The quotation of text search in the lastparagraph is not casual. Unlike othertypes of classifiers, the literature on prob-abilistic classifiers is inextricably inter-twined with that on probabilistic searchsystems (see Crestani et al. [1998] for a


22 Sebastiani

Fig. 2 . A decision tree equivalent to the DNF rule of Figure 1. Edges are labeledby terms and leaves are labeled by categories (underlining denotes negation).

review), since these latter attempt to de-termine the probability that a documentfalls in the category denoted by the query,and since they are the only search systemsthat take relevance feedback, a notion es-sentially involving supervised learning, ascentral.

6.3. Decision Tree Classifiers

Probabilistic methods are quantitative(i.e., numeric) in nature, and as suchhave sometimes been criticized since, ef-fective as they may be, they are not eas-ily interpretable by humans. A class ofalgorithms that do not suffer from thisproblem are symbolic (i.e., nonnumeric)algorithms, among which inductive rulelearners (which we will discuss in Sec-tion 6.4) and decision tree learners are themost important examples.

A decision tree (DT) text classifier (seeMitchell [1996], Chapter 3) is a tree inwhich internal nodes are labeled by terms,branches departing from them are labeledby tests on the weight that the term has inthe test document, and leafs are labeled bycategories. Such a classifier categorizes atest document d j by recursively testing for

the weights that the terms labeling the in-ternal nodes have in vector Ed j , until a leafnode is reached; the label of this node isthen assigned to d j . Most such classifiersuse binary document representations, andthus consist of binary trees. An exampleDT is illustrated in Figure 2.

There are a number of standard pack-ages for DT learning, and most DT ap-proaches to TC have made use of one suchpackage. Among the most popular ones areID3 (used by Fuhr et al. [1991]), C4.5 (usedby Cohen and Hirsh [1998], Cohen andSinger [1999], Joachims [1998], and Lewisand Catlett [1994]), and C5 (used by Liand Jain [1998]). TC efforts based on ex-perimental DT packages include Dumaiset al. [1998], Lewis and Ringuette [1994],and Weiss et al. [1999].

A possible method for learning a DTfor category ci consists in a “divide andconquer” strategy of (i) checking whetherall the training examples have the samelabel (either ci or ci); (ii) if not, select-ing a term tk , partitioning Tr into classesof documents that have the same valuefor tk , and placing each such class in aseparate subtree. The process is recur-sively repeated on the subtrees until each



leaf of the tree so generated contains train-ing examples assigned to the same cate-gory ci, which is then chosen as the labelfor the leaf. The key step is the choice ofthe term tk on which to operate the parti-tion, a choice which is generally made ac-cording to an information gain or entropycriterion. However, such a “fully grown”tree may be prone to overfitting, as somebranches may be too specific to the train-ing data. Most DT learning methods thusinclude a method for growing the tree andone for pruning it, that is, for removingthe overly specific branches. Variations onthis basic schema for DT learning abound[Mitchell 1996, Section 3].

DT text classifiers have been used eitheras the main classification tool [Fuhr et al.1991; Lewis and Catlett 1994; Lewis andRinguette 1994], or as baseline classifiers[Cohen and Singer 1999; Joachims 1998],or as members of classifier committees [Liand Jain 1998; Schapire and Singer 2000;Weiss et al. 1999].

6.4. Decision Rule Classifiers

A classifier for category ci built by aninductive rule learning method consistsof a DNF rule, that is, of a conditionalrule with a premise in disjunctive normalform (DNF), of the type illustrated inFigure 1.14 The literals (i.e., possiblynegated keywords) in the premise denotethe presence (nonnegated keyword) or ab-sence (negated keyword) of the keywordin the test document d j , while the clausehead denotes the decision to classify d junder ci. DNF rules are similar to DTsin that they can encode any Boolean func-tion. However, an advantage of DNF rulelearners is that they tend to generate morecompact classifiers than DT learners.

Rule learning methods usually attemptto select from all the possible coveringrules (i.e., rules that correctly classifyall the training examples) the “best” one

14 Many inductive rule learning algorithms builddecision lists (i.e., arbitrarily nested if-then-elseclauses) instead of DNF rules; since the former mayalways be rewritten as the latter, we will disregardthe issue.

according to some minimality criterion.While DTs are typically built by a top-down, “divide-and-conquer” strategy, DNFrules are often built in a bottom-up fash-ion. Initially, every training example d j isviewed as a clause η1, . . . , ηn→ γi, whereη1, . . . , ηn are the terms contained in d jand γi equals ci or ci according to whetherd j is a positive or negative example of ci.This set of clauses is already a DNF clas-sifier for ci, but obviously scores high interms of overfitting. The learner appliesthen a process of generalization in whichthe rule is simplified through a seriesof modifications (e.g., removing premisesfrom clauses, or merging clauses) thatmaximize its compactness while at thesame time not affecting the “covering”property of the classifier. At the end ofthis process, a “pruning” phase similar inspirit to that employed in DTs is applied,where the ability to correctly classify allthe training examples is traded for moregenerality.

DNF rule learners vary widely in termsof the methods, heuristics and criteriaemployed for generalization and prun-ing. Among the DNF rule learners thathave been applied to TC are CHARADE

[Moulinier and Ganascia 1996], DL-ESC[Li and Yamanishi 1999], RIPPER [Cohen1995a; Cohen and Hirsh 1998; Cohen andSinger 1999], SCAR [Moulinier et al. 1996],and SWAP-1 [Apte 1994].

While the methods above use rulesof propositional logic (PL), research hasalso been carried out using rules of first-order logic (FOL), obtainable throughthe use of inductive logic programmingmethods. Cohen [1995a] has extensivelycompared PL and FOL learning in TC(for instance, comparing the PL learnerRIPPER with its FOL version FLIPPER), andhas found that the additional represen-tational power of FOL brings about onlymodest benefits.

6.5. Regression Methods

Various TC efforts have used regressionmodels (see Fuhr and Pfeifer [1994]; Ittneret al. [1995]; Lewis and Gale [1994];Schutze et al. [1995]). Regression denotes


24 Sebastiani

the approximation of a real-valued (in-stead than binary, as in the case of clas-sification) function 8 by means of a func-tion 8 that fits the training data [Mitchell1996, page 236]. Here we will describe onesuch model, the Linear Least-Squares Fit(LLSF) applied to TC by Yang and Chute[1994]. In LLSF, each document d j hastwo vectors associated to it: an input vec-tor I (d j ) of |T | weighted terms, and anoutput vector O(d j ) of |C| weights rep-resenting the categories (the weights forthis latter vector are binary for trainingdocuments, and are nonbinary CSV ′s fortest documents). Classification may thusbe seen as the task of determining an out-put vector O(d j ) for test document d j ,given its input vector I (d j ); hence, build-ing a classifier boils down to computinga |C| × |T | matrix M such that MI(d j )=O(d j ).

LLSF computes the matrix from thetraining data by computing a linear least-squares fit that minimizes the error on thetraining set according to the formula M =arg minM ‖MI − O‖F , where arg minM (x)stands as usual for the M for which x isminimum, ‖V ‖F

def=√∑|C|

i=1∑|T |

j=1 v2i j rep-

resents the so-called Frobenius norm of a|C| × |T | matrix, I is the |T | × |Tr| matrixwhose columns are the input vectors of thetraining documents, and O is the |C| × |Tr|matrix whose columns are the output vec-tors of the training documents. The M ma-trix is usually computed by performing asingular value decomposition on the train-ing set, and its generic entry mik repre-sents the degree of association betweencategory ci and term tk .

The experiments of Yang and Chute[1994] and Yang and Liu [1999] indicatethat LLSF is one of the most effective textclassifiers known to date. One of its disad-vantages, though, is that the cost of com-puting the M matrix is much higher thanthat of many other competitors in the TCarena.

6.6. On-Line Methods

A linear classifier for category ci is a vec-tor Eci = 〈w1i, . . . , w|T |i〉 belonging to the

same |T |-dimensional space in which doc-uments are also represented, and suchthat CSVi(d j ) corresponds to the dotproduct

∑|T |k=1 wkiwkj of Ed j and Eci. Note

that when both classifier and documentweights are cosine-normalized (see (2)),the dot product between the two vec-tors corresponds to their cosine similarity,that is:

S(ci, d j ) = cos(α)

=∑|T |

k=1 wki ·wkj√∑|T |k=1 w2

ki ·√∑|T |

k=1 w2k j

,

which represents the cosine of the angleα that separates the two vectors. This isthe similarity measure between query anddocument computed by standard vector-space IR engines, which means in turnthat once a linear classifier has been built,classification can be performed by invok-ing such an engine. Practically all searchengines have a dot product flavor to them,and can therefore be adapted to doing TCwith a linear classifier.

Methods for learning linear classifiersare often partitioned in two broad classes,batch methods and on-line methods.

Batch methods build a classifier by ana-lyzing the training set all at once. Withinthe TC literature, one example of a batchmethod is linear discriminant analysis,a model of the stochastic dependence be-tween terms that relies on the covari-ance matrices of the categories [Hull 1994;Schutze et al. 1995]. However, the fore-most example of a batch method is theRocchio method; because of its importancein the TC literature, this will be discussedseparately in Section 6.7. In this sectionwe will instead concentrate on on-linemethods.

On-line (a.k.a. incremental) methodsbuild a classifier soon after examiningthe first training document, and incre-mentally refine it as they examine newones. This may be an advantage in theapplications in which Tr is not avail-able in its entirety from the start, or inwhich the “meaning” of the category maychange in time, as for example, in adaptive



filtering. This is also apt to applications(e.g., semiautomated classification, adap-tive filtering) in which we may expect theuser of a classifier to provide feedback onhow test documents have been classified,as in this case further training may be per-formed during the operating phase by ex-ploiting user feedback.

A simple on-line method is the per-ceptron algorithm, first applied to TC bySchutze et al. [1995] and Wiener et al.[1995], and subsequently used by Daganet al. [1997] and Ng et al. [1997]. In this al-gorithm, the classifier for ci is first initial-ized by setting all weights wki to the samepositive value. When a training exampled j (represented by a vector Ed j of binaryweights) is examined, the classifier builtso far classifies it. If the result of the clas-sification is correct, nothing is done, whileif it is wrong, the weights of the classifierare modified: if d j was a positive exam-ple of ci, then the weights wki of “activeterms” (i.e., the terms tk such that wkj= 1)are “promoted” by increasing them by afixed quantity α >0 (called learning rate),while if d j was a negative example of cithen the same weights are “demoted” bydecreasing them by α. Note that when theclassifier has reached a reasonable level ofeffectiveness, the fact that a weight wki isvery low means that tk has negatively con-tributed to the classification process so far,and may thus be discarded from the repre-sentation. We may then see the perceptronalgorithm (as all other incremental learn-ing methods) as allowing for a sort of “on-the-fly term space reduction” [Dagan et al.1997, Section 4.4]. The perceptron classi-fier has shown a good effectiveness in allthe experiments quoted above.

The perceptron is an additive weight-updating algorithm. A multiplicativevariant of it is POSITIVE WINNOW [Daganet al. 1997], which differs from perceptronbecause two different constants α1> 1 and0 < α2 < 1 are used for promoting and de-moting weights, respectively, and becausepromotion and demotion are achieved bymultiplying, instead of adding, by α1 andα2. BALANCED WINNOW [Dagan et al. 1997]is a further variant of POSITIVE WINNOW, inwhich the classifier consists of two weights

w+ki and w−ki for each term tk ; the finalweight wki used in computing the dot prod-uct is the difference w+ki−w−ki. Followingthe misclassification of a positive in-stance, active terms have their w+ki weightpromoted and their w−ki weight demoted,whereas in the case of a negative instanceit is w+ki that gets demoted while w−ki getspromoted (for the rest, promotions anddemotions are as in POSITIVE WINNOW).BALANCED WINNOW allows negative wkiweights, while in the perceptron and inPOSITIVE WINNOW the wki weights are al-ways positive. In experiments conductedby Dagan et al. [1997], POSITIVE WINNOW

showed a better effectiveness than per-ceptron but was in turn outperformed by(Dagan et al.’s own version of) BALANCED

WINNOW.Other on-line methods for building text

classifiers are WIDROW-HOFF, a refinementof it called EXPONENTIATED GRADIENT (bothapplied for the first time to TC in [Lewiset al. 1996]) and SLEEPING EXPERTS [Cohenand Singer 1999], a version of BALANCED

WINNOW. While the first is an additiveweight-updating algorithm, the secondand third are multiplicative. Key differ-ences with the previously described al-gorithms are that these three algorithms(i) update the classifier not only after mis-classifying a training example, but also af-ter classifying it correctly, and (ii) updatethe weights corresponding to all terms (in-stead of just active ones).

Linear classifiers lend themselves toboth category-pivoted and document-pivoted TC. For the former the classifierEci is used, in a standard search engine,as a query against the set of test docu-ments, while for the latter the vector Ed jrepresenting the test document is usedas a query against the set of classifiers{Ec1, . . . , Ec|C|}.

6.7. The Rocchio Method

Some linear classifiers consist of an ex-plicit profile (or prototypical document)of the category. This has obvious advan-tages in terms of interpretability, as sucha profile is more readily understandableby a human than, say, a neural network


26 Sebastiani

classifier. Learning a linear classifier is of-ten preceded by local TSR; in this case, aprofile of ci is a weighted list of the termswhose presence or absence is most usefulfor discriminating ci.

The Rocchio method is used for induc-ing linear, profile-style classifiers. It re-lies on an adaptation to TC of the well-known Rocchio’s formula for relevancefeedback in the vector-space model, andit is perhaps the only TC method rootedin the IR tradition rather than in theML one. This adaptation was first pro-posed by Hull [1994], and has been usedby many authors since then, either asan object of research in its own right[Ittner et al. 1995; Joachims 1997; Sableand Hatzivassiloglou 2000; Schapire et al.1998; Singhal et al. 1997], or as a base-line classifier [Cohen and Singer 1999;Galavotti et al. 2000; Joachims 1998;Lewis et al. 1996; Schapire and Singer2000; Schutze et al. 1995], or as a mem-ber of a classifier committee [Larkey andCroft 1996] (see Section 6.11).

Rocchio’s method computes a classi-fier Eci =〈w1i, . . . , w|T |i〉 for category ci bymeans of the formula

wki = β ·∑

{d j∈POSi}

wkj

|POSi| −

γ ·∑

{d j∈NEGi}

wkj

|NEGi| ,

where wkj is the weight of tk in documentd j , POSi = {d j ∈ Tr | 8(d j , ci) = T }, andNEGi = {d j ∈ Tr | 8(d j , ci) = F }. In thisformula, β and γ are control parametersthat allow setting the relative importanceof positive and negative examples. Forinstance, if β is set to 1 and γ to 0 (asin Dumais et al. [1998]; Hull [1994];Joachims [1998]; Schutze et al. [1995]),the profile of ci is the centroid of its pos-itive training examples. A classifier builtby means of the Rocchio method rewardsthe closeness of a test document to thecentroid of the positive training examples,and its distance from the centroid of thenegative training examples. The role ofnegative examples is usually deempha-

sized, by setting β to a high value and γ toa low one (e.g., Cohen and Singer [1999],Ittner et al. [1995], and Joachims [1997]use β = 16 and γ = 4).

This method is quite easy to implement,and is also quite efficient, since learninga classifier basically comes down to aver-aging weights. In terms of effectiveness,instead, a drawback is that if the docu-ments in the category tend to occur indisjoint clusters (e.g., a set of newspaperarticles lebeled with the Sports categoryand dealing with either boxing or rock-climbing), such a classifier may miss mostof them, as the centroid of these docu-ments may fall outside all of these clusters(see Figure 3(a)). More generally, a classi-fier built by the Rocchio method, as all lin-ear classifiers, has the disadvantage thatit divides the space of documents linearly.This situation is graphically depicted inFigure 3(a), where documents are classi-fied within ci if and only if they fall withinthe circle. Note that even most of the pos-itive training examples would not be clas-sified correctly by the classifier.

6.7.1. Enhancements to the Basic RocchioFramework. One issue in the application ofthe Rocchio formula to profile extractionis whether the set NEGi should be con-sidered in its entirety, or whether a well-chosen sample of it, such as the set NPOSiof near-positives (defined as “the most pos-itive among the negative training exam-ples”), should be selected from it, yielding

wki = β ·∑

{d j∈POSi}

wkj

|POSi| −

γ ·∑

{d j∈NPOSi}

wkj

|NPOSi| .

The∑{d j∈NPOSi}

wkj|NPOSi | factor is more sig-

nificant than∑{d j∈NEGi}

wkj|NEGi | , since near-

positives are the most difficult documentsto tell apart from the positives. Usingnear-positives corresponds to the queryzoning method proposed for IR by Singhalet al. [1997]. This method originates fromthe observation that, when the original



Fig. 3 . A comparison between the TC behavior of (a) the Rocchio classifier, and(b) the k-NN classifier. Small crosses and circles denote positive and negativetraining instances, respectively. The big circles denote the “influence area” ofthe classifier. Note that, for ease of illustration, document similarities are hereviewed in terms of Euclidean distance rather than, as is more common, in termsof dot product or cosine.

Rocchio formula is used for relevancefeedback in IR, near-positives tend tobe used rather than generic negatives, asthe documents on which user judgmentsare available are among the ones thathad scored highest in the previous rank-ing. Early applications of the Rocchio for-mula to TC (e.g., Hull [1994]; Ittner et al.[1995]) generally did not make a distinc-tion between near-positives and genericnegatives. In order to select the near-positives Schapire et al. [1998] issue aquery, consisting of the centroid of the pos-itive training examples, against a docu-ment base consisting of the negative train-ing examples; the top-ranked ones are themost similar to this centroid, and are thenthe near-positives. Wiener et al. [1995] in-stead equate the near-positives of ci tothe positive examples of the sibling cate-gories of ci, as in the application they workon (TC with hierarchically organized cat-egory sets) the notion of a “sibling cate-gory of ci” is well defined. A similar policyis also adopted by Ng et al. [1997], Ruizand Srinivasan [1999], and Weigend et al.[1999].

By using query zoning plus other en-hancements (TSR, statistical phrases, anda method called dynamic feedback op-timization), Schapire et al. [1998] havefound that a Rocchio classifier can achieve

an effectiveness comparable to that ofa state-of-the-art ML method such as“boosting” (see Section 6.11.1) while being60 times quicker to train. These recentresults will no doubt bring about a re-newed interest for the Rocchio classifier,previously considered an underperformer[Cohen and Singer 1999; Joachims 1998;Lewis et al. 1996; Schutze et al. 1995; Yang1999].

6.8. Neural Networks

A neural network (NN) text classifier is anetwork of units, where the input unitsrepresent terms, the output unit(s) repre-sent the category or categories of interest,and the weights on the edges connectingunits represent dependence relations. Forclassifying a test document d j , its termweights wkj are loaded into the input units;the activation of these units is propa-gated forward through the network, andthe value of the output unit(s) determinesthe categorization decision(s). A typicalway of training NNs is backpropagation,whereby the term weights of a trainingdocument are loaded into the input units,and if a misclassification occurs the erroris “backpropagated” so as to change the pa-rameters of the network and eliminate orminimize the error.


28 Sebastiani

The simplest type of NN classifier isthe perceptron [Dagan et al. 1997; Nget al. 1997], which is a linear classifier andas such has been extensively discussedin Section 6.6. Other types of linear NNclassifiers implementing a form of logis-tic regression have also been proposedand tested by Schutze et al. [1995] andWiener et al. [1995], yielding very goodeffectiveness.

A nonlinear NN [Lam and Lee 1999;Ruiz and Srinivasan 1999; Schutze et al.1995; Weigend et al. 1999; Wiener et al.1995; Yang and Liu 1999] is instead a net-work with one or more additional “layers”of units, which in TC usually representhigher-order interactions between termsthat the network is able to learn. Whencomparative experiments relating nonlin-ear NNs to their linear counterparts havebeen performed, the former have yieldedeither no improvement [Schutze et al.1995] or very small improvements [Wieneret al. 1995] over the latter.

6.9. Example-Based Classifiers

Example-based classifiers do not build anexplicit, declarative representation of thecategory ci, but rely on the category la-bels attached to the training documentssimilar to the test document. These meth-ods have thus been called lazy learners,since “they defer the decision on how togeneralize beyond the training data untileach new query instance is encountered”[Mitchell 1996, page 244].

The first application of example-basedmethods (a.k.a. memory-based reason-ing methods) to TC is due to Creecy,Masand and colleagues [Creecy et al.1992; Masand et al. 1992]; other examplesinclude Joachims [1998], Lam et al. [1999],Larkey [1998], Larkey [1999], Li and Jain[1998], Yang and Pedersen [1997], andYang and Liu [1999]. Our presentation ofthe example-based approach will be basedon the k-NN (for “k nearest neighbors”)algorithm used by Yang [1994]. For decid-ing whether d j ∈ ci, k-NN looks at whetherthe k training documents most similar tod j also are in ci; if the answer is posi-tive for a large enough proportion of them,

a positive decision is taken, and a nega-tive decision is taken otherwise. Actually,Yang’s is a distance-weighted version ofk-NN (see [Mitchell 1996, Section 8.2.1]),since the fact that a most similar docu-ment is in ci is weighted by its similar-ity with the test document. Classifying d jby means of k-NN thus comes down tocomputing

CSVi(d j )

=∑

dz∈ Trk (d j )

RSV(d j , dz ) · [[8(dz , ci)]],

(9)

where Trk(d j ) is the set of the k documentsdz which maximize RSV(d j , dz ) and

[[α]] ={

1 if α = T0 if α = F

.

The thresholding methods of Section 6.1can then be used to convert the real-valued CSVi ’s into binary categorizationdecisions. In (9), RSV(d j , dz ) representssome measure or semantic relatedness be-tween a test document d j and a trainingdocument dz ; any matching function, be itprobabilistic (as used by Larkey and Croft[1996]) or vector-based (as used by Yang[1994]), from a ranked IR system may beused for this purpose. The construction ofa k-NN classifier also involves determin-ing (experimentally, on a validation set) athreshold k that indicates how many top-ranked training documents have to be con-sidered for computing CSVi(d j ). Larkeyand Croft [1996] used k = 20, while Yang[1994, 1999] has found 30≤ k≤ 45 to yieldthe best effectiveness. Anyhow, various ex-periments have shown that increasing thevalue of k does not significantly degradethe performance.

Note that k-NN, unlike linear classi-fiers, does not divide the document spacelinearly, and hence does not suffer fromthe problem discussed at the end ofSection 6.7. This is graphically depictedin Figure 3(b), where the more “local”character of k-NN with respect to Rocchiocan be appreciated.



This method is naturally geared towarddocument-pivoted TC, since ranking thetraining documents for their similaritywith the test document can be done oncefor all categories. For category-pivoted TC,one would need to store the documentranks for each test document, which is ob-viously clumsy; DPC is thus de facto theonly reasonable way to use k-NN.

A number of different experiments (seeSection 7.3) have shown k-NN to be quiteeffective. However, its most importantdrawback is its inefficiency at classifica-tion time: while, for example, with a lin-ear classifier only a dot product needs tobe computed to classify a test document,k-NN requires the entire training set tobe ranked for similarity with the test docu-ment, which is much more expensive. Thisis a drawback of “lazy” learning methods,since they do not have a true trainingphase and thus defer all the computationto classification time.

6.9.1. Other Example-Based Techniques.Various example-based techniques havebeen used in the TC literature. For exam-ple, Cohen and Hirsh [1998] implementedan example-based classifier by extendingstandard relational DBMS technologywith “similarity-based soft joins.” Intheir WHIRL system they used the scoringfunction

CSVi(d j )

= 1−∏

dz∈Trk (d j )

(1−RSV(d j , dz ))[[8(dz ,ci )]]

as an alternative to (9), obtaining a smallbut statistically significant improvementover a version of WHIRL using (9). Intheir experiments this technique outper-formed a number of other classifiers, suchas a C4.5 decision tree classifier and theRIPPER CNF rule-based classifier.

A variant of the basic k-NN ap-proach was proposed by Galavotti et al.[2000], who reinterpreted (9) by redefining[[α]] as

[[α]] ={

1 if α = T−1 if α = F

.

The difference from the original k-NN ap-proach is that if a training document dzsimilar to the test document d j does notbelong to ci, this information is not dis-carded but weights negatively in the deci-sion to classify d j under ci.

A combination of profile- and example-based methods was presented in Lam andHo [1998]. In this work a k-NN system wasfed generalized instances (GIs) in place oftraining documents. This approach may beseen as the result of

—clustering the training set, thus obtain-ing a set of clusters Ki = {ki1, . . . ,ki|Ki |};

—building a profile G(kiz ) (“generalizedinstance”) from the documents belong-ing to cluster kiz by means of some algo-rithm for learning linear classifiers (e.g.,Rocchio, WIDROW-HOFF);

—applying k-NN with profiles in place oftraining documents, that is, computing

CSVi(d j )def=

∑kiz∈Ki

RSV(d j , G(kiz )) ·

|{d j ∈ kiz | 8(d j , ci) = T }||{d j ∈ kiz}| ·

|{d j ∈ kiz}||Tr|

=∑

kiz∈Ki

RSV(d j , G(kiz )) ·

|{d j ∈ kiz | 8(d j , ci) = T }||Tr| , (10)

where |{d j∈kiz | 8(d j ,ci )=T }||{d j∈kiz }| represents the

“degree” to which G(kiz ) is a positive in-stance of ci, and |{d j∈kiz }|

|Tr| represents itsweight within the entire process.

This exploits the superior effectiveness(see Figure 3) of k-NN over linear clas-sifiers while at the same time avoidingthe sensitivity of k-NN to the presence of“outliers” (i.e., positive instances of ci that“lie out” of the region where most otherpositive instances of ci are located) in thetraining set.


30 Sebastiani

Fig. 4 . Learning support vector classifiers.The small crosses and circles represent posi-tive and negative training examples, respec-tively, whereas lines represent decision sur-faces. Decision surface σi (indicated by thethicker line) is, among those shown, the bestpossible one, as it is the middle element ofthe widest set of parallel decision surfaces(i.e., its minimum distance to any trainingexample is maximum). Small boxes indicatethe support vectors.

6.10. Building Classifiers by SupportVector Machines

The support vector machine (SVM) methodhas been introduced in TC by Joachims[1998, 1999] and subsequently used byDrucker et al. [1999], Dumais et al. [1998],Dumais and Chen [2000], Klinkenbergand Joachims [2000], Taira and Haruno[1999], and Yang and Liu [1999]. In ge-ometrical terms, it may be seen as theattempt to find, among all the surfacesσ1, σ2, . . . in |T |-dimensional space thatseparate the positive from the negativetraining examples (decision surfaces), theσi that separates the positives from thenegatives by the widest possible margin,that is, such that the separation propertyis invariant with respect to the widest pos-sible traslation of σi.

This idea is best understood in the casein which the positives and the negativesare linearly separable, in which case thedecision surfaces are (|T |−1)-hyperplanes.In the two-dimensional case of Figure 4,various lines may be chosen as decisionsurfaces. The SVM method chooses the

middle element from the “widest” set ofparallel lines, that is, from the set in whichthe maximum distance between two ele-ments in the set is highest. It is notewor-thy that this “best” decision surface is de-termined by only a small set of trainingexamples, called the support vectors.

The method described is applicable alsoto the case in which the positives and thenegatives are not linearly separable. Yangand Liu [1999] experimentally comparedthe linear case (namely, when the assump-tion is made that the categories are lin-early separable) with the nonlinear caseon a standard benchmark, and obtainedslightly better results in the former case.

As argued by Joachims [1998], SVMsoffer two important advantages for TC:

—term selection is often not needed, asSVMs tend to be fairly robust to over-fitting and can scale up to considerabledimensionalities;

—no human and machine effort in param-eter tuning on a validation set is needed,as there is a theoretically motivated,“default” choice of parameter settings,which has also been shown to providethe best effectiveness.

Dumais et al. [1998] have applied anovel algorithm for training SVMs whichbrings about training speeds comparableto computationally easy learners such asRocchio.

6.11. Classifier Committees

Classifier committees (a.k.a. ensembles)are based on the idea that, given a taskthat requires expert knowledge to per-form, k experts may be better than one iftheir individual judgments are appropri-ately combined. In TC, the idea is to ap-ply k different classifiers81, . . . ,8k to thesame task of deciding whether d j ∈ ci, andthen combine their outcome appropriately.A classifier committee is then character-ized by (i) a choice of k classifiers, and (ii)a choice of a combination function.

Concerning Issue (i), it is known fromthe ML literature that, in order to guar-antee good effectiveness, the classifiers



forming the committee should be as in-dependent as possible [Tumer and Ghosh1996]. The classifiers may differ for the in-dexing approach used, or for the inductivemethod, or both. Within TC, the avenuewhich has been explored most is the latter(to our knowledge the only example of theformer is Scott and Matwin [1999]).

Concerning Issue (ii), various rules havebeen tested. The simplest one is majorityvoting (MV), whereby the binary outputsof the k classifiers are pooled together, andthe classification decision that reaches themajority of k+1

2 votes is taken (k obviouslyneeds to be an odd number) [Li and Jain1998; Liere and Tadepalli 1997]. Thismethod is particularly suited to the casein which the committee includes classi-fiers characterized by a binary decisionfunction CSVi : D→ {T, F }. A second ruleis weighted linear combination (WLC),whereby a weighted sum of the CSVi ’s pro-duced by the k classifiers yields the finalCSVi. The weights wj reflect the expectedrelative effectiveness of classifiers8 j , andare usually optimized on a validation set[Larkey and Croft 1996]. Another policyis dynamic classifier selection (DCS),whereby among committee {81, . . . , 8k}the classifier 8t most effective on the lvalidation examples most similar to d jis selected, and its judgment adopted bythe committee [Li and Jain 1998]. A stilldifferent policy, somehow intermediatebetween WLC and DCS, is adaptiveclassifier combination (ACC), whereby thejudgments of all the classifiers in the com-mittee are summed together, but their in-dividual contribution is weighted by theireffectiveness on the l validation examplesmost similar to d j [Li and Jain 1998].

Classifier committees have had mixedresults in TC so far. Larkey and Croft[1996] have used combinations of Rocchio,Naıve Bayes, and k-NN, all together or inpairwise combinations, using a WLC rule.In their experiments the combination ofany two classifiers outperformed the bestindividual classifier (k-NN), and the com-bination of the three classifiers improvedan all three pairwise combinations. Theseresults would seem to give strong sup-port to the idea that classifier committees

can somehow profit from the complemen-tary strengths of their individual mem-bers. However, the small size of the test setused (187 documents) suggests that moreexperimentation is needed before conclu-sions can be drawn.

Li and Jain [1998] have tested a commit-tee formed of (various combinations of) aNaıve Bayes classifier, an example-basedclassifier, a decision tree classifier, and aclassifier built by means of their own “sub-space method”; the combination rules theyhave worked with are MV, DCS, and ACC.Only in the case of a committee formedby Naıve Bayes and the subspace classi-fier combined by means of ACC has thecommittee outperformed, and by a nar-row margin, the best individual classifier(for every attempted classifier combina-tion ACC gave better results than MV andDCS). This seems discouraging, especiallyin light of the fact that the committee ap-proach is computationally expensive (itscost trivially amounts to the sum of thecosts of the individual classifiers plusthe cost incurred for the computation ofthe combination rule). Again, it has to beremarked that the small size of their ex-periment (two test sets of less than 700documents each were used) does not allowus to draw definitive conclusions on theapproaches adopted.

6.11.1. Boosting. The boosting method[Schapire et al. 1998; Schapire and Singer2000] occupies a special place in the classi-fier committees literature, since the k clas-sifiers 81, . . . ,8k forming the committeeare obtained by the same learning method(here called the weak learner). The keyintuition of boosting is that the k clas-sifiers should be trained not in a con-ceptually parallel and independent way,as in the committees described above,but sequentially. In this way, in train-ing classifier 8i one may take into ac-count how classifiers81, . . . ,8i−1 performon the training examples, and concentrateon getting right those examples on which81, . . . ,8i−1 have performed worst.

Specifically, for learning classifier 8teach 〈d j , ci〉 pair is given an “importanceweight” ht

i j (where h1i j is set to be equal for


32 Sebastiani

all 〈d j , ci〉 pairs15), which represents howhard to get a correct decision for thispair was for classifiers81, . . . ,8t−1. Theseweights are exploited in learning 8t ,which will be specially tuned to correctlysolve the pairs with higher weight. Clas-sifier 8t is then applied to the trainingdocuments, and as a result weights ht

i jare updated to ht+1

i j ; in this update oper-ation, pairs correctly classified by 8t willhave their weight decreased, while pairsmisclassified by 8t will have their weightincreased. After all the k classifiers havebeen built, a weighted linear combinationrule is applied to yield the final committee.

In the BOOSTEXTER system [Schapire andSinger 2000], two different boosting al-gorithms are tested, using a one-leveldecision tree weak learner. The formeralgorithm (ADABOOST.MH, simply calledADABOOST in Schapire et al. [1998]) is ex-plicitly geared toward the maximization ofmicroaveraged effectiveness, whereas thelatter (ADABOOST.MR) is aimed at mini-mizing ranking loss (i.e., at getting a cor-rect category ranking for each individualdocument). In experiments conducted overthree different test collections, Schapireet al. [1998] have shown ADABOOST tooutperform SLEEPING EXPERTS, a classifierthat had proven quite effective in the ex-periments of Cohen and Singer [1999].Further experiments by Schapire andSinger [2000] showed ADABOOST to out-perform, aside from SLEEPING EXPERTS, aNaıve Bayes classifier, a standard (nonen-hanced) Rocchio classifier, and Joachims’[1997] PRTFIDF classifier.

A boosting algorithm based on a “com-mittee of classifier subcommittees” thatimproves on the effectiveness and (espe-cially) the efficiency of ADABOOST.MH waspresented in Sebastiani et al. [2000]. Anapproach similar to boosting was also em-ployed by Weiss et al. [1999], who experi-mented with committees of decision treeseach having an average of 16 leaves (andhence much more complex than the sim-

15 Schapire et al. [1998] also showed that a simplemodification of this policy allows optimization of theclassifier based on “utility” (see Section 7.1.3).

ple “decision stumps” used in Schapireand Singer [2000]), eventually combinedby using the simple MV rule as a combi-nation rule; similarly to boosting, a mech-anism for emphasising documents thathave been misclassified by previous de-cision trees is used. Boosting-based ap-proaches have also been employed inEscudero et al. [2000], Iyer et al. [2000],Kim et al. [2000], Li and Jain [1998], andMyers et al. [2000].

6.12. Other Methods

Although in the previous sections wehave tried to give an overview as com-plete as possible of the learning ap-proaches proposed in the TC literature, itis hardly possible to be exhaustive. Someof the learning approaches adopted donot fall squarely under one or the otherclass of algorithms, or have remainedsomehow isolated attempts. Among these,the most noteworthy are the ones basedon Bayesian inference networks [Dumaiset al. 1998; Lam et al. 1997; Tzerasand Hartmann 1993], genetic algorithms[Clack et al. 1997; Masand 1994], andmaximum entropy modelling [Manningand Schutze 1999].

7. EVALUATION OF TEXT CLASSIFIERS

As for text search systems, the eval-uation of document classifiers is typ-ically conducted experimentally, ratherthan analytically. The reason is that, inorder to evaluate a system analytically(e.g., proving that the system is correctand complete), we would need a formalspecification of the problem that the sys-tem is trying to solve (e.g., with respectto what correctness and completeness aredefined), and the central notion of TC(namely, that of membership of a docu-ment in a category) is, due to its subjectivecharacter, inherently nonformalizable.

The experimental evaluation of a clas-sifier usually measures its effectiveness(rather than its efficiency), that is, itsability to take the right classificationdecisions.



Table II. The Contingency Table for Category ci

Category Expert judgmentsci YES NO

Classifier YES TPi FPiJudgments NO FNi TNi

7.1. Measures of TextCategorization Effectiveness

7.1.1. Precision and Recall. Classificationeffectiveness is usually measured in termsof the classic IR notions of precision (π )and recall (ρ), adapted to the case ofTC. Precision wrt ci (πi) is defined asthe conditional probability P (8(dx , ci) =T | 8(dx , ci)=T ), that is, as the prob-ability that if a random document dx isclassified under ci, this decision is correct.Analogously, recall wrt ci (ρi) is definedas P (8(dx , ci) = T | 8(dx , ci) = T ), thatis, as the probability that, if a randomdocument dx ought to be classified underci, this decision is taken. These category-relative values may be averaged, in a wayto be discussed shortly, to obtain π and ρ,that is, values global to the entire categoryset. Borrowing terminology from logic, πmay be viewed as the “degree of sound-ness” of the classifier wrt C, while ρ maybe viewed as its “degree of completeness”wrt C. As defined here, πi and ρi are tobe understood as subjective probabilities,that is, as measuring the expectation ofthe user that the system will behave cor-rectly when classifying an unseen docu-ment under ci. These probabilities maybe estimated in terms of the contingencytable for ci on a given test set (see Table II).Here, FPi (false positives wrt ci, a.k.a.errors of commission) is the number oftest documents incorrectly classified un-der ci; TNi (true negatives wrt ci), TPi (truepositives wrt ci), and FNi (false negativeswrt ci, a.k.a. errors of omission) are de-fined accordingly. Estimates (indicated bycarets) of precision wrt ci and recall wrt cimay thus be obtained as

πi = TPi

TPi + FPi, ρi = TPi

TPi + FNi.

For obtaining estimates of π and ρ, twodifferent methods may be adopted:

Table III. The Global Contingency TableCategory set Expert judgmentsC = {c1, . . . , c|C|} YES NO

Classifier YES TP =|C|∑

i=1

TPi FP =|C|∑

i=1

FPi

Judgments NO FN =|C|∑

i=1

FNi TN =|C|∑

i=1

TNi

—microaveraging:π and ρ are obtained bysumming over all individual decisions:

πµ = TPTP+ FP

=∑|C|

i=1 TPi∑|C|i=1(TPi + FPi)

,

ρµ = TPTP+ FN

=∑|C|

i=1 TPi∑|C|i=1(TPi + FNi)

,

where “µ” indicates microaverag-ing. The “global” contingency table(Table III) is thus obtained by sum-ming over category-specific contin-gency tables;

—macroaveraging: precision and recallare first evaluated “locally” for eachcategory, and then “globally” by aver-aging over the results of the differentcategories:

πM =∑|C|

i=1 πi

|C| , ρM =∑|C|

i=1 ρi

|C| ,

where “M ” indicates macroaveraging.

These two methods may give quite dif-ferent results, especially if the differentcategories have very different generality.For instance, the ability of a classifier tobehave well also on categories with lowgenerality (i.e., categories with few pos-itive training instances) will be empha-sized by macroaveraging and much lessso by microaveraging. Whether one or theother should be used obviously depends onthe application requirements. From nowon, we will assume that microaveraging isused; everything we will say in the rest ofSection 7 may be adapted to the case ofmacroaveraging in the obvious way.

7.1.2. Other Measures of Effectiveness.Measures alternative to π and ρ andcommonly used in the ML litera-ture, such as accuracy (estimated as


34 Sebastiani

A= TP+TNTP+TN+FP+FN ) and error (estimated

as E = FP+FNTP+TN+FP+FN = 1 − A), are not

widely used in TC. The reason is that, asYang [1999] pointed out, the large valuethat their denominator typically has inTC makes them much more insensitive tovariations in the number of correct deci-sions (TP+TN) than π and ρ. Besides, ifA is the adopted evaluation measure, inthe frequent case of a very low averagegenerality the trivial rejector (i.e., theclassifier 8 such that 8(d j , ci)= F forall d j and ci) tends to outperform allnontrivial classifiers (see also Cohen[1995a], Section 2.3). If A is adopted,parameter tuning on a validation set maythus result in parameter choices thatmake the classifier behave very much likethe trivial rejector.

A nonstandard effectiveness mea-sure was proposed by Sable andHatzivassiloglou [2000, Section 7], whosuggested basing π and ρ not on “abso-lute” values of success and failure (i.e., 1if 8(d j , ci) = 8(d j , ci) and 0 if 8(d j , ci) 6=8(d j , ci)), but on values of relative suc-cess (i.e., CSVi(d j ) if 8(d j , ci)=T and1−CSVi(d j ) if 8(d j , ci)= F ). This meansthat for a correct (respectively wrong)decision the classifier is rewarded (re-spectively penalized) proportionally to itsconfidence in the decision. This proposedmeasure does not reward the choice of agood thresholding policy, and is thus unfitfor autonomous (“hard”) classificationsystems. However, it might be appropri-ate for interactive (“ranking”) classifiersof the type used in Larkey [1999], wherethe confidence that the classifier hasin its own decision influences categoryranking and, as a consequence, the overallusefulness of the system.

7.1.3. Measures Alternative to Effectiveness.In general, criteria different from effec-tiveness are seldom used in classifier eval-uation. For instance, efficiency, althoughvery important for applicative purposes,is seldom used as the sole yardstick, dueto the volatility of the parameters onwhich the evaluation rests. However, ef-ficiency may be useful for choosing among

Table IV. The Utility MatrixCategory set Expert judgmentsC = {c1, . . . , c|C|} YES NO

Classifier YES uTP uFPJudgments NO uFN uTN

classifiers with similar effectiveness. Aninteresting evaluation has been carriedout by Dumais et al. [1998], who havecompared five different learning methodsalong three different dimensions, namely,effectiveness, training efficiency (i.e., theaverage time it takes to build a classifierfor category ci from a training set Tr), andclassification efficiency (i.e., the averagetime it takes to classify a new documentd j under category ci).

An important alternative to effective-ness is utility, a class of measures fromdecision theory that extend effectivenessby economic criteria such as gain or loss.Utility is based on a utility matrix suchas that of Table IV, where the numericvalues uTP, uFP, uFN and uTN representthe gain brought about by a true positive,false positive, false negative, and true neg-ative, respectively; both uTP and uTN aregreater than both uFP and uFN. “Standard”effectiveness is a special case of utility,i.e., the one in which uTP=uTN>uFP =uFN. Less trivial cases are those inwhich uTP 6=uTN and/or uFP 6=uFN; thisis appropriate, for example, in spam fil-tering, where failing to discard a pieceof junk mail (FP) is a less serious mis-take than discarding a legitimate mes-sage (FN) [Androutsopoulos et al. 2000].If the classifier outputs probability esti-mates of the membership of d j in ci, thendecision theory provides analytical meth-ods to determine thresholds τi, thus avoid-ing the need to determine them exper-imentally (as discussed in Section 6.1).Specifically, as Lewis [1995a] reminds us,the expected value of utility is maximizedwhen

τi = (uFP − uTN)(uFN − uTP)+ (uFP − uTN)

,

which, in the case of “standard” effective-ness, is equal to 1

2 .



Table V. Trivial Cases in TCPrecision Recall C-precision C-recall

TPTP+ FP

TPTP+ FN

TNFP+ TN

TNTN+ FN

Trivial rejector TP=FP= 0 Undefined0

FN= 0

TNTN= 1

TNTN+ FN

Trivial acceptor FN=TN= 0TP

TP+ FPTPTP= 1

0FP= 0 Undefined

Trivial “Yes” collection FP=TN= 0TPTP= 1

TPTP+ FN

Undefined0

FN= 0

Trivial “No” collection TP=FN= 00

FP= 0 Undefined

TNFP+ TN

TNTN= 1

The use of utility in TC is discussedin detail by Lewis [1955a]. Other workswhere utility is employed are Amati andCrestani [1999], Cohen and Singer [1999],Hull et al. [1996], Lewis and Catlett[1994], and Schapire et al. [1998]. Utilityhas become popular within the text filter-ing community, and the TREC “filteringtrack” evaluations have been using it fora while [Lewis 1995c]. The values of theutility matrix are extremely application-dependent. This means that if utility isused instead of “pure” effectiveness, thereis a further element of difficulty in thecross-comparison of classification systems(see Section 7.3), since for two classifiersto be experimentally comparable also thetwo utility matrices must be the same.

Other effectiveness measures differentfrom the ones discussed here have occa-sionally been used in the literature; theseinclude adjacent score [Larkey 1998],coverage [Schapire and Singer 2000], one-error [Schapire and Singer 2000], Pear-son product-moment correlation [Larkey1998], recall at n [Larkey and Croft 1996],top candidate [Larkey and Croft 1996],and top n [Larkey and Croft 1996]. Wewill not attempt to discuss them in detail.However, their use shows that, althoughthe TC community is making consistentefforts at standardizing experimentationprotocols, we are still far from universalagreement on evaluation issues and, asa consequence, from understanding pre-cisely the relative merits of the variousmethods.

7.1.4. Combined Effectiveness Measures.Neither precision nor recall makes sensein isolation from each other. In fact theclassifier 8 such that 8(d j , ci) = T for alld j and ci (the trivial acceptor) has ρ = 1.When the CSVi function has values in[0, 1], one only needs to set every thresh-old τi to 0 to obtain the trivial acceptor.In this case, π would usually be very low(more precisely, equal to the average test

set generality∑|C|

i=1gTe(ci )|C| ).16 Conversely, it

is well known from everyday IR practicethat higher levels of π may be obtained atthe price of low values of ρ.

In practice, by tuning τi a functionCSVi : D → {T, F } is tuned to be, in thewords of Riloff and Lehnert [1994], moreliberal (i.e., improving ρi to the detrimentof πi) or more conservative (improving πi to

16 From this, one might be tempted to infer, by sym-metry, that the trivial rejector always has π = 1.This is false, as π is undefined (the denominator iszero) for the trivial rejector (see Table V). In fact,it is clear from its definition (π = TP

TP+FP ) that πdepends only on how the positives (TP + FP ) aresplit between true positives TP and the false posi-tives FP , and does not depend at all on the cardinal-ity of the positives. There is a breakup of “symme-try” between π and ρ here because, from the point ofview of classifier judgment (positives vs. negatives;this is the dichotomy of interest in trivial acceptor vs.trivial rejector), the “symmetric” of ρ ( TP

TP+FN ) is notπ ( TP

TP+FP ) but C-precision (πc = TNFP+TN ), the “con-

trapositive” of π . In fact, while ρ= 1 and πc = 0 forthe trivial acceptor, πc = 1 and ρ = 0 for the trivialrejector.


36 Sebastiani

the detriment of ρi).17 A classifier shouldthus be evaluated by means of a mea-sure which combines π and ρ.18 Vari-ous such measures have been proposed,among which the most frequent are:

(1) Eleven-point average precision: thresh-old τi is repeatedly tuned so as to allowρi to take up values of 0.0, .1, . . . , .9,1.0; πi is computed for these 11 differ-ent values of τi, and averaged over the11 resulting values. This is analogousto the standard evaluation methodol-ogy for ranked IR systems, and may beused(a) with categories in place of IR

queries. This is most frequentlyused for document-ranking clas-sifiers (see Schutze et al. [1995];Yang [1994]; Yang [1999]; Yang andPedersen [1997]);

(b) with test documents in place ofIR queries and categories in placeof documents. This is most fre-quently used for category-rankingclassifiers (see Lam et al. [1999];Larkey and Croft [1996]; Schapireand Singer [2000]; Wiener et al.[1995]). In this case, if macroav-eraging is used, it needs to be re-defined on a per-document, ratherthan per-category, basis.

This measure does not make sense forbinary-valued CSVi functions, since inthis case ρi may not be varied at will.

(2) The breakeven point, that is, thevalue at which π equals ρ (e.g., Apteet al. [1994]; Cohen and Singer [1999];Dagan et al. [1997]; Joachims [1998];

17 While ρi can always be increased at will by low-ering τi , usually at the cost of decreasing πi , πi canusually be increased at will by raising τi , always atthe cost of decreasing ρi . This kind of tuning is onlypossible for CSVi functions with values in [0, 1]; forbinary-valued CSVi functions tuning is not alwayspossible, or is anyway more difficult (see Weiss et al.[1999], page 66).18 An exception is single-label TC, in which π and ρare not independent of each other: if a document d jhas been classified under a wrong category cs (thusdecreasing πs), this also means that it has not beenclassified under the right category ct (thus decreas-ing ρt ). In this case either π or ρ can be used as ameasure of effectiveness.

Joachims [1999]; Lewis [1992a]; Lewisand Ringuette [1994]; Moulinier andGanascia [1996]; Ng et al. [1997]; Yang[1999]). This is obtained by a processanalogous to the one used for 11-pointaverage precision: a plot of π as a func-tion of ρ is computed by repeatedlyvarying the thresholds τi; breakevenis the value of ρ (or π ) for which theplot intersects the ρ=π line. This idearelies on the fact that, by decreasingthe τi ’s from 1 to 0, ρ always increasesmonotonically from 0 to 1 and π usu-ally decreases monotonically from avalue near 1 to 1

|C|∑|C|

i=1 gTe(ci). If forno values of the τi ’s π and ρ are ex-actly equal, the τi ’s are set to the valuefor which π and ρ are closest, and aninterpolated breakeven is computed asthe average of the values of π and ρ.19

(3) The Fβ function [van Rijsbergen 1979,Chapter 7], for some 0≤β ≤ + ∞(e.g., Cohen [1995a]; Cohen and Singer[1999]; Lewis and Gale [1994]; Lewis[1995a]; Moulinier et al. [1996]; Ruizand Srinivassan [1999]), where

Fβ = (β2 + 1)πρβ2π + ρ

Here β may be seen as the relative de-gree of importance attributed to π andρ. If β = 0 then Fβ coincides with π ,whereas if β = +∞ then Fβ coincideswith ρ. Usually, a value β = 1 is used,which attributes equal importance toπ and ρ. As shown in Moulinier et al.[1996] and Yang [1999], the breakevenof a classifier 8 is always less or equalthan its F1 value.

19 Breakeven, first proposed by Lewis [1992a, 1992b],has been recently criticized. Lewis himself (see hismessage of 11 Sep 1997 10:49:01 to the DDLBETAtext categorization mailing list—quoted with permis-sion of the author) has pointed out that breakeven isnot a good effectiveness measure, since (i) there maybe no parameter setting that yields the breakeven; inthis case the final breakeven value, obtained by in-terpolation, is artificial; (ii) to have ρ equal π is notnecessarily desirable, and it is not clear that a systemthat achieves high breakeven can be tuned to scorehigh on other effectiveness measures. Yang [1999]also noted that when for no value of the parameters πand ρ are close enough, interpolated breakeven maynot be a reliable indicator of effectiveness.



Once an effectiveness measure is chosen,a classifier can be tuned (e.g., thresh-olds and other parameters can be set)so that the resulting effectiveness is thebest achievable by that classifier. Tun-ing a parameter p (be it a threshold orother) is normally done experimentally.This means performing repeated experi-ments on the validation set with the val-ues of the other parameters pk fixed (ata default value, in the case of a yet-to-be-tuned parameter pk , or at the chosenvalue, if the parameter pk has alreadybeen tuned) and with different values forparameter p. The value that has yieldedthe best effectiveness is chosen for p.

7.2. Benchmarks for Text Categorization

Standard benchmark collections that canbe used as initial corpora for TC are publi-cally available for experimental purposes.The most widely used is the Reuters col-lection, consisting of a set of newswirestories classified under categories relatedto economics. The Reuters collection ac-counts for most of the experimental workin TC so far. Unfortunately, this does notalways translate into reliable comparativeresults, in the sense that many of these ex-periments have been carried out in subtlydifferent conditions.

In general, different sets of experimentsmay be used for cross-classifier compar-ison only if the experiments have beenperformed

(1) on exactly the same collection (i.e.,same documents and same categories);

(2) with the same “split” between trainingset and test set;

(3) with the same evaluation measureand, whenever this measure dependson some parameters (e.g., the utilitymatrix chosen), with the same param-eter values.

Unfortunately, a lot of experimentation,both on Reuters and on other collec-tions, has not been performed with thesecaveats in mind: by testing three differ-ent classifiers on five popular versionsof Reuters, Yang [1999] has shown that

a lack of compliance with these threeconditions may make the experimentalresults hardly comparable among eachother. Table VI lists the results of allexperiments known to us performed onfive major versions of the Reuters bench-mark: Reuters-22173 “ModLewis” (column#1), Reuters-22173 “ModApte” (column #2),Reuters-22173 “ModWiener” (column #3),Reuters-21578 “ModApte” (column #4),and Reuters-21578[10] “ModApte” (column#5).20 Only experiments that have com-puted either a breakeven or F1 have beenlisted, since other less popular effective-ness measures do not readily comparewith these.

Note that only results belonging to thesame column are directly comparable.In particular, Yang [1999] showed thatexperiments carried out on Reuters-22173“ModLewis” (column #1) are not directlycomparable with those using the otherthree versions, since the former strangelyincludes a significant percentage (58%) of“unlabeled” test documents which, beingnegative examples of all categories, tendto depress effectiveness. Also, experi-ments performed on Reuters-21578[10]“ModApte” (column #5) are not comparablewith the others, since this collection is therestriction of Reuters-21578 “ModApte” tothe 10 categories with the highest gen-erality, and is thus an obviously “easier”collection.

Other test collections that have beenfrequently used are

—the OHSUMED collection, set upby Hersh et al. [1994] and used byJoachims [1998], Lam and Ho [1998],Lam et al. [1999], Lewis et al. [1996],Ruiz and Srinivasan [1999], and Yang

20 The Reuters-21578 collection may be freely down-loaded for experimentation purposes from http://www.research.att.com/~lewis/reuters21578.html.A new corpus, called Reuters Corpus Volume 1 andconsisting of roughly 800,000 documents, hasrecently been made available by Reuters forTC experiments (see http://about.reuters.com/researchandstandards/corpus/). This will likelyreplace Reuters-21578 as the “standard” Reutersbenchmark for TC.


38 Sebastiani

Table VI. Comparative Results Among Different Classifiers Obtained on Five Different Versions of Reuters.(Unless otherwise noted, entries indicate the microaveraged breakeven point; within parentheses, “M”

indicates macroaveraging and “F 1” indicates use of the F 1 measure; boldface indicates the bestperformer on the collection.)

#1 #2 #3 #4 #5# of documents 21,450 14,347 13,27212,90212,902

# of training documents 14,704 10,667 9,610 9,603 9,603# of test documents 6,746 3,680 3,662 3,299 3,299

# of categories 135 93 92 90 10System Type Results reported by

WORD (non-learning) Yang [1999] .150 .310 .290probabilistic [Dumais et al. 1998] .752 .815probabilistic [Joachims 1998] .720probabilistic [Lam et al. 1997] .443 (MF1)

PROPBAYES probabilistic [Lewis 1992a] .650BIM probabilistic [Li and Yamanishi 1999] .747

probabilistic [Li and Yamanishi 1999] .773NB probabilistic [Yang and Liu 1999] .795

decision trees [Dumais et al. 1998] .884C4.5 decision trees [Joachims 1998] .794IND decision trees [Lewis and Ringuette 1994] .670

SWAP-1 decision rules [Apte et al. 1994] .805RIPPER decision rules [Cohen and Singer 1999] .683 .811 .820

SLEEPINGEXPERTS decision rules [Cohen and Singer 1999] .753 .759 .827DL-ESC decision rules [Li and Yamanishi 1999] .820

CHARADE decision rules [Moulinier and Ganascia 1996] .738CHARADE decision rules [Moulinier et al. 1996] .783 (F1)

LLSF regression [Yang 1999] .855 .810LLSF regression [Yang and Liu 1999] .849

BALANCEDWINNOW on-line linear [Dagan et al. 1997] .747 (M) .833 (M)WIDROW-HOFF on-line linear [Lam and Ho 1998] .822

ROCCHIO batch linear [Cohen and Singer 1999] .660 .748 .776FINDSIM batch linear [Dumais et al. 1998] .617 .646ROCCHIO batch linear [Joachims 1998] .799ROCCHIO batch linear [Lam and Ho 1998] .781ROCCHIO batch linear [Li and Yamanishi 1999] .625CLASSI neural network [Ng et al. 1997] .802NNET neural network Yang and Liu 1999] .838

neural network [Wiener et al. 1995] .820GIS-W example-based [Lam and Ho 1998] .860k-NN example-based [Joachims 1998] .823k-NN example-based [Lam and Ho 1998] .820k-NN example-based [Yang 1999] .690 .852 .820k-NN example-based [Yang and Liu 1999] .856

SVM [Dumais et al. 1998] .870 .920SVMLIGHT SVM [Joachims 1998] .864SVMLIGHT SVM [Li Yamanishi 1999] .841SVMLIGHT SVM [Yang and Liu 1999] .859

ADABOOST.MH committee [Schapire and Singer 2000] .860committee [Weiss et al. 1999] .878

Bayesian net [Dumais et al. 1998] .800 .850Bayesian net [Lam et al. 1997] .542 (MF1)

and Pedersen [1997].21 The documentsare titles or title-plus-abstracts frommedical journals (OHSUMED is actuallya subset of the Medline document base);

21 The OHSUMED collection may be freely down-loaded for experimentation purposes from ftp://medir.ohsu.edu/pub/ohsumed.

the categories are the “postable terms”of the MESH thesaurus.

—the 20 Newsgroups collection, set upby Lang [1995] and used by Bakerand McCallum [1998], Joachims[1997], McCallum and Nigam [1998],McCallum et al. [1998], Nigam et al.



[2000], and Schapire and Singer [2000].The documents are messages posted toUsenet newsgroups, and the categoriesare the newsgroups themselves.

—the AP collection, used by Cohen [1995a,1995b], Cohen and Singer [1999], Lewisand Catlett [1994], Lewis and Gale[1994], Lewis et al. [1996], Schapireand Singer [2000], and Schapire et al.[1998].

We will not cover the experiments per-formed on these collections for the samereasons as those illustrated in footnote 20,that is, because in no case have a signifi-cant enough number of authors used thesame collection in the same experimen-tal conditions, thus making comparisonsdifficult.

7.3. Which Text Classifier Is Best?

The published experimental results, andespecially those listed in Table VI, allowus to attempt some considerations on thecomparative performance of the TC meth-ods discussed. However, we have to bear inmind that comparisons are reliable onlywhen based on experiments performedby the same author under carefully con-trolled conditions. They are instead moreproblematic when they involve differentexperiments performed by different au-thors. In this case various “backgroundconditions,” often extraneous to the learn-ing algorithm itself, may influence the re-sults. These may include, among others,different choices in preprocessing (stem-ming, etc.), indexing, dimensionality re-duction, classifier parameter values, etc.,but also different standards of compliancewith safe scientific practice (such as tun-ing parameters on the test set rather thanon a separate validation set), which oftenare not discussed in the published papers.

Two different methods may thus beapplied for comparing classifiers [Yang1999]:

—direct comparison: classifiers 8′ and 8′′may be compared when they have beentested on the same collection Ä, usuallyby the same researchers and with the

same background conditions. This is themore reliable method.

—indirect comparison: classifiers 8′ and8′′ may be compared when(1) they have been tested on collections

Ä′ and Ä′′, respectively, typicallyby different researchers and hencewith possibly different backgroundconditions;

(2) one or more “baseline” classifiers81, . . . , 8m have been tested on bothÄ′ and Ä′′ by the direct comparisonmethod.

Test 2 gives an indication on the rela-tive “hardness” of Ä′ and Ä′′; using thisand the results from Test 1, we mayobtain an indication on the relative ef-fectiveness of 8′ and 8′′. For the rea-sons discussed above, this method is lessreliable.

A number of interesting conclusions can bedrawn from Table VI by using these twomethods. Concerning the relative “hard-ness” of the five collections, if by Ä′>Ä′′we indicate that Ä′ is a harder collectionthan Ä′′, there seems to be enough evi-dence that Reuters-22173 “ModLewis” ÀReuters-22173 “ModWiener” > Reuters-22173 “ModApte” ≈ Reuters-21578 “Mod-Apte” > Reuters-21578[10] “ModApte.”These facts are unsurprising; in particu-lar, the first and the last inequalities are adirect consequence of the peculiar charac-teristics of Reuters-22173 “ModLewis” andReuters-21578[10] “ModApte” discussed inSection 7.2.

Concerning the relative performance ofthe classifiers, remembering the consid-erations above we may attempt a fewconclusions:

—Boosting-based classifier committees,support vector machines, example-based methods, and regression methodsdeliver top-notch performance. Thereseems to be no sufficient evidence todecidedly opt for either method; ef-ficiency considerations or application-dependent issues might play a role inbreaking the tie.

—Neural networks and on-line linear clas-sifiers work very well, although slightly


40 Sebastiani

worse than the previously mentionedmethods.

—Batch linear classifiers (Rocchio) andprobabilistic Naıve Bayes classifierslook the worst of the learning-basedclassifiers. For Rocchio, these resultsconfirm earlier results by Schutzeet al. [1995], who had found three classi-fiers based on linear discriminant anal-ysis, linear regression, and neural net-works to perform about 15% betterthan Rocchio. However, recent resultsby Schapire et al. [1998] ranked Rocchioalong the best performers once near-positives are used in training.

—The data in Table VI is hardly suf-ficient to say anything about decisiontrees. However, the work by Dumaiset al. [1998], in which a decision treeclassifier was shown to perform nearlyas well as their top performing system(a SVM classifier), will probably renewthe interest in decision trees, an interestthat had dwindled after the unimpres-sive results reported in earlier litera-ture [Cohen and Singer 1999; Joachims1998; Lewis and Catlett 1994; Lewisand Ringuette 1994].

—By far the lowest performance isdisplayed by WORD, a classifier im-plemented by Yang [1999] and notincluding any learning component.22

Concerning WORD and no-learning classi-fiers, for completeness we should recallthat one of the highest effectiveness valuesreported in the literature for the Reuterscollection (a .90 breakeven) belongs toCONSTRUE, a manually constructed clas-sifier. However, this classifier has neverbeen tested on the standard variants ofReuters mentioned in Table VI, and it isnot clear [Yang 1999] whether the (small)test set of Reuters-22173 “ModHayes” on

22 WORD is based on the comparison between docu-ments and category names, each treated as a vector ofweighted terms in the vector space model. WORD wasimplemented by Yang with the only purpose of de-termining the difference in effectiveness that addinga learning component to a classifier brings about.WORD is actually called STR in [Yang 1994; Yang andChute 1994]. Another no-learning classifier was pro-posed in Wong et al. [1996].

which the .90 breakeven value was ob-tained was chosen randomly, as safe sci-entific practice would demand. Therefore,the fact that this figure is indicative of theperformance of CONSTRUE, and of the man-ual approach it represents, has been con-vincingly questioned [Yang 1999].

It is important to bear in mind thatthe considerations above are not abso-lute statements (if there may be any)on the comparative effectiveness of theseTC methods. One of the reasons is thata particular applicative context may ex-hibit very different characteristics fromthe ones to be found in Reuters, and dif-ferent classifiers may respond differentlyto these characteristics. An experimen-tal study by Joachims [1998] involvingsupport vector machines, k-NN, decisiontrees, Rocchio, and Naıve Bayes, showedall these classifiers to have similar ef-fectiveness on categories with≥ 300 pos-itive training examples each. The factthat this experiment involved the meth-ods which have scored best (support vec-tor machines, k-NN) and worst (Rocchioand Naıve Bayes) according to Table VIshows that applicative contexts differentfrom Reuters may well invalidate conclu-sions drawn on this latter.

Finally, a note about the worth of sta-tistical significance testing. Few authorshave gone to the trouble of validating theirresults by means of such tests. These testsare useful for verifying how strongly theexperimental results support the claimthat a given system 8′ is better than an-other system8′′, or for verifying how mucha difference in the experimental setup af-fects the measured effectiveness of a sys-tem 8. Hull [1994] and Schutze et al.[1995] have been among the first to workin this direction, validating their resultsby means of the ANOVA test and the Fried-man test; the former is aimed at determin-ing the significance of the difference in ef-fectiveness between two methods in termsof the ratio between this difference and theeffectiveness variability across categories,while the latter conducts a similar test byusing instead the rank positions of eachmethod within a category. Yang and Liu[1999] defined a full suite of significance



tests, some of which apply to microaver-aged and some to macroaveraged effective-ness. They applied them systematicallyto the comparison between five differentclassifiers, and were thus able to infer fine-grained conclusions about their relativeeffectiveness. For other examples of sig-nificance testing in TC, see Cohen [1995a,1995b]; Cohen and Hirsh [1998], Joachims[1997], Koller and Sahami [1997], Lewiset al. [1996], and Wiener et al. [1995].

8. CONCLUSION

Automated TC is now a major researcharea within the information systems dis-cipline, thanks to a number of factors:

—Its domains of application are numer-ous and important, and given the pro-liferation of documents in digital formthey are bound to increase dramaticallyin both number and importance.

—It is indispensable in many applica-tions in which the sheer number ofthe documents to be classified and theshort response time required by the ap-plication make the manual alternativeimplausible.

—It can improve the productivity ofhuman classifiers in applications inwhich no classification decision can betaken without a final human judgment[Larkey and Croft 1996], by providingtools that quickly “suggest” plausibledecisions.

—It has reached effectiveness levels com-parable to those of trained profession-als. The effectiveness of manual TC isnot 100% anyway [Cleverdon 1984] and,more importantly, it is unlikely to beimproved substantially by the progressof research. The levels of effectivenessof automated TC are instead growingat a steady pace, and even if they willlikely reach a plateau well below the100% level, this plateau will probablybe higher than the effectiveness levelsof manual TC.

One of the reasons why from the early’90s the effectiveness of text classifiershas dramatically improved is the arrival

in the TC arena of ML methods thatare backed by strong theoretical motiva-tions. Examples of these are multiplica-tive weight updating (e.g., the WINNOW

family, WIDROW-HOFF, etc.), adaptive re-sampling (e.g., boosting), and support vec-tor machines, which provide a sharp con-trast with relatively unsophisticated andweak methods such as Rocchio. In TC,ML researchers have found a challeng-ing application, since datasets consistingof hundreds of thousands of documentsand characterized by tens of thousands ofterms are widely available. This meansthat TC is a good benchmark for checkingwhether a given learning technique canscale up to substantial sizes. In turn, thisprobably means that the active involve-ment of the ML community in TC is boundto grow.

The success story of automated TC isalso going to encourage an extension ofits methods and techniques to neighbor-ing fields of application. Techniques typ-ical of automated TC have already beenextended successfully to the categoriza-tion of documents expressed in slightly dif-ferent media; for instance:

—very noisy text resulting from opti-cal character recognition [Ittner et al.1995; Junker and Hoch 1998]. In theirexperiments Ittner et al. [1995] havefound that, by employing noisy textsalso in the training phase (i.e. texts af-fected by the same source of noise thatis also at work in the test documents),effectiveness levels comparable to thoseobtainable in the case of standard textcan be achieved.

—speech transcripts [Myers et al.2000; Schapire and Singer 2000].For instance, Schapire and Singer[2000] classified answers given to aphone operator’s request “How may Ihelp you?” so as to be able to route thecall to a specialized operator accordingto call type.

Concerning other more radically differ-ent media, the situation is not as bright(however, see Lim [1999] for an interest-ing attempt at image categorization based


42 Sebastiani

on a textual metaphor). The reason forthis is that capturing real semantic con-tent of nontextual media by automatic in-dexing is still an open problem. Whilethere are systems that attempt to detectcontent, for example, in images by rec-ognizing shapes, color distributions, andtexture, the general problem of image se-mantics is still unsolved. The main reasonis that natural language, the language ofthe text medium, admits far fewer vari-ations than the “languages” employed bythe other media. For instance, while theconcept of a house can be “triggered” byrelatively few natural language expres-sions such as house, houses, home, housing,inhabiting, etc., it can be triggered by farmore images: the images of all the differ-ent houses that exist, of all possible colorsand shapes, viewed from all possible per-spectives, from all possible distances, etc.If we had solved the multimedia indexingproblem in a satisfactory way, the generalmethodology that we have discussed inthis paper for text would also apply to au-tomated multimedia categorization, andthere are reasons to believe that the ef-fectiveness levels could be as high. Thisonly adds to the common sentiment thatmore research in automated content-based indexing for multimedia documentsis needed.

ACKNOWLEDGMENTS

This paper owes a lot to the suggestions and con-structive criticism of Norbert Fuhr and David Lewis.Thanks also to Umberto Straccia for comments onan earlier draft, to Evgeniy Gabrilovich, DanielaGiorgetti, and Alessandro Moschitti for spotting mis-takes in an earlier draft, and to Alessandro Sperdutifor many fruitful discussions.

REFERENCES

AMATI, G. AND CRESTANI, F. 1999. Probabilisticlearning for selective dissemination of informa-tion. Inform. Process. Man. 35, 5, 633–654.

ANDROUTSOPOULOS, I., KOUTSIAS, J., CHANDRINOS, K. V.,AND SPYROPOULOS, C. D. 2000. An experimen-tal comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mailmessages. In Proceedings of SIGIR-00, 23rdACM International Conference on Research andDevelopment in Information Retrieval (Athens,Greece, 2000), 160–167.

APTE, C., DAMERAU, F. J., AND WEISS, S. M. 1994.Automated learning of decision rules for textcategorization. ACM Trans. on Inform. Syst. 12,3, 233–251.

ATTARDI, G., DI MARCO, S., AND SALVI, D. 1998. Cat-egorization by context. J. Univers. Comput. Sci.4, 9, 719–736.

BAKER, L. D. AND MCCALLUM, A. K. 1998. Distribu-tional clustering of words for text classification.In Proceedings of SIGIR-98, 21st ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Melbourne, Australia,1998), 96–103.

BELKIN, N. J. AND CROFT, W. B. 1992. Informationfiltering and information retrieval: two sidesof the same coin? Commun. ACM 35, 12, 29–38.

BIEBRICHER, P., FUHR, N., KNORZ, G., LUSTIG, G., AND

SCHWANTNER, M. 1988. The automatic index-ing system AIR/PHYS. From research to appli-cation. In Proceedings of SIGIR-88, 11th ACMInternational Conference on Research and De-velopment in Information Retrieval (Grenoble,France, 1988), 333–342. Also reprinted in SparckJones and Willett [1997], pp. 513–517.

BORKO, H. AND BERNICK, M. 1963. Automatic docu-ment classification. J. Assoc. Comput. Mach. 10,2, 151–161.

CAROPRESO, M. F., MATWIN, S., AND SEBASTIANI, F.2001. A learner-independent evaluation of theusefulness of statistical phrases for automatedtext categorization. In Text Databases and Doc-ument Management: Theory and Practice, A. G.Chin, ed. Idea Group Publishing, Hershey, PA,78–102.

CAVNAR, W. B. AND TRENKLE, J. M. 1994. N-gram-based text categorization. In Proceedings ofSDAIR-94, 3rd Annual Symposium on Docu-ment Analysis and Information Retrieval (LasVegas, NV, 1994), 161–175.

CHAKRABARTI, S., DOM, B. E., AGRAWAL, R., AND

RAGHAVAN, P. 1998a. Scalable feature selec-tion, classification and signature generation fororganizing large text databases into hierarchicaltopic taxonomies. J. Very Large Data Bases 7, 3,163–178.

CHAKRABARTI, S., DOM, B. E., AND INDYK, P. 1998b.Enhanced hypertext categorization using hyper-links. In Proceedings of SIGMOD-98, ACM In-ternational Conference on Management of Data(Seattle, WA, 1998), 307–318.

CLACK, C., FARRINGDON, J., LIDWELL, P., AND YU, T.1997. Autonomous document classification forbusiness. In Proceedings of the 1st InternationalConference on Autonomous Agents (Marina delRey, CA, 1997), 201–208.

CLEVERDON, C. 1984. Optimizing convenient on-line access to bibliographic databases. Inform.Serv. Use 4, 1, 37–47. Also reprinted in Willett[1988], pp. 32–41.

COHEN, W. W. 1995a. Learning to classify Englishtext with ILP methods. In Advances in Inductive



Logic Programming, L. De Raedt, ed. IOS Press,Amsterdam, The Netherlands, 124–143.

COHEN, W. W. 1995b. Text categorization and rela-tional learning. In Proceedings of ICML-95, 12thInternational Conference on Machine Learning(Lake Tahoe, CA, 1995), 124–132.

COHEN, W. W. AND HIRSH, H. 1998. Joins that gen-eralize: text classification using WHIRL. In Pro-ceedings of KDD-98, 4th International Confer-ence on Knowledge Discovery and Data Mining(New York, NY, 1998), 169–173.

COHEN, W. W. AND SINGER, Y. 1999. Context-sensitive learning methods for text categoriza-tion. ACM Trans. Inform. Syst. 17, 2, 141–173.

COOPER, W. S. 1995. Some inconsistencies and mis-nomers in probabilistic information retrieval.ACM Trans. Inform. Syst. 13, 1, 100–111.

CREECY, R. M., MASAND, B. M., SMITH, S. J., AND WALTZ,D. L. 1992. Trading MIPS and memory forknowledge engineering: classifying census re-turns on the Connection Machine. Commun.ACM 35, 8, 48–63.

CRESTANI, F., LALMAS, M., VAN RIJSBERGEN, C. J., AND

CAMPBELL, I. 1998. “Is this document rele-vant? . . . probably.” A survey of probabilisticmodels in information retrieval. ACM Comput.Surv. 30, 4, 528–552.

DAGAN, I., KAROV, Y., AND ROTH, D. 1997. Mistake-driven learning in text categorization. In Pro-ceedings of EMNLP-97, 2nd Conference on Em-pirical Methods in Natural Language Processing(Providence, RI, 1997), 55–63.

DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W.,LANDAUER, T. K., AND HARSHMAN, R. 1990. In-dexing by latent semantic indexing. J. Amer. Soc.Inform. Sci. 41, 6, 391–407.

DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. 2001.HMM-based passage models for document clas-sification and ranking. In Proceedings of ECIR-01, 23rd European Colloquium on InformationRetrieval Research (Darmstadt, Germany, 2001).

DIAZ ESTEBAN, A., DE BUENAGA RODRIGUEZ, M., URENA

LOPEZ, L. A., AND GARCIA VEGA, M. 1998. In-tegrating linguistic resources in an uniformway for text classification tasks. In Proceed-ings of LREC-98, 1st International Conference onLanguage Resources and Evaluation (Grenada,Spain, 1998), 1197–1204.

DOMINGOS, P. AND PAZZANI, M. J. 1997. On the theoptimality of the simple Bayesian classifier un-der zero-one loss. Mach. Learn. 29, 2–3, 103–130.

DRUCKER, H., VAPNIK, V., AND WU, D. 1999. Auto-matic text categorization and its applications totext retrieval. IEEE Trans. Neural Netw. 10, 5,1048–1054.

DUMAIS, S. T. AND CHEN, H. 2000. Hierarchical clas-sification of Web content. In Proceedings ofSIGIR-00, 23rd ACM International Conferenceon Research and Development in InformationRetrieval (Athens, Greece, 2000), 256–263.

DUMAIS, S. T., PLATT, J., HECKERMAN, D., AND SAHAMI,M. 1998. Inductive learning algorithms andrepresentations for text categorization. In Pro-ceedings of CIKM-98, 7th ACM InternationalConference on Information and Knowledge Man-agement (Bethesda, MD, 1998), 148–155.

ESCUDERO, G., MARQUEZ, L., AND RIGAU, G. 2000.Boosting applied to word sense disambiguation.In Proceedings of ECML-00, 11th European Con-ference on Machine Learning (Barcelona, Spain,2000), 129–141.

FIELD, B. 1975. Towards automatic indexing: auto-matic assignment of controlled-language index-ing and classification from free indexing. J. Doc-ument. 31, 4, 246–265.

FORSYTH, R. S. 1999. New directions in text catego-rization. In Causal Models and Intelligent DataManagement, A. Gammerman, ed. Springer,Heidelberg, Germany, 151–185.

FRASCONI, P., SODA, G., AND VULLO, A. 2002. Textcategorization for multi-page documents: Ahybrid naive Bayes HMM approach. J. Intell.Inform. Syst. 18, 2/3 (March–May), 195–217.

FUHR, N. 1985. A probabilistic model of dictionary-based automatic indexing. In Proceedings ofRIAO-85, 1st International Conference “Re-cherche d’Information Assistee par Ordinateur”(Grenoble, France, 1985), 207–216.

FUHR, N. 1989. Models for retrieval with proba-bilistic indexing. Inform. Process. Man. 25, 1, 55–72.

FUHR, N. AND BUCKLEY, C. 1991. A probabilisticlearning approach for document indexing. ACMTrans. Inform. Syst. 9, 3, 223–248.

FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG, G.,SCHWANTNER, M., AND TZERAS, K. 1991.AIR/X—a rule-based multistage indexingsystem for large subject fields. In Proceed-ings of RIAO-91, 3rd International Conference“Recherche d’Information Assistee par Ordina-teur” (Barcelona, Spain, 1991), 606–623.

FUHR, N. AND KNORZ, G. 1984. Retrieval testevaluation of a rule-based automated index-ing (AIR/PHYS). In Proceedings of SIGIR-84,7th ACM International Conference on Researchand Development in Information Retrieval(Cambridge, UK, 1984), 391–408.

FUHR, N. AND PFEIFER, U. 1994. Probabilistic in-formation retrieval as combination of abstrac-tion inductive learning and probabilistic as-sumptions. ACM Trans. Inform. Syst. 12, 1,92–115.

FURNKRANZ, J. 1999. Exploiting structural infor-mation for text classification on the WWW.In Proceedings of IDA-99, 3rd Symposium onIntelligent Data Analysis (Amsterdam, TheNetherlands, 1999), 487–497.

GALAVOTTI, L., SEBASTIANI, F., AND SIMI, M. 2000.Experiments on the use of feature selec-tion and negative evidence in automated textcategorization. In Proceedings of ECDL-00,4th European Conference on Research and


44 Sebastiani

Advanced Technology for Digital Libraries(Lisbon, Portugal, 2000), 59–68.

GALE, W. A., CHURCH, K. W., AND YAROWSKY, D. 1993.A method for disambiguating word senses in alarge corpus. Comput. Human. 26, 5, 415–439.

GOVERT, N., LALMAS, M., AND FUHR, N. 1999. Aprobabillistic description-oriented approach forcategorising Web documents. In Proceedings ofCIKM-99, 8th ACM International Conferenceon Information and Knowledge Management(Kansas City, MO, 1999), 475–482.

GRAY, W. A. AND HARLEY, A. J. 1971. Computer-assisted indexing. Inform. Storage Retrieval 7,4, 167–174.

GUTHRIE, L., WALKER, E., AND GUTHRIE, J. A. 1994.Document classification by machine: theoryand practice. In Proceedings of COLING-94, 15thInternational Conference on Computational Lin-guistics (Kyoto, Japan, 1994), 1059–1063.

HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I. B.,AND SCHMANDT, L. M. 1990. Tcs: a shell forcontent-based text categorization. In Proceed-ings of CAIA-90, 6th IEEE Conference on Arti-ficial Intelligence Applications (Santa Barbara,CA, 1990), 320–326.

HEAPS, H. 1973. A theory of relevance for au-tomatic document classification. Inform. Con-trol 22, 3, 268–278.

HERSH, W., BUCKLEY, C., LEONE, T., AND HICKMAN, D.1994. OHSUMED: an interactive retrieval evalu-ation and new large text collection for research.In Proceedings of SIGIR-94, 17th ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Dublin, Ireland, 1994),192–201.

HULL, D. A. 1994. Improving text retrieval for therouting problem using latent semantic indexing.In Proceedings of SIGIR-94, 17th ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Dublin, Ireland, 1994),282–289.

HULL, D. A., PEDERSEN, J. O., AND SCHUTZE, H. 1996.Method combination for document filtering. InProceedings of SIGIR-96, 19th ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Zurich, Switzerland,1996), 279–288.

ITTNER, D. J., LEWIS, D. D., AND AHN, D. D. 1995.Text categorization of low quality images. InProceedings of SDAIR-95, 4th Annual Sympo-sium on Document Analysis and InformationRetrieval (Las Vegas, NV, 1995), 301–315.

IWAYAMA, M. AND TOKUNAGA, T. 1995. Cluster-basedtext categorization: a comparison of categorysearch strategies. In Proceedings of SIGIR-95,18th ACM International Conference on Researchand Development in Information Retrieval(Seattle, WA, 1995), 273–281.

IYER, R. D., LEWIS, D. D., SCHAPIRE, R. E., SINGER, Y.,AND SINGHAL, A. 2000. Boosting for documentrouting. In Proceedings of CIKM-00, 9th ACMInternational Conference on Information and

Knowledge Management (McLean, VA, 2000),70–77.

JOACHIMS, T. 1997. A probabilistic analysis of theRocchio algorithm with TFIDF for text cat-egorization. In Proceedings of ICML-97, 14thInternational Conference on Machine Learning(Nashville, TN, 1997), 143–151.

JOACHIMS, T. 1998. Text categorization with sup-port vector machines: learning with many rel-evant features. In Proceedings of ECML-98,10th European Conference on Machine Learning(Chemnitz, Germany, 1998), 137–142.

JOACHIMS, T. 1999. Transductive inference for textclassification using support vector machines. InProceedings of ICML-99, 16th International Con-ference on Machine Learning (Bled, Slovenia,1999), 200–209.

JOACHIMS, T. AND SEBASTIANI, F. 2002. Guest editors’introduction to the special issue on automatedtext categorization. J. Intell. Inform. Syst. 18, 2/3(March-May), 103–105.

JOHN, G. H., KOHAVI, R., AND PFLEGER, K. 1994. Ir-relevant features and the subset selection prob-lem. In Proceedings of ICML-94, 11th Interna-tional Conference on Machine Learning (NewBrunswick, NJ, 1994), 121–129.

JUNKER, M. AND ABECKER, A. 1997. Exploiting the-saurus knowledge in rule induction for text clas-sification. In Proceedings of RANLP-97, 2nd In-ternational Conference on Recent Advances inNatural Language Processing (Tzigov Chark,Bulgaria, 1997), 202–207.

JUNKER, M. AND HOCH, R. 1998. An experimen-tal evaluation of OCR text representations forlearning document classifiers. Internat. J. Docu-ment Analysis and Recognition 1, 2, 116–122.

KESSLER, B., NUNBERG, G., AND SCHUTZE, H. 1997.Automatic detection of text genre. In Proceed-ings of ACL-97, 35th Annual Meeting of the Asso-ciation for Computational Linguistics (Madrid,Spain, 1997), 32–38.

KIM, Y.-H., HAHN, S.-Y., AND ZHANG, B.-T. 2000. Textfiltering by boosting naive Bayes classifiers. InProceedings of SIGIR-00, 23rd ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Athens, Greece, 2000),168–175.

KLINKENBERG, R. AND JOACHIMS, T. 2000. Detect-ing concept drift with support vector machines.In Proceedings of ICML-00, 17th InternationalConference on Machine Learning (Stanford, CA,2000), 487–494.

KNIGHT, K. 1999. Mining online text. Commun.ACM 42, 11, 58–61.

KNORZ, G. 1982. A decision theory approach tooptimal automated indexing. In Proceedings ofSIGIR-82, 5th ACM International Conferenceon Research and Development in InformationRetrieval (Berlin, Germany, 1982), 174–193.

KOLLER, D. AND SAHAMI, M. 1997. Hierarchicallyclassifying documents using very few words. In



Proceedings of ICML-97, 14th International Con-ference on Machine Learning (Nashville, TN,1997), 170–178.

KORFHAGE, R. R. 1997. Information Storage andRetrieval. Wiley Computer Publishing, NewYork, NY.

LAM, S. L. AND LEE, D. L. 1999. Feature reduc-tion for neural network based text categoriza-tion. In Proceedings of DASFAA-99, 6th IEEEInternational Conference on Database AdvancedSystems for Advanced Application (Hsinchu,Taiwan, 1999), 195–202.

LAM, W. AND HO, C. Y. 1998. Using a generalizedinstance set for automatic text categorization.In Proceedings of SIGIR-98, 21st ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Melbourne, Australia,1998), 81–89.

LAM, W., LOW, K. F., AND HO, C. Y. 1997. Using aBayesian network induction approach for textcategorization. In Proceedings of IJCAI-97, 15thInternational Joint Conference on Artificial In-telligence (Nagoya, Japan, 1997), 745–750.

LAM, W., RUIZ, M. E., AND SRINIVASAN, P. 1999. Auto-matic text categorization and its applications totext retrieval. IEEE Trans. Knowl. Data Engin.11, 6, 865–879.

LANG, K. 1995. NEWSWEEDER: learning to filter net-news. In Proceedings of ICML-95, 12th Interna-tional Conference on Machine Learning (LakeTahoe, CA, 1995), 331–339.

LARKEY, L. S. 1998. Automatic essay grading us-ing text categorization techniques. In Pro-ceedings of SIGIR-98, 21st ACM InternationalConference on Research and Development inInformation Retrieval (Melbourne, Australia,1998), 90–95.

LARKEY, L. S. 1999. A patent search and classifica-tion system. In Proceedings of DL-99, 4th ACMConference on Digital Libraries (Berkeley, CA,1999), 179–187.

LARKEY, L. S. AND CROFT, W. B. 1996. Combiningclassifiers in text categorization. In Proceedingsof SIGIR-96, 19th ACM International Conferenceon Research and Development in InformationRetrieval (Zurich, Switzerland, 1996), 289–297.

LEWIS, D. D. 1992a. An evaluation of phrasal andclustered representations on a text categoriza-tion task. In Proceedings of SIGIR-92, 15th ACMInternational Conference on Research and Devel-opment in Information Retrieval (Copenhagen,Denmark, 1992), 37–50.

LEWIS, D. D. 1992b. Representation and Learn-ing in Information Retrieval. Ph. D. thesis, De-partment of Computer Science, University ofMassachusetts, Amherst, MA.

LEWIS, D. D. 1995a. Evaluating and optmizing au-tonomous text classification systems. In Pro-ceedings of SIGIR-95, 18th ACM InternationalConference on Research and Development inInformation Retrieval (Seattle, WA, 1995), 246–254.

LEWIS, D. D. 1995b. A sequential algorithm fortraining text classifiers: corrigendum and addi-tional data. SIGIR Forum 29, 2, 13–19.

LEWIS, D. D. 1995c. The TREC-4 filtering track:description and analysis. In Proceedingsof TREC-4, 4th Text Retrieval Conference(Gaithersburg, MD, 1995), 165–180.

LEWIS, D. D. 1998. Naive (Bayes) at forty: Theindependence assumption in information re-trieval. In Proceedings of ECML-98, 10thEuropean Conference on Machine Learning(Chemnitz, Germany, 1998), 4–15.

LEWIS, D. D. AND CATLETT, J. 1994. Heterogeneousuncertainty sampling for supervised learning. InProceedings of ICML-94, 11th International Con-ference on Machine Learning (New Brunswick,NJ, 1994), 148–156.

LEWIS, D. D. AND GALE, W. A. 1994. A sequentialalgorithm for training text classifiers. In Pro-ceedings of SIGIR-94, 17th ACM InternationalConference on Research and Development inInformation Retrieval (Dublin, Ireland, 1994),3–12. See also Lewis [1995b].

LEWIS, D. D. AND HAYES, P. J. 1994. Guest editorialfor the special issue on text categorization. ACMTrans. Inform. Syst. 12, 3, 231.

LEWIS, D. D. AND RINGUETTE, M. 1994. A compar-ison of two learning algorithms for text cat-egorization. In Proceedings of SDAIR-94, 3rdAnnual Symposium on Document Analysis andInformation Retrieval (Las Vegas, NV, 1994),81–93.

LEWIS, D. D., SCHAPIRE, R. E., CALLAN, J. P., AND PAPKA,R. 1996. Training algorithms for linear textclassifiers. In Proceedings of SIGIR-96, 19thACM International Conference on Research andDevelopment in Information Retrieval (Zurich,Switzerland, 1996), 298–306.

LI, H. AND YAMANISHI, K. 1999. Text classificationusing ESC-based stochastic decision lists. InProceedings of CIKM-99, 8th ACM InternationalConference on Information and Knowledge Man-agement (Kansas City, MO, 1999), 122–130.

LI, Y. H. AND JAIN, A. K. 1998. Classification of textdocuments. Comput. J. 41, 8, 537–546.

LIDDY, E. D., PAIK, W., AND YU, E. S. 1994. Text cat-egorization for multiple users based on seman-tic features from a machine-readable dictionary.ACM Trans. Inform. Syst. 12, 3, 278–295.

LIERE, R. AND TADEPALLI, P. 1997. Active learningwith committees for text categorization. In Pro-ceedings of AAAI-97, 14th Conference of theAmerican Association for Artificial Intelligence(Providence, RI, 1997), 591–596.

LIM, J. H. 1999. Learnable visual keywords for im-age classification. In Proceedings of DL-99, 4thACM Conference on Digital Libraries (Berkeley,CA, 1999), 139–145.

MANNING, C. AND SCHUTZE, H. 1999. Foundations ofStatistical Natural Language Processing. MITPress, Cambridge, MA.


46 Sebastiani

MARON, M. 1961. Automatic indexing: an experi-mental inquiry. J. Assoc. Comput. Mach. 8, 3,404–417.

MASAND, B. 1994. Optimising confidence of textclassification by evolution of symbolic expres-sions. In Advances in Genetic Programming,K. E. Kinnear, ed. MIT Press, Cambridge, MA,Chapter 21, 459–476.

MASAND, B., LINOFF, G., AND WALTZ, D. 1992. Clas-sifying news stories using memory-based rea-soning. In Proceedings of SIGIR-92, 15th ACMInternational Conference on Research and Devel-opment in Information Retrieval (Copenhagen,Denmark, 1992), 59–65.

MCCALLUM, A. K. AND NIGAM, K. 1998. Employ-ing EM in pool-based active learning for textclassification. In Proceedings of ICML-98, 15thInternational Conference on Machine Learning(Madison, WI, 1998), 350–358.

MCCALLUM, A. K., ROSENFELD, R., MITCHELL, T. M., AND

NG, A. Y. 1998. Improving text classificationby shrinkage in a hierarchy of classes. In Pro-ceedings of ICML-98, 15th International Confer-ence on Machine Learning (Madison, WI, 1998),359–367.

MERKL, D. 1998. Text classification with self-organizing maps: Some lessons learned. Neuro-computing 21, 1/3, 61–77.

MITCHELL, T. M. 1996. Machine Learning. McGrawHill, New York, NY.

MLADENIC, D. 1998. Feature subset selection intext learning. In Proceedings of ECML-98,10th European Conference on Machine Learning(Chemnitz, Germany, 1998), 95–100.

MLADENIC, D. AND GROBELNIK, M. 1998. Word se-quences as features in text-learning. In Pro-ceedings of ERK-98, the Seventh Electrotechni-cal and Computer Science Conference (Ljubljana,Slovenia, 1998), 145–148.

MOULINIER, I. AND GANASCIA, J.-G. 1996. Applyingan existing machine learning algorithm to textcategorization. In Connectionist, Statistical,and Symbolic Approaches to Learning for Nat-ural Language Processing, S. Wermter, E. Riloff,and G. Schaler, eds. Springer Verlag, Heidelberg,Germany, 343–354.

MOULINIER, I., RASKINIS, G., AND GANASCIA, J.-G. 1996.Text categorization: a symbolic approach. InProceedings of SDAIR-96, 5th Annual Sympo-sium on Document Analysis and InformationRetrieval (Las Vegas, NV, 1996), 87–99.

MYERS, K., KEARNS, M., SINGH, S., AND WALKER,M. A. 2000. A boosting approach to topicspotting on subdialogues. In Proceedings ofICML-00, 17th International Conference on Ma-chine Learning (Stanford, CA, 2000), 655–662.

NG, H. T., GOH, W. B., AND LOW, K. L. 1997. Fea-ture selection, perceptron learning, and a us-ability case study for text categorization. In Pro-ceedings of SIGIR-97, 20th ACM InternationalConference on Research and Development in

Information Retrieval (Philadelphia, PA, 1997),67–73.

NIGAM, K., MCCALLUM, A. K., THRUN, S., AND MITCHELL,T. M. 2000. Text classification from labeledand unlabeled documents using EM. Mach.Learn. 39, 2/3, 103–134.

OH, H.-J., MYAENG, S. H., AND LEE, M.-H. 2000. Apractical hypertext categorization method usinglinks and incrementally available class informa-tion. In Proceedings of SIGIR-00, 23rd ACM In-ternational Conference on Research and Develop-ment in Information Retrieval (Athens, Greece,2000), 264–271.

PAZIENZA, M. T., ed. 1997. Information Extraction.Lecture Notes in Computer Science, Vol. 1299.Springer, Heidelberg, Germany.

RILOFF. E. 1995. Little words can make a big dif-ference for text classification. In Proceedings ofSIGIR-95, 18th ACM International Conferenceon Research and Development in InformationRetrieval (Seattle, WA, 1995), 130–136.

RILOFF, E. AND LEHNERT, W. 1994. Information ex-traction as a basis for high-precision text classifi-cation. ACM Trans. Inform. Syst. 12, 3, 296–333.

ROBERTSON, S. E. AND HARDING, P. 1984. Probabilis-tic automatic indexing by learning from humanindexers. J. Document. 40, 4, 264–270.

ROBERTSON, S. E. AND SPARCK JONES, K. 1976. Rel-evance weighting of search terms. J. Amer. Soc.Inform. Sci. 27, 3, 129–146. Also reprinted inWillett [1988], pp. 143–160.

ROTH, D. 1998. Learning to resolve naturallanguage ambiguities: a unified approach. InProceedings of AAAI-98, 15th Conference of theAmerican Association for Artificial Intelligence(Madison, WI, 1998), 806–813.

RUIZ, M. E. AND SRINIVASAN, P. 1999. Hierarchicalneural networks for text categorization. In Pro-ceedings of SIGIR-99, 22nd ACM InternationalConference on Research and Development inInformation Retrieval (Berkeley, CA, 1999),281–282.

SABLE, C. L. AND HATZIVASSILOGLOU, V. 2000. Text-based approaches for non-topical image catego-rization. Internat. J. Dig. Libr. 3, 3, 261–275.

SALTON, G. AND BUCKLEY, C. 1988. Term-weightingapproaches in automatic text retrieval. Inform.Process. Man. 24, 5, 513–523. Also reprinted inSparck Jones and Willett [1997], pp. 323–328.

SALTON, G., WONG, A., AND YANG, C. 1975. A vectorspace model for automatic indexing. Commun.ACM 18, 11, 613–620. Also reprinted in SparckJones and Willett [1997], pp. 273–280.

SARACEVIC, T. 1975. Relevance: a review of anda framework for the thinking on the notion ininformation science. J. Amer. Soc. Inform. Sci.26, 6, 321–343. Also reprinted in Sparck Jonesand Willett [1997], pp. 143–165.

SCHAPIRE, R. E. AND SINGER, Y. 2000. BoosTexter:a boosting-based system for text categorization.Mach. Learn. 39, 2/3, 135–168.



SCHAPIRE, R. E., SINGER, Y., AND SINGHAL, A. 1998.Boosting and Rocchio applied to text filtering.In Proceedings of SIGIR-98, 21st ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Melbourne, Australia,1998), 215–223.

SCHUTZE, H. 1998. Automatic word sense discrimina-tion. Computat. Ling. 24, 1, 97–124.

SCHUTZE, H., HULL, D. A., AND PEDERSEN, J. O. 1995.A comparison of classifiers and document repre-sentations for the routing problem. In Proceed-ings of SIGIR-95, 18th ACM International Con-ference on Research and Development in Infor-mation Retrieval (Seattle, WA, 1995), 229–237.

SCOTT, S. AND MATWIN, S. 1999. Feature engineer-ing for text classification. In Proceedings ofICML-99, 16th International Conference on Ma-chine Learning (Bled, Slovenia, 1999), 379–388.

SEBASTIANI, F., SPERDUTI, A., AND VALDAMBRINI, N.2000. An improved boosting algorithm and itsapplication to automated text categorization. InProceedings of CIKM-00, 9th ACM InternationalConference on Information and KnowledgeManagement (McLean, VA, 2000), 78–85.

SINGHAL, A., MITRA, M., AND BUCKLEY, C. 1997.Learning routing queries in a query zone. InProceedings of SIGIR-97, 20th ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Philadelphia, PA,1997), 25–32.

SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY,C. 1996. Document length normalization.Inform. Process. Man. 32, 5, 619–633.

SLONIM, N. AND TISHBY, N. 2001. The power of wordclusters for text classification. In Proceedingsof ECIR-01, 23rd European Colloquium onInformation Retrieval Research (Darmstadt,Germany, 2001).

SPARCK JONES, K. AND WILLETT, P., eds. 1997.Readings in Information Retrieval. MorganKaufmann, San Mateo, CA.

TAIRA, H. AND HARUNO, M. 1999. Feature selectionin SVM text categorization. In Proceedingsof AAAI-99, 16th Conference of the AmericanAssociation for Artificial Intelligence (Orlando,FL, 1999), 480–486.

TAURITZ, D. R., KOK, J. N., AND SPRINKHUIZEN-KUYPER,I. G. 2000. Adaptive information filteringusing evolutionary computation. Inform. Sci.122, 2–4, 121–140.

TUMER, K. AND GHOSH, J. 1996. Error correlationand error reduction in ensemble classifiers.Connection Sci. 8, 3-4, 385–403.

TZERAS, K. AND HARTMANN, S. 1993. Automaticindexing based on Bayesian inference networks.In Proceedings of SIGIR-93, 16th ACM Interna-tional Conference on Research and Developmentin Information Retrieval (Pittsburgh, PA, 1993),22–34.

VAN RIJSBERGEN, C. J. 1977. A theoretical basis forthe use of co-occurrence data in informationretrieval. J. Document. 33, 2, 106–119.

VAN RIJSBERGEN, C. J. 1979. Information Retrieval,2nd ed. Butterworths, London, UK. Available athttp://www.dcs.gla.ac.uk/Keith.

WEIGEND, A. S., WIENER, E. D., AND PEDERSEN, J. O.1999. Exploiting hierarchy in text catagoriza-tion. Inform. Retr. 1, 3, 193–216.

WEISS, S. M., APTE, C., DAMERAU, F. J., JOHNSON, D.E., OLES, F. J., GOETZ, T., AND HAMPP, T. 1999.Maximizing text-mining performance. IEEEIntell. Syst. 14, 4, 63–69.

WIENER, E. D., PEDERSEN, J. O., AND WEIGEND, A. S.1995. A neural network approach to topic spot-ting. In Proceedings of SDAIR-95, 4th AnnualSymposium on Document Analysis and Informa-tion Retrieval (Las Vegas, NV, 1995), 317–332.

WILLETT, P., ed. 1988. Document Retrieval Sys-tems. Taylor Graham, London, UK.

WONG, J. W., KAN, W.-K., AND YOUNG, G. H. 1996.ACTION: automatic classification for full-textdocuments. SIGIR Forum 30, 1, 26–41.

YANG, Y. 1994. Expert network: effective andefficient learning from human decisions in textcategorisation and retrieval. In Proceedings ofSIGIR-94, 17th ACM International Conferenceon Research and Development in InformationRetrieval (Dublin, Ireland, 1994), 13–22.

YANG, Y. 1995. Noise reduction in a statistical ap-proach to text categorization. In Proceedings ofSIGIR-95, 18th ACM International Conferenceon Research and Development in InformationRetrieval (Seattle, WA, 1995), 256–263.

YANG, Y. 1999. An evaluation of statistical ap-proaches to text categorization. Inform. Retr. 1,1–2, 69–90.

YANG, Y. AND CHUTE, C. G. 1994. An example-basedmapping method for text categorization and re-trieval. ACM Trans. Inform. Syst. 12, 3, 252–277.

YANG, Y. AND LIU, X. 1999. A re-examination oftext categorization methods. In Proceedings ofSIGIR-99, 22nd ACM International Conferenceon Research and Development in InformationRetrieval (Berkeley, CA, 1999), 42–49.

YANG, Y. AND PEDERSEN, J. O. 1997. A comparativestudy on feature selection in text categorization.In Proceedings of ICML-97, 14th InternationalConference on Machine Learning (Nashville,TN, 1997), 412–420.

YANG, Y., SLATTERY, S., AND GHANI, R. 2002. A studyof approaches to hypertext categorization. J. In-tell. Inform. Syst. 18, 2/3 (March-May), 219–241.

YU, K. L. AND LAM, W. 1998. A new on-line learn-ing algorithm for adaptive text filtering. InProceedings of CIKM-98, 7th ACM InternationalConference on Information and KnowledgeManagement (Bethesda, MD, 1998), 156–160.

Received December 1999; revised February 2001; accepted July 2001


Machine learning in automated text categorization

Technology

Transcript of Machine learning in automated text categorization