Alexios Msc Thesis

8/3/2019 Alexios Msc Thesis

1/121

A Rough Set Approachto Text Classification

Alexios Chouchoulas([email protected])

T

HE

UN I VER

SIT

Y

O

F

ED

I N BU

RG

H

MSc (By Research) in Artificial Intelligence,School of Artificial Intelligence,

Division of Informatics,The University of Edinburgh, 1999.
mailto:[email protected]


2/121


3/121

The illustration on the cover and the header of each chapter depicts a relatively early attempt at clas-sifying documents in different classes based on their content: the IBM Type 82 punched card sorter(mid-Sixties).

The Type 82 Electric Punched Card Sorting Machine groups all cards of similar classifica-

tion and at the same time arranges such classifications in numerical sequence.The operator places the cards in the feed hopper, sets the sorting brush on the column tobe sorted, and presses the starting key. The sorting operation is accomplished at a speed of650 cards per minute, per column sorted. The card feeding mechanism allows refilling of thecard hopper while the sorter is in operation, eliminating stopping of the machine.

The pockets into which the cards are sorted have a capacity of 800 cards and the machinestops when the limit is reached in any pocket. By means of a selecting feature, all cardspunched with any hole or holes in a single column may be sorted out while the remainingcards are passed into the reject pocket without disturbing their sequence.

After the card hopper is empty, the sorter continues to run long enough to allow all cards tobe sorted into the correct pockets without the necessity for pressing the start key.

Alphabetical sequence sorting can be accomplished with this model by an extra sort, inwhich case the machine is equipped with alphabetical switch and pocket designations.

The Type 82 Electric Punched Card Sorting Machine can be specified for operation eitherfrom alternating or direct current.

http://www.users.nwark.com/~rcmahq/jclark/cardsrt.htm

iii
http://www.users.nwark.com/~rcmahq/jclark/cardsrt.htm


4/121


5/121

Declaration

I hereby declare that this thesis has been composed by me only, and reports on research work that

was my own.

Alexios Chouchoulas

v


6/121


7/121

Abstract

Automated Information Filtering (IF) and Information Retrieval (IR) systems are acquiring increasing

prominence. Unfortunately, most attempts to produce effective IF/IR systems are either expensiveor unsuccessful. Systems fall prey to the extremely high dimensionality of the domain. Numerousattempts have been made to reduce this dimensionality. However, many involve oversimplificationsand nave assumptions about the nature of the data, or rely on linguistic aspects like semantics and

word lists that cannot possibly be relied upon in multi-cultural domains like the Internet.This thesis proposes a dimensionality reduction approach based on Rough Set Theory: it makes

few assumptions about the nature of data it processes and can cope with the subtleties of IF/IRdatasets. The technique reduces the dimensionality of IF/IR data by 3.5 orders of magnitude, en-suring no useful information is lost.

Rough Sets and IF/IR techniques are described and related work is reviewed; a taxonomy of IF/IRtechniques is drawn up to aid in comparing existing approaches. An experimental system appliedto E-mail message classification is proposed, designed and implemented. Its results are discussed indetail by verifying the project aims; conclusions are drawn regarding the success of the system and

future work is proposed.

vii


8/121


9/121

Acknowledgements

Many thanks are due to my grant holder and supervisor, Dr Qiang Shen, whose help and support

turned this project from fiction to fact.

I am also indebted to my parents for agreeing, at exceptionally short notice, to fund my studies.

Finally, I would like to thank Informatics secretarial and support staff and other unseen heroes fortheir support and understanding.

ix


10/121


11/121

Contents

Declaration v

Abstract vii

Acknowledgements ix

List of Figures xv

List of Tables xvii

List of Algorithms xix

Notation Used xxi

1 Introduction 11.1 Growing Volume of Available Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Solutions to the Dimensionality Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Automated Classification of Textual Documents . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Text Categorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1.1 Vector-Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1.2 Rule Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1.3 Probabilistic Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Information Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.5 Dimensionality Reduction for Text Categorisation . . . . . . . . . . . . . . . . . . 14

2.2 Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Description of Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 QuickReduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Verification of QuickReduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Related Work 233.1 A Taxonomy of Automatic Document Handling Techniques . . . . . . . . . . . . . . . . . 233.2 Work related to Automated Document Handling . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Amaltha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

xi


12/121

xii CONTENTS

3.2.2 NewT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.3 RIPPER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.4 Distributional Clustering for TC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.5 Apte-Damerau-Weiss Rule Induction for Text Categorisation . . . . . . . . . . . . 333.2.6 SKIPPER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 DBRough: Knowledge Discovery in Databases . . . . . . . . . . . . . . . . . . . . 353.3.2 Rough Set-Aided Dimensionality Reduction for Systems Monitoring. . . . . . . . 35

4 Proposed System 374.1 Aims of the Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 An Example Application: E-Mail Categorisation . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Language Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.1 Splitting Folders Into Training/Testing Datasets . . . . . . . . . . . . . . . . . . . . 41

4.5.2 Acquiring and Weighting Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.5.3 Rough Set Attribute Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5.3.1 Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5.3.2 Invocation of Haig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.3.3 Back-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.4 Testing the Induced Classification Means . . . . . . . . . . . . . . . . . . . . . . . 464.6 Verification of Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.7 An Example of E-Mail Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Experimental Results 515.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Datasets Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Efficiency Considerations of RSDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5 Rough Set Reducts Information Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.6 Comparative Study of TC Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.7 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.8 Comparison with Conventional TC Approaches . . . . . . . . . . . . . . . . . . . . . . . . 585.9 Summary of Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Conclusion 596.1 Verification of the Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Bibliography 61

A Reprint: Chouchoulas & Shen, 1999a 69 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.2.1 The Rough Set Attribute Reduction Method (RSAR) . . . . . . . . . . . . . . . . . 70 A.2.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.2.2.1 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71A.2.2.2 Rule-Based Text Classification . . . . . . . . . . . . . . . . . . . . . . . . 71

A.3 The Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.3.1 Keyword Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72A.3.2 Rough Set Attribute Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A.3.3 Fuzzy Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


13/121

CONTENTS xiii

A.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B Reprint: Chouchoulas & Shen, 1999b 79B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79B.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

B.2.1 The Rough Set Attribute Reduction Method (RSAR) . . . . . . . . . . . . . . . . . 80B.2.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

B.2.2.1 Distance-Based Text Classification . . . . . . . . . . . . . . . . . . . . . . 81B.2.2.2 Rule-Based Text Classification . . . . . . . . . . . . . . . . . . . . . . . . 82

B.3 The Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82B.3.1 Keyword Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83B.3.2 Rough Set Attribute Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

B.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85B.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

C Complete Example Transcript 87C.1 Example Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

C.1.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87C.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

C.2 Running the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88C.3 Pre-Reduction High Dimensionality Keyword Set . . . . . . . . . . . . . . . . . . . . . . . 89C.4 Pre-Reduction Attribute List/Axis Ordering File . . . . . . . . . . . . . . . . . . . . . . . . 90C.5 Post-Reduction Attribute List/Axis Ordering File . . . . . . . . . . . . . . . . . . . . . . . 90C.6 Post-Reduction Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

D Tools Used 91

E Glossary 93

Index 97


14/121

xiv CONTENTS


15/121

List of Figures

1.1 Internet Growth 19811999. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Training Stage of a Typical Text Classification System. . . . . . . . . . . . . . . . . . . . . 62.2 Converting keyword/weight sets to weight vectors. . . . . . . . . . . . . . . . . . . . . . . 72.3 Major constituents of an Information Retrieval System. . . . . . . . . . . . . . . . . . . . 112.4 Components of an Information Filtering System. . . . . . . . . . . . . . . . . . . . . . . . 132.5 Typical Word Frequency Histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Approximating a circle with rough sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 The Amaltha system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Block diagram of NewT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Block diagram of Hus DBROUGH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Block Diagram of the Water Treatment Plant. . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Example of an E-mail message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Block diagram of the Proposed System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Front-end to HAIG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4 Back-end to HAIG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5 Fuzzification of the keyword weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6 E-Mail corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.7 E-Mail corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1 RSDRs Runtime Efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Verifying the information content of the Rough Set reduct. . . . . . . . . . . . . . . . . . 545.3 Comparison of all combinations of keyword acquisition and inference methods. . . . . . 565.4 The systems ability to generalise given different train/test partition ratios. . . . . . . . . 57

A.1 Block diagram of the integrated system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.2 Fuzzification of the normalised keyword weights. . . . . . . . . . . . . . . . . . . . . . . . 74 A.3 Minimum, average and maximum dimensionality reduction. . . . . . . . . . . . . . . . . 75A.4 Average classification accuracy for the three inference modules. . . . . . . . . . . . . . . 76

B.1 Data flow through the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83B.2 Discernibility after dimensionality reduction. . . . . . . . . . . . . . . . . . . . . . . . . . 85

xv


16/121

xvi LIST OF FIGURES


17/121

List of Tables

2.1 Rough Set example data table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Discernibility of Dataset Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Comparison of Reviewed Methodologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Average Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.1 Dataset produced from (A.5) and (A.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.1 Dataset produced from (B.12) and (B.13). . . . . . . . . . . . . . . . . . . . . . . . . . . . 84B.2 Average dimensionality reduction in orders of magnitude. . . . . . . . . . . . . . . . . . . 86

xvii


18/121

xviii LIST OF TABLES


19/121

List of Algorithms

2.1 The QUICKREDUCT algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 The UPDATED

DISTRIBUTIONAL

CLUSTERING

algorithm. . . . . . . . . . . . . . . . . . . . 32

xix


20/121

xx LIST OF ALGORITHMS


21/121

Notation Used

Acronyms

[The biggest problem in computing in the Nineties will be that] there are only 17,000 three-letter acronyms.

Paul Boutin

TLA Meaning

ADW Apte-Damerau-Weiss rule induction

BEM Boolean Exact Model

BIM Boolean Inexact Model

CPU Central Processing Unit

DBMS DataBase Management System

FRM Fuzzy Relevance Metric

GA Genetic Algorithm

IF Information Filtering

IR Information Retrieval

KLD Kullback-Leibler Divergence

LAN Local Area Network

ML Machine Learning

RDBMS Relational DBMS

RFC Request For Comments (Internet standard document)

RS Rough Sets

RSAR Rough Set Attribute (dimensionality) Reduction (same as RSDR)

RSDR Rough Set Dimensionality Reduction (same as RSAR)

TA Transport Agent

TLA Three Letter Acronym

TP Transport Protocol

UCA Unsolicited Commercial E-mail (also known as spam)

UDC Updated Distributional Clustering

VSM Vector Space Model

Algorithmic Notation

Typographic notation used in algorithms is outlined in the next section. The mathematical notation

listed above is also used in algorithms.

xxi


22/121

xxii Notation Used

Symbol Meaning

A[i] Array indexing; A[i] denotes the i-th element of arrayA. Somearrays are associative, which means i may be symbolic rather

than numerical.{} The empty set.

Assignment: a b assigns the value b to the variable a. The lefthand side may be any symbol that can hold a value (an arrayelement, a set, a vector, etc).

Typographical Notation

Italicsare used for emphasis.

Sans-Serifis used for Internet addresses (host names, E-Mail addresses and URLs), as well as for sec-

tion headings.

Typewriter is used for keyboard input, computer output and pieces of code.

Bibliography references are usually given in author (year) format. The bibliography list is on p. 61.

SMALL-CAPITALS denote the name of an algorithm or function. Some local variables with long namesare also printed like this.


23/121

1Introduction

1.1 Growing Volume of Available Information

It is a well known (and often reiterated) fact that the volume of electronically stored information in-creases exponentially [Cohen, 1996a, Apte et al., 1994]. Even discounting machine-readable data,many of us own more documents than we can read in our respective lifetimes. To intensify this inun-dation further, the advent of the Internet as a mainstream buzzword did not slow its own exponentialgrowth. As shown in figure 1.1 on page 2, 1998 saw approximately 1,000 times as many Internet hostsas in 1988 [Lottor, 1992, Internet Software Consortium, 1999]. Adding to that the equally exponentialprogress of digital mass storage media makes for an extremely rapid increase in the total volume ofinformation.

Sorting through even the tiniest fraction of this information is a truly nightmarish prospect forany human being. Automated Information Filtering (IF) and Information Retrieval (IR) systems aretherefore acquiring rapidly increasing prominence as automated aids in quickly sifting through in-formation. Unfortunately, most attempts to produce effective information filters trade off betweenefficiency/effectiveness and cost/time/space. Digital Equipment Corporations Altavista [1995], oneof the most complete IR systems on the World Wide Web, is very effective only because it has ex-tremely powerful distributed hardware and astronomical amounts of bandwidth with which it bothserves users and indexes approximately 140 million documents every month.

As helpful as centralised, world-wide IR systems are, it is imperative to develop good IF/IR systemsfor the individual user. However, effective systems have high requirements in terms of computingpower. In most techniques involving text categorisation (TC, part of both IF and IR systems), eachdocument is described by a vector of extremely high dimensionality typically one value per word or

pair of words in the document [van Rijsbergen, 1979]. The vector ordinates are used as preconditionsto rules or other similarity metrics which decide what category the document belongs to. Such vectorstypically exist in spaces with several tens of thousands of dimensions [ Joachims, 1998]. This makeseffective TC an extremely hard, if not intractable, problem for even the most powerful computers,unless the task is simplified. The use of the cosine of the angle between two vectors as a commoncomparison metric [Moukas and Maes, 1998] further increases the number of operations needed tocategorise one document.

1.2 Solutions to the Dimensionality Bottleneck

Three solutions to this problem can be identified:

Wait until the state of the art is suitably powerful to allow better implementation of TC systems.

1


24/121

2 1. Introduction

19901988198619841982 1992 1994 1996 1998

Numberof Hosts on the Internet

Hosts

Year

1,000

10,000

100,000

1,000,000

10,000,000

Figure 1.1: The number of Internet hosts has increased exponentially since 1981 [Lottor, 1992, Inter-net Software Consortium, 1999].

At closer examination, this is far too nave an approach. It is trivial to observe that the volumeof electronically available information grows faster than the capabilities of computer hardware.

In 1981, the 213 hosts comprising the ARPANet could be listed on a few sheets of paper [Lot-tor, 1992], easily read by a human. Today, the list of all 56.2 million host names would have tobe stored on a few CD-ROMs [Internet Software Consortium, 1999]. Even assuming that massstorage has not increased in capacity over the past years, this fact makes TC appear chimeric.

Abandon the concept altogether clearly a non-option, as more and more facets of everydaylife become dependent on electronically stored information.

Tackle the bottleneck by better design of TC systems. This is the preferred method: careful studyof how TC works can help to simplify it to the point where it escapes the dimensionality prob-lem. To date, many such attempts have been made, with varying degrees of success. This workconcerns itself with a novel technique to dramatically reduce the dimensionality of TC datasets

without reducing their information content.

1.3 Rough Set Theory

Zdzisaw Pawlak introduced Rough Set (RS) theory around 1982 [Pawlak, 1982]. Rough Set theory isa formal mathematical tool that can be applied, among other things, to reducing the dimensionalityof datasets [Shen and Chouchoulas, 1999a, Pawlak, 1991] by providing a measure of the informationcontent of datasets with respect to a certain classification.

Rough Sets have already been applied to a very wide variety of application domains with satisfac-tory results [Shen and Chouchoulas, 1999b, Martienne and Quafafou, 1998, Komorowski et al., 1997,Stepaniuk, 1997, Keen and Rajasekar, 1994, Das-Gupta, 1988]. Different facets of the theory can aidin identifying information-rich portions of datasets, as well as the equivalence relations between at-

tributes in datasets.


25/121

1.4. Aims and Objectives 3

Applying RS to TC might provide much needed help in locating those parts of TC datasets thatprovide the necessary information for the application domain, hopefully simplifying the task andreducing the amount of data to be handled by text classifiers. Given corpora of documents and a set

of examples of pre-classified documents, the RS-based techniques are likely to locate a spanning setof co-ordinate keywords. As a result, the dimensionality of the keyword space can be reduced.

1.4 Aims and Objectives

This thesis will attempt to meet the following goals:

Investigate the applicability of RS theory to the TC application domain. Rough Set theory cantheoretically be applied to easing TC systems past the dimensionality bottleneck, but this hy-pothesis needs to be tested.

Quantify this applicability with respect to various different TC techniques. There are numerous

such techniques and Rough Set dimensionality reduction may have different effects on each. Design a modular framework for simplifying TC tasks. Although this is a proof-of-concept in-

vestigation and not an application-oriented project, the proposed systems co-operation withother, existing systems should be as smooth as possible.

Using this framework, produce an integrated system that significantly reduces dimensionalityin the TC domain.

Ensure accuracy and TC quality does not drop. Theoretical discussions of RS show that the tech-nique will attempt to remove no information important to the classification at hand. This willhave to be assessed in the context of TC, since this application domain is much more complexthan most applications of RS to date.

Be able to generalise given a minimum of training data, either because there is a shortage ofsuch information or simply to accelerate training.

1.5 Document Structure

This thesis attempts to be as self-contained as possible. To that end, the background of InformationFiltering/Retrieval, Text Categorisation and Rough Sets is discussed in Chapter 2. This includes highlevel descriptions followed by more detailed descriptions of the techniques. The readers expertise inany of these areas is not presumed.

Work related to either of this projects two facets (Text Categorisation and Dimensionality Reduc-tion) is reviewed and commented upon in Chapter 3 and taxonomies of TC and dimensionality re-duction are drawn up. This part of the text assumes knowledge of the background discussion of the

previous chapter.Based on the first two chapters, the text goes on to a detailed description of the proposed inte-

grated system, found in Chapter 4. The discussion includes design issues and describes the imple-mentation of the proof-of-concept system. This chapter also includes a somewhat formalised exam-ple of E-mail classification to illustrate the workings of the system.

This is in turn followed by a presentation and discussion of experiments and experimental resultsin Chapter 5. The conducted experiments attempt to verify the aims of the project, as outlined above.

Chapter 6 concludes the thesis, assessing the success of the overall work and tentatively proposingfuture enhancements to deal with this techniques weak points.


26/121

4 1. Introduction


27/121

2Background

This chapter discusses the theoretical background behind the various methods of automated classi-fication of documents, as well as Rough Set theory and its application to reducing the dimensionalityof datasets.

2.1 Automated Classification of Textual Documents

There are two major paradigms for the automated handling of documents: Information Retrieval (dis-cussed in section 2.1.2 on page 10) and Information Filtering (described in section 2.1.3 on page 11).Both IF and IR depend on Text Categorisation to perform their respective tasks. The following sec-

tions discuss TC, IR and IF, presenting a comparison between IR and IF (section 2.1.4 on page 13),and discussing existing dimensionality reduction approaches used in these methods (section 2.1.5 onpage 14).

2.1.1 Text Categorisation

Moulinier [1996] asserts that TC is the intersection between Machine Learning (ML) and InformationRetrieval (IR), since it applies ML techniques for IR purposes.

Whatever the differences between IF and IR, both depend on TC modules to perform their duties.Text Classification is an inherently inductive process. It comprises two distinct stages, like most MLtechniques:

Training. Given textual data and an application-dependent amount of additional knowledge, it mas-

sages the data into a form suitable for induction, performs induction and outputs a means ofclassifying data. These classification means may comprise anything that can later be used byanother system to perform the classification: depending on the method used, means may in-clude rule bases [Cohen, 1996a], document vectors [Moukas and Maes, 1998], neural networks[Kwok, 1995] or fuzzy classification rules [Koczy et al., 1999, Chouchoulas and Shen, 1999a].

Classification. Given previously unseen documents and the classification means produced at theprevious stage, this stage utilises the classification means to categorise the documents.

Pure TC applications may use a number of different classes against which documents are matched.Information Filtering/Retrieval applications mostly use a simplified Boolean classification, wheredocuments are either relevantor irrelevant.

The additional knowledge mentioned above includes a range of information on the form and

structure of the textual data as well as instructions on how to handle text. This is used to extract

5


28/121

6 2. Background

AdditionalKnowledge

Text Representation

Representation

RepresentationExtract

Initial

Representation

Induction

Classification Means

Textual Data

Final

Dimensionality

Reduction

Figure 2.1: A block diagram of the training stage of a typical text classification system accordingMoulinier [1996].

the initial representation of the text. It almost always involves the location of keywords (terms) forthe input documents, maybe augmented by other metrics (e.g. an indication of the size or age ofthe document). Locating the document terms is a difficult task in itself. Most techniques involvefrequency-based metrics to judge the usefulness of a term, dropping those that are either too uncom-mon or too common. The uncommon ones are deemed uncharacteristic of the text; those terms thatare too common are invariably irrelevant connective words or phrases that would be both abundantand irrelevant in any similar document.

In most cases, this approach leads to extremely high numbers of terms for each document: typ-ically in the tens of thousands. Inducing classification means using a large number of, for instance,

document vectors in R105

may easily lead to intractability.

Thus, most TC approaches incorporate a means of dimensionality reduction. This may range fromapproaches such as word stemming (using only the stems of words, thus reducing the total number ofkeywords), to using thesauri or semantic networks to reduce the vocabulary. Language-independentapproaches borrow from information theory to rank terms by their entropy. There are numerousdifferent algorithms to do this, both in the symbolic and non-symbolic domains. A more thoroughdiscussion is given in section 2.1.5 on page 14.

The induction step itself relies on the additional knowledge and the final text representation toproduce the classification means of classifying the documents. Again, there are numerous choices ofinduction algorithms. Even techniques that are not normally associated with induction tasks, suchas Genetic Algorithms [Sheth, 1994, Chouchoulas, 1997], may be used with success. Research in thissubject has been considerable, with almost all well-known techniques applied to this task.

A block diagram of a typical TC system is shown in figure 2.1 on page 6. Three major branches of

TC can be identified:


29/121

2.1. Automated Classification of Textual Documents 7

G Axis OrderingEB HDC IFA J

9

00 Vector B

Vector A

A8

B1

G5H2

I7

J7

C1

Set A

A2

B9 E3F1

G6

H1

Set B

1 02 3 1 60 0 0

8 1 1 5 2 7 70

Figure 2.2: Converting sets of term, weight pairs to weight vectors. Here, A1 is term A with a weightof1.

Vector-based techniques;

Rule-based methods; and

Probabilistic approaches.

2.1.1.1 Vector-Based Techniques

The Vector Space Model (VSM) is by far the most popular TC technique [Moukas and Maes, 1998, vanRijsbergen, 1979, Salton et al., 1975]. It maps documents and queries onto multi-dimensional space,

where they can be dealt with by means of conventional geometric manipulations.The set of axes of this document space is the set of all keywords located for all documents dealt

with by the system. As such, the dimensionality of VSM techniques is quite volatile. In most cases, itis also extremely high: tens of thousands of dimensions, usually.

Mapping a document to a vector involves obtaining a set of keywords for that document. Numer-ous techniques aim to provide good keyword sets for document collections. A number of frequency-based metrics are defined. The set consists of pairs of keywords and their associated weight, a cer-tainty factor ranking their association with the document in question.

The keyword set is then converted to a vector by means of an arbitrary but fixed ordering of terms(keywords) and the universal set of keywords for this application.

Algorithms that apply to multidimensional space (such as clustering) can be applied to documentsmapped onto document space. Queries are represented in the same manner, which allows for easycomparison between document and query vectors: the scalar product of the two vectors is used as asimilarity metric.

Despite the wide following of the VSM, a number of drawbacks can easily be identified:

Perhaps the most important one is dimensionality. It renders many algorithms unusable withcurrent technology and tends to require very large amounts of storage space.

The technique implicitly assumes that, since the keywords are used as axes in document space,they are orthogonal. Orthogonality is used here in both the geometric and semantic sense: eachkeyword must, optimally, describe a concept completely different from all the other keywords.

Similar keywords should not be used as separate dimensions. This avoids redundancy in the


30/121

8 2. Background

same way as having orthogonal axes avoids redundancy and correlation between axes in con-ventional vector spaces.

In real life, this constraint is impossible to satisfy due to the intricacies of natural language (syn-

onyms, context, semantics, cultural bias, et cetera as discussed by Gazdar and Mellish 1989),although there are techniques that obtain an approximation of orthogonality. However, despiteviolating the constraint, TC applications based on the VSM seem to work quite well [Wong et al.,1987].

2.1.1.2 Rule Based Techniques

As expected, this paradigm uses rules rather than prototypical document vectors to control the classi-fication task. Rule-based methods may, at a first glance, seem dated and somewhat nave, but they arevery established in the literature, offering accuracies similar to those exhibited by the more advancedapproaches, without the higher computational requirements of the VSM [Cohen, 1996a, Apte et al.,1994].

Classification means for rule-based approaches comprise sets of rules and an associated classifier.Each rule provides a set of preconditions and a decision which gives the class of documents matchingthe preconditions.

Preconditions may be in a number of different forms. Boolean matching may be used, in whichcase the preconditions check for the existence of certain keywords in texts [Apte et al., 1994]. Co-hen [1996a] argues that, although this method appears too simplistic, it manages almost the sameaccuracy as more advanced techniques, while requiring a fraction of the requirements on processingpower that the VSM needs.

Other rule-based methods may use preconditions based on the frequency of keywords, with anexact or approximative similarity metric between rules and documents to be classified.

Fuzzy rule preconditions may also be used, using membership in fuzzysets [Zadeh, 1965, Kasabov,1996, Cox, 1994] as the criterion.

A number of methods exist to assess how a document matches a rule. A purely Boolean paradigm(also known as the Boolean Exact Model) would be to evaluate the Boolean conjunction of all thepreconditions, giving a true/false answer for each rule. The first rule to evaluate to true is used toprovide the classification. To resolve inconsistencies, all rules may be applied to ensure that any rulesthat match this document agree in classification. Disagreements cause the document to be deemedundecidable, in which case a fall-back is used (the document is assigned to a special undecidedclass, or presented to the user for a more educated opinion). Depending on the application, it maybe important to vary the number of such undecidable documents to minimise the need for humansupervision (high threshold of inconsistencies) or ensure that there are virtually no misclassifications(low inconsistency threshold).

Another matching scheme involves scoring, where the rule with the most matched preconditionswins the contest and classifies the document. If the set of firing rules contains more than one rule

with the same score, all firing rules have to agree on the classification, otherwise the classification is

again undecidable. This assumes that all rules have the same number of preconditions, otherwise thescoring metric has to be revised to take this into account by, for instance, normalisation.

Fuzzy rules for text classification dictate the use of fuzzy reasoning to obtain the matching rule[Chouchoulas and Shen, 1999a]: all precondition (fuzzy set) memberships are evaluated. The fuzzyand operator is applied (evaluating to the minimum membership value). The rule with the maxi-mum such number (or certainty factor) wins and, if it is the only one, classifies the document. Aset of firing rules is employed here, too, as in the scoring technique above. An additional constant may be used to specify the neighbourhood within which firing rules are considered to have the samecertainty factor. Thus, the set of firing rules contains not only the best rule, but all rules within ofthe best one. Inconsistencies are detected within this firing set. The use of allows the user to controlthe classifiers sensitivity to inconsistencies. The higher the number, the less documents are markedundecidable, but misclassifications may increase.

Finally, decision trees [Quinlan, 1993], either crisp (conventional) or fuzzy, may be induced by


31/121


certain systems. These hierarchical rule bases are intuitively understood by most humans and areparticularly efficient both in their creation and interpretation. Decision trees are not discussed herein great detail due to their popularity. For a more involved discussion, the interested reader should

consult the work ofQuinlan [1993].Rule-based techniques have two advantages of importance in the context of TC: their classificationmeans are, in most cases, easily readable and understood by humans who could possibly even changethem for fine-tuning or improvement; and the requirements of rule-based techniques are very low incomparison to those of more complex methods. Their disadvantages typically lie with rule induction:various such techniques exist, such as those byLozowski et al. [1996], but a number of them, includingthe cited one, suffer from high computational complexity. However, Rough Set-based dimensionalityreduction has been proven to greatly aid rule induction by such algorithms [ Shen and Chouchoulas,1999a]. Even combinations of multiple, simpler dimensionality reduction techniques are effective, as

witnessed in the work ofApte et al. [1994] (a more detailed discussion of such techniques is in section2.1.5 on page 14).

2.1.1.3 Probabilistic TechniquesThis family of classification methods employs probability theory to perform TC (detailed discussionsmay be found in Kalt 1998, Riloff and Lehnert 1994, van Rijsbergen 1979). It is in many respects sim-ilar to the two other approaches mentioned. To classify a document D into a category within a setof categories {C1, C2, . . . , C n}, probabilistic classifiers evaluate the conditional probability P(Ci|D),

where 1 i n. The category with the highest probability is assigned to the document.This may, at first glance, appear to be based on guesswork, but is not: like all ML techniques,

probabilistic classifiers are trained by minimising an error metric. In the case of probabilistic classi-fiers, this error metric is the probability of an erroneous classification. If only two classes are used, Cr(relevant) and ci (irrelevant), then the probability of document D being misclassified is:

P(error|D) = P(Cr|D) if the classifier decides D belongs to Ci

P(Ci|D) if the classifier decides D belongs to Cr(2.1)

This gives four possible outcomes:

1. DecidingCr when Cr is the real class (correct).

2. DecidingCr when Ci is the real class (wrong).

3. DecidingCi when Cr is the real class (wrong).

4. DecidingCi when Ci is the real class (correct).

To further minimise errors, each of the above conditions is assigned a cost. The higher the cost, the

higher the conditional risk of an erroneous classification. Conditional risk is evaluated for each pos-sible decision:

Deciding Cr : Rr = c1P(Cr|D) + c2P(Ci|D) (2.2)

Deciding Ci : Ri = c3P(Cr|D) + c4P(Ci|D) (2.3)

Correct classifications incur no cost, so typically c1 = c4 = 0. For simplicity and symmetry, the re-maining costs are usually (but by no means necessarily) c2 = c3 = 1. For D to be relevant (i.e. belongto the Cr class), the following Bayesian condition must hold: P(Cr|D) > P(Ci|D). Thus, to minimisethe risk of misclassification, the following must hold:

(c3 c1)P(Cr|D) > (c2 c4)P(Ci|D) (2.4)


32/121

10 2. Background

By using the Bayes rule for conditional probability, this is rewritten as:

P(B|A) = P(A|B)

P(B)

P(A) (2.5)By2.4 and 2.5 : (c3 c1)P(D|Cr)P(R) > (c2 c4)P(D|Ci)P(Ci) (2.6)

The odds ratio is then:

P(D|Cr)

P(D|Ci)>

(c2 c4)P(Ci)

(c3 c1)P(Cr)(2.7)

An odds ratio threshold may now be established, over which D will be considered to belong to classCr.

However, documents are not monolithic: they comprise a number of keywords or terms. Proba-bilistic classification treats documents as binary vectors, each ordinate of which represents the pres-

ence or absence of a certain keyword (much like in Boolean classification above). To evaluate P(D|Cr)and P(D|Ci), the technique estimates the probabilities of individual keywords appearing in docu-ments belonging in either Cr or Ci.

A very important advantage of this technique and other, similar ones is efficiency: since calculat-ing probabilities is, generally speaking, an easy task for a computing device, probabilistic classifica-tion is a quick method. However, it lacks an important aspect: intuitiveness. It is difficult to under-stand the semantic significance of large tables of conditional probabilities. Additionally it is, in somecases, difficult to convince oneself that this technique works, especially since the intuitive, popularunderstanding of probability involves equiprobable randomness. However, probabilistic methodsareacquiring wide interest as a feasible solution to efficiency problems in ML tasks.

2.1.2 Information Retrieval

Information Retrieval (IR) concerns itself with the indexing and querying of large documentdatabases.It is an old practice, as any librarian will know. The problem has attracted increasing attention sincethe 1940s. Computer-based IR has existed for a few decades now, but it is the information explosionof the Nineties that finally put it in a very different perspective [van Rijsbergen, 1979].

Most mentions of IR in the literature deal with document databases, although any type of datamay be accessed. This oversight is easily excused, since:

the need to index and query gigantic collections of textual documents is by far larger than forany other single form of data available [van Rijsbergen, 1979]; and

other types of data, for instance images or software, are easier to handle because of their struc-tured, strict form1.

Hence, this dissertation refers to the object of both IF and IR techniques as a document, despitethe techniques generic nature. Real-world examples of IR systems are on-line (and older, card-based)library catalogues and search engines on the World Wide Web, such asAltavista [1995] or Lycos [1995].

In terms of the mechanics of IR (shown in figure 2.3 on page 11), the centre of the applicationdomain is a database and its attached database manager system (DBMS). The DBMS allows the usualdatabase operations, such as storage, indexing, retrieval and deletion of documents. Indexing thedatabase uses the text classification module to create the aforementioned classification means. Inthis case, these means comprise indices of documents in some internal, system-specific format.

Querying the database uses the document indices and a suitable user interface to formulate queries.The Text Classification module may sometimes also be involved in query formulation. The queries

1Or, alternatively, they are out of the reach of the combination of current techniques and current computer hardware, so

there is little successful work in indexing them.


33/121


Classification

Means

Database

Document

User

Query

Delete

etc

ManagerDatabase

Add

Classification

Text Indexing

Querying

Figure 2.3: Major constituents of an Information Retrieval System.

are submitted to the DBMS. Their results are translated to human readable form and presented to theuser through the user interface. Most modern IR systems employ some means of feedback so that theuser can refine or update the query. Altavistas (now discontinued) refine query feature is an exampleof this.

According to van Rijsbergen [1979], one somewhat purist school of thought asserts that true IRsystems share one characteristic: despite their name, they do not retrieve documents, they merelyprovide references in response to queries. Using a library catalogue is indeed like this, but the distinc-tion between a reference to a document and the document itself is becoming more and more blurred.On the World Wide Web, for instance, locating a document is almost entirely equivalent to obtaining

it. A search engine may not serve the documents it indexes, but it is trivial to obtain such documentsby following the provided hypertext links.

Success in IR systems is judged by their average response time to queries and the size of their in-dices relative to the original database size. Efficiency is an easily solved problem for large IR engines,through the use of parallel or distributed computing. Index size is still a problem, since nave ap-proaches can easily end up with indices as large as the original databases (or even larger). In suchcases, dimensionality reduction can be particularly effective by limiting the number of index termsper document indexed.

2.1.3 Information Filtering

Where IR concerns itself with the indexing of documents and the querying of document databases,

Information Filtering deals with the selection and dissemination of documents (or data in general) to


34/121

12 2. Background

groups or individuals.Philosophically speaking, IF is a function inherent to all organisms: it is essential to focus ones

attention on only part of the infinite amount of information present in the physical world. The same

observation can be made even when the domain is drastically specialised to only cover textual infor-mation processed by humans. The average person already possesses more information in the formof books, compact disks, et cetera than they can process in their lifetime. More rigorous filtering isrequired for extremely rich domains such as the Internet.

In terms of mechanics, an IF system (as shown in figure 2.4 on page 13) attaches itself to a streamof information. Information reaches the heart of the system, the Text Classification module, througha Transport Agent (TA) that implements a Transport Protocol (TP). Data can flow through the systemalong two major paths:

1. The training pathuses sample documents and the TC module to obtain classification means forthe domain. This is clearly analogous to the indexing operation on IR systems.

This function of an IF system does not usually require the presence or supervision of a user, andso it typically lacks a user interface. In fact, it is not uncommon for the classification means to

be created using lengthy batch processing techniques (especially when the classification meansneed to cover large corpora of documents).

2. Thefiltering path is effectively a feedback loop. The TA is instructed to obtain documents whichare then presented to the Document Selection module. This module employs the classificationmeans produced by the TC subsystem to infer the relevanceof documents to the user. Recalleddocuments are presented to the user through a suitable user interface.

At this stage, the user may provide feedback to the system about the relevance of the docu-ments. Such feedback is passed on to the TC module to update the classification means so as toretrieve documents the user prefers. This process is analogous to the querying of an IR system,the difference being that the user can modify the systems classification means by incrementaltraining.

It is notable that even IF approaches that do not advertise a feedback mechanism usually implicitlyprovide this feature. For example, Usenet kill files filter out undesirable newsgroup articles. MostUsenet news reader software does not allow feedback about the success of killfiles, but the user canalways update them manually, effectively forming the feedback loop though, in this case, part ofthe mechanism lies with the user, rather than the software [Chouchoulas, 1997].

Apart from the above example, an everyday application of IF systems on the Internet can be wit-nessed in personalised search engine user profiles, such as My Yahoo! at the Yahoo! site [Yahoo!,1994]. The user generates a profile by declaring some personal interests and hobbies. The systemthen filters daily news and various Internet publications and presents retrieved material to the user.On most occasions, the end user is provided with means of feedback.

As with IR systems, the usual metrics of time and space efficiency are applied to judge the successof IF applications. In terms of time and required storage, IF systems are prone to the same obesity

traps as IR ones. Unfortunately, since many IF applications are user-centred, it is difficult to increasethe speed by using parallel or distributed computer systems, since the average user does not haveaccess to such hardware1. Furthermore, storage requirements over a medium or large Local AreaNetwork (LAN) become problematic if many users run individual systems to store near-identical userprofiles.

Community-centred information filters alleviate both disadvantages. The problem of efficiencyis easily solved. There may not be a distributed computer in the strict sense of the word, but thecombined processing power of all personal computers involved is far from negligible. Total spacerequirements and redundancies may also drop, since there is one, centrally stored set of classifica-tion means (or indices). The disadvantage is a certain drop in the level of personalisation attainable,though there are cases where this, too, is desirable2.

1Although the average user would certainly like that.

2For instance, in content-rating schemes or in companies wishing to control the documents accessed by their employees.


35/121


Protocol

Transport

Text

Classification

Updating

Training or

Information Stream

Selection

Document

Reach User

Selected Documents

Classification Means

User

Feedba

ck

Filtering

Figure 2.4: Components of an Information Filtering System.

2.1.4 A Comparison

There is a fair amount of confusion inherent in the literature concerning IF and IR. The two termsare rarely defined adequately and at times even used interchangeably. However, it can be arguedthat both domains behave very similarly. In some cases, the distinction becomes little more than amatter of viewpoint. Belkin and Croft [1992] argue that IF and IR are two sides of the same coin. Incomparing IF and IR, Belkin and Croft give three points:

IR deals with the collection and organisation of documents, while IF deals with the selectionand distribution of documents to groups or individuals.

IR is concerned with a relatively static document database. IF typically needs to cope with adynamic stream of information.

IR responds to a users queries within a single information-seeking episode. IF works withlong-term changes over a series of information-seeking episodes. [Belkin and Croft, 1992, p. 32].

Certainly, as information storage departs from a centralised, single-database model to enter aparadigm of wide-area distribution and dissemination over global networks, the distinction between

IF and IR will become more blurred.


36/121

14 2. Background

2.1.5 Dimensionality Reduction for Text Categorisation

Since raw, unprocessed keyword sets tend to render most text classifiers intractable, almost every TCsystem makes use of one or more methods of dimensionality reduction. There is a wide range of such

techniques, including:

Word stemming [Sheth, 1994]. This involves the removal of word suffixes, leaving only the stemsor roots. Since most languages have a limited number of word stems, this technique may reducethe number of possible keywords. Obviously, this is language-dependent and cannot be appliedto any language. Word stemming falls prey to several pitfalls, especially where a keywords in-formation lies in the suffix: for instance, the words reactionand reactorwould be reduced to thesingle stem react not enough to distinguish between documents dealing with Chemistry andNuclear Physics.

Using thesauri to collapse synonyms into single terms. Clearly another language-dependentmethod of dimensionality reduction. In addition, several thousand thesaurus lookups per doc-ument may prove computationally expensive. Coupled with the lack of electronic thesauri for

all but a handful of languages, this technique is not a very popular choice among developers ofTC systems. Again, the information content of seemingly synonymous terms in unusual con-texts cannot possibly be ignored: the words expandedand extendedare largely synonymous. Inthe context of older personal computer memory expansion schemes, however, they refer to twoentirely different concepts.

Removing common stop words. Stop words are information-poor connectives (like articles, aux-iliary verbs, et cetera) found in many languages. Again, this is a language-dependent techniquebased on the availability of such stop word lists for a given language. It is a very efficient method,but will, at most, remove a few hundred dimensions from any TC vector space, not enough tomake a significant difference. In addition, this runs the risk of mistaking an information-richterm for a stop word and removing it.

Indexing only part of each document. Cohen [1996a] argues for the usefulness of this methodby mentioning E-mailed camera-ready documents encoded in the Postscript language: theseare extremely similar to plain text English documents, yet are very lengthy and may cause theindexer to extract thousands of weak keywords. Thus, only the first few hundred acquired termsare taken into account for each document. Unlike the previously mentioned techniques, thismethod is language independent. However, the assumption that all usable keywords lie towardsthe beginning of a long document is a nave one. As a counter-example, consider the Guten-berg Project [1990]. It provides numerous classic books in machine-readable format. All thosedocuments have a common header describing the aims of the Gutenberg Project, explainingthe concept of expired copyrights and listing contacts. This common preamble is several pageslong, enough for text classifiers using this technique to obtain keyword sets that are completelyuseless in classifying documents in this context.

Stripping least useful keywords. Apte et al. [1994] use this method to reduce the dimensional-ity of TC problems. After keywords have been acquired and assigned a weight measuring theirrelevance to a document, the keywords are sorted by weight and those with the minimal weightare removed. Term weight histograms (please see section 2.5 on page 15 for an example) high-light the obvious fact that documents comprise disproportionately more irrelevant keywordsthan relevant ones. The bulk of words in a document will have an extremely low frequency, withfew terms appearing more than a few times. Since most term weighting metrics rely on term fre-quencies, it is expected that a very large number of keywords will share the least possible weight.Removing those allows for a significant drop in dimensionality. A variation of this method fil-ters out keywords that only appear less than five times in documents. Apte et al. [1994] employthis in addition to stripping terms with minimal weight. The threshold frequency is arbitrary,but Apte claims it is a good choice. A potential problem here would arise if the term weighting

metric was not well designed, giving wrong estimates for the relevance of keywords.


37/121

2.2. Rough Sets 15

1000

Word FrequencyHistogram

Frequency

100

10

10 500 1000 1500 2000 2500 3000 3500

Words

Figure 2.5: A typical word frequency histogram (for part of this thesis, in fact) shows more than 50%of words appearing only once.

Information Theory techniques [Apteetal., 1994]. This group of dimensionality reduction meth-

ods use entropy calculations to decide which keywords are relevant. These are in many respectssimilar to the RS-based method used herein, with varying degrees of efficiency. They tend tobe slower than previously described methods, but are used in conjunction with them to ensurethat the acquired keyword sets span the document space, i.e. are information-rich with respectto the classification at hand.

There are also other, ad hocmethods that cannot usually be utilised outside the specific TC ap-plication. Many of these are devised as afterthoughts to allow the testing of otherwise intractableapplications. Such methods are not reported in detail here, as they typically take the form of as-sumptions simplifying an application. Their utility is severely limited by their specialisation.

As evidenced by the work of Apte et al. [1994], these techniques may be combined to produce

better results.

2.2 Rough Sets

Rough Set theory [Pawlak, 1991] is a formal mathematical tool that can be applied to reducing thedimensionality of datasets used for training neural networks or other classifiers [Chouchoulas andShen, 1999b]. Rough Set Dimensionality Reduction (RSDR) removes redundant conditional (or input)attributes from nominal datasets, all the while making sure that no information is lost. The approachis fast and efficient, making use of standard Set Theory operations. Due to the relative novelty of thistechnique, a rather detailed technical discussion will follow. To aid in understanding Rough Sets, an

example will be followed through.


38/121

16 2. Background

2.2.1 Description of Rough Set Theory

Rough Sets involve the approximation of traditional sets using a pair of other sets, named the Negativeor Positive Region(or lower/upper approximationof the set in question.

Assume a datasetD viewed as a table where attributes are columns and objects are rows, as in table2.1 on page 16 [Maudal, 1996].

U denotes the set of all objects in the dataset. A is the set of all attributes. C is the set of con-ditional (or input) attributes and D is the set of decision attributes. C and D are mutually exclu-sive and C D = A. In the example above (drawn from the work ofMaudal [1996], Pawlak [1991]),U = {0, 1, 2, 3, 4, 5, 6, 7}, A = {a,b,c,d,e}, C = {a,b,c,d} andD = {e}.

f(x, q) denotes the value of attribute q A in object x U. f(x, q) defines an equivalence relationover U. With respect to a given q, the function partitions the universe into a set of pairwise disjointsubsets ofU:

Rq = {x : x U f(x, q) = f(x0, q) x0 U} (2.8)

For instance, drawing from the discussed example:

Ra =

{1, 7}, {0, 3, 4}, {2, 5, 6}

Rb =

{0, 2, 4}, {1, 3, 6, 7}, {5}

Rc =

{2, 3, 5}, {1, 6, 7}, {0, 4}

Rd =

{4, 7}, {1, 2, 5, 6}, {0, 3}

Re =

{0}, {2, 4, 5, 7}, {1, 3, 6}

Assume a subset of the set of attributes, P A. Two objects x and y in U are indiscernible withrespect to P if and only iff(x, q) = f(y, q) q P. IND(P) denotes the indiscernibility relation for

all P A. U/IND(P) is used to denote the partition ofU given IND(P). U/IND(P) is calculated as:

U/IND(P) =

q P : U/IND(q)

, where (2.9)

A B = {X Y : X A, Y B, X Y = } (2.10)

For instance, if P = {b, c}, objects 0 and 4 are indiscernible; 1, 6 and 7 likewise. The rest of theobjects are not. This applies to the example dataset as follows:

U/IND(P) = U/IND(b) U/IND(c) = (2.11)

x U a b c d e0 1 0 2 2 01 0 1 1 1 22 2 0 0 1 13 1 1 0 2 24 1 0 2 0 15 2 2 0 1 16 2 1 1 1 27 0 1 1 0 1

Table 2.1: An example data table drawn from the work of Maudal [1996], Pawlak[1991].


39/121

2.2. Rough Sets 17

=

{0, 2, 4}, {1, 3, 6, 7}, {5}

{2, 3, 5}, {1, 6, 7}, {0, 4}

=

= {0, 2, 4} {2, 3, 5},{0, 2, 4} {1, 6, 7},

{0, 2, 4} {0, 4},

{5} {0, 4}

=

=

{2}, {0, 4}, {3}, {1, 6, 7}, {5}

(2.12)

IfP = {a,b,c}, then, similarly:

U/IND(P) = U/IND(a) U/IND(b) U/IND(c)

The lower and upper approximation of a set P U (given an equivalence relation IND(P) is de-fined as:

P Y =

{X : X U/IND(P), X Y} (2.13)

P Y =

{X : X U/IND(P), X Y = } (2.14)

AssumingP and Q are equivalence relations in U, the positive, negativeand boundary regionsaredefined as (POSP(Q), NEGP(Q) and BNP(Q) respectively) as:

POSP(Q) = XQ

P X (2.15)

NEGP(Q) = UXQ

P X (2.16)

BNP(Q) =XQ

P XXQ

P X (2.17)

The positive region contains all objects in U that can be classified in attributes Q using the in-formation in attributes P. The negative region is the set of objects that cannot be classified this way,

while the boundary region consists of objects that can possibly be classified in this context (for a visualillustration of this please see figure 2.6 on page 18). For example, assumingP = {b, c} and Q = {e}:

POSIND(P)

IND(Q)

=

{}, {2, 5}, {3}

= {2, 3, 5}

NEGIND(P)

IND(Q)

= U

{0, 4}, {2, 0, 4, 1, 6, 7, 5}, {3, 1, 6, 7}

= {}

BNIND(P)

IND(Q)

= U {2, 3, 5} = {0, 1, 3, 4, 6, 7}

What this means is that, with respect to conditional attributes b and c, objects 2, 3 and 5 can def-initely be classified in terms of decision attribute e. The remaining objects could, possibly, be classi-fied, but it is not certain. This is shown more intuitively in table 2.2 on page 19.

Pawlak defines the degree of dependency of a set Q of decision attributes on a set of conditionalattributes P is defined as:

P

(Q) = POSP(Q)

U (2.18)


40/121

18 2. Background

Positive Region (black)

Universe of Discourse

Boundary Region (dark grey)

Negative Region (white)

Figure 2.6: Approximating a circle with rough sets. From left to right, top to bottom: the universe ofdiscourse (a circle on an irregular, two-dimensional grid); the positive region marked in white; thenegative region marked in black; boundary region (the actual approximation) in dark grey.

Where is the cardinality of a set. The complement of gives a measure of the contradictionsin the selected subset of the dataset. If = 0, there is no dependence; for 0 < < 1, there is a partialdependence. If = 1, there is complete dependence. For instance, in the example:

{b,c}

{e}

= POSP

{e}

U =

= {2, 3, 5}

{0, 1, 2, 3, 4, 5, 6, 7} =

=3

8 = 0.375

This shows that, of the eight objects, only three can be classified into the decision attribute e,given conditional attributes b and c. The other five objects (the unshaded rows in table 2.2 on page19) represent contradictory information.

It is now possible to define the significanceof an attribute. This is done by calculating the changeof dependency when removing the attribute from the set of considered conditional attributes. GivenP, Q and an attribute x P:

P(Q, x) = P(Q) P{x}(Q) (2.19)

The higher the change in dependency, the more significant x is. For example, let P = {a,b,c} and

Q = {e}. Calculating some degrees of dependency in the context of our example:


41/121

2.2. Rough Sets 19

x U b c e0 0 2 0

1 1 1 22 0 0 13 1 0 24 0 2 15 2 0 16 1 1 27 1 1 1

Table 2.2: As calculated in the text, the shaded objects (2, 3 and 5) are discernible and can definitelybe classified into e using the selected conditional attributes b and c. The rest of the objects cannot beclassified the information that would make them discernible in this context is missing.

{a,b,c}

{e}

= {2, 3, 5, 6} /8 = 4/8

{a,b}

{e}

= {2, 3, 5, 6} /8 = 4/8

{b,c}

{e}

= {2, 3, 5} /8 = 3/8

{a,c}

{e}

= {2, 3, 5, 6} /8 = 4/8

Using these calculations, it is possible to evaluate the significance of the three conditional at-tributes a, b and c:

P(Q, a) = {a,b,c}

{e}

{b,c}

{e}

= 1/8P(Q, b) = {a,b,c}

{e}

{a,c}

{e}

= 0

P(Q, c) = {a,b,c}

{e}

{a,b}

{e}

= 0

This shows that attribute a is not indispensable, having a significance of 0.125, while attributes band c can be dispensed with, as they do not provide any information significant for the classificationinto e.

Attribute reduction involves removing attributes that have no significance to the classification athand. It is obvious that a dataset may have more than one attribute reduct set. Remembering that Dis the set of decision attributes andC is the set of conditional attributes, the set of reducts R is definedas:

R =

X : X C, C(D) = X(D)

(2.20)

It is evident from this that the RSDR will not compromise with a set of conditional attributes thathas a large part of the information of the initial set, C it will alwaysattempt to reduce the attributeset while losingnoinformation significant to the classification at hand. To force the reduct to be thesmallest possible set of conditional attributes, the minimal reduct Rmin R is specified as the set ofshortest reducts:

Rmin =

X : X R, Y R, X Y

(2.21)

Returning to the example and calculating the dependencies for all possible subsets ofC:


42/121

20 2. Background

QUICKREDUCT(C, D)Input: C, the set of all conditional attributes; D, the set of decision attributes.Output: R, the attribute reduct, R C

(1) R {}(2) do(3) T R(4) foreach x (C R)(5) if R{x}(D) > T(D)(6) T R {x}(7) R T(8) until R(D) = C(D)(9) return R

Algorithm 2.1: The QUICKREDUCT algorithm yields a reduct for a dataset without forming all subsetsofC, the set of conditional attributes.

{a,b,c,d}

{e}

= 8/8 {b,c}

{e}

= 3/8

{a,b,c}

{e}

= 4/8 {b,d}

{e}

= 8/8

{a,b,d}

{e}

= 8/8 {c,d}

{e}

= 8/8

{a,c,d}

{e}

= 8/8 {a}

{e}

= 0/8

{b,c,d}

{e}

= 8/8 {b}

{e}

= 1/8

{a,b}

{

e} = 4/8

{c}

{e

} = 0/8

{a,c}

{e}

= 4/8 {d}

{e}

= 2/8

{a,d}

{e}

= 3/8

Of the above choices of attributes, the ones that can be chosen as reducts are those with the highestdependency. Hence, the reduct and minimal reduct sets are as follows:

R =

{a,b,c,d}, {a,b,d}, {a,c,d},

{b,c,d}, {b, d}, {c, d}

Rmin =

{b, d}, {c, d}

2.2.2 QuickReduct

In terms of computational complexity and memory requirements, the calculation of all possible sub-sets of a given set is an intractable operation. To solve this problem, the subset search space is treatedas a tree traversal. Each node of the tree represents the addition of one conditional attribute to thereduct. Instead of generating the whole tree and picking the best path on it, the path is constructedincrementally using the following heuristic: the next attribute chosen to be added to the reduct isthe attribute that adds the most to the reducts dependency, i.e. the most significant attribute. Thesearch ends when the dependency of the attribute subset equals that of the entire set of attributes.This is dubbed the QUICKREDUCT byMaudal [1996] and is very similar to the algorithm introducedbyJelonek et al. [1995] (please see algorithm 2.1 on page 20). The operation of this algorithm can be il-lustrated by returning to the example. QUICKREDUCT starts off with an empty attribute subset: R = {}

(step 1). Now the algorithm evaluates the change in dependency caused by adding an attribute to R:


43/121

2.2. Rough Sets 21

R{a}

{e}

= {a}

{e}

= 0/8

R{b}

{e}

= {b}

{e}

= 1/8R{c}

{e}

= {c}

{e}

= 0/8

R{d}

{e}

= {d}

{e}

= 2/8

The addition of attribute d provides the highest increase in discernibility. Hence, R becomes R = {d}and the algorithm attempts to add another conditional attribute:

R{a}

{e}

= {a,d}

{e}

= 3/8

R{b}

{e}

= {b,d}

{e}

= 8/8

R{c}

{e}

= {c,d}

{e}

= 8/8

Adding either b or c to {d} results in perfect discernibility of the dataset. Given a choice of two or moreminimal reducts, QUICKREDUCT chooses the first solution encountered, in this case the conditionalattribute subset {b, d}.

2.2.3 Verification of QuickReduct

It is easy to identify a serious problem with QUICKREDUCT: if it works like a hill climber, can it notget trapped in local minima? It can be proven that R monotonically increases as R increases. Inother words, there are no local minima, hence hill climbing is an easy task. The intention is to provethat:

R{x}D RD (2.22)

Where R is a set of conditional attributes, x is an arbitrary conditional attribute that belongs to thedataset and D is the set of decision attributes in question. The value ofshould never decrease whenadding more conditional attributes to the reduct. Rewriting equation 2.22 using equation 2.18:

(2.22) =POS(R {x})

U

POS(R)

U

POS(R {x}) POS(R) (2.23)

Rewriting again, this time using equation 2.15:

(2.23) =AQ

(R {x})A

AQ

RA (2.24)

Using the definition ofX, given in equation 2.13, the above is rewritten as:

(2.24) =A : A U/IND(R {x}), A (R {x}) A : A U/IND(R), A R} (2.25)

Finally, rewriting the above using equation 2.9:

(2.25) =

q (R {x}) : U/IND(q)

q R : U/IND(q)

(2.26)


44/121

22 2. Background

Equation 2.9 clearly defines the operator as a variation on the Cartesian product. It is trivially ob-served that, by adding one or more elements to one of the operands of a Cartesian product, the cardi-nality of the product set monotonically increases. Hence, the value ofR{x} is always greater than or

equal to the value ofR.Because of this monotonic nature of the reduct tree, it is possible for Q UICKREDUCT, a nave hillclimber, to always locate the shortest reduct of a dataset.

An alternative proof byreductio ad absurdumis provided. Assume a simple dataset, whose valuesfor conditional attributes {a, b} and decision attribute c are shown below:

a b c

a1 b1 a2 b2

Let (a1, b1) and (a2, b2) be discernible objects. Now, assume that a further conditionalattribute x can be added to {a, b} so that:

{a,b,x}

{c}

< {a,b}

{c}

(2.27)

This assumption implies that the objects (a1, b1, x1) and (a2, b2, x2) are indiscernible, i.e.a1 = a2 b1 = b2 x1 = x2. However, this proposition cannot hold since, for (a1, b1) and(a2, b2) to be discernible, a1 = a2 b1 = b2 holds. Thus, no matter what x is: {a,b,x}

{c}

{a,b}

{c}

.


45/121

3Related Work

A taxonomic discussion of TC, IF and IR approaches begins the chapter, followed by a review of pub-lished work related to the text handling and Rough Set sides of this project. Related work is discussed

within the framework established by the taxonomy.

There is an inevitable overlap between this and the previous chapter, as background materialneeds to involve related work and vice versa.

Due to the relative novelty of the approach described in this dissertation, the author has, to date,found little related work joining both theoretical aspects (text categorisation and rough sets). By ne-cessity, then, work related to either facet of this project is reviewed separately.

3.1 A Taxonomy of Automatic Document Handling TechniquesA taxonomic hierarchy may be drawn up to show the variety of different methods available for auto-matically indexing, classifying and filtering documents. In terms of the domain the system is appliedto, there is a division into:

Textual data. The domain deals with the handling of text documents. Such entities may bedistinguished by the existence or lack of structure:

Unstructured textual data. Plain text, such as a report, a shopping list or Aristotles Meta-physics. Obviously, all texts have a certain structure at various different levels. However,this structure (titles, chapters, sections, paragraphs) may not be obvious to the system, or itmay lie beyond its grasp (e.g. structure of paragraphs at the semantics level requires strongnatural language processing features).

Structured textual data. Such documents can usually be found in limited, concise domainssuch as programming languages. Since the textual data is meant to be parsed by a machine,structure is easily imposed. It may be assumed that content will be strictly structured andthat data will follow the hard constraints of some well-defined grammar. Such domains arevery easy to handle because TC/IF/IR systems can extract information from the documentstructure.

Semi-structured textual data. Certain types of documents exhibit enough physical struc-ture to allow easier handling by automated systems, yet also include unstructured parts.The canonical example of a semi-structured document is an E-mail message: it comprisesa header of field-value pairs meant to be machine-readable and following a strict gram-

mar [Postel, 1982, Crocker, 1982, Costales and Allman, 1997], yet includes a message body

23


46/121

24 3. Related Work

which contains unstructured textual data in a natural language, or even further structureddata as attachments [Borenstein and Freed, 1993, Moore, 1993].

Non-textual data. Although demand for non-textual IF/IR is still quite low, it is expected toincrease as multimedia and the Internet become part of everyday life. Different types of non-textual data may be identified:

Visual data. Extremely large collections of images need to be indexed and filtered as muchas any library of textual documents. This is not dealt with in this project due to the addi-tional complexities involved (computational vision, for instance).

Aural data. Similarly, libraries of audio clips require additional consideration and costlyhardware and software. The (small but set to increase) necessity for IF/IR systems classify-ing audio is not a concern of this project at this time, again due to added requirements interms of Artificial Intelligence approaches.

Hybrid non-textual data. Audiovisual data falls under both of the above categories, thus

incorporating the complexities and subtleties of both sub-domains.

Other. As technology progresses, the need will arise to classify and index different types ofdata.

Hybrid data. These comprise a mixture of textual and non-textual data. Most cases involvingnon-textual data actually deal with hybrid data, as many non-textual documents have textualdescriptions or filenames attached to them.

The linguistic focus of text-oriented systems may be:

Language-dependent. Such systems make additional assumptions about the nature of the textsthey deal with. These are always simplifying assumptions: by focusing a system on a single

natural language, it is possible to greatly ease the design bottleneck and, in some cases, reducecomputational complexity. Such systems are typically highly specialised, some having trouble

with even slightly altered semantics. For a more detailed description of such cases, please con-sult section 2.1.5 on page 14.

Language-independent. Language independence is considered a highly desirable feature forany modern TC/IF/IR system, since even texts in the same natural language will exhibit subtle(but possibly important) variations in semantics. However, generality is sometimes traded offfor efficiency. Again, please refer to section 2.1.5 on page 14 for more information.

The high level function of the system may be one of the following (bearing in mind that there is acertain degree of overlap between functions). For a more detailed discussion, including a thorough

comparison of the techniques, please refer to Chapter 2 on page 5.

Pure Text Categoriser. The aim of such systems is to simply distinguish between different typesof documents. Such purist approaches are, as usual, relatively uncommon most TC systemsshare functionality with IF or IR systems.

Information Retrieval. Such systems subsume a TC, applying it to classify documents into differ-ent classes, thus indexing large document databases. Database querying is also handled usingTC approaches.

Information Filtering. These systems tap into a stream of information and continuously selectrelevant documents using a TC subsystem.

In terms of topology and architecture, a TC/IF/IR system may be:


47/121

3.1. A Taxonomy of Automatic Document Handling Techniques 25

Global. Such systems are implemented in one location, on one computer or a local cluster ofcomputers which, however, is globally accessible. Search Engines on the WWW [Altavista, 1995,

Yahoo!, 1994, Lycos, 1995] and the University of Edinburgh On-line Library Catalogue are ex-

amples of such systems. The systems are remotely usable, over global or wide area networks,but are not located in their users machines (barring certain client software needed to use theseremote information servers). Such systems are usually either centralised or distributed:

Centralised. The system is built as a single (cluster of) computers, an information pro-cessing nerve centre. Altavista [1995] is an example of this, with an impressive array ofinformation processing technology located in one physical place.

Distributed. Such systems use the divide and conquer technique to reduce the need forpowerful yet expensive machinery in one place. Instead, they employ less powerful, butmuch cheaper and smaller machines at various locations. This redundancy is helpful incase of hardware failures.

Community-centred. A more localised version of the global information processing system,

community-centred TC/IF/IR systems work within the bounds of a well-defined community toprovide their services. For instance, such a system could work within one company, with indi-vidual employees updating it with their preferred documents and/or document critiques. Those

would then be used to help the system yield higher recall and accuracy, as long as users queriesare in some way related (hence the need for a well-defined community). Like global systems,community-centred systems are either centralised or distributed:

Centralised. The system runs on a single computer or cluster of computers, accessible bythe users. It processes all information seeking or updating transactions. It may requirelarge amounts of storage and computing power, in a manner similar to centralised globalsystems.

Distributed. The system comprises numerous smaller, individually less capable segments.Each segment runs on a users personal computer and services that users requests only.

With the help of distributed object brokers such as CORBA [Vinoski, 1993], individual seg-ments consult with each other to exchange information that leads to higher efficiency fromthe users point

Alexios Msc Thesis

Documents

Transcript of Alexios Msc Thesis