EXTRACTION OF SEMANTIC HEADER FROM RTF DOCUMENTS · 2005-02-10 · Chapter 3 describes Rich Text...

EXTRACTION OF SEMANTIC HEADER FROM RTF DOCUMENTS

ABDELBASET ALI

A Mejor Report in

The Department of

Computer Science

Presented in Partial Fulnllment of the Requirements for the Degree of Master of Computer Science at

Concordia University Montreal, Quebec

August, 1999

O ABDELBASET ALI, 1999

National Library 1+1 ,,ana* Bibiiitheque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395. ru0 wdlnrg(oo OttawaON K 1 A W OüawaON K 1 A W Canada Canada

The author has granted a non- exclusive licence aliowing the National Library of Canada to reproduce, loan, distn'bute or sell copies of this thesis in microform, paper or eleclronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othenvise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, dism'buer ou vendre des copies de cette thèse sous la forme de microfiche/nlm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits saos son autorisation.

Abstract

Extraction of Semantic Header h m RTF Documents

Abdelbaset Ali

The problem of indexhg and retrieval of electmnic information resources becomes more

criticai as the amount of information and the numbcr of Intemet users continues to grow.

The Semantic Header. proposed by Desai [3], is a portion of each document that contains

the meta-information for each publicly accessible resource on the Intemet. The Sementic

Header for document-like Internet resources is a powtrfd means of helping users locate

documents and other types of data among large npositorïes. In environments that contain

many different types of data, content indexing requires type-specific processing to extract

information efkctively.

In this project which is a part of the ASHG system (Automatic Semantic Header

Generator). we present a mode1 for type-specific. infommtion extraction that

autornatically extracts the meta-information h m RTF (Rich Text Format) documents,

and stores it in a Sernantic Header which wiil be used as an index for the document. This

shail provide a useful tool in searching for a document based on a number of commonly

used critena. The infoxmation from the Semantic Header could be used by the search

system to help locate appropriate documents with minimum effort.

iii

Acknowledgements

I would me to ihank rny supervisor. Dr. Desai, for his guidance and encouragement

throughout this project. 1 would like to express rny dapest gratitude to him for having

made my project work a plcasant and exmmtly educational cxperience. I am forevcr

indebted to you Dr Desai, and 1 am hopefbi that we wil i m e t and work again sometime

in the funire.

To a l l my fiends and colleagues in the Department of Cornputer Science, tha& you for

your encouragement, thoughtful discussions, and help.

1 would iike to thank my fiiends for k ing there for -me in tiws of need.

Fmally, 1 would iike to thank my wife for ber patience and encouragement.

To the memory of my parents

Contents * a m List of Figures ..........e...ee....e..eee..e.e....e.~...~~~ee.....e...ee.ee~.e.. e o e e o ~

List of Tables .....m....m..e......e.e....ee..........e......e.e..ee.......e..e..e...

. 1 Introduction ...........a..~mee.ee..eoeooooomo.eœoo.oeo.wa.eeo.-e-eo.e-e~eaeeoeoe..oeoe 0.1

.......................................................................... 1.1 The Discovery Problem 1

1.2 CINDI's appmach ................................................................................... 1

.............................................................. 1.2.1 Meta-Information Indexing 2 . 1.3 Orgrnation of the report ....................................................................... 2

2 . The CINDI System.. .........O................ o.oooooe.ooem*oeoeooeeeeeme.mmœooee~eooe.*o*oo**..mmoee3

2.1 Overview of CLNDI ................................................................................ 3

...................................................... 2.1.1 The Semantic Header of CiNDI -4

............................................... 2.1.2 Semantic Header's Mark-up Schema 6

........................................ 2.1.3 Database System of the Semantic Header 7

.................................................................... 2.1.4 CINDI's Search System 8

3 . Rich Text Format ., ......................................................................... m.oooeme9

................................................... 3.1 Rich Text Format (RTF) Specincation 9

3.1.1 RTF Syntax ....................................................................................... 9 3.1.2 Contents of an RTF file .................................................................. 10

..................................................... 3.1.2.1 Document Area .................... 10

4 . Automatic Semantic Header Generator (ASHG).... ......... eemo......œœ..e.e..o13

........................................................................................... 4.1 Introduction 13

4.3 Document Type Recognition ................................................................ 14

.... .............................................................. 4.4 The Thesaurus for ASHG ... 16

.................................................................. 4.4.1 The subject Hierarchies 16

.............. 4.5 ASHG's Document Subject Headings Classification scheme 17

4.5.1 The Algorithm ................................................................................. 17

4.6 Applying ASHG's Extractors ............................................................... 18

4.6.1 RTF-extractor ............................................................................. 18

4.6.2 Generating an implicit list of keywords ......................................... 21

4.7 Semantic Header Validation ................................................................. 25

58 ASHG's Experiments o,oomwoo,oo,oooowooomomoooommooo.ooooeeooooooomoooooemomoooooomooaomoooom26

5.1 Experiments .......................................................................................... 26

5.2 Sample Results ...................................................................................... 28

6 . Conclusion and Future Work oooooomomooooooooooeoooooomoooooaooomoommaomooowooomoooooooomooomm37

6.1 Conclusion ............................................................................................ 37 6.2 Future Work .......................................................................................... 38

References e~ooowoomoomm~o~mm~.owmommsmooooomomoooooooooooomeooooooo~oo~m~mmwmmmm~ooomooomoooeo.oooo42

List of Figures

Figure 1: ASHG9s step ...................................................~....... 16 Figure 2: Subject Extraction ......................~............................ .19

Figure 3: RTE' extractor ..............................~........~.............~.... 21

Figure 4: ASHG extraction for RTF document ..... ....................... 2 4

Figure 5: CINDI' Semantic Header Graphid interface (a) .................. 33

Figure 6: CIND19 Semantic Header Graphitai intertaœ (b) .................. 34

Figure 7: CIND19 Semantic Header Graphid interface (c) ................... 35

List of Tables

Table 1: Defiaitions d Contra1 words (a) a...-------------------------------------------- Table 2= Definitions of Control words (b) ......--------.------------------------------------------------- Table 3: Summary of ASHG's RTF tat agaïnst the authots a d INSPEC's msaits......... -28

Chapter 1

1. Introduction

1.1 The Discovery Problem

At this time, a number of intorrnation sources. both public and private are available on the Internet. They include text, cornputer programs, books, electronic journals. newspapers, local and national directories of various types, and private information senrices. Rapid growth in data volume, user base and data diversity renders Intemet- accessible information increasingly difiîcult to locate efficientiy. Bmwsing a hierarchy containhg millions of directories is infeasible, particularly given the erratic organization that results when many different people critate the data. htead, there is a need for an automated search and for a systcm that allows easy " ~ h for and access to* resoufces available on the Intemet. For this auto- scarch to help users in locating relevant information requin indexing the avaiiablc information. Thus, sccondary information cailed meta-information must be extracted and used as an index to the available primary resource. In tum, building this index requins idormation extraction methods tailored to each specific environment. To ihis end, the semantic of the nies in which the primary resource is stored wili be exploited to extract and suxxmmke the relevant idormation. To do this, the prünary fie type should be identifiai and then the type specific selection and extraction methods are applied to the file.

It is envisaged that regional andlor specialized databases would be created to maintain archives of Semantic Header. These databases could be searched by the cooperation of distributeci expert systems to help users locate pertinent documents. Such a system is currently under development at Concordia University and is called Concordia Indexing and Discovery systems, or CINDL

1.2 CIND19s approach

CINDI (Concordia Indexhg and Discovery system), a system under development at Concordia University, provides a mechanism to register, search and manage the rneta- information, with the help of easy to use graphical user intcrf&ce. Thc meta-information, which is described in chapter 2, is the Semantic Header, that is stored in the CINDI system. CINDI has been designed to avoid problems caused by diffemices in semantics and representation, and incomplete or incoma data cataloging. This meta-information could either be entered by the primary resource provider or assisted in this by the Automatic Semantic Header Generator (ASHG). ASHG is a software that generates some meta-information of the submitted document, and assists the user in this process. This major report introduces ASHG for RTF documents, which aims at saving time for the

primary resource provider by automaically gentrathg and extractirhg part of the meta- information (Semantic Header) of the document aiid classifyirig the rcsource under a list of mbject headings. The provider helps in this proass by verifying and comcthg the Semantic Header entry.

1.2.1 Meta-Information Indedg

Constructed meta-information couid support queries bascd on content ps wêu as traditional queries based on items such as title, author, subjact, etc. This means that a c W sources could be s e p d h m thcir meta-iaformaîÏon, thcreby simplifying the storage of the resources. Professionai catalogers have found the need for the mentioned elements in indexing applications. The dependence on tities as the most commonly uscd scarch criterion suggests that they must indicate tbe contents of the document. Furthemore, the author or the cataloger has to add annotations, keywords, or key phrases to indicate its actual content. Including reviewers' opinions can indicate the accuracy or quaiity of a document However, such opinions are rarely accessible to the traditionai cataloger. Another feature of importance is the presence of an accurate abstract. An abstract provides a sllmmary of the material, and thus is more indicative of the contents than the titie and keywords supplieci by the author [14].

1.3 Organization of the report

This report is organized as foilows. In Chapter 2, the -1 system is introduced. Chapter 3 describes Rich Text Format (RTF). Chapter 4 d d b e s the Automatic Semantic Header Generator (ASHG). This cbapter coves the type recognition and the extractors, and a description of the Thesaurus used at the beginniag of this chapter. In chapter 5, we test and compare the classification of our generated index with those produced by cataloguers. Lastly, in chapter 6, we give our conclusion and discuss future work.

Chapter 2

2. The CINDI System

2.1 Overview of CINDI

The trend in most research institutes, universitics and business organization of intercomecting their computing facilities using a digital network has become the accepted method of sbaring murces. Such networks, in tum, are interconnecteci allowing information to be exchangeci across networks by using a common interchange protocol (TCl?/IP). Tbe number of such mtcrcomectcd networks (hternet) coatiaues to grow and with the emergence of powerfiil workstation-based servers connected to these networks, it is possible to support local as weîl as remote scarch and retrieval of information stored on any component of the interconncction .

There is a need for the development of a system, which d o w s easy search for and access to resources available on the Intemet. It has becn observeci that distributed information systems, even though under control of a single administrative unit, create multiple problems typically caused by ciifferences in semantics and representation, and incomplete and incorrect data dictionaries. The building of a standard index structure and a bibliographie system using standardized control definitions and temis can solve these problems and lead to a fast, efficient and easy access to îhc d~cumcnts. Such definitions could be built into the knowledge base of expert system based index entry and search interfaces. The purpose of indices and secondary information is to inventory the primary information and aUow easy access to it. Preparing a secondary information requins finding the primary source, idenifying it as to its author, title subject, abstract, keywords, etc. Since an index is to be used by many users, it has to be accurate, easy to use (usage via author, titie, subject, etc.) and properly classined.

Attempts to provide easy searching of relevant documents has lead to a number of systems including WAIS, and more recently a number of Spiders, Worms and other creepy crawlers 111, [a, (101, 1121, [lq, [lq, [IV, [18]. However the problem with many of these indices is that their seiectivity of documents is often poor [4]. The chances of getting correct documents and missing relevant information because of poor choice of search ternis are large. In addition, the user is rcquired to access the actuai resource, based just on the title and author information as provideci through a library catalog, and decide whether the resource meets the needs. These problems are addressed by CINDI using an appropriate index entry c d e d Semantic Header which includes title, abstract. keywords, and subject; and proMding a mechanism to register, manage and search the bibliograph y.

The overail CINDI system uses knowledge bases and expert sub-systems to help the user in the register and search process. The expert system wiii help the user in entexing

those attributes required for indexing or updating a Semantic Headet. UNDI standardizes the terms. The index generation and maintenance sub-systmi uses CINDI's thesaurus to help the provider of the rrsource select comct tams for item such as subject, sub- subject and keywords. S d a r l y , anothcr expert sub-system is used to heIp th user in the search for appropriate information nsources 131.

2.1.1 The Semantic Header of CINDI

CINDI uses a meta-data description c d e d a Semantic Header to describe an information resoutce, The Semantic Header indudes those elen~nts that arc m a t ofien used in the search for an information resource. Since the majority of searches begh with a titk. aame of authon (70%), subject and sub-subject ( S m ) 191, CIPiSDL r e q u k tbe entry for these elements in the Semantic Header. Similarly, since the abstract and annotations an relevant in deciding whether or not a murce is usefuI. thcy are also included. [14,4]. The description of the semantic header elemmts foiiows:

Tirle, A&-tWe The title ficld contains the name of the resowe tbar is given by the creator(s). The aiternate title field is used to indicate a secondary title of the rcsource.

Subject The subject and sub-subject of the CtSOurce arc indicatcd in the next field, which is a repeating group. This field contains a list of possible subject classifications of the resource.

Languoge, Churacfer Set The character set and the language are the same as those used in the resource.

Author and other responsïble agents The role of the person associaied with the documenieat, for instance, author, editor, and compiler. This inc lud~ fields such as name, postai ddrrss, telephone number, fax number, and email address.

Keyworà This field contains a Iist of keywords mentioned in the resourcc.

Identifier The identifier for the document Example of identifier are ISBN (urtemational Standard Book Number), URL (Universa Resourcc Locator) of the document. This is a multi- valued dot in case the document is availabk in many formats or is electronicaily stored at more than one site.

Dafe The date on which the document was created, caîalogucd, and the date the document will expire.

Version The version number aud the version number king superscdcd, if m y are given in these elements.

CCassification The classification of the document such as legai', 'security', or other types. For each, the nature of ~Iassification is specined.

Coverage It indicates either the target audience of the document or the cuituraI and temporal aspects of the document's content-

System Requiremenî5 Being an electronic document certain system requirements for it to k displayed or used are necessitated. The components involveà are the hardware, the software, or the network and the minimum needs for each of these.

Genre It is used to describe the physical or electronic format of the resource. It consists of a domain and the corresponding values or size of the resource.

Source and Reference The Source indicates the documents king referenced or that was rquired in its preparation. It could also be the main component for whicb the cumnt document is an addendum or attachmeat.

cost In case of a resource accessible for a fee, the cost of accessing it is given.

Abstract The abstract of the document is either provideci by the author or by ASHG.

Annotalrons Annotations put in by readers of the document.

User ID, P~ssword A provider ID of at least six characters and a password of four to eight characters. More than one semantic header by the same provider can have the same ID and password

2.1.2 Semantic Header9s Mark-up Schema

The heart of any bibliography or iridexing system is the record that is Lept for cach item that is king indexed. The syntax of the semantic headcr is th HTML marhip language. which is based on the SGML mvkup fanguage. The following entries will fom the meta- information of the primary mource, which will be stond in the Semantic Headcr database. Once this index is filleci by the provider of the remince or by the proposecl ASGH system, it wiU be registered as bibliographical information about the rcsource.

<Subject>required: a list each of which insludes fields for subject and up to two levels of sub-subject: at least one entry is requircd </Subjec~

danguage>OP'IïONAL: of the information resource4anguage>

<char-set>OPTIONAL: character set useàdchar-set>

<author> required: a list each of which includes name, organization, address, etc. of each personhstitute responsible for the information resource: at the least, the name of the organization and address is required </au-

deyworbrequired: a list of keywords </Keyword>

cVersion>OPTIONAL: version of the resource </Version>

4overageSPTIONAL: audience, spatial, temporal &overage>

<Classificatio~OPTIONAL: nature security level of the resowce</Classificatio~~)

<-A list of locations (WRL) Unique Universal Resource Lmator-Cal1 No for this resource: at Ieast one required-

<Abstract>OPTIONAL but recommendedc/AbstracO

<SysReqsOPTIONAL: Est of rrqukments in hardware and software

cHardware>OPTIONAL: list of hardware requircd4Hardwart>

<Sofeware>OPTIONAL: list of software xequired</Software>

csizexize of the resource in bytesc(size>

<Cost>OPTIONAL: cost of accessing the resource</lost>

<password>required: encoded password or digital signature of provider of &source for initial entry and subsequent update4passworcb

<signatusdigital signature of the resource for authenticationc~signature>

2.1.3 Database System of the Semantic Header

The index entry system provides a graphicai interface to facilitate the provider of a resource to register the bibliographie information about the resource. Once the information is correctly entered, the provider cm decide to registcr the Semantic Header entry in the database. The index entries registered by a provider of a resource are stored in a distributed database system (SHDDB). The underlying database may be considereù to be a monolithic system h m the point of view of the usea of the system. In reality, it wouid be distributed and repkated aifowing for diable and failure-tolerant operatioas. The interface hides the distributed and replicated nature of the database. The distribution is based on subject areas and as such the database is considered to be horizontally partitioned [2].

The Semantic Header information entered by the provider of the resource using a graphicai interface is dayed by client process from the user's workstation to the database server process at one of the nodes of the SHDDB. The node is chosen based on its proximity to the workstation or on the subject of the index record. On receipt of the information, the server venfies the comtness and authenticity of the information and, on finding everything in order. sen& an aclmowlcdgment to the client.

The server node is rrsponsible for locating the partitions of the SHI>DB whem the entqr should be stored and forwards the rcplicatcd information to appropriate nodes. For example, the semantic beadtr entry would be part of the SHDDB for subjects Cornputer Science and Library Studies.

2.1.4 CINDI9s Seareh System

The search system also uses a graphical interfe and a client pnxrss ~UOW a provider to enter s w c h requiremtllts consisting of ab-string and synonyms. The search interface providing the precise statement of a user query by ailowing cornplex pttdicate~ based on search items such as author, subject, keywords, dates, and words appearing in the abstract. Once the user has entend a search quest, the client p n r e s s commuaicates with the nearest SHDDB catalogue to determine the appropriate site of the SHDDB database. Subsequently, the client process communicatcs with this database and retrieves one or more semantic headers. The resuit of the query couid then be collectecl and sent to the user's workstation. The contents of these headers are displayed, on demancl, to the user who may decide to access one or more of the acnial rtsources. k may happen that the item in question is available fiom a number of sources. In such a case the best source is chosen based on optimum wsts. The client process would attempt to use appropriate hardwardsoftware to retrieve the selected resowces [4].

Chapter 3

3. Rich Text Format

3.1

RTF

Rich Text Format (RTF) Specification

(Rich Text Format) allows for the exchange of text files between different word '

processon and oprating systcms. For example, &c acated using Microsoft Word 97 in Windows 95 can be saved and sent as an RTF file to someone using Wordperfect 6.0 on Windows 3.1. This RTF file will have a "d" fiIe name suilîx.

Control words and symbols senring as "common denominator" fonnaiting comrnands are defined by RTF specifications usbg the PC-8, Macintosh, ANSI, and IBM character sets. An RTF writer processes nles saved in the Rich Text Format and convert's the word processor's marhp to the RTF language. The RTF reader processes and converts the RTF language so that it can be displayed on a differcnt word pmcessor. In this way, the Rich Text Format (RTF) Spacîflcations [13] serves as a bridge' that aiiows for the interchange of text and graphics between different opcrating systems. operating environments, and output devices.

The actual software that tunis a formatted file into an RTF hie is called a writer. It writes a new file containing the text and associated RTF groups. This RTF file is then translated into a formatted Ne by a reader [13].

3.1.1 RTF Syntax

An RTF file is compriseci of control words, wntrol symbols, unformatteci text, and groups [ 131. A control word takes the foilowing fom:

Note that each control word is preceded by a backslash.

The LetterSequence is made up of lowercase alphabetic charactem.

A control symbol consists of a single. non-alpbabetic character preceded by a backslash. For example, \- represents a non-brcaking spafe. Control symbols take no delhuiters [13].

A group consists of text and control words or control symbols e~ loscd in braces ({ 1). The staxt of the group is marked by an opming bnre ({) and tht end of the gmup by a ciosing brace (1). Each group specifics the turt affectcd by the grwp and the differcnt attri'butes of that text. The RTF fiie can also includc groups for fonts, styles, saecn color, pictures, fmtnotes, annotations, headcrs and fwtcrs, summary information, fields, and booharks, as weii as document-, section-, paragraph-, and character-formatting properties. If the font, file, style, screen-color, revision mark, summary-information groups and document-formattiag properties an included, tbcy must prcccdc the first plain text character in the document. These groups form the RTF file headcr [13].

Some control words have only two States. For example, bolci, which is either turad on or tumed off. When such a control word has no parameter or has a nonzero parameter, it is assumed that the control word tums on the pmperty. When such a control word has a parameter of O (zero), it is assumai that the fontml won3 turns off the proprrl. For example, \b hims on bolci, wheeas \M) turns off bold [ 131.

Within a document, the beginning of a collection of relatai text that could appear at another position, or destination, is distinguisécd by certain control words referrtd to as destinations. Braces must enclose destination control words and any text following them. The control words, control symbols, and braces constitute control information. AU other charactcn in the file are plain text. As previously mention& the backslash ((and braces (( )) have special meaning in RTE To use these charactes as text, precede them with a backslash, as in \\ \{and \) [13].

3.1.2 Contents of an RTF file

An RTF file has the folIowing syntax:

It is important to include here an explmation of the document area because RTEextractor exploits the use of mark-up fields ta extract the meta-information. It extracts the title, explicitly stated keywords, subject, language, author(s), &tes (Created, Expiry), and the abstract from the RTF document. This will be describeci in another chapter.

3.1.2.1 Document Area

The document area has the following syntax. edocumen~ Qnfo>?<docfmt>*<sectio~~+

Information group

The information group, which contains information about the document, is intmduced by the M o control word, This can inchde the title, author, keywords, comments, and other idonnation specific to the file [13]. This group has the following syntax:

<info> '{' d t b ? & <subject>? & <author,? & coperato~? & dceyworcb? & <comment>? & \ version? & <doccomm>? & \ v e m ? & & aeviizw? & <printim>? & cbuptim>? & \ d h h s ? & \ mfpriges? & \ nofiuords? & \ id? '}'

esubjecb '{ ' \ subject #PCDATA '1'

<comment> '{ ' \ comment #PCDATA ') '

#PCDATA : Text (without control words)

Some applications, such as Word, ask a user to type this information when saving the document in its native format. If the document is then saved as an RTF file or translated into RTF, the RTF writer specifies this information using the foiiowing control words. These control words are destinations and b t h the control words and the text should be enclosed in braces ((1) 1131-

oatrol word bcan€ng I

mitle of the document. Bubiect of tbe documtant. 1

l~uthor of the document. I

\ operator person who last made changes to the document \ keywords lSelected keywords for the d ~ ~ ~ l l ~ l t n t \ comment komments; text is ignorcd. I

\ versionN me version number of tbe document, dmomm komxnents displaycd in Worâ's Edit Stunxmry Info dialog box.

Table 1: Dennitions of Contm1 words (a)

The RTF writer can automatically enter otber conirol words, includiag the followhg:

Contml worà mppninn \ vernN me interna1 version number \ creatim me cmation time \ revtim The revision time \ printim The last pnht time \ baptim nie backup tirne

TabIe 2: Definitions of Control words (b)

7 dylv \ hrN \ minN \ secN \ nofpagesN

Entries without the N parameter have the \ yr \ mo \ dy \ hr \ min \ sec controls [13]. An example of an information group foilows:

)Ihe day The hou? The minute The seconds The number of pages

(\info(\title The Semantic Header and Inâexing and rartbing on the Intemet) {buthor Bipin C. Desai J (keywords Bibliographie recorâ, Content description, Indexing, Index aid, Database system, Expert system, Searching, URC, URL } )

\ nofwordsN me number of words \ n o f c h W me number of characters \ idN me intemal ID number L

Chapter 4

4. Automatic Semantic Header Generator (ASHG)

4.1 Introduction

In this chapter. the Automatic Semantic Header Generator (ASHG) of the CTNDI for RTF documents with ASHG is presented. ASHG saves thne for the document's author by generating an initial set of subject classincation and a number of components of the Semantic Header for the document. ASHG mcasures both the ficquency of occurrence and positional weight of keywords fomd in the document. By matching those keywords with the controlled terms found in the controiied term subject association a list of subject headings is then generated [7].

The selection of important terms may be based either upon position (the term's location in the document). semantic. or pragmatic (a system which would consider proper names as highly significant). Amther criterion used for selecting important temis is Statisticai term weight. Since the frequency criteria are not very diable, additional criteria should be used such as contextuai inference (the word location or the prescnce of cue words). and syntac tic coherence criteria [ 1 I ,SI.

Meta-information from HTML, Latex, Text, RTFT and Unknown documents is extracted by the ASHG and is stored in a Semantic Header. This meta-iaformation may include such fields as the document's title, absîracî, keywords. dates, author, author's information. size and version. The significance of these is measureû by ASHG using frequency occurrence and positional schemes. A base form for each word is generated by Word stemming and then matched to the controiled terms found in the controlled term subject association [7]. If a match is found the subject headings associated with the controlled te- are extracted and ranked accordingly. The following describes the major steps of ASHG.

Documeni Type Recognirian: In order to appiy the correct ASHG to a document, the type of the document has to be recognized. The system currently understands HyperText Markup Language (HTML), Latex, plain text, and Rich text format (RTF) documents.

File VaNWon: The user validates the file type recognized by ASHG.

A p p l ' g ASEG's Ertractor.. The summarizer corresponding to the type of document is applied to the input document.

ASEG9s Document Classifrcrrtion: A classification of subjoct headings is assigncd to the document. It involves:

1. Word sfemming: A stemming process is applied by the system to map the words found in the extracteci fields onto a base root word.

2. A Look-up inlo the ConbDRkd T e m Sukject rliicthmy.

Sema& Headw Volidotion: The generated Semantic Header is prcsentcd to the user to vaiidate, correct, and register using the GUI [21].

4.3 Document Type Recognition

Name conventions arc used by the system to assist in recognition of document type. If this does not work, the system wiU then examine the contents. If the document type remains unrecognized, the user is informeci by the system, and is askcd to either choose a document type or generate the Semantic Header manually. The two steps used in the document type recognition process follow:

o Naming conventions and hewistics.

O Examining file contents in detennining the file types.

Upon submitting, the document file is passed to a function d d by-. This function checks the document's name extension. If the extension of the file indicates that it is an HTML, Latex, text or RTF then the function user-ve@'& is calleci- If the naming convention fails in recognizing the document type, the function byconfeW is cded .

if (document.extension = .hW or HTML or .htrn) then { The file is an html Ne. CaIi function user-veri@.

1 else if (document.extension = .tex or .TEX) then {

The file is a latex file. Cal1 function user-verify.

1 else if (document.exteasion = .rtf or .RTF) then { The file is a rtf Ne. Cal1 function user-verify.

1 else if (document.extcnsion = .dm, or .txt

or ide or .ascii) thcn {

The file is a text file Call function user-verifjr.

1 else {

Examine the document contents by calling the function bycontent. 1

In user-vent!'. the user either confimis or rejects the rcsult. If tbe user rejects the resuit, he should chwse a type fiom a list that is displayed. If the user confirms the document's type as recognized, ASHG applies the extractor cozftspoading to the type confirmcd by the user. ûtherwise, he should chose a type and then apply ASHG.

In the function byconteat, the semantics of tbe Laîex and RTF content is exploited when attempting to recognize the file type.

If (the nle contents match the html file sernantics, such as the existence of the tag) then

the file is of html type caii function user-ve*

1 else if (the file contents match the latex semantics, such as the

existence of the \begin { ... ) tag) then

the Ne is of latex type cail function user-verify

1 else if (the file contents match the rtf semantics, such as the

existence of the ktf tag) then {

the N e is of rtf type call fûnction user-veriQ

1 else {

Unrecognized file type, The user should select a type. 1

If the fde type is not HTML, Latex, Text, or RTF, it remains unrecognked. and ASHG extracts the size of the file and the date of creation.

f USER

TYPE VALIDATION AND RECOGNITION -+ VALIDATION

\

Figure 1: ASHG's steps

4.4 The Thesaunis for ASHG

The ASHG's Thesaurus is composeci of a three level subject hierarchy and a set of conml terms associated with the subject headings found in the subject hicrarehies. The Thesaurus used by ASHG contains four object classa: b e l - 0 which represents the general subject of the subject hierarchy. Level-1 which represcnts the sub-subject of the general subject and is derived from bel -O; Level-2 which rcpresents the sub-subject of the Levef-2 subject and is derived from bel-2. and finally Control-term which contains the mot tenns that are derived h m the subject headings. A mot term is the origin of aii possible ternis that can be generated h m it by adding the &es and prefkes [7J.

4.4.1 The subject Hierarchies

Since dBerent subject headings may be used to convey the samc subject, and since different people may have different perspectives on the same single subject, controlled subject headings have been derived. The CINDI system focuses on the siandardization of subject headings. This database helps the provider of the pRmary resource in selecting the correct subjects and sub-subjects' headings for the semantic hcader entry.

CINDI's subject hierarchy is made up of three Ievels, where level-O contains the general subject heading. Cumntly we have included only two general subject headings: Cornputer Science and Electrical Engineering. LeveZ-l contains al i the subjects that fall under levez-O subjects. and similarly level-2 WU contain more precise subjects that fall under Ievel-1 subjects.

4.5 ASHG9s Docament Subject Headings Classincation scheme

An important step in constructing the smantic header is to automatically assign subject headings to the documents. Tne title, explicitly stated keywords. and abstract are not by themselves enough to convey the ideas or subjects of the document. Since the author tries to convey or to summuize his ideas in the previously mentioncd fields, ihere is a necd to use all English none n o i r words found in those fields. To assign the subject heaâïngs. ASHG uses the resdting list of significant words generated h m the prcvious section and CINDI's controlled term subject association. The subject heading classification scheme relies on passing weights h m the sieniflcant tenns to their associated subjects. and seiecting the highest weighted subject headhgs [q.

4.5.1 The Algorithm

Having the keywords, title words, abstract words and other tagged words, will help us select the most appropriate subjects for a given document. The following algorithm is used:-

Three lis& of subject headings are to be consmicted. The List of Level-O subject headings, the list of Level-l subject headings and the list of Level-2 subject headings.

For each term found in both CINDI's controkd te- and the geaerated list of words, the system traces the controkd term's attached list of subjects headings (List of level O, level I and level 2). and adds the subject headings to their corresponding list of possible subject headings.

Weights are also assjgned to the subject hierarchies. The weight for a subject is given according to where the term matching its controiled term was found. A subject heading having a term or set of te= occurring in both title and abstract, for instance, gets a weight of seven. The matched t e m . weights are passed to their subject headings.

The system extracts b e l - 2 . b e l - 1 and Levez-O subject headings having the highest weights fiom the thme lists of possible subject headings.

After building the three Lists for the three level subject headings, the system:

1. Selects the subjects using the bottom-up scheme.

2. HaMng selected the highest weighted level-2 subject headings. the system derives their level-1 parent subject headhgs.

3. An intersection is made between the derived 1evel-1 subject headings and the iist of the highest weightd level-1 subjcct headings. The common levcl-1 subjects are the document's level-1 subject hcadings.

4. The system uses the same procedure in sclccting ievel-0 subject beadings.

4.6 Applying ASHG's Extractors

Based on the document's type uncovered in the document type recognition step, ASHG applies an extraction procedure. ASHG uses its understanding of IITML, Latex, text and RTF s yntax documents to extract the document's meta-information. ASHG's RTF-exî~~tor , are applicd to BTF type documents and to unncognizcd type dawnents respec tively.

Using the document's syntax, ASHG extracts summary Ulformation, such as the title, keywords, dates of creation, author, author's informafion, abstract and s k . In tlTw Latex and RTF documents, the author might explicitly tag some of the fields 10 be extracted. In case these fields were not expliciily tagged, ASHG atternpts to extract them using some heuristics. For example, extracthg the keywords in an RTF document, The RTF-extractor extracts words that are between keywords control word braces in meyworàs**-). However, if the explicit keywords were not found in the document, then words found in the titie. abstract and other tagged words wodd be used to extract an implicit list of keywords.

After identifying the document's type, the comsponding extractor is applied to the document, thus aiming to retrieve the correspondhg meta-information to build its Semantic Header.

The RTF-exfroctlor extracts the title, explicitly stated keywords, language (English), author(s), dates (Created, Expisr), size of the file, and the abstract h m a RTF document Generating both implicit keywords and a List of subject headings for a document will be described in a Iater section, since they are a standard procedure for aii extractors.

Extructing the titlcjhm RTF document: The title could be extracted fiom what is between title control word braces in (\atle...). For example, if the RTF document contains {\titIe Automatic Semmitic Header Generator ) , RTF-extractor extracts Automatic Semantic Header Generator as the document's title. If this fails, it then extracts the fust sentence.

Extraeting the subjectfiom RTF document: The subject could be extracted fiom what is between control word bnws in m b j ect...), or the RTF-extractor uses the resulting list of signiscant words generated fiom the subsection: 4.6.2 and CINDrs controlled term subject association. The subject heading classification scheme relies

on passing weigbts h m the significant temis to their associatcd subjccts, and selecting the highest weighted abject headings. The subjcct will be seiected ody after the keywords h m the documents. And these kcywords wïil lead the system in identifying the right subject by using the Ktyword-Subject database.

RTF Parsed Document

Figure 2: Subject EKtraction

3) Extracting the hnguage fion RTF document: The language could be exîracted fiom what is after \deflang control word (default is English).

4) Extracting the autlior(s) RTF document,. The author name is extracted from what is between author control word braces in (\author...). For instance, if the RTF document contains (buthor Abdelbaset Ali), the RTF-extractor extracts Abdelbaset Ali as the author of the document Extraction cm also be made by lodung for the patterns: edited by, written by, revised by, author, when selecting the author and what follows it which wiii then be assumed to be the author. The following information could also be extracted:

.O Tne e - d : wiU be extracted by looking for the "8" symbol which is the worldwide usable symbol for an e-mail.

9 Ine phone: will k extractcd by lodung for the pattern phone or tel.

9 The fac wül be extracted by looking for the pattern fax.

5) Extrocting the 4bstrrrctJiom RTF document: The abstract muid be extracted h m what is between control word bracts in mbstm&-). If it fails, it thcn extracts the paragraph headed by the tagged word absaan If it fails to locate an abstract btading, it applies an automatic abstracting method. This method, which is similar to Luhn's automatic abstracting method, attempts to extract a section or paragraph that is headed by introciucn'on. B a d on the numbcr of signifiant mot words in the sentence, a numaicd measme is devtIoped for a sentence. The autoniatic abstracting includes the highest measured sentences in the abstnn If it fails. the RTr-eztroctor extracts the k t paragraph after removhg the RTF tags and applies the automatic abstracting method, describeci above, to this paragraph.

6) Extraca'ng expücitiy s&zted RGywords /rom RTF &cumenk The keywords an extracted from what is between keywords control word braces in ~eywords,..E For example, if the RTF document contains control word {Uceywords type , quire, object , define , act , abstract , issue), the RTLextractor extracts these as the documents's keywords. If it fails then the pattern keywords are looked for, and whaî foUows are assumed to be the keywords of the document.

7) Extraeting the created datc j h m RTF &eument: The created date could be extracted fiom what is after Lreatim control word. For example, if { kreatim\yr l999\mo8\dy 10\hr 12\min9 ) was found in the RTF document, RTF-extractor extracts 10/08/99 as the document's date If no W t i m command is found in the RTF document, RTF-extrucror uses the stat and GM-îime commiinds to extract the date of creation. stat unu cornniand contains idofmation about the fiie such as File sire in bytes, T h e of k t access, TMc of làst &ta modification and T h of lastfile status change. G M - t h e unk command converts the time to Cwrdinated Universal Time (UTC). which is what the UNIX system uses htemaüy.

8) Extracting ofher tagged words fiom RTF àucumef.. The RTF-extructor extracts a list of tagged words. For example, if the RTF document contains the meta tags \b Cornputer W, the RTF_extractor includes Cornputer in the list of other tagged words. The extracted woràs will be used in the generation of an implicit list of keywords and the generation of a iist of significant words used in the document's classincation scheme. This process of generating an implicit List of keywords and a List of significant words wiii be described in subsection 4.6.2.

9 ) Extracring the version fiom RTF doCume& The version could be extracted from what is after \version control word.

10) Exliacting îhe &e of the RTF Ctoculllent.: using the stat unix command, the size of the file can be ex-

R W Plliscd RTF Document Document

EXTRACïING GENERATOR a ZWk 4 --

h t b r 4 A u t h o r % i n f o ~ n 4 AbslmCt 4 K@Words 4 sriltfricr

ades 4 Version

Figure 3: RTF extractor

4.6.2 Generating an irnplicit list of keywords

ASHG generates an implicit list of keywords in case explicit keywords are not found in the document It denves a list of the most significant words, which is use-d in the document classification scheme.

In case keywords are not found in the document, the system derives a List of words nom the words found in the title, abstract, and other tagged fields. This list of denved words will aIso be used in classifyhg the document- However, if the keywords are explicitly stated in the document, then ASHG will generate a list of words from the words found in the title, abstract, keywords and other tagged fields. This list is used to assign a list of subject headings for the document.

Generating both lists of words relies on the stemming process that wîii rnap the words into their root words, the stemmed word fkquency of occurrence and the word location in the document. 1t uses the following algorithm in generating the list of implicit keywords, this algoritm k ing the same used for the extraction implicit keywords from EITML, Latex documents [7]. In cases where the keywords are not found in the document, and the words used in the classification scheme:-

Extract the titie, abstract and other taggcd fields.

Engiish Noise words constitute usually around 30 to 50 percent of a document. The Idormation Retrieval cornmW1ity calls h m the S~qp List [fl. Tkse words an dropped fiom the exiracted fields.

The remaining words are sent to the stemming p n x w s . This promis wiU remove the words' suffixes and prennes. Tk aim of the stemmhg procear is to gentrate a base word class, which iocludes ai l thc fonas thaî could be gtneratcd h m it.

Because the terms are not equally useful for content rcpresentation, it is important to introduce a term weighting system that assigns high weights for important tenns and low weights for the l e s important terms [19]. Thercfore, the weights constitute the importance of a word The system assigns weights to boch lis& of mot words. The weight assignment uses the following scheme:

1. If the word appears in the explicitly stated ktywords. it is assigned a weight of five. Since authors explicitly state the keywords to convey some important tcnns that their document covers, it is assigncd the highcst weight.

2. Usuaily, words iound in the abstract are the second most important words, because this is where the author aies to convey hiSmCr idea. Therefore, words found in the abstract are the second most sigaificant since they convey the idea of the article more than any words found in other Iocations [20]. If the word appcars in the abstract, it is assigned a weight of four.

3. If the word appean in the title, it is assigned a weight of three.

4. If the word appears in the other tagged words. it is assigned a weight of two.

Each numeric weight is a class by itself denning the words' location. The system has the foilowing classes:

1. A class weight of two defines the OTHER WORDS class. This class contains the terrns found in only the OTHER WORDS field.

2. A class weight of three defines the TITLE class. This class contains a l i the terms found only in the titie field.

3. The class weight four contains all the temu found only in the abstract field. Therefore a class weight of four defines the ABSTRACï class.

4. A class weight of five includes ali the tenns found in either the keywords' field or in the title i d other words fields.

5. A class weight of six iacludes a l l the te- found in both tbe abstnia field and the other words fields.

6. A class weight of seven includes ail the terms found in either the keyword and other words fields or the abstract and titk fields.

7. A class weight of eight contains all the ternis f o n d in keyword and titie fields.

8. A class weight of nine contains aii the terms found in-either the abstract, titie and other words fields, or abstract and keywords fields.

9. A class weight of ten contains al1 the terms found in the other words, title and keywords fields.

10. A class weight of eleven contains dl the tem found in the other words, abstract and keywords fields.

11. A class weight of twelve contains a i i the terms found in the titlt, abstract and keywords fields.

22. A class weight of fourteen contains all the tcrms found in the other words, title, abstract and keywords fields.

A texm appearing in other words field is Iess important tha . the one appearing in the abstract field. Furthemore, a term appearing in both titie and other words fields is less significant than the one appearing in the keywords. abstract and title field. in high class weights, we are interesteci in extracthg more ttrms than in lower class weights. Thus, we tend to extract more ternis from the high weight classes. To limit the number of extracted terms, we use the term's frequency of occurrence. Significant terms have the highest frequency of occurrence in the low weight classes. As the class weight increases. more of its terms are regarded as signïficant. To include more signifïcant te=, the domain of the ternis' frequencies is expandeci. The more the class weight, the wider is the domain frequency of the significant tem.

For each class, we set the maximum class 6requency to be the maximum frequency of occurrence of a t e m found in that class. For example, if in dass four, we had three terms baving two, four and six as fkquencies, the system would select six as the maximum class four frequency. The words' frequencies arc cornparcd with their corresponding maximum class frequency. For Low weight classes such as two and three, significant ternis have the maximum class kquencies thus Limiting the number of significant terms. However, ail ternis found in class eight and more are sigficant regardles of their frequency of occurrence.

Figure 4: ASHG extraction for RTF document

4.7 Semantic Header Validation

Once the process of extracthg the meta-information or secondary idormation is completed, the semantic header is displaycd for the source providtr to modify, add or remove some of the attn'butes. Once the provider Whts, tbe semantic heder can be stored in the CINDI database.

Chapter 5

5. ASHG's Experiments

In this chapter, we illustrate how the ASHG system extracts the meta-information h m the RTF documents, and dernonstrate ASHG's automatic subject heading chssification. For each of these document types, we apply ASHG and show the resuits. We compare the subject classification generated by ASHG with that of INSPEC for the same set of documents.

5.1 Experiments

After applying the ASHG on a set of RTF documents, the titles of thest documents can be viewed in the appendix. The generated index fields such as title, keywords, abstract and author are compared with those that are found in the document. The ASHG's automatic subject headings classification resuits are compand with the INSPEC's classification.

The experiments were conducted on twenty-five documents to test the acmacy of the generated index and the subject headuig classification results. These documents deal with computer science and ekctrical engineering subjects. ASHG was able to extract al i the explicitly stated fields such as title, abstract, keywords and author's information with a hundred percent accuracy. If the abstraa was not explicitly stated, ASHG was able to automaticdy generate an abstract that would describe the paper. However, ASHGf implicit keyword extraction generated a list of words thai included some words tbat are insignificant. These insignificant words in tum lead to the diversion in subject classification.

Based on the consultation with the papers' authors on the ASHG's subject classification results done by Haddad [7]. Their response was divided into thne categones: good. OKEVot sure and poor subject hierarchy selection. Good subject hierarchy selection implied that the authors would have chosen them as subject hierarchics for the documents. OlUVot sure subject hierarchy selection implied that the authors doubt the results and that they would not choose them. Finally, thepoor subject hierarchy seleclion implied that the selected subject hierarchies describeci anothcr different subject-

We compared the ASHG's subject classification results against the INSPEC's classification done by expert cataloguers and thesaurus. Some of the ASHG's subject classification had different words than INSPECs even though they described the same subject. That was due to the fact that our computer science subject classification was built

fkom ACM and not h m INSPEC. Skce our systcm w u ody based on the fiecluency and location of words in a document to determine the documtnt's m o r d s and subject classification, it has missed the importance of the word senses and the relationship between words in a sentence. W e have compand the resuit for HTML-extractor for HTML documents, Latex-extractor for Latex docmmets, and Text-extractor for text documents with the result for the same document in RTF format wbich extracted by RTLextrac tor .

RTF Document

Docl r

Doc2 Doc3 Doc4

DocS Doc6

L

Doc7

Doc8 r

Doc9

Number of Subject Headings gencrated by ASHG

*TF-extractor)

10

6 4 4

4 5

5

6

14

Accuracy

A= Author T: INSPEC

Author's Opinion

Good

Doc 10

Doc 1 1

Doc 12

Doc 13

I

Doc 14

4

O O 0

O 1

O

O

4

-5

4

3

3

Docl5 /'

4

OK/NotSurc

5 1 i 20% (A)

Poar

4

2 2 2

4 1

5

2

4 2 2

O 3 O

2

1

1

1

1

40% (A)

O (A) O (A) 0 (A) 25% O

O CA) 20% (A)

0 (A) 20% 0

1 3

4

2

1

O

33.33 (A)

14.28% (l)

O

1

1

2

25% O 20% (A) 20% m 25% (A)

3333 (A) 66.66 0 50% (A) 50% 0

Table 3= Summary d ASHG's RTF test resdts q p h t tht aiithom a d INSPEC's dQ

RTF Document

Docl6

-17

Docl8

Docl9

Doc2O

Doc2 1

Doc22

Dm23

Dm24

Doc25

Averages

5.2 Sample Results In this section, we will show some of the indexes generated by ASHG for RTF documents.

Numbcr of Subjcct Headings geaerated by ASHG

(RTF,exaactot)

3

7

7

4

4

3

<sernhdrB> <useridB> <useridE> <passwordB> q a s s w o d 3 ctitleB> Designing a Class Library ctitleEw <alttitieB> <alttitieE> <subjectB> <generaIB> Computer Science <generalEw <sublevell B> Programming languages <sublevel lE, <sublevel2B> Computer program language <subleve12E> <generalB> Computer Science egeneralEw csublevel 1 B> Theory of computation by abstract devices <sublevell E> csubleve12B> Complexity measurcs and classes <.ublcvcl2E> <generalB> Computer Science cgencralE> ~sublevell B> Theory of computation by abstract devices esublevel lE> <sublevel2B> Complexity measurts and classes hieriarchies <sublevelZEs <generalB> Cornputer Science egencralEw <sublevellB> Theory of computation by abstract devices esublevellE>

Author's Opinion

Good

O

-

3 I

OK/NotSipe

O

26

5

4

-

Po01

3

i

- -

<sublevel2B> Complexity mcasurcs and classes reducibility and completeness <sublevel2E> <generalB> Cornputer Science <generalE> <sublevellB> Theory of computation by absaact devices <sublevellE> <sublevel2B> Relations among complexity classes <sublevel2E> <subjectE> dangtmgeB> English danguageE> <char-sctB> <char-setE> <authorB> <aroleB> Author <aroleE> <anameB> Petet Grogono <anamcE> Cao@> c a o e <aaddrcssB> Department of Cornputer Science, Concordia University de Maisonneuve Blvd-, Montreaï, Quebec Canada H3G 1M8 <aaddressE> <aphoneB> 848-3000 <aphomE> aîfaxB> 848-2830 cafaxE> <aemailB> [email protected] <æmaifFs <authorE> ckeywordB> library , class , programmer, program, problem , prtstntation , main . lay , intcmal. environ , do , devetop, dee , complex, collect . d l . built , browse , advancage, abstract <kcywordEw tidentifierb <domain3B> E.TP domaÏn3b <value3B> cvalue3E> cidentifierE> <datesB> <creatcdB> 199916114 <creatcdE> cexpiryBw I999/6/ 14 <expiryE> <datesE> cversionB> 1 cversion5 <çpversionB> <spversionE> <classificationE> <domaidB> <domain4E> <value4B> cvalue4Ew <classificationE> <coverageB> <domains%> <domai- <valueSB> cvalueSE> <coverageE> csystem-requirementsB> <componentB> <componentE> <exiganceB> <exiganceE> <system-requirementsE> <genreB> cformB> d o r d 3 <sizeB> 4503 <si- <genreE> <source-referenceB> aelationB> aeIationE> <domain-identifitrB> cdomain-idcntiflerE> <source-referenceb <costB> <costE> <absnactB> One of the functions of a class Iibtary is to provide a collection of data structures. W e argue that if a class library is to be useful, it must do more rhan this: it must provide usefiil abstractions and it must allow choices of representation to be deferrcd until late in program developmcnt. W e have designeci a language and a development environment called Dee. The class library of Dee is built in layers and has a complex internai stmcture. A sophisticatd class browscr conceals the complexity of the class library fiam

programmers. We introduce the salient f- of Dee, describe the structure of iîs class Ii'brary. exphin the advantages of this structure, and discuss the rcmPiLUng poblcms-

<passwordB> <passwordEw <titleB> Issues in the Design of an Object Oriente <aIttitleB> <alttitleE> <subjectB> <generalB> Computer Science <generalE> <su blevel lB> Programming languages esubievel lEw <sublevel2B> Abstract data types in languagc constructs and featums < s u b l e v e l ~ <generalB> abstract <generalE> <sublevellB> thcoq of computation by abstract devices <sublevellE> <sublevel2B> abstract data types in laquage constnicts and featurcs <sublevelZE> <generalB> Computer Science <genctalE> <sublevel 1 B> Programming languages <sublevellE> <sublevel2B> Classes and objccts in language constructs and ftanues <sublevelZE> cgeneralB> Computer Science <generalE> ~sublevel lB> Theory of computation by abstract devices csublevel lE> csublevel2B> Complexity measures and classes <subltvel2E> cgeneralB> Computer Science <generalE> csublevellB> Theory of computation by abstract devices <sublevellE> csublevel2B> Complexity measurcs and classes hicrarchics c~ublevel2E> cgeneralB> Computer Science <gencralE> <subIevellB> Theory of computanon by abstract devices <sublevelfE> csublcvel2B> Complexity mcasurts and classes reducibility and completcness <sublevel2E> cgeneralB> Computer Science <generalE> csublevel lB> Theory of computation by abstract devices <sublevellE> csublevel2B> Relations among complexïty classes <sublevel2B csubjectE> danguageB> English <languagcE> cc har-setB> <c har-se= <authorB> <aroleB> Author <aroleE> <anameB> Peter Grogono canamtEw caorgFb <aorgE> caaddressBw Department of Computcr Science, Concordia University de Maisonneuve Blvd. West, Montr 4, Quebec Canada H3G lM8 d d r e s s E > caphoneB> 848-3000 <aphoneE> cafaxB> 848-2830 <afaxE> caemailB> [email protected] <aemailE> <authorE> <keywordB> type , rcquue , object , define , act , abstract , issue &ywordE> UdentifierEb cdomain3B> FTP <domain3E> cvaluc3B> cvalue3B cidentifierb

<datesB> <crcattdB w l999/6/4 ccrmedb <expiryB> l999/6/4 e x p i e cdattsE> cversionB> 1 <versionE> <spversionB> c s p v c r s i o ~ <classificationB> cdomaidB> uiornain4E> cvalueAB> cvalue4E> cclassificationB <coverageB> <domainSB> uiomainSE> cvalueSB> cvalueSE> <coverageE> aystem-requirementsB> <wmponentB> <componentE> <exiganceB> ctxiganccb esystem-requirementsEs cgenreB> <formB> <formE> aizeB> 73741 <sizeE> e g e n r e b <source-referenceB> geIationB> aclatio- <domain-identifierB> <domain-idtntifierEw <source-referenceE> <costB> <costE> <abstractB> The object dented paradigm, which advocatcs bottom-up program development, apptats at fint sight to run countcr to the classical, top-down approach of stn~~tured progmmming. The decp requiremcnt of stnictured programming, however, is that pmgramming should be based on welldefined abstractions with clear meaning rather than on incidental characteristics of computing machinery- This rcquirement can bc met by object oriented programming anâ, in fact, object oriented programs may have bettcr structure than programs obtained by functional decomposition. The definitions of the basic components of object orientcd programming, object, class, and inheritance,

are stiii sufficientiy fluid to provide many choices for the designer of an object orientcd languagc. Full support of objects in a typed languagc fcquires a numbcr of fatum, including classes, inhcritance, genencity, rcnaming, and rcdefinition, Although each of these features is simple in itself, interactions between features can becorne cornplex. For example. rtnaming and &finition may intcract in unexpectcd ways. In this paper, we show that appropriate design choices can yield a language that fiilfills the promise of object oriented prograrnming without sacrificing the rcquircments of stnictured programming- <absa- cannotationb <annotationE> <semhdrE> &OF>

The fdowing example shows CïNDI Semantic Header Grapbicai interfAce. W e have used RTF document to extract mcta-information. The gcneratd Semantic headcr is written into a nle whcre it is read and displayed into the QNDI's GUL The ASHGS resuit wîii be used as a starting point by the author. and Wshe has the oppommity to comct the errors and include fields of the Sunantic Header not given befon registcring it.

Figure 7 : CIND19 Semantic ~ Ï d e r Graphical interface (c)

The system wili perform a number of steps in order to register a Semantic Header. At the beginning. the pamr wiïl v e m the syntax of the input file and makt sure that the mandatory attributes of the Semantic Header have k n entcrcd. These attributs includc those needed to assign a Semantic Header Naxne which arc: title, f b t general subject, name of fïrst author. date of creation, and finally version; as weU as first identifier and one keyword. The non-noise words of the Semantic Headtr will be storcd in tcmporary variables and data structures for later use. The next s e p wouid be for the database module to verify the status of the user ID and the password Subsequently, it will be assued that such a Semantic Header does not exist in the database. Findy, the words and the Semantic Header are indexed into the database.

Semantie Header Delerion

A Semantic Header can bc deleted by the user who registercd the Semantic Header only if no annotation aftcr the Semantic Header was registed. The procedure of delethg a Sernantic Header is similar to that of registering it. The Semantic Header and the SHN mdintained in the word object with the value comsponding to each non-noise word WU be deleted fiom the database. If a word object happens to contain no SHNs, it wiU be deleted fiom the database as well-

A Semantic Header c m be updated by the user who registered the Semantic Header and requires the user ID and the password during the initial registraîion. However, a number of the attributes of a Semantic Header cannot be modified. These attributes are those comprishg the Semantic Header Name as weil as the annotation field added after the SH was registered It upàates those fields that are modified by the user. For each modified field, the deletion and addition of words are also managed by this method.

Chapter 6

6. Conclusion and Future Work

6.1 Conclusion

Content indexing provides a powerful means of helping users locate relevant information among large repositones- To be most effwtive, content indexhg must exploit the semantics of the dinerent types of data and di&rent environmenits in which it is uscd. In thïs major report, we prcscnted a mode1 that will automatically extract and gmerate an index or meta-information. We have integrated this mode1 which is called RTF extractor with ASHG.

ASHG exploits the file naming conventions and the data w i t h a document to detemine the document file type. ASHG exploits the semantics of the document's types in extracthg the meta-information. It also assigns weights for terms dependhg on their location in the document. Both tenn weight and occurrence frequency was used in assigning terms for a document. These extracted tem were used to classify a document using the association between CINDI's controlled term and their subject headings in the thesaurus. CXNDK has a three level subject hierarchy for Cornputer Science and Electrical Engineering. CINDI's computer science subject hierarchy was based on ACM and CINDI's electrical engineering subject hierarchy was based on INSPEC. LCSH was used to augment both subject hierarchies. We also derived control temm h m CINDI's subject headiogs. These control temis wen associaîed with th& s u b j e in CINDI's thesaurus. In addition, we presented a method of generating a Semantic Header, caiied ASHG. This scheme automaticaliy extracts and generates an index or meta-information.

The improvements made by the introduction of ASHG are that it extracts more information than other similar systems and it aiso nies to select the subject of the document by looking into our own thesaurus and the keyword-subjcct database.

Lastly, we applied ASHG to a collection of test RTF documents and compared the results to the actual assignments made by INSPEC. The resuits showed a hundred percent accuracy in extracthg the explicitly stated fields such as the titie, abstract, author and keywords. They also showed some level of accuracy in generating the abstract.

6.2 Future Work

Some of the system's rehcmcnts shouid hcludc:

Terms, which arc not significant alone, but ate significsillt if they pppepr adjaccnt to another te- should be extractcd as signifïcant tcm. For exemple, the temi wke should not be exîracted unless it is followed by another tcnîî such as wire grid. ASHG's keyword extraction pmcess should handle more than single conû0IId tnms. Future work should explore the effkct of exaoçting noun phrases and compomd cuntroiled terms-

Word senses and detennining the relationships that those wods hrve to e h 0 t h ~ should be resolved. The semantic level language processing shoyld be bandleci by ASHG,

The controlled terms and their synonyms should belong to tbe same control terni and they should be associated with the same subject heridings.

The domain of the stopword list shouid be explod. and more significmt tmns should be associateci with the subject hcadings.

Build more subject hierarchies such as Civil Engineering, and Mechanical Engineering. Extend the type of documents thai ASHG can extract meta-information fiom,

Appendix

Pupers Used in Testing ASHG for RTF-extractor

The following is the list of papers used in testing ASHG for RTF_extrafor

Docl Grogono P., Designing for Change, Department of Computcr Science, Concordia University, Montreai, Canada

Doe2 Grogono P., Designing a ciizss library, Department of Computer Science. Concordia University, Montreai, Canada

Doc3 Grogono P., A Co& Generator for Dee, Department of Computer Science, Concordia University, Montreal, Canada.

Doc4 Butler G., Grogono P., Shinghal R-, and Tjandra 1.. Retrieving Information fiom Data Flow Diagr- Department of Computer Science. Concordia University, Montreal, Canacia.

DocS Grogono P. and Santas P., Equality in Object Oriented Lrurguuges, Department of Computer Science, Concordia University, Montreal, Canada and Institute of Scientific Computation, ETH Zurich, Switzerland.

Doc6 Grogono P. and Gargul M., A Computationatl Mo& for Object Oriented Programing, Department of Computer Science, Concordia University, Montreal, Canada

hc'l Grogono P. and Gargul M., A Graph Mode1 for Object Oriented Programming, Department of Computer Science, Concordia University, Montreai, Canada.

Docû Grogono P. and Gargul M., Graph Semantics for Object Oriented P rogrumming, Department of Cornputer Science, Concordia University, Montreai, Canada.

D O C ~ Khorasani K., Adaptive Control of Nonlinear Systems Using Output Feedback, Department of Electrical and Computer Engineering, Concordia University, Montred, Canada

Docl0 Nascimento M. A. and Dunham M. H., A Proposal for Indexng Bitempoml Databases Via Couperative B+ trees, Southern Methodist University, Dallas, USA.

ml1 Grogono P., Issues in the Design of an Object Oricnted Programming Language, Department of Computer Science, Concordia University, Montreai, Canada

DoeU Grogono P., A M& for Compvturg with Objects, Depamnent of Cornputer Science, Concordia University, Montnal, Canada.

Docl3 Paknys R and Raschkowan L. R., Moinmt Method Su@Ùce Patch d Wirc Grid Accuracy in the Computation of Near Fie&, Deparcmtnt of Elë~tncd ~ n d Computer Engineering, Concordia University, Montreai. Canada

Docl4 Paknys R, On the A c c u r q of the 011) for the Scutterüag by a CyIinder, Department of Elecaical and Cornpiter Engineering, Concordia University, Montreai, Canada,

Docl5 Davis D., Paknys R., and Kubina S. J., The Basic Scattering Cde Viewer A GUI for the NEC Basic Scattering C d , Department of Ektrical and Computer Engineering, Concordia University, Montreal. Caiuda

Docl6 Grogono P. and Cheung B., A Semantic Browser for Object Oriented Progrmn Development, Department of Computer Science, Concordia University, Montreai. Canada.

Doel7 Ounis L, Pasca M., An Extended Inverted File Approcich for Infomation Retrieval, Grenoble, France.

Docl8 Kienie H. M. and Fortier P. J., Exception-HClltdling Extension for un Object- oriented DBMS, University of Stutîgart, Gczmany and University of Massachusetts Dartmouth, USA.

Docl9 Nascimento M. A. and Dunham M . 8, A Proposa1 for Zndung Bitcmporal Databases Via Cooperative B+ trees, Southem Methodist University, Dallas, USA.

D20 Ludwig A., Becker P. and Guntzer U., Inte#acing Onlinr Bibliographie Databases with 239.50, University of Tubingen, Gemany.

M l Park C. and Park S., Alternative Correctness Criteria for Multiversion c'oncun-ency Conîrol and a Locking Protocol via Freezing, Sogang University, Seoul. Korea.

Dm22 Woo S., Kim M. H. and Lee Y. J., A c c o d t i n g Logical bgging under Frury Checkpointing UI Main Memry Databases, Department of Computer Science Korea Advanced Institute of Science and Technology. Taejon, Korea

M 3 Cho E. S., Han S. Y., Kim H. I. and Thor M- Y., A New Data Abstraction Layer Required For OODBMS, Department of Computer Science and Computer Engineering, Seoui National University, Swul. Korca.

Doc24 EhüSoya S . A., A Formal Spec~flcutioon Strategy for Ekctronic Commerce, Department of Computer Science University of Manitoba, Winnipeg, Manitoba, Canada

Dae25 Revesz P. 2. and Li Y.. A Lincar Col~~ttaint Database System wirh Aggregate Operators, Dept. of Cornputer Science and Engineering, Lincoin, USA

References

[Il De Bra, P., Houben, G-J., & Kornabky, Y., SeurcA in the World- Widc Web, http://www.win.tue.nYhelddoc/&nio,ns

121 Desai, B. C., An Introduction to Datrrbcrse Systems, West. St. Pau4 MN lm.

131 Desai B. C., Coverpage aka Semannmannc Hcadcr, httb.J/www.cs .concofdiacal-facultv/bcdcsai/scmanîic-eh, July 1994, reviscd version, August 1994.

[4] Desai B. C., The Semantic H 4 r IiictcxUlg and Searcfing on the internet, Departmcnt of Computer Science, Concordia University. Montiieal, Canada, Fébmary 1995. http-J/fkr~r~.cs.concordia.ca/ faculty/bcdesai/cindi-system- 1.1 .hW

[SI Edmundson H. P. and WyUys R E., Autondc AbstroctUlg d h h i n g Survey and Recommen&tions, CommullfCati011~ ofACM, 45, pp. 226-234, May 1961.

[6] Fletcher, J. 1993., Jumpstation, htpJ/www.stir.ac.uk/isbidj

173 Haddad S., ASHG: Automatic Semantic Herrder Generator. Master's thesis, Department of Computer Science, Concordia University, Montreal, Canada, 1998.

18 ] Hardy D. R., Shwartz M. F., Customùed lnfomtion Extraction as a Busis for Resource Discovery, Departinent of Computer Science, University of Colorado. March 1994; Reviscd February 1995.

[9] Katz, W. A., Introduction to Reference Work, Vol. 1-2 McGraw-Hill, New York, NY.

[ 1 O] Koster, M., ALMrEB(Archie Like I d a i n g the WEB), htti,://web.nexor.co.uWaliweb/dOC/aliweb.html

[Il] Luhn, H. P., The automatic creation of literature abstraets* IBM Journal of Research and Developmenr, 2, pp. 159-165,1958.

[12] McBryan, Oliver A., World Wi& Web Wonn,

[13] Rich T a t Format (RTF) Specificatiun; RTF Version 1.3 documents. htt~://ni ~ht.~mate-wisc.edu:8Q/software/RTF/

1141 Shayan N., CINDI: Concordia IiVdun'ng anà DIscovery system. Master's thesis, Department of Computer Science, Concordia University, Montrtal, Canada, 1997.

[IS] Thau, R., Sireirrdex Trmrlucer, htt~://www.ai.mi t.edu/tools/site-index-ha

[19] Salton G., AIlan J. , Buckley C., and Singhai A. , Automatic Analysis, Theme Generation, and Summarization of Machine-Readabie Te-, Science, V o W , pp. 1421-1426, June 1994.

1201 Saiton G., The SMART Retrieval System, hntice-Hall Inc.,d6, 1971.

1211 Youquan Zhou. CINDE Thc virtiral Librory Grcaphical Usrr Inteflace. Master's thesis, Department of Cornputer Science. Concordia University, Montreai, Canada. 1997.

[22] UMX Programnung P d , 2.6 Edition By Lsny Wall, Tom Christiansn dt Randal L. Schwartz 2d Edition Septemkr 1996.

1231 UNiX Leaming Ped Second Edition 1997 By Randal L. Schwartz and Tom Christianse.

EXTRACTION OF SEMANTIC HEADER FROM RTF DOCUMENTS · 2005-02-10 · Chapter 3 describes Rich Text...

Documents

Transcript of EXTRACTION OF SEMANTIC HEADER FROM RTF DOCUMENTS · 2005-02-10 · Chapter 3 describes Rich Text...