Automated title page cataloging: A feasibility study

17
Informorron Processmg & Manqemenf Vol 25. No 2, pp. 187-203, 1989 03064573/89 $3.00 + .cfJ Prmted I” Great EMam. CopyrIght 0 1989 Pergamon Press plc AUTOMATED TITLE PAGE CATALOGING: A FEASIBILITY STUDY STUART WEIBEL*, MICHAEL OSKINS and DIANE VIZINE-GOETZ Online Computer Library Center Office of Research, 6565 Frantz Road, Dublin, OH 43017-0702 (Recewed 23 February 1988; accepted rn final form 3 June 1988) Abstract-The cost of original cataloging remains a substantial expense for libraries and a hindrance to rapid availability of newly published materials. We have prototyped a rule-based system to explore the impediments to automating descriptive cataloging from title pages. Our test results suggest that it is possible to capture a substantial part of the regularity in title page layout in a small set of rules. Our system correctly identified over 80% of the bibliographic fields present on a random sample of title pages. Significant unsolved problems include the difficulty of incorporating a cataloger’s general knowl- edge about the world in such a system, the complexity and irregularity of cataloging rules, and lack of reliable data capture techniques. Nonetheless, the methods explored hold promise for advancing the state of the art in the automation of cataloging and doc- ument format recognition. 1. INTRODUCTION Cataloging is a resource-intensive activity requiring skills that are typically in short sup- ply. Resource-sharing library networks such as OCLC, RLIN, and Utlas grew out of the need to reduce the cost of original cataloging, and compact disc-based cataloging products have further contributed to cost efficiencies. Nonetheless, the cost of original cataloging remains high, and the backlog of uncataloged materials impedes timely access to many materials. In addition, there are classes of materials that are seldom effectively controlled due to work backlogs (pamphlet collections and government documents often fall in this category). The problem is not unique to libraries. The so-called gray literature (correspondence, internal corporate memoranda, research reports) is rarely subject to effective bibliographic control and is therefore unavailable to retrieval systems. It is typically not economic for catalogers to process these materials, although bringing them under more effective biblio- graphic control would result in significant benefits. Automated cataloging holds the promise of increasing the productivity of catalogers and allowing the economic building of specialized databases of document collections with minimal human cataloging effort. 1.1 State of the art The concept of automated cataloging is not new. Nearly two decades ago Fred Kil- gour, founder of OCLC, proposed extracting information from title pages to enhance the efficiency of the cataloging process [ 11. Soon after, the Library of Congress also addressed the possibility of automated preparation of cataloging records from printed materials [2], and the dissertation work of Martha Fox [3] addressed whether computer modeling of some of the cognitive components of cataloging might be fruitful. Recently, scanning and optical character recognition (OCR) technology has matured sufficiently to pique renewed interest in these areas. An early attempt to automate part of the cataloging process was carried out by Davies and James at Exeter University [4]. This study focused on the development of a computer- based tool for use by a cataloger to aid in the interpretation of the complex set of rules of the Anglo American Cataloguing Rules, 2nd edition (AACR2) [5]. It was not fully *Author to whom correspondence should be addressed. 187

Transcript of Automated title page cataloging: A feasibility study

Informorron Processmg & Manqemenf Vol 25. No 2, pp. 187-203, 1989 03064573/89 $3.00 + .cfJ Prmted I” Great EMam. CopyrIght 0 1989 Pergamon Press plc

AUTOMATED TITLE PAGE CATALOGING: A FEASIBILITY STUDY

STUART WEIBEL*, MICHAEL OSKINS and DIANE VIZINE-GOETZ Online Computer Library Center Office of Research,

6565 Frantz Road, Dublin, OH 43017-0702

(Recewed 23 February 1988; accepted rn final form 3 June 1988)

Abstract-The cost of original cataloging remains a substantial expense for libraries and a hindrance to rapid availability of newly published materials. We have prototyped a rule-based system to explore the impediments to automating descriptive cataloging from title pages. Our test results suggest that it is possible to capture a substantial part of the regularity in title page layout in a small set of rules. Our system correctly identified over 80% of the bibliographic fields present on a random sample of title pages. Significant unsolved problems include the difficulty of incorporating a cataloger’s general knowl- edge about the world in such a system, the complexity and irregularity of cataloging rules, and lack of reliable data capture techniques. Nonetheless, the methods explored hold promise for advancing the state of the art in the automation of cataloging and doc- ument format recognition.

1. INTRODUCTION

Cataloging is a resource-intensive activity requiring skills that are typically in short sup- ply. Resource-sharing library networks such as OCLC, RLIN, and Utlas grew out of the need to reduce the cost of original cataloging, and compact disc-based cataloging products have further contributed to cost efficiencies. Nonetheless, the cost of original cataloging

remains high, and the backlog of uncataloged materials impedes timely access to many materials. In addition, there are classes of materials that are seldom effectively controlled due to work backlogs (pamphlet collections and government documents often fall in this

category). The problem is not unique to libraries. The so-called gray literature (correspondence,

internal corporate memoranda, research reports) is rarely subject to effective bibliographic control and is therefore unavailable to retrieval systems. It is typically not economic for catalogers to process these materials, although bringing them under more effective biblio- graphic control would result in significant benefits.

Automated cataloging holds the promise of increasing the productivity of catalogers and allowing the economic building of specialized databases of document collections with minimal human cataloging effort.

1.1 State of the art The concept of automated cataloging is not new. Nearly two decades ago Fred Kil-

gour, founder of OCLC, proposed extracting information from title pages to enhance the efficiency of the cataloging process [ 11. Soon after, the Library of Congress also addressed the possibility of automated preparation of cataloging records from printed materials [2], and the dissertation work of Martha Fox [3] addressed whether computer modeling of some of the cognitive components of cataloging might be fruitful. Recently, scanning and optical character recognition (OCR) technology has matured sufficiently to pique renewed interest in these areas.

An early attempt to automate part of the cataloging process was carried out by Davies and James at Exeter University [4]. This study focused on the development of a computer- based tool for use by a cataloger to aid in the interpretation of the complex set of rules of the Anglo American Cataloguing Rules, 2nd edition (AACR2) [5]. It was not fully

*Author to whom correspondence should be addressed.

187

implemented but provided a starting point for further development. Davies has recently published a more extensive outline of the features a computer system would need to aug- ment effectively a cataloger’s efforts [6].

Hjerppe at Linkoping University, Sweden, implemented a set of rules from Chapter 21 of AACR2 chat deals with choice of access points 171. The project, called ESSCAPE (Expert Systems for Simple Choice of Access Points for Entries), was implemented using EMYCIN, a shell system for building expert systems. This approach has been supplanted by a project called HYPERCAT, which is concerned with the structure and design of an enhanced catalog. A third effort, the AUTOCAT project in West Germany, seeks to gen- erate bibliographic records of periodical literature in the physical sciences that is available in machine-readable form [83.

In a related project, Harrison reported on the retrospective conversion of cataloging cards to full MARC format records using the conversion software developed by Libpac in conjunction with the optical scanning technology of the Optiram company of Great Britain

]91. Ling-Hwey Jeng has proposed a system design that approaches the problems of titie

page format recognition in a manner related in many respects to our approach [IQ]. At this writing, no system implementation of her ideas has been forthcoming.

Helga Schwarz of the Deutsches Bibliotheksinstitut in Berlin also proposed the form an expert cataloging system might take [I 11. Although no system implementation was attempted, some potential problems and issues were identified and a sample of title pages was analyzed and discussed. She voiced a strong bias in favor of a stand-alone, ?atalogu- ing robot” system as opposed to the cataloger’s assistant advocated by Davies. Our sym- pathies lie with this approach as well, although our experience in building this prototype somewhat tempers our optimism about achieving a completely automated system. The complexity and ambiguity of AACRZ rules, and their dependency on knowledge of the real world, make an unattended cataloging robot unlikely in the foreseeable future,

Svenonius and Molto at UCLA have recently completed a study of the derivation of name access points from machine-readable title page surrogates fl2]. Their approach assumes that both personal and corporate names can be identified from a scanned title page image and poses the question of whether these title page names provide adequate name access points. This study is in a sense complementary to our own work that attempts to identify the various bibliographic elements on the title page.

I .2 Sfuiement of the problem Monograph cataloging can be divided into two main activities: descriptive cataloging

and subject cataloging. Useful cataloging systems must include both of these functions. In order to limit the scope of the problem, we have restricted our investigation to the first component of the cataloging probfem: descriptive cataloging, Specifically, we have attempted to identify the elements of an AACRZ first-level cataloging description rom-

monly found on title pages, which include:

1. Title (and other title information) 2. Statement of Responsibility 3. Pubtisher 4. Place of Publication 5. Date of Publication 6. Edition Statement

Titte pages seldom contain all of these elements and typicahy do not contain sufficient information for subject cataloging. A more comprehensive approach woufd include interpretation of the versa of the title page (the page immediately following the title page), table of contents, and perhaps index to construct a complete cataloging record. Title page analysis represents a reasonable first step in the overall process and sheds light on the nature of the obstacles and scale of the larger probIem.

Automated title page cataloging 189

The problem can be divided into three broad areas:

1. Scanning and optical character recognition (OCR), 2. Incorporation of the resulting text and its attributes into a useful data structure,

and 3. Application of rules to identify relevant cataloging information on the title page.

2. METHODOLOGY AND SYSTEM DESIGN

2.1 General approach We cast this problem in an expert system model because we feel that a knowledge-

based approach is an appropriate development vehicle for a poorly understood problem in human interpretation and reasoning. An expert system is not necessarily the most appro- priate delivery vehicle for such a system; the lessons learned from this prototype could be incorporated in a more conventional programming idiom as well.

No model of user interaction is embodied in this system because our goal is a system that stands alone (although many of our ideas might profitably inform the design of a cataloger’s assistant as envisaged by Davies [6]). The value of the rule-based model is that it allows rapid debugging and modification of the rules; subproblems can be dissected, solved, and revised incrementally without discarding and rewriting large blocks of code. Furthermore, this model provides a structure that can be readily adapted to similar prob- lem areas by changing the rule set.

2.2 The sample An initial version of the system was developed from an arbitrary sample of title pages

selected from books at hand. In the meantime, a collection of randomly selected volumes

representing current cataloging was assembled. Specifically, 104 records used in OCLC online cataloging activity in May 1986 were selected randomly from OCLC transaction tapes and the corresponding volumes acquired through interlibrary loan. The sample was restricted to English-language monographs held by three or more institutions. We subse- quently eliminated 13 items from our sample because they were actually dramatic scripts, journal article reprints, or other nonbook items. We divided our remaining sample into two parts for processing and used the first group of 45 as a training sample to refine the sys- tem. We tailored the performance of the system to encompass the variety of title page con- figurations represented in this sample. We used the remaining 46 title pages as a test sample to determine system performance with material it had not “seen” before.

2.3 Machine-readable title page surrogates Preliminary experiments with commercially available OCR devices demonstrated that

these devices cannot presently recognize the diverse range of display fonts typically found on title pages. The difficulties typically fall into two categories: (1) size of font and (2) abnormal type faces. OCR devices we tested do well with typefaces in the approximate range of 6 to 30 points. It is rare for title page fonts to fall entirely within this range. The styles of title page fonts, intended as they are to attract attention more than to facilitate reading, are often problematic for OCR devices. Cursive fonts readily confound the OCR devices we tested and other display fonts often violate common understanding of font def- inition. Our belief is that these problems can be resolved by incremental improvements in this evolving technology and by preprocessing the document images (scaling zones from a page, for example, such that the type size will fall within an acceptable range). Our research agenda includes such efforts and commercial activity in this area suggest that others are pursuing such improvements as well.

In the meantime, we chose to bypass the scanning issue by using machine-readable typography files as image surrogates. To ensure that our system was extracting data from machine-readable images in a truly objective fashion, all title pages from our sample were typeset using the TnX typographic language. Our goal was to generate a machine-readable

190 STUART WEIBEL et al.

image that was faithful in style and layout to the actual title page image, and then to pro- cess that surrogate image. Graphics on the title pages were represented by empty boxes; the system recognized these areas as non-textual and subsequently disregarded these areas. This is consistent with OCR devices that try to recognize graphics as characters and, when they fail, simply ignore them. Our system also ignores horizontal lines used as separators. Both Jeng [lo] and Schwarz [l l] suggest that such lines may be useful as bibliographic field delimiters; our experience suggests that they introduce more confusion than they resolve. One problem that we have not considered is the inclusion of writing in graphic symbols (as with company logos, for example). A pattern recognition component would be a use- ful system feature to address this issue. Figure 1 illustrates a typical pair of original and TnX surrogate title pages.

2.4 System design Figure 2 illustrates the overall system design. The pathway from the paper document

to token files via OCR (dashed lines) represents an idealized version that we hope will evolve. The actual process used the T,Xed surrogates described. TnX generates a device independent (DVI) file that can subsequently serve as input to a wide variety of display devices, from dot-matrix printers to phototypography devices. In our case, the DVI files were interpreted by our DVI parser, a program developed to extract space-delimited char- acter strings and their attributes (page location coordinates, type style, and type size). The DVI parser writes the list of tokens and attributes to a token file that serves as the input to the main system. The main system was written in Prolog, largely because of its wide use in the construction of such systems.

2.5 Tokens and their attributes The basic unit of data in the system is the token, a space-delimited character string.

Tokens are read from the token fiIe and incorporated into Prolog data structures. The resulting data structure reflects the position, size, style of font, and case class of each char- acter string on the page.

Case class is a derived element in this structure. The idea is based on the traditional notion of upper and lower case but is extended to capture additional information about the token such as the presence of numeric characters, punctuation, symbols, and white

CHANCE AND NECESSITY

An Essay on the Natural

Phtlosophy of Modern Btology

VINTAGE BOOKS A D~wsion of Random How New York

CAFE NECESSITY

An Essay on the Natural Philosophy of Modern Bfology

by JACQUES MONOD

Fig. 1. An ortginal title page image (Ieft) and its T,Xed surrogate (right). All our title page fac- similes were developed using the fonts of the Metafoundry (regrstered trade name).

Automated title page cataloging 191

Automated Title Page Cataloging: System Diagram

I Token File I

Fig. 2. Functional block diagram of the automated title page cataloging system.

space. Table 1 illustrates the variety of case classes and the rules for determining a case class designation. Table 2 illustrates actual tokens derived from the title page of Chance and Necessity.

Following the creation of tokens, a subsidiary page statistics structure is created that describes specific characteristics of the page, including left-most and right-most margins, top baseline, bottom baseline, largest font on the page, and median vertical gap between printed elements on the page. This structure provides a reference of overall page charac- teristics for use in subsequent processing.

2.6 Compound token formation Once these structures are created, they are recorded in Prolog’s internal database and

can be used in subsequent processing without reference to external files. The second phase of processing involves building functionally meaningful units that can then be evaluated for their relevance to bibliographic fields. We designate these units as compound tokens, or simply, compounds. Compounds are constructed by adding tokens to a compound structure until any of three events is detected:

1. 2. 3.

Font style changes, Font size changes, or Vertical white space separating successive compounds exceeds a specified threshold.

These simple rules result in the grouping of tokens (and their respective attributes) into compounds that have proven useful in delimiting relevant bibliographic fields. Table 3 illus- trates the compound tokens derived from the tokens from the title page of Chance and Necessity.

Prior to application of the rule interpreter, the compounds were searched for the occurrence of proper names indicating cities and publishing houses. This was done by com- paring external files containing city names and publishing house names with each com- pound. Substring searching for publishing house names was a relatively straightforward substring matching procedure. Multiple publisher names present no problem inasmuch as AACR2 for the first-level description rules dictate that the first be used. Identification of a place of publication was slightly more complex, in that often there were several city names from which to choose.

192 STLIART WEIBEL et d.

Table 1. Ca,e claw\ in the automated title page catalogmg system and examples of their application

U uppercase

L lowercase

N numeric

P punctuation

S symbol

V vskip token (vertical whitespace)

I

Y

single uppercase character folIowed by a full stop

four character numeric string, between 1400 and 1988

Examples:

Edited By ULUL

ALFRED A. KNOPF U I U

July 13, 1946 ULNPY

McElligot’s Pool ULULPL UL

The identification of place of publication is made according to the following scheme:

l If a publisher name has been identified, take the first place name found in that or

succeeding compounds, l if a publisher name has been identified and no place names occur in the same or

successive compounds, check the compound that immediately precedes the com- pound containing the publisher name,

l If no publisher is identified, take the first pIace name in the last compound that con-

tains a place name.

The publisher and place of publication are written directly into their respective bib- liographic fields at this time and are not subject to further conflict resolution.

Other bibliographic fields are identified by applying a series of rules to each com- pound. We attempt to capture in these rules some of the same reasoning that a cataloger might apply to reading a title page. The kinds of information that our rules encode fall roughly into seven classes.

l Font size: A number of our rules use information about font size for discrimina- tion. For example, a compound having the largest typeface and appearing first on the page is likely to be a title proper. Conversely, identifying a valid publication date requires guarding against the presence of a four character numeric token between 1400 and 1989 occurring in the title: publication dates, in our experience, are never set in the largest font.

Automated title page catatoging 193

Table 2. Partial listing of the Tokens for the title page of Chance and Necesstty. Numeric coordinate values are in thousandths of an inch from the page origin (upper left corner); font s.lze is in points. For \vskip tokens, the coordinate values signify baseline of previous token, baseline of next token, and the approximate distance (in thousandths of an inch) between the token bounding boxes. Thus, there is .I31 inches between the baseline of token 5 and the calculated top of the bounding box of token 7. A \vskip token inherits its font style and size values from the preceding token. Prolog strings are typically delimited by dollar signs;

these have been edited out of string components in the interest of clarity.

ID String BaselineStart X End X Font Font Case

Coord Gmd Chord Style Size Class

token( t,

token( 2,

token( 3,

token( 4,

token( 5,

token( 6,

token( 7,

token( 8,

token( 9,

token( 10,

token( 11,

token{ 12,

tokenf 13,

token( 14,

token( 15,

token( 16,

CHANCE

\vskip

AND

\vskip

NECESSITY

\vskip

An

WY

On

the

Natural

\vskip

Philosophy

of

Modern

Biology

422, 337, 855. bold

422, 523, 90, bold

523, 337, 607. bold

523, 624, 90, bold

624, 337, 1023, bold

624, 761, 131, bold

761, 337, 425, bold

761, 446, 620, bold

761, 641, 711, bold

761, 732, 826, bold

761, 847, 1083, bold

761, 827, 60, bold

827, 337, 670, bold

827, 692, 756, bold

827, 778, 1010, bold

827, 1031, 1261, bold

3% U) 30, “1

30, U) 30, “1

30, U) 30, “1 18, UL)

18, UL1 18, L) 18, L) 18, UL) 18, “1

18, UL)

18, L) 18, UL) 18, UL1

l Case class: Case class is a weak, but sometimes useful, feature for detecting names; in particular, the occurrence of an initial in a text string is a useful indicator.

* Page location: In general, location of a string on the title page is correlated with its bibliographic role. We have used this characteristic only in the identification of the title, but it is probably useful in other respects as well.

l Key words: Key works represent a simple form of semantic analysis that is useful in many of our rules. For example, a compound containing the phrase “edited by” is almost certainly part of the statement of responsibility.

l Name lookup: Publisher and place names are identified by table lookup; this approach could profitably be extended to personal and corporate names to facili- tate identification of statement of responsibility as well.

l Syntactic analysis: Certain rules presently identify simple constructions as com- pound features. For example, a compound that ends in a connective and precedes the compound with the largest font is usually part of the title.

l Absence of a feature: Some rules use the lack of a particular feature as evidence for a specific bibliographic role. For example, compounds which follow the compound with the largest font, but have no features associated with a statement of respon- sibility, are often elements of other title information.

194 STUART WEIBEL et al.

Table 3. Compounds derived from the title page of Chance and Necessity. The general form of the data structure is indicated at the top. Prolog strings are typically delimited by dollar

signs; these have been edited out of string components in the interest of clarity.

compound(

ID.

reverse sequencx number, list of words comprising the compound string, list of case classes for the compound string, font size, distance to previous compound, distance to next compound, ID of the starting token in the compound, ID of the final token in the compound, list of rules which succeeded for the compound, combined confidence factors, bibliographic field assignment)

compound(

compound(

compound(

compound(

compound(

;&-IAN& AND, NECESSITY],

WJ,ul, 30, 0, 131, 1, 6,

El>21. [W)l, Title Proper)

2, 4, [An, Essay, on, the, Natural, Philosophy, of, Modern, Biology], [UL,UL,L,L,UL,UL,L,UL,UL~

18, 131, 201, 7, 17, E7,13,161, bv~)3. Other Title Information)

3, 3, [by, JACQUES, MONOD],

[L,Wl, fi’;], 201, 89, 18, 21,

[sr(Vl~ Statement of Responsibility)

4, 2, [translated, from, the, French, by, Austryn, Wainhouse],

CL,L,L,uL,L,UL,ULl, I’:;,. 89, 226, 22, 29,

Csr(7)l. Statement of Responsibility)

5, 1, [VINTAGE, BOOKS, A, Division, of, Random, House, New, York], [u,u,u,UL,L,uL,uL,uL,uL],

226, 0, 30, 40,

[ I* 1

The narrative forms of the rules are listed in Appendix A. Rules are specified as a Prolog structure having six components:

1. Rule number: identification integer unique to a rule, 2. Precedence class: grouping of rules according to order of application, 3. Explanatory te.yt: short phrase describing the rule, 4. Confidence factor: indicator of what success of the rule implies about a compound, 5. List of conditions: conditions which must be true for the rule to succeed, and 6. Resultant action: procedure to be carried out if the rule succeeds.

Automated title page cataloging 195

The first component is an integer that serves as the rule ID. If a rule succeeds for a

particular compound, the ID number for the rule is recorded in the compound as part of a list of rules that has succeeded.

The second component specifies the class of the rule that is used by the rule interpreter to control the application of rules. Rules are grouped in precedence classes to control their order of application. Low precedence class rules are applied prior to higher classes. We presently have implemented only two levels of rule precedence; however, the simplicity of the control scheme can easily be expanded to embrace additional classes of rules as the need arises.

The third component is a phrase that describes the rule. This string is used by an explanation facility to describe the rules that succeed for each compound. This capability facilitates the inspection and modification of the reasoning performance of the system.

The fourth component is a confidence factor that has a type and a numeric value. For example, a confidence factor “ti(3)” implies that a compound is a title. The integer com- ponent of the term, (3), is a relative measure of the strength of the implication. A confi- dence factor of “ti(9)” also implies a title but does so more strongly. When a rule succeeds, the associated confidence factor is recorded in a list in the compound along with the con- fidence factors of any other rules that succeeded for the compound. These terms provide a way of capturing the weight of evidence associated with rules of different strength. We assigned numeric values to our confidence factors based on heuristic judgements of the importance of a rule and modified them when experience with our sample indicated this

was necessary. Some rules are virtually perfect predictors of a given field identity. To these rules we assigned, somewhat arbitrarily, a ranking of 100. Thus, in the summation pro- cess (described below) the confidence score for these rules overwhelm any others that might have succeeded.

The fifth element specifies the condition or list of conditions that must be satisfied for the rule to succeed. Each of the terms in this list is supported by a small code fragment

that either succeeds or fails. For example, largest_font is a procedure that compares the font size of the compound with the first argument of the page statistics structure and suc- ceeds if they are equal. Multiple conditions in this list can be joined in an AND or an OR

conjunction, as appropriate. The last component is the list of actions that are to be executed if the rule succeeds.

In the present implementation the action is simply to record the rule and associated con- fidence factor in their respective lists in the data structure for the compound in question.

The rule interpreter, based loosely on a model described by Merritt [ 131, is a relatively simple module that processes every compound against every rule. If the condition state- ment is satisfied by the particular compound, then the action statement is executed, other- wise no event occurs. In our case the action is simply to record the rule identifier and the confidence factor in their respective lists in the compound token data structure. Thus, for a given compound, a rule failure results in no action; if a rule succeeds, then the rule ID is added to the rule list structure of the compound and the confidence factor is added to the confidence factor list.

2.8 Bibliographic field resolution After all rules for all precedence classes are tested, each compound has a list of asso-

ciated confidence factors. The first step of the field resolution process entails consolidat- ing confidence factors of the same type. This is done by adding the values of confidence factors of the same type and replacing the substituent terms with this consolidated term. The list of confidence factors is then sorted by the “weight” of the term. Thus:

[ti(4),sr(3),sr(5)] reduces to [sr(Q,ti(4)].

The numeric factors are intended to capture a continuum that reflects the true rela- tions between the factors expressed in our rules. In fact, they are somewhat rule of thumb estimates and should perhaps be viewed more as ordinal data than interval data. As such, our arithmetic combination rule must be acknowledged as a methodological compromise.

196 STUART WEIBEL et al.

However, we did not encounter failures resulting from nuances of rule strengths; our ad hoc approach to combining evidence worked well with relatively little modification of con-

fidence factors. The method for selecting final field assignments consists of identifying the highest con-

fidence factor from the list and assigning the compound to the corresponding field. Thus, a compound with a confidence factor list of [sr(@,ti(4)] would be assigned to the statement of responsibility. Table 4 illustrates the final record derived from the title page of Chance and Necessit_v.

3 RESULTS

Figure 3 illustrates the overall performance of the system on the training set through several iterations of system refinement. Each successive stage represents minor modifica- tions to the system based on errors it made with the training sample. We believe that much of the intrinsic regularity of title pages can be captured by iterative modification of this

sort. Figure 4 illustrates system performance for the training sample and the test sample,

broken down by field type. Eighty-one percent of the fields in the test set, previously “unseen” by the system, are interpreted correctly by the system, compared with 89% of the fields in the training set. Fifty-nine percent of the test set pages were interpreted cor- rectly in their entirety, whereas 69% of the training set were so recognized. Recognition of individual fields from the test sample ranged from a low of 63% for other title infor- mation to a high of 100% for the edition field.

4. DlSCUSSlON

Our results suggest that we have captured a significant portion (somewhat more than half) of the regularity in title page layout in a prototype rule-based system with a small number of rules. There are obvious pitfalls in assuming that this model will scale easily or comprehensively to a production system, even if one assumes that OCR technology is ade- quate. One problem common in expert systems is loss of control of rule application due to the proliferation of rules. Our preliminary work suggests that the number of rules will not burgeon to unmanageable numbers; the present results are achieved with 16 rules. As

a partial safeguard, we have provided for modularization of rules such that they can be applied in an ordered and hierarchical fashion.

A second possible confounder is the potential for reciprocal exclusion of rules; does greater precision in one rule come at the cost of precision in another? This is a problem we expected to appear early in our study, but it did not manifest itself.

Table 4. Fmal version of the record created from the title page ot Chance und ,Vecessrr_t

Title

Other Title Information

Statement of Responsibility

Publisher

City

Date

Edition

CHANCE AND NECESSITY

An Essay on the Natural Philosophy of Modern Biology

by JACQUES MONOD

translated from the French by Austryn Wainhouse

VINTAGE BOOKS

New York

Automated title page cataloging 197

The breakdown for individual fields reflects the particular aspects of the interpreta-

tion problem that are most difficult. The most straightforward aspects of the problem are publisher, place of publication, date, and edition. Publishers and places were identified by comparison with external files for each field, and indeed, these identifications could be fur- ther enhanced by directly linking a publisher authority file with the location of that pub- lisher. For example, if a publisher authority file contained the publisher Clive Bingley

100,

Revision Pass

Fig. 3. Changes in system performance by revision pass (training sample only).

n Tralnmg Set 0 Test Set

100 1

%

Correct -

50 -

0 Fields Records Ti OTI SR

Fields

Pub Pl

_ Date I

Fig. 4. System performance by field for the training set and test set. The Fields column illustrates the percent of all fields present on the title pages that were captured correctly; the Records column indicates the percent of title pages captured correctly in their entirety. Column abbreviations: Ti = title; OTI = other title Information; SR = statement of responsibility; Pub = publisher; PI = place; Date = publication year; and Ed = edition statement.

19x srUi\RT WEIBEL & ffi.

linked to the location London and Linnet Books Iinked to Hamden, Corm., for a title page containing both of these publishers and place names, the correct relationship between pub- lisher and place name could be determined using the authority file. In cases where no loca- tion information was available on the title page a probable @ace of publication could be supplied from the authority fiIe.

The fields which are most difficult to identify are title (identified in ‘769% of the test sample), other title information (63% success), and statement of responsibitity (70970). Of the 12 test records that suffered defects in only these three fields, 6 had all relevant infor- mation but some part of one field was identified as part of one of the other two. Another 4 of these 12 records were substantiaIly correct but had some extraneous item included in one of the fields. Thus, an index to these combined fields would provide relatively good access to the records; a bibhographer woufd be less than completely satisfied, but a patron searching an online catalog would suffer little. One prominent research library has pro- posed informally the use of a system such as ours to formulate a temporary “finding rec- ord” that would serve users until the book was requested, after which the book would be formally catalogued. In such a scenario, distinctions between title proper and other &ie ~~f~r~at~~~ would become relatively inconsequel~tial, and having extraneous information in certain fietds would cause relatively few problems for patrons using these temporary records.

Identifying personal authors can probably be substantially enhanced by reference to an authority list of legal names. Currently the system identifies statements of responsibility by a series of somewhat weak rules that can introduce corruption. For example, rule 5, identification of a weak name pattern, was decisive in correctly identifying a personal author in 6 of 36 instances where the rule was triggered. It was decisive in misidentifying a personal author 4 of those 36 times (the rule was not decisive in the accumulated evidence for the remaining 26 compounds for which it succeeded). Each of the successful identifi- cations could probably be accomplished using a proper name authority system.

4. I major ~~~ped~~ents The major impediments we have encountered in building our system can be classified

in the following manner: 4.1.1 Deep understanding of the task. Figure 5 illustrates a kind of knowledge about

title pages that requires much more than the rudimentary type of reasoning embodied in our system, The layout of these title pages is quite similar, Both begin with a personal name, set in the targest font size of the page, in the top position. Each has a following phrase in a smaller font. Despite the similarities of layout, the bibliographic meaning of these respective elements is different in each case. In the case of Charles Dickens: a criti- cal introduction, the name is the title proper and the following phrase is other title infor- mation. In the other item, Larry Eigner is an author statement, not a part of the title, and should be recorded in the statement of responsibility= The title proper is Selected Poems.

Catalogers bring to the task much general knowledge about the worfd and the intrica- cies of the task itself. As Hagler and Simmons point out 1141:

When a bibliographic situation arises for which the pattern of the conventions is not decisive, one must use judgment based on broad experience of similar situations, on a sense of the bibliographic language, and on the purpose of the fisting in question.

This problem is at the heart of the cataloging robot vs. cataloging assistant issue. It is difficult to imagine a stand-alone cataloging system operating successfully in the current cataloging environment. Whether this can or should be changed is beyond the scope of this discussion, but such issues should inform future discussions of cataloging rule revisions.

4. f .2 ~r$eg~~a~~~~ of ta‘tfe page ~a~~u~t. Figure 6 illustrates exampies of title pages that are likely to confound an automated system because of their layout compfexity or depar- ture from conventional layout conventions. In the case of Waiting for Godot it is easy to see the great difficulty in automating certain low-level layout interpretations that present little difficulty to a human.

Automated title page cataloging 199

Charles

Dickens

A Criticul Inn-oductm

K J FIELDING c F Man T.wm, c..Lk*r, LIy0p.I

LARRY ElGNER

SELECTED P 0 E M s

Fig. 5. A pair of title pages sharing similar layout characteristics that have substantially different bibliographic interpretations. Correct interpretation of such items requires sophisticated domain knowledge.

4.1.3 Scanning and graphical interpretation problems. Figure 7 illustrates a problem inherent to the common use of display fonts on title pages. Such fonts are employed for their visual impact; OCR devices are not likely to be able to recognize them in the near future, however there are good prospects for the solution of problems of this nature in the middle term future. The font used for the title Xaipe typifies the problem; the OCR device tries to recognize each contiguous area as a character. Thus, the first three letters are likely to be interpreted as two characters each. Image preprocessing techniques could be applied prior to OCR processing to render such images more recognizable. Other particularly dif- ficult problems arise with cursive fonts that offer few clues as to the separation of charac- ters. We understand that the character recognition process developed by Optiram of the UK has solved this problem to a significant degree [15], but we know of no commercially available devices with this virtue.

4.1.4 Complex cataloging rules. The complexity of AACRZ cataloging rules leads inevitably to inconsistencies in their application. This is true for catalogers and will cer- tainly present problems for automated systems as well. We hope that studies in automated cataloging will contribute to future revisions of cataloging rules.

4.2 Prospects for further progress The difficulties alluded to in this discussion might well lead one to excessive (some

might say realistic) pessimism. There are, however, many opportunities for progress. 4.2.1 Segmentation of prospective materials. A system such as we have prototyped

may never be able to handle the universe of possible materials. Many classes of materials, however, do not exhibit the wide diversity of some of the present examples, and it is not difficult to identify at a glance title pages that will be captured easily. It may be useful to process these materials automatically and thereby reduce the workload of a cataloging staff.

4.2.2 Natural language processing. Natural language processing affords potential solu- tions to some of the problems encountered here; this is, after all, primarily a constrained language processing task. The system now does some simple checking for key word phrases; edited by, illustrated by, and the like. Addition of many more such words will

IPH 25:2-F

200 STUART WEIBEL er al.

W

*I T

I N F Go G OR "OT

Fig. 6. Creative title page layouts such as these are difficult to reconcile with a conceptual model of a “typical” title page format.

sharpen the capabilities of the system. A more advanced implementation should incor- porate syntactic analysis of text strings as well and use the resulting patterns to classify the strings. For example, a statement of responsibility linked grammatically to a title proper is treated differently than one that is not. Analysis of our system’s failures suggests that relatively simple syntactic analysis represents perhaps the single most fruitful approach to enhancing system performance. Thus, building a string of parts of speech corresponding to the text string would likely provide a very useful basis for discrimination.

4.2.3 Further authority checking. The present system uses external publisher name and place name files to identify valid publishers and place names. This approach, if extended to the identification of personal and corporate author names, might be particularly help- ful in the difficult task of discriminating between other title information and statements of responsibility.

Automated title page cataloging 201

Fig. 7. Current OCR devices have considerable difficulty recognizing many display fonts. Most devices would try and resolve at least eight characters in this title. Cursive fonts and large typefaces also are beyond the capability of today’s devices.

4.2.4 Scanning and optical character recognition issues. The major limitation in the conduct of this research has been the necessity of preparing a typographic surrogate for each title page in our sample. This has limited our ability to encompass the broad range of title page diversity that only a large development sample could provide. We are pres- ently working on techniques that we hope will improve the capabilities of OCR devices for this task. For example, scaling down a text block displayed in a font-size beyond the range of the OCR device might bring most title page strings within reach. On versos, the prob- lem is inverted; very small fonts might have to be enlarged.

The issue of basic OCR accuracy is one which will have to be resolved by incremen- tally increasing sophistication of the OCR algorithms implemented in commercial devices. Accuracy rates of 99% are not good enough to support a cataloging robot activity (and one only achieves rates this high with very high quality initial images). Thus, for the fore- seeable future one must expect to incorporate a correction cycle as part of the model for a practical automated cataloging system.

Finally, we make no claim to having exhausted the richness of title page types and pos- sible discriminant features in our modest series of rules. There are likely many more fea- tures that could be employed to identify relevant fields. Ours is a first attempt at a conceptual architecture that might underlie the construction of automated cataloging systems.

5. SUMMARY AND CONCLUSIONS

The successes achieved in this prototype derive from three basic ideas:

1. The importance of capturing the page image in a data structure suitable for logi- cal manipulation (our tokens and compounds),

2. Provision of a rule structure, separate from the procedural aspects of processing, that allows for maintaining rules in a clear and accessible manner (humans can read and understand them), and

3. Title pages have discernible regularity of format, a substantial part of which can be identified and captured in a limited set of rules.

We have demonstrated a potentially useful model for analysis and categorization of bibliographic elements on title pages of books. The model has helped us to identify the major problems likely to impede the development of practical systems. It has also dem-

202 STUART WEIBEL et ni.

onstrated that a substantial amount of the regularity of layout and semantics can be cap- tured by a relatively simple rule-based system with a small set of rules.

We are optimistic that this model will be useful for advancing the automation of cataloging. Our ideas can be extended readily to other types of documents and the auto- mated analysis of document structure beyond the bibliographic level. Information systems of the future will depend on a high level of automated processing to ensure timely access to well-structured databases of books, journals, and other written communication.

Fred Kilgour has remarked that someday material that must be cataloged by humans will not be widely read; it simply won’t be available to retrieval systems. Whether one accepts this radical notion of the future, it is clear that automating the processing of the written word will be a major feature of the evolution of information systems. We hope our efforts here contribute to this evolution.

Acknowledgtnents-The authors wish to acknowledge the substantial contrtbutron to the development of this sys- tem made by our research assistants: Craig Henderson developed the DVI parser which linked the TnX output to our system; Kym Pocius contributed to the system code and Tt.Ned many title pages; Amy Rhine managed the T,Xing and TtXed the majority of our title pages.

Larry Olstewski, our reference Iibrartan. managed the interlIbrary loan logistics wtth great efficiency. We would also like to thank our colleagues m the Office of Research for many useful remarks and suggestions in the conduct of the research and preparation of the manuscript.

Finally, we wish to acknowledge the careful efforts of our reviewers whose perceptive remarks helped us improve this article.

A preliminary version of this research was presented at the American Society for Information Sctence 50th annual meeting [16].

1.

2. 3.

4. 5.

6.

7.

8.

9.

10.

II.

12. 13. 14.

15. 16.

REFERENCES

Kigour, F. Computerization: The advent of humanization m the college library. Library Trends, 18( 1):29-36; 1969. Avram, H.D. RECON pilot project. Washington: Library of Congress; 1972. Fox, A.M.S. The amenability of a cataloging process to stmulation by automatic techniques. Ph.D. Thesis; University of Illinois at Urbana-Champaign; 1972. Davies, R.; James, B. Towards an expert system for cataloging. Program, 18(4):283-297; 1984. Gorman, M.; Winkier, P.W. Eds. Anglo American Cataloguing Rules, second edition. Chicago:American Library Association; 1978. Davies, R. Cataloguing as a domain for an expert system. In: Davies, R., editor. Intelligent information sys- tems: Progress and prospects. Chichester: Ellis Horwood; 1986. Hjerppe, R.; Olander, B.; Marklund, K. Project ESSCAPE-Expert system for simple choice of access points for entries: Applications of artificial intelligence in cataloging. Paper presented at the IFLA 51st General Con- ference in Chicago, August 18-24, 1985. Endres-Nlggemeyer, B.; Knorz, G. AUTOCAT: Knowledge-based descriptive cataloguing of articles pub- Iished in scientific Journals. Second international GI Congress ‘87 Knowledge Based Systems. Munich, October 20-21, 1987. Harrison, M. Retrospective conversion of card catalogues into full MARC format usmg sophisttcated com- puter controlled visual imagmg techniques. Program, 19(3):213-230; 1985. Jeng, L. An expert system for determining title proper in descriptive catalogmg: A conceptual model. Catalog- rng & Classification Quarterly, 7(2):55-70; 1986. Schwarz, H. Expert systems and the future of cataloging: A posstbte approach. LIBER Bulletin, (26):23-50; 1986. Svenonius, E.: Molto, M. Studies in Automatic Catalogmg, m press, lY88. Merritt, D. Forward chaining in Prolog. Al Expert, 30-42; November 1986. Hagler, R.; Simmons, P. The bibliographic record and information technology. Chtcago: Amertcan Library Association: 1982:56. Dollars, C. Director, National Archtves. OCLC Distinguished Seminar Serie,, March II, 1986. Weibel, S.L.; Oskins, M.; Vizine-Goetz, D. Automated title page cataloging. Proceedings of the SOth ASIS Annual Meeting; 1987 October 4-8; Boston, MA. Medford. NJ: Learned information: 24:234-240: 1987.

APPENDIX A: RULES IMPLEMENTED IN THE AUTOMATED

TITLE PAGE CATALOGING SYSTEM

Confidence factor codes:

ti(n) = title sr(n) = statement of responsibility at(n) = attributive phrase da(n) = year of publication

at(n) = other title information cd(n) = edition statement af(n) = affiliation phrase

n = relative strength of evidence; n = 100 signifies certainty

Automated title page cataloging

Context free rules Rules that do not depend on previous rules for valid application.

203

1. First compound, largest font. [ti(lO)] 2. Largest font. [ti(lO)] 3. Separate attributive phrase (including by, edited by, translated by, illustrated by, con-

ducted by, etc.). [at(lOO)] 4. Strong name pattern (compound contains an initial, does not contain a publisher’s

name). [sr(7)] 5. Weak name pattern: name case pattern, no publisher in compound. [sr(4)] 6. Ends in a connective and precedes compound with largest font. [ti(9)] 7. After largest font, begins with article. [ot(7)], 8. Affiliation phrase present, not largest font, no publisher name in compound. Affili-

ation phrases include Professor, College, Institute, Bureau, University (without Press), and related phrases. [af(8)]

9. Four numeric characters, not largest font. [da(lOO)] 10. Name suffix pattern (Ph.D., D.D., M.A., MD., M.D., Jr., Esq., etc.). [sr(6)] 11. Contains an edition statement key word. [ed(9)]

Context sensitive rules Rules that require application of previous rules for valid processing.

12. Attributive phrase present with other elements, not largest font. [sr(7)] 13. Follows the compound with the largest font, no evidence of the statement of respon-

sibility in the compound. [ot(5)] 14. follows a compound that is an attributive phrase, no publisher name in the compound.

h-(91 15. Precedes an affiliation phrase. [sr(9)] 16. Follows the compound with the largest font, no evidence of the statement of respon-

sibility in the compound, but the succeeding compound may contain the statement of responsibility. [ot(7)]