On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

14
On the Creation of Hypertext Links in Full-Text Documents: Measurement of Retrieval Effectiveness David Ellis, Jonathan Furner, and Peter Willett* Department of Information Studies, University of Sheffield, Sheffield SlO 2TN, United Kingdom. E-mail: p. willett@sheffield. ac. uk An important stage in the process of retrieval of objects from a hypertext database is the creation of a set of inter- nodal links that are intended to represent the relationships existing between objects; this operation is often under- taken manually, just as index terms are often manually as- signed to documents in a conventional retrieval system. In an earlier article (Ellis, D., Furner-Hines, J., & Willett, P., 1994b), the results were published of a study in which sev- eral different sets of links were inserted, each by a different person, between the paragraphs of each of a number of full-text documents. These results showed little similarity between the link-sets, a finding that was comparable with those of studies of inter-indexer consistency, which sug- gest that there is generally only a low level of agreement between the sets of index terms assigned to a document by different indexers. In this article, a description is pro- vided of an investigation into the nature of the relationship existing between (i) the levels of inter-linker consistency obtaining among the group of hypertext databases used in our earlier experiments, and (ii) the levels of effectiveness of a number of searches carried out in those databases. An account is given of the implementation of the searches and of the methods used in the calculation of numerical values expressing their effectiveness. Analysis of the re- sults of a comparison between recorded levels of consis- tency and those of effectiveness does not allow us to draw conclusions about the consistency-effectiveness relation- ship that are equivalent to those drawn in comparable studies of inter-indexer consistency. 1. Introduction In an earlier article (Ellis et al., 1994b), we reported the results of an experiment in which we measured the degree of similarity between a number of hypertext data- bases that shareda common set of nodes but whose link- sets had been manually created by different people. Our purpose was to use the similarity values that we calcu- lated as measurementsof the extent to which agreement existed in the choice of hypertext links to be inserted in a * To whom all correspondence should be addressed. 0 1996 John Wiley & Sons, Inc. full-text database, i.e., as measurementsof the degreeof inter-linker corzsistency in the work of the originators of the link-sets. Our principal conclusions were that the re- corded levels of inter-linker consistency were generally low, but that they displayed marked variation. In this article, we describean investigation of the rela- tionship between, on the one hand, the levels of inter- linker consistency obtained amongst a group of full-text databases in which inter-nodal links have been inserted and, on the other, the effectiveness of searches carried out in those databases. It has regularly been suggested in studies of conventional document retrieval systemsthat the degreeof consistency in the terms assigned to docu- ments by indexers is positively associated with retrieval effectiveness. We argue that the manual creation of sets of inter-nodal links in hypertext databases is analogous, in certain respects, to the manual creation of setsof index terms in conventional document databases, and test the hypothesis that consistency and effectiveness are related in the former context just as, it is suggested, they are in the latter. The significance of accepting such a hypothesis rests on the results of our earlier study. These indicate that levels of inter-linker consistency are low and vari- able. If we were also to find that consistency and effectiveness are related to some degree, then the simple conclusion would be that the levels of effectiveness that may be achieved by hypertext retrieval systemsare cor- respondingly low and variable. Given the intensive na- ture of the manual labor required in the creation of sets of hypertext links, such a conclusion might have serious implications for those contemplating the addition of hy- pertextual characteristics to a conventional document database. In Section 2, we explain in more detail how this study relates to work on inter-indexer consistency,and in par- ticular on an assumption that it is often made by re- searchers in this field but for which evidence is scant. In Section 3, we provide a brief summary of previous work on the evaluation of hypertext retrieval systems.In Sec- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 47(4):287-300, 1996 CCC 0002-8231/96/040287-14

Transcript of On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

Page 1: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

On the Creation of Hypertext Links in Full-Text Documents: Measurement of Retrieval Effectiveness

David Ellis, Jonathan Furner, and Peter Willett* Department of Information Studies, University of Sheffield, Sheffield SlO 2TN, United Kingdom. E-mail: p. willett@sheffield. ac. uk

An important stage in the process of retrieval of objects from a hypertext database is the creation of a set of inter- nodal links that are intended to represent the relationships existing between objects; this operation is often under- taken manually, just as index terms are often manually as- signed to documents in a conventional retrieval system. In an earlier article (Ellis, D., Furner-Hines, J., & Willett, P., 1994b), the results were published of a study in which sev- eral different sets of links were inserted, each by a different person, between the paragraphs of each of a number of full-text documents. These results showed little similarity between the link-sets, a finding that was comparable with those of studies of inter-indexer consistency, which sug- gest that there is generally only a low level of agreement between the sets of index terms assigned to a document by different indexers. In this article, a description is pro- vided of an investigation into the nature of the relationship existing between (i) the levels of inter-linker consistency obtaining among the group of hypertext databases used in our earlier experiments, and (ii) the levels of effectiveness of a number of searches carried out in those databases. An account is given of the implementation of the searches and of the methods used in the calculation of numerical values expressing their effectiveness. Analysis of the re- sults of a comparison between recorded levels of consis- tency and those of effectiveness does not allow us to draw conclusions about the consistency-effectiveness relation- ship that are equivalent to those drawn in comparable studies of inter-indexer consistency.

1. Introduction

In an earlier article (Ellis et al., 1994b), we reported the results of an experiment in which we measured the degree of similarity between a number of hypertext data- bases that shared a common set of nodes but whose link- sets had been manually created by different people. Our purpose was to use the similarity values that we calcu- lated as measurements of the extent to which agreement existed in the choice of hypertext links to be inserted in a

* To whom all correspondence should be addressed.

0 1996 John Wiley & Sons, Inc.

full-text database, i.e., as measurements of the degree of inter-linker corzsistency in the work of the originators of the link-sets. Our principal conclusions were that the re- corded levels of inter-linker consistency were generally low, but that they displayed marked variation.

In this article, we describe an investigation of the rela- tionship between, on the one hand, the levels of inter- linker consistency obtained amongst a group of full-text databases in which inter-nodal links have been inserted and, on the other, the effectiveness of searches carried out in those databases. It has regularly been suggested in studies of conventional document retrieval systems that the degree of consistency in the terms assigned to docu- ments by indexers is positively associated with retrieval effectiveness. We argue that the manual creation of sets of inter-nodal links in hypertext databases is analogous, in certain respects, to the manual creation of sets of index terms in conventional document databases, and test the hypothesis that consistency and effectiveness are related in the former context just as, it is suggested, they are in the latter. The significance of accepting such a hypothesis rests on the results of our earlier study. These indicate that levels of inter-linker consistency are low and vari- able. If we were also to find that consistency and effectiveness are related to some degree, then the simple conclusion would be that the levels of effectiveness that may be achieved by hypertext retrieval systems are cor- respondingly low and variable. Given the intensive na- ture of the manual labor required in the creation of sets of hypertext links, such a conclusion might have serious implications for those contemplating the addition of hy- pertextual characteristics to a conventional document database.

In Section 2, we explain in more detail how this study relates to work on inter-indexer consistency, and in par- ticular on an assumption that it is often made by re- searchers in this field but for which evidence is scant. In Section 3, we provide a brief summary of previous work on the evaluation of hypertext retrieval systems. In Sec-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 47(4):287-300, 1996 CCC 0002-8231/96/040287-14

Page 2: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

tion 4, we describe the methods that we employed in our experiments, firstly in the construction of our test data- bases, secondly in the implementation of the searches on the databases, and thirdly in the calculation of numerical values expressing the effectiveness ofthe searches. In Sec- tion 5, we present the results of our calculations; in Sec- tion 6, we discuss the implications of a comparison of these results with those reported in our earlier article. Fi- nally, we draw a number of conclusions and describe the ways in which this research might be developed further.

Although the methods that we use, in the calculation both of similarity values and of values of retrieval effectiveness, will in their essence be familiar to those ex- perienced in the traditional experimental activities of in- formation retrieval research, their application in the cur- rent context is almost wholly novel. Few studies of hy- pertext retrieval systems make more than nominal use of traditional measures of retrieval effectiveness: The ex- periments described by Al-Hawamdeh, Smith, and Wil- lett ( 199 1) and by Savoy ( 1993 ) are conspicuous excep- tions. In this article, therefore, we place particular em- phasis on the description of the methodology that we have developed.

2. Inter-Indexer Consistency and Retrieval Effectiveness

The aim of work in the field of inter-indexer consis- tency is the measurement of the extent to which agreement exists among different indexers on the sets of index terms to be assigned to individual documents (Leonard, 1977). Such measurement is possible where two or more sets of terms are assigned to each document, each set constructed by one of two or more individual indexers. Studies of inter-indexer consistency were espe- cially popular in the 1960s (e.g., Hooper, 1965: Zunde & Dexter, 1969); although interest has waned in subse- quent decades, reports of experiments continue to ap- pear with some regularity (for recent examples, see Re- ich & Biever, 199 1; Sievert & Andrews, 199 1; and Tonta, 1992). The principal conclusions of such studies, be- yond suggestions as to how consistency might be im- proved, are normally that recorded levels of consistency display marked variation, and that high levels of consis- tency are rarely achieved.

For most writers on the subject, the degree of signifi- cance that is commonly attributed to the results of stud- ies of inter-indexer consistency appears to derive from the assumption that they are predictive of the levels of retrieval effectiveness that may be attained by the sys- tems studied. Leonard ( 1977, p. 33 ), for example, states that “interindexer consistency and retrieval effectiveness exhibit a tendency toward a direct, positive relationship, i.e., high interindexer consistency in assignment of terms appears to be associated with a high retrieval effective- ness of the documents indexed.” The argument posited

in order to account for this relationship generally runs as follows:

(1)

(2)

(3)

Studies of inter-indexer consistency provide empiri- cal evidence that indexers asked to index the same document do so inconsistently. It is reasonable to infer from this evidence that in- consistency also arises in the indexing by different indexers of different documents-in other words, one indexer may well use a term to represent a con- cept in one document that is different from the term used by another indexer in another document. Stud- ies of intra-indexer consistency indicate that the lev- els of consistency identifiable in the work of a single indexer on a collection of documents are often sim- ilarly low ( Hooper, 1965 ). The larger the number of index terms that are used in a document database to represent the same con- cept, the more difficult it becomes for the user to represent the concept with search terms that match those used in the indexing, and the less effective their retrieval. As Leonard (1977, p. 32) puts it, “the greater the agreement among indexers regarding the terms that best describe a document’s content, the higher the probability that the index terms will also match terms used in a search for which the docu- ment is regarded as a relevant item.”

It should be noted that an argument in this form pos- tulates a relationship simply between inter-indexer con- sistency and the component of retrieval effectiveness known as recall (the proportion of all the relevant docu- ments in a database that are retrieved). Although the number of studies of consistency conducted is large, em- pirical data confirming the existence of such a relation- ship is notoriously rare. For most authors. the sole source of such data is the study reported in Leonard’s unpub- lished thesis (Leonard, 1975 ). Leonard’s test collection consisted of 18 groups of 10 documents, each group drawn from one of two databases. Within each group, five documents had been judged as relevant to a particu- lar pre-defined query, and five as non-relevant. Each group of documents was indexed by five indexers drawn from a pool of 40, and measurements of inter-indexer consistency for each group were calculated using two separate methods. Searches were undertaken using the pre-defined queries on each of the five “versions,” in- dexed by a different indexer, of each of the five document groups, and measurements of recall and precision calcu- lated by comparing the results of these searches with the original relevance judgments. Leonard reported that the rank-order correlations between the measurements of consistency and the averaged measurements of recall for each group indicated “a definite trend toward a moder- ate to strong positive association of variables” (Leonard, 1977, p. 32). Significantly, no such relationship was suggested by the correlations between the measurements of consistency and the averaged measurements of precision.

288 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996

Page 3: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

We believe that the manual insertion of hypertext links between the documents stored in a database is anal- ogous to the manual assignment of index terms to such documents. Just as the performance of a conventional document retrieval system is strongly dependent on the terms used to index documents, the performance of a hypertext retrieval system is strongly dependent on the links inserted between the objects stored in its database, in that a poorly constructed set of links may render the database of little practical use to those who seek informa- tion from it. Moreover, the creation of links, like the cre- ation of sets of index terms, is a time-consuming and skilled process, since it requires the person creating the links to have a clear overview of the contents of the text and of the conceptual relationships that exist between its component parts. It is, hence, of interest to investigate the hypothesis that there exists a relationship between inter-linker consistency and the effectiveness of hy- pertext retrieval systems similar to the one that is posited to exist between inter-indexer consistency and effect- iveness. The measurement of inter-linker consistency is discussed in our earlier article: In the next section, we describe previous attempts to measure the effectiveness of hypertext systems.

3. Retrieval System Evaluation

Historically, the majority of evaluative studies of doc- ument retrieval systems have included attempts to quan- tify their efictiveness, using measures such as recall and precision that are based on pre-retrieval manual rele- vance judgments. By way of exception, evaluative stud- ies of systems based on navigational retrieval mecha- nisms have tended to eschew quantitative techniques in favor of evaluation of a less formal nature, largely as a result of the historical attachment of many hypertext re- searchers to the field of human-computer interaction rather than to that of information retrieval. In work such as this, the essence of any hypertext system is seen to lie in the peculiarity of its user interface; as a result, any measurement of the quality of a system’s performance may generally be equated with evaluation of its usability rather than of its performance (Nielsen, 1989; 1993, pp. 149- 153 ) . Several recent experiments have focused, from a human-factors perspective, on the usability of the interface to hypertext systems, often in comparison with other media for the presentation of information such as paper (see, for example, Boyle, Teh, & Williams, 1990; Egan et al., 1989; and Rada & Murphy, 1992).

In Rada and Murphy’s design, ( 1992) users were asked to write essay-style answers to their questions on the basis of the information they acquired in the course of their navigation, and these answers were marked on the basis of “accuracy” (which might be construed as a measure of precision) and on that of “completeness” (which might similarly be viewed as a measure of recall). Dividing a combination of these scores by the time taken

gave a measure of efficiency. However, studies of hy- pertext retrieval systems that make more than a passing reference to the use of traditional measures of retrieval effectiveness are few and far between. In the experiment described by Al-Hawamdeh et al. ( 199 1 ), three test “da- tabases” were used, each database containing the text of a single full-text dissertation or thesis. Each of the “doc- uments” making up a database (of which there were 78, 122, and 3 16, respectively) was equivalent to a para- graph in the original printed document. A set of queries ( 9, 10, and 6, respectively) was created for each database, and a set of relevant documents was identified for each query. Independently of this work, a set of links was cre- ated in each database, each link connecting one docu- ment with another whose subject matter was deemed to be related. Searches were carried out in each database using a variety of methods:

(1)

(2)

(3)

A’uv&ztional search, successively identifying the single document that should be next retrieved; best-match search, identifying in ranked order those documents whose representations are most similar to that of the query; simple string search, identifying all documents con- taining a single discriminating term from the query.

In turn, two types of navigational mechanisms were investigated (one operating wholly under user control, the other operating under the control of either of two algorithms), and the provision for the navigator of three modes of access to the database were compared (via its table of contents, via its abstract, and via the location of the first occurrence of a chosen string). Values of van Rijsbergen’s measure E for selected values of n (number ofdocuments retrieved) were calculated in order to com- pare the retrieval effectiveness of each search. It was difficult to determine any significant pattern from the mean E values averaged over each set of queries, al- though tentative conclusions could be drawn as follows:

( I ) The navigational searches were, on average, no more effective than the best-match searches;

(2 ) the simple string searches were in several cases sub- stantially more effective than those using other mechanisms;

(3) access via contents and via string search were more effective than access via abstract.

Al-Hawamdeh et al. recognized that pressure of time, by preventing them from creating a larger sample of hy- pertext databases and searches, had rendered interpreta- tion of their results problematic. They felt able to suggest, however, that the navigational searches were far from consistently superior to the best-match or string searches, neither of which required the additional work involved in creating the sets of links for each database. As they point out ( p. 125 ) : “This is an important find- ing, since it is very time-consuming to create a hypertext

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996 289

Page 4: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

document.” Link insertion requires the complete under- standing of a document; “. . . this may not be too much of a problem if the document’s author does the charac- terization, but otherwise requires a quite inordinate amount ofeffort”(pp. 125-127).

In the experiments described by Savoy ( 1993), two well-known medium-size test databases were used (the CACM collection, consisting of 3,204 articles from the journal Communications qf‘the ACM, and the CISI col- lection, consisting of 1,460 articles highly cited in the in- formation-science literature), together with two sets of queries and corresponding relevance judgments. The da- tabases were indexed using an automatic technique that produced a set of weighted terms for each document, and searches were made for the test queries using a variety of different retrieval mechanisms: A traditional Boolean model; a selection of hybrid Boolean models that used different methods of term-weight normalization; and a vector processing model. Values of average precision were calculated for each of these sets of searches. In the second stage of the experiment, each collection was con- verted into a hypertext database with the insertion of a set of links, each link representing a bibliographic refer- ence from one document to another. A method de- scribed by Frisse ( 1988 ) was used to implement a global search mechanism that was able to take into account the information represented by these hypertext links. In a separate development, a clustering model was used to compute the similarities between documents, and an- other set of hypertext links simulated by the connections within the resulting clusters of nearest-neighbors. The existence and number of hypertext links were also used in the ranking of those documents retrieved by the tradi- tional Boolean mechanism. The effectiveness of searches using each of these methods was measured by calculating values of average precision. The best results achieved by any ofthe methods that did not incorporate the informa- tion contained in hypertext links were those returned by the vector processing mechanism; the only method using hypertext links which improved on these (by up to 7%) was a modified version of that developed by Frisse. Al- though Savoy felt able to suggest tentatively that “hy- pertext links can provide useful information for enhanc- ing effectiveness” ( p. 44), his results were by no means conclusive.

The traditional method of computing the effective- ness of retrieval in an experimental environment, by comparing what is actually retrieved with what a human judging panel determine should be retrieved, dates back to a time (the early 1950s) when all searches were under- taken in batch mode: A query would be input to a re- trieval system, and a set of references deemed by the sys- tem to be relevant to the query would be output. Whether or not the query was subsequently refined in the light of the output content and re-presented to the system in slightly differing form, the boundaries of each individ- ual search could be clearly delineated.

Technological developments over the last 30 years have allowed the nature of user interaction with retrieval systems to change, and the degree of such interaction to increase. Systems based on navigational mechanisms give users a particularly high degree of control over the retrieval process, enabling them to select at each stage of that process individual objects for retrieval. Conse- quently, in the context of an interactive navigational re- trieval system such as the one under consideration in this article, the “retrieval” or display of a paragraph as a re- sult of user navigation should not be taken as an indica- tion that that paragraph is necessarily relevant to the us- er’s information need: It is left to the searcher to decide which of the paragraphs retrieved are relevant to their needs. In other words, the effectiveness of a search de- pends to a large extent on the particular navigational ac- tions of the searcher. The validity in this case of using the traditional method of computing retrieval effectiveness would therefore need to be questioned.

Nevertheless, because the range of navigational op- tions open to a searcher at any stage of the process is determined by the nature of the set of links inserted be- tween the nodes of the database, it is also true that a search’s effectiveness will be limited to a certain degree by the opportunities that are provided by the system for the searcher to navigate to relevant nodes. In simple terms, if there is a particular paragraph to which a searcher is never given the chance to navigate, they will never be able to decide whether that paragraph is rele- vant to their needs or not.

In the next section, we describe a novel method of making measurements of the degree of effectiveness achieved by searches undertaken with a hypertext re- trieval system. It should be made clear, however, that the traditional measures of retrieval effectiveness are not necessarily the most appropriate for the evaluation of re- trieval systems based on navigational mechanisms. As is indicated by the amount of attention afforded to hy- pertext systems by the HCI (human-computer interac- tion) community, the operation of such systems involves a high level of interaction with the user, and the particu- lar problems of evaluating highly-interactive document retrieval systems have received much attention (Belkin & Vickery, 1985; Robertson & Hancock-Beaulieu, 1992; Su, 1992). Various methods have been proposed for measuring the efficiency, as opposed to the effectiveness, of searches; beyond simple consideration of the length of time taken by searchers, we do not discuss further how such methods might be applied in the context of our ex- periments.

4. Experimental Methodology

4.1. Creation of the Test Database&

The creation of the databases used in our experiments is described in full detail in our earlier article (Ellis et al.,

290 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996

Page 5: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

1994b). We began with five printed full-text documents (to which references are provided in an Appendix to this article), each a thesis, journal article, or book written by a member of the Department of Information Studies at the University of Sheffield. The number of paragraphs in each document ranged from 23 to 347. Each database consisted of a set of nodes, each node representing an individual paragraph in the document from which the database was derived, and a limited set of links. These links were of two types: (i) Those that connected pairs of nodes containing paragraphs that are physically adjacent in the linear sequence of the original document; and (ii) those that connected headings in a table of contents (or “root” node) with nodes containing the paragraphs that begin the sections and subsections to which those head- ings refer. Such links may be considered to be “objec- tive,” because they are derived from inter-paragraph relationships that were represented explicitly in the content of the original physical document.

A machine-readable version was produced of each of the documents using the authoring system Guide (Guide User Manual, 1990); Guide was chosen for this purpose on account of its widespread availability and use within the academic community in the UK, and because of its use in a previous study of the effectiveness of retrieval from hypertext that used a similar test collection (Al-Ha- wamdeh et al., 199 1) . Five copies were made of each ma- chine-readable document, and each of the 25 copies was allocated to a different student volunteer from the De- partment. The volunteers were instructed in the use of an interactive system, developed using Guide, that allowed them to create explicit representations of links between paragraphs whose contents they decided were related. Certain of these conceptual links, such as those derived from explicit cross-references and footnotes, might be classed as “objective”: Most, however, were “subjective” because they were derived from relationships that were considered by the individual link-creator to reside im- plicitly in the semantics of the text. Any link could be categorized according to its direction: “Backward” if its target was a paragraph that appeared earlier in the linear sequence of the original printed document than its source, and “forward” if its target was a paragraph that appeared later in that sequence. It was also useful to dis- tinguish, on the one hand, forward links connecting a source node to a target node that was physically adjacent to it in the original linear sequence from, on the other hand, those connecting nodes that were more widely-dis- persed. Links of the former, “next-node-in-sequence” type may easily be constructed by automatic means. Such links are no less conceptual than others; it seems likely, however, that the knowledge of two nodes’ physi- cal adjacency is an important objective influence on linkers’ decisions. Links of other types should therefore be considered the more characteristic of individual link- ers’ subjective work, and in the presentation of our ex- periment’s results (Section 5 ) we concentrate on the use

that searchers made of those “subjective” links that were not “next-node-in-sequence” links.

On completion of the linkers’ work, the results were five hypertext versions of each of five different docu- ments, each sharing a common set of nodes (i.e., paragraphs) with four others, but each having a different set of links inserted amongst the nodes. Table 1 records the time spent by each linker, and the number of links they inserted. Each hypertext version of a document was subsequently considered as 1 of 25 complete and sepa- rate hypertext databases.

4.2. Search Design

A total of 40 volunteer searchers were recruited from the group of master’s and doctoral students at the De- partment of Information Studies. In their capacity as stu- dents, few of these volunteers had any professional expe- rience of database searching, but all had some practical experience of using online retrieval systems, together with an understanding of their purpose and of the nature of the relationship of “relevance” that might exist be- tween a query and a document.

Each of these volunteers were instructed in the use of an interactive computer system, again developed using Guide, that allowed them to access any of the 25 data- bases, to view one node of the database on the screen at a time, and to navigate successively from one individual node of the database to another, thus displaying on the screen the content of the target node on the screen. Searchers were able to navigate in this way by activating buttons of a number of types. The searchers were told that the most important buttons were those embedded in the text of a document’s paragraphs, each labeled with the ID number of a target paragraph. Each of these but- tons indicated the source of a link that had been inserted by the volunteer linker responsible for the currently ac- cessed database; the ID number of a paragraph repre- sented its position in the linear sequence adopted in the original printed document. The location of a button was indicated to the searcher by a section of highlighted text chosen for this purpose by the linker: It was by this means that the searcher was given an indication of the potential relevance of the target of the link. As well as buttons enabling the user to change databases, to view a “Help” screen, and to exit the system, other buttons provided searchers with different navigational options, including a backtracking facility, and a facility to display the paragraphs immediately preceding and succeeding (in the original linear sequence) the currently displayed paragraph.

Each of the volunteers was allocated five databases for searching, one of each of the five versions of each of the five original documents: Thus, each database-version was searched by eight different searchers. Each searcher was given the same list of 25 queries, five for each of the five databases. These queries, whose contents are listed

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996 291

Page 6: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

TABLE I Time spent by 25 volunteer linkers, and numbers of links inserted

Document ID 1; No. of nodes, p 340

Linker ID la lb lc Id le Mean Time spent (hours) 28; 33; 464 43 21 34$ No. of links inserted 260 208 137 112 98 162

Document ID 2; No. of nodes, p 307

2b 2c 26$ 9$ 517 226

Linker ID 2a Time spent (hours) 24; No. of links inserted 864

2d 2e Mean 40 20: 24 969 551 625

Document ID 3: No. of nodes, p 45

Linker ID 3a 3b 3c Time spent (hours) 6 9; 9; No. oflinks inserted 119 138 53

3d 3e Mean 3 4; 6; 156 133 120

Document ID 4: No. of nodes, p 23

4b 4c 22 I 8; 28 77

Document ID 5: No. of nodes, p 347

5b 5C

19: 30 427 113

Linker ID 4a Time spent (hours) 4; No. of links inserted 40

4d 4e 2; 3; 62 3

4$ 42

Linker ID 5a Time spent (hours) 24 No. of links inserted 501

5d 5e Mean 16 1X 21; 492 119 331

in the Appendix, derived from those suggested by the au- thors of the original printed documents for experiments conducted previously in the Department that made use of a similar test collection ( Al-Hawamdeh et al., 199 1). The volunteers were instructed to carry out the following process for each query:

links inserted amongst its set of nodes, the navigational options available to a searcher at any stage of the process would be likely to differ from those available to someone at a similar stage of the search in another database ver- sion. It was hypothesized that at least a proportion of the variation in the degree of effectiveness observed between searches carried out by different people would be a result of certain differences exhibited in the structures of the link-sets. Given the small size of the sample, no attempts were made to control for the effects of variables such as searcher expertise or experience; however, in an attempt to discover variation occurring as a result of the provi- sion or otherwise of a global search facility in addition to a simple navigational mechanism, the experiment was designed so that only half of the searchers were allowed to use a “keyword search” facility. This facility enabled the user to search the whole of the currently-accessed da- tabase for occurrence of any single- or multi-word string (no wildcard characters, truncation, or Boolean combin- ations).

Each user’s transactions with the system, along with the time taken, were logged automatically, and the log files made available for subsequent analysis.

4.3. Computation of Retrieval Efectiveness In the light of the observations made in Section 3, we

propose that the effectiveness of the retrieval system used in our experiments should be measured in two different

(1)

(2)

(3)

Try to put yourself in the position of a person who has the particular information need that is expressed in that query. Use the facilities provided by the system to navigate from paragraph to paragraph in each document. Identify all and only those paragraphs that, in your opinion, would be relevant to the person with that particular information need.

It was also suggested to the volunteers that some of the queries were simple questions whose answers might be found in the text of individual paragraphs, whereas oth- ers were expressions of much vaguer needs to which many different paragraphs might be relevant. It was left to the individual searcher to decide how many para- graphs were at all relevant to each query. Volunteers were also asked to rank, in order of relevance, those par- agraphs that they selected in step 3: Not all volunteers followed this instruction, however, and it was decided that ranking information should not be considered in subsequent analysis.

Because each database version had a unique set of

292 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996

Page 7: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

ways, using sets of relevance judgments of three types as follows:

1. Judge-relevant paragraphs. A set of relevance judgments, produced by the authors of the original printed documents, was assigned to each set of five que- ries before any of the searches had taken place. Each set consists of the numbers of all and only those paragraphs that the authors believed would be relevant to a person with the information need expressed in the query. Given the set of queries QX, where x represents the number of the document to which the queries relate, the set of pre- search judgments for QX is given by Rel( J)x. We may say that the members of Rel( J), are “judge-relevant.”

2. Searcher-relevant paragraphs. On completion of the searches. another 40 sets of relevance judgments, produced by the volunteers, were assigned to each set of five queries, Eight of these sets of judgments were pro- duced for each of the five database versions. Each set consists of the numbers of all and only those paragraphs that the searcher believed would be relevant to a person with the information need expressed in the query. The set of judgments produced by searcher y for QX is given by Rel( S,.),. We may say that the members of Ref( S,)1. are “searcher-relevant.”

3. Navigator-relevant paragraphs. A set of “judg- ments” of a third type is that corresponding to the para- graphs “visited” or retrieved by the user in the course of their navigation. Another 40 sets of judgments could be identified in this way, eight for each of the five database versions. Each set consists of the numbers of all those paragraphs through which the searcher navigated. The set of such judgments produced by searcher y for Q, is given by R4( N,.),. We may say that the members of Rel( NY), are “navigator-relevant.” Note that if a para- graph is searcher-relevant, it is also necessarily navigator- relevant, but not vice versa; in fact, Rel(S,), G Rel( NY),.

We can now define a modified version of the 2 X 2 table that is often depicted in discussions ofthe measures of recall and precision (van Rijsbergen, 1979, p. 14X). This modified version (see Fig. 1) consists of six cells a,, a2, b, , b2, c, d as follows:

( 1) u, is the number of paragraphs that are judge-rele- vant and searcher-relevant, i.e., a, = 1 Rel(& n Rel(Sy),l:

(2) n2 is the number of paragraphs that are judge-rele- vant and navigator-relevant, but not searcher-rele- vant, i.e., a2 = 1 (Rel(.& n Rel( A’,,),) fl Re/( S,), I ;

(3) b, is the number of paragraphs that are not judge- relevant but are searcher-relevant, i.e., bl = IRrlo,nReI(S,),~:

(4) b2 is the number of paragraphs that are neither judge-relevant nor searcher-relevant, but are naviga-

All pi

Navigator-relevant 1I

Not navigator-relevant r

Igraphs Navigator-relevant

Judge-relevant \ Searcher-relevant

(0

Judge-relevant Not judge~elevmt

ml-----l

FIG. I. (i) Venn diagram, and (ii) table, showing the relationships between sets of judge-relevant, searcher-relevant, and navigator-rele- vant paragraphs

tor-relevant, i.e., b2 = I( Re/(J), fl Rel(N,),) n Re~(.%),I;

(5) c is the number of paragraphs that are judge-rele- vant, but are not navigator-relevant, i.e., c = I Re/( .& n Rel( NY), I ; and

(6) disthe numberofparagraphs that areneitherjudge- relevant nor navigator-relevant, i.e., d = 1 Rel( & fl Reh4LI.

For each set of queries x and for each searcher y, we may define one measure of recall, Rl,,Y, as the ratio of the number of paragraphs that are both judge-relevant and searcher-relevant, to the total number of those that are judge-relevant, i.e.,

Rl,,, = a’ a, + a2 + c’

Similarly, we may define a measure of precision, P1,,Y, as the ratio of the number of paragraphs that are both judge-relevant and searcher-relevant, to the total num- ber of those that are searcher-relevant, i.e.,

P1,,Y = A!--- al + b, .

For each set of queries x and for each searcher y, we may also define a second measure of recall, R2,,y, as the ratio of the number of paragraphs that are both judge- relevant and navigation-relevant, to the total number of those that are judge-relevant, i.e.,

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996 293

Page 8: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

TABLE 2. Mean search data calculated over each document l-5, and over ail documents.

Document ID 1 2 3 4 5 All

No key Time

Vi + Vj

Total Rl Pl R2 F.2

Key Time

v, + v, Total

RI Pl R2 P2 All

Time vi* ‘i- Total

RI PI R2 P2

00:47:4 1 00:26: 10 00:19:17 00:12:10 00:32: 10 00~27129 5 6 5 3 8 5

219 110 78 36 260 141 0.23 0.35 0.59 0.86 0.38 0.48 0.41 0.66 0.70 0.77 0.76 0.67 0.52 0.52 0.84 0.95 0.64 0.69 0.26 0.3 1 0.34 0.40 0.29 0.32

00:x%07 00:36:56 00:20:49 2 3 3

78 55 46 0.22 0.22 0.52 0.56 0.49 0.70 0.37 0.33 0.76 0.4 I 0.34 0.44

00:52:54 00:3 I:33 00:20:03 00: 15:05 00:30:32 00:30:0 1 3 4 4 3 5 4

148 83 62 35 165 99 0.23 0.28 0.55 0.87 0.30 0.44 0.52 0.57 0.70 0.74 0.74 0.65 0.45 0.42 0.80 0.95 0.5 1 0.62 0.33 0.33 0.39 0.39 0.37 0.36

00:18:01 7

3; 0.88 0.71 0.94 0.38

00:28:53 00:32:33 3 3

70 57 0.21 0.4 1 0.73 0.64 0.37 0.55 0.45 0.40

R2,,, = a’ + a2 al +az+c’

Similarly, we may define a second measure of precision, P2 x,y, as the ratio of the number of paragraphs that are both judge-relevant and navigation-relevant, to the total number of those that are navigation-relevant, i.e.,

my = 4 +a2

al + a2 + 6, + b2 ’

5. Results and Discussion

The values of each of the measures discussed in the previous section were calculated for each set of queries x and for each searcher y, and full details appear in a report (Ellis et al., 1994a). A summary of these results is pre- sented in Table 2. Each column in this table consists of mean data, either (in the columns headed 1 to 5) aver- aged over all the searches undertaken in different ver- sions of the same individual document, or (in the col- umn headed All) over all searches undertaken in all doc- uments. The figures in the rows labeled No key are averaged over all searches in which use of the keyword- search facility was not made available, those labeled Key are averaged over all searches in which use of this facility was made available, and those labeled All are averaged

over all searches. The figures in each individual row have the following meanings:

Time: The time in hours:minutes:seconds taken to carry out the search; vi --+ vi: The number of “subjective” links navigated where the two nodes connected were nol simply adjacent to each other in the linear structure of the original printed document: i.e., links that were not “next-node- in-sequence” links (see Section 4.1); Total: The total number of links navigated during the search: RI: The proportion ofjudge-relevant paragraphs that are searcher-relevant: Pl: The proportion of searcher-relevant paragraphs that are judge-relevant; R2: The proportion ofjudge-relevant paragraphs that are navigator-relevant; P2: The proportion of navigator-relevant paragraphs that are judge-relevant.

5. I Variation between Searches in Direrent Documents

Disregarding for the moment the distinction between No key and Key searches, and focusing on the data in the rows labeled All, we may firstly note the variation amongst documents in the recorded mean values of R 1, Pl, and R2, and in the mean times taken by searchers. ( Mean values of P2, the proportion of navigator-relevant paragraphs that are judge-relevant, are the least variable

294 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996

Page 9: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

of those of our four measures of effectiveness.) Searches in Documents 3 and 4 took markedly less time, and the mean values of recall (both RI and R2) for those docu- ments are markedly higher. Similarly, the mean values of Pl for Documents 3, 4, and 5 are markedly higher than those for Documents 1 and 2. These and related differences may be simply explained by reference to cer- tain characteristics both of the documents themselves, such as length and difficulty, and of the test queries, such as specificity. Documents 3 and 4 (45 and 23 nodes, respectively) are much shorter than the others, their numbers of judge-relevant paragraphs ( 13 and 6, respectively) are much smaller, and their queries are gen- erally more specific.

Although microanalytical techniques have proved useful in other studies of hypertext retrieval (see, for ex- ample, Gray, 1992), it is not our intention to examine results at the level of individual searches. Nevertheless, certain hypotheses may be posited whose acceptance would rest on the findings of such detailed analysis. For instance, searches in Document 1 took significantly longer, on average, than those in the two other docu- ments of equivalent length. We might surmise that this is the result of a combination of factors including: The use of terms in the queries that are not used, or seldom used, in relevant paragraphs in the document itselfi a ta- ble of contents that provides searchers with few obvious starting-points; and a particularly detailed treatment of its subject matter. These factors exert an influence on searchers not only directly, but in an indirect manner as a result of their effect on the work of linkers. Given the inexperience of both sets of volunteers, we might assume that a document whose structure presents difficulties to the linker will pose problems of a similar nature for the searcher.

5.2. Variation between Searches ofD@erent Types

In consideration of the differences between searches in which keyword-searching was available and those in which it was not, it comes as no surprise that the mean values of R2 for No key searches are higher in all cases than those for Key searches. Each value of R2 represents the proportion of judge-relevant paragraphs that are merely visited by a searcher (whether they are identified by the searcher as relevant or not); in general, searchers with no access to keyword-searching navigate through many more nodes than those with such access (witness the values of Total in Table 2); it is therefore to be ex- pected that they will visit a higher proportion of those nodes that are judged to be relevant, whether such navi- gation is by “accident” or by design. However, it is more interesting to find that, for four documents out of five (and overall), mean values of Rl and of Pl for No key searches are greater than or equal to those for Key searches. Moreover, the average times taken by No key searches are lower in four out of the five cases (and

overall). In other words, searchers with no access to key- word-searching not only identify as relevant larger pro- portions of judge-relevant paragraphs than their key- word-searching counterparts, but they also do so with greater precision and in less time.

For this last finding to be accepted at a reasonable level of statistical significance would require a larger sample and greater control of variables such as searcher expertise. This does not prevent us from attempting to explain the result, and this we can do with reference to our qualitative observations of the searchers at work. In general, our volunteers tended to be distrustful of the usefulness of the links that had been inserted amongst the paragraphs of the documents they searched. By com- paring the values of Vi + Vj and Total in Table 2, we can see that, out of the total number of links followed by searchers, only very small numbers were subjective links inserted by our volunteer linkers rather than objective links deriving from the physical structure of the original printed document. Our instructions to searchers specifi- cally encouraged the use of the subjective links, but there are various possible reasons why this encouragement was largely ignored. In some cases, it is possible that searchers experimented with the use of subjective links early in their sessions, but came to the conclusion that they were not helpful. Other searchers might have been wary of los- ing their bearings, or might simply have decided that us- ing links from the table of contents and the keyword- search facility (if available) were more (or sufficiently) effective methods for conducting their searches. It is not easy to draw definite conclusions from the data available: In hindsight, we recognize that the quality of the experi- mental design suffers from its lack of a post-search ques- tionnaire similar to that which was distributed to linkers.

5.3. Conclusion

Our principal experimental aim is to examine the re- lationship between inter-linker consistency and retrieval effectiveness, and a detailed study of the factors affecting levels of retrieval effectiveness is beyond the scope of our work. Nevertheless, the results of our use of traditional measures of effectiveness in the evaluation of a naviga- tional retrieval system, though inconclusive, are interest- ing enough to warrant further investigation in this area.

From our small sample of results, we noted two prin- cipal patterns:

( 1) Higher levels of effectiveness were achieved by searchers who were not given access to a keyword- search facility. For those who are comfortable in the belief that navigational retrieval can offer a practical alternative to the more conventional system-medi- ated approach, this is an agreeable and possibly un- expected finding (albeit one that should not be over- emphasized-the differences involved, especially those between mean precision values, are small).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996 295

Page 10: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

(2) The level of searchers’ use of subjective links was ex- tremely low. This finding is more worrying for hy- pertext’s apologists, as it makes it very difficult to contend that the effort expended on the creation of large numbers of these links (see Table 1) was worthwhile for the purposes of subsequent searches.

Of course, there are several reasons why it would be wrong to accept the significance of these findings too readily. We have already remarked upon the small size of our sample. A second noteworthy shortcoming of the experimental design is that, in their capacity as students. few of our volunteers had any professional experience of hypertext construction or of database searching. (In our defence, however, lack of experience of the latter type might be regarded as a fair reflection of the characteris- tics of the hypothetical group of end-users for which our system might be intended.) Moreover, it could be argued from the outset that testing the effectiveness of a naviga- tional system using a method based on queries is funda- mentally inappropriate, and that an experiment based on the evaluation of browsing-based searches would have been a fairer test of the capabilities of the naviga- tional approach.

6. Comparison of Consistency with Effectiveness

6.1. Methodology

In our earlier article (Ellis et al., 1994b), we described the use of a variety of methods, involving the application of arithmetic coefficients and topological indices, in or- der to measure the degree of similarity between the sets of inter-nodal links inserted in the two members of a pair of hypertext databases. Our principal research hypothe- sis ( HI ) was that there is a positive association between inter-linker consistency and retrieval effectiveness. The objective of our experiment was therefore to compare the measurements of inter-linker consistency presented in our earlier article, with the measurements of retrieval effectiveness presented in Section 5. Because a single consistency value gives an indication of the similarity be- tween two hypertext versions of the same document, then each such value should be compared with a corre- sponding value that gives an indication of the average effectiveness that may be achieved by searching either of those two databases. It is in this way that Leonard ( 1975) compares levels of inter-indexer consistency with re- trieval effectiveness, although he calculates his averages over groups of ten documents rather than groups of two ( see Section 2 ) .

For the purposes of this experiment, we consider each pair of hypertext databases as an object that may be char- acterized by a value of each of two attributes, S and M. There are fifty objects in our sample (ten database-pairs per document). Each value of attribute S represents the level of inter-linker consistency between the two mem-

I 0.00 -------1-------+----i 0.03 0.20 0.40 0.63 0.80 1.00

N4lAl

FIG. 2. Scattergram: M(A) against RI.

bers of a database-pair; each value of attribute hf repre- sents the mean level of mean effectiveness attained by searches in the two members of a database-pair.

A value S representing the level of inter-linker consis- tency between the two members of a database-pair may be obtained by any of the wide variety of methods de- scribed in our earlier article. On the basis of the discus- sion presented there, we have chosen to use four sets of values of S, each corresponding to one of the sets of val- uesofS(A),S(D),N4(A),andAV(D)recordedinTa- ble 3: The reader is referred to the earlier article (Ellis et al., 1994b) for a full description of the formulae used in the calculation of these values.

A value Mrepresenting the mean level of effectiveness for a single database may be obtained by taking the aver- age of a set of values, each element of which represents the level of effectiveness of one of the eight individual searches conducted in that database. A value represent- ing the level of effectiveness of a search may be calculated using any of the four different formulae defined in Sec- tion 4, and in theory it would be of interest to state four experimental hypotheses, each corresponding to a different definition of effectiveness and hence to a differ- ent set of values of M. Our one-tailed null hypothesis (H,) is thus that there is no positive association between the set of values of S and any set of values of M. As Leo- nard’s original hypothesis concerned recall only, how- ever, we are concerned more with values of M derived from the formula for R 1, and to a lesser extent with those derived from R2, than with those derived from values of precision. Table 4 presents the mean values of mean re- call for 50 database-pairs.

4.2. Results

The scattergram in Figure 2 shows the set of values of N4(A) for 50 database-pairs (as recorded in Table 3) plotted against the corresponding set of values of R 1 (as recorded in Table 4). This scattergram is just one repre-

296 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996

Page 11: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

TABLE 3. Values of the Dice coefficient for 50 database-pairs, derived (i) using matrix-element methods from adjacency data (S(A) values) and from distance data (S(D) values), and (ii) using node-index methods from adjacency data (N4(A) values) and from distance data (N4(D) values)*.

I 2 3 4 5 Document ID Version-pair S S N4 N4 S S N4 N4 S S N4 N4 S S N4 N4 S S N4 N4

ID (A) CD) (A) CD) (A) 03 (A) CD) (4 (W (A) CD) (A) CD) (A) UN (A) (D) (A) WY

a/b 0.03 0.02 0.18 0.49 0.10 0.02 0.37 0.25 0.47 0.33 0.68 0.49 0.02 0.01 0.10 0.01 0.03 0.02 0.21 0.18 a/c 0.02 0.01 0.20 0.13 0.18 0.04 0.24 0.01 0.02 0.00 0.24 0.13 0.16 0.15 0.56 0.39 0.03 0.01 0.22 0.18 a/d 0.03 0.02 0.14 0.15 0.26 0.10 0.70 0.08 0.56 0.17 0.68 0.15 0.25 0.12 0.67 0.04 0.06 0.04 0.41 0.22 ale 0.02 0.01 0.09 0.34 0.17 0.05 0.36 0.07 0.45 0.24 0.64 0.34 0.02 0.01 0.00 0.00 0.01 0.00 0.09 0.13 b/c 0.01 0.00 0.12 0.21 0.07 0.02 0.41 0.01 0.02 0.01 0.27 0.21 0.02 0.01 0.06 0.09 0.03 0.01 0.40 0.29 b/d 0.01 0.01 0. I5 0.1 I 0.09 0.05 0.32 0.07 0.56 0.15 0.75 0.11 0.03 0.01 0.04 0.00 0.06 0.03 0.30 0.16 b/e 0.02 0.01 0.18 0.46 0.08 0.03 0.52 0.08 0.52 0.45 0.71 0.46 0.02 0.01 0.00 0.00 0.01 0.00 0.10 0.02 c/d 0.11 0.08 0.57 0.08 0.16 0.02 0.24 0.00 0.03 0.03 0.26 0.08 0.09 0.49 0.53 0.09 0.04 0.01 0.32 0.10 cle 0.02 0.02 0.08 0.38 0.12 0.05 0.42 0.10 0.02 0.00 0.25 0.38 0.01 0.01 0.00 0.00 0.01 0.00 0.19 0.19 d/e 0.02 0.01 0.08 0.09 0.16 0.03 0.36 0.01 0.55 0.13 0.78 0.09 0.01 0.00 0.00 0.00 0.01 0.00 0.12 0.05

* The formulae used in the calculation of these values are described in detail in Ellis et al. (1994b).

sentative of the eight that may be formed by plotting one of the four sets of consistency values against one of the two sets of recall values: None of the other scattergrams displays characteristics that are sufficiently dissimilar from those of our chosen example to suggest that conclu- sions should be drawn which differ from those made be- low. A number of observations should be made before the results of any statistical test of correlation between consistency and recall values are analyzed:

( 1) The recorded values of S(A), S(D), N4(A), and N4( D) form distributions that are far from normal: Frequency diagrams reveal positively-skewed distri- butions.

(2 ) However, a few of the values of S( A ) and S(D) are much higher than their medians. The values that are particularly high (within the range 0.50 to 0.60) are to be found amongst those recorded for Document 3, i.e., values of equivalent magnitude to these were not recorded for any of the other documents. This might suggest that each sample of values for a par- ticular document is drawn from a different popula-

TABLE 4. Mean values of mean recall for 50 database-pairs. -

Document Version-pair ID

1 2

RI R2 Rl R2

(3)

tion. We do not intend to test this hypothesis here (it is doubtful, in any case, that the samples are sufficiently large to allow conclusions to be drawn at an appropriate level of significance), but it is cer- tainly not unreasonable to imagine that those factors having a strong influence on level of consistency in- clude some which may be expressed in terms of the characteristics of the documents under consider- ation, and that these characteristics (e.g., length, subject, difficulty) vary considerably-with the re- sult that levels of consistency vary accordingly be- tween documents. Indeed, in a more general respect, these characteristics seem to have had a significant degree of influence on the levels of recall and precision that were recorded in our experiments. Similarly, neither of the samples of values of mean recall form a normal distribution, and inspection of the eight consistency-recall scattergrams exempli- fied by the representative shown in Figure 2 would indicate that, again, each sample of values for a par- ticular document is drawn from a different popula- tion. Factors influencing the level of mean recall might include not only the same characteristics of

3 4 5

Rl R2 Rl R2 Rl R2

a/b 0.23 0.47 0.24 0.36 0.54 0.78 0.88 0.95 0.23 0.45 a/c 0.23 0.39 0.26 0.36 0.57 0.81 0.87 0.97 0.29 0.45 a/d 0.21 0.43 0.29 0.43 0.60 0.82 0.88 0.99 0.32 0.53 a/e 0.24 0.46 0.24 0.37 0.51 0.75 0.82 0.93 0.24 0.42 b/c 0.23 0.44 0.28 0.43 0.55 0.83 0.90 0.94 0.30 0.51 b/d 0.21 0.48 0.31 0.45 0.59 0.84 0.90 0.96 0.32 0.59 b/e 0.24 0.51 0.26 0.39 0.49 0.77 0.84 0.90 0.25 0.49 c/d 0.21 0.40 0.33 0.50 0.61 0.86 0.90 0.98 0.38 0.59 c/e 0.24 0.43 0.28 0.44 0.52 0.79 0.84 0.92 0.31 0.49 d/e 0.22 0.47 0.31 0.46 0.55 0.80 0.84 0.94 0.33 0.57

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996 297

Page 12: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

TABLE 5. Values of Spearman’s rank-order correlation coefficient p indicating correlation between sets of values of inter-linker consistency and sets of values of mean effectiveness.

Document ID 1; No. of nodes, p 340

P Rl PI R2 P2

S(A) -0.237 0.085 -0.367 -0.129 S(D) -0.279 0.099 -0.363 -0.287 W-4 -0.298 0.320 -0.199 0.180 N4(D) 0.749 0.06 1 0.342 0.580

Document ID 2; No. of nodes, p 307

P Rl Pl R2 F-2

S(A) 0.034 0.666 -0.098 0.437 S(D) 0.006 0.459 -0.025 0.376 NW) -0.232 -0.215 -0.156 0.070 N4(D) -0.522 -0.280 -0.432 -0.123

Document ID 3; No. of nodes, p 45

P Rl PI R2 P2

S(A) 0.185 0.275 0.03 1 -0.855 S(D) -0.372 -0.144 -0.492 -0.760 WA) -0.098 0.03 1 -0.116 -0.769 WW -0.784 -0.605 -0.746 -0.150

Document ID 4; No. of nodes, p 23

P Rl PI R2 P2

0.507 0.311 0.824 0.411 0.394 0.681 0.630 0.449 0.900 0.560 0.624 0.661

Document ID 5; No. of nodes, p 347

0.540 0.42 1 0.516 0.28 1

P Rl PI R2 P2

S(A) 0.383 -0.812 0.529 0.233 S(D) 0.153 -0.835 0.333 0.445 MN 0.462 -0.726 0.532 -0.064 N4(D) -0.098 -0.480 -0.129 0.067

documents that affect level of consistency, but also characteristics of the different sets of queries associ- ated with each document (e.g., terminological us- age, specificity).

The non-normal shape ofeach ofthe distributions un- der consideration precludes the use of Pearson’s prod- uct-moment correlation coefficient to test the extent of the association between variables: In any case, inspection alone is necessary to confirm that there is no discernible association between the values of S and the values of R 1 when both samples are taken as wholes. It is instructive, however, to investigate levels of association between the samples of values for a particular document, so that the potentially variable effects of document and query char- acteristics are controlled. In doing this, we shall consider

values of an alternative, non-parametric coefficient, Spearman’s rank-order correlation coefficient, that was used by Leonard in his pioneering study of the relation- ship between consistency and effectiveness.

For each document, Table 5 presents the values of the rank-order correlation coefficient p derived from com- parisons of each set of values of S with each set of values of M. For a one-tailed test where n = 10, the value of the rank-order correlation coefficient p must exceed 0.564 for the level of association that it represents to be consid- ered significant at a level of probability of p 5 0.05. In Table 5, the values of p that exceed this threshold are set in bold type. We find that the the values of S and R 1 are positively correlated at a level that exceeds the threshold in only two cases out of twenty (Document 1, S = M(D): p = 0.749,~ I 0.01; Document 4, S = N4(A): p = 0.630, p I 0.05), and are in fact inversely correlated at a similar level in one case (Document 3, S = N#( D): p = -0.784, p I 0.0 1). When the values of S for each document are compared with values of Pl, R2, and P2, conflicting patterns emerge: The level of positive associ- ation between S and PI is revealed to be significant in two cases (Document 2, S = S(A): p = 0.666, p 5 0.025; Document 4: p = 0.624, p 5 0.05), while the level of inverse association between the same two variables is similarly significant in four other cases. Sand R2 are pos- itively correlated to a significant degree in four cases, all relating to the same document (Document 4), but are inversely correlated in one other case; and S and P2 ex- hibit a significant positive association in one case and a significant inverse association in another. In short: In the overwhelming majority of cases, the recorded value of p does not suggest the existence of a positive association.

7. Conclusions

Given these results, we are unable to reject the null hypothesis Ho, that there is no positive association be- tween any set of values of S and any set of values of M. If we had found that levels of inter-linker consistency, like those of inter-indexer consistency, are predictive of levels of retrieval effectiveness, then those results would have had ominous implications for the evaluation of re- trieval systems that access databases in which hypertext links have been inserted manually, especially given the intensive nature of the manual labor required in the cre- ation of hypertext links. But we did not find this to be so, and we are therefore not in a position to draw conclu- sions about the consistency-effectiveness relationship that are equivalent to those drawn by Leonard.

However, our research raises many issues, suitable for further study, that concern the effectiveness of hypertext systems. It would be of interest, for example, to see whether our results were reproducible using (i) other au- thoring systems with different linking mechanisms to those provided by Guide, (ii) original hypertexts that are not merely conversions of existing texts, (iii) profes-

298 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996

Page 13: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

sional hypertext authors, (iv) links created automatically by means of statistical analysis of term occurrences, and (v) expert searchers, inter alia. A particularly fertile area for future research, however, would be a comparison of the differential effect on retrieval effectiveness of(i) link structure, and (ii) search strategy. If it were found that the precise shape of a link structure is relatively unim- portant in terms of its ultimate influence on retrieval effectiveness, then we would yet be forced to draw a dis- turbing conclusion: That the presumed objective of a manual linker’s efforts-to construct link-sets on whose account the effectiveness of future searches will some- how be optimized-is largely doomed to failure.

Acknowledgments

We thank students from the Department of Informa- tion Studies for their creation of the sets of hypertext links and the searches used in our experiments; two anonymous referees for their comments; and the British Library Research and Development Department for funding this work under grant number RDD/G/ 142.

Appendix: Test Collection and Queries

Ellis, D. (I 987). The derivation of a behavioural model for information retrieval design (Ph.D. thesis) Department ofInformation Studies, University of Shefield, Shefield, UK.

( I) The importance of citation to academic research. (2) Knowledge structures as the basis for information

retrieval. (3) Semi-directed” searching activity. (4) Terminological issues in the social sciences in rela-

tion to information retrieval mechanisms. (5) The use of relevance judgments to improve infor-

mation retrieval performance.

Rasmussen, E. M. (1988). Cluster analysis on a highly parallel array processor (Ph.D. thesis). Department of Information Studies, University of Shefield, Shefield, UK.

( 1) How useful is the Distributed Array Processor for clustering chemical structures?

(2) What types of hierarchical agglomerative clustering methods are available?

(3) How does Amdahl’s Law affect the speedup which can be achieved by a parallel computer?

(4) The use of clustering to improve the effectiveness of document retrieval systems.

(5) Information on the computational complexity of clustering methods.

Ormerod, A., Willett, P., h Bawden, D. (1989). Comparison offragment weighting schemes for substructural analysis. Quantitative Structure-Activity Relationships, 8, 115-129.

( 1) How were the ranked lists evaluated? (2) What were the two types of predictive technique

which were used? (3) Which weighting scheme uses the binomial distribu-

tion? (4) Which databases performed poorly in the predictive

experiments? ( 5 ) Which pairs of weights are closely related?

Loughridge, B., & Sutton, J. (1988). The careers of MA graduates: Training, education andpractice. Journal of Librarianship, 20, 255-269.

(1)

(2)

(3)

(4)

(5)

Which aspects of previous work experience were most frequently mentioned as valuable? How many graduates thought management had been over-emphasized in the course? How many of the former SCONUL trainees chose the special libraries option on the course? What influence, if any, did the pre-library school work-experience have on student’s choice of course options and on career expectations? How many graduates find posts in academic libraries?

Ford, N. (1991). Expert systems and artificial intelligence: An information manager’s guide. London: Library Association.

( 1) Anything about a frame-based system developed at London University.

(2) What are certainty factors, and how are they used in a system produced by the OCLC?

(3) Describe the two types of control mechanisms that may be used in an inference engine.

(4) The use of expert system shells (with examples). (5) What is heuristic reasoning?

References

Al-Hawamdeh, S., Smith. G., & Willett, P. ( 1991). Paragraph-based access to full-text documents using a hypertext system. Program, 25, 119-131.

Belkin, N. J., & Vickery, A. ( 1985). Inferuction in infbrmation sysrems. A review qf research from document-retrievul to knowledge-based sys- terns (BLRDD Report). London: British Library Research & Devel- opment Department.

Boyle, C.. Teh, S. H., & Williams, C. ( 1990). An empirical evaluation of hypertext interfaces. Hypermedia, 2, 235-247.

Egan, D. E., Remde, J. R., Gomez, L. M., Landauer, T. K., Eberhardt, J., & Lochbaum, C. C. ( 1989). Formative design-evaluation of ‘Su- perBook’. AC.!4 Transactioru on Information Systems. 7, 30-5 7.

Ellis, D., Furner-Hines, J., & Willett, P. (1994a). The creation ofhv- pertext linkages in fulhxt documents (BLRDD Report). London: British Library Research & Development Department.

Ellis, D., Furner-Hines, J., & Willett, P. ( 1994b). On the creation of hypertext links in full-text documents: Measurement of inter-linker consistency. Journal ofDocumentation. 50.67-98.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996 299

Page 14: On the creation of hypertext links in full-text documents: Measurement of retrieval effectiveness

Frisse. M. (1988). Searching for information in a hypertext medical handbook. Comrvnrnicaiions oj’the ACM, 31, 8 19-835.

Gray. H. S. (1992). H~pertew and the technolog~~ yj’conversation. Or- derly situational choice. Westport, CT: Greenwood Press.

Guide ZIS~Y manzral. ( 1990). Bellevue, WA: OWL International. Hooper. R. S. ( 1965 ). Indexer consistency tests-origin, measure-

ments, re.wIfs and utilization. Bethesda, MD: IBM. Leonard, L. E. ( 1915). Inter-inde.wr consistency and retrieval

eflectweness: Measwemmt of relationships Unpublished doctoral dissertation. Graduate School of Library Science, University of Illi- nois, Urbana-Champaign. IL.

Leonard, L. E. ( 1971). Inter-indcJ,wr consistency studies, 1954-1975. A rewew ok/ [kc) literature und summar,v of‘srrrdy rwdts. Occasional Paper No. 13 1, Graduate School of Library Science. University of Illinois. Urbana-Champaign, IL.

Nielsen, J. ( 1989). The matters that really matter for hypertext usabil- ity. In N. Meyrowitz (Ed.), Il~~pertew ‘89 Proceedings (November 5-8. 1989, Pittsburgh. PA) (pp. 239-248). New York: Association for Computing Machinery.

Nielsen, J. ( 1993). H~,pertevt and lqpermedia (2nd ed.) Cambridge, MA: Academic Press.

Rada, R., & Murphy, C. (1992). Searching versus browsing in hy- pertext. EI~!permediu, 4. l-30.

Reich, P., & Biever. E. J. ( 1991). Indexing consistency: The input/ output function of thesauri. College and Ruearch Lihruries, 52, 336-342.

Robertson, S. E.. & Hancock-Beaulieu. M. H. ( 1992). On the evalua- tion of IR systems. Infivmaiion Processing& Munugement, 28,457- 466.

Savoy. J. ( 1993). Effectiveness of information retrieval systems used in a hypertext environment. Hypermedia 5, 23-46.

Sieve& M. C., & Andrews. M. J. ( 1991). Indexing consistency in In- formation Science Ahstracfs. Jotwlal ofthe ilmerican Socictyfor In- jbrmation Science, 42. l-6.

Su. L. T. ( 1992). Evaluation measures for interactive information re- trieval. Ir$mnution Processing & Munagement. 28, 503-5 16.

Tonta, Y. ( 1992). A study of indexer consistency between Library of Congress and British Library cataloguers. Library Resources and Technical Services. 35, 177-185.

van Rijsbergen, C. J. ( 1979 ). Infbrmution retrieval (2nd ed.). London: Butterwotths.

Zunde, P., & Dexter. M. E. ( 1969). Factors affecting indexing perfor- mance. In J. B. North (Ed.), Cooperating iqjbrmation societies: Pro- ceedings of’the 3&d Annuul Meeting ofrhe American Society& In- fi,rmatian Sciencc( October l-4. 1969, San Francisco. CA) (pp. 3 13- 322). Westport. CT: Greenwood.

300 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-April 1996