How Small Are We in the Universe

GENERAL QUERY EXPANSION TECHNIQUESFOR SPOKEN DOCUMENT RETRIEVAL

Pierre Jourliny, Sue E. Johnsonz, Karen Sparck Jonesy and Philip C. Woodlandz

yCambridge University Computer Laboratory zCambridge University Engineering DepartmentPembroke Street, Cambridge, CB2 3QG, UK. Trumpington Street, Cambridge CB2 1PZ, UK.Email: fpj207,[email protected] fsej28,[email protected]

ABSTRACTThis paper presents some developments in query ex-

pansion and document representation of our SpokenDocument Retrieval (SDR) system since the 1998 TextREtrieval Conference (TREC-7).

We have shown that a modification of the docu-ment representation combining several techniques forquery expansion can improve Average Precision by17% relative to a system similar to that which we pre-sented at TREC-7 [1]. These new experiments havealso confirmed that the degradation of Average Preci-sion due to a Word Error Rate (WER) of 25% is rela-tively small (around 2% relative). We hope to repeatthese experiments when larger document collectionsbecome available to evaluate the scalability of thesetechniques.

1. INTRODUCTION

Accessing information in spoken audio encompassesa wide range of problems, in which spoken documentretrieval has an important place. A set of spoken doc-uments constitutes the file for retrieval, to which theuser addresses a request expressing an informationneed in natural language.

This original sequence of words is transformed bythe system into a set of query terms which are usedto retrieve documents which may or may not meet theusers information need. A good SDR system retrievesas many relevant documents as possible whilst keep-ing the number of non-relevant retrieved documents toa minimum. For this work we take text-based queriesand use an Automatic Speech Recognition (ASR) sys-tem to produce word-based transcriptions for the doc-uments.

Following earlier scattered studies, the TREC-7SDR evaluation provided further support for the claimthat conventional Information Retrieval (IR) methodsare applicable to automatically transcribed documents.Our retrieval system was run on 7 different sets ofautomatically transcribed broadcast news texts with aWER varying from 24:8% to 66:2%. The correspond-ing range of Average Precision was 45% to 35% [1].

This work is in part supported by an EPSRC grant on Multime-dia Document Retrieval reference GR/L49611.

The difference in Average Precision between thebest ASR-based system and the manual reference tran-scriptions was only 5% relative for our retrieval en-gine. Therefore we concluded that improving IR per-formance would probably be more profitable than im-proving ASR performance. We have therefore fo-cussed our research on general IR techniques, such asquery expansion, implemented within the ProbabilisticRetrieval Model (PRM) [2].

The formal framework for this research is presentedin sections 2 and 3, with the experimental procedureand system description in section 4. Results are givenin section 5 and conclusions are drawn in section 6.

2. A BRIEF DESCRIPTION OF THE PRM

The PRM framework [2] is not too prescriptive on doc-ument representation. Here we address the relation be-tween the notion of query and document index termsand the more ordinary notion of words. In section 3,we show how more complex relations can be estab-lished to enrich the document representation.

The PRM is based on the idea that documents areranked by the retrieval engine in order of decreasingestimated probability of relevance P

Q

(RjD).

1 Therelevance R is taken to be a basic, binary criterionvariable. D is a random variable taking values in thedocument universe

D

.

For a given document collection,

D

is a set of pos-sible events, each event corresponding to the occur-rence of a particular document and document repre-sentation. The query Q is used in the creation of thedocument representation and therefore is necessary todefine

D

.

Suppose, for the moment, that the query terms arejust plain words. By assuming all query words areindependent, a document event might be representedas the set of couples (w;wf (w; d)) for all querywords w, where the word frequency wf (w; d) is thenumber of occurrences of w in document d. By wayof illustration, a small but complete retrieval examplecould be given by :

1To estimate PQ

(RjD), additional information outside the doc-ument universe, such as the document length, may be used.

Query : information retrieval1

st doc. : information retrieval is no easy task2

nd doc. : speech is an information rich medium

e

1

= f(information; 1); (retrieval; 1)ge

2

= f(information; 1); (retrieval; 0)g

D

= fe

1

; e

2

g

R

= fyes; nogP (R = yes jD = e

1

) > P (R = yes jD = e2

)

The original query word frequencies within a doc-ument therefore provide basic predictors of relevance.However, we should bear in mind that the query wordsare derived from the users original request, which inturn conveys a need which could have been expresseddifferently. In addition, as queries are just word sets,the same set could have been extracted from differenttext requests. In the next section we review variousways of modifying the document universe: some arewell established, others less so, but we believe theyare worthy of further study. More specifically, we in-troduce semantic posets as an appropriate characteri-sation for particular forms of modification.

3. MODIFICATIONS OF THE DOCUMENTUNIVERSE

3.1. Compound Words

Context can change the meaning of words dramati-cally. One method of taking context into account isto treat a given sequence of words as an irreducibleatomic semantic unit. The atomic terms in this ap-proach are either individual words or multi-word se-quences that are treated as explicit and undecompos-able. Some proper names (e.g. New York) may besuch compound words.

The compound word vocabulary can then beadded to the single-word vocabulary (extracted fromthe document collection) to give the new atomic termvocabulary V .

Both the original queries and documents are seg-mented so that the longest possible sequence of wordsin V is always preferred. For example, suppose con-sists of the sequences new-york-city and new-york,then the sentence New York City is in the state ofNew York but York is in the county of North York-shire produces new-york-city is in the state of new-york but york is in the county of north yorkshire2.

A new document universe is now defined in a sim-ilar way as before, but with the notions of word andword frequency replaced by those of atomic term at 2Q

0 and atomic term frequency atf (at ; d), where Q0 isthe query formed from V .

These new atomic terms should act as better rele-vance predictors. For example, a document about New

2as opposed to new york city ...

York is not likely to be relevant to a query about Yorkand vice versa.

Such compound words should only be used whenthere are no alternative ways of expressing the sameconcept using some or all of the constituent words.Thus information retrieval should not be a compoundword as we may have the alternative retrieval of infor-mation or simply retrieval alone.

3.2. Removing Stop Words

Non-content words (e.g. the, at, with, do, ...) are gen-erally of no retrieval value [3]. Most IR systems definea set S of these stop words and remove them from boththe queries and the documents.

The new document universe is defined with a set ofquery atomic terms at 2 (Q00 = Q0S) and an atomicterm frequency function :

atf (at ; d) =

0 8at 2 S

number of occurrences of at in d 8at 62 S

3.3. Stemming

Stemming [4] allows the system to consider thewords with a (real or assumed) common root asa unique semantic class. For an atomic term at

i

,

a corresponding set of atomic terms st(ati

) existswhich share the same stem (e.g. st(trains) =ftrain, trainer, trained, training, trains...g).

We define a term t as a set of atomic terms. Theterm frequency tf (t; d) is then defined as :

tf (t; d) =

X

at2t

atf (at ; d)

The corresponding events making up the documentuniverse are therefore defined as :

e

i

=

[

at2Q

00

f(st(at); tf (st(at); d

i

))g

An example at this stage would look like :

Query : Trains in New York1

st doc. : There is a train in New York2

nd doc. : The trainer is training in New Yorkst(trains) = ftrainer, train, traininggst(new-york) = fnew-yorkg

e

1

= f(st(trains); 1); (st(new-york); 1)ge

2

= f(st(trains); 2); (st(new-york); 1)gP (R = yes jD = e

1

) < P (R = yes jD = e2

)

3.4. Semantic Posets

It is also possible to use a list of equivalence classesof terms to allow more complex associations. We as-sume that the users original query words refer to se-mantic units rather than just words. Therefore, words

which share the same meaning should be consideredas equivalent. A simple equivalence list can be used toprocess synonyms in a similar way to how the stem-ming procedure deals with the different forms of indi-vidual words.

In addition, we can also assume that if the user isinterested in a general semantic entity, then s/he is alsointerested in more specific entities which are seen aspart of it.

For instance, the word Europe may refer to theclass containing the names of all European countries,regions and cities, whilst the word England onlyrefers to the class containing the English county andcity names.

Several attempts to deal with this particular kind ofsemantic structure have been described in the litera-ture (e.g. [5, 6]). However, the corresponding exper-iments have shown very little improvement in IR per-formance.

We find it convenient to represent this behaviourof term classes by considering a semantic PartiallyOrdered Set (poset) [7] which contains the meaningM(at

i

) of each atomic term ati

.

The equivalence relation for poset P , =P

, could betaken from a synonym thesaurus and the strict partialordering 0

dl(d

j

) =

X

w2V

tf (w; d

j

)

ndl(d

j

) =

dl(d

j

) N

P

d2D

dl(d)

where V is the term vocabulary for the whole doc-ument collection D; K and b are tuning constantsand pos(t

i

) the part-of-speech [12] weight of termt

i

. A ranked list of documents is thus produced foreach query by sorting the returned match scores in de-scending order. The POS weights were those usedin the Cambridge TREC-7 SDR evaluation system:

Proper Noun 1.2Common Noun 1.1Adjective & Adverbs 1.0Verbs and the rest 0.9

4.3.2. Adding Geographic Semantic Posets (GP)Location information is very common in requestsin the broadcast news domain. Our first extensionimplements the expansion of geographic namesoccurring in the original query of the BL system intothe list of their components, e.g :US ! Arizona, : : :, Wisconsin, : : :

Atlanta, : : :, Washington-D.C., : : :We manually built a semantic poset containing 484names of continents, countries, states and majorcities, extracted from a travel web server. The posetis represented by a semantic tree whose nodes arelocation names and edges are the contains relation.The process of using posets, described in section 3.4is applied, creating a new index term for eachsempos(at), with at 2 Q00.

4.3.3. Adding WordNet Hyponyms Posets (WP)The previous approach can be generalised to everykind of term, provided that they only have one pos-sible sense in the document file. We obtained a list ofunambiguous nouns from WordNet 1.6 [13] and then,assuming that these words are actually unambiguousin the file and also in the query, generated the corre-sponding noun hyponym trees (is-a relation. For in-stance, the query term disease is expanded into flu andmalaria but words like air (e.g. gas or aviation or man-ner) are ignored in this expansion process as they havemore than one sense. In these experiments we do notconsider WordNet compound words, their proper han-dling being much more complicated than in the geo-graphic names domain.

4.3.4. Adding Parallel Blind Relevance Feedback(PBRF)

We assembled a parallel corpus of 18628 documentsfrom a manually transcribed broadcast news collec-tion covering January to May 97, thus not overlappingthe TREC-7 collection recording time4. Parallel BlindRelevance Feedback, as described in section 3.5 wasthen applied on this document collection.

The five terms which obtain the highest OfferWeight were appended to the query. The Offer Weightof a term t

i

is :

ow(t

i

) = r log

(r + 0:5)(N nB + r + 0:5)

(n r + 0:5)(B r + 0:5)

where B is the number of top document which are as-sumed relevant, r the number of assumed relevant doc-uments in which at least one at 2 t

i

occurs, n the totalnumber of documents in which at least one at 2 t

i

oc-

curs and N the total number of documents. For theseexperiments we used B = 15. The terms added to thequery were then given a weighting equal to their OfferWeight in a similar way to the part-of-speech weight-ing described in section 4.3.1.

4.3.5. Adding Blind Relevance Feedback (BRF)The BRF process was also applied to the actual TREC-7 corpus. This time, only one term was added from thetop five documents that were retrieved for each PBRFexpanded query.

5. RESULTS

The five systems described in section 4.3 were evalu-ated on the TREC-7 SDR test data. An illustration ofthe complete expansion process is given in Table 3.

what diseases are frequent in Britain ?BL disease frequent Britain+GP disease frequent fBritain, UK, ..., Cambridgeg+WP fdisease, flu, ..., malariag frequent

fBritain, UK, ..., Cambridgeg+PBRF + cold Blair rheumatism queen+BRF + endemic

Table 3: Illustration of the query expansion process

The results in terms of Average Precision, and alsoPrecision at 5 documents retrieved, for both the man-ual and HTK transcriptions, are given in Table 4.

We can see that improving the IR system produceda combined relative gain of 17% in Average Precisionon the automatic transcriptions (lines 1 and 5 of Ta-ble 4). Lines 6-8 of Table 4 show the results we ob-tain when each expansion technique in turn is omitted.

4This corpus is a subset of the Primary Source Media broadcastnews transcriptions used to train the language model of our speechrecognition system.

TranscriptionsManual HTK

AvP P-5 AvP P-51 = BL 49.11 57.39 47.30 56.522 = 1 + GP 51.55 60.00 49.77 58.263 = 2 + WP 52.33 60.00 50.75 58.264 = 3 + PBRF 53.59 64.35 51.73 64.355 = 4 + BRF 55.88 60.87 55.08 64.356 = 5 - PBRF 53.54 60.00 52.56 59.137 = 5 - WP 54.40 60.00 54.20 63.488 = 5 - GP 54.95 59.13 53.56 60.87

Table 4: Average Precision (AvP) and Precision at a 5documents cut-off (P-5) on the TREC-7 test collection(results in %)

It confirms that each expansion device is necessary toreach the final performance of line 5 but that very goodresults can be obtained by using only Blind RelevanceFeedback techniques.

It is worth noting that for each level of IR perfor-mance, recognition errors do not affect the results bymore than 4% relative and by more than 2% relativefor the final system.

Assuming that the manual transcriptions place anupper bound on performance, these experiments sug-gest that the adaptation of IR to ASR (e.g. using wordlattices) would be less profitable than future improve-ments in IR techniques.

6. CONCLUSION

In this paper we have confirmed that, for a small doc-ument file, the degradation of Average Precision dueto a WER of 25% is relatively small (2%). We haveshown that several devices which modify the docu-ment representation, each improve the system perfor-mance slightly and that the new methods based on se-mantic posets might be successfully combined withBlind Relevance Feedback, and are therefore worthyof further study.

More generally, we have shown that pure informa-tion retrieval techniques can improve Average Preci-sion by 17% relative on the TREC-7 SDR task. Wehope to repeat these experiments when larger docu-ment collections become available, in order to evaluatethe scalability of our techniques.

7. REFERENCES

[1] S. E. Johnson, P. Jourlin, G. L. Moore, K. SparckJones, and P. C. Woodland, Spoken documentretrieval for TREC-7 at Cambridge University,in To appear in Proc. TREC-7, (Gaithersburg,MD), 1999.

[2] K. Sparck Jones, S. Walker, and S. E. Robert-son, A probabilistic model of information re-trieval : Development and status, TR 446,

Cambridge University Computer Laboratory,September 1998.

[3] C. Fox, Lexical analysis and stoplists, in In-formation Retrieval : Data Structures and Algo-rithms (W. B. Frakes and R. Baeza-Yates, eds.),ch. 7, pp. 102130, Prentice Hall, 1992.

[4] M. F. Porter, An algorithm for suffix stripping,Program, no. 14, pp. 130137, 1980.

[5] G. Salton and M. E. Lesk, Computer evalu-ation of indexing and text processing, in TheSMART Retrieval System: Experiments in Au-tomatic Document Processing (G. Salton, ed.),pp. 143180, Prentice Hall, 1971.

[6] E. M. Voorhees, Query expansion using lexical-semantic relations, in Proceedings of the 17thAnnual International ACM-SIGIR Conference onResearch and Development in Information Re-trieval, pp. 6169, 1994.

[7] B. Dushnik and E. W. Miller, Partially orderedsets, American Journal of Mathematics, no. 63,pp. 600610, 1941.

[8] P. C. Woodland, T. Hain, S. E. Johnson, T. R.Niesler, A. Tuerk, E. W. D. Whittaker, and S. J.Young, The 1997 HTK broadcast news tran-scription system, in Proc DARPA BroadcastNews Transcription and Understanding Work-shop, pp. 4148, 1998.

[9] T. Hain, S. E. Johnson, A. Tuerk, P. C. Wood-land, and S. J. Young, Segment generation andclustering in the HTK broadcast news transcrip-tion system, in Proc. 1998 DARPA BroadcastNews Transcription and Understanding Work-shop, pp. 133137, 1998.

[10] S. E. Johnson and P. C. Woodland, Speakerclustering using direct maximisation of theMLLR-adapted likelihood, in Proc. ICSLP 98,vol. 5, pp. 17751779, 1998.

[11] S. E. Johnson, P. Jourlin, G. L. Moore, K. SparckJones, and P. C. Woodland, The CambridgeUniversity spoken document retrieval system, inProc. of ICASSP99, vol. 1, pp. 4952, 1999.

[12] S. F. Knight, Personal communication, 1998.[13] C. Fellbaum, WordNet : An Electronic Lexical

Database. ISBN 0-262-06197-X, MIT Press,1998.

How Small Are We in the Universe

Documents

Transcript of How Small Are We in the Universe