Nlp

Artificial intelligence & natural language processing

Mark Sanderson

Porto, 2000

Aims

• To provide an outline of the attempts made at using NLP techniques in IR

Objectives

• At the end of this lecture you will be able to– Outline a range of attempts to get NLP to work

with IR systems– Idly speculate on why they failed– Describe the successful use of NLP in a limited

domain

Why?

• Seems an obvious area of investigation– Why not working?

Use of NLP

• Syntactic– Parsing to identify phrases– Full syntactic structure comparison

• Semantic– Building an understanding of a document’s

content

• Discourse– Exploiting document structure?

Syntactic

• Parsing to identify phrases– The issues.– Explain how it’s done (a bit).– Is it worth it?

• Other possibilities– Grammatical tagging– Full syntactic structure comparison

• Explain how it’s done (a little bit).

• Show results.

Simple phrase identification

• High frequency terms could be good candidates.– Why?

• Terms co-occurring more often than chance.– Within small number of words.– Surrounding simple terms.– Not surrounding punctuation.

Problems

• Close words that aren’t phrases.• “the use of computers in science & technology”

• Distant words that are phrases.• “preparation & evaluation of abstracts and extracts”

Parsing for phrases

• Using parsers to identify noun phrases.

• Make a phrase out of a head and the head of its modifiers.

“automatic analysis of scientific text”

ADJADJ NOUNNOUN PREP

NP

PP

Errors

• Not a perfect rule by any means.– Need restrictions to eliminate bogus phrases.

“automatic analysis of these four scientific texts”

ADJADJ NOUNNOUN PREP

NP

PP

DET QUANT

Do they work?

• Fagan compared statistical with syntactic, statistics won, just

– J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University

• More research has been conducted.– T. Strzalkowski (1995) Natural language information

retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417

Check out TREC

• Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology)– http://trec.nist.gov/

– Ad hoc track• Fairly even between statistical phrases, syntactic

phrases and no phrases.

Grammatical tagging?

• Tag document text with grammatical codes?– R. Garside (1987). The CLAWS word tagging system, in

The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41.

• Doesn’t appear to work– R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using

syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13th ACM SIGIR Conference: 179-191.

Syntactic structure comparison

• Has been tried…– A. F. Smeaton & P. Sheridan (1991) Using morpho-syntactic

language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429

• Method– Parse sentences into tree structures– When you get a phrase match

• Look at linking syntactic operator.• Look at the residual tree structure that didn’t match

• Does not to work

Semantic

• Disambiguation– Given a word appearing in a certain context,

disambiguators will tell you what sense it is.

• IR system– Index document collections by senses rather than

words– Ask the users what senses the query words are– Retrieve on senses

Disambiguation

• Does it work?– No (well maybe)

• M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17th ACM SIGIR Conference, Pages 142-151, 1994

• M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.

Partial conclusions

• NLP has yet to prove itself in IR– Agree

– D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101

– Sort of don’t agree– A. Smeaton (1992) Progress in the application of natural

language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.

Mark’s idle speculation

• What people think is going on always

Keywords

NLP

Mark’s idle speculation

• What’s usually actually going on

Keywords NLP

Areas where NLP does work

• Systems with the following ingredients.– Collection documents cover small domain.– Language use is limited in some manner.– User queries cover tight subject area.– Documents/queries very short

• Image captions– LSI, pseudo-relevance feedback

– People willing to spend money getting NLP to work

RIME & IOTA

• From Grenoble– Y. Chiaramella & J. Nie (1990) A retrieval model based

on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13th SIGIR conference, Pages 25-43

• Medical record retrieval system

• Some database’y parts

• Free text descriptions of cases

Indexing

• “an opacity affecting probably the lung and the trachea”

{[p], SGN}

{[bears-on], SGN}

{[and], SGN}

{[bears-on], SGN}

{[lung], LOC}{[opacity], SGN} {[opacity], SGN} {[trachea], LOC}

LOC - localisation

SGN - observed sign

Retrieval

• How do we match a user’s query to these structures?– Using transformations - bit like logic.

{[bears-on], SGN}

{[lung], LOC}{[opacity], SGN}

t - uncertainty

{[lung], LOC}, t

{[opacity], SGN}, t

Tree transformation

{[bears-on], SGN}

{[has-for-value], SGN}


{[lung], LOC}{[opacity], SGN} {[contour], SGN} {[blurred], LOC}

{[opacity], SGN}

{[has-for-value], SGN}, t


{[contour], SGN} {[blurred], LOC}

Term transforms

• Basic medical terms stored in a hierarchy.– Transformations possible again with

uncertainty added.

Level 1 Level 2 Level 3tumour cancer sarcoma

hygromakyste polykystosispseudokystpolyp polyposis

Isn’t this a bit slow?

• Yes

• Optimisation– Scan for potential documents.– Process them intensively.

• Evaluation?– Not in that paper.

Not unique

• SCISOR– P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting

Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97

Why do they work?

• Because of the restrictions– Small subject domain.– Limited vocabulary.– Restricted type of question.

• Compare with large scale IR system.– Keywords are good enough.– Long time to set up.– Hard to adapt to new domain.

Anything else for NLP?

• Text Generation– IR system explaining itself?

Conclusions

• By now, you will be able to– Outline a range of attempts to get NLP to work

with IR systems– Idly speculate on why they failed– Describe the successful use of NLP in a limited

domain

Nlp

Technology

Transcript of Nlp