CSM06 Information Retrieval

CSM06 Information RetrievalLecture 1b – IR Basics

Dr Andrew Salway [email protected]

mailto:[email protected]

Requirements for IR systems

When developing or evaluating an IR system the first considerations are…

– Who are the users of information retrieval systems? General public or specialist researchers?

– What kinds of information do they want to retrieve? Text, image, audio or video? General or specialist information?

Information Access Process

Most uses of an information retrieval system can be characterised by this generic process

(1) Start with an information need(2) Select a system / collections to search(3) Formulate a query(4) Send query(5) Receive results (i.e. information items)(6) Scan, evaluate, interpret results(7) Reformulate query and go to (4) OR stop

From Baeza-Yates and Ribeiro-Neto (1999), p. 263

NB. When doing IR on the web a user can browse away from the results returned in step 5 – this may change the process

Information Need Query

Verbal queries• Single-word queries: a list of words• Context queries: phrase (“ ”); proximity

(NEAR)• Boolean queries: use AND, OR, BUT• Natural Language: from sentence to

whole text

Information Need Query

EXERCISETina is a user of an information retrieval system who is researching how the industrial revolution effected the urban population in Victorian England. – How could her information need be

expressed with the different types of query described above?

– What are the advantages / disadvantages of each query type?

“Ad-hoc” Retrieval Problem

The ad-hoc retrieval problem is commonly faced by IR systems, especially web search engines. It takes the form “return information items on topic t” – where t is a string of one or more terms

characterising a user’s information need.

• For large collections this needs to happen automatically.

• Note there is not a fixed list of topics! • So, the IR system should return documents

relevant to the query. • Ideally it will rank the documents in order of

relevance, so the user sees the most relevant first

Generic Architecture of an Text IR System

• Based on Baeza-Yates and Riberio-Neto (1999), Modern Information Retrieval, Figure 1.3, p.10.

IndexingQuery Operations

Searching

Text Operations

User Interface

Ranking

INDEX

Text Database

IR compared with data retrieval, and knowledge retrieval

• Data Retrieval, e.g. SQL query to well-structured database – if data is stored you get exactly what you want


• Information Retrieval – returns information items from unstructured source; user must still interpret them


• Knowledge Retrieval (see current Information Extraction technology) – answers specific questions by analysing an unstructured information source, e.g. user could ask “What is capital of France?” and the system would answer “Paris” by ‘reading’ a book about France

How Good is an IR System?

• We need ways to measure how good an IR systems is, i.e. evaluation metrics

• Systems should return relevant information items (texts, images, etc); systems may rank the items in order of relevance

Two ways to measure the performance of an IR system:

Precision = “how many of the retrieved items are relevant?”

Recall = “how many of the items that should have been retrieved were retrieved?”

• These should be objective measures.• Both require humans to make decisions about what

documents are relevant for a given query

Calculating Precision and Recall

R = number of documents in collection relevant to topic t

A(t) = number of documents returned by system in response to query t

C = number of ‘correct’ (relevant) documents returned, i.e. the intersection of R and A(t)

PRECISION = ((C+1)/(A(t)+1))*100

RECALL = ((C+1)/(R+1))*100

EXERCISE• Amanda and Alex each need to choose an information retrieval system.

Amanda works for an intelligence agency, so getting all possible information about a topic is important for the users of her system. Alex works for a newspaper, so getting some relevant information quickly is more important for the journalists using his system.

• See below for statistics for two information retrieval systems (Search4Facts and InfoULike) when they were used to retrieve documents from the same document collection in response to the same query: there were 100,000 documents in the collection, of which 50 were relevant to the given query. Which system would you advise Amanda to choose and which would you advise Alex to choose? Your decisions should be based on the evaluation metrics of precision and recall.

Search4Facts• Number of Relevant Documents Returned = 12• Total Number of Documents Returned = 15InfoULike• Number of Relevant Documents Returned = 48• Total Number of Documents Returned = 295

Precision and Recall: refinements

• May plot graphs of P against R for single queries (see Belew 2000, Table 4.2 and Figs. 4.10 and 4.11)

• These graphs are unstable for single queries so may need to combine P/R curves for multiple queries…

Reference Collections: TREC

• A reference collection comprises a set of document and a set of queries for which all relevant documents have been identified: size is important!!

• TREC = Text Retrieval Evaluation Conference

• TREC Collection = 6GB of text (millions of documents – mainly news related) !!

• See http://trec.nist.gov/

http://trec.nist.gov/



Reference Collections: Cystic Fibrosis Collection

• Cystic Fibrosis collection comprises 1239 documents from the National Library of Medicine’s MEDLINE database + 100 information requests with relevant documents + four relevance scores (0-2) from four experts

• Available for download: http://www.dcc.ufmg.br/irbook/cfc.html

http://www.dcc.ufmg.br/irbook/cfc.html

http://www.dcc.ufmg.br/irbook/cfc.html

PN 74001RN 00001 AN 75051687AU Hoiby-N. Jorgensen-B-A. Lykkegaard-E. Weeke-B.TI Pseudomonas aeruginosa infection in cystic fibrosis. SO Acta-Paediatr-Scand. 1974 Nov. 63(6). P 843-8.MJ CYSTIC-FIBROSIS: co. PSEUDOMONAS-AERUGINOSA: im.MN ADOLESCENCE. BLOOD-PROTEINS: me. CHILD. CHILDAB The significance of Pseudomonas aeruginosa infection in

the respiratory tract of 9 cystic fibrosis patients have been studied…. … The results indicate no protective value of the many precipitins on

the tissue of the respiratory tract.RF 001 BELFRAGE S ACTA MED SCAND SUPPL 173 5 963 002 COOMBS RRA IN: GELL PGH 317 964CT 1 HOIBY N SCAND J RESPIR DIS 56 38 975 2 HOIBY N ACTA PATH MICROBIOL SCAND (C)83 459 975

Cystic Fibrosis Collection: example document

CF Collection: example query and details of relevant documents

QN 00001 QU What are the effects of calcium

on the physical properties of mucus

from CF patients? NR 00034RD 139 1222 151 2211 166 0001

311 0001 370 1010 392 0001 439 0001

440 0011 441 2122 454 0100 461 1121 502 0002 503 1000 505 0001

139 = document number1222 = expert 1 scored it relevance ‘1’,

experts 2-4 scored it relevance ‘2’.

Further Reading

• See Belew (2000), pages 119-128

• See also Belew CD for reference corpus (and lots more!)

Basic Concepts of IR: recap

After this lecture, you should be able to explain and discuss:

• Information access process; ‘ad-hoc’ retrieval

• User information need; query; IR vs. data retrieval / knowledge retrieval; retrieval vs. browsing

• Relevance; Ranking• Evaluation metrics - Precision and Recall

Set Reading

• To prepare for next week’s lecture, you should look at:

• Weiss et al (2005), handout – especially sections 1.4, 2.3, 2.4 and 2.5

• Belew, R. K. (2000), pages: 50-58

Further Reading

For more about the IR basics in today’s lecture see introductions in:

Belew, R. K. (2000), R. Baeza-Yates and Berthier Ribeiro-Neto, pages 1-9, or, Kowalski and Maybury (2000).

Further Reading

• To keep up-to-date with web search engine developments, see

www.searchenginewatch.com

• I will put a links to some online articles about recent developments in web search technologies on the module web page

http://www.searchenginewatch.com/

CSM06 Information Retrieval

Documents

Transcript of CSM06 Information Retrieval