CSM06 Information Retrieval
-
Upload
fabiana-fullam -
Category
Documents
-
view
36 -
download
1
description
Transcript of CSM06 Information Retrieval
CSM06 Information RetrievalLecture 1b – IR Basics
Dr Andrew Salway [email protected]
Requirements for IR systems
When developing or evaluating an IR system the first considerations are…
– Who are the users of information retrieval systems? General public or specialist researchers?
– What kinds of information do they want to retrieve? Text, image, audio or video? General or specialist information?
Information Access Process
Most uses of an information retrieval system can be characterised by this generic process
(1) Start with an information need(2) Select a system / collections to search(3) Formulate a query(4) Send query(5) Receive results (i.e. information items)(6) Scan, evaluate, interpret results(7) Reformulate query and go to (4) OR stop
From Baeza-Yates and Ribeiro-Neto (1999), p. 263
NB. When doing IR on the web a user can browse away from the results returned in step 5 – this may change the process
Information Need Query
Verbal queries• Single-word queries: a list of words• Context queries: phrase (“ ”); proximity
(NEAR)• Boolean queries: use AND, OR, BUT• Natural Language: from sentence to
whole text
Information Need Query
EXERCISETina is a user of an information retrieval system who is researching how the industrial revolution effected the urban population in Victorian England. – How could her information need be
expressed with the different types of query described above?
– What are the advantages / disadvantages of each query type?
“Ad-hoc” Retrieval Problem
The ad-hoc retrieval problem is commonly faced by IR systems, especially web search engines. It takes the form “return information items on topic t” – where t is a string of one or more terms
characterising a user’s information need.
• For large collections this needs to happen automatically.
• Note there is not a fixed list of topics! • So, the IR system should return documents
relevant to the query. • Ideally it will rank the documents in order of
relevance, so the user sees the most relevant first
Generic Architecture of an Text IR System
• Based on Baeza-Yates and Riberio-Neto (1999), Modern Information Retrieval, Figure 1.3, p.10.
IndexingQuery Operations
Searching
Text Operations
User Interface
Ranking
INDEX
Text Database
IR compared with data retrieval, and knowledge retrieval
• Data Retrieval, e.g. SQL query to well-structured database – if data is stored you get exactly what you want
IR compared with data retrieval, and knowledge retrieval
• Information Retrieval – returns information items from unstructured source; user must still interpret them
IR compared with data retrieval, and knowledge retrieval
• Knowledge Retrieval (see current Information Extraction technology) – answers specific questions by analysing an unstructured information source, e.g. user could ask “What is capital of France?” and the system would answer “Paris” by ‘reading’ a book about France
How Good is an IR System?
• We need ways to measure how good an IR systems is, i.e. evaluation metrics
• Systems should return relevant information items (texts, images, etc); systems may rank the items in order of relevance
Two ways to measure the performance of an IR system:
Precision = “how many of the retrieved items are relevant?”
Recall = “how many of the items that should have been retrieved were retrieved?”
• These should be objective measures.• Both require humans to make decisions about what
documents are relevant for a given query
Calculating Precision and Recall
R = number of documents in collection relevant to topic t
A(t) = number of documents returned by system in response to query t
C = number of ‘correct’ (relevant) documents returned, i.e. the intersection of R and A(t)
PRECISION = ((C+1)/(A(t)+1))*100
RECALL = ((C+1)/(R+1))*100
EXERCISE• Amanda and Alex each need to choose an information retrieval system.
Amanda works for an intelligence agency, so getting all possible information about a topic is important for the users of her system. Alex works for a newspaper, so getting some relevant information quickly is more important for the journalists using his system.
• See below for statistics for two information retrieval systems (Search4Facts and InfoULike) when they were used to retrieve documents from the same document collection in response to the same query: there were 100,000 documents in the collection, of which 50 were relevant to the given query. Which system would you advise Amanda to choose and which would you advise Alex to choose? Your decisions should be based on the evaluation metrics of precision and recall.
Search4Facts• Number of Relevant Documents Returned = 12• Total Number of Documents Returned = 15InfoULike• Number of Relevant Documents Returned = 48• Total Number of Documents Returned = 295
Precision and Recall: refinements
• May plot graphs of P against R for single queries (see Belew 2000, Table 4.2 and Figs. 4.10 and 4.11)
• These graphs are unstable for single queries so may need to combine P/R curves for multiple queries…
Reference Collections: TREC
• A reference collection comprises a set of document and a set of queries for which all relevant documents have been identified: size is important!!
• TREC = Text Retrieval Evaluation Conference
• TREC Collection = 6GB of text (millions of documents – mainly news related) !!
• See http://trec.nist.gov/
Reference Collections: Cystic Fibrosis Collection
• Cystic Fibrosis collection comprises 1239 documents from the National Library of Medicine’s MEDLINE database + 100 information requests with relevant documents + four relevance scores (0-2) from four experts
• Available for download: http://www.dcc.ufmg.br/irbook/cfc.html
PN 74001RN 00001 AN 75051687AU Hoiby-N. Jorgensen-B-A. Lykkegaard-E. Weeke-B.TI Pseudomonas aeruginosa infection in cystic fibrosis. SO Acta-Paediatr-Scand. 1974 Nov. 63(6). P 843-8.MJ CYSTIC-FIBROSIS: co. PSEUDOMONAS-AERUGINOSA: im.MN ADOLESCENCE. BLOOD-PROTEINS: me. CHILD. CHILDAB The significance of Pseudomonas aeruginosa infection in
the respiratory tract of 9 cystic fibrosis patients have been studied…. … The results indicate no protective value of the many precipitins on
the tissue of the respiratory tract.RF 001 BELFRAGE S ACTA MED SCAND SUPPL 173 5 963 002 COOMBS RRA IN: GELL PGH 317 964CT 1 HOIBY N SCAND J RESPIR DIS 56 38 975 2 HOIBY N ACTA PATH MICROBIOL SCAND (C)83 459 975
Cystic Fibrosis Collection: example document
CF Collection: example query and details of relevant documents
QN 00001 QU What are the effects of calcium
on the physical properties of mucus
from CF patients? NR 00034RD 139 1222 151 2211 166 0001
311 0001 370 1010 392 0001 439 0001
440 0011 441 2122 454 0100 461 1121 502 0002 503 1000 505 0001
139 = document number1222 = expert 1 scored it relevance ‘1’,
experts 2-4 scored it relevance ‘2’.
Further Reading
• See Belew (2000), pages 119-128
• See also Belew CD for reference corpus (and lots more!)
Basic Concepts of IR: recap
After this lecture, you should be able to explain and discuss:
• Information access process; ‘ad-hoc’ retrieval
• User information need; query; IR vs. data retrieval / knowledge retrieval; retrieval vs. browsing
• Relevance; Ranking• Evaluation metrics - Precision and Recall
Set Reading
• To prepare for next week’s lecture, you should look at:
• Weiss et al (2005), handout – especially sections 1.4, 2.3, 2.4 and 2.5
• Belew, R. K. (2000), pages: 50-58
Further Reading
For more about the IR basics in today’s lecture see introductions in:
Belew, R. K. (2000), R. Baeza-Yates and Berthier Ribeiro-Neto, pages 1-9, or, Kowalski and Maybury (2000).
Further Reading
• To keep up-to-date with web search engine developments, see
www.searchenginewatch.com
• I will put a links to some online articles about recent developments in web search technologies on the module web page