Lecture 1: Overview of IR
description
Transcript of Lecture 1: Overview of IR
![Page 1: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/1.jpg)
Lecture 1: Overview of IR
Maya Ramanath
![Page 2: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/2.jpg)
Who hasn’t used Google?
• Why did Google return these results first ?• Can we improve on it?
• Is this a good result for the query “maya ramanath”?• OR: How good is Google?
![Page 3: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/3.jpg)
Lectures• Overview (this lecture)• Retrieval Models• Retrieval Evaluation• Why DB and IR?
![Page 4: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/4.jpg)
Information Retrieval• “An information retrieval system does not inform
(i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.”
• “Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).”
![Page 5: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/5.jpg)
Basic TermsTerm Definition
Document A sequence/set of terms, expressing ideas about one or more topics, usually in natural language
Corpus/Collection A set of documents Information need Corresponds to an innate idea of
information/knowledge that the user is currently looking for
Term/Keyword/Phrase
A semantic unit, a word, phrase or potentially root of a word
Query The expression of the information need by the user
Relevance A measure of how well the retrieved documents satisfy the user’s information need
![Page 6: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/6.jpg)
What is a retrieval system?
Source: Hiemstra, D. (2009) Information Retrieval Models, in Information Retrieval: Searching in the 21st Century (eds A. Göker and J. Davies), John Wiley & Sons, Ltd, Chichester, UK.
![Page 7: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/7.jpg)
Retrieval ModelsSource and Further Reading: Hiemstra, D. (2009) Information Retrieval Models, in Information Retrieval: Searching in the 21st Century (eds A. Göker and J. Davies), John Wiley & Sons, Ltd, Chichester, UK.
![Page 8: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/8.jpg)
2 kinds of models• No Ranking– Boolean models– Region models
• Ranking– Vector space model– Probabilistic models– Language models
![Page 9: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/9.jpg)
Boolean Model• Based on set theory• Simple query language
Ex: information AND (retrieval OR management)
retrieval
management
information
![Page 10: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/10.jpg)
Vector Space Model (1/2)• Based on the notion of “similarity”
between query and document– Query is the representation of the
document that you want to retrieve– Compare similarity between query and
document• Luhn’s formulation:The more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information.
![Page 11: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/11.jpg)
Vector Space Model (2/2)
DocumentQuery
We will study more in the next lecture
![Page 12: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/12.jpg)
Probabilistic IR (1/2)• Based on probability theory– Specifically, we would like to estimate the
probability of relevanceThe Probability Ranking PrincipleIf a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has
been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.
![Page 13: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/13.jpg)
Probabilistic IR (2/2)Ranking of documents based on
Odds
We will study more in the next lecture
![Page 14: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/14.jpg)
Language Models (1/3)• Based on generative models for
documents and queries
• Documents, Query: Samples of an underlying probabilistic process
• Estimate the parameters of this process• Measure how close the distributions are
(KL-divergence)– “Closeness” gives a measure of relevance
![Page 15: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/15.jpg)
Language Models (2/3)
d2
d1
q
Documents
Query
![Page 16: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/16.jpg)
Language Models (3/3)The Maximum Likelihood Estimator
+ smoothing
i
iwwDwP)(#)(#)|(
We will study more in the next lecture
![Page 17: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/17.jpg)
Evaluation
(Which system is best?)
![Page 18: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/18.jpg)
Benchmarking IR Systems (1/2)
• Why do we need to benchmark?• To benchmark an IR system– Efficiency– Quality• Results• Power of interface• Ease of use, etc.
![Page 19: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/19.jpg)
Benchmarking IR Systems (2/2)
Result Quality • Data Collection– Ex: Archives of the NYTimes
• Query set– Provided by experts, identified from real
search logs, etc.• Relevance judgements– For a given query, is the document
relevant?
![Page 20: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/20.jpg)
Precision, Recall, F-Measure• Precision
• Recall
• F-Measure: Weighted harmonic mean of Precision and Recall
![Page 21: Lecture 1: Overview of IR](https://reader035.fdocuments.in/reader035/viewer/2022062323/56816309550346895dd38696/html5/thumbnails/21.jpg)
That’s it for today!