ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector...
![Page 1: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/1.jpg)
ISP 433/533 Week 2
IR Models
![Page 2: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/2.jpg)
Outline
• IR defined
• IR tasks
• IR processes
• Boolean model
• Break
• Vector space model
• Probabilistic model
![Page 3: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/3.jpg)
User Information Needs
• Goal of IR
• Hard Problem– People have different and highly varied
needs for information– People often do not know what they want,
or may not be able to express it in a usable form
![Page 4: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/4.jpg)
Some Definitions of IR
Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”
Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
![Page 5: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/5.jpg)
Examples of IR
• Conventional (library catalog). Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language
![Page 6: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/6.jpg)
Key Terms Used in IR
• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.
• DOCUMENT: an information entity that the user wants to retrieve
• COLLECTION: a set of documents• INDEX: a representation of information that makes
querying easier• TERM: word or concept that appears in a document
or a query• RANKING: an ordering of the documents retrieved
that (hopefully) reflects the relevance of the documents to the user query
![Page 7: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/7.jpg)
Basic IR Process
Information Need
Index Terms
doc
query
Rankingmatch
Docs
![Page 8: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/8.jpg)
IR Task – ad hoc
Collection-relatively stable
Q2
Q3
Q1
Q4Q5
![Page 9: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/9.jpg)
IR Task - filtering
Documents Stream
User 1Profile
User 2Profile
Docs Filteredfor User 2
Docs forUser 1
![Page 10: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/10.jpg)
Process of IR
User Interface
Text operations
indexing DB Man.
Text Db
index
Queryoperations
Searching
Ranking
![Page 11: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/11.jpg)
Document Process Steps
![Page 12: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/12.jpg)
Classic IR models
• Each document represented by a set of representative keywords or index terms– Not all terms are equally useful for representing the
document contents: less frequent terms allow identifying a narrower set of documents
• Let – ki be an index term– dj be a document – wij is a weight associated with (ki,dj)
• The weight wij quantifies the importance of the index term for describing the document contents
![Page 13: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/13.jpg)
Boolean Model
• Simple model based on set theory
• Queries specified as boolean expressions – precise semantics– neat formalism using boolean logic
– Eg. Queryx = ka (kb kc)
• Terms are either present or absent. Thus, wij {0,1}
![Page 14: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/14.jpg)
Boolean Logic
• Named after logician/mathematician George Boole
• Logical Connectives: AND, OR, NOT– WARNING!
• INSPIRED BY, BUT NOT THE SAME AS, USUAL ENGLISH USAGE
AND: “Each thing must satisfy ALL conditions” OR : “Each thing must satisfy at least one
condition”NOT: “Each thing must NOT satisfy the given
condition”
![Page 15: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/15.jpg)
Logical AND ()
(Set Intersection)
A B
is the set of things in common, i.e., in both sets A and B
A BAged Blind
A B(Aged, Blind People)
![Page 16: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/16.jpg)
Logical OR ()
(Set Union)
A B
is the set of: things in either A, B or both.
A BAged Blind
A B (people that are either Aged or Blind or both)
![Page 17: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/17.jpg)
Logical NOT ()
(Set Complement)
B
is the set of things outside the set B
B
(people who aren’t blind)
Blind
B
![Page 18: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/18.jpg)
Example Combination
• A ( B)
B
(old people who aren’t blind)
Blind
A ( B)
AAged
![Page 19: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/19.jpg)
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information computer”
![Page 20: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/20.jpg)
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no notion of partial matching
• No ranking of the documents is provided (absence of a grading scale)
• Information need has to be translated into a Boolean expression which most users find awkward
• The Boolean queries formulated by the users are most often too simplistic
• As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
• BREAK
![Page 21: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/21.jpg)
Vector Model
• Non-binary weights provide consideration for partial matches
• These term weights are used to compute a degree of similarity between a query and each document
• Ranked set of documents provides for better matching
![Page 22: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/22.jpg)
Vector Space
• Assume each term is independent from each other and each term defines a dimension
• T-dimensional space, where T is the number of terms
• In this space, queries and documents are represented as weighted vectors– Weight wiq >= 0 associated with the pair
(ki,q)– vec(dj) = (w1j, w2j, ..., wtj)– vec(q) = (w1q, w2q, ..., wtq)
![Page 23: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/23.jpg)
Example Vector Space using term frequency
• D1 = “computer information retrieval”• D2 = “computer retrieval”• Q1 = “information, retrieval”
computer
information
retrieval
D1=(1, 1, 1)Q1=(0, 1, 1)
D2=(1, 0, 1)
![Page 24: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/24.jpg)
Similarity Measure
• Sim(q,dj) = cos()= [vec(dj) vec(q)] / ( |dj| *
|q|) = [ wij * wiq] / (|dj| * |q|)
• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
i
j
dj
q
![Page 25: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/25.jpg)
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• Q1 = “information, retrieval”
• Given the above query, rank the relevance of the above two documents using vector model
![Page 26: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/26.jpg)
Pro and Con of Vector model
• Advantages:– term-weighting improves quality of the answer set– partial matching allows retrieval of docs that
approximate the query conditions– cosine ranking formula sorts documents according
to degree of similarity to the query
• Disadvantages:– assumes independence of index terms (??); not
clear that this is bad though
![Page 27: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/27.jpg)
Probabilistic Model
• Given a user query, there is an ideal answer set
• Querying as specification of the properties of this ideal answer set (clustering)
• But, what are these properties? • Guess at the beginning what they could be
(i.e., guess initial description of ideal answer set)
• Improve by iteration
![Page 28: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/28.jpg)
Probabilistic Ranking Principle
• Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj relevant
• sim(q, dj ) = P(dj relevant-to q) / P(dj non-relevant-to q)
![Page 29: ISP 433/533 Week 2 IR Models. Outline IR defined IR tasks IR processes Boolean model Break Vector space model Probabilistic model.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d375503460f94a0f63d/html5/thumbnails/29.jpg)
Performance of Probabilistic Model
• Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
• This seems also to be the view of the research community