Information Retrieval: Introduction · Information Retrieval: Introduction De nition of Information...
Transcript of Information Retrieval: Introduction · Information Retrieval: Introduction De nition of Information...
Information Retrieval: Introduction
Information Retrieval: Introduction
Norbert Fuhr
April 28, 2014
1 / 52
Information Retrieval Applications
Application ExamplesFacets of Search
Information Retrieval: Introduction
Information Retrieval Applications
Application Examples
Web Search
3 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Application Examples
Product Search in Online Shops
4 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Application Examples
Intranet Search
5 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Application Examples
Searching in Digital Libraries
6 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Application Examples
Multimedia Search
7 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Facets of Search
Facets of SearchLanguage
Example: Cross-lingual search in Google
8 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Facets of Search
Facets of SearcingStructure
Example: XML retrieval
9 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Facets of Search
Facets of SearchMedia
Example: Similarity search for images
10 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Facets of Search
Facets of SearchObjects
Example: People search in 123people
11 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Facets of Search
Facets of Searchstatic/dynamic Content
Example: Twitter search
12 / 52
Information Retrieval: Introduction
Information Retrieval Applications
Facets of Search
Facets of Search
Language: monolingual, cross-lingual, multilingual
Structure: atomics, fields, tree structure (e.g. XML), graph(e.g. Web)
Media: texts, facts, images, audio, video, 3D,. . .
Objects: products, people, companies,. . .
static/dynamic contents (databases/streams)
13 / 52
Definition of Information Retrieval
Information Retrieval: Introduction
Definition of Information Retrieval
Definition of Information Retrieval
Information Retrieval (IR) is about vagueness und uncertainty ininformation systems
Vagueness: user cannot give a precise specification of herinformation need
vague query conditionsiterative query formulation
Uncertainty system has uncertain knowledge about the (contentof the) objects in database
uncertain representation( wrong answers)incomplete representation( missing answers)
15 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
Definition of Information Retrieval
Information Retrieval (IR) is about vagueness und uncertainty ininformation systems
Vagueness: user cannot give a precise specification of herinformation need
vague query conditions
iterative query formulation
Uncertainty system has uncertain knowledge about the (contentof the) objects in database
uncertain representation( wrong answers)incomplete representation( missing answers)
15 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
Definition of Information Retrieval
Information Retrieval (IR) is about vagueness und uncertainty ininformation systems
Vagueness: user cannot give a precise specification of herinformation need
vague query conditionsiterative query formulation
Uncertainty system has uncertain knowledge about the (contentof the) objects in database
uncertain representation( wrong answers)incomplete representation( missing answers)
15 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
Definition of Information Retrieval
Information Retrieval (IR) is about vagueness und uncertainty ininformation systems
Vagueness: user cannot give a precise specification of herinformation need
vague query conditionsiterative query formulation
Uncertainty system has uncertain knowledge about the (contentof the) objects in database
uncertain representation( wrong answers)incomplete representation( missing answers)
15 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
Definition of Information Retrieval
Information Retrieval (IR) is about vagueness und uncertainty ininformation systems
Vagueness: user cannot give a precise specification of herinformation need
vague query conditionsiterative query formulation
Uncertainty system has uncertain knowledge about the (contentof the) objects in database
uncertain representation( wrong answers)
incomplete representation( missing answers)
15 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
Definition of Information Retrieval
Information Retrieval (IR) is about vagueness und uncertainty ininformation systems
Vagueness: user cannot give a precise specification of herinformation need
vague query conditionsiterative query formulation
Uncertainty system has uncertain knowledge about the (contentof the) objects in database
uncertain representation( wrong answers)incomplete representation( missing answers)
15 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
IR = Content-Oriented SearchNarrow Definition of IR
Searching at different abstraction levels:
Syntax document as sequence of symbols
Semantics meaning of a text/media object
Pragmatics usefulness for solving my current problem
16 / 52
Information Retrieval: Introduction
Definition of Information Retrieval
Syntax, Semantics and Pragmatics
Welcome at the Information Engineering Group. Our current workfocuses on information retrieval, digital libraries and web-basedinformation systems, with special emphasis on user-orientedresearch.
Syntax ’digital library’ no match
Semantics ’research area’ match
Pragmatics ’potential project partner for medical informationproject’?
17 / 52
Retrieval Quality
Information Retrieval: Introduction
Retrieval Quality
Retrieval QualityThe concept of relevance
in contrast to databases, IR system cannot decide if an answeris correct or not
user has information need
relevance: relationship between document and informationneed
judged by user
19 / 52
Information Retrieval: Introduction
Retrieval Quality
Facets of Relevance
20 / 52
Information Retrieval: Introduction
Retrieval Quality
Facets of Relevance
Situational Relevance: related to the percepted task
Pertinence relevance: related to the information need
Intellectual topicality: as judged by human observer
Algorithmic relevance: system score comparing request/querywith object
In the following: Relevance as pertinence/topicality without furtherdistinction
21 / 52
Information Retrieval: Introduction
Retrieval Quality
Retrieval metrics
RET: set of retrieved documents
REL: set of relevant documents in the database
Precision p: Proportion of relevant among retrieved
Recall r : Proportion of retrieved among relevant
p =|REL ∩ RET ||RET |
r =|REL ∩ RET ||REL|
Example:20 relevant documents for the current query.System returns 10 dokumente, of which 8 are relevant.
Precision: p = 8/10 = 0.8Recall: r = 8/20 = 0.4
22 / 52
Information Retrieval: Introduction
Retrieval Quality
Retrieval metrics
RET: set of retrieved documents
REL: set of relevant documents in the database
Precision p: Proportion of relevant among retrieved
Recall r : Proportion of retrieved among relevant
p =|REL ∩ RET ||RET |
r =|REL ∩ RET ||REL|
Example:20 relevant documents for the current query.System returns 10 dokumente, of which 8 are relevant.
Precision: p = 8/10 = 0.8Recall: r = 8/20 = 0.4
22 / 52
Information Retrieval: Introduction
Retrieval Quality
Retrieval metrics
RET: set of retrieved documents
REL: set of relevant documents in the database
Precision p: Proportion of relevant among retrieved
Recall r : Proportion of retrieved among relevant
p =|REL ∩ RET ||RET |
r =|REL ∩ RET ||REL|
Example:20 relevant documents for the current query.System returns 10 dokumente, of which 8 are relevant.
Precision: p = 8/10 = 0.8Recall: r = 8/20 = 0.4
22 / 52
Representations
Semantic DescriptionsFree Text SearchObjects, Representations, and Descriptions
Information Retrieval: Introduction
Representations
Representations
Free text search search in document text
Semantic approach assign semantic descriptions
24 / 52
Information Retrieval: Introduction
Representations
Semantic Descriptions
Semantic Descriptions
classification schemes e.g. hierarchic classification, as in librariesor product catalogs
Tagging users assign tags
Ontologies e.g. OWL: Web Ontology Language
25 / 52
Information Retrieval: Introduction
Representations
Semantic Descriptions
Classification/Ontology Example: DMOZ
26 / 52
Information Retrieval: Introduction
Representations
Free Text Search
Free Text SearchProblems
Inflectioncomputer – computers, fly – fliesgo – goes – going
Derivationcompute - computer - computerization - computation
Synonymsmobile – smartphone, table – bench – board –counter
Polysemesbank, head
Compoundssteamboat, testbed
Phrasesinformation retrieval – retrieval of information
27 / 52
Information Retrieval: Introduction
Representations
Free Text Search
Free Text SearchApproaches
inflection, derivation stemming algorithmscomputer, computation, computerize → comput
synonyms synonym lexicons
compunds splitting algorithms
phrases adjacency search
Most systems implement only stemming and adjacency search!
28 / 52
Information Retrieval: Introduction
Representations
Objects, Representations, and Descriptions
A Document Object
29 / 52
Information Retrieval: Introduction
Representations
Objects, Representations, and Descriptions
Example: document text, representation, description
Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...
Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay downRepresentation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)
30 / 52
Information Retrieval: Introduction
Representations
Objects, Representations, and Descriptions
Example: document text, representation, description
Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay down
Representation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)
30 / 52
Information Retrieval: Introduction
Representations
Objects, Representations, and Descriptions
Example: document text, representation, description
Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay downRepresentation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),
Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)
30 / 52
Information Retrieval: Introduction
Representations
Objects, Representations, and Descriptions
Example: document text, representation, description
Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay downRepresentation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)
30 / 52
Information Retrieval: Introduction
Representations
Objects, Representations, and Descriptions
Conceptual Model
DD
rel.
judg.
α
αQ βQ
Dβ
ρIR
DQ
DD
Q
D
Q
R
qk∈ Q query
qk ∈ Q: queryrepresentation
qDk ∈ QD : query description
dm ∈ D document
dm ∈ D: documentrepresentation
dDm ∈ DD : document
descriptionR: relevance scale%: retrieval function
31 / 52
Probabilistic Models
Probabilistic Event SpaceProbability Ranking PrincipleBinary Independence Retrieval ModelBM25 modelLearning to Rank
Information Retrieval: Introduction
Probabilistic Models
Probabilistic Event Space
Probabilistic Event space
[Fuhr 92]
qk
dm
dm
qk
D
Q
Q: Queriesqk
: queryqk : query rep.
D: Documentsdm: documentdm: document rep.
33 / 52
Information Retrieval: Introduction
Probabilistic Models
Probabilistic Event Space
Event space
dm
qk
dm
qk
qk dm
Q
R N R N
R R N N
R R R R
N N N N
D
P(R| , )=0.5
34 / 52
Information Retrieval: Introduction
Probabilistic Models
Probabilistic Event Space
Event space
Event space: Q × Dsingle element: query-document pair (qk , dm)all elements are equiprobable
relevance judgement (qk , dm)εRrelevance judgements for different documents w.r.t. the samequery are independent of each other
Probability of relevance P(rel |qk , dm):probability of a an element of (qk , dm) being relevant
regard collections as samples of possibly infinite sets
poor representation of retrieval objects:single representation may stand for a number of differentobjects.
35 / 52
Information Retrieval: Introduction
Probabilistic Models
Probabilistic Event Space
Event space
Event space: Q × Dsingle element: query-document pair (qk , dm)all elements are equiprobable
relevance judgement (qk , dm)εRrelevance judgements for different documents w.r.t. the samequery are independent of each other
Probability of relevance P(rel |qk , dm):probability of a an element of (qk , dm) being relevant
regard collections as samples of possibly infinite sets
poor representation of retrieval objects:single representation may stand for a number of differentobjects.
35 / 52
Information Retrieval: Introduction
Probabilistic Models
Probabilistic Event Space
Event space
Event space: Q × Dsingle element: query-document pair (qk , dm)all elements are equiprobable
relevance judgement (qk , dm)εRrelevance judgements for different documents w.r.t. the samequery are independent of each other
Probability of relevance P(rel |qk , dm):probability of a an element of (qk , dm) being relevant
regard collections as samples of possibly infinite sets
poor representation of retrieval objects:single representation may stand for a number of differentobjects.
35 / 52
Information Retrieval: Introduction
Probabilistic Models
Probability Ranking Principle
Probability Ranking Principle
defines optimum retrieval for probabilistic models:
rank documents according to decreasing values of the
probability of relevance P(rel |q, d)
Advantage:
PRP yields
optimum retrieval quality
minimum retrieval costs
36 / 52
Information Retrieval: Introduction
Probabilistic Models
Probability Ranking Principle
Probability Ranking Principle
defines optimum retrieval for probabilistic models:
rank documents according to decreasing values of the
probability of relevance P(rel |q, d)
Advantage:
PRP yields
optimum retrieval quality
minimum retrieval costs
36 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
Binary Independence Retrieval Model
represent queries and documents as sets of termsT = {t1, . . . , tn} set of terms in the collection
q ∈ Q: queryrepresentation
dm ∈ D: documentrepresentation
qT : set of query terms
dTm : set of document
terms
simple retrieval function: Coordination level match
%COORD(q, dm) = |qT ∩ dTm |
Binary independence retrieval (BIR) model:assign weights to query terms
%BIR(q, dm) =∑
ti∈qT∩dTm
ci
37 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
Binary Independence Retrieval Model
represent queries and documents as sets of termsT = {t1, . . . , tn} set of terms in the collection
q ∈ Q: queryrepresentation
dm ∈ D: documentrepresentation
qT : set of query terms
dTm : set of document
terms
simple retrieval function: Coordination level match
%COORD(q, dm) = |qT ∩ dTm |
Binary independence retrieval (BIR) model:assign weights to query terms
%BIR(q, dm) =∑
ti∈qT∩dTm
ci
37 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
Binary Independence Retrieval Model
represent queries and documents as sets of termsT = {t1, . . . , tn} set of terms in the collection
q ∈ Q: queryrepresentation
dm ∈ D: documentrepresentation
qT : set of query terms
dTm : set of document
terms
simple retrieval function: Coordination level match
%COORD(q, dm) = |qT ∩ dTm |
Binary independence retrieval (BIR) model:assign weights to query terms
%BIR(q, dm) =∑
ti∈qT∩dTm
ci
37 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
%BIR(q, dm) =∑
ti∈qT∩dTm
ci , ci = logpi (1− si )
si (1− pi )
pi = P(ti |rel): prob. that ti occurs in arbitrary relevant doc.si = P(ti | ¯rel): prob. that ti occurs in arbitrary nonrelevant doc.
38 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
%BIR(q, dm) =∑
ti∈qT∩dTm
ci , ci = logpi (1− si )
si (1− pi )
pi = P(ti |rel): prob. that ti occurs in arbitrary relevant doc.si = P(ti | ¯rel): prob. that ti occurs in arbitrary nonrelevant doc.
38 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
Parameter estimationRelevance Feedback
ti occurs ¬ ti occurs
relevant ri R − ri R
¬ relevant ni − ri N − ni − R + ri N − R
ni N − ni N
pi = P(ti |rel) prob. that ti occurs in arbitrary relevant doc.
pi ≈riR
si = P(ti | ¯rel) prob. that ti occurs in arbitrary nonrelevant doc..
si ≈ni − riN − R
≈ ni
N
39 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
Parameter estimation w/o relevance feedback
N - # documents in the collectionni - # documents containing term ti
si = P(ti | ¯rel) prob. that ti occurs in arbitrary nonrelevant doc..
si ≈ni
N
pi = P(ti |rel) prob. that ti occurs in arbitrary relevant doc.assume constant value: p = 0.5
ci = logpi (1− si )
si (1− pi )= log
p
1− p+ log
1− sisi
= 0 + logN − ni
ni≈ log
N
ni
IDF (inverse document frequency) weight: log N/ni
40 / 52
Information Retrieval: Introduction
Probabilistic Models
Binary Independence Retrieval Model
Parameter estimation w/o relevance feedback
N - # documents in the collectionni - # documents containing term ti
si = P(ti | ¯rel) prob. that ti occurs in arbitrary nonrelevant doc..
si ≈ni
N
pi = P(ti |rel) prob. that ti occurs in arbitrary relevant doc.assume constant value: p = 0.5
ci = logpi (1− si )
si (1− pi )= log
p
1− p+ log
1− sisi
= 0 + logN − ni
ni≈ log
N
ni
IDF (inverse document frequency) weight: log N/ni
40 / 52
Information Retrieval: Introduction
Probabilistic Models
BM25 model
BM25
[Robertson et al 95]heuristic extension of the BIR modelfrom binary to weighted indexing(consideration of within-document frequency tf )
tf
1
umi
mi
41 / 52
Information Retrieval: Introduction
Probabilistic Models
BM25 model
tf*idf Weighting
originally developed for (non-probabilistic) vector space model
set of heuristics: the weight of a term should be higher. . .1 the less frequent the term occurs in the collection
(inverse document frequency, idf — see above)2 the more often the term occurs in the document (tf)3 the shorter the document
42 / 52
Information Retrieval: Introduction
Probabilistic Models
BM25 model
From binary to weighted Indexing
lm document length(# tokens in dm)al average document length in D
tfmi : occurrence frequency of ti in dm.b weight of length normalization, 0 ≤ b ≤ 1k weight of occurrence frequency
length normalization: B =
((1− b) + b
lmal
)normalized within-document frequency: ntfmi = tfmi/B
BM25 weight: umi =ntfmi
k + ntfmi
=tfmi
k((1− b) + b lm
al
)+ tfmi
43 / 52
Information Retrieval: Introduction
Probabilistic Models
Learning to Rank
Parameter learning in IR
[Fuhr 92]
Learning approaches in IR
44 / 52
Information Retrieval: Introduction
Probabilistic Models
Learning to Rank
Learning to Rank for Web Searches
45 / 52
Interactive Retrieval
Search modelsAnomalous State of KnowledgeIngwersen’s Cognitive Model
Information Retrieval: Introduction
Interactive Retrieval
Search models
Search modelsClassical search process model
47 / 52
Information Retrieval: Introduction
Interactive Retrieval
Search models
Empirical studies
information search consists of a sequence of connected, butdifferent searches
search result may trigger new searches
only task context remains the same
main goal of a search is accumulated learning and collectionof new information while searching
48 / 52
Information Retrieval: Introduction
Interactive Retrieval
Search models
Search modelsBerry picking-Model
[Bates 90]
continuous change of information need and queries duringsearchinformation need cannot be satisfied by a single result setinstead: sequence of selections and collection of pieces ofinformation during search
49 / 52
Information Retrieval: Introduction
Interactive Retrieval
Anomalous State of Knowledge
Anomalous State of Knowledge (ASK)(1)
[Belkin 80]
Classic IR systems: ”best match” principle
system returns those documents that fit best to therepresentation of the information need (e.g. query statement)
only feasible, if user can give precise specification of herinformation need (like e.g. in DBMS)
50 / 52
Information Retrieval: Introduction
Interactive Retrieval
Anomalous State of Knowledge
Anomalous State of Knowledge (ASK)(2)
ASK-Hypothesis
information need results from user’s anomalous state ofknowledge (ASK)
user is unable to precisely specify information need forremoving the ASK
instead: describe ASK
requires capture of cognitive and situation-specific aspects forresolving this anomaly
51 / 52
Information Retrieval: Introduction
Interactive Retrieval
Anomalous State of Knowledge
Anomalous State of Knowledge (ASK)(2)
ASK-Hypothesis
information need results from user’s anomalous state ofknowledge (ASK)
user is unable to precisely specify information need forremoving the ASK
instead: describe ASK
requires capture of cognitive and situation-specific aspects forresolving this anomaly
51 / 52
Information Retrieval: Introduction
Interactive Retrieval
Anomalous State of Knowledge
Anomalous State of Knowledge (ASK)(2)
ASK-Hypothesis
information need results from user’s anomalous state ofknowledge (ASK)
user is unable to precisely specify information need forremoving the ASK
instead: describe ASK
requires capture of cognitive and situation-specific aspects forresolving this anomaly
51 / 52
Information Retrieval: Introduction
Interactive Retrieval
Anomalous State of Knowledge
Anomalous State of Knowledge (ASK)(2)
ASK-Hypothesis
information need results from user’s anomalous state ofknowledge (ASK)
user is unable to precisely specify information need forremoving the ASK
instead: describe ASK
requires capture of cognitive and situation-specific aspects forresolving this anomaly
51 / 52
Information Retrieval: Introduction
Interactive Retrieval
Anomalous State of Knowledge
Anomalous State of Knowledge (ASK)(2)
ASK-Hypothesis
information need results from user’s anomalous state ofknowledge (ASK)
user is unable to precisely specify information need forremoving the ASK
instead: describe ASK
requires capture of cognitive and situation-specific aspects forresolving this anomaly
51 / 52
Information Retrieval: Introduction
Interactive Retrieval
Ingwersen’s Cognitive Model
Ingwersen’s Cognitive Model
Organiz.
SocialContext
Cultural
CognitiveActor(s)
(team)Interface
Informationobjects
IT: EnginesLogics
Algorithms
Cognitive transformations and influenceInteractive communications of cognitive structures
52 / 52