IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
-
Upload
vishnudhanabalan -
Category
Documents
-
view
217 -
download
0
Transcript of IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
1/57
Modeling in Information Retrieval- Fuzzy Set, Extended Boolean,
Generalized Vector Space,
Set-based Models, and Best Match Models
Berlin ChenDepartment of Computer Science & Information Engineering
National Taiwan Normal University
References:
1. Modern Information Retrieval, Chapter 3 & Teaching material
2. Language Modeling for Information Retrieval, Chapter 3
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
2/57
IR Berlin Chen 2
Taxonomy of Classic IR Models
Document Property
Text
Links
Multimedia Proximal Nodes, othersXML-based
Semi-structured Text
Classic Models
Boolean
Vector
Probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Set-based
Probabilistic
BM25
Language Models
Divergence from Ramdomness
Bayesian Networks
Algebraic
Generalized Vector
Latent Semanti Indexing
Neural NetworksSupport Vector Machines
Page Rank
Hubs & Authorities
Web
Image retrieval
Audio and Music RetrievalVideo Retrieval
Multimedia Retrieval
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
3/57
IR Berlin Chen 3
Outline
Alternative Set Theoretic Models
Fuzzy Set Model (Fuzzy Information Retrieval)
Extended Boolean Model
Set-based Model
Alternative Algebraic Model
Generalized Vector Space Model
Alternative Probabilistic Models
Best Match Models (BM1, BM15, BM11 & BM 25)
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
4/57
IR Berlin Chen 4
Fuzzy Set Model
Premises
Docs and queries are represented through sets of
keywords, therefore the matching between them is
vague
Keywords cannot completely describe the users
information need and the docs main theme
For each query term (keyword)
Define a fuzzy set and that each doc has a degreeof membership (0~1) in the set
aboutness
wi, wj, wk,. ws, wp, wq,.
Retrieval Model
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
5/57
IR Berlin Chen 5
Fuzzy Set Model (cont.)
Fuzzy Set Theory
Framework for representing classes (sets) whose
boundaries are not well defined Key idea is to introduce the notion of a degree of
membership associated with the elements of a set
This degree of membership varies from 0 to 1 andallows modeling the notion ofmarginal membership
0 no membership
1 full membership
Thus, membership is now a gradual instead of abrupt
Not as conventional Boolean logic
Here we will define a fuzzy set for each query (or index) term,thus each doc has a degree of membership in this set.
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
6/57
IR Berlin Chen 6
Fuzzy Set Model (cont.)
Definition
A fuzzy subsetA of a universal of discourse Uis
characterized by a membership functionA: U [0,1]
Which associates with each element u ofUa
numberA(u) in the interval [0,1]
LetA and B be two fuzzy subsets ofU. Also,
letA be the complement ofA. Then, Complement
Union
Intersection
)(1)( uuAA
))(),(max()( uuu BABA
))(),(min()( uuu BABA
U
AB
u
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
7/57
IR Berlin Chen 7
Fuzzy Set Model (cont.)
Fuzzy information retrieval
Fuzzy sets are modeled based on a thesaurus
This thesaurus can be constructed by a term-termcorrelation matrix (or called keyword connection matrix)
: a term-term correlation matrix
: a normalized correlation factor for terms kiand k
l
We now have the notion ofproximity among index terms
The relationship is symmetric !
c
lic
,
lili
li
linnn
nc
,
,
,
li
i
n
n
,
: no of docs that contain ki
: no of docs that contain both kiand kl
Defining term relationship
ikillilk kcck li ,,
docs, paragraphs, sentences, ..ranged from 0 to 1
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
8/57
IR Berlin Chen 8
Fuzzy Set Model (cont.)
The union and intersection operations aremodified here
Union: algebraic sum (instead ofmax)
Intersection: algebraic product (instead ofmin)
U
A1 A2
k
2
1
11
)()()()()()()(21212121
jA
AAAAAAAA
(k)--
kkkkkkk
j
11
)()(
1
21
n
jA
AAAA
(k)--
kk
j
jj
n
)()()(2121
kkk AAAA
n
j
AAAA kk jn1
21
a negative algebraic product
)1)(1(1
)1(1
11
ba
abba
abaabbabbabaab
babaab
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
9/57
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
10/57
IR Berlin Chen 10
Fuzzy Set Model (cont.)
Example:
Query q=ka (kb kc)qdnf=(ka kb kc) (ka kb kc) (ka kb kc)=cc1+cc2+cc3
Da is the fuzzy set of docs
associated to the term ka
Degree of membership ?
cc3cc2
Da Db
Dc
disjunctive normal form
conjunctive component
cc1
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
11/57
IR Berlin Chen 11
Fuzzy Set Model (cont.)
))1)(1(1())1(1(
)1(1
1111
)1(13
1
321
jcjbjajcjbja
jcjbja
jcbajcbajcba
i jcc
jccccccjq
dddddd
ddd
ddd
d
dd
i
algebraic sum
negative algebraic product
cc1 cc2 cc3
algebraic product
or a doc in
he fuzzy answer
set
jd
qD
Degree of membershipcc3
cc2
Da Db
Dc
cc1
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
12/57
IR Berlin Chen 12
More on Fuzzy Set Model
Advantages
The correlations among index terms are considered
Degree of relevance between queries and docs can
be achieved
Disadvantages
Fuzzy IR models have been discussed mainly in the
literature associated with fuzzy theory
Experiments with standard test collections are notavailable
Do not consider the frequecny (or counts) of a term in
a document or a query
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
13/57
IR Berlin Chen 13
Extended Boolean Model
Motive
Extend the Boolean model with the functionality of
partial matching and term weighting
E.g.: in Boolean model, for the qery q=kx ky, adoc contains eitherkxorky is as irrelevant as
another doc which contains neither of them How about the disjunctive query q=kx ky
Combine Boolean query formulations withcharacteristics of the vector model
Term weighting
Algebraic distances for similarity measures
Salton et al., 1983
a ranking can
be obtained
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
14/57
IR Berlin Chen 14
Extended Boolean Model (cont.)
Term weighting
The weight for the term kx in a doc dj is
is normalized to lie between 0 and 1
Assume two index terms kxand kywere used
Let denote the weight of term kxon doc dj Let denote the weight of term kyon doc dj
The doc vector is represented as
Queries and docs can be plotted in a two-dimensionalmap
ii
x
jxjxidf
idftfw
max,,
Normalized idf
jxw ,
jxw ,xy jyw ,
jyjxj
wwd,,
,
yxdj
,
normalized frequency
ranged from 0 to 1
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
15/57
IR Berlin Chen 15
Extended Boolean Model (cont.)
If the query is q=kx ky (conjunctive query)-The docs near the point (1,1) are preferred
-The similarity measure is defined as
2
111,
22yx
dqsimand
2-norm model
(Euclidean distance)
dj
dj+1
(0,0)
(1,1)
kx
ky AND
jyjxj wwd ,, ,
1
2/11 2/11
2/11
0jxwx ,
jywy ,
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
16/57
IR Berlin Chen 16
Extended Boolean Model (cont.)
If the query is q=kx ky (disjunctive query)-The docs far from the point (0,0) are preferred
-The similarity measure is defined as
2
,22 yx
dqsimor
dj
dj+1
y = wy,j
(0,0)
(1,1)
kx
ky Or
x = wx,j
1
0 2/1
2/1
2-norm model
(Euclidean distance)
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
17/57
IR Berlin Chen 17
Extended Boolean Model (cont.)
The similarity measures and
also lie between 0 and 1
dqsim or , dqsim and ,
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
18/57
IR Berlin Chen 18
Extended Boolean Model (cont.)
Generalization
tindex terms are used t-dimensional space
p-norm model,
Some interesting properties
p=1
p=
p1
m
ppp
andkkkq ...
21
m
ppp
orkvkkq ...
21
pp
m
pp
andm
xxxdqsim
1
211...11
1,
pp
m
pp
orm
xxxdqsim
1
21...
,
m
xxxdqsimdqsim morand ...,,
21
iand xdqsim min,
ior xdqsim max, just like the
formula of fuzzy logic
Similar to vector space model
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
19/57
IR Berlin Chen 19
Extended Boolean Model (cont.)
Example query 1:
Processed by grouping the operators in a predefined
order
Example query 2:
Combination of different algebraic distances
321
kkkqpp
p
p
p
ppp
xxx
dqsim
1
3
1
21
2
2
111
,
32
2
1kkkq
3
2
1
2
2
2
1 ,
2
min, xxx
dqsim
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
20/57
More on Extended Boolean Model
Advantages
A hybrid model including properties of both the set
theoretic models and the algebraic models
That is, relax the Boolean algebra by interpreting
Boolean operations in terms of algebraic distances
By varying the parameterp between 1 and infinity,we can vary thep -norm ranking behavior from that
of a vector-like ranking to that of a fuzzy logic-like
ranking
Have the possibility of using combinations of
different values of the parameterp in the same
query request
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
21/57
More on Extended Boolean Model (cont.)
Disadvantages
Distributive operation does not hold for ranking
computation
E.g.:
Assumes mutual independence of index terms
IR Berlin Chen 21
32223212322211 , kkkkqkkkq
dqsimdqsim ,,21
21
2
3
2
2
12
2
2
1
2
2
111
x
xx 21
22
3
2
2
22
2
2
1
2
21
21
1
xxxx
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
22/57
IR Berlin Chen 22
Generalized Vector Model
Premise
Classic models enforceindependence
of index terms
For the Vector model
Set of term vectors {k1, k1, ..., kt} are linearly
independent and form a basis for the subspace of
interest
Frequently, it means pairwise orthogonality
i,j kikj= 0 (in a more restrictive sense)
Wong et al. proposed an interpretation
An alternative intepretation: The index term vectors are
linearly independent, but not pairwise orthogonal
Generalized Vector Model
Wong et al., 1985
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
23/57
IR Berlin Chen 23
Generalized Vector Model (cont.)
Key idea Index term vectors form the basis of the space are not
orthogonal and are represented in terms of smallercomponents (minterms)
Notations {k1, k2, , kt}: the set of all terms
wi,j: the weight associated with [ki, dj]
Minterms:binary indicators (0 or 1) of all patterns ofoccurrence of terms within documents
Each represents one kind of co-occurrence of index terms in
a specific document
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
24/57
IR Berlin Chen 24
Generalized Vector Model (cont.)
Representations ofminterms
m1=(0,0,.,0)
m2=(1,0,.,0)m3=(0,1,.,0)
m4=(1,1,.,0)
m5=(0,0,1,..,0)
m2t=(1,1,1,..,1)
m1=(1,0,0,0,0,.,0)
m2=(0,1,0,0,0,.,0)m3=(0,0,1,0,0,.,0)
m4=(0,0,0,1,0,.,0)
m5=(0,0,0,0,1,.,0)
m2t=(0,0,0,0,0,.,1)
2tminterms 2tminterm vectors
Points to the docs where only
index terms k1 and k2 co-occur and
the other index terms disappear
Point to the docs containingall the index terms
Pairwise orthogonal vectors mi
associated with minterms mias the basis for the generalized
vector space
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
25/57
IR Berlin Chen 25
Generalized Vector Model (cont.)
Minterm vectors are pairwise orthogonal. But,this does not mean that the index terms are
independent
Each minterm specifies a kind of dependence among
index terms
That is, the co-occurrence of index terms inside docs
in the collection induces dependencies among theseindex terms
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
26/57
IR Berlin Chen 26
Generalized Vector Model (cont.)
The vector associated with the term ki is
represented by summing up all minterms
containing it and normalizing
1,2,
,,
1, ,
1,2,
1, ,
where
ri
ri
ri
ri
mgr ri
riri
mgr rri
mgr ri
mgr rrii
c
cc
mcc
mck
allfor, ,,
lmgdgd jiri rljlj
wc
All the docs whose term co-occurrence
relation (pattern) can be represented
as (exactly coincide with that of) minterm mr
The weight associated with the pair [ki, mr]
sums up the weights of the term ki in all
the docs which have a term occurrence
pattern given by mr.
Notice that for a collection of size N,only Nminterms affect the ranking (and not 2N)
ri
mg Indicates the index term ki is in the
minterm mr
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
27/57
IR Berlin Chen 27
Generalized Vector Model (cont.)
The similarity between the query and doc is
calculated in the space of minterm vectors
r
rd
r
rq
r
rdrq
jj
i
qi
i
qi
i
jiqi
jj
i r
rrqiqij
r
rrj
i
ijij
ss
ss
dqsim
ww
ww
dqsim
mskwq
mskwd
,,
,,
,,
,,
,,
,,
,
,
t-dimensional 2t-dimensional
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
28/57
IR Berlin Chen 28
Generalized Vector Model (cont.)
Example (a system with three index terms)
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 minterm
d1 2 0 1 m6
d2 1 0 0 m2
d3 0 1 3 m7
d4 2 0 0 m2
d5 1 2 4 m8
d6 1 2 0 m4
d7 0 5 0 m3
q 1 2 3
minterm k1 k2 k3
m1 0 0 0
m2 1 0 0
m3 0 1 0
m4 1 1 0
m5 0 0 1
m6 1 0 1
m7 0 1 1
m8 1 1 1
2
8,2
2
7,2
2
4,2
2
3,2
88,277,244,233,2
2
cccc
mcmcmcmck
2
8,1
2
6,1
2
4,1
2
2,1
88,166,144,122,1
1
cccc
mcmcmcmck
2
8,3
2
7,3
2
6,3
2
5,3
88,377,366,355,3
3
cccc
mcmcmcmck
1
2
1
321
5,18,1
1,16,1
6.14,1
4.12.12,1
wc
wc
wc
wwc2222
8642
1
1213
1213
mmmmk
2
1
2
5
5,28,2
3,27,2
6,24,2
7,23,2
wc
wc
wc
wc
2222
8743
2
2125
2125
mmmm
k
4
3
1
0
5,38,3
3,37,3
1,36,3
5,3
wc
wc
wc
c
2222
8765
3 4310
4310
mmmmk
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
29/57
IR Berlin Chen 29
Generalized Vector Model (cont.)
Example: Ranking15
1213
1213
12138642
2222
8642
1
mmmmmmmmk
34
2125
2125
21258743
2222
8743
2
mmmmmmmmk
26
431
4310
4310876
2222
8765
3
mmmmmmmk
87642
311
26
41
15
12
26
31
26
11
15
22
15
12
15
32
12
mmmmm
kkd
876432
321
26
43
34
22
15
11
26
33
34
12
26
13
15
21
34
22
15
11
34
52
15
31
321
mmmmmm
kkkq
sd1,4sd1,2sd1,6 sd1,7
sq,6
sd1,8
sq,2 sq,3 sq,4
876432
321
26
43
34
22
15
11
26
33
34
12
26
13
15
21
34
22
15
11
34
52
15
31
321
mmmmmm
kkkq
sq,8
sq,7
222222222228,8,7,7,6,6,4,4,2,2,
1
00
2
00
2
00
,,
8,17,16,14,12,18,7,6,4,3,2,
11111
,,
,
,,
,
,,
,
),(consine,
dddddqqqqqq
rdrq
rd
rdrq
rq
rdrq
sssssssssssssssssssssdqsim
ss
ss
dqdqsim
dqdqdqdqdq
ssrssr
ssr
rdrq
The similarity between the query and doc is
calculated in the space of minterm vectors
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
30/57
IR Berlin Chen 30
Generalized Vector Model (cont.)
Term Correlation
The degree of correlation between the terms kiand kj
can now be computed as
Do not need to be normalized? (because we havedone it before! See p26)
1)(1)(|
,, rjri mgmgr
rjriji cckk
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
31/57
IR Berlin Chen 31
More on Generalized Vector Model
Advantages
Model considers correlations among index terms
Model does introduce interesting new ideas
Disadvantages
Not clear in which situations it is superior to thestandard vector model
Computation cost is fairly high with large collections
Since the number of active minterms might be proportionalto the number of documents in the collection
Despitethesedrawbacks,thegeneralizedvectormodeldoesintroducenewideaswhichareofimportancefromatheoreticalpointofview.
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
32/57
Set-Based Model
This is a more recent approach (2005) that
combines set theory with a vectorial ranking
The fundamental idea is to use mutualdependencies among index terms to improve
results
Term dependencies are captured throughtermsets, which are sets of correlated terms
The approach, which leads to improved results
with various collections, constitutes the first IRmodel that effectively took advantage of term
dependence with general collections
IR Berlin Chen 32
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
33/57
Set-Based Model: Termsets
Termset is a concept used in place of the index
terms
A termset Si= {ka, kb, ..., kn} is a subset of the terms inthe collection
If all index terms in Sioccur in a document dj then we
say that the termset Sioccurs in dj
There are 2t termsets that might occur in the
documents of a collection, where t is the
vocabulary size
However, most combinations of terms have no
semantic meaning
Thus, the actual number of termsets in a collection is
far smaller than 2tIR Berlin Chen 33
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
34/57
Set-Based Model: Termsets (cont.)
Let tbe the number of terms of the collection
Then, the set VS = {S1, S2, ..., S2t} is the vocabulary-
set of the collection To illustrate, consider the document collection
below
IR Berlin Chen 34
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
35/57
Set-Based Model: Termsets (cont.)
To simplify notation, let us define
Further, let the letters a...n refer to the index
terms ka...kn , respectively
IR Berlin Chen 35
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
36/57
Set-Based Model: Termsets (cont.)
Consider the query q as to do be it, i.e. q = {a,
b, d, n}
For this query, the vocabulary-set is as below
IR Berlin Chen 36
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
37/57
Set-Based Model: Termsets (cont.)
At query processing time, only the termsets
generated by the query need to be considered
A termset composed ofn terms is called an n-termset
Let Nibe the number of documents in which Sioccurs
An n-termset Si is said to be frequent ifNi isgreater than or equal to a given threshold
This implies that an n-termset is frequent if and only if
all of its (n
1)-termsets are also frequent Frequent termsets can be used to reduce the
number of termsets to consider with long queries
IR Berlin Chen 37
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
38/57
Set-Based Model: Termsets (cont.)
Let the threshold on the frequency of termsets be 2
To compute all frequent termsets for the query
q = {a, b, d, n} we proceed as follows1. Compute the frequent 1-termsets and their inverted lists:
Sa = {d1, d2}
Sb = {d1, d3, d4}
Sd= {d1, d2, d3, d4}
2. Combine the inverted lists to
compute frequent 2-termsets:
Sad= {d1, d2} Sbd= {d1, d3 , d4}
3. Since there are no frequent
3-termsets, stopIR Berlin Chen 38
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
39/57
Set-Based Model: Termsets (cont.)
Notice that there are only 5 frequenttermsets in
our collection
Inverted lists for frequent n-termsets can becomputed by starting with the inverted lists of
frequent 1-termsets
Thus, the only indices required are the standardinverted lists used by any IR system
This is reasonably fast for short queries up to 4-
5 terms
IR Berlin Chen 39
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
40/57
Set-Based Model: Ranking Computation
The ranking computation is based on the vector model,
but adopts termsets instead of index terms
Given a query q,
let {S1, S2, . . .} be the set of all termsets originated from q
Nibe the number of documents in which termset Sioccurs
Nbe the total number of documents in the collection
Fi,jbe the frequency of termset Si in document dj
For each pair [Si, dj] we compute a weight Wi,jgiven by
We also compute a Wi,q value for each pair [Si, q]IR Berlin Chen 40
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
41/57
Set-Based Model: Ranking Computation (cont.)
Consider
query q = {a, b, d, n}
document d1 = a b c a d a d c a b
IR Berlin Chen 41Assumehereaminimumthresholdfrequencyof1.
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
42/57
A document djand a query q are represented as
vectors in a 2t-dimensional space of termsets
The rank ofdjto the query q is computed as
follows
For termsets that are not in the query q, Wi,q = 0IR Berlin Chen 42
Set-Based Model: Ranking Computation (cont.)
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
43/57
The document norm |dj| is hard to compute in
the space of termsets
Thus, its computation is restricted to 1-termsets
Let again q = {a, b, d, n} and d1
The document norm in terms of 1-termsets is
given by
IR Berlin Chen 43
Set-Based Model: Ranking Computation (cont.)
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
44/57
To compute the rank ofd1, we need to consider
the seven termsets Sa, Sb, Sd, Sab, Sad, Sbd, and
Sabd
The rank ofd1 is then given by
IR Berlin Chen 44
Set-Based Model: Ranking Computation (cont.)
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
45/57
BM25 (Best Match 25)
BM25 was created as the result of a series of
experiments on variations of the probabilistic
model A good term weighting is based on three principles
Inverse document frequency
Term frequency Document length normalization
The classic probabilistic model covers only the first
of these principles
This reasoning led to a series of experiments with
the Okapi system, which led to the BM25 ranking
formulaIR Berlin Chen 45
BM1 BM11 d BM15 F l
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
46/57
BM1, BM11 and BM15 Formulas
At first, the Okapi system used the Equation
below as ranking formula
which is just the equation used in the probabilistic
model, when no relevance information is provided
It was referred to as the BM1 formula (Best
Match 1)
IR Berlin Chen 46
jii dkqk i
ij
.n
.nNqdsim
50
50log~),(
BM1 BM11 d BM15 F l ( t )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
47/57
BM1, BM11 and BM15 Formulas (cont.)
The first idea for improving the ranking was to
introduce a term-frequency factorFi,j in the BM1
formula
This factor, after some changes, evolved to become
Where
fi,j is the frequency of term kiwithin document dj
K1 is a constant setup experimentally for each collection
S1 is a scaling constant, normally set to S1 = (K1 + 1)
IfK1 = 0, this whole factor becomes equal to 1 and
bears no effect in the rankingIR Berlin Chen 47
ji
ji
ji
fK
fSF
,1
,
1,
BM1 BM11 d BM15 F l ( t )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
48/57
BM1, BM11 and BM15 Formulas (cont.)
can be viewed as a saturation function
IR Berlin Chen 48
jif
,
ji
ji
fK
f
,
,
ji
ji
fK
f
,
,
BM1 BM11 d BM15 F l ( t )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
49/57
BM1, BM11 and BM15 Formulas (cont.)
The next step was to modify the Fi,jfactor by
adding document length normalization to it, as
follows:
Where
len(dj) is the length of document dj (computed, for
instance, as the number of terms in the document)
avg_doclen is the average document length for the
collection
IR Berlin Chen 49
ji
j
ji
ji
fdoclenavg
dlenK
fSF
,
1
,
1,
_
BM1 BM11 and BM15 Formulas (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
50/57
BM1, BM11 and BM15 Formulas (cont.)
Next, a correction factorGj,q dependent on the
document and query lengths was added
Where
len(q) is the query length (number of terms in the
query)
K2 is a constant
IR Berlin Chen 50
j
j
qjdlendoclenavg
dlendoclenavgqlenKG
_
_2,
BM1 BM11 and BM15 Formulas (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
51/57
BM1, BM11 and BM15 Formulas (cont.)
A third additional factor, aimed at taking into
account term frequencies within queries, was
defined as
where
fi,q is the frequency of term ki within query q
K3 is a constant
S3 is an scaling constant related to K3, normally set
to S3 = (K3 + 1)
IR Berlin Chen 51
qi
qi
qi
fK
fSF
,3
,
3,
BM1 BM11 and BM15 Formulas (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
52/57
BM1, BM11 and BM15 Formulas (cont.)
Introduction of these three factors led to various
BM (Best Matching) formulas, as follows:
IR Berlin Chen 52
jii
jii
jii
dkqk i
ii,qi,jqjjBM
dkqk i
ii,qi,jqjjBM
dkqk i
ijBM
.n
.nN
FFGqdsim
.n
.nNFFGqdsim
.n
.nNqdsim
50
50
log~),(
50
50log~),(
50
50log~),(
,11
,15
1
BM1 BM11 and BM15 Formulas (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
53/57
BM1, BM11 and BM15 Formulas (cont.)
Experiments using TREC data have shown that
BM11 outperforms BM15 (due to additional
document length normalization)
Further, empirical considerations can be used to
simplify the previous equations, as follows:
Empirical evidence suggests that a best value ofK2 is 0,
which eliminates the Gj,q factor from these equations (i.e.,BM15 and BM12)
Further, good estimates for the scaling constants S1 and
S3 are K1 + 1 and K3 + 1, respectively
Empirical evidence also suggests that making K3 very
large is better. As a result, the Fi,q factor is reduced simply
to fi,q
For short queries, we can assume that fi,q is 1 for all termsIR Berlin Chen 53
BM1 BM11 and BM15 Formulas (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
54/57
BM1, BM11 and BM15 Formulas (cont.)
These considerations lead to simpler equations
as follows
IR Berlin Chen 54
jii
jii
jii
dkqk i
i
jij
ji
jBM
dkqk i
i
ji
ji
jBM
dkqk i
ijBM
.n
.nN
fdoclenavg
dlenK
fK
qdsim
.n.nN
fKfKqdsim
.n
.nNqdsim
50
50
log
_
1
~),(
5050log1~),(
50
50log~),(
,1
,1
11
,1
,1
15
1
BM25 Ranking Formula
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
55/57
BM25 Ranking Formula
BM25: combination of the BM11 and BM15
The motivation was to combine the BM11 and
BM25 term frequency factors as follows
Where b is a constant with values in the interval [0, 1]
Ifb = 0, it reduces to the BM15 term frequency factor
Ifb = 1, it reduces to the BM11 term frequency factor
For values ofb between 0 and 1, the equation
provides a combination of BM11 with BM15IR Berlin Chen 55
jij
ji
ji
fdoclenavg
dlen
bbK
fKSB
,1
,1
1,
_1
1
BM25 Ranking Formula (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
56/57
BM25 Ranking Formula (cont.)
The ranking equation for the BM25 model can
then be written as
Where K1 and b are empirical constants
K1 = 1 works well with real collections
b should be kept closer to 1 to emphasize the documentlength normalization effect present in the BM11 formula
For instance, b = 0.75 is a reasonable assumption
Constants values can be fine tuned for particular
collections through proper experimentationIR Berlin Chen 56
jii dkqk i
ii,jjBM
.n
.nNBqdsim
50
50log~),(25
BM25 Ranking Formula (cont )
-
7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)
57/57
BM25 Ranking Formula (cont.)
Unlike the probabilistic model, the BM25 formula
can be computed without relevance information
There is consensus that BM25 outperforms the
classic vector model for general collections
Thus, it has been used as a baseline forevaluating new ranking functions, in substitution
to the classic vector model
IR Berlin Chen 57