Some Information Retrieval Models and Our Experiments for TREC KBA

39
INFORMATION RETRIEVAL MODELS / TREC KBA Patrice Bellot AixMarseille Université CNRS (LSIS UMR 7296 ; OpenEdition) [email protected] LSIS DIMAG team http://www.lsis.org/spip.php?id_rubrique=291 OpenEdition Lab : http://lab.hypotheses.org

description

Presentation of the main IR models Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)

Transcript of Some Information Retrieval Models and Our Experiments for TREC KBA

Page 1: Some Information Retrieval Models and Our Experiments for TREC KBA

INFORMATION RETRIEVAL MODELS / TREC KBA

Patrice  BellotAix-­‐Marseille  Université  -­‐  CNRS  (LSIS  UMR  7296  ;  OpenEdition)  !patrice.bellot@univ-­‐amu.frLSIS  -­‐  DIMAG  team  http://www.lsis.org/spip.php?id_rubrique=291  OpenEdition  Lab  :  http://lab.hypotheses.org

Page 2: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

— What Web search engines can do and still can’t do ?

— The Main Statistical Information Retrieval Models for Texts

— Entity linking and Entity oriented Document Retrieval

2

Mining  large  text  collections  Robustness  (documents,  queries,  information  needs,  languages…)  

Be  fast,  be  relevant

Do  we  really  need  (formal)  semantics  ?  Do  we  need  deep  (symbolic)  language  analysis  ?

Page 3: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Vertical vs horizontal search vs … ?

3

Horizontal  search  (Google  search,  Bing…)

Vertical  search  (e.g.  Health  search  engines)

Future  ?

What  models  ?  What  NLP  ?  What  resources  should  be  used  ?  What  (how)  can  be  learned  ?

Page 4: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

INFORMATION RETRIEVAL MODELS

4

Page 5: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Information Retrieval / Document Retrieval

• Objective: finding the « documents » that correspond to the user request at best

• Problems: — Interpreting the query— Interpreting the documents (indexing) — Defining a score of relatedness (a ranking function)

• Solutions: — Distributional hypothesis = statistical and probabilistic approaches (+ linear algebra) — Natural Language Processing — Knowledge Engineering

• Indexing : — Assigning terms to documents (number of terms = exhaustivity vs specificity)— Index term weighting based on the occurrence frequency of terms in documents and on the number of documents in which a term occurs (document frequency)

5

~d

~q

~d =

0

BBB@

wm1,d

wm2,d.

.

.

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q.

.

.

wmn,q

1

CCCA

wmi,d

mi

s(~d, ~q) =

i=nX

i=1

wmi,d · wmi,q (1)

wi,d =

wi,dqPnj=1 w2

j,d

(2)

s(~d, ~q) =

i=nX

i=1

wi,dqPnj=1 w2

j,d

· wi,qqPnj=1 w2

j,q

=

~d · ~q

kdk2 · kqk2= cos(

~d, ~q) (3)

1

Page 6: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Evaluation

• The aim is to retrieve as many relevant documents as possible and as few non-relevant documents as possible

• Relevance is not truth

• Precision and Recall

!!!!

• Precision and recall can be estimated at different cut-off ranks (P@n)

• Other measures : (mean) average precision (MAP), Discounted Cumulative Gain, Mean Reciprocal Rank…

• International Challenges : TREC, CLEF, INEX, NTCIR…

6

Evaluation Precision and Recall

Recall / Precision

In the ideal case, the set of retrieved documents is equal to the set ofrelevant documents. However, in most cases, the two sets will be di↵erent.This di↵erence is formally measured with precision and recall.

Precision =number of relevant documents retrieved

number of documents retrieved

Recall =number of relevant documents retrieved

number of relevant documents

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 59 / 171

Page 7: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Document retrieval : the Vector Space Model• Classical solution : the Vector Space Model

• In the index : a (non binary) weight is associated to every word in each document that contains it

• Every document d is represented as a vector • The query q is represented as a vector in the document space • The degree of similarity between a document and the query is

computed according to the weights w of the words m

7

~d

~q

~d =

0

BBB@

wm1,d

wm2,d.

.

.

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q.

.

.

wmn,q

1

CCCA

wmi,d

mi

s(~d, ~q) =

i=nX

i=1

wmi,d · wmi,q (1)

wi,d =

wi,dqPnj=1 w2

j,d

(2)

s(~d, ~q) =

i=nX

i=1

wi,dqPnj=1 w2

j,d

· wi,qqPnj=1 w2

j,q

=

~d · ~q

kdk2 · kqk2= cos(

~d, ~q) (3)

1

The foundations of the rigorous study of analysis were laid in the nineteenth century, notably bythe mathematicians Cauchy and Weierstrass. Central to the study of this subject are the formaldefinitions of limits and continuity.

Let D be a subset of R and let f :D ! R be a real-valued function on D. The function f is said tobe continuous on D if, for all ✏ > 0 and for all x 2 D, there exists some � > 0 (which may dependon x) such that if y 2 D satisfies

|y � x| < �

then|f(y)� f(x)| < ✏.

One may readily verify that if f and g are continuous functions on D then the functions f + g,f � g and f.g are continuous. If in addition g is everywhere non-zero then f/g is continuous.

~

d

~q

~

d =

0

BBB@

wm1,d

wm2,d...

wmn,d

1

CCCA

1

The foundations of the rigorous study of analysis were laid in the nineteenth century, notably bythe mathematicians Cauchy and Weierstrass. Central to the study of this subject are the formaldefinitions of limits and continuity.

Let D be a subset of R and let f :D ! R be a real-valued function on D. The function f is said tobe continuous on D if, for all ✏ > 0 and for all x 2 D, there exists some � > 0 (which may dependon x) such that if y 2 D satisfies

|y � x| < �

then|f(y)� f(x)| < ✏.

One may readily verify that if f and g are continuous functions on D then the functions f + g,f � g and f.g are continuous. If in addition g is everywhere non-zero then f/g is continuous.

~

d

~q

~

d =

0

BBB@

wm1,d

wm2,d...

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q...

wmn,q

1

CCCA

s(~d, ~q) =i=nX

i=1

wmi,d · wmi,q (1)

1

Page 8: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Ranking function : e.g. dot product / cosine

• Similarity function : dot product !!!!!!

• Normalization ? !!!

• cosine similarity function

~d

~q

~d =

0

BBB@

wm1,d

wm2,d.

.

.

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q.

.

.

wmn,q

1

CCCA

wmi,d

mi

s(~d, ~q) =

i=nX

i=1

wmi,d · wmi,q (1)

wi,d =

wi,dqPnj=1 w2

j,d

(2)

s(~d, ~q) =

i=nX

i=1

wi,dqPnj=1 w2

j,d

· wi,qqPnj=1 w2

j,q

=

~d · ~q

kdk2 · kqk2= cos(

~d, ~q) (3)

1

~d

~q

~d =

0

BBB@

wm1,d

wm2,d.

.

.

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q.

.

.

wmn,q

1

CCCA

wmi,d

mi

s(~d, ~q) =

i=nX

i=1

wmi,d · wmi,q (1)

wi,d =

wi,dqPnj=1 w2

j,d

(2)

s(~d, ~q) =

i=nX

i=1

wi,dqPnj=1 w2

j,d

· wi,qqPnj=1 w2

j,q

=

~d · ~q

kdk2 · kqk2= cos(

~d, ~q) (3)

1

~d

~q

~d =

0

BBB@

wm1,d

wm2,d.

.

.

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q.

.

.

wmn,q

1

CCCA

wmi,d

mi

s(~d, ~q) =

i=nX

i=1

wmi,d · wmi,q (1)

wi,d =

wi,dqPnj=1 w2

j,d

(2)

s(~d, ~q) =

i=nX

i=1

wi,dqPnj=1 w2

j,d

· wi,qqPnj=1 w2

j,q

=

~d · ~q

k~dk2 · k~qk2

= cos(

~d, ~q) (3)

1

cosine

document

query

8TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes

Page 9: Some Information Retrieval Models and Our Experiments for TREC KBA

Example

9

Information Retrieval and Web Search 63-3

Terms Documents

T1: Bab(y,ies,y’s) D1: Infant & Toddler First AidT2: Child(ren’s) D2: Babies and Children’s Room (For Your Home)T3: Guide D3: Child Safety at HomeT4: Health D4: Your Baby’s Health and Safety: From Infant to ToddlerT5: Home D5: Baby Proofing BasicsT6: Infant D6: Your Guide to Easy Rust ProofingT7: Proofing D7: Beanie Babies Collector’s GuideT8: SafetyT9: Toddler

The indexed terms are italicized in the titles. Also, the stems [BB05] of the terms for baby (andits variants) and child (and its variants) are used to save storage and improve performance. Theterm-by-document matrix for this document collection is

A =

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0 1 1 0 1

0 1 1 0 0 0 0

0 0 0 0 0 1 1

0 0 0 1 0 0 0

0 1 1 0 0 0 0

1 0 0 1 0 0 0

0 0 0 0 1 1 0

0 0 1 1 0 0 0

1 0 0 1 0 0 0

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

For a query on baby health, the query vector is

q = [ 1 0 0 1 0 0 0 0 0 ]T .

To process the user’s query, the cosines

δi = cos θi = qT di

∥q∥2∥di ∥2

are computed. The documents corresponding to the largest elements of δ are most relevant to theuser’s query. For our example,

δ ≈ [ 0 0.40824 0 0.63245 0.5 0 0.5 ],

so document vector 4 is scored most relevant to the query on baby health. To calculate the recalland precision scores, one needs to be working with a small, well-studied document collection. Inthis example, documents d4, d1, and d3 are the three documents in the collection relevant to babyhealth. Consequently, with τ = .1, the recall score is 1/3 and the precision is 1/4.

63.2 Latent Semantic Indexing

In the 1990s, an improved information retrieval system replaced the vector space model. This system iscalled Latent Semantic Indexing (LSI) [Dum91] and was the product of Susan Dumais, then at Bell Labs.LSI simply creates a low rank approximation Ak to the term-by-document matrix A from the vector spacemodel.

from  Langville  &  Meyer,  2006  Handbook  of  Linear  Algebra

Page 10: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Term Weighting• Zipf’s law (1949) : the distribution of word frequencies is similar for (large) texts

!!!!!!!

• Luhn’s hypothesis (1957) : the frequency of a word is a measurement of its significance … and then a criterion that measures the capacity of a word to discriminate documents by their content

10

Indexing and TF-IDF Index Term Weighting

Zipf’s law [1949]

Distribution of word frequencies is similar for di↵erent texts (naturallanguage) of significantly large size

Words by rank order

Fre

quen

cy o

f w

ord

s

f

r

Zipf’s law holds even for di↵erent languages!

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 42 / 171

Indexing and TF-IDF Index Term Weighting

Luhn’s analysis — Observation

Upper cut−off Lower cut−off

Significantwords

Words by rank order

Fre

quen

cy o

f w

ord

s

f

r

com

mon

word

s

rare words

Resolving power

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 44 / 171

from  M.  Lalmas,  2012

Rank Word Frequency

1 the 200

2 a 150

… …

hapax 1~50%

rank  x  freq  ≈  constant

Page 11: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Term weighting• In a given document, a word is important (discriminant) if it occurs often and it is rare

in the collection

!

• TF.IDF weighting schemes

~d

~q

~d =

0

BBB@

wm1,d

wm2,d.

.

.

wmn,d

1

CCCA

~q =

0

BBB@

wm1,q

wm2,q.

.

.

wmn,q

1

CCCA

wmi,d

mi

s(~d, ~q) =

i=nX

i=1

wmi,d · wmi,q (1)

wi,d =

wi,dqPnj=1 w2

j,d

(2)

s(~d, ~q) =

i=nX

i=1

wi,dqPnj=1 w2

j,d

· wi,qqPnj=1 w2

j,q

=

~d · ~q

k~dk2 · k~qk2= cos(

~d, ~q) (3)

QteInfo(mi) = log2 P (mi) �! IDF (mi) = � log

ni

N

1

Pondération pour les documents Pondération pour lesrequêtes

(a)

wi, D =

tf mi ,D( ). log Nn mi( )

tf mj ,D( ). log Nn mj( )

⎝ ⎜ ⎜

⎠ ⎟ ⎟

2

j / m j∈D∑

wi, R = 0,5 + 0,5tf mi , R( )

maxj / m j∈R

tf mi , R( )⎛

⎜ ⎜ ⎜

⎟ ⎟ ⎟

⋅log Nn mi( )

(b)

wi, D = 0,5 + 0,5tf mi , D( )

maxj / m j ∈D

tf mi ,D( ) wi,R = log

N − n mi( )n mi( )

(c) wi, D = log

Nn mi( )

wi, R = logN

n mi( )(d) wi, D =1

wi, R = log

N − n mi( )n mi( )

(e)

wi, D =tf m i,D( )

tf m j, D( )2

j / m j ∈D∑ wi, R = tf mi ,R( )

(f) wi, D =1 wi, R = 1

Tableau 1 - Pondérations citées et évaluéesdans [Salton & Buckley, 1988]

11TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes

Page 12: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Vector Space Model : some drawbacks• The dimensions are orthogonal

– “automobile” and “car” are as distant as “car” and “apricot tree”…

—> the user query must contain the same words than the documents that he wishes to find…

• The word order and the syntax are not used

– the cat drove out the dog of the neighbor – ≈ the dog drove out the cat of the neighbor – ≈ the cat close to the dog drives out

– It assumes words are statistically independent – It does not take into account the syntax of the sentences, nor the negations…

– this paper is about politics VS. this paper is not about politics : very similar sentences…

12TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes

Page 13: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Probabilistic model (1)

• 1976 : Robertson and Sparck-Jones

• Query : {relevant documents} : {features}

• Problem: to guess the characteristics (features) of the relevant documents (Binary independence retrieval model : based on the presence or the absence of terms)

• Solutions : • iterative and interactive process {user, selection of relevant documents =

relevance feedback}

• selection of the documents according to a cost function

2 Modele probabiliste

Le modele probabiliste permet de representer le processus de recherche documentaire comme unprocessus de decision : le cout, pour l’utilisateur, associe a la recuperation d’un document doit etreminimise. Autrement dit, un document n’est propose a l’utilisateur que si le cout associe a cetteproposition est inferieur a celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) :

ECretr(d) < EC ¯retr(d) (4)

avec :ECretr(d) = P (pert.|�d)Cretrouve,pert. + P (pert.|�d)Cretrouve,pert. (5)

ou P (pert.|�d) designe la probabilite qu’un document d est pertinent sachant ses caracteristiques �d,P (pertinent|�d) qu’il ne le soit pas et Cretrouve,pert. le cout associe au fait de retrouver (ramener)un document pertinent et Cretrouve, ¯pert. de retrouver un document non pertinent.

La regle de decision devient alors : retrouver un document s seulement si :

P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert. < P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert.(6)

soit :P (pert.|�d)P ( ¯pert.|�d)

>Cretrouve,pert. � Cretrouve,pert.

Cretrouve,pertinent � Cretrouve,pert.= constante = � (7)

La valeur de la constante � depend du type de recherche e�ectuee : desire-t-on privilegier le rappelou la precision etc.

Une autre maniere de voir le modele probabiliste est de considerer que celui-ci cherche a modeliserl’ensemble des documents pertinents, autrement dit a estimer la probabilite qu’un mot donneapparaisse dans de tels documents.

2.1 Binary Relevance Model

Soit q une requete et dj un document. Le modele probabiliste essaye d’estimer la probabilite quel’utilisateur trouve interessant le document dj sachant la requete q. On suppose qu’il existe alorsl’ensemble R des documents interessants (on parle d’ensemble ideal ) et que ces documents designentl’ensemble des documents pertinents. Soit R le complement de R. Le modele attribue a chaquedocument dj sa probabilite de pertinence de la facon suivante :

dj ⇥P (dj est pertinent)

P (dj n’est pas pertinent)(8)

sim(dj , q) =P (R|�dj)P (R|�dj)

(9)

2

13TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes

Page 14: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Probabilistic model (2)

• Estimating the probability that a document d is relevant (is not relevant) for the query q :

!• Bayes th.using the probability of observing the document given relevance, the prior probability of relevance and the probability of observing the document at random

• The Retrieval Status Value :

2 Modele probabiliste

Le modele probabiliste permet de representer le processus de recherche documentaire comme unprocessus de decision : le cout, pour l’utilisateur, associe a la recuperation d’un document doit etreminimise. Autrement dit, un document n’est propose a l’utilisateur que si le cout associe a cetteproposition est inferieur a celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) :

ECretr(d) < EC ¯retr(d) (4)

avec :ECretr(d) = P (pert.|�d)Cretrouve,pert. + P (pert.|�d)Cretrouve,pert. (5)

ou P (pert.|�d) designe la probabilite qu’un document d est pertinent sachant ses caracteristiques �d,P (pertinent|�d) qu’il ne le soit pas et Cretrouve,pert. le cout associe au fait de retrouver (ramener)un document pertinent et Cretrouve, ¯pert. de retrouver un document non pertinent.

La regle de decision devient alors : retrouver un document s seulement si :

P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert. < P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert.(6)

soit :P (pert.|�d)P ( ¯pert.|�d)

>Cretrouve,pert. � Cretrouve,pert.

Cretrouve,pertinent � Cretrouve,pert.= constante = � (7)

La valeur de la constante � depend du type de recherche e�ectuee : desire-t-on privilegier le rappelou la precision etc.

Une autre maniere de voir le modele probabiliste est de considerer que celui-ci cherche a modeliserl’ensemble des documents pertinents, autrement dit a estimer la probabilite qu’un mot donneapparaisse dans de tels documents.

2.1 Binary Relevance Model

Soit q une requete et dj un document. Le modele probabiliste essaye d’estimer la probabilite quel’utilisateur trouve interessant le document dj sachant la requete q. On suppose qu’il existe alorsl’ensemble R des documents interessants (on parle d’ensemble ideal ) et que ces documents designentl’ensemble des documents pertinents. Soit R le complement de R. Le modele attribue a chaquedocument dj sa probabilite de pertinence de la facon suivante :

dj ⇥P (dj est pertinent)

P (dj n’est pas pertinent)(8)

sim(dj , q) =P (R|�dj)P (R|�dj)

(9)

2

Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)

P ( ⌦dj), la similarite est egale a :

sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)

⇥ P (↵dj |R)P (↵dj |R)

(10)

P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.

En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R) =i=n⌅

i=1

P (wmi,dj )|R) (11)

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R =i=n⌅

i=1

P (wmi,dj )|R)) (12)

Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :

wmi,dj = {0, 1} (13)

La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :

sim(dj , q) ⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥ (14)

avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.

Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :

sim(dj , q) ⇥⌅

mi⇥dj

P (mi|R)P (mi|R

�⌅

mi /⇥dj

P (mi|R)P (mi|R)

(15)

3

Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)

P ( ⌦dj), la similarite est egale a :

sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)

⇥ P (↵dj |R)P (↵dj |R)

(10)

P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.

En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R) =i=n⌅

i=1

P (wmi,dj )|R) (11)

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R =i=n⌅

i=1

P (wmi,dj )|R)) (12)

Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :

wmi,dj = {0, 1} (13)

La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :

sim(dj , q) ⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥ (14)

avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.

Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :

sim(dj , q) ⇥⌅

mi⇥dj

P (mi|R)P (mi|R

�⌅

mi /⇥dj

P (mi|R)P (mi|R)

(15)

3

14

Page 15: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

• Hypothesis : bag of words = words occur independently

!• The Retrieval Status Value :

Probabilistic model (3)

Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)

P ( ⌦dj), la similarite est egale a :

sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)

⇥ P (↵dj |R)P (↵dj |R)

(10)

P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.

En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R) =i=n⌅

i=1

P (wmi,dj )|R) (11)

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R =i=n⌅

i=1

P (wmi,dj )|R)) (12)

Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :

wmi,dj = {0, 1} (13)

La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :

sim(dj , q) ⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥ (14)

avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.

Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :

sim(dj , q) ⇥⌅

mi⇥dj

P (mi|R)P (mi|R

�⌅

mi /⇥dj

P (mi|R)P (mi|R)

(15)

3

Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)

P ( ⌦dj), la similarite est egale a :

sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)

⇥ P (↵dj |R)P (↵dj |R)

(10)

P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.

En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R) =i=n⌅

i=1

P (wmi,dj )|R) (11)

P (↵dj |R) =i=n⌅

i=1

P (dj,i)|R =i=n⌅

i=1

P (wmi,dj )|R)) (12)

Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :

wmi,dj = {0, 1} (13)

La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :

sim(dj , q) ⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥

�⇤mi⇥dj

P (mi|R)⇥�

�⇤mi /⇥dj

P (mi|R)⇥ (14)

avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.

Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :

sim(dj , q) ⇥⌅

mi⇥dj

P (mi|R)P (mi|R)

�⌅

mi /⇥dj

P (mi|R)P (mi|R)

(15)

3

Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :

sim(dj , q) ⇤⇧

mi⇥dj

pi

qi⇥

mi /⇥dj

1� pi

1� qi(16)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi⇥dj ,mi /⇥q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi⇥

mi /⇥dj ,mi /⇥q

1� pi

1� qi(17)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi(18)

=⇧

mi⇥dj⇤q

pi

qi⇥

⇤mi⇥q

1�pi1�qi⇤

mi⇥dj⇤q1�pi1�qi

(19)

=⇧

mi⇥dj⇤q

pi(1� qi)qi(1� pi)

⇥⇧

mi⇥q

1� pi

1� qi(20)

Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.

Soit, en passant en outre au logarithme1 :

sim(dj , q) ⇤⌅

mi⇥dj⇤q

logpi(1� qi)qi(1� pi)

= RSV (dj , q) (22)

sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.

En gardant les notations precedentes :

sim(dj , q) ⇤⌅

mi⇥q⇤dj

�log

P (mi|R)1� P (mi|R)

+ logP (mi|R)

1� P (mi|R)

⇥(23)

1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :

�(x; p) = px(1� p)1�x (21)

qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.

4

15

• Let and= the probability that a relevant (a non relevant) document contains m_i

• RSV = Retrieval Status Value

!!

• A non binary model ? = Using term frequency, document length

Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :

sim(dj , q) ⇤⇧

mi⇥dj

pi

qi⇥

mi /⇥dj

1� pi

1� qi(16)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi⇥dj ,mi /⇥q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi⇥

mi /⇥dj ,mi /⇥q

1� pi

1� qi(17)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi(18)

=⇧

mi⇥dj⇤q

pi

qi⇥

⇤mi⇥q

1�pi1�qi⇤

mi⇥dj⇤q1�pi1�qi

(19)

=⇧

mi⇥dj⇤q

pi(1� qi)qi(1� pi)

⇥⇧

mi⇥q

1� pi

1� qi(20)

Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.

Soit, en passant en outre au logarithme1 :

sim(dj , q) ⇤⌅

mi⇥dj⇤q

logpi(1� qi)qi(1� pi)

= RSV (dj , q) (22)

sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.

En gardant les notations precedentes :

sim(dj , q) ⇤⌅

mi⇥q⇤dj

�log

P (mi|R)1� P (mi|R)

+ logP (mi|R)

1� P (mi|R)

⇥(23)

1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :

�(x; p) = px(1� p)1�x (21)

qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.

4

Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :

sim(dj , q) ⇤⇧

mi⇥dj

pi

qi⇥

mi /⇥dj

1� pi

1� qi(16)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi⇥dj ,mi /⇥q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi⇥

mi /⇥dj ,mi /⇥q

1� pi

1� qi(17)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi(18)

=⇧

mi⇥dj⇤q

pi

qi⇥

⇤mi⇥q

1�pi1�qi⇤

mi⇥dj⇤q1�pi1�qi

(19)

=⇧

mi⇥dj⇤q

pi(1� qi)qi(1� pi)

⇥⇧

mi⇥q

1� pi

1� qi(20)

Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.

Soit, en passant en outre au logarithme1 :

sim(dj , q) ⇤⌅

mi⇥dj⇤q

logpi(1� qi)qi(1� pi)

= RSV (dj , q) (22)

sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.

En gardant les notations precedentes :

sim(dj , q) ⇤⌅

mi⇥q⇤dj

�log

P (mi|R)1� P (mi|R)

+ logP (mi|R)

1� P (mi|R)

⇥(23)

1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :

�(x; p) = px(1� p)1�x (21)

qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.

4

Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :

sim(dj , q) ⇤⇧

mi⇥dj

pi

qi⇥

mi /⇥dj

1� pi

1� qi(16)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi⇥dj ,mi /⇥q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi⇥

mi /⇥dj ,mi /⇥q

1� pi

1� qi(17)

⇤⇧

mi⇥dj⇤q

pi

qi⇥

mi /⇥dj ,mi⇥q

1� pi

1� qi(18)

=⇧

mi⇥dj⇤q

pi

qi⇥

⇤mi⇥q

1�pi1�qi⇤

mi⇥dj⇤q1�pi1�qi

(19)

=⇧

mi⇥dj⇤q

pi(1� qi)qi(1� pi)

⇥⇧

mi⇥q

1� pi

1� qi(20)

Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.

Soit, en passant en outre au logarithme1 :

sim(dj , q) ⇤⌅

mi⇥dj⇤q

logpi(1� qi)qi(1� pi)

= RSV (dj , q) (22)

sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.

En gardant les notations precedentes :

sim(dj , q) ⇤⌅

mi⇥q⇤dj

�log

P (mi|R)1� P (mi|R)

+ logP (mi|R)

1� P (mi|R)

⇥(23)

1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :

�(x; p) = px(1� p)1�x (21)

qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.

4

2.4 Methode par apprentissage automatique des parametres

Les methodes Bayesiennes permettent d’estimer les parametres a partir du retour de pertinenceformule par un utilisateur [Bookstein, 1983, ”Information retrieval : A sequential learning process”,JASIS].

2.5 Integration de distributions non binaires

A partir du modele probabiliste originel, Robertson et l’equipe du Centre for Interactive SystemsResearch de City University (London) y ont integre la possibilite de tenir compte de la frequenced’apparition des mots dans les documents et dans la requete ainsi que de la longueur des docu-ments. Cette integration correspondait originellement a l’integration du modele 2-poisson de Harter(utilise par ce dernier pour selectionner les bons termes d’indexation et non pour les ponderer)dans le modele probabiliste. A partir du modele 2-poisson et de la notion d’ensemble d’elite Epour un mot (selon Harter, l’ensemble des documents les plus representatifs de l’usage du mot ;plus generalement : l’ensemble des documents qui contiennent le mot), sont derivees les proba-bilites conditionnelles p(E|R), p(E|R), p(E|R) et p(E|R) donnant un nouveau modele probabilistedependant de E et de E. Avec la prise en compte d’autres variables telles la longueur des documentset le nombre d’occurrences du mot au sein du document, ce modele a donne lieu a une famille deponderations denommees BM (Best Match).

De maniere generale, la prise en compte des poids w des mots dans les documents et dans la requetes’exprime par :

sim(dj , q) =�

mi�dj⇥q

wmi,dj · wmi,dj · logpi(1� qi)qi(1� pi)

(33)

Lorsque l’on n’a pas d’informations sur l’ensemble R des documents pertinents, il est d’usage detransformer cette egalite en un classique produit scalaire :

sim(dj , q) =�

mi�dj⇥q

wmi,dj · wmi,dj (34)

2.5.1 Integration du modele 2-poisson de Harter

On obtient une loi de Poisson lorsque le nombre d’evenements est tres grand et que la probabiliteelementaire est tres faible (exemples : defauts dans une chaıne de fabrication, erreurs de frappedans une page). Certaines experiences ont montre que seule la distribution de 50% (jusqu’a 70%selon les experiences) des mots peut s’apparenter a un modele de Poisson [Margulis 91 ; Fuhr 92].

Definition 1 (Distribution de Poisson) Sachant µi, le nombre d’occurrences moyen du mot mi

par document dans un ensemble de documents R, la probabilite que le nombre d’occurrences f(mi, d)

6

Page 16: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Eliteness• « We hypothesize that occurrences of a term in a document have a random or

stochastic element, which nevertheless reflects a real but hidden distinction between those documents which are “about” the concept represented by the term and those which are not. Those documents which are “about” this concept are described as “elite” for the term. »

• The assumption is that the distribution of within-document frequencies is Poisson for the elite documents, and also (but with a different mean) for the non-elite documents.

• Modeling within-document term frequencies by means of a mixture of two Poisson distributions

16

It would be possible to derive this model from a more basic one, under which a document was arandom stream of term occurrences, each one having a fixed, small probability of being the term inquestion, this probability being constant over all elite documents, and also constant (but smaller) overall non-elite documents. Such a model would require that all documents were the same length. Thus the2–Poisson model is usually said to assume that document length is constant: although technically it doesnot require that assumption, it makes little sense without it. Document length is discussed further below(section 5).

The approach taken in [6] was to estimate the parameters of the two Poisson distributions for eachterm directly from the distribution of within-document frequencies. These parameters were then used invarious weighting functions. However, little performance benefit was gained. This was seen essentiallyas a result of estimation problems: partly that the estimation method for the Poisson parameters wasprobably not very good, and partly because the model is complex in the sense of requiring a large numberof di↵erent parameters to be estimated. Subsequent work on mixed-Poisson models has suggested thatalternative estimation methods may be preferable [9].

Combining the 2–Poisson model with formula 4, under the various assumptions given about depen-dencies, we obtain [6] the following weight for a term t:

w = log(p0�tf e�� + (1� p0)µtf e�µ) (q0e�� + (1� q0)e�µ)(q0�tf e�� + (1� q0)µtf e�µ) (p0e�� + (1� p0)e�µ)

, (5)

where � and µ are the Poisson means for tf in the elite and non-elite sets for t respectively, p0 =P (document elite for t|R), and q0 is the corresponding probability for R.

The estimation problem is very apparent from equation 5, in that there are four parameters for eachterm, for none of which are we likely to have direct evidence (because of eliteness being a hidden variable).It is precisely this estimation problem which makes the weighting function intractable. This considerationleads directly to the approach taken in the next section.

4 A Rough Model for Term Frequency

4.1 The Shape of the tf E↵ect

Many di↵erent functions have been used to allow within-document term frequency tf to influence theweight given to the particular document on account of the term in question. In some cases a linearfunction has been used; in others, the e↵ect has been dampened by using a suitable transformation suchas log tf .

Even if we do not use the full equation 5, we may allow it to suggest the shape of an appropriate,but simpler, function. In fact, equation 5 has the following characteristics: (a) It is zero for tf = 0; (b)it increases monotonically with tf ; (c) but to an asymptotic maximum; (d) which approximates to theRobertson/Sparck Jones weight that would be given to a direct indicator of eliteness.

Only in an extreme case, where eliteness is identical to relevance, is the function linear in tf . Thesepoints can be seen from the following re-arrangement of equation 5:

w = log(p0 + (1� p0)(µ/�)tf e��µ) (q0eµ�� + (1� q0))(q0 + (1� q0)(µ/�)tf e��µ) (p0eµ�� + (1� p0))

. (6)

µ is smaller than �. As tf ! 1 (to give us the asymptotic maximum), (µ/�)tf goes to zero, so thosecomponents drop out. eµ�� will be small, so the approximation is:

w = logp0(1� q0)q0(1� p0)

. (7)

(The last approximation may not be a good one: for a poor and/or infrequent term, eµ�� will not bevery small. Although this should not a↵ect the component in the numerator, because q0 is likely to besmall, it will a↵ect the component in the denominator.)

4.2 A Simple Formulation

What is required, therefore, is a simple tf -related weight that has something like the characteristics(a)-(d) listed in the previous section. Such a function can be constructed as follows. The functiontf /(constant + tf ) increases from zero to an asymptotic maximum in approximately the right fashion.The constant determines the rate at which the increase drops o↵: with a large constant, the function

Robertson  &  Walker,  1994,  ACM  SIGIR

p(k) = λk

k!e−λ

B BB B

BB BB B

BB

BBBA

A

A AA

A

A

B

BB

BB

Page 17: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Divergence From Randomness (DFR) models

• The 2-Poisson model : in a elite-set of documents, informative words occur to a greater extent than in the rest of documents in the collection. But other words do not possess elite documents and their frequencies follow a random distribution.

• Divergence from randomness (DFR) : — selecting a basic randomness model — applying normalisations

• « The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d »

• « if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document »

!!

• By using a binomial distribution or a geometric distribution

17

score(d,Q) =X

t2Q

qtw · w(t, d)

http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels

1

tfn+ 1

�tfn · log2

N + 1

nt + 0.5

�I(n)L2 :

Page 18: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Probabilistic model (4)• Estimating p and q ? = better estimate term weights

according to the number of documents n_i with words m_i and N the total number of documents

• Iterative process (relevance feedback) : user selects the relevant documents froma first list of retrieved documents

• if no sample is available = pseudo-relevance feedback (and 2-Poisson model)

!!

• With no relevance information, it approximates TF / IDF :

!

Si l’on integre le nombre d’occurrences f(mi, dj) des mi dans dj , on obtient :

sim(dj , q) ⇤�

mi�dj⇥q

f(mi, dj) · logpi(1� qi)qi(1� pi)

(24)

2.2 Estimation des parametres

2.3 Methode originelle sans retour de pertinence

Lors de la premiere iteration, aucun document pertinent n’a encore ete trouve, il est necessaire deposer les valeurs de P (mi|R) et de P (mi|R). On suppose ainsi qu’il y a une chance sur deux qu’unmot quelconque de l’index soit present dans un document pertinent et que la probabilite qu’un motsoit present dans un document non pertinent est proportionnelle a sa distribution dans la collection(etant donne que le nombre de documents non pertinents est generalement bien plus grand quecelui des pertinents) :

P (mi|R) = 0, 5 (25)

P (mi|R) =ni

N(26)

avec ni le nombre de documents qui contiennent mi dans la collection et N le nombre total dedocuments de la collection. Ces valeurs doivent etre estimes lors de chaque iteration en fonctiondes documents qu’elles permettent de trouver (et, eventuellement de la selection de ceux qui sontpertinents par l’utilisateur).

A partir de ces valeurs initiales, il est possible de calculer sim(dj , q) pour tous les documents dela collection et de ne retenir que ceux dont la similarite est superieure a �. Le choix de � peut seramener au choix d’un rang r au-dela duquel les documents sont ecartes. Soit Vi le nombre desdocuments dans le sous-ensemble des documents retenus qui contiennent mi (V designe alors lenombre de documents retenus). P (mi|R) et de P (mi|R) sont alors calculees recursivement :

P (mi|R) =Vi

V(27)

P (mi|R) =ni � Vi

N � V(28)

ou encore (pour eviter un probleme avec les valeurs V = 1 et Vi = 0) :

P (mi|R) =Vi + 0.5V + 1

(29)

P (mi|R) =ni � Vi + 0.5N � V + 1

(30)

et, plus souvent :

P (mi|R) =Vi + ni

N

V + 1(31)

P (mi|R) =ni � Vi + ni

N

N � V + 1(32)

5

Si l’on integre le nombre d’occurrences f(mi, dj) des mi dans dj , on obtient :

sim(dj , q) ⇤�

mi�dj⇥q

f(mi, dj) · logpi(1� qi)qi(1� pi)

(24)

2.2 Estimation des parametres

2.3 Methode originelle sans retour de pertinence

Lors de la premiere iteration, aucun document pertinent n’a encore ete trouve, il est necessaire deposer les valeurs de P (mi|R) et de P (mi|R). On suppose ainsi qu’il y a une chance sur deux qu’unmot quelconque de l’index soit present dans un document pertinent et que la probabilite qu’un motsoit present dans un document non pertinent est proportionnelle a sa distribution dans la collection(etant donne que le nombre de documents non pertinents est generalement bien plus grand quecelui des pertinents) :

P (mi|R) = 0, 5 (25)

P (mi|R) =ni

N(26)

avec ni le nombre de documents qui contiennent mi dans la collection et N le nombre total dedocuments de la collection. Ces valeurs doivent etre estimes lors de chaque iteration en fonctiondes documents qu’elles permettent de trouver (et, eventuellement de la selection de ceux qui sontpertinents par l’utilisateur).

A partir de ces valeurs initiales, il est possible de calculer sim(dj , q) pour tous les documents dela collection et de ne retenir que ceux dont la similarite est superieure a �. Le choix de � peut seramener au choix d’un rang r au-dela duquel les documents sont ecartes. Soit Vi le nombre desdocuments dans le sous-ensemble des documents retenus qui contiennent mi (V designe alors lenombre de documents retenus). P (mi|R) et de P (mi|R) sont alors calculees recursivement :

P (mi|R) =Vi

V(27)

P (mi|R) =ni � Vi

N � V(28)

ou encore (pour eviter un probleme avec les valeurs V = 1 et Vi = 0) :

P (mi|R) =Vi + 0.5V + 1

(29)

P (mi|R) =ni � Vi + 0.5N � V + 1

(30)

et, plus souvent :

P (mi|R) =Vi + ni

N

V + 1(31)

P (mi|R) =ni � Vi + ni

N

N � V + 1(32)

5

V <=> threshold (cost)

18

1st  estimation

Parmi les revers du modele, la prediction systematique que deux occurrences d’un mot dans undocument sont moins probables que trois ou quatre occurrences. Le modele 2-Poisson n’ayant pasabouti a des resultats particulierement satisfaisants, d’autres auteurs ont propose une mixture de ndistribution de Poisson [Margulis, cite par Ponte & Croft, ACM SIGIR 1998]. Une autre possibiliteest d’utiliser la K-mixture de Katz donnant d’aussi bon resultats qu’une distribution binomialenegative, mais bien plus simple a utiliser ; voir [Manning & Schutze, ”Foundations of...”, p.549].

2.5.2 Integration d’un modele gaussien

Si l’on considere que les mots sont distribues selon une loi normale, la similarite propose en 1982par Bookstein est :

RSV (dj , q) =⇧

mi�q⇥dj

f(mi, dj)⇤�

µmi

⇥2mi

� µmi

⇥mi

⇥� f(mi, dj)

2·�

1⇥2

mi

� 1⇥mi

⇥⌅(41)

avec µ et ⇥ les moyennes et les ecarts-types dans R et dans R.

2.5.3 Les ponderations Okapi

Une maniere courante de definir la composante IDF (Inverse Document Frequency) avec N lenombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans lacollection est2 :

IDF (mi) = log�

N � n(mi) + 0.5n(mi) + 0.5

⇥(43)

Le nombre d’occurrences f(mi, dj) est generalement normalise suivant la longueur moyenne l desdocuments de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj . Avec K uneconstante, habituellement choisie entre 1.0 et 2.0, une possibilite consiste a definir la composanteTF de telle sorte de favoriser les documents courts :

TF (mi, dj) =(K + 1) · f(mi, dj)

f(mi, dj) + K · (l(dj)/l)(44)

Un grand nombre de ponderations ont ete testees. Les premiers resultats sont publies pendant lescampagnes TREC-2 et TREC-3. La definition de ces nouvelles ponderations fut concomitante dela generalisation de l’expansion automatique de requete a partir des premiers documents trouves.

Soient :2Etant donne que n(mi) est generalement petit par rapport a N , on peut parfois simplifier cette definition en :

IDF (mi) = log

„N + 0.5

n(mi) + 0.5

«(42)

8

Parmi les revers du modele, la prediction systematique que deux occurrences d’un mot dans undocument sont moins probables que trois ou quatre occurrences. Le modele 2-Poisson n’ayant pasabouti a des resultats particulierement satisfaisants, d’autres auteurs ont propose une mixture de ndistribution de Poisson [Margulis, cite par Ponte & Croft, ACM SIGIR 1998]. Une autre possibiliteest d’utiliser la K-mixture de Katz donnant d’aussi bon resultats qu’une distribution binomialenegative, mais bien plus simple a utiliser ; voir [Manning & Schutze, ”Foundations of...”, p.549].

2.5.2 Integration d’un modele gaussien

Si l’on considere que les mots sont distribues selon une loi normale, la similarite propose en 1982par Bookstein est :

RSV (dj , q) =⇧

mi�q⇥dj

f(mi, dj)⇤�

µmi

⇥2mi

� µmi

⇥mi

⇥� f(mi, dj)

2·�

1⇥2

mi

� 1⇥mi

⇥⌅(41)

avec µ et ⇥ les moyennes et les ecarts-types dans R et dans R.

2.5.3 Les ponderations Okapi

Une maniere courante de definir la composante IDF (Inverse Document Frequency) avec N lenombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans lacollection est2 :

IDF (mi) = log�

N � n(mi) + 0.5n(mi) + 0.5

⇥(43)

Le nombre d’occurrences f(mi, dj) est generalement normalise suivant la longueur moyenne l desdocuments de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj . Avec K uneconstante, habituellement choisie entre 1.0 et 2.0, une possibilite consiste a definir la composanteTF de telle sorte de favoriser les documents courts :

TF (mi, dj) =(K + 1) · f(mi, dj)

f(mi, dj) + K · (l(dj)/l)(44)

Un grand nombre de ponderations ont ete testees. Les premiers resultats sont publies pendant lescampagnes TREC-2 et TREC-3. La definition de ces nouvelles ponderations fut concomitante dela generalisation de l’expansion automatique de requete a partir des premiers documents trouves.

Soient :2Etant donne que n(mi) est generalement petit par rapport a N , on peut parfois simplifier cette definition en :

IDF (mi) = log

„N + 0.5

n(mi) + 0.5

«(42)

8

Page 19: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Probabilistic model (5)

• “OKAPI” (BM 25) with tuning constants = a (very) good baseline – N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :

w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)

(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)

Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :

sim(dj , q) =⇤

mi�q

�w(mi)⇤

(k1 + 1) · tf(mi, dj)K + tf(mi, dj)

⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)

⇥(46)

avec :K = k1 ·

�(1� b) + b · l(dj)

l

⇥(47)

Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :

w(mi) = logN � n(mi) + 0.5

n(mi) + 0.5(48)

avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.

Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :

sim(dj , q) =⇤

mi�q

2.2 · tf(mi, dj)

0.3 + 0.9 · l(dj)l

+ tf(mi, dj)⇤ 1001 · tf(mi, q)

1000 + tf(mi, q)⇤ log2

N � n(mi) + 0.5n(mi) + 0.5

(49)

Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :

sim(dj , q) =⇤

mi�q

tf(mi, dj)

0.5 + 1.5 · l(dj)l

+ tf(mi, dj)⇤

log2N+0.5n(mi)

log2(N + 1)(50)

3 Les modeles de langage pour la RD

Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus

9

– N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :

w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)

(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)

Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :

sim(dj , q) =⇤

mi�q

�w(mi)⇤

(k1 + 1) · tf(mi, dj)K + tf(mi, dj)

⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)

⇥(46)

avec :K = k1 ·

�(1� b) + b · l(dj)

l

⇥(47)

Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :

w(mi) = logN � n(mi) + 0.5

n(mi) + 0.5(48)

avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.

Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :

sim(dj , q) =⇤

mi�q

2.2 · tf(mi, dj)

0.3 + 0.9 · l(dj)l

+ tf(mi, dj)⇤ 1001 · tf(mi, q)

1000 + tf(mi, q)⇤ log2

N � n(mi) + 0.5n(mi) + 0.5

(49)

Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :

sim(dj , q) =⇤

mi�q

tf(mi, dj)

0.5 + 1.5 · l(dj)l

+ tf(mi, dj)⇤

log2N+0.5n(mi)

log2(N + 1)(50)

3 Les modeles de langage pour la RD

Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus

9

– N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :

w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)

(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)

Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :

sim(dj , q) =⇤

mi�q

�w(mi)⇤

(k1 + 1) · tf(mi, dj)K + tf(mi, dj)

⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)

⇥(46)

avec :K = k1 ·

�(1� b) + b · l(dj)

l

⇥(47)

Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :

w(mi) = logN � n(mi) + 0.5

n(mi) + 0.5(48)

avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.

Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :

sim(dj , q) =⇤

mi�q

2.2 · tf(mi, dj)

0.3 + 0.9 · l(dj)l

+ tf(mi, dj)⇤ 1001 · tf(mi, q)

1000 + tf(mi, q)⇤ log2

N � n(mi) + 0.5n(mi) + 0.5

(49)

Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :

sim(dj , q) =⇤

mi�q

tf(mi, dj)

0.5 + 1.5 · l(dj)l

+ tf(mi, dj)⇤

log2N+0.5n(mi)

log2(N + 1)(50)

3 Les modeles de langage pour la RD

Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus

9

– N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :

w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)

(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)

Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :

sim(dj , q) =⇤

mi�q

�w(mi)⇤

(k1 + 1) · tf(mi, dj)K + tf(mi, dj)

⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)

⇥(46)

avec :K = k1 ·

�(1� b) + b · l(dj)

l

⇥(47)

Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :

w(mi) = logN � n(mi) + 0.5

n(mi) + 0.5(48)

avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.

Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :

sim(dj , q) =⇤

mi�q

2.2 · tf(mi, dj)

0.3 + 0.9 · l(dj)l

+ tf(mi, dj)⇤ 1001 · tf(mi, q)

1000 + tf(mi, q)⇤ log2

N � n(mi) + 0.5n(mi) + 0.5

(49)

Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :

sim(dj , q) =⇤

mi�q

tf(mi, dj)

0.5 + 1.5 · l(dj)l

+ tf(mi, dj)⇤

log2N+0.5n(mi)

log2(N + 1)(50)

3 Les modeles de langage pour la RD

Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus

9

19

7 Experiments

7.1 TREC

The TREC (Text REtrieval Conference) conferences, of which there have been two, with the third due tostart early 1994, are concerned with controlled comparisons of di↵erent methods of retrieving documentsfrom large collections of assorted textual material. They are funded by the US Advanced ProjectsResearch Agency (ARPA) and organised by Donna Harman of NIST (National Institute for Standardsand Technology). There were about 31 participants, academic and commercial, in the TREC-2 conferencewhich took place at Gaithersburg, MD in September 1993 [2]. Information needs are presented in theform of highly structured “topics” from which queries are to be derived automatically and/or manuallyby participants. Documents include newspaper articles, entries from the Federal Register, patents andtechnical abstracts, varying in length from a line or two to several hundred thousand words.

A large number of relevance judgments have been made at NIST by a panel of experts assessing thetop-ranked documents retrieved by some of the participants in TREC–1 and TREC–2. The number ofknown relevant documents for the 150 topics varies between 1 and more than 1000, with a mean of 281.

7.2 Experiments Conducted

Some of the experiments reported here were also reported at TREC–2 [1].

Database and Queries

The experiments reported here involved searches of one of the TREC collections, described as disks 1 &2 (TREC raw data has been distributed on three CD-ROMs). It contains about 743,000 documents. Itwas indexed by keyword stems, using a modified Porter stemming procedure [13], spelling normalisationdesigned to conflate British and American spellings, a moderate stoplist of about 250 words and a smallcross-reference table and “go” list. Topics 101–150 of the 150 TREC–1 and –2 topic statements wereused. The mean length (number of unstopped tokens) of the queries derived from title and concepts fieldsonly was 30.3; for those using additionally the narrative and description fields the mean length was 81.

Search Procedure

Searches were carried out automatically by means of City University’s Okapi text retrieval software. Theweighting functions described in Sections 4–6 were implemented as BM152 (the model using equation 8 forthe document term frequency component) and BM11 (using equation 10). Both functions incorporatedthe document length correction factor of equation 13. These were compared with BM1 (w(1) weights,approximately ICF, since no relevance information was used in these experiments) and with a simplecoordination-level model BM0 in which terms are given equal weights. Note that BM11 and BM15 bothreduce to BM1 when k1 and k2 are zero. The within-query term frequency component (equation 15)could be used with any of these functions.

To summarize, the following functions were used:

w = 1(BM0)

w = logN � n + 0.5

n + 0.5⇥ qtf

(k3 + qtf )(BM1)

w =tf

(k1 + tf )⇥ log

N � n + 0.5n + 0.5

⇥ qtf(k3 + qtf )

+ k2 ⇥ nq(�� d)(� + d)

(BM15)

w =tf

(k1⇥d� + tf )

⇥ logN � n + 0.5

n + 0.5⇥ qtf

(k3 + qtf )+ k2 ⇥ nq

(�� d)(� + d)

.(BM11)

In the experiments reported below where k3 is given as 1, the factor qtf /(k3 + qtf ) is implemented asqtf on its own (equation 16).

2BM = Best Match

Page 20: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Generative models - eg. Language model

• A model that « generates » phrases

• A probability distribution (unigrams, bigrams, n-grams) over samples

• For IR : what is the probability a document produces a given query ? = the query likelihood = the probability the document is relevant

• IR = what is the document that is the most likely to generate the query

!• Different types of language models : unigrams assume word independence

!!

• Estimating P(t|d) with Maximum Likelihood (the number of times the query word t occurs in the document d divided by the total number of word occurrences in d)

• Problem : estimating « Zero Frequency Prob. » (t may not occur in d)—> smoothing function (Laplace, Jelinek-Mercer, Dirichlet…)

20

Retrieval Models Retrieval Models II: Probabilities, Language models and DFR

Standard LM Approach

Assume that query terms are drawn identically and independentlyfrom a document (unigram models)

P(q|d) =Y

t2qP(t|d)n(t,q)

(where n(t, q) is the number of term t in query q)

Maximum Likelihood Estimate of P(t|d)Simply use the number of times the query term occurs in the documentdivided by the total number of term occurrences.

Problem: Zero Probability (frequency) Problem

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 118 / 171

Retrieval Models Retrieval Models II: Probabilities, Language models and DFR

Document Priors

Remember P(d |q) = P(q|d)P(d)/P(q) ⇡ P(q|d)P(d)P(d) is typically assumed to be uniform so is usually ignored leadingto P(d |q) ⇡ P(q|d)P(d) provides an interesting avenue for encoding a priori knowledgeabout the document

Document length (longer doc ! more relevant)Average Word Length (bigger words ! more relevant)Time of publication (newer doc ! more relevant)Number of web links (more in links ! more relevant)PageRank (more popular ! more relevant)

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 125 / 171

Retrieval Models Retrieval Models II: Probabilities, Language models and DFR

Estimating Document Models

Example of Smoothing methodsLaplace

P(t|✓d) =n(t, d) + ↵P

t0 n(t0, d) + ↵|T |

|T | is the number of term in the vocabularyJelinek-Mercer

P(t|✓d) = � · P(t|d) + (1 � �) · P(t)

Dirichlet

P(t|✓d) =|d |

|d | + µ· P(t|d) + µ

|d | + µ· P(t)

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 123 / 171

Retrieval Models Retrieval Models II: Probabilities, Language models and DFR

Estimating Document Models

Example of Smoothing methodsLaplace

P(t|✓d) =n(t, d) + ↵P

t0 n(t0, d) + ↵|T |

|T | is the number of term in the vocabularyJelinek-Mercer

P(t|✓d) = � · P(t|d) + (1 � �) · P(t)

Dirichlet

P(t|✓d) =|d |

|d | + µ· P(t|d) + µ

|d | + µ· P(t)

Mounia Lalmas (Yahoo! Research) 20-21 June 2011 123 / 171

Classification et enrichissement 9

12.3.2. Classification bayésienne naïve (modèles de langage)

Un modèle de langage [DEM 98] est un ensemble de propriétés et de contraintes

sur des séquences de mots obtenues à partir d’exemples. Ces exemples peuvent re-

présenter, plus ou moins fidèlement, une langue ou une thématique. L’estimation de

probabilités à partir d’exemples permet par extension de déterminer la probabilité

qu’une phrase quelconque puisse être générée par le modèle. Catégoriser un nou-

veau texte équivaut à calculer la probabilité de la suite de mots qui le compose pour

chacun des modèles de langage de chaque catégorie. Le nouveau texte est étiqueté

selon la thématique correspondant au langage de probabilité maximale.

Soit W une suite de mots w1, w2, …, wn. Nous faisons l’hypothèse que les proba-

bilités d’apparition des mots sont indépendantes les unes des autres (hypothèse évi-

demment fausse mais qui fonctionne assez bien). Dans le cas d’un modèle de

langage trigramme – historique de longueur 2– la probabilité de cette suite de mots

peut être calculée comme suit :

P W( ) = P wi wi-2,wi−1( )i=1

i= n

∏ [12.7]

La représentativité du corpus d’apprentissage par rapport aux données qu’il fau-

dra exploiter est cruciale8. Nigam et al. [NIG 00] ont toutefois montré que l’emploi

d’un algorithme EM permettait de combler en partie la trop faible quantité de ces

dernières.

Exemple. L’utilisation de la règle de Bayes permet de résoudre des problèmes de

catégorisation. Supposons par exemple que l’on souhaite déterminer la langue em-

ployée majoritairement dans un texte. Il s’agit alors de calculer la probabilité de

chaque langue L sachant le texte S. La formule de Bayes permet de « retourner »

cette probabilité en des facteurs calculés grâce aux modèles de langage des diffé-

rentes langues. Comparer :

P L = Anglais S( ) =P S L = Anglais( ). P L = Anglais( )

P S( )

et

P L = Espagnol S( ) =P S L = Espagnol ( ). P L = Espagnol ( )

P S( )[12.8]

revient à comparer uniquement (puisque P(S) est identique dans les deux cas) :

8 Il est très probable que le calcul fasse intervenir des trigrammes jamais rencontrés. La tech-

nique la plus simple pour répondre à ce problème consiste à ajouter systématiquement un

nombre k, petit, d’occurrences à chaque mot et de normaliser l’ensemble des comptes.

Page 21: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Language models (2)

• Priors allow to take into account diverse elements about the documents / the collection / the query

• the document length (longer a document is, more relevant it is ?)

• the time of publication

• the number of links / citations

• the page rank of the document (Web)

• the language…

• Sequential Dependence Model

21

fT

fO

fU

SDM(Q,D) = �T

X

q2Q

fT (q,D)

+�O

|Q|�1X

i=1

fO(qi, qi + 1, D)

+�U

|Q|�1X

i=1

fU (qi, qi + 1, D)

�T = 0.85 �O = 0.1 �U = 0.05 fT fO fU

http://www.lemurproject.org

#weight( 0.75 #combine ( hubble telescope achievements )!! 0.25 #combine ( universe system mission search galaxies ) )

Page 22: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Some other models

• Inference networks (Bayesian networks) : combination of distinct evidence sources - modeling causal relationship - ex. Probabilistic inference network (Inquery) —> cf. Learning to rank from multiple and diverse features

• Fuzzy models

• (Extended) Boolean Model / Inference logical models

• Information-based models

• Algebric models (Latent Semantic Indexing…)

• Semantic IR models based on ontologies and conceptualization

!• and … Web-based models (Page Rank…) / XML based models…

22

Page 23: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Web Page Retrieval

IR Systems on the web

Use many scores (> 300)

• Similarity between the query and the docs

• Localization of the keywords in the pages

• Structure of the pages

• Page Authority (Google’s PageRank)

• Domain Authority

23

— Hyperlink matrix (the link structure of the Web) : an entry if there is a link from page i to page j

(else = 0)

ai,j =1

|Oi|

Page 24: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

PageRankThe authority of a Web page ? / The authority of a Web site - a domain ?

24

Random Walk : the PageRank of a page is the probability of arriving at that page after a large number of clicks

http://en.wikipedia.org/wiki/PageRank

Page 25: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 25

Fast, Scalable Graph Processing: Apache Giraph on YARN

1. All vertices start with same PageRank

1.0

1.0

1.0

Fast, Scalable Graph Processing: Apache Giraph on YARN

2. Each vertex distributes an equal portion of its PageRank to all neighbors:

0.5 0.5

1

1

Fast, Scalable Graph Processing: Apache Giraph on YARN

3. Each vertex sums incoming values times a weight factor and adds in small adjustment:

1/(# vertices in graph)

(.5*.85) + (.15/3)

(1.5*.85) + (.15/3)

(1*.85) + (.15/3)

Fast, Scalable Graph Processing: Apache Giraph on YARN

4. This value becomes the vertices' PageRank for the next iteration

.43

.21

.64

Fast, Scalable Graph Processing: Apache Giraph on YARN

5. Repeat until convergence:

(change in PR per-iteration < epsilon)

From : Fast, Scalable Graph Processing: Apache Giraph on YARNhttp://fr.slideshare.net/Hadoop_Summit/fast-scalable-graph-processing-apache-giraph-on-yarn

Page 26: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 26

Page 27: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 27

Page 28: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Entity oriented IR on the Web !Example : LSIS / KWare @ TREC KBA

28

Page 29: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 29

http://trec-­‐kba.org/ Knowledge  Base  Acceleration

2014  :  1.2B  documents  (Web,  social…),  11  TB http://s3.amazonaws.com/aws-­‐publicdatasets/trec/kba/index.html

Page 30: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Some Challenges

- Queries focused on specific entity

- Key issues

- Ambiguity in names = Need Disambiguation

- Profile definition

- Novelty detection / event detection / event attribution

- Dynamic models (outdated information, new information, new aspects/properties)

- Time oriented IR models

30

Page 31: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 31

Evaluation using TREC KBA Framework

Run F-Measure

1 vs All .361

1 vs All Top10 Features .355

Cross10 .355

Cross 5 .350

Cross 3 .354

Cross 2 .339

Table 2: Robustness evaluation resultsFigure 2:�=HYPHISL�0TWVY[HUJL�MVY�*SHZZPÄJH[PVU

�� PZ� IHZLK� VU� [^V� JSHZZPÄLYZ� PU� JHZJHKL� !� VUL� MVY�ÄS[LYPUN� V\[� UVU� TLU[PVUPUN� KVJ\TLU[Z� HUK� [OL� V[OLY�[V� KPZZVJPH[L� WVVYS`� YLSL]HU[� KVJ\TLU[Z� MYVT� JLU[YHSS`�YLSL]HU[�VULZ�

��KVLZ�UV[�YLX\PYL�UL^�[YHPUPUN�KH[H�^OLU�WYVJLZZPUN�H�UL^�LU[P[`

��KLHSZ�^P[O�[OYLL�KPMMLYLU[�[`WLZ�VM�MLH[\YL�IHZLK�VU!�[OL�LU[P[ ��[OL�[PTL�HUK�[OL�MV\UK�KVJ\TLU[Z

�� OHZ� ILLU� L]HS\H[LK� \ZPUN� [OL� 2UV^SLKNL� )HZL�(JJLSLYH[PVU�-YHTL^VYR�WYV]PKLK� MVY� [OL�;L_[�9,[YPL]HS�*VUMLYLUJL�������;9,*�2)(�

Our Approach

Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012)

Today

Run F-Measure

Our Approach .382

Best KBA .359

Median KBA .289

Mean KBA .220

Table 1: KBA 2012 results

About KBAÄYZ[�ZLZZPVU�PU!������WHY[PJPWHU[Z!����[LHTZV\Y�YHUR!��YK��ILMVYL�LUOHUJLTLU[�

U\TILY�VM�Z\ITPZZPVUZ!����

Evaluation using TREC KBA Framework

Run F-Measure

1 vs All .361

1 vs All Top10 Features .355

Cross10 .355

Cross 5 .350

Cross 3 .354

Cross 2 .339

Table 2: Robustness evaluation resultsFigure 2:�=HYPHISL�0TWVY[HUJL�MVY�*SHZZPÄJH[PVU

�� PZ� IHZLK� VU� [^V� JSHZZPÄLYZ� PU� JHZJHKL� !� VUL� MVY�ÄS[LYPUN� V\[� UVU� TLU[PVUPUN� KVJ\TLU[Z� HUK� [OL� V[OLY�[V� KPZZVJPH[L� WVVYS`� YLSL]HU[� KVJ\TLU[Z� MYVT� JLU[YHSS`�YLSL]HU[�VULZ�

��KVLZ�UV[�YLX\PYL�UL^�[YHPUPUN�KH[H�^OLU�WYVJLZZPUN�H�UL^�LU[P[`

��KLHSZ�^P[O�[OYLL�KPMMLYLU[�[`WLZ�VM�MLH[\YL�IHZLK�VU!�[OL�LU[P[ ��[OL�[PTL�HUK�[OL�MV\UK�KVJ\TLU[Z

�� OHZ� ILLU� L]HS\H[LK� \ZPUN� [OL� 2UV^SLKNL� )HZL�(JJLSLYH[PVU�-YHTL^VYR�WYV]PKLK� MVY� [OL�;L_[�9,[YPL]HS�*VUMLYLUJL�������;9,*�2)(�

Our Approach

Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012)

Today

Run F-Measure

Our Approach .382

Best KBA .359

Median KBA .289

Mean KBA .220

Table 1: KBA 2012 results

About KBAÄYZ[�ZLZZPVU�PU!������WHY[PJPWHU[Z!����[LHTZV\Y�YHUR!��YK��ILMVYL�LUOHUJLTLU[�

U\TILY�VM�Z\ITPZZPVUZ!����

Page 32: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 32

Approach for Documents Filtering on a Content Stream by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit

KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge bases like Wikipedia and recommending edits.

Topic Preprocessing:Variants Extraction using:

- Bold text� ���Ƥ�����������������the topic’s wikipedia page;

- Text from links that points to the topic’s wikipedia page in the whole wikipedia corpus.

Information Retrieval:We adopted a recall oriented approach. We wanted to retrieve all documents containing at least one of the previously found variations. We used the RI system provided by Terrier with a tf-idf words weighting.

count KBA LSIStotal LSIS 44,351total KBA 52,244inter. 23,245 44.49% 52.41%comp. 50,105 55.41% 47.59%

Results

Top 10 Features from Gini Score:

Process description:

Text REtrieval Conference: Knowledge Base Acceleration

������Ƥ��������������������Ƥ������������Ƥ��������������������������������������when dealing with a content stream. We decided to use a decision �����������Ƥ�������������������������������������������������������������������Ǥ������Ƥ�������������������������������������������������������Ƥ��ǣ�time related features: statistics on found documents; presence/absence of known relations concerning the current topic during a week using a day scale;

common RI features: TF-IDF; mention distribution every 10% of the ��������Ǣ����������������������Ǧ���������Ǧ������������������������page.

Boris_Berezovsky_(business-man)boris berezovskyboris abramovich berezovsky

Boris_Berezovsky_(pianist)boris berezovskyboris vadimovich berezovsky

Relations extraction is also performed using link’s titles from and to the topic’s wikipedia page.

OVERALLCOUNT_RELATED_ENTITIES_MENTION 100.00000 COSINE_SIMLARITY_1G 100.00000 COSINE_SIMLARITY_2G 100.00000 COUNT_MENTION_IN_60_70%_DOCUMENT 88.33264 STAT_MENTION 73.70804 COUNT_RELATED_CITED 72.44485 COUNT_MENTION_IN_10_20%_DOCUMENT 70.76657 AVG_DOCUMENTS_IN_QUEUE 66.24520 COUNT_SENTENCE_WITH_MENTION 65.93995 COUNT_RELATED_LINKED 65.86051

Central Rel./Cent.

Run # F1 SU F1 SU

���������� .359 .410 .639 .635

4 RF-Yes .342 Ǥ .617 .600

3 RF-All .330 .279 .614 .601

5 SRF-All Ǥ Ǥ .603 Ǥ

6 SRF-Yes Ǥ Ǥ Ǥ Ǥ

1 All-All Ǥ Ǥ .553 .554

All-Yes .306 .193 Ǥ Ǥ

median Ǥ Ǥ .543 .549

means Ǥ .311 .405 Ǥ

RF: Weka Random Forest SRF: Salford Random ForestAll: Weka Random Comittee of Random Forest

Yes: Includes only central judgements

All: Includes central and relevant judgments

score(di ) = s(di,c1)* s(di,c2 )

score(di ) = 0.5+s(di,c1)+s(di,c2)

2

s(di,c1) if s(di,c1) < 0.5���

��

LIA: [email protected]; LSIS:{vincent.bouvier, patrice.bellot}@lsis.org; Kware: [email protected]

Approach for Documents Filtering on a Content Stream by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit

KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge bases like Wikipedia and recommending edits.

Topic Preprocessing:Variants Extraction using:

- Bold text� ���Ƥ�����������������the topic’s wikipedia page;

- Text from links that points to the topic’s wikipedia page in the whole wikipedia corpus.

Information Retrieval:We adopted a recall oriented approach. We wanted to retrieve all documents containing at least one of the previously found variations. We used the RI system provided by Terrier with a tf-idf words weighting.

count KBA LSIStotal LSIS 44,351total KBA 52,244inter. 23,245 44.49% 52.41%comp. 50,105 55.41% 47.59%

Results

Top 10 Features from Gini Score:

Process description:

Text REtrieval Conference: Knowledge Base Acceleration

������Ƥ��������������������Ƥ������������Ƥ��������������������������������������when dealing with a content stream. We decided to use a decision �����������Ƥ�������������������������������������������������������������������Ǥ������Ƥ�������������������������������������������������������Ƥ��ǣ�time related features: statistics on found documents; presence/absence of known relations concerning the current topic during a week using a day scale;

common RI features: TF-IDF; mention distribution every 10% of the ��������Ǣ����������������������Ǧ���������Ǧ������������������������page.

Boris_Berezovsky_(business-man)boris berezovskyboris abramovich berezovsky

Boris_Berezovsky_(pianist)boris berezovskyboris vadimovich berezovsky

Relations extraction is also performed using link’s titles from and to the topic’s wikipedia page.

OVERALLCOUNT_RELATED_ENTITIES_MENTION 100.00000 COSINE_SIMLARITY_1G 100.00000 COSINE_SIMLARITY_2G 100.00000 COUNT_MENTION_IN_60_70%_DOCUMENT 88.33264 STAT_MENTION 73.70804 COUNT_RELATED_CITED 72.44485 COUNT_MENTION_IN_10_20%_DOCUMENT 70.76657 AVG_DOCUMENTS_IN_QUEUE 66.24520 COUNT_SENTENCE_WITH_MENTION 65.93995 COUNT_RELATED_LINKED 65.86051

Central Rel./Cent.

Run # F1 SU F1 SU

���������� .359 .410 .639 .635

4 RF-Yes .342 Ǥ .617 .600

3 RF-All .330 .279 .614 .601

5 SRF-All Ǥ Ǥ .603 Ǥ

6 SRF-Yes Ǥ Ǥ Ǥ Ǥ

1 All-All Ǥ Ǥ .553 .554

All-Yes .306 .193 Ǥ Ǥ

median Ǥ Ǥ .543 .549

means Ǥ .311 .405 Ǥ

RF: Weka Random Forest SRF: Salford Random ForestAll: Weka Random Comittee of Random Forest

Yes: Includes only central judgements

All: Includes central and relevant judgments

score(di ) = s(di,c1)* s(di,c2 )

score(di ) = 0.5+s(di,c1)+s(di,c2)

2

s(di,c1) if s(di,c1) < 0.5���

��

LIA: [email protected]; LSIS:{vincent.bouvier, patrice.bellot}@lsis.org; Kware: [email protected]

Page 33: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Numerical and Temporal Meta-Features for Entity Document Filtering and Ranking

— Entity related features

— Document related meta-features

— Time related meta-features

33

Evaluation using TREC KBA Framework

Run F-Measure

1 vs All .361

1 vs All Top10 Features .355

Cross10 .355

Cross 5 .350

Cross 3 .354

Cross 2 .339

Table 2: Robustness evaluation resultsFigure 2:�=HYPHISL�0TWVY[HUJL�MVY�*SHZZPÄJH[PVU

�� PZ� IHZLK� VU� [^V� JSHZZPÄLYZ� PU� JHZJHKL� !� VUL� MVY�ÄS[LYPUN� V\[� UVU� TLU[PVUPUN� KVJ\TLU[Z� HUK� [OL� V[OLY�[V� KPZZVJPH[L� WVVYS`� YLSL]HU[� KVJ\TLU[Z� MYVT� JLU[YHSS`�YLSL]HU[�VULZ�

��KVLZ�UV[�YLX\PYL�UL^�[YHPUPUN�KH[H�^OLU�WYVJLZZPUN�H�UL^�LU[P[`

��KLHSZ�^P[O�[OYLL�KPMMLYLU[�[`WLZ�VM�MLH[\YL�IHZLK�VU!�[OL�LU[P[ ��[OL�[PTL�HUK�[OL�MV\UK�KVJ\TLU[Z

�� OHZ� ILLU� L]HS\H[LK� \ZPUN� [OL� 2UV^SLKNL� )HZL�(JJLSLYH[PVU�-YHTL^VYR�WYV]PKLK� MVY� [OL�;L_[�9,[YPL]HS�*VUMLYLUJL�������;9,*�2)(�

Our Approach

Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012)

Today

Run F-Measure

Our Approach .382

Best KBA .359

Median KBA .289

Mean KBA .220

Table 1: KBA 2012 results

About KBAÄYZ[�ZLZZPVU�PU!������WHY[PJPWHU[Z!����[LHTZV\Y�YHUR!��YK��ILMVYL�LUOHUJLTLU[�

U\TILY�VM�Z\ITPZZPVUZ!����

recall =#documentsfound 2 corpus

#documentsfound 2 train [ test

(1)

With Variants Without Variants

KBA12Train .862 .772Test .819 .726Overall .835 .743

KBA13Train .877 .831Test .611 .534Overall .646 .573

Table 1. Recall depending on using variants name or not on both KBA12and KBA13 collection train and test subset

3.2 The Ranking MethodThe ranking method is right after the documents pre-selection fil-ter and thus takes as an input a document mentioning an entity. Themethod is to rank documents into four classes: garbage/neutral (noinformation or not informative), useful or vital. It has been shownin [9] that Naive Bayes, Decision Trees, or SVM classifiers performsimilarly on several test collections. For the ranking method, we usea Random Forest Classifier (a decision type of tree classifier) which,in addition of great performance, is really useful for post analysis.

We want our method to be adaptive and therefore not dependent onthe entity on which the classifier is trained. So we designed a seriesof meta-features that strive to depict evidence regarding an entity soit can be apply to other entities. The remaining details the three typesof meta-features: document, entity and time related meta-features

3.2.1 Entity related meta-features

The entity related meta-features are used to determine how a doc-ument concerns the target entity it has been extracted for. In orderto structure all information we have for an entity, we build an entityprofile that contains :

- variant collection V

e

: contains different variant names found foran entity e (cf., section 3.1);

- relation collection R

e,relType

: contains the different typesrelType of relations an entity e has with other entities;

- entity language model ✓e

: contains textual representation of theentity e as a bag of n-grams.

- entity Stream Information Language Model eSilme

: contains tex-tual representation of one or more documents selected by our sys-tem as a bag of n-grams for the entity e. The eSlime is used toevaluate the divergence with upcoming documents in order to tryto depict novelty from already known ”new” information.

A system may have no information at all (besides name) concern-ing an entity. Therefore the entity language model ✓

e

remains empty.The wikipedia page can be used though when it is known.

However for entities where no information at all is available wethought it could be useful to use documents that mention the entitiesand that are well ranked. With the aim to keep the entity backgroundseparated from the entity new information, we build another modeleSilm

e

to store the information that comes from the stream about anentity e. We will see in the section 4.2 the different ways we experi-ment to update the model.

The relation collection can be obtained with different manners de-pending on the prior information on the entity. When having for an

entity its wikipedia page it is possible while extracting variant namesto gather the pages that contain hyperlinks pointing to the entity page.It is also possible to gather all hyperlinks from the entity page thatpoint to another page. So it is possible to define three types of re-lations : incoming (from a page to the entity page), outgoing (fromentity page to another page) and mutual (when incoming and outgo-ing).

When using social networks those relations are explicitly defined.On twitter for instance, incoming relation would be when a user isfollowed, outgoing relation is when a user is following, and mutualis when both users are following each other.

Some meta-features require term frequency (TF) to be computed.To compute a TF of an entity e, we sum up the frequencies of allmentions of variant names v

i

from the collection V

e

in a documentD. We eventually normalize by the number of words|D| in D (cf.,equation 2). We also compute meta-features for each type of relation(incoming, outgoing, mutual) using the equation 2 where instead ofvariants, all relation sharing the same types are used.

tf(e,D) =

PVe

i=1f(v

i

, D)

|D| (2)

A snippet is computed from a document and the different mentionsof an entity. It contains a set of paragraph where the mentions of theentity are. Then the coverage of the snippet cov(D

snippet

, D) for thedocument D is computed using the length |D

snippet

| of the snippetand the length |D| of the document (cf., equation 3).

cov(Dsnippet

, D)) =|D

snippet

||D| (3)

The following table summarize all entity related meta-features:

tf

title

tf(e,Dtitle

)tf

document

tf(e,D)length

✓e |✓

e

|

length

eSilme |eSilm

e

|

cov

snippet

equation 3tf

relationType

tf(reltype

, D)cosine(✓

e

, D) similarity between ✓

e

and D

jensenShannon(✓e

, D) divergence between ✓

e

and D

jensenShannon(eSilme

, D) divergence between eSilm

e

and D

jensenShannon(✓e

, eSilm

e

) divergence between ✓

e

and eSilm

e

Table 2. Entity related features

3.2.2 Document related meta-features

Documents can give many information regardless an entity. For in-stance it is possible to compute the amount of information carriedby a document using the entropy of a document D. In addition, thelength (number of words) of a document also can give informationon whether a document is rather short or long. A document might beconsidered as long (compare to others) and be more likely vital thanshorts ones. This is at least the kind of behavior we could expect.Since we want to be able to distinguish documents not mentioningthe entity in the document title (entity meta-feature tf(e,D

title

))from those who simply don’t have a title we add a meta-featurehas

t

itle(D). The table 3 gather all documents related meta-features.

3.2.3 Time related meta-features

Let’s consider a stream of documents where each documents havepublication date and time. It is therefore possible to make use of this

has title(D) 2 {0, 1}length

document

|D|

entropy(D)P

D

i=0p(w

i

, D)log2(p(wi

, D))

Table 3. Document related Meta-Features

information to detect for instance an anormal activity on an entitywhich might mean that something really important to that entity ishappening.

As shown on the figure 3 drew from the KBA13 stream-corpus, theburst does not always depict vital documents, although it still mightbe a relevant information for classification.

Figure 3. Burst on different entities does not always imply vital documents.

To depict the burst effect we used an implementation of the Klein-berg Algorithm [11]. Given a time series, it captures burst and mea-sure the strength of it as well as the direction (up or down). We de-cided to scale the time series on an hour basis. In order not to messthe classifiers with too many information we decided not to use thedirection as a feature but to merge the direction with the strength byapplying a coefficient of -1 when direction is down and 1 otherwise.

In addition to burst detection, we also consider the number of doc-uments having a mention the last 24hours.

We noticed from our last year experiments on KBA12 that timefeatures were actually degrading final results since when ignoringthem our scores was better. So we decided to focus only on features(cf table 4) that can really bring useful time information.

kleinberg1h burst strength and directionmatch24h # documents found last 24h

Table 4. Time related features used for classification

3.2.4 Classification

To perform the classification we decided not to rely only on onemethod. Instead we designed different ways to classify the informa-tion given the meta-features described in the previous section.

For the first method TwoSteps, we consider the problem as a bi-nary classification problem where we use two classifiers in cas-cade. The first one C

GN/UV

is to classify between two classes:Garbage/Neutral and Useful/Vital. For documents being classified asUseful/Vital a second classifier C

U/V

is used to determine the finaloutput class between Useful and Vital.

The second method Single performs directly a classification be-tween the four classes.

The third method VitalVSOthers trains a classifier on all docu-ments considering only two classes vital and others (all classes but

vital). When this classifier gives a non-vital class, the Single methodis used to determine another class from Garbage to Useful.

The last but not least method CombineScores uses scores emittedby all previous classifiers and try to learn the best output class con-sidering all classifiers scores for every classes.

4 Experiments on KBA Framework

4.1 Setup

The KBA organizers have built up a stream-corpus which is a hugecorpus of dated web documents that can be processed chronologi-cally. Hence it is possible to simulate a real time system. The doc-uments come from newswires, blogs, forums, review, memetracker.In addition, a set of target entities, coming from wikipedia or fromtwitter, has been selected for their ambiguity or unpopularity. Andlast but not least, more than 60,000 documents have been annotatedso that systems can train on it. The train period starts on documentspublished from october 2011 until february 2012, and the test periodstarts from february 2012 to february 2013.

The KBA track is divided in two tasks: CCR (Cumulative Cita-tion Recommendation) and SSF (Streaming Slot Filling). CCR taskis to filter out documents worth citing in a profile of an entity (e.g.,wikipedia or freebase article). SSF task is to detect changes on givenslots for each of the target entities. We will focus only on CCR task.

The KBA task in 2013 is more challenging than the one fromKBA12 since entity are more diversified (29 Wikipedia entities in2012 vs 141 entities from wikipedia and twitter in 2013), the amountof annotated data per entity is much lower, and the ranking is moredifficult with vital classes. The table 5 shows the differences in thetraining data between KBA13 and KBA12.

Classes #Docs1 #Docs/Entity2

2012 2013 2012 2013Garbage 8467 2176 284 20Neutral 1584 1152 73 11Relevant/Useful 5186 2293 181 20Central/Vital 2671 1718 92 19Total 17482 7222

Table 5. Number of documents per classe and number of documents perclasse and per entity for both evaluation KBA12 and KBA13.

4.2 System Output

We detailed in the section 3.2.1 how an entity profile is build andhow this profile is dynamically altered by the updates on the entityStream Information Language Model (eSilm

e

). We did different ex-periments to understand how this kind of model would evolve usingtrivial update methods based on two parameters:

- UPDATE WITH: Full-Documents or Snippet or No Update;- UPDATE THRESHOLD: Useful/Vital or Vital documents only;

Given those parameters, the system is to give 5 different outputs de-pending on how the eSilm is updated: NO UPDT, V UPDT DOC,V UPDT SNPT, VU UPDT DOC, VU UPDT SNPT.

In addition, we said in the section 3.2.4 that four classificationmethods are used to experiment different types of classification.

To summarize, 20 outputs are expected at the end of the wholeprocess.

Bouvier  &  Bellot,  TREC  2013

Page 34: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Temporal FeaturesBurstiness : some words tend to appear in bursts

Hypothesis : Entity name bursts are related to important news about the entity (social Web; News…)

34

tf

title

tf(e,Dtitle

)tf

document

tf(e,D)voc size

document

|D|cov

snippet

equation 3tf

relationType

tf(reltype

, D)

Table 5: Entity related features

When building profile, we said that we extract re-lation an entity may have with another from WP us-ing three different kind of relations: incoming, out-going and mutual. For each kind of relation and foreach entity in this relation group, we compute theaverage tf on the whole document.

Time related Features: the corpus offers the prosto be able to work with time information. We de-signed the time related features so the classifiers areable to work with information concerning previousdocuments. Such information may help detectingthat may be something is going on about an entityusing different clues such as burst effect. As shownon the figure 2, the burst does not always depicts vi-tal documents, although it still might be a relevantinformation for classification.

Figure 2: Burst on different entities does not alwaysimply vital documents.

To depict the burst effect we used an implementa-tion of the Kleinberg Algorithm (Kleinberg, 2003).Given a time series, it captures burst and measure thestrength of it as well as the direction (up or down).We decided to scale the time series on an hour ba-sis. In order not to mess the classifiers with toomany information we decided not to use the direc-tion as a feature but to merge the direction with thestrength by applying a coefficient of -1 when direc-tion is down and 1 otherwise.

In addition to burst detection, we also considerthe number of documents having a mention the last24hours.

We noticed from our last year experiments onKBA12 that time features were actually degrading

final results since when ignoring them our scoreswas better. So we decided to focus only on features(cf table 6) that can really bring useful time informa-tion.

kleinberg1h burst strength and directionmatch24h # documents found last 24h

Table 6: Time related features used for classification

4.1 ClassificationAs a reminder of section 3.1.2, we implemented dif-ferent ways to update (or not) a dynamic languagemodel:

- No Update: NO UPDT

- Update with Snippet: UPDT SNPT

- Update with Document: UPDT DOC

When we update the dynamic model, we canchoose to update either Vital or Vital and Useful doc-uments which adds 2 different outputs. In total 5outputs are computed.

To classify documents based on computed fea-tures, we designed several ways to handle it. Thefirst method “TwoStep” we use, considers the prob-lem as a binary classification problem where we usetwo classifiers in cascade. The first one C

GN/UV

isto classify between two classes: “Garbage/Neutral”and “Useful/Vital”. For documents being classifiedas “Useful/Vital” the second classifier C

U/V

is usedto determine the final output class between “Useful”and “Vital”.

The second method “Single” performs directly aclassification between the four classes.

The third method “VitalVSOthers” trains a classi-fier on recognizing vital documents amongst all oth-ers classes. When this classifier gives a non-vitalclass, the “Single” method is used to determine an-other class from “Garbage” to “Useful”.

The last but not least method “CombineScores”uses scores emitted by all previous classifier and tryto learn the best output class considering all classi-fiers scores for every classes.

4.2 System OutputsTo summarize, we have 5 different outputs possiblewith 4 different methods which makes 20 differentruns. For the official run submission, we had is-sues with our system making our runs not consistentenough. In addition, we also had issues for extract-ing documents from stream-corpus which makes oursystem miss a lot of documents. The result of those

has title(D) 2 {0, 1}length

document

|D|

entropy(D)P

D

i=0p(w

i

, D)log2(p(wi

, D))

Table 3. Document related Meta-Features

information to detect for instance an anormal activity on an entitywhich might mean that something really important to that entity ishappening.

As shown on the figure 3 drew from the KBA13 stream-corpus, theburst does not always depict vital documents, although it still mightbe a relevant information for classification.

Figure 3. Burst on different entities does not always imply vital documents.

To depict the burst effect we used an implementation of the Klein-berg Algorithm [11]. Given a time series, it captures burst and mea-sure the strength of it as well as the direction (up or down). We de-cided to scale the time series on an hour basis. In order not to messthe classifiers with too many information we decided not to use thedirection as a feature but to merge the direction with the strength byapplying a coefficient of -1 when direction is down and 1 otherwise.

In addition to burst detection, we also consider the number of doc-uments having a mention the last 24hours.

We noticed from our last year experiments on KBA12 that timefeatures were actually degrading final results since when ignoringthem our scores was better. So we decided to focus only on features(cf table 4) that can really bring useful time information.

kleinberg1h burst strength and directionmatch24h # documents found last 24h

Table 4. Time related features used for classification

3.2.4 Classification

To perform the classification we decided not to rely only on onemethod. Instead we designed different ways to classify the informa-tion given the meta-features described in the previous section.

For the first method TwoSteps, we consider the problem as a bi-nary classification problem where we use two classifiers in cas-cade. The first one C

GN/UV

is to classify between two classes:Garbage/Neutral and Useful/Vital. For documents being classified asUseful/Vital a second classifier C

U/V

is used to determine the finaloutput class between Useful and Vital.

The second method Single performs directly a classification be-tween the four classes.

The third method VitalVSOthers trains a classifier on all docu-ments considering only two classes vital and others (all classes but

vital). When this classifier gives a non-vital class, the Single methodis used to determine another class from Garbage to Useful.

The last but not least method CombineScores uses scores emittedby all previous classifiers and try to learn the best output class con-sidering all classifiers scores for every classes.

4 Experiments on KBA Framework

4.1 Setup

The KBA organizers have built up a stream-corpus which is a hugecorpus of dated web documents that can be processed chronologi-cally. Hence it is possible to simulate a real time system. The doc-uments come from newswires, blogs, forums, review, memetracker.In addition, a set of target entities, coming from wikipedia or fromtwitter, has been selected for their ambiguity or unpopularity. Andlast but not least, more than 60,000 documents have been annotatedso that systems can train on it. The train period starts on documentspublished from october 2011 until february 2012, and the test periodstarts from february 2012 to february 2013.

The KBA track is divided in two tasks: CCR (Cumulative Cita-tion Recommendation) and SSF (Streaming Slot Filling). CCR taskis to filter out documents worth citing in a profile of an entity (e.g.,wikipedia or freebase article). SSF task is to detect changes on givenslots for each of the target entities. We will focus only on CCR task.

The KBA task in 2013 is more challenging than the one fromKBA12 since entity are more diversified (29 Wikipedia entities in2012 vs 141 entities from wikipedia and twitter in 2013), the amountof annotated data per entity is much lower, and the ranking is moredifficult with vital classes. The table 5 shows the differences in thetraining data between KBA13 and KBA12.

Classes #Docs1 #Docs/Entity2

2012 2013 2012 2013Garbage 8467 2176 284 20Neutral 1584 1152 73 11Relevant/Useful 5186 2293 181 20Central/Vital 2671 1718 92 19Total 17482 7222

Table 5. Number of documents per classe and number of documents perclasse and per entity for both evaluation KBA12 and KBA13.

4.2 System Output

We detailed in the section 3.2.1 how an entity profile is build andhow this profile is dynamically altered by the updates on the entityStream Information Language Model (eSilm

e

). We did different ex-periments to understand how this kind of model would evolve usingtrivial update methods based on two parameters:

- UPDATE WITH: Full-Documents or Snippet or No Update;- UPDATE THRESHOLD: Useful/Vital or Vital documents only;

Given those parameters, the system is to give 5 different outputs de-pending on how the eSilm is updated: NO UPDT, V UPDT DOC,V UPDT SNPT, VU UPDT DOC, VU UPDT SNPT.

In addition, we said in the section 3.2.4 that four classificationmethods are used to experiment different types of classification.

To summarize, 20 outputs are expected at the end of the wholeprocess.

Jon Kleinberg, ‘Bursty and hierarchical structure in streams’, Data Mining and Knowledge Discovery, 7(4), 373–397, (2003)

Bouvier  &  Bellot,  DN,2014

Page 35: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 35

V.  Bouvier  &  P.  Bellot  (TREC  2014,  to  appear)

http://docreader:4444/data/index.html

DEMO  IR  KBA  platform  soft.  (Kware  Company  /  LSIS)  V.  Bouvier,  P.  Bellot,  M.  Benoit

Page 36: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 36

Page 37: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 37

Page 38: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Some Interesting Perspectives

— More features, more (linguistic / semantic) resources, more data… — Deeper Linguistic / Semantic Analysis

= Machine Learning Approaches (Learning to rank) + Natural Language Processing + Knowledge Management

Pluridisciplinarity :

— Neurolinguistics (What Models could be adapted to Information Retrieval / Text Mining / Knowledge Retrieval)

— Psycholinguistics (psychological / neurobiological) / (models / features)

38

One  example  ?

Page 39: Some Information Retrieval Models and Our Experiments for TREC KBA

P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition)

Recent publications

39

Publications scientifiquesh-index = 15 ; i10 = 22 (Google Scholar)

375 citations depuis 2009

Direction d’ouvrage1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-

formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.

Direction de numéros spéciaux1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document

numérique RSTI série DN - Volume 15 – num. 1/2012.

Edition d’actes de conférences1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information

Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement

Automatique des Langues Naturelles 2008", Avignon, France, 2008.

Revues répertoriées1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document

Numérique RSTI, vol. 17-1, 20142. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet

d’une entité nommée", Document Numérique RSTI, vol. 17-1, 20143. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée

par Persée) — rang B AERES4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,

E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,

A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,p. 50-59, 2012.

6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report onINEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012

7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIRForum,vol. 45-1, p. 2-17, 2011

8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", TraitementAutomatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES

9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences etTechnologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010

10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897

11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.

10

Publications scientifiquesh-index = 15 ; i10 = 22 (Google Scholar)

375 citations depuis 2009

Direction d’ouvrage1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-

formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.

Direction de numéros spéciaux1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document

numérique RSTI série DN - Volume 15 – num. 1/2012.

Edition d’actes de conférences1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information

Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement

Automatique des Langues Naturelles 2008", Avignon, France, 2008.

Revues répertoriées1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document

Numérique RSTI, vol. 17-1, 20142. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet

d’une entité nommée", Document Numérique RSTI, vol. 17-1, 20143. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée

par Persée) — rang B AERES4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,

E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,

A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,p. 50-59, 2012.

6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report onINEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012

7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIRForum,vol. 45-1, p. 2-17, 2011

8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", TraitementAutomatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES

9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences etTechnologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010

10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897

11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.

10

Publications scientifiquesh-index = 15 ; i10 = 22 (Google Scholar)

375 citations depuis 2009

Direction d’ouvrage1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-

formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.

Direction de numéros spéciaux1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document

numérique RSTI série DN - Volume 15 – num. 1/2012.

Edition d’actes de conférences1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information

Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement

Automatique des Langues Naturelles 2008", Avignon, France, 2008.

Revues répertoriées1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document

Numérique RSTI, vol. 17-1, 20142. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet

d’une entité nommée", Document Numérique RSTI, vol. 17-1, 20143. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée

par Persée) — rang B AERES4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,

E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,

A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,p. 50-59, 2012.

6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report onINEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012

7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIRForum,vol. 45-1, p. 2-17, 2011

8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", TraitementAutomatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES

9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences etTechnologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010

10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897

11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.

10

12. P. Zweigenbaum, B. Grau, A.-L. Ligozat, I. Robba, S. Rosset, X. Tannier, A. Vilnat (LIMSI) & P. Bellot (Univ. Avi-gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Françaisede Linguistique Appliquée),XIII (1), p. 41 à 62, 2008.– Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J.

Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont-réal) ; http ://www.rfla-journal.org/som_2008-1.html

13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques",Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES

14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le systèmede Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès

15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto-matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001

16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et ScienceInformatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001.

17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As-sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999.

Chapitres de livres1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-

tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.

2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :978-1-84821-322-7, 2012.

3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.

4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problèmede classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction deE. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.

5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherched’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes dequestion-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-Lavoisier, chapitre 1, p. 5-35

6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour lessystèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-DENE M., chapitre 4, p.73 à 96, Hermès

7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "LaLinguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005

Conférences internationales avec comités de lecture (ACTI)1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic

Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland)2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly

Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9thInternational Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014.

3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad HocInformation Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia,Bulgaria, August 2013.

4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.

5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re-sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.

11

12. P. Zweigenbaum, B. Grau, A.-L. Ligozat, I. Robba, S. Rosset, X. Tannier, A. Vilnat (LIMSI) & P. Bellot (Univ. Avi-gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Françaisede Linguistique Appliquée),XIII (1), p. 41 à 62, 2008.– Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J.

Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont-réal) ; http ://www.rfla-journal.org/som_2008-1.html

13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques",Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES

14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le systèmede Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès

15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto-matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001

16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et ScienceInformatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001.

17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As-sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999.

Chapitres de livres1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-

tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.

2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :978-1-84821-322-7, 2012.

3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.

4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problèmede classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction deE. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.

5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherched’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes dequestion-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-Lavoisier, chapitre 1, p. 5-35

6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour lessystèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-DENE M., chapitre 4, p.73 à 96, Hermès

7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "LaLinguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005

Conférences internationales avec comités de lecture (ACTI)1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic

Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland)2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly

Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9thInternational Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014.

3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad HocInformation Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia,Bulgaria, August 2013.

4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.

5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re-sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.

11

LSIS  -­‐  DIMAG  team  http://www.lsis.org/spip.php?id_rubrique=291  OpenEdition  Lab  :  http://lab.hypotheses.org