Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Post on 03-Mar-2017

427 views 0 download

Transcript of Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Querylog-based Assessment of Retrievability Bias in a

Large Newspaper CorpusMyriam C. Traub, Thaer Samar, Jacco van Ossenbruggen,

Jiyin He, Arjen de Vries, Lynda Hardman

Motivation

• Users want to be able

• to get a fair overview of the archive’s content

• to access all (relevant) documents in the archive

2

Motivation

• Users want to be able

• to get a fair overview of the archive’s content

• to access all (relevant) documents in the archive

• However,

• data collections are implicitly and explicitly biased,

• users are biased,

• and technology induces even more bias(es)

2

Motivation

• Users want to be able

• to get a fair overview of the archive’s content

• to access all (relevant) documents in the archive

• However,

• data collections are implicitly and explicitly biased,

• users are biased,

• and technology induces even more bias(es)

… which I can deal with if the bias is made

explicit.2

• Bias in search results

• Potential sources are:

Retrievability Bias

3

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

Retrievability Bias

3

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

Retrievability Bias

3

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

Retrievability Bias

3

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

• Side-effects of ranking algorithm

Retrievability Bias

3

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

• Side-effects of ranking algorithm

• Side-effects of result presentation

Retrievability Bias

3

• Bias in search results

• Potential sources are:

• User interest

• Search skills of users

• Users’ willingness to explore results

• Collection bias (indexed documents)

• OCR errors

• Side-effects of ranking algorithm

• Side-effects of result presentation

Retrievability Bias

3

Research Questions

RQ1: Detecting and quantifying retrievability bias

RQ2: Influence of document features on retrievability bias

RQ3: Representativeness of simulated queries and experimental setup

4

Retrievability

• Introduced by Azzopardi et al. [1] in 2008 in a study based on born-digital documents and simulated queries

• Retrievability score counts how often a document is retrieved as one of the top K documents by a given set of queries

• Gini coefficient and Lorenz curves can visualize and quantify inequality in the distribution of the scores

5

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

6

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 1

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 2

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

• perfect tyranny (G=0.8)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 20, 0, 0, 0, 1

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

• perfect tyranny (G=0.8)

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 20, 0, 0, 0, 1

% of documents

% o

f ac

cum

ulat

ed r(

d)

Lorenz curve for n=5

Lorenz Curve & Gini Coefficient

• Introduced by economists to express and visualize inequality in wealth distribution

• Gini coefficient (G):

• perfect communist (G=0)

• in-between (G=0.5)

• perfect tyranny (G=0.8)

• There is no good or bad G.

6

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

1, 1, 1, 1, 10, 0, 1, 1, 20, 0, 0, 0, 1

% of documents

% o

f ac

cum

ulat

ed r(

d)

Experimental setup / Parameters

• Digitized collection of Dutch historic newspapers

• View data extracted from user logs

• Real queries, simulated queries

• Standard Information Retrieval models: TFIDF, LM1000, BM25 (using Lemur framework)

• Pre-processing (corpus & queries): Stemming, stopword removal, operator removal

• Cutoff values: c=10, c=100, c=1000

7

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

Document Collection:Dutch Newspaper Archive

June 1618 - December 1995

Articles 67% 69,237,655

Advertisements 29% 29,591,599

Notifications* 2% 1,918,375

Captions 2% 1,970,899

Total Size 102,718,528

Vocabulary Size 353,086,358

* Familiebericht 8

Simulated Queries

• Followed similar strategy as previous studies

• Top 2 million single terms from the preprocessed corpus + top 2 million bigram terms

• No filtering for OCR errors

9

Real Queries

• User logs collected between March and July 2015 on Delpher, the online web service of the National Library of the Netherlands

• Extracted queries and viewed items related to newspaper archive

• Total of 957,239 unique queries

10

RQ1: Detecting and Quantifying

Retrievability Bias

11

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

A very large fraction of

documents is never

retrieved.

Inequality

Real queries, c=1000 GBM25 = 0.76

Simulated queries, c=100 GBM25 = 0.5213

• The Lorenz curves and Gini values

• are strongly influenced by non-retrieved documents,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

14

Limitations

• The Lorenz curves and Gini values

• are strongly influenced by non-retrieved documents,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

14

Limitations

Does the inequality arise

from the users’ interest / search behavior?

Or from a technological bias towards a particular document feature?

Retrievability scores Meaningful?

• Created 4 subsets of documents according to their score and selected a set of target documents from each subset

• Generated queries from selected documents, tailored to retrieve these specific documents

• Performed search tasks and measured ranks of target documents

• Showed that documents with lower score are actually harder to find

15

Rarely Sometimes Often Very often

RQ2: Influence of

Document Features

16

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●●●●

●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●●

●●●●

●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●

●●●

●●●

0.5

1.0

1.5

2.0

0 1000 2000 3000 4000 5000Bins based on page confidence (PC)

Mea

n r(d

) per

bin

OCR Confidence Scores

• Generated by OCR engine during digitization

• Documents ordered by page confidence (PC) and split into bins

• Mean score per bin

17

Document Length

• Documents ordered by length and split into bins of 20,000

• LM1000 (left): upward trend, longer documents more retrievable

• BM25 and TFIDF (right): seem to be better at retrieving documents of medium length

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●

●●●●●●●●●●●●

●●

●●●●

●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●●●●●●●●●

●●●

●●●●●●●

●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●●●●●

●●

●●●

●●●●●●●●●

●●●

●●●

●●●●

●●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●

●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●

●●

●●●●●●●●

●●

●●

●●

0

2

4

6

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

●●●●●

●●●●●●●●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●

●●

●●●●●●

●●●●

●●●●

●●

●●●

●●

●●●●●●

●●●

●●

●●●●●

●●

●●●

●●●●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●

●●●●

●●●

●●●●

●●●

●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.5

1.0

1.5

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

18

RQ3: Representativeness of Simulated Queries and

Experimental Setup

19

Top retrieved article for real queries

20

Top retrieved article(s) for simulated queries

21

Differences between query sets

• Real queries:

• Mean length: 2.32 terms

• Unique terms: 253,637

• 56 references to persons or locations in top 100 terms

• Simulated queries:

• Mean length: 1.5 terms

• Unique terms: 2,028,617

• 5 references to persons or locations in top 100 terms

22

15

10

50100

5001000

500010000

50000100000

5000001000000

5 10 15 20 25 30 35 40 50 60 65 70 90 110 170 700Number of Views

Cou

nts

Actual views

• Only 2.7M out of 102M documents were viewed by users (G = 0.98)

• most documents have not been viewed at all

• many documents only viewed once

• very few are viewed multiple times23

Overlap with views

• How many documents were viewed by the users, but not retrieved in our study?

• Many non-retrieved documents

• were found using facets or operators

• scored a rank just below the cutoff

• Better representation of the real search engine, taking faceted search and operators into account

0

0.75

1.5

2.25

3

c=10 c=100 c=1000

RetrievedNon-Retrieved

24

Document Types Viewed

Simulated Real Viewed

Article 3.89 0.90 2.61%

Advertisement 3.32 0.51 2.07%

Notification 3.22 4.80 40.10%

Caption 3.06 0.84 4.01%

25

Conclusions• Real and simulated queries differ in

regard to

• composition of query sets

• number of (unique) terms used

• use of named entities

• Apart from document length and page confidence, we did not find strong evidence for technical bias

• Using real queries is important for realistic results

• Simulation strategies for queries need to be improved

• Retrievability studies should take faceted search and operators into account

26

We would like to thank the

for making the newspaper corpus and the (sensitive) user data available to us for research.

travel grant

Supported by

Querylog-based Assessment of Retrievability Bias in a Large

Newspaper Corpus