An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...

An Exploration of Retrieval-Enhancing Methods for Integrated Search in a Digital Library

TBAS2012, Barcelona. April 1, 2012

Diana Ransgaard Sørensen, Toine Bogers, Birger Larsen

Royal School of Library and Information Science, Copenhagen, Denmark

02/04/2012 1

http://www.iva.dk/english

Outline

• Introduction

– Problem

– Our focus

– Goal

• Methodology

• Experiments

• Conclusion

02/04/2012 2


Introduction

• Problem– Different document types contain different amounts of text (full

text vs. metadata-only)

– Some document types are more likely to be retrieved than others, regardless of relevance

• Our focus– How to best combine & rank different document types and

representations in a digital library setting?

• Goal– Present the user with a single ranked list containing the optimal

mix of document types

– Explore different techniques for integrating different document types and representations into a single results list

02/04/2012 3


Outline

• Introduction

• Methodology

– Test collection

– Topics

– Experimental setup

• Experiments

• Conclusion

02/04/2012 4


Test collection

• iSearch collection

– Based on the digital physics library arXiv.org

– Available from http://itlab.dbit.dk/~isearch

– Three different document types

• 18,443 metadata-only book records (BK)

• 291,246 metadata-only article records + abstracts (PN)

• 143,571 full-text article records, including metadata (PF)

– Topics

• 65 topics with graded relevance assessments

• Created by 23 lecturers and experienced postgraduate and graduate students from three different university departments of physics

02/04/2012 5


http://itlab.dbit.dk/~isearch

Topics

• Each topic representation contains five fields

– Description of information sought

– User background knowledge

– Work task description

– Ideal answer

– Keywords• “What are the key search terms used

to express your situation and your information needs?”

02/04/2012 6


Experimental setup

• Indexing & retrieval– Indri 5.0 toolkit

• Stop word filtering

• Stemming

– Language modeling algorithms with three different smoothing methods• Jelinek-Mercer smoothing (JM)

• Bayesian smoothing using Dirichlet priors (DIR)

• Two-stage smoothing (TWO)

• Evaluation– Normalized Discounted Cumulated Gain (NDCG)

– Two-tailed paired Student's t-test

02/04/2012 7


Outline

• Introduction

• Methodology

• Experiments

1) Out of the box

2) Weighting

3) Fusion

• Conclusion

02/04/2012 8


Experiments

1) Default settings + optimized baseline runs

2) Adjust weighting of the three document types

3) Fusing different document types

02/04/2012 9


1) Out-of-the-box vs. optimized

• We optimize the settings of the system and of the retrieval model on a combined index of all three document types (BK, PF and PN)

– Using the default, out-of-the-box settings does not always provide the best retrieval performance

– Default parameter settings can be seen as a generalization over many different test collections

• Goal is to examine how much performance can improve over default settings in this integrated search scenario

02/04/2012 10



• What do we compare?

– Out-of-the-box

– Tuned

• What do we optimize?

(i) Stop word filtering: Yes or no

(ii) Krovetz stemming: Yes or no

(iii) LM smoothing parameters: λ [0-1] in steps of 0.1 μ [0-5000] in steps of 500

02/04/2012 11



Optimizeddefault.NDCG 0.3263

Default. NDCG 0.2856

02/04/2012 12

= statistical significance



Optimized baseline runs increases the NDCG scores by:

17.4% (JM)

9.8% (DIR)

17.2% (TWO)

The best performing model is JM with an

NDCG score of 0.3263

Baseline in remaining tests (weigthing and fusion).

02/04/2012 13


2) Weighting document types

• Weights – range [0.0001, 0.2, 0.4, 0.6, 0.8, 1.0]

216 unique combinations of the three document types

02/04/2012 14


2) Weighting: top 10 of 216

02/04/2012 15

Book records Metadata Fulltext

Optimizeddefault.NDCG 0.3263


3) Fusing document types

• Three separate indexes, optimized runs in each

• Two types of fusion

– Round-robin merging

– Linear combination (LC) with score- or rank-normalization

02/04/2012 16


3) Fusing document types

FusionNDCG 0.3286 0,7 %

(One index)

02/04/2012 17


Outline

• Introduction

• Methodology

• Experiments

• Conclusion

– Discussion

– Future work

02/04/2012 18


Conclusions

• As aspected optimization of the retrieval model produces beneficial results on a combined index of document types.

• Our approach for weighting document types is not an effective way of improving integrated search performance.

• Round-robin merging is not an effective strategy for integrating different document types.

• Fusion based on Linear Combination on individual indexes for the document types produces results that are slightly better than the baseline.

02/04/2012 19


Discussion

We expected that weighting the document types in our combined index differently could boost performance even further, but this was not the case.

Trend of the best weighted runsBook records tended to have higher weights and article metadata and full text lower weights.

02/04/2012 20


Future work

• A more extensive analysis of the performance of the individual document types.Goal: more fruitful techniques for weighting them properly.

• Weighting: calculate document-specific weights based on analysis of different document features, instead of only assigning a weight based on the document type.

• Use the citation information from the documents available in the iSearch collection as an additional source of information.

02/04/2012 21


Questions? Comments? Suggestions?

02/04/2012 22


An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...

Technology

Transcript of An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...