An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...

22
An Exploration of Retrieval-Enhancing Methods for Integrated Search in a Digital Library TBAS2012, Barcelona. April 1, 2012 Diana Ransgaard Sørensen, Toine Bogers, Birger Larsen Royal School of Library and Information Science, Copenhagen, Denmark 02/04/2012 1

Transcript of An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...

Page 1: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

An Exploration of Retrieval-Enhancing Methods for Integrated Search in a Digital Library

TBAS2012, Barcelona. April 1, 2012

Diana Ransgaard Sørensen, Toine Bogers, Birger Larsen

Royal School of Library and Information Science, Copenhagen, Denmark

02/04/2012 1

Page 2: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Outline

• Introduction

– Problem

– Our focus

– Goal

• Methodology

• Experiments

• Conclusion

02/04/2012 2

Page 3: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Introduction

• Problem– Different document types contain different amounts of text (full

text vs. metadata-only)

– Some document types are more likely to be retrieved than others, regardless of relevance

• Our focus– How to best combine & rank different document types and

representations in a digital library setting?

• Goal– Present the user with a single ranked list containing the optimal

mix of document types

– Explore different techniques for integrating different document types and representations into a single results list

02/04/2012 3

Page 4: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Outline

• Introduction

• Methodology

– Test collection

– Topics

– Experimental setup

• Experiments

• Conclusion

02/04/2012 4

Page 5: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Test collection

• iSearch collection

– Based on the digital physics library arXiv.org

– Available from http://itlab.dbit.dk/~isearch

– Three different document types

• 18,443 metadata-only book records (BK)

• 291,246 metadata-only article records + abstracts (PN)

• 143,571 full-text article records, including metadata (PF)

– Topics

• 65 topics with graded relevance assessments

• Created by 23 lecturers and experienced postgraduate and graduate students from three different university departments of physics

02/04/2012 5

Page 6: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Topics

• Each topic representation contains five fields

– Description of information sought

– User background knowledge

– Work task description

– Ideal answer

– Keywords• “What are the key search terms used

to express your situation and your information needs?”

02/04/2012 6

Page 7: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Experimental setup

• Indexing & retrieval– Indri 5.0 toolkit

• Stop word filtering

• Stemming

– Language modeling algorithms with three different smoothing methods• Jelinek-Mercer smoothing (JM)

• Bayesian smoothing using Dirichlet priors (DIR)

• Two-stage smoothing (TWO)

• Evaluation– Normalized Discounted Cumulated Gain (NDCG)

– Two-tailed paired Student's t-test

02/04/2012 7

Page 8: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Outline

• Introduction

• Methodology

• Experiments

1) Out of the box

2) Weighting

3) Fusion

• Conclusion

02/04/2012 8

Page 9: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Experiments

1) Default settings + optimized baseline runs

2) Adjust weighting of the three document types

3) Fusing different document types

02/04/2012 9

Page 10: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

1) Out-of-the-box vs. optimized

• We optimize the settings of the system and of the retrieval model on a combined index of all three document types (BK, PF and PN)

– Using the default, out-of-the-box settings does not always provide the best retrieval performance

– Default parameter settings can be seen as a generalization over many different test collections

• Goal is to examine how much performance can improve over default settings in this integrated search scenario

02/04/2012 10

Page 11: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

1) Out-of-the-box vs. optimized

• What do we compare?

– Out-of-the-box

– Tuned

• What do we optimize?

(i) Stop word filtering: Yes or no

(ii) Krovetz stemming: Yes or no

(iii) LM smoothing parameters: λ [0-1] in steps of 0.1 μ [0-5000] in steps of 500

02/04/2012 11

Page 12: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

1) Out-of-the-box vs. optimized

Optimizeddefault.NDCG 0.3263

Default. NDCG 0.2856

02/04/2012 12

= statistical significance

Page 13: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

1) Out-of-the-box vs. optimized

Optimized baseline runs increases the NDCG scores by:

17.4% (JM)

9.8% (DIR)

17.2% (TWO)

The best performing model is JM with an

NDCG score of 0.3263

Baseline in remaining tests (weigthing and fusion).

02/04/2012 13

Page 14: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

2) Weighting document types

• Weights – range [0.0001, 0.2, 0.4, 0.6, 0.8, 1.0]

216 unique combinations of the three document types

02/04/2012 14

Page 15: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

2) Weighting: top 10 of 216

02/04/2012 15

Book records Metadata Fulltext

Optimizeddefault.NDCG 0.3263

Page 16: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

3) Fusing document types

• Three separate indexes, optimized runs in each

• Two types of fusion

– Round-robin merging

– Linear combination (LC) with score- or rank-normalization

02/04/2012 16

Page 17: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

3) Fusing document types

FusionNDCG 0.3286 0,7 %

(One index)

02/04/2012 17

Page 18: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Outline

• Introduction

• Methodology

• Experiments

• Conclusion

– Discussion

– Future work

02/04/2012 18

Page 19: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Conclusions

• As aspected optimization of the retrieval model produces beneficial results on a combined index of document types.

• Our approach for weighting document types is not an effective way of improving integrated search performance.

• Round-robin merging is not an effective strategy for integrating different document types.

• Fusion based on Linear Combination on individual indexes for the document types produces results that are slightly better than the baseline.

02/04/2012 19

Page 20: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Discussion

We expected that weighting the document types in our combined index differently could boost performance even further, but this was not the case.

Trend of the best weighted runsBook records tended to have higher weights and article metadata and full text lower weights.

02/04/2012 20

Page 21: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Future work

• A more extensive analysis of the performance of the individual document types.Goal: more fruitful techniques for weighting them properly.

• Weighting: calculate document-specific weights based on analysis of different document features, instead of only assigning a weight based on the document type.

• Use the citation information from the documents available in the iSearch collection as an additional source of information.

02/04/2012 21

Page 22: An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012, barcelona. 1.april, 2012.

Questions? Comments? Suggestions?

02/04/2012 22