An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...
-
Upload
diana-ransgaard-sorensen -
Category
Technology
-
view
358 -
download
1
Transcript of An exploration of retrieval enhancing methods for integrated search in a digital library ecir2012,...
An Exploration of Retrieval-Enhancing Methods for Integrated Search in a Digital Library
TBAS2012, Barcelona. April 1, 2012
Diana Ransgaard Sørensen, Toine Bogers, Birger Larsen
Royal School of Library and Information Science, Copenhagen, Denmark
02/04/2012 1
Outline
• Introduction
– Problem
– Our focus
– Goal
• Methodology
• Experiments
• Conclusion
02/04/2012 2
Introduction
• Problem– Different document types contain different amounts of text (full
text vs. metadata-only)
– Some document types are more likely to be retrieved than others, regardless of relevance
• Our focus– How to best combine & rank different document types and
representations in a digital library setting?
• Goal– Present the user with a single ranked list containing the optimal
mix of document types
– Explore different techniques for integrating different document types and representations into a single results list
02/04/2012 3
Outline
• Introduction
• Methodology
– Test collection
– Topics
– Experimental setup
• Experiments
• Conclusion
02/04/2012 4
Test collection
• iSearch collection
– Based on the digital physics library arXiv.org
– Available from http://itlab.dbit.dk/~isearch
– Three different document types
• 18,443 metadata-only book records (BK)
• 291,246 metadata-only article records + abstracts (PN)
• 143,571 full-text article records, including metadata (PF)
– Topics
• 65 topics with graded relevance assessments
• Created by 23 lecturers and experienced postgraduate and graduate students from three different university departments of physics
02/04/2012 5
Topics
• Each topic representation contains five fields
– Description of information sought
– User background knowledge
– Work task description
– Ideal answer
– Keywords• “What are the key search terms used
to express your situation and your information needs?”
02/04/2012 6
Experimental setup
• Indexing & retrieval– Indri 5.0 toolkit
• Stop word filtering
• Stemming
– Language modeling algorithms with three different smoothing methods• Jelinek-Mercer smoothing (JM)
• Bayesian smoothing using Dirichlet priors (DIR)
• Two-stage smoothing (TWO)
• Evaluation– Normalized Discounted Cumulated Gain (NDCG)
– Two-tailed paired Student's t-test
02/04/2012 7
Outline
• Introduction
• Methodology
• Experiments
1) Out of the box
2) Weighting
3) Fusion
• Conclusion
02/04/2012 8
Experiments
1) Default settings + optimized baseline runs
2) Adjust weighting of the three document types
3) Fusing different document types
02/04/2012 9
1) Out-of-the-box vs. optimized
• We optimize the settings of the system and of the retrieval model on a combined index of all three document types (BK, PF and PN)
– Using the default, out-of-the-box settings does not always provide the best retrieval performance
– Default parameter settings can be seen as a generalization over many different test collections
• Goal is to examine how much performance can improve over default settings in this integrated search scenario
02/04/2012 10
1) Out-of-the-box vs. optimized
• What do we compare?
– Out-of-the-box
– Tuned
• What do we optimize?
(i) Stop word filtering: Yes or no
(ii) Krovetz stemming: Yes or no
(iii) LM smoothing parameters: λ [0-1] in steps of 0.1 μ [0-5000] in steps of 500
02/04/2012 11
1) Out-of-the-box vs. optimized
Optimizeddefault.NDCG 0.3263
Default. NDCG 0.2856
02/04/2012 12
= statistical significance
1) Out-of-the-box vs. optimized
Optimized baseline runs increases the NDCG scores by:
17.4% (JM)
9.8% (DIR)
17.2% (TWO)
The best performing model is JM with an
NDCG score of 0.3263
Baseline in remaining tests (weigthing and fusion).
02/04/2012 13
2) Weighting document types
• Weights – range [0.0001, 0.2, 0.4, 0.6, 0.8, 1.0]
216 unique combinations of the three document types
02/04/2012 14
2) Weighting: top 10 of 216
02/04/2012 15
Book records Metadata Fulltext
Optimizeddefault.NDCG 0.3263
3) Fusing document types
• Three separate indexes, optimized runs in each
• Two types of fusion
– Round-robin merging
– Linear combination (LC) with score- or rank-normalization
02/04/2012 16
3) Fusing document types
FusionNDCG 0.3286 0,7 %
(One index)
02/04/2012 17
Outline
• Introduction
• Methodology
• Experiments
• Conclusion
– Discussion
– Future work
02/04/2012 18
Conclusions
• As aspected optimization of the retrieval model produces beneficial results on a combined index of document types.
• Our approach for weighting document types is not an effective way of improving integrated search performance.
• Round-robin merging is not an effective strategy for integrating different document types.
• Fusion based on Linear Combination on individual indexes for the document types produces results that are slightly better than the baseline.
02/04/2012 19
Discussion
We expected that weighting the document types in our combined index differently could boost performance even further, but this was not the case.
Trend of the best weighted runsBook records tended to have higher weights and article metadata and full text lower weights.
02/04/2012 20
Future work
• A more extensive analysis of the performance of the individual document types.Goal: more fruitful techniques for weighting them properly.
• Weighting: calculate document-specific weights based on analysis of different document features, instead of only assigning a weight based on the document type.
• Use the citation information from the documents available in the iSearch collection as an additional source of information.
02/04/2012 21