Copulas for Information Retrieval (SIGIR'13)
-
Upload
carsten-eickhoff -
Category
Education
-
view
249 -
download
3
description
Transcript of Copulas for Information Retrieval (SIGIR'13)
Copulas for Information Retrieval
Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson
Copulas – What is it all about?
• Assume two sufficiently different commodities
• Rare elemental metals
• Pork bellies
• No apparent correlations
0
1
2
3
4
5
6
Rare Earths Pork Bellies
Copulas – What is it all about?
• Two seemingly independent variables
• Yet, for rare extreme cases, there are co-movements
• “Tail dependencies”
• Copulas decouple observations and dependencies • IR models are good at estimating marginals
• Copulas are good at combining them
Overview
1. Non-linear Dependency Structures in IR
2. Copulas – Intuition & Background
3. Multivariate Relevance Estimation
4. When to use them?
5. Score Fusion
6. Conclusion & Future Directions
1 Non-Linear Dependency Structures in IR
Multivariate Relevance Modelling
• IR Systems index and retrieve a growing variety of document types • Many structured, or at least “complex”
• Single-criteria relevance frameworks do not perform well
• Multi-criteria models tend to be either: a) Naïve (e.g., independence assumption), or,
b) Hard to qualitatively interpret for humans (e.g., L2R)
Non-Linear Dependencies
• Non-linear dependency structures are still a challenge
• TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql”
• Relevance Criteria: • Topicality
• Subjectivity
Non-Linear Dependencies
• Pearson’s ᵨ = 0.18
• So, there is no real dependency
• …right?
Non-Linear Dependencies
• Pearson’s ᵨ = 0.18
• So, there is no real dependency
• …right?
Non-Linear Dependencies
• Pearson’s ᵨ = 0.18
• So, there is no real dependency
• …right?
• In the lower third of the scale,
we note ᵨ = 0.37
Non-Linear Dependencies
• Pearson’s ᵨ = 0.18
• So, there is no real dependency
• …right?
• In the lower third of the scale,
we note ᵨ = 0.37,
• And in the upper third, it turns
to ᵨ = -0.4
2 Copulas – Intuition & Background
Copulas (from copulare, to join)
• Copulas model complex non-linear dependencies between variables that simple correlations can't capture
• Decouple marginal distributions from dependency structure
• Approximate joint multivariate distributions
• Applied previously in portfolio and risk management, meteorology, river flooding predictions, …
Formal Basics
• Given a k-dimensional rv
• Map to unit cube
• Describe joint cdf with copula
• Isolation of a component
• Copula’s zero
Closing the circle
• Recall the example TREC topic 1171
• Linear combination: AP = 0.14, below collection average (0.25)
• Fit Clayton copula to model joint relevance distribution
• AP rises to 0.22
3 Multivariate Relevance Estimation
Joint Relevance Estimation
• Estimate marginal distributions from data
• Estimate copula fitting parameters to maximize posterior probability of observing data
• Use copula to represent joint probability of relevance
Joint Relevance Estimation
• We study three different scenarios: • Opinionated blog posts • Personalized bookmarks • Child-friendly websites
• Use original training portion of the corpora where available
• A 90/10 split otherwise
Results I – Opinionated Blog Posts
• TREC Blogs08 dataset
• 1.3 M documents
• Relevance dimensions: Topicality & Subjectivity
• Significantly higher performance than linear combination model
Results II – Personalized Bookmarks
• Dataset by Vallet & Castells
• 339k documents
• Relevance Dimensions: Topicality & Personal relevance
• Significantly performance gains in some metrics
Results III – Child-friendly Websites
• Dataset from the PuppyIR project (http://puppyir.eu)
• 22k documents
• Relevance Dimensions: Topicality & Child-suitability
• Worse-than-baseline performance
4 Copulas – When to use them?
When to use them?
• Previously: Strongly varying performance for different settings
• Is there a way of predicting the merit?
• Recall: copulas model tail dependencies between dimensions
Types of Tail Dependencies
Measuring Tail Dependencies
• According to Frees and Valdez 1998: IL and IU measure strength of lower and upper tail dependencies
• Anderson-Darling test of goodness-of-fit between copula and observed data
Domain Frees Tail index Anderson-Darling Actual Retrieval
Performance
Opinionated Blogs IL = 0.07 0.67 Copulas > linear
Personalized Bookmarks IU = 0.49 0.47 Copulas = linear
Child-friendly Websites IL = IU = 0 0.046 Copulas < linear
5 Copulas for Score Fusion
Score Fusion
• A different angle on relevance estimation
• Combine individual retrieval system scores instead of modelling relevance from content criteria
• In this setting, submissions to historic TRECs serve as criteria
• We randomly draw k individual runs and combine them using copulas
Fusion Methods
• Established: • Copula-based:
Results – TREC 4
• Results are averaged across 200 randomizations per setting of k
• Relative improvements over the best, worst and median fused run in terms of percentages of MAP
• Small but consistent improvements over non-copula fusion baselines
Robustness - CombSUM
• Fusion approaches are often sensitive to weak contributions
• We control the number of weak submissions added to the fusion
• Copulas’ explicit modeling of dependency structure is more robust
Robustness - CombMNZ
• Fusion approaches are often sensitive to weak contributions
• We control the number of weak submissions added to the fusion
• Copulas’ explicit modeling of dependency structure is more robust
6 Conclusion and Future Directions
Conclusion
• Copulas decouple observations and dependencies • IR models are good at estimating marginal
• Copulas are good at combining them
• We use them for multivariate relevance estimation • Strongly scenario-dependent performance
• Tail indices & goodness of fit tests as estimators of expected performance
• Copulas for score fusion • Robust to outliers
The Road Ahead
• Currently, we use single copulas for relevance modelling • Copula mixtures and composite Archimedean copulas for higher accuracy
• Here, we use pre-existing copula families and fit them to data • Instead, can we formalize copulas from scratch to include domain knowledge?
• So far, we explored two-dimensional relevance spaces • What happens as we move into higher-order systems?
Thank You!