Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval
-
Upload
julian-urbano -
Category
Technology
-
view
532 -
download
2
description
Transcript of Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval
![Page 1: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/1.jpg)
AdMIRe 2012 Lyon, France · April 17th Picture by ERdi43 (Wikipedia)
Towards Minimal Test Collections for Evaluation of
Audio Music Similarity and Retrieval
@julian_urbano University Carlos III of Madrid
@m_schedl Johannes Kepler University
![Page 2: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/2.jpg)
Problem
evaluation of IR systems is costly
Annotations
time consuming expensive
boring
(Bad) Consequence
small and biased test collections unlikely to change from year to year
Solution
apply low-cost evaluation methodologies
![Page 3: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/3.jpg)
2011 1960
ISMIR (2000-today)
MIREX (2005-today)
TREC (1992-today)
CLEF (2000-today)
NTCIR (1999-today)
Cranfield 2 (1962-1966)
MEDLARS (1966-1967)
SMART (1961-1995)
nearly 2 decades of Meta-Evaluation in Text IR
a lot of things have happened here!
some good practices inherited from here
![Page 4: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/4.jpg)
Minimal Test Collections (MTC) [Carterette at al.]
estimate the ranking of systems with very few judgments (high incompleteness)
Application in Audio Music Similarity (AMS)
dozens of volunteers required by MIREX every year to make thousands of judgments
Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%
![Page 5: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/5.jpg)
evaluation with
incomplete judgments
![Page 6: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/6.jpg)
Basic Idea
treat similarity scores as random variables can be estimated with uncertainty
gain of an arbitrary document: Gi ⤳ multinomial
𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙
𝑙∈ℒ
ℒ𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ𝐹𝐼𝑁𝐸 = {0, 1, … , 100}
whenever document i is judged:
𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺𝑖 = 0
*all variance formulas in the paper
![Page 7: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/7.jpg)
AG@k is also treated as a random variable
𝐸 𝐴𝐺@𝑘 =1
𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝒟
iterate all documents (in practice, only
the top k retrieved)
ranking at which it was retrieved
Ultimate Goal
compute a good estimate with the least effort
![Page 8: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/8.jpg)
Comparing Two Systems
𝐸 𝛥𝐴𝐺@𝑘 =1
𝑘 𝐸 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝒟
what really matters is the sign of the difference
Evaluating Several Queries
𝐸 𝛥𝐴𝐺@𝑘 =1
𝒬 𝐸 𝛥𝐴𝐺@𝑘𝑞
𝑞∈𝒬
iterate all queries
The Rationale
if then judge another document else stop judging
𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼
![Page 9: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/9.jpg)
Distribution of AG@k
𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾𝑘 · 𝑃 𝛾𝑘
𝛾𝑘∈𝛤𝑘
what are the possible assignments of similarity?
iterate all possible permutations of k
similarity assignments
ultimately depends on the distribution of Gi
Plain English
the ratio of similarity assignments s.t. AG@k=z
For Complex Measures or Large Similarity Scales
run Monte Carlo simulation
![Page 10: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/10.jpg)
Actually, AG@k is a Special Case
let G be the similarity of the top k for all queries
1. take a sample of k documents. Mean = X1
2. take a sample of k documents. Mean = X2
...
Q. take a sample of k documents. Mean = XQ
Mean of sample means = X
Central Limit Theorem
regardless of the distribution of G
query AG@k for a single query
mean AG@k over all queries
as Q→∞, X approximates a normal distribution
![Page 11: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/11.jpg)
AG@k is Normally Distributed
use the normal cumulative density function Φ
𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ−𝐸 ∆𝐴𝐺@𝑘
𝑉𝑎𝑟 ∆𝐴𝐺@𝑘
BROAD scale
AG@5
De
nsity
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
FINE scale
AG@5
De
nsity
0 20 40 60 80 100
0.0
00
0.0
10
0.0
20
0.0
30
![Page 12: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/12.jpg)
Confidence as a Function of # Judgments
Percent of judgments
Co
nfid
en
ce
in
ra
nkin
g o
f syste
ms
0 10 20 30 40 50 60 70 80 90 100
75
80
85
90
95
100
50
55
60
65
70
what documents should we judge? those that maximize the confidence
or keep judging to be really confident we can
stop judging
or waste our time
![Page 13: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/13.jpg)
The Trick
documents retrieved by both systems are useless there is no need to judge them
whatever Gi is, it is added and then subtracted
Comparing Several Systems
compute a weight wi for each query-document judge the document with largest effect
wi in the Original MTC
wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i
![Page 14: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/14.jpg)
wi Dependent on Confidence
if we are highly confident about a pair of systems we do not need to judge another of their documents
𝑤𝑖 = 1− 𝐶𝐴,𝐵 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘2
𝐴,𝐵 ∈𝒮−ℛ
better results than traditional weights
iterate system pairs with low confidence
weight inversely proportional to confidence
even if it has the largest weight
![Page 15: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/15.jpg)
MTC for AMS
with AG@k
![Page 16: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/16.jpg)
MTC for ΔAG@k
while 1
𝒮 𝐶𝐴,𝐵𝐴,𝐵 ∈𝒮
≤ 1 − 𝛼 do
𝑖∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝑤𝑖
from all unjudged query-documents judge query-document 𝑖∗ (obtain true 𝑔𝑎𝑖𝑛𝑖∗) 𝐸 𝐺𝑖∗ ← 𝑔𝑎𝑖𝑛𝑖∗ 𝑉𝑎𝑟 𝐺𝑖∗ ← 0
end while
average confidence on the ranking
select the best document
update (increase confidence)
![Page 17: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/17.jpg)
MTC in MIREX AMS 2011
![Page 18: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/18.jpg)
Why MIREX 2011
largest edition so far 18 systems (153 pairwise comparisons)
100 queries and 6,322 judgments
Distribution of Gi
let us work with a uniform distribution for now
![Page 19: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/19.jpg)
Confidence as Judgments are Made
correct bins: estimated sign is correct or not significant anyway
![Page 20: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/20.jpg)
Confidence as Judgments are Made
correct bins: estimated sign is correct or not significant anyway
![Page 21: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/21.jpg)
Confidence as Judgments are Made
correct bins: estimated sign is correct or not significant anyway
![Page 22: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/22.jpg)
high confidence with considerably
less effort
![Page 23: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/23.jpg)
Accuracy as Judgments are Made estimated bins always better than expected
![Page 24: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/24.jpg)
Accuracy as Judgments are Made
estimated signs highly correlated with confidence
![Page 25: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/25.jpg)
Accuracy as Judgments are Made
rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)
![Page 26: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/26.jpg)
high confidence and
high accuracy with considerably
less effort
![Page 27: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/27.jpg)
Statistical Significance
MTC allows us to accurately estimate the ranking but for the current set of queries
can we generalize to a general set of queries?
Not Trivial
we have the variance of the estimates but not the sample variance
![Page 28: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/28.jpg)
Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A Lower bound: best case for B
∆𝐴𝐺@𝑘 =1
𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝜋
+
+1
𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝜋
−
−1
𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘
𝑖∈𝜋
known judgments
*same for the lower bound
![Page 29: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/29.jpg)
Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A Lower bound: best case for B
∆𝐴𝐺@𝑘 =1
𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝜋
+
+1
𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝜋
−
−1
𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘
𝑖∈𝜋
retrieved by A
*same for the lower bound
unknown judgments best
similarity score
![Page 30: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/30.jpg)
Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A Lower bound: best case for B
∆𝐴𝐺@𝑘 =1
𝑘 𝐺𝑖 · 𝐼 𝐴𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑖∈𝜋
+
+1
𝑘 𝑙+ · 𝐼 𝐴𝑖 ≤ 𝑘
𝑖∈𝜋
−
−1
𝑘 𝑙− · 𝐼 𝐵𝑖 ≤ 𝑘 ∧ 𝐴𝑖 > 𝑘
𝑖∈𝜋
*same for the lower bound
unknown judgments
retrieved by B but not by A
worst similarity
score
![Page 31: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/31.jpg)
3 Rules
1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B
2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A
3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different
Problem upper and lower bounds are very unrealistic
![Page 32: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/32.jpg)
Incorporate a Heuristic
4. If the estimated difference is larger than t naively conclude significance
Choose t Based on Power Analysis
t = effect-size detectable by a t-test with • sample variance σ2=0.0615 • sample size n=100 • Type I Error rate α=0.05 • Type II Error rate β=0.15
t ≈ 0.067
from previous MIREX editions
typical values
![Page 33: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/33.jpg)
Accuracy of the Significance Estimates
rule 4 (heuristic) ends up overestimating significance
pretty good around 95% confidence
![Page 34: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/34.jpg)
Accuracy of the Significance Estimates
rule 4 (heuristic) ends up overestimating significance
rules 1 to 3 begin to apply and correct overestimations
![Page 35: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/35.jpg)
Accuracy of the Significance Estimates
closer to expected
never under 90%
![Page 36: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/36.jpg)
significance can be estimated
fairly well too
![Page 37: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/37.jpg)
what we did
![Page 38: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/38.jpg)
Introduce MTC to the MIR folks
Work out the Math for MTC with AG@k
See How Well it would have Done in AMS 2011 quite well actually!
![Page 39: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/39.jpg)
what now
![Page 40: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/40.jpg)
Learn the true Distribution of Similarity Judgments
Significance Testing with Incomplete Judgments
Study Low-Cost Methodologies for other MIR Tasks
it‘s clearly not uniform would give more accurate estimates with less effort
use previous AMS data or fit a model as we judge
best-case scenarios are very unrealistic
![Page 41: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/41.jpg)
what for
![Page 42: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/42.jpg)
MTC Greatly Reduces the Effort for AMS (and SMS)
have MIREX volunteers incrementally create brand new test collections for other tasks
Better Yet
study low-cost methodologies for the other tasks
Not Only for MIREX
private collections for in-house evaluations no possibility of gathering large pools of annotators
lost-cost becomes paramount
![Page 43: Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval](https://reader033.fdocuments.in/reader033/viewer/2022052911/559d38561a28ab5c398b46e0/html5/thumbnails/43.jpg)
the MIR community needs a paradigm shift
from a priori to a posteriori evaluation methods
to reduce cost and gain reliability