Audio Music Similarity and Retrieval: Evaluation Power and Stability
-
Upload
julian-urbano -
Category
Technology
-
view
312 -
download
0
description
Transcript of Audio Music Similarity and Retrieval: Evaluation Power and Stability
![Page 1: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/1.jpg)
ISMIR 2011Miami, USA · October 26thPicture by Michael Shane
Audio Music Similarity and Retrieval:
Evaluation Power and StabilityJulián Urbano @julian_urbano
Diego Martín, Mónica Marrero and Jorge MoratoUniversity Carlos III of Madrid
![Page 2: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/2.jpg)
AMS
retrieve audio clips
musically similar
to a query clip
![Page 3: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/3.jpg)
grand results(MIREX 2009)
![Page 4: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/4.jpg)
grand results(MIREX 2009)I won!I won!I won!I won!
but the difference is not significant…is not significant…is not significant…is not significant…
yeah, it’s not significant!
oh, come on! it‘s so close!so close!so close!so close!
![Page 5: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/5.jpg)
grand results(MIREX 2009)I won!I won!I won!I won!
but the difference is not significant…is not significant…is not significant…is not significant…
yeah, it’s not significant!
did you hear?
shut up… we are!we are!we are!we are!
oh, come on! it‘s so close!so close!so close!so close!
![Page 6: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/6.jpg)
grand results(MIREX 2009)I won!I won!I won!I won!
but the difference is not significant…is not significant…is not significant…is not significant…
yeah, it’s not significant!
did you hear?
damn it!
don‘t worry don‘t worry don‘t worry don‘t worry about it
shut up… we are!we are!we are!we are!
oh, come on! it‘s so close!so close!so close!so close!
![Page 7: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/7.jpg)
Picture by Sara A. Beyer
what does it mean?
![Page 8: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/8.jpg)
proper interpretation of p-values
H0: mean score of system A = mean score of B
H1: mean scores are different
B A
a statistical test returns p<0.01, so we conclude A >> B
![Page 9: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/9.jpg)
proper interpretation of p-values
H0: mean score of system A = mean score of B
H1: mean scores are different
B Ait means that if we assume Hassume Hassume Hassume H0000and repeatrepeatrepeatrepeat the experiment, there is a <0.01 probabilityof having these result having these result having these result having these result again*
*or one even more extreme
a statistical test returns p<0.01, so we conclude A >> B
![Page 10: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/10.jpg)
MIREX 2010
system A is better than B, but it’s
not statistically significant
we can expect anything
with a different collection
this evaluationis not powerfulnot powerfulnot powerfulnot powerful
MIREX 2009
conclusions about general behavior
A ? BA > B
A is better than B, and it’s
statistically significant
A >> B we expect the same:
A is significantly better than B
A >> B
…and stablestablestablestable
but these could also happen:
A > B or A < B or A << B
this oneis powerful…is powerful…is powerful…is powerful…
lack of power lack of power lack of power lack of power in MIREX 2010minorminorminorminor stability conflict
majormajormajormajor stability conflict
![Page 11: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/11.jpg)
it‘s all about reliability
![Page 12: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/12.jpg)
on the shoulders of giantsIsaac Newton
![Page 13: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/13.jpg)
Text REtrieval Conference
1% to 14% of comparisons show stability conflicts
~25% differences to ensure <5% conflicts with 50 queries
[Buckley and Voorhees, 2000]depends on the measure used
nononono significancesignificancesignificancesignificance testing
improved reliability with pairwise t-tests
virtually no conflicts if >10% differences with significance
[Sanderson and Zobel, 2005] others werenot as good
with many queries, even significance is unreliable
[Voorhees, 2009]
major review: other collections and more recent measures
some measures are much better than others
[Sakai, 2007]
sensitivitysensitivitysensitivitysensitivity
efforteffortefforteffort
does not mean they should not be used!
![Page 14: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/14.jpg)
Music Similarity and Retrieval
alternative forms of ground truth for SMS
reliable and comprehensive but too expensive
[Typke et al., 2005][Urbano et al., 2010]
more about thisin 30 mins
no prefixedrelevance scale
specific measure for the task
[Typke et al., 2006]
despite high agreement, evaluation does change…evaluation does change…evaluation does change…evaluation does change…
agreement between judgments by different people
propose to use more queries
[Jones et al., 2007]
cheaper judgments via crowdsourcing seems reliable
[Urbano et al., 2010][Lee, 2010]
many other things
[Urbano, 2011]
![Page 15: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/15.jpg)
it‘s actually about the
effort-reliability tradeoff
![Page 16: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/16.jpg)
it‘s actually about the
effort-reliability tradeoff
task# of queries
relevance judgmentsmeasures
statistical methods
# of systemssystem similarity
![Page 17: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/17.jpg)
Picture by Wessex Archaeology
measures
&
judgments
![Page 18: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/18.jpg)
how much information does the user gain?
results as a set
AG@5: Average Gain in the top 5 documents
results as a list
NDCG@5: Normalized Discounted Cumulated Gain
ANDCG@5: Average NDCG across ranks
ADR@5: Average Dynamic Recall
measure used in MIREX(with different name)
more realisticuser modeluser modeluser modeluser model
best documents firstbest documents firstbest documents firstbest documents first,and the lower the rank
the lower the gain**details in the paper
![Page 19: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/19.jpg)
how much information does a result provide?
BROAD relevance judgments
not similar = 0
somewhat similar = 1
very similar = 2
FINE relevance judgments
real-valued, from 0 to 10 or 100
![Page 20: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/20.jpg)
look at MIREX 2009
largest evaluation until 2011
![Page 21: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/21.jpg)
Picture by Roger Green
power
![Page 22: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/22.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
![Page 23: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/23.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 queries set
![Page 24: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/24.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
random sample
![Page 25: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/25.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
evaluation
Broad judgments
Fine judgments
random sample
# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
![Page 26: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/26.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
evaluation
Broad judgments
Fine judgments
random sample
repeat 500 times repeat 500 times repeat 500 times repeat 500 times for 5 query subsetsto minimize random effectsrandom effectsrandom effectsrandom effects
52,500 system
comparisons# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
![Page 27: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/27.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 queries set10 query
subset
Broad judgments
Fine judgments
repeat another 500 times another 500 times another 500 times another 500 times for 10 query subsets
evaluation
52,500 system
comparisons# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
![Page 28: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/28.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 queries set10 query
subset
Broad judgments
Fine judgments
stratifiedstratifiedstratifiedstratified random samplingwith equal priorsequal priorsequal priorsequal priors
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
balanced across 10 genres
evaluation# queries
% s
ign
ific
an
t
# queries
% s
ign
ific
an
t
![Page 29: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/29.jpg)
% of pairwise comparisons that are significant
what's the effect of:number of queries
relevance judgments
effectiveness measures
all 100 query subset
# queries
% s
ign
ific
an
t
Broad judgments
# queries
% s
ign
ific
an
t
Fine judgments
evaluation
![Page 30: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/30.jpg)
we simulate possible
evaluation scenarios
![Page 31: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/31.jpg)
power results (larger is better)
power inMIREX 2009
Broad judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
![Page 32: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/32.jpg)
power results (larger is better)
power inMIREX 2009
similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)
Broad judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
![Page 33: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/33.jpg)
power results (larger is better)
power inMIREX 2009
similar logarithmic trend similar logarithmic trend similar logarithmic trend similar logarithmic trend except for ADRFine (expected)
same powersame powersame powersame powerwith with with with 70% 70% 70% 70% effort!effort!effort!effort!
only 2 significant pairs missed with 70% effort
(probably unstable)(probably unstable)(probably unstable)(probably unstable)
Broad judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
![Page 34: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/34.jpg)
merely using more queries
does not pay offwhen looking for power
![Page 35: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/35.jpg)
Picture by Dave Hunt
stability
![Page 36: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/36.jpg)
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures
![Page 37: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/37.jpg)
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
![Page 38: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/38.jpg)
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
5 query
subset
independent samples
![Page 39: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/39.jpg)
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
evaluation
#queries
% c
on
flic
tin
g
Broad judgments
#queries
% c
on
flic
tin
g
Fine judgments
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
5 query
subset
evaluation
independent samples
![Page 40: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/40.jpg)
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures5 query
subset
all 100 queries set
evaluation
#queries
% c
on
flic
tin
g
Broad judgments
#queries
% c
on
flic
tin
g
Fine judgments
52,500crosscrosscrosscross----collectioncollectioncollectioncollection
system comparisons
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
5 query
subset
evaluation
independent samples
repeat 500 timesrepeat 500 timesrepeat 500 timesrepeat 500 timesto minimize random effectsrandom effectsrandom effectsrandom effects
![Page 41: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/41.jpg)
% of pairwise comparisons that are conflicting
what's the effect of:number of queries
relevance judgments
effectiveness measures
evaluation
#queries
% c
on
flic
tin
g
Broad judgments
#queries
% c
on
flic
tin
g
Fine judgments
evaluation
with 100 total queries we can’t go beyond 50
50 query subset
50 query subset
![Page 42: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/42.jpg)
we simulate comparisons
across possible collections
![Page 43: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/43.jpg)
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
![Page 44: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/44.jpg)
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
lack of powerlack of powerlack of powerlack of power in one collection but not in the other
![Page 45: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/45.jpg)
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
lack of powerlack of powerlack of powerlack of power in one collection but not in the otherADR takes longer
to converge
![Page 46: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/46.jpg)
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
stability inMIREX 2009
lack of powerlack of powerlack of powerlack of power in one collection but not in the other
converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries converge to <5% for >40 queries (consistent with α=0.05)
ADR takes longerto converge
![Page 47: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/47.jpg)
merely using more queries
does not pay offwhen looking for stability
![Page 48: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/48.jpg)
type of conflicts (50 queries)
measure conflictsA>B
(power)
A<B
(minor)
A<<B
(major)B
roa
d
AG 3.36% 100% 0% 0%
NDCG 3.77% 99.90% 0.10% 0%
ANDCG 4.73% 99.96% 0.04% 0%
ADR 9.03% 99.94% 0.06% 0%
Fin
e
AG 2.64% 99.86% 0.14% 0%
NDCG 2.94% 99.74% 0.26% 0%
ANDCG 4.03% 99.91% 0.09% 0%
ADR 19.08% 99.50% 0.50% 0%
virtually all virtually all virtually all virtually all conflicts due to lack of power in one collection
no major conflictno major conflictno major conflictno major conflictwhatsoeverwhatsoeverwhatsoeverwhatsoever
![Page 49: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/49.jpg)
if significance shows up
it most probably is correct
are we being too conservative?
![Page 50: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/50.jpg)
Milton Friedman
statistics
John TukeyFrank Wilcoxon
![Page 51: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/51.jpg)
compare two systems
is the difference significant?
t-test, Wilcoxon test, sign test, etc.
significance level α
probability of Type I error
(finding a significant difference when there is none)
usually, α=0.05 or α=0.01
5% or 1% of my significant results are just wrong
stability conflictthey make
different assumptions
![Page 52: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/52.jpg)
compare several systems
15 systems = 105 comparisons
experiment-wide significance level = 1-(1-α)105 = 0.995
we can expect at least one significant comparison to be wrong
instead, compare all systems at once
ANOVA, Friedman test, Kruskal-Wallis, etc.
correct p-values to keep experiment-wide significance level <0.05
Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
used in MIREX(with different assumptions)(with different assumptions)(with different assumptions)(with different assumptions)
MIREX 2009
![Page 53: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/53.jpg)
more stability
at the cost of
less power
is it worth it?
![Page 54: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/54.jpg)
what a MIREX participant wants
compare my system with the other 14
comparisons between those 14 are uninteresting
subexperiment: only 14 pairwise comparisons, not 105
get back the power missed by considering the other 91
compare all systems with 1-tailed Wilcoxon tests at α=0.01
experiment-wide significant level = 1-(1-0.01)105 = 0.652
subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
should throw out more conflicts toonumber of comparisons grows linearly with number of systems
subexperiment-wide significant level = 1-(1-α)14 = 0.512
![Page 55: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/55.jpg)
power results (larger is better)
Broad judgments
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Friedman+Tukey(as in MIREX)
![Page 56: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/56.jpg)
power results (larger is better)
Broad judgments
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Friedman+Tukey(as in MIREX)
all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey
![Page 57: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/57.jpg)
power results (larger is better)
Broad judgments
AGNDCGANDCGADR
Fine judgments
Query set size
% S
ign
ific
an
t co
mp
ari
son
s
40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Query set size
% S
ign
ific
an
t co
mp
ari
son
s40 45 50 55 60 65 70 75 80 85 90 95 100
46
48
50
52
54
56
58
60
62
64
Friedman+Tukey(as in MIREX)
same powersame powersame powersame powerwith 50with 50with 50with 50% % % % effort!effort!effort!effort!
all 1-tailed Wilcoxon comparisonsis up to %20 more powerful up to %20 more powerful up to %20 more powerful up to %20 more powerful than Friedman+Tukey
![Page 58: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/58.jpg)
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
earlier convergencebecause of increased power
![Page 59: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/59.jpg)
stability results (lower is better)
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22 AG
NDCGANDCGADR
Broad judgments
Query subset size
% C
on
flic
tin
g c
om
pa
riso
ns
5 10 15 20 25 30 35 40 45 50
24
68
10
12
14
16
18
20
22
Fine judgments
earlier convergencebecause of increased power
AG converges againagainagainagain to 3-4%(A)NDCG converge to 5-6%
![Page 60: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/60.jpg)
type of conflicts (50 queries)
measure conflictsA>B
(power)
A<B
(minor)
A<<B
(major)B
roa
d
AG 3.68% 96.32% 3.68% 0%
NDCG 5.05% 96.82% 3.18% 0%
ANDCG 6.08% 96.84% 3.13% 0.03%
ADR 5.93% 95.12% 4.88% 0%
Fin
e
AG 3.32% 98.34% 1.66% 0%
NDCG 6.58% 96.61% 3.39% 0%
ANDCG 6.44% 94.94% 5.06% 0%
ADR 12.48% 90.58% 9.37% 0.05%
again, again, again, again, due tolack of power in one collection no major conflicts
within knownType III error rates
![Page 61: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/61.jpg)
effort-reliability tradeoff
Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries
measure power - conflicts = stable power - conflicts = stable
Bro
ad
AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42%
NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%
ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29%
ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37%
Fin
e
AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99%
NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%
ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94%
ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55%
vvvvirtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!irtually same reliability with half the effort!
![Page 62: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/62.jpg)
Friedman-Tukey requires
too much effort
![Page 63: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/63.jpg)
my point?
![Page 64: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/64.jpg)
Do not attempt to accomplish greater results
by a greater effort of your little understanding,
but by a greater understanding of your little effort.
Walter Russell
![Page 65: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/65.jpg)
if significance shows up it most probably is trueat worst, conflicts are due to lack of power
using more and more queries is pointlesstoo much effort for the small gain in power and stability
using different similarity scales has little effectusing only one is probably just fine
some effectiveness measures are better than othersthey should still be used: they measure different things
but bear in mind their power and stability
some statistical methods are better than othersvirtually same realiability with half the effort
![Page 66: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/66.jpg)
Picture by Ronny Welter
![Page 67: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/67.jpg)
reduce the judging effortmore queries in Symbolic Melodic Similarity
reliable low-cost in-house evaluations and Crowdsourcing
other collections, tasks and measures
deeper evaluation cutoffsnot just the top 5 documents: pay attention to ranking
probably more reliable, and certainly more reusable
effect of the number of systemsspecially if developed by the same research group
forget about power and worry about effect-sizeeventually, significance becomes meaningless
other statistical methodsMultiple Comparisons with a Control (baseline)
![Page 68: Audio Music Similarity and Retrieval: Evaluation Power and Stability](https://reader036.fdocuments.in/reader036/viewer/2022081404/557647b7d8b42ac31b8b4f35/html5/thumbnails/68.jpg)
guide experimenters in
the interpretationof the results and the
tradeoff between
effort and reliability