New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)
-
Upload
maria-eskevich -
Category
Technology
-
view
120 -
download
0
description
Transcript of New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)
New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
New Metrics for Meaningful Evaluation ofInformally Structured Speech Retrieval
Maria Eskevich1, Walid Magdy2,3, Gareth J.F. Jones1,2
1 Centre for Digital Video Processing2Centre for Next Generation Localisation
School of ComputingDublin City University, Dublin, Ireland
3 Qatar Computing Research Institute - Qatar FoundationDoha, Qatar
April, 3, 2012
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Outline
Speech Retrieval
Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)
New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)
Retrieval Collection
Experimental Results
Conclusions
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Documents Diversity
I Broadcast news:
I Meetings:
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Documents Diversity
I Broadcast news:
I Meetings:
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Documents Diversity
I Broadcast news:
I Meetings:
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Documents Diversity
I Broadcast news:
I Meetings:
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Documents Diversity
I Broadcast news:
I Meetings:
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Speech Retrieval
SpeechCollection
Queries(audio)
Queries(text)
AutomaticSpeechRecognitionSystem
Transcript
AutomaticSpeechRecognitionSystem
Segments
Segmentation
IndexedSegmentsIndexing
InformationRequest
Retrieval Results:textual segments
Retrieval
Retrieval Results:speech segments
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Outline
Speech Retrieval
Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)
New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)
Retrieval Collection
Experimental Results
Conclusions
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Related Work in Speech Search Evaluation
Retrieval Units:
I Clearly defined documents:
TREC SDR: Mean Average Precision (MAP)
I Passages:I INEX : Mean Average interpolated Precision (MAiP)
I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision
(mGAP)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Related Work in Speech Search Evaluation
Retrieval Units:I Clearly defined documents:
TREC SDR: Mean Average Precision (MAP)
I Passages:I INEX : Mean Average interpolated Precision (MAiP)
I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision
(mGAP)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Related Work in Speech Search Evaluation
Retrieval Units:I Clearly defined documents:
TREC SDR: Mean Average Precision (MAP)
I Passages:I INEX : Mean Average interpolated Precision (MAiP)
I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision
(mGAP)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Related Work in Speech Search Evaluation
Retrieval Units:I Clearly defined documents:
TREC SDR: Mean Average Precision (MAP)
I Passages:I INEX : Mean Average interpolated Precision (MAiP)
I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision
(mGAP)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Related Work in Speech Search Evaluation
Retrieval Units:I Clearly defined documents:
TREC SDR: Mean Average Precision (MAP)
I Passages:I INEX : Mean Average interpolated Precision (MAiP)
I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision
(mGAP)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average interpolated Precision (MAiP)
Task: passage text retrieval.
Document relevance is not counted in a binary way.
Precision at rank r : fraction of retrieved number of charactersthat are relevant:
Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):
AiP =1
101.
∑x=0.00,0.01,...,1.00
iP[x ]
Shortcomings: averaging over characters in transcript isnot suitable for speech tasks
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average interpolated Precision (MAiP)
Task: passage text retrieval.
Document relevance is not counted in a binary way.
Precision at rank r : fraction of retrieved number of charactersthat are relevant:
Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):
AiP =1
101.
∑x=0.00,0.01,...,1.00
iP[x ]
Shortcomings: averaging over characters in transcript isnot suitable for speech tasks
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average interpolated Precision (MAiP)
Task: passage text retrieval.
Document relevance is not counted in a binary way.
Precision at rank r : fraction of retrieved number of charactersthat are relevant:
Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):
AiP =1
101.
∑x=0.00,0.01,...,1.00
iP[x ]
Shortcomings: averaging over characters in transcript isnot suitable for speech tasks
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
mean Generalized Average Precision (mGAP)
Task: retrieval of the jump-in points in time for relevant content
GAP =1n.
N∑r=1
P[r ] ·(
1 − DistanceGranularity
· 0.1)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
mean Generalized Average Precision (mGAP)
Task: retrieval of the jump-in points in time for relevant content
GAP =1n.
N∑r=1
P[r ] ·(
1 − DistanceGranularity
· 0.1)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
mean Generalized Average Precision (mGAP)
Task: retrieval of the jump-in points in time for relevant content
GAP =1n.
N∑r=1
P[r ] ·(
1 − DistanceGranularity
· 0.1)
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
mean Generalized Average Precision (mGAP)
Task: retrieval of the jump-in points in time for relevant content
GAP =1n.
N∑r=1
P[r ] ·(
1 − DistanceGranularity
· 0.1)
Shortcomings: Does not take into accounthow much time the user needs to spend listeningto access the relevant content
Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
mean Generalized Average Precision (mGAP)
Task: retrieval of the jump-in points in time for relevant content
GAP =1n.
N∑r=1
P[r ] ·(
1 − DistanceGranularity
· 0.1)
Shortcomings: Does not take into accounthow much time the user needs to spend listeningto access the relevant content
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Outline
Speech Retrieval
Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)
New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)
Retrieval Collection
Experimental Results
Conclusions
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Time Precision Oriented Metrics
Motivation:
I Create a metric that measures both the ranking quality andthe segmentation quality with respect to relevance in asingle score.
I Reflect how far the user has to listen into the segment at acertain rank until the relevant part actually begins.
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Precision (MASP)
Segment Precision (SP[r ]) at rank r :
Average Segment Precision:
ASP =1n.
N∑r=1
SP[r ] · rel(sr )
rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:
I the amount of relevant content is measured over timeinstead of text
I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :
Average Segment Precision:
ASP =1n.
N∑r=1
SP[r ] · rel(sr )
rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:
I the amount of relevant content is measured over timeinstead of text
I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :
Average Segment Precision:
ASP =1n.
N∑r=1
SP[r ] · rel(sr )
rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:
I the amount of relevant content is measured over timeinstead of text
I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :
Average Segment Precision:
ASP =1n.
N∑r=1
SP[r ] · rel(sr )
rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:
I the amount of relevant content is measured over timeinstead of text
I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :
Average Segment Precision:
ASP =1n.
N∑r=1
SP[r ] · rel(sr )
rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0
Difference from other metrics:I the amount of relevant content is measured over time
instead of textI average segment precision (ASP) is calculated at the
ranks of segments containing relevant contentrather than fixed recall points as in MAiP
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :
Average Segment Precision:
ASP =1n.
N∑r=1
SP[r ] · rel(sr )
rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:
I the amount of relevant content is measured over timeinstead of text
I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Mean Average Segment Distance-Weighted Precision(MASDWP)
Penalize ASP results as mGAP
ASDWP =1n.
N∑r=1
SP[r ] · rel(sr ) ·(
1 − DistanceGranularity
· 0.1)
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Comparative example of AP, ASP and ASDWP
RetrievedSegments
1
2
3
4
5
6
Rel Len/Total Len
2/3
0/5
3/4
6/6
0/2
5/10
AP
1
1/2
2/3
3/4
3/5
4/6
1
1/2
2/3
3/4
3/5
4/6
MAP0.771
ASP
2/3
2/8
5/12
11/18
11/20
16/30
2/3
2/8
5/12
11/18
11/20
16/30
MASP0.557
ASDWP
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
2/3 * 1.0
2/8 * 0.0
5/12 * 0.9
11/18 * 0.0
11/20 * 0.0
16/30 * 0.0
MASDWP0.260
Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Outline
Speech Retrieval
Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)
New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)
Retrieval Collection
Experimental Results
Conclusions
Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Test Collection
Speech collection: AMI CorpusI Ca. 100 hours of data (80 hours of speech)I 160 meetings:
I average length – 30 minutesI Transcript
I ManualI Automatic Speech Recognition (ASR), WER ≈ 30 %
Retrieval test set:I 25 queries with text taken form PowerPoint slides provided
with the AMI Corpus (avr len > 10 content words)I Manual relevance assessment
Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Segmentation Methods and Retrieval Runs
I Segmentation*:I Lexical cohesion based algorithms: TextTiling, C99I Time- and length-based algorithms:
time length = 60, 120, 150, 180 seconds;number of words per segment = 300, 400
I Extreme case: No segmentation
I Retrieval system:I SMART extended to use language modeling
* Manual boundaries for both types of transcript
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Outline
Speech Retrieval
Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)
New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)
Retrieval Collection
Experimental Results
Conclusions
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Scores Results for 1000 retrieved documentsRun asr man
MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009
I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience
I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Scores Results for 1000 retrieved documentsRun asr man
MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009
I one doc run: only MAP highest score, all other metricshas the lowest score
− > contradict user experienceI time 60: the highest MASDWP rank − > shorter average
length of the segments makes it easier to capturethe segment closer to the jump-in point
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Scores Results for 1000 retrieved documentsRun asr man
MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009
I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience
I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Scores Results for 1000 retrieved documentsRun asr man
MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009
I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience
I time 60: the highest MASDWP rank
− > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Scores Results for 1000 retrieved documentsRun asr man
MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009
I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience
I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Capturing Difference Between Segmentations
Rank c99 time 180 time 603 179/179
(–)
60/60
(–)
4 243/243
(–)
179/179
(–)
59/59
(1)
5 180/180
(-69)
60/60
(–)
6 105/125
(20)
59/59
(-10)
7 157/204
(47)
179/179
(0)
59/59
(–)
8 107/107
(-45)
59/179 60/60
(–)
9 350/429
(47)
162/180
(-4)
60/60
(21)
10 122/122
(-11)
143/181
(–)AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc
ASDWP c99 > time 180 > time 60 > one doc
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Capturing Difference Between Segmentations
Rank c99 time 180 time 603 179/179
(–)
60/60
(–)
4 243/243
(–)
179/179
(–)
59/59
(1)
5 180/180
(-69)
60/60
(–)
6 105/125
(20)
59/59
(-10)
7 157/204
(47)
179/179
(0)
59/59
(–)
8 107/107
(-45)
59/179 60/60
(–)
9 350/429
(47)
162/180
(-4)
60/60
(21)
10 122/122
(-11)
143/181
(–)
AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc
ASDWP c99 > time 180 > time 60 > one doc
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Capturing Difference Between Segmentations
Rank c99 time 180 time 603 179/179 (–) 60/60 (–)4 243/243 (–) 179/179 (–) 59/59 (1)5 180/180 (-69) 60/60 (–)6 105/125 (20) 59/59 (-10)7 157/204 (47) 179/179 (0) 59/59 (–)8 107/107 (-45) 59/179 60/60 (–)9 350/429 (47) 162/180 (-4) 60/60 (21)10 122/122 (-11) 143/181 (–)
AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc
ASDWP c99 > time 180 > time 60 > one doc
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Impact of Averaging Techniques
AiP: man<asr man; ASP: man>asr man
AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Impact of Averaging Techniques
AiP: man<asr man; ASP: man>asr man
AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Impact of Averaging Techniques
AiP: man<asr man; ASP: man>asr man
AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)
Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Impact of Averaging Techniques
AiP: man<asr man; ASP: man>asr man
AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)
Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Outline
Speech Retrieval
Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)
New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)
Retrieval Collection
Experimental Results
Conclusions
Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Conclusions
MAP and MAiP do not reflect the user experience of informallystructured speech documents:
I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters
Introduced MASP and MASDWP:
I MASP: captures the amount of relevant content thatappears at different ranks
I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list
Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Conclusions
MAP and MAiP do not reflect the user experience of informallystructured speech documents:
I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters
Introduced MASP and MASDWP:
I MASP: captures the amount of relevant content thatappears at different ranks
I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list
Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Conclusions
MAP and MAiP do not reflect the user experience of informallystructured speech documents:
I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters
Introduced MASP and MASDWP:
I MASP: captures the amount of relevant content thatappears at different ranks
I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list
Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval
Thank you for your attention!