New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

New Metrics for Meaningful Evaluation ofInformally Structured Speech Retrieval

Maria Eskevich1, Walid Magdy2,3, Gareth J.F. Jones1,2

1 Centre for Digital Video Processing2Centre for Next Generation Localisation

School of ComputingDublin City University, Dublin, Ireland

3 Qatar Computing Research Institute - Qatar FoundationDoha, Qatar

April, 3, 2012

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

Speech Documents Diversity

I Broadcast news:

I Meetings:

I Broadcast news:

I Meetings:

I Broadcast news:

I Meetings:

I Broadcast news:

I Meetings:

I Broadcast news:

I Meetings:

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

Transcript

Segments

Segmentation

InformationRequest

Retrieval

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Conclusions

Related Work in Speech Search Evaluation

Retrieval Units:

I Clearly defined documents:

TREC SDR: Mean Average Precision (MAP)

I Passages:I INEX : Mean Average interpolated Precision (MAiP)

I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision

(mGAP)

Retrieval Units:I Clearly defined documents:

(mGAP)

Mean Average interpolated Precision (MAiP)

Task: passage text retrieval.

Document relevance is not counted in a binary way.

Precision at rank r : fraction of retrieved number of charactersthat are relevant:

Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):

AiP =1

∑x=0.00,0.01,...,1.00

iP[x ]

Shortcomings: averaging over characters in transcript isnot suitable for speech tasks

AiP =1

∑x=0.00,0.01,...,1.00

iP[x ]

AiP =1

∑x=0.00,0.01,...,1.00

iP[x ]

mean Generalized Average Precision (mGAP)

Task: retrieval of the jump-in points in time for relevant content

GAP =1n.

N∑r=1

P[r ] ·(

1 − DistanceGranularity

· 0.1)

GAP =1n.

N∑r=1

P[r ] ·(

· 0.1)

GAP =1n.

N∑r=1

P[r ] ·(

· 0.1)

GAP =1n.

N∑r=1

P[r ] ·(

· 0.1)

Shortcomings: Does not take into accounthow much time the user needs to spend listeningto access the relevant content

GAP =1n.

N∑r=1

P[r ] ·(

· 0.1)

Shortcomings: Does not take into accounthow much time the user needs to spend listeningto access the relevant content

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Conclusions

Time Precision Oriented Metrics

Motivation:

I Create a metric that measures both the ranking quality andthe segmentation quality with respect to relevance in asingle score.

I Reflect how far the user has to listen into the segment at acertain rank until the relevant part actually begins.

Mean Average Segment Precision (MASP)

Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:

I the amount of relevant content is measured over timeinstead of text

I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP

Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0

Difference from other metrics:I the amount of relevant content is measured over time

instead of textI average segment precision (ASP) is calculated at the

ranks of segments containing relevant contentrather than fixed recall points as in MAiP

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

Mean Average Segment Distance-Weighted Precision(MASDWP)

Penalize ASP results as mGAP

ASDWP =1n.

N∑r=1

SP[r ] · rel(sr ) ·(

· 0.1)

Comparative example of AP, ASP and ASDWP

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

RetrievedSegments

Rel Len/Total Len

MAP0.771

MASP0.557

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Conclusions

Test Collection

Speech collection: AMI CorpusI Ca. 100 hours of data (80 hours of speech)I 160 meetings:

I average length – 30 minutesI Transcript

I ManualI Automatic Speech Recognition (ASR), WER ≈ 30 %

Retrieval test set:I 25 queries with text taken form PowerPoint slides provided

with the AMI Corpus (avr len > 10 content words)I Manual relevance assessment

Segmentation Methods and Retrieval Runs

I Segmentation*:I Lexical cohesion based algorithms: TextTiling, C99I Time- and length-based algorithms:

time length = 60, 120, 150, 180 seconds;number of words per segment = 300, 400

I Extreme case: No segmentation

I Retrieval system:I SMART extended to use language modeling

* Manual boundaries for both types of transcript

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Conclusions

Scores Results for 1000 retrieved documentsRun asr man

MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173

len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009

I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience

I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point

I one doc run: only MAP highest score, all other metricshas the lowest score

− > contradict user experienceI time 60: the highest MASDWP rank − > shorter average

length of the segments makes it easier to capturethe segment closer to the jump-in point

I time 60: the highest MASDWP rank

− > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point

Capturing Difference Between Segmentations

Rank c99 time 180 time 603 179/179

4 243/243

179/179

5 180/180

6 105/125

7 157/204

179/179

8 107/107

59/179 60/60

9 350/429

162/180

10 122/122

143/181

(–)AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc

ASDWP c99 > time 180 > time 60 > one doc

Rank c99 time 180 time 603 179/179

4 243/243

179/179

5 180/180

6 105/125

7 157/204

179/179

8 107/107

59/179 60/60

9 350/429

162/180

10 122/122

143/181

AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc

Rank c99 time 180 time 603 179/179 (–) 60/60 (–)4 243/243 (–) 179/179 (–) 59/59 (1)5 180/180 (-69) 60/60 (–)6 105/125 (20) 59/59 (-10)7 157/204 (47) 179/179 (0) 59/59 (–)8 107/107 (-45) 59/179 60/60 (–)9 350/429 (47) 162/180 (-4) 60/60 (21)10 122/122 (-11) 143/181 (–)

AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc

Impact of Averaging Techniques

AiP: man<asr man; ASP: man>asr man

AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Conclusions

MAP and MAiP do not reflect the user experience of informallystructured speech documents:

I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters

Introduced MASP and MASDWP:

I MASP: captures the amount of relevant content thatappears at different ranks

I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list

Conclusions

Thank you for your attention!

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)

Technology

Transcript of New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)