New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)

Post on 13-Jan-2015

120 views 0 download

Tags:

description

We introduce two new metrics for the evaluation of search effectiveness for informally structured speech data: mean average segment precision (MASP) which measures retrieval performance in terms of both content segmentation and ranking with respect to relevance; and mean average segment distance-weighted precision (MASDWP) which takes into account the distance between the start of the relevant segment and the retrieved segment. We demonstrate the effectiveness of these new metrics on a retrieval test collection based on the AMI meeting corpus.

Transcript of New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

New Metrics for Meaningful Evaluation ofInformally Structured Speech Retrieval

Maria Eskevich1, Walid Magdy2,3, Gareth J.F. Jones1,2

1 Centre for Digital Video Processing2Centre for Next Generation Localisation

School of ComputingDublin City University, Dublin, Ireland

3 Qatar Computing Research Institute - Qatar FoundationDoha, Qatar

April, 3, 2012

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Documents Diversity

I Broadcast news:

I Meetings:

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Documents Diversity

I Broadcast news:

I Meetings:

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Documents Diversity

I Broadcast news:

I Meetings:

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Documents Diversity

I Broadcast news:

I Meetings:

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Documents Diversity

I Broadcast news:

I Meetings:

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Speech Retrieval

SpeechCollection

Queries(audio)

Queries(text)

AutomaticSpeechRecognitionSystem

Transcript

AutomaticSpeechRecognitionSystem

Segments

Segmentation

IndexedSegmentsIndexing

InformationRequest

Retrieval Results:textual segments

Retrieval

Retrieval Results:speech segments

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Related Work in Speech Search Evaluation

Retrieval Units:

I Clearly defined documents:

TREC SDR: Mean Average Precision (MAP)

I Passages:I INEX : Mean Average interpolated Precision (MAiP)

I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision

(mGAP)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Related Work in Speech Search Evaluation

Retrieval Units:I Clearly defined documents:

TREC SDR: Mean Average Precision (MAP)

I Passages:I INEX : Mean Average interpolated Precision (MAiP)

I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision

(mGAP)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Related Work in Speech Search Evaluation

Retrieval Units:I Clearly defined documents:

TREC SDR: Mean Average Precision (MAP)

I Passages:I INEX : Mean Average interpolated Precision (MAiP)

I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision

(mGAP)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Related Work in Speech Search Evaluation

Retrieval Units:I Clearly defined documents:

TREC SDR: Mean Average Precision (MAP)

I Passages:I INEX : Mean Average interpolated Precision (MAiP)

I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision

(mGAP)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Related Work in Speech Search Evaluation

Retrieval Units:I Clearly defined documents:

TREC SDR: Mean Average Precision (MAP)

I Passages:I INEX : Mean Average interpolated Precision (MAiP)

I Jump-in points:I CLEF CL-SR: Mean Generalized Average Precision

(mGAP)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average interpolated Precision (MAiP)

Task: passage text retrieval.

Document relevance is not counted in a binary way.

Precision at rank r : fraction of retrieved number of charactersthat are relevant:

Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):

AiP =1

101.

∑x=0.00,0.01,...,1.00

iP[x ]

Shortcomings: averaging over characters in transcript isnot suitable for speech tasks

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average interpolated Precision (MAiP)

Task: passage text retrieval.

Document relevance is not counted in a binary way.

Precision at rank r : fraction of retrieved number of charactersthat are relevant:

Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):

AiP =1

101.

∑x=0.00,0.01,...,1.00

iP[x ]

Shortcomings: averaging over characters in transcript isnot suitable for speech tasks

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average interpolated Precision (MAiP)

Task: passage text retrieval.

Document relevance is not counted in a binary way.

Precision at rank r : fraction of retrieved number of charactersthat are relevant:

Average interpolated Precision (AiP): average of interpolatedprecision scores calculated at 101 recall levels (0.00, 0.01, . . . ,1.00):

AiP =1

101.

∑x=0.00,0.01,...,1.00

iP[x ]

Shortcomings: averaging over characters in transcript isnot suitable for speech tasks

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

mean Generalized Average Precision (mGAP)

Task: retrieval of the jump-in points in time for relevant content

GAP =1n.

N∑r=1

P[r ] ·(

1 − DistanceGranularity

· 0.1)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

mean Generalized Average Precision (mGAP)

Task: retrieval of the jump-in points in time for relevant content

GAP =1n.

N∑r=1

P[r ] ·(

1 − DistanceGranularity

· 0.1)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

mean Generalized Average Precision (mGAP)

Task: retrieval of the jump-in points in time for relevant content

GAP =1n.

N∑r=1

P[r ] ·(

1 − DistanceGranularity

· 0.1)

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

mean Generalized Average Precision (mGAP)

Task: retrieval of the jump-in points in time for relevant content

GAP =1n.

N∑r=1

P[r ] ·(

1 − DistanceGranularity

· 0.1)

Shortcomings: Does not take into accounthow much time the user needs to spend listeningto access the relevant content

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

mean Generalized Average Precision (mGAP)

Task: retrieval of the jump-in points in time for relevant content

GAP =1n.

N∑r=1

P[r ] ·(

1 − DistanceGranularity

· 0.1)

Shortcomings: Does not take into accounthow much time the user needs to spend listeningto access the relevant content

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Time Precision Oriented Metrics

Motivation:

I Create a metric that measures both the ranking quality andthe segmentation quality with respect to relevance in asingle score.

I Reflect how far the user has to listen into the segment at acertain rank until the relevant part actually begins.

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Precision (MASP)

Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:

I the amount of relevant content is measured over timeinstead of text

I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:

I the amount of relevant content is measured over timeinstead of text

I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:

I the amount of relevant content is measured over timeinstead of text

I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:

I the amount of relevant content is measured over timeinstead of text

I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0

Difference from other metrics:I the amount of relevant content is measured over time

instead of textI average segment precision (ASP) is calculated at the

ranks of segments containing relevant contentrather than fixed recall points as in MAiP

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Precision (MASP)Segment Precision (SP[r ]) at rank r :

Average Segment Precision:

ASP =1n.

N∑r=1

SP[r ] · rel(sr )

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0Difference from other metrics:

I the amount of relevant content is measured over timeinstead of text

I average segment precision (ASP) is calculated at theranks of segments containing relevant contentrather than fixed recall points as in MAiP

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Mean Average Segment Distance-Weighted Precision(MASDWP)

Penalize ASP results as mGAP

ASDWP =1n.

N∑r=1

SP[r ] · rel(sr ) ·(

1 − DistanceGranularity

· 0.1)

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Comparative example of AP, ASP and ASDWP

RetrievedSegments

1

2

3

4

5

6

Rel Len/Total Len

2/3

0/5

3/4

6/6

0/2

5/10

AP

1

1/2

2/3

3/4

3/5

4/6

1

1/2

2/3

3/4

3/5

4/6

MAP0.771

ASP

2/3

2/8

5/12

11/18

11/20

16/30

2/3

2/8

5/12

11/18

11/20

16/30

MASP0.557

ASDWP

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

2/3 * 1.0

2/8 * 0.0

5/12 * 0.9

11/18 * 0.0

11/20 * 0.0

16/30 * 0.0

MASDWP0.260

Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Test Collection

Speech collection: AMI CorpusI Ca. 100 hours of data (80 hours of speech)I 160 meetings:

I average length – 30 minutesI Transcript

I ManualI Automatic Speech Recognition (ASR), WER ≈ 30 %

Retrieval test set:I 25 queries with text taken form PowerPoint slides provided

with the AMI Corpus (avr len > 10 content words)I Manual relevance assessment

Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Segmentation Methods and Retrieval Runs

I Segmentation*:I Lexical cohesion based algorithms: TextTiling, C99I Time- and length-based algorithms:

time length = 60, 120, 150, 180 seconds;number of words per segment = 300, 400

I Extreme case: No segmentation

I Retrieval system:I SMART extended to use language modeling

* Manual boundaries for both types of transcript

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Scores Results for 1000 retrieved documentsRun asr man

MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173

len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009

I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience

I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Scores Results for 1000 retrieved documentsRun asr man

MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173

len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009

I one doc run: only MAP highest score, all other metricshas the lowest score

− > contradict user experienceI time 60: the highest MASDWP rank − > shorter average

length of the segments makes it easier to capturethe segment closer to the jump-in point

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Scores Results for 1000 retrieved documentsRun asr man

MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173

len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009

I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience

I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Scores Results for 1000 retrieved documentsRun asr man

MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173

len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009

I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience

I time 60: the highest MASDWP rank

− > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Scores Results for 1000 retrieved documentsRun asr man

MAP MAiP MASP MASDWPc99 0.438 0.275 0.218 0.177tt 0.421 0.275 0.221 0.173

len 300 0.416 0.287 0.248 0.181len 400 0.463 0.286 0.237 0.147time 120 0.428 0.296 0.256 0.196time 150 0.448 0.283 0.243 0.171time 180 0.473 0.300 0.246 0.163time 60 0.333 0.259 0.238 0.220one doc 0.686 0.109 0.085 0.009

I one doc run: only MAP highest score, all other metricshas the lowest score − > contradict user experience

I time 60: the highest MASDWP rank − > shorter averagelength of the segments makes it easier to capturethe segment closer to the jump-in point

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Capturing Difference Between Segmentations

Rank c99 time 180 time 603 179/179

(–)

60/60

(–)

4 243/243

(–)

179/179

(–)

59/59

(1)

5 180/180

(-69)

60/60

(–)

6 105/125

(20)

59/59

(-10)

7 157/204

(47)

179/179

(0)

59/59

(–)

8 107/107

(-45)

59/179 60/60

(–)

9 350/429

(47)

162/180

(-4)

60/60

(21)

10 122/122

(-11)

143/181

(–)AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc

ASDWP c99 > time 180 > time 60 > one doc

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Capturing Difference Between Segmentations

Rank c99 time 180 time 603 179/179

(–)

60/60

(–)

4 243/243

(–)

179/179

(–)

59/59

(1)

5 180/180

(-69)

60/60

(–)

6 105/125

(20)

59/59

(-10)

7 157/204

(47)

179/179

(0)

59/59

(–)

8 107/107

(-45)

59/179 60/60

(–)

9 350/429

(47)

162/180

(-4)

60/60

(21)

10 122/122

(-11)

143/181

(–)

AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc

ASDWP c99 > time 180 > time 60 > one doc

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Capturing Difference Between Segmentations

Rank c99 time 180 time 603 179/179 (–) 60/60 (–)4 243/243 (–) 179/179 (–) 59/59 (1)5 180/180 (-69) 60/60 (–)6 105/125 (20) 59/59 (-10)7 157/204 (47) 179/179 (0) 59/59 (–)8 107/107 (-45) 59/179 60/60 (–)9 350/429 (47) 162/180 (-4) 60/60 (21)10 122/122 (-11) 143/181 (–)

AP: one doc > time 180 > c99 > time 60AiP: c99 > time 180 > time 60 > one docASP time 180 > c99 > time 60 > one doc

ASDWP c99 > time 180 > time 60 > one doc

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Impact of Averaging Techniques

AiP: man<asr man; ASP: man>asr man

AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Impact of Averaging Techniques

AiP: man<asr man; ASP: man>asr man

AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Impact of Averaging Techniques

AiP: man<asr man; ASP: man>asr man

AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Impact of Averaging Techniques

AiP: man<asr man; ASP: man>asr man

AiP: man<asr man; ASP: man>asr man(relevant content moves down from higher ranks)

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline

Speech Retrieval

Speech Search EvaluationMean Average Precision (MAP)Mean Average interpolated Precision (MAiP)mean Generalized Average Precision (mGAP)

New MetricsMean Average Segment Precision (MASP)Mean Average Segment Distance-Weighted Precision(MASDWP)

Retrieval Collection

Experimental Results

Conclusions

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Conclusions

MAP and MAiP do not reflect the user experience of informallystructured speech documents:

I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters

Introduced MASP and MASDWP:

I MASP: captures the amount of relevant content thatappears at different ranks

I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Conclusions

MAP and MAiP do not reflect the user experience of informallystructured speech documents:

I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters

Introduced MASP and MASDWP:

I MASP: captures the amount of relevant content thatappears at different ranks

I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Conclusions

MAP and MAiP do not reflect the user experience of informallystructured speech documents:

I MAP is appropriate for clearly defined documentsI MAiP works with transcript characters

Introduced MASP and MASDWP:

I MASP: captures the amount of relevant content thatappears at different ranks

I MASDWP: rewards runs where segmentation algorithmsput boundaries closer to the relevant content and thesesegments are higher in the ranked list

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Thank you for your attention!