Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010
-
Upload
voxeo-corp -
Category
Technology
-
view
1.352 -
download
1
description
Transcript of Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010
![Page 1: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/1.jpg)
Comparative ASR evaluation
Dan BurnettDirector of Speech Technologies, Voxeo
SpeechTek New YorkAugust 2010
![Page 2: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/2.jpg)
Goals for today
• Learn about data selection
• Learn all the steps of doing an eval by actually doing them
• Leave with code that runs
![Page 3: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/3.jpg)
Outline
• Overview of comparative ASR evaluation
• How to select an evaluation data set
• Why transcription is important and how to do it properly
• What and how to test
• Analyzing the results
![Page 4: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/4.jpg)
Comparative ASR Evaluation
• How could you compare ASR accuracy?
• Can you test against any dataset?
• What settings should you use? The optimal ones, right?
![Page 5: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/5.jpg)
Today’s approach• Choose representative evaluation data set
• Determine human classification of each recording
• For each ASR engine
• Determine machine classification of each recording at “optimal” setting
• Compare to human classification to determine accuracy
• Intelligently compare results for the two engines
![Page 6: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/6.jpg)
Evaluation data set
• Ideally at least 100 recordings per grammar path for good confidence in results (up to 10000 minimum for large grammars)
• Must be representative
• Best to take from actual calls (why?)
• Do you need all the calls? Consider
• Time of day, day of week, holidays
• Regional differences
• Simplest is to use every nth call
![Page 7: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/7.jpg)
Lab data set
• Stored in all-data
• In “original” format as recorded
• Only post-endpointed data for today
• 1607 recordings of answers to yes/no question
• Likely to contain yes/no, but not guaranteed
![Page 8: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/8.jpg)
Transcription
• Why is it needed? Why not automatic?
• Stages
• Classification
• Transcription
![Page 9: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/9.jpg)
Audio classification• Motivation:
• Applications may distinguish (i.e. possibly behave differently) among the following cases:
Case Possible behaviorNo speech in audio sample
(nospeech)Mention that you didn’t hear anything and ask for repeat
Speech, but not intelligible (unintelligible) Ask for repeat
Intelligible speech, but not in app grammar
(out-of-grammar speech)Encourage in-grammar speech
Intelligible speech, and within app grammar (in-grammar speech) Respond to what person said
![Page 10: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/10.jpg)
Transcribing speech
• Words only, all lower case
• No digits
• Only punctuation allowed is apostrophe
![Page 11: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/11.jpg)
Lab 1
• Copy yn_files.csv to yn_finaltrans.csv and edit
• For each file, append category of nospeech, unintelligible, or speech
• Example: all-data/.../utt01.wav,unintelligible
• Append transcription if speech
• Example: all-data/.../utt01.wav,speech,yes
• Transcription instructions in transcription.html
• How might you validate transcriptions?
![Page 12: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/12.jpg)
What and how to test
• Understanding what to test/measure
• Preparing the data
• Building a test harness
• Running the test
![Page 13: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/13.jpg)
What to test/measure
• To measure accuracy, we need
• For each data file
• the human categorization and transcription, and
• the recognizer’s categorization, recognized string, and confidence score
![Page 14: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/14.jpg)
Preparing the data
• Recognizer needs a grammar (typically from your application)
• This grammar can be used to classify transcribed speech as In-grammar/Out-of-grammar
![Page 15: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/15.jpg)
Lab 2
• Fix GRXML yes/no grammar in “a” directory called yesno.grxml
• Copy yn_finaltrans.csv to yn_igog.csv
• Edit yn_igog.csv and change every “yes” or “no” line to have a category of “in_grammar”(should be 756 yes, 159 no, for total of 915)
![Page 16: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/16.jpg)
Building a test harness
• Why build a test harness? What about vendor batch reco tools?
• End-to-end vs. recognizer-only testing
• Harness should be
• generic
• customizable to different ASR engines
![Page 17: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/17.jpg)
Lab 3
• Complete the test harness harness.php(see harness_outline.txt)
• The harness must use the “a/scripts” scripts
• A list of “missing commands” is in harness_components.txt
• Please review (examine) these scripts
• FYI, ASR engine is a/a.php -- treat as black box
![Page 18: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/18.jpg)
Lab 4
• Now run the test harness:
• php harness.php a/scripts <data file> <rundir>
• Output will be in <rundir>/results.csv
• Compare your output to “def_results.csv”
![Page 19: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/19.jpg)
Analyzing results
• What are the possible outcomes and errors?
• How do we evaluate errors?
![Page 20: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/20.jpg)
Possible ASR Engine Classifications
• Silence/nospeech (nospeech)
• Reject (rejected)
• Recognize (recognized)
• What about DTMF?
![Page 21: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/21.jpg)
Possible outcomes
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASR
![Page 22: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/22.jpg)
Possible outcomes
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASRMisrecognitions
![Page 23: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/23.jpg)
Possible outcomes
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASR“Misrejections”
![Page 24: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/24.jpg)
Possible outcomes
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASR“Missilences”
![Page 25: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/25.jpg)
Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized inappropriately or incorrectly
![Page 26: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/26.jpg)
Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized inappropriately or incorrectly
So how do we evaluate these?
![Page 27: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/27.jpg)
Evaluating errors
• Run ASR Engine on data set
• Try every rejection threshold value
• Plot errors as function of threshold
• Find optimal value
![Page 28: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/28.jpg)
Try every rejection threshold value
• Ran data files through test harness with rejection threshold of 0 (i.e., no rejection), but recorded confidence score
• Now, for each possible rejection threshold from 0 to 100
• Calculate number of misrecognitions, misrejections, and missilences
![Page 29: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/29.jpg)
Semantic equivalence
• We call “yes” in-grammar, but what about “yes yes yes”?
• Application only cares about whether it does the right thing, so
• Our final results need to be semantic results
![Page 30: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/30.jpg)
Lab 5
• Look at synonyms.txt file
• Analyze at single threshold and look at the result
• php analyze_csv.php <csv file> 50 synonyms.txt
• Note the difference between raw and semantic results
• Now evaluate at all thresholds and look at the (semantic) results
• php analyze_all_thresholds.php <csv file> <synonyms file>
![Page 31: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/31.jpg)
“Missilences”
Misrecognitions
“Misrejections”
ASR Engine A errors
0 100Rejection Threshold 100
500
1000
![Page 32: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/32.jpg)
Sum
MinimumTotal Error
0 100Rejection Threshold
500
1000
ASR Engine A errors
![Page 33: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/33.jpg)
Lab 6• You now have engine B in “b” directory
• Change harness and component scripts as necessary to run the same test
• You need to know that
• The API for engine B is different. Run “php b/b.php” to find out what it is. It takes ABNF grammars instead of XML.
• Engine B stores its output in a different file.
• Possible outputs from engine B are
• <audiofilename>: [NOSPEECH, REJECTION, SPOKETOOSOON, MAXSPEECHTIMEOUT]
• <audiofilename>: ERROR processing file
• Remember to run at confidence threshold of 0 . . .
![Page 34: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/34.jpg)
“Missilences”
Misrecognitions
“Misrejections”
0 100Rejection Threshold
500
1000
ASR Engine B errors
![Page 35: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/35.jpg)
Sum
MinimumTotal Error
0 100Rejection Threshold
500
1000
ASR Engine B errors
![Page 36: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/36.jpg)
Comparing ASR accuracy
• Plot and compare
• Remember to compare optimal error rates of each (representing tuned accuracy)
![Page 37: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/37.jpg)
Total errors: A vs B
ASR Engine A
0 100Rejection Threshold
500
1000
ASR Engine B
![Page 38: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/38.jpg)
Comparison conclusions
• Optimal error rates are very similar on this data set
• Engine A is much more sensitive to rejection threshold changes
![Page 39: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/39.jpg)
Natural Numbers
ASR Engine A
0 100
500
1000
ASR Engine B
Rejection Threshold
Note that optimal thresholds are
different!
![Page 40: Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010](https://reader033.fdocuments.in/reader033/viewer/2022060110/555c02f6d8b42a5b448b53ee/html5/thumbnails/40.jpg)
Today we . . .
• Learned all the steps of doing an eval by actually doing them
• How to collect data
• Transcribing data
• Running a test
• Analyzing results
• Finished with code that runs(and some homework . . .)