Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)
-
Upload
clare-george -
Category
Documents
-
view
213 -
download
1
Transcript of Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)
![Page 1: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/1.jpg)
Speech Recognition
MIT 6.893SMA 5508
Spring 2004Larry Rudolph (MIT)
![Page 2: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/2.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
A long term goal
Since 1950, AI researchers claimed
Crucial problem
Will be solved within the decade
Finally, it appears true
Failure rates still too high
90% hit rate is 10% error ratewant 98% or 99% success rate
![Page 3: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/3.jpg)
6.893 Spring 2004: User Interface Larry Rudolph
Spectrum of choices
Constrained Domain
Unconstrained Domain
Speaker Dependent
Voice tags (e.g. phone)
Trained Dictation (Viavoice)
Speaker Independent
Galaxy(we are here)
What everyone
wants
![Page 4: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/4.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
Waveform to Phonemes
Waveform is very fuzzy
We think there is a large break between words and sentences
hard to see from waveform
Mapping waveform segments to phonemes is not accurate
![Page 5: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/5.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
Phonemes to words
Group phonemes into words
not always 1-1 mappingmissing phonemes
false phonemes (extra ones)
accents
many possible choices
Word should be known to system
domain or dictionary
![Page 6: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/6.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
words to sentences
People do not always speak grammatically correct
some invariant rules (for speech)
extra or missing words
phrases not always sentences
Easier when sentence is in domain
domain specified by grammar
![Page 7: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/7.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
sentences into meaning
Dictation system: want sentences
Other system: want to understand
Integrate high-level processing
Most applications need it anyway
Helps with recognitionuseful to disambiguate input
![Page 8: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/8.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
meaning into action
What happens after meaning?
Respond to user (even a beep)
Usually generate more substantial response
Action should be valid in context
![Page 9: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/9.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
Disambiguation
Each transformation is rarely highly accurate
Lots of choices
Subsequent steps can rule out choices from previous steps
![Page 10: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/10.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
disambiguation strategy
Select “n-best” choices and pass on
Each step restricts possible meaning
Make heavy use of probability
Viterby search
state transitions along with probabilities.
push through n choices at once
![Page 11: Speech Recognition MIT 6.893 SMA 5508 Spring 2004 Larry Rudolph (MIT)](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f205503460f94c38892/html5/thumbnails/11.jpg)
6.893 5508 Spring 2004: Speech Recognition Larry RudolphDRAFT -- To Be Revised Shortly
after domain dependent
Handling out-of-vocabulary words
Multimodal input
improve recognition ratese.g. lip reading
sometimes easier to point than say