Post on 31-Dec-2015
description
04/19/23 3
It’s hard to recognize speech...
People speak in very different ways Across speaker variation Within speaker variation
Speech sounds vary according to the speech context
Environment varies wrt noiseTranscription task must handle all of this
and produce a transcript of spoken words
04/19/23 4
Success: low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER
Progress: Very large training corpora Fast machines and cheap storage Bake-offs Market for real-time systems New representations and algorithms: Finite
State Transducers
04/19/23 5
Varieties of Speech Recognition
Mode Isolated words continuous
Style Read, prepared, spontaneous
Enrollment Spkr-dependent or independent
Vocabulary size <20 5K --> 60K -->~1M
Language Model
Finite state, ngrams, CFGs, CSGs
Perplexity <10 > 100
SNR > 30dB (high) < 10dB (low)
Input device Telephone, microphones
04/19/23 6
ASR and the Noisy Channel Model
Source --> noisy channel --> HypothesisFind the most likely input to have
generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O W’= P(W|O)
Bayes rule W’=
maxargLW
maxargLW
)()()|()|(
yPxPxyPyxP
)()()|(
OPWPWOP
04/19/23 7
P(O) same for all hypothetical W, soW’=P(O|W)P(W)P(W) the prior; P(O|W) the (acoustic)
likelihood
04/19/23 8
Simple Isolated Digit Recognition
Train 10 acoustic templates Mi: one per digit
Compare input x with eachSelect most similar template j according to
some comparison function, minimizing differences j = min{f(x,Mi)}
04/19/23 9
Scaling Up: Continuous Speech Recognition
Collect training and test corpora of Speech + word transcription Speech + phonetic transcription
Built by hand or using TTS
Text corpusDetermine a representation for the signalBuild probabilitistic
Acoustic model: signal to phones
04/19/23 10
Pronunciation model: phones to words Language model: words to sentences
Select search procedures to decode new input given these training models
04/19/23 11
Representing the Signal
What parameters (features) of the waveform Can be extracted automatically Will preserve phonetic identity and
distinguish it from other phones Will be independent of speaker
variability and channel conditions Will not take up too much space
…Power Spectrum
04/19/23 12
Speech captured by microphone and digitized
Signal divided into framesPower spectrum computed to represent
energy in different bands of the signal LPC spectrum, Cepstra, PLP Each frame’s spectral features
represented by small set of numbers
04/19/23 13
Why it works?Different phonemes have different spectral
characteristics Why it doesn’t work?
Phonemes can have different properties in different acoustic contexts, spoken by different people, ...
04/19/23 14
Acoustic Models
Model likelihood of phone given spectral features and prior context
Usually represented as HMM Set of states representing phones or other
subword units Transition probabilities on states: how likely
is it to see one phone after another? Observation/output likelihoods: how likely is
spectral feature vector to be observed from state i, given state i-1?
04/19/23 15
Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities
Tune parameters on large corpus with only transcription
Iterate until no further improvement
04/19/23 16
Pronunciation Model
Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice)
Allophones: butter vs. butLexicon may be HMM or simple dictionary
04/19/23 17
Language Models
Models likelihood of word sequence given candidate word hypotheses
Grammars Finite state or CFG
Ngrams Corpus trained Smoothing issues
Out of Vocabulary (OOV) problem
04/19/23 18
Search
Find the best hypothesis given Lattice of subword units (AM) Segmentation of all paths into possible
words (PM) Probabilities of word sequences (LM)
Huge search space Viterbi decoding Beam search
04/19/23 19
Challenges for Transcription
Robustness to channel characteristics and noise
Portability to new applicationsAdapatation: to speakers, to environmentsLMs: simple ngrams need helpConfidence measuresOOV wordsNew speaking styles/genresNew applications
04/19/23 20
Challenges for Understanding
Recognizing communicative ‘problems’ ASR errors User corrections Disfluencies and self-repairs
04/19/23 21
An Unsuccessful Dialogue
S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you?
U: I'd like to find a train from Ch- to Chicago from Baltimore.
S: Sorry, I can't understand you. Please repeat your utterance.
U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the
morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the
morning.
04/19/23 22
S: You can say the arrival city name, such as “New York City."U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.”U: Baltimore to Chicago. (I hate it.)U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello?S: You can say the name of your departure city, such as "New York City."U: Baltimore.
04/19/23 23
U: Bal-ti-more....S: You can say...U: Baltimore.S: Do you want to continue and find a new set of trains?U: Yes.S: You can say "yes" or "I am done here" to find a new set of trains.U: Yes.S: Thank you for using AT&T Amtrak train time table system. See you next time.U: I nev-
04/19/23 24
Disfluencies and Self-Repairs
Disfluencies abound in spontaneous speech every 4.6s in radio call-in (Blackmer &
Mitton ‘91)hesitation: Ch- change strategy. filled pause: Um Baltimore.
self-repair: Ba- uh Chicago. Hard to recognize
Ch- change strategy. --> to D C D C today ten fifteen.
Um Baltimore. --> From Baltimore ten.
Ba- uh Chicago. --> For Boston Chicago.
04/19/23 25
Possibilities for Understanding
Recognizing speaker emotionIdentifying speech acts: okayLocating topic boundaries for topic
tracking, audio browsing, speech data mining