Automatic speech recognition
Transcript of Automatic speech recognition
![Page 1: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/1.jpg)
Institut de Recherche en Informatique et Systèmes Aléatoires
Automatic speech recognitionGwénolé Lecorvé – [email protected]
Research in Computer Science (SIF) master
![Page 2: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/2.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 2/91M2 SIF
Spoken interaction
Speech synthesis /Text-to-speech (TTS)
Automatic speech recognition (ASR) /
Speech-to-text
Spoken language understanding
Dialogue management
![Page 3: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/3.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 3/91M2 SIF
Spoken interaction
Spoken language understanding
Dialogue management
Vocal tract articulators
Inner ear nervePhonetics
MorphologyLexiconSyntax
SemanticsPragmatics
Speech synthesis /Text-to-speech (TTS)
Automatic speech recognition (ASR) /
Speech-to-text
![Page 4: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/4.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 4/91M2 SIF
Spoken interactionG. Lecorvé
G. Lecorvé
C. Raymond
Spoken language understanding
Dialogue management
Vocal tract articulators
Inner ear nervePhonetics
MorphologyLexiconSyntax
SemanticsPragmatics
Speech synthesis /Text-to-speech (TTS)
Automatic speech recognition (ASR) /
Speech-to-text
![Page 5: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/5.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 5/91M2 SIF
Outline
1) Introduction and definitions2) Statistical approach3) Speech analysis4) Acoustic modeling5) Lexicon and pronunciation modeling6) Language modeling7) Decoding8) End-to-end approach
![Page 6: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/6.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 6/91M2 SIF
Introduction and definitions
Reading:• Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.)
![Page 7: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/7.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 7/91M2 SIF
What is speech recognition?
►Transform raw audio into a sequence of words– No meaning
→ "Recognize speech" ~ "Wreck a nice beach"→ "Barack Obama" ~ "Barraque aux Bahamas"
►Related tasks– Speaker diarization/recognition: Who spoke when?– Spoken langage understanding: What’s the meaning?– Sentiment analysis, opinion mining: How does the
speaker feel/think?
![Page 8: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/8.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 8/91M2 SIF
Difficulties
►Hierarchical problem?
(Source: Julia
Hirschberg)
![Page 9: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/9.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 9/91M2 SIF
Difficulties
►Not that simple because lots of variability– Acoustics
● Intra-speaker variability, inter-speaker variability● Noise, reverberation, etc.
– Phonetics● Co-articulation, elisions, etc.● Word confusability
– Linguistics● Word variations● Vocabulary size● Polysemy● Elipses, anaphore, etc.
![Page 10: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/10.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 10/91M2 SIF
Speech production
![Page 11: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/11.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 11/91M2 SIF
Speech production
►Source filter model (Fant, 1960)►Signal s = f * e
Filter f
Source e (excitation)
(assuming f is linear and time-independent)
![Page 12: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/12.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 12/91M2 SIF
Speech production
Time
Air f
ow
Frequency
Inte
nsity
Frequency
Inte
nsity
Frequency
Inte
nsity
Output spectrum
Formants of the vocal
tract
Glotal spectrum
Initial waveform
Vibration (voiced)or not
(unvoiced)
![Page 13: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/13.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 13/91M2 SIF
Speech production
Time
Air f
ow
Frequency
Inte
nsity
Frequency
Inte
nsity
Frequency
Inte
nsity
Output spectrum
Formants of the vocal
tract
Glotal spectrumInitial
waveform
F1 F2F3 F4
F0 (fundamental frequency)
F0
![Page 14: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/14.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 14/91M2 SIF
Phonetics and phonology
►Spoken words are made of phonemes/phones– "French" → / f n t /ɹ ɛ ʃ– "français" → / f s /ʁ ɑ̃ ɛ
►Acoustic view– Phone– Realized
►Phonological view– Phoneme– Symbolic
[ f o n ʊ̯ ]
/ f o n i m ʊ /
InternationalPhonetic
Alphabet (IPA)
![Page 15: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/15.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 15/91M2 SIF
Phonetics and phonology
►1 phoneme = voiceness/unvoiceness + position of articulators
►All phonemes = set of elementary sounds in a langage– Language-dependent ( ɹ ≠ )ʁ– Elementary : principle of minimal pairs
● "kill" versus "kiss"● "pat" versus "bat"
►Allophones = free variants of a phonemes– No minimal pair– "père" → [p r], [p ] or [p ]ɛ ɛʀ ɛʁ
![Page 16: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/16.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 16/91M2 SIF
Phonetics and phonology
![Page 17: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/17.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 17/91M2 SIF
Phonetics and phonology
![Page 18: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/18.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 18/91M2 SIF
Phonetics and phonology
![Page 19: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/19.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 19/91M2 SIF
Linguistics
►Word– Sequence of graphemes (symbolic view)– Morphemes: "recognition" = "re" + "cogni" + "tion"– Morpho-syntax: Part Of Speech (POS)
● Grammatical class : Noun, verb, etc.● Flexional information : Singular/plural, gender, etc.
– Syntax● Function : subject, object, etc.● Shallow, deep parsing (compound structures)
– Meaning→ Representation?
![Page 20: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/20.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 20/91M2 SIF
Linguistics
►Vocabulary = set of words in a– Task– Language– Several languages
►Syntax– None: isolated words– Grammar– Free
Continuous speech
recognition
Large vocabulary continuous
speech recognition (LVCSR)
![Page 21: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/21.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 21/91M2 SIF
Evaluation: Word Error Rate (WER)
►Reference: manual transcript►Hypothesis: ASR output►Alignment►Editings►Score:
– Perfect = 0%– can be > 100% (many insertions)
the lazy dog jumps*** amazing dog jumpsDel Sub
(0+1+1)/4 = 50%
amazing dog jumps
the lazy dog jumps
Edit distance
![Page 22: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/22.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 22/91M2 SIF
Evaluation: Word Error Rate (WER)
►Word alignment: Wagner-Fischer algorithm (dynamic programming)– 3 costs: insertion, deletion, substitution
→ All errors may not harm the same (w.r.t. task)
jumps 4 4 3 2 (↙)
dog 3 3 2 (↙) 3
lazy 2 2 (↙) 2 3
the 1 ( ↓ ) 1 2 3
<s> 0 1 2 3<s> amazing dog jumps
![Page 23: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/23.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 23/91M2 SIF
Statistical (historical) approach
Reading:• Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.)• Jelinek (1998). Statistical Methods for Speech Recognition
![Page 24: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/24.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 24/91M2 SIF
Formalisation statistique
►Y : sequence of acoustif features►W : sequence of words (of the voabulary)
Search space= f(Vocabulary)
Language model
Acoustic model
![Page 25: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/25.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 25/91M2 SIF
Steps and components
LexiconAcoustic model
Language model
p(Y|W)Decoding:
Maximize dep(Y|W) P[W]ψ I |W|
Speech analysis
Y
Best hypothesis W*
Speech signal
P (W)
ASR system
![Page 26: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/26.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 26/91M2 SIF
Generative view
Acoustic featuresY = y1 ... yT
Transcript hypothesisW = w1 ... wN
Phoneme sequenceH = p1 ... pL
Acoustic modelp(Y|H)
Language modelP(W)
Pronunciation dictionaryW H
Signal
![Page 27: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/27.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 27/91M2 SIF
Speech analysis
Reading:• Han et al. (2006). An efficient MFCC extraction method in speech
recognition. IEEE International Symposium.• Hermansky et al. (1992). RASTA-PLP speech analysis technique.
In Proc. ICASSP.• Hermansky et al. (2000). Tandem connectionist feature extraction
for conventional HMM systems. In Proc. ICASSP
![Page 28: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/28.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 28/91M2 SIF
Sampling and quantization
►Sampling– Usual resolution
fs = 8kHz-16kHz
►Quantization– 8 bits / sample
Time
Ampl
itude
Time
Ampl
itude
1/fs
![Page 29: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/29.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 29/91M2 SIF
Windowing, frames
►1 frame = window of samples►Overlap across frames
– 32ms span (256 samples for fs = 8kHz)– Every 10ms
Frame i – 1 Frame i Frame i+1
![Page 30: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/30.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 30/91M2 SIF
Feature extraction
►Features = Energy + frequencies►Desirable properties
– Robust to F0 changes (and F0 harmonics)– Robust across speakers– Robust against noise and channel distorsion– Lowest dimension as possible at equal accurary– Non redondancy among features
![Page 31: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/31.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 31/91M2 SIF
Mel-Frequency Cepstral Coefficients (MFCC)
tFast Fourier
Transform (FFT)
Mel filterbank
cos-1 (IDCT)
f
ff...
Power spectrum+ logarithm (log |...|2)
f...
f
t
~ 20-40 filters
Δ and ΔΔEnergy
~ 12 coefficients
3x(12+1) = 39 coefficients
Cepstrum
![Page 32: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/32.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 32/91M2 SIF
Feature extraction (cont.)
►Other features– Perceptual Linear Prediction
(PLP)● Autoregressive
– Tandem● Discriminative
►Normalization : avoid mismatches across samples– Mean/variance
normalization
![Page 33: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/33.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 33/91M2 SIF
"zéro" (0 in French)
Examples
"trois" (3 in French)
Time
Mel
filte
r
![Page 34: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/34.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 34/91M2 SIF
Acoustic modeling
Reading:• Gales and Young (2007). “The Application of Hidden Markov
Models in Speech Recognition”, Foundations and Trends in Signal Processing, 1 (3), 195–304.
• Rabiner and Juang (1989). “An introduction to hidden Markov models”, IEEE ASSP Magazine
• Hinton et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.
IEEE Signal Processing Magazine, 29(6), 82-97.• Palaz, D. (2016). Towards End-to-End Speech Recognition
![Page 35: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/35.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 35/91M2 SIF
Isolated words
►Pattern recognition
Acoustic features
y1 y2 y3 yT–1 yT
...Y=
"right"
Classifier
![Page 36: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/36.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 36/91M2 SIF
Overview
►Decomposition into phonemes
Acoustic features
y1 y2 y3 yT–1 yT
...Y=
/ / /ɹ a / /ɪ t/
Acoustic model (AM)
![Page 37: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/37.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 37/91M2 SIF
Hidden Markov Models (HMM)
►Overview
begn midn endnTemporal structure
Acoustic features
y1 y2 y3 yT–1 yT
...Y=
Observation probabilities
/ / /ɹ a / /ɪ t/
Acoustic model
![Page 38: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/38.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 38/91M2 SIF
HMMs
►Probabilistic automaton– States qi
● State at step t: st
● Inititial probabilityπi = Pr( s0 = qi)● Transition probabilities
aij = Pr(st+1 = qj | st = qi)– Outputs
● Observation at step t: ot
● Alphabet of symbols � = (xk)● Emission probability
bi(x) = Pr(ot = x | st = qi)
![Page 39: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/39.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 39/91M2 SIF
HMM
►1 phoneme– 3 (or 5-)-state linear HMM : beginning, middle, end– Observation independence
● Pr(ot = x | st = qi, st – 1 = qi’, …, s0 = qi’’)= Pr(ot = x | st = qi)
– Probability to reach the end state e at frame yt
►Context-dependent phoneme = triphones– Same linear HMMs– E.g., /bab/, /bal/, /bap/, etc.– State-tying: gather similar HMM state to overcome data sparsity
e P(/a/|t) = P(st = e)P(be = yt)
![Page 40: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/40.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 40/91M2 SIF
HMMs
►Training: Baum-Welsch►Usage: Viterbi
q3 π3 P(q3|y4)
q2 π2
q1 π1
y0 y1 y2 y3 y4
ObservationsTime steps
Forward:
probalities
Backward:
hidden sta
tesHidd
en s
tate
s
![Page 41: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/41.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 41/91M2 SIF
HMM
►Usage:– Not finding the hidden state sequence– Give probability of each end of phoneme at time t⇒ All (context-dependent) phonemes HMMs in parallel
►AM performance– Accuracy to recognize the proper phonemes
![Page 42: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/42.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 42/91M2 SIF
GMM/HMM
►Gaussian Mixture Model (GMM)– M components
– Pr(ot = y | st = qi) =
![Page 43: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/43.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 43/91M2 SIF
GMM/HMM
►GMM training– M components:
K-means– PDF estimation
● (M+M2) x dim(Y) parameters● EM algorithm● Maximize
● Constraint
→ See ADM course
![Page 44: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/44.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 44/91M2 SIF
DNN/HMM
►DNN (DNN/HMM)– Train an GMM/HMM, label sequence
elements– Train DNN (supervised learning)� x Q → [0, 1]
x , q → Pr(ot = x | st = q)
►Feedforward NNs, convolutional NNs– Features of step t
yt → Phoneme class p– With neighbours
(yt–1, yt, yt+1) → p (source: Hinton et al., 2012)
![Page 45: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/45.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 45/91M2 SIF
Recurrent NNs, end-to-end models
►Recurrent NNs, LTSMs, etc.– (Segment of) Sequence
YtC = (yi)t-C..t → sequence of pi → last phoneme pt
►End-to-end– Combine feature extraction and phoneme prediction
(source: Pallaz et al., 2016)
![Page 46: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/46.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 46/91M2 SIF
Pronunciation
Reading:• Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.)• Rasipuram (2014). Grapheme-based automatic speech
recognition using probabilistic lexical modeling.• Collobert et al. (2016). Wav2Letter: an End-to-End ConvNet-
based Speech Recognition System. ArXiv.
![Page 47: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/47.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 47/91M2 SIF
Back to ASR
►Words are sequences of states Q
Pronunciation model
![Page 48: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/48.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 48/91M2 SIF
Pronunciation dictionary / Lexicon
►Most basic way►Word → phoneme sequence►No probabilities►Written by human experts
→ Key aspect in ASR accuracy►Coverage
– AM training set: all words to train HMMs (or whatever)
– Maximize coverage over some representative texts
![Page 49: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/49.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 49/91M2 SIF
N-M relation
►English– "hello" → /hɛlo /ʊ– "hello" → /həlo /ʊ– "there" → /ð /ɛɹ– "their" → /ð /ɛɹ
►Français– "les" → /lɛ/– "les" → /le/– "les" → /lɛz/– "les" → /lez/– "clans" → /klɑ̃/– "clans" → /klɑ̃z/
– "clé" → /kle/– "clef" → /kle/– "être" → /ɛtʁə/– "être" → /ɛtʁ/ – "être" → /ɛt/
![Page 50: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/50.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 50/91M2 SIF
Lexical tree
►Prefix factorization
ɑ̃
e
l
t ʁɛ
k
ə
{clans}
{clef,clé}
{être}{être}{être}
z
{clans}
![Page 51: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/51.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 51/91M2 SIF
Out Of Vocabulary (OOV) words
►Constructing a dictionary involves– 1 Selection of the words in the dictionary—want to
ensure high coverage of words in test data– 2 Representation of the pronunciation(s) of each word
►OOV rate: percent of word tokens in test data that are not in the ASR system dictionary
►OOV rate increase => WER increase– 1,5-2 errors per OOV word (> 1 because loss of
context)
![Page 52: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/52.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 52/91M2 SIF
Vocabulary content
►Words►Multi-words: frequent sequences of words
– "want to" → "want_to"– "je suis" → "je_suis"→ Handling of pronunciation variants
►Subword units (morphemes, characters, etc.)– OOV words– Character-based languages– Agglutinatives languages
![Page 53: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/53.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 53/91M2 SIF
Word normalization
►Many variants– Hong-Kong, Hong Kong– U.N., UN, U. N.– Trinity College, new college– 2, two– Mr, Mister– 100m, 100 meters
►Automatic learning/discovery based on knowledge resources (Wikipedia, Wiktionary, WordNet, etc.)
![Page 54: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/54.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 54/91M2 SIF
Current topics
►Pronunciation variants or alternative pronunciations– Grapheme-to-phoneme (G2P) models: automatic
learning of pronunciations of new words– Probability distribution over possible pronunciations
►Cobebook learning : joint learning of the inventory of subword units and the pronunciation lexicon
→ Minimum description length (MDL)►Sub-phonetic / articulatory feature model
![Page 55: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/55.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 55/91M2 SIF
Current topics
►Grapheme-based acoustic modelling (Rasipuram, 2014)– Character level– No more pronunciation modelling entirely
►Grapheme-based speech recognition: wav2letter (Collovert et al., 2016)– End-to-end approach
![Page 56: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/56.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 56/91M2 SIF
Language modeling
Reading:• Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.)• Bengio et al. (2006), “Neural probabilistic language models” (sections 6.1, 6.2, 6.3, 6.6, 6.7, 6.8), Studies in Fuzziness and
Soft Computing Volume 194, Springer, chapter 6.• Mikolov et al (2011), “Extensions of recurrent neural network
language model”, Proc. of ICASSP.• R Jozefowicz et al (2016), “Exploring the Limits of Language
Modeling”. ArXiv.
![Page 57: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/57.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 57/91M2 SIF
Constraints
►What the speaker is allowed to say
►Constrained grammar►Binary decision►Task-oriented►+ More precise►– Less fexible
►What the speaker may say
►Free grammar– Given the vocabulary
►Probabilities over possible sequences– A priori: trained on some
text
![Page 58: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/58.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 58/91M2 SIF
Regular/Context-free grammar
<Root> = <Date><Date> = <Day> "the" <Ith> "of" <Month><Day> = "Monday" | "Tuesday" | … | "Sunday"<Ith> = "first" | "second" | … | "thirty-first"<Month> = "January" | … | "December"
►No training data to be collected►Finite state automaton/Pushdown automaton
►Grammar can be made probabilistic
"Monday"
"Sunday"
"First"
"Thirty-first"
"the""January"
"December"
"of"
![Page 59: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/59.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 59/91M2 SIF
Statical language modeling
►Idea– Cover all possible sequence (V* = V x V x V x …)– Disambiguate acoustically ambiguous sequences
"recognize speech", "wreck a nice beach"►Sequence►Chain rule
►Maximum likelihood estimation (MLE)with C(w1...wi) observed count in the training data
History h
![Page 60: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/60.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 60/91M2 SIF
Smoothing and Backof
►What if never observed during training?→ The longer sequence, the more zero-counts
►History of words
►Smoothing– Redistribute probability mass from observed to unobserved
events : change counts and renormalize● Absolute discouting, Kneser-Ney smoothing
►Backof– Link unseen events to the most related seen events
![Page 61: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/61.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 61/91M2 SIF
n-gram model
►Truncate the word history
with n usually 2..5►n = 1 → "unigram", n = 2 → "bigram", n = 3
→ "trigram"►Backof
– Unseen wawbwc → fallback to wbwc
– P(wc|wawb) = P(wc|wb) β(wawb)
h h–
![Page 62: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/62.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 62/91M2 SIF
n-gram model
h h–
wa wb
wc
wa wb
wc
![Page 63: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/63.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 63/91M2 SIF
Perplexity
►How well a text T is predicted by a model M?►Definition 1
– Cross-entropy
– Perplexity►Definition 2
– Average log-likelihoodof M over T (n words)
– Perplexity
![Page 64: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/64.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 64/91M2 SIF
Perplexity
►The lower, the better►Best theoretical perplexity = 1►Intepretation
– Branching factor
hi
ŵ1
ŵ2
ŵ|V|
ŵj = wi
...
...
![Page 65: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/65.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 65/91M2 SIF
Advanced n-gram models
►Factored language models– 1 word wi → 1 feature vector fi = (wi, xi, yi, …)
– Data sparsity => feature dependencies
– Backof scheme
yi – 3
xi – 3
wi – 3
fi – 3
yi – 2
xi – 2
wi – 2
fi – 2
yi – 1
xi – 1
wi – 1
yi
xi
wi
fifi – 1
![Page 66: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/66.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 66/91M2 SIF
Advanced n-gram models
►Structured language models (LMs)►Long-span/distant
dependencies►Syntax parsing
→ Grammatical function+ headword
►Idea: condition wordson parent’s information
►Difficulty: online parsing
![Page 67: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/67.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 67/91M2 SIF
Cache/trigger/topic models
►Cache assumption: words said once may be said again
►Trigger assumption: some words may increase the probability of other words later on
►Extension to topic models►Additional models
![Page 68: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/68.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 68/91M2 SIF
Exponential (MaxEnt) models
►Decompose (h, w) into features functions fj(h, w)
– With Z(h), normalization factor– λj, parameters to be optimized
►Feature function denote a characteristic– Have been observed together– Is syntactically correct– Is thematically correct– Etc.
→ Binary value
![Page 69: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/69.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 69/91M2 SIF
Exponential (MaxEnt) models
►Training– Maximum entropy of the model– Under constraints for each feature function fj
with Kj, probability mass (usually observed in training data)
►Iterative algorithms– Long– No smoothing– Not always better than n-gram MLE
![Page 70: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/70.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 70/91M2 SIF
Neural network LMs
►Feed-forward approach
(source: Koehn, 2016)
![Page 71: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/71.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 71/91M2 SIF
Neural network LMs
►Feed-forward approach
(source: Koehn, 2016)
![Page 72: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/72.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 72/91M2 SIF
Neural network LMs
►Word embedding►Shared weights C
(source: Koehn, 2016)
![Page 73: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/73.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 73/91M2 SIF
(Word embeddings)
►Projection into a continuous space ℝd
►Topological properties– Morpho-syntax– Semantics
![Page 74: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/74.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 74/91M2 SIF
Recurrent neural network LMs
►Build embeddings of word histories(source: Koehn, 2016)
![Page 75: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/75.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 75/91M2 SIF
Long short term memories (LSTM)
►Recurrent neural networks with forgetting mechanisms⇒ Important information is remembered longer
(source: Sundermeyer et al., 2015)
![Page 76: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/76.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 76/91M2 SIF
Decoding
Reading:• Ney and Ortmanns (1999). Dynamic programming search for
continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 64-83.
• Mohri et al. (2008). “Speech recognition with weighted finite-state transducers.” In Springer Handbook of Speech Processing, pp.
559-584.• Mangu et al. (2000). Finding consensus in speech recognition:
word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373-400.
![Page 77: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/77.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 77/91M2 SIF
Beam search decoding
►Need: find
while not exploring the whole search space►Solution: beam search
– Frame-synchronous (start at t = 0)– Idea: Parallel explorations limited a maximum of K active
states– Advantage: memory and time efficient due to pruning of
low interst partial hypotheses
LM scale factorWord
insertion penalty
![Page 78: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/78.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 78/91M2 SIF
Beam search decoding
►Until end of word– Aggregate acoustics
►At end of word– Add LM score and
insertion penalty►At each step
– Check the number of active states→ Pruning (a, b, c)
...
...
...
Frames (yt)
Lexical tree
End of word hypothesis
Lexical tree of a new word history
...
...
...
(a)
0 1 2 3 4 5 6
(b)
(c)
(c)
Acoustic score
Linguistic score+ insertion penalty
P(wa)
P(wb|wa)
P(wb)
P(wc)
P(wd)
(c)
![Page 79: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/79.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 79/91M2 SIF
WFST-based decoding
►Weighted Finite State Transducer (WFST)– Finite state automaton– Weighted edged– Output symbol (in addition to input symbol)
►String conversion
A:a/0.5 A:a/0.6
B:b/0.4
B:b/0.5
A:b/0.4
Finalstate
Input sequence: BBB→ Ouput sequence: bbb
Probability: 0.4×0.4×0.5=0.08
Input sequence: AA→ Best ouput sequence: aa
Probability: 0.5×0.6=0.3→ 2nd best output sequence: ab
Probability: 0.5×0.4=0.2
![Page 80: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/80.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 80/91M2 SIF
WFST-based decoding
►All models can be written as WFSTs
►WFST composition– H o C o L o G (+ determinization + minimization) maps HMM
states to words►Fast but requires memory►Kaldi toolkit
![Page 81: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/81.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 81/91M2 SIF
Alternative hypotheses
►Word lattice
►Confusion network
(Source: Gales and Young, 2008)
►Confidence measures
►Rescoring
![Page 82: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/82.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 82/91M2 SIF
Multi-pass architecture
3rd pass:adaptation
and rescoring
2nd pass:rescoring
4th step:rescoring
and reranking
Monophone AM Simple LM
Adapted AM
Best hypothesis
Adapted LM
Triphone AM More complex LM
Viterbi
N-best hypotheses
Other modelsN-best hypotheses
Final transcript
1st pass:decoding
Audio signal
5th step:consensus decoding
![Page 83: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/83.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 83/91M2 SIF
End-to-end approach
Reading:• Lu et al. (2015). A study of the recurrent neural network encoder-
decoder for large vocabulary speech recognition. In Proc. Interspeech (pp. 3249-3253).
![Page 84: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/84.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 84/91M2 SIF
End-to-end approach
►Encode Y (yt) as a sequence of contexts C (ct)
►Decode C into a sequence of words W
►Beam-search decoding
Encoder
Decoder
y0 yt...
c0 co...
w0 wo
![Page 85: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/85.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 85/91M2 SIF
yyyy yy
RNN encoder
►co, context at step o►yt, input at step t►ht, (encoder’s) hidden layer at
step t
(source : Renals, 2017)
![Page 86: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/86.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 86/91M2 SIF
RNN decoder
►wo, output at step o►co, context at step o►so, (decoder’s) hidden layer
at step o
www
(source : Renals, 2017)
![Page 87: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/87.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 87/91M2 SIF
Related tasks
![Page 88: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/88.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 88/91M2 SIF
Adaptation
►Speaker adaptation– GMM, DNN parameter changes
►Language model– (A) Grab task-related texts (web)– (B) Spot discriminating words, phrases– Increase probs
►Vocabulary– (A + B)– Sub-word units, phonetic transcription– Enrich pronunciation dictionnary– Add n-grams / exploit related words parameters
![Page 89: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/89.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 89/91M2 SIF
Usage of speech transcripts
►Spoken language understanding/dialogue►Spying►Command►Indexation/information retrieval►Clustering/summarization►Multimedia hyperlinking
![Page 90: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/90.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 90/91M2 SIF
Conclusion
![Page 91: Automatic speech recognition](https://reader031.fdocuments.in/reader031/viewer/2022012915/61c629451978be230f10923a/html5/thumbnails/91.jpg)
Vocal and Acoustic Interactions - Automatic Speech Recognition 91/91M2 SIF
Keypoints
►Statistical approach– Acoustic model,
language model►End-to-end approach
– 1 big neural network►Current trends
– LSTMs– Removal of expert
(acoustic, linguistic) knowledge
►Remaining challenges– Adaptation– Noisy environments
►Performance– Human still not beaten– Many possible
applications yet