Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf ·...
Transcript of Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf ·...
![Page 1: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/1.jpg)
Deep learning for end-to-end speech recognition
Liang Lu
Centre for Speech Technology ResearchThe University of Edinburgh
11 March 2016
![Page 2: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/2.jpg)
2 of 41
![Page 3: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/3.jpg)
Golden age for speech technology?
Speech technology is around us
3 of 41
![Page 4: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/4.jpg)
Golden age for speech technology?
Driven by data
TIMIT WSJ0 AMI SWB Google Baidu
train
ning
dat
a (h
ours
)
0
2000
4000
6000
8000
10000
12000
3 hours15 hours
70 hours300 hours
2000 hours
4 of 41
![Page 5: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/5.jpg)
Golden age for speech technology?
Driven by data
TIMIT WSJ0 AMI SWB Google Baidu
train
ning
dat
a (h
ours
)
0
2000
4000
6000
8000
10000
12000
3 hours15 hours
70 hours300 hours
2000 hours
≈ 5 × 109(5 billion)
5 of 41
![Page 6: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/6.jpg)
Golden age for speech technology?
Driven by deep learning
Feed-forward neural network Convolutional neural network
Unfolding
Recurrent neural network6 of 41
![Page 7: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/7.jpg)
Golden age for speech technology?
Driven by deep learning
2009
2010
2011
DBN-DNN (G. Hinton, et al)TIMIT
CD-DNN-HMM (Microsoft & Toronto)SWB
2012
Sequence training, Hessian-free (IBM, Google, Academics)
2013
Panic periodDeep everything
Mainstream RNNs, CNNs, Maxout, Dropout, ReLU ....
2014
LSTM-HMM (Alex Graves, and Google, etc)Kaldi, Theano, Torch
2015
CTC, learning from waves, complex networks (CLDNN)
?7 of 41
![Page 8: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/8.jpg)
But, what is next?
• Open challenges in speech recognition Efficient adaptation to speakers, environment, etc
Distant speech recognition, from close-talk microphone to distantmicrophone(s)
Small footprint models, reduce the model size for mobile devices
Open-vocabulary speech recognition
Low-resource languages
...
• In this talk, I would like to revisit the fundamental architecture forspeech recognition
8 of 41
![Page 9: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/9.jpg)
Speech recognition problem
• Speech recognition is a typical sequence tosequence transduction problem
• Given y = y1, · · · , yJ, y ∈ Y andX = x1, · · · , xT, compute P(y | X)
• However, it is difficult
T J and T can be large (> 1000)
Large size of vocabulary |Y| ≈ 60K
Uncertainty and variability in features
Context-dependency problem
...
x1, x2, · · · , xT
y1, y2, · · · , yJ
Channel distortion + noise
A bit signal processing
Sequence of features
Sequence of labels
9 of 41
![Page 10: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/10.jpg)
Hidden Markov Models
10 of 41
![Page 11: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/11.jpg)
Hidden Markov Models
• Why the hidden Markov model works for speech recognition?
• It converts the sequence-level classification problem into aframe-level problem
P(y | X) ∝ p(X | y)
≈ p(X1:T |Q1:T )P(y)
≈ P(y)∏t
p(xt |qt)p(qt |qt−1)
qt−1 qt qt+1
xt+1xtxt−1
11 of 41
![Page 12: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/12.jpg)
Hidden Markov Models
• Problems of HMMs: Loss function: minimise the word error L(y, y) versus maximise the
likelihood p(X1:T |Q1:T )
Conditional independence assumption
Weak sequence model – first order Markov rule
System complexity: monophone → alignment → triphone →alignment → neural net → alignment → neural net
qt−1 qt qt+1
xt+1xtxt−112 of 41
![Page 13: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/13.jpg)
End-to-end speech recognition
• Can we train a model that directly computes P(y | X)?
• CTC – Connectionist Temporal Classification
• Attention-based recurrent neural network (RNN) encoder-decoder
• Segmental recurrent neural networks
13 of 41
![Page 14: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/14.jpg)
End-to-end speech recognition
• CTC – Connectionist Temporal Classification Trick: y1, · · · , yJ → y1, · · · , yT → x1, · · · , xT Replicate the labels (a b c → a a b b b c) with blank symbol Approximate the conditional probability
P(y | X) =T∏t=1
P(yt | xt) (1)
[1] A. Graves, et al, ”Connectionist temporal classification: labelling unsegmentedsequence data with recurrent neural networks”, ICML 2006[2] A. Graves and N. Jaitly, ”Towards end-to-end speech recognition with recurrent neuralnetworks”, ICML 2014[3] A. Hannun, et al, ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv2014
[4] H. Sak, et al, ”Fast and Accurate Recurrent Neural Network Acoustic Models for
Speech Recognition”, INTERSPEECH 2015
14 of 41
![Page 15: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/15.jpg)
End-to-end speech recognition
• Maximum Entropy Markov Model (MEMM)
• Still reply on the independence assumption
x1 x2 x3 x4
h4h3h2h1
y1 y2 y3 y4
15 of 41
![Page 16: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/16.jpg)
End-to-end speech recognition
state-of-the-art
16 of 41
![Page 17: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/17.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
P(y | X) ≈∏j
P(yj | y1, · · · , yj−1, cj) (2)
h1:T = RNN(x1:T ) (3)
cj = Attend(h1:T ) (4)
[1] D. Bahdanau, et al, ”Neural Machine Translation by Jointly Learning to Align andTranslate”, ICLR 2015[2] J. Chorowski, et al, ”Attention-Based Models for Speech Recognition”, NIPS 2015[3] L. Lu et al, ”A Study of the Recurrent Neural Network Encoder-Decoder for LargeVocabulary Speech Recognition”, INTERSPEECH 2015
[4] W. Chan, et al, ”Listen, Attend and Spell”, arXiv 2015
17 of 41
![Page 18: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/18.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
x1 x2 x3 x4 x5
y1
18 of 41
![Page 19: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/19.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
x1 x2 x3 x4 x5
y1 y2
19 of 41
![Page 20: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/20.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
x1 x2 x3 x4 x5
y1 y2 y3
20 of 41
![Page 21: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/21.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
x1 x2 x3 x4 x5
y1 y2 y3 y4
21 of 41
![Page 22: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/22.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
22 of 41
![Page 23: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/23.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
Encoder
Attention
Decoder
h1:T = RNN(x1:T )
cj = Attend(h1:T )
P (yj | y1, · · · , yj−1, cj)
23 of 41
![Page 24: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/24.jpg)
End-to-end speech recognition
• Attention-based RNN encoder-decoder A flexible sequence-to-sequence transducer
“Revolutionising” machine translation
Popularising the attention-based scheme
But it may not be a natural model for speech recognition, why?
24 of 41
![Page 25: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/25.jpg)
End-to-end speech recognition
• Segmental recurrent neural network – segmental CRF + RNN
x1 x2 x3 x4
y2y1
x5 x6
y3
CRF
segmental CRF segmental RNN
[1] L. Kong, et al, ”Segmental Recurrent Neural Networks”, ICLR 2016[2] L. Lu, L. Kong, et al, ”Segmental Recurrent Neural Networks for End-to-end SpeechRecognition”, submitted to INTERSPEECH 2016
[3] Many many more on (segmental) CRFs25 of 41
![Page 26: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/26.jpg)
Segmental recurrent neural network
• CRF [Lafferty et al. 2001]
P(y | X) =1
Z (X)
∏j
exp(w>Φ(yj ,X)
)(5)
where the length of y and X should be equal.
• Segmental (semi-Markov) CRF [Sarawagi and Cohen 2004]
P(y,E, | X) =1
Z (X)
∏j
exp(w>Φ(yj , ej ,X)
)(6)
where ej = 〈sj , nj〉 denotes the beginning (sj) and end (nj) timetag of yj ; E = e1, · · · , eJ is the latent segment label.
26 of 41
![Page 27: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/27.jpg)
Segmental recurrent neural network
• Segmental recurrent neural network – using neural networks tolearn the feature function Φ(·).
x1 x2 x3 x4
y2y1
x5 x6
y3
27 of 41
![Page 28: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/28.jpg)
Segmental recurrent neural network
• Training criteria Conditional maximum likelihood
L(θ) = logP(y | X)
= log∑E
P(y,E | X) (7)
Max-margin – maximising the distance between the ground truth andnegative labels
L(θ) =∑y∈Ω
Dθ(y, y)︸ ︷︷ ︸model distance
(8)
H. Tang, et al, “A comparison of training approaches for discriminative segmental
models”, INTERSPEECH 2014
28 of 41
![Page 29: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/29.jpg)
Segmental recurrent neural network
• Viterbi decoding Partially Viterbi decoding
y∗ = arg maxy
log∑E
P(y,E | X) (9)
Fully Viterbi decoding
y∗,E∗ = arg maxy,E
logP(y,E | X) (10)
More details: L. Lu, L. Kong, et al, “Segmental Recurrent Neural Networks for
End-to-end Speech Recognition”, arXiv 2016.
29 of 41
![Page 30: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/30.jpg)
Experiment 1
• TIMIT dataset 3696 training utterances (∼ 3 hours)
core test set (192 testing utterances)
trained on 48 phonemes, and mapped to 39 for scoring
log filterbank features (FBANK)
using LSTM as an implementation of RNN
30 of 41
![Page 31: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/31.jpg)
Experiment 1
• Speed up training
x1 x2 x3 x4· · ·
x1 x2 x3 x4· · ·
a) concatenate / add
b) skip
31 of 41
![Page 32: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/32.jpg)
Experiment 1
Table: Results of hierarchical subsampling networks.
System LSTM layers hidden PER(%)skip 3 128 21.2conc 3 128 21.3add 3 128 23.2skip 3 250 20.1conc 3 250 20.5add 3 250 21.5
32 of 41
![Page 33: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/33.jpg)
Experiment 1
Table: Results of tuning the hyperparameters.
Dropout layers hidden PER3 128 21.2
0.2 3 250 20.16 250 19.33 128 21.3
0.1 3 250 20.96 250 20.4
× 6 250 21.9
33 of 41
![Page 34: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/34.jpg)
Experiment 1
Table: Results of three types of acoustic features.
Features Deltas d(xt) PER24-dim FBANK
√72 19.3
40-dim FBANK√
120 18.9Kaldi × 40 17.3
Kaldi features – 39 dimensional MFCCs spliced by a context window of 7, followed byLDA and MLLT transform and with feature-space speaker-dependent MLLR
34 of 41
![Page 35: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/35.jpg)
Experiment 1
Table: Comparison to related works.
System LM SD PERHMM-DNN
√ √18.5
first-pass SCRF [Zweig 2012]√ × 33.1
Boundary-factored SCRF [He 2012] × × 26.5Deep Segmental NN [Abdel 2013]
√ × 21.9Discriminative segmental cascade [Tang 2015]
√ × 21.7+ 2nd pass with various features
√ × 19.9CTC [Graves 2013] × × 18.4RNN transducer [Graves 2013] - × 17.7Attention-based RNN baseline [Chorowski 2015] - × 17.6Segmental RNN × × 18.9Segmental RNN × √
17.3
35 of 41
![Page 36: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/36.jpg)
Experiment 2
• Switchboard dataset (∼ 300 hours ≈ 100 million frames)
• Attention-based RNN systems (EncDec)
• No language model in baseline systems
Table: Attention-Based RNN vs. CTC and DNN-HMM hybrid systems.
System Output AvgHMM-DNN sMBR [Vesely 2013] - 18.4CTC no LM [Maas 2015] char 47.1
+7-gram char 35.9+RNNLM (3 hidden layers) char 30.8
Deep Speech [Hannun 2014] char 25.9EncDec no LM word 36.4EncDec no LM char 37.8
36 of 41
![Page 37: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/37.jpg)
Experiment 2
• Long memory decoder
cjyj−1
yj
a) Baseline decoder
cjyj−1
yj
b) LongMem decoder
37 of 41
![Page 38: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/38.jpg)
Experiment 2
Table: Results of language model rescoring and using long memory decoder.
System Output AvgEncDec no LM word 37.6
+ LongMem word 36.4+ 3-gram rescoring word 36.0
EncDec no LM char 42.8+ LongMem char 41.3+ 5-gram rescoring char 40.5
L. Lu, et al, “On Training the Recurrent Neural Network Encoder-Decoder for LargerVocabulary End-to-End Speech Recognition”, ICASSP 2016.
L. Lu, et al, “A Study of the Recurrent Neural Network Encoder-Decoder for Large
Vocabulary Speech Recognition”, INTERSPEECH 2015.
38 of 41
![Page 39: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/39.jpg)
Summary
• End-to-end speech recognition is a new and exiting research area
• Three new models have been discussed Connectionist Temporal Classification (CTC)
Attention-based recurrent neural network
Segmental recurrent neural network
39 of 41
![Page 40: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/40.jpg)
Acknowledgement
• Joint work with Xingxing Zhang (Ph.D student at Edinburgh)
Lingpeng Kong (Ph.D student at CMU)
• Funed by the NST project
40 of 41
![Page 41: Deep learning for end-to-end speech recognitionttic.uchicago.edu/~llu/pdf/liang_ttic_slides.pdf · + 2nd pass with various features p 19.9 CTC [Graves 2013] ... Vocabulary End-to-End](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec67ae718bbe609472157ca/html5/thumbnails/41.jpg)
Thank you ! Questions?
41 of 41