IBM Thomas J. Watson Research Center Yorktown Heights,...

End-to-end Automatic Speech Recognition

Markus Nussbaum-Thom

IBM Thomas J. Watson Research CenterYorktown Heights, NY 10598, USA

Markus Nussbaum-Thom.

February 22, 2017

Nussbaum-Thom: IBM Thomas J. Watson Research Center 1 February 22, 2017

Contents

1. Introduction

2. Connectionst Temporal Classification (CTC)

3. Attention Model

4. References


Terminology

I Features: x , xt , xT1 := x1, . . . , xT .

I Words: w , u, v ,wm,wM1 := w1, . . . ,wM .

I Word sequences: W ,Wn,V .

I States: s, st , sT1 := s1, . . . , sT .

I Class conditional posterior probability: p(st |xt), p(W , sT1 |xT1 ).


Bayes’ Decision Rule


Towards End-to-End Automatic Speech Recognition


End-to-End Automatic Speech Recognition


End-to-End Approach

I End-to-end:I Training all modules to optimize a global performance

criterion. (LeCun et al., 98)

I Easy: Classes do not have a sub-structureI e.g. image classification.

I Difficult: Classes have a sub-structure (sequences, graphs)I e.g. automatic speech recognition,I automatic handwriting recognition,I machine translation.

I Segmentation problem: Which part of the input related towhich part of the sub-structure ?


Towards End-to-end Automatic Speech Recognition

I End-to-end acoustic model:I Using characters instead of phonemes.I Connectionst temporal classification using recurrent or

convolutional neural networks.I Purely neural attention model.

I End-to-end feature extraction:I Feature extraction integrated into the acoustic model.I Using the raw time signal.I Learning a specific type of filter.

I Towards real end-to-end modeling:I Using word as targets instead of characters or phonemes and a

massive amount of data.


Basic Problem

I Input: X = xT1 = (x1, . . . , xT )

I Neural network: p(·|x1), . . . , p(·|xT ).

I Target: W = wM1 = (w1, . . . ,wM)

I but M << T

I How do we solve this ?

I Connectionist Temporal Classification (CTC).[Graves et al., 2006, Graves et al., 2009, CTC]

I Attention Models.[Bahdanau et al., 2016, Chorowski et al., 2015,Chorowski et al., 2015, Attention]

I Inverted Hidden Markov Models.[Doetsch et al., 2016, Inverted HMM - a Proof of Concept]


Overview CTC

I Concept.

I Training.

I Recognition.


Connectionist Temporal Classification (CTC)

I Given X = (x1, . . . , x5) and W = (a, b, c)

I Introduce blank state and allow word repititions: �

x1 x2 x3 x4 x5

a b c � �a � b c �a a b c �...

......

......

� � a b c

I Blank and repitition removal B: B(a,�, b, c ,�,�) = (a, b, c)


Connectionist Temporal Classification (CTC)

I Posterior for sentence W = wM1 and features X = xT1 :

p(W |X ) =∑

sT1 ∈B−1(W )

p(sT1 |X )

:=∑

sT1 ∈B−1(W )

T∏t=1

p(st |xt)

I Training criterion for training samples (Xn,Wn), n = 1, . . . ,N:

FCTC(Λ) = − 1

N

N∑n=1

log pΛ(Wn|Xn)


Overview CTC

I Concept.

I Training.

I Recognition.


Forward-Backward Decomposition (CTC)

I α(t,m, v) : Sum over st1 ∈ B(wm1 ) for given x t1 ending in v .

I β(t,m, v) : Sum over sTt ∈ B(wMm ) for given xTt starting in v .

I Choose t ∈ 1, . . . ,T :

p(wM1 |xT1 ) =

∑sT1 ∈B−1(wM

1 )

p(sT1 |xT1 )

= . . .

=M∑

m=1

∑v∈{wm,�}

α(t,m, v)

p(v |xt)· p(v |xt) ·

β(t,m, v)

p(v |xt)


Forward Algorithm (CTC)



I Compute α(1,m, v) = p(v |x1)



I Compute α(2,m, v).


Backward Algorithm (CTC)



I Compute β(12,M, v) = p(v |x12)



I Compute β(11,m, v).


Posterior Decomposition (CTC)Choose t ∈ 1, . . . ,T :

p(wM1 |X ) =

∑sT1 ∈B−1(wM

1 )

p(sT1 |X ) ”definition”

=∑

sT1 ∈B−1(wM1 )

T∏τ=1

p(sτ |xτ ) ”model assumption”

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

T∏τ=1

p(sτ |xτ )

”decomposition around t”

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

t−1∏τ=1

p(sτ |xτ ) · p(sv |xt) ·T∏

ρ=t+1

p(sρ|xρ)


Posterior Decomposition (CTC)

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

t−1∏τ=1

p(sτ |xτ ) · p(sv |xt) ·T∏

ρ=t+1

p(sρ|xρ)

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

t∏τ=1

p(sτ |xτ )


T∏ρ=t

p(sρ|xρ)

p(v |xt)


Posterior Decomposition (CTC)

=M∑

m=1

∑v∈{wm,�}

∑st1∈B−1(wm

1 )st=v

t∏τ=1

p(v |xτ )


∑sTt ∈B−1(wM

m )st=v

T∏ρ=t

p(sρ|xρ)

p(v |xt)

=M∑

m=1

∑v∈{wm,�}

α(t,m, v)


β(t,m, v)

p(v |xt)

I α(t,m, v) : Sum over st1 ∈ B(wm1 ) for given x t1 ending in v .

I β(t,m, v) : Sum over sTt ∈ B(wMm ) for given xTt starting in v .


Forward Path Decomposition (CTC)

I Consider a path st1 ∈ B−1(wm1 ), st = v :

I st = wm

st−11 st−1 st

w1 . . .w? ? wm

w1 . . .wm wm wm

w1 . . .wm−1 � wm

w1 . . .wm−1 wm−1 wm

I st = �

st−11 st−1 st

w1 . . .w? ? �w1 . . .wm wm �w1 . . .wm � wm


Forward Probablities (CTC)

α(t,m, v) =∑

st1∈B−1(wm1 )

wm=v

t∏τ=1

p(sτ |xτ )

= p(v |xt) ·

∑u∈{wm−1,�}

∑st−1

1 ∈B−1(wm−11 )

st−1=u

t−1∏τ=1

p(sτ |xτ )

+∑

st−11 ∈B−1(wm

1 )st−1=wm

t−1∏τ=1

p(sτ |xτ ), v = wm

∑u∈{wm,�}

∑st−1

1 ∈B−1(wm1 )

st−1=u

t−1∏τ=1

p(sτ |xτ ), v = �


Forward Probabilities (CTC)

= p(v |xt) ·

∑

u∈{wm−1,�}

α(t − 1,m − 1, u) + α(t − 1,m,wm) , v = wm∑u∈{wm,�}

α(t − 1,m, u) , v = �


Backward Probablities (CTC)

β(t,m, v) = Sum over all pathes sTt ∈ B(wMm ) for given xTt

starting in a word v.

β(t,m, v) =∑

sTt ∈B−1(wMm )

wm=v

T∏τ=t

p(sτ |xτ )

=

p(v |xt) ·

∑u∈{wm+1,�}

β(t + 1,m + 1, u) + β(t + 1,m,wm), v = wm

p(�|xt) ·∑

u∈{wm,�}

β(t + 1,m, u) , v = �


Derivatives CTCI Derivative posterior:

∇p(s|xt)P(W |X )

= ∇p(s|xt)

M∑m=1

∑v∈{wm,�}

α(t,m, v)


β(t,m, v)

p(v |xt)

=M∑

m=1

∑v∈{wm,�}

δ(v , s)α(t,m, s) · β(t,m, s)

p2(s|xt)

∇ logP(W |X ) =1

P(W |X )∇P(W |X )

I Derivative training criterion:

⇒ ∇FCTC(Λ) = − 1

N

N∑n=1

∇ log p(Wn|Xn)


CTC Architectures

I What kind of encoders ? DNNs, (bidirectional) LSTMs,CNNs.

I Subsampling: Reducing framerate through the network.


Subsampling

I Join input frames.

I Reshape input to next layer:

I Return every 2nd frame to the next layer.

I CNNs: Use strides.


Peaking Behavior

[Graves and Jaitly, 2014, citation]


Overview CTC

I Concept.

I Training.

I Recognition.


Hybrid Recognition (CTC)

I Hybrid:I Model:

p(x |s) ∼ p(s|x)

p(s)=

p(s|x)

p(x)p(s)=

I Decoding:

wN1 = arg max

wN1

{p(wN

1 ) maxsT1

T∏τ=1

p(xt |st)}


Weighted Finite State Transudcer Recognition (WFST)I Token T :

I Language Model G :

I Lexicon L:

I Search Space:

S = T ◦min(det(L ◦ G ))


Resources for CTC

I Keras:

I Tensorflow: https://github.com/fchollet/keras/blob/

master/keras/backend/tensorflow_backend.pyI Theano: https://github.com/fchollet/keras/blob/

master/keras/backend/theano_backend.pyI Example: https://github.com/fchollet/keras/blob/

master/examples/image_ocr.py

I Baidu:I https://github.com/baidu-research/warp-ctcI https://github.com/sherjilozair/ctcI https:

//github.com/baidu-research/ba-dls-deepspeech

I Eesen: https://github.com/srvk/eesen

I Kaldi: https://github.com/lingochamp/kaldi-ctc


https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py

https://github.com/fchollet/keras/blob/master/keras/backend/tensorflow_backend.py

https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py

https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py

https://github.com/fchollet/keras/blob/master/examples/image_ocr.py

https://github.com/fchollet/keras/blob/master/examples/image_ocr.py

https://github.com/baidu-research/warp-ctc

https://github.com/sherjilozair/ctc

https://github.com/baidu-research/ba-dls-deepspeech

https://github.com/baidu-research/ba-dls-deepspeech

https://github.com/srvk/eesen

https://github.com/lingochamp/kaldi-ctc

Further Literature on CTC

[Miao et al., 2015, EESEN: End-to-end speech recognition usingdeep RNN models and WFST-based decoding]

[Collobert et al., 2016, Wav2Letter: an End-to-End ConvNet-basedSpeech Recognition System]

[Zhang et al., 2017, Towards End-to-End Speech Recognition withDeep Convolutional Neural Networks]

[Senior et al., 2015, Acoustic modelling with CD-CTC-SMBRLSTM RNNS]

[Soltau et al., 2016, Neural Speech Recognizer: Acoustic-to-WordLSTM Model for Large Vocabulary Speech Recognition]


Attention Model

I Encoder-Decoder architecture:I Encoder: performs a feature extracation/encoding based on

the input.I Decoder: Produces output sequence output labels from the

encoded features.


Attention Encoder-Decoder Architecture

I What kind of encoders ? DNNs, (bidirectional) LSTMs,CNNs.

location-aware, that is using ↵t�1 in the equations defining↵t, is crucial for reliable behaviour on long input sequences.

A disadvantage of the approach from (Chorowski et al.,2015) is the complexity of the training procedure, which isO(LT ) since weights ↵tl have to be computed for all pairsof input and output positions. The same paper showcases awindowing approach that reduces the complexity of decod-ing to O(L + T ). In this work we apply the windowingat the training stage as well. Namely, we constrain the at-tention mechanism to only consider positions from the range(mt�1 � wl, . . . , mt�1 + wr), where mt�1 is the median of↵t�1, interpreted in this context as a distribution. The valueswl and wr define how much the window expands to the leftand to the right respectively. This modification makes trainingsignificantly faster.

Apart from the speedup it brings, windowing can be alsovery helpful for starting the training procedure. From our ex-perience, it becomes increasingly harder to train an ARSGcompletely from scratch on longer input sequences. We foundthat providing a very rough estimate of the desired align-ment at the early training stage is an effective way to quicklybring network parameters in a good range. Specifically, weforced the network to only choose from positions in the rangeRt = (smin+tvmin, . . . , smax+tvmax). The numbers smin,smax, vmin, vmax were roughly estimated from the trainingset so that the number of leading silent frames for training ut-terances were between smin and smax and so that the speakerspeed, i.e. the ratio between the transcript and the encodedinput lengths, were between vmin and vmax. We aimed tomake the windows Rt as narrow as possible, while keepingthe invariant that the character yt was pronounced within thewindow Rt. The resulting sequence of windows is quicklyexpanding, but still it was sufficient to quickly move the net-work out of the random initial mode, in which it had oftenaligned all characters to a single location in the audio data.We note, that the median-centered windowing could not beused for this purpose, since it relies on the quality of the pre-vious alignment to define the window for the new one.

3. INTEGRATION WITH A LANGUAGE MODEL

Although an ARSG by construction implicitly learns how anoutput symbol depends on the previous ones, the transcrip-tions of the training utterances are typically insufficient tolearn a high-quality language model. For this reason, weinvestigate how an ARSG can be integrated with a languagemodel. The main challenge is that in speech recognitionword-based language models are used, whereas our ARSGmodels a distribution over character sequences.

We use the Weighted Finite State Transducer (WFST)framework (Mohri et al., 2002; Allauzen et al., 2007) to builda character-level language model from a word-level one. AWFST is a finite automaton, whose transitions have weightand input and output labels. It defines a cost of transducing

Fig. 3. Schematic representation of the Attention-based Re-current Sequence Generator. At each time step t, an MLPcombines the hidden state st�1 with all the input vectors hl

to compute the attention weights ↵tl. Subsequently, the newhidden state st and prediction for output label yt can be com-puted.

an input sequence into an output sequence by consideringall pathes with corresponding sequences of input and outputlabels. The composition operation can be used to combineFSTs that define different levels of representation, such ascharacters and words in our case.

We compose the language model Finite State Transducer(FST) G with a lexicon FST L that simply spells out theletters of each word. More specifically, we build an FSTT = min(det(L � G)) to define the log-probability for char-acter sequences. We push the weights of this FST towards thestarting state to help hypothesis pruning during decoding.

For decoding we look for a transcript y that minimizesthe cost L which combines the encoder-decoder (ED) and thelanguage model (LM) outputs as follows:

L = � log pED(y|x) � � log pLM (y) � �T (5)

where � and � are tunable parameters. The last term �T isimportant, because without it the LM component dominatesand the cost L is minimized by too short sequences. We notethat the same criterion for decoding was proposed in (Hannunet al., 2014b) for a CTC network.

Integrating an FST and an ARSG in a beam-search decod-ing is easy because they share the property that the currentstate depends only on the previous one and the input symbol.Therefore one can use a simple left-to-right beam search algo-rithm similar to the one described in (Sutskever et al., 2014)to approximate the value of y that minimizes L.

The determinization of the FST becomes impractical formoderately large FSTs, such as the trigram model shippedwith the Wall Street Journal corpus (see Subsection 5.1). Tohandle non-deterministic FSTs we assume that its weights

MLP glimpse

output

decoder

encoder


Attention Encoder-Decoder Architecture

I Encoder-Decoder:

I Input: xT1 = x1, . . . , xT

I Encoder: hT/41 = h1, . . . , hT/4 = Encoder(xT1 )

I Decoder: For m = 1, . . . ,M:I Attention: αm = Attend(h

T/41 , sm−1, αm−1)

I Glimpse: gm =

T/4∑τ=1

αm,tht

I Generator: ym = Generator(gm, sm−1)

cm = RNN(cm−1, gm, sm−1)

ym = Softmax(cm)

I Transition: sm = RNN(sm−1, ym, gm)


Attention Mechanism

I Content based: (weights: E ,W ,V and bias: b)

εm,t = E · tanh(W · sm−1 + V · ht + b)

I Location based: (weights: E ,W ,V ,U and bias: b)

f = F ∗ αm−1

εm,t = E · tanh(W · sm−1 + V · ht + U · fm,t + b)

I Renormalization: (sharpening: γ)

αm,t =exp(γ · εm,t)

T/4∑t=1

exp(γ · εm,t)


Window Around Median

I Compute median:

τm = arg mink=1,...,T/4

∣∣∣∣∣∣k∑ρ=1

αm−1,ρ −k∑

θ=k+1

αm−1,θ

∣∣∣∣∣∣I Compute attention around median:

Tm = {τm − ωleft, . . . , τm + ωright}

αm,t =

exp(γ · εm,t)∑

τ∈Tm

exp(γ · εm,τ ), t ∈ Tm

0 , otherwise


Other Techniques (Attention)

I Monotonic regularization:

rm = max

0,

T/4∑τ=1

(τ∑

i=1

αm,i −τ∑

i=1

αm−1,i

)I Curriculum learning: Starting with shorter sequences and

gradually increase sequence length.

I Flatstart: Initial positions are chosen according to speakerspeed.


Resources for Attention

I Theano+Bricks+Blocks:https://github.com/rizar/attention-lvcsr

I Tensorflow:https://www.tensorflow.org/tutorials/seq2seq

I Keras: https://github.com/farizrahman4u/seq2seq


https://github.com/rizar/attention-lvcsr

https://www.tensorflow.org/tutorials/seq2seq

https://github.com/farizrahman4u/seq2seq

Further Literature on Attention

[Bahdanau et al., 2016, End-to-end attention-based largevocabulary speech recognition]

[Chorowski et al., 2015, Attention-Based Models for SpeechRecognition]

[Chorowski et al., 2015, End-to-end Continuous SpeechRecognition using Attention-based Recurrent NN: First Results]

[Kim et al., 2016, Joint CTC-Attention based End-to-End SpeechRecognition using Multi-task Learning]


Evaluation Framework

I Development and evaluation set different from training set.

I Levenshtein: Minimum insertions, deletions and substitutions

S P E A K E R

T E A C H E R

correct substitution

insertion deletion

Referenz

Hypothese

Alignment

SpokenAlignmentRecognized


Evaluation Framework

I Levenshtein: Minimum insertions, deletions and substitutions

L(wN1 , v

M1 ) = min

s,t

{λ∑

i=1

(1− δ(ws(i), vt(i))

)}

with dem Kronecker delta δ(w , v) =

{1 , v = w

0 , v 6= w

I Word Error Rate (WER):

WER(SpokenR1 ,RecognizedR1 ) =

R∑r=1

L(Spokenr ,Recognizedr )

R∑r=1

|Spokenr |


Experimental Results

Model CER WER

Bhadanau et al. (2015)Attention 6.4 18.6Attention + bigram LM 5.3 11.7Attention + trigram LM 4.8 10.8Attention + extended trigram LM 3.9 9.3Graves and Jaitly (2014)CTC 9.2 30.1Hannun et al. (2014)CTC + bigram LM n/a 14.1Miao et al. (2015)CTC + bigram LM n/a 26.9CTC for phonemes + lexicon n/a 26.9CTC for phonemes + trigram LM n/a 7.3CTC + trigram LM n/a 9.0

Hybrid BGRU (15 h) n/a 2.0


Attention Modeling Example from Handwriting


Inverted Hidden Markov Model (HMM)I Traditional HMM:

p(wN1 , x

T1 ) = p(wN

1 ) · p(xT1 |wN1 )

= p(wN1 )∑sT1

p(sT1 , xT1 |wN

1 )

=N∏

n=1

p(wn|wn−11 )

∑sT1

T∏t=1

p(st , xt |st−11 , x t−1

1 ,wN1 )

I Inverted HMM:

p(wN1 |xT1 ) =

∑tN1

p(wN1 , t

N1 |xT1 )

=∑tN1

N∏n=1

p(wn, tn|wn−11 , tn−1

1 , xT1 )

[Doetsch et al., 2016, Inverted HMM - a Proof of Concept]


Unsolved Problems for End-to-End ASR

I Error rates: Still higher than traditional HMM-based system(one exception).

I Global search: Still a transducer-based or HMM-based search.

I Acoustic model: Word and character-based End-to-Endlearning.

I Language model: No integration with the language model intraining yet.


Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., andBengio, Y. (2016).End-to-end attention-based large vocabulary speechrecognition.In 2016 IEEE International Conference on Acoustics, Speechand Signal Processing, ICASSP 2016, Shanghai, China, March20-25, 2016, pages 4945–4949.

Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., andBengio, Y. (2015).Attention-based models for speech recognition.In Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems2015, December 7-12, 2015, Montreal, Quebec, Canada, pages577–585.

Collobert, R., Puhrsch, C., and Synnaeve, G. (2016).Wav2letter: an end-to-end convnet-based speech recognitionsystem.


CoRR, abs/1609.03193.

Doetsch, P., Hegselmann, S., SchlA 14 ter, R., and Ney, H.

(2016).Inverted hmm - a proof of concept.In Neural Information Processing Systems Workshop,Barcelona, Spain.

Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J.(2006).Connectionist temporal classification: Labelling unsegmentedsequence data with recurrent neural networks.In Proceedings of the 23rd International Conference onMachine Learning, ICML ’06, pages 369–376, New York, NY,USA. ACM.

Graves, A. and Jaitly, N. (2014).Towards end-to-end speech recognition with recurrent neuralnetworks.


In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1764–1772.

Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke,H., and Schmidhuber, J. (2009).A novel connectionist system for unconstrained handwritingrecognition.IEEE Trans. Pattern Anal. Mach. Intell., 31(5):855–868.

Kim, S., Hori, T., and Watanabe, S. (2016).Joint ctc-attention based end-to-end speech recognition usingmulti-task learning.CoRR, abs/1609.06773.

Miao, Y., Gowayyed, M., and Metze, F. (2015).EESEN: end-to-end speech recognition using deep RNNmodels and wfst-based decoding.


In 2015 IEEE Workshop on Automatic Speech Recognition andUnderstanding, ASRU 2015, Scottsdale, AZ, USA, December13-17, 2015, pages 167–174.

Senior, A. W., Sak, H., de Chaumont Quitry, F., Sainath,T. N., and Rao, K. (2015).Acoustic modelling with CD-CTC-SMBR LSTM RNNS.In 2015 IEEE Workshop on Automatic Speech Recognition andUnderstanding, ASRU 2015, Scottsdale, AZ, USA, December13-17, 2015, pages 604–609.

Soltau, H., Liao, H., and Sak, H. (2016).Neural speech recognizer: Acoustic-to-word LSTM model forlarge vocabulary speech recognition.CoRR, abs/1610.09975.

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C.,Bengio, Y., and Courville, A. C. (2017).


Towards end-to-end speech recognition with deepconvolutional neural networks.CoRR, abs/1701.02720.


IBM Thomas J. Watson Research Center Yorktown Heights,...

Documents

Transcript of IBM Thomas J. Watson Research Center Yorktown Heights,...