IBM Thomas J. Watson Research Center Yorktown Heights,...

76
End-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA Markus Nussbaum-Thom. February 22, 2017 Nussbaum-Thom: IBM Thomas J. Watson Research Center 1 February 22, 2017

Transcript of IBM Thomas J. Watson Research Center Yorktown Heights,...

Page 1: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

End-to-end Automatic Speech Recognition

Markus Nussbaum-Thom

IBM Thomas J. Watson Research CenterYorktown Heights, NY 10598, USA

Markus Nussbaum-Thom.

February 22, 2017

Nussbaum-Thom: IBM Thomas J. Watson Research Center 1 February 22, 2017

Page 2: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Contents

1. Introduction

2. Connectionst Temporal Classification (CTC)

3. Attention Model

4. References

Nussbaum-Thom: IBM Thomas J. Watson Research Center 2 February 22, 2017

Page 3: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Terminology

I Features: x , xt , xT1 := x1, . . . , xT .

I Words: w , u, v ,wm,wM1 := w1, . . . ,wM .

I Word sequences: W ,Wn,V .

I States: s, st , sT1 := s1, . . . , sT .

I Class conditional posterior probability: p(st |xt), p(W , sT1 |xT1 ).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 3 February 22, 2017

Page 4: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Bayes’ Decision Rule

Nussbaum-Thom: IBM Thomas J. Watson Research Center 4 February 22, 2017

Page 5: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Towards End-to-End Automatic Speech Recognition

Nussbaum-Thom: IBM Thomas J. Watson Research Center 5 February 22, 2017

Page 6: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Towards End-to-End Automatic Speech Recognition

Nussbaum-Thom: IBM Thomas J. Watson Research Center 6 February 22, 2017

Page 7: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

End-to-End Automatic Speech Recognition

Nussbaum-Thom: IBM Thomas J. Watson Research Center 7 February 22, 2017

Page 8: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

End-to-End Approach

I End-to-end:I Training all modules to optimize a global performance

criterion. (LeCun et al., 98)

I Easy: Classes do not have a sub-structureI e.g. image classification.

I Difficult: Classes have a sub-structure (sequences, graphs)I e.g. automatic speech recognition,I automatic handwriting recognition,I machine translation.

I Segmentation problem: Which part of the input related towhich part of the sub-structure ?

Nussbaum-Thom: IBM Thomas J. Watson Research Center 8 February 22, 2017

Page 9: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Towards End-to-end Automatic Speech Recognition

I End-to-end acoustic model:I Using characters instead of phonemes.I Connectionst temporal classification using recurrent or

convolutional neural networks.I Purely neural attention model.

I End-to-end feature extraction:I Feature extraction integrated into the acoustic model.I Using the raw time signal.I Learning a specific type of filter.

I Towards real end-to-end modeling:I Using word as targets instead of characters or phonemes and a

massive amount of data.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 9 February 22, 2017

Page 10: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Basic Problem

I Input: X = xT1 = (x1, . . . , xT )

I Neural network: p(·|x1), . . . , p(·|xT ).

I Target: W = wM1 = (w1, . . . ,wM)

I but M << T

I How do we solve this ?

I Connectionist Temporal Classification (CTC).[Graves et al., 2006, Graves et al., 2009, CTC]

I Attention Models.[Bahdanau et al., 2016, Chorowski et al., 2015,Chorowski et al., 2015, Attention]

I Inverted Hidden Markov Models.[Doetsch et al., 2016, Inverted HMM - a Proof of Concept]

Nussbaum-Thom: IBM Thomas J. Watson Research Center 10 February 22, 2017

Page 11: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Overview CTC

I Concept.

I Training.

I Recognition.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 11 February 22, 2017

Page 12: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Connectionist Temporal Classification (CTC)

I Given X = (x1, . . . , x5) and W = (a, b, c)

I Introduce blank state and allow word repititions: �

x1 x2 x3 x4 x5

a b c � �a � b c �a a b c �...

......

......

� � a b c

I Blank and repitition removal B: B(a,�, b, c ,�,�) = (a, b, c)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 12 February 22, 2017

Page 13: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Connectionist Temporal Classification (CTC)

I Posterior for sentence W = wM1 and features X = xT1 :

p(W |X ) =∑

sT1 ∈B−1(W )

p(sT1 |X )

:=∑

sT1 ∈B−1(W )

T∏t=1

p(st |xt)

I Training criterion for training samples (Xn,Wn), n = 1, . . . ,N:

FCTC(Λ) = − 1

N

N∑n=1

log pΛ(Wn|Xn)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 13 February 22, 2017

Page 14: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Overview CTC

I Concept.

I Training.

I Recognition.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 14 February 22, 2017

Page 15: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward-Backward Decomposition (CTC)

I α(t,m, v) : Sum over st1 ∈ B(wm1 ) for given x t1 ending in v .

I β(t,m, v) : Sum over sTt ∈ B(wMm ) for given xTt starting in v .

I Choose t ∈ 1, . . . ,T :

p(wM1 |xT1 ) =

∑sT1 ∈B−1(wM

1 )

p(sT1 |xT1 )

= . . .

=M∑

m=1

∑v∈{wm,�}

α(t,m, v)

p(v |xt)· p(v |xt) ·

β(t,m, v)

p(v |xt)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 15 February 22, 2017

Page 16: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 16 February 22, 2017

Page 17: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(1,m, v) = p(v |x1)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 17 February 22, 2017

Page 18: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(2,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 18 February 22, 2017

Page 19: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(3,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 19 February 22, 2017

Page 20: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(4,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 20 February 22, 2017

Page 21: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(5,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 21 February 22, 2017

Page 22: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(6,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 22 February 22, 2017

Page 23: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(7,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 23 February 22, 2017

Page 24: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(8,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 24 February 22, 2017

Page 25: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(9,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 25 February 22, 2017

Page 26: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(10,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 26 February 22, 2017

Page 27: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(11,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 27 February 22, 2017

Page 28: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Algorithm (CTC)

I Compute α(12,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 28 February 22, 2017

Page 29: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 29 February 22, 2017

Page 30: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(12,M, v) = p(v |x12)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 30 February 22, 2017

Page 31: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(11,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 31 February 22, 2017

Page 32: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(10,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 32 February 22, 2017

Page 33: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(9,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 33 February 22, 2017

Page 34: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(8,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 34 February 22, 2017

Page 35: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(7,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 35 February 22, 2017

Page 36: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(6,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 36 February 22, 2017

Page 37: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(5,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 37 February 22, 2017

Page 38: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(4,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 38 February 22, 2017

Page 39: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(3,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 39 February 22, 2017

Page 40: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(2,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 40 February 22, 2017

Page 41: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Algorithm (CTC)

I Compute β(1,m, v).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 41 February 22, 2017

Page 42: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Posterior Decomposition (CTC)Choose t ∈ 1, . . . ,T :

p(wM1 |X ) =

∑sT1 ∈B−1(wM

1 )

p(sT1 |X ) ”definition”

=∑

sT1 ∈B−1(wM1 )

T∏τ=1

p(sτ |xτ ) ”model assumption”

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

T∏τ=1

p(sτ |xτ )

”decomposition around t”

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

t−1∏τ=1

p(sτ |xτ ) · p(sv |xt) ·T∏

ρ=t+1

p(sρ|xρ)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 42 February 22, 2017

Page 43: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Posterior Decomposition (CTC)

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

t−1∏τ=1

p(sτ |xτ ) · p(sv |xt) ·T∏

ρ=t+1

p(sρ|xρ)

=M∑

m=1

∑v∈{wm,�}

∑sT1 ∈B−1(wM

1 )st=v

t∏τ=1

p(sτ |xτ )

p(v |xt)· p(v |xt) ·

T∏ρ=t

p(sρ|xρ)

p(v |xt)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 43 February 22, 2017

Page 44: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Posterior Decomposition (CTC)

=M∑

m=1

∑v∈{wm,�}

∑st1∈B−1(wm

1 )st=v

t∏τ=1

p(v |xτ )

p(v |xt)· p(v |xt) ·

∑sTt ∈B−1(wM

m )st=v

T∏ρ=t

p(sρ|xρ)

p(v |xt)

=M∑

m=1

∑v∈{wm,�}

α(t,m, v)

p(v |xt)· p(v |xt) ·

β(t,m, v)

p(v |xt)

I α(t,m, v) : Sum over st1 ∈ B(wm1 ) for given x t1 ending in v .

I β(t,m, v) : Sum over sTt ∈ B(wMm ) for given xTt starting in v .

Nussbaum-Thom: IBM Thomas J. Watson Research Center 44 February 22, 2017

Page 45: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Path Decomposition (CTC)

I Consider a path st1 ∈ B−1(wm1 ), st = v :

I st = wm

st−11 st−1 st

w1 . . .w? ? wm

w1 . . .wm wm wm

w1 . . .wm−1 � wm

w1 . . .wm−1 wm−1 wm

I st = �

st−11 st−1 st

w1 . . .w? ? �w1 . . .wm wm �w1 . . .wm � wm

Nussbaum-Thom: IBM Thomas J. Watson Research Center 45 February 22, 2017

Page 46: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Probablities (CTC)

α(t,m, v) =∑

st1∈B−1(wm1 )

wm=v

t∏τ=1

p(sτ |xτ )

= p(v |xt) ·

∑u∈{wm−1,�}

∑st−1

1 ∈B−1(wm−11 )

st−1=u

t−1∏τ=1

p(sτ |xτ )

+∑

st−11 ∈B−1(wm

1 )st−1=wm

t−1∏τ=1

p(sτ |xτ ), v = wm

∑u∈{wm,�}

∑st−1

1 ∈B−1(wm1 )

st−1=u

t−1∏τ=1

p(sτ |xτ ), v = �

Nussbaum-Thom: IBM Thomas J. Watson Research Center 46 February 22, 2017

Page 47: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Forward Probabilities (CTC)

= p(v |xt) ·

u∈{wm−1,�}

α(t − 1,m − 1, u) + α(t − 1,m,wm) , v = wm∑u∈{wm,�}

α(t − 1,m, u) , v = �

Nussbaum-Thom: IBM Thomas J. Watson Research Center 47 February 22, 2017

Page 48: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Backward Probablities (CTC)

β(t,m, v) = Sum over all pathes sTt ∈ B(wMm ) for given xTt

starting in a word v.

β(t,m, v) =∑

sTt ∈B−1(wMm )

wm=v

T∏τ=t

p(sτ |xτ )

=

p(v |xt) ·

∑u∈{wm+1,�}

β(t + 1,m + 1, u) + β(t + 1,m,wm), v = wm

p(�|xt) ·∑

u∈{wm,�}

β(t + 1,m, u) , v = �

Nussbaum-Thom: IBM Thomas J. Watson Research Center 48 February 22, 2017

Page 49: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Derivatives CTCI Derivative posterior:

∇p(s|xt)P(W |X )

= ∇p(s|xt)

M∑m=1

∑v∈{wm,�}

α(t,m, v)

p(v |xt)· p(v |xt) ·

β(t,m, v)

p(v |xt)

=M∑

m=1

∑v∈{wm,�}

δ(v , s)α(t,m, s) · β(t,m, s)

p2(s|xt)

∇ logP(W |X ) =1

P(W |X )∇P(W |X )

I Derivative training criterion:

⇒ ∇FCTC(Λ) = − 1

N

N∑n=1

∇ log p(Wn|Xn)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 49 February 22, 2017

Page 50: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

CTC Architectures

I What kind of encoders ? DNNs, (bidirectional) LSTMs,CNNs.

I Subsampling: Reducing framerate through the network.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 50 February 22, 2017

Page 51: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Subsampling

I Join input frames.

I Reshape input to next layer:

I Return every 2nd frame to the next layer.

I CNNs: Use strides.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 51 February 22, 2017

Page 52: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Peaking Behavior

[Graves and Jaitly, 2014, citation]

Nussbaum-Thom: IBM Thomas J. Watson Research Center 52 February 22, 2017

Page 53: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Overview CTC

I Concept.

I Training.

I Recognition.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 53 February 22, 2017

Page 54: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Hybrid Recognition (CTC)

I Hybrid:I Model:

p(x |s) ∼ p(s|x)

p(s)=

p(s|x)

p(x)p(s)=

I Decoding:

wN1 = arg max

wN1

{p(wN

1 ) maxsT1

T∏τ=1

p(xt |st)}

Nussbaum-Thom: IBM Thomas J. Watson Research Center 54 February 22, 2017

Page 55: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Weighted Finite State Transudcer Recognition (WFST)I Token T :

I Language Model G :

I Lexicon L:

I Search Space:

S = T ◦min(det(L ◦ G ))

Nussbaum-Thom: IBM Thomas J. Watson Research Center 55 February 22, 2017

Page 56: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Resources for CTC

I Keras:

I Tensorflow: https://github.com/fchollet/keras/blob/

master/keras/backend/tensorflow_backend.pyI Theano: https://github.com/fchollet/keras/blob/

master/keras/backend/theano_backend.pyI Example: https://github.com/fchollet/keras/blob/

master/examples/image_ocr.py

I Baidu:I https://github.com/baidu-research/warp-ctcI https://github.com/sherjilozair/ctcI https:

//github.com/baidu-research/ba-dls-deepspeech

I Eesen: https://github.com/srvk/eesen

I Kaldi: https://github.com/lingochamp/kaldi-ctc

Nussbaum-Thom: IBM Thomas J. Watson Research Center 56 February 22, 2017

Page 57: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Further Literature on CTC

[Miao et al., 2015, EESEN: End-to-end speech recognition usingdeep RNN models and WFST-based decoding]

[Collobert et al., 2016, Wav2Letter: an End-to-End ConvNet-basedSpeech Recognition System]

[Zhang et al., 2017, Towards End-to-End Speech Recognition withDeep Convolutional Neural Networks]

[Senior et al., 2015, Acoustic modelling with CD-CTC-SMBRLSTM RNNS]

[Soltau et al., 2016, Neural Speech Recognizer: Acoustic-to-WordLSTM Model for Large Vocabulary Speech Recognition]

Nussbaum-Thom: IBM Thomas J. Watson Research Center 57 February 22, 2017

Page 58: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Attention Model

I Encoder-Decoder architecture:I Encoder: performs a feature extracation/encoding based on

the input.I Decoder: Produces output sequence output labels from the

encoded features.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 58 February 22, 2017

Page 59: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Attention Encoder-Decoder Architecture

I What kind of encoders ? DNNs, (bidirectional) LSTMs,CNNs.

location-aware, that is using ↵t�1 in the equations defining↵t, is crucial for reliable behaviour on long input sequences.

A disadvantage of the approach from (Chorowski et al.,2015) is the complexity of the training procedure, which isO(LT ) since weights ↵tl have to be computed for all pairsof input and output positions. The same paper showcases awindowing approach that reduces the complexity of decod-ing to O(L + T ). In this work we apply the windowingat the training stage as well. Namely, we constrain the at-tention mechanism to only consider positions from the range(mt�1 � wl, . . . , mt�1 + wr), where mt�1 is the median of↵t�1, interpreted in this context as a distribution. The valueswl and wr define how much the window expands to the leftand to the right respectively. This modification makes trainingsignificantly faster.

Apart from the speedup it brings, windowing can be alsovery helpful for starting the training procedure. From our ex-perience, it becomes increasingly harder to train an ARSGcompletely from scratch on longer input sequences. We foundthat providing a very rough estimate of the desired align-ment at the early training stage is an effective way to quicklybring network parameters in a good range. Specifically, weforced the network to only choose from positions in the rangeRt = (smin+tvmin, . . . , smax+tvmax). The numbers smin,smax, vmin, vmax were roughly estimated from the trainingset so that the number of leading silent frames for training ut-terances were between smin and smax and so that the speakerspeed, i.e. the ratio between the transcript and the encodedinput lengths, were between vmin and vmax. We aimed tomake the windows Rt as narrow as possible, while keepingthe invariant that the character yt was pronounced within thewindow Rt. The resulting sequence of windows is quicklyexpanding, but still it was sufficient to quickly move the net-work out of the random initial mode, in which it had oftenaligned all characters to a single location in the audio data.We note, that the median-centered windowing could not beused for this purpose, since it relies on the quality of the pre-vious alignment to define the window for the new one.

3. INTEGRATION WITH A LANGUAGE MODEL

Although an ARSG by construction implicitly learns how anoutput symbol depends on the previous ones, the transcrip-tions of the training utterances are typically insufficient tolearn a high-quality language model. For this reason, weinvestigate how an ARSG can be integrated with a languagemodel. The main challenge is that in speech recognitionword-based language models are used, whereas our ARSGmodels a distribution over character sequences.

We use the Weighted Finite State Transducer (WFST)framework (Mohri et al., 2002; Allauzen et al., 2007) to builda character-level language model from a word-level one. AWFST is a finite automaton, whose transitions have weightand input and output labels. It defines a cost of transducing

Fig. 3. Schematic representation of the Attention-based Re-current Sequence Generator. At each time step t, an MLPcombines the hidden state st�1 with all the input vectors hl

to compute the attention weights ↵tl. Subsequently, the newhidden state st and prediction for output label yt can be com-puted.

an input sequence into an output sequence by consideringall pathes with corresponding sequences of input and outputlabels. The composition operation can be used to combineFSTs that define different levels of representation, such ascharacters and words in our case.

We compose the language model Finite State Transducer(FST) G with a lexicon FST L that simply spells out theletters of each word. More specifically, we build an FSTT = min(det(L � G)) to define the log-probability for char-acter sequences. We push the weights of this FST towards thestarting state to help hypothesis pruning during decoding.

For decoding we look for a transcript y that minimizesthe cost L which combines the encoder-decoder (ED) and thelanguage model (LM) outputs as follows:

L = � log pED(y|x) � � log pLM (y) � �T (5)

where � and � are tunable parameters. The last term �T isimportant, because without it the LM component dominatesand the cost L is minimized by too short sequences. We notethat the same criterion for decoding was proposed in (Hannunet al., 2014b) for a CTC network.

Integrating an FST and an ARSG in a beam-search decod-ing is easy because they share the property that the currentstate depends only on the previous one and the input symbol.Therefore one can use a simple left-to-right beam search algo-rithm similar to the one described in (Sutskever et al., 2014)to approximate the value of y that minimizes L.

The determinization of the FST becomes impractical formoderately large FSTs, such as the trigram model shippedwith the Wall Street Journal corpus (see Subsection 5.1). Tohandle non-deterministic FSTs we assume that its weights

MLP glimpse

output

decoder

encoder

Nussbaum-Thom: IBM Thomas J. Watson Research Center 59 February 22, 2017

Page 60: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Attention Encoder-Decoder Architecture

I Encoder-Decoder:

I Input: xT1 = x1, . . . , xT

I Encoder: hT/41 = h1, . . . , hT/4 = Encoder(xT1 )

I Decoder: For m = 1, . . . ,M:I Attention: αm = Attend(h

T/41 , sm−1, αm−1)

I Glimpse: gm =

T/4∑τ=1

αm,tht

I Generator: ym = Generator(gm, sm−1)

cm = RNN(cm−1, gm, sm−1)

ym = Softmax(cm)

I Transition: sm = RNN(sm−1, ym, gm)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 60 February 22, 2017

Page 61: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Attention Mechanism

I Content based: (weights: E ,W ,V and bias: b)

εm,t = E · tanh(W · sm−1 + V · ht + b)

I Location based: (weights: E ,W ,V ,U and bias: b)

f = F ∗ αm−1

εm,t = E · tanh(W · sm−1 + V · ht + U · fm,t + b)

I Renormalization: (sharpening: γ)

αm,t =exp(γ · εm,t)

T/4∑t=1

exp(γ · εm,t)

Nussbaum-Thom: IBM Thomas J. Watson Research Center 61 February 22, 2017

Page 62: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Window Around Median

I Compute median:

τm = arg mink=1,...,T/4

∣∣∣∣∣∣k∑ρ=1

αm−1,ρ −k∑

θ=k+1

αm−1,θ

∣∣∣∣∣∣I Compute attention around median:

Tm = {τm − ωleft, . . . , τm + ωright}

αm,t =

exp(γ · εm,t)∑

τ∈Tm

exp(γ · εm,τ ), t ∈ Tm

0 , otherwise

Nussbaum-Thom: IBM Thomas J. Watson Research Center 62 February 22, 2017

Page 63: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Other Techniques (Attention)

I Monotonic regularization:

rm = max

0,

T/4∑τ=1

(τ∑

i=1

αm,i −τ∑

i=1

αm−1,i

)I Curriculum learning: Starting with shorter sequences and

gradually increase sequence length.

I Flatstart: Initial positions are chosen according to speakerspeed.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 63 February 22, 2017

Page 64: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Resources for Attention

I Theano+Bricks+Blocks:https://github.com/rizar/attention-lvcsr

I Tensorflow:https://www.tensorflow.org/tutorials/seq2seq

I Keras: https://github.com/farizrahman4u/seq2seq

Nussbaum-Thom: IBM Thomas J. Watson Research Center 64 February 22, 2017

Page 65: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Further Literature on Attention

[Bahdanau et al., 2016, End-to-end attention-based largevocabulary speech recognition]

[Chorowski et al., 2015, Attention-Based Models for SpeechRecognition]

[Chorowski et al., 2015, End-to-end Continuous SpeechRecognition using Attention-based Recurrent NN: First Results]

[Kim et al., 2016, Joint CTC-Attention based End-to-End SpeechRecognition using Multi-task Learning]

Nussbaum-Thom: IBM Thomas J. Watson Research Center 65 February 22, 2017

Page 66: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Evaluation Framework

I Development and evaluation set different from training set.

I Levenshtein: Minimum insertions, deletions and substitutions

S P E A K E R

T E A C H E R

correct substitution

insertion deletion

Referenz

Hypothese

Alignment

SpokenAlignmentRecognized

Nussbaum-Thom: IBM Thomas J. Watson Research Center 66 February 22, 2017

Page 67: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Evaluation Framework

I Levenshtein: Minimum insertions, deletions and substitutions

L(wN1 , v

M1 ) = min

s,t

{λ∑

i=1

(1− δ(ws(i), vt(i))

)}

with dem Kronecker delta δ(w , v) =

{1 , v = w

0 , v 6= w

I Word Error Rate (WER):

WER(SpokenR1 ,RecognizedR1 ) =

R∑r=1

L(Spokenr ,Recognizedr )

R∑r=1

|Spokenr |

Nussbaum-Thom: IBM Thomas J. Watson Research Center 67 February 22, 2017

Page 68: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Experimental Results

Model CER WER

Bhadanau et al. (2015)Attention 6.4 18.6Attention + bigram LM 5.3 11.7Attention + trigram LM 4.8 10.8Attention + extended trigram LM 3.9 9.3Graves and Jaitly (2014)CTC 9.2 30.1Hannun et al. (2014)CTC + bigram LM n/a 14.1Miao et al. (2015)CTC + bigram LM n/a 26.9CTC for phonemes + lexicon n/a 26.9CTC for phonemes + trigram LM n/a 7.3CTC + trigram LM n/a 9.0

Hybrid BGRU (15 h) n/a 2.0

Nussbaum-Thom: IBM Thomas J. Watson Research Center 68 February 22, 2017

Page 69: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Attention Modeling Example from Handwriting

Nussbaum-Thom: IBM Thomas J. Watson Research Center 69 February 22, 2017

Page 70: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Inverted Hidden Markov Model (HMM)I Traditional HMM:

p(wN1 , x

T1 ) = p(wN

1 ) · p(xT1 |wN1 )

= p(wN1 )∑sT1

p(sT1 , xT1 |wN

1 )

=N∏

n=1

p(wn|wn−11 )

∑sT1

T∏t=1

p(st , xt |st−11 , x t−1

1 ,wN1 )

I Inverted HMM:

p(wN1 |xT1 ) =

∑tN1

p(wN1 , t

N1 |xT1 )

=∑tN1

N∏n=1

p(wn, tn|wn−11 , tn−1

1 , xT1 )

[Doetsch et al., 2016, Inverted HMM - a Proof of Concept]

Nussbaum-Thom: IBM Thomas J. Watson Research Center 70 February 22, 2017

Page 71: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Unsolved Problems for End-to-End ASR

I Error rates: Still higher than traditional HMM-based system(one exception).

I Global search: Still a transducer-based or HMM-based search.

I Acoustic model: Word and character-based End-to-Endlearning.

I Language model: No integration with the language model intraining yet.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Page 72: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., andBengio, Y. (2016).End-to-end attention-based large vocabulary speechrecognition.In 2016 IEEE International Conference on Acoustics, Speechand Signal Processing, ICASSP 2016, Shanghai, China, March20-25, 2016, pages 4945–4949.

Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., andBengio, Y. (2015).Attention-based models for speech recognition.In Advances in Neural Information Processing Systems 28:Annual Conference on Neural Information Processing Systems2015, December 7-12, 2015, Montreal, Quebec, Canada, pages577–585.

Collobert, R., Puhrsch, C., and Synnaeve, G. (2016).Wav2letter: an end-to-end convnet-based speech recognitionsystem.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Page 73: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

CoRR, abs/1609.03193.

Doetsch, P., Hegselmann, S., SchlA 14 ter, R., and Ney, H.

(2016).Inverted hmm - a proof of concept.In Neural Information Processing Systems Workshop,Barcelona, Spain.

Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J.(2006).Connectionist temporal classification: Labelling unsegmentedsequence data with recurrent neural networks.In Proceedings of the 23rd International Conference onMachine Learning, ICML ’06, pages 369–376, New York, NY,USA. ACM.

Graves, A. and Jaitly, N. (2014).Towards end-to-end speech recognition with recurrent neuralnetworks.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Page 74: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1764–1772.

Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke,H., and Schmidhuber, J. (2009).A novel connectionist system for unconstrained handwritingrecognition.IEEE Trans. Pattern Anal. Mach. Intell., 31(5):855–868.

Kim, S., Hori, T., and Watanabe, S. (2016).Joint ctc-attention based end-to-end speech recognition usingmulti-task learning.CoRR, abs/1609.06773.

Miao, Y., Gowayyed, M., and Metze, F. (2015).EESEN: end-to-end speech recognition using deep RNNmodels and wfst-based decoding.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Page 75: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

In 2015 IEEE Workshop on Automatic Speech Recognition andUnderstanding, ASRU 2015, Scottsdale, AZ, USA, December13-17, 2015, pages 167–174.

Senior, A. W., Sak, H., de Chaumont Quitry, F., Sainath,T. N., and Rao, K. (2015).Acoustic modelling with CD-CTC-SMBR LSTM RNNS.In 2015 IEEE Workshop on Automatic Speech Recognition andUnderstanding, ASRU 2015, Scottsdale, AZ, USA, December13-17, 2015, pages 604–609.

Soltau, H., Liao, H., and Sak, H. (2016).Neural speech recognizer: Acoustic-to-word LSTM model forlarge vocabulary speech recognition.CoRR, abs/1610.09975.

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C.,Bengio, Y., and Courville, A. C. (2017).

Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Page 76: IBM Thomas J. Watson Research Center Yorktown Heights, …llcao.net/cu-deeplearning17/lecture/lecture6_Markus.pdfEnd-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas

Towards end-to-end speech recognition with deepconvolutional neural networks.CoRR, abs/1701.02720.

Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017