Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition...

79
Speech Recognition HUNG-YI LEE 李宏毅

Transcript of Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition...

Page 1: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

SpeechRecognitionHUNG-YI LEE 李宏毅

Page 2: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Speech Recognition is Difficult?

I heard the story from Prof Haizhou Li.

Page 3: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Speech Recognition

SpeechRecognition

speech text

Speech: a sequence of vector (length T, dimension d)

Text: a sequence of token (length N, V different tokens)

Usually T > N

N

d

V

T

Page 4: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Token

Grapheme: smallest unit of a writing system

“一”,“拳”,“超”,“人”

one_punch_man

N=13, V=26+?

N=4, V≈4000

26 English alphabet

Chinese does not need “space”

+ { _ } (space)

+ {punctuation marks}

Phoneme: a unit of sound

W AH N P AH N CH M AE N

Lexicon: word to phonemes

good ⟶ G UH Dcat ⟶ K AE T

punch man

Lexicon free!

one ⟶ W AH Npunch ⟶ P AH N CH

man ⟶ M AE N

one

Page 5: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Token

Word:

For some languages, V can be too large!

“一拳”“超人”

one punch man N=3, usually V>100K

N=2, V=???

Page 6: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Token

Turkish: Agglutinative language

Source of information: http://tkturkey.com/ (土女時代)

70 characters?!

Page 7: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Token

Morpheme: the smallest meaningful unit

unbreakable → “un” “break” “able”

rekillable → “re” “kill” “able”

What are the morphemes in a language?

Word:

“一拳”“超人”

one punch man N=3, usually V>100K

N=2, V=???

(< word, > grapheme)

linguistic or statistic

For some languages, V can be too large!

Page 8: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Token

Bytes (!): The system can be language independent!

V is always 256

[Li, et al., ICASSP’19]

UTF-8

Page 9: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

TokenGo through more than 100 papers in INTERSPEECH’19, ICASSP’19, ASRU’19

感謝助教群的辛勞

Page 10: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

SpeechRecognition

one punch man word embeddings

SpeechRecognition

one punch man+ Translation

一拳超人

SpeechRecognition

+ Slot filling

SpeechRecognition

one ticket to Taipei on March 2nd

+ Intent classification

<buy ticket>

one ticket to Taipei on March 2nd

NA NA NA <LOC>NA <TIME> <TIME>

Page 11: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Acoustic Feature10ms

25ms

400 sample points (16KHz)39-dim MFCC80-dim filter bank output

length T, dimension d

frame

1s → 100 frames

數位語音處理第七章Speech Signal and Front-end Processinghttp://ocw.aca.ntu.edu.tw/ntu-ocw/ocw/cou/104S204/7

Page 12: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Acoustic Feature

DFT

DCT log

MFCC

………

Waveform

spectrogram

filter bank

Page 13: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Acoustic FeatureGo through more than 100 papers in INTERSPEECH’19, ICASSP’19, ASRU’19

感謝助教群的辛勞

Page 14: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

How much data do we need?(English corpora)

80 hr

300 hr

960 hr

2000 hr

“4096 hr”

[Chiu, et al., ICASSP, 2018]

[Huang, et al., arXiv’19]

“49 min”

4 hr

“2 hr 40 min”MNIST: 28 x 28 x 1 x 60000

= 49 minutes (16kHz)

= 47,040,000

CIFAR-10: 32 x 32 x 3 x 50000

= 2 hours 40 minutes

= 153,600,000

The commercial systems use more than that ……

Page 15: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Two Points of Views

Source of image: 李琳山老師《數位語音處理概論》

Seq-to-seq HMM

Page 16: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 17: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models Go through more than 100 papers in INTERSPEECH’19, ICASSP’19, ASRU’19

感謝助教群的辛勞

Page 18: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Encoder Decoder

It is the typical seq2seq with attention.

Page 19: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

• Extract content information• Remove speaker variance, remove noises

Page 20: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

𝑥4𝑥3𝑥2𝑥1

RNN

ℎ4ℎ3ℎ2ℎ1

Page 21: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen

𝑥4𝑥3𝑥2𝑥1

1-D CNN

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

Page 22: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen

𝑥4𝑥3𝑥2𝑥1

……

……

……

……

𝑏1 𝑏2 𝑏3 𝑏4

• Filters in higher layer can consider longer sequence

• CNN+RNN is common

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

1-D CNN

Page 23: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

Self-attention Layers

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Please refer to ML video recording:https://www.youtube.com/watch?v=ugWDIIOHtPA

[Zeyer, et al., ASRU’19]

[Karita, et al., ASRU’19]

Page 24: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen – Down Sampling

𝑥4𝑥3𝑥2𝑥1

Pyramid RNN

ℎ2ℎ1

Pooling over time

𝑥4𝑥3𝑥2𝑥1

ℎ2ℎ1

[Chan, et al., ICASSP’16]

[Bahdanau. et al., ICASSP’16]

Page 25: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Listen – Down Sampling

Time-delay DNN (TDNN)

Dilated CNN has the same concept

……

……

……

……

Truncated Self-attention

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Attention in a range

[Yeh, et al., arXiv’19]

[Peddinti, et al., INTERSPEECH’15]

Page 26: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Attention

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧ℎ

𝑊ℎ 𝑊𝑧

Dot-product Attention[Chan, et al., ICASSP’16]

𝛼

match

𝛼01

𝑧0

Page 27: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Attention

match

𝛼01

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧ℎ

𝑊ℎ 𝑊𝑧

Additive Attention[Chorowski. et al., NIPS’15]

𝛼

+

𝑡𝑎𝑛ℎ

𝑊

𝑧0

Page 28: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Attention

Encoder

ℎ4ℎ3ℎ2ℎ1

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑐0 = ො𝛼0𝑖ℎ𝑖

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑧0

As RNN input𝑐0

cat

Page 29: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑐0 = ො𝛼0𝑡ℎ𝑡

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑧0

As RNN input𝑐0

𝑧1

max

Size VDistribution over all tokens

c

a: 0.0b: 0.1c: 0.6d: 0.1……

cat

hidden state

Page 30: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

As RNN input𝑐0

𝑧1

Size VDistribution over all tokens

cat

match

𝛼11

max

c

hidden state

Page 31: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

𝑐0

𝑧1

Size V

cat

𝑐1

𝛼11 𝛼1

2 𝛼13 𝛼1

4

0.0 0.3 0.6 0.1

softmax

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

𝑐1

𝑐1 = ො𝛼1𝑡ℎ𝑡

= 0.3ℎ2 + 0.6ℎ3 + 0.1ℎ4

𝑧1 𝑧2

max

c a

Page 32: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

𝑐0

𝑧1

Size V

cat

𝑐1

𝑧1 𝑧2

max

c a

𝑐2

𝑧3

𝑐3

𝑧4

attention

t <EOS>

Beam Search is usually used (not today)

Page 33: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Training

Encoder

ℎ4ℎ3ℎ2ℎ1

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑐0 = ො𝛼0𝑖ℎ𝑖

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑧0

As RNN input𝑐0

𝑧1

cCross

entropy

Size VDistribution over all tokens

One-hot vector

cat

𝑝 𝑐

Page 34: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Training

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

𝑐0

𝑧1

Size V

cat

𝑐1

𝛼11 𝛼1

2 𝛼13 𝛼1

4

0.0 0.3 0.6 0.1

softmax

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

𝑐1

𝑧0 𝑧1

a

Cross entropy

One-hot vector

c

𝑝 𝑎

max

?

Use ground truth

Teacher Forcing

Page 35: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Why Teacher Forcing?

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

ax

Use previous output

incorrect

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

accorrect

training

你要輸出 c 你要先講,好嗎?

那我也要睡啦

那我先講…

之前的訓練都白費了……

Page 36: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Why Teacher Forcing?

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

a?

Use ground truth

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

a

c

c

training

你要輸出 c 你要先講,好嗎?

ccorrect

那我也要睡啦

那我先講…

correct

Page 37: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Back to Attention

𝑧𝑡 𝑧𝑡+1

𝑐𝑡

Attention

𝑧𝑡

Attention

𝑐𝑡

𝑧𝑡+1

[Bahdanau. et al., ICLR’15] [Luong, et al., EMNLP’15]

Page 38: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

𝑧𝑡

Attention

𝑐𝑡

𝑧𝑡+1

𝑐𝑡

[Chan, et al., ICASSP’16]

+

Page 39: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Back to Attention

ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1

ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1

generate 1st token generate 2nd token generate 3rd token

generate 1st token generate 2nd token generate 3rd token

Page 40: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Location-aware attention

ℎ4ℎ3ℎ2ℎ1

generate the 1st token

𝛼01 𝛼0

2 𝛼03 𝛼0

4

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

ℎ4ℎ3ℎ2ℎ1

match

𝛼12

𝑧1

generate the 2nd token

process history

原始論文難以理解,感謝劉浩然同學指點

[Chorowski. et al., NIPS’15]

Page 41: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

LAS – Does it work?

[Chorowski. Et al., NIPS’15]

[Lu, et al., INTERSPEECH’15]

300 hours

TIMIT

10.4% on SWB …[Soltau, et al., ICASSP’14]

Page 42: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

LAS – Yes, it works!

[Chiu, et al., ICASSP, 2018]

[Chan, et al., ICASSP’16] 2000 hours

12500 hours

Page 43: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

[Chan, et al., ICASSP’16]

Location-aware attention is not used here

[Chan, et al.,ICASSP’16]

Page 44: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Hokkien (閩南語、台語)

(台語語音)

台語語音辨識

“母湯”

感謝李仲翊同學提供實驗結果

(台語語音)

台語語音辨識+中文

翻譯

“不行”

看不懂…

訓練資料: YouTube 上的鄉土劇

(台語語音、中文字幕),約 1500 小時

然後就直接用 LAS 訓練下去

Page 45: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Hokkien (閩南語、台語)

• 有背景音樂、音效?

• 語音和字幕沒有對齊?

• 台羅拼音?

只有用深度學習“硬train一發”

不管…

不管…

不用…

Page 46: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Results Accuracy = 62.1%

你的身體撐不住

沒事你為什麼要請假

要生了嗎

正解:不會膩嗎

我有幫廠長拜託

正解: 我拜託廠長了

Page 47: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Limitation of LAS

• LAS outputs the first token after listening the whole input.

• Users expect on-line speech recognition.

今 天 的 天 氣 非 常 好

LAS is not the final solution of ASR!

Page 48: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 49: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

CTC

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

LinearClassifier

Streaming Speech Recognition(if you chose uni-directional RNN)

ℎ𝑖W= Softmax( )

vocabulary size V

tokendistribution

Page 50: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

CTC

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

LinearClassifier

tokendistribution

Streaming Speech Recognition(if you chose uni-directional RNN)

ℎ𝑖W= Softmax( )

size V + 1

𝜙

Page 51: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

CTC

• Input T acoustic features, output T tokens (ignoring down sampling)

• Output tokens including 𝜙, merging duplicate tokens, removing 𝜙

𝜙 𝜙 d d 𝜙 e 𝜙 e 𝜙 p p d e e p

𝜙 𝜙 d d 𝜙 e e e 𝜙 p p d e p

好 好 棒 棒 棒 棒 棒 好 棒

好 𝜙 棒 𝜙 棒 𝜙 𝜙 好 棒 棒

Page 52: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

CTC

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

LinearClassifier

tokendistribution

𝑥4𝑥3𝑥2𝑥1 , 好棒

paired training data:

cross-entropy

11

1

1

Page 53: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

CTC – Training

𝑥4𝑥3𝑥2𝑥1 , 好棒

paired training data:

𝑥4𝑥3𝑥2𝑥1 ,𝜙好棒𝜙

𝑥4𝑥3𝑥2𝑥1 , 好𝜙𝜙棒

𝑥4𝑥3𝑥2𝑥1 , 好棒𝜙𝜙

𝑥4𝑥3𝑥2𝑥1 , 好𝜙棒𝜙

𝑥4𝑥3𝑥2𝑥1 ,𝜙好𝜙棒

𝑥4𝑥3𝑥2𝑥1 ,𝜙𝜙好棒

𝑥4𝑥3𝑥2𝑥1 , 好棒𝜙棒

𝑥4𝑥3𝑥2𝑥1 ,好好棒𝜙

𝑥4𝑥3𝑥2𝑥1 ,𝜙好棒棒

𝑥4𝑥3𝑥2𝑥1 ,好棒棒棒

How to enumerate all possible alignment?

Page 54: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Does CTC work?

[Graves, et al., ICML’14]

V=7K

terry?

[Sak, et al., INTERSPEECH’15]One can increase V to obtain better performance

dietary

Page 55: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Does CTC work?

[Bahdanau. et al., ICASSP’16]

80 hours

Page 56: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Issue

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

generated “independently”

Page 57: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 58: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

RNA

ℎ4ℎ3ℎ2ℎ1

c 𝜙

copy

a 𝜙

Decoder:take one vector as input,Output one token

Can one vector map to multiple tokens?

for example, “th”

[Sak, et al., INTERSPEECH’17]

Page 59: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

RNN-T

t

ℎ𝑡

copy

h

𝜙t h

……

ℎ𝑡

copy

ℎ𝑡+1

Page 60: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

RNN-T

t

ℎ𝑡

copy

h

𝜙t h

……

ℎ𝑡

copy

ℎ𝑡+1

Page 61: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

t

ℎ𝑡

copy

h 𝜙

ℎ𝑡+1

e 𝜙

ℎ𝑡+3

_ 𝜙

ℎ𝑡+2

𝜙

Page 62: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

t

ℎ𝑡

copy

h 𝜙

ℎ𝑡+1

e 𝜙

ℎ𝑡+3

_ 𝜙

ℎ𝑡+2

𝜙

Page 63: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

RNN-T

Page 64: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 65: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Neural Transducer

CTC, RNA, RNN-T

c

ℎ𝑡 ℎ𝑡 ℎ𝑡+1 ℎ𝑡+𝑤……

𝜙

copy copy

… …

c a

𝜙

attention

Neural Transducer

Page 66: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Neural Transducer

c a 𝜙

attention

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7 ℎ8

attention

t 𝜙

window 1 window 2

Page 67: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Neural Transducer

[Jaitly, et al., NIPS’16]

Page 68: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 69: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

MoChA: Monotonic Chunkwise Attention

similar to attention

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7

𝑧0 Output?

yes/noStop, and send information to decoder?

Page 70: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

MoChA

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7

𝑧0 Output?

no

Output?

no

Output?

no

Output?

yes

Page 71: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

MoChA

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7

𝑧0

c

attention

𝑧1

MoChA does not output 𝜙

Output?

no

Output?

yes

Page 72: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

MoChA

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7

𝑧0

c

attention

𝑧1 Output?

yes

Page 73: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

MoChA

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5

𝑧0

c

𝑧1

MoChA does not output 𝜙

a

𝑧2

Please refer to the original paper for model training [Chiu, et al., ICLR’18]

Page 74: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

LAS: 就是 seq2seq

CTC: decoder 是 linear classifier 的 seq2seq

RNA: 輸入一個東西就要輸出一個東西的 seq2seq

RNN-T: 輸入一個東西可以輸出多個東西的 seq2seq

Neural Transducer: 每次輸入一個 window 的 RNN-T

MoCha: window 移動伸縮自如的 Neural Transducer

Summary

Page 75: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Two Points of Views

Source of image: 李琳山老師《數位語音處理概論》

Seq-to-seq HMM

Page 76: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Reference

• [Li, et al., ICASSP’19] Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, William Chan, Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes, ICASSP 2019

• [Bahdanau. et al., ICLR’15] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, 2015

• [Bahdanau. et al., ICASSP’16] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, Yoshua Bengio, End-to-End Attention-based Large Vocabulary Speech Recognition, ICASSP, 2016

• [Chan, et al., ICASSP’16] William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, Listen, Attend and Spell, ICASSP, 2016

• [Chiu, et al., ICLR’18] Chung-Cheng Chiu, Colin Raffel, Monotonic ChunkwiseAttention, ICLR, 2018

• [Chiu, et al., ICASSP’18] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani, State-of-the-art Speech Recognition With Sequence-to-Sequence Models, ICASSP, 2018

Page 77: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Reference

• [Chorowski. et al., NIPS’15] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio, Attention-Based Models for Speech Recognition, NIPS, 15

• [Huang, et al., arXiv’19] Hongzhao Huang, Fuchun Peng, An Empirical Study of Efficient ASR Rescoring with Transformers, arXiv, 2019

• [Graves, et al., ICML’06] Alex Graves, Santiago Fernández, Faustino Gomez, Jurgen Schmidhuber, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". In Proceedings of the International Conference on Machine Learning, ICML, 2006

• [Graves, ICML workshop’12] Alex Graves, Sequence Transduction with Recurrent Neural Networks, ICML workshop, 2012

• [Graves, et al., ICML’14] Alex Graves, Navdeep Jaitly, Towards end-to-end speech recognition with recurrent neural networks, ICML, 2014

• [Lu, et al., INTERSPEECH’15] Liang Lu, Xingxing Zhang, Kyunghyun Cho, Steve Renals, A Study of the Recurrent Neural Network Encoder-Decoder for Large Vocabulary Speech Recognition, INTERSPEECH, 2015

• [Luong, et al., EMNLP’15] Minh-Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation, EMNLP, 2015

Page 78: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Reference

• [Karita, et al., ASRU’19] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, TakaakiHori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, TakenoriYoshimura, Wangyou Zhang, A Comparative Study on Transformer vs RNN in Speech Applications, ASRU, 2019

• [Soltau, et al., ICASSP’14] Hagen Soltau, George Saon, Tara N. Sainath, Joint training of convolutional and non-convolutional neural networks, ICASSP, 2014

• [Sak, et al., INTERSPEECH’15] Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays, Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition, INTERSPEECH, 2015

• [Sak, et al., INTERSPEECH’17] Haşim Sak, Matt Shannon, Kanishka Rao, Françoise Beaufays, Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping, INTERSPEECH, 2017

• [Jaitly, et al., NIPS’16] Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, Samy Bengio, An Online Sequence-to-Sequence Model Using Partial Conditioning, NIPS, 2016

Page 79: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v10).pdfSpeech Recognition Speech Recognition speech text Speech: a sequence of vector (length T, dimension d) Text:

Reference

• [Rao, et al., ASRU’17] Kanishka Rao, Haşim Sak, Rohit Prabhavalkar, Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer, ASRU. 2017

• [Peddinti, et al., INTERSPEECH’15] Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts, INTERSPEECH, 2015

• [Yeh, et al., arXiv’19] Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer, Transformer-Transducer: End-to-End Speech Recognition with Self-Attention, arXiv, 2019

• [Zeyer, et al., ASRU’19] Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter, Hermann Ney, A Comparison of Transformer and LSTM Encoder Decoder Models for ASR, ASRU, 2019