Deep Speech 2: End-to-End Speech Recognition in English ...

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Silicon Valley AI Lab (SVAIL)*

We demonstrate a generic speech engine that handles a broad range of scenarios without needing to resort to domain-speci�c optimizations.

We perform a focused search through model architectures �nding deep recurrent nets with multiple layers of 2D convolution and layer-to-layer batch normalization to perform best.

In many cases, our system is competitive with the transcription performance of human workers when benchmarked on standard datasets.

This approach also adapts easily to multiple languages, which we illustrate by applying essentially the same system to both Mandarin and English speech.

To train on 12,000 hours of speech, we perform many systems optimizations. Placing all computation on distributed GPUs, we can perform synchronous SGD at 3 TFLOP/s per GPU for up to a 128 GPUs (weak scaling).

For deployment, we develop a batching scheduler to improve computational e�ciency while minimizing latency. We also create specialized matrix multiplication kernels to perform well at small batch sizes.

Combined with forward only variants of our research models, we achieve a low-latency production system with little loss in recognition accuracy.

Abstract

*Presenters: Jesse Engel, Shubho Sengupta

SVAIL

Architecture Search / Data Collection

SortaGrad

Batch Normalization

Data Scaling

Deep Recurrence

Bigrams

Frequency Convolutions

Model Size

Single Model Approaching Human Performance on Many Test Sets

Architectural Improvements Generalize to Mandarin ASR

More layers of 2D convolution generalize better in noisy environments, while more layers of 1D convolution lead to worse generalization.

Temporal Stride: 2Frequency Stride: 2/layer7-Recurrent Layers (38M Params)

Batch normalization (between layers, not be-tween timesteps) improves both training and generalization in deep recurrent networks.

More layers of recurrence helps for both Gated Recurrent Units (GRUs) and Simple Recurrent Units. Increase at 7xGRU is likely due to not enough parameters (layers too thin).

1x1D Convolution Layer(38M Params)

WER decreases following an approximate power law with amount of training audio (~40% relative / decade). There is a near constant relative o�set between noisy and clean generalization (~60%).

2x2D ConvolutionTemporal Stride: 2Frequency Stride: 2/layer7-Recurrent Layers (70M Params)

Millions of Parameters

GRU:1x2D ConvolutionTemporal Stride: 3Frequency Stride: 23-Recurrent LayersBigrams

Simple Recurrent:3x2D ConvolutionTemporal Stride: 3Frequency Stride: 2/layer7-Recurrent LayersBigrams

Connectionist Temporal Classi�cation (CTC)

Spectrogram

Fully Connected

1x1D Convolution7-Recurrent Layers (38M Params)

3x2D Convolution (Bigrams / BN)Temporal Stride: 3Frequency Stride: 2/layer7-Recurrent Layers (100M Params)

9.52

Baseline Batch Norm

Not SortedSorted

11.9610.83

9.78

22

20

18

16

14

12

10

81 2 3

Noisy Dev

Clean Dev

# Convolution Layers

WER

1D

2D

1 3 5# Recurrent Layers

7

15

13

11

10

9

8

7

WER

14

12

Clean Dev

Simple Recurrent

GRU

(Recurrent/Total) No BN

1/53/55/77/9

2400188015101280

10.559.558.598.76

14.4010.56

9.789.52

BN

Layers

Width

Train Dev

No BN BN

11.998.297.617.68

13.5511.6110.7710.83

WER

Hours of Audio

100

10

5

50

20

100 1000 10000

Clean Dev

Noisy Dev

10

5

20

15

25W

ER

20 40 60 80 100

Clean Dev

Noisy Dev

GRU

SimpleRecurrent

20

30

40

50

CTC

Cos

t

0.5 1 1.5 2 2.5 3

Iterations (105)

1-Recurrent Layer

7-Recurrent Layers

BN

BN

No BN

No BN

Diverse DatasetsDataset Speech Type Hours

WSJSwitchboardFisherLibriSpeechBaidu

readconversationalconversationalreadread

80300

2000960

50003600Baidu mixed

11940Total

Dataset Type Hours

Unaccented MandarinAccented MandarinAssorted Mandarin

spontaneousspontaneousmixed

500030001400

9400Total

Test set DS1 Human

WSJ eval’92WSJ eval’93LibriSpeech test-cleanLibriSpeech test-other

5.038.085.83

12.69

Read Speech

DS2

3.604.985.33

13.25

4.946.947.89

21.74

WER

Stride

No LM

LM

2 3 4 5 6789

10

12

15

20 Unigrams

Bigrams

Test set DS1 Human

VoxForge American-CanadianVoxForge CommonwealthVoxForge EuropeanVoxForge Indian

4.858.15

12.7622.15

Accented Speech

DS2

7.5513.5617.5522.44

15.0128.4631.2045.35

Test set DS1 Human

CHiME eval cleanCHiME eval realCHiME eval sim

3.4611.8431.33

Noisy Speech

DS2

3.3421.7945.05

6.3067.9480.27

Language Architecture Dev No LM

EnglishEnglishMandarin

5-layers, 1 recurrent9-layers, 7 recurrent

27.7914.93

9.807.55Mandarin

Dev LM


14.399.527.135.81

Architecture Dev


7.136.496.225.81

Test

5-layers, 3 recurrent + BN9-layers, 7 recurrent + BN + 2D conv

15.4111.85

9.397.93

3x2D ConvolutionTemporal Stride: 4Frequency Stride: 2/layer70M Params

Inspired by Mandarin, we can maintain performance while reducing the number of timesteps (thus dramatically reducing computation) by creating a larger alphabet (bigrams) with fewer char-acters / second. This decreases the entropy density of the language, allowing for more temporal compression. (ex. [the_cats_sat] -> [th, e, _, ca, ts, _, sa, t])

4.1

Unigram MandarinChars / SecBits / Char

2.49.6

14.1

Bits / Sec

Bigram

23.358.1

3.212.640.7

“A Sort of Stochastic Gradient Descent”Curriculum learning with utterance length as a proxy for di�culty. Sorting the �rst epoch by length improves convergence stability and �nal values.

Mix of conversational, read, and spontaneous. Average utterance length ~6 seconds.

WER

WER

WER WER

WER

CERWER English / CER MandarinMix of conversational, read, and spontaneous. Average utterance length ~3 seconds.

Hannun et al., Deep speech: Scaling up end-to-end speech

recognition, 2014.http://arxiv.org/abs/1412.5567

Amazon mechanical turk transcription. Each utterance was transcribed by 2 turk-ers, averaging 30 seconds of work time per an utterance.

DS 1

1x1D Convolution Temporal Stride: 2Unigrams / No BN1-Recurrent Layer (38M Params)

DS 2

This WorkHuman

Systems Optimizations / Deployment

I

I I

IA

T

CI

I

I I I

G

G

G

G

G

G

GG

G

1 2 3 4 T-3 T-2 T-1 T

G

G G

GA

T

CG

G

G G G

I

I

I

I

I

I

II

I

α

β

Optimized compute dense node with 53 TFLOP/s peak compute and large PCI root complex for e�cient communication

An optimized CTC implmentation on the GPU is signi�cantly faster than the corresponding CPU implementation. This also eliminates all roundtrip copies beween CPU and GPU. All times are in seconds for training one epoch of a 5 layer, 3 bi-directional layer network.

We built a system to train deep recurrent neural networks that can linearly scale from 1 to 128 GPUs, while sustaining 3 TFLOP/s computational throughput per GPU throughout an entire training run (weak scaling). We did this by carefully de-signing our node, building an e�cient com-munication layer and writing a very fast implementation of CTC cost function so that our entire network runs on the GPU.

Node Architecture

Scaling Neural Network Training

Computing CTC on GPU

Fast MPI AllReduceWe use synchronous SGD for deterministic results from run to run. The weights are synchronized through MPI Allreduce. We wrote our own Allreduce implementation that is 20x faster than Open-MPI’s Allreduce up to 16 GPUs. It can scale up to 128 GPUs and maintain its supe-rior performance

ht

Row convolution layer

Recurrent layer

Row Convolutions

A special convolution layer that captures a forward context of �xed number of timesteps into the future. This allows us to train and deploy a model with only forward recurrent layers. Models with only forward recurrent layers satisfy the latency constraints for de-ployment since we do not need to wait for the end of utterance.

As we see on the right, the production system that only has forward recurrent layers is very close in performance to the research system with bi-directional layers, on two di�erent test sets.

We wrote a speical ma-trix-matrix multiply kernel that is e�cient for half-preci-sion and small batch sizes. The deployment system uses small batches to mini-mize latency and uses half-precision to minimize memory footprint. Our kernel is 2x faster than the publicly available kernels from Nervana systems

Deployment Optimized Matrix-Matrix Multiply

Batch Dispatch10 streams20 streams30 streams

We built a batching schedul-er that assembles streams of data from user requests into batches before doing for-ward propagation. Large batch size results in more computational e�ciency but also higher latency. As expected, batching works best when the server is heavily loaded: as load increases, the distribution shifts to favor processing

requests in larger batches. Even for a light load of only 10 concurrent user requests, our system performs more than half the work in batches with at least 2 samples.

G P U G P U G P U G P U G P U G P U G P U G P U

PLXPLXPLX

C P U C P U

PLX

Language CPU CTC Time

EnglishMandarin

5888.121688.01

203.56135.05

GPU CTC Time Speedup

28.912.5

System Clean

ResearchProduction

6.056.10

9.7510.00

Noisy

1 2 4 8 16 32 64 128# GPUs

Narrow deep network

Wide shallow network

4 8 16 32 64 128# GPUs

OpenMPI Allreduce

1 2 3 4 5 6Batch size

7 8 9 10

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Tera

FLO

P/s

Baidu HGEMM

Nervana HGEMM

1 2 3 4 5 6Batch size

7 8 9 10

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Prob

abili

ty

0.45

29

210

211

212

213

214

215

216

Tim

e (s

)

210

211

212

213

214

215

216

217218

Tim

e (s

)

CER

Internal Mandarin voice search test set.

Deep Speech 2: End-to-End Speech Recognition in English ...

Documents

Transcript of Deep Speech 2: End-to-End Speech Recognition in English ...