Deep Speech 2: End-to-End Speech Recognition in English ...

1
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Silicon Valley AI Lab (SVAIL)* We demonstrate a generic speech engine that handles a broad range of scenarios without needing to resort to domain-specific optimizations. We perform a focused search through model architectures finding deep recurrent nets with multiple layers of 2D convolution and layer-to-layer batch normalization to perform best. In many cases, our system is competitive with the transcription performance of human workers when benchmarked on standard datasets. This approach also adapts easily to multiple languages, which we illustrate by applying essentially the same system to both Mandarin and English speech. To train on 12,000 hours of speech, we perform many systems optimizations. Placing all computation on distributed GPUs, we can perform synchronous SGD at 3 TFLOP/s per GPU for up to a 128 GPUs (weak scaling). For deployment, we develop a batching scheduler to improve computational efficiency while minimizing latency. We also create specialized matrix multiplication kernels to perform well at small batch sizes. Combined with forward only variants of our research models, we achieve a low-latency production system with little loss in recognition accuracy. Abstract *Presenters: Jesse Engel, Shubho Sengupta SV AIL Architecture Search / Data Collection SortaGrad Batch Normalization Data Scaling Deep Recurrence Bigrams Frequency Convolutions Model Size Single Model Approaching Human Performance on Many Test Sets Architectural Improvements Generalize to Mandarin ASR More layers of 2D convolution generalize better in noisy environments, while more layers of 1D convolution lead to worse generalization. Temporal Stride: 2 Frequency Stride: 2/layer 7-Recurrent Layers (38M Params) Batch normalization (between layers, not be- tween timesteps) improves both training and generalization in deep recurrent networks. More layers of recurrence helps for both Gated Recurrent Units (GRUs) and Simple Recurrent Units. Increase at 7xGRU is likely due to not enough parameters (layers too thin). 1x1D Convolution Layer (38M Params) WER decreases following an approximate power law with amount of training audio (~40% relative / decade). There is a near constant relative offset between noisy and clean generalization (~60%). 2x2D Convolution Temporal Stride: 2 Frequency Stride: 2/layer 7-Recurrent Layers (70M Params) Millions of Parameters GRU: 1x2D Convolution Temporal Stride: 3 Frequency Stride: 2 3-Recurrent Layers Bigrams Simple Recurrent: 3x2D Convolution Temporal Stride: 3 Frequency Stride: 2/layer 7-Recurrent Layers Bigrams Connectionist Temporal Classification (CTC) Spectrogram Fully Connected 1x1D Convolution 7-Recurrent Layers (38M Params) 3x2D Convolution (Bigrams / BN) Temporal Stride: 3 Frequency Stride: 2/layer 7-Recurrent Layers (100M Params) 9.52 Baseline Batch Norm Not Sorted Sorted 11.96 10.83 9.78 22 20 18 16 14 12 10 8 1 2 3 Noisy Dev Clean Dev # Convolution Layers WER 1D 2D 1 3 5 # Recurrent Layers 7 15 13 11 10 9 8 7 WER 14 12 Clean Dev Simple Recurrent GRU (Recurrent/Total) No BN 1/5 3/5 5/7 7/9 2400 1880 1510 1280 10.55 9.55 8.59 8.76 14.40 10.56 9.78 9.52 BN Layers Width Train Dev No BN BN 11.99 8.29 7.61 7.68 13.55 11.61 10.77 10.83 WER Hours of Audio 100 10 5 50 20 100 1000 10000 Clean Dev Noisy Dev 10 5 20 15 25 WER 20 40 60 80 100 Clean Dev Noisy Dev GRU Simple Recurrent 20 30 40 50 CTC Cost 0.5 1 1.5 2 2.5 3 Iterations (10 5 ) 1-Recurrent Layer 7-Recurrent Layers BN BN No BN No BN Diverse Datasets Dataset Speech Type Hours WSJ Switchboard Fisher LibriSpeech Baidu read conversational conversational read read 80 300 2000 960 5000 3600 Baidu mixed 11940 Total Dataset Type Hours Unaccented Mandarin Accented Mandarin Assorted Mandarin spontaneous spontaneous mixed 5000 3000 1400 9400 Total Test set DS1 Human WSJ eval’92 WSJ eval’93 LibriSpeech test-clean LibriSpeech test-other 5.03 8.08 5.83 12.69 Read Speech DS2 3.60 4.98 5.33 13.25 4.94 6.94 7.89 21.74 WER Stride No LM LM 2 3 4 5 6 7 8 9 10 12 15 20 Unigrams Bigrams Test set DS1 Human VoxForge American-Canadian VoxForge Commonwealth VoxForge European VoxForge Indian 4.85 8.15 12.76 22.15 Accented Speech DS2 7.55 13.56 17.55 22.44 15.01 28.46 31.20 45.35 Test set DS1 Human CHiME eval clean CHiME eval real CHiME eval sim 3.46 11.84 31.33 Noisy Speech DS2 3.34 21.79 45.05 6.30 67.94 80.27 Language Architecture Dev No LM English English Mandarin 5-layers, 1 recurrent 9-layers, 7 recurrent 27.79 14.93 9.80 7.55 Mandarin Dev LM 5-layers, 1 recurrent 9-layers, 7 recurrent 14.39 9.52 7.13 5.81 Architecture Dev 5-layers, 1 recurrent 5-layers, 3 recurrent 7.13 6.49 6.22 5.81 Test 5-layers, 3 recurrent + BN 9-layers, 7 recurrent + BN + 2D conv 15.41 11.85 9.39 7.93 3x2D Convolution Temporal Stride: 4 Frequency Stride: 2/layer 70M Params Inspired by Mandarin, we can maintain performance while reducing the number of timesteps (thus dramatically reducing computation) by creating a larger alphabet (bigrams) with fewer char- acters / second. This decreases the entropy density of the language, allowing for more temporal compression. (ex. [the_cats_sat] -> [th, e, _, ca, ts, _, sa, t]) 4.1 Unigram Mandarin Chars / Sec Bits / Char 2.4 9.6 14.1 Bits / Sec Bigram 23.3 58.1 3.2 12.6 40.7 “A Sort of Stochastic Gradient Descent” Curriculum learning with utterance length as a proxy for difficulty. Sorting the first epoch by length improves convergence stability and final values. Mix of conversational, read, and spontaneous. Average utterance length ~6 seconds. WER WER WER WER WER CER WER English / CER Mandarin Mix of conversational, read, and spontaneous. Average utterance length ~3 seconds. Hannun et al., Deep speech: Scaling up end-to-end speech recognition, 2014. http://arxiv.org/abs/1412.5567 Amazon mechanical turk transcription. Each utterance was transcribed by 2 turk- ers, averaging 30 seconds of work time per an utterance. DS 1 1x1D Convolution Temporal Stride: 2 Unigrams / No BN 1-Recurrent Layer (38M Params) DS 2 This Work Human Systems Optimizations / Deployment I I I I A T C I I I I I G G G G G G G G G 1 2 3 4 T -3 T -2 T -1 T G G G G A T C G G G G G I I I I I I I I I Optimized compute dense node with 53 TFLOP/s peak compute and large PCI root complex for efficient communication An optimized CTC implmentation on the GPU is significantly faster than the corresponding CPU implementation. This also eliminates all roundtrip copies beween CPU and GPU. All times are in seconds for training one epoch of a 5 layer, 3 bi-directional layer network. We built a system to train deep recurrent neural networks that can linearly scale from 1 to 128 GPUs, while sustaining 3 TFLOP/s computational throughput per GPU throughout an entire training run (weak scaling). We did this by carefully de- signing our node, building an effcient com- munication layer and writing a very fast implementation of CTC cost function so that our entire network runs on the GPU. Node Architecture Scaling Neural Network Training Computing CTC on GPU Fast MPI AllReduce We use synchronous SGD for deterministic results from run to run. The weights are synchronized through MPI Allreduce. We wrote our own Allreduce implementation that is 20x faster than Open- MPI’s Allreduce up to 16 GPUs. It can scale up to 128 GPUs and maintain its supe- rior performance h t Row convolution layer Recurrent layer Row Convolutions A special convolution layer that captures a forward context of fixed number of timesteps into the future. This allows us to train and deploy a model with only forward recurrent layers. Models with only forward recurrent layers satisfy the latency constraints for de- ployment since we do not need to wait for the end of utterance. As we see on the right, the production system that only has forward recurrent layers is very close in performance to the research system with bi-directional layers, on two different test sets. We wrote a speical ma- trix-matrix multiply kernel that is efficient for half-preci- sion and small batch sizes. The deployment system uses small batches to mini- mize latency and uses half-precision to minimize memory footprint. Our kernel is 2x faster than the publicly available kernels from Nervana systems Deployment Optimized Matrix-Matrix Multiply Batch Dispatch 10 streams 20 streams 30 streams We built a batching schedul- er that assembles streams of data from user requests into batches before doing for- ward propagation. Large batch size results in more computational efficiency but also higher latency. As expected, batching works best when the server is heavily loaded: as load increases, the distribution shifts to favor processing requests in larger batches. Even for a light load of only 10 concurrent user requests, our system performs more than half the work in batches with at least 2 samples. GPU GPU GPU GPU GPU GPU GPU GPU PLX PLX PLX CPU CPU PLX Language CPU CTC Time English Mandarin 5888.12 1688.01 203.56 135.05 GPU CTC Time Speedup 28.9 12.5 System Clean Research Production 6.05 6.10 9.75 10.00 Noisy 1 2 4 8 16 32 64 128 # GPUs Narrow deep network Wide shallow network 4 8 16 32 64 128 # GPUs OpenMPI Allreduce 1 2 3 4 5 6 Batch size 7 8 9 10 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 TeraFLOP/s Baidu HGEMM Nervana HGEMM 1 2 3 4 5 6 Batch size 7 8 9 10 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Probability 0.45 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 Time (s) 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 Time (s) CER Internal Mandarin voice search test set.

Transcript of Deep Speech 2: End-to-End Speech Recognition in English ...

Page 1: Deep Speech 2: End-to-End Speech Recognition in English ...

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Silicon Valley AI Lab (SVAIL)*

We demonstrate a generic speech engine that handles a broad range of scenarios without needing to resort to domain-speci�c optimizations.

We perform a focused search through model architectures �nding deep recurrent nets with multiple layers of 2D convolution and layer-to-layer batch normalization to perform best.

In many cases, our system is competitive with the transcription performance of human workers when benchmarked on standard datasets.

This approach also adapts easily to multiple languages, which we illustrate by applying essentially the same system to both Mandarin and English speech.

To train on 12,000 hours of speech, we perform many systems optimizations. Placing all computation on distributed GPUs, we can perform synchronous SGD at 3 TFLOP/s per GPU for up to a 128 GPUs (weak scaling).

For deployment, we develop a batching scheduler to improve computational e�ciency while minimizing latency. We also create specialized matrix multiplication kernels to perform well at small batch sizes.

Combined with forward only variants of our research models, we achieve a low-latency production system with little loss in recognition accuracy.

Abstract

*Presenters: Jesse Engel, Shubho Sengupta

SVAIL

Architecture Search / Data Collection

SortaGrad

Batch Normalization

Data Scaling

Deep Recurrence

Bigrams

Frequency Convolutions

Model Size

Single Model Approaching Human Performance on Many Test Sets

Architectural Improvements Generalize to Mandarin ASR

More layers of 2D convolution generalize better in noisy environments, while more layers of 1D convolution lead to worse generalization.

Temporal Stride: 2Frequency Stride: 2/layer7-Recurrent Layers (38M Params)

Batch normalization (between layers, not be-tween timesteps) improves both training and generalization in deep recurrent networks.

More layers of recurrence helps for both Gated Recurrent Units (GRUs) and Simple Recurrent Units. Increase at 7xGRU is likely due to not enough parameters (layers too thin).

1x1D Convolution Layer(38M Params)

WER decreases following an approximate power law with amount of training audio (~40% relative / decade). There is a near constant relative o�set between noisy and clean generalization (~60%).

2x2D ConvolutionTemporal Stride: 2Frequency Stride: 2/layer7-Recurrent Layers (70M Params)

Millions of Parameters

GRU:1x2D ConvolutionTemporal Stride: 3Frequency Stride: 23-Recurrent LayersBigrams

Simple Recurrent:3x2D ConvolutionTemporal Stride: 3Frequency Stride: 2/layer7-Recurrent LayersBigrams

Connectionist Temporal Classi�cation (CTC)

Spectrogram

Fully Connected

1x1D Convolution7-Recurrent Layers (38M Params)

3x2D Convolution (Bigrams / BN)Temporal Stride: 3Frequency Stride: 2/layer7-Recurrent Layers (100M Params)

9.52

Baseline Batch Norm

Not SortedSorted

11.9610.83

9.78

22

20

18

16

14

12

10

81 2 3

Noisy Dev

Clean Dev

# Convolution Layers

WER

1D

2D

1 3 5# Recurrent Layers

7

15

13

11

10

9

8

7

WER

14

12

Clean Dev

Simple Recurrent

GRU

(Recurrent/Total) No BN

1/53/55/77/9

2400188015101280

10.559.558.598.76

14.4010.56

9.789.52

BN

Layers

Width

Train Dev

No BN BN

11.998.297.617.68

13.5511.6110.7710.83

WER

Hours of Audio

100

10

5

50

20

100 1000 10000

Clean Dev

Noisy Dev

10

5

20

15

25W

ER

20 40 60 80 100

Clean Dev

Noisy Dev

GRU

SimpleRecurrent

20

30

40

50

CTC

Cos

t

0.5 1 1.5 2 2.5 3

Iterations (105)

1-Recurrent Layer

7-Recurrent Layers

BN

BN

No BN

No BN

Diverse DatasetsDataset Speech Type Hours

WSJSwitchboardFisherLibriSpeechBaidu

readconversationalconversationalreadread

80300

2000960

50003600Baidu mixed

11940Total

Dataset Type Hours

Unaccented MandarinAccented MandarinAssorted Mandarin

spontaneousspontaneousmixed

500030001400

9400Total

Test set DS1 Human

WSJ eval’92WSJ eval’93LibriSpeech test-cleanLibriSpeech test-other

5.038.085.83

12.69

Read Speech

DS2

3.604.985.33

13.25

4.946.947.89

21.74

WER

Stride

No LM

LM

2 3 4 5 6789

10

12

15

20 Unigrams

Bigrams

Test set DS1 Human

VoxForge American-CanadianVoxForge CommonwealthVoxForge EuropeanVoxForge Indian

4.858.15

12.7622.15

Accented Speech

DS2

7.5513.5617.5522.44

15.0128.4631.2045.35

Test set DS1 Human

CHiME eval cleanCHiME eval realCHiME eval sim

3.4611.8431.33

Noisy Speech

DS2

3.3421.7945.05

6.3067.9480.27

Language Architecture Dev No LM

EnglishEnglishMandarin

5-layers, 1 recurrent9-layers, 7 recurrent

27.7914.93

9.807.55Mandarin

Dev LM

5-layers, 1 recurrent9-layers, 7 recurrent

14.399.527.135.81

Architecture Dev

5-layers, 1 recurrent5-layers, 3 recurrent

7.136.496.225.81

Test

5-layers, 3 recurrent + BN9-layers, 7 recurrent + BN + 2D conv

15.4111.85

9.397.93

3x2D ConvolutionTemporal Stride: 4Frequency Stride: 2/layer70M Params

Inspired by Mandarin, we can maintain performance while reducing the number of timesteps (thus dramatically reducing computation) by creating a larger alphabet (bigrams) with fewer char-acters / second. This decreases the entropy density of the language, allowing for more temporal compression. (ex. [the_cats_sat] -> [th, e, _, ca, ts, _, sa, t])

4.1

Unigram MandarinChars / SecBits / Char

2.49.6

14.1

Bits / Sec

Bigram

23.358.1

3.212.640.7

“A Sort of Stochastic Gradient Descent”Curriculum learning with utterance length as a proxy for di�culty. Sorting the �rst epoch by length improves convergence stability and �nal values.

Mix of conversational, read, and spontaneous. Average utterance length ~6 seconds.

WER

WER

WER WER

WER

CERWER English / CER MandarinMix of conversational, read, and spontaneous. Average utterance length ~3 seconds.

Hannun et al., Deep speech: Scaling up end-to-end speech

recognition, 2014.http://arxiv.org/abs/1412.5567

Amazon mechanical turk transcription. Each utterance was transcribed by 2 turk-ers, averaging 30 seconds of work time per an utterance.

DS 1

1x1D Convolution Temporal Stride: 2Unigrams / No BN1-Recurrent Layer (38M Params)

DS 2

This WorkHuman

Systems Optimizations / Deployment

I

I I

IA

T

CI

I

I I I

G

G

G

G

G

G

GG

G

1 2 3 4 T-3 T-2 T-1 T

G

G G

GA

T

CG

G

G G G

I

I

I

I

I

I

II

I

α

β

Optimized compute dense node with 53 TFLOP/s peak compute and large PCI root complex for e�cient communication

An optimized CTC implmentation on the GPU is signi�cantly faster than the corresponding CPU implementation. This also eliminates all roundtrip copies beween CPU and GPU. All times are in seconds for training one epoch of a 5 layer, 3 bi-directional layer network.

We built a system to train deep recurrent neural networks that can linearly scale from 1 to 128 GPUs, while sustaining 3 TFLOP/s computational throughput per GPU throughout an entire training run (weak scaling). We did this by carefully de-signing our node, building an e�cient com-munication layer and writing a very fast implementation of CTC cost function so that our entire network runs on the GPU.

Node Architecture

Scaling Neural Network Training

Computing CTC on GPU

Fast MPI AllReduceWe use synchronous SGD for deterministic results from run to run. The weights are synchronized through MPI Allreduce. We wrote our own Allreduce implementation that is 20x faster than Open-MPI’s Allreduce up to 16 GPUs. It can scale up to 128 GPUs and maintain its supe-rior performance

ht

Row convolution layer

Recurrent layer

Row Convolutions

A special convolution layer that captures a forward context of �xed number of timesteps into the future. This allows us to train and deploy a model with only forward recurrent layers. Models with only forward recurrent layers satisfy the latency constraints for de-ployment since we do not need to wait for the end of utterance.

As we see on the right, the production system that only has forward recurrent layers is very close in performance to the research system with bi-directional layers, on two di�erent test sets.

We wrote a speical ma-trix-matrix multiply kernel that is e�cient for half-preci-sion and small batch sizes. The deployment system uses small batches to mini-mize latency and uses half-precision to minimize memory footprint. Our kernel is 2x faster than the publicly available kernels from Nervana systems

Deployment Optimized Matrix-Matrix Multiply

Batch Dispatch10 streams20 streams30 streams

We built a batching schedul-er that assembles streams of data from user requests into batches before doing for-ward propagation. Large batch size results in more computational e�ciency but also higher latency. As expected, batching works best when the server is heavily loaded: as load increases, the distribution shifts to favor processing

requests in larger batches. Even for a light load of only 10 concurrent user requests, our system performs more than half the work in batches with at least 2 samples.

G P U G P U G P U G P U G P U G P U G P U G P U

PLXPLXPLX

C P U C P U

PLX

Language CPU CTC Time

EnglishMandarin

5888.121688.01

203.56135.05

GPU CTC Time Speedup

28.912.5

System Clean

ResearchProduction

6.056.10

9.7510.00

Noisy

1 2 4 8 16 32 64 128# GPUs

Narrow deep network

Wide shallow network

4 8 16 32 64 128# GPUs

OpenMPI Allreduce

1 2 3 4 5 6Batch size

7 8 9 10

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Tera

FLO

P/s

Baidu HGEMM

Nervana HGEMM

1 2 3 4 5 6Batch size

7 8 9 10

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Prob

abili

ty

0.45

29

210

211

212

213

214

215

216

Tim

e (s

)

210

211

212

213

214

215

216

217218

Tim

e (s

)

CER

Internal Mandarin voice search test set.