Using Neural Network Language Models for LVCSR

24
Using Neural Network Using Neural Network Language Models for Language Models for LVCSR LVCSR Holger Schwenk and Jean-Luc Holger Schwenk and Jean-Luc Gauvain Gauvain Presented by Erin Fitzgerald Presented by Erin Fitzgerald CLSP Reading Group CLSP Reading Group December 10, 2004 December 10, 2004

description

Using Neural Network Language Models for LVCSR. Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004. Introduction. Build and use neural networks to estimate LM posterior probabilities for ASR tasks Idea: - PowerPoint PPT Presentation

Transcript of Using Neural Network Language Models for LVCSR

Page 1: Using Neural Network Language Models for LVCSR

Using Neural Network Using Neural Network Language Models for LVCSRLanguage Models for LVCSR

Holger Schwenk and Jean-Luc GauvainHolger Schwenk and Jean-Luc GauvainPresented by Erin FitzgeraldPresented by Erin Fitzgerald

CLSP Reading GroupCLSP Reading Group

December 10, 2004December 10, 2004

Page 2: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 22

IntroductionIntroduction Build and use neural networks to estimate LM Build and use neural networks to estimate LM

posterior probabilities for ASR tasksposterior probabilities for ASR tasks Idea:Idea:

Project word indices onto Project word indices onto continuouscontinuous space space Resulting smooth prob fns of word representations Resulting smooth prob fns of word representations

generalize better to unknown ngramsgeneralize better to unknown ngrams

Still an n-gram approach, but posteriors Still an n-gram approach, but posteriors interpolatedinterpolated for any poss. context; for any poss. context; no backing offno backing off

ResultResult: significant WER reduction with small : significant WER reduction with small computational costscomputational costs

Page 3: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 33

ArchitectureArchitectureStandard fully connected multilayer perceptronStandard fully connected multilayer perceptron

hjck dj oi

Input projectionlayer hidden

layer

outputlayer

wj-n+1

wj-n+2

wj-1

pi = P(wj=i| hj)

pN = P(wj=N| hj)

p1 = P(wj=1| hj)

N

N

N = 51k

P=50

H≈1k

N

MV

b k

Page 4: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 44

ck

ArchitectureArchitecture

dj oi

PH

N

MV

b k

d = tanh(M*c+b)

1

k

N

k

e

e

o

o

p

pi = P(wj=i| hj)

pN = P(wj=N| hj)

o = tanh(V*d+k)

Page 5: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 55

TrainingTraining

Train with std back propagation algorithmTrain with std back propagation algorithm Error fn: cross entropyError fn: cross entropy Weight decay regularization usedWeight decay regularization used

Targets set to 1 for wTargets set to 1 for wjj and to 0 otherwise and to 0 otherwise These outputs shown to cvg to posterior probsThese outputs shown to cvg to posterior probs

Back-prop through projection layerBack-prop through projection layer NN learns best projection of words onto NN learns best projection of words onto continuous space for prob estimation taskcontinuous space for prob estimation task

Page 6: Using Neural Network Language Models for LVCSR

OptimizationsOptimizations

Page 7: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 77

Fast RecognitionFast Recognition

TechniquesTechniques1)1) Lattice RescoringLattice Rescoring

2)2) ShortlistsShortlists

3)3) RegroupingRegrouping

4)4) Block modeBlock mode

5)5) CPU optimizationCPU optimization

Page 8: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 88

Fast RecognitionFast Recognition

TechniquesTechniques1)1) Lattice RescoringLattice Rescoring

• Decode with std backoff LM to build latticesDecode with std backoff LM to build lattices

2)2) ShortlistsShortlists

3)3) RegroupingRegrouping

4)4) Block modeBlock mode

5)5) CPU optimizationCPU optimization

Page 9: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 99

Fast RecognitionFast Recognition

TechniquesTechniques1)1) Lattice RescoringLattice Rescoring

2)2) ShortlistsShortlists• NN only predicts high freq subset of vocabNN only predicts high freq subset of vocab

3)3) RegroupingRegrouping

4)4) Block modeBlock mode

5)5) CPU optimizationCPU optimizationshortlist ( )

ˆ ( | ) ( ) if shortlistˆ( | )ˆ otherwise( | )

ˆ( ) ( | )t

N t t s tt t

B t t

s t B tw h

P w h P t wP w h

P w h

P h P w h

Redistributes probability mass of shortlist words

Page 10: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1010

ck

Shortlist optimizationShortlist optimization

dj

oi

PH

N

MV

b

k

pi = P(wj=i| hj)

pS = P(wj=S| hj)

Page 11: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1111

Fast RecognitionFast Recognition

TechniquesTechniques1)1) Lattice RescoringLattice Rescoring2)2) ShortlistsShortlists3)3) RegroupingRegrouping – Optimization of #1– Optimization of #1

• Collect and sort LM prob requestsCollect and sort LM prob requests

• All prob requests with same All prob requests with same hhtt:: only one fwd pass necessaryonly one fwd pass necessary

4)4) Block modeBlock mode5)5) CPU optimizationCPU optimization

Page 12: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1212

Fast RecognitionFast Recognition

TechniquesTechniques1)1) Lattice RescoringLattice Rescoring2)2) ShortlistsShortlists3)3) RegroupingRegrouping4)4) Block modeBlock mode

• Several examples propagated through NN at Several examples propagated through NN at onceonce

• Takes advantage of faster matrix operationsTakes advantage of faster matrix operations

5)5) CPU optimizationCPU optimization

Page 13: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1313

ck

Block mode calculations Block mode calculations

dj oi

PH

N

MV

b k

d = tanh(M*c+b)

o = tanh(V*d+k)

Page 14: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1414

C

Block mode calculationsBlock mode calculations

D O

MV

b k

D = tanh(M*C+B)

O = (V*D+K)

Page 15: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1515

Fast Recognition – Fast Recognition – Test ResultsTest Results

TechniquesTechniques1)1) Lattice Rescoring – ave 511 nodesLattice Rescoring – ave 511 nodes2)2) Shortlists (2000)– 90% prediction coverageShortlists (2000)– 90% prediction coverage

• 3.8M 4gms req’d, 3.4M processed by NN3.8M 4gms req’d, 3.4M processed by NN

3)3) Regrouping – only 1M fwd passes req’dRegrouping – only 1M fwd passes req’d4)4) Block mode – bunch size=128Block mode – bunch size=1285)5) CPU optimizationCPU optimization

Total processing < 9min (0.03xRT)Total processing < 9min (0.03xRT)Without optimizations, 10x slowerWithout optimizations, 10x slower

Page 16: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1616

Fast TrainingFast Training

TechniquesTechniques1)1) Parallel implementationsParallel implementations

• Full connections req low latency; very costlyFull connections req low latency; very costly

2)2) Resampling techniquesResampling techniques• Optimum floating pt operations best with Optimum floating pt operations best with

continuous memory locationscontinuous memory locations

Page 17: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1717

Fast TrainingFast Training

TechniquesTechniques1)1) Floating point precision – 1.5x fasterFloating point precision – 1.5x faster

2)2) Suppress internal calcs – 1.3x fasterSuppress internal calcs – 1.3x faster

3)3) Bunch modeBunch mode – – 10+x faster10+x faster• Fwd + back propagation for many examples Fwd + back propagation for many examples

at onceat once

4)4) Multiprocessing – 1.5x fasterMultiprocessing – 1.5x faster

47 hours 47 hours 1h27m with bunch size 128 1h27m with bunch size 128

Page 18: Using Neural Network Language Models for LVCSR

Application toApplication toCTS and BNCTS and BN

LVCSRLVCSR

Page 19: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1919

Application to ASRApplication to ASR

Neural net LM techniques focus on CTS bcNeural net LM techniques focus on CTS bc Far less in-domain training data Far less in-domain training data data sparsity data sparsity NN can only handle sm amount of training dataNN can only handle sm amount of training data

New Fisher CTS data – 20M words (vs 7M)New Fisher CTS data – 20M words (vs 7M) BN data: 500M wordsBN data: 500M words

Page 20: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2020

Application to CTSApplication to CTS

Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate

Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM

Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM

Page 21: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2121

Application to CTS - PPLApplication to CTS - PPL

Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate In-domain PPL: 50.1 Full data PPL: 47.5

Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM In-domain PPL: 45.5

Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM Full data PPL: 44.2

Page 22: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2222

Application to CTS - WERApplication to CTS - WER

Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate In-domain WER: 19.9 Full data WER: 19.3

Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM In-domain WER: 19.1

Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM Full data WER: 18.8

Page 23: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2323

Application to BNApplication to BN

Only subset of 500M available words could Only subset of 500M available words could be used for training – 27M train setbe used for training – 27M train set

Still useful:Still useful: NN LM gave 12% PPL gain over backoff on NN LM gave 12% PPL gain over backoff on

small 27M setsmall 27M set NN LM gave 4% PPL gain over backoff on full NN LM gave 4% PPL gain over backoff on full

500M word training set500M word training set Overall WER reduction of 0.3% absoluteOverall WER reduction of 0.3% absolute

Page 24: Using Neural Network Language Models for LVCSR

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2424

ConclusionConclusion

Neural net LM provide significant Neural net LM provide significant improvements in PPL and WERimprovements in PPL and WER

Optimizations can speed NN training by 20x Optimizations can speed NN training by 20x and lattice rescoring in less than 0.05xRTand lattice rescoring in less than 0.05xRT

While NN LM was developed for and works While NN LM was developed for and works best with CTS, gains found in BN task toobest with CTS, gains found in BN task too