Using Neural Network Language Models for LVCSR

Using Neural Network Using Neural Network Language Models for LVCSRLanguage Models for LVCSR

Holger Schwenk and Jean-Luc GauvainHolger Schwenk and Jean-Luc GauvainPresented by Erin FitzgeraldPresented by Erin Fitzgerald

CLSP Reading GroupCLSP Reading Group

December 10, 2004December 10, 2004

December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 22

IntroductionIntroduction Build and use neural networks to estimate LM Build and use neural networks to estimate LM

posterior probabilities for ASR tasksposterior probabilities for ASR tasks Idea:Idea:

Project word indices onto Project word indices onto continuouscontinuous space space Resulting smooth prob fns of word representations Resulting smooth prob fns of word representations

generalize better to unknown ngramsgeneralize better to unknown ngrams

Still an n-gram approach, but posteriors Still an n-gram approach, but posteriors interpolatedinterpolated for any poss. context; for any poss. context; no backing offno backing off

ResultResult: significant WER reduction with small : significant WER reduction with small computational costscomputational costs


ArchitectureArchitectureStandard fully connected multilayer perceptronStandard fully connected multilayer perceptron

hjck dj oi

Input projectionlayer hidden

layer

outputlayer

wj-n+1

wj-n+2

wj-1

pi = P(wj=i| hj)

pN = P(wj=N| hj)

p1 = P(wj=1| hj)

N

N

N = 51k

P=50

H≈1k

N

MV

b k


ck

ArchitectureArchitecture

dj oi

PH

N

MV

b k

d = tanh(M*c+b)

1

k

N

k

e

e

o

o

p

pi = P(wj=i| hj)

pN = P(wj=N| hj)

o = tanh(V*d+k)


TrainingTraining

Train with std back propagation algorithmTrain with std back propagation algorithm Error fn: cross entropyError fn: cross entropy Weight decay regularization usedWeight decay regularization used

Targets set to 1 for wTargets set to 1 for wjj and to 0 otherwise and to 0 otherwise These outputs shown to cvg to posterior probsThese outputs shown to cvg to posterior probs

Back-prop through projection layerBack-prop through projection layer NN learns best projection of words onto NN learns best projection of words onto continuous space for prob estimation taskcontinuous space for prob estimation task

OptimizationsOptimizations


Fast RecognitionFast Recognition

TechniquesTechniques1)1) Lattice RescoringLattice Rescoring

2)2) ShortlistsShortlists

3)3) RegroupingRegrouping

4)4) Block modeBlock mode

5)5) CPU optimizationCPU optimization




• Decode with std backoff LM to build latticesDecode with std backoff LM to build lattices

2)2) ShortlistsShortlists







2)2) ShortlistsShortlists• NN only predicts high freq subset of vocabNN only predicts high freq subset of vocab



5)5) CPU optimizationCPU optimizationshortlist ( )

ˆ ( | ) ( ) if shortlistˆ( | )ˆ otherwise( | )

ˆ( ) ( | )t

N t t s tt t

B t t

s t B tw h

P w h P t wP w h

P w h

P h P w h

Redistributes probability mass of shortlist words


ck

Shortlist optimizationShortlist optimization

dj

oi

PH

N

MV

b

k

pi = P(wj=i| hj)

pS = P(wj=S| hj)



TechniquesTechniques1)1) Lattice RescoringLattice Rescoring2)2) ShortlistsShortlists3)3) RegroupingRegrouping – Optimization of #1– Optimization of #1

• Collect and sort LM prob requestsCollect and sort LM prob requests

• All prob requests with same All prob requests with same hhtt:: only one fwd pass necessaryonly one fwd pass necessary

4)4) Block modeBlock mode5)5) CPU optimizationCPU optimization



TechniquesTechniques1)1) Lattice RescoringLattice Rescoring2)2) ShortlistsShortlists3)3) RegroupingRegrouping4)4) Block modeBlock mode

• Several examples propagated through NN at Several examples propagated through NN at onceonce

• Takes advantage of faster matrix operationsTakes advantage of faster matrix operations



ck

Block mode calculations Block mode calculations

dj oi

PH

N

MV

b k

d = tanh(M*c+b)

o = tanh(V*d+k)


C

Block mode calculationsBlock mode calculations

D O

MV

b k

D = tanh(M*C+B)

O = (V*D+K)


Fast Recognition – Fast Recognition – Test ResultsTest Results

TechniquesTechniques1)1) Lattice Rescoring – ave 511 nodesLattice Rescoring – ave 511 nodes2)2) Shortlists (2000)– 90% prediction coverageShortlists (2000)– 90% prediction coverage

• 3.8M 4gms req’d, 3.4M processed by NN3.8M 4gms req’d, 3.4M processed by NN

3)3) Regrouping – only 1M fwd passes req’dRegrouping – only 1M fwd passes req’d4)4) Block mode – bunch size=128Block mode – bunch size=1285)5) CPU optimizationCPU optimization

Total processing < 9min (0.03xRT)Total processing < 9min (0.03xRT)Without optimizations, 10x slowerWithout optimizations, 10x slower


Fast TrainingFast Training

TechniquesTechniques1)1) Parallel implementationsParallel implementations

• Full connections req low latency; very costlyFull connections req low latency; very costly

2)2) Resampling techniquesResampling techniques• Optimum floating pt operations best with Optimum floating pt operations best with

continuous memory locationscontinuous memory locations


Fast TrainingFast Training

TechniquesTechniques1)1) Floating point precision – 1.5x fasterFloating point precision – 1.5x faster

2)2) Suppress internal calcs – 1.3x fasterSuppress internal calcs – 1.3x faster

3)3) Bunch modeBunch mode – – 10+x faster10+x faster• Fwd + back propagation for many examples Fwd + back propagation for many examples

at onceat once

4)4) Multiprocessing – 1.5x fasterMultiprocessing – 1.5x faster

47 hours 47 hours 1h27m with bunch size 128 1h27m with bunch size 128

Application toApplication toCTS and BNCTS and BN

LVCSRLVCSR


Application to ASRApplication to ASR

Neural net LM techniques focus on CTS bcNeural net LM techniques focus on CTS bc Far less in-domain training data Far less in-domain training data data sparsity data sparsity NN can only handle sm amount of training dataNN can only handle sm amount of training data

New Fisher CTS data – 20M words (vs 7M)New Fisher CTS data – 20M words (vs 7M) BN data: 500M wordsBN data: 500M words


Application to CTSApplication to CTS

Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate

Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM

Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM


Application to CTS - PPLApplication to CTS - PPL

Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate In-domain PPL: 50.1 Full data PPL: 47.5

Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM In-domain PPL: 45.5

Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM Full data PPL: 44.2


Application to CTS - WERApplication to CTS - WER

Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate In-domain WER: 19.9 Full data WER: 19.3

Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM In-domain WER: 19.1

Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM Full data WER: 18.8


Application to BNApplication to BN

Only subset of 500M available words could Only subset of 500M available words could be used for training – 27M train setbe used for training – 27M train set

Still useful:Still useful: NN LM gave 12% PPL gain over backoff on NN LM gave 12% PPL gain over backoff on

small 27M setsmall 27M set NN LM gave 4% PPL gain over backoff on full NN LM gave 4% PPL gain over backoff on full

500M word training set500M word training set Overall WER reduction of 0.3% absoluteOverall WER reduction of 0.3% absolute


ConclusionConclusion

Neural net LM provide significant Neural net LM provide significant improvements in PPL and WERimprovements in PPL and WER

Optimizations can speed NN training by 20x Optimizations can speed NN training by 20x and lattice rescoring in less than 0.05xRTand lattice rescoring in less than 0.05xRT

While NN LM was developed for and works While NN LM was developed for and works best with CTS, gains found in BN task toobest with CTS, gains found in BN task too

Using Neural Network Language Models for LVCSR

Documents

Transcript of Using Neural Network Language Models for LVCSR