Using Neural Network Language Models for LVCSR
description
Transcript of Using Neural Network Language Models for LVCSR
Using Neural Network Using Neural Network Language Models for LVCSRLanguage Models for LVCSR
Holger Schwenk and Jean-Luc GauvainHolger Schwenk and Jean-Luc GauvainPresented by Erin FitzgeraldPresented by Erin Fitzgerald
CLSP Reading GroupCLSP Reading Group
December 10, 2004December 10, 2004
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 22
IntroductionIntroduction Build and use neural networks to estimate LM Build and use neural networks to estimate LM
posterior probabilities for ASR tasksposterior probabilities for ASR tasks Idea:Idea:
Project word indices onto Project word indices onto continuouscontinuous space space Resulting smooth prob fns of word representations Resulting smooth prob fns of word representations
generalize better to unknown ngramsgeneralize better to unknown ngrams
Still an n-gram approach, but posteriors Still an n-gram approach, but posteriors interpolatedinterpolated for any poss. context; for any poss. context; no backing offno backing off
ResultResult: significant WER reduction with small : significant WER reduction with small computational costscomputational costs
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 33
ArchitectureArchitectureStandard fully connected multilayer perceptronStandard fully connected multilayer perceptron
hjck dj oi
Input projectionlayer hidden
layer
outputlayer
wj-n+1
wj-n+2
wj-1
pi = P(wj=i| hj)
pN = P(wj=N| hj)
p1 = P(wj=1| hj)
N
N
N = 51k
P=50
H≈1k
N
MV
b k
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 44
ck
ArchitectureArchitecture
dj oi
PH
N
MV
b k
d = tanh(M*c+b)
1
k
N
k
e
e
o
o
p
pi = P(wj=i| hj)
pN = P(wj=N| hj)
o = tanh(V*d+k)
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 55
TrainingTraining
Train with std back propagation algorithmTrain with std back propagation algorithm Error fn: cross entropyError fn: cross entropy Weight decay regularization usedWeight decay regularization used
Targets set to 1 for wTargets set to 1 for wjj and to 0 otherwise and to 0 otherwise These outputs shown to cvg to posterior probsThese outputs shown to cvg to posterior probs
Back-prop through projection layerBack-prop through projection layer NN learns best projection of words onto NN learns best projection of words onto continuous space for prob estimation taskcontinuous space for prob estimation task
OptimizationsOptimizations
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 77
Fast RecognitionFast Recognition
TechniquesTechniques1)1) Lattice RescoringLattice Rescoring
2)2) ShortlistsShortlists
3)3) RegroupingRegrouping
4)4) Block modeBlock mode
5)5) CPU optimizationCPU optimization
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 88
Fast RecognitionFast Recognition
TechniquesTechniques1)1) Lattice RescoringLattice Rescoring
• Decode with std backoff LM to build latticesDecode with std backoff LM to build lattices
2)2) ShortlistsShortlists
3)3) RegroupingRegrouping
4)4) Block modeBlock mode
5)5) CPU optimizationCPU optimization
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 99
Fast RecognitionFast Recognition
TechniquesTechniques1)1) Lattice RescoringLattice Rescoring
2)2) ShortlistsShortlists• NN only predicts high freq subset of vocabNN only predicts high freq subset of vocab
3)3) RegroupingRegrouping
4)4) Block modeBlock mode
5)5) CPU optimizationCPU optimizationshortlist ( )
ˆ ( | ) ( ) if shortlistˆ( | )ˆ otherwise( | )
ˆ( ) ( | )t
N t t s tt t
B t t
s t B tw h
P w h P t wP w h
P w h
P h P w h
Redistributes probability mass of shortlist words
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1010
ck
Shortlist optimizationShortlist optimization
dj
oi
PH
N
MV
b
k
pi = P(wj=i| hj)
pS = P(wj=S| hj)
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1111
Fast RecognitionFast Recognition
TechniquesTechniques1)1) Lattice RescoringLattice Rescoring2)2) ShortlistsShortlists3)3) RegroupingRegrouping – Optimization of #1– Optimization of #1
• Collect and sort LM prob requestsCollect and sort LM prob requests
• All prob requests with same All prob requests with same hhtt:: only one fwd pass necessaryonly one fwd pass necessary
4)4) Block modeBlock mode5)5) CPU optimizationCPU optimization
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1212
Fast RecognitionFast Recognition
TechniquesTechniques1)1) Lattice RescoringLattice Rescoring2)2) ShortlistsShortlists3)3) RegroupingRegrouping4)4) Block modeBlock mode
• Several examples propagated through NN at Several examples propagated through NN at onceonce
• Takes advantage of faster matrix operationsTakes advantage of faster matrix operations
5)5) CPU optimizationCPU optimization
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1313
ck
Block mode calculations Block mode calculations
dj oi
PH
N
MV
b k
d = tanh(M*c+b)
o = tanh(V*d+k)
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1414
C
Block mode calculationsBlock mode calculations
D O
MV
b k
D = tanh(M*C+B)
O = (V*D+K)
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1515
Fast Recognition – Fast Recognition – Test ResultsTest Results
TechniquesTechniques1)1) Lattice Rescoring – ave 511 nodesLattice Rescoring – ave 511 nodes2)2) Shortlists (2000)– 90% prediction coverageShortlists (2000)– 90% prediction coverage
• 3.8M 4gms req’d, 3.4M processed by NN3.8M 4gms req’d, 3.4M processed by NN
3)3) Regrouping – only 1M fwd passes req’dRegrouping – only 1M fwd passes req’d4)4) Block mode – bunch size=128Block mode – bunch size=1285)5) CPU optimizationCPU optimization
Total processing < 9min (0.03xRT)Total processing < 9min (0.03xRT)Without optimizations, 10x slowerWithout optimizations, 10x slower
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1616
Fast TrainingFast Training
TechniquesTechniques1)1) Parallel implementationsParallel implementations
• Full connections req low latency; very costlyFull connections req low latency; very costly
2)2) Resampling techniquesResampling techniques• Optimum floating pt operations best with Optimum floating pt operations best with
continuous memory locationscontinuous memory locations
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1717
Fast TrainingFast Training
TechniquesTechniques1)1) Floating point precision – 1.5x fasterFloating point precision – 1.5x faster
2)2) Suppress internal calcs – 1.3x fasterSuppress internal calcs – 1.3x faster
3)3) Bunch modeBunch mode – – 10+x faster10+x faster• Fwd + back propagation for many examples Fwd + back propagation for many examples
at onceat once
4)4) Multiprocessing – 1.5x fasterMultiprocessing – 1.5x faster
47 hours 47 hours 1h27m with bunch size 128 1h27m with bunch size 128
Application toApplication toCTS and BNCTS and BN
LVCSRLVCSR
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 1919
Application to ASRApplication to ASR
Neural net LM techniques focus on CTS bcNeural net LM techniques focus on CTS bc Far less in-domain training data Far less in-domain training data data sparsity data sparsity NN can only handle sm amount of training dataNN can only handle sm amount of training data
New Fisher CTS data – 20M words (vs 7M)New Fisher CTS data – 20M words (vs 7M) BN data: 500M wordsBN data: 500M words
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2020
Application to CTSApplication to CTS
Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate
Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM
Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2121
Application to CTS - PPLApplication to CTS - PPL
Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate In-domain PPL: 50.1 Full data PPL: 47.5
Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM In-domain PPL: 45.5
Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM Full data PPL: 44.2
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2222
Application to CTS - WERApplication to CTS - WER
Baseline: Train standard backoff LMs for Baseline: Train standard backoff LMs for each domain and then interpolateeach domain and then interpolate In-domain WER: 19.9 Full data WER: 19.3
Expt #1: Interpolate CTS neural net with Expt #1: Interpolate CTS neural net with in-domain back-off LMin-domain back-off LM In-domain WER: 19.1
Expt #2: Interpolate CTS neural net with Expt #2: Interpolate CTS neural net with full data back-off LMfull data back-off LM Full data WER: 18.8
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2323
Application to BNApplication to BN
Only subset of 500M available words could Only subset of 500M available words could be used for training – 27M train setbe used for training – 27M train set
Still useful:Still useful: NN LM gave 12% PPL gain over backoff on NN LM gave 12% PPL gain over backoff on
small 27M setsmall 27M set NN LM gave 4% PPL gain over backoff on full NN LM gave 4% PPL gain over backoff on full
500M word training set500M word training set Overall WER reduction of 0.3% absoluteOverall WER reduction of 0.3% absolute
December 10, 2004December 10, 2004 Using Neural Network LMs for LVCSRUsing Neural Network LMs for LVCSR 2424
ConclusionConclusion
Neural net LM provide significant Neural net LM provide significant improvements in PPL and WERimprovements in PPL and WER
Optimizations can speed NN training by 20x Optimizations can speed NN training by 20x and lattice rescoring in less than 0.05xRTand lattice rescoring in less than 0.05xRT
While NN LM was developed for and works While NN LM was developed for and works best with CTS, gains found in BN task toobest with CTS, gains found in BN task too