Random Forests for Language Modeling
description
Transcript of Random Forests for Language Modeling
Random Forests for Language Modeling
Peng Xu, Frederick Jelinek
CLSP, The Johns Hopkins University,Dept. of ECE
OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models
Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)
ExperimentsConclusions and Future Work
CLSP, The Johns Hopkins University,Dept. of ECE
Basic Language ModelingEstimate the source probability
from a training corpus: large amount of words chosen for similarity to expected sentences
Parametric conditional modelsNiVwwwwP iii ,...,1,,),,...,|( 11
NwwWWP ,...,),( 1
history
CLSP, The Johns Hopkins University,Dept. of ECE
Basic Language ModelingSmooth Models:
Perplexity (PPL):
n-gram Models:
0),...,|( 11 ii wwwP
N
iiiM wwwP
NMPPL
111 ),...,|(log
1exp)(
),...,|(),...,|( 1111 iniiii wwwPwwwP
CLSP, The Johns Hopkins University,Dept. of ECE
Estimate n-gram ParametersMaximum Likelihood (ML) estimate:
Best on training data: lowest PPL
)(
)()|(
11
111
ini
inii
nii wC
wCwwP
Data sparseness problem: n=3, |V|=10k, a trillion words needed
Zero probability for almost all test data!
CLSP, The Johns Hopkins University,Dept. of ECE
Dealing with SparsitySmoothing: use lower order statistics
Word clustering: reduce the size of VHistory clustering: reduce the number of historiesMaximum entropy: use exponential models
dRNeural network: represent words in real space , use exponential model
CLSP, The Johns Hopkins University,Dept. of ECE
Smoothing TechniquesGood smoothing techniques: Deleted
Interpolation, Katz, Absolute Discounting, Kneser-Ney (KN)Kneser-Ney: consistently the best [Chen & Goodman, 1998] )|(
)(
0,)(max)|( 1
2111
1
111
i
niiKNinii
ni
inii
niiKN wwPwwC
DwCwwP
)(
1
11
0)(:11
1
ini
wCwini wC
D
winii
0)(:
2
11
1)(ˆinini wCw
iniwC
CLSP, The Johns Hopkins University,Dept. of ECE
Decision Tree Language ModelsGoal: history clustering by a binary
decision tree (DT)Internal nodes: a set of histories, one or two questionsLeaf nodes: a set of historiesNode splitting algorithmsDT growing algorithms
CLSP, The Johns Hopkins University,Dept. of ECE
Example DT
{ab,ac,bc,bd}a:3 b:2
{ab,ac}a:2 b:1
{bc,bd}a:1 b:1
Is the first word ‘a’? Is the first word ‘b’?
Training data: aba, aca, acb, bcb, bda
New event ‘cba’: Stuck!
CLSP, The Johns Hopkins University,Dept. of ECE
Previous WorkDT is an appealing idea: deal with data sparseness
[Bahl, et al 1989] 20 words in histories, slightly better than 3-gram
[Potamianos and Jelinek, 1998] fair comparison, negative results on letter n-gram
Both are top-down with a stopping criterion
Why doesn’t it work in practice? Training data fragmentation: data sparseness No theoretically founded stopping criterion: early
termination Greedy algorithms: early termination
CLSP, The Johns Hopkins University,Dept. of ECE
OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models
Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)
ExperimentsConclusions and Future Work
CLSP, The Johns Hopkins University,Dept. of ECE
Random Forests[Amit & Geman 1997] shape recognition with randomized trees[Ho 1998] random subspace[Breiman 2001] random forests
Random Forest (RF): a classifier consisting of a collection of tree-structured classifiers.
CLSP, The Johns Hopkins University,Dept. of ECE
Our GoalMain problems:
Data sparseness Smoothing Early termination Greedy algorithms
Expectations from Random Forests: Less greedy algorithms: randomization and
voting Avoid early termination: randomization Conquer data sparseness: voting
CLSP, The Johns Hopkins University,Dept. of ECE
OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models
Random Forests for Language Models Random Forests n-gram: general approach Structured Language Model (SLM)
ExperimentsConclusions and Future Work
CLSP, The Johns Hopkins University,Dept. of ECE
General DT Growing Approach
Grow a DT until maximum depth using training dataPerform no smoothing during growingPrune fully grown DT to maximize heldout data likelihoodIncorporate KN smoothing during pruning
CLSP, The Johns Hopkins University,Dept. of ECE
Node Splitting AlgorithmQuestions: about identities of words in the
history
Definitions:H(p) : the set of histories in a node pposition: distance from a word in the history to predicted wordi(v) : the set of histories with word v in position split: non-empty sets Ai and Bi, consists of i(v)
L(Ai) : training data log-likelihood of a node under split Ai and Bi using relative frequencies
CLSP, The Johns Hopkins University,Dept. of ECE
Node Splitting AlgorithmAlgorithm Sketch:1. For each position i
a) Initialization: Ai, Bib) For each i(v) Ai
i. Tentatively move i(v) to Bi
ii. Calculate log-likelihood increase L(Ai- i(v)) - L(Ai)
iii. If the increase is positive, move i(v) and modify counts
c) Carry out the same for each i(v) Bi
d) Repeat b)-c) until no move is possible
2. Split the node according to the best position: the increase in log-likelihood is the largest
CLSP, The Johns Hopkins University,Dept. of ECE
Pruning a Decision TreeSmoothing:
Define: L(p) : set of all leaves rooted in p LH(p) : smoothed heldout data log-likelihood in p LH(L(p)) : smoothed heldout data log-likelihood in L(p) potential : LH(L(p)) - LH(p)
Pruning: traverse all internal nodes, prune a subtree rooted in p if potential is negative (similar to CART)
12
111
1
111
1 |)())((
0,))(,(max)(|
i
niiKNiniDTi
niDT
iniDTii
niDTiDT wwPwwC
DwwCwwP
CLSP, The Johns Hopkins University,Dept. of ECE
Towards Random ForestsRandomized question selection:
Randomized initialization: Ai, BiRandomized position selection
Generating random forests LM: M decision trees are grown randomly Each DT generates a probability sequence on
test data Aggregation:
M
j
iniDTiDT
iniiRF wwP
MwwP
jj1
11
11 |
1|
CLSP, The Johns Hopkins University,Dept. of ECE
Remarks on RF-LMRandom Forest Language Model (RF-LM) :
A collection of randomly constructed DT-LMs
A DT-LM is an RF-LM: small forest
An n-gram LM is a DT-LM: no pruning
An n-gram LM is an RF-LM!
Single compact model
CLSP, The Johns Hopkins University,Dept. of ECE
OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models
Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)
ExperimentsConclusions and Future Work
CLSP, The Johns Hopkins University,Dept. of ECE
A Parse Tree
CLSP, The Johns Hopkins University,Dept. of ECE
The Structured Language Model (SLM)
CLSP, The Johns Hopkins University,Dept. of ECE
Partial Parse Tree
CLSP, The Johns Hopkins University,Dept. of ECE
SLM ProbabilitiesJoint probability of words and parse:
Word probabilities:
1
1 111111111 ),,,|(),|()|(),(
n
i
N
j
ij
iiiii
ijiiiiiii
i
pptwTWpPwTWtPTWwPTWP
11
11
)()()(
)()|()|(
11
1111
11111
ii
ii
STii
iiii
STiiiiiiiSLM
TWPTWPTW
TWTWwPWwP
CLSP, The Johns Hopkins University,Dept. of ECE
Using RFs for the SLMIdeally: running the SLM one time
Parallel approximation: running the SLM M times
Aggregate M probability sequences
Mm
PPP PARSERm
TAGGERm
PREDICTORm DTDTDT
,,1
)(),(),(
)(),(),( PARSERTAGGERPREDICTOR RFRFRFPPP
CLSP, The Johns Hopkins University,Dept. of ECE
OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models
Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)
ExperimentsConclusions and Future Work
CLSP, The Johns Hopkins University,Dept. of ECE
ExperimentsGoal: Compare with Kneser-Ney (KN)
Perplexity (PPL): UPenn Treebank: 1 million words training, 82k
test Normalized text
Word Error Rate (WER): WSJ text: 20 or 40 million words training WSJ DARPA’93 HUB1 test data: 213
utterances, 3446 words N-best rescoring: standard trigram baseline
on 40 million words
CLSP, The Johns Hopkins University,Dept. of ECE
Experiments: trigram perplexity
Baseline: KN-trigramNo randomization: DT-trigram100 random DTs: RF-trigram
Model heldout Test
PPL Gain PPL Gain %
KN-trigram 160.1 - 145.0 -DT-trigram 158.6 0.9% 163.3 -12.6%RF-trigram 126.8 20.8% 129.7 10.5%
CLSP, The Johns Hopkins University,Dept. of ECE
Experiments: Aggregating
Improvements within 10 trees!
CLSP, The Johns Hopkins University,Dept. of ECE
Experiments: Why does it work?seen event :
KN-trigram: in training dataDT-trigram: in training dataRF-trigram: in training data for any m
)|( 11
inii ww
))(|( 11
iniDTi ww))(|( 11
iniDTi ww
m
Model seen unseen
% PPL % PPL
KN-trigram 45.6% 19.7 54.4% 773DT-trigram 58.1% 26.2 41.9% 2069RF-trigram 91.7% 75.6 8.3% 49818
CLSP, The Johns Hopkins University,Dept. of ECE
Experiments: SLM perplexity
Baseline: KN-SLM100 random DTs for each of the componentsParallel approximationInterpolate with KN-trigram
Model =0.0 =0.4 =1.0
KN-SLM 137.9 127.2 145.0
RF-SLM 122.8 117.6 145.0
Gain 10.9% 7.5% -
CLSP, The Johns Hopkins University,Dept. of ECE
Experiments: speech recognition
Baseline: KN-trigram, KN-SLM
100 random DTs for RF-trigram, RF-SLM-P (predictor)
Interpolate with KN-trigram (40M)
Model
0.0 0.2 0.4 0.6 0.8KN-trigram(20M) 14.0 13.6 13.3 13.2 13.1RF-trigram(20M) 12.9 12.9 13.0 13.0 12.7KN-trigram(40M) 13.0 - - - -RF-trigram(40M) 12.4 12.7 12.7 12.7 12.7
KN-SLM(20M) 12.8 12.5 12.6 12.7 12.7RF-SLM-P(20M) 11.9 12.2 12.3 12.3 12.6
CLSP, The Johns Hopkins University,Dept. of ECE
ConclusionsNew RF language modeling approach
More general LM: RF DT n-gramRandomized history clustering: non-reciprocal data sharing
Good performance in PPL and WER
Generalize well to unseen data
Portable to other tasks
CLSP, The Johns Hopkins University,Dept. of ECE
Future WorkRandom samples of training dataMore linguistically oriented questionsDirect implementation in the SLMLower order random forests
Larger test data for speech recognitionLanguage model adaptation
12
111
1
111
1 |)())((
0,))(,(max)(|
i
niiRFiniDTi
niDT
iniDTii
niDTiDT wwPwwC
DwwCwwP
CLSP, The Johns Hopkins University,Dept. of ECE
Thank you!