Random Forests for Language Modeling

Random Forests for Language Modeling

Peng Xu, Frederick Jelinek

CLSP, The Johns Hopkins University,Dept. of ECE

OutlineBasic Language Modeling Language Models Smoothing in n-gram Language Models Decision Tree Language Models

Random Forests for Language Models Random Forests n-gram Structured Language Model (SLM)

ExperimentsConclusions and Future Work


Basic Language ModelingEstimate the source probability

from a training corpus: large amount of words chosen for similarity to expected sentences

Parametric conditional modelsNiVwwwwP iii ,...,1,,),,...,|( 11

NwwWWP ,...,),( 1

history


Basic Language ModelingSmooth Models:

Perplexity (PPL):

n-gram Models:

0),...,|( 11 ii wwwP

N

iiiM wwwP

NMPPL

111 ),...,|(log

1exp)(

),...,|(),...,|( 1111 iniiii wwwPwwwP


Estimate n-gram ParametersMaximum Likelihood (ML) estimate:

Best on training data: lowest PPL

)(

)()|(

11

111

ini

inii

nii wC

wCwwP

Data sparseness problem: n=3, |V|=10k, a trillion words needed

Zero probability for almost all test data!


Dealing with SparsitySmoothing: use lower order statistics

Word clustering: reduce the size of VHistory clustering: reduce the number of historiesMaximum entropy: use exponential models

dRNeural network: represent words in real space , use exponential model


Smoothing TechniquesGood smoothing techniques: Deleted

Interpolation, Katz, Absolute Discounting, Kneser-Ney (KN)Kneser-Ney: consistently the best [Chen & Goodman, 1998] )|(

)(

0,)(max)|( 1

2111

1

111

i

niiKNinii

ni

inii

niiKN wwPwwC

DwCwwP

)(

1

11

0)(:11

1

ini

wCwini wC

D

winii

0)(:

2

11

1)(ˆinini wCw

iniwC


Decision Tree Language ModelsGoal: history clustering by a binary

decision tree (DT)Internal nodes: a set of histories, one or two questionsLeaf nodes: a set of historiesNode splitting algorithmsDT growing algorithms


Example DT

{ab,ac,bc,bd}a:3 b:2

{ab,ac}a:2 b:1

{bc,bd}a:1 b:1

Is the first word ‘a’? Is the first word ‘b’?

Training data: aba, aca, acb, bcb, bda

New event ‘cba’: Stuck!


Previous WorkDT is an appealing idea: deal with data sparseness

[Bahl, et al 1989] 20 words in histories, slightly better than 3-gram

[Potamianos and Jelinek, 1998] fair comparison, negative results on letter n-gram

Both are top-down with a stopping criterion

Why doesn’t it work in practice? Training data fragmentation: data sparseness No theoretically founded stopping criterion: early

termination Greedy algorithms: early termination


Random Forests[Amit & Geman 1997] shape recognition with randomized trees[Ho 1998] random subspace[Breiman 2001] random forests

Random Forest (RF): a classifier consisting of a collection of tree-structured classifiers.


Our GoalMain problems:

Data sparseness Smoothing Early termination Greedy algorithms

Expectations from Random Forests: Less greedy algorithms: randomization and

voting Avoid early termination: randomization Conquer data sparseness: voting



Random Forests for Language Models Random Forests n-gram: general approach Structured Language Model (SLM)



General DT Growing Approach

Grow a DT until maximum depth using training dataPerform no smoothing during growingPrune fully grown DT to maximize heldout data likelihoodIncorporate KN smoothing during pruning


Node Splitting AlgorithmQuestions: about identities of words in the

history

Definitions:H(p) : the set of histories in a node pposition: distance from a word in the history to predicted wordi(v) : the set of histories with word v in position split: non-empty sets Ai and Bi, consists of i(v)

L(Ai) : training data log-likelihood of a node under split Ai and Bi using relative frequencies


Node Splitting AlgorithmAlgorithm Sketch:1. For each position i

a) Initialization: Ai, Bib) For each i(v) Ai

i. Tentatively move i(v) to Bi

ii. Calculate log-likelihood increase L(Ai- i(v)) - L(Ai)

iii. If the increase is positive, move i(v) and modify counts

c) Carry out the same for each i(v) Bi

d) Repeat b)-c) until no move is possible

2. Split the node according to the best position: the increase in log-likelihood is the largest


Pruning a Decision TreeSmoothing:

Define: L(p) : set of all leaves rooted in p LH(p) : smoothed heldout data log-likelihood in p LH(L(p)) : smoothed heldout data log-likelihood in L(p) potential : LH(L(p)) - LH(p)

Pruning: traverse all internal nodes, prune a subtree rooted in p if potential is negative (similar to CART)

12

111

1

111

1 |)())((

0,))(,(max)(|

i

niiKNiniDTi

niDT

iniDTii

niDTiDT wwPwwC

DwwCwwP


Towards Random ForestsRandomized question selection:

Randomized initialization: Ai, BiRandomized position selection

Generating random forests LM: M decision trees are grown randomly Each DT generates a probability sequence on

test data Aggregation:

M

j

iniDTiDT

iniiRF wwP

MwwP

jj1

11

11 |

1|


Remarks on RF-LMRandom Forest Language Model (RF-LM) :

A collection of randomly constructed DT-LMs

A DT-LM is an RF-LM: small forest

An n-gram LM is a DT-LM: no pruning

An n-gram LM is an RF-LM!

Single compact model


A Parse Tree


The Structured Language Model (SLM)


Partial Parse Tree


SLM ProbabilitiesJoint probability of words and parse:

Word probabilities:

1

1 111111111 ),,,|(),|()|(),(

n

i

N

j

ij

iiiii

ijiiiiiii

i

pptwTWpPwTWtPTWwPTWP

11

11

)()()(

)()|()|(

11

1111

11111

ii

ii

STii

iiii

STiiiiiiiSLM

TWPTWPTW

TWTWwPWwP


Using RFs for the SLMIdeally: running the SLM one time

Parallel approximation: running the SLM M times

Aggregate M probability sequences

Mm

PPP PARSERm

TAGGERm

PREDICTORm DTDTDT

,,1

)(),(),(

)(),(),( PARSERTAGGERPREDICTOR RFRFRFPPP


ExperimentsGoal: Compare with Kneser-Ney (KN)

Perplexity (PPL): UPenn Treebank: 1 million words training, 82k

test Normalized text

Word Error Rate (WER): WSJ text: 20 or 40 million words training WSJ DARPA’93 HUB1 test data: 213

utterances, 3446 words N-best rescoring: standard trigram baseline

on 40 million words


Experiments: trigram perplexity

Baseline: KN-trigramNo randomization: DT-trigram100 random DTs: RF-trigram

Model heldout Test

PPL Gain PPL Gain %

KN-trigram 160.1 - 145.0 -DT-trigram 158.6 0.9% 163.3 -12.6%RF-trigram 126.8 20.8% 129.7 10.5%


Experiments: Aggregating

Improvements within 10 trees!


Experiments: Why does it work?seen event :

KN-trigram: in training dataDT-trigram: in training dataRF-trigram: in training data for any m

)|( 11

inii ww

))(|( 11

iniDTi ww))(|( 11

iniDTi ww

m

Model seen unseen

% PPL % PPL

KN-trigram 45.6% 19.7 54.4% 773DT-trigram 58.1% 26.2 41.9% 2069RF-trigram 91.7% 75.6 8.3% 49818


Experiments: SLM perplexity

Baseline: KN-SLM100 random DTs for each of the componentsParallel approximationInterpolate with KN-trigram

Model =0.0 =0.4 =1.0

KN-SLM 137.9 127.2 145.0

RF-SLM 122.8 117.6 145.0

Gain 10.9% 7.5% -


Experiments: speech recognition

Baseline: KN-trigram, KN-SLM

100 random DTs for RF-trigram, RF-SLM-P (predictor)

Interpolate with KN-trigram (40M)

Model

0.0 0.2 0.4 0.6 0.8KN-trigram(20M) 14.0 13.6 13.3 13.2 13.1RF-trigram(20M) 12.9 12.9 13.0 13.0 12.7KN-trigram(40M) 13.0 - - - -RF-trigram(40M) 12.4 12.7 12.7 12.7 12.7

KN-SLM(20M) 12.8 12.5 12.6 12.7 12.7RF-SLM-P(20M) 11.9 12.2 12.3 12.3 12.6


ConclusionsNew RF language modeling approach

More general LM: RF DT n-gramRandomized history clustering: non-reciprocal data sharing

Good performance in PPL and WER

Generalize well to unseen data

Portable to other tasks


Future WorkRandom samples of training dataMore linguistically oriented questionsDirect implementation in the SLMLower order random forests

Larger test data for speech recognitionLanguage model adaptation

12

111

1

111

1 |)())((

0,))(,(max)(|

i

niiRFiniDTi

niDT

iniDTii

niDTiDT wwPwwC

DwwCwwP


Thank you!

Random Forests for Language Modeling

Documents

Transcript of Random Forests for Language Modeling