March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen...

March 24, 2005EARS STT Workshop 1

A Study of Some Factors Impacting SuperARV Language Modeling

Wen Wang1

Andreas Stolcke1

Mary P. Harper2

1. Speech Technology & Research LaboratorySRI International

2. School of Electrical and Computer EngineeringPurdue University


Motivation

• RT-03 SuperARV gave excellent results using a backoff N-gram approximation [ICASSP’04 paper]

• N-gram backoff approximation of RT-04 SuperARV did not generalize to RT-04 evaluation test set

– Dev04: achieved 1.0% absolute WER reduction over baseline LM – Eval04: no gain in WER (in fact, a small loss)

• RT-04 SARV LM was developed under considerable time pressure

– Training procedure is very time consuming (weeks and months), due to syntactic parsing of training data

– Did not have time to examine all design choices in combination

• Reexamine all design decisions in detail


What Changed?

RT-04 SARV training differed from RT-03 in 2 aspects:

• Retrained the Charniak parser using a combination of the Switchboard Penn Treebank and Wall Street Journal Penn Treebank

The 2003 parser was trained on the WSJ Treebank only.• Built a SuperARV LM with additional modifiee lexical feature

constraints (Standard+ model)

The 2003 LM was a SuperARV LM without these additional constraints (Standard model)

Changes had given improvements at various points, but weren’t tested in complete systems on new Fisher data.


Plan of Attack• Revisit changes to training procedure

– Check effect on old and new data sets and systems

• Revisit the backoff N-gram approximation– Did we just get lucky in 2003 ?

– Evaluate full SuperARV LM in N-best rescoring

– Find better approximations

• Start investigation by going back to 2003 LM, then move to current system.

• Validate training software (and document and release)• Work in progress• Holding out on eval04 testing (avoid implicit tuning)


Perplexity of RT-03 LMs• RT-03 LM training data• LM types tested:

– “Word”: Word backoff 4-gram, KN smoothed

– “SARV N-gram”: N-gram approximation to standard SuperARV LM

– “SARV Standard”: full SuperARV (without additional constraints)

• Full model gains smaller on dev04• N-gram approximation breaks down

Test Sets Word SARV N-gram

SARV Standard

dev2001 64.34 53.74 52.70

eval2003 70.80 56.25 54.18

dev2004 63.45 62.87 56.97


N-best Rescoring with Full SuperARV LM

• Evaluated full Standard SARV LM in final N-best rescoring

• Based on PLP subsystem of RT-03 CTS system• Full SARV rescoring is expensive, so tried increasingly

longer N-best lists– Top-50– Top-500– Top-2000 (max used in eval system)

• Early passes (including MLLR) use baseline LM, so gains will be limited


RT-03 LM N-best Rescoring Results

• Standard SuperARV reduces WER on eval02, eval03• No gain on dev04• Identical gains on eval03-SWB and eval03-Fisher• SuperARV gain increases with larger hypothesis space

Test

Set

Top-50 Top-500 Top-2000

Word SARV Standard

Word SARV Standard

Word SARV Standard

eval2002 26.7 26.1 26.6 25.8 26.3 25.6

eval2003 --- --- 26.4 26.1 --- ---

dev2004 18.2 18.2 18.1 18.1 --- ---


Adding Modifiee Constraints

• Constraints enforced by a Constraint Dependency Grammar (on which SuperARV is based) can be enhanced by utilizing modifiee information in unary and binary constraints

• Expected that this information can improve SuperARV LM.

• In RT-04 development, explored using only the modifiee’s lexical category in the LM, adding them to the SuperARV tag structure.

• This reduced perplexity and WER in early experiments.• But: additional tag constraints could have hurt LM

generalization!


Perplexity with Modifiee Constraints

• Trained a SuperARV LM augmented with modifiee lexical features on RT-03 LM data (“Standard+” model)

• Standard+ model reduces perplexity on the eval02 and eval03 test sets (relative to Standard)

• But not on Fisher (dev04) test set!

Test Set Word N-gram

SARV N-gram

SARV Standard

SARV Standard+

dev2001 64.34 53.74 52.70 51.35

eval2003 70.80 56.25 54.18 53.09

dev2004 63.45 62.87 56.97 57.53


N-best Rescoring with Modifiee Constraints

• WER reductions consistent with perplexity results• No improvement on dev04.

Test Set

Top-50 Top-500

Word N-

gram

SARV Standard

SARV Standard+

Word N-gram

SARV Standard

SARV Standard+

eval2002 26.7 26.1 26.0 26.6 25.8 25.6

eval2003 --- --- --- 26.4 26.1 25.8

dev2004 18.2 18.2 18.2 18.1 18.1 ---


In-domain vs. Out-of-domain Parser Training

• SuperARVs are collected from CDG parses that are obtained by transforming CFG parses

• CFG parses are generated using existing state-of-the-art parsers.

• In 2003: CTS data parsed with parser trained on Wall Street Journal Treebank (out-of-domain parser)

• In 2004: Obtained trainable version of Charniak parser• Retrained parser on a combination of Switchboard

Treebank and WSJ Treebank (in-domain parser)– Expected improved consistency and accuracy of parse structures– However, there were bugs in that retraining; fixed for the current experiment.


Rescoring Results with In-domain Parser

• Reparsed the RT-03 LM training with in-domain parser• Retrained Standard SuperARV model (“Standard-retrained”)• N-best rescoring system as before

• In-domain parsing helps• Also: number of distinct SuperARV tags reduced in retraining

(improved parser consistency)

Test Set Top-500 Rescoring WER (%)

Word N-gram

SARV Standard

SARV Standard+

SARV Standard-retrained

SARV Standard-retrained+

eval2002 26.6 25.8 25.6 25.6 25.4


Summary So Far

• Prior design decisions have been validated• Adding modifiee constraints helps LM on matched data• Reparsing with retrained in-domain parser improves

LM quality• Now: reexamine approximation used in decoding• (work in progress)


N-best Rescoring with RT-04 Full Standard+ Model

• RT-04 model is “Standard+” model (includes modifee constraints)• RT-04 had been built with in-domain parser• Caveat: old parser runs fraught with some (not catastrophic) bugs,

still need to reparse RT-04 LM training data (significantly more than RT-03 data)

• Improved WER, but smaller gains than on older test sets• Gains improve with more hypotheses• Suggests need for better approximation to enable use of SuperARV

in search

Test set

Top-50 Top-500

Word N-gram

SARV Standard+

Word N-gram

SARV Standard+

dev2004 18.0 17.8 17.9 17.6


Original N-gram Approximation Algorithm

• Algorithm Description:1. For each ngram observed in the training data (note their SuperARV tag

information is known), calculate its probability using the Standard or Standard+ SuperARV LM, generating a new LM after renormalization;

2. For each of these ngrams, w1…wn, (note their tags are t1…tn),

1. Extract the short-SuperARV (a subset of components of a SuperARV) sequence from t1…tn, denoted st1…stn;

2. Find the list of word sequences sharing the same short-SuperARV sequences as st1…stn, using the lexicon constructed after training;

3. We select ngrams from this list of word sequences which do not exist in the training data by finding those ngrams that, when added, can reduce the perplexity on a held-out test set or increase its perplexity lower than a threshold;

3. The resulting LM could be pruned to make its size comparable to a word-based LM.

• If the held-out set is small, algorithm will result in overfitting• If the held-out set is large, algorithm will be slow.


Revised N-gram Approximation for SuperARV LMs

• Idea: build a testset-specific N-gram LM that approximates the SuperARV LM [suggested by Dimitra Vergyri]

• Include all N-grams that “matter” to the decoder• Method:

Step 1: perform the first-pass decoding using a word-based language model on a test set, and generate HTK lattices

Step 2: extract N-grams from the HTK lattices; prune based on posterior countsStep 3: compute conditional probabilities for these N-grams using a standard

SuperARV language modelStep 4: compute backoff weights based on the conditional probabilitiesStep 5: apply the resulting N-gram LM in all subsequent decoding passes (using

standard tools)

• Some approximations left:– Due to pruning in Step 2– From using only N-gram context, not full sentence prefix

• Drawback: Step 3 takes significant compute time– currently 10xRT, but not optimized for speed yet


Lattice N-gram ApproximationExperiment

• Based on RT-03 Standard SuperARV LM• Extracted N-grams from first-pass HTK lattices• Pruned N-grams with total posterior count < 10-3

• Left with 3.6M N-grams on a 6h test set• RT-02/03 experiment

– Uses 2003 acoustic models– 2000-best rescoring (1st pass)

• Dev-04 experiment– Uses 2004 acoustic models– Lattice rescoring (1st pass)


Lattice N-gram Approximation Results

• 1.2% absolute gain on old (matched) test sets• Small 0.2% gain on Fisher (mismatched) test set• Recall: no Fisher gain previously with N-best rescoring• Better exploitation of full hypothesis space yields

results

Test set Word N-gram SARV Lattice N-gram

eval2002 32.1 30.9

eval2003 32.1 30.9

dev2004 20.7 20.5


Conclusions and Future Work

• There is tradeoff between the generality and selectivity of a SuperARV model, much as was observed in our past CDG grammar induction experiments. – When making a model more constrained, its generality may be

reduced.

– Modifiee lexical features are helpful for strengthening constraints for word prediction, but they might need more or better matched data

– We need a better understanding of the interaction between this knowledge source and characteristics of the training data, e.g., the Fisher domain.

• For a structured model like the SuperARV model, it is beneficial to improve the quality of training syntactic structures, e.g., making them less errorful or most consistent.

– Observed LM win from better parses (using retrained parser)– Can expect further gains from advances in parse accuracy


Conclusions and Future Work (Cont.)

• Old N-gram approximation was flawed• New N-gram approximation looks promising, but also

needs more work– Tests using full system– Rescoring algorithm needs speeding up

• Still to do: reparse current CTS LM training set.

• Longer term: plan to investigate how conversational speech phenomena (sentence fragments, disfluencies) can be modeled better in the SuperARV framework.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen...

Documents

Transcript of March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen...