March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen...

20
March 24, 20 05 EARS STT Workshop 1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology & Research Laboratory SRI International 2. School of Electrical and Computer Engineering Purdue University

Transcript of March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen...

Page 1: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 1

A Study of Some Factors Impacting SuperARV Language Modeling

Wen Wang1

Andreas Stolcke1

Mary P. Harper2

1. Speech Technology & Research LaboratorySRI International

2. School of Electrical and Computer EngineeringPurdue University

Page 2: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 2

Motivation

• RT-03 SuperARV gave excellent results using a backoff N-gram approximation [ICASSP’04 paper]

• N-gram backoff approximation of RT-04 SuperARV did not generalize to RT-04 evaluation test set

– Dev04: achieved 1.0% absolute WER reduction over baseline LM – Eval04: no gain in WER (in fact, a small loss)

• RT-04 SARV LM was developed under considerable time pressure

– Training procedure is very time consuming (weeks and months), due to syntactic parsing of training data

– Did not have time to examine all design choices in combination

• Reexamine all design decisions in detail

Page 3: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 3

What Changed?

RT-04 SARV training differed from RT-03 in 2 aspects:

• Retrained the Charniak parser using a combination of the Switchboard Penn Treebank and Wall Street Journal Penn Treebank

The 2003 parser was trained on the WSJ Treebank only.• Built a SuperARV LM with additional modifiee lexical feature

constraints (Standard+ model)

The 2003 LM was a SuperARV LM without these additional constraints (Standard model)

Changes had given improvements at various points, but weren’t tested in complete systems on new Fisher data.

Page 4: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 4

Plan of Attack• Revisit changes to training procedure

– Check effect on old and new data sets and systems

• Revisit the backoff N-gram approximation– Did we just get lucky in 2003 ?

– Evaluate full SuperARV LM in N-best rescoring

– Find better approximations

• Start investigation by going back to 2003 LM, then move to current system.

• Validate training software (and document and release)• Work in progress• Holding out on eval04 testing (avoid implicit tuning)

Page 5: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 5

Perplexity of RT-03 LMs• RT-03 LM training data• LM types tested:

– “Word”: Word backoff 4-gram, KN smoothed

– “SARV N-gram”: N-gram approximation to standard SuperARV LM

– “SARV Standard”: full SuperARV (without additional constraints)

• Full model gains smaller on dev04• N-gram approximation breaks down

Test Sets Word SARV N-gram

SARV Standard

dev2001 64.34 53.74 52.70

eval2003 70.80 56.25 54.18

dev2004 63.45 62.87 56.97

Page 6: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 6

N-best Rescoring with Full SuperARV LM

• Evaluated full Standard SARV LM in final N-best rescoring

• Based on PLP subsystem of RT-03 CTS system• Full SARV rescoring is expensive, so tried increasingly

longer N-best lists– Top-50– Top-500– Top-2000 (max used in eval system)

• Early passes (including MLLR) use baseline LM, so gains will be limited

Page 7: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 7

RT-03 LM N-best Rescoring Results

• Standard SuperARV reduces WER on eval02, eval03• No gain on dev04• Identical gains on eval03-SWB and eval03-Fisher• SuperARV gain increases with larger hypothesis space

Test

Set

Top-50 Top-500 Top-2000

Word SARV Standard

Word SARV Standard

Word SARV Standard

eval2002 26.7 26.1 26.6 25.8 26.3 25.6

eval2003 --- --- 26.4 26.1 --- ---

dev2004 18.2 18.2 18.1 18.1 --- ---

Page 8: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 8

Adding Modifiee Constraints

• Constraints enforced by a Constraint Dependency Grammar (on which SuperARV is based) can be enhanced by utilizing modifiee information in unary and binary constraints

• Expected that this information can improve SuperARV LM.

• In RT-04 development, explored using only the modifiee’s lexical category in the LM, adding them to the SuperARV tag structure.

• This reduced perplexity and WER in early experiments.• But: additional tag constraints could have hurt LM

generalization!

Page 9: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 9

Perplexity with Modifiee Constraints

• Trained a SuperARV LM augmented with modifiee lexical features on RT-03 LM data (“Standard+” model)

• Standard+ model reduces perplexity on the eval02 and eval03 test sets (relative to Standard)

• But not on Fisher (dev04) test set!

Test Set Word N-gram

SARV N-gram

SARV Standard

SARV Standard+

dev2001 64.34 53.74 52.70 51.35

eval2003 70.80 56.25 54.18 53.09

dev2004 63.45 62.87 56.97 57.53

Page 10: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 10

N-best Rescoring with Modifiee Constraints

• WER reductions consistent with perplexity results• No improvement on dev04.

Test Set

Top-50 Top-500

Word N-

gram

SARV Standard

SARV Standard+

Word N-gram

SARV Standard

SARV Standard+

eval2002 26.7 26.1 26.0 26.6 25.8 25.6

eval2003 --- --- --- 26.4 26.1 25.8

dev2004 18.2 18.2 18.2 18.1 18.1 ---

Page 11: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 11

In-domain vs. Out-of-domain Parser Training

• SuperARVs are collected from CDG parses that are obtained by transforming CFG parses

• CFG parses are generated using existing state-of-the-art parsers.

• In 2003: CTS data parsed with parser trained on Wall Street Journal Treebank (out-of-domain parser)

• In 2004: Obtained trainable version of Charniak parser• Retrained parser on a combination of Switchboard

Treebank and WSJ Treebank (in-domain parser)– Expected improved consistency and accuracy of parse structures– However, there were bugs in that retraining; fixed for the current experiment.

Page 12: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 12

Rescoring Results with In-domain Parser

• Reparsed the RT-03 LM training with in-domain parser• Retrained Standard SuperARV model (“Standard-retrained”)• N-best rescoring system as before

• In-domain parsing helps• Also: number of distinct SuperARV tags reduced in retraining

(improved parser consistency)

Test Set Top-500 Rescoring WER (%)

Word N-gram

SARV Standard

SARV Standard+

SARV Standard-retrained

SARV Standard-retrained+

eval2002 26.6 25.8 25.6 25.6 25.4

Page 13: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 13

Summary So Far

• Prior design decisions have been validated• Adding modifiee constraints helps LM on matched data• Reparsing with retrained in-domain parser improves

LM quality• Now: reexamine approximation used in decoding• (work in progress)

Page 14: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 14

N-best Rescoring with RT-04 Full Standard+ Model

• RT-04 model is “Standard+” model (includes modifee constraints)• RT-04 had been built with in-domain parser• Caveat: old parser runs fraught with some (not catastrophic) bugs,

still need to reparse RT-04 LM training data (significantly more than RT-03 data)

• Improved WER, but smaller gains than on older test sets• Gains improve with more hypotheses• Suggests need for better approximation to enable use of SuperARV

in search

Test set

Top-50 Top-500

Word N-gram

SARV Standard+

Word N-gram

SARV Standard+

dev2004 18.0 17.8 17.9 17.6

Page 15: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 15

Original N-gram Approximation Algorithm

• Algorithm Description:1. For each ngram observed in the training data (note their SuperARV tag

information is known), calculate its probability using the Standard or Standard+ SuperARV LM, generating a new LM after renormalization;

2. For each of these ngrams, w1…wn, (note their tags are t1…tn),

1. Extract the short-SuperARV (a subset of components of a SuperARV) sequence from t1…tn, denoted st1…stn;

2. Find the list of word sequences sharing the same short-SuperARV sequences as st1…stn, using the lexicon constructed after training;

3. We select ngrams from this list of word sequences which do not exist in the training data by finding those ngrams that, when added, can reduce the perplexity on a held-out test set or increase its perplexity lower than a threshold;

3. The resulting LM could be pruned to make its size comparable to a word-based LM.

• If the held-out set is small, algorithm will result in overfitting• If the held-out set is large, algorithm will be slow.

Page 16: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 16

Revised N-gram Approximation for SuperARV LMs

• Idea: build a testset-specific N-gram LM that approximates the SuperARV LM [suggested by Dimitra Vergyri]

• Include all N-grams that “matter” to the decoder• Method:

Step 1: perform the first-pass decoding using a word-based language model on a test set, and generate HTK lattices

Step 2: extract N-grams from the HTK lattices; prune based on posterior countsStep 3: compute conditional probabilities for these N-grams using a standard

SuperARV language modelStep 4: compute backoff weights based on the conditional probabilitiesStep 5: apply the resulting N-gram LM in all subsequent decoding passes (using

standard tools)

• Some approximations left:– Due to pruning in Step 2– From using only N-gram context, not full sentence prefix

• Drawback: Step 3 takes significant compute time– currently 10xRT, but not optimized for speed yet

Page 17: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 17

Lattice N-gram ApproximationExperiment

• Based on RT-03 Standard SuperARV LM• Extracted N-grams from first-pass HTK lattices• Pruned N-grams with total posterior count < 10-3

• Left with 3.6M N-grams on a 6h test set• RT-02/03 experiment

– Uses 2003 acoustic models– 2000-best rescoring (1st pass)

• Dev-04 experiment– Uses 2004 acoustic models– Lattice rescoring (1st pass)

Page 18: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 18

Lattice N-gram Approximation Results

• 1.2% absolute gain on old (matched) test sets• Small 0.2% gain on Fisher (mismatched) test set• Recall: no Fisher gain previously with N-best rescoring• Better exploitation of full hypothesis space yields

results

Test set Word N-gram SARV Lattice N-gram

eval2002 32.1 30.9

eval2003 32.1 30.9

dev2004 20.7 20.5

Page 19: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 19

Conclusions and Future Work

• There is tradeoff between the generality and selectivity of a SuperARV model, much as was observed in our past CDG grammar induction experiments. – When making a model more constrained, its generality may be

reduced.

– Modifiee lexical features are helpful for strengthening constraints for word prediction, but they might need more or better matched data

– We need a better understanding of the interaction between this knowledge source and characteristics of the training data, e.g., the Fisher domain.

• For a structured model like the SuperARV model, it is beneficial to improve the quality of training syntactic structures, e.g., making them less errorful or most consistent.

– Observed LM win from better parses (using retrained parser)– Can expect further gains from advances in parse accuracy

Page 20: March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

March 24, 2005EARS STT Workshop 20

Conclusions and Future Work (Cont.)

• Old N-gram approximation was flawed• New N-gram approximation looks promising, but also

needs more work– Tests using full system– Rescoring algorithm needs speeding up

• Still to do: reparse current CTS LM training set.

• Longer term: plan to investigate how conversational speech phenomena (sentence fragments, disfluencies) can be modeled better in the SuperARV framework.