novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using...

33
Lattice Minimum Bayes-Risk for WFST-based G2P Conversion Josef R. Novak

Transcript of novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using...

Page 1: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Lattice Minimum Bayes-Risk for WFST-based G2P Conversion

Josef R. Novak

Page 2: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Talk outline   Review of Grapheme-to-Phoneme conversion   Problem outline

  Important previous work

  Lattice Minimum Bayes-Risk decoding   Problem outline

  Important previous work

  Applying LMBR decoding to G2P conversion

 Conclusions and Future work

Page 3: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

G2P conversion: basic idea  Given a new word, predict the pronunciation:   Input: REST   Output: r eh s t

 G2P conversion is usually broken into 3 related sub-problems:

R E S T | | | | r EH s T

1. Sequence alignment Align letters and phonemes in the training dictionary

2. Model building Train a model to generate pronunciation hypotheses for previously unseen words

3. Decoding Find the best hypothesis

Page 4: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Current state-of-the-art (1)   Joint sequence models   Train a model using joint G-P tokens,

  R:r I:A G:ε H:ε T:t

  Bisani and Ney, “Joint Sequence Models for Grapheme-to-Phoneme Conversion”, CSL, 2008   Uses a modified EM-algorithm to simultaneously learn G-P alignment

and G-P chunk information.

  Learn: ○ Pr(I:A)>Pr(G:A)

  Learn: ○ Pr(I:A|R:r)>Pr(G:ε|R:r)   Pros: State-of-the-art accuracy, elegant formulation

  Cons: EM procedure is very slow, may result in over-training for some setups, decoding may be slow.

Page 5: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Current state-of-the-art (2)  Discriminative training

  Jiampojamarn and Kondrak   “Joint Processing and Discriminative Training for L2P

Conversion”, Proc. ACL, 2008.

  “Integrating Joint n-gram Features into a Discriminative Training Framework”, NAACL, 2010.

  Pros: State-of-the-art accuracy

 Cons: Slow to train, decode, ensemble solution is complex, requires template construction.

Page 6: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Proposed system   Phonetisaurus

 Decoupled joint-sequence approach based on the Weighted Finite-State Transducer (WFST) framework   EM sequence alignment

  Joint N-gram model

  Decoder(s)

  Simple shortest path

  N-best re-ranking with a Recurrent Neural Network Language Model (RNNLM)

  Lattice Minimum Bayes-Risk decoding for G2P lattices

Page 7: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Decomposed MBR gain function

N-gram path posteriors

Lattice Minimum Bayes-Risk decoding

  Tromble, et al., “Lattice Minimum Bayes-Risk for Statistical Machine Translation”, Proc. ACL 2007

  Blackwood, et al., “Efficient Path Counting Transducers for MBR Decoding of Statistical Machine Translation Lattices”, Proc. ACL 2010

  Basic idea is to find the hypothesis with the lowest risk given the model   Considers the # of times each N-gram occurs in the lattice

  Result for G2P is a pronunciation-level consensus

CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76

a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).

Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as

gu(E,E!) = !u#u(E!)"u(E), (7.5)

where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0

times the length of the hypothesis E!:

G(E,E!) = !0|E!| +

!

u"N

gu(E,E!) (7.6)

Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form

E = argmaxE!"E

"

!0|E!| +

!

u"N

!u#u(E!)p(u|E)

#

, (7.7)

where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:

p(u|E) =!

E " Eu

P (E|F ), (7.8)

where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.

7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.

The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:

CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76

a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).

Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as

gu(E,E!) = !u#u(E!)"u(E), (7.5)

where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0

times the length of the hypothesis E!:

G(E,E!) = !0|E!| +

!

u"N

gu(E,E!) (7.6)

Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form

E = argmaxE!"E

"

!0|E!| +

!

u"N

!u#u(E!)p(u|E)

#

, (7.7)

where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:

p(u|E) =!

E " Eu

P (E|F ), (7.8)

where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.

7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.

The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:

Page 8: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Decomposed MBR gain function

N-gram path posteriors

Lattice Minimum Bayes-Risk decoding

  Tromble, et al., “Lattice Minimum Bayes-Risk for Statistical Machine Translation”, Proc. ACL 2007

  Blackwood, et al., “Efficient Path Counting Transducers for MBR Decoding of Statistical Machine Translation Lattices”, Proc. ACL 2010

  Basic idea is to find the hypothesis with the lowest risk given the model   Considers the # of times each N-gram occurs in the lattice

  Result for G2P is a pronunciation-level consensus

CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76

a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).

Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as

gu(E,E!) = !u#u(E!)"u(E), (7.5)

where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0

times the length of the hypothesis E!:

G(E,E!) = !0|E!| +

!

u"N

gu(E,E!) (7.6)

Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form

E = argmaxE!"E

"

!0|E!| +

!

u"N

!u#u(E!)p(u|E)

#

, (7.7)

where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:

p(u|E) =!

E " Eu

P (E|F ), (7.8)

where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.

7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.

The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:

CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76

a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).

Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as

gu(E,E!) = !u#u(E!)"u(E), (7.5)

where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0

times the length of the hypothesis E!:

G(E,E!) = !0|E!| +

!

u"N

gu(E,E!) (7.6)

Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form

E = argmaxE!"E

"

!0|E!| +

!

u"N

!u#u(E!)p(u|E)

#

, (7.7)

where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:

p(u|E) =!

E " Eu

P (E|F ), (7.8)

where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.

7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.

The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:

# times n-gram ‘u’ appears in hypothesis E’

N-gram weight factor

LMBR best hypothesis

Theta-weighted copy of hypothesis E’

Set of N-grams occurring in the lattice

Subset of hypotheses including the order-n N-gram ‘u’

Page 9: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

: lattice of pronunciation hypotheses

Page 10: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

•  Scale arcs X 0.3 •  Optimize •  Cut finalW

Apply exponential scale factor α= 0.3*

*log semiring

Page 11: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

•  Scale arcs X 0.3 •  Optimize •  Cut finalW

Apply exponential scale factor α= 0.3*

*log semiring

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1All hypotheses sum to 1

Page 12: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

  Objective: find all unique N-grams up to order N

  Naïve solution: Recursively search the lattice   X Very slow for lattices with a large average

branching factor

  Better solution: Build a counting FST and compose with the lattice and optimize the result   O Much faster, easier to search

  Even better: Modify the FST, restrict to the largest N-gram order and recursively search the result   OO Even faster, even easier!

Extracting N-grams

Page 13: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Extracting N-grams   Build a specialized counting FST for the largest order

planned for the LMBR decoding process:  ε(epsilon): null symbol  σ(sigma): match any non-ε symbol

  Compose, optimize and extract all unique N-grams   NN = Extract( Optimize( ProjectI( E⚪CN ) ) )

compose

Extract unique N-grams

1-grams 2-grams

d dAH r rEH EH EHs AH AHs s sd, st t --

Unweighted copy of lattice Optimized 2-gram counter

Optimized 2-gram lattice

Page 14: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

  Objective: Transform the raw lattice into an explicit N-gram lattice for order n.

  Solution: Build an N-gram tree from NN and connect final states as dictated by the raw the lattice.

Building N-gram mapping FSTs

Page 15: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Building N-gram mapping FSTs (1)   Use NN to generate an N-gram tree FST of order ≤ n

  Input symbols: unigrams

  Output symbols: order-n sequences are N-gram labels, all other output labels are ε

  Connect final states as dictated by the N-gram table and optimize the result:

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)

Page 16: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Building N-gram mapping FSTs (2)  Compose with E, projecto and optimize to

obtain the order-n version of the raw un-weighted lattice

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)compose

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

Page 17: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

  Objective: Efficiently compute lattice posterior N-gram probabilities for each N-gram,

  Solution: Build right path-counter to simultaneously extract the final occurrence of each from each order-n lattice copy

Building lattice posterior N-gram path counting WFSTs

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

Page 18: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Building lattice posterior N-gram path counting WFSTs   Construct from Nn for each order n  ε(epsilon): null symbol  σ(sigma): match any non-ε symbol   ρ(rho): match all other symbols

  The ρarcs map all other symbols to to ε.   At state 1, arc ρ:εmaps everything but

2-gram ‘dAH’ to ε   Arc 1→2 ensures that all but the final

occurrence of ‘rEH’ on a path is deleted

  Composition with order-n lattice will extract just the last occurrence of each n-gram on a path

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

2-grams

dAH rEH EHs AHs

sd, st

Page 19: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

  Objective: Use , and to compute all

  Solution: Compose the component WFSTs together and optimize

Computing lattice posterior N-gram probabilities

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

Page 20: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Computing lattice posterior N-gram probabilities   Iteratively compose together the components

, and and optimize to obtain encoding all of order-n  

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " Cn) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

Opt((E " !2) ""R2 )

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

Opt((E " !2) ""R2 )

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

Opt((E " !2) ""R2 )

U2

Extract

2-grams p(u|E) dAH 1.0689 rEH 0.42065 EHs 0.42065 AHs 1.0689 sd 1.1289 st 0.39065

Page 21: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

  Objective: Obtain the Minimum Bayes-Risk best hypothesis given the posterior lattice N-gram probabilities

  Solution: Make a copy of each mapper, and apply the posterior N-gram probabilities, scaled by the N-gram factors to all arcs

Building LMBR decoders

Page 22: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Building LMBR decoders   Build a decoder WFST for each order-n   Make a copy of the order-n mapper

  Apply -scaled posterior N-gram probabilities to all arcs

Computing BLEU N-gram factors

•  T: scale factor •  p: avg. n-gram precision •  r: n-gram match ratio •  n: n-gram order

!n

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

Opt((E " !2) ""R2 )

U2

"0 = % 1/T

"n = 1/4Tprn!1

"n

!n

T = 10

p = 0.85

r = 0.72

!n

T = 10

p = 0.85

r = 0.72

!2 = 0.04

!2

2-grams p(u|E) Θ2 x u dAH 1.0689 0.08722 rEH 0.42065 0.03423 EHs 0.42065 0.03423 AHs 1.0689 0.08722 sd 1.1289 0.09211 st 0.39065 0.03187

E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

Opt((E " !2) ""R2 )

U2

"0 = % 1/T

"n = 1/NTprn!1

"n

!n

!1

ShortestPath(Projecti(Projecti(Projecti(E0 ! !1) ! !2) ! !3))

T = 10

p = 0.85

r = 0.72

!0 = " 0.1

!1 = 0.058

!2 = 0.081

!2

!0 E0

Page 23: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding under the WFST framework   Scary algorithm

5.1 RNNLM N-best rescoring

Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,

HNbest =NShortestPaths(w !M)

Hbest =Projecto(Rescorernn(HNbest)).(2)

In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).

5.2 Lattice Minimum Bayes-Riskdecoding for G2P

In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.

Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much

Algorithm 3: G2P Lattice MBR-Decode

Input: E # Projecto(w !M), !, "0!n

1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R

n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R

n )7 $n = "n

8 for state q % Q[$n] do9 for arc e % E[q] do

10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)

we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R

n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.

6 Experimental results

Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative

  Objective: Obtain the Minimum Bayes-Risk best hypothesis using the decoders

  Solution: Iteratively compose and input-project the scaled lattice and extract the shortest path through the final result

Computing the LMBR best hypothesis

!n

!n

T = 10

p = 0.85

r = 0.72

!2 = 0.04

!2

!0 E0

!n

T = 10

p = 0.85

r = 0.72

!2 = 0.04

!2

!0 E0

Page 24: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Computing the LMBR best hypothesis   Construct by applying to an un-weighted

copy of

  Iteratively compose and and project input labels after each composition   Applies n-gram posteriors and accumulates

multiple counts

  Extract the shortest path through the final result*

N-gram factors

!n

T = 10

p = 0.85

r = 0.72

!2 = 0.04

!2

!0 E0

!n

T = 10

p = 0.85

r = 0.72

!2 = 0.04

!2

!0 E0E ! Projecto(w "M) ! "0!n

E ! (!# E)NN ! (E)

n ! 1 N!n ! (Nn)"R

n ! (Nn)Un ! Opt((E " !n) ""R

n )#n = !n

q $ Q[#n]e $ E[q]

w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)

n ! 2 NP ! Projectinput(P " #n)

Hbest = ShortestPath(P)

!

E"E

P (E|F ) = 1

!2

ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)

"Rn

"R2

!n

!n

T = 10

p = 0.85

r = 0.72

!2 = 0.04

!2

!0 E0

!n

!1

ShortestPath(Projecti(Projecti(Projecti(E0 ! !1) ! !2) ! !3))

T = 10

p = 0.85

r = 0.72

!0 = " 0.1

!1 = 0.058

!2 = 0.081

!2

!0 E0

!n

!1

ShortestPath(Projecti(Projecti(E0 ! !1) ! !2))

T = 10

p = 0.85

r = 0.72

!0 = " 0.1

!1 = 0.058

!2 = 0.081

!2

!0 E0

Page 25: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR Gain: What happened?   The small alpha parameter smooths the raw

posterior lattice distribution

  The LMBR process emphasizes N-gram occurrences across paths in the lattice

Compare

‘rEH’ and ‘EHs’ 2-grams appear in 2 hypotheses. This boosts the 2nd rank hypothesis into the 1st.

Page 26: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

LMBR decoding: Issues   Sensitive to parameter values   N-gram order

  alpha

  thetas

  Slower than MAP decoding due to large number of posterior computations   Sometimes preferable to use k-best subset of the

lattice if speed is a concern

Page 27: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Recent G2P Experiments  N-best rescoring with a Recurrent Neural Network Language

Model (RNNLM)   Cross comparison

  State-of-the-art improvements

  LMBR decoding   Small improvements versus MAP phoneme-accuracy

Page 28: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Experiments: RNNLM N-best rescoring

!75.08

.

  Replication of standard English G2P test sets from Bisani 2008   Small but consistent improvements from m2m-fst-P proposed

alignment algorithm

  Consistent improvements to state-of-the-art with RNNLM based N-best rescoring

Page 29: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Experiments: training & decoding speed  Comparison of relative training and decoding

times for the proposed system versus a variety of alternatives

!75.08

.

!"

#!"

$!"

%!"

&!"

'!"

(!"

)!"

*!"

%+'" &+'" '+'" (+'" )+'" *+'"

!"#$%&'

'(#)'*%+,

-%

./'"$012%34/%+5/'-%

678$0'9%:;<=::;<%%!"#$%&''(#)'*%>5?%@04/%A"#%BC;D:;%

",-"./+"0123"

Page 30: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Experiments: LMBR decoding for G2P

NETtalk-15k

Decoder WA (%) PA (%)

MAP 66.5 91.80

LMBR(n=6) 66.5 91.82

•  LMBR impacted by n-gram order •  Improvement through n=6 •  SMT typically sets n=4

•  Very small improvement to phoneme-accuracy vs. MAP

•  Still worse than RNNLM •  Better thetas, alpha

Page 31: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Experiments: LMBR real-world re-ranking example   ./phonetisaurus-g2p -m model.fst -n 5 -w abbreviate -a .3 -d 1 -r

-0.21198 e b r i v i e t -0.20022 @ b r i v i e t -0.17939 @ b r e v i e t -0.17864 e b b r i v i e t -0.16689 @ b b r i v i e t

-0.235998 @ b r i v i e t -0.120462 @ b b r i v i e t 0.0844044 @ b i v i e t 0.140279 @ b r i v i x t 0.140505 x b r i v i e t

-0.126297 @ b r i v i e t 0.210171 @ b b r i v i e t 0.432433 @ b i v i e t 0.4813 x b r i v i e t 0.557048 @ b r x v i e t

0.0823436 @ b r i v i e t 0.723454 @ b b r i v i e t 0.895162 x b r i v i e t 0.933964 @ b i v i e t 1.01223 x b b r i v i e t

0.419434 @ b r i v i e t 1.32085 @ b b r i v i e t 1.38801 x b r i v i e t 1.54077 x b b r i v i e t 1.56691 @ b i v i e t

0.821685 @ b r i v i e t 1.99726 x b r i v i e t 2.06059 @ b b r i v i e t 2.2304 x b b r i v i e t 2.23651 a b r i v i e t

N=1 N=2 N=3

N=4 N=5 N=6

Page 32: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Conclusions and Future work (1)  New state-of-the-art results for G2P conversion on

several test sets:   Small improvements to EM-driven multiple sequence

alignment algorithm   N-best rescoring with a Recurrent Neural Network

Language Model

  Lattice Minimum Bayes-Risk decoding applied to G2P conversion   Small improvement to phoneme accuracy

compared to MAP-based approach   Need to determine a better set of BLEU n-gram

factors for the G2P problem   LMBR decoder applicable to alignment as well

Page 33: novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens, R:r I:A G:ε H:ε T:t Bisani and Ney, “Joint Sequence Models

Conclusions and Future Work (2)  Apply the LMBR decoder to incremental spoken

dialog system processing   The Psi WFST can extract either the first or last

occurrence of an N-gram from a lattice path   Most action hypotheses in the dialog system should

be roughly aligned in time   Should be possible to incrementally aggregate

acts that are in agreement across a lattice   Use this as a pruning technique during dialog

management   Use this as a method to ‘jump’, ‘barge-in’ or provide

a back-channeling response to respond to a user who hasn’t finished talking