novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using...
Transcript of novakj/lmbr-decoding-g2p.pdfCurrent state-of-the-art (1) Joint sequence models Train a model using...
Lattice Minimum Bayes-Risk for WFST-based G2P Conversion
Josef R. Novak
Talk outline Review of Grapheme-to-Phoneme conversion Problem outline
Important previous work
Lattice Minimum Bayes-Risk decoding Problem outline
Important previous work
Applying LMBR decoding to G2P conversion
Conclusions and Future work
G2P conversion: basic idea Given a new word, predict the pronunciation: Input: REST Output: r eh s t
G2P conversion is usually broken into 3 related sub-problems:
R E S T | | | | r EH s T
1. Sequence alignment Align letters and phonemes in the training dictionary
2. Model building Train a model to generate pronunciation hypotheses for previously unseen words
3. Decoding Find the best hypothesis
Current state-of-the-art (1) Joint sequence models Train a model using joint G-P tokens,
R:r I:A G:ε H:ε T:t
Bisani and Ney, “Joint Sequence Models for Grapheme-to-Phoneme Conversion”, CSL, 2008 Uses a modified EM-algorithm to simultaneously learn G-P alignment
and G-P chunk information.
Learn: ○ Pr(I:A)>Pr(G:A)
Learn: ○ Pr(I:A|R:r)>Pr(G:ε|R:r) Pros: State-of-the-art accuracy, elegant formulation
Cons: EM procedure is very slow, may result in over-training for some setups, decoding may be slow.
Current state-of-the-art (2) Discriminative training
Jiampojamarn and Kondrak “Joint Processing and Discriminative Training for L2P
Conversion”, Proc. ACL, 2008.
“Integrating Joint n-gram Features into a Discriminative Training Framework”, NAACL, 2010.
Pros: State-of-the-art accuracy
Cons: Slow to train, decode, ensemble solution is complex, requires template construction.
Proposed system Phonetisaurus
Decoupled joint-sequence approach based on the Weighted Finite-State Transducer (WFST) framework EM sequence alignment
Joint N-gram model
Decoder(s)
Simple shortest path
N-best re-ranking with a Recurrent Neural Network Language Model (RNNLM)
Lattice Minimum Bayes-Risk decoding for G2P lattices
Decomposed MBR gain function
N-gram path posteriors
Lattice Minimum Bayes-Risk decoding
Tromble, et al., “Lattice Minimum Bayes-Risk for Statistical Machine Translation”, Proc. ACL 2007
Blackwood, et al., “Efficient Path Counting Transducers for MBR Decoding of Statistical Machine Translation Lattices”, Proc. ACL 2010
Basic idea is to find the hypothesis with the lowest risk given the model Considers the # of times each N-gram occurs in the lattice
Result for G2P is a pronunciation-level consensus
CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76
a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).
Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as
gu(E,E!) = !u#u(E!)"u(E), (7.5)
where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0
times the length of the hypothesis E!:
G(E,E!) = !0|E!| +
!
u"N
gu(E,E!) (7.6)
Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form
E = argmaxE!"E
"
!0|E!| +
!
u"N
!u#u(E!)p(u|E)
#
, (7.7)
where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:
p(u|E) =!
E " Eu
P (E|F ), (7.8)
where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.
7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.
The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:
CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76
a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).
Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as
gu(E,E!) = !u#u(E!)"u(E), (7.5)
where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0
times the length of the hypothesis E!:
G(E,E!) = !0|E!| +
!
u"N
gu(E,E!) (7.6)
Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form
E = argmaxE!"E
"
!0|E!| +
!
u"N
!u#u(E!)p(u|E)
#
, (7.7)
where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:
p(u|E) =!
E " Eu
P (E|F ), (7.8)
where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.
7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.
The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:
Decomposed MBR gain function
N-gram path posteriors
Lattice Minimum Bayes-Risk decoding
Tromble, et al., “Lattice Minimum Bayes-Risk for Statistical Machine Translation”, Proc. ACL 2007
Blackwood, et al., “Efficient Path Counting Transducers for MBR Decoding of Statistical Machine Translation Lattices”, Proc. ACL 2010
Basic idea is to find the hypothesis with the lowest risk given the model Considers the # of times each N-gram occurs in the lattice
Result for G2P is a pronunciation-level consensus
CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76
a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).
Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as
gu(E,E!) = !u#u(E!)"u(E), (7.5)
where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0
times the length of the hypothesis E!:
G(E,E!) = !0|E!| +
!
u"N
gu(E,E!) (7.6)
Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form
E = argmaxE!"E
"
!0|E!| +
!
u"N
!u#u(E!)p(u|E)
#
, (7.7)
where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:
p(u|E) =!
E " Eu
P (E|F ), (7.8)
where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.
7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.
The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:
CHAPTER 7. LATTICE MINIMUM BAYES-RISK DECODING WITH WFSTS 76
a sum of independent local gain functions gu the decoder can be reformulated in terms ofn-gram matches between E and E! and computed e!ciently (Tromble et al., 2008).
Let N = {u1, . . . , u|N |} denote the set of all n-grams in lattice E and define the n-gramlocal gain function between two hypotheses gu : E ! E " R for each u : #N as
gu(E,E!) = !u#u(E!)"u(E), (7.5)
where !u is an n-gram specific constant, #u(E!) is the number of times u occurs in E!, and"u(E) is 1 if u occurs in E and zero otherwise. The gain gu is thus !u times the numberof occurrences of u in E!, or zero if u does not occur in E. Using a first order Taylor-seriesapproximation to the gain in log corpus BLEU (Tromble et al., 2008), the overall gain functionG(E,E!) can be approximated as a linear sum of these local gain functions and a constant !0
times the length of the hypothesis E!:
G(E,E!) = !0|E!| +
!
u"N
gu(E,E!) (7.6)
Substituting this linear decomposition of the gain function into Equation (7.2) results in anMBR decoder with the form
E = argmaxE!"E
"
!0|E!| +
!
u"N
!u#u(E!)p(u|E)
#
, (7.7)
where p(u|E) is the path posterior probability of n-gram u which can be computed from thelattice. The important point is that the linear decomposition of the gain function replacesthe sum over an exponentially large set of hypotheses in the lattice E # E with a sum over n-grams u#N which can be computed exactly even for large lattices. The n-gram path posteriorprobability is the sum of the posterior probabilities of all paths containing the n-gram:
p(u|E) =!
E " Eu
P (E|F ), (7.8)
where Eu = {E # E : #u(E) > 0} is the subset of lattice paths containing the n-gram u atleast once. The next section describes how these path posterior probabilities can be computede!ciently using general purpose WFST operations.
7.1.4 Decoding with Weighted Finite-State AcceptorsThis section describes an implementation of lattice minimum Bayes-risk decoding based onweighted finite-state acceptors (Mohri, 1997) and the OpenFst toolkit (Allauzen et al., 2007).Each lattice E is a weighted directed acyclic graph (DAG) (Cormen et al., 2001) encodinga large space of hypothesised translations output by the baseline system. Denote by Eh thehypothesis space (e.g. the top 1000-best hypotheses in a k-best list generated from the lattice)and by Ee the evidence space.
The lattice MBR decoder of Equation (7.7) is implemented by the algorithm shown inFigure 7.1. The input parameters are the posterior distribution smoothing factor #, evidencespace Ee, hypothesis space Eh, and n-gram factors !n for n = 0, . . ., 4. The return valueis the translation hypothesis that maximises the conditional expected gain. The algorithmcorresponds to the following sequence of operations:
# times n-gram ‘u’ appears in hypothesis E’
N-gram weight factor
LMBR best hypothesis
Theta-weighted copy of hypothesis E’
Set of N-grams occurring in the lattice
Subset of hypotheses including the order-n N-gram ‘u’
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
: lattice of pronunciation hypotheses
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
• Scale arcs X 0.3 • Optimize • Cut finalW
Apply exponential scale factor α= 0.3*
*log semiring
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
• Scale arcs X 0.3 • Optimize • Cut finalW
Apply exponential scale factor α= 0.3*
*log semiring
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1All hypotheses sum to 1
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
Objective: find all unique N-grams up to order N
Naïve solution: Recursively search the lattice X Very slow for lattices with a large average
branching factor
Better solution: Build a counting FST and compose with the lattice and optimize the result O Much faster, easier to search
Even better: Modify the FST, restrict to the largest N-gram order and recursively search the result OO Even faster, even easier!
Extracting N-grams
Extracting N-grams Build a specialized counting FST for the largest order
planned for the LMBR decoding process: ε(epsilon): null symbol σ(sigma): match any non-ε symbol
Compose, optimize and extract all unique N-grams NN = Extract( Optimize( ProjectI( E⚪CN ) ) )
compose
○
Extract unique N-grams
1-grams 2-grams
d dAH r rEH EH EHs AH AHs s sd, st t --
Unweighted copy of lattice Optimized 2-gram counter
Optimized 2-gram lattice
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
Objective: Transform the raw lattice into an explicit N-gram lattice for order n.
Solution: Build an N-gram tree from NN and connect final states as dictated by the raw the lattice.
Building N-gram mapping FSTs
Building N-gram mapping FSTs (1) Use NN to generate an N-gram tree FST of order ≤ n
Input symbols: unigrams
Output symbols: order-n sequences are N-gram labels, all other output labels are ε
Connect final states as dictated by the N-gram table and optimize the result:
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)
Building N-gram mapping FSTs (2) Compose with E, projecto and optimize to
obtain the order-n version of the raw un-weighted lattice
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)compose
○
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
Objective: Efficiently compute lattice posterior N-gram probabilities for each N-gram,
Solution: Build right path-counter to simultaneously extract the final occurrence of each from each order-n lattice copy
Building lattice posterior N-gram path counting WFSTs
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
Building lattice posterior N-gram path counting WFSTs Construct from Nn for each order n ε(epsilon): null symbol σ(sigma): match any non-ε symbol ρ(rho): match all other symbols
The ρarcs map all other symbols to to ε. At state 1, arc ρ:εmaps everything but
2-gram ‘dAH’ to ε Arc 1→2 ensures that all but the final
occurrence of ‘rEH’ on a path is deleted
Composition with order-n lattice will extract just the last occurrence of each n-gram on a path
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
2-grams
dAH rEH EHs AHs
sd, st
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
Objective: Use , and to compute all
Solution: Compose the component WFSTs together and optimize
Computing lattice posterior N-gram probabilities
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
Computing lattice posterior N-gram probabilities Iteratively compose together the components
, and and optimize to obtain encoding all of order-n
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " Cn) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
Opt((E " !2) ""R2 )
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
Opt((E " !2) ""R2 )
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
Opt((E " !2) ""R2 )
U2
Extract
2-grams p(u|E) dAH 1.0689 rEH 0.42065 EHs 0.42065 AHs 1.0689 sd 1.1289 st 0.39065
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
Objective: Obtain the Minimum Bayes-Risk best hypothesis given the posterior lattice N-gram probabilities
Solution: Make a copy of each mapper, and apply the posterior N-gram probabilities, scaled by the N-gram factors to all arcs
Building LMBR decoders
Building LMBR decoders Build a decoder WFST for each order-n Make a copy of the order-n mapper
Apply -scaled posterior N-gram probabilities to all arcs
Computing BLEU N-gram factors
• T: scale factor • p: avg. n-gram precision • r: n-gram match ratio • n: n-gram order
!n
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
Opt((E " !2) ""R2 )
U2
"0 = % 1/T
"n = 1/4Tprn!1
"n
!n
T = 10
p = 0.85
r = 0.72
!n
T = 10
p = 0.85
r = 0.72
!2 = 0.04
!2
2-grams p(u|E) Θ2 x u dAH 1.0689 0.08722 rEH 0.42065 0.03423 EHs 0.42065 0.03423 AHs 1.0689 0.08722 sd 1.1289 0.09211 st 0.39065 0.03187
E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
Opt((E " !2) ""R2 )
U2
"0 = % 1/T
"n = 1/NTprn!1
"n
!n
!1
ShortestPath(Projecti(Projecti(Projecti(E0 ! !1) ! !2) ! !3))
T = 10
p = 0.85
r = 0.72
!0 = " 0.1
!1 = 0.058
!2 = 0.081
!2
!0 E0
LMBR decoding under the WFST framework Scary algorithm
5.1 RNNLM N-best rescoring
Recurrent Neural Network Language Modelshave recently enjoyed a resurgence in popular-ity in the context of ASR applications (m10).In another recent publication we investigatedthe applicability of this approach to G2P con-version with joint sequence models by provid-ing support for the rnnlm toolkit (m11). Thetraining corpus for the G2P LM is a corpusof joint sequences, thus it can be used withoutmodification to train a parallel RNNLM. N-bestreranking is then accomplished with the pro-posed toolkit by causing the decoder to outputthe N-best joint G-P sequences, and employingrnnlm to rerank the the N-best joint sequences,
HNbest =NShortestPaths(w !M)
Hbest =Projecto(Rescorernn(HNbest)).(2)
In practice the rnnlm models require consider-able tuning, and somewhat more time to train,but provide a consistent WA boost. For furtherdetails see (m10).
5.2 Lattice Minimum Bayes-Riskdecoding for G2P
In (t07) the authors note that the aim of MBRdecoding is to find the hypothesis that has the“least expected loss under the model”. MBRdecoding was successfully applied to Statisti-cal Machine Translation (SMT) lattices in (t07),and significantly improved in (b10). Noting thesimilarities between G2P conversion and SMT,we have begun work implementing an integratedLMBR decoder for the proposed toolkit.
Our approach closely follows that describedin (b10), and the algorithm implementation issummarized in Algorithm 3. The inputs are thefull phoneme lattice that results from compos-ing the input word with the G2P model and pro-jecting output labels, an exponential scale factor!, and N-gram precision factors "0!N . The "nare computed using average N-gram precision p,and a match ratio r using the following equa-tions, "0 = "1/T ; "n = 1/4Tprn!1. T is, ine!ect an arbitrary constant which does not af-fect the MBR decision. Line 1 applies ! to theraw lattice. In e!ect this controls how much
Algorithm 3: G2P Lattice MBR-Decode
Input: E # Projecto(w !M), !, "0!n
1 E #ScaleLattice(!$ E)2 NN #ExtractN-grams(E)3 for n # 1 to N do4 "n # MakeMapper(Nn)5 #R
n # MakePathCounter(Nn)6 Un # Opt((E ! Cn) !#R
n )7 $n = "n
8 for state q % Q[$n] do9 for arc e % E[q] do
10 w[e] # "n $ U(o[e])11 P # Projectinput(E!0 ! $1)12 for n # 2 to N do13 P # Projectinput(P ! $n)14 Hbest = ShortestPath(P)
we trust the raw lattice weights. After apply-ing !, E is normalized by pushing weights to thefinal state and removing any final weights. Inline 2 all unique N-grams up to order N are ex-tracted from the lattice. Lines 4-10 create, foreach order, a context-dependency FST ("n) anda special path-posterior counting WFST (#R
n ),which are then used to compute N-gram posteri-ors (Un), and finally to create a decoder WFST($n). The full MBR decoder is then computedby first making an unweighted copy of E , apply-ing "0 uniformly to all arcs, and iteratively com-posing and input-projecting with each $n. TheMBR hypothesis is then the best path throughthe result P . See (t07; b10) for further details.
6 Experimental results
Experimental evaluations were conducted uti-lizing three standard G2P test sets. Theseincluded replications of the NetTalk, CMU-dict, and OALD English language dictionaryevaluations described in detail in (b08). Re-sults comparing various configuration of the pro-posed toolkit to the joint sequence model Se-quitur (b08) and an alternative discriminativetraining toolkit direcTL+ (j10) are describedin Table 3. Here m2m-P indicates the pro-posed toolkit using the alignment algorithmfrom (j07), m2m-fst-P indicates the alternative
Objective: Obtain the Minimum Bayes-Risk best hypothesis using the decoders
Solution: Iteratively compose and input-project the scaled lattice and extract the shortest path through the final result
Computing the LMBR best hypothesis
!n
!n
T = 10
p = 0.85
r = 0.72
!2 = 0.04
!2
!0 E0
!n
T = 10
p = 0.85
r = 0.72
!2 = 0.04
!2
!0 E0
Computing the LMBR best hypothesis Construct by applying to an un-weighted
copy of
Iteratively compose and and project input labels after each composition Applies n-gram posteriors and accumulates
multiple counts
Extract the shortest path through the final result*
N-gram factors
!n
T = 10
p = 0.85
r = 0.72
!2 = 0.04
!2
!0 E0
!n
T = 10
p = 0.85
r = 0.72
!2 = 0.04
!2
!0 E0E ! Projecto(w "M) ! "0!n
E ! (!# E)NN ! (E)
n ! 1 N!n ! (Nn)"R
n ! (Nn)Un ! Opt((E " !n) ""R
n )#n = !n
q $ Q[#n]e $ E[q]
w[e] ! "n # U(o[e])P ! Projectinput(E!0 " #1)
n ! 2 NP ! Projectinput(P " #n)
Hbest = ShortestPath(P)
!
E"E
P (E|F ) = 1
!2
ConnectF inal(!2 tree)Opt(ConnectF inal(!2 tree)p(u|E)
"Rn
"R2
!n
!n
T = 10
p = 0.85
r = 0.72
!2 = 0.04
!2
!0 E0
!n
!1
ShortestPath(Projecti(Projecti(Projecti(E0 ! !1) ! !2) ! !3))
T = 10
p = 0.85
r = 0.72
!0 = " 0.1
!1 = 0.058
!2 = 0.081
!2
!0 E0
!n
!1
ShortestPath(Projecti(Projecti(E0 ! !1) ! !2))
T = 10
p = 0.85
r = 0.72
!0 = " 0.1
!1 = 0.058
!2 = 0.081
!2
!0 E0
LMBR Gain: What happened? The small alpha parameter smooths the raw
posterior lattice distribution
The LMBR process emphasizes N-gram occurrences across paths in the lattice
Compare
‘rEH’ and ‘EHs’ 2-grams appear in 2 hypotheses. This boosts the 2nd rank hypothesis into the 1st.
LMBR decoding: Issues Sensitive to parameter values N-gram order
alpha
thetas
Slower than MAP decoding due to large number of posterior computations Sometimes preferable to use k-best subset of the
lattice if speed is a concern
Recent G2P Experiments N-best rescoring with a Recurrent Neural Network Language
Model (RNNLM) Cross comparison
State-of-the-art improvements
LMBR decoding Small improvements versus MAP phoneme-accuracy
Experiments: RNNLM N-best rescoring
!75.08
.
Replication of standard English G2P test sets from Bisani 2008 Small but consistent improvements from m2m-fst-P proposed
alignment algorithm
Consistent improvements to state-of-the-art with RNNLM based N-best rescoring
Experiments: training & decoding speed Comparison of relative training and decoding
times for the proposed system versus a variety of alternatives
!75.08
.
!"
#!"
$!"
%!"
&!"
'!"
(!"
)!"
*!"
%+'" &+'" '+'" (+'" )+'" *+'"
!"#$%&'
'(#)'*%+,
-%
./'"$012%34/%+5/'-%
678$0'9%:;<=::;<%%!"#$%&''(#)'*%>5?%@04/%A"#%BC;D:;%
",-"./+"0123"
Experiments: LMBR decoding for G2P
NETtalk-15k
Decoder WA (%) PA (%)
MAP 66.5 91.80
LMBR(n=6) 66.5 91.82
• LMBR impacted by n-gram order • Improvement through n=6 • SMT typically sets n=4
• Very small improvement to phoneme-accuracy vs. MAP
• Still worse than RNNLM • Better thetas, alpha
Experiments: LMBR real-world re-ranking example ./phonetisaurus-g2p -m model.fst -n 5 -w abbreviate -a .3 -d 1 -r
-0.21198 e b r i v i e t -0.20022 @ b r i v i e t -0.17939 @ b r e v i e t -0.17864 e b b r i v i e t -0.16689 @ b b r i v i e t
-0.235998 @ b r i v i e t -0.120462 @ b b r i v i e t 0.0844044 @ b i v i e t 0.140279 @ b r i v i x t 0.140505 x b r i v i e t
-0.126297 @ b r i v i e t 0.210171 @ b b r i v i e t 0.432433 @ b i v i e t 0.4813 x b r i v i e t 0.557048 @ b r x v i e t
0.0823436 @ b r i v i e t 0.723454 @ b b r i v i e t 0.895162 x b r i v i e t 0.933964 @ b i v i e t 1.01223 x b b r i v i e t
0.419434 @ b r i v i e t 1.32085 @ b b r i v i e t 1.38801 x b r i v i e t 1.54077 x b b r i v i e t 1.56691 @ b i v i e t
0.821685 @ b r i v i e t 1.99726 x b r i v i e t 2.06059 @ b b r i v i e t 2.2304 x b b r i v i e t 2.23651 a b r i v i e t
N=1 N=2 N=3
N=4 N=5 N=6
Conclusions and Future work (1) New state-of-the-art results for G2P conversion on
several test sets: Small improvements to EM-driven multiple sequence
alignment algorithm N-best rescoring with a Recurrent Neural Network
Language Model
Lattice Minimum Bayes-Risk decoding applied to G2P conversion Small improvement to phoneme accuracy
compared to MAP-based approach Need to determine a better set of BLEU n-gram
factors for the G2P problem LMBR decoder applicable to alignment as well
Conclusions and Future Work (2) Apply the LMBR decoder to incremental spoken
dialog system processing The Psi WFST can extract either the first or last
occurrence of an N-gram from a lattice path Most action hypotheses in the dialog system should
be roughly aligned in time Should be possible to incrementally aggregate
acts that are in agreement across a lattice Use this as a pruning technique during dialog
management Use this as a method to ‘jump’, ‘barge-in’ or provide
a back-channeling response to respond to a user who hasn’t finished talking