Tree parsing for tree-adjoining machine...

23
Tree parsing for tree-adjoining machine translation MATTHIAS BÜCHSE and HEIKO VOGLER, Department of Computer Science, Technische Universität Dresden, D-01062 Dresden, Germany. E-mail: [email protected]; [email protected] MARK-JAN NEDERHOF, School of Computer Science, University of St Andrews, North Haugh, St Andrews, KY16 9SX, UK. E-mail: [email protected] Abstract Tree parsing is an important problem in statistical machine translation. In this context, one is given (a) a synchronous grammar that describes the translation from one language into another and (b) a recognizable set of trees; the aim is to construct a finite representation of the set of those derivations that derive elements from the given set, either on the source side (input restriction) or on the target side (output restriction). In tree-adjoining machine translation the grammar is a kind of synchronous tree-adjoining grammar. For this case, only partial solutions to the tree parsing problem have been described, some being restricted to the unweighted case, some to the monolingual case. We introduce a class of synchronous tree-adjoining grammars which is effectively closed under input and output restrictions to weighted regular tree languages, i.e. the restricted translations can again be represented by grammars in the same class; this enables, e.g. cascading restrictions. Moreover, we present an algorithm that constructs these grammars for input and output restriction. Keywords: Tree parsing, synchronous tree-adjoining grammars, statistical machine translation 1 Introduction Many recent systems for statistical machine translation (SMT) [32] use some grammar at their core, for instance: (a) synchronous context-free grammars (SCFG) [8], which derive pairs of translationally equivalent sentences; (b) tree-to-string transducers (called xRLNS) [26], which describe pairs of the form (phrase-structure tree, string); and (c) synchronous tree-adjoining grammars (STAGs) [1, 13, 28, 42, 50], which derive pairs of phrase-structure trees (or dependency trees [17]). The technology in the last instance is called tree-adjoining machine translation [12]. Common variants of STAGs are synchronous tree-substitution grammars (STSGs) and synchronous tree-insertion grammars (STIGs). For grammar-based systems, a variety of tasks can be described using the general concepts of input product and output product [33]. Roughly speaking, these products restrict the translation described by the grammar to a given tree or string language on the input or output side. For practical purposes, the derivations of the restricted translation are represented in a compact way, e.g. using a weighted regular tree grammar (WRTG) [2]. The process of obtaining this representation is called tree parsing or string parsing, depending on the type of restriction. We illustrate the importance of input and output product by considering their role in three essential tasks of SMT. Grammar Estimation: After the rules of the grammar have been obtained from a sample of translation pairs (rule extraction), the probabilities of the rules need to be estimated. To this end, two approaches have been employed. © The Author, 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] doi:10.1093/logcom/exs050 Journal of Logic and Computation Advance Access published December 6, 2012 at SLUB Dresden on December 7, 2012 http://logcom.oxfordjournals.org/ Downloaded from

Transcript of Tree parsing for tree-adjoining machine...

Page 1: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 1 1–23

Tree parsing for tree-adjoining machinetranslationMATTHIAS BÜCHSE and HEIKO VOGLER, Department of Computer Science,Technische Universität Dresden, D-01062 Dresden, Germany.E-mail: [email protected]; [email protected]

MARK-JAN NEDERHOF, School of Computer Science, University of St Andrews,North Haugh, St Andrews, KY16 9SX, UK.E-mail: [email protected]

AbstractTree parsing is an important problem in statistical machine translation. In this context, one is given (a) a synchronous grammarthat describes the translation from one language into another and (b) a recognizable set of trees; the aim is to construct afinite representation of the set of those derivations that derive elements from the given set, either on the source side (inputrestriction) or on the target side (output restriction). In tree-adjoining machine translation the grammar is a kind of synchronoustree-adjoining grammar. For this case, only partial solutions to the tree parsing problem have been described, some beingrestricted to the unweighted case, some to the monolingual case. We introduce a class of synchronous tree-adjoining grammarswhich is effectively closed under input and output restrictions to weighted regular tree languages, i.e. the restricted translationscan again be represented by grammars in the same class; this enables, e.g. cascading restrictions. Moreover, we present analgorithm that constructs these grammars for input and output restriction.

Keywords: Tree parsing, synchronous tree-adjoining grammars, statistical machine translation

1 Introduction

Many recent systems for statistical machine translation (SMT) [32] use some grammar at their core,for instance: (a) synchronous context-free grammars (SCFG) [8], which derive pairs of translationallyequivalent sentences; (b) tree-to-string transducers (called xRLNS) [26], which describe pairs of theform (phrase-structure tree, string); and (c) synchronous tree-adjoining grammars (STAGs) [1, 13,28, 42, 50], which derive pairs of phrase-structure trees (or dependency trees [17]). The technologyin the last instance is called tree-adjoining machine translation [12]. Common variants of STAGs aresynchronous tree-substitution grammars (STSGs) and synchronous tree-insertion grammars (STIGs).

For grammar-based systems, a variety of tasks can be described using the general concepts of inputproduct and output product [33]. Roughly speaking, these products restrict the translation describedby the grammar to a given tree or string language on the input or output side. For practical purposes,the derivations of the restricted translation are represented in a compact way, e.g. using a weightedregular tree grammar (WRTG) [2]. The process of obtaining this representation is called tree parsingor string parsing, depending on the type of restriction. We illustrate the importance of input andoutput product by considering their role in three essential tasks of SMT.

Grammar Estimation: After the rules of the grammar have been obtained from a sample oftranslation pairs (rule extraction), the probabilities of the rules need to be estimated. To this end, twoapproaches have been employed.

© The Author, 2012. Published by Oxford University Press. All rights reserved.For Permissions, please email: [email protected]:10.1093/logcom/exs050

Journal of Logic and Computation Advance Access published December 6, 2012 at SL

UB

Dresden on D

ecember 7, 2012

http://logcom.oxfordjournals.org/

Dow

nloaded from

Page 2: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 2 1–23

2 Tree parsing for tree-adjoining MT

Table 1. Tree-parsing algorithms published so far in comparisonwith this paper.

characteristicspaper AL grammar restriction type result

[26] 3–4 xRLNS tree best derivation[23] 3–4 xRLN (tree, tree) derivation WRTG[17] 3–4 STSG (tree, tree) derivations[40] 2 WLIG regular tree language WLIG[34] 2 XTT regular tree language XTT(this paper) 1–3 WSTAG regular tree language WSTAG

Some systems [8, 13] hypothesize a canonical derivation for each translation pair, and applyrelative-frequency estimation to the resulting derivations to obtain rule probabilities. While thisprocedure is computationally inexpensive, it only maximizes the likelihood of the training dataunder the assumption that the canonical derivations are the true ones.

Other systems [17, 23, 42] use a variant of the EM algorithm [11] called Inside-Outside. Thisalgorithm requires that the set of derivations for a given translation pair be representable by a WRTG,which in most cases can be computed by restricting the translation grammar to the given translationpair, that is, by applying input and output product. Note that the translation pair can be a pair ofstrings, a pair of trees, or even some other combination.

Feature Weight Estimation: In the systems mentioned at the beginning of this section, aprobability distribution of the form p(e,d | f ) is estimated within a log-linear model [4, 44], wheree and f are sentences from the target language and the source language, respectively, and d is aderivation. Such a distribution combines information from different sources, such as the grammar ora probability distribution over the target language sentences. These pieces of information are calledfeatures. The features are represented by real-valued functions hi(e,d,f ). For said combination, eachfeature gets a weight λi.

The feature weights are usually estimated using minimum-error-rate training [43]. For this it isnecessary to compute, for a given f , the set Df of n highest ranking derivations generating f on thesource language side. Roughly speaking, this set can be computed by applying the input productwith f , and then applying the n-best algorithm [5, 25]. We note that, while f is usually a string, it canin some circumstances be a phrase-structure tree, as in [26].

Decoding: The actual translation, or decoding, problem amounts to finding, for a given f , thetarget language sentence e such that

e=argmaxe∑

d∏

i hi(e,d,f )λi .

Even for the simplest grammars, this problem is NP hard [7]. As a result, SMT systems useapproximations such as crunching or variational decoding [31]. Here we focus on the former, whichamounts to restricting the sum in the equation to the set Df . Since this set is finite, the sum is thenzero for almost all e, which makes the computation of e feasible. As mentioned before, the inputproduct can be used to compute Df .

As these tasks illustrate, tree parsing is employed for various tasks in recent SMT systems. Table 1lists five relevant contributions to the problem of (bilingual or monolingual) tree parsing. Thesecontributions can be classified according to a number of characteristics indicated by the column

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 3: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 3 1–23

Tree parsing for tree-adjoining MT 3

headings. One of these characteristics is the abstraction level (AL), which we categorize as follows:(1) language-theoretic result, (2) construction, (3) algorithm and (4) implementation.

The first three entries of Table 1 deal with contributions that are restricted to tree substitution.In [26] the authors show an algorithm for computing the best derivation of the input product of anxRLNS with a single tree. In [23] an algorithm is presented that computes the derivation WRTG forthe input and output product of a tree-to-tree transducer (called xRLN) with a single pair of trees. In[17] an algorithm is described that computes the set of derivations for the input and output productof an STSG with a single pair of trees.

We note that the grammar classes covered so far are strictly less powerful than STAGs. This isdue to the fact that STAGs additionally permit an operation called adjoining. As it is pointed out in[13, 42], the adjoining operation has a well-founded linguistic motivation, and permitting it improvestranslation quality.

There are two papers approaching the problem of tree parsing for STAGs, given in the fourthand fifth entries of the table. These papers establish closure properties, that is, their constructionsyield a grammar of the same type as the original grammar. Since the resulting grammars arecompact representations of the derivations of the input product or output product, respectively, theseconstructions constitute tree parsing.

In [40] it is shown that weighted linear indexed grammars (WLIGs) are closed under weightedintersection with tree languages generated by WRTGs. WLIGs derive phrase-structure trees, andthey are weakly equivalent to tree-adjoining grammars (TAGs). It is not clear how this result can betransferred to a synchronized setting.

In [34] STAGs are represented in an alternative way, namely as extended tree transducers (XTT)with explicit substitution. In this framework, adjoining is encoded into the phrase-structure treesby introducing special symbols, to be evaluated in a separate step. The author indicates that hisrepresentation of STAG is closed under input and output product with regular tree languagesby providing a corresponding construction. However, in his setting, both the translations and thelanguages are unweighted.

The advantage of closure properties of the above kind is that they allow cascades of input andoutput products to be constructed in a uniform way. They also allow further operations on grammars,such as projection. Ultimately, SMT tasks may be described in this framework, as illustrated bytypical applications of toolboxes for WFSTs [38] and XTTs [36].

In this article, we propose a weighted formulation of STAGs which is closed under input andoutput product with WRTGs, and we present a corresponding tree-parsing algorithm. This article isorganized as follows.

In Section 3, we introduce our formulation of STAGs, which is called weighted synchronous tree-adjoining grammar (WSTAG). The major difference with respect to the classical STAGs is two-fold:(i) we use states and (ii) we encode substitution and adjoining sites as variables in the trees. Thestates make intersection with regular properties possible (without the need for relabelling as in [47]and [34]). In addition, they permit implementing all features of conventional STAG/STIG, such aspotential adjoining and left/right adjoining. The variables are used for synchronization of the inputand output sides.

In Section 4, we state that WSTAGs are closed under input and output product with tree languagesgenerated by WRTGs (cf. Theorem 3). We provide a direct construction for the input and outputproduct (Section 5), which is based on the standard technique for composing two top-down treetransducers (cf. page 195 of [3]). This technique has been extended in Theorem 4.12 of [20] to thecomposition of a macro tree transducer and a top-down tree transducer (also cf. [45]); in fact, ourdirect construction is very similar to the latter one. In Section 6 we prove the correctness of Theorem 3.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 4: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 4 1–23

4 Tree parsing for tree-adjoining MT

(a)

(d) (e)

(c)

(b)

Figure 1. (a) Example of a WSTAG (following [28]) where q1 is the initial state, (b) derivation treeWRTG DG, (c) derivation tree dex, (d) input tree s=h1(dex), (d) output tree t=h2(dex).

Section 7 containsAlgorithm 1, which computes our construction. It is similar to Earley’s algorithm[16, 24] in its strategy to avoid a certain portion of useless rules. Both time and space complexityare linear in the size of the input WSTAG. The algorithm is presented in the framework of deductiveparsing [22, 39].

In Sections 8, 9, and 10, we discuss the correctness of our algorithm, present an extended exampleof the generation of items and derive the complexity of the algorithm, respectively.

In Section 11 we conclude with a discussion and further research topics.The present study is an extended version of [6].

2 Trees and weighted regular tree grammars

We denote the set of all unranked, ordered, labelled trees over some alphabet � by U� . We representtrees as well-formed expressions, e.g. S(Adv(yesterday),∗); a graphical representation of this treeoccurs at the very bottom of Figure 1a. Sometimes we assign a rank (or: arity) k∈N to a symbolσ ∈� and then require that every σ -labelled position of a tree has exactly k successors. We denote the

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 5: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 5 1–23

Tree parsing for tree-adjoining MT 5

set of all positions of a tree t∈U� by pos(t). A position is represented as a finite sequence of naturalnumbers (Gorn notation). If w∈pos(t), then t(w) denotes the label of t at w, and rkt(w) denotes thenumber of successors of w.

A weighted regular tree grammar (for short: WRTG) is a tuple H= (P,�,p0,R,wt) where P is afinite set of states,1 � is an alphabet, p0∈P is the initial state, and R is a finite set of rules; everyrule ρ has the form p→σ (p1,...,pk) where k∈N, p,p1,...,pk ∈P, and σ ∈� (note that σ (p1,...,pk)is a tree over �∪P); finally, wt :R→R≥0 is the weight assignment, where R≥0 is the set of allnon-negative real numbers.

We note that in the literature WRTG occur in a more general form in which the right-hand sideof a rule may contain an arbitrary number (including zero) of symbols of the alphabet. We assumethat∞ is an element of R≥0, which implies that each WRTG of the general form can be normalizedinto a WRTG of our restricted form.

A run (of H) is a tree κ ∈UP. Let t∈U� and let κ be a run. We say that κ is a run on tif pos(κ)=pos(t) and for every position w∈pos(t) the rule ρ(κ,t,w) is in R, where ρ(κ,t,w) isdefined as

κ(w)→ t(w)(κ(w1),...,κ(w rkt(w))

).

The weight of a run κ on t is the value wt(κ,t)∈R≥0 defined by

wt(κ,t)=∏

w∈pos(t)

wt(ρ(κ,t,w)).

The weighted tree language generated by H is the weighted tree language L(H) : U�→R≥0defined by

L(H)(t)=∑

κ run on tκ(ε)=r0

wt(κ,t).

The support supp(L(H)) of L(H) is the set of all t∈U� such that L(H)(t) �=0. We say that H isunambiguous if for every tree t∈U� there is at most one run κ of H on t with wt(κ,t) �=0.

3 Weighted synchronous tree-adjoining grammars

We formulate a STAG syntax based on states and variables. The usage of states is in the spirit of[21]. We formalize substitution sites and adjoining sites as explicit positions labelled by variables,and each variable carries a state. We thus achieve four advantages: (i) the states increase expressivepower just enough to enable intersection with finite-state devices via a product construction asknown from automata theory. (ii) They also provide a simple, uniform representation for well-established phenomena such as obligatory adjoining, potential adjoining and adjoining order. (iii) Dueto the variables, synchronization is effortless (even for an arbitrary number of grammars). (iv) Ourformulation is concise and has rigorously defined semantics.

In the following, we will use variables x1, x2,... of rank 0 to denote substitution sites andvariables z1, z2,... of rank 1 to denote adjoining sites; the foot node will be labelled by the nullarysymbol ∗. A weighted synchronous tree-adjoining grammar with states (for short: WSTAG) is a tupleG= (Q,F,�,q0,R,wt) where

1We choose the name ‘state’ here rather than ‘nonterminal’ in order to avoid confusion with the labels of internal nodesin parse trees generated by context-free grammars.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 6: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 6 1–23

6 Tree parsing for tree-adjoining MT

• Q and F are disjoint finite sets (of nullary and unary states, respectively, each denoted byvariants of q and f , respectively),

• � is an alphabet (terminal alphabet),• q0 is a nullary state (initial state),• R is a finite set of rules of either of the following forms (α) or (β):

q→〈ζ ζ ′,q1 ···qm,f1 ···fl〉 (α)

f (∗)→〈ζ ζ ′,q1 ···qm,f1 ···fl〉 (β)

where ζ and ζ ′ are trees over �∪V and

– if the rule has the form (α), then V={x1,...,xm,z1,...,zl} and– if the rule has the form (β), then V={x1,...,xm,z1,...,zl,∗},

and every element of V occurs exactly once in each of ζ and ζ ′, and• wt :R→R≥0 is the weight assignment.

Rules of the forms (α) and (β) are called (m,l)-rules; ζ and ζ ′ are called the input tree and the outputtree of the rule, respectively. For fixed q and f , the sets of all rules of the forms (α) and (β) aredenoted by Rq and Rf , resp. Figure 1a shows an example of a WSTAG.

In the following, let G= (Q,F,�,q0,R,wt) be a WSTAG. We define the semantics in terms ofbimorphisms [48]: first, we define the derivation tree WRTG DG, which generates the weighted treelanguage of derivation trees of G. Second, we define two embedded tree homomorphisms h1 and h2,which retrieve from a derivation tree the derived input tree and output tree, respectively.

Let y∈Q∪F. The y-derivation tree WRTG of G is the WRTG DyG= (Q∪F,R,y,R′,wt′) where

• we assign the rank m+l to every (m,l)-rule,• R′ is the set of all rules Dρ with ρ∈R and

– if ρ has the form (α), then Dρ=q→ρ(q1,...,qm,f1,...,fl),– if ρ has the form (β), then Dρ= f→ρ(q1,...,qm,f1,...,fl),

• and wt′(Dρ)=wt(ρ).

We let DG be Dq0G . Recall that L(DG) is the weighted tree language generated by DG. We call the trees

in supp(L(DG)) derivation trees. Figures 1b and c show the derivation tree WRTG of the WSTAGof Figure 1a and a derivation tree, respectively.

Next we define the embedded tree homomorphism h1. Since we aim at an inductive definition,we have to distinguish between the Q-embedded tree homomorphism hQ

1 and the F-embedded treehomomorphism hF

1 with

hQ1 :

⋃q∈Q

supp(L(DqG))→U� and hF

1 :⋃f∈F

supp(L(DfG))→U�∪{∗}.

We define hQ1 and hF

1 simultaneously as follows. Let Y ∈{Q,F} and ρ be an (m,l)-rule with input

tree ζ ; ρ has type (α) if Y=Q, and it has type (β) if Y=F. Then

hY1 (ρ(d1,...,dm,d′1,...,d′l ))=ζ [x1/hQ

1 (d1)]...[xm/hQ1 (dm)]� z1/hF

1 (d′1)� ...� zl/hF1 (d′l )� (1)

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 7: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 7 1–23

Tree parsing for tree-adjoining MT 7

where the first-order substitution [x/s] replaces the single occurrence of x in ζ by s, and the

second-order substitution � z/s� is defined inductively:

• σ (ζ1,...,ζk)� z/s�=σ (ζ1 � z/s�,...,ζk � z/s�),• xj � z/s�=xj,• ∗� z/s�=∗, and

• zj(ζ )� z/s�={

s[∗/ζ ] if z=zj

zj(ζ � z/s�) otherwisewhere the tree s[∗/ζ ] is obtained from s by replacing ∗ by ζ .

We note that every tree in the image of hF1 contains ∗ exactly once.

Then we let h1=hQ1 |supp(L(DG)). The embedded tree homomorphism h2 is defined in the same way,

but using the output tree ζ ′ of ρ in (1). We call d a derivation tree of the pair (h1(d),h2(d)). The treein Figure 1c is a derivation tree of the pair given by Figure 1d and e.

Example 1To illustrate the embedded tree homomorphism, we show that h1(dex)=s, where s and dex arethe trees from Figure 1. Using the abbreviations [1]=[x1/h1(ρ2)], [2]=[x2/h1(ρ3(ρ4))], and�3�=� z1/h1(ρ5)�, we derive

h1(dex)=z1(S(x1,VP(V(saw),x2)))[1][2]�3�

=z1(S(h1(ρ2),VP(V(saw),x2)))[2]�3�

=z1(S(h1(ρ2),VP(V(saw),h1(ρ3(ρ4))))︸ ︷︷ ︸ξ

)�3�

=S(Adv(yesterday),ξ )

= ...=s .

�The WSTAG G induces the weighted tree transformation T (G) : U� × U� → R≥0 defined by

T (G)(s,t)=∑

d derivation tree of (s,t)

L(DG)(d) .

We recall that ∞ is an element of R≥0 to make the equation well defined also for infinite sums.Alternatively, we can require G to be productive, a syntactic property that was described for WSTSGin [21]; then the sum is guaranteed to be finite.

Due to the construction of DG, for every derivation tree d there is exactly one run κd of DG on d: forevery w∈pos(d), we let κd(w) be the state which occurs in the left-hand side of the rule d(w). Hence,DG is unambiguous and L(DG)(d)=wt(κd,d). Thus, T (G)(s,t)=∑

d wt(κd,d). As an example, wereconsider the derivation tree dex of the pair (s,t) given by Figure 1. Then we have

L(DG)(dex)=wt(κdex ,dex)=5∏

i=1

wt(ρi)=0.24

Since dex is the only derivation tree of (s,t), we have that T (G)(s,t)=0.24.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 8: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 8 1–23

8 Tree parsing for tree-adjoining MT

Relationship to existing STAG Formalisms Previous definitions of TAG-related formalisms suchas considered in [1, 28, 49] usually contain features not present in the syntax of WSTAGs:

strictness a substitution site or adjunction site can only be filled with a tree whose root node islabelled with the same symbol as the site where it is substituted/adjoined. Deviating from this,[29] define non-strict TAGs.

adjoining constraints in many definitions the user can specify a specific constraint for each site,such as null adjunction, selective adjunction or obligatory adjunction.

left/right adjoining in tree-insertion grammars [13, 14, 41, 42, 46] adjoining sites are partitionedinto left- and right-adjoining sites.

Using states WSTAGs can simulate each and every of these features in a systematic way, as indicatedin the following. For strictness we encode the root symbols into the states, cf. the discussion inSection 5 of [21].

For adjoining constraints, we indicate how to implement selective adjunction by the followingexcerpt of a WSTAG:

q→〈z1(A(x1)) z1(B(x1)),q′,f 〉 (2)

f (∗)→〈∗ ∗,ε,ε〉 (3)

f (∗)→〈z1(∗) z1(∗),ε,f ′〉 (4)

Rule (2) shows adjoining at the root position governed by state f , rule (3) implements the case of notadjoining, rule (4) redirects to the state f ′, which is intended to handle the actual adjoining. We notethat the separation between f and f ′ allows modelling independent probabilities for the activation ofthe adjoining site and the choice of the adjoined rule.

Finally, the left/right adjoining restriction of STIGs can be handled by keeping appropriate finiteinformation in the states.

We conjecture that the expressive power of WSTAGs depends on (a) the maximum number ofvariables in any rule and (b) the maximum number of symbols in any rule. The first dependency maymake binarization for WSTAGs as problematic as it is for SCFGs (cf. [27]).

STSGs as defined in [21] are WSTAGs that only have nullary states.

4 Closure under input and output product

First we define the input product and output product of a weighted tree transformation T : U�×U�→R≥0 and a weighted tree language L : U�→R≥0. The input product of T and L is the weighted treetransformation

L�T : U�×U�→R≥0, (s,t) �→L(s)·T (s,t) .

Similarly, we define the output product of T and L as the mapping

T �L : U�×U�→R≥0, (s,t) �→T (s,t)·L(t) .

We note that the input product and output product can be considered as a kind of composition ofweighted tree transformations by viewing L as mapping L′ :U�×U�→R≥0 with L′(s,t)=L(s) ifs= t, and L′(s,t)=0 otherwise.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 9: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 9 1–23

Tree parsing for tree-adjoining MT 9

Example 2We consider the WSTAG G and the WRTG H given in Figure. 2a and b, where the states e and ogenerate backbones of even and odd lengths, respectively. Figure 2c and d show the shape of thederivation trees of G and a concrete derived tree pair, respectively. Note that the weight of a derivationtree of the given shape is 0.3k ·0.7n1+···+nk , and the derived string pair is

(an1bn1#...#ank bnk ###cnk dnk #...#cn1dn1 ,

(ad)n1 (bc)n1#...#(ad)nk (bc)nk ###k) .

The weighted tree language L(H) maps trees of the form shown in Figure 2e, where the unlabellednodes may carry any label in {a,b,c,d,#}, to the weight 0.52n ·0.24n if the number of occurrencesof S is 2n. Every other tree is mapped to 0.

The input product L(H)�T (G) maps pairs like in Figure 2d to 0.52n ·0.24n ·0.3k ·0.7n1+···+nk

if 2n=k+2(n1+···+nk). Every other pair is mapped to 0.

�The following theorem comprises our closure result. We show its effectiveness by giving a direct

construction in Section 5. Moreover, we provide its proof in Section 6.

Theorem 3For every WSTAG G and WRTG H there are WSTAGs H �G and G�H such that

• T (H �G)=L(H)�T (G) and• T (G�H)=T (G)�L(H).

Moreover, if H is unambiguous, then there is a bijection π between supp(L(DH�G))and supp(LH (DG)), where LH (DG)(d)=L(H)(h1(d))·L(DG)(d), such that π preserves weightsand derived input and output trees; formally, wt′(d′)=wt(π (d′)) and (h′1(d′),h′2(d′))=(h1(π (d′)),h2(π (d′))), where the primed symbols stem from the constructed WSTAG.

Consequently, the n-best derivation trees of H �G correspond to the n-best derivation trees of G,when their weights are adjusted according to the input product. The analogous statement holds forthe output product.

5 Direct construction

Here we provide our construction of the WSTAG H �G, whose existence is postulated in Theorem 3.Let G= (Q,F,�,q0,R,wt) be a WSTAG and H= (P,�,r0,RH ,wtH ) a WRTG.

First we define an enrichment of H that can generate trees like ζ as they occur in rules of G, thatis, including variables xj, zj, and possibly ∗. To this end, let ρ∈R. A (state) assignment for ρ is amapping θ that maps each nullary variable in ρ to a state of H and each unary variable in ρ to a pairof such states. Likewise, if ∗ occurs in ζ , then θ maps it to a state of H. Then, for every p∈P andassignment θ , we define the WRTG H(p,θ ), which is obtained from H by using p as initial state andadding the following rules with weight 1 for every p,p′ ∈P:

p→xj if θ (xj)=p,

p→zj(p′) if θ (zj)= (p,p′), and

p→∗ if θ (∗)=p .

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 10: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 10 1–23

10 Tree parsing for tree-adjoining MT

(a)

(c) (d) (e)

(b)

Figure 2. (a) WSTAG G for Example 2. (b) WRTG H for Example 2. (c) Shape of derivation treesof G for k∈N and n1,...,nk ∈N. (d) Derived tree pair for k=2, n1=2, and n2=1. (e) Shape of treesin supp(L(H)) for n∈N.

Second, for every rule ρ∈R, we let H(p,θ ) ‘run’ on the input tree ζ of ρ. Formally, we define theproduct WSTAG of H and G as the WSTAG H �G=

(Q×P,F×(P×P),�,(q0,p0),R′,wt′),

as follows. Let ρ∈R, p∈P, and θ an assignment for ρ. Then, depending on the form (α) or (β) of ρ,the rule

(q,p)→〈ζ ζ ′,u,v〉 (α)

(f ,(r,θ (∗)))→〈ζ ζ ′,u,v〉 (β)

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 11: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 11 1–23

Tree parsing for tree-adjoining MT 11

is in R′ where

• u= (q1,θ (x1))···(qm,θ (xm)) and• v= (f1,θ (z1))···(fl,θ (zl)).

We denote this rule by (ρ,p,θ ). Its weight is

wt′(ρ,p,θ )=wt(ρ)·L(H(p,θ ))(ζ ) .

There are no further elements in R′.

Example 4 (Ex. 2 contd.)The WSTAG H �G is shown in Figure 3, where rules with weight 0 are omitted. Each rule is shownwith its shorthand notation (ρ,r,θ ) on its left. Note that ρ3 in G contains exactly two nodes labelled S,so this rule does not affect the parity of the total number of S-labelled nodes. Hence, rules in H �Gstemming from ρ3 only contain the states

(f ,(e,e)

)and

(f ,(o,o)

), but not

(f ,(e,o)

)or

(f ,(o,e)

). Also

note how these rules alternate between said states.

�We have that |R′|∈O(|R|·|P|·|P|C) where C=max{m+2·l+y |∃ρ : ρ is an (m,l)-rule,y=0 in

case (α),y=1 in case (β)}. More specifically, the factors |R|, |P|, and |P|C are due to the choices ofρ∈R, p, and θ , respectively, where C is a worst-case estimate.

6 Proofs

We will only prove the closure under input product, because the proof for the output product is similar.First we show the proof of Theorem 3 in the unweighted case, and second we prove the correctnessof the direct construction given in Section 5.

6.1 Proof of Theorem 3 in the unweighted case

For the unweighted case, the closure result follows from classical results as follows. We obtainthe unweighted case if we replace the algebra in which the weights are calculated by another one:R≥0 is replaced by the set B={true,false}, and the operations + and · are replaced by disjunctionand conjunction, respectively. In other words, we replace the inside semiring (R≥0,+,·,0,1) by theBoolean semiring (B,∨,∧,false,true). Then L(H) and T (G) become sets L(H)⊆U� and T (G)⊆U�×U� . In this setting and using h :supp(L(DG))→U�×U� defined by h(d)= (h1(d),h2(d)), wehave that

L(H)�T (G)=h(h−11 (L(H))).

Note that h−11 (L(H))⊆supp(L(DG)). Now we observe that h1 can be computed by a particular macro

tree transducer [10, 18]; we note that in [48] such macro tree transducers were called embedded treetransducers.

By [20, Theorem 7.4] the class of regular tree languages is closed under the inverse of macro treetransducers. Thus, since L(H) is a regular tree language, also h−1

1 (L(H)) is a regular tree language.Hence L(H)�T (G)=h(L(H)) for some regular tree grammar H. Now it is easy to construct (using h1and h2) a STAG H �G from H such that T (H �G)=h(L(H)).

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 12: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 12 1–23

12 Tree parsing for tree-adjoining MT

Figure 3. WSTAG H �G for Example 4 (rules with positive weight).

6.2 Proof of Theorem 3

First, we define the image of a weighted tree language under a mapping. To this end, let A be a set, �an alphabet, and g : U�→A a mapping. For every weighted tree language L : U�→R≥0 we definethe image of L under g as the weighted tree language g(L) : A→R≥0 : a �→∑

s∈g−1(a)L(s). Recallthat we assume∞∈R≥0.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 13: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 13 1–23

Tree parsing for tree-adjoining MT 13

Now we can express T (G) for an arbitrary WSTAG G succinctly as follows. We combine theembedded tree homomorphisms h1 and h2 into the mapping hG : UR→U�×U� with hG(d)=(h1(d),h2(d)). Then we have T (G)=hG(L(DG)).

Next, we relate derivation trees of H �G with derivation trees of G via the relabellingπ : UR′ →UR,which maps each occurrence of (ρ,r,θ ) to ρ. Informally, π erases the state behaviour of H fromderivation trees of H �G.

Lemma 5T (H �G)=L(H)�T (G).

Proof. Recall the definition of LH (DG) from Theorem 3. Then

T (H �G)=hH�G(L(DH�G))=hG(π (L(DH�G)))

=hG(LH (DG)) (by Lemma 7)

=L(H)�T (G) . (by Lemma 6)

Lemma 6L(H)�T (G)=hG

(LH (DG)

).

Proof. (L(H)�T (G)

)(s,t)=L(H)(s)·T (G)(s,t)

=L(H)(s)·∑

d : hG(d)=(s,t)

L(DG)(d)=∑

d : hG(d)=(s,t)

L(H)(h1(d))·L(DG)(d)

=(hG(LH (DG))

)(s,t) .

�Lemma 7π (L(DH�G))=LH (DG).

Before we can prove this lemma, we have to introduce some concepts and auxiliary statements (cf.Figure 4 for an illustration). For every derivation tree d of G we define an equivalence relation ≡dover runs of H on h1(d). Two runs will be deemed equivalent if they coincide in all positions at whichthe state is determined by a corresponding derivation of H �G. To this end, we define the auxiliaryWSTAG ab(G) as the WSTAG obtained from G by applying the following mapping to each positionof the input tree of each rule:

ζ (w) �→

⎧⎪⎪⎪⎨⎪⎪⎪⎩

ζ (w) if ζ (w)∈{∗,x1,x2,...,z1,z2,...}� if w=ε and ζ (w)∈�

or if w=u1 for some u,ζ (w)∈�, and ζ (u)=zi

. otherwise.

Clearly, the derivation trees of ab(G) and those of G are in a one-to-one correspondence, and wewill not distinguish between them. Moreover, we have that hab(G),1(d) is a tree over {�,.}, wherean occurrence of � indicates a position at which the state is fixed by a corresponding derivationof H �G, whereas . indicates a position at which the state is not fixed. For every d∈supp(L(DG))we define the equivalence relation ≡d on the runs of H on h1(d) by

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 14: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 14 1–23

14 Tree parsing for tree-adjoining MT

Figure 4. An illustration of the concepts used in the proof of Theorem 3. The set in the lower leftcorner contains equivalence classes of ≡d , each represented by a tree.

κ1≡d κ2 iff for every w∈pos(h1(d)) we have that(hab(G)(d)

)(w)=� implies κ1(w)=κ2(w).

We let ϕ(d) be the set of equivalence classes induced by ≡d .

Lemma 8Let d be a derivation tree of G. Then there is a bijection between the sets ϕ(d) and π−1(d).

Proof. A derivation d′ ∈π−1(d) fixes the states of exactly those positions of h1(d) that are mappedto � by hab(G),1(d). Hence, such a derivation corresponds to an equivalence class of ≡d . It is easyto see that for every equivalence class there is such a derivation. �Lemma 9π (L(DH�G))(d)=∑

ν∈ϕ(d)L(DG)(d)·∑κ∈ν wt(κ,h1(d)).

Proof. This follows from Lemma 8, the definition of wt(κ,h1(d)), the definition of p′, andcommutativity, associativity and distributivity of + and ·. �

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 15: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 15 1–23

Tree parsing for tree-adjoining MT 15

Proof (of Lemma 7).

π (L(DH�G))(d)=∑

ν∈ϕ(d)

L(DG)(d)·∑κ∈ν

wt(κ,h1(d)) (Lemma 9)

=L(DG)(d)·∑

ν∈ϕ(d)

∑κ∈ν

wt(κ,h1(d))=L(DG)(d)·L(H)(h1(d))

=LH (DG)(d) .

�Corollary 10Let H be unambiguous. The restriction of π to supp(L(DH�G))→supp(LH (DG)) is a bijection.

Proof. Injectivity: Let d′1,d′2∈supp(L(DH�G)) be distinct and d=π (d′1)=π (d′2). By Lemma 8 wehave that d′1 and d′2 correspond to distinct equivalence classes ν1,ν2 in ϕ(d). Since H is unambiguous,at most one run κ of H on h1(d) has positive weight. Since we cannot have κ ∈ν1 and κ ∈ν2 at thesame time, Lemma 9 yields that we also cannot have that both d′1,d′2∈supp(L(DH�G)).

Surjectivity: Direct consequence of Lemma 7. �Lemma 5 and Cor. 10 together prove Theorem 3.

7 Algorithm

Now we present Algorithm 1, which performs the construction of H �G. It uses a strategy similar tothat of Earley’s algorithm to construct at least all useful rules of H �G while avoiding constructionof a certain portion of useless rules. A rule is useful if it occurs in some derivation tree; otherwise itis useless.

Algorithm 1 Product construction algorithmRequire: G= (Q,F,�,q0,R,wt) a WSTAG and H= (P,�,p0,RH ,wtH ) a WRTG,Ensure: Ru contains at least the useful rules of H �G, wtu coincides with the weight assignment of H �G on

Ru

� step 1: compute I1: I←∅2: repeat3: add items to I by applying the rules in Figure 54: until convergence

� step 2: compute rules5: Ru←∅6: for [ρ,ε,p,θ ]∈I do7: Ru←Ru∪{(ρ,p,θ )} as in Section 5

� step 3 (optional): reduce8: perform reachability analysis to remove useless rules from Ru

� step 4: compute weights9: for (ρ,p,θ )∈Ru do

10: wtu(ρ,p,θ )←wt(ρ)·W([ρ,ε,p,θ ])

Conceptually, the algorithm proceeds in four steps. Note that, in practice, some of these stepsmay be implemented to run in an interleaved manner in order to reduce constants in the runtimecomplexity.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 16: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 16 1–23

16 Tree parsing for tree-adjoining MT

Figure 5. Deductive parsing schema for the input product.

The first step is based on a deductive system, or deductive parsing schema, which is given inFigure 5. Its central notion is that of an item, which is a syntactic representation of a proposition.We say that an item holds if the corresponding proposition is true. In Section 8 we will explain themeaning of the items in detail and in Section 9 we will illustrate the generation of items. Roughlyspeaking, the items drive a depth-first left-to-right simulation of H on the input trees of rules of G.Items with round brackets are responsible for top-down traversal and items with square bracketsfor horizontal and bottom-up traversal. The deductive system contains inference rules which are, asusual, syntactic representations of conditional implications [22, 39]. The first step of the algorithmcomputes the least set I of items that is closed under application of the inference rules. This is donein the usual iterative way, starting with the empty set and applying rules until convergence. Sincethere are only finitely many items, this process will terminate. Note that, given the soundness of theinference rules (cf. Section 8), all items in I hold.

In the second step, we construct the rule (ρ,p,θ ) of H �G for each [ρ,ε,p,θ ] in I, as in (α) or asin (β) in Section 5, depending on whether ρ∈Rq or ρ∈Rf .

The third step is a reachability analysis to remove useless rules. This step is optional—it dependson the application whether the runtime spent here is amortized by subsequent savings.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 17: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 17 1–23

Tree parsing for tree-adjoining MT 17

In the fourth step, we determine the weight of each of the remaining rules. For a rule (ρ,p,θ ) thisis wt(ρ)·W([ρ,ε,p,θ ]), where W is defined recursively by case analysis as follows.

• If ζ (w)∈�, with k= rkζ (w), and θj is the restriction of θ to variables below node wj in ρ andDρ=p→ζ (w)

(p1,...,pk

), then

W([ρ,w,p,θ ])=∑

p1,...,pk :[ρ,w1,p1,θ1],...,[ρ,wk,pk,θk]∈I

wtH (Dρ)·∏

j

W([ρ,wj,pj,θj])

• If ζ (w)∈{∗,x1,...,xm}, then W([ρ,w,p,θ ])=1.• If ζ (w)={zj}, with θ (zj)= (p,p′), then

W([ρ,w,p,θ ])=W([ρ,w1,p′,θ \{zj �→ (p,p′)}]) .

The computation of W([ρ,ε,p,θ ]) can be sped up by storing ‘back pointers’ for each item, i.e.the items which were used for its generation. Alternatively, it is possible to compute the weightson-the-fly during the first step, thus alleviating the need for a separate recursive computation. Tothis end, items should be prioritized to make sure that they are generated in the right order for thecomputation. To be more precise, one has to ensure that all items referred to on the right-hand sidesof the equations are generated before the items on the left-hand sides.

8 Meaning of items

The meaning of the items can best be illustrated by the concepts of enriched derivation tree andpartial enriched derivation tree.

An enriched derivation tree is a modified derivation tree in which node labels have the form (ρ,c),where ρ is a rule, as in the familiar derivation trees, and c is an additional decoration that maps everyposition of the input tree ζ of ρ to a state of the WRTG H. Moreover, c must be consistent with therules of H, and positions that coincide in the derived tree must be decorated with the same state (cf.Figure 6, dashed lines). A partial enriched derivation tree (for short: pedt) is an enriched derivationtree in which subtrees can still be missing (represented by ⊥) or the decoration with states from His not yet complete (i.e. some positions are mapped to ?).

Figure 6 shows an example pedt d, where we represent the decoration at each position of d byannotating the corresponding input tree.

This pedt can be viewed as representing application of the following rules:

(1),(2q),(8z),(2f ),(4),(5),(4),(7),...

Now we make our description of pedts more precise. Let n be a position of d, ρ be the rule occurringin the label d(n) and ζ be the input tree of ρ. Then there is a position w of ζ such that

• every position u which is lexicographically smaller than w is decorated by a state and, if ζ (u)is a variable xj or zj, then the subtree of n which corresponds to the variable does not contain⊥or ?, and

• every position v which is lexicographically greater than w is decorated by ? and, if ζ (v) is avariable xj or zj, then the child of n which corresponds to the variable is ⊥. For instance, if weconsider the root n=ε of the pedt in Figure 6, then w=1 (i.e. the position with label S).

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 18: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 18 1–23

18 Tree parsing for tree-adjoining MT

Figure 6. Partial enriched derivation tree. Grammars taken from Figure 2.

Finally we can describe the meaning of the items by referring to properties of pedts.

• (q,p) (and (f ,p)): There are a pedt d, a position n of d, a rule ρ and a decoration c such thatd(n)= (ρ,c), ρ∈Rq (resp., ρ∈Rf ), and c(ε)=p.

• [q,p]: There are a pedt d, a position n of d, a rule ρ and a decoration c such that d(n)= (ρ,c),ρ∈Rq, c(ε)=p, and c is a complete decoration.

• [f ,p,p′]: There are a pedt d, a position n of d, a rule ρ and a decoration c such that d(n)= (ρ,c),ρ∈Rf , c(ε)=p, c is a complete decoration, and c maps the ∗-labelled position of the input treeof ρ to p′.

• (ρ,w,p): There are a pedt d, a position n of d and a decoration c such that d(n)= (ρ,c) andc(w)=p.

• [ρ,w,p,θ]: There are a pedt d, a position n of d and a decoration c such that d(n)= (ρ,c),c(w)=p and, if a position u below w of the input tree ζ of ρ is labelled by xj, then c(u)=θ (xj),and if it is labelled by zj, then (c(u),c(u1))=θ (zj), and if it is labelled by ∗, then c(u)=θ (∗).

• [ρ,w,j,p,p1 ···pk,θ]: There are a pedt d, a position n of d and a decoration c such thatd(n)= (ρ,c), c(w)=p, and, if a position u of the input tree ζ of ρ is labelled by some variable yand u is lexicographically smaller than wj, c and θ agree in the same way as for the precedingitem.

Given these semantics of items, it is not difficult to see that the inference rules of the deductionsystem are sound. The completeness of the system can be derived by means of a small proof bycontradiction.

9 Example of the generation of items

Here we illustrate the inference rules of Figure 5. We consider the WSTAG G and the WRTG H ofFigure 2, and demonstrate the generation of items (lines 1–4 of Algorithm 1) for the rule ρ3 of G.

In Figure 7 we show the input tree ζ of ρ3 in bold-face letters and lines. On top of this syntacticstructure, we have drawn another graph, consisting of items and arrows. Close to every node of ζ , we

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 19: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 19 1–23

Tree parsing for tree-adjoining MT 19

Figure 7. Generation of items on the rule ρ3 of G of Example 4.

have placed those items that carry the Gorn-address of that node. The arrows show the dependenciesbetween the items as they are expressed by the rules of the deduction schema; the label of an arrowrefers to the name of the corresponding inference rule, with one exception: the edge between I9= (f ,o)and I10=[f ,o,o] is due to further generation of items, viz. on an f -rule with state o of H (which wedo not show).

The list of the items involved in Figure 7 is shown in Table 2. Apart from the first two items, theyare grouped according to the nodes of ζ . We realize that the items in the rightmost column only serveto achieve a left-to-right traversal over the children of the relevant node.

Finally, we note that the deduction schema in Figure 5 can be considered as an attribute grammar[9, 19, 30] that is based on a macro grammar rather than a context-free grammar. From this perspective,Figure 7 shows the dependency graph on ζ , where the items are the attribute occurrences and thearrows are the attribute dependencies.

10 Complexity analysis

In this section, we analyse the worst-case space and time complexity of step 1 of Algorithm 1.The space complexity is O

(|G|in ·|RH |·|P|C), which is determined by the number of possible

items of the form [ρ,w,j,p,p1 ···pk,θ ]. The first factor, |G|in, denotes the input size of G, defined by∑ρ∈R |pos(ζ (ρ))|, where ζ (ρ) is the input tree of ρ. It captures the components ρ, w and j in said items,

which together identify exactly one node of an input tree of G. The factor |RH | captures the compo-nents p and p1 ···pk . The final factor, |P|C , captures the θ , where C is given at the end of Section 5.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 20: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 20 1–23

20 Tree parsing for tree-adjoining MT

Table 2. List of items, where θ1(∗)=e and θ2(∗)=e,θ2(z1)= (o,o).

I1= (f ,e), I32=[f ,e,e]S (root): I2 = (ρ3,ε,e) I3 = [ρ3,ε,0,e,ror,∅]

I31 = [ρ3,ε,e,θ2] I7 = [ρ3,ε,1,e,ror,∅]I26 = [ρ3,ε,2,e,ror,θ2]I30 = [ρ3,ε,3,e,ror,θ2]

a: I4 = (ρ3,1,r) I5 = [ρ3,1,0,r,ε,∅]I6 = [ρ3,1,r,∅]

z1: I8 = (ρ3,2,o) I25 = [ρ3,2,o,θ2]I9 = (f ,o)

I10 = [f ,o,o]S: I11 = (ρ3,21,o) I12 = [ρ3,21,0,o,rer,∅]

I24 = [ρ3,21,o,θ1] I16 = [ρ3,21,1,o,rer,∅]I19 = [ρ3,21,2,o,rer,θ1]I23 = [ρ3,21,3,o,rer,θ1]

b: I13 = (ρ3,211,r) I14 = [ρ3,211,0,r,ε,∅]I15 = [ρ3,211,r,∅]

∗: I17 = (ρ3,212,e)I18 = [ρ3,212,e,θ1]

c: I20 = (ρ3,213,r) I21 = [ρ3,213,0,r,ε,∅]I22 = [ρ3,213,r,∅]

d: I27 = (ρ3,3,r) I28 = [ρ3,3,0,r,ε,∅]I29 = [ρ3,3,r,∅]

Following [37] we determine the time complexity by the number of instantiations of the inferencerules. In our case the time complexity coincides with the space complexity.

11 Conclusion and discussion

We have introduced a formulation of STAGs that is closed under input product and output productwith regular weighted tree languages. By the result of [35], this implies closure under input productand output product with regular weighted string languages. We have provided a direct constructionof the STAG that generates said input product (and, mutatis mutandis, the output product). No suchconstruction has been published before that deals with both weights and synchronization. Moreover,we have presented a novel algorithm for computing our construction. This algorithm is inspired byEarley’s algorithm to the effect that computation of a certain portion of useless rules is avoided.

The next step towards an implementation would be to consider pruning. This amounts topartitioning the set I of items and imposing a bound on the size of each partition. Such a techniquehas already been presented in [8] for the cube-pruning algorithm.

Another possible future contribution could be an algorithm specifically tailored to the input productwith a regular weighted string language. For this scenario, several contributions exist, requiringadditional restictions however. For instance, in [42] a CYK-like algorithm is shown that intersects aSTIG with a pair of strings. This algorithm requires that the trees of the grammar be binarized. Asthe authors of [13] point out, this makes the grammar strictly less powerful. They in turn proposea construction that converts the STIG into an equivalent tree-to-string transducer, and they use

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 21: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 21 1–23

Tree parsing for tree-adjoining MT 21

corresponding algorithms for parsing, such as the one in [15]. However, their construction relies onthe fact that tree-insertion grammars are weakly equivalent to context-free grammars. Thus, it is notapplicable to the more general STAGs.

Acknowledgments

We are grateful to the referees for thorough remarks and helpful suggestions.

Funding

Matthias Büchse was financially supported by DFG VO 1011/6-1.

References[1] A. Abeillé, Y. Schabes, and A. K. Joshi. Using lexicalized TAGs for machine translation. In

Proceedings of Conference on Computational Linguistics COLING 1990, H. Karlgren, ed.,1990.

[2] A. Alexandrakis and S. Bozapalidis. Weighted grammars and Kleene’s theorem. InformationProcessing Letters, 24, 1–4, 1987.

[3] B. S. Baker. Composition of top-down and bottom-up tree transductions. Information andControl, 41, 186–213, 1979.

[4] A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra. A maximum entropy approach to naturallanguage processing. Computational Linguistics, 22, 39–71, 1996.

[5] M. Büchse, D. Geisler, T. Stüber, and H. Vogler. n-best parsing revisited. In Proceedings ofWorkshop on Applications of Tree Automata in Natural Language Processing ATANLP 2010,F. Drewes, and M. Kuhlmann, eds, pp. 46–54, 2010.

[6] M. Büchse, M.-J. Nederhof, and H. Vogler. Tree parsing with synchronous tree-adjoininggrammars. In Proceedings of International Conference on Parsing Technologies IWPT 2011,J. Nivre, ed. pp. 14–25, 2011.

[7] F. Casacuberta and C. de la Higuera. Computational complexity of problems on probabilisticgrammars and transducers. In Proceedings of Grammatical Inference: Algorithms andApplications ICGI, A. L. Oliveria, ed. Lecture Notes in Computer Science, Vol. 189, 2000.

[8] D. Chiang. Hierarchical phrase-based translation. Computational Linguistics, 33, 201–228,2007.

[9] B. Courcelle. Attribute grammars: definitions, analysis of dependencies, proof methods. InMethods and Tools for Compiler Construction, B. Lorho, ed., pp. 81–102. Cambridge UniversityPress, 1984.

[10] B. Courcelle and P. Franchi–Zannettacci. Attribute grammars and recursive program schemesI and II. Theoretical Computer Science, 17, 163–191, 235–257, 1982.

[11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data viathe EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38,1977.

[12] S. DeNeefe. Tree-Adjoining Machine Translation. PhD Thesis, University of SouthernCalifornia, 2011.

[13] S. DeNeefe and K. Knight. Synchronous tree-adjoining machine translation. In Proceedings ofConference on Empirical Methods in Natural Language Processing, EMNLP 2009, P. Koehn,and R. Mihalcea, eds, pp. 727–736, 2009.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 22: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 22 1–23

22 Tree parsing for tree-adjoining MT

[14] S. DeNeefe, K. Knight, and H. Vogler. A decoder for probabilistic synchronous tree insertiongrammars. In Proceedings of the 2010 Workshop on Applications of Tree Automata in NaturalLanguage Processing, ACL 2010, F. Drewes, and M. Kuhlmann, eds, pp. 10–18, 2010.

[15] J. DeNero, M. Bansal, A. Pauls, and D. Klein. Efficient parsing for transducer grammars. InProceedings of North Atlantic Chapter of the Association of Computational Linguistics- HumanLanguage Technologies NAACL 2009, pp. 227–235, 2009.

[16] J. Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13, 94–102,1970.

[17] J. Eisner. Learning non-isomorphic tree mappings for machine translation. In Proceedings ofAnnual Meeting of the Association of Computational Linguistics ACL 2003, pp. 205–208, 2003.

[18] J. Engelfriet. Some open questions and recent results on tree transducers and tree languages.In Formal Language Theory: Perspectives and Open Problems, 1st edn, R.V. Book, ed., pp.241–286. Academic Press, 1980.

[19] J. Engelfriet. Attribute grammars: attribute evaluation methods. In Methods and Tools forCompiler Construction, B. Lorho, ed., pp. 103–138. Cambridge University Press, 1984.

[20] J. Engelfriet and H. Vogler. Macro tree transducers. Journal of Computer and System Sciences,31, 71–146, 1985.

[21] Z. Fülöp, A. Maletti, and H. Vogler. Preservation of recognizability for synchronous treesubstitution grammars. In Proceedings of 2010 Workshop Applications of Tree Automata inNatural Language Processing, F. Drewes, and M. Kuhlmann, eds, pp. 1–9, 2010.

[22] J. Goodman. Semiring parsing. Computational Linguistics, 25, 573–605, 1999.[23] J. Graehl, K. Knight, and J. May. Training tree transducers. Computational Linguistics, 34,

391–427, 2008.[24] S. L. Graham, M. Harrison, and W. L. Ruzzo. An improved context-free recognizer. ACM

Transactions on Programming Languages and System, 2, 415–462, 1980.[25] L. Huang and D. Chiang. Better k-best parsing. In Proceedings of International Conference on

Parsing Technologies IWPT 2005, pp. 53–64, 2005.[26] L. Huang, K. Knight, and A. Joshi. Statistical syntax-directed translation with extended domain

of locality. In Proceedings of Association for Machine Translation in the Americas AMTA 2006,pp. 66–73, 2006.

[27] L. Huang, H. Zhang, D. Gildea, and K. Knight. Binarization of synchronous context-freegrammars. Computational Linguistics, 35, 559–595, December 2009.

[28] A. K. Joshi and Y. Schabes. Tree-adjoining grammars. In Handbook of Formal Languages.Springer, 1997.

[29] S. Kepser and J. Rogers. The equivalence of tree adjoining grammars and monadic linearcontext-free tree grammars. In The Mathematics of Language. Vol. 6149 of Lecture Notes inComputer Science, pp. 129–144. Springer, 2010.

[30] D. E. Knuth. Semantics of context–free languages. Mathematical Systems Theory, 2, 127–145,1968.

[31] Z. Li, J. Eisner, and S. Khudanpur. Variational decoding for statistical machine translation. InProceedings of Joint Conference of the 47th Annual Meeting of the ACL and 4th InternationalJoint Conference on Natural Language Processing of the AFNLP ACL-IJCNLP ’09, pp. 593–601, 2009.

[32] A. Lopez. Statistical machine translation. ACM Computer Surveys, 40, 1–49, 2008.[33] A. Maletti. Input and output products for weighted extended top-down tree transducers. In

Proceedings of Developments in Language Theory DLT 2010. Vol. 6224 of Lecture Notes inComputer Science, H. -C. Yen, and O. H. Ibarra, eds, pp. 316–327. Springer, 2010.

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from

Page 23: Tree parsing for tree-adjoining machine translationpdfs.semanticscholar.org/30fe/dff0dbe473646ad388d6daa3983855c2ede4.pdfTree parsing for tree-adjoining machine translation MATTHIAS

Copyedited by: GS MANUSCRIPT CATEGORY: Article

[12:33 4/12/2012 exs050.tex] LogCom: Journal of Logic and Computation Page: 23 1–23

Tree parsing for tree-adjoining MT 23

[34] A. Maletti. A tree transducer model for synchronous tree-adjoining grammars. In Proceedingsof Annual Meeting of the Association of Computational Linguistics ACL 2010, pp. 1067–1076,2010.

[35] A. Maletti and G. Satta. Parsing algorithms based on tree automata. In Proceedings ofInternational Conference on Parsing Technologies IWPT 2009, pp. 1–12, 2009.

[36] J. May and K. Knight. Tiburon: a weighted tree automata toolkit. In 11th InternationalConference on Implementation and Application of Automate CIAA 2006. O. H. Ibaraa, andH. -C. Yen, eds, Vol. 4094 of Lecture Notes in Computer Science, pp. 102–113. Springer, 2006.

[37] D. McAllester. On the complexity analysis of static analyses. J. ACM, 49, 512–537, 2002.[38] M. Mohri. Weighted automata algorithms. In Handbook of Weighted Automata, M. Droste,

W. Kuich, and H. Vogler, eds, chapter 6, pp. 213–254. Springer, 2009.[39] M.-J. Nederhof. Weighted deductive parsing and Knuth’s algorithm. Computational Linguistics,

29, 135–143, 2003.[40] M.-J. Nederhof. Weighted parsing of trees. In Proceedings of International Conference on

Parsing Technologies IWPT 2009, pp. 13–24, 2009.[41] R. Nesson, S. M. Shieber, and A. Rush. Induction of probabilistic synchronous tree-

insertion grammars. Technical report TR-20-05. Computer Science Group, Harvard University,Cambridge, Massachusetts, 2005.

[42] R. Nesson, S. M. Shieber, and A. Rush. Induction of probabilistic synchronous tree-insertiongrammars for machine translation. In Proceedings of Association for Machine Translation inthe Americas AMTA 2006, 2006.

[43] F. J. Och. Minimum error rate training in statistical machine translation. In Proceedings ofAnnual Meeting of the Association of Computational Linguistics ACL 2003, pp. 160–167, 2003.

[44] F. J. Och and H. Ney. Discriminative training and maximum entropy models for statisticalmachine translation. In Proceedings of Annual Meeting of the Association of ComputationalLinguistics ACL 2002, pp. 295–302, 2002.

[45] W. C. Rounds. Mappings and grammars on trees. Math. Systems Theory, 4, 257–287, 1970.[46] Y. Schabes and R.C. Waters. Tree insertion grammars: a cubic-time, parsable formalism

that lexicalizes context-free grammar without changing the trees produced. ComputationalLinguistics, 21, 479–513, 1994.

[47] S. M. Shieber. Synchronous grammars as tree transducers. In Proceedings of SeventhInternational Workshop on Tree Adjoining Grammars and Related Formalisms, pp. 88–95,2004.

[48] S. M. Shieber. Unifying synchronous tree-adjoining grammars and tree transducers viabimorphisms. In Proceedings of 11th Conference of the European Chapter of the Associationof Computational Linguistics EACL 2006, D. McCarthy, and S. Winter, eds, Trento, Italy, 2006.

[49] S. M. Shieber and Y. Schabes. Synchronous tree-adjoining grammars. In Proceedings of 13thInternational Conference on Computational Linguistics, Vol. 3, pp. 253–258, Helsinki, Finland,1990.

[50] M. Zhang, H. Jiang, A. Aw, H. Li, C. L. Tan, and S. Li. A tree sequence alignment-based tree-to-tree translation model. In Proceedings of Annual Meeting of the Association of ComputationalLinguistics ACL 2008, 2008.

Received 19 January 2012

at SLU

B D

resden on Decem

ber 7, 2012http://logcom

.oxfordjournals.org/D

ownloaded from