Bayesian Synchronous Tree-Substitution Grammar Induction ...

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 937–947,Uppsala, Sweden, 11-16 July 2010. c�2010 Association for Computational Linguistics

Bayesian Synchronous Tree-Substitution Grammar Induction

and its Application to Sentence Compression

Elif Yamangil and Stuart M. Shieber

Harvard UniversityCambridge, Massachusetts, USA

{elif, shieber}@seas.harvard.edu

Abstract

We describe our experiments with trainingalgorithms for tree-to-tree synchronoustree-substitution grammar (STSG) formonolingual translation tasks such assentence compression and paraphrasing.These translation tasks are characterizedby the relative ability to commit to parallelparse trees and availability of word align-ments, yet the unavailability of large-scaledata, calling for a Bayesian tree-to-treeformalism. We formalize nonparametricBayesian STSG with epsilon alignment infull generality, and provide a Gibbs sam-pling algorithm for posterior inference tai-lored to the task of extractive sentencecompression. We achieve improvementsagainst a number of baselines, includingexpectation maximization and variationalBayes training, illustrating the merits ofnonparametric inference over the space ofgrammars as opposed to sparse parametricinference with a fixed grammar.

1 Introduction

Given an aligned corpus of tree pairs, we mightwant to learn a mapping between the paired trees.Such induction of tree mappings has applicationin a variety of natural-language-processing tasksincluding machine translation, paraphrase, andsentence compression. The induced tree map-pings can be expressed by synchronous grammars.Where the tree pairs are isomorphic, synchronouscontext-free grammars (SCFG) may suffice, but ingeneral, non-isomorphism can make the problemof rule extraction difficult (Galley and McKeown,2007). More expressive formalisms such as syn-

chronous tree-substitution (Eisner, 2003) or tree-adjoining grammars may better capture the pair-ings.

In this work, we explore techniques for inducingsynchronous tree-substitution grammars (STSG)using as a testbed application extractive sentencecompression. Learning an STSG from alignedtrees is tantamount to determining a segmentationof the trees into elementary trees of the grammaralong with an alignment of the elementary trees(see Figure 1 for an example of such a segmenta-tion), followed by estimation of the weights for theextracted tree pairs.1 These elementary tree pairsserve as the rules of the extracted grammar. ForSCFG, segmentation is trivial — each parent withits immediate children is an elementary tree — butthe formalism then restricts us to deriving isomor-phic tree pairs. STSG is much more expressive,especially if we allow some elementary trees onthe source or target side to be unsynchronized, sothat insertions and deletions can be modeled, butthe segmentation and alignment problems becomenontrivial.

Previous approaches to this problem havetreated the two steps — grammar extraction andweight estimation — with a variety of methods.One approach is to use word alignments (wherethese can be reliably estimated, as in our testbedapplication) to align subtrees and extract rules(Och and Ney, 2004; Galley et al., 2004) butthis leaves open the question of finding the rightlevel of generality of the rules — how deep therules should be and how much lexicalization theyshould involve — necessitating resorting to heuris-tics such as minimality of rules, and leading to

1Throughout the paper we will use the word STSG to re-fer to the tree-to-tree version of the formalism, although thestring-to-tree version is also commonly used.

937

large grammars. Once a given set of rules is ex-tracted, weights can be imputed using a discrimi-native approach to maximize the (joint or condi-tional) likelihood or the classification margin inthe training data (taking or not taking into accountthe derivational ambiguity). This option leveragesa large amount of manual domain knowledge en-gineering and is not in general amenable to latentvariable problems.

A simpler alternative to this two step approachis to use a generative model of synchronousderivation and simultaneously segment and weightthe elementary tree pairs to maximize the prob-ability of the training data under that model; thesimplest exemplar of this approach uses expecta-tion maximization (EM) (Dempster et al., 1977).This approach has two frailties. First, EM searchover the space of all possible rules is computation-ally impractical. Second, even if such a searchwere practical, the method is degenerate, pushingthe probability mass towards larger rules in orderto better approximate the empirical distribution ofthe data (Goldwater et al., 2006; DeNero et al.,2006). Indeed, the optimal grammar would be onein which each tree pair in the training data is itsown rule. Therefore, proposals for using EM forthis task start with a precomputed subset of rules,and with EM used just to assign weights withinthis grammar. In summary, previous methods suf-fer from problems of narrowness of search, havingto restrict the space of possible rules, and overfit-ting in preferring overly specific grammars.

We pursue the use of hierarchical probabilisticmodels incorporating sparse priors to simultane-ously solve both the narrowness and overfittingproblems. Such models have been used as gener-ative solutions to several other segmentation prob-lems, ranging from word segmentation (Goldwa-ter et al., 2006), to parsing (Cohn et al., 2009; Postand Gildea, 2009) and machine translation (DeN-ero et al., 2008; Cohn and Blunsom, 2009; Liuand Gildea, 2009). Segmentation is achieved byintroducing a prior bias towards grammars that arecompact representations of the data, namely by en-forcing simplicity and sparsity: preferring simplerules (smaller segments) unless the use of a com-plex rule is evidenced by the data (through repeti-tion), and thus mitigating the overfitting problem.A Dirichlet process (DP) prior is typically usedto achieve this interplay. Interestingly, sampling-based nonparametric inference further allows the

possibility of searching over the infinite space ofgrammars (and, in machine translation, possibleword alignments), thus side-stepping the narrow-ness problem outlined above as well.

In this work, we use an extension of the afore-mentioned models of generative segmentation forSTSG induction, and describe an algorithm forposterior inference under this model that is tai-lored to the task of extractive sentence compres-sion. This task is characterized by the availabil-ity of word alignments, providing a clean testbedfor investigating the effects of grammar extraction.We achieve substantial improvements against anumber of baselines including EM, support vectormachine (SVM) based discriminative training, andvariational Bayes (VB). By comparing our methodto a range of other methods that are subject dif-ferentially to the two problems, we can show thatboth play an important role in performance limi-tations, and that our method helps address both aswell. Our results are thus not only encouraging forgrammar estimation using sparse priors but also il-lustrate the merits of nonparametric inference overthe space of grammars as opposed to sparse para-metric inference with a fixed grammar.

In the following, we define the task of extrac-tive sentence compression and the Bayesian STSGmodel, and algorithms we used for inference andprediction. We then describe the experiments inextractive sentence compression and present ourresults in contrast with alternative algorithms. Weconclude by giving examples of compression pat-terns learned by the Bayesian method.

2 Sentence compression

Sentence compression is the task of summarizing asentence while retaining most of the informationalcontent and remaining grammatical (Jing, 2000).In extractive sentence compression, which we fo-cus on in this paper, an order-preserving subset ofthe words in the sentence are selected to form thesummary, that is, we summarize by deleting words(Knight and Marcu, 2002). An example sentencepair, which we use as a running example, is thefollowing:

• Like FaceLift, much of ATM’s screen perfor-mance depends on the underlying applica-tion.

• ATM’s screen performance depends on theunderlying application.

938

Figure 1: A portion of an STSG derivation of the example sentence and its extractive compression.

where the underlined words were deleted. In su-pervised sentence compression, the goal is to gen-eralize from a parallel training corpus of sentences(source) and their compressions (target) to unseensentences in a test set to predict their compres-sions. An unsupervised setup also exists; meth-ods for the unsupervised problem typically relyon language models and linguistic/discourse con-straints (Clarke and Lapata, 2006a; Turner andCharniak, 2005). Because these methods rely ondynamic programming to efficiently consider hy-potheses over the space of all possible compres-sions of a sentence, they may be harder to extendto general paraphrasing.

3 The STSG Model

Synchronous tree-substitution grammar is a for-malism for synchronously generating a pair ofnon-isomorphic source and target trees (Eisner,2003). Every grammar rule is a pair of elemen-tary trees aligned at the leaf level at their frontiernodes, which we will denote using the form

cs/ct → es/et, γ

(indices s for source, t for target) where cs, ct areroot nonterminals of the elementary trees es, et re-spectively and γ is a 1-to-1 correspondence be-tween the frontier nodes in es and et. For example,the rule

S / S→ (S (PP (IN Like) NP[�]) NP[1] VP[2]) /(S NP[1] VP[2])

can be used to delete a subtree rooted at PP. Weuse square bracketed indices to represent the align-ment γ of frontier nodes — NP[1] aligns withNP[1], VP[2] aligns with VP[2], NP[�] aligns withthe special symbol � denoting a deletion from thesource tree. Symmetrically �-aligned target nodesare used to represent insertions into the target tree.Similarly, the rule

NP / �→ (NP (NN FaceLift)) / �

can be used to continue deriving the deleted sub-tree. See Figure 1 for an example of how an STSGwith these rules would operate in synchronouslygenerating our example sentence pair.

STSG is a convenient choice of formalism fora number of reasons. First, it eliminates the iso-morphism and strong independence assumptionsof SCFGs. Second, the ability to have rules deeperthan one level provides a principled way of model-ing lexicalization, whose importance has been em-phasized (Galley and McKeown, 2007; Yamangiland Nelken, 2008). Third, we may have our STSGoperate on trees instead of sentences, which allowsfor efficient parsing algorithms, as well as provid-ing syntactic analyses for our predictions, which isdesirable for automatic evaluation purposes.

A straightforward extension of the popular EMalgorithm for probabilistic context free grammars(PCFG), the inside-outside algorithm (Lari andYoung, 1990), can be used to estimate the ruleweights of a given unweighted STSG based on acorpus of parallel parse trees t = t1, . . . , tN wheretn = tn,s/tn,t for n = 1, . . . , N . Similarly, an

939

Figure 2: Gibbs sampling updates. We illustrate a sampler move to align/unalign a source node with atarget node (top row in blue), and split/merge a deletion rule via aligning with � (bottom row in red).

extension of the Viterbi algorithm is available forfinding the maximum probability derivation, use-ful for predicting the target analysis tN+1,t for atest instance tN+1,s. (Eisner, 2003) However, asnoted earlier, EM is subject to the narrowness andoverfitting problems.

3.1 The Bayesian generative process

Both of these issues can be addressed by takinga nonparametric Bayesian approach, namely, as-suming that the elementary tree pairs are sampledfrom an independent collection of Dirichlet pro-cess (DP) priors. We describe such a process forsampling a corpus of tree pairs t.

For all pairs of root labels c = cs/ct that weconsider, where up to one of cs or ct can be � (e.g.,S / S, NP / �), we sample a sparse discrete distribu-tion Gc over infinitely many elementary tree pairse = es/et sharing the common root c from a DP

Gc ∼ DP(αc, P0(· | c)) (1)

where the DP has the concentration parameter αc

controlling the sparsity of Gc, and the base dis-tribution P0(· | c) is a distribution over novel el-ementary tree pairs that we describe more fullyshortly.

We then sample a sequence of elementary treepairs to serve as a derivation for each observed de-rived tree pair. For each n = 1, . . . , N , we sam-ple elementary tree pairs en = en,1, . . . , en,dn in

a derivation sequence (where dn is the number ofrules used in the derivation), consulting Gc when-ever an elementary tree pair with root c is to besampled.

eiid∼ Gc, for all e whose root label is c

Given the derivation sequence en, a tree pair tn isdetermined, that is,

p(tn | en) =�

1 en,1, . . . , en,dn derives tn0 otherwise.

(2)The hyperparameters αc can be incorporated

into the generative model as random variables;however, we opt to fix these at various constantsto investigate different levels of sparsity.

For the base distribution P0(· | c) there are avariety of choices; we used the following simplescenario. (We take c = cs/ct.)

Synchronous rules For the case where neither cs

nor ct are the special symbol �, the base dis-tribution first generates es and et indepen-dently, and then samples an alignment be-tween the frontier nodes. Given a nontermi-nal, an elementary tree is generated by firstmaking a decision to expand the nontermi-nal (with probability βc) or to leave it as afrontier node (1 − βc). If the decision to ex-pand was made, we sample an appropriaterule from a PCFG which we estimate ahead

940

of time from the training corpus. We expandthe nonterminal using this rule, and then re-peat the same procedure for every child gen-erated that is a nonterminal until there are nogenerated nonterminal children left. This isdone independently for both es and et. Fi-nally, we sample an alignment between thefrontier nodes uniformly at random out of allpossible alingments.

Deletion/insertion rules If ct = �, that is, wehave a deletion rule, we need to generatee = es/�. (The insertion rule case is symmet-ric.) The base distribution generates es usingthe same process described for synchronousrules above. Then with probability 1 we alignall frontier nodes in es with �. In essence,this process generates TSG rules, rather thanSTSG rules, which are used to cover deleted(or inserted) subtrees.

This simple base distribution does nothing toenforce an alignment between the internal nodesof es and et. One may come up with more sophis-ticated base distributions. However the main pointof the base distribution is to encode a control-lable preference towards simpler rules; we there-fore make the simplest possible assumption.

3.2 Posterior inference via Gibbs sampling

Assuming fixed hyperparameters α = {αc} andβ = {βc}, our inference problem is to find theposterior distribution of the derivation sequencese = e1, . . . , eN given the observations t =t1, . . . , tN . Applying Bayes’ rule, we have

p(e | t) ∝ p(t | e)p(e) (3)

where p(t | e) is a 0/1 distribution (2) which doesnot depend on Gc, and p(e) can be obtained bycollapsing Gc for all c.

Consider repeatedly generating elementary treepairs e1, . . . , ei, all with the same root c, iid fromGc. Integrating over Gc, the ei become depen-dent. The conditional prior of the i-th elementarytree pair given previously generated ones e<i =e1, . . . , ei−1 is given by

p(ei | e<i) =nei + αcP0(ei | c)

i− 1 + αc(4)

where nei denotes the number of times ei occursin e<i. Since the collapsed model is exchangeablein the ei, this formula forms the backbone of the

inference procedure that we describe next. It alsomakes clear DP’s inductive bias to reuse elemen-tary tree pairs.

We use Gibbs sampling (Geman and Geman,1984), a Markov chain Monte Carlo (MCMC)method, to sample from the posterior (3). Aderivation e of the corpus t is completely specifiedby an alignment between the source nodes and thecorresponding target nodes (as well as � on eitherside), which we take to be the state of the sampler.We start at a random derivation of the corpus, andat every iteration resample a derivation by amend-ing the current one through local changes madeat the node level, in the style of Goldwater et al.(2006).

Our sampling updates are extensions of thoseused by Cohn and Blunsom (2009) in MT, but aretailored to our task of extractive sentence compres-sion. In our task, no target node can align with� (which would indicate a subtree insertion), andbarring unary branches no source node i can alignwith two different target nodes j and j� at the sametime (indicating a tree expansion). Rather, theconfigurations of interest are those in which onlysource nodes i can align with �, and two sourcenodes i and i� can align with the same target nodej. Thus, the alignments of interest are not arbitraryrelations, but (partial) functions from nodes in es

to nodes in et or �. We therefore sample in thedirection from source to target. In particular, wevisit every tree pair and each of its source nodes i,and update its alignment by selecting between andwithin two choices: (a) unaligned, (b) aligned withsome target node j or �. The number of possibil-ities j in (b) is significantly limited, firstly by theword alignment (for instance, a source node dom-inating a deleted subspan cannot be aligned witha target node), and secondly by the current align-ment of other nearby aligned source nodes. (SeeCohn and Blunsom (2009) for details of matchingspans under tree constraints.)2

2One reviewer was concerned that since we explicitly dis-allow insertion rules in our sampling procedure, our modelthat generates such rules wastes probability mass and is there-fore “deficient”. However, we regard sampling as a separatestep from the data generation process, in which we can for-mulate more effective algorithms by using our domain knowl-edge that our data set was created by annotators who wereinstructed to delete words only. Also, disallowing insertionrules in the base distribution unnecessarily complicates thedefinition of the model, whereas it is straightforward to de-fine the joint distribution of all (potentially useful) rules andthen use domain knowledge to constrain the support of thatdistribution during inference, as we do here. In fact, it is pos-

941

More formally, let eM be the elementary treepair rooted at the closest aligned ancestor i� ofnode i when it is unaligned; and let eA and eB

be the elementary tree pairs rooted at i� and i re-spectively when i is aligned with some target nodej or �. Then, by exchangeability of the elementarytrees sharing the same root label, and using (4), wehave

p(unalign) =neM + αcM P0(eM | cM )

ncM + αcM

(5)

p(align with j) =neA + αcAP0(eA | cA)

ncA + αcA

(6)

× neB + αcBP0(eB | cB)ncB + αcB

(7)

where the counts ne· , nc· are with respect to thecurrent derivation of the rest of the corpus; exceptfor neB , ncB we also make sure to account for hav-ing generated eA. See Figure 2 for an illustrationof the sampling updates.

It is important to note that the sampler describedcan move from any derivation to any other deriva-tion with positive probability (if only, for example,by virtue of fully merging and then resegment-ing), which guarantees convergence to the poste-rior (3). However some of these transition prob-abilities can be extremely small due to passingthrough low probability states with large elemen-tary trees; in turn, the sampling procedure is proneto local modes. In order to counteract this and toimprove mixing we used simulated annealing. Theprobability mass function (5-7) was raised to thepower 1/T with T dropping linearly from T = 5to T = 0. Furthermore, using a final tempera-ture of zero, we recover a maximum a posteriori(MAP) estimate which we denote eMAP.

3.3 Prediction

We discuss the problem of predicting a target treetN+1,t that corresponds to a source tree tN+1,s

unseen in the observed corpus t. The maximumprobability tree (MPT) can be found by consid-ering all possible ways to derive it. However amuch simpler alternative is to choose the targettree implied by the maximum probability deriva-

sible to prove that our approach is equivalent up to a rescalingof the concentration parameters. Since we fit these parame-ters to the data, our approach is equivalent.

tion (MPD), which we define as

e∗ = argmaxe

p(e | ts, t)

= argmaxe

�

e

p(e | ts, e)p(e | t)

where e denotes a derivation for t = ts/tt. (Wesuppress the N + 1 subscripts for brevity.) Weapproximate this objective first by substitutingδeMAP(e) for p(e | t) and secondly using a finiteSTSG model for the infinite p(e | ts, eMAP), whichwe obtain simply by normalizing the rule counts ineMAP. We use dynamic programming for parsingunder this finite model (Eisner, 2003).3

Unfortunately, this approach does not ensurethat the test instances are parsable, since ts mayinclude unseen structure or novel words. A work-around is to include all zero-count context freecopy rules such as

NP / NP→ (NP NP[1] PP[2]) / (NP NP[1] PP[2])NP / �→ (NP NP[�] PP[�]) / �

in order to smooth our finite model. We usedLaplace smoothing (adding 1 to all counts) as itgave us interpretable results.

4 Evaluation

We compared the Gibbs sampling compressor(GS) against a version of maximum a posterioriEM (with Dirichlet parameter greater than 1) anda discriminative STSG based on SVM training(Cohn and Lapata, 2008) (SVM). EM is a naturalbenchmark, while SVM is also appropriate sinceit can be taken as the state of the art for our task.4

We used a publicly available extractive sen-tence compression corpus: the Broadcast Newscompressions corpus (BNC) of Clarke and Lap-ata (2006a). This corpus consists of 1370 sentencepairs that were manually created from transcribedBroadcast News stories. We split the pairs intotraining, development, and testing sets of 1000,

3We experimented with MPT using Monte Carlo integra-tion over possible derivations; the results were not signifi-cantly different from those using MPD.

4The comparison system described by Cohn and Lapata(2008) attempts to solve a more general problem than ours,abstractive sentence compression. However, given the natureof the data that we provided, it can only learn to compressby deleting words. Since the system is less specialized to thetask, their model requires additional heuristics in decodingnot needed for extractive compression, which might cause areduction in performance. Nonetheless, because the compar-ison system is a generalization of the extractive SVM com-pressor of Cohn and Lapata (2007), we do not expect that theresults would differ qualitatively.

942

SVM EM GS

Precision 55.60 58.80 58.94Recall 53.37 56.58 64.59Relational F1 54.46 57.67 61.64Compression rate 59.72 64.11 65.52

Table 1: Precision, recall, relational F1 and com-pression rate (%) for various systems on the 200-sentence BNC test set. The compression rate forthe gold standard was 65.67%.

SVM EM GS Gold

Grammar 2.75† 2.85∗ 3.69 4.25Importance 2.85 2.67∗ 3.41 3.82Comp. rate 68.18 64.07 67.97 62.34

Table 2: Average grammar and importance scoresfor various systems on the 20-sentence subsam-ple. Scores marked with ∗ are significantly dif-ferent than the corresponding GS score at α < .05and with † at α < .01 according to post-hoc Tukeytests. ANOVA was significant at p < .01 both forgrammar and importance.

170, and 200 pairs, respectively. The corpus wasparsed using the Stanford parser (Klein and Man-ning, 2003).

In our experiments with the publicly availableSVM system we used all except paraphrasal rulesextracted from bilingual corpora (Cohn and Lap-ata, 2008). The model chosen for testing had pa-rameter for trade-off between training error andmargin set to C = 0.001, used margin rescaling,and Hamming distance over bags of tokens withbrevity penalty for loss function. EM used a sub-set of the rules extracted by SVM, namely all rulesexcept non-head deleting compression rules, andwas initialized uniformly. Each EM instance wascharacterized by two parameters: α, the smooth-ing parameter for MAP-EM, and δ, the smooth-ing parameter for augmenting the learned gram-mar with rules extracted from unseen data (add-(δ − 1) smoothing was used), both of which werefit to the development set using grid-search over(1, 2]. The model chosen for testing was (α, δ) =(1.0001, 1.01).

GS was initialized at a random derivation. Wesampled the alignments of the source nodes in ran-dom order. The sampler was run for 5000 itera-

tions with annealing. All hyperparameters αc, βc

were held constant at α,β for simplicity and werefit using grid-search over α ∈ [10−6, 106], β ∈[10−3, 0.5]. The model chosen for testing was(α,β) = (100, 0.1).

As an automated metric of quality, we computeF-score based on grammatical relations (relationalF1, or RelF1) (Riezler et al., 2003), by which theconsistency between the set of predicted grammat-ical relations and those from the gold standard ismeasured, which has been shown by Clarke andLapata (2006b) to correlate reliably with humanjudgments. We also conducted a small human sub-jective evaluation of the grammaticality and infor-mativeness of the compressions generated by thevarious methods.

4.1 Automated evaluation

For all three systems we obtained predictions forthe test set and used the Stanford parser to extractgrammatical relations from predicted trees and thegold standard. We computed precision, recall,RelF1 (all based on grammatical relations), andcompression rate (percentage of the words that areretained), which we report in Table 1. The resultsfor GS are averages over five independent runs.EM gives a strong baseline since it already usesrules that are limited in depth and number of fron-tier nodes by stipulation, helping with the overfit-ting we have mentioned, surprisingly outperform-ing its discriminative counterpart in both precisionand recall (and consequently RelF1). GS howevermaintains the same level of precision as EM whileimproving recall, bringing an overall improvementin RelF1.

4.2 Human evaluation

We randomly subsampled our 200-sentence testset for 20 sentences to be evaluated by humanjudges through Amazon Mechanical Turk. Weasked 15 self-reported native English speakers fortheir judgments of GS, EM, and SVM output sen-tences and the gold standard in terms of grammat-icality (how fluent the compression is) and impor-tance (how much of the meaning of and impor-tant information from the original sentence is re-tained) on a scale of 1 (worst) to 5 (best). We re-port in Table 2 the average scores. EM and SVMperform at very similar levels, which we attributeto using the same set of rules, while GS performsat a level substantially better than both, and muchcloser to human performance in both criteria. The

943

Figure 3: RelF1, precision, recall plotted againstcompression rate for GS, EM, VB.

human evaluation indicates that the superiority ofthe Bayesian nonparametric method is underap-preciated by the automated evaluation metric.

4.3 Discussion

The fact that GS performs better than EM can beattributed to two reasons: (1) GS uses a sparseprior and selects a compact representation of thedata (grammar sizes ranged from 4K-7K for GScompared to a grammar of about 35K rules forEM). (2) GS does not commit to a precomputedgrammar and searches over the space of all gram-

mars to find one that bests represents the corpus.It is possible to introduce DP-like sparsity in EMusing variational Bayes (VB) training. We exper-iment with this next in order to understand howdominant the two factors are. The VB algorithmrequires a simple update to the M-step formulasfor EM where the expected rule counts are normal-ized, such that instead of updating the rule weightin the t-th iteration as in the following

θt+1c,e =

nc,e + α− 1nc,. + Kα−K

where nc,e represents the expected count of rulec → e, and K is the total number of waysto rewrite c, we now take into account ourDP(αc, P0(· | c)) prior in (1), which, whentruncated to a finite grammar, reduces to aK-dimensional Dirichlet prior with parameterαcP0(· | c). Thus in VB we perform a variationalE-step with the subprobabilities given by

θt+1c,e =

exp (Ψ(nc,e + αcP0(e | c)))exp (Ψ(nc,. + αc))

where Ψ denotes the digamma function. (Liu andGildea, 2009) (See MacKay (1997) for details.)Hyperparameters were handled the same way asfor GS.

Instead of selecting a single model on the devel-opment set, here we provide the whole spectrum ofmodels and their performances in order to betterunderstand their comparative behavior. In Figure3 we plot RelF1 on the test set versus compres-sion rate and compare GS, EM, and VB (β = 0.1fixed, (α, δ) ranging in [10−6, 106]×(1, 2]). Over-all, we see that GS maintains roughly the samelevel of precision as EM (despite its larger com-pression rates) while achieving an improvement inrecall, consequently performing at a higher RelF1level. We note that VB somewhat bridges the gapbetween GS and EM, without quite reaching GSperformance. We conclude that the mitigation ofthe two factors (narrowness and overfitting) bothcontribute to the performance gain of GS.5

4.4 Example rules learned

In order to provide some insight into the grammarextracted by GS, we list in Tables (3) and (4) high

5We have also experimented with VB with parametric in-dependent symmetric Dirichlet priors. The results were sim-ilar to EM with the exception of sparse priors resulting insmaller grammars and slightly improving performance.

944

(ROOT (S CC[�] NP[1] VP[2] .[3])) / (ROOT (S NP[1] VP[2] .[3]))(ROOT (S NP[1] ADVP[�] VP[2] (. .))) / (ROOT (S NP[1] VP[2] (. .)))(ROOT (S ADVP[�] (, ,) NP[1] VP[2] (. .))) / (ROOT (S NP[1] VP[2] (. .)))(ROOT (S PP[�] (, ,) NP[1] VP[2] (. .))) / (ROOT (S NP[1] VP[2] (. .)))(ROOT (S PP[�] ,[�] NP[1] VP[2] .[3])) / (ROOT (S NP[1] VP[2] .[3]))(ROOT (S NP[�] (VP VBP[�] (SBAR (S NP[1] VP[2]))) .[3])) / (ROOT (S NP[1] VP[2] .[3]))(ROOT (S ADVP[�] NP[1] (VP MD[2] VP[3]) .[4])) / (ROOT (S NP[1] (VP MD[2] VP[3]) .[4]))(ROOT (S (SBAR (IN as) S[�]) ,[�] NP[1] VP[2] .[3])) / (ROOT (S NP[1] VP[2] .[3]))(ROOT (S S[�] (, ,) CC[�] (S NP[1] VP[2]) .[3])) / (ROOT (S NP[1] VP[2] .[3]))(ROOT (S PP[�] NP[1] VP[2] .[3])) / (ROOT (S NP[1] VP[2] .[3]))(ROOT (S S[1] (, ,) CC[�] S[2] (. .))) / (ROOT (S NP[1] VP[2] (. .)))(ROOT (S S[�] ,[�] NP[1] ADVP[2] VP[3] .[4])) / (ROOT (S NP[1] ADVP[2] VP[3] .[4]))(ROOT (S (NP (NP NNP[�] (POS ’s)) NNP[1] NNP[2]) / (ROOT (S (NP NNP[1] NNP[2])

(VP (VBZ reports)) .[3])) (VP (VBZ reports)) .[3]))

Table 3: High probability ROOT / ROOT compression rules from the final state of the sampler.

(S NP[1] ADVP[�] VP[2]) / (S NP[1] VP[2])(S INTJ[�] (, ,) NP[1] VP[2] (. .)) / (S NP[1] VP[2] (. .))(S (INTJ (UH Well)) ,[�] NP[1] VP[2] .[3]) / (S NP[1] VP[2] .[3])(S PP[�] (, ,) NP[1] VP[2]) / (S NP[1] VP[2])(S ADVP[�] (, ,) S[1] (, ,) (CC but) S[2] .[3]) / (S S[1] (, ,) (CC but) S[2] .[3])(S ADVP[�] NP[1] VP[2]) / (S NP[1] VP[2])(S NP[�] (VP VBP[�] (SBAR (IN that) (S NP[1] VP[2]))) (. .)) / (S NP[1] VP[2] (. .))(S NP[�] (VP VBZ[�] ADJP[�] SBAR[1])) / S[1]

(S CC[�] PP[�] (, ,) NP[1] VP[2] (. .)) / (S NP[1] VP[2] (. .))(S NP[�] (, ,) NP[1] VP[2] .[3]) / (S NP[1] VP[2] .[3])(S NP[1] (, ,) ADVP[�] (, ,) VP[2]) / (S NP[1] VP[2])(S CC[�] (NP PRP[1]) VP[2]) / (S (NP PRP[1]) VP[2])(S ADVP[�] ,[�] PP[�] ,[�] NP[1] VP[2] .[3]) / (S NP[1] VP[2] .[3])(S ADVP[�] (, ,) NP[1] VP[2]) / (S NP[1] VP[2])

Table 4: High probability S / S compression rules from the final state of the sampler.

probability subtree-deletion rules expanding cate-gories ROOT / ROOT and S / S, respectively. Ofespecial interest are deep lexicalized rules such as

a pattern of compression used many times in theBNC in sentence pairs such as “NPR’s Anne Gar-rels reports” / “Anne Garrels reports”. Such aninformative rule with nontrivial collocation (be-tween the possessive marker and the word “re-ports”) would be hard to extract heuristically andcan only be extracted by reasoning across thetraining examples.

5 Conclusion

We explored nonparametric Bayesian learningof non-isomorphic tree mappings using Dirich-let process priors. We used the task of extrac-tive sentence compression as a testbed to investi-

gate the effects of sparse priors and nonparamet-ric inference over the space of grammars. Weshowed that, despite its degeneracy, expectationmaximization is a strong baseline when given areasonable grammar. However, Gibbs-sampling–based nonparametric inference achieves improve-ments against this baseline. Our investigation withvariational Bayes showed that the improvement isdue both to finding sparse grammars (mitigatingoverfitting) and to searching over the space of allgrammars (mitigating narrowness). Overall, wetake these results as being encouraging for STSGinduction via Bayesian nonparametrics for mono-lingual translation tasks. The future for this workwould involve natural extensions such as mixingover the space of word alignments; this would al-low application to MT-like tasks where flexibleword reordering is allowed, such as abstractivesentence compression and paraphrasing.

References

James Clarke and Mirella Lapata. 2006a. Constraint-based sentence compression: An integer program-ming approach. In Proceedings of the 21st Interna-

945

tional Conference on Computational Linguistics and44th Annual Meeting of the Association for Compu-tational Linguistics, pages 144–151, Sydney, Aus-tralia, July. Association for Computational Linguis-tics.

James Clarke and Mirella Lapata. 2006b. Modelsfor sentence compression: A comparison across do-mains, training requirements and evaluation mea-sures. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th An-nual Meeting of the Association for ComputationalLinguistics, pages 377–384, Sydney, Australia, July.Association for Computational Linguistics.

Trevor Cohn and Phil Blunsom. 2009. A Bayesianmodel of syntax-directed tree to string grammar in-duction. In EMNLP ’09: Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing, pages 352–361, Morristown, NJ,USA. Association for Computational Linguistics.

Trevor Cohn and Mirella Lapata. 2007. Large mar-gin synchronous generation and its application tosentence compression. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing and on Computational Natural Lan-guage Learning, pages 73–82, Prague. Associationfor Computational Linguistics.

Trevor Cohn and Mirella Lapata. 2008. Sentencecompression beyond word deletion. In COLING’08: Proceedings of the 22nd International Confer-ence on Computational Linguistics, pages 137–144,Manchester, United Kingdom. Association for Com-putational Linguistics.

Trevor Cohn, Sharon Goldwater, and Phil Blun-som. 2009. Inducing compact but accurate tree-substitution grammars. In NAACL ’09: Proceed-ings of Human Language Technologies: The 2009Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 548–556, Morristown, NJ, USA. Associationfor Computational Linguistics.

A. Dempster, N. Laird, and D. Rubin. 1977. Max-imum likelihood from incomplete data via the EMalgorithm. Journal of the Royal Statistical Society,39 (Series B):1–38.

John DeNero, Dan Gillick, James Zhang, and DanKlein. 2006. Why generative phrase models under-perform surface heuristics. In StatMT ’06: Proceed-ings of the Workshop on Statistical Machine Trans-lation, pages 31–38, Morristown, NJ, USA. Associ-ation for Computational Linguistics.

John DeNero, Alexandre Bouchard-Cote, and DanKlein. 2008. Sampling alignment structure undera Bayesian translation model. In EMNLP ’08: Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing, pages 314–323, Mor-ristown, NJ, USA. Association for ComputationalLinguistics.

Jason Eisner. 2003. Learning non-isomorphic treemappings for machine translation. In ACL ’03: Pro-ceedings of the 41st Annual Meeting on Associa-tion for Computational Linguistics, pages 205–208,Morristown, NJ, USA. Association for Computa-tional Linguistics.

Michel Galley and Kathleen McKeown. 2007. Lex-icalized Markov grammars for sentence compres-sion. In Human Language Technologies 2007:The Conference of the North American Chapter ofthe Association for Computational Linguistics; Pro-ceedings of the Main Conference, pages 180–187,Rochester, New York, April. Association for Com-putational Linguistics.

Michel Galley, Mark Hopkins, Kevin Knight, andDaniel Marcu. 2004. What’s in a translationrule? In Daniel Marcu Susan Dumais and SalimRoukos, editors, HLT-NAACL 2004: Main Proceed-ings, pages 273–280, Boston, Massachusetts, USA,May 2 - May 7. Association for Computational Lin-guistics.

S. Geman and D. Geman. 1984. Stochastic Relaxation,Gibbs Distributions and the Bayesian Restoration ofImages. pages 6:721–741.

Sharon Goldwater, Thomas L. Griffiths, and MarkJohnson. 2006. Contextual dependencies in un-supervised word segmentation. In Proceedings ofthe 21st International Conference on ComputationalLinguistics and 44th Annual Meeting of the Associa-tion for Computational Linguistics, pages 673–680,Sydney, Australia, July. Association for Computa-tional Linguistics.

Hongyan Jing. 2000. Sentence reduction for auto-matic text summarization. In Proceedings of thesixth conference on Applied natural language pro-cessing, pages 310–315, Morristown, NJ, USA. As-sociation for Computational Linguistics.

Dan Klein and Christopher D. Manning. 2003. Fastexact inference with a factored model for naturallanguage parsing. In Advances in Neural Informa-tion Processing Systems 15 (NIPS, pages 3–10. MITPress.

Kevin Knight and Daniel Marcu. 2002. Summa-rization beyond sentence extraction: a probabilis-tic approach to sentence compression. Artif. Intell.,139(1):91–107.

K. Lari and S. J. Young. 1990. The estimation ofstochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Lan-guage, 4:35–56.

Ding Liu and Daniel Gildea. 2009. Bayesian learn-ing of phrasal tree-to-string templates. In Proceed-ings of the 2009 Conference on Empirical Methodsin Natural Language Processing, pages 1308–1317,Singapore, August. Association for ComputationalLinguistics.

946

David J.C. MacKay. 1997. Ensemble learning for hid-den markov models. Technical report, CavendishLaboratory, Cambridge, UK.

Franz Josef Och and Hermann Ney. 2004. The align-ment template approach to statistical machine trans-lation. Comput. Linguist., 30(4):417–449.

Matt Post and Daniel Gildea. 2009. Bayesian learningof a tree substitution grammar. In Proceedings of theACL-IJCNLP 2009 Conference Short Papers, pages45–48, Suntec, Singapore, August. Association forComputational Linguistics.

Stefan Riezler, Tracy H. King, Richard Crouch, andAnnie Zaenen. 2003. Statistical sentence condensa-tion using ambiguity packing and stochastic disam-biguation methods for lexical-functional grammar.In NAACL ’03: Proceedings of the 2003 Conferenceof the North American Chapter of the Associationfor Computational Linguistics on Human LanguageTechnology, pages 118–125, Morristown, NJ, USA.Association for Computational Linguistics.

Jenine Turner and Eugene Charniak. 2005. Super-vised and unsupervised learning for sentence com-pression. In ACL ’05: Proceedings of the 43rd An-nual Meeting on Association for Computational Lin-guistics, pages 290–297, Morristown, NJ, USA. As-sociation for Computational Linguistics.

Elif Yamangil and Rani Nelken. 2008. Miningwikipedia revision histories for improving sentencecompression. In Proceedings of ACL-08: HLT,Short Papers, pages 137–140, Columbus, Ohio,June. Association for Computational Linguistics.

947

Bayesian Synchronous Tree-Substitution Grammar Induction ...

Documents

Transcript of Bayesian Synchronous Tree-Substitution Grammar Induction ...