TriS: A Statistical Sentence Simplifier with Log-linear ... of the 5th International Joint...

Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 474–482,Chiang Mai, Thailand, November 8 – 13, 2011. c©2011 AFNLP

TriS: A Statistical Sentence Simplifierwith Log-linear Models and Margin-based Discriminative Training

Nguyen Bach, Qin Gao, Stephan Vogel and Alex WaibelLanguage Technologies Institute

Carnegie Mellon University5000 Forbes Avenue, Pittsburgh PA, 15213

{nbach, qing, stephan.vogel, ahw}@cs.cmu.edu

AbstractWe propose a statistical sentence simplifica-tion system with log-linear models. In contrastto state-of-the-art methods that drive sentencesimplification process by hand-written linguis-tic rules, our method used a margin-based dis-criminative learning algorithm operates on afeature set. The feature set is defined on statis-tics of surface form as well as syntactic and de-pendency structures of the sentences. A stackdecoding algorithm is used which allows us toefficiently generate and search simplificationhypotheses. Experimental results show thatthe simplified text produced by the proposedsystem reduces 1.7 Flesch-Kincaid grade levelwhen compared with the original text. Wewill show that a comparison of a state-of-the-art rule-based system (Heilman and Smith,2010) to the proposed system demonstrates animprovement of 0.2, 0.6, and 4.5 points inROUGE-2, ROUGE-4, and AveF10, respec-tively.

1 IntroductionComplicated sentences impose difficulties on readingcomprehension. For instance, a person in 5th grade cancomprehend a comic book easily but will struggle tounderstand New York Times articles which require atleast 12th grade average reading level (Flesch, 1981).Complicated sentences also challenge natural languageprocessing applications including, but not limited to,text summarization, question answering, informationextraction, and machine translation (Chandrasekar etal., 1996). An example of this is syntactic parsing inwhich long and complicated sentences will generate alarge number of hypotheses and usually fail in disam-biguating the attachments. Therefore, it is desirable topre-process complicated sentences and generate sim-pler counter parts. There are direct applications of sen-tence simplification. Dalemans et al. (2004) appliedsentence simplification so that the automatically gen-erated closed caption can fit into limited display area.The Facilita system generates accessible content fromBrazilian Portuguese web pages for low literacy readersusing both summarization and simplification technolo-gies (Watanabe et al., 2009).

This paper tackles sentence-level factual simplifica-tion (SLFS). The objective of SLFS is twofold. First,SLFS will process the syntactically complicated sen-tences. Second, while preserving the content meaning,SLFS outputs a sequence of simple sentences. SLFS isan instance of the broader spectrum of text-to-text gen-eration problems, which includes summarization, sen-tence compression, paraphrasing, and sentence fusion.Comparing to sentence compression, sentence simplifi-cation requires the conversion to be lossless in sense ofsemantics. It is also different from paraphrasing in thatit generates multiple sentences instead of one sentencewith different constructions.

There are certain specific characteristics that compli-cate a sentence, which include length, syntactic struc-ture, syntactic and lexical ambiguity, and an abundanceof complex words. As suggested by its objective, sen-tence simplification outputs “simple sentences”. Intu-itively, a simple sentence is easy to read and under-stand, and arguably easily processed by computers. Amore fine-tuned definition on a simple sentence is sug-gested in Klebanov et al. (2004), and is termed EasyAccess Sentences (EAS). EAS in English is defined as1) EAS is a grammatical sentence; 2) EAS has one fi-nite verb; 3) EAS does not make any claims that werenot present, explicitly or implicitly; 4) An EAS shouldcontain as many named entities as possible.

While the last two requirements are difficult to quan-tify, the first two provide a practical guideline for sen-tence simplification. In this paper, we treat the sen-tence simplification process as a process of statisticalmachine translation. Given the input of a syntacticallycomplicated sentence, we translate it into a set of EASthat preserves as much information as possible from theoriginal sentence. We develop the algorithm that cangenerate a set of EAS from the original sentence anda model to incorporate features that indicate the meritof the simplified candidates. The model is discrimina-tively trained on a data set of manually simplified sen-tences.

We briefly review related work in the area of text-to-text generation in Section 2. The proposed model forstatistical sentence simplification is presented in Sec-tion 3. In Section 4 we introduce the decoding algo-rithm. Section 5 and 6 describe the discriminative train-ing method we use and the feature functions. Experi-

474

ments and analysis are present in Section 7, followedby the conclusion in Section 8.

2 Related WorkGiven the problematic nature of text-to-text generationthat takes a sentence or a document as the input and op-timizes the output toward a certain objective, we brieflyreview state-of-art approaches of text-to-text genera-tion methods.

Early approaches in summarization focus on extrac-tion methods which try to isolate and then summarizethe most significant sentences or paragraphs of the text.However, this has been found to be insufficient becauseit usually generates incoherent summaries. Barzilayand McKeown (2005) proposed sentence fusion formulti-document summarization, which produces a sen-tence that conveys common information of multiplesentences based upon dependency tree structures andlexical similarity.

Sentence compression generates a summary of a sin-gle sentence with minimal information loss, which canalso be treated as sentence-level summarization. Thisapproach applies word deletion, in which non informa-tive words will be removed from the original sentence.A variety of models were developed based on this per-spective, ranging from generative models (Knight andMarcu, 2002; Turner and Charniak, 2005) to discrim-inative models (McDonald, 2006) and Integer LinearProgramming (Clarke, 2008). Another line of researchtreats sentence compression as machine translation, inwhich tree-based translation models have been devel-oped (Galley and McKeown, 2007; Cohn and Lapata,2008; Zhu et al., 2010). Recently, Woodsend and La-pata (2011) proposed a framework to combine tree-based simplification with ILP.

In contrast to sentence compression, sentence sim-plification generates multiple sentences from one inputsentence and tries to preserve the meaning of the orig-inal sentence. The major objective is to transform sen-tences in complicated structures to a set of easy-to-readsentences, which will be easier for human to compre-hend, and hopefully easier for computers to deal with.

Numerous attempts have been made to tackle thesentence simplification problem. One line of researchhas explored simplification with linguistic rules. Jon-nalagadda (2006) developed a rule-based system thattake into account the discourse information. Thismethod is applied on simplification of biomedical text(Jonnalagadda et al., 2009) and protein-protein infor-mation extraction (Jonnalagadda and Gonzalez, 2010).Chandrasekar and Srinivas (1997) automatically in-duced simplification rules based on dependency trees.Additionally, Klebanov et al. (2004) develop a set ofrules that generate a set of EAS from syntactically com-plicated sentences. Heilman and Smith (2010) pro-posed an algorithm for extracting simplified declarativesentences from syntactically complex sentences.

The rule-based systems performs well on English.

However, in order to develop a more generic frame-work for other languages, a statistical framework ispreferable. In this work, we follow this direction totreat the whole process as a statistical machine trans-lation task with an online large-margin learning frame-work. The method is generalizable to other languagesgiven labeled data. To ensure the information is pre-served, we build a table of EAS for each object, and usestack decoding to search for the optimal combinationof EAS. A feature vector is assigned to each combina-tion and we use an end-to-end discriminative trainingframework to tune the parameters given a set of train-ing data. Our method is different from Klebanov et al.(2004) in the way that we applied statistical model torank the generated sentences. And the difference be-tween our method and Heilman and Smith (2010) isthat we integrate linguistic rules into the decoding pro-cess as soft constraints in order to explore a much largersearch space.

3 Statistical Sentence Simplification withLog-linear Models

Assume that we are given an English sentence e, whichis to be simplified into a set S of k simple sentences{s1, ..., si, ..., sk}. Among all possible simplified sets,we will select the set with the highest probabilityS(e) = argmax∀S Pr(S|e). As the true probabilitydistribution of Pr(S|e) is unknown, we have to ap-proximate Pr(S|e) by developing a log-linear modelp(S|e). In contrast to noisy-channel models (Knightand Marcu, 2002; Turner and Charniak, 2005) we di-rectly compute simplification probability by a condi-tional exponential model as follow:

p(S|e) = exp[∑M

m=1 wmfm(S, e)]∑S′ exp[

∑Mm=1 wmfm(S ′, e)]

(1)

where fm(S, e),m = 1, ...,M are feature functions oneach sentence; there exists a model parameter wm arefeature weights to be learned.

In this framework, we need to solve decoding, learn-ing, and modeling problems. The decoding prob-lem, also known as the search problem, is denoted bythe argmax operation which finds the optimal S thatmaximize model probabilities. The learning problemamounts to obtaining suitable parameter values wM

1

subject to a loss function on training samples. Finally,the modeling problem amounts to developing suitablefeature functions that capture the relevant propertiesof the sentence simplification task. Our sentence sim-plification model can be viewed as English-to-Englishlog-linear translation models. The defining character-istic that makes the problem difficult is that we need totranslate from one syntactically complicated sentenceto k simple sentences, and k is not predetermined.

475

2

John comes from England , works for IMF , and is an active hiker .

NP NP NP NP VP VP VP

England

John

IMF

an active hiker

Subject

England

John

IMF

an active hiker

Object

comes from

works for

is

Verb

Figure 1: Constructing simple sentences

4 DecodingThis section presents a solution to the decoding prob-lem. The solution is based on a stack decoding algo-rithm that finds the best S given an English sentencee. Our decoding algorithm is inspired by the decodingalgorithms in speech recognition and machine transla-tion (Jelinek, 1998; Koehn et al., 2007). For example,with a sentence e “John comes from England, worksfor IMF, and is an active hiker”, the stack decodingalgorithm tries to find S, which is a set of three sen-tences: “John comes from England”, “John works forIMF” and “John is an active hiker”. Note that S is aset of k simple sentences S = {s1, ..., si, ..., sk}. Wecan assume the items si are drawn from a finite set Sof grammatical sentences that can be derived from e.Therefore, the first step is to construct the set S.

4.1 Constructing simple sentencesWe define a simple English sentence as a sentence withSVO structure, which has one subject, one verb andone object. Our definition is similar to the definitionof EAS, mentioned in section 1. However, we onlyfocus on the SVO structure and other constraints arerelaxed. We assume both subjects (S) and objects (O)are noun phrases (NP) in the parse tree. For a givenEnglish sentence e, we extract a list SNP of NPs anda list SV of verbs. SNP has an additional empty NPin order to handle intransitive verbs. A straightforwardway to construct simple sentences is to enumerate allpossible sentences based on SNP and SV . That resultsin |SNP |2|SV | simple sentences.

Figure 1 illustrates the constructions for “Johncomes from England, works for IMF, and is an activehiker”. The system extracts a noun phrase list SNP

{John, England, IMF, an active hiker} and a verb listSV {comes from, works for, is}. Our model constructssimple sentences such as “John comes from England” ,“John comes from IMF” and “John comes from an ac-tive hiker”. The total number of simple sentences, |S|,is 48.

4.2 Decoding algorithmGiven a list of simple sentences S, the decoder’s ob-jective is to construct and find the best simplificationcandidate S ⊂ S. We call S a hypothesis in the con-text of the decoder. The rationale behind our left-right

John comes from England

IMF comes from England

an active hiker is England

John comes from IMF

John works for IMF

an active hiker is IMF

John comes from an active hiker

John is an active hiker

IMF is an active hiker

Figure 2: Left-right decoding by objects

stack decoding algorithm is to construct a hypothesisthat covers all noun phrases and verb phrases of theoriginal sentence.

The decoding task is to find the optimal solution overall possible combinations of simple sentences, giventhe feature values and learned feature weights. De-pending on the number of simple sentences per hypoth-esis, the search space grows exponentially. Since eachsimple sentence contains an object, we can group thecandidate sentences by the object noun phrase. As aresult, it is not necessary to evaluate all combinations;we can look at one object at a time from left to right.Figure 2 demonstrates the idea of decoding via objects.We have three objects “England”, “IMF” and “an ac-tive hiker”. The algorithm first finds potential simplesentences which have “England” as object. After fin-ishing “England”, the algorithm expands to “IMF” and“an active hiker”. In this example the optimal path isthe path with bold arrows.

Algorithm 1 : K-Best Stack Decoding

1: Initialize an empty hypothesis list HypList2: Initialize HYPS is a stack of 1-simple-sentence hy-

potheses3: for i = 0 to |SV | do4: Initialize stack expandh5: while HYPS is not empty do6: pop h from HYPS7: expandh ← Expand-Hypothesis(h)8: end while9: expandh← Prune-Hypothesis(expandh, stack-

size)10: HYPS← expandh11: Store hypotheses of expandd into HypList12: end for13: SortedHypList← Sort-Hypothesis(HypList)14: Return K-best hypotheses in SortedHypList

Algorithm 1 is a version of stack decoding for sen-tence simplification. The decoding process advancesby extending a state that is equivalent to a stack of hy-potheses. Line 1 and 2 initialize HYPS stack and Hy-pList. A HYPS stack maintains a current search state,meanwhile HypList stores potential hypotheses aftereach state. HYPS is initialized with hypotheses con-taining one simple sentence. Line 3 starts a loop overstates. The number of maximum states is equal to the

476

size of SV plus one. Lines 4-8 represent the hypothesisexpansion.

4

…

…

… pop

Expand-Hypothesis

HYPS with 1 simple sentence

expandh

(a) Pop and Expand

5

HYPS with 1 simple sentence

expandh

…

…

…

Prune Hypothesis

HYPS with 2 simple sentences

(b) Hypothesis pruning

Figure 3: A visualization for stack decoding

Figure 3a illustrates the pop-expand process ofHYPS stack with 1-simple-sentence hypotheses. Theexpansion in this situation expands to a 2-simple-sentence hypotheses-stack expandh. The size ofexpandh will exponentially increase according to thesize of SV and SNP . Therefore, we prefer to main-tain expandh within a limit number (stack-size) of hy-potheses. Line 9 helps the decoder to control the sizeof expandh by applying different pruning strategies:word coverage, model score or both. Figure 3b illus-trates the pruning process on expandh with 2-simple-sentence hypotheses. Line 10 replaces the current statewith a new state of the expanded hypotheses. Beforemoving to a new state, HypList is used to preserve po-tential hypotheses of the current state. Line 13 sortshypotheses in HypList according to their model scoresand a K-best list is returned in line 14.

5 LearningSince defining a log-linear sentence simplificationmodel and decoding algorithm has been completed,this section describes a discriminative learning algo-rithm for the learning problem. We learn optimizedweight vector w by using the Margin Infused RelaxedAlgorithm or MIRA (Crammer and Singer, 2003),which is an online learner closely related to both thesupport vector machine and perceptron learning frame-work. In general, weights are updated at each step time

i according to:

wi+1 = argminwi+1||wi+1 − wi||

s.t. score(S, e) ≥ score(S ′, e) + L(S,S ′)(2)

where L(S,S ′) is a measure of the loss of using S ′instead of the simplification reference S; score() is acost function of e and S and in this case is the decoderscore.

Algorithm 2 : MIRA training for Sentence Simplifier

training set τ = {ft, et}Tt=1 has T original Englishsentences with the feature vector ft of et.ε is the simplification reference set.m-oracle set O = {}.The current weight vector wi.

1: i=02: for j = 1 to Q do3: for t = 1 to T do4: H ← get K Best(St ; wi)5: O← get m Oracle(H ; εt)

6: γ =m∑o=1

K∑h=1

α(eo, eh; εt)(feo − feh)7: wi+1 = wi + γ8: i = i+ 19: end for

10: end for11: Return

∑Q∗Ti=1 wi

Q∗T

Algorithm 2 is a version of MIRA for training theweights of our sentence simplification model. On eachiteration, MIRA considers a single instance from thetraining set (St, et) and updates the weights so that thescore of the correct simplification εt is greater than thescore of all other simplifications by a margin propor-tional to their loss. However, given a sentence there arean exponential amount of possible simplification can-didates. Therefore, the optimizer has to deal with anexponentially large number of constraints. To tacklethis, we only consider K-best hypotheses and choosem-oracle hypotheses to support the weight update de-cision. This idea is similar to the way MIRA has beenused in dependency parsing and machine translation(McDonald et al., 2005; Liang et al., 2006; Watanabeet al., 2007).

On each update, MIRA attempts to keep the newweight vector as close as possible to the old weight vec-tor. Subject to margin constraints keep the score of thecorrect output above the score of the guessed outputby updating an amount given by the loss of the incor-rect output. In line 6, α can be interpreted as an up-date step size; when α is a large number we want toupdate our weights aggressively, otherwise weights are

477

updated conservatively. α is computed as follow:

α = max(0, δ)

δ = min

{C, L(eo,eh;εt)−[score(eo)−score(eh)]

||Seo−feh ||22

}

(3)where C is a positive constant used to cap the maxi-mum possible value of α; score() is the decoder score;and L(eo, eh; εt) is the loss function.L(eo, eh; εt) measures the difference between oracle

eo and hypothesis eh according to the gold reference εt.L is crucial to guide the optimizer to learn optimizedweights. We defined L(eo, eh; εt) as follow

L(eo, eh; εt) = AveFN (eo, εt)−AveFN (eh, εt)(4)

where AveFN (eo, εt) and AveFN (eh, εt) is the aver-age n-gram (n=[2:N]) cooccurrence F-score of (eo, εt)and (eh, εt), respectively. In this case, we optimizethe weights directly against the AveFN metric over thetraining data. AveFN can be substituted by other eval-uation metrics such as the ROUGE family metric (Lin,2004). Similar to the perceptron method, the actualweight vector during decoding is averaged across thenumber of iterations and training instances; and it iscomputed in line 11.

6 ModelingWe now turn to the modeling problem. Our fundamen-tal question is: given the model in Equation 1 withM feature functions, what linguistic features can beleveraged to capture semantic information of the orig-inal sentence? We address the question in this sectionby describing features that cover different levels of lin-guistic structures. Our model incorporates 177 featuresbased on information from the original English sen-tence e which contains chunks, syntactic and depen-dency parse trees (Ramshaw and Marcus, 1995; Marn-effe et al., 2006).

6.1 Simple sentence level featuresA simplification hypothesis s contains k simple sen-tences. Therefore, it is crucial that our model choosesreasonable simple sentences to form a hypothesis. Foreach simple sentence si we incorporated the followingfeature functions:

Word Count These features count the numberword in subject (S), verb (V) and object (O), also count-ing the number of proper nouns in S and the number ofproper nouns in O.

Distance between NPs and Verbs These featuresfocus on the number of NPs and VPs in between S, Vand O. This feature group includes the number of NPsbetween S and V, the number of NPs between V andO, the number of VPs between S and V, the number ofVPs between V and O.

Dependency Structures It is possible that thedecoder constructs semantically incorrect simple sen-tences, in which S, V, and O do not have any semantic

connection. One way to possibly reduce this kind ofmistake is analyze the dependency chain between S, V,and O on the original dependency tree of e. Our de-pendency structure features include the minimum andmaximum distances of (S:O), (S:V), and (V:O).

Syntactic Structures Another source of informa-tion is the syntactic parse tree of e, which can be used toextract syntactic features. The sentence-like boundaryfeature considers the path from S to O along the syn-tactic parse tree to see whether it crosses the sentence-like boundary (e.g. relative clauses). For example inthe original sentence “John comes from England andworks for IMF which stands for International Mone-tary Funds”, the simple sentence “IMF stands for In-ternational Monetary Funds” has sentence-like bound-ary feature is triggered since the path from “IMF” to“International Monetary Funds” on the syntactic treeof the original sentence contains an SBAR node.

Another feature is the PP attachment feature. Thischecks if the O contains a prepositional phrase attach-ment or not. Moreover, the single pronoun feature willcheck if S and O are single pronoun or not. The lastfeature is VO common ancestor, which looks at the syn-tactic tree to see whether or not V and O share the sameVP tag as a common ancestor.

6.2 Interactive simple sentence features

A collection of grammatically sound simplified sen-tences does not necessarily make a good hypothesisDropping words, unnecessary repetition, or even wrongorder can make the hypothesis unreadable. Therefore,our model needs to be equipped with features that arecapable to measure the interactiveness across simplesentences and are also able to represent s in the bestpossible manner. We incorporated the following fea-tures into our model:

Sentence Count This group of features considerthe number of sentences in the hypothesis. It consistsof an integral feature of sentence count sci = |S|, anda group of binary features scbk = δ(|S|) = k wherek ∈ [1, 6] is the number of sentence.

NP and Verb Coverage The decoder’s objectiveis to improve the chance of generating hypotheses thatcover all NP and verbs of the original sentence e. Thesefeatures count the number of NPs and verbs that havebeen covered by the hypothesis, by the 1st and 2nd sim-ple sentences. Similarly, these features also count thenumber of missing NPs and verbs.

S and O cross sentences These features count howmany times S of the 1st simple sentence is repeated asS of the 2nd simple sentence in a hypothesis. They alsocount the number of times O of the 1st sentence is theS of 2nd sentence.

Readability This group of features computesstatistics related to readability. It includes Flesch,Gunning-Fog, SMOG, Flesch-Kincaid, automatic read-ability index, and average all scores (Flesch, 1948;Gunning, 1968; McLaughlin, 1969; Kincaid et al.,

478

1975). Also, we compute the edit-distance of hypothe-sis against the original sentence, and the average wordper simple sentence.

Typed Dependency At simple sentence level weexamine dependency chains of S, V and O, while atthe hypothesis level we analyze the typed dependencybetween words. Our model has 46 typed dependencieswhich are represented by the 92 count features for the1st and 2nd simple sentence.

7 Experiments and Analysis

7.1 Data

To enable the study of sentence simplification with ourstatistical models, we search for parallel corpora, inwhich the sources are original English sentences andthe target is its simplification reference. For example,the source is “Lu is married to Lian Hsiang , who isalso a vajra master , and is referred as Grand MadamLu ”. The simplification reference contains 3 simplesentences which are “Lu is married to Lian Hsiang”;“Lian Hsiang is also a vajra master”; “Lu is referredas Grand Madam Lu”. To the best of our knowledge,there is no such publicly available corpora under theseconditions1.

Our first attempt is to collect data automatically fromoriginal English and Simple English Wikipedia, basedon the suggestions of Napoles and Dredze (2010).However, we found that the collected corpus is unsuit-able for our model. For example, consider the origi-nal sentence “Hawking was the Lucasian Professor ofMathematics at the University of Cambridge for thirtyyears, taking up the post in 1979 and retiring on 1 Oc-tober 2009”. The Simple Wikipedia reads “Hawkingwas a professor of mathematics at the University ofCambridge (a position that Isaac Newton once had)”and “He retired on October 1st 2009”. The problemswith this are that “(a position that Isaac Newton oncehad)” did not appear in the original text, and the pro-noun “He” requires our model to perform anaphora res-olution which is out of scope of this work.

We finally decided to collect a set of sentencesfor which we obtained one manual simplificationper sentence. The corpus contains 854 sentences,among which 25% sentences are from the New YorkTimes and 75% sentences are from Wikipedia. Theaverage sentence length is 30.5 words. We reserved100 sentences for the unseen test set and the restis for the development set and training data. Theannotators were given instructions that explained thetask and defined sentence simplification with the aidof examples. They were encouraged not to introducenew words and try to simplify by restructuring theoriginal sentence. They were asked to simplify whilepreserving all important information and ensuring the

1 We are aware of data sets from (Cohn and Lapata, 2008;Zhu et al., 2010), however, they are more suitable in sentencecompression task than in our task.

simplification sentences remained grammatically cor-rect2. Some examples from our corpus are given below:

Original: “His name literally means Peach Taro ; asTaro is a common Japanese boy ’s name , it is oftentranslated as Peach Boy .”Simplification: “His name literally means Peach Taro”; “Taro is a common Japanese boy ’s name” ; “Taro isoften translated as Peach Boy”

Original: “These rankings are likely to change thanksto one player , Nokia , which has seen its market shareshrink in the United States .”Simplification: “These rankings are likely to changethanks to one player , Nokia” ; “Nokia has seen its mar-ket share shrink in the United States”

7.2 Evaluation methodsEvaluating sentence simplification is a difficult prob-lem. One possible way to overcome this is to usereadability tests. There have been readability testssuch as Flesch, Gunning-Fog, SMOG, Flesch-Kincaid,etc. (Flesch, 1948; Gunning, 1968; McLaughlin, 1969;Kincaid et al., 1975). In this work, we will use Flesch-Kincaid grade level which can be interpret as the num-ber of years of education generally required to under-stand a text.

Furthermore, automatic evaluation of summaries hasalso been explored recently. The work of Lin (2004)on the ROUGE family metric is perhaps the bestknown study of automatic summarization evaluation.Other methods have been proposed such as Pyramid(Nenkova et al., 2007). Recently, Aluisio et al. (2010)proposed readability assessment for sentence simplifi-cation.

Our models are optimized toward AveF10, which isthe average F-score of n-gram concurrence betweenhypothesis and reference in which n is from 2 to 10.Besides AveF10, we will report automatic evaluationscores on the unseen test set in Flesch-Kincaid gradelevel, ROUGE-2 and ROUGE-4. When we evaluate ona test set, a score will be reported as the average scoreper sentence.

7.3 Model behaviorsHow well does our system learn from the labeled cor-pus? To answer this question we investigate the interac-tions of model and decoder hyper parameters over thetraining data. We performed controlled experiments onstack-size, K-best, C, and m-oracle parameters. Foreach parameter, all other model and decoder valuesare fixed, and the only change is with the parameter’svalue of interest. Figure 4 illustrates these experimentswith parameters over the training data during 15 MIRAtraining iterations with AveF10 metric. The weightvector w is initialized randomly.

2 Our corpus will be made publicly available for otherresearchers.

479

Stack size

46

46.5

47

47.5

48

48.5

49

49.5

1 2 3 4 5 6 7 8 9 101112131415

AveF10

Iterations

500 300

200 100

50

(a) stack-size

K-Best

46

46.5

47

47.5

48

48.5

49

49.5

1 2 3 4 5 6 7 8 9 101112131415

AveF10

Iterations

100 200

300 400

500

(b) K-best

C

46

46.5

47

47.5

48

48.5

49

49.5

1 2 3 4 5 6 7 8 9 101112131415

AveF10

Iterations

0.3 0.2

0.1 0.07

0.04

(c) Constant C

M-Oracle

46

46.5

47

47.5

48

48.5

49

49.5

1 2 3 4 5 6 7 8 9 101112131415

AveF10

Iterations

1 2

3 4

5

(d) m-oracle

Figure 4: Performance of the sentence simplifier on training data over 15 iterations when optimized towardAveF10

metric and under various conditions.

In Figure 4a, we experimented with 5 different val-ues from 100 to 500 hypotheses per stack. The ex-pected outcome is when we use a larger stack-size thedecoder may has more chance to find better hypothe-ses. However, a larger stack-size will obviously costmore memory and run time is slower. Therefore, wewant to find a stack-size that compromises conditions.These experiments show that with a stack-size of 200,our model performed reasonably well in comparisonwith 300 and 500. A stack-size of 100 is no better than200, while a stack-size of 50 is much worse than 200.

In Figure 4b, we experimented with 5 different val-ues of K-best list with K from 100 to 500. We ob-served a K-best list of 300 hypotheses seems to performwell compare to other values. In terms of stability, thecurve of 300-best list appears less fluctuation than othercurves over 15 iterations.

C is the hyper-parameter which is used in Equation 3for weight updating in MIRA. Figure 4c shows experi-ments with different constant C. If C is a large number,it means our model prefers an aggressive weight up-dating scheme, otherwise, our model updates weightsconservatively. When C is 0.3 or 0.2 the performanceis worse than 0.1 or 0.07 and 0.04.

The last controlled experiments are shown in Fig-ure 4d, in which we test different values of m rang-ing from 1 to 5. These experiments show that using2 oracle hypotheses consistently leads to better perfor-mances in comparison with other values.

7.4 Performance on the unseen test set

After exploring different model configurations wetrained the final model with stack-size = 200; K-best= 300; C = 0.04; and m-oracle = 2. AveF10 score ofthe final system on the training set is 50.69 which isabout one AveF10 point better than any system in Fig-ure 4. We use the final system to evaluate on the unseentest set. Also, we compare our system with the rule-based system (henceforth H&S) proposed by Heilman

and Smith (2010).3,4

Original Reference H&S Our system9.6 8.2 8.3 7.9

Table 1: Flesch-Kincaid grade level of original, refer-ence, H&S, and our proposed simplification on the un-seen test set.

System AveF10 ROUGE-2 ROUGE-4

H&S 51.0 82.2 72.3Our system 55.5 82.4 72.9

Table 2: Results on the unseen test set with AveF10,ROUGE-2 and ROUGE-4 scores. Our system outper-forms the rule-based system proposed by Heilman andSmith (2010).

We first compare our system with H&S in the Flesch-Kincaid grade level, which indicates comprehensiondifficulty when reading an English text. The higher thenumber the more difficult the text. Table 1 shows theoriginal text requires a reader of grade level 9 or 10.Both H&S and us provided simplification candidates,which are easier to read compared to the original text.Our model generated simpler hypotheses than the ref-erence, while H&S outputs were slightly more difficultto read than the reference.

Next, we compare our system with H&S inngram-based metrics such as AveF10, ROUGE-2 andROUGE-4 as shown in Table 2. Our results arebetter than H&S by 0.2 and 0.6 point in ROUGE-2and ROUGE-4, respectively. More interestingly, oursystem outperformed H&S by 4.5 points in AveF10,3 We thank Michael Heilman for providing us his code.4 We could not reach the authors of (Zhu et al., 2010) in or-der to obtain outputs. Kristian Woodsend kindly provided uspartial outputs of (Woodsend and Lapata, 2011), thereforewe did not include their outputs in this section.

480

Positive examplesO In 2011 , IBM gained worldwide attention for its artificial intelligence program Watson , which was exhibited on

Jeopardy against game show champions Ken Jennings and Brad Rutter .S Watson was exhibited on Jeopardy against game show champions Ken Jennings and Brad Rutter .

In 2011 , IBM gained worldwide attention for its artificial intelligence program Watson .R In 2011 , IBM gained worldwide attention for its artificial intelligence program Watson .

Watson was exhibited on Jeopardy against game show champions Ken Jennings and Brad Rutter .

O He told Radiozurnal that he was halting the campaign for Christmas and would restart it in the new year .S He told Radiozurnal .

He was halting the campaign for Christmas .He would restart it in the new year .

R He told Radiozurnal .He was halting the campaign for Christmas .He would restart it in the new year .

Negative examplesO He drives a 10-year-old Opel Corsa , but lives in a pleasant town house in the sleepy capital, Maseru, with wireless

Internet and a housekeeper who comes twice a week .S He drives a 10-year-old Opel Corsa .

He lives in a pleasant town house in the sleepy capital, Maseru, with wireless Internet and a housekeeper who .R He drives a 10-year-old Opel Corsa .

He lives in a pleasant town house in the sleepy capital, Maseru, with wireless Internet and a housekeeper .a housekeeper comes twice a week .

O An elderly Georgian woman was scavenging for copper to sell as scrap when she accidentally sliced through anunderground cable and cut off Internet services to all of neighbouring Armenia , it emerged on Wednesday .

S An elderly Georgian woman was scavenging for copper to sell .scrap cut off Internet services to all of neighbouring Armenia .

R An elderly Georgian woman was scavenging for copper to sell as scrap .she accidentally sliced through an underground cable .she cut off Internet services to all of neighbouring Armenia .it emerged on Wednesday .

Table 3: We show the original sentence (O), our simplification (S), and simplification reference (R). Positiveexamples are cases when our simplifications closely match with the reference. Meanwhile, negative examplesshow cases when our model can not produce good simplifications.

which is a metric considering both precision and recallup to 10-gram. Over 100 sentences of the unseen testset, H&S outperforms us in 43 sentences, but is worsethan our system in 51 sentences.

Table 3 shows examples of our system on the unseentest set. We present examples in cases where the pro-posed model works well and does not work well.

7.5 DiscussionsThis work shares the same line of research with (Kle-banov et al., 2004; Heilman and Smith, 2010) inwhich we all focus on sentence-level factual simplifi-cation. However, a major focus of our work is on log-linear models which offer a new perspective for sen-tence simplification on decoding, training, and model-ing problems. To contrast, consider rule-based systems(Klebanov et al., 2004; Daelemans et al., 2004; Sid-dharthan, 2006; Heilman and Smith, 2010), in whichsentence simplification processes are driven by hand-written linguistic rules. The linguistic rules representprior information about how each word and phrase canbe restructured. In our model, each linguistic rule is en-coded as a feature function and we allow the model tolearn the optimized feature weights based on the natureof training data.

A potential issue is the proposed model might be sus-

ceptible to the sparseness issue. We alleviated this is-sue by using structure level and count feature functionswhich are lexically independent.

There are some limitations in this work. First, theproposed model does not introduce new words whichmay lead to generate grammatically incorrect simplesentences. Second, we focus on structure simplificationand not on lexical simplification. Finally, the proposedmodel does not deal with anaphora resolution, whichmeans our model can generate repeatedly noun phrasesrepeatedly in multiple simple sentences.

8 Conclusions

In this paper we proposed a novel method for sen-tence simplification based on log-linear models. Ourmajor contributions are the stack decoding algorithm,the discriminative training algorithm, and the 177 fea-ture functions within the model. We have presentedinsight the analyses of our model in controlled settingsto show the impact of different model hyper parame-ters. We demonstrated that the proposed model outper-forms a state-of-the-art rule-based system on ROUGE-2, ROUGE-4, and AveF10 by 0.2, 0.6, and 4.5 points,respectively.

Another way to improve our model is feature engi-

481

neering, which can be applied in future work. To ad-dress the data sparsity issue, we plan to use crowd-sourcing such as Amazon Mechanical Turk to collectmore training data. We plan to incorporate an n-gramLM to improve the grammaticality of hypotheses.

AcknowledgmentsWe would like to thank Colin Cherry for fruitful discus-sions and anonymous reviewers for helpful comments.We also thank Julianne Mentzer for proofreading thefinal version of the paper.

ReferencesSandra Aluisio, Lucia Specia, Caroline Gasperin, and Carolina Scarton. 2010. Readabil-

ity assessment for text simplification. In Proceedings of the NAACL HLT 2010 FifthWorkshop on Innovative Use of NLP for Building Educational Applications, pages 1–9.Association for Computational Linguistics.

Regina Barzilay and Kathleen R. McKeown. 2005. Sentence fusion for multidocumentnews summarization. Computational Linguistics, 31, September.

Raman Chandrasekar and Bangalore Srinivas. 1997. Automatic induction of rules for textsimplification. Knowledge Based Systems, 10(3):183–190.

Raman Chandrasekar, Doran Christine, and Bangalore Srinivas. 1996. Motivations andmethods for text simplification. In Proceedings of 16th International Conference onComputational Linguistics, pages 1041–1044. Association for Computational Linguis-tics.

James Clarke. 2008. Global Inference for Sentence Compression: An Integer Linear Pro-gramming Approach. Ph.D. thesis, University of Edinburgh.

Trevor Cohn and Mirella Lapata. 2008. Sentence compression beyond word deletion. InProceedings of the 22nd International Conference on Computational Linguistics (Col-ing 2008), pages 137–144, Manchester, UK, August. Coling 2008 Organizing Commit-tee.

Koby Crammer and Yoram Singer. 2003. Ultraconservative online algorithms for multiclassproblems. Journal of Machine Learning Research, 3:951–991.

Walter Daelemans, Anja Hothker, and Erik Tjong Kim Sang. 2004. Automatic sentencesimplification for subtitling in Dutch and English. In Proceedings of the 4th Interna-tional Conference on Language Resources and Evaluation (LREC-04), pages 1045–1048.

Rudolf Flesch. 1948. A new readability yardstick. Journal of Applied Psychology, 32:221–233.

Rudolf Flesch. 1981. How to write plain English. Barnes & Noble.

Michel Galley and Kathleen McKeown. 2007. Lexicalized Markov grammars for sentencecompression. In Proceedings of Human Language Technologies 2007: The Conferenceof the North American Chapter of the Association for Computational Linguistics, pages180–187, Rochester, New York, April. Association for Computational Linguistics.

Robert Gunning. 1968. The technique of clear writing. McGraw-Hill New York, NY.

Michael Heilman and Noah Smith. 2010. Extracting simplified statements for factual ques-tion generation. In Proceedings of the 3rd Workshop on Question Generation.

Frederick Jelinek. 1998. Statistical Methods for Speech Recognition. The MIT Press.

Siddhartha Jonnalagadda and Graciela Gonzalez. 2010. Sentence simplification aidsprotein-protein interaction extraction. CoRR.

Siddhartha Jonnalagadda, Luis Tari, Jorg Hakenberg, Chitta Baral, and Graciela Gonzalez.2009. Towards effective sentence simplification for automatic processing of biomedicaltext. In Proceedings of Human Language Technologies: The 2009 Annual Conferenceof the North American Chapter of the Association for Computational Linguistics, Com-panion Volume: Short Papers, pages 177–180, Boulder, Colorado, June. Association forComputational Linguistics.

Siddhartha Jonnalagadda. 2006. Syntactic simplication and text cohesion. Technical report,University of Cambridge.

Peter Kincaid, Robert Fishburne, Richard Rogers, and Brad Chissom. 1975. Derivation ofnew readability formulas (Automated Readability Index, Fog Count and Flesch ReadingEase Formula) for Navy enlisted personnel (Research Branch Report 8-75). Memphis,TN: Naval Air Station.

Beata Klebanov, Kevin Knight, and Daniel Marcu. 2004. Text simplification for infor-mation seeking applications. In On the Move to Meaningful Internet Systems, Lec-ture Notes in Computer Science, volume 3290, pages 735–747, Morristown, NJ, USA.Springer-Verlag.

Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: Aprobabilistic approach to sentence compression. Artificial Intelligence, 139(1):91 – 107.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, ChrisDyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In Proceedings of ACL’07, pages 177–180, Prague, Czech Republic, June.

Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proceedings of ACL’06, pages761–768, Sydney, Australia. Association for Computational Linguistics.

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InStan Szpakowicz Marie-Francine Moens, editor, Text Summarization Branches Out:Proceedings of the ACL-04 Workshop, pages 74–81, Barcelona, Spain, July. Associa-tion for Computational Linguistics.

Marie-Catherine Marneffe, Bill MacCartney, and Christopher Manning. 2006. Generatingtyped dependency parses from phrase structure parses. In Proceedings of LREC’06,Genoa, Italy.

Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin train-ing of dependency parsers. In Proceedings of the 43rd Annual Meeting on Associationfor Computational Linguistics, ACL ’05, pages 91–98. Association for ComputationalLinguistics.

Ryan McDonald. 2006. Discriminative sentence compression with soft syntactic evidence.In Proceedings 11th Conference of the European Chapter of the Association for Com-putational Linguistics, pages 297–304.

Hary McLaughlin. 1969. SMOG grading: A new readability formula. Journal of Reading,12(8):639–646.

Courtney Napoles and Mark Dredze. 2010. Learning simple wikipedia: A cogitation in as-certaining abecedarian language. In Proceedings of the NAACL HLT 2010 Workshop onComputational Linguistics and Writing: Writing Processes and Authoring Aids, pages42–50, Los Angeles, CA, USA, June. Association for Computational Linguistics.

Ani Nenkova, Rebecca Passonneau, and Kathleen McKeown. 2007. The pyramid method:Incorporating human content selection variation in summarization evaluation. ACMTransactions on Speech and Language Processing, 4, May.

Lance Ramshaw and Mitchell Marcus. 1995. Text chunking using transformation-basedlearning. In Proceedings of the Third ACL Workshop on Very Large Corpora, MIT,June.

Advaith Siddharthan. 2006. Syntactic simplification and text cohesion. Research on Lan-guage and Computation, 4(1):77–109, June.

Jenine Turner and Eugene Charniak. 2005. Supervised and unsupervised learning for sen-tence compression. In Proceedings of the 43rd Annual Meeting on Association forComputational Linguistics, ACL ’05, pages 290–297. Association for ComputationalLinguistics.

Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the EMNLP-CoNLL, pages 764–773, Prague, Czech Republic, June. Association for ComputationalLinguistics.

W.M. Watanabe, A.C. Junior, V.R. Uzeda, R.P.M. Fortes, T.A.S. Pardo, and S.M. Aluısio.2009. Facilita: reading assistance for low-literacy readers. In Proceedings of the 27thACM International Conference on Design of communication, pages 29–36. ACM.

Kristian Woodsend and Mirella Lapata. 2011. Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the 2011 Confer-ence on Empirical Methods in Natural Language Processing, pages 409–420, Edin-burgh, Scotland, UK., July. Association for Computational Linguistics.

Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. 2010. A monolingual tree-basedtranslation model for sentence simplification. In Proceedings of The 23rd InternationalConference on Computational Linguistics, pages 1353–1361, Beijing, China, Aug.

482

TriS: A Statistical Sentence Simplifier with Log-linear ... of the 5th International Joint...

Documents

Transcript of TriS: A Statistical Sentence Simplifier with Log-linear ... of the 5th International Joint...