Automatically Generating Wikipedia Articles: A Structure-Aware Approach

Automatically Generating Wikipedia Articles:A Structure-Aware Approach

Christina Sauper and Regina BarzilayComputer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology{csauper,regina}@csail.mit.edu

Abstract

In this paper, we investigate an ap-proach for creating a comprehensive tex-tual overview of a subject composed of in-formation drawn from the Internet. We usethe high-level structure of human-authoredtexts to automatically induce a domain-specific template for the topic structure ofa new overview. The algorithmic innova-tion of our work is a method to learn topic-specific extractors for content selectionjointly for the entire template. We aug-ment the standard perceptron algorithmwith a global integer linear programmingformulation to optimize both local fit ofinformation into each topic and global co-herence across the entire overview. Theresults of our evaluation confirm the bene-fits of incorporating structural informationinto the content selection process.

1 Introduction

In this paper, we consider the task of automaticallycreating a multi-paragraph overview article thatprovides a comprehensive summary of a subject ofinterest. Examples of such overviews include ac-tor biographies from IMDB and disease synopsesfrom Wikipedia. Producing these texts by hand isa labor-intensive task, especially when relevant in-formation is scattered throughout a wide range ofInternet sources. Our goal is to automate this pro-cess. We aim to create an overview of a subject –e.g.,3-M Syndrome– by intelligently combiningrelevant excerpts from across the Internet.

As a starting point, we can employ meth-ods developed for multi-document summarization.However, our task poses additional technical chal-lenges with respect to content planning. Gen-erating a well-rounded overview article requiresproactive strategies to gather relevant material,

such as searching the Internet. Moreover, the chal-lenge of maintaining output readability is mag-nified when creating a longer document that dis-cusses multiple topics.

In our approach, we explore how the high-level structure of human-authored documents canbe used to produce well-formed comprehensiveoverview articles. We select relevant material foran article using a domain-specific automaticallygenerated content template. For example, a tem-plate for articles about diseases might containdi-agnosis, causes, symptoms, and treatment. Oursystem induces these templates by analyzing pat-terns in the structure of human-authored docu-ments in the domain of interest. Then, it producesa new article by selecting content from the Internetfor each part of this template. An example of oursystem’s output1 is shown in Figure 1.

The algorithmic innovation of our work is amethod for learning topic-specific extractors forcontent selection jointly across the entire template.Learning a single topic-specific extractor can beeasily achieved in a standard classification frame-work. However, the choices for different topicsin a template are mutually dependent; for exam-ple, in a multi-topic article, there is potential forredundancy across topics. Simultaneously learn-ing content selection for all topics enables us toexplicitly model these inter-topic connections.

We formulate this task as astructured classifica-tion problem. We estimate the parameters of ourmodel using the perceptron algorithm augmentedwith an integer linear programming (ILP) formu-lation, run over a training set of example articlesin the given domain.

The key features of this structure-aware ap-proach are twofold:

1This system output was added to Wikipedia athttp://en.wikipedia.org/wiki/3-M_syndrome on June26, 2008. The page’s history provides examples of changesperformed by human editors to articles created by our system.

Diagnosis . . . No laboratories offering molecular genetic testing forprenatal diagnosis of 3-M syndrome are listed in theGeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causingmutationshave been identified in an affected family member in a research or clinical laboratory.

Causes Three M syndrome is thought to be inherited as an autosomal recessive genetic trait. Human traits, including the classicgenetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother. Inrecessivedisorders, the condition does not occur unless an individual inherits the same defective gene for the same trait from each parent. . . .

Symptoms . . . Many of the symptoms and physical features associated with the disorder are apparent at birth (congenital). Insome cases, individuals who carry a single copy of the disease gene (heterozygotes) may exhibit mild symptoms associated withThree M syndrome.

Treatment . . . Genetic counseling will be of benefit for affected individuals and their families. Family members of affected indi-viduals should also receive regular clinical evaluations to detect any symptoms and physical characteristics that maybe potentiallyassociated with Three M syndrome or heterozygosity for the disorder. Other treatment for Three M syndrome is symptomatic andsupportive.

Figure 1: A fragment from the automatically created articlefor 3-M Syndrome.

• Automatic template creation: Templatesare automatically induced from human-authored documents. This ensures that theoverview article will have the breadth ex-pected in a comprehensive summary, withcontent drawn from a wide variety of Inter-net sources.

• Joint parameter estimation for content se-lection: Parameters are learned jointly forall topics in the template. This procedure op-timizes both local relevance of informationfor each topic and global coherence acrossthe entire article.

We evaluate our approach by creating articles intwo domains: Actors and Diseases. For a data set,we use Wikipedia, which contains articles simi-lar to those we wish to produce in terms of lengthand breadth. An advantage of this data set is thatWikipedia articles explicitly delineate topical sec-tions, facilitating structural analysis. The resultsof our evaluation confirm the benefits of structure-aware content selection over approaches that donot explicitly model topical structure.

2 Related Work

Concept-to-text generation and text-to-text gener-ation take very different approaches to content se-lection. In traditional concept-to-text generation,a content planner provides a detailed template forwhat information should be included in the outputand how this information should be organized (Re-iter and Dale, 2000). In text-to-text generation,such templates for information organization arenot available; sentences are selected based on theirsalience properties (Mani and Maybury, 1999).While this strategy is robust and portable across

domains, output summaries often suffer from co-herence and coverage problems.

In between these two approaches is work ondomain-specific text-to-text generation. Instancesof these tasks are biography generation in sum-marization and answering definition requests inquestion-answering. In contrast to a generic sum-marizer, these applications aim to characterizethe types of information that are essential in agiven domain. This characterization varies greatlyin granularity. For instance, some approachescoarsely discriminate between biographical andnon-biographical information (Zhou et al., 2004;Biadsy et al., 2008), while others go beyond binarydistinction by identifying atomic events – e.g., oc-cupation and marital status – that are typically in-cluded in a biography (Weischedel et al., 2004;Filatova and Prager, 2005; Filatova et al., 2006).Commonly, such templates are specified manuallyand are hard-coded for a particular domain (Fujiiand Ishikawa, 2004; Weischedel et al., 2004).

Our work is related to these approaches; how-ever, content selection in our work is driven bydomain-specific automatically induced templates.As our experiments demonstrate, patterns ob-served in domain-specific training data providesufficient constraints for topic organization, whichis crucial for a comprehensive text.

Our work also relates to a large body of recentwork that uses Wikipedia material. Instances ofthis work include information extraction, ontologyinduction and resource acquisition (Wu and Weld,2007; Biadsy et al., 2008; Nastase, 2008; Nastaseand Strube, 2008). Our focus is on a different task— generation of new overview articles that followthe structure of Wikipedia articles.

3 Method

The goal of our system is to produce a compre-hensive overview article given a title – e.g.,Can-cer. We assume that relevant information on thesubject is available on the Internet but scatteredamong several pages interspersed with noise.

We are provided with a training corpus consist-ing of n documentsd1 . . . dn in the same domain– e.g.,Diseases. Each documentdi has a title anda set of delineated sections2 si1 . . . sim. The num-ber of sectionsm varies between documents. Eachsectionsij also has a corresponding headinghij –e.g.,Treatment.

Our overview article creation process consistsof three parts. First, a preprocessing step createsa template and searches for a number of candidateexcerpts from the Internet. Next, parameters mustbe trained for the content selection algorithm us-ing our training data set. Finally, a complete ar-ticle may be created by combining a selection ofcandidate excerpts.

1. Preprocessing (Section 3.1) Our prepro-cessing step leverages previous work in topicsegmentation and query reformulation to pre-pare a template and a set of candidate ex-cerpts for content selection. Template gen-eration must occur once per domain, whereassearch occurs every time an article is gener-ated in both learning and application.

(a) Template Induction To create a con-tent template, we cluster all sectionheadingshi1 . . . him for all documentsdi. Each cluster is labeled with the mostcommon headinghij within the clus-ter. The largestk clusters are selected tobecome topicst1 . . . tk, which form thedomain-specific content template.

(b) Search For each document that wewish to create, we retrieve from the In-ternet a set ofr excerptsej1 . . . ejr foreach topictj from the template. We de-fine appropriate search queries using therequested document title and topicstj.

2. Learning Content Selection (Section 3.2)For each topictj, we learn the correspondingtopic-specific parameterswj to determine the

2In data sets where such mark-up is not available, one canemploy topical segmentation algorithms as an additional pre-processing step.

quality of a given excerpt. Using the percep-tron framework augmented with an ILP for-mulation for global optimization, the systemis trained to select the best excerpt for eachdocumentdi and each topictj . For train-ing, we assume the best excerpt is the originalhuman-authored textsij.

3. Application (Section 3.2) Given the title ofa requested document, we select several ex-cerpts from the candidate vectors returned bythe search procedure (1b) to create a com-prehensive overview article. We perform thedecoding procedure jointly using learned pa-rametersw1 . . . wk and the same ILP formu-lation for global optimization as in training.The result is a new document withk excerpts,one for each topic.

3.1 Preprocessing

Template Induction A content template speci-fies the topical structure of documents in one do-main. For instance, the template for articles aboutactors consists of four topicst1 . . . t4: biography,early life, career, and personal life. Using thistemplate to create the biography of a new actorwill ensure that its information coverage is con-sistent with existing human-authored documents.

We aim to derive these templates by discoveringcommon patterns in the organization of documentsin a domain of interest. There has been a sizableamount of research on structure induction rangingfrom linear segmentation (Hearst, 1994) to contentmodeling (Barzilay and Lee, 2004). At the coreof these methods is the assumption that fragmentsof text conveying similar information have simi-lar word distribution patterns. Therefore, often asimple segment clustering across domain texts canidentify strong patterns in content structure (Barzi-lay and Elhadad, 2003). Clusters containing frag-ments from many documents are indicative of top-ics that are essential for a comprehensive sum-mary. Given the simplicity and robustness of thisapproach, we utilize it for template induction.

We cluster all section headingshi1 . . . him fromall documentsdi using a repeated bisectioningalgorithm (Zhao et al., 2005). As a similarityfunction, we use cosine similarity weighted withTF*IDF. We eliminate any clusters with low in-ternal similarity (i.e., smaller than 0.5), as we as-sume these are “miscellaneous” clusters that willnot yield unified topics.

We determine the average number of sectionsk over all documents in our training set, then se-lect thek largest section clusters as topics. We or-der these topics ast1 . . . tk using a majority order-ing algorithm (Cohen et al., 1998). This algorithmfinds a total order among clusters that is consistentwith a maximal number of pairwise relationshipsobserved in our data set.

Each topictj is identified by the most frequentheading found within the cluster – e.g.,Causes.This set of topics forms the content template for adomain.Search To retrieve relevant excerpts, we mustdefine appropriate search queries for each topict1 . . . tk. Query reformulation is an active area ofresearch (Agichtein et al., 2001). We have exper-imented with several of these methods for draw-ing search queries from representative words in thebody text of each topic; however, we find that thebest performance is provided by deriving queriesfrom a conjunction of the document title and topic– e.g.,“3-M syndrome” diagnosis.

Using these queries, we search using Yahoo!and retrieve the first ten result pages for each topic.From each of these pages, we extract all possibleexcerpts consisting of chunks of text between stan-dardized boundary indicators (such as<p> tags).In our experiments, there are an average of 6 ex-cerpts taken from each page. For each topictj ofeach document we wish to create, the total numberof excerptsr found on the Internet may differ. Welabel the excerptsej1 . . . ejr.

3.2 Selection Model

Our selection model takes the content templatet1 . . . tk and the candidate excerptsej1 . . . ejr foreach topictj produced in the previous steps. Itthen selects a series ofk excerpts, one from eachtopic, to create a coherent summary.

One possible approach is to perform individ-ual selections from each set of excerptsej1 . . . ejr

and then combine the results. This strategy iscommonly used in multi-document summariza-tion (Barzilay et al., 1999; Goldstein et al., 2000;Radev et al., 2000), where the combination stepeliminates the redundancy across selected ex-cerpts. However, separating the two steps may notbe optimal for this task — the balance betweencoverage and redundancy is harder to achievewhen a multi-paragraph summary is generated. Inaddition, a more discriminative selection strategy

is needed when candidate excerpts are drawn di-rectly from the web, as they may be contaminatedwith noise.

We propose a novel joint training algorithm thatlearns selection criteria for all the topics simulta-neously. This approach enables us to maximizeboth local fit and global coherence. We implementthis algorithm using the perceptron framework, asit can be easily modified for structured predictionwhile preserving convergence guarantees (DaumeIII and Marcu, 2005; Snyder and Barzilay, 2007).

In this section, we first describe the structureand decoding procedure of our model. We thenpresent an algorithm to jointly learn the parame-ters of all topic models.

3.2.1 Model Structure

The model inputs are as follows:

• The title of the desired document• t1 . . . tk — topics from the content template• ej1 . . . ejr — candidate excerpts for each

topic tj

In addition, we define feature and parametervectors:

• φ(ejl) — feature vector for thelth candidateexcerpt for topictj

• w1 . . .wk — parameter vectors, one for eachof the topicst1 . . . tk

Our model constructs a new article by followingthese two steps:Ranking First, we attempt to rank candidateexcerpts based on how representative they are ofeach individual topic. For each topictj, we inducea ranking of the excerptsej1 . . . ejr by mappingeach excerptejl to a score:

scorej(ejl) = φ(ejl) ·wj

Candidates for each topic are ranked from high-est to lowest score. After this procedure, the posi-tion l of excerptejl within the topic-specific can-didate vector is the excerpt’s rank.Optimizing the Global Objective To avoid re-dundancy between topics, we formulate an opti-mization problem using excerpt rankings to createthe final article. Givenk topics, we would like toselect one excerptejl for each topictj , such thatthe rank is minimized; that is,scorej(ejl) is high.

To select the optimal excerpts, we employ inte-ger linear programming (ILP). This framework is

commonly used in generation and summarizationapplications where the selection process is drivenby multiple constraints (Marciniak and Strube,2005; Clarke and Lapata, 2007).

We represent excerpts included in the outputusing a set of indicator variables,xjl. For eachexcerptejl, the corresponding indicator variablexjl = 1 if the excerpt is included in the final doc-ument, andxjl = 0 otherwise.

Our objective is to minimize the ranks of theexcerpts selected for the final document:

min

k∑

j=1

r∑

l=1

l · xjl

We augment this formulation with two types ofconstraints.Exclusivity Constraints We want to ensure thatexactly one indicatorxjl is nonzero for each topictj. These constraints are formulated as follows:

r∑

l=1

xjl = 1 ∀j ∈ {1 . . . k}

Redundancy Constraints We also want to pre-vent redundancy across topics. We definesim(ejl, ej′l′) as the cosine similarity between ex-cerptsejl from topic tj and ej′l′ from topic tj′ .We introduce constraints that ensure no pair of ex-cerpts has similarity above 0.5:

(xjl + xj′l′) · sim(ejl, ej′l′) ≤ 1

∀j, j′ = 1 . . . k ∀l, l′ = 1 . . . r

If excerptsejl and ej′l′ have cosine similaritysim(ejl, ej′l′) > 0.5, only one excerpt may beselected for the final document – i.e., eitherxjl

or xj′l′ may be1, but not both. Conversely, ifsim(ejl, ej′l′) ≤ 0.5, both excerpts may be se-lected.Solving the ILP Solving an integer linear pro-gram is NP-hard (Cormen et al., 1992); however,in practice there exist several strategies for solvingcertain ILPs efficiently. In our study, we employedlp solve,3 an efficient mixed integer programmingsolver which implements the Branch-and-Boundalgorithm. On a larger scale, there are several al-ternatives to approximate the ILP results, such as adynamic programming approximation to the knap-sack problem (McDonald, 2007).

3http://lpsolve.sourceforge.net/5.5/

Feature ValueUNI wordi count of word occurrencesPOS wordi first position of word in excerptBI wordi wordi+1 count of bigram occurrencesSENT count of all sentencesEXCL count of exclamationsQUES count of questionsWORD count of all wordsNAME count of title mentionsDATE count of datesPROP count of proper nounsPRON count of pronounsNUM count of numbersFIRST word1 1∗

FIRST word1 word2 1†

SIMS count of similar excerpts‡

Table 1: Features employed in the ranking model.∗ Defined as the first unigram in the excerpt.† Defined as the first bigram in the excerpt.‡ Defined as excerpts with cosine similarity> 0.5

Features As shown in Table 1, most of the fea-tures we select in our model have been employedin previous work on summarization (Mani andMaybury, 1999). All features except theSIMSfeature are defined for individual excerpts in isola-tion. For each excerptejl, the value of theSIMSfeature is the count of excerptsejl′ in the sametopic tj for which sim(ejl, ejl′) > 0.5. This fea-ture quantifies the degree of repetition within atopic, often indicative of an excerpt’s accuracy andrelevance.

3.2.2 Model Training

Generating Training Data For training, we aregiven n original documentsd1 . . . dn, a contenttemplate consisting of topicst1 . . . tk, and a set ofcandidate excerptseij1 . . . eijr for each documentdi and topictj. For each section of each docu-ment, we add the gold excerptsij to the corre-sponding vector of candidate excerptseij1 . . . eijr.This excerpt represents the target for our trainingalgorithm. Note that the algorithm does not re-quire annotated ranking data; only knowledge ofthis “optimal” excerpt is required. However, ifthe excerpts provided in the training data have lowquality, noise is introduced into the system.Training Procedure Our algorithm is amodification of the perceptron ranking algo-rithm (Collins, 2002), which allows for jointlearning across several ranking problems (DaumeIII and Marcu, 2005; Snyder and Barzilay, 2007).Pseudocode for this algorithm is provided inFigure 2.

First, we defineRank(eij1 . . . eijr,wj), which

ranks all excerpts from the candidate excerptvector eij1 . . . eijr for document di and topictj. Excerpts are ordered byscorej(ejl) usingthe current parameter values. We also defineOptimize(eij1 . . . eijr), which finds the optimalselection of excerpts (one per topic) given rankedlists of excerptseij1 . . . eijr for each documentdi

and topictj. These functions follow the rankingand optimization procedures described in Section3.2.1. The algorithm maintainsk parameter vec-tors w1 . . .wk, one associated with each topictjdesired in the final article. During initialization,all parameter vectors are set to zeros (line 2).

To learn the optimal parameters, this algorithmiterates over the training set until the parametersconverge or a maximum number of iterations isreached (line 3). For each document in the train-ing set (line 4), the following steps occur: First,candidate excerpts for each topic are ranked (lines5-6). Next, decoding through ILP optimization isperformed over all ranked lists of candidate ex-cerpts, selecting one excerpt for each topic (line7). Finally, the parameters are updated in a jointfashion. For each topic (line 8), if the selectedexcerpt is not similar enough to the gold excerpt(line 9), the parameters for that topic are updatedusing a standard perceptron update rule (line 10).When convergence is reached or the maximum it-eration count is exceeded, the learned parametervalues are returned (line 12).

The use of ILP during each step of trainingsets this algorithm apart from previous work. Inprior research, ILP was used as a postprocess-ing step to remove redundancy and make otherglobal decisions about parameters (McDonald,2007; Marciniak and Strube, 2005; Clarke and La-pata, 2007). However, in our training, we inter-twine the complete decoding procedure with theparameter updates. Our joint learning approachfinds per-topic parameter values that are maxi-mally suited for the global decoding procedure forcontent selection.

4 Experimental Setup

We evaluate our method by observing the qualityof automatically created articles in different do-mains. We compute the similarity of a large num-ber of articles produced by our system and sev-eral baselines to the original human-authored arti-cles using ROUGE, a standard metric for summaryquality. In addition, we perform an analysis of edi-

Input:d1 . . . dn: A set ofn documents, each containing

k sectionssi1 . . . sik

eij1 . . . eijr: Sets of candidate excerpts for each topictj and documentdi

Define:Rank(eij1 . . . eijr,wj):

As described in Section 3.2.1:Calculatesscorej(eijl) for all excerpts for

documentdi and topictj , using parameterswj .Orders the list of excerpts byscorej(eijl)

from highest to lowest.Optimize(ei11 . . . eikr):

As described in Section 3.2.1:Finds the optimal selection of excerpts to form a

final article, given ranked lists of excerptsfor each topict1 . . . tk.

Returns a list ofk excerpts, one for each topic.φ(eijl):

Returns the feature vector representing excerpteijl

Initialization:1 For j = 1 . . . k2 Set parameterswj = 0Training:3 Repeat until convergence or whileiter < itermax:4 For i = 1 . . . n5 For j = 1 . . . k6 Rank(eij1 . . . eijr,wj)7 x1 . . . xk = Optimize(ei11 . . . eikr)8 For j = 1 . . . k9 If sim(xj, sij) < 0.810 wj = wj + φ(sij) − φ(xi)11 iter = iter + 112 Return parametersw1 . . . wk

Figure 2: An algorithm for learning several rank-ing problems with a joint decoding mechanism.

tor reaction to system-produced articles submittedto Wikipedia.

Data For evaluation, we consider two domains:American Film Actors and Diseases. These do-mains have been commonly used in prior workon summarization (Weischedel et al., 2004; Zhouet al., 2004; Filatova and Prager, 2005; Demner-Fushman and Lin, 2007; Biadsy et al., 2008). Ourtext corpus consists of articles drawn from the cor-responding categories in Wikipedia. There are2,150 articles in American Film Actors and 523articles in Diseases. For each domain, we ran-domly select 90% of articles for training and teston the remaining 10%. Human-authored articlesin both domains contain an average of four top-ics, and each topic contains an average of 193words. In order to model the real-world scenariowhere Wikipedia articles are not always available(as for new or specialized topics), we specificallyexclude Wikipedia sources during our search pro-

Avg. Excerpts Avg. SourcesAmer. Film ActorsSearch 2.3 1No Template 4 4.0Disjoint 4 2.1Full Model 4 3.4Oracle 4.3 4.3DiseasesSearch 3.1 1No Template 4 2.5Disjoint 4 3.0Full Model 4 3.2Oracle 5.8 3.9

Table 2: Average number of excerpts selected andsources used in article creation for test articles.

cedure (Section 3.1) for evaluation.

Baselines Our first baseline, Search, reliessolely on search engine ranking for content selec-tion. Using the article title as a query – e.g.,Bacil-lary Angiomatosis, this method selects the webpage that is ranked first by the search engine. Fromthis page we select the firstk paragraphs wherekis defined in the same way as in our full model. Ifthere are less thank paragraphs on the page, allparagraphs are selected, but no other sources areused. This yields a document of comparable sizewith the output of our system. Despite its sim-plicity, this baseline is not naive: extracting ma-terial from a single document guarantees that theoutput is coherent, and a page highly ranked by asearch engine may readily contain a comprehen-sive overview of the subject.

Our second baseline,No Template, does notuse a template to specify desired topics; there-fore, there are no constraints on content selection.Instead, we follow a simplified form of previouswork on biography creation, where a classifier istrained to distinguish biographical text (Zhou etal., 2004; Biadsy et al., 2008).

In this case, we train a classifier to distinguishdomain-specific text. Positive training data isdrawn from all topics in the given domain cor-pus. To find negative training data, we performthe search procedure as in our full model (seeSection 3.1) using only the article titles as searchqueries. Any excerpts which have very low sim-ilarity to the original articles are used as negativeexamples. During the decoding procedure, we usethe same search procedure. We then classify eachexcerpt as relevant or irrelevant and select thek

non-redundant excerpts with the highest relevance

confidence scores.Our third baseline,Disjoint, uses the ranking

perceptron framework as in our full system; how-ever, rather than perform an optimization stepduring training and decoding, we simply selectthe highest-ranked excerpt for each topic. Thisequates to standard linear classification for eachsection individually.

In addition to these baselines, we compareagainst anOracle system. For each topic presentin the human-authored article, theOracle selectsthe excerpt from our full model’s candidate ex-cerpts with the highest cosine similarity to thehuman-authored text. This excerpt is the optimalautomatic selection from the results available, andtherefore represents an upper bound on our excerptselection task. Some articles contain additionaltopics beyond those in the template; in these cases,the Oracle system produces a longer article thanour algorithm.

Table 2 shows the average number of excerptsselected and sources used in articles created by ourfull model and each baseline.Automatic Evaluation To assess the quality ofthe resulting overview articles, we compare themwith the original human-authored articles. Weuse ROUGE, an evaluation metric employed at theDocument Understanding Conferences (DUC),which assumes that proximity to human-authoredtext is an indicator of summary quality. Weuse the publicly available ROUGE toolkit (Lin,2004) to compute recall, precision, and F-score forROUGE-1. We use the Wilcoxon Signed Rank Testto determine statistical significance.Analysis of Human Edits In addition to our auto-matic evaluation, we perform a study of reactionsto system-produced articles by the general pub-lic. To achieve this goal, we insert automaticallycreated articles4 into Wikipedia itself and exam-ine the feedback of Wikipedia editors. Selectionof specific articles is constrained by the need tofind topics which are currently of “stub” status thathave enough information available on the Internetto construct a valid article. After a period of time,we analyzed the edits made to the articles to deter-mine the overall editor reaction. We report resultson 15 articles in the Diseases category5.

4In addition to the summary itself, we also include propercitations to the sources from which the material is extracted.

5We are continually submitting new articles; however, wereport results on those that have at least a 6 month history attime of writing.

Recall Precision F-scoreAmer. Film ActorsSearch 0.09 0.37 0.13∗

No Template 0.33 0.50 0.39∗

Disjoint 0.45 0.32 0.36∗

Full Model 0.46 0.40 0.41Oracle 0.48 0.64 0.54∗

DiseasesSearch 0.31 0.37 0.32†

No Template 0.32 0.27 0.28∗

Disjoint 0.33 0.40 0.35∗

Full Model 0.36 0.39 0.37Oracle 0.59 0.37 0.44∗

Table 3: Results of ROUGE-1 evaluation.∗ Significant with respect to our full model forp ≤ 0.05.† Significant with respect to our full model forp ≤ 0.10.

Since Wikipedia is a live resource, we do notrepeat this procedure for our baseline systems.Adding articles from systems which have previ-ously demonstrated poor quality would be im-proper, especially in Diseases. Therefore, wepresent this analysis as an additional observationrather than a rigorous technical study.

5 Results

Automatic Evaluation The results of this evalu-ation are shown in Table 3. Our full model outper-forms all of the baselines. By surpassing theDis-joint baseline, we demonstrate the benefits of jointclassification. Furthermore, the high performanceof both our full model and theDisjoint baselinerelative to the other baselines shows the impor-tance of structure-aware content selection. TheOracle system, which represents an upper boundon our system’s capabilities, performs well.

The remaining baselines have different flaws:Articles produced by theNo Templatebaselinetend to focus on a single topic extensively at theexpense of breadth, because there are no con-straints to ensure diverse topic selection. On theother hand, performance of theSearchbaselinevaries dramatically. This is expected; this base-line relies heavily on both the search engine andindividual web pages. The search engine must cor-rectly rank relevant pages, and the web pages mustprovide the important material first.Analysis of Human Edits The results of our ob-servation of editing patterns are shown in Table4. These articles have resided on Wikipedia fora period of time ranging from 5-11 months. Allof them have been edited, and no articles were re-moved due to lack of quality. Moreover, ten au-tomatically created articles have been promoted

Type CountTotal articles 15

Promoted articles 10Edit types

Intra-wiki links 36Formatting 25Grammar 20Minor topic edits 2Major topic changes 1

Total edits 85

Table 4: Distribution of edits on Wikipedia.

by human editors from stubs to regular Wikipediaentries based on the quality and coverage of thematerial. Information was removed in three casesfor being irrelevant, one entire section and twosmaller pieces. The most common changes weresmall edits to formatting and introduction of linksto other Wikipedia articles in the body text.

6 Conclusion

In this paper, we investigated an approach for cre-ating a multi-paragraph overview article by select-ing relevant material from the web and organiz-ing it into a single coherent text. Our algorithmyields significant gains over a structure-agnosticapproach. Moreover, our results demonstrate thebenefits of structured classification, which out-performs independently trained topical classifiers.Overall, the results of our evaluation combinedwith our analysis of human edits confirm that theproposed method can effectively produce compre-hensive overview articles.

This work opens several directions for future re-search. Diseases and American Film Actors ex-hibit fairly consistent article structures, which aresuccessfully captured by a simple template cre-ation process. However, with categories that ex-hibit structural variability, more sophisticated sta-tistical approaches may be required to produce ac-curate templates. Moreover, a promising directionis to consider hierarchical discourse formalismssuch as RST (Mann and Thompson, 1988) to sup-plement our template-based approach.

AcknowledgmentsThe authors acknowledge the support of the NSF (CA-REER grant IIS-0448168, grant IIS-0835445, and grant IIS-0835652) and NIH (grant V54LM008748). Thanks to MikeCollins, Julia Hirschberg, and members of the MIT NLPgroup for their helpful suggestions and comments. Any opin-ions, findings, conclusions, or recommendations expressedinthis paper are those of the authors, and do not necessarily re-flect the views of the funding organizations.

ReferencesEugene Agichtein, Steve Lawrence, and Luis Gravano. 2001.

Learning search engine specific query transformations forquestion answering. InProceedings of WWW, pages 169–178.

Regina Barzilay and Noemie Elhadad. 2003. Sentence align-ment for monolingual comparable corpora. InProceed-ings of EMNLP, pages 25–32.

Regina Barzilay and Lillian Lee. 2004. Catching the drift:Probabilistic content models, with applications to genera-tion and summarization. InProceedings of HLT-NAACL,pages 113–120.

Regina Barzilay, Kathleen R. McKeown, and Michael El-hadad. 1999. Information fusion in the context of multi-document summarization. InProceedings of ACL, pages550–557.

Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008.An unsupervised approach to biography production usingwikipedia. InProceedings of ACL/HLT, pages 807–815.

James Clarke and Mirella Lapata. 2007. Modelling com-pression with discourse constraints. InProceedings ofEMNLP-CoNLL, pages 1–11.

William W. Cohen, Robert E. Schapire, and Yoram Singer.1998. Learning to order things. InProceedings of NIPS,pages 451–457.

Michael Collins. 2002. Ranking algorithms for named-entityextraction: Boosting and the voted perceptron. InPro-ceedings of ACL, pages 489–496.

Thomas H. Cormen, Charles E. Leiserson, and Ronald L.Rivest. 1992.Intoduction to Algorithms. The MIT Press.

Hal Daume III and Daniel Marcu. 2005. A large-scale explo-ration of effective global features for a joint entity detec-tion and tracking model. InProceedings of HLT/EMNLP,pages 97–104.

Dina Demner-Fushman and Jimmy Lin. 2007. Answer-ing clinical questions with knowledge-based and statisti-cal techniques.Computational Linguistics, 33(1):63–103.

Elena Filatova and John M. Prager. 2005. Tell me what youdo and I’ll tell you what you are: Learning occupation-related activities for biographies. InProceedings ofHLT/EMNLP, pages 113–120.

Elena Filatova, Vasileios Hatzivassiloglou, and KathleenMcKeown. 2006. Automatic creation of domain tem-plates. InProceedings of ACL, pages 207–214.

Atsushi Fujii and Tetsuya Ishikawa. 2004. Summarizing en-cyclopedic term descriptions on the web. InProceedingsof COLING, page 645.

Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and MarkKantrowitz. 2000. Multi-document summarization bysentence extraction. InProceedings of NAACL-ANLP,pages 40–48.

Marti A. Hearst. 1994. Multi-paragraph segmentation of ex-pository text. InProceedings of ACL, pages 9–16.

Chin-Yew Lin. 2004. ROUGE: A package for automaticevaluation of summaries. InProceedings of ACL, pages74–81.

Inderjeet Mani and Mark T. Maybury. 1999.Advances inAutomatic Text Summarization. The MIT Press.

William C. Mann and Sandra A. Thompson. 1988. Rhetor-ical structure theory: Toward a functional theory of textorganization.Text, 8(3):243–281.

Tomasz Marciniak and Michael Strube. 2005. Beyond thepipeline: Discrete optimization in NLP. InProceedingsof CoNLL, pages 136–143.

Ryan McDonald. 2007. A study of global inference algo-rithms in multi-document summarization. InProceedingsof EICR, pages 557–564.

Vivi Nastase and Michael Strube. 2008. Decoding wikipediacategories for knowledge acquisition. InProceedings ofAAAI, pages 1219–1224.

Vivi Nastase. 2008. Topic-driven multi-document summa-rization with encyclopedic knowledge and spreading acti-vation. InProceedings of EMNLP, pages 763–772.

Dragomir R. Radev, Hongyan Jing, and MalgorzataBudzikowska. 2000. Centroid-based summarizationof multiple documents: sentence extraction, utility-based evaluation, and user studies. InProceedings ofANLP/NAACL, pages 21–29.

Ehud Reiter and Robert Dale. 2000.Building Natural Lan-guage Generation Systems. Cambridge University Press,Cambridge.

Benjamin Snyder and Regina Barzilay. 2007. Multiple as-pect ranking using the good grief algorithm. InProceed-ings of HLT-NAACL, pages 300–307.

Ralph M. Weischedel, Jinxi Xu, and Ana Licuanan. 2004. Ahybrid approach to answering biographical questions. InNew Directions in Question Answering, pages 59–70.

Fei Wu and Daniel S. Weld. 2007. Autonomously semanti-fying wikipedia. InProceedings of CIKM, pages 41–50.

Ying Zhao, George Karypis, and Usama Fayyad. 2005.Hierarchical clustering algorithms for document datasets.Data Mining and Knowledge Discovery, 10(2):141–168.

L. Zhou, M. Ticrea, and Eduard Hovy. 2004. Multi-document biography summarization. InProceedings ofEMNLP, pages 434–441.

Automatically Generating Wikipedia Articles: A Structure-Aware Approach

Technology

Transcript of Automatically Generating Wikipedia Articles: A Structure-Aware Approach