RPI BLENDER TAC-KBP2015 System Description · RPI BLENDER TAC-KBP2015 System Description Yu Hong...

RPI BLENDER TAC-KBP2015 System Description

Yu Hong1,2, Di Lu1, Dian Yu1, Xiaoman Pan1, Xiaobin Wang2,Yadong Chen2, Lifu Huang1, Heng Ji1

1 Computer Science Department, Rensselaer Polytechnic Institute2 Computer Science Department, Soochow [email protected], [email protected]

1 Introduction

This year the RPI BLENDER team participated andachieved top 1 in four tasks at KBP2015: EventNugget Detection (section ??), Event Nugget Coref-erence Resolution (section 4), Cold-start Slot Fill-ing Validation Filtering (section 5) and Tri-lingualEntity Linking and top 2 in Tri-lingual Entity Dis-covery and Linking (section 2).

2 Tri-lingual Entity Discovery and Linking

2.1 Entity Mention Identification

To extract English name mentions, we apply alinear-chain CRFs model trained from ACE 2003-2005 corpora (Li et al., 2012a). For Chinese andSpanish, we use Stanford name tagger (Finkel et al.,2005). We also encode several regular expressionbased rules to extract poster name mentions in dis-cussion forum posts. In this year’s task, person nom-inal mentions extraction is added. There are two ma-jor challenges: (1) Only person nominal mentionsreferring to specific, individual real-world entitiesneed to be extracted. Therefore, a system shouldbe able to distinguish specific and generic personnominal mentions; (2) within-document coreferenceresolution should be applied to clustering personnominial and name mentions. We apply heuristicrules to try to solve these two challenges: (1)We consider person nominal mentions that appearafter indefinite articles (e.g., a/an) or conditionalconjunctions (e.g., if ) as generic. The person nom-nial mention extraction F1 score of this approachis around 46% for English training data. (2) Forcoreference resolution, if the closest mention of a

person nominal mention is a name, then we considerthey are coreferential. The accuracy of this approachis 67% using perfect mentions in English trainingdata.

2.2 Unsupervised Entity Linking

Our entity linking system is a domain and languageindependent system (Wang et al., 2015). Thissystem is based on an unsupervised collective in-ference approach. Given a set of English entitymentions M = {m1,m2, ...,mn}, our system firstconstructs a graph for all entity mentions basedon their co-occurrence within a paragraph. Then,for each entity mention m, our system uses thesurface form dictionary < f, e1, e2, ..., ek >, wheree1, e2, ..., ek is the set of entities with surface formf according to their properties (e.g., labels, names,aliases), to locate a list of candidate entities e ∈ Eand compute the importance score by an entropybased approach (Zheng et al., 2014). Finally, itcomputes similarity scores for each entity mentionand candidate entity pair < m, e > and selects thecandidate with the highest score as the appropriateentity for linking. For Chinese and Spanish, we firsttranslate mentions into English using name trans-lation dictionaries mined from various approachesdescribed in (Ji et al., 2009). If mentions cannotbe found in the dictionaries, we use Pinyin forChinese mentions and normalize special charactersfor Spanish mentions.

2.3 Linking Feedback for Typing

This year’s pre-defined entity types are expandedto Person, Geo-political Entity, Organization,

Location and Facility. Therefore, we implementa fine-grained entity typing system based onlinking feedback and map them back to these fivetypes. We utilize Abstract Meaning Representation(AMR) corpus (Banarescu et al., 2013) whichcontains over 100 fine-grained entity types andhuman annotated KB titles. DBPedia1 alsoprovides rich types for each page. Therefore,we generate a mapping table between AMRtype and DBPedia rdf:type (e.g., university -TechnicalUniversitiesAndColleges).Finally, we can get a list of typing candidatesfor each linking candidate. For example, givea mention RPI, we can obtain a list of linkingcandidates [ Rensselaer PolytechnicInstitute, Rensselaer at Hartford,Lally School of Management &Technology, ... ] using our entitylinking system. Each linking candidate hasa list of mapped AMR types, RensselaerPolytechnic Institute: [ university -EducationalInstitution, university -UniversitiesAndCollegesInNewYork,organization - Organisation, ... ]. If theconfidence value of the top 1 linking candidate isreliable, we only select its typing result. Otherwise,we merge the typing results of all candidates. Thetyping F1 score is 93.2% for perfect mentions inEnglish training data.

2.4 Build EDL for a New Language Over Night

We also propose a novel unsupervised entity typingframework by combining symbolic and distribu-tional semantics. We start from learning generalembeddings for each entity mention, compose theembeddings of specific contexts using linguisticstructures, link the mention to knowledge bases andlearn its related knowledge representations. Thenwe develop a novel joint hierarchical clustering andlinking algorithm to type all mentions using theserepresentations. This framework doesn’t rely on anyannotated data, predefined typing schema, or hand-crafted features, therefore it is highly extensible andcan be quickly adapted to new languages. Differentlanguages may have different linguistic resourcesavailable. For example, English has rich linguistic

1http://dbpedia.org

resources (e.g., Abstract Meaning Representation)that can be utilized to model local contexts whilesome languages don’t. For these low-resource lan-guages, we can utilize the embeddings of contextwords which occur within a limited size of windowinstead of rich linguistic resource based compo-sitional specific-context embedding. In addition,for low-resource languages, there are not enoughunlabeled documents to train word embeddings andKBs may not be available for these languages. Inthis case, we can utilize other feature representationssuch as bag-of-words tf-idf instead of embeddingbased representations. To prove this, we apply ourframework to two low-resource languages: Hausaand Yoruba. The mention-level typing accuracy withperfect boundary is very promising: 85.42% forHausa and 72.26% for Yoruba. Experiments on var-ious languages show comparable performance withstate-of-the-art supervised typing systems trainedfrom a large amount of labeled data. Then we canapply the above unsupervised language-independentlinking component to link each mention to the En-glish KB and use the linking feedback to refinetyping results.

3 Nugget Detection

3.1 Baseline Maximum Entropy Model

We utilize a Maximum Entropy model (MaxEnt)to predict the event type of realis type of eachcandidate event nugget, based on linguistic featuresas summarized in Table 1. They can be roughlydivided into the following categories:

Lexical features include nbr, sense and bro. nbrsrefer to the unigrams or bigrams within the textwindow of size 2. bro is a Brown cluster, whichwas learned from ACE English corpus (Brown et al.,1992). We used the clusters with prefixes of length13, 16 and 20 for each token. The synonyms are themost possible synset in WordNet (George, 1995).

Syntactic features include dg words, dg typ,be pron and be pron. They represent thecharacteristics of dependency and coreference.

Entity features include en typ and en typ(dg),which capture the participation of entities in eventsof specific types.

Statistical feature refers to ev typ. The featureev typ is learned from the distributions of event

types over nuggets in the training corpus.

Feature Descriptionnbr neighbor grams and POSsense lemma, synonyms and case of the

target tokenbro Brown clusters (Brown et al.,

1992)dg words dependent and governor wordsbe pron whether the target token is a non-

referential pronounbe mod whether the target is a modifier of

job titleen typ types of entities within the text

window of size 3 if haveen typ(dg) types of dependent and governor

entity of the target tokenev typ argmax event type of the target

token in training data

Table 1: Features for Event Nugget Classification

Feature Descriptionnbr neighbor grams and POSdg words dependent and governor wordsdg typ dependency types associated of

dg woodsen typ(dg) types of dependent and governor

entity of the target tokennug attr lemma, POS, event type and

subtype of the nuggetfv first verb within the clause con-

taining the nuggetco neg whether occurred with a negative

word (e.g., not)co unc whether occurred with a modifier

of uncertainty (maybe)co madv whether occurred with a modal

adverb (possibly)dis distance between the nugget and

co neg or co unc

Table 2: Features for Realis Classification

Table 2 shows the features used in realis classi-fication. Besides the basic features (nbr, dg words,dg typ, en typ(dg) and nug attr), we employed thewords that express negation and uncertainty, suchas co neg, co unc, and co madv. The feature dis isused to identify the scope affected of negation andhypothesis. In addition, we considered the first verbof a clause (fv). An empirical finding shows that

some fvs are capable of constraining the realis typein a clause.

3.2 Using Homogeneous Data to ImproveNugget Classification

Homogeneous data refers to a set of samples thathave common attributes. The attributes can bepredefined in terms of specific needs. In rhetori-cal analysis of discourse, for example, we employthe stylistic characteristics (narrative, argumenta-tive, lyric, etc) as the attributes for homogeneitydetection. The stylistically homogeneous discoursesgenerally contain similar rhetorical structure. Thishelps an analyzer predict the rhetorical structureof a specific discourse sample by learning the ho-mogeneous samples. In other fields, similarly, wemay consider Part-of-Speech based homogeneitydetection for machine learning on syntax, semanticsfor grammar, pragmatics for word sense, etc.

In this paper, we regard an occurrence of a nuggetas a sample. We focus on the application of prag-matically homogeneous samples in event nuggetclassification.

Context is the most important aspect of pragmat-ics. In our case, it can be used to verify whetherthe samples of a nugget indicate the events of thesame type. See the following sentences, where thesamples of the nugget “death” in (1) and (2) occurin very similar contexts and both indicate an “Die”event. By contrast, the samples in (1) and (3) occurin different contexts and indicate different events.Accordingly, we define the homogeneous samplesas the ones that occur in similar contexts. Wename such samples as pragmatically homogeneoussamples, which means that they convey similarsenses in a specific context, triggering the events ofthe same type.

1) UN puts the conflict’s death[Die/Actual] toll at over 67,000.

2) I urge you to take in a variety ofsources in researching the civilian death[Die/Actual] toll of the conflict.

3) Elliott was the third convict put todeath [Execute/Actual] in the state sincethe start of the year and number 105 since1976. (here, death refers to death penalty)

It is worth noting that not just the pragmaticallyhomogenous samples of a single nugget, but theones among different nuggets can take advantage ofthe homogeneity in revealing the same event type.See sentences 4) and 5) where the samples of thenuggets “executed” and “death” occur in similarcontext, both of which evoke an “Execute” event.

4) A convicted murderer was executed[Execute/Actual] in the electric chairTuesday in Virginia.

5) Larry Elliott, 60, was the first personput to death [Execute/Actual] by electro-cution since June of last year.

In order to reduce the errors caused by uncer-tainty, a MaxEnt model always makes decision interms of the most reliable priori knowledge. Accord-ingly, it always drives the classifier to assign a sam-ple to the type class that represents its homogeneoussamples. For example, suppose that the nugget“sentence” in either sentence (2) or (3) is a trainingsample, while (1) a test sample. As mentionedabove, the nugget in (1) is pragmatically homoge-neous with (2) but heterogeneous with (3). In aview of pragmatic features, therefore, the contextin (2) provides reliable prior knowledge while (3)doesn’t. To ensure the reliability of classification,the MaxEnt model will predict (1) as the same eventtype with (2) but not (3).

It is predictable that the nugget classificationsystem can be optimized if trained on rich homo-geneous samples. From here on, the remainingproblems include 1) how to enrich homogeneousdata and 2) enable learning among the homogeneoussamples. To solve the problems, we respectivelypropose a feature oriented transfer learning methodand a lexicon based enrichment method for homo-geneous data.

3.2.1 Feature Oriented Transfer LearningWe detect pragmatically homogeneous samples

of a nugget in terms of contextual similarity. Thecontext appears as the co-occurred words with asample, such as the ones in a chunk, text windowand dependency tree. In some cases, a similarcontext means that two nuggets have very similarcontextual words (see 1) and 2)). In other cases,

the agreement in contexts can be reached only at thelevel of semantics. See 9) and 10) where the contextshave the same meaning but they are consisted of verydifferent words.

9) A good deal of this hatred is relatedto the fact that Congress has a traditionof preventing its own members convictedof crimes from ever going to jail [Arrest-Jail/Other].

10) Scholars of Brazil’s judicial systemsay legislators in corruption scandals of-ten avoid jail [Arrest-Jail/Other].

We propose to employ semantic feature basedtransfer learning for training the MaxEnt model.Transfer learning is a process of learning homoge-neous data. A feature based transfer learner devel-ops uniform feature representation to characterizethe common attributes of homogeneous samples andthus unifies them in the feature space. This usu-ally facilitates the learning process of homogeneoussamples.

We use frame semantics for characterizing thecontexts of homogeneous samples. Given a samplealong with its context, we transform the words inthe context to their semantic frames. A semanticframe is the conceptual representation of a cluster ofwords, reflecting the general semantics of the words.See Table 4. Using such frames as features, we cangenerate a nearly uniform feature representation forthe semantically similar contexts. See 11) whichexhibits the common semantic frames of some keywords in 9) and 10).

Frame: Leadership (see Table 4)Lexical units in 9) and 10): congress,legislator

Frame: Preventing (see Table 4)Lexical units in 9) and 10): prevent, avoid

Frame: Being IncarcerateLexical units in 9) and 10): jail

Accordingly, the homogeneous samples that havesimilar context can be always transferred to the sameregion in the feature space, far from the heteroge-neous samples. In terms of the space partition, the

frm 1 Preventing frm 2 Leadershipprevent, stave off,avoid, avert, obvi-ate, prohibit, ob-viate, upset, etc.

congressman, leg-islator, adminis-ter, bishop, chair-man, chief, etc.

Table 3: Examples of semantic frames and thelexical units they contained

MaxEnt model is capable of assigning a nugget tothe class type of the semantically-similar homoge-neous samples, even if it has a very different contextin content from the samples, such as that in 9) and10).

In practice, we fulfill the semantic-level transferlearning for all kinds of homogeneous samples, i.e.,the nuggets of 39 event types, including the 38 KBPevent types and a N/A type (means “Other” type).In the feature space employed by the original clas-sification system, we replace the contextual featuresby their semantic frames. The contextual featuresinclude the co-occurred words with a candidatenugget in the same chunk, text window and the first-level dependency subtree. We retrained the MaxEntmodel using the revised feature space over the sametraining data.

3.2.2 Lexicon based Enrichment ofHomogeneous Data

Because of the lack of training data, diversenuggets in the test data and generally narrow scopeof application of homogeneity, we suspect that thereisn’t any reliable homogeneous sample to use forsome test samples during the process of machinelearning. Our solution is to introduce more potentialhomogenous samples into the training data from ex-ternal linguistic resources. We consider two classesof words for the enrichment of homogeneous sam-ples: one can serve as a nugget to trigger an eventeven though it never occurred in the training data,while the other have occurred as known ground-truthnuggets but necessarily associated with some newcontext. We name the former cases as brand-new(BN) homogenous samples, by contrast, the latterhalf-new (HN).

In practice, for either BN or HN samples, it isnecessary to add their contexts to the training data.That is because the MaxEnt model can learn the

pragmatic features only from the contexts. Fromthis perspective, our goal is actually to diversifythe pragmatic context specific to a certain event type.

Acquisition of BN SamplesWe propose to acquire BN samples from FrameNet,which organizes lexical units (words and phrases)into different clusters and assign a conceptualrepresentation to each cluster2. The conceptualrepresentation is also called semantic frame.Moreover, FrameNet provides an index structure.The index facilitates the search for the semanticframes given a lexical unit as query. By accessingframes, we can obtain all the lexical unitssemantically similar to the query.

Technically, we use a ground-truth nugget asquery. By going through the index, we discoverall the related semantic frames and further thesemantically-similar lexical units. In terms of thepriori correspondence between the ground-truthnugget and an event type, we associate the retrievedlexical units with the event type, generating event-type-specific candidate nuggets. By using allground-truth nuggets as queries, we search allpossible candidate nuggets in FrameNet. Based onthis approach we can filter the duplicated candidatesfor each KBP event type to eventually produceunique new nuggets.

As mentioned above, the newly found nuggets arenot yet eligible BN samples until we attach somecontexts to each of them. Similarly, we acquire thecontexts by using ad-hoc Information Retrieval (IR)technique. We show the detailed retrieval procedureas follows:

Input: a nugget xOutput: the set of contexts C

Step 1: Use x as query to search the relateddocuments D in the KBP dataset.Step 2: Pick up a document d from D.Step 3: Extract a sentence Scon that contains x andthe left and right neighbors Snei from d.Step 4: Use the co-occurred words with x in chunksand dependency trees in Scon respectively as thecontext cinc and cdep; Use the words in a radar-fixed

2https://framenet.icsi.berkeley.edu/fndrupal/

text window as the context cwin (the window mayoccupy contents in both Scon and Snei).Step 5: Verify whether C is empty. If yes, add thetriple T(cinc, cdep, cwin) to C, else calculate thesimilarity of the current triple Tcur to every existingtriple Texi in C. We use VSM based Cosine metricin the similarity measurement. If the similarity toeach Texi is smaller than a threshold θsim, add Tcurto C, else skip to Step 6 directly.Step 6: If the number of the context triples T in C isbigger than n, break out the loop.Step 7: If there isn’t any Scon in d, go to Step 2,else skip to 3.

In the procedure, we set n to 50 while θsim 0.05.Accordingly, we can obtain various contexts T(cinc,cdep, cwin) for each newly found nugget. We useevery pair of nugget x and context T as a BN sample.

In the set of BN samples, however, there isa lot of noise. Most cases are caused by theambiguous ground-truth nuggets. For example,the nugget “strike” can have many meanings,such as “walk off the job and protest”, “hit in aforceful way” and “be impressed”, correspondingto the semantic frames “Political actions”,“Attack/Cause harm” and “Coming to believe”respectively. The nugget indicates a “Demonstrate”,“Attack” or “Injury” event only if it conveys theformer two meanings. In this case, the eligiblesemantic frames are “Political actions” and“Attack/Cause harm”. Undertaking the remainingmeaning (i.e., “Coming to believe”), it cannottrigger any kind of KBP events.

To filter the noise, we need to disable the in-eligible semantic frames at the beginning of theacquisition process. We employed two rules toidentify such frames:

r1: FrameNet associates each semanticframe of a lexical unit with a POS tag. Itmeans the POS is fixed when the unit indi-cates a specific meaning. For example, theword “fine” conveys the semantic frame“Fining” (which means “penalty paid”)only if its POS is noun or verb. Bycontrast, the frame “Fining” (which means“penalty paid”). Similarly, a nugget gen-erally has a fixed POS when triggering

a type of KBP event. For a ground-truth nugget, accordingly, we detect itsgeneral POS in triggering specific types ofevents. Then we can determine the seman-tic frames unassociated with the POS to beineligible.

r2: Given all the ground-truth nuggets ofa specific event type etyp and a semanticframe sfrm, we measure the agreementof the nuggets in taking on the meaningof sfrm. If the agreement is high, wedetermine sfrm to be eligible. Of course,in this case, sfrm should be associatedwith the event type etyp. It means thatthe lexical units in sfrm are very likely totrigger events of the type etyp. Therefore,they should be used as new candidatenuggets.

We measure the agreement in rule 2 by calculatingthe average joint probability of the nuggets of Etypoccurring as a unit in the semantic frame Sfrm:

P (etyp|sfrm) =ne∑

i=1,gi⇒etyp

Bool(gi, sfrm)

nens

Bool(gi, sfrm) =

{1, gi ∈ sfrm0, otherwise

where, gi is a ground-truth nugget indicting an eventwith type etyp, ne is the total number of the ground-truth nuggets known to trigger etyp, while ns is thenumber of lexical units in the semantic frame sfrm.The boolean logic indicates whether a gi occurs asa unit in sfrm in FrameNet. The equation impliesthat a semantic frame is associated with a KBP eventtype only if many lexical units of such semanticshave occurred as a nugget of the event type in thetraining data.

We only employ sfrm that has an agreementP(etyp|sfrm) bigger than θagr in acquiring BN sam-ples for etyp. We empirically set θagr equal to 0.01.

Acquisition of HN SamplesHN samples are the ground-truth nuggets attachedwith new context. To acquire HN samples, similarly,we employ an IR system to search for sentence-level

contexts. To ensure the diversity of the context, weonly search and keep distinctive contexts. Differentfrom the initial condition in searching BN, the set ofcontexts C for the acquisition of HN is nonempty.The scale of context in the sets is imbalanced: somecan be loaded with rich contexts, others few. Wefocus on increasing the scale in the latter cases,adding new contexts to the sets until their scalereaches the level of the known largest set.

3.3 Topic based Nugget Disambiguation

There exist many ambiguous nuggets in reality. Wedefine an ambiguous nugget as the word that maytrigger multiple types of KBP events. Consideringthe feasibility of topic in word sense disambigua-tion, we suggest that topic is also an important clueto the real event type of a nugget.

Now the question is how we use topic model-ing for nugget disambiguation (our goal). It ispredictably useful to generate a relation networkbetween topics and the KBP event types. Therelationship strength, empirically, enables the de-termination of the exact event type. Thereforethe utilization of the topic-event relationship as adiscriminative feature is probably a feasible way toachieve the goal. However the size of the dataset isnot large enough to provide robust estimates for thetopic-event relationship, considering the conditionthat a topic in the test data may never occur in thetraining data.

We propose to implement a Case-Oriented Real-Time topic-event relation detection system (CORTfor short). CORT fulfills a case study S(g,d) for eachnugget g in a test document d. It collects a clusterC of related documents to d in real time. At thedocument level, CORT regards the documents in Cas the same topic tC , similar to the topic td of d. Un-der the precondition that g frequently occurs in C,CORT detects the most probable event type eC of gand associates it with the topic tC . According to thestrong similarity between the single-document topictd and the multiple-document tC , CORT propagatesthe association of eC with tC to td.

The main characteristics of CORT include:

Reliable: CORT analyzes the topic-eventrelation in a dataset containing abundantrelated documents. Compared to the orig-

inal single document, the related docu-ments provide richer topic-related infor-mation for surveying the closest topic-event relation. The estimation, therefore,is more reliable.

Case oriented: CORT regards a testdocument as an independent scenario.A nugget in the document only evokesevents of the type specific to the soletopic of the scenario. This facilitates acase study for every nugget in each testdocument.

Unsupervised: During the case study,CORT collects related documents inreal time for the test document byusing a content-based local text searchengine. Over the documents, CORTanalyzes topic-event pairs in terms oftheir probability distributions.

CORT is used as post-processing behind the su-pervised nugget classification system. Its input isa document in the test data. The available prioriknowledge is a list of ambiguous nuggets foundby the classifier. CORT looks through the testdocument, word by word, and enforces disambigua-tion for every occurrence of an ambiguous nugget.Specifically CORT re-estimates the event types interms of the topic-event association strength, usingthe new estimate to replace the original if they aredifferent from each other.

We employed a Lucene-based search engine tosupport the acquisition of the related documents.Toward a test document, we use the keywords toformulate the query. In terms of the content sim-ilarity, we retrieved the related documents from theKBP dataset. In the ranking list of the search results,we uniformly contained top n (n=100) most similardocuments as the reliable related documents.

To determine the topic-related event type, weintroduce a margin model in topic-event associationstrength determination. Given an ambiguous nuggetg and the event types E it evoked in the cluster Cof the retrieved related documents, the margin M iscalculated as:

M(ei, ej) =O(ei)−O(ej)

O(ei)O(ej), ∀ei, ej ∈ E

R(e∗, TC) =

{1, M(e∗, ei) > θM

0, otherwise

where, e∗ is an event type evoked by g, O(*) denotesthe occurrence frequency of e∗ in C, TC is thecommon topic of the related documents, and θM isa threshold for the margin M. Actually, the marginis used to reflect whether there is an event type ofg occurred in C much more frequently than anyother type. Constrained by the threshold (θM=0.7),the most frequently occurred event type in C isdetermined related to the topic TC . The rest will beregarded as some unreliable estimates. Accordingly,for each ambiguous nugget g, CORT only remainsonly one reliable topic-related event type as thedetermination result.

To drive CORT to make the decision on topic-event relationship, it is important to note the eventtypes of the ambiguous nuggets beforehand. How-ever, there are no ground-truth event type anno-tations in the real-time retrieved documents. Tosolve this problem, we can use a weaker IE systemto detect the event types beforehand. Obviously,the detection results definitely involve some errors.The event types used in the margin calculation,therefore, are some pseudo-relevant data in reality,easily causing a wrong decision on the topic-relatedevent type. The way to reduce the risk of makingmistake is to enhance CORT using adaptive boostingmethod. In detail, the IE system can be used togenerate the event types in the retrieved relateddocuments, while CORT can serve as the post-processing to improve the IE system. The iterativecycle can proceed iteratively

3.4 Multi-word Nugget IdentificationA nugget can also be a phrase or chunk, such asshoot up (phrase) and thrown in jail (chunk). Abasic step to identify such nuggets, accordingly, isto parse sentence constituents and extract phrasesand chunks. However, it is difficult to solve theproblem of jump-over connection, such as throwin jail in throw him in jail. Instead of syntacticparsing, we employ dependency parsing to extractcandidate multi-gram nuggets. Specifically, we usethe dependency structures of the ground-truth multi-gram nuggets as the templates to extract the nuggetsfrom the test data. By using these templates, we

extract nearly all eligible event nuggets (0.99 recallscore). Nevertheless, the precision is very low dueto two problems:

Some general dependency structures in-troduce large-scale noise, such as to be,to throw, in the, over and, by step, etc. Incontrast, the eligible grams should be asto be, throw out, in the jail, over and over,step by step, etc.

Most candidate multi-grams are not KBPevent related nuggets.

To filter the noise, we propose a bilingual wordalignment based multi-gram qualification verifica-tion method. It determines the eligibility of anEnglish multi-gram by checking whether the gramhas a consistently aligned Chinese word or phrase inbilingual data.

For type-specific multi-gram nugget detection,we evaluated three methods, including rule baseddetection, supervised classification using linguisticfeatures and semantics based clustering. Instead ofwords, we use the multi-grams as the objects, nomatter in the stage of training or test. In the cluster-ing, we generate the semantic vector for both candi-date multi-gram nuggets and the ground-truth word-level ones. On the basis, we partition the nuggets indifferent clusters in terms of semantic similarity. Ineach cluster, we use the word-level nuggets and theirevent types as references to determine the event typeof those semantically similar multi-gram nuggets.Similarly, we employ the propagation algorithm inthe determination procedure.

3.5 Experiment and Analysis3.5.1 Data and Evaluation Metrics

We trained and tested our system based on theTAC KBP 2015 Event Nugget Training Data set,which contains 158 documents. We chose 126documents as training data and 32 documents astesting data. We evaluated our system on fourmetrics: CEAFe, B3, Muc and Blanc.

3.5.2 Analysis of New Methods for EventNugget Detection

We separately evaluated the proposed methods onthe KBP 2015 nugget detection training data by 4-

System plain typ rls typ+rlsBaseline(Bas) 63.19 53.28 44.91 37.35Bas+FTL 64.84 60.05 45.95 42.21Bas+BN 67.68 62.01 47.25 42.99Bas+TND 67.93 63.09 47.57 43.73all 68.16 62.64 47.52 43.34typ: mention type; rls: realis status

Table 4: Improvement Achieved over the Baseline

fold cross validation. See the performance in Ta-ble 4. BN denotes the collected type-specific Bran-New homogeneous samples, including BN nuggetsacquired from FrameNet and their contexts retrievedfrom local KBP dataset. We added BN homo-geneous samples to the original training data andretrain the MaxEnt classifier. The goal is to increasepriori knowledge about potential nuggets as wellas the contexts. The remaining results show thatthe supervised MaxEnt classifier achieves better dis-criminative ability by learning from rich knowledge.

By contrast, FTL doesn’t rely on rich homoge-neous samples. It is illustrated that FTL achievessignificant improvements over the baseline using thesame training data. It proves that the MaxEnt modelobtains a real-time reasoning ability by transferlearning from the homogeneous data at the semanticlevel.

TND yields the most significant improvementthan the baseline. It performs particularly well indetermining the mention types. It proves that topic-event relationship specific to a nugget benefits thedisambiguation of the nugget in different scenarios.

4 Event Nugget Coreference

4.1 Approach

We propose a method that views the event nuggetcoreference space as an undirected weighted graphin which the nodes represent all the event nuggetsand the edge weight indicates coreference confi-dence between two event nuggets. There’re twomodules in our system: Maximum Entropy Modeland clustering module.

4.1.1 Pairwise ModelWe train a Maximum Entropy model to generate

the cofidence matrix W . Each confidence value in-dicates the probability that there exists a coreference

link C between event nuggets eni and enj .

P (C|eni, enj) =e(Σkλkgk(eni,enj ,C))

Z(eni, enj)

where gk(eni, enj , C) is a feature and λk is itsweight; Z(eni, enj) is the normalizing factor.The feature sets used during train are listed inTable 5.

4.1.2 ClusteringLet EN = {enn : 1 ≤ n ≤ N} be N event

nuggets for one event type in a document andEH = {ehk : 1 ≤ k ≤ K} be K event hoppers.Let f : EN → EH be the function mapping froman event nugget enn ∈ EN to an event hopperehk ∈ EH . Let coref : EN × EN → [0, 1] be thefunction that computes the coreference confidencevalue between two event nuggets eni, enj ∈ EN .For each event type in the document,we construct a graph G(V,E), whereV = {enn|f(enn).enn ∈ EN} and E ={(eni, enj , coref(eni, enj))|eni, enj ∈ EN}.We then apply a hierarchical clustering algorithmto the graph. Because the number of hoppers K,which has to be set in advance, is unknown, wedefine a parameter ε to quantify the performanceof the clustering results by varying the numberof hoppers K from 1 to N . N is the number ofevent nuggets. For each K, ε is the ratio of thenumber of conflicting edges(Nc) to the number ofall edges(Ne). The smaller ε is, the better the resultof clustering is.

εK =Ne

Na

An edge between two event nuggets is defined asconflicting in either of the following two cases,where δ is the confidence threshold:

1. f(eni) = f(enj) but coref(eni, enj) < δ

2. f(eni) 6= f(enj) but coref(eni, enj) > δ

Here’s an example to demonstrate howto compute ε. In Figure 1, five eventnuggets are classified into three eventhoppers, and there are three conflictingedges(coref(en1, en5), coref(en2, en4), coref(en3, en4))

Features Remarks(EN1: the first event nugget, EN2: the second event nugget)type subtype match 1 if the types and subtypes of the event nuggets matchtrigger pair exact match 1 if the spellings of triggers in EN1 and EN2 exactly matchstem of the trigger match 1 if the stems of triggers in EN1 and EN2 matchsimilarity of the triggers(wordnet) quantized semantic similarity score (0-5) using WordNet resourcesimilarity of the triggers(word2vec) quantized semantic similarity score (0-5) using word2vec embeddingtoken dist how many tokens between triggers of EN1 and EN2realis conflict 1 if the realis values of EN1 and EN2 exactly matchSentence match 1 if the sentences of EN1 and EN2 exactly matchextent match 1 if the extents of EN1 and EN2 exactly matchPOS match 1 if two sentences have the same NNP, CD

Table 5: Features for the Pairwise Model

EM3

EM4

EN1

EN2

EN5

Cluster 1

Cluster 2

Cluster 3

Weight >=threshold

Weight <threshold

Coref(en1, en5)

Coref(en2, en5)

Coref(en2, en3)

Figure 1: Clustering Example

according to our definition. Thus

ε3 =3

10= 0.3

4.1.3 Adding Cross-Media Features

In our recent work (Zhang et al., 2015), weproved that additional visual similarity features canbe used to improve cross-document event coref-erence resolution for news videos. In this eventnugget coreference resolutio we extend it to retrieverelated images automatically, though we didn’t useit for the evaluation because we are not allowedto use web access. The rich visual concepts inimages can help us to identify event hoppers, evenwhen they are difficult to be identified just basedon the text resources. For example, when thereare many different arguments in two event nuggets,it’s very challenging for text features to tell ifthey are coreferential or not. However, we can use

arguments as keywords to search for images online.If we can get similar or even the same images, it’squite likely that two event nuggets refer to the sameevent hopper.For example, the following two event nuggets referto the same event hopper. However it’s very difficultto make the correct judgement just based on textinformation.

• EN1: Nigeria’s graft-facing ex-governor{arrested} in Dubai.

• EN2: London’s Metropolitan Police confirmedIbori’s {arrest}.

Let’s make the use of powerful multimedia informa-tion. Figure 2 shows the top retrieved images using[“Nigeria”, “ex-governor”, “arrested”, “Dubai”] asthe query via Google image search, and Figure 3shows the top retrieved images using [“London”,“Metropolitan”, “Police”, “Ibori”, “arrest”] as thequery. We can see that the retrieved images of thesetwo event nuggets share a lot of visual features,which indicate they might talk about the arrest ofthe same person. Thus it’s highly possible that theyare coreferential.

Figure 4 and Figure 5 shows another exampleincluding the following two event nuggets. Thequeries for image search are [“attacked”, “oil”, “fa-cility”, “Niger”, “Delta”] and [“attacked”, “Agip”,“facility”] respectively.

• EN3: A militant group says it {attacked} an oilfacility in the Niger Delta.

• EN4: A statement sent to reporters from theJoint Revolutionary Council claims militants

Figure 2: Retrieved Images for EN1


{attacked} the Agip facility early Wednesdaymorning.



4.1.4 Feedback for Realis ImprovementThe realis classification accuracy in the event

nugget detection system is quite low (around 0.5).We hypothesize that event nuggets which refer tothe same event hopper should have the same realistype. So we trained another classifier without thefeature of realis conflict, and used the event coref-

erence results from this classifier to improve realisclassification.For example, the following three event nuggets fromthe evaluation data are automatically clustered intothe same event hopper:

• EN1: E3 prison Justice Arrest-Jail Generic0.54

• EN2: E20 prison Justice Arrest-Jail Actual1.00

• EN3: E21 prison Justice Arrest-Jail Actual1.00

And according to the output of our event nugget de-tection system, the realis calues of these three eventnuggets are (“Generic”, 0.54), (“Actual”, 1.00) and(“Actual”, 1.00) respectively. While the realis confi-dence of EN1 is low, that of EN2 and EN3 are prettyhigh. Thus we modify the realis of EN1 to “Actual”by a majority voting. This feedback approach yieldssubstantial improvement on realis classification.

4.2 Experiments4.2.1 Analysis on Metrics

Figure 6: F-socres based on four metrics versusconfidence threshold

Fig 6 shows the F-scores versus threshold δ basedon four evaluation metrics. The maximal CEAFescore 0.83 was obtained when the threshold was0.68. Thus we chose 0.68 as the cofidence threshold.However, it’s obvious that the F-scores based onMuc and Blanc metrics are highly sensitive to δ.

This is similar to the observation from previouswork on ACE event coreference resolution (Chen etal., 2015). Event nugget coreference is a new task,so we think it’s not necessary to use them just forcomparison with previous work. We recommend touse B3 and CEAFe only to evaluation event nuggetcoreference resolution.

4.2.2 Remaining ChallengesEven though we achieved top 1 in end-to-end

event nugget detection and coreference resolution,some challenges remain. We observed that somecoreferential event nuggets share very few features.For example, the following two event nuggets referto the same hopper:

• EN1: After months of speculation, the SimonProperty Group on Tuesday finally made an un-solicited $10 billion offer for General GrowthProperties, its {bankrupt} rival.

• EN2: it is not sufficient to pre-empt the processwe are undertaking to explore all avenues toemerge from {Chapter 11} and maximize valuefor all the Company’s stakeholders

However, it’s hard for the system to tell that theyrefer to the same event. First of all, it’s challengingto determine that two trigger phrases “bankrupt”and “Chapter 11” are semantically similar. Second,these two sentences don’t share any obvious patternsor arguments.

5 Slot Filling Validation - Filtering

Our basic assumption is that a response is morelikely to be true if it is supported by multiple strongteams. In order to validate this assumption, we firstpropose an unsupervised method to roughly esti-mate the performance of teams with little/no priorknowledge. After that, we focus on categorizingruns into multiple tiers based on their estimatedperformance and apply a tier-specific voting strategyincorporating linguistic constraints.

5.1 Tier ClassificationOur objective is to classify a team as strong, rela-tively strong or relatively weak. The performance ofa team is usually consistent regardless of individualqueries. Thus, we can take advantage of the team

ranking based on the preliminary assessments toestimate the overall performance/rank of a run.

Unfortunately, preliminary assessment results areoften not available. But we can still obtain reliableinitial credibility scores of runs by analyzing thecommon characteristics among various runs. Giventhe set of runs R = {r1, . . . , rm}, we initialize theircredibility scores c(r) based on their interactions onclaims (i.e., a combination of query, slot type andfiller). Suppose each run ri generates a set of claimsMri . The similarity between two runs ri and rj isdefined as follows (Mihalcea, 2004).

simlairty(ri, rj) =|Mri ∩Mrj |

log (|Mri |) + log (|Mrj |)(1)

Then we construct a weighted undirected graphG = 〈R,E〉, where R(G) = {r1, . . . , rm} andE(G) = {〈ri, rj〉}, 〈ri, rj〉 = similarity(ri, rj),and apply TextRank algorithm (Mihalcea, 2004) onG to obtain c(r).

In our task, the classification problem can beregarded as finding two intervals within a set ofcredibility scores C = {c(r1), . . . , c(rm)} withoptimal interval borders. We implemented the Jenksoptimization method to determine the best arrange-ment of runs into three tiers. This is done byminimizing each tier’s average deviation from thetier mean, while maximizing each tier’s deviationfrom the means of the other groups (McMaster andMcMaster, 2002).

Tier 1

Tier 2

Tier 3

Supported by ≥ 2 teams

𝐴11 𝐴12

𝐴21 𝐴22

𝐴31 𝐴32

Supported by only one team

Figure 7: Tier-specific constraints.

5.2 Linguistic ConstraintsWe analyze the evidence sentence extracted for eachclaim by checking if it satisfies trigger constraints orthe candidate filler satisfies type constraints.

Trigger ConstraintsA trigger is defined as the smallest extent of a

text which most clearly indicates a relation type.For trigger-driven slot types such per:city of birth,a response without sufficient lexical support will bedirectly judged as wrong by annotators.

We generally follow our previous work (Yu et al.,2015) on trigger mining. We mined fact-specifictrigger lists based on patterns (Chen et al., 2010;Min et al., 2012; Li et al., 2012b) and correctevidence sentences from KBP 2012-2014 trainingcorpus. In our experiment, we use 8,237 triggersand 392 triggers on average for each trigger-drivenslot types3.

Type ConstraintsIn Slot Filling (SF)/Cold Start Slot Filling (CSSF)

task, slots are labeled as Name, Value, or Stringbased on the content of their fillers.

Name slots (e.g., per:spouse, org:parents) arerequired to be filled by a person, organization orgeopolitical entity (GPE). We rely on name taggingresults to validate type constraints for name slots.We also use city/state/coutry dictionaries to furthervalidate if an entity belongs to a GPE subtype.

Value slots should be filled by a numerical value(e.g., per:age, per:date of death and org:website).We use regular expressions and/or the name taggingresult to verify the correctness of a value format.

For a string slot (e.g., per:religion and per:origin),we collected category dictionaries from SF sourcecorpus such as religion, origin, disease, title andcrime which can help us make a rough judgementwhether a candidate filler belongs to a specific cat-egory or not. We also mined dictionaries fromNELL (Carlson et al., 2010) annotated KBP corpuswhich contains rich semantic category labels formillions of noun phrases. We mapped these seman-tic categories to slot types and keep high-confidencenoun phrases in order to generate clean dictionaries.

5.3 Tier-specific Voting based on Constraints

We divided all the responses into 6 fields as shownin Figure 7. Ai1 represents the responses submittedby at least two teams in Tier i and Ai2 denotesthe responses submitted by only one team in Tier i.

3The trigger lists are publicly available for research purposesat: http://nlp.cs.rpi.edu/data/triggers.zip

Method τ ZτPageRank 0.69 8.35Preliminary Assessment 0.79 9.59Golden Standard 1.00 N/A

Table 6: Performance of estimating the ranks.

Note that a team can submit at most five runs duringthe evaluation. In this step, we keep the responseswhich are submitted by multiple strong or relativelystrong teams. In other words, the responses fromA11, A21 and common responses of A12 and A22

are annotated as correct.We discard all the responses in A32 since these

responses are extremely noisy and we only lost lessthan 3% correct claims. A response in the remainingfields will be annotated as correct if it satisfies theabove slot-specific trigger and type constraints.

5.4 EvaluationBased on the released CSSF assessment result (69runs in total), we can use the Kendall rank corre-lation coefficient τ (Kendall, 1948) to evaluate thedegree of similarity between our estimated rankingand the standard ranking given the same set of Nruns. The symmetric difference distance betweentwo sets of ordered pairs P1 and P2 is denotedd∆(P1,P2). For example, the ordered set of N = 3runs [r1, r2, r3] gives the ranks [1,3,2] and can bedecomposed into 1

2N(N − 1) ordered pairs P ={[r1, r3], [r1, r2], [r3, r2]}. For N ≥ 10, a nullhypothesis test can be performed by transforming τinto a Z value as shown in Formula 3 (Abdi, 2007).

τ = 1− 2× [d∆(P1,P2)]

N(N − 1)(2)

Zτ =τ√

2(2N + 5)

9N(N − 1)

(3)

From Table 6, we can see that the value of Zof both methods is large enough to reject the nullhypothesis and therefore we can conclude that ourpredicted ranking and the final ranking displayed asignificant agreement. In addition, the tier classifi-cation method can successfully annotate the top 31runs as tier 1 and the bottom 15 runs as tier 3.

CSSF run F-score (%)Original Filtered

SFV2015 KB 12 4 30.1 36.7SFV2015 KB 12 1 30.4 34.9SFV2015 KB 12 5 30.9 34.3SFV2015 KB 12 3 29.6 33.7SFV2015 SF 03 3 31.4 31.4SFV2015 KB 12 2 33.8 33.1SFV2015 KB 03 2 30.5 34.1

Table 7: Performance upon top runs (CSLDC level).

CSSF run F-score (%)Original Filtered

SFV2015 KB 12 4 27.6 30.3SFV2015 KB 12 1 27.5 28.4SFV2015 KB 12 5 27.1 27.3SFV2015 KB 12 3 26.7 27.3SFV2015 SF 03 3 27.5 29.1SFV2015 KB 12 2 28.8 26.9SFV2015 KB 03 2 22.9 26.8

Table 8: Performance upon top runs (CSSF level).

In Table 7 and 8, we show that our method canimprove upon the top CSSF runs. Here we used thefinal scores which are computed considering bothhops. Compared with the SF task, CSSF runs haverelatively lower recall and therefore more sensitiveto the filtering process. In addition, the majorityvoting method has a relatively worse performancesince there are no pre-assigned queries. In this case,a correct claim is more likely to be submitted byonly one team. Our strategy of discarding all theresponses in A32 lead to the failure of our method inannotating responses of weak runs.

Acknowledgments

This work was supported by the U.S. DARPALORELEI Program, DARPA DEFT ProgramNo. FA8750-13-2-0041, DARPA MultimediaSeedling award, ARL NS-CTA No. W911NF-09-2-0053, NSF CAREER Award IIS-1523198, AFRLDREAM project, gift awards from IBM, Googleand Bosch. The views and conclusions containedin this document are those of the authors andshould not be interpreted as representing the officialpolicies, either expressed or implied, of the U.S.Government. The U.S. Government is authorizedto reproduce and distribute reprints for Governmentpurposes notwithstanding any copyright notationhere on.

ReferencesH. Abdi. 2007. The kendall rank correlation coefficient.

Encyclopedia of Measurement and Statistics. Sage,Thousand Oaks, CA, pages 508–510.

L. Banarescu, C. Bonial, S. Cai, M. Georgescu,K. Griffitt, U. Hermjakob, K. Knight, P. Koehn,M. Palmer, and N. Schneider. 2013. Abstractmeaning representation for sembanking. In Proc.ACL 2013 Workshop on Linguistic Annotation andInteroperability with Discourse.

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. RHruschka Jr, and T. M Mitchell. 2010. Toward anarchitecture for never-ending language learning. InProc. Association for the Advancement of ArtificialIntelligence (AAAI 2010).

Z. Chen, S. Tamang, A. Lee, X. Li, W. Lin, M. Snover,J. Artiles, M. Passantino, and H. Ji. 2010. Cuny-blender tac-kbp2010 entity linking and slot fillingsystem description. In Proc. Text Analysis Conference(TAC 2012).

Zheng Chen, Heng Ji, and Robert Harallick. 2015.A pairwise coreference model, feature impact andevaluation for event coreference resolution. In Proc.RANLP 2009 workshop on Events in Emerging TextTypes.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informationinto information extraction systems by gibbs sampling.In Proceedings of the 43rd Annual Meeting onAssociation for Computational Linguistics.

H. Ji, R. Grishman, D. Freitag, M. Blume, J. Wang,S. Khadivi, R. Zens, and H. Ney. 2009. Nameextraction and translation for distillation. Hand-book of Natural Language Processing and MachineTranslation: DARPA Global Autonomous LanguageExploitation.

Maurice George Kendall. 1948. Rank correlationmethods.

Qi Li, Haibo Li, Heng Ji, Wen Wang, Jing Zheng, and FeiHuang. 2012a. Joint bilingual name tagging for paral-lel corpora. In 21st ACM International Conference onInformation and Knowledge Management.

Y. Li, S. Chen, Z. Zhou, J. Yin, H. Luo, L. Hong, W. Xu,G. Chen, and J. Guo. 2012b. Pris at tac2012 kbp track.In Proc. Text Analysis Conference (TAC 2012).

R. McMaster and S. McMaster. 2002. A historyof twentieth-century american academic cartography.Cartography and Geographic Information Science,29(3):305–321.

R. Mihalcea. 2004. Graph-based ranking algorithms forsentence extraction, applied to text summarization. InProc. Association for Computational Linguistics (ACL2004).

B. Min, X. Li, R. Grishman, and A. Sun. 2012. Newyork university 2012 system for kbp slot filling. Proc.Text Analysis Conference (TAC 2012).

Han Wang, Jin Guang Zheng, Xiaogang Ma, PeterFox, and Heng Ji. 2015. Language and domainindependent entity linking with quantified collectivevalidation. In Proc. Conference on Empirical Methodsin Natural Language Processing (EMNLP2015).

D. Yu, H. Ji, S. Li, and C. Lin. 2015. Why read if youcan scan? trigger scoping strategy for biographical factextraction. In Proc. North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT 2015).

Tongtao Zhang, Hongzhi Li, Heng Ji, and Shih-FuChang. 2015. Cross-document event coreferenceresolution based on cross-media features. InProc. Conference on Empirical Methods in NaturalLanguage Processing (EMNLP2015).

Jin Guang Zheng, Daniel Howsmon, Boliang Zhang,Juergen Hahn, Deborah McGuinness, James Hendler,and Heng Ji. 2014. Entity linking for biomedicalliterature. BMC Medical Informatics and DecisionMaking.

RPI BLENDER TAC-KBP2015 System Description · RPI BLENDER TAC-KBP2015 System Description Yu Hong...

Documents

Transcript of RPI BLENDER TAC-KBP2015 System Description · RPI BLENDER TAC-KBP2015 System Description Yu Hong...