GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge...

22
GAIA - A Multi-media Multi-lingual Knowledge Extraction and Hypothesis Generation System Tongtao Zhang 1 , Ananya Subburathinam 1 , Ge Shi 1 , Lifu Huang 1 , Di Lu 1 , Xiaoman Pan 1 , Manling Li 1 , Boliang Zhang 1 , Qingyun Wang 1 , Spencer Whitehead 1 , Heng Ji 1 1 Rensselaer Polytechnic Institute [email protected] Alireza Zareian 2 , Hassan Akbari 2 , Brian Chen 2 , Ruiqi Zhong 2 , Steven Shao 2 , Emily Allaway 2 , Shih-Fu Chang 2 , Kathleen McKeown 2 2 Columbia University [email protected], [email protected] Dongyu Li 3 , Xin Huang 3 , Kexuan Sun 3 , Xujun Peng 3 , Ryan Gabbard 3 , Marjorie Freedman 3 , Mayank Kejriwal 3 , Ram Nevatia 3 , Pedro Szekely 3 , T.K. Satish Kumar 3 3 Information Sciences Institute, University of Southern California [email protected] Ali Sadeghian 4 , Giacomo Bergami 4 , Sourav Dutta 4 , Miguel Rodriguez 4 , Daisy Zhe Wang 4 4 University of Florida [email protected] 1 Introduction An analyst or a planner seeking a rich, deep under- standing of an emergent situation today is faced with a paradox - multimodal, multilingual real- time information about most emergent situations is freely available but the sheer volume and diver- sity of such information makes the task of under- standing a specific situation or finding relevant in- formation an immensely challenging one. To rem- edy this situation, the Generating Alternative In- terpretations for Analysis (GAIA) team at DARPA AIDA program aims for automated solutions that provide an integrated, comprehensive, nuanced, and timely view of emerging events, situations, and trends of interest. GAIA focuses on develop- ing a multi-hypothesis semantic engine that em- bodies a novel synthesis of new and existing tech- nologies in multimodal knowledge extraction, se- mantic integration, knowledge graph generation, and inference. In the past year the GAIA team has devel- oped an end-to-end knowledge extraction, ground- ing, inference, clustering and hypothesis genera- tion system that covers all languages, data modali- ties and knowledge element types defined in AIDA ontologies. We participated in the evaluations of all tasks within TA1, TA2 and TA3. The system incorporates a number of impactful and fresh re- search innovations: Generative Adversarial Imitation Learn- ing for Event Extraction: We developed a dynamic mechanism – inverse reinforce- ment learning – to directly assess correct and wrong labels on instances in entity and event extraction (Zhang and Ji, 2018). We assign explicit scores on cases – or rewards in terms of Reinforcement Learning (RL). This frame- work naturally fits AIDA’s future setting of streaming data because we adopt discrimi- nators from generative adversarial networks (GAN) to efficiently assign different reward values to instances on different difficulty lev- els and different epochs. Global Attention: Popular deep learning methods have modeled Information Extrac- tion (IE) as sequence labeling problems based on distributional embedding features, and led to improvements in quality for some low-level IE tasks such as name tagging and event trigger labeling (Chiu and Nichols, 2016; Lample et al., 2016; Zeng et al., 2014; Liu et al., 2015; Nguyen and Grishman, 2015b; Yang et al., 2016; Nguyen and Grish- man, 2015b; Chen et al., 2015; Nguyen and Grishman, 2015a, 2016; Feng et al., 2016b; Huang et al., 2017a; Zhang et al., 2017a; Lin et al., 2018; Shi et al., 2018; Zhang et al., 2018a). However, more advanced extrac- tion problems such as relation extraction and event argument labeling require us to cap-

Transcript of GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge...

Page 1: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

GAIA - A Multi-media Multi-lingual Knowledge Extraction andHypothesis Generation System

Tongtao Zhang1, Ananya Subburathinam1, Ge Shi1, Lifu Huang1, Di Lu1, Xiaoman Pan1,Manling Li1, Boliang Zhang1, Qingyun Wang1, Spencer Whitehead1, Heng Ji1

1 Rensselaer Polytechnic [email protected]

Alireza Zareian2, Hassan Akbari2, Brian Chen2, Ruiqi Zhong2, Steven Shao2,Emily Allaway2, Shih-Fu Chang2, Kathleen McKeown2

2 Columbia [email protected], [email protected]

Dongyu Li3, Xin Huang3, Kexuan Sun3, Xujun Peng3, Ryan Gabbard3, Marjorie Freedman3,Mayank Kejriwal3, Ram Nevatia3, Pedro Szekely3, T.K. Satish Kumar3

3 Information Sciences Institute, University of Southern [email protected]

Ali Sadeghian4, Giacomo Bergami4, Sourav Dutta4, Miguel Rodriguez4, Daisy Zhe Wang4

4 University of [email protected]

1 Introduction

An analyst or a planner seeking a rich, deep under-standing of an emergent situation today is facedwith a paradox - multimodal, multilingual real-time information about most emergent situationsis freely available but the sheer volume and diver-sity of such information makes the task of under-standing a specific situation or finding relevant in-formation an immensely challenging one. To rem-edy this situation, the Generating Alternative In-terpretations for Analysis (GAIA) team at DARPAAIDA program aims for automated solutions thatprovide an integrated, comprehensive, nuanced,and timely view of emerging events, situations,and trends of interest. GAIA focuses on develop-ing a multi-hypothesis semantic engine that em-bodies a novel synthesis of new and existing tech-nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation,and inference.

In the past year the GAIA team has devel-oped an end-to-end knowledge extraction, ground-ing, inference, clustering and hypothesis genera-tion system that covers all languages, data modali-ties and knowledge element types defined in AIDAontologies. We participated in the evaluations ofall tasks within TA1, TA2 and TA3. The systemincorporates a number of impactful and fresh re-search innovations:

• Generative Adversarial Imitation Learn-

ing for Event Extraction: We developeda dynamic mechanism – inverse reinforce-ment learning – to directly assess correct andwrong labels on instances in entity and eventextraction (Zhang and Ji, 2018). We assignexplicit scores on cases – or rewards in termsof Reinforcement Learning (RL). This frame-work naturally fits AIDA’s future setting ofstreaming data because we adopt discrimi-nators from generative adversarial networks(GAN) to efficiently assign different rewardvalues to instances on different difficulty lev-els and different epochs.

• Global Attention: Popular deep learningmethods have modeled Information Extrac-tion (IE) as sequence labeling problemsbased on distributional embedding features,and led to improvements in quality for somelow-level IE tasks such as name tagging andevent trigger labeling (Chiu and Nichols,2016; Lample et al., 2016; Zeng et al., 2014;Liu et al., 2015; Nguyen and Grishman,2015b; Yang et al., 2016; Nguyen and Grish-man, 2015b; Chen et al., 2015; Nguyen andGrishman, 2015a, 2016; Feng et al., 2016b;Huang et al., 2017a; Zhang et al., 2017a; Linet al., 2018; Shi et al., 2018; Zhang et al.,2018a). However, more advanced extrac-tion problems such as relation extraction andevent argument labeling require us to cap-

Page 2: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

ture global constraints that go beyond thesentence level, as well as inter-dependencyamong knowledge elements (Li et al., 2013;Li and Ji, 2014; Li et al., 2014). In the GAIAsystem we tackle this challenge by introduc-ing a novel document-level attention mech-anism (Zhang et al., 2018b) along with lin-guistic patterns and global constraints.

• Multi-media Multi-lingual Common Se-mantic Space: We developed a novel com-mon semantic space for unifying cross-mediaand cross-language semantics, and applied itto encode visual attention to improve entityextraction (Lu et al., 2018) and event extrac-tion (Zhang et al., 2017b), and ground entitiesdetected from texts into images and videos.

• Hierarchical Inconsistency Detection: Wedeveloped a novel inconsistency detectionmethod. Hierarchical entity matching andontologically induced functional dependen-cies are used to detect logical inconsisten-cies. Such logical features are also sum-marised into an inconsistency metric showingthe degree of inconsistency that each hypoth-esis bares.

• Path Ranking Based Query Answering:We are developing a novel hypothesis min-ing approach based on the path ranking algo-rithm (Lao et al., 2011). Our method createsa dual path-node embedding space which canbe used to represent complex subgraphs, e.g.,hypothesis.

• Embedding Based Graph Clustering: Weare developing graph clustering algorithmsbased on various features and distance met-rics. We also explore a novel approach in-spired by ”entity+graph” hybrid embeddingmethods. Such a method can be used to clus-ter similar/repeating subgraphs in the KG.

2 Multimodal Ontology Creation

A sufficiently expressive and refined ontology isparamount to the various IE tasks as it defines thetarget information that should be extracted. Theontology must offer a suitable target, with typesthat are reasonably disjoint and have a level ofgranularity such that a system can feasibly dis-tinguish between them. Existing ontologies can

be too coarse-grained (ACE or ERE) to offer clar-ity in settings that have conflicting information asin AIDA or too fine-grained (YAGO, WordNet,FrameNet, or VerbNet) to offer systems a clear tar-get when performing the extraction. Therefore, wecarefully design a multimodal ontology that is ableto accurately capture the AIDA scenario relevantentities, relations, and events across multiple datamodalities. We approach the development of theontology by first creating a text-based ontology toserve as the foundation for the multimodal ontol-ogy. In parallel, we identify a set of salient visualtypes to form a visual ontology that is most rele-vant to AIDA. We then refine these ontologies andmerge the textual and visual ontologies to form ourfinal multimodal ontology.

2.1 Textual Ontology

2.1.1 Entity Ontology

We use the YAGO ontology as a base entity ontol-ogy which we refine, restructure, and merge withthe AIDA ontology to form our textual ontology.YAGO offers a greatly expanded set of entity typesthat are incredibly expressive and has annotationsfrom sources, such as Wikipedia, that potentiallyalready include entities of interest in the AIDAscenario. We start from a set of 7,309 YAGO typeontology.

We take a data-driven approach to form our on-tology. First, we apply our entity discovery andlinking system (Pan et al., 2017) to the AIDASeedling data. This system is able to extract en-tities from the text data and perform fine-grainedtyping using the YAGO ontology. With the resultsof this system, we compute the frequencies of eachentity type in the AIDA Seedling data. We filterthese types using these frequencies, only keepingthe top 250 most frequent entity types.

The data-driven approach offers the advantagethat it ensures that the entity types included inthe ontology are relevant and occur with relativelyhigh frequency. However, further refinement isneeded to ensure that the types are robust and trulycover the knowledge elements important to theAIDA program. Additionally, it is best to have thestructure of the entity ontology be easily amenableto the event and relation ontology arguments so asto ensure that these ontologies may be used in con-cert with one another. Therefore, we carefully: 1)refine the set of 250 types and append any relevanttypes from the YAGO ontology that were mistak-

Page 3: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

enly filtered out; 2) restructure the ontology suchthat the top level of the ontology is the same asthe coarse-grained AIDA ontology, which is basedupon ERE.

Starting from the 250 types, we prune typesfrom the 250 and add back YAGO entity typesbased upon four criteria. First, the importanceand/or relevance of a type to the AIDA scenar-ios. As a result, any truly irrelevant types, suchas “Skidder110605088: A person who slips orslides because of loss of traction”, are filteredout. Second, the level of granularity of a type,which further removes types like “MapProjec-tion103720443: A projection of the globe onto aflat map using a grid of lines of latitude and longi-tude.” Third, a type’s level of coherence with theexisting AIDA ontology and target information, soif a YAGO type overlaps with an AIDA ontologytype, then we use the AIDA ontology type to pre-vent redundancy. Lastly, we consider whether ornot a type can only be represented as an entity or ifthe information conveyed by the type could be bestrepresented as a relation or event, such as “Male-Sibling110286084: A sibling who is male.” Wethen map the set of refined types determined us-ing these criteria to the existing AIDA ontology.Throughout this process, as mentioned above, wefavor the AIDA ontology types and merge our re-fined types into this ontology. This merging stepensures that the entity, event, and relation ontolo-gies can be coherently used together.

The result of this process is a set of 163 fine-grained entity types that serve as the foundationfor the multi-modal ontology. These types fitwithin the framework of the AIDA ontology, whilestill expanding the expressiveness of the ontology.Furthermore, due to our combination of a data-driven approach and careful design, we are able toensure that the entity types are relevant, coherent,and are able to capture the entities most importantto AIDA.

2.1.2 Event OntologyFor events, we follow a similar procedure to theentity ontology. We use the AIDA event ontologyas the base ontology into which we merge exter-nal event ontologies. The external event ontolo-gies we consider are: CauseEx, FrameNet (Bakeret al., 1998), VerbNet (Schuler, 2005), and Prop-Bank (Palmer et al., 2005).

In contrast to the entity ontology, a data-drivenapproach to forming the event ontology is less

plausible since the performance of event extrac-tion systems (Li et al., 2013; Nguyen and Grish-man, 2015a; Huang et al., 2017b; Zhang et al.,2017b) is markedly lower than entity discoveryand linking (Pan et al., 2015; Lample et al., 2016;Pan et al., 2017). Thus, we manually merge andmerge the external ontologies into the AIDA on-tology.

Our first step is to map the external event typesand argument roles to the AIDA ontology. Dur-ing this process, it is likely that many of the typesoverlap between ontologies. In such instances, weset an order of precedence for overlapping eventtypes and argument roles. This approach enablesuse to verify that the event types being incorpo-rated are coherent and fit well within the AIDAframework. The following ordering is used to re-solve these cases: 1) AIDA ontology; 2) CauseEx;3) FrameNet; 4) VerbNet; 5) PropBank. For exam-ple, if CauseEx has an event type that is matches aFrameNet type, then we select the CauseEx type.After merging all 5 event ontologies, we prune andrefine the entire ontology in a similar fashion to theentity ontology. Specifically, we only keep typesthat are: 1) relevant and salient to the AIDA sce-nario and 2) detectable by a system and human.This merging and refining process for the eventontology yields 114 salient event types along withtheir arguments.

2.2 Visual Enrichment

Visual data conveys knowledge elements thatrarely appear in text. For instance, a news arti-cle may refer to an explosion, but not the flamesand smoke that appear in a related image. Simi-larly, text may refer to an organization, which mayonly be visually represented by a logo. Text maymention rescuing in a disaster, but not refer to themedical stretcher that is used to carry an injuredperson.

In order to extract and represent multimodalknowledge in a unified language, and discoverlinks between them, our ontology should cover vi-sual concepts as well. In this section, we reporthow we enriched a text-driven ontology by addingvisual concepts. Overall, we proposed a set of 150entity types and 25 event types. Out of 150 entitytypes, only 26 were already covered by the text-driven concepts, while 124 were new and wereadded to the ontology.

To expand the ontology, we look for concepts

Page 4: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

that are: (1) visually expressible, (2) important,and (3) related to the scenario of interest. If a con-cept has all the three characteristics, we determineif it already exists in our ontology, and if not, de-termine where we can add it in the ontology hier-archy. In this section, we review three techniquesthat we used to discover such concepts, and weshare some examples and insights.

2.2.1 Ontology-driven expansionTo expand the ontology, we search existing visualontologies to discover new visual concepts. Morespecifically, given a list of visual concepts from areference visual ontology, we select the ones thatare important and relevant to the scenario, and addthose to our target ontology. Here we use the listof 20,000 Open Images v3 (OI3) categories as ourreference ontology, and the 9,000 YAGO types asour target ontology.

To measure the relevance of OI3 concepts tothe seedling scenario, we extract a doc2vec (Leand Mikolov, 2014) representation from the AIDASeedling Data Collection and Annotation PlanV2.0 and match it to the word embeddings of thereference ontology. We use cosine similarity fromeach concept embedding to the document embed-ding to sort the concepts in order of relevance. Wechoose a cut-off threshold by subjectively assess-ing the relevance of the sorted list to the scenario,which leads to the top 800 concepts being selected.Then we manually verify the relevance of each ofthe top 800 to the scenario, based on subjective in-tuition. Out of the 800 selected concepts, 306 wereverified to be relevant, which indicates a precisionof 0.38. We further prune this list by manuallyremoving concepts that are too specific or infre-quent, leading to 120 visual concepts.

The selected 120 visual concepts are all impor-tant and relevant visual concepts, and can be goodchoices for the target ontology. In the next step wetry to link each of the 120 to the YAGO concepts.To this end, for each of the 120 selected con-cepts, we extract semantic embeddings and com-pare to all YAGO concept embeddings, matchingto the highest similarity. We do this using 4 differ-ent methods for extracting semantic embeddings,which results in 4 possibly different matches foreach selected concept. We manually assess thematches and select the best match. We consider5 possible cases:

1. The selected visual concept and the YAGO

match are identical, which makes it redun-dant.

2. The selected concept is a subtype of theYAGO match, e.g. police car which wasmatched to car. In these cases, the selectedconcept can be added as a child node to theYAGO ontology.

3. The YAGO match is a subtype of the selectedconcept, e.g. Emergency Vehicle which wasmatched to Ambulance. In these cases, theselected concept can be added as a node be-tween the YAGO match and its current par-ent.

4. The selected concept and its YAGO matchesare sibling concepts, e.g. Flight Engineerwhich was matched to Pilot. In these cases,the selected concept can be added as a childnode to the parent node of the YAGO match.

5. The selected visual concept does not have anyof the aforementioned relationships with anyof the 4 matches from YAGO. In that case,the selected concept is completely new andshould be manually added to YAGO. Exam-ples are Missile Destroyer and military robot.

Out of the 120 selected visual concepts, 49, 45,4, 11, and 11 fall under each of the 5 cases re-spectively. This means, while 49 of the selectedconcepts already exist in YAGO, 71 can be addedto enrich YAGO. Moreover, discovering 45 childnodes compared to 4 parent nodes suggests thatthe visual concepts that are not covered by YAGOare usually more fine-grained than YAGO con-cepts and belong to the lower levels of the hier-archy.

2.2.2 Data-driven expansion: bottom upWhilst the first approach is limited to a prede-fined source ontology, the second approach startsfrom a corpus of scenario-related documents. Weuse Latent Dirichlet Allocation (LDA) (Blei et al.,2003) along with a phrase mining tool (sha) to geta list of keywords. Each keyword should be fre-quent enough in the corpus to be picked. There-fore, they tend to be important and relevant to thescenario. We follow (Chen et al., 2014) to de-termine whether each keyword represent a visualconcepts or not. For each keyword, we use theGoogle search API to crawl 100 images, and re-move outliers using the method proposed in (Chen

Page 5: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

et al., 2014). then we split the remaining imagesinto training and test parts, and train a classifier onthe training portion, to separate those images fromother keywords. We determine a keyword as a vi-sual concept if the accuracy of the trained modelon the test images is above a threshold.

Using LDA and (sha), we find 238 and 336 con-cepts, out of which 61 and 72 were determinedas visual concepts. Those selected concepts werelinked and added to YAGO using the same ap-proach described in section 2.1. out of 61 and 72selected concepts, 19 and 39 concepts were new(not existing in YAGO), and therefore can be usedto enrich the YAGO ontology. Examples are riotpolice, missile launcher, and anti-aircraft.

2.2.3 Data-driven expansion: multimodalpattern mining

In this method, we utilize our recent work on mul-timodal pattern mining (Li et al., 2016). This workaims to find frequent image-text accompanimentpatterns and assign them to a name. It extractsboth visual and textual features (by an AlexNetand Word2Vec model, respectively), and then gen-erates transactions to be used in association rulemining. The transactions are a 1000-D vector(with 0 or 1 entries each standing for a cluster pre-senting in the text), and a 256-D vector (with 0 or1 entries each standing for activation of featuresmaps of the pool5 layer in AlexNet). The featuremap activations are calculated for each of 6x6 re-gions of the pool5 layer. Thus, there are 36 trans-actions for each image-caption pair. After runningassociation rule mining, the most frequent patternsare named by TF-IDF and the area correspondingto that transaction is chosen for the visual repre-sentation for that name.

We ran this model on a dataset of 61k image-caption pairs collected in-house from the VOAwebsite. Since association rule mining needs adiscrimination annotation, we ran event extractionby trigger words on VOA dataset and grouped itto 33 ACE events. The model found 225 event-specific concepts. By manually looking, we chose28 of them by the following criteria:

1. Event relatedness: The discovered concepts(pattern names and related images) are rele-vant to the event, and would be useful enti-ties or attributes in an event schema for thatevent.

2. Visual semantic coherence: The visual pat-

Figure 1: Examples of discovered multimodal patterns.

tern associated with the name is semanticallyconsistent. Namely, the images shown underthe pattern depict a coherent semantic con-cept, and not a mix of many different con-cepts. If the majority of images in a patternare consistent, with few outliers, the patternis considered to exhibit this property.

3. Text-visual matching: The pattern name cor-rectly describes the semantic concept of thevisual pattern.

Among the 28 hand-picked concepts, 14 of themwere not present in the YAGO ontology. Exam-ples are water cannon, protesters, and tear gas, asshown in Figure 1.

3 TA1 Text Knowledge Extraction

3.1 Approach OverviewAs shown in Figure2, the text knowledge extrac-tion system is a comprehensive end-to-end trilin-gual system. Note that all components with-out specifying language is language independent.Given tri-lingual text documents, we firstly extractnamed and nominal mentions using Bi-LSTM-CRFs extractors, and determine the coreference ofnamed mentions based on collective entity linkingand NIL clustering, followed by nominal corefer-ence using Bi-LSTM Attention Network. Afterthat, relations are extracted using Assembled CNNextractor, with post-processing via pattern rank-ing. Meanwhile, with entities as input, Englishevents are extracted through GAIL, Bi-LSTM-CRFs and CNN, followed by attribute extractionthrough Logistic Regression hedge detection andsyntax-based negation detection. For Russian andUkrainian events, their triggers are extracted byBi-LSTM extractor, followed by a CNN argu-ment extractor. After that, coreference between

Page 6: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

Relation ExtractionMention Extraction

Event Extraction English 

      

Bi­LSTM­CRFs Named Entity Extractor

Bi­LSTM­CRFs  Nominal Extractor

Coreference / Linking

Collective Entity Linking  and NIL Clustering

Bi­LSTM Attention Network Nominal Coreference

Post­processing via  Pattern Ranking

Assembled CNN Extractor

Event Coreference 

Syntax­based Negation Detection

Logistic Regression Hedge Detection

Russian and Ukrainian 

Bi­LSTM Trigger Extractor

CNN Argument Extractor

Graph­based  Coreference Resolution

Refinement    

Second­Stage Classifier  KB Propagation 

Tri­lingual     Text Documents     

TA1.a KB  

TA1.b KB  

Human Hypothesis  

Generative Adversarial ImitationLearning (GAIL) Extractor

Bi­LSTM­CRFs and  CNN Extractor

Figure 2: The Architecture of Text Knowledge Extraction

all events is determined with graph-based resolu-tion. In the end, extraction results of entities, rela-tions and events composite a document-level KBfor TA1.a. Furthermore, this KB is refined by hu-man hypothesis based on second-stage classifiersto generate TA1.b KB. The details of each compo-nent will be illustrated in following sections.

3.2 Mention Extraction

We consider mention extraction as a sequence la-beling problem, to tag each token in a sentenceas the Beginning (B), Inside (I) or Outside (O) ofa name mention with one of seven types: Person(PER), Organization (ORG), Geo-political Entity(GPE), Location (LOC), Facility (FAC), Weapon(WEA) or Vehicle (VEH). Predicting the tag foreach token needs evidence from both of its previ-ous context and future context in the entire sen-tence. Bi-LSTM networks (Graves et al., 2013;Lample et al., 2016) meet this need by processingeach sequence in both directions with two separatehidden layers, which are then fed into the sameoutput layer. Moreover, there are strong classi-fication dependencies among tags in a sequence.For example, “I-LOC” cannot follow “B-ORG”. ACRF model, which is particularly good at jointlymodeling tagging decisions, can be built on topof the Bi-LSTM networks. External informationlike gazetteers, brown clustering, etc. have provento be beneficial for mention extraction. We usean additional Bi-LSTM to consume the external

feature embeddings of each token and concatenateboth Bi-LSTM encodings of feature embeddingsand word embeddings before the output layer.

We set the word input dimension to 100, wordLSTM hidden layer dimension to 100, characterinput dimension to 50, character LSTM hiddenlayer dimension to 25, input dropout rate to 0.5,and use stochastic gradient descent with learningrate 0.01 for optimization.

3.3 Entity Linking

Given a set of name mentions M ={m1,m2, ...,mn}, we first generate an initiallist of candidate entities Em = {e1, e2, ..., en}for each name mention m, and then rank them toselect the candidate entity with the highest scoreas the appropriate entity for linking.

We adopt a dictionary-based candidate genera-tion approach (Medelyan and Legg, 2008). In oderto improve the coverage of the dictionary, we alsogenerate a secondary dictionary by normalizing allkeys in the primary dictionary using a phonetic al-gorithm NYSIIS (Taft, 1970). If an name mentionm is not in the primary dictionary, we will use thesecondary dictionary to generate candidates.

Then we rank these entity candidates based onthree measures: salience, similarity and coher-ence (Pan et al., 2015).

We utilize Wikipedia anchor links to compute

Page 7: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

salience based on entity prior:

pprior(e) =A∗,eA∗,∗

(1)

where A∗,e is a set of anchor links that point toentity e, and A∗,∗ is a set of all anchor links inWikipedia. We define mention to entity probabil-ity as

pmention(e|m) =Am,eAm,∗

(2)

where Am,∗ is a set of anchor links with the sameanchor textm, andAm,e is a subset ofAm,∗ whichpoints to entity e.

Then we compute the similarity between men-tion and any candidate entity. We first utilize entitytypes of mentions which are extracted from nametagging. For each entity e in the KB, we assigna coarse-grained entity type t (PER, ORG, GPE,LOC, Miscellaneous (MISC)) using a MaximumEntropy based entity classifier (Pan et al., 2017).We incorporate entity types by combining it withmention to entity probability pmention(e|m) (Linget al., 2015):

ptype(e|m, t) =p(e|m)∑

e 7→tp(e|m)

(3)

where e 7→ t indicates that t is the entity type of e.Following (Huang et al., 2017c), we construct

a weighted undirected graph G = (E,D) fromDBpedia, where E is a set of all entities in DBpe-dia and dij ∈ D indicates that two entities ei andej share some DBpedia properties. The weight ofdij , wij is computed as:

wij =|pi ∩ pj |

max(|pi|, |pj |)(4)

where pi, pj are the sets of DBpedia propertiesof ei and ej respectively. After constructing theknowledge graph, we apply the graph embeddingframework proposed by (Tang et al., 2015) to gen-erate knowledge representations for all entities inthe KB. We compute cosine similarity betweenthe vector representations of two entities to modelcoherence between these two entities coh(ei, ej).Given a name mention m and its candidate entitye, we defined coherence score as:

pcoh(e) =1

|Cm|∑c∈Cm

coh(e, c) (5)

where Cm is the union of entities for coherentmentions of m.

Finally, we combine these measures and com-pute final score for each candidate entity e.

3.4 Within-document Coreference Resolution3.4.1 Name Coreference ResolutionFor name mentions that cannot be linked to theKB, we apply heuristic rules described in Table 1to cluster them within each document.

Rule DescriptionExact match Create initial clusters based on mention

surface form.Normalization Normalize surface forms (e.g., remove

designators and stop words) and groupmentions with the same normalized sur-face form.

NYSIIS (Taft,1970)

Obtain soundex NYSIIS representationof each mention and group mentionswith the same representation longerthan 4 letters.

Edit distance Cluster two mentions if the edit distancebetween their normalized surface formsis equal to or smaller than D, whereD = length(mention1)/8 + 1.

Translation Merge two clusters if they include men-tions with the same translation.

Table 1: Heuristic Rules for NIL Clustering.

For each cluster, we assign the most frequentname mention as the document-level canonicalmention.

3.4.2 Nominal Coreference ResolutionWe takes in the output of both the entity linkingand NIL Clustering system to predict the nomi-nal coreference. Our main architecture is basedon the End-to-End Neural Coreference Resolu-tion(Lee et al., 2017). However, since we alreadyknow the mention types, we further simplify thecoreference system by only computing scores be-tween nominal and corresponding named entitywith same type.

Given an document D = {y1, y2, ...yn}, namedentity span Sa = {sa1 , sa2 , ..., sam} with itstype Ta = {ta1 , ta2 , ..., tam} and its clusterCa = {ca1 , ca2 , ..., cam}, nominal mention spanSo = {so1 , so2 , ..., sol} with its type To ={to1 , to2 , ..., tol}, where yi is individual words, nis the length of the document, m is the num-ber of named entity, cai ∈ {1, 2, ..., v}, v isthe cluster size, n is the number of the nom-inal mention, and each span sai or Soi con-tains two indices START(i) and END(j). Wecompute word representation from a pre-trainedWord2Vector(Mikolov et al., 2013) embeddingswith 1-dimensional convolution neural networks

Page 8: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

for character embeddings with window sizes of2,3,and 4 characters. The character embed-dings are random initialized and optimized dur-ing training. compute span representation, we firstuse BiLSTMs as encoders to get hidden stateshk,1, hk,−1 of words where 1,−1 represent the di-rectional of LSTM. For word at position k, its rep-resentation y∗k = [hk,1, hk,−1]. For each sentence,the LSTM is independent to improve the programefficiency.

Then for each span either named entity ornominal mention, we compute the soft atten-tion(Bahdanau et al., 2014) over each word in thespan:

αk = ωα · FFNNα(y∗k) (6)

αt,k =exp (αk)∑END(t)

n=START(t) exp (αt)(7)

yk =

END(k)∑n=START(k)

αt,k · yt (8)

where yk is a weighted sum of word hidden statesin span k, FFNN is the feed-forward neural net-work. Then the representation of gk a span k iscomputed as:

gk = [y∗START(k), y∗END(k), yk,Φg(t)] (9)

where feature vectors Φg(t) encode size of span k,and type pf span k.

Finally, to calculate nominal clusters, we firstgroup each named entity span according to theircluster. Then for each nominal span go,i, we checkwith every name span ga,j with same type. Foreach pair of nominal span go,i and name span ga,j ,we compute score x(i, j):

x(i, j) = wx · [go,i, ga,j , go,i ◦ ga,j ,Φx(i, j)] (10)

where ◦ is the element-wise similarity feature vec-tors Φx(i, j) encode distance between two span.We first combine score of same cluster qe for spansoi :

P qeoi =∑

qe|qe=caj

x(i, j) (11)

where qe ∈ {1, 2, ..., v} We then compute condi-tional probability P (oi = qe|D) for each cluster:

P (oi = qe|D) = P (oi = qe|Ca) (12)

=exp (P qe

oi)∑v

e=1 exp (P qeoi

)(13)

We then compute marginal log-likelihood of allcorrect clusters by the gold clustering:

l∏i=1

P (oi = q′e|D) (14)

where q′e is the gold cluster. For English training,we used ACE2005, EDL2016, EDL2017 to trainthe system.

3.4.3 Additional Russian/Ukrainian namecoreference resolution

For Russian and Ukrainian, we performed addi-tional name coreference as a post-process. Webuilt a graph where each AIF cluster in the inputwas a node. In order, we applied the followingrules to add edges between clusters if:

1. there was an exact match in string or stembetween any name mentions in either cluster

2. there existed two strings from the two clus-ters where the Levenshtein distance betweenthem was less than or equal to 2 and the sizeof the string or stem was at least 6 or 5, re-spectively.

3. (person names only) there existed a single-token string in one cluster which appeared asthe second token of a two-token string in an-other cluster and did not appear as the sec-ond token of any other two-token string (e.g.Smith to John Smith unless Jane Smith alsoappears in the document).

4. (person names only) there existed a two-token string in one cluster which matcheda two-token string in another cluster, ac-counting for morphological inflection of bothstrings.

A coreference cluster was made from each con-nected component of the resulting graph. ”Stems”were determined in a simple, heuristic way bystripping a list of common Russian and Ukrainianinflectional affixes from the end of the string.

3.5 Relation ExtractionMost AIDA relation types can be mapped toACE/ERE ontologies. For these types, we adopt aconvolutional neural network (CNN) based model(Section 3.5.1). For those new types, such asGenafl.Sponsor, Measurement.Count,Genafl.Orgweb, we implement a rule basedsystem (Section 3.5.2).

Page 9: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

3.5.1 Relation Extraction for types inACE/ERE

Our relation system takes named entity corefer-ence results as input, predicting relations betweeneach entity mention pairs occurred within sen-tences. We utilize a typical neural network archi-tecture that consists of Convolution Neural Net-work (CNN) with piece-wise pooling as our un-derlying learning model.

Formally, given a source sentence (s, e1, e2, r)where s = [w1, ..., wm], for each word wik,we generate a multi-type embedding: vi =[vi, pi, pi, ti, ti, ci, ηi] where vi denotes a wordembedding. Specifically, we use the Skip-Grammodel to pre-train the word embeddings. pi andpi are position embeddings indicating the relativedistance from wi to e1 and e2 respectively. ti andti are entity type embedding of e1 and e2. ci isthe chunking embedding, and ηi is a binary digitindicating whether the word is within the shortestdependency path between e1 and e2. All these em-beddings except pre-trained word embedding arerandomly initialized and optimized during train-ing. Thus the input layer is a sequence of wordrepresentations V = {v1, v2, ..., vn}. We thenapply the convolution weights W to each slid-ing n-gram phrase gj with a biased vector b, i.e.,g′j = tanh(W ·V ) + b. Considering different seg-

ments do not share the same weights in composingthe semantic of sentences, we split all the g

′j into

three parts based on the two entity mentions andperform piecewise max pooling, after which weconcatenate the representations of the three seg-ments and feed them into a fully connected layerand obtain a dense vector, then we use a linearprojection function with a softmax as the relationclassifier to determine the relation type.

Training Details For English training, we man-ually map DEFT Rich ERE and ACE2005 toAIDA schema and combine all the filtered re-sources. For Ukrainian and Russian, we employeda native speaker to annotate part of the seedlingcorpus. To further improve the system perfor-mance, we also adopted majority vote strategy toassemble the models for all three languages. Be-sides, we found that our model can hardly capturethe global information due to instance level train-ing. Thus, we extract some high confidence fre-quent relation patterns (Table 2) from training dataas hard constraints to tackle this problem.

Relation type Relation pattern

Employment Person of OrganizationOrganization origin Organization in Geography

Leadership Geography prime Personlocated near Person at FacilityPart whole Organization of the Organization

Manufacture Geography Weapon

Table 2: Examples of frequent relation patterns.

3.5.2 Relation Extraction for New RelationTypes

We implement a rule-based component forGenafl.Sponsor, Measurement.Count,Genafl.Orgweb and relations based on therules as follows (for English only):Genafl.Sponsor We first use spacy to ex-tract the shortest dependency path (SDP) betweentwo entity mentions. If keywords like ”pledgeallegiance,” ”draws support from,” ”behind” oc-curred in the SDP and the length of SDP is shorterthan a threshold, we will tag this relation asGenafl.Sponsor. We augmented a seed set ofmanually determined keywords with trigger wordscomputed by our sentiment system trained on theGood-For/Bad-For corpus (Deng et al., 2013) forsentiment relations with high confidence.Measurement.Count We first apply StanfordCoreNLP to extract all NUMBER N . Then forni ∈ N , we check if the word after ni. If it isa part of a named entity extracted by our EDL sys-tem, we tag the relation between between them asMeasurement.Count.Genafl.Orgweb Our EDL system can detecturls as ORG and cluster them with other name men-tions. We tag those urls as Genafl.Orgweb ofthe ORG named entities.

3.6 Event Extraction3.6.1 Event Extraction with Generative

Adversarial Imitation LearningWe use Generative Adversarial Imitation Learn-ing (GAIL) – an inversed reinforcement learningframework – to extract event triggers and detectargument roles. Event extraction consists of twosteps: 1. trigger labeling: the system detects trig-ger words or phrases from natural language sen-tences; 2. argument role labeling: the system de-termines the relation between the triggers and en-tities detected from Section 3.2.

Q-Learning for trigger labeling We use Q-learning to label the triggers from sequences of

Page 10: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

Masih's alleged comments of blasphemy are punishable by death under ...

Epoch 1 Epoch 10 Epoch 20Epoch 30

 Epoch 40

RewardEstimator

Model

Sentence ExecutePER

Person

-0.9Execute

ODie

PER

ExecuteO

Die

PER

Defendant

RewardEstimator

Model

Execute

ODie

PER

ExecuteO

Die

PER

-2.3 RewardEstimator

Model

Masih's

... death ...

-5.7RewardEstimator

Model

Execute

ODie

PER

Execute

O

Die

PER

6.4RewardEstimator

Model

... are

punishable by

...  

ExecuteO

Sentence

Execute

OSentence

1.7RewardEstimator

Model

ExecuteO

Die

PER

ExecuteO

Die

PER

-5.5

... by death

under ...

... by death

under ...

... by death

under ...

... by death

under ...

Place

N/A

Person

Agent

Place

N/A

Person

Agent

Figure 3: A diagram illustrating the dynamic rewards estimated by the GAN in the GAIL event extractor.

raw text tokens.We have Q-tables and Q-values

Qsl(st, at) = fsl(st|st−1, . . . , at−1, . . .), (15)

where st denotes a state (of feature) for a token t.

st =< vt, at−1 > . (16)

and at denotes an action (or label) from the agent(or extractor).

at = arg maxat

Qsl(st, at). (17)

The vt in Equation 16 is the context embeddingof the token t. We use local embeddings concate-nated with the following: 1. 200-dim Word2Vecembeddings, 2. 50-dim representation for Part-of-Speech (PoS) tags, 3. 100-dim of token em-beddings randomly initialized and updated in thetraining phase, 4. 32-dim character embeddings.We adopt a Bi-LSTM to calculate vt. We uti-lize another mono-directional LSTM to calculateEquation 15 and determine the labels from Equa-tion 17.

In the training phase, we use Bellman Equationto update Q-values and Q-tables:

Qπ∗sl (st, at) = rt + γmax

at+1

Qsl(st+1, at+1), (18)

where rt denotes a reward based on the currentstate st and action at.

The word embeddings and parameters onLSTMs are updated by minimizing the object

Lsl =1

n

n∑t

∑a

(Q′sl(st, at)−Qsl(st, at))2,

(19)

where Q′ denotes the updated Q-table from Equa-tion 18.

Policy-gradient for argument role labelingFor the argument role labeling, we use another RLalgorithm, policy gradient.

In this scenario, we also define a state (feature)for the agent (extractor)

str,ar =< vttr ,vtar , attr , atar ,fss >, (20)

where vs are the context embedding as in Equa-tion 16 and fss denote a sub-sentence embeddingcalculated using a Bi-LSTM which consumes thesequence of context embedding between the trig-ger and the argument.

We feed the state representation into an MLP toobtain Q-table which represents probability distri-bution of actions and we can determine the argu-ment role with

atr,ar = arg maxatr,ar

Qtr,ar(str,ar, atr,ar). (21)

We also assign a reward R for the action (argu-ment role) and we train the models by minimizing

Lpg = −R logP (atr,ar|str,ar). (22)

Dynamic Reward Estimation with GAN Ourframework includes a reward estimator based onGAN to issue dynamic rewards with regard to thelabels committed by the event extractor as shownin Figure 3. The reward estimator is trained uponthe difference between the labels from groundtruth (expert) and extractor (agent). If the extractorrepeatedly

Page 11: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

The original GAN has a pool of real data, agenerator G and a discriminator D, while in ourframework, the real data is ground truth (expert),the generator G is the agent, and the discriminatorD is a reward estimator.

To train the discriminator, we minimize the fol-lowing object function

LD = −(E[logD(s, aE)]+E[log(1−D(s, aA))]),(23)

where s is the state representation from Equa-tion 16 and 20.

We use a linear transform to estimate rewards

R(s, a) = 20 ∗ (D(s, a)− 0.5), (24)

3.6.2 Event Extraction with Bi-LSTMTrigger Labeling Event trigger detection re-mains a challenge due to the difficulty at en-coding word semantics and word senses in var-ious contexts. Previous approaches heavily de-pend on language-specific knowledge. However,compared to English, the resources and tools forChinese are limited and yield low quality. Amore promising approach is to automatically learneffective features from data, without relying onlanguage-specific resources.

We developed a language-independent neuralnetwork architecture: Bi-LSTM-CRFs, which cansignificantly capture meaningful sequential infor-mation and jointly model nugget type decisions forevent nugget detection. This architecture is similaras the one in (Yu et al., 2016; Feng et al., 2016a;Al-Badrashiny et al., 2017).

Given a sentence X = (X1, X2, ..., Xn) andtheir corresponding tags Y = (Y1, Y2, ..., Yn), nis the number of units contained in the sequence,we initialize each word with a vector by lookingup word embeddings. Specifically, we use theSkip-Gram model to pre-train the word embed-dings (Mikolov et al., 2013). Then, the sequenceof words in each sentence is taken as input to theBi-LSTM to get meaningful and contextual fea-tures. We feed these features into CRFs and maxi-mize the log-probabilities of all tag predictions ofthe sequence.

Argument Labeling For event argument extrac-tion, given a sentence, we first adopt the eventtrigger detection system to identify candidate trig-gers and utilize the EDL system to recognize allcandidate arguments, including Person, Location,

Organization, Geo-Political Entity, Time expres-sion and Money Phrases. For each trigger andeach candidate argument, we select two types ofsequence information: the surface contexts be-tween trigger word and candidate argument or theshortest dependency path, which is obtained withthe Breath-First-Search (BFS) algorithm over thewhole dependency parsing output, as input to ourneural architecture.

Each word in the sequence will be assigned witha vector, which is concatenated from vectors ofword embedding, position and POS tag. In orderto better capture the relatedness between words,we also adopt CNNs to generate a vector for eachword based on its character sequence. For each de-pendency relation, we randomly initialize a vector,which holds the same dimensionality with eachword.

We encode the following two types of sequenceof vectors with CNNs and Bi-LSTMs respectively.For the surface context based sequence, we utilizea general CNNs architecture with Max-Pooling toobtain a vector representation. For the dependencypath based sequence, we adopt the Bi-LSTMs withMax-Pooling to get an overall vector representa-tion. Finally we concatenate these two vectors andfeed them to two softmax functions: one is to pre-dict whether the current entity mention is a can-didate argument of the trigger, and the other is topredict final argument role.

Training Details For all the above components,we utilize all the available event annotations fromDEFT Rich ERE and ACE2005 for training.

3.6.3 Event Extraction for New Event TypesWe implemented a system, for the event types notincluded in English training data, to map framenetto AIDA ontology (Table 3).

Framenet event type AIDA event type

Conquering Transaction.TransferControlCriminal investigation Justice.Investigate

Sign agreement Government.AgreementsInspecting InspectionDestroying Existence.DamageDestroyDamaging Existence.DamageDestroy

Table 3: Mapping from Framenet to AIDA event ontol-ogy.

We tag frame elements as event arguments onlyif they overlap with named entities in our EDL out-puts.

Page 12: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

For Government.Vote andGovernment.Spy event types, which can-not be mapped from Framenet, we implementa rule-based system. We detect event triggersbased on fuzzy string match with keywords 1.And we further decide event arguments based ondependency trees (Table 4).

Event Type Syntactic Relation Argument Role

Government.Vote

nsubj Voternmod:for Candidate

dobj Candidatenmod:in Place Date

Government.Spy

nsubj Agentnmod:for Beneficiary

dobj Targetnmod:in Place Date

Table 4: Dependency Tree-based ArgumentRole identification for Government.Vote andGovernment.Spy events.

3.6.4 Russian and Ukrainian EventExtraction

Event extraction for Russian and Ukrainian tex-tual data is performed using a two-stage language-independent model. The first stage uses a Bi-LSTM (Bi-directional Long Short Term Memorynetwork) which takes pre-trained word embed-dings of each word in the sentence as input andgenerates a vector representation of each word inthe context of the sentence. To capture the struc-tured label sequences, such as B-attack followedby I-attack in the case of event triggers like shotdown, we use the vectors from the Bi-LSTM witha Softmax classifier to obtain tag predictions foreach word.

The second stage uses a Convolutional Neu-ral Network (CNN) followed by a Softmax clas-sification (Kim, 2014) to label the arguments ofthe event mention. Each input to the CNN is aportion of the sentence (’sub-sentence’) betweena named entity mention and the trigger word. Ifthere are multiple named entities present in a sen-tence containing a trigger, all such sub-sentencesare retrieved. Here, each word in an input sub-sentence be represented by its word embedding.We concatenate an additional dimension to eachword here to represent its trigger type if it is atrigger word. This forms the input to the CNN.

1Government.Vote: vote, ballot, poll;Government.Spy: spy

We use filters of size 3, 4, and 5 in the convolu-tion layer, followed by a max-pooling layer, anda fully-connected layer to generate a single vectorrepresentation of the input sub-sentence. We thenuse this vector and perform a Softmax classifica-tion to identify and label arguments.

Training Details For Russian and Ukrainianlanguages, due to the absence of relevant train-ing data, we use a native speaker to annotate threehundred documents in each language. This anno-tation is performed using the Brat annotation tool(Stenetorp et al., 2012), which we set up to allowannotations using the AIDA Event Ontologybrat.These annotated files serve as the training and de-velopment sets for the above event extraction sys-tem.

3.6.5 Event Attribute ExtractionHedge Detection We apply an uncertainty clas-sifier (Vincze, 2015) to detects the uncertainty ofeach sentence. Negation Detection We applynegation detection toolkit (Gkotsis et al., 2016),given a sentence and a key-word (e.g. event trig-ger) as inputs.

3.7 Event Coreference Resolution

We apply our language-independent within-document Event Coreference model to all English,Russian and Ukrainian. The model is based onour previous work (Al-Badrashiny et al., 2017; Yuet al., 2016; Hong et al., 2015; Chen and Ji, 2009).We view the event coreference space as an undi-rected weighted graph in which the nodes repre-sent all the event nuggets and the edge weightsindicate the coreference confidence between twoevent nuggets. And we apply hierarchical cluster-ing to classify event nuggets into event hoppers.To compute the coreference confidence betweentwo events, we train a Maximum Entropy classi-fier using the features listed in Table 5.

3.8 Refinement with Human Hypotheses

A human hypothesis is a small topic-level con-nected knowledge graph. Given five human hy-potheses, we generate TA1.b KBs conditionedupon each hypothesis through entity refinement,relation refinement, and event refinement.

Entity Refinement For each entity in humanhypothesis, we link it to the entities in TA1.a KBbased on its name string. If one entity appears inhypothesis but not in TA1.a KB, we propagate KB

Page 13: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

Features Remarks(EM1: the first event mention, EM2: the second event mention)

type subtype match 1 if the types and subtypes of the event nuggets matchtrigger pair exact match 1 if the spellings of triggers in EM1 and EM2 exactly matchDistance between the wordembedding quantized semantic similarity score (0-1) using pre-trained word embedding ()token dist how many tokens between triggers of EM1 and EM2 (quantized)Argument match argument roles that are associated with the same entitiesArgument conflict Number of argument roles that are associated with different entities

Table 5: Event Coreference Features.

by adding new entities. If the entity has severaltypes in one hypothesis, we will select the type ofhighest frequency. If there are conflicts betweenhuman hypothesis and TA1.a KB, we trust the onewith longer name string. For example, if an entityin human hypothesis is ‘Kramatorsk airport’ andthe corresponding entity mention in TA1.a KB is‘airport’, we will update ‘airport’ to ‘Kramatorskairport’. Besides, to better support relation eventrefinement, we also update name translations fromEnglish to Russian and Ukraine after entity refine-ment.

Relation Refinement Given a relation in a hu-man hypothesis, we propagate relations based onthe co-occurrence sentences of its head and tailentities. There are four steps to generate thesesentences. Firstly, we enrich hypothetical rela-tions using entity clusters in human hypothesis,i.e., entity coreference information. Namely, morehypothetical relations are constructed by substi-tuting each entity with its coreferential entities.Secondly, three-way translations are done to fur-ther enrich hypothetical relations. Specifically, foreach language, apart from the original hypotheti-cal relations, we also add the translated ones fromother two languages. Note that English serves asa bridge for the translation between Russian andUkraine. Thirdly, we link head and tail hypotheti-cal entities to our system’s entities following Sec-tion 5.1. Fourthly, we gather all entity mentions ofthe above linked entities to find co-occurrence sen-tences, including named entity mentions, nominalmentions and pronominal mentions.

We trained a second-stage binary classifierwhich predicts whether these exists meaningfulrelation type in a given sentence. Comparedwith the first-stage classifier(depicted as S1) (Sec-tion 3.5.1), most parts of the second-stage clas-sifier (depicted as S2) including features are thesame except for the output layer. In testing phase,we adopt three different strategies to refine re-lations. (1) Regarding conflict results which S1

and S2 both predict meaningful relations (relationtypes defined in AIDA schema), we trust the bi-nary classifier due to the high quality of humanhypothesis; (2) In these cases which S1 predictsmeaningful relation type but S2 predicts no rela-tion, we remove the S1 relation results; (3) Forcases in which S1 predict no relation, We adoptthe S1 results because the S2 classifier tends to betoo aggressive.

Event Refinement Event refinement is basedon the sentences which are extracted accordingto the co-occurrence of each event type and ar-gument provided by human hypothesis. The oc-currence of each event type is based on the trig-gers detected by our event extraction systems. Foreach sentence containing an event trigger, we linkeach argument entity to entity mentions in ourgenerated knowledge graph similar to relation re-finement, i.e., through entity cluster-based enrich-ment, three-way translations, entity linking andentity mention linking. If one of the entity men-tions is also in the sentence, the sentence will beextracted. Based on these sentences, we find somerules to propagate events. For example, for eventtype ‘Transaction.TransferOwnership’, if the ar-gument role is ‘Thing’, we add this argument toour extracted event.

3.9 Experiments3.9.1 Performance OverviewA table to show component, benchmark data set,results.

3.9.2 What WorksWe conducted ablation experiments on each fea-ture components for relation extraction. Amongthe linguistic features we used, the entity type andposition features contribute the most to the perfor-mance. For example, the relation extraction per-formance decreases by about 8% if removing theentity type feature. We analyze the reasons andfind that the entity type feature is vital to ensurethe types of two entity mentions to be consistent

Page 14: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

Components Benchmark # of Training Sents # of Dev Sents Our F1 (%) SOTA F1 (%)

Mention ExtractionEDL

EDL + NIL ClusteringNominal Coreference ACE2005&EDL 721 21 67.6

Relation ExtractionEnglish ACE&ERE 61,857 6,879 65.6 N/ARussian AIDA Seedling 37,179 4,137 72.4 N/AUkraine AIDA Seedling 17,001 1,896 68.2 N/A

GAIL Trigger ACE2005 14,837 863 72.9 69.6Event Extraction Argument ACE2005 14,837 863 59.0 57.2

Bi-LSTM Trigger ERE 45,801 5,088 65.41Event Extraction Argument ERE 3,240 1,056 85.02

Russian Trigger AIDA Seedling 3,703 1,233 56.15 N/AEvent Extraction Argument AIDA Seedling 3456 1480 58.16 N/A

Ukrainian Trigger AIDA Seedling 3,500 1,166 58.98 N/AEvent Extraction Argument AIDA Seedling 3268 1398 61.13 N/A

Event Coreference

Table 6: List of Components for Text Knowledge Extraction

with the hard entity type constraint of each rela-tion type defined in AIDA schema.

For instances with ambiguity, our dynamic re-ward function can provide more salient mar-gins between correct and wrong labels: e.g.,“... they sentenced him to death ...”, with theidentical parameter set as aforementioned, rewardfor the wrong Die label is −5.74 while cor-rect Execute label gains 6.53. For simplercases, e.g., “... submitted his resignation ...”, wehave flatter rewards as 2.74 for End-Position,−1.33 for None or −1.67 for Meet, which aresufficient to commit correct labels.

3.9.3 Remaining ChallengesLosses of scores are mainly missed trigger wordsand arguments. For example, the Meet trigger“pow-wow” is missed because it is rarely used todescribe a formal political meeting; and there isno token with similar surface form – which canbe recovered using character embedding – in thetraining data.

We observe some special erroneous cases due tofully biased annotation. In the sentence “Bombershave also hit targets ...”, the entity “bombers” ismistakenly classified as the Attacker argumentof the Attack event triggered by the word “hit”.Here the “bombers” refers to aircraft and is con-sidered as a VEH (Vehicle) entity, and should bean Instrument in the Attack event, while“bombers” entities in the training data are anno-tated as Person (who detonates bombs), whichare never Instrument. This is an ambiguous

case, however, it does not compromise our claimon the merit of our proposed framework againstambiguous errors, because our proposed frame-work still requires a mixture of different labels toacknowledge ambiguity.

4 Visual Knowledge Extraction

Recent advances in computer vision have madeit possible to extract various types of knowledgefrom images and videos, e.g. object detectionand tracking, face detection and recognition, hu-man activity understanding, etc. Nevertheless, itis challenging to create a framework that extractsknowledge from various types of media includ-ing visual, communicates information in a uni-fied language for higher-level analysis, and cross-references knowledge elements between modali-ties.

Our first step toward such a unified system wasthe creation of a multimodal ontology that coversessential visual concepts. In this section, we sum-marizes the system we developed for parsing vi-sual knowledge from images and videos. We uti-lize an ensemble of state-of-the-art object detec-tion models to detect, localize, and categorize a va-riety of entity types in still images and video key-frames. Next, we use face verification and objectinstance matching models to link and coreferencethe detected entities. The methods are elaboratedin the following.

Page 15: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

4.1 Entity Detection

Entities can appear in an image, either as physi-cal objects (e.g. Vehicle), or as scenes (e.g. Air-port). To detect physiscal objects, we utilize FasterRCNN (Girshick, 2015), which is the current stateof the art in object detection. We use three FasterRCNN models trained on Open Images (Krasinet al., 2017) v2 (600 classes), MSCOCO (Linet al., 2014) (80 classes), and Pascal VOC (Ev-eringham et al., 2010) (20 classes) datasets. Thesemodels can accurately detect, localize, and cat-egorize hundreds of object types. Nevertheless,supervised models like Faster RCNN come witha considerable limitation: they require a massivedataset of images with full bounding box annota-tion, which is not available for every visual con-cept.

To detect objects types that are not covered bypublicly available bounding box datasets, we usea Weakly Supervised Object Detection (WSOD)technique (Zhou et al., 2016), which requires onlyimage-level annotation for training. We train thatmodel on a subset of 250 classes from OpenIm-ages v4, that are selected using the techniques de-scribed in Section 2.2.1. Since this method worksbased on image-level labels and is not limited bybounding boxes, it can be used to recognize scenesand image-level events as well. Therefore, we in-clude scene and event types in the 250 selectedclasses.

To integrate these 4 models, we perform sev-eral post-processing steps. The first challenge isthat each model generates a distribution acrossits own classes, and since the list of classes arenot shared between models, the probability valuescannot be compared to each other. To fix this, werescale confidence scores generated by each modelto a unified distribution that is shared for all mod-els. We use a reference dataset, randomly sam-pled from OpenImages, and apply all models onthat. Then for each object type, for each model, wecollect all the correctly detected bounding boxesand fit a Gaussian distribution to the confidencescores. We rescale the confidence values by a con-stant multiplication and addition to adjust meanand standard deviation to a fixed value for all mod-els and categories.

Furthermore, we merge bounding boxes withthe same type if their overlap is higher than athreshold. We also propagate class labels up thehierarchy tree and remove those detections that are

outside our ontology.In addition to our object detection ensemble, we

utilize MTCNN (Zhang et al., 2016) which is a3-stage CNN, to detect and align faces. Detectedfaces are linked with any person object detected bythe ensemble if the intersection of the two bound-ing boxes has a high ratio over the total face area.

4.2 Face RecognitionWe extract facial features from each detected faceusing the FaceNet (Schroff et al., 2015) modeltrained on the CAISA dataset. These features arecompared with our database of facial features andmatched to the closest person if the distance be-tween features are below a threshold. To constructa database of facial features, we first create a list ofpeople that are helpful to detect. We collect namedperson annotations from the seedling data and useas a seed to discover other relevant names fromDBPedia. To do that, we find all incoming andoutgoing relations to/from each seed name, and re-peat for 2 hops. We filter based on a set of rules toonly maintain highly relevant names. This resultsin a list of 388 people that are potentially relevantto the scenario.

After that, we use the Google search API to col-lect images for each named person. Finally, we ex-tract FaceNet features from each image and storein the database. In test time, we compare the fea-tures of a query face to all entries in our databaseand compute the average distance for each person.We link the query to the closest person if the aver-age distance is less than a threshold.

4.3 Visual Entity CoreferenceWe use the DBSCAN clustering algorithm (Esteret al., 1996) on features extracted from entities ofsame type, to link mentions of the same entity in-stance. The linking accuracy highly depends onthe quality of features. For entities of type per-son, we used FaceNet features, which are provento be highly accurate in various face recognitionand verification tasks (Schroff et al., 2015). Forother entity types, instance-level feature represen-tation is still an unsolved research issue.

We use a similar approach to FaceNet, traininga CNN using triplet loss, to learn a Euclidean met-ric where images of the same instance are closer toeach other compared to other instances of the sametype. To come up with such triplets, we use theYoutube BoundingBoxes dataset which was orig-inally intended for object tracking. Each video

Page 16: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

comes with bounding box annotations for certainobjects at every frame. For each triplet, we ran-domly sample two frames of an object tracklet aspositive images of the same object instance, andrandomly choose a bounding box from other track-lets (of the same type), as a negative example.

Another type of coreference resolution wasdone by merging highly overlapping boundingboxes in an image. For example, if a face bound-ing box falls within a human body bounding box,we link those to indicate they represent the sameperson entity. Moreover, if a grounded text entityoverlaps with a bounding box that is detected byour object detection system, we link those.

5 Cross-Media Coreference

In this section, we demonstrate how visual knowl-edge is integrated with textual knowledge viacross-modal grounding. We propose a model forvisual grounding of text phrases that outperformsthe state of the art. The input to this model is animage and sentence pair, and the output will be alocalization and relevance score for each word inthe sentence. The model can also localize phrasesinstead of words, if phrase boundaries are pro-vided. Figure reffig:grounding-diag illustrates themodel block diagram. It consists of visual and tex-tual branches, each extracting local and global fea-tures.

More specifically, we use a PNASNet (Liuet al., 2017) and an ELMO (Peters et al., 2018)as our visual and textual branches respectively.PNASNet produces a global image feature at itsoutput and several levels of local feature maps atits intermediate layers. We use one of the mid-dle layers where each local feature represents animage region. ELMO produces local features foreach word while also a global feature for the sen-tence. We compute correlations between all wordfeatures and all image region features, which re-sults in a heatmap for each word.

The model is trained on Flickr30k (Young et al.,2014), by maximizing overall image to sentencematching score between relevant image-captionpairs, and minimizing that for non-relevant image-caption pairs. Qualitatively, we observe great im-provement over the previous baseline (Engilbergeet al., 2018), which was the state of the art in vi-sual grounding. Example results on the seedlingdata are illustrated in Figure 5.

Figure 4: Block diagram of the proposed visualgrounding method.

Figure 5: Example grounding results.

6 Cross Document Knowledge ElementLinking

Our approach to cross-document knowledge ele-ment linking uses connected component detection.We formulate a graph relies heavily on informa-tion associated with the knowledge elements byTA1. For entities, we apply the following rules:

1. Within a single document, we link nodes thatbelong to the same TA1-identified cluster (i.e.we respect within document coreference de-cisions from TA1).

2. We link those entities for which TA1 associ-ated the same external link.

3. We use string similarity to cluster named en-tities. Because pair-wise string matching ofall named entities is not scalable, we applyblocking and only compare those entities thatmeet one of the following conditions: (a) theyshare the first three characters;(b) they meeta min-hashing requirement. Within a block,two entities are linked if they meet all the fol-lowing criterion: (a) they are of same typeand that type is one of: GPE, LOC, PER,ORG, Facilities; (b) neither entity has a reli-able external link The jaro-distance is abovea threshold(0.9).

Page 17: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

7 Question Answering

Both the document level and the corpus level tasksare evaluated via question answering using threetypes of queries: 1. Class-based queries: Findall instances of a known class in a document (forTA1) or the corpus (for TA2). For example, find allorganizations. 2. Zero-hop queries: Given a spe-cific instance of an entity, find all instances of thatentity in a document (for TA1) or the corpus (forTA2). For example, find all instances of the PER-SON entity indicated by bounding box A,B,C,D.3. Graph-queries: Given an instance (where in-stance is specified using offsets/bounding box asin a zero-hop query) and a graph of relations con-nected to that instance, find the nodes that could beincluded in the graph. Class-based queries are an-swered with a simple SPARQL query of the KB.Zero-hop queries are answered with slight relax-ation to the offsets and/or bounding box of thequery (this accounts for errors in exact extent).Answering the graph queries is more complex,even with state-of-the-art extractions from TA1,we expect there to be missing information in theKB. Matching the full graph (often more than 50edges) is unlikely. To match a graph query, we firstuse the relaxation strategies described for zero-hop queries on any entry-points. If simple, extent-based relaxation fails to match, we further relaxthe query by using exact string match the name at-tribute in the KB within a single document. Notethat if all of these relaxations fail, we will not findany answers to the query. Assuming that we findan entrypoint, we attempt to match the full graph.Much of the time, this fails. Our first relaxationattempts to find a ’backbone’ of the query by elim-inating those nodes that are connected to only oneother node in the graph. We then attempt to matchthe backbone. In some cases, nodes are connectedto the graph by more than one edge. If we areunable to match the full backbone, we try an alter-native relaxation strategy where we allow a matchif any (rather than all) edges linking a node con-nect. As final relaxation strategy, we combine thebackbone approach with the match any approach.We use the same strategies for both TA1 and TA2.For TA1, because we limit string based match towithin a single document (to avoid e.g. conflatingall John Smiths), we will not find answers wherethe entry point entity was not found in the rele-vant document. For TA2, we rely on system cross-document linking to enable answering questions

across documents.

8 HypoGator: Alternative HypothesesGeneration and Ranking

In this section we provide an overview of HypoGa-tor, our hypothesis generation system. HypoGa-tor relies on the KB constructed from combiningand aligning of the document level knowledge ele-ments. Using multiple features and inconsistencydetection methods, it extracts coherent and con-sistent hypothesis. It finally returns a sorted listof the alternative hypothesis relevant to the query.We describe the major components of our systemin the following sections.

8.1 Hypotheses Generation and Ranking

The input KB is first converted from AIDA Inter-change Format (AIF) to our internal format, whereeach edge in the KB is represented by (event,role, entity, properties). We alsofind entry points and their roles from the informa-tion need .xml files.

The major components of Hypogator’s pipelineinclude:

Extraction: Find all the nodes in the graph thatmatch any of the entry points from the informationneed. We used string similarity metrics in combi-nation with exact text offset matching to find allthe matches. The set of matched entities are calledseeds.

Expansion: Based on heuristical methods, eachof the seed entities is locally expanded to a sub-graph. This subgraph contains several nodes andevents that are in the neighborhood of the seed.

Exposition: During the training phase (on theannotated data) multiple features that are indica-tive of coherence and relevance were developedand identified. For example, one of the featuresmeasures how similar a given subgraph is to thequery; another feature approximates coherence us-ing pair-wise vertex connectivity.

These features are then aggregated to calculatea score for each subgraph. Finally, the subgraphsare sorted using this score.

8.2 Inconsistency Detection

The inconsistency detection focuses on identify-ing conflicting information in multiple event andrelationships represented as tuples. Such incon-sistency may be inferred from either negated tu-ples or conflicting information provided within

Page 18: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

TA2 KB

Information Need

Query-Processor

String/offset Similarity Entry Point Extraction

Candidate Hypothesis Generator

Local Graph Expansion A* Algorithm

Inconsistency feature 

Hierarchical LogicalInconsistency Detection

Coherence featureQuery Relevance

Edge Intensity Entity Correlation

Hypothesis Clustering

Subgraph Similarity Graph Embedding

AIDA Hypothesis

Confidence Computation

Score calculation Ensambler

Figure 6: HypoGator: hypothesis mining and ranking, the Architecture.

the tuples’ fields. Another source of inconsis-tency comes from event and filler types associatedwith the wrong field information according to theproject’s ontology. We then compare events aftera transformation phase under ontology-extractedfunctional dependencies, which allows us to selectthe fields that need to be compared. The field in-consistency is detected through the use of hierar-chical information extracted from common sensenetworks. Such data allows detecting whether allthe entities and fillers occurring in the same eventor relationships are compatible – that is when theyappear within the same is-a/part-of generalisationpath – or not.

Each hypothesis is associated to an inconsis-tency function IMIc : K → R+∞

≥0 (Hunter andKonieczny, 2008), which summarises the amountof the found inconsistencies over a real positivenumber. Such function weights the number of overthe minimal inconsistent sets in MI(H) for eachhypothesis H , and it is defined as follows:

IMIc(H) =∑

M∈MI(H)

1

|M |

We chose such function because it satisfies somerelevant rationality postulates such as super ad-ditivity, attenuation and irrelevance of syntax.Moreover, such function overcomes the problemsof the lottery paradox (Kyburg, 1961) by assign-ing a stronger weight to inconsistencies that canbe drawn from few pieces of evidence.

8.3 Hypotheses Clustering

Due to the nature of AIDA’s data (e.g., multipledocuments about the same hypothesis, and noisein TA2s alignment), it is possible to have multiple

subgraphs representing the same hypothesis. Oursystem uses subgraph clustering to prune duplicatehypothesis.

Our focus on the hypothesis and having asmaller data size compared to what TA2 has towork with, enables us to perform more expensivesubgraph level clustering. We compute a simi-larity score for each pair of generated subgraph-hypothesis. Entity features, TA2’s alignments andthe graph structure, are used to compute this score.We finally use spectral clustering to cluster the setof hypothesis into K clusters.

9 Hypothesis Generation at ISI

TA2 generates a full Knowledge Graph (KG).Given such a KG and a file containing the “infor-mation need” that identifies a topic in the KG, thesteps required to generate the top k hypotheses inTA3 are as follows.

Retrieve the knowledge elements relevant tothe Entry Points. We convert the XML file con-taining the information need to SPARQL and per-form a query on the full KG generated by TA2.

Convert the full KG to data in our own inter-nal format. The data in this new format only con-tains information of the knowledge elements thatare used for reasoning, such as types of Entities,Events, and Relations, as well as the arguments ofthese Events and Relations. Each of these Entities,Events, and Relations has a confidence value.

Generate mutual exclusion constraints.Based on the LDCOntology, we generate twoadditional types of mutual exclusion constraints:(a) cardinality constraints, and (b) domain con-straints. For some types of Events or Relations,some of their arguments can only have a limited

Page 19: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

number of distinct objects. For example, given anEvent Life.Born, the argument Life.Born Timecan have only one object. Such constraints qualifyas cardinality constraints. For each type of Eventor Relation, each argument can accept onlyspecific types of objects. Such constraints qualifyas domain constraints.

Represent the new data as weighted con-straints. In our approach, an assertion in the KGis not automatically deemed to be true in a hy-pothesis. This is because it has a certain confi-dence value associated with it. Instead, a hypothe-sis fixes the truth value of each assertion, i.e., it isa subset of the full KG. We therefore treat each as-sertion as a Boolean variable with domain {0, 1}.Here, ‘0’ and ‘1’ represent ‘False’ and ‘True’, re-spectively. For each Boolean variable v corre-sponding to an assertion with confidence value p,we add a unary weighted constraint with weights− log (1 − p) and −log (p) against the assign-ments v = 0 and v = 1, respectively. For eachmutual exclusion constraint c between two vari-ables v1 and v2, we add a binary weighted con-straint with weights− log (p/2),− log (1−p/2),− log (1 − p/2) and − log(p/2) against the as-signments (v1 = 0, v2 = 0), (v1 = 0, v2 = 1),(v1 = 1, v2 = 0), and (v1 = 1, v2 = 1), respec-tively.

Solve a Weighted Constraint SatisfactionProblem (WCSP) defined on the weighted con-straints. We use the WCSP-LIFT solver basedon the idea of the Constraint Composite Graph(CCG)(Xu et al., 2017) for exploiting structure incombinatorial optimization problems. The WCSPsolver finds an optimal assignment of values to allBoolean variables that minimizes the total weight.This optimal assignment is the top hypothesis ofthe original problem. To generate the kth hypoth-esis, we construct a new WCSP that includes allthe weighted constraints described above as wellas additional constraints that disallow the previ-ously generated k − 1 hypotheses.

Focus hypothesis generation to a subgraph.Step 1 identifies the Entry Points. From the fullKG, we can retrieve a subgraph of knowledge el-ements that are within n hops from these EntryPoints, for some user-specified value of n. Foreach of the top k hypotheses, we discard the ele-ments that do not exist in the subgraph to producea hypothesis that is relevant to the topic of interestspecified in the “information need”.

Scoring the hypotheses. Each time we solvea WCSP instance, we get an optimal assignmentas well as a total weight w. The score of the hy-pothesis is then e−w, using the transformation rulethat defined weighted constraints from confidencevalues

10 Conclusion and Future Work

Acknowledgments

This work was supported by the U.S. DARPAAIDA Program No. FA8750-18-2-0014,LORELEI Program No. HR0011-15-C-0115, AirForce No. FA8650-17-C-7715, NSF IIS-1523198and U.S. ARL NS-CTA No. W911NF-09-2-0053.The views and conclusions contained in thisdocument are those of the authors and should notbe interpreted as representing the official policies,either expressed or implied, of the U.S. Gov-ernment. The U.S. Government is authorized toreproduce and distribute reprints for Governmentpurposes notwithstanding any copyright notationhere on.

ReferencesAutomated phrase mining from massive text corpora.

Mohamed Al-Badrashiny, Jason Bolton, Arun TejasviChaganty, Kevin Clark, Craig Harman, Lifu Huang,Matthew Lamm, Jinhao Lei, Di Lu, Xiaoman Pan,et al. 2017. Tinkerbell: Cross-lingual cold-startknowledge base construction. In TAC.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Collin F Baker, Charles J Fillmore, and John B Lowe.1998. The berkeley framenet project. In Proceed-ings of the 17th international conference on Compu-tational linguistics-Volume 1, pages 86–90. Associ-ation for Computational Linguistics.

D. Blei, A. Ng, and M. Jordan. 2003. Latent dirich-let allocation. the Journal of Machine Learning Re-search, 3:993–1022.

Jiawei Chen, Yin Cui, Guangnan Ye, Dong Liu, andShih-Fu Chang. 2014. Event-driven semantic con-cept discovery by exploiting weakly tagged internetimages. In Proceedings of International Conferenceon Multimedia Retrieval, page 1. ACM.

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng,and Jun Zhao. 2015. Event extraction via dy-namic multi-pooling convolutional neural networks.In Proceedings of the 53rd Annual Meeting of the

Page 20: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

Association for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing.

Zheng Chen and Heng Ji. 2009. Graph-based eventcoreference resolution. In Proceedings of the 2009Workshop on Graph-based Methods for NaturalLanguage Processing, pages 54–57. Association forComputational Linguistics.

Jason P.C. Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns. In Trans-action of Association for Computational Linguistics.

Lingjia Deng, Yoonjung Choi, and Janyce Wiebe.2013. Benefactive/malefactive event and writer at-titude annotation. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), volume 2, pages120–125.

Martin Engilberge, Louis Chevallier, Patrick Perez, andMatthieu Cord. 2018. Finding beans in burgers:Deep semantic-visual embedding with localization.In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3984–3993.

Martin Ester, Hans-Peter Kriegel, Jorg Sander, XiaoweiXu, et al. 1996. A density-based algorithm fordiscovering clusters in large spatial databases withnoise. In Kdd, volume 96, pages 226–231.

Mark Everingham, Luc Van Gool, Christopher KIWilliams, John Winn, and Andrew Zisserman. 2010.The pascal visual object classes (voc) challenge. In-ternational journal of computer vision, 88(2):303–338.

Xiaocheng Feng, Lifu Huang, Duyu Tang, Bing Qin,Heng Ji, and Ting Liu. 2016a. A language-independent neural network for event detection. InThe 54th Annual Meeting of the Association forComputational Linguistics, page 66.

Xiaocheng Feng, Heng Ji, Duyu Tang, Bing Qin, andTing Liu. 2016b. A language-independent neuralnetwork for event detection. In Proceeddings of the54th Annual Meeting of the Association for Compu-tational Linguistics.

Ross Girshick. 2015. Fast r-cnn. In Proceedings of theIEEE international conference on computer vision,pages 1440–1448.

George Gkotsis, Sumithra Velupillai, Anika Oellrich,Harry Dean, Maria Liakata, and Rina Dutta. 2016.Don’t let notes be misunderstood: A negation de-tection method for assessing risk of suicide in men-tal health records. In Proceedings of the ThirdWorkshop on Computational Lingusitics and Clin-ical Psychology, pages 95–105.

Alan Graves, Navdeep Jaitly, and Abdel-rahman Mo-hamed. 2013. Hybrid speech recognition with deepbidirectional lstm. In Automatic Speech Recognitionand Understanding, 2013 IEEE Workshop on.

Yu Hong, Di Lu, Dian Yu, Xiaoman Pan, XiaobinWang, Yadong Chen, Lifu Huang, and Heng Ji.2015. Rpi blender tac-kbp2015 system description.In Proc. Text Analysis Conference (TAC2015).

Lifu Huang, Heng Ji Avirup Sil, and Radu Florian.2017a. Improving slot filling performance with at-tentive neural networks on dependency structures.In Proc. Conference on Empirical Methods in Natu-ral Language Processing (EMNLP2017).

Lifu Huang, Heng Ji, Kyunghyun Cho, and Clare R.Voss. 2017b. Zero-Shot Transfer Learning for EventExtraction. arXiv preprint arXiv:1707.01066.

Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji,Xiang Ren, Jiawei Han, Lin Zhao, and James A.Hendler. 2017c. Liberal entity extraction: Rapidconstruction of fine-grained entity typing systems.Big Data, 5.

Anthony Hunter and Sebastien Konieczny. 2008. Mea-suring inconsistency through minimal inconsistentsets. In Proceedings of the Eleventh InternationalConference on Principles of Knowledge Representa-tion and Reasoning, KR’08, pages 358–366. AAAIPress.

Yoon Kim. 2014. Convolutional neural networks forsentence classification.

Ivan Krasin, Tom Duerig, Neil Alldrin, VittorioFerrari, Sami Abu-El-Haija, Alina Kuznetsova,Hassan Rom, Jasper Uijlings, Stefan Popov, Sha-hab Kamali, Matteo Malloci, Jordi Pont-Tuset,Andreas Veit, Serge Belongie, Victor Gomes,Abhinav Gupta, Chen Sun, Gal Chechik, DavidCai, Zheyun Feng, Dhyanesh Narayanan, andKevin Murphy. 2017. Openimages: A publicdataset for large-scale multi-label and multi-class image classification. Dataset available fromhttps://storage.googleapis.com/openimages/web/index.html.

Henry E. Kyburg. 1961. Probability and the logic ofrational belief. Wesleyan University Press.

Guillaume Lample, Miguel Ballesteros, KazuyaKawakami, Sandeep Subramanian, and Chris Dyer.2016. Neural architectures for named entity recog-nition. In Proceeddings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies.

Ni Lao, Tom Mitchell, and William W Cohen. 2011.Random walk inference and learning in a large scaleknowledge base. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 529–539. Association for Computa-tional Linguistics.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Inter-national Conference on Machine Learning, pages1188–1196.

Page 21: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 188–197. Association for ComputationalLinguistics.

Hongzhi Li, Joseph G Ellis, Heng Ji, and Shih-FuChang. 2016. Event specific multimodal patternmining for knowledge base construction. In Pro-ceedings of the 2016 ACM on Multimedia Confer-ence, pages 821–830. ACM.

Qi Li and Heng Ji. 2014. Incremental joint extrac-tion of entity mentions and relations. In Proc. the52nd Annual Meeting of the Association for Compu-tational Linguistics (ACL2014).

Qi Li, Heng Ji, Yu Hong, and Sujian Li. 2014.Constructing information networks using one sin-gle model. In Proc. the 2014 Conference on Em-pirical Methods on Natural Language Processing(EMNLP2014).

Qi Li, Heng Ji, and Liang Huang. 2013. Joint eventextraction via structured prediction with global fea-tures. In Proc. the 51st Annual Meeting of the Asso-ciation for Computational Linguistics (ACL2013).

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

Ying Lin, Shengqi Yang, Veselin Stoyanov, and HengJi. 2018. A multi-lingual multi-task architecturefor low-resource sequence labeling. In Proc. The56th Annual Meeting of the Association for Compu-tational Linguistics (ACL2018).

Xiao Ling, Sameer Singh, and Daniel Weld. 2015. De-sign challenges for entity linking. Transactions ofthe Association for Computational Linguistics, 3.

Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua,Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang,and Kevin Murphy. 2017. Progressive neural archi-tecture search. arXiv preprint arXiv:1712.00559.

Yang Liu, Furu Wei, Sujian Li, Heng Ji, and MingZhou. 2015. A dependency-based neural networkfor relation classification. In Proceeddings of the53rd Annual Meeting of the Association for Compu-tational Linguistics.

Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang,and Heng Ji. 2018. Visual attention model for nametagging in multimodal social media. In Proc. The56th Annual Meeting of the Association for Compu-tational Linguistics (ACL2018).

O. Medelyan and C. Legg. 2008. Integrating cyc andwikipedia: Folksonomy meets rigorously definedcommon-sense. In Proc. AAAI 2008 Workshop onWikipedia and Artificial Intelligence.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Thien Huu Nguyen and Ralph Grishman. 2015a. Eventdetection and domain adaptation with convolutionalneural networks. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing.

Thien Huu Nguyen and Ralph Grishman. 2015b. Rela-tion extraction: Perspective from convolutional neu-ral networks. In Proceedings of NAACL Workshopon Vector Space Modeling for NLP.

Thien Huu Nguyen and Ralph Grishman. 2016. Jointevent extraction via recurrent neural networks. InProceedings of the 53rd Annual Meeting of the As-sociation for Computational Linguistics.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated cor-pus of semantic roles. Computational linguistics,31(1):71–106.

Xiaoman Pan, Taylor Cassidy, Ulf Hermjakob, Heng Ji,and Kevin Knight. 2015. Unsupervised entity link-ing with abstract meaning representation. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics Human Language Technologies.

Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In Proc. the 55th Annual Meeting of the Associationfor Computational Linguistics.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365.

Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. 2015. Facenet: A unified embedding forface recognition and clustering. In Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 815–823.

Karin Kipper Schuler. 2005. Verbnet: A broad-coverage, comprehensive verb lexicon.

Ge Shi, Chong Feng, Lifu Huang, Boliang Zhang,Heng Ji, Lejian Liao, and Heyan Huang. 2018.Genre separation network with adversarial trainingfor cross-genre relation extraction. In Proc. 2018Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP2018).

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. brat: a web-based tool for NLP-assisted

Page 22: GAIA - A Multi-media Multi-lingual Knowledge Extraction ...nologies in multimodal knowledge extraction, se-mantic integration, knowledge graph generation, and inference. In the past

text annotation. In Proceedings of the Demonstra-tions Session at EACL 2012, Avignon, France. As-sociation for Computational Linguistics.

Robert L Taft. 1970. Name Search Techniques. NewYork State Identification and Intelligence System,Albany, New York, US.

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, JunYan, and Qiaozhu Mei. 2015. Line: Large-scaleinformation network embedding. In WWW, pages1067–1077. International World Wide Web Confer-ences Steering Committee.

Veronika Vincze. 2015. Uncertainty detection in natu-ral language texts. Ph.D. thesis, szte.

Hong Xu, T. K. Satish Kumar, and Sven Koenig. 2017.The Nemhauser-Trotter reduction and lifted mes-sage passing for the weighted CSP. In Proceedingsof the 14th International Conference on Integrationof Artificial Intelligence and Operations ResearchTechniques in Constraint Programming (CPAIOR),pages 387–402.

Yunlun Yang, Yunhai Tong, Shulei Ma, and Zhi-HongDeng. 2016. A position encoding convolutionalneural network based on dependency tree for rela-tion classification. In Proceedings of the EmpiricalMethods on Natural Language Processing.

Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptions tovisual denotations: New similarity metrics for se-mantic inference over event descriptions. Transac-tions of the Association for Computational Linguis-tics, 2:67–78.

Dian Yu, Xiaoman Pan, Boliang Zhang, Lifu Huang,Di Lu, Spencer Whitehead, and Heng Ji. 2016. Rpiblender tac-kbp2016 system description.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In Proceeddings ofthe 25th International Conference on ComputationalLinguistics.

Boliang Zhang, Di Lu, Xiaoman Pan, Ying Lin, Hal-idanmu Abudukelimu, Heng Ji, and Kevin Knight.2017a. Embracing non-traditional linguistic re-sources for low-resource language name tagging. InProc. the 8th International Joint Conference on Nat-ural Language Processing (IJCNLP 2017).

Boliang Zhang, Spencer Whitehead, Lifu Huang, andHeng Ji. 2018a. Global attention for name tagging.In Proc. the SIGNLL Conference on ComputationalNatural Language Learning (CONLL2018).

Boliang Zhang, Spencer Whitehead, Lifu Huang, andHeng Ji. 2018b. Global attention for name tagging.In Proc. the SIGNLL Conference on ComputationalNatural Language Learning (CONLL2018).

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, andYu Qiao. 2016. Joint face detection and alignmentusing multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23(10):1499–1503.

Tongtao Zhang and Heng Ji. 2018. Event extractionwith generative adversarial imitation learning. Inarxiv.

Tongtao Zhang, Spencer Whitehead, Hanwang Zhang,Hongzhi Li, Joseph Ellis, Lifu Huang, Wei Liu,Heng Ji, and Shih-Fu Chang. 2017b. Improvingevent extraction via cross-modal integration. InProc. the 25th ACM International Conference onMultimedia (ACMMM2017).

Bolei Zhou, Aditya Khosla, Agata Lapedriza, AudeOliva, and Antonio Torralba. 2016. Learning deepfeatures for discriminative localization. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2921–2929.