MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et...

9
MAVEN: A Massive General Domain Event Detection Dataset Xiaozhi Wang 1 , Ziqi Wang 1 , Xu Han 1 , Wangyi Jiang 1 , Rong Han 1 , Zhiyuan Liu 1 , Juanzi Li 1 , Peng Li 2 , Yankai Lin 2 , Jie Zhou 2 1 Department of Computer Science and Technology, Tsinghua University, Beijing, China 2 Pattern Recognition Center, WeChat AI, Tencent Inc, China {xz-wang16, ziqi-wan16, hanxu17}@mails.tsinghua.edu.cn Abstract Event detection (ED), which identifies event trigger words and classifies event types ac- cording to contexts, is the first and most fun- damental step for extracting event knowledge from plain text. Most existing datasets ex- hibit the following issues that limit further de- velopment of ED: (1) Small scale of existing datasets is not sufficient for training and sta- bly benchmarking increasingly sophisticated modern neural methods. (2) Limited event types of existing datasets lead to the trained models cannot be easily adapted to general- domain scenarios. To alleviate these problems, we present a MAssive eVENt detection dataset (MAVEN), which contains 4, 480 Wikipedia documents, 117, 200 event mention instances, and 207 event types. MAVEN alleviates the lack of data problem and covers much more general event types. Besides the dataset, we reproduce the recent state-of-the-art ED mod- els and conduct a thorough evaluation for these models on MAVEN. The experimental results and empirical analyses show that existing ED methods cannot achieve promising results as on the small datasets, which suggests ED in real world remains a challenging task and re- quires further research efforts. The dataset and baseline code will be released in the future to promote this field. 1 Introduction Event detection (ED) is an important task of in- formation extraction, which aims to identify event triggers (the words or phrases evoking events in text) and classify event types. For instance, in the sentence “Bill Gates founded Microsoft in 1975”, an ED model should recognize that the word founded” is the trigger of a Found event. ED is the first stage to extract event knowledge from text (Ahn, 2006) and also fundamental to various Preprint. Work in progress. Figure 1: Main statistics of ACE 2005 English dataset. NLP applications (Yang et al., 2003; Basile et al., 2014; Cheng and Erk, 2018). Due to the rising importance of obtaining event knowledge, many efforts have been devoted to ED in recent years. Extensive supervised meth- ods along with various techniques have been de- veloped, including the feature-based models (Ji and Grishman, 2008; Gupta and Ji, 2009; Li et al., 2013; Araki and Mitamura, 2015) and advanced neural models (Chen et al., 2015; Nguyen and Gr- ishman, 2015; Nguyen et al., 2016; Feng et al., 2016; Ghaeini et al., 2016; Liu et al., 2017; Zhao et al., 2018; Chen et al., 2018; Ding et al., 2019; Yan et al., 2019). Different from the advanced models having been continuously proposed, the benchmark datasets for ED are upgraded slowly. As event annotation is complex and expensive, the existing datasets are mostly small-scale. As shown in Figure 1, the most widely-used ACE 2005 English dataset (Walker et al., 2006) only contains 599 documents and 5, 349 annotated instances. Due to the inherent data imbalance problem, 20 of its 33 event types only have fewer than 100 annotated instances. As recent neural methods are typically data-hungry, these small-scale datasets are not sufficient for training arXiv:2004.13590v1 [cs.CL] 28 Apr 2020

Transcript of MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et...

Page 1: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

MAVEN: A Massive General Domain Event Detection Dataset

Xiaozhi Wang1, Ziqi Wang1, Xu Han1, Wangyi Jiang1, Rong Han1,Zhiyuan Liu1, Juanzi Li1, Peng Li2, Yankai Lin2, Jie Zhou2

1Department of Computer Science and Technology, Tsinghua University, Beijing, China2Pattern Recognition Center, WeChat AI, Tencent Inc, China

{xz-wang16, ziqi-wan16, hanxu17}@mails.tsinghua.edu.cn

Abstract

Event detection (ED), which identifies eventtrigger words and classifies event types ac-cording to contexts, is the first and most fun-damental step for extracting event knowledgefrom plain text. Most existing datasets ex-hibit the following issues that limit further de-velopment of ED: (1) Small scale of existingdatasets is not sufficient for training and sta-bly benchmarking increasingly sophisticatedmodern neural methods. (2) Limited eventtypes of existing datasets lead to the trainedmodels cannot be easily adapted to general-domain scenarios. To alleviate these problems,we present a MAssive eVENt detection dataset(MAVEN), which contains 4, 480 Wikipediadocuments, 117, 200 event mention instances,and 207 event types. MAVEN alleviates thelack of data problem and covers much moregeneral event types. Besides the dataset, wereproduce the recent state-of-the-art ED mod-els and conduct a thorough evaluation for thesemodels on MAVEN. The experimental resultsand empirical analyses show that existing EDmethods cannot achieve promising results ason the small datasets, which suggests ED inreal world remains a challenging task and re-quires further research efforts. The dataset andbaseline code will be released in the future topromote this field.

1 Introduction

Event detection (ED) is an important task of in-formation extraction, which aims to identify eventtriggers (the words or phrases evoking events intext) and classify event types. For instance, inthe sentence “Bill Gates founded Microsoft in1975”, an ED model should recognize that the word“founded” is the trigger of a Found event. EDis the first stage to extract event knowledge fromtext (Ahn, 2006) and also fundamental to various

Preprint. Work in progress.

Figure 1: Main statistics of ACE 2005 English dataset.

NLP applications (Yang et al., 2003; Basile et al.,2014; Cheng and Erk, 2018).

Due to the rising importance of obtaining eventknowledge, many efforts have been devoted toED in recent years. Extensive supervised meth-ods along with various techniques have been de-veloped, including the feature-based models (Jiand Grishman, 2008; Gupta and Ji, 2009; Li et al.,2013; Araki and Mitamura, 2015) and advancedneural models (Chen et al., 2015; Nguyen and Gr-ishman, 2015; Nguyen et al., 2016; Feng et al.,2016; Ghaeini et al., 2016; Liu et al., 2017; Zhaoet al., 2018; Chen et al., 2018; Ding et al., 2019;Yan et al., 2019).

Different from the advanced models having beencontinuously proposed, the benchmark datasets forED are upgraded slowly. As event annotation iscomplex and expensive, the existing datasets aremostly small-scale. As shown in Figure 1, the mostwidely-used ACE 2005 English dataset (Walkeret al., 2006) only contains 599 documents and5, 349 annotated instances. Due to the inherent dataimbalance problem, 20 of its 33 event types onlyhave fewer than 100 annotated instances. As recentneural methods are typically data-hungry, thesesmall-scale datasets are not sufficient for training

arX

iv:2

004.

1359

0v1

[cs

.CL

] 2

8 A

pr 2

020

Page 2: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

and stably benchmarking modern models. More-over, the covered event types in existing datasetsare limited. The ACE 2005 English dataset onlycontains 8 event types and 33 specific subtypes.The Rich ERE ontology (Song et al., 2015) used byTAC KBP challenges (Ellis et al., 2015, 2016) cov-ers 9 event types and 38 subtypes. The coverageof these datasets is low for general domain events,which results in the model trained on these datasetscannot be easily transferred and applied on generalapplications.

Recent research (Huang et al., 2016; Chen et al.,2017) has shown that the existing datasets suffer-ing from the lack of data and low coverage prob-lems are increasingly incompetent for benchmark-ing emerging supervised methods, i.e., the evalua-tion results on these datasets are difficult to reflectthe effectiveness of novel methods. To tackle theseissues, some works adopt the distantly supervisedmethods (Mintz et al., 2009) to automatically an-notate data with existing event instances in knowl-edge bases (Chen et al., 2017; Zeng et al., 2018;Araki and Mitamura, 2018) or use bootstrappingmethods to generate new instances (Ferguson et al.,2018; Wang et al., 2019a). However, the generateddata are inevitably noisy and homogeneous due tothe limited number of event instances in existingknowledge bases.

In this paper, we present MAVEN, a human-annotated large-scale general domain event detec-tion dataset constructed from English Wikipedia1

and FrameNet (Baker et al., 1998), which signif-icantly alleviate the lack of data and low cover-age problem: (1) Our MAVEN dataset contains109, 567 different events, 117, 200 event mentions,which is twenty times larger than the most widely-used ACE 2005, and 4, 480 annotated documentsin total. To the best of our knowledge, this is thelargest human-annotated event detection dataset un-til now. (2) MAVEN contains 207 event types intotal, which covers a much broader range of gen-eral domain events. These event types are manuallyselected and derived from the frames defined in thelinguistic resource FrameNet (Baker et al., 1998).As previous works (Aguilar et al., 2014; Huanget al., 2018) show, the FrameNet frames have goodcoverage of general event semantics. In our eventschema, we maintain the good coverage of eventsemantics and simplify event types to enable thecrowd-sourced annotation by manually selecting

1https://en.wikipedia.org

the event-related frames and merging some fine-grained frames.

We reproduce some recent state-of-the-art EDmodels and conduct a thorough evaluation of thesemodels on our MAVEN dataset. From the experi-mental results and empirical analyses, we observea significant performance drop of these models ascompared with on existing ED benchmarks. It indi-cates that detecting general-domain events is stillchallenging and the existing datasets are difficult tosupport further explorations. We hope that all con-tents of our MAVEN could encourage the commu-nity to make further exploration and breakthroughstowards better ED methods.

2 Event Detection Formalization

In our dataset, we mostly follow the setting and ter-minologies defined in the ACE 2005 program (Dod-dington et al., 2004). We specify the vital termi-nologies as follows:

An event is a specific occurrence involving par-ticipants (Consortium, 2005). In MAVEN, wemainly focus on extracting the basic events thatcan be described in one or a few sentences ratherthan the high-level grand events. Each event willbe labeled as a certain event type. An event men-tion is a sentence that its semantic meaning candescribe an event. As the same event may be men-tioned many times in different sentences of a docu-ment, the number of event mentions will be morethan the number of events. An event trigger is thekey word or phrase in an event mention that mostclearly expresses the event.

The ED task is to identify event triggers andclassify event types for given sentences. Accord-ingly, ED is conventionally divided into two sub-tasks: Trigger Identification and Trigger Classifica-tion (Ahn, 2006). Trigger identification is to iden-tify the annotated triggers from all possible wordsand phrases. Trigger classification is to classifythe corresponding event types for the identified trig-gers. Both the subtasks are evaluated with preci-sion, recall, and F-1 scores. Recent neural methodstypically formulate ED as a token-level multi-classclassification task (Chen et al., 2015; Nguyen et al.,2016) or sequence labeling task (Chen et al., 2018;Zeng et al., 2018), and only report the classificationresults for comparison (add an additional type N/Ato be classified at the same time, indicating thatit is not a trigger). In MAVEN, we inherit all theabovementioned settings in both constructing the

Page 3: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

dataset and evaluating the typical models.

3 Data Collection

In this section, we will introduce the detailed datacollection process of MAVEN, including followingstages: (1) Event schema construction, (2) Docu-ment selection, (3) Automatic labeling and candi-date selection, (4) Human annotation.

3.1 Event Schema Construction

As the event schema used by the existing EDdatasets like ACE (Doddington et al., 2004), LightERE (Aguilar et al., 2014) and Rich ERE (Songet al., 2015) only includes limited event types (e.g.Life, Movement, Contact, etc). Hence, weneed to construct a new event schema with a goodcoverage of general-domain events for our dataset .

Inspired by Aguilar et al. (2014), we mostly usethe frames defined in FrameNet (Baker et al., 1998)as our event types for a good coverage. FrameNetfollows the frame semantic theory (Fillmore et al.,1976; Charles et al., 1982) and defines over 1, 200semantic frames along with corresponding frameelements, frame relations, and lexical units. Fromthe ED perspective, some of the frames and lexicalunits can be used as event types and example trig-gers respectively. The frame elements can also beused as event arguments in the future.

Considering FrameNet is primarily a linguisticresource constructed by linguistic experts, it prior-itizes lexicographic and linguistic completenessover ease of annotation (Aguilar et al., 2014). Toenable the crowd-source annotation with massiveannotators, we simplify the original frame schemato construct our event schema. We first collect 598event-related frames from FrameNet by selectingthe frames inherited from the Event frame. Thenwe manually filter out some abstractive frames (e.g.Activity ongoing, Process resume,etc), merge some similar frames (e.g. Choosingand Adopt selection ), and assemblethose too fine-grained frames into more gen-eralized frames (e.g. Visitor arrival,Visit host arrival, and Drop in on intoArriving). Finally, we get a new event schemawith 207 event types.

3.2 Document Selection

To support the annotation, we need to select in-formative documents as our basic corpus. Eachdocument should contain a number of events and

Topic Occurrence Percentage

Military conflict 1, 458 32.5%Hurricane 480 10.7%Civilian attack 287 6.4%Concert tour 255 5.7%Music festival 170 3.8%

Total 2, 650 59.2%

Table 1: Top-5 topics of documents.

the events should have connections (e.g. a story-line) to facilitate further research like event coref-erence resolution, event relation extraction. Tothis end, we choose English Wikipedia as ourdata source considering it is informative, widely-used (Rajpurkar et al., 2016; Yang et al., 2018).Meanwhile, Wikipedia contains rich entities andwill benefit further event argument annotation inthe next step.

To effectively select the articles containingenough events, we follow a simple intuition thatthe articles describing grand “topic events” maycontain much more basic events than the articlesexpressing specific entity definitions. Recently, aknowledge base for major events EventWiki (Geet al., 2018) is proposed, and each major event inEventWiki is described with a Wikipedia article.We thus utilize the articles indexed by EventWikias base and select some articles to annotate theirbasic events concerned with our event schema. Fi-nally, we select 4, 480 documents in total, covering90 of the 95 major event types defined in Even-tWiki. Table 1 shows the top 5 EventWiki types ofour selected documents.

To ensure the quality of articles, we follow theprevious settings (Yao et al., 2019) to use the in-troductory sections of the Wikipedia articles forannotation. Moreover, we filter out the articleswith fewer than 5 sentences or fewer than 10event-related frames labeled by a semantic labelingtool (Swayamdipta et al., 2017).

3.3 Automatic Labeling and CandidateSelection

As we have a large scale of data to be annotatedwith 207 event types and our annotators are typi-cally not linguistic experts, it is necessary to adoptsome heuristic methods to limit the number of trig-ger candidates and event type candidates for eachtrigger candidate.

Page 4: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

Dataset #Document #Token #Event Type #Event #Event Mention

ACE 2005 599 303k 33 4, 090 5, 349

RichERE

LDC2015E29 91 43k 38 1, 439 2, 196LDC2015E68 197 164k 37 2, 650 3, 567LDC2015E78 171 114k 31 2, 285 2, 933TAC KBP 2014 351 282k 34 10, 719 10, 719TAC KBP 2015 360 238k 38 7, 460 12, 976TAC KBP 2016 169 109k 18 3, 191 4, 155TAC KBP 2017 167 99k 18 2, 963 4, 375

All 1, 272 854k 38 29, 293 38, 853

MAVEN 4, 480 1, 276k 207 109, 567 117, 200

Table 2: Statistics of MAVEN compared with existing widely-used ED datasets. The #Event Type shows thenumber of the most fine-grained types (i.e. the “subtype” of ACE and ERE). For the multilingual datasets, wereport the statistics of the English subset (typically largest subset) for direct comparison. We merge all the RichERE datasets and remove the duplicate documents to get the “All” statistics. The number of tokens are acquiredwith the NLTK tokenizer (Bird, 2006).

We first do POS tagging with the NLTKtoolkit (Bird, 2006)2, and select the content words(nouns, verbs, adjectives, and adverbs) as the trig-ger candidates to be annotated. As event triggerscan also be phrases rather than single words, thephrases in documents that can be matched with thephrases provided in FrameNet are also selected astrigger candidates. For each trigger candidate, weonly recommend 15 event types with the highest co-sine similarities between trigger word embeddingand the average of the word embeddings of en-tity types’ corresponding lexical units in FrameNet.The word embeddings we used here are the pre-trained word vectors3 trained with Glove (Penning-ton et al., 2014).

We also automatically label some of thetrigger candidates with a frame semanticparser (Swayamdipta et al., 2017) and use thepredicted frames as the default event types. Theannotators can change them with more appropriateevent types or just keep them as the final decisionto save time and effort.

3.4 Human Annotation

The final step is human annotation. The annotatorsare required to label the trigger candidates withappropriate event types (if multiple event types aresuitable for an event, we ask the annotators to labelthe most fine-grained one) and merge the event

2https://www.nltk.org3https://nlp.stanford.edu/projects/

glove

mentions (triggers) expressing the same event.As the event annotation is complicated, to ensure

the accuracy and consistency of our annotation, wefollow the ACE 2005 annotation process (Consor-tium, 2005) to organize a two-stage iterative anno-tation. In the first stage, 121 crowd-source annota-tors will annotate the documents given the defaultresults and candidate sets described in the last sec-tion. Each document will be annotated twice bytwo independent annotators. In the second stage, 17experienced annotators and experts will review theresult of the first stage and give final result giventhe results of the two first-stage annotators.

4 Data Analysis

In this section, we analyze our MAVEN to showthe features of the dataset.

4.1 Data Size

We show the main statistics of our MAVENcompared with some existing widely-used EDdatasets in Table 2, including the most widely-used ACE 2005 dataset (Walker et al., 2006)and a series of Rich ERE annotation datasetsused by TAC KBP competition, which areDEFT Rich ERE English Training Annotation V2(LDC2015E29), DEFT Rich ERE English Train-ing Annotation R2 V2 (LDC2015E68), DEFT RichERE Chinese and English Parallel Annotation V2(LDC2015E78), TAC KBP Event Nugget data2014-2016 (LDC2017E02) (Ellis et al., 2014, 2015,2016) and TAC KBP 2017 (LDC2017E55) (Get-

Page 5: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

0100020003000400050006000

Cata

stro

phe

Atta

ckCo

mpe

titio

nPr

oces

s_st

art

Killi

ngCo

nque

ring

Proc

ess_

end

Com

ing_

to_b

eCa

use_

to_b

e_in

clud

edDe

ath

Mili

tary

_ope

ratio

nAr

rang

ing

Beco

min

gGe

ttin

gCo

ntro

lPr

even

ting_

or_l

ettin

gM

anip

ulat

ion

Chan

ge_o

f_le

ader

ship

Requ

est

Caus

e_ch

ange

Judg

men

t_co

mm

unic

…Su

ppor

ting

Inte

ntio

nally

_cre

ate

Reco

rdin

gAg

ree_

or_r

efus

e_to

_…

#Ins

tanc

e

Event Type

Figure 2: Distribution of our MAVEN by event type(only show top-50 event types here).

Subset #Doc. #Trigger #Negative.

Train 2, 913 89, 251 319, 780Dev 710 21, 508 78, 794Test 857 25, 014 92, 464

Table 3: Statistics of MAVEN data split.

man et al., 2017). The Rich ERE datasets can becombined together as used in Lin et al. (2019) andLu et al. (2019), but the combined dataset is stillmuch smaller than our MAVEN. Our MAVEN islarger than all existing ED datasets, especially inthe number of events. Hopefully, the large-scaledataset can facilitate the research on general do-main ED.

4.2 Data Distribution

Figure 2 shows the top-50 event types number ofinstance distribution in our MAVEN. It is still along tail distribution due to the inherent data imbal-ance problem. However, as our MAVEN is large-scale, 32% and 87% event types have more than500 and 100 instances in our MAVEN respectively.Compared with existing datasets like ACE 2005(only 13 of 33 event types have more than 100 in-stances), MAVEN significantly alleviate the datasparsity and data imbalance problem, which willbenefit developing strong ED models and variousevent-related downstream applications.

4.3 Benchmark Setting

Before the experiments, we will introduce thebenchmark setting here. Our data are randomlysplit into training, development and test sets andthe statistics of the three sets are shown in Table 3.After the data split, there are 52 and 167 of 207

event types have more than 500 and 100 traininginstances respectively, which ensures the modelscan be well-trained.

Conventionally, the existing ED datasets onlyprovide the standard annotation of positive in-stances (i.e. the annotated event triggers) and re-searchers will sample the negative instances (i.e.non-trigger words or phrases) from the documentsby themselves, which may lead to potential unfaircomparisons between different methods. In ourMAVEN, we provide official negative instances toensure a fair comparison. As described in the can-didate selection part of Section 3, the negative in-stances are the content words labeled by the NLTKPOS tagger or the phrases which can be matchedwith the lexical units in FrameNet. In other words,we only filter out the empty words, which will notinfluence the application of models developed onMAVEN.

5 Experiments

In this section, we conduct experiments to show thechallenge of MAVEN and analyze the experimentsto discuss the direction of ED.

5.1 Experimental SettingWe will introduce the experimental settings at first,including the representative models we used andthe evaluation metrics.

Models Recently, various neural models havebeen developed for ED and achieved superior per-formances compared with traditional feature-basedmodels. Hence, we reproduce three representa-tive state-of-the-art neural models and report theirperformances on both MAVEN and widely-usedACE 2005 to assess the challenge of MAVEN. Themodels include:

• DMCNN (Chen et al., 2015) is representativefor convolutional neural network (CNN) mod-els. It leverages CNN to automatically learnthe sequence representations and proposes adynamic multi-pooling mechanism to aggre-gate the CNN outputs into trigger-specific rep-resentations. Specifically, the dynamic multi-pooling separately max-pooling the featurevectors before and after the position of thetrigger candidate, and concatenate the two fea-tures as the final sentence representation.

• MOGANED (Yan et al., 2019) is an advancedgraph neural network (GNN) model. It pro-

Page 6: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

MethodACE 2005 MAVEN

P R F-1 P R F-1

DMCNN 75.6 63.6 69.1 67.6± 0.49 52.4± 0.59 59.0± 0.27

MOGANED 79.5 72.3 75.7 59.9± 0.72 63.6± 0.86 61.7± 0.36

DMBERT 73.8 83.1 78.2 60.0± 1.09 69.7± 1.26 64.5± 0.27

Table 4: The overall performances of different models on ACE 2005 and MAVEN. The DMCNN and MOGANEDresults on ACE 2005 are taken from original papers (Chen et al., 2015; Yan et al., 2019).

poses a multi-order graph attention networkto effectively model the multi-order syntac-tic relations in dependency trees and improveED.

• DMBERT is a vanilla BERT-based modelproposed in Wang et al. (2019a). It takesadvantage of the effective pre-trained lan-guage representation model BERT (Devlinet al., 2019) and also uses a dynamic multi-pooling method to aggregate the features. Weuse the BERTBASE model and released pre-trained checkpoints4. The DMBERT is im-plemented with HuggingFace’s Transformerslibrary (Wolf et al., 2019). In our reproduc-tion, we insert special tokens around the trig-ger candidate and use a much larger batch size,hence the results are higher than the originalimplementation (Wang et al., 2019a).

Evaluation Following the widely-used setting in-troduced in Section 2, we report the precision, re-call, and F-1 scores as our evaluation metrics andall the models directly do trigger classification with-out the additional identification stage. On MAVEN,all the models use the standard data split and neg-ative instances described in Section 4.3. We runeach model multiple times and report the averageand standard deviation for each metric. On ACE2005, we use 40 newswire articles for testing, 30random documents for development, and 529 docu-ments for training following previous works (Chenet al., 2015; Wang et al., 2019b), and sample all theunlabeled content words as negative instances.

5.2 Experimental Results

We report the overall experimental results in Ta-ble 4, from which we have the following observa-tions:

4https://github.com/google-research/bert

• Although the models perform well on thesmall-scale ACE 2005 dataset, their perfor-mances are significantly lower and not satisfy-ing on MAVEN. It indicates that our MAVENis challenging and the general domain ED stillneeds more research efforts.

• The deviations of DMBERT are larger thanother models, which is consistent with recentfindings (Dodge et al., 2020) that the fine-tuning processes of pre-trained language mod-els (PLM) are often unstable. It is necessaryto run multiple times and report averages ormedians for evaluating PLM-based models.

6 Related Work

As stated in Section 2, we follow the ED task defi-nition specified in the ACE challenges, especiallythe ACE 2005 dataset (Doddington et al., 2004) inthis paper, which requires ED models to generallydetect the event triggers and classify them into spe-cific event types. The ACE event schema and anno-tation standard are simplified into Light ERE andfurther extended to Rich ERE (Song et al., 2015) tocover more but still a limited number of event types.Rich ERE is used to create various datasets and theTAC KBP 2014-1017 challenges (Ellis et al., 2014,2015, 2016; Getman et al., 2017). Nowadays, themajority of ED and event extraction models (Ji andGrishman, 2008; Li et al., 2013; Chen et al., 2015;Feng et al., 2016; Liu et al., 2017; Zhao et al., 2018;Yan et al., 2019) are developed on these datasets.Our MAVEN follows the effective framework, andextends it to numerous general domain event typesand contains a large scale of data.

There are also various datasets define the EDtask in a different way. The early MUC seriesdatasets (Grishman and Sundheim, 1996) defineevent extraction as a slot-filling task. The TDT cor-pus (Allan, 2012) and some other datasets (Arakiand Mitamura, 2018; Sims et al., 2019; Liu et al.,

Page 7: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

2019) follows the open-domain paradigm, whichdoes not require the models to classify the spe-cific event types for better coverage but limits thedownstream application of the extracted events.Some datasets are developed for domain-specificevent extraction on special domains, like the bio-medical domain (Pyysalo et al., 2007; Kim et al.,2008; Thompson et al., 2009; Buyko et al., 2010;Nedellec et al., 2013), literature (Sims et al., 2019),Twitter (Ritter et al., 2012; Guo et al., 2013) andbreaking news (Pustejovsky et al., 2003). Thesedatasets are also typically small-scale due to theinherent complexity of event annotation, but theirdifferent settings are complementary to our frame-work.

7 Conclusion and Future work

In this paper, we present a massive general domainevent detection dataset (MAVEN), which signifi-cantly alleviates the data sparsity and low coverageproblems of existing datasets. We conduct a thor-ough evaluation of the state-of-the-art models onour MAVEN and observe an obvious performancedrop, which indicates that our MAVEN is challeng-ing and may facilitate further research on ED.

We will further construct a hierarchical eventschema and conduct more experiments to verify theeffectiveness of MAVEN. The dataset and codeswill be released when we finish all the data cleans-ing and experiments in more settings. If you needthe dataset in this version during this period, pleasee-mail the authors. In the future, we will also ex-plore to extend MAVEN to more event-related taskslike event argument extraction, event sequencing.

References

Jacqueline Aguilar, Charley Beller, Paul McNamee,Benjamin Van Durme, Stephanie Strassel, ZhiyiSong, and Joe Ellis. 2014. A comparison of theevents and relations across ACE, ERE, TAC-KBP,and FrameNet annotation standards. In Proceedingsof the Second Workshop on EVENTS: Definition, De-tection, Coreference, and Representation, pages 45–53.

David Ahn. 2006. The stages of event extraction. InProceedings of ACL Workshop on Annotating andReasoning about Time and Events, pages 1–8.

James Allan. 2012. Topic detection and tracking:event-based information organization, volume 12.Springer Science & Business Media.

Jun Araki and Teruko Mitamura. 2015. Joint eventtrigger identification and event coreference resolu-tion with structured perceptron. In Proceedings ofEMNLP, pages 2074–2080.

Jun Araki and Teruko Mitamura. 2018. Open-domainevent detection using distant supervision. In Pro-ceedings of COLING, pages 878–891. Associationfor Computational Linguistics.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet project. In Proceed-ings of ACL-COLING, pages 86–90.

P Basile, A Caputo, G Semeraro, and L Siciliani. 2014.Extending an information retrieval system throughtime event extraction. In Proceedings of DART,pages 36–47.

Steven Bird. 2006. NLTK: The Natural LanguageToolkit. In Proceedings of the COLING/ACL 2006Interactive Presentation Sessions, pages 69–72.

Ekaterina Buyko, Elena Beisswanger, and Udo Hahn.2010. The GeneReg corpus for gene expression reg-ulation events — an overview of the corpus and itsin-domain and out-of-domain interoperability. InProceedings of LREC.

Fillmore Charles et al. 1982. Frame semantics. TheLinguistic Society of Korea (eds), Linguistics in theMorning Calm, Seoul, Hanshin, pages 111–137.

Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, andJun Zhao. 2017. Automatically labeled data genera-tion for large scale event extraction. In Proceedingsof ACL, pages 409–419.

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, andJun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceed-ings of ACL-IJCNLP, pages 167–176.

Yubo Chen, Hang Yang, Kang Liu, Jun Zhao, andYantao Jia. 2018. Collective event detection via ahierarchical and bias tagging networks with gatedmulti-level attention mechanisms. In Proceedingsof EMNLP, pages 1267–1276.

Pengxiang Cheng and Katrin Erk. 2018. Implicit argu-ment prediction with event knowledge. In Proceed-ings of ACL, pages 831–840.

Linguistic Data Consortium. 2005. ACE (AutomaticContent Extraction) English annotation guidelinesfor events. Version, 5(4).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of NAACL-HLT, pages4171–4186.

Ning Ding, Ziran Li, Zhiyuan Liu, Haitao Zheng, andZibo Lin. 2019. Event detection with trigger-awarelattice neural network. In Proceedings of EMNLP-IJCNLP, pages 347–356.

Page 8: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

George Doddington, Alexis Mitchell, Mark Przybocki,Lance Ramshaw, Stephanie Strassel, and RalphWeischedel. 2004. The automatic content extraction(ACE) program – tasks, data, and evaluation. In Pro-ceedings of LREC.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, AliFarhadi, Hannaneh Hajishirzi, and Noah Smith.2020. Fine-tuning pretrained language models:Weight initializations, data orders, and early stop-ping. arXiv preprint arXiv:2002.06305.

Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie M Strassel.2015. Overview of linguistic resources for the TACKBP 2015 evaluations: Methodologies and results.In TAC.

Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie M Strassel.2016. Overview of linguistic resources for the TACKBP 2016 evaluations: Methodologies and results.In TAC.

Joe Ellis, Jeremy Getman, and Stephanie M Strassel.2014. Overview of linguistic resources for the TACKBP 2014 evaluations: Planning, execution, and re-sults. In TAC.

Xiaocheng Feng, Lifu Huang, Duyu Tang, BingQin, Heng Ji, and Ting Liu. 2016. A language-independent neural network for event detection. InProceedings of ACL, pages 66–71.

James Ferguson, Colin Lockard, Daniel Weld, and Han-naneh Hajishirzi. 2018. Semi-supervised event ex-traction with paraphrase clusters. In Proceedings ofNAACL, pages 359–364.

Charles J Fillmore et al. 1976. Frame semantics andthe nature of language. In Annals of the New YorkAcademy of Sciences: Conference on the origin anddevelopment of language and speech, volume 280,pages 20–32.

Tao Ge, Lei Cui, Baobao Chang, Zhifang Sui, FuruWei, and Ming Zhou. 2018. EventWiki: A knowl-edge base of major events. In Proceedings of LREC.

Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey,and Stephanie M Strassel. 2017. Overview of lin-guistic resources for the TAC KBP 2017 evaluations:Methodologies and results. In TAC.

Reza Ghaeini, Xiaoli Fern, Liang Huang, and PrasadTadepalli. 2016. Event nugget detection withforward-backward recurrent neural networks. InProceedings of ACL, pages 369–373.

Ralph Grishman and Beth Sundheim. 1996. Messageunderstanding conference- 6: A brief history. InProceedings of COLING.

Weiwei Guo, Hao Li, Heng Ji, and Mona Diab. 2013.Linking tweets to news: A framework to enrich shorttext data in social media. In Proceedings of ACL,pages 239–249.

Prashant Gupta and Heng Ji. 2009. Predicting un-known time arguments based on cross-event propa-gation. In Proceedings of ACL-IJCNLP, pages 369–372, Suntec, Singapore. Association for Computa-tional Linguistics.

Lifu Huang, Taylor Cassidy, Xiaocheng Feng, HengJi, Clare R. Voss, Jiawei Han, and Avirup Sil. 2016.Liberal event extraction and event schema induction.In Proceedings of ACL, pages 258–268.

Lifu Huang, Heng Ji, Kyunghyun Cho, Ido Dagan, Se-bastian Riedel, and Clare Voss. 2018. Zero-shottransfer learning for event extraction. In Proceed-ings of ACL, pages 2160–2170.

Heng Ji and Ralph Grishman. 2008. Refining event ex-traction through cross-document inference. In Pro-ceedings of ACL, pages 254–262.

Jin-Dong Kim, Tomoko Ohta, Kanae Oda, and Jun’ichiTsujii. 2008. From Text to Pathway: Corpus annota-tion for knowledge acquisition from biomedical lit-erature. volume 6, pages 165–176.

Qi Li, Heng Ji, and Liang Huang. 2013. Joint eventextraction via structured prediction with global fea-tures. In Proceedings of ACL, pages 73–82.

Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun.2019. Cost-sensitive regularization for labelconfusion-aware event detection. In Proceedings ofACL, pages 5278–5283.

Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017.Exploiting argument information to improve eventdetection via supervised attention mechanisms. InProceedings of ACL, pages 1789–1798.

Xiao Liu, Heyan Huang, and Yue Zhang. 2019. Opendomain event extraction using neural latent variablemodels. In Proceedings of ACL, pages 2860–2871.

Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun.2019. Distilling discrimination and generaliza-tion knowledge for event detection via delta-representation learning. In Proceedings of ACL,pages 4366–4376.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-rafsky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of ACL,pages 1003–1011.

Claire Nedellec, Robert Bossy, Jin-Dong Kim, Jung-Jae Kim, Tomoko Ohta, Sampo Pyysalo, and PierreZweigenbaum. 2013. Overview of BioNLP sharedtask 2013. In Proceedings of the BioNLP sharedtask 2013 workshop, pages 1–7.

Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-ishman. 2016. Joint event extraction via recurrentneural networks. In Proceedings of NAACL, pages300–309.

Page 9: MAVEN: A Massive General Domain Event Detection Dataset · 2016) or sequence labeling task (Chen et al.,2018; Zeng et al.,2018), and only report the classification results for comparison

Thien Huu Nguyen and Ralph Grishman. 2015. Eventdetection and domain adaptation with convolutionalneural networks. In Proceedings of ACL, pages 365–371.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of EMNLP, pages 1532–1543.

James Pustejovsky, Patrick Hanks, Roser Saur, AndrewSee, Rob Gaizauskas, Andrea Setzer, DragomirRadev, Beth Sundheim, David Day, Lisa Ferro, andMarcia Lazo. 2003. The TimeBank corpus. Pro-ceedings of Corpus Linguistics.

Sampo Pyysalo, Filip Ginter, Juho Heimonen, JariBjrne, Jorma Boberg, Jouni Jrvinen, and TapioSalakoski. 2007. BioInfer: A corpus for informationextraction in the biomedical domain. BMC bioinfor-matics, 8:50.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392.

Alan Ritter, Mausam Mausam, Oren Etzioni, and SamClark. 2012. Open domain event extraction fromtwitter. Proceedings of the ACM SIGKDD, 1104-1112.

Matthew Sims, Jong Ho Park, and David Bamman.2019. Literary event detection. In Proceedings ofACL, pages 3623–3634.

Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese,Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick,Neville Ryant, and Xiaoyi Ma. 2015. From lightto rich ERE: Annotation of entities, relations, andevents. In Proceedings of the The 3rd Workshop onEVENTS: Definition, Detection, Coreference, andRepresentation, pages 89–98.

Swabha Swayamdipta, Sam Thomson, Chris Dyer, andNoah A Smith. 2017. Frame-semantic parsing withsoftmax-margin segmental rnns and a syntactic scaf-fold. arXiv preprint arXiv:1706.09528.

Paul Thompson, Syed Iqbal, John McNaught, andSophia Ananiadou. 2009. Construction of an anno-tated corpus to support biomedical information ex-traction. BMC bioinformatics, 10:349.

Christopher Walker, Stephanie Strassel, Julie Medero,and Kazuaki Maeda. 2006. ACE 2005 multilin-gual training corpus. Linguistic Data Consortium,Philadelphia, 57.

Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun,and Peng Li. 2019a. Adversarial training for weaklysupervised event detection. In Proceedings ofNAACL, pages 998–1008.

Xiaozhi Wang, Ziqi Wang, Xu Han, Zhiyuan Liu,Juanzi Li, Peng Li, Maosong Sun, Jie Zhou, andXiang Ren. 2019b. HMEAE: Hierarchical modu-lar event argument extraction. In Proceedings ofEMNLP-IJCNLP), pages 5777–5783.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. HuggingFace’s Trans-formers: State-of-the-art natural language process-ing. arXiv preprint arXiv:1910.03771.

Haoran Yan, Xiaolong Jin, Xiangbin Meng, JiafengGuo, and Xueqi Cheng. 2019. Event detection withmulti-order graph convolution and aggregated atten-tion. In Proceedings of EMNLP-IJCNLP, pages5766–5770.

Hui Yang, Tat-Seng Chua, Shuguang Wang, and Chun-Keat Koh. 2003. Structured use of external knowl-edge for event-based open domain question answer-ing. In Proceedings of SIGIR, pages 33–40.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HotpotQA: A dataset fordiverse, explainable multi-hop question answering.In Proceedings of EMNLP, pages 2369–2380.

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin,Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou,and Maosong Sun. 2019. DocRED: A large-scaledocument-level relation extraction dataset. In Pro-ceedings of ACLs, pages 764–777.

Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, RuiYan, Chongde Shi, and Dongyan Zhao. 2018. Scaleup event extraction learning via automatic trainingdata generation. In Proceedings of AAAI.

Yue Zhao, Xiaolong Jin, Yuanzhuo Wang, and XueqiCheng. 2018. Document embedding enhanced eventdetection with hierarchical and supervised attention.In Proceedings of ACL, pages 414–419.