Event Detection with Trigger-Aware Lattice Neural Network · 1 Introduction Event Detection (ED) is...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 347–356,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

347

Event Detection with Trigger-Aware Lattice Neural Network

Ning Ding1,2∗ , Ziran Li1,2∗, Zhiyuan Liu2,3,4, Hai-Tao Zheng1,2†, Zibo Lin1,2

1Tsinghua Shenzhen International Graduate School, Tsinghua University2Department of Computer Science and Technology, Tsinghua University, Beijing, China

3Institute for Artificial Intelligence, Tsinghua University, Beijing, China4State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China

{dingn18}@mails.tsinghua.edu.cn

Abstract

Event detection (ED) aims to locate triggerwords in raw text and then classify them intocorrect event types. In this task, neural net-work based models became mainstream in re-cent years. However, two problems arise whenit comes to languages without natural delim-iters, such as Chinese. First, word-based mod-els severely suffer from the problem of word-trigger mismatch, limiting the performanceof the methods. In addition, even if triggerwords could be accurately located, the ambi-guity of polysemy of triggers could still af-fect the trigger classification stage. To ad-dress the two issues simultaneously, we pro-pose the Trigger-aware Lattice Neural Net-work (TLNN). (1) The framework dynami-cally incorporates word and character informa-tion so that the trigger-word mismatch issuecan be avoided. (2) Moreover, for polysemouscharacters and words, we model all senses ofthem with the help of an external linguisticknowledge base, so as to alleviate the prob-lem of ambiguous triggers. Experiments ontwo benchmark datasets show that our modelcould effectively tackle the two issues andoutperforms previous state-of-the-art methodssignificantly, giving the best results. Thesource code of this paper can be obtained fromhttps://github.com/thunlp/TLNN.

1 Introduction

Event Detection (ED) is a pivotal part of EventExtraction, which aims to detect the position ofevent triggers in raw text and classify them intocorresponding event types. Conventionally, thestage of locating trigger words is known as Trig-ger Identification (TI), and the stage of clas-sifying trigger words into particular event types

∗ indicates equal contribution† Corresponding author: Hai-Tao Zheng. ( E-mail:

[email protected] )

(a) An example of trigger-word mismatch.

(b) An example of polysemous trigger.

Figure 1: Examples about trigger-word mismatch andpolysemous trigger in Event Detection.

is called Trigger Classification (TC). Althoughneural network methods have achieved significantprogress in event detection (Nguyen and Grish-man, 2015; Chen et al., 2015; Zeng et al., 2016),both steps are still exposed to the following twoissues.

In the TI stage, the problem of trigger-wordmismatch could severely impact the performanceof event detection systems. Because in languageswithout natural delimiters such as Chinese, main-stream approaches are mostly word-based models,in which the segmentation should be firstly per-formed as a necessary preprocessing step. Unfor-tunately, these word-wise methods neglect an im-portant problem that a trigger could be a specificpart of one word or contain multiple words. Asshown in Figure 1(a), “射” (shoot) and “杀” (kill)

https://github.com/thunlp/TLNN

348

DatasetsTrigger-word Match Trigger PolysemyMatch Mismatch Polysemy Univocal

ACE 2005 85.39% 14.61% 41.64% 58.36%KBP 2017 76.28% 23.72% 52.14% 47.86%

Table 1: Proportion of Trigger-word mismatch and Pol-ysemous Triggers on ACE 2005 and KBP 2017.

are two triggers that both are parts of the word “射杀” (shoot and kill) . In the other case, “示威游行” (demonstration) is a trigger that crosses twowords. Under this circumstance, triggers couldnot be located correctly with word-based methods,thereby becoming a serious limitation of the task.Some feature-based methods are proposed (Chenand Ji, 2009; Qin et al., 2010; Li and Zhou, 2012)to alleviate the issue, but they heavily rely on thehand-crafted features. Lin et al. (2018) proposesthe nugget proposal networks (NPN) in terms ofthis issue, which uses a neural network to modelcharacter compositional structure of trigger wordsin a fix-sized window. However, the mechanismof the NPN limits the scope of trigger candidateswithin a fix-sized window, which is inflexible andsuffering from the problem of trigger overlaps.

Even if the locations of triggers can be correctlydetected in the TI step, the TC step could still beseverely affected by the inherent problem of am-biguity of polysemy. Because a trigger word withmultiple word senses could be classified into dif-ferent event types. Take 1(b) as an example, apolysemous trigger word “释放” (release) couldrepresent two distinctly different event types. Inthe first case, the word ’release’ triggers an Attackevent (release tear gas). But in the second case,the event triggered by ’release’ becomes Release-Parole (release a man in court).

To further illustrate that the two problems men-tioned above do exist, we make manual statisticson the proportion of mismatch triggers and poly-semous triggers on two widely used datasets. Thestatistics are illustrated in Table 1, and we can ob-serve that data with trigger-word mismatch andtrigger polysemy do account for a considerableproportion and then affect the task.

In this paper, we propose the Trigger-aware Lat-tice Network (TLNN), a comprehensive modelthat can simultaneously tackle both issues. Toavoid error propagation by NLP tools like seg-mentor, we take characters as the basic units ofthe input sequence. Moreover, we utilize HowNet(Dong and Dong, 2003), an external knowledge

base that manually annotates polysemous Chineseand English words, to obtain the sense-level infor-mation. Further, we develop the trigger-aware lat-tice LSTM as the feature extractor of our model,which could leverage character-level, word-leveland sense-level information at the same time.More specifically, in order to address the trigger-word mismatch issue, we construct short cut pathsto link the cell state between the start and the endcharacters for each word. It is worth mentioningthat the paths are sense-level, which means all thesense information of words that end in one spe-cific character will flow into the memory cell ofthe character. Hence, with the utilization of mul-tiple granularity of information (character, wordand word sense), the problem of polysemous trig-gers could be effectively alleviated.

We conduct sets of experiments on two real-world datasets in the task of event detection. Em-pirical results of the main experiments show thatour model can efficiently address both mentionedissues. With comprehensive comparisons withother proposed methods, our model achieves thestate-of-the-art results on both datasets. Further,sets of subsidiary experiments are conducted tofurther analyze how TLNN addresses the two is-sues.

Figure 2: The architecture of TLNN.

349

2 Methodology

In the paper, event detection is regarded as a se-quence labelling task. For each character, themodel should identify if it is a part of one triggerand correctly classify the trigger into one specificevent type.

The architecture of our model is shown in Fig-ure 2, which primarily includes the following threeparts: (1) Hierarchical Representation Learn-ing, which reveals the character-level, word-leveland sense-level embedding vectors in an unsuper-vised way. (2) Trigger-aware Feature Extractor,which automatically extracts different levels of se-mantic features by a tree structure LSTM model.(3) Sequence Tagger, which calculates the prob-ability of being a trigger for each character candi-date.

2.1 Hierarchical Representation LearningGiven an input sequence S = {c1, c2, ..., cN},where ci represents the ith character in the se-quence. In character level, each character will berepresented as an embedding vector xc by Skip-Gram method (Mikolov et al., 2013).

xci = e(ci) (1)

In the word level, the input sequence S could alsobe S = {w1, w2, ..., wM}, where the basic unit isa single word wi. In this paper, we will use twoindexes b and e to represent the start and the endof a word. In this case, the word embeddings are:

xwb,e = e(wb,e) (2)

However, the Skip-Gram method maps each wordto only one single embedding, ignoring the factthat many words have multiple senses. Hence rep-resentation of finer granularity is still necessaryto represent deep semantics. With the help ofHowNet (Dong and Dong, 2003), we can obtainthe representation of each sense of a character or aword. For each character c, there are possible mul-tiple senses sen(ci) ∈ S(c) annotated in HowNet.Similarly, for each word w, the senses could besen(wi) ∈ S(w). Consequently, we can obtain theembeddings of senses by jointly learning word andsense embeddings via Skip-gram manner. Thismechanism is also applied to (Niu et al., 2017).

scij = e(sen(ci)j ) (3)

swb,e

j = e(sen(wb,e)j ) (4)

where sen(ci)j and sen

(wb,e)j represents the jth

sense for the character ci and word wb,e in the se-quence. And then scij and swb,e

j are the embed-dings of ci and wb,e.

2.2 Trigger-Aware Feature ExtractorThe trigger-aware feature extractor is the corecomponent of our model. After training, the out-puts of the extractor are the hidden state vectors hof an input sentence.

Conventional LSTM. LSTM (Hochreiter andSchmidhuber, 1997) is an extension of the recur-rent neural network (RNN) with additional gatesto control the information. Traditionally, there arefollowing basic gates in LSTM: input gate i, out-put gate o and forget gate f . They collectivelycontrols which information will be reserved, for-gotten and output. All three gates are accompaniedby corresponding weight matrix W . Current cellstate c records all historical information flow upto the current time. Therefore, the character-basedLSTM functions are:

iciocif ci

cci

=

σ

σ

σ

tanh

(W c>

[xci

hci−1

]+ bc) (5)

cci = fci � cci−1 + ici � cci (6)

hci = o

ci � tanh(cci ) (7)

where hci is the hidden state vector.Trigger-Aware Lattice LSTM. Trigger-Aware

Lattice LSTM is the core feature extractor of ourframework, which is an extension of LSTM andlattice LSTM. In this subsection, We will deriveand theoretically analyze the model in detail.

In this section, characters and words are as-sumed to have K senses. As mentioned in 2.1, forthe jth sense of the ith character ci, the embed-ding would be scij . Then an additional LSTMCellis utilized to integrate all senses of the character,hence the calculation of the cell gate of the multi-sense character ci would be: i

cij

f cij

ccij

=

σ

σ

tanh

(W c>

[scijhci−1

]+ bc) (8)

ccij = f cij � c

ci−1 + icij � ccij (9)

350

Figure 3: The structure of Trigger-Aware Feature Extractor, the input of the example is a part of the sentence“若罪名成立，他将被逮捕” (If convicted, he will be arrested) . In this case, “罪名成立” (convicted) is a triggerwith event type Justice: Sentence. “成立” (convicted/found) and “立” (stand/conclude) are polysemous words.To keep the figure concise, we (1) only show two senses for each polysemous word; (2) only show the forwarddirection.

where ccij is the cell state of the jth sense of the ithcharacter, cci−1 is the final cell state of the i− 1thcharacter. In order to obtain the cell state of thecharacter, an additional gate is used:

gcij = σ(W>

[xci

ccij

]+ b) (10)

Then all the senses should be dynamically inte-grated into the temporary cell state:

c∗ci =K∑j

αcij � c

cij (11)

whereαcij is the character sense gate after normal-

ization:

αcij =

exp(gcij )

K∑k

exp(gcik )

(12)

Eq.11 obtains the temporary cell state of the char-acter c∗ci by incorporating all the senses informa-tion of the character. However, word-level infor-mation needs to be considered as well. As men-tioned in 2.1, swb,e

j is the embedding for the jthsense of the word out wb,e. Similar to characters,

extra LSTMCell is used to calculate the cell stateof each word that matches the lexicon D. i

wb,e

j

fwb,e

j

cwb,e

j

=

σ

σ

tanh

(W c>

[swb,e

j

hcb−1

]+ bc) (13)

cwb,e

j = fwb,e

j � ccb−1 + iwb,e

j � cwb,e

j (14)

Similar to Eq.11, the cell state of the word couldbe computed by incorporating all the cells ofsenses.

gwb,e

j = σ(W>

[xcb

cwb,e

j

]+ b) (15)

cwb,e =

K∑j

αwb,e

j � cwb,e

j (16)

where αwb,e

j is the word sense gate after normal-ization:

αwb,e

j =exp(g

wb,e

j )

K∑k

exp(gwb,e

k )

(17)

351

For a character ci, the temporary cell state c∗ci thatcontains sense information is calculated by Eq.11.Moreover, we could calculate all the cell states ofwords that end in the index i by Eq.16, which arerepresented as {cwb,i |b ∈ [1, i], wb,i ∈ D}. In or-der to ensure the corresponding information couldflow into the final cell state of ci, an extra gate gmb,iis used to merge character and word cells:

gmb,i = σ(W l>

[xci

cwb,i

]+ bl) (18)

and the computation of the final cell state of thecharacter cci is:

cci=∑

b∈{b′|wdb′,i∈D}

αwb,i � cwb,i+αci � c∗ci (19)

where αwb,i and αci are word gate and charactergate after normalization. The computation is sim-ilar to Eq. 12 and Eq. 17.

Therefore, the final cell state cci could repre-sent the ambiguous characters and words in a dy-namic manner. Similar to Eq. 7, hidden state vec-tors could be calculated to transmit to the sequencetagger layer.

2.3 Sequence TaggerIn this paper, the event detection task is regardedas a sequence tagging problem. For an input se-quence S = {c1, c2, ..., cN}, there is a correspond-ing label sequence L = {y1, y2, ..., yN}. Hiddenvectors h for each character obtained in 2.2 areused as the input. We use a classic CRF layer toperform the sequence tagging, thus the probabilitydistribution is:

P (L|S)=exp(

N∑i=1

(S(yi)+T (yi−1, yi)))

∑L′∈C

exp(N∑i=1

(S(y′i)+T (y′i−1, y

′i)))

,

(20)where S is the score function to compute the emis-sion score from hidden vector hi to the label yi:

S(yi) =WyiCRFhi + byiCRF . (21)

W yiCRF and byiCRF are learned parameters specific

to yi. And in Eq. 20, T is the transition func-tion to compute the transition score from yi−1 toyi. C contains all the possible label sequences onsequence S and L′ is a random label sequence inC.

We use standard Viterbi (Viterbi, 1967) algo-rithm as a decoder to decode the highest scoredlabel sequence. The loss function of our model islog-likelihood in sentence-level.

Loss =M∑i=1

log(P (Li|Si)) (22)

whereM is the number of sentences, Li is the cor-rect label for the sentence Si.

3 Experiments

3.1 Datasets and Experimental SettingsDatasets. In this paper, we conduct a seriesof experiments on two real-world datasets: ACE2005 Chinese dataset (ACE2005) and TAC KBP2017 Event Nugget Detection Evaluation dataset(KBP2017). For better comparison, we use thesame data split as previous works (Chen andJi, 2009; Zeng et al., 2016; Feng et al., 2018;Lin et al., 2018). Specifically, the ACE2005(LDC2006T06) contains 697 articles, with 569 ar-ticles for training, 64 for validating and the rest64 for testing. For KBP2017 Chinese dataset(LDC2017E55), we follow the same setup as Linet al. (2018), using 506/20/167 documents as train-ing/development/test set respectively.

Evaluation Metrics. Standard micro-averagedPrecision, Recall and F1 are used as evaluationmetrics. For ACE2005 the computation is thesame as Chen and Ji (2009). To remain rigorous,we use the official evaluation toolkit 1 to performthe metrics for KBP2017.

Hyper-Parameter Settings. We tune the pa-rameters of our models by grid searching on thevalidation dataset. Adam (Kingma and Ba, 2014)with a learning rate decay is utilized as the op-timizer. The embedding sizes of characters andsenses are all 50. To avoid overfitting, Dropoutmechanism (Srivastava et al., 2014) is used in thesystem, and the dropout rate is set to 0.5. We se-lect the best models by early stopping using theF1 results on the validation dataset. Because ofthe limited influence, we follow empirical settingsfor other hyper-parameters.

3.2 Overall ResultsIn this section, we compare our model with previ-ous state-of-the-art methods. The proposed mod-els are as follows:

1github.com/hunterhector/EvmEval

352

ModelACE2005 KBP2017

Trigger Identification Trigger Classification Trigger Identification Trigger ClassificationP R F1 P R F1 P R F1 P R F1

CharDMCNN 60.10 61.60 60.90 57.10 58.50 57.80 53.67 49.92 51.73 50.03 46.53 48.22C-BiLSTM* 65.60 66.70 66.10 60.00 60.90 60.40 - - - - - -HBTNGMA 41.67 59.29 48.94 38.74 55.13 45.50 40.52 46.76 43.41 35.93 41.47 38.50

WordDMCNN 66.60 63.60 65.10 61.60 58.80 60.20 60.43 51.64 55.69 54.81 46.84 50.51HNN* 74.20 63.10 68.20 77.10 53.10 63.00 - - - - - -HBTNGMA 54.29 62.82 58.25 49.86 57.69 53.49 46.92 53.57 50.02 37.54 42.86 40.03

FeatureRich-C* 62.20 71.90 66.70 58.90 68.10 63.20 - - - - - -KBP2017 Best* - - - - - - 67.76 45.92 54.74 62.69 42.48 50.64

HibirdNPN 70.63 64.74 67.56 67.13 61.54 64.21 58.03 59.91 58.96 52.04 53.73 52.87TLNN (Ours) 67.34 74.68 70.82 64.45 71.47 67.78 65.93 59.07 62.31 60.72 54.41 57.39

Table 2: Overall results of proposed methods and TLNN on ACE2005 and KBP2017. * indicates the resultsadapted from the original paper. For KBP2017, ”Trigger Identification” and ”Trigger Classification” correspondto the ”Span” and ”Type” metrics in the official evaluation.

DMCNN (Chen et al., 2015) put forward a dy-namic Multi-pooling CNN as a sentence-level fea-ture extractor. Moreover, we add a classifier toDMCNN using IOB encoding.

C-BiLSTM (Zeng et al., 2016) put forward theConvolutional Bi-LSTM model for the event de-tection task.

HNN (Feng et al., 2018) designed a HybridNeural Network model which combines CNNwith Bi-LSTM.

HBTNGMA (Chen et al., 2018) put forwarda Hierarchical and Bias Tagging Networks withGated Multi-level Attention Mechanisms to inte-grate sentence-level and document-level informa-tion collectively.

NPN (Lin et al., 2018) proposed a comprehen-sive model by automatically learning the innercompositional structures of triggers to solve thetrigger mismatch problem.

The results of all the models are shown in Table2. From the results, we can observe that:

(1) Both for ACE2005 and KBP2017, TLNNoutperform other proposed models significantly,achieving the best results on two datasets. Thisdemonstrates that the trigger-aware lattice struc-ture could enhance the accuracy of locating trig-gers. Further, thanks to the usage of sense-levelinformation, triggers could be more precisely clas-sified into correct event types.

(2) On the TI stage, TLNN gives the best perfor-mance. By linking shortcut paths of all word can-didates with the current character, the model couldeffectively exploit both character and word infor-mation, and then alleviates the issue of trigger-word mismatch.

ModelACE2005 KBP2017TI TC TI TC

Word baseline 61.13 57.81 56.68 50.98+char CNN 61.11 57.52 57.14 51.66+char LSTM 61.51 58.15 56.14 50.46

Char baseline 64.97 61.78 53.95 49.63+bigram 66.89 62.95 57.01 52.71+softword 67.79 64.11 60.68 54.57+softword+bigram 68.22 64.23 60.42 54.85

TLNN 70.82 67.78 62.31 57.39

Table 3: F1-score of Word-based and Character-basedbaselines and TLNN on ACE2005 and KBP2017.

(3) On the TC stage, TLNN still maintain itsadvantages. The results indicate that the linguis-tic knowledge of HowNet and the unique struc-ture to dynamically utilize sense-level informationcould enhance the performance on the TC stage.More located triggers could be classified into cor-rect event types by considering the ambiguity oftriggers.

3.3 Effect of Trigger-aware FeatureExtractor

In this section, we design a set of experimentsto explore the effect of the trigger-aware featureextractor. We implement strong character-basedand word-based baselines by replacing the trigger-aware lattice LSTM with the standard Bi-LSTM.

For word-based baselines, the input is seg-mented into word sequences firstly. Furthermore,we implement extra CNN and LSTM to learncharacter-level features as additional modules. Forcharacter-based baselines, the basic units of the

353


Match Mismatch Match MismatchChar Baseline 65.58 63.89 55.65 42.76Word Baseline 66.30 2.78 64.43 17.63NPN 65.94 55.56 56.47 41.53

TLNN 69.93 72.22 65.52 48.56

Table 4: Recall rates of two trigger-word match splitsof two datasets on Trigger Identification task.

input sequence are characters. Then we enhancethe character representation by adding externalword-level features including bigram and softword(word in which the current character is located).Hence, both baselines could collectively utilizecharacter and word information.

As shown in Table 3, experiments of two typesof baselines and our model are conducted onACE2005 and KBP2017. For the word baseline,although adding character-level features can im-prove the performance, the effects are relative lim-ited. For the char baseline, it gains considerableimprovements when word-level features are takeninto account. The results of baselines indicatethat integrating different level of information isan effective strategy to improve the performanceof models. Compared with baselines, the TLNNachieves the best F1-score compared to all thebaselines on both datasets, showing remarkablesuperiority and robustness. The results show thatby dynamically combining multi-grained informa-tion, the trigger-aware feature extractor could ef-fectively explore deeper semantic features than thefeature-based strategies used in baselines.

3.4 Influence of Trigger Mismatch

In order to explore the influence of trigger mis-match problem, we split the test data of ACE2005and KBP2017 into two types: match and mis-match. Table 1 shows the proportion of word-trigger match and mismatch on two datasets.

The recall of different methods of each split onTrigger Identification task is shown in Table 4. Wecan observe that:

(1) The result indicates that the word-triggermismatch problem could severely impact the per-formance of the task. All approaches except oursgive lower recall rates in the trigger-mismatchpart than in the trigger-match part. In contrast,our model could robustly address the word-triggermismatch problem, reaching the best results onboth parts of the two datasets.

ModelACE2005 KBP2017TI TC TI TC

NPN 67.56 64.21 58.96 52.87

TLNN 70.82 67.78 62.31 57.39- w/o Sense info 68.65 65.83 61.17 56.55

Table 5: F1-score of NPN, TLNN and TLNN - w/oSense info on ACE2005 and KBP2017.


Poly Mono Poly MonoNPN 66.27 61.08 51.15 48.63

TLNN 69.11 62.56 58.61 56.56- w/o Sense info 67.53 62.89 56.01 55.96

Table 6: F1-score of two splits of two datasets on Trig-ger Classification task. The splits are based on the pol-ysemy of triggers. ”Poly” and ”Mono” correspond topolysemous and monosemous trigger splits.

(2) To a certain extent, the NPN model couldalleviate the problem by utilizing hybrid represen-tation learning and nugget generator in a fix-sizedwindow. However, the mechanism is still not flex-ible and robust to integrate character and word in-formation.

(3) The word-based baseline is most severelyaffected by the trigger-word mismatch problem.This phenomenon is explainable because if onetrigger could not be segmented as a specific wordin the preprocessing stage, it is impossible to belocated correctly.

3.5 Influence of Trigger Polysemy

In this section, we mainly focus on the influenceof polysemous triggers. We select NPN modelfor comparison. And we implement a version ofTLNN without sense information, which is de-noted as TLNN - w/o Sense info in Table 5 andTable 6.

Empirical results in Table 5 show the overallperformance on ACE2005 and KBP2017. We canobserve that the TLNN is weakened by removingsense information, which indicates the effective-ness of the usage of sense-level information. Evenwithout sense information, our model could stilloutperform the NPN model on both two datasets.

To further explore and analyze the effect ofword sense information, we split the KBP2017dataset into two parts based on the polysemy oftriggers and their contexts. The F1-score of eachsplit is shown in Table 6, in which the TLNNyields the best results on both ”Poly” parts. With-

354

Sentence 1 Word baseline NPN TLNN Answer立刻/抗抗抗敌援友 (抗敌援友,Attack) (刻抗,Attack) (抗,Attack) (抗,Attack)Resist the enemies and aid the allies at once

Sentence 2 NPN TLNN - Sense info TLNN Answer送送送/他/一笔/赴/欧洲/的/旅费 (送,TransferPerson) (送,TransferPerson) (送,TransferMoney) (送,TransferMoney)

Send him money for European tourism

Table 7: Two model prediction examples. The first sentence is an example of trigger-word mismatch, while thesecond one is about polysemous triggers. For each prediction result, (A,B) indicates that A is a trigger with eventtype B.

out sense information, TLNN - w/o sense infocould give comparable F1-scores with TLNN onthe ”Mono” parts. The results indicate that thetrigger-aware feature extractor could dynamicallylearn all the senses of characters and words, gain-ing significant improvements under the conditionof polysemy.

3.6 Case Study

Table 7 shows two examples comparing the TLNNmodel with other ED methods. The former ex-ample is about trigger-word mismatch, in whichthe correct trigger “抗”(resist) is part of the idiomword “抗敌援友” (resist the enemies and aid theallies). In this case, the word baseline gives thewhole word “抗敌援友” as prediction because it isimpossible for word-based methods to detect part-of-word triggers. Additionally, the NPN modelrecognizes a non-existent word “刻抗”. The rea-son is that the NPN enumerates the combinationsof all characters within a window as trigger candi-dates, which is likely to generate invalid words. Incontrast, our model detects the event trigger “抗”accurately.

In the latter example, the trigger “送”(send) isa polysemous word with two different meanings:“送行”(see him off) and “送钱”(give him money).Without considering multiple word senses of pol-ysemes, the NPN and TLNN (w/o Sense info)classify trigger “送” into wrong event type Trans-ferPerson. On the contrary, the TLNN can dynam-ically select word sense for polysemous triggersby utilizing context information. Thus the correctevent type TransferMoney is predicted.

4 Related Work

Event Detection (ED) is a crucial subtask in EventExtraction task. Feature-based methods (Ahn,2006; Ji and Grishman, 2008; Liao and Grishman,2010; Huang and Riloff, 2012; Patwardhan andRiloff, 2009; McClosky et al., 2011) were widely

used in the ED task, but these traditional methodsare heavily rely on the manual features, limitingthe scalability and robustness.

Recent developments in deep learning have ledto a renewed interest in neural event detection.Neural networks can automatically learn featuresof the input sequence and conduct token-levelclassification. CNN-based models are the semi-nal neural network models in ED (Nguyen and Gr-ishman, 2015; Chen et al., 2015; Nguyen and Gr-ishman, 2016). However, these models can onlycapture the local context features in a fixed sizewindow. Some approaches design comprehen-sive models to explore the interdependency amongtrigger words (Chen et al., 2018; Feng et al., 2018).To further improve the ED task, some joint modelsare designed (Nguyen et al., 2016; Lu and Nguyen,2018; Yang and Mitchell, 2016). These methodshave achieved great success in English datasets.

However, in languages without delimiters, suchas Chinese, the mismatch of word-trigger becomesignificantly severe. Some feature-based meth-ods are proposed to solve the problem (Chen andJi, 2009; Qin et al., 2010; Li and Zhou, 2012),but they heavily rely on the hand-crafted features.Lin et al. (2018) proposes NPN, a neural networkbased method to address the issue. However, themechanism of NPNs limits the scope of triggercandidates within a fix-sized window, which willcause two problems in the progress. First, theNPNs still cannot take all the possible trigger can-didates into account, leading to meaningless com-putation. Furthermore, the overlap of triggers isserious in NPNs. Lattice-based models were usedin other fields to combine character and word in-formation (Li et al., 2019; Zhang and Yang, 2018;Yang et al., 2018). Mainstream methods also suf-fer from the problem of trigger polysemy. Luand Nguyen (2018) proposes a multi-task learn-ing model which uses word sense disambigua-tion to alleviate the effect of the trigger polysemy

355

problem. But in this work, word disambiguationdatasets are necessary. In contrast, our model cansolve both word-trigger mismatch and trigger pol-ysemy problems at the same time.

5 Conclusion and Future Work

We propose a novel framework TLNN for eventdetection, which can simultaneously address theproblems of trigger-word mismatch and polyse-mous triggers. With the hierarchical representa-tion learning and the trigger-aware feature extrac-tor, TLNN efficaciously exploits multi-grained in-formation and learn deep semantic features. Setsof experiments on two real-world datasets showthat TLNN could efficiently address the two issuesand yield better empirical results than a variety ofneural network models.

In future work, we will conduct experiments onmore languages with and without explicit worddelimiters. In addition, we will try developinga dynamic mechanism to selectively consider thesense-level information rather than take all thesenses of characters and words into account.

6 Acknowledgement

This work is supported by National Natu-ral Science Foundation of China (Grant No.61773229, 61572273, 61661146007), NationalKey Research and Development Program of China(No. 2018YFB1004503), Basic Scientific Re-search Program of Shenzhen City (Grant No.JCYJ20160331184440545), and Overseas Co-operation Research Fund of Graduate Schoolat Shenzhen, Tsinghua Univeristy (Grant No.HW2018002), Shenzhen Giiso Information Tech-nology Co. Ltd. Finally, we would like to thankthe anonymous reviewers for their helpful feed-back and suggestions.

ReferencesDavid Ahn. 2006. The stages of event extraction. In

Proceedings of the Workshop on Annotating andReasoning about Time and Events, pages 1–8.

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng,and Jun Zhao. 2015. Event extraction via dy-namic multi-pooling convolutional neural networks.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), vol-ume 1, pages 167–176.

Yubo Chen, Hang Yang, Kang Liu, Jun Zhao, and Yan-tao Jia. 2018. Collective event detection via a hier-archical and bias tagging networks with gated multi-level attention mechanisms. In Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 1267–1276.

Zheng Chen and Heng Ji. 2009. Language specificissue and feature exploration in chinese event ex-traction. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, Companion Volume: Short Pa-pers, pages 209–212.

Zhendong Dong and Qiang Dong. 2003. Hownet-a hy-brid language and knowledge resource. In Proceed-ings of NLP-KE.

Xiaocheng Feng, Bing Qin, and Ting Liu. 2018.A language-independent neural network for eventdetection. Science China Information Sciences,61(9):092106.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ruihong Huang and Ellen Riloff. 2012. Modeling tex-tual cohesion for event extraction. In Twenty-SixthAAAI Conference on Artificial Intelligence.

Heng Ji and Ralph Grishman. 2008. Refining eventextraction through cross-document inference. Pro-ceedings of ACL-08: HLT, pages 254–262.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Peifeng Li and Guodong Zhou. 2012. Employing mor-phological structures and sememes for chinese eventextraction. Proceedings of COLING 2012, pages1619–1634.

Ziran Li, Ning Ding, Zhiyuan Liu, Haitao Zheng,and Ying Shen. 2019. Chinese relation extractionwith multi-grained information and external linguis-tic knowledge. In Proceedings of the 57th Confer-ence of the Association for Computational Linguis-tics, pages 4377–4386.

Shasha Liao and Ralph Grishman. 2010. Using doc-ument level cross-event inference to improve eventextraction. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics, pages 789–797. Association for ComputationalLinguistics.

Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun.2018. Nugget proposal networks for chinese eventdetection. arXiv preprint arXiv:1805.00249.

Weiyi Lu and Thien Huu Nguyen. 2018. Similar butnot the same: Word sense disambiguation improvesevent detection via neural representation matching.

356

In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages4822–4828.

David McClosky, Mihai Surdeanu, and Christopher DManning. 2011. Event extraction as dependencyparsing. In Proceedings of the 49th Annual Meetingof the Association for Computational Linguistics:Human Language Technologies-Volume 1, pages1626–1635. Association for Computational Linguis-tics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-ishman. 2016. Joint event extraction via recurrentneural networks. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 300–309.

Thien Huu Nguyen and Ralph Grishman. 2015. Eventdetection and domain adaptation with convolutionalneural networks. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 2: ShortPapers), volume 2, pages 365–371.

Thien Huu Nguyen and Ralph Grishman. 2016. Mod-eling skip-grams for event detection with convolu-tional neural networks. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 886–891, Austin, Texas.Association for Computational Linguistics.

Yilin Niu, Ruobing Xie, Zhiyuan Liu, and MaosongSun. 2017. Improved word representation learn-ing with sememes. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 2049–2058, Vancouver, Canada. Association for Compu-tational Linguistics.

Siddharth Patwardhan and Ellen Riloff. 2009. A uni-fied model of phrasal and sentential evidence for in-formation extraction. In Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing: Volume 1-Volume 1, pages 151–160. Association for Computational Linguistics.

Bing Qin, Yanyan Zhao, Xiao Ding, Ting Liu, andGuofu Zhai. 2010. Event type recognition based ontrigger expansion. Tsinghua Science and Technol-ogy, 15(3):251–258.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Andrew Viterbi. 1967. Error bounds for convolutionalcodes and an asymptotically optimum decoding al-gorithm. IEEE transactions on Information Theory,13(2):260–269.

Bishan Yang and Tom Mitchell. 2016. Joint extrac-tion of events and entities within a document con-text. arXiv preprint arXiv:1609.03632.

Jie Yang, Yue Zhang, and Shuailong Liang. 2018. Sub-word encoding in lattice lstm for chinese word seg-mentation. arXiv preprint arXiv:1810.12594.

Ying Zeng, Honghui Yang, Yansong Feng, ZhengWang, and Dongyan Zhao. 2016. A convolutionbilstm neural network model for chinese event ex-traction. In Natural Language Understanding andIntelligent Applications, pages 275–287. Springer.

Yue Zhang and Jie Yang. 2018. Chinese ner using lat-tice lstm. arXiv preprint arXiv:1805.02023.

https://doi.org/10.18653/v1/D16-1085

https://doi.org/10.18653/v1/D16-1085

https://doi.org/10.18653/v1/D16-1085

https://doi.org/10.18653/v1/P17-1187

https://doi.org/10.18653/v1/P17-1187

Event Detection with Trigger-Aware Lattice Neural Network · 1 Introduction Event Detection (ED) is...

Documents

Transcript of Event Detection with Trigger-Aware Lattice Neural Network · 1 Introduction Event Detection (ED) is...