arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”,...

20
Neural Metric Learning for Fast End-to-End Relation Extraction Tung Tran Department of Computer Science University of Kentucky, USA [email protected] Ramakanth Kavuluru Division of Biomedical Informatics Department of Internal Medicine Department of Computer Science University of Kentucky, USA [email protected] Relation extraction (RE) is an indispensable information extraction task in several disci- plines. RE models typically assume that named entity recognition (NER) is already performed in a previous step by another independent model. Several recent efforts, under the theme of end-to-end RE, seek to exploit inter-task correlations by modeling both NER and RE tasks jointly. Earlier work in this area commonly reduces the task to a table-filling problem wherein an additional expensive decoding step involving beam search is applied to obtain globally consistent cell labels. In efforts that do not employ table-filling, global optimization in the form of CRFs with Viterbi decoding for the NER component is still necessary for competitive performance. We introduce a novel neural architecture utilizing the table structure, based on repeated applications of 2D convolutions for pooling local dependency and metric-based features, that improves on the state-of-the-art without the need for global optimization. We validate our model on the ADE and CoNLL04 datasets for end-to-end RE and demonstrate 1% gain (in F-score) over prior best results with training and testing times that are seven to ten times faster — the latter highly advantageous for time-sensitive end user applications. 1. Introduction Information extraction (IE) systems are fundamental to the automatic construction of knowledge bases and ontologies from unstructured text. While important, in and of themselves, these result- ing resources can be harnessed to advance other important language understanding applications including knowledge discovery and question answering systems. Among IE tasks are named entity recognition (NER) and binary relation extraction (RE) which involve identifying named entities and relations among them, respectively, where the latter is typically a set of triplets identifying pairs of related entities and their relation types. We present Figure 1 as an example of the NER and RE problem given the input sentence “Mrs. Tsuruyama is from Yatsushiro in Kumamoto Prefecture in southern Japan.” First, we extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is of type PERSON and the rest are of type LOCATION. Thus, NER consists of identifying both the bounds and type of entities mentioned in the © 2016 Association for Computational Linguistics arXiv:1905.07458v4 [cs.CL] 27 Aug 2019

Transcript of arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”,...

Page 1: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Neural Metric Learning for Fast End-to-EndRelation Extraction

Tung TranDepartment of Computer ScienceUniversity of Kentucky, [email protected]

Ramakanth KavuluruDivision of Biomedical InformaticsDepartment of Internal MedicineDepartment of Computer ScienceUniversity of Kentucky, [email protected]

Relation extraction (RE) is an indispensable information extraction task in several disci-plines. RE models typically assume that named entity recognition (NER) is already performedin a previous step by another independent model. Several recent efforts, under the theme ofend-to-end RE, seek to exploit inter-task correlations by modeling both NER and RE tasksjointly. Earlier work in this area commonly reduces the task to a table-filling problem wherein anadditional expensive decoding step involving beam search is applied to obtain globally consistentcell labels. In efforts that do not employ table-filling, global optimization in the form of CRFswith Viterbi decoding for the NER component is still necessary for competitive performance. Weintroduce a novel neural architecture utilizing the table structure, based on repeated applicationsof 2D convolutions for pooling local dependency and metric-based features, that improves on thestate-of-the-art without the need for global optimization. We validate our model on the ADE andCoNLL04 datasets for end-to-end RE and demonstrate ≈ 1% gain (in F-score) over prior bestresults with training and testing times that are seven to ten times faster — the latter highlyadvantageous for time-sensitive end user applications.

1. Introduction

Information extraction (IE) systems are fundamental to the automatic construction of knowledgebases and ontologies from unstructured text. While important, in and of themselves, these result-ing resources can be harnessed to advance other important language understanding applicationsincluding knowledge discovery and question answering systems. Among IE tasks are namedentity recognition (NER) and binary relation extraction (RE) which involve identifying namedentities and relations among them, respectively, where the latter is typically a set of tripletsidentifying pairs of related entities and their relation types.

We present Figure 1 as an example of the NER and RE problem given the input sentence“Mrs. Tsuruyama is from Yatsushiro in Kumamoto Prefecture in southern Japan.” First, weextract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and“Japan” where “Mrs. Tsuruyama” is of type PERSON and the rest are of type LOCATION.Thus, NER consists of identifying both the bounds and type of entities mentioned in the

© 2016 Association for Computational Linguistics

arX

iv:1

905.

0745

8v4

[cs

.CL

] 2

7 A

ug 2

019

Page 2: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

Mrs. Tsuruyama is from Yatsushiro in Kumamoto Prefecture in southern Japan .PERSON

LOCATED_IN

LIVE_IN

LOCATED_IN

LOCATED_INLOCATION

LIVE_INLIVE_IN

LOCATION LOCATION

Figure 1A simple relation extraction example.

sentence. Once entities are identified, the next step is to extract relation triplets of the form(subject,predicate,object), if any, based on the context; for example, (Mrs.Tsuruyama, LIVE_IN, Yatsushiro) is a relation triple that may be extracted from theexample sentence as output of an RE system. Given this, it is clear that E2ERE is a complexproblem given the sparse nature of the output space; for a sentence of n length with k possiblerelation types, the output is a variable-length set of relations each drawn from kn2 possiblerelation combinations.

NER and RE have been traditionally treated as independent problems to be solved separatelyand later combined in an ad-hoc manner as part of a pipeline system. End-to-end RE (E2ERE) isa relatively new research direction that seeks to model NER and RE jointly in a unified architec-ture. As these tasks are closely intertwined, joint models that simultaneously extract entitiesand their relations in a single framework have the capacity to exploit inter-task correlationsand dependencies leading to potential performance gains. Moreover, joint approaches, like ourmethod, are better equipped to handle datasets where entity annotations are non-exhaustive (thatis, only entities involved in a relation are annotated), since standalone NER systems are notdesigned to handle incomplete annotations. Recent advancements in deep learning for E2EREare broadly divided into two categories: (1). The first category involves applying deep learningto the table structure first introduced by Miwa and Sasaki (2014), including Gupta, Schütze, andAndrassy (2016), Pawar, Bhattacharyya, and Palshikar (2017), and Zhang, Zhang, and Fu (2017)where E2ERE is reduced to some variant of the table-filling problem such that the (i, j)-th cellis assigned a label that represents the relation between tokens at positions i and j in the sentence.We further describe the table-filling problem in Section 3.1. Recent approaches based on the tablestructure operate on the idea that cell labels are dependent on features or predictions derived frompreceding or adjacent cells; hence, the table is filled incrementally leading to potential efficiencyissues. Also, these methods typically require an additional expensive decoding step, involvingbeam search, to obtain a globally optimal table-wide label assignment. (2). The second categoryincludes models where NER and RE are modeled jointly with shared components or parameterswithout the table structure. Even state-of-the-art methods not utilizing the table structure rely onconditional random fields (CRFs) as an integral component of the NER subsystem where Viterbialgorithm is used to decode the best label assignment at test time (Bekoulis et al. 2018b,a).

Our model utilizes the table formulation by embedding features along the third dimension.We overcome efficiency issues by utilizing a more efficient and effective approach for deepfeature aggregation such that local metric, dependency, and position based features are simulta-neously pooled — in a 3× 3 cellular window — over many applications of the 2D convolution.Intuitively, preliminary decisions are made at earlier layers and corroborated at later layers. Finallabel assignments for both NER and RE are made simultaneously via a simple softmax layer.Thus, computationally, our model is expected to improve over earlier efforts without a costlydecoding step. We validate our proposed method on the CoNLL04 dataset (Roth and Yih 2004)and the ADE dataset (Gurulingappa et al. 2012), which correspond to the general English andthe biomedical domain respectively, and show that our method improves over prior state-of-

2

Page 3: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

the-art in E2ERE. We also show that our approach leads to training and testing times that areseven to ten times faster, where the latter can be critical for time-sensitive end-user applications.Lastly, we perform extensive error analyses and show that our network is visually interpretableby examining the activity of hidden pooling layers (corresponding to intermediate decisions).To our knowledge, our study is the first to perform this type of visual analysis of a deep neuralarchitecture for end-to-end relation extraction.1

2. Related Work

In this section, we provide an overview of three main types of relation extraction methods in theliterature: studies that are limited to relation classification, early E2ERE methods that assumeknown entity bounds, and recent efforts on E2ERE that perform full entity recognition andrelation extraction in an end-to-end fashion.

2.1 Relation Classification

The majority of past and current efforts in relation extraction treat the problem as a simplerrelation classification problem where pairs of entities are known during test time; the goal is toclassify the pair of entities, given the context, as being either positive or negative for a particulartype of relation. Many works on relation classification preprocess the input as a dependencyparse tree (Bunescu and Mooney 2005; Qian et al. 2008) and exploit features corresponding tothe shortest dependency path between candidate entities; this general approach has also beensuccessfully applied in the biomedical domain (Airola et al. 2008; Fundel, Küffner, and Zimmer2007; Li et al. 2008; Özgür et al. 2008), where they typically involve a graph kernel based SupportVector Machine (SVM) classifier (Li et al. 2008; Rink, Harabagiu, and Roberts 2011). Theconcept of network centrality has also been explored (Özgür et al. 2008) to extract gene-diseaserelations. Other studies, such as the effort by Frunza, Inkpen, and Tran (2011), apply the moretraditional bag-of-words approach focusing on syntactic and lexical features while exploringa wide variety of classification algorithms including decision trees, SVMs, and Naïve Bayes.More recently, innovations in relation extraction have centered around designing meaningfuldeep learning architectures. Liu et al. (2016) proposed a dependency-based convolutional neuralnetwork (CNN) architecture wherein the convolution is applied over words adjacent accordingto the shortest path connecting the entities in the dependency tree, rather than words adjacentwith respect to the order expressed, to detect drug-drug interactions (DDIs). In Kavuluru, Rios,and Tran (2017), ensembling of both character-level and word-level recurrent neural networks(RNNs) is further proposed for improved performance in DDI extraction. Raj, SAHU, and Anand(2017) proposed a deep learning architecture such that word representations are first processed bya bidirectional RNN followed by a standard CNN, with an optional attention mechanism towardsthe output layer. Zhang, Qi, and Manning (2018) showed that relation extraction performancecan be improved by applying graph convolutions over a pruned version of the dependency tree.

2.2 End-to-End Relation Extraction with Known Entity Bounds

Early efforts in E2ERE, as covered in this section, assume that entity bounds are known duringtest time. Hence, the NER aspect of these methods is limited to classifying entity type (e.g.,is "President Kennedy" a person, place, or organization?). In a seminal work, Roth and Yih(2004) proposed an integer linear programming (LP) approach to tackle the end-to-end problem.

1 Our code is included as supplementary material and will be made publicly available on GitHub.

3

Page 4: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

They discovered that the LP component was effective in enhancing classifier results by reducingsemantic inconsistencies in the predictions compared to a traditional pipeline wherein the outputsof an NER component are passed as features into the RE component. Their results indicate thatthere are mutual inter-dependencies between NER and RE as subtasks which can be exploited.The LP technique has been also been successfully applied in jointly modeling entities andrelations with respect to opinion recognition by Choi, Breck, and Cardie (2006). Kate andMooney (2010) proposed a similar approach but presented a global inference mechanism inducedby building a graph resembling a card-pyramid structure. A dynamic programming algorithm,similar to CYK (Jurafsky and Martin 2008) parsing, called card-pyramid parsing is applied alongwith beam search to identify the most probable joint assignment of entities and their relationsbased on outputs of local classifiers. Other efforts to this end involve the use of probabilisticgraphical models (Yu and Lam 2010; Singh et al. 2013).

2.3 End-to-End Relation Extraction

Li and Ji (2014) proposed one of the first truly joint models wherein entities, including entitymention bounds, and their relations are predicted. Structured perceptrons (Collins 2002), as alearning framework, are used to estimate feature weights while beam search is used to explorepartial solutions to incrementally arrive at the most probable structure. Miwa and Sasaki (2014)proposed the idea of using a table representation which simplifies the task into a table-fillingproblem such that NER and relation labels are assigned to cells of the table; the aim was to predictthe most probable label assignment to the table, out of all possible assignments, using beamsearch. While the representation is in table form, beam search is performed sequentially, one cell-assignment per step. The table-filling problem for E2ERE has since been successfully transferredto the deep neural network setting (Gupta, Schütze, and Andrassy 2016; Pawar, Bhattacharyya,and Palshikar 2017; Zhang, Zhang, and Fu 2017).

Other recent approaches not utilizing a table structure involve modeling the entity and rela-tion extraction task jointly with shared parameters (Miwa and Bansal 2016; Li et al. 2016; Zhenget al. 2017a; Li et al. 2017; Katiyar and Cardie 2017; Bekoulis et al. 2018b; Zeng et al. 2018).Katiyar and Cardie (2017) and Bekoulis et al. (2018b) specifically use attention mechanisms forthe RE component without the need for dependency parse features. Zheng et al. (2017b) operateby reducing the problem to a sequence-labeling task that relies on a novel tagging scheme. Zenget al. (2018) use an encoder-decoder network such that the input sentence is encoded as fixed-length vector and decoded to relation triples directly. Most recently, Bekoulis et al. (2018a) foundthat adversarial training (AT) is an effective regularization approach for E2ERE performance.

3. Method

We present our version of the table-filling problem, a novel neural network architecture to fillthe table, and details of the training process. Here, Greek letter symbols are used to distinguishhyper-parameters from variables that are learned during training.

3.1 The Table-Filling Problem

Given a sentence of length n, we use an n× n table to represent a set of semantic relationssuch that the (i, j)-th cell represents the relationship (or non-relation) between tokens i and j.In practice, we assign a tag for each cell in the table such that entity tags are encoded along thediagonal while relation tags are encoded at non-diagonal cells. For entity recognition, we use theBILOU tagging scheme (Ratinov and Roth 2009). In the BILOU scheme, B, I, and L tags are

4

Page 5: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

Mrs.

Tsuruyama is

from Yatsushiro

in Kumamoto Prefecture

in southern

Japan .

Mrs

. Tsur

uyam

a is

fr

om

Yats

ushi

ro

in

Kum

amot

o Pr

efec

ture

in

so

uthe

rn

Japa

n .

PERSON

LOCATION

LIVE_IN

LOCATION_IN

Figure 2Table representation for the example in Figure 1. BILOU-encoded entity tags are assigned along thediagonal and relation tags are assigned where entity spans intersect. Empty cells are implicitly assignedthe O tag.

used to indicate the beginning, inside, and last token of a multi-token entity respectively. The Otag indicates whether the token outside of an entity span, and U is used for unit-length entities.

In tabular form, entity and relation tags are drawn from a unified list Z serving as thelabel space; that is, each cell in the table is assigned exactly one tag from Z . For simplicity,the O tag is also used to indicate a null relation when occurring outside of a diagonal. As eachentity type requires a BILOU variant, a problem with nent entity types and nrel relation typeshas |Z| = 4nent + nrel + 1 where the last term accounts for the O tag. Our conception of thetable-filling problem differs from Miwa and Sasaki (2014) in that we utilize the entire table asopposed to only the lower triangle; this allows us to model directed relations without the needfor additional inverse-relation tags. Moreover, we assign relation tags to cells where entity spansintersect instead of where head words intersect; thus encoded relations manifest as rectangularblocks in the proposed table representation. We present a visualization of our table representationin Figure 2. At test time, entities are first extracted, and relations are subsequently extracted byaveraging the output probability estimates of the blocks where entities intersect. We describe theexact procedure for extracting relations from these blocks at test-time in Section 3.3.

3.2 Our Model: Relation-Metric Network

We propose a novel neural architecture, which we call the relation-metric network, combiningthe ideas of metric learning and convolutional neural networks (CNNs) for table filling. Theschematic of the network is shown in Figure 3, whose components will be detailed in this section.

3.2.1 Context Embeddings Layer. In addition to word embeddings, we employ character-CNNbased representations as commonly observed in recent neural NER models (Chiu and Nichols2016) and E2ERE models (Li et al. 2017). Character-based features can capture morphological

5

Page 6: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

���

���������

��

���

�����

����

����

����

����

����

�������������������������� ���������

�����������

���������

������������

���������

������������

������ ��­

������������

��

��

���

��

� �����������

���������������

������� �����

��������������������

�����­�����������

���������������

��

��

����������������

Figure 3Overview of the network architecture for λ = 2. For simplicity, we ignore punctuation tokens.

features and help generalize to out-of-vocabulary words. For the proposed model, such represen-tations are composed by convolving over character embeddings of size π using a window of size3, producing η feature maps; the feature maps are then max-pooled to produce η-length featurerepresentations. As our approach is standard, we refer readers to Chiu and Nichols (2016) forfull details. This portion of the network is illustrated in step 1 of Figure 3.

Suppose the input is a sentence of length n represented by a sequence of word indicesw1, . . . , wn into the vocabulary VWord. Each word is mapped to an embedding vector viaembedding matrices EWord ∈ R|VWord|×δ such that δ is a hyperparameter that determines the sizeof word embeddings. Next, let C[i] be the character-based representation for the ith word. Aninput sentence is represented by matrix S wherein rows are words mapped to their correspondingembedding vectors; or concretely,

S =

EWord[w1] ‖ C[1]...

EWord[wn] ‖ C[n]

where ‖ is the vector concatenation operator and EWord[i] is the ith row of EWord.

Next, we compose context embedding vectors (CVs), which embed each word of thesentence with additional contextual features. Suppose

−−−−→LSTM and

←−−−−LSTM represent a long short

term memory (LSTM) network composition in the forward and backward direction, respectively,and let ρ be a hyperparameter that determines context embedding size. We feed S to a Bi-LSTMlayer of hidden unit size 1

2ρ such that

−→h i =

−−−−→LSTM(S[i]),

←−h i =

←−−−−LSTM(S[i]),

hi =−→h i ‖

←−h i, for i = 1, . . . , n,

where S[i] is the ith row of S and hi ∈ Rρ represents the context centered at the ith word. Theoutput of the Bi-LSTM can be represented as a matrixH ∈ Rn×ρ such thatH =

(h1, . . . ,hn

)>.This concludes step 2 of Figure 3.

6

Page 7: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

3.2.2 Relation-Metric Learning. Our goal is to design a network such that any two CVs canbe compared via some “relatedness” measure; that is, we wish to learn a relatedness measure(as a parameterized function) that is able to capture correlative features indicating semanticrelationships. A common approach in metric learning to parameterize a relatedness function isto model it in bilinear form. Here, for input vectors x, z ∈ Rm, a similarity function in bilinearform is formally defined as

sR(x, z) = x>Rz (1)

where R ∈ Rm×m is a parameter of the relatedness function, dubbed a relation-metric embed-ding matrix, that is learned during the training process.

In machine learning research, Eq. 1 is also associated with a type of attention mechanismcommonly referred to as “multiplicative” attention (Luong, Pham, and Manning 2015). However,we apply Eq. 1 with the classical goal of learning a variety of metric-based features. Our aim isto compute sR for all pairs of CVs in the sentence. Concretely, we can compute a “relational-metric table” G ∈ Rn×n over all pairs of CVs in the sentence such that Gi,j = hi

>Rhj . In fact,

we can learn a collection of κ similarity functions corresponding to κ relation metric tables; forour purposes, this is analogous to learning a diverse set of convolution filters in the context ofCNNs. Thus we have the 3-dimensional tensor

Gi,j,k = hi>Rkhj , for k = 1, . . . , κ, (2)

with G ∈ Rn×n×κ where the first and second dimension correspond to word position indiceswhile the third dimension embeds metric-based features. This constitutes step 3 of Figure 3.We show how G is consumed by the rest of the network in Section 3.2.6. However, as aprerequisite, we first describe how dependency parse and relative position information is preparedin Section 3.2.3 and Section 3.2.4 respectively and define the 2D convolution in Section 3.2.5.

3.2.3 Dependency Embeddings Table. Let Vdep be the vocabulary of syntactic dependency tags(e.g., nsubj, dobj). For an input sentence, let T = {(a1, b1, z1), . . . , (ad, bd, zd)} be the setof dependency relations where zi are mappings to tags in Vdep that express the dependency-based relations between pairs of words at positions ai, bi ∈ {1, . . . , n}, respectively. We definethe dependency embedding matrix as F dep ∈ R|Vdep|×β , where each unique dependency tag is aβ-dimensional embedding. We compose the dependency representation tensor D for T as

Di,j,k =

{F dept,k if (i, j, t) ∈ T or (j, i, t) ∈ T ,φk otherwise,

for k = 1, . . . , β, where φ is a trainable embedding vector representing the null dependencyrelation. As shown in the above equation for Di,j,k, we embed the dependency parse tree simplyas an undirected graph.

3.2.4 Position Embeddings Table. First proposed by Zeng et al. (2014), so called positionvectors have been shown to be effective in neural models for relation classification. Positionvectors are designed to encode the relative offset between a word and the two candidate entities(for RE) as fixed-length embeddings. We bring this idea to the tabular setting by proposinga position embeddings table P , which is composed the same way as the dependencies table;however, instead of dependency tags, we simply encode the distance between two candidateCVs as discrete labels mapped to fixed-length embeddings (of size γ, a hyperparameter). It

7

Page 8: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

��

��

��

��

����� �

���

Figure 42D convolution on 3D input with padding

is straightforward to see there will be 2(nmax − 1) + 1 distinct position offset labels wherenmax is the maximum length of a sentence in the training data. Specifically, given a positionvocabulary Vdist, associated position embedding matrix F dist ∈ R|Vdist|×γ , the position embed-dings tensor is Pi,j,k = F dist

(i−j),k for k = 1, . . . , γ. As an implementation detail, we set Vdist to{−nmax, . . . , nmax} where nmax is the maximum sentence length over all training examples. Bothdependency and position embedding tensors are concatenated to the metric tensor (Eq. (2)) alongthe 3rd dimension prior to every convolution operation. Hence they are shown in steps 4 and6 of Figure 3 for the network with two convolutional layers.

3.2.5 2D Convolution Operation. Unlike the standard 2D convolution typically used in NLPtasks, which takes 2D input, our 2D convolution operates on 3D input commonly seen incomputer vision tasks where colored image data has height, width, and an additional dimensionfor color channel. The goal of the 2D convolution is to pool information within a 3× 3 windowalong the first two dimensions such that metric features and dependency/positional informationof adjacent cells are pooled locally over several layers. However, it is necessary to perform apadded convolution to ensure that dimensions corresponding to word positions are not altered bythe convolution. We denote this padding transformation using the hat accent. That is, for sometensor input X ∈ Rn×n×m, the padded version is X ∈ R(n+2)×(n+2)×m and the zero-paddingexists at the beginning and at the end of the first and second dimensions. Next, we define the2D convolution operation via the ? operator which corresponds to an element-wise product oftwo tensors followed by summation over the products; formally, for two input tensors A and B,A ? B =

∑i

∑j

∑k Ai,j,kBi,j,k.

Now our 2D convolution step is a tensor map fv(X) : Rn×n×u → Rn×n×v with v filters ofsize 3× 3× u, defined as

fv(X)i,j,k =W k ? X[i:i+2][j:j+2][1:u] + bk (3)

for i = 1, . . . , n, j = 1, . . . , n, k = 1, . . . , v,whereW k ∈ R3×3×u for k = 1, . . . , v, and b ∈ Rvare filter and bias variables respectively, and X[i:i+2][j:j+2][1:u] is a 3× 3× u window of X fromi to i+ 2 along the first dimension, j to j + 2 along the second dimension, and 1 to u alongthe final dimension. We show how fv(X) is used to repeatedly pool contextual information inSection 3.2.6. Instead of a 3× 3 window, the convolution operation can be over any t× twindowfor some odd t ≥ 3 where large t values lead to larger parameter spaces and multiplication

8

Page 9: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

operations. The 2D convolution is illustrated in Figure 4 and manifests in steps 5 and 7 ofFigure 3.

3.2.6 Pooling Mechanism. Central to our architecture is the iterative pooling mechanism de-signed so that preliminary decisions are made in early iterations and further corroborated insubsequent iterations. It also facilitates the propagation of local metric and dependency/positionalfeatures to neighboring cells. Let Z be the set of tags for the target task. We denote hyper-parameters κ and λ as the number of channels and the number of CNN layers respectively, whereκ is same hyperparameter previously defined to represent the size of metric-based features. Thepooling layers are defined recursively with base case L1 = relu(fκ(G ‖ D ‖ P )) and

Li =

{relu(fκ(Li−1 ‖ D ‖ P )) 1 < i < λ,

f|Z|(Li−1 ‖ D ‖ P ) i = λ,

where f is the convolution function from Eq. (3),G is the tensor from Eq. (2), and ‖ is the tensorconcatenation operator along the third dimension, and relu(x) = max(0, x) is the linear rectifieractivation function. Here, κ and λ determine the breadth and depth of the architecture. A higherλ corresponds to a larger receptive field when making final predictions. For example at λ = 2,the decision at some cell is informed by its immediate neighbors with a receptive field of 3× 3.However, at λ = 3, decisions are informed by all adjacent neighbors in a 5× 5 window. Thelast layer, Lλ, is the output layer immediately prior to application of the softmax function. Giventhe architecture in Figure 3 with two convolutional layers, the convolve-and-pool operation isapplied twice, indicated as steps 5 and 7 in the figure.

3.2.7 Softmax Output Layer. Given Lλ, we apply the softmax function along the third dimen-sion to obtain a categorical distribution tensor Q ∈ Rn×n×|Z|over output tags Z for each wordposition pair such that Qi,j,k = exp(Lλi,j,k)/(

∑|Z|l=1 exp(L

λi,j,l)), where Qi,j,k is the probability

estimate of the pair of words at position i and j being assigned the kth tag. This constitutes thefinal step 8 of the network (Figure 3). Suppose Y ∈ Rn×n×|Z| represents the corresponding one-hot encoded ground truth along the third dimension such that Yi,j,k ∈ {0, 1}. Then the example-based loss ` is obtained by summing the categorical cross-entropy loss over each cell in the table,normalized by the number of words in the sentence; that is,

`(Y,Q; θ) = − 1

n

n∑i=1

n∑j=1

|Z|∑k=1

Yi,j,k log(Qi,j,k),

where θ is the network parameter set. During training, the loss ` is computed per example andaveraged along the mini-batch dimension.

3.3 Decoding

While we learn concrete tags during training, the process for extracting predictions isslightly more nuanced. Entity spans are straightforwardly extracted by decoding BILOUtags along the diagonal. However, RE is based on “ensembling” the cellular outputs of thetable where entity spans intersect. For entities a and b represented by their starting andending offsets, (aS, aE) and (bS, bE), the relation between them is the label computed asargmax1≤k≤|Z|

∑aEi=aS

∑bEj=bS

Qi,j,k, which indexes a tag in the label space Z .

9

Page 10: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

4. Experimental Setup

In this section, we describe the established evaluation method, the datasets used for training andtesting, and the configuration of our model. We note that the computing hardware is controlledacross experiments given we report training and testing run times. Specifically, we used theAmazon AWS EC2 p2.xlarge instance which supports the NVIDIA Tesla K80 GPU with 12GB memory.

4.1 Evaluation Metrics

We use the well-known F1 measure (along with precision and recall) to evaluate NER and REsubtasks as in prior work. For NER, a predicted entity is treated as a true positive if it is exactlymatched to an entity in the groundtruth based on both character offsets and entity type. ForRE, a predicted relation is treated as a true positive if it is exactly matched to a relation in theground truth based on subject/object entities and relation type. As relation extraction performancedirectly subsumes NER performance, we focus purely on relation extraction performance as theprimary evaluation metric of this study.

4.2 Datasets

CoNLL04. We use the dataset originally released by Roth and Yih (2004) with 1441 examplesconsisting of news articles from outlets such as WSJ and AP. The dataset has four entity typesincluding Person, Location, Organization, and Other and five relation types including Live_In,Located_In, OrgBased_In, Work_For, and Kill. We report results based on training/testing on thesame train-test split as established by Gupta, Schütze, and Andrassy (2016); Adel and Schütze(2017); Bekoulis et al. (2018b,a), which consists of 910 training, 243 development, and 288testing instances.

ADE. We also validate our method on the Adverse Drug Events (ADE) dataset from Gurulin-gappa et al. (2012) for extracting drug-related adverse effects from medical text. Here, the onlyentity types are Drug and Disease and the relation extraction task is strictly binary (i.e., Yes/Now.r.t the ADE relation). The examples come from 1644 PubMed abstracts and are divided in twopartitions: the first partition of 6821 sentences contain at least one drug/disease pair while thesecond partition of 16695 sentences contain no drug/disease pairs. As with prior work (Li et al.2016, 2017; Bekoulis et al. 2018b,a), we only use examples from the first partition from which120 relations with nested entity annotations (such as “lithium intoxication” where lithium andlithium intoxication are the drug/disease pair) are removed. Since sentences are duplicated foreach pair of drug/disease mention in the original dataset, when collapsed on unique sentences,the final dataset used in our experiments constitutes 4271 sentences in total. Given there are noofficial train-test splits, we report results based on 10-fold cross-validation, where results arebased on averaging performance across the ten folds, as in prior work.

4.3 Model Configuration

We tuned our model on the CoNLL04 development set; the corresponding configuration of ourmodel (including hyperparameter values) used in our main experiments is shown in Table 1.For the ADE dataset, we used Word2Vec embeddings pretrained on the corpus of PubMedabstracts (Pyysalo et al. 2013). For the CoNLL04 dataset, we used GloVe embeddings pretrainedon Wikipedia and Gigaword (Pennington, Socher, and Manning 2014). All other variables areinitialized using values drawn from a normal distribution with a mean of 0 and standard deviation

10

Page 11: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

Table 1Model configuration as tuned on the CoNLL04 development set.

Setting Value

Optimization Method RMSPropLearning Rate 0.005Dropout Rate 0.5Num. Epochs 100Num. Channels (κ) 15Num. Layers (λ) 8

Setting Value

Character Embedding Size (π) 25Character Representation Size (η) 50Position Embedding Size (γ) 25Dependency Embedding Size (β) 10Word Embedding Size (δ) 200Context Embedding Size (ρ) 200

of 0.1 and further tuned during training. Words were tokenized on both spaces and punctuations;punctuation tokens were kept as is common practice for NER systems. For part-of-speech anddependency parsing, we use the well-known tool spaCy2. For both datasets, we used projectivedependency parses produced from the default pretrained English models. We found that usingmodels pretrained on biomedical text (namely, the GENIA (Kim et al. 2003) corpus) did notimprove performance on the ADE dataset.

Early experiments showed that applying exponential decay to the learning rate in conjunc-tion with batch normalization (Ioffe and Szegedy 2015) is essential for stable/effective learningfor this particular architecture. We apply exponential decay to the learning rate such that it isroughly halved every 10 epochs; concretely, rk = rb

k10 where rb is the base learning rate and rk

is the rate at the kth epoch. We apply dropout (Srivastava et al. 2014) on hi for i = 1, . . . , n asregularization at the earlier layers. However, dropout had a detrimental impact when applied tolater layers. We instead apply batch normalization as a form of regularization on representationsG and Li for i = 1, . . . , λ− 1. We optimize the objective loss using RMSProp (Tieleman andHinton 2012) with a relatively high initial learning rate of 0.005 given exponential decay is used.

5. Results and Discussion

Table 2Results comparing to other methods on the CoNLL04 dataset. We report 95% confidence intervals aroundthe mean F1 over 30 runs for models in the last two rows. Our model was tuned on the CoNLL04development set corresponding to the configuration from Table 1.

Entity Recognition Relation Extraction Avg. Epoch Avg.

Model P (%) R (%) F (%) P (%) R (%) F (%) Train Time Test Time ∗

Table Representation(Miwa and Sasaki 2014) 81.20 80.20 80.70 76.00 50.90 61.00 - -

Multihead (Bekoulis et al. 2018b) 83.75 84.06 83.90 63.75 60.43 62.04 - -

Multihead with AT (Bekoulis et al. 2018a) - - 83.61 - - 61.95 - -

Replicating Multihead with AT (Bekoulis et al. 2018a)† 84.36 85.80 85.07 ± 0.26 65.81 57.59 61.38 ± 0.50 614 sec 34 sec

Relation-Metric (Ours)† 84.46 84.67 84.57 ± 0.29 67.97 58.18 62.68 ± 0.46 101 sec 4.5 sec

† These results are directly comparable given the same train-test splits, pretrained word embeddings, and computing hardware.∗ Average test time is per test set of 288 examples; dependency parsing accounts for approximately 0.5 second of our reported test time.

2 https://spacy.io/

11

Page 12: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

Table 3Results comparing to other methods on the ADE dataset. We report the mean performance over 10-foldcross-validation for models in the last two rows. Our model was tuned on the CoNLL04 development setcorresponding to the configuration from Table 1.

Entity Recognition Relation Extraction Avg. Epoch Avg.

Model P (%) R (%) F (%) P (%) R (%) F (%) Train Time Test Time ∗

Neural Joint Model (Li et al. 2016) 79.50 79.60 79.50 64.00 62.90 63.40 - -

Neural Joint Model (Li et al. 2017) 82.70 86.70 84.60 67.50 75.80 71.40 - -

Multihead (Bekoulis et al. 2018b) 84.72 88.16 86.40 72.10 77.24 74.58 - -

Multihead with AT (Bekoulis et al. 2018a) - - 86.73 - - 75.52 - -

Replicating Multihead with AT (Bekoulis et al. 2018a)† 85.76 88.17 86.95 74.43 78.45 76.36 1567 sec 40 sec

Relation-Metric (Ours)† 86.16 88.08 87.11 77.36 77.25 77.29 134 sec 4.5 sec

† These results are directly comparable given the same fixed 10-fold splits, pretrained word embeddings, and computing hardware.∗ Average test time is per test set of 427 examples; dependency parsing accounts for approximately 0.5 second of our reported test time.

We report our main results in Tables 2 and 3 for the CoNLL04 and ADE datasets respec-tively. As a baseline, we replicate the prior best models (Bekoulis et al. 2018a) for both datasetsbased on publicly available source code3. Unlike prior work, which reports performance basedon a single run, we report the 95% confidence interval around the mean F1 based on 30 runs withdiffering seed values for the CoNLL04 dataset. For the ADE dataset, we instead report the meanperformance over 10-fold cross-validation so that results are comparable to established work.These experiments were performed using the same splits, pretrained embeddings, and computinghardware; hence, results are directly comparable.

We make the following observations based on our results from Table 2. Both our model andthe model from Bekoulis et al. (2018a) tend to skew heavily towards precision. However, ourmethod improves on both precision and recall, and by over 1% F1 on relation extraction whereimprovements are statistically significant (p < 0.05) based on the two-tailed Student’s t-test. Wenote that our model performs slightly worse when evaluated purely on NER. We contend thisis a worthwhile trade-off given our model is tuned purely on relation extraction and the relationextraction metric, being end-to-end, indirectly accounts for NER performance. Based on Table 3,when tested on the ADE dataset, our method improves over prior best results by approximately1% F1 for RE on average. While the prior best skews toward recall in this case, our methodexhibits better balance of precision and recall. Based on run time results, we contend that ourmethod is more computationally efficient given training and testing times are nearly seven timeslower on the CoNLL04 and ten times lower on the ADE set when compared to prior efforts. Wenote that dependency parsing accounts approximately one-half second of our testing time. Whiletraining time may not be crucial in most settings, we argue that fast and efficient predictions areimportant for many end-user applications.

As an auxiliary experiment, we tested the potential for integrating adversarial training (AT)with our model; however, there were no performance gains even with extensive tuning. On theCoNLL04 dataset, our method with AT performs at 62.26% F1, compared to 62.68% without AT.On the ADE dataset, our method performs at 76.83% F1 with AT, compared to 77.29% withoutAT. Given this, we have elected not to include AT evaluations as part of our main results.

3 https://github.com/bekou/multihead_joint_entity_relation_extraction

12

Page 13: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

20 30 40 50 60 70 80Number of Epochs

67.5

68.0

68.5

69.0

69.5

70.0

Mea

n F1

Trainable Word EmbeddingsTrainable Word Embeddings with Downscaled GradientsStatic Word Embeddings

Figure 5Mean F1-score (over 10 runs) on CoNLL04 development set with respect to number of training epochs forvarious embedding training strategies.

Comparison with More Prior Efforts. Gupta, Schütze, and Andrassy (2016), Adel and Schütze(2017), and Zhang, Zhang, and Fu (2017) also experimented with the CoNLL04 dataset; however,Gupta, Schütze, and Andrassy (2016) evaluate on a more relaxed evaluation metric for matchingentity bounds while Adel and Schütze (2017) assume entity bounds are known at test time thustreating the NER aspect as a simpler entity classification problem. Of the three studies, resultsfrom Zhang, Zhang, and Fu (2017) are most comparable given they consider entity bounds intheir evaluations; however, their results are based on a random 80%–20% split of the train and testset. As we use established splits based on prior work, the two results are not directly comparable.

5.1 Ablation Analysis

Table 4Ablation studies for relation extraction over the CoNLL04 and ADE dataset; each row after the firstindicates removal of a particular feature/component.

CoNLL04 (Relation) ADE (Relation)Model P (%) R (%) F (%) P (%) R (%) F (%)

Full model 67.97 58.18 62.68 77.36 77.25 77.29– Character-based Input 67.30 52.69 59.09 76.73 76.44 76.58– Dependency Embeddings 66.56 57.69 61.78 75.79 77.16 76.45– Position Embeddings 68.57 57.34 62.43 75.94 76.62 76.27– Pretrained Word Embeddings 62.33 46.09 52.96 72.50 71.41 71.91

We report ablation analysis results in Table 4 using our best model as the baseline. We notethat the model hyperparameters were tuned on the CoNLL04 development set. Character anddependency based features all had a notable impact on performance for either dataset. On thehand, while position embeddings had a positive effect on the ADE dataset, performance gainswere negligible when testing on CoNLL04. For the CoNLL04 dataset, we find that characterbased features had little effect on precision while improving recall substantially.

Unsurprisingly, pretrained word embeddings had the greatest impact on performance interms of both precision and recall. Early experiments showed that, unlike models from priorwork that used static word embeddings (Li et al. 2017; Bekoulis et al. 2018a), our model benefits

13

Page 14: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

from trainable word embeddings as shown in Figure 5. Here, trainable word embeddings withdownscaled gradients refer to reducing the gradient of word embeddings by a factor of 10 at eachtraining step.

5.2 Error Analyses

In this section, we first perform a class based analysis where performance variations for differentclasses of examples are examined. Then, a more in-depth error analysis is performed for inter-esting example cases. The class based analyses entail partitioning examples by length, entitydistance, and relation type and are covered in Section 5.2.1. The more in-depth example basedanalysis is discussed in Section 5.2.2.

5.2.1 Class based analyses. Long sentences are a natural source of difficulty for relationextraction models given the potential for long-term dependencies. In this section, we performstraightforward analysis by conducting experiments to assess model performance with respect toincreasing sentence length. For this experiment, we train a single model using 80% of the datasetwith 20% held out for testing. For some sentence length limit k, we evaluate on a subset of theoverall test set that includes only examples with a sentence length that is less than or equal to k.

60

65

70

75

80

85

90

Perfo

rman

ce (%

)

ENTITY FENTITY RENTITY P

RELATION FRELATION RRELATION P

10 20 30 40 50 60 70 80Maximum sentence length

100200300

# of

Exa

mpl

es

Figure 6CoNLL04: Entity and relation extractionperformance with respective to change inmaximum sentence length.

70

75

80

85

90

Perfo

rman

ce (%

)

ENTITY FENTITY RENTITY P

RELATION FRELATION RRELATION P

10 20 30 40 50 60 70 80Maximum sentence length

250500750

# of

Exa

mpl

es

Figure 7ADE: Entity and relation extraction performancewith respect to change in maximum sentencelength.

Results from these experiments are plotted in Figures 6 and 7, for the CoNLL04 and ADEdatasets respectively, such that k is varied along the horizontal x-axis. The top graph displaysperformance, while the bottom graph plots the number of examples with sentence length less thanor equal to k that are used for evaluation. As shown, performances for both NER and RE tendto decline as longer sentences are added to the evaluation set. Unsurprisingly, relation extractionis more susceptible to long sentences compared to entity recognition. While there is a decline inboth relation extraction precision and recall, we note that recall drops at a faster rate with respectto maximum sentence length and this phenomenon is apparent for both datasets.

In addition to length-based analysis, we also conducted experiments to study the variation inrelation extraction performance with respect to the distance between subject and object entitiesas shown in Table 5. We measure distance by computing the absolute character offset between

14

Page 15: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

Table 5Relation extraction performance partitioned based on “Entity Distance”, which is defined as the number ofcharacters separating the subject and object entities (i.e., absolute character offset).

CoNLL04 (Relation) ADE (Relation)Entity Distance # of Examples P (%) R (%) F (%) # of Examples P (%) R (%) F (%)

0 — 20 207 83.7 43.80 57.51 447 88.50 42.02 56.9820 — 40 51 59.09 24.07 34.21 265 77.17 35.51 48.6440 — 60 43 80.00 18.60 30.19 181 78.72 37.00 50.3460 — 80 22 100.00 25.93 41.18 125 82.35 29.58 43.5280 — 100 13 100.00 15.38 26.67 91 85.00 34.00 48.57

the last character of the first occurring entity and first character of the second occurring entity,which is henceforth simply referred to as “entity distance.” Our results show that, at least onthe CoNLL04 dataset, notable performance differences occur at the boundary cases; i.e., veryshort range relations (0-20 entity distance) tend to be easier and very long range relations (80-100 entity distance) tend to be harder (mostly due to changes in recall). For the ADE dataset,performance is similar across all partitions of entity distances. This is surprising, as sentencelength appears to have a more notable impact on relation extraction performance than entitydistance for this particular architecture.

Table 6Relation extraction performance on the CoNLL04 dataset partitioned based on relation type.

CoNLL04 (Relation)Relation Type # of Examples Avg. Entity Distance P (%) R (%) F (%)

Kill 46 47 81.25 82.98 82.11Live_In 82 37 71.76 61.00 65.95Located_In 58 28 80.77 44.68 57.53Work_For 65 24 60.56 56.58 58.50OrgBased_In 70 29 91.38 50.48 65.03

Table 6 shows variance in performance when examined by relation type. Here, we see thatperformance depends heavily on the type of relation being extracted; our model exhibits muchhigher accuracy on the Kill relation at 80% F1, with Located_In and Work_For being the mostdifficult with performance below 60% F1. These results further corroborate our analysis based onTable 5 that entity distance does not correlate with example difficulty given that the Kill relation,being the easiest relation to extract, occurs with the highest average entity distance.

5.2.2 Example based analysis. A common source of difficulty that occurs is ambiguity withrespect to expression of the Live_In and Work_In relation types. For example, consider the sen-tence “After buying the shawl for $1,600, Darryl Breniser of Blue Ball, said the approximately2-by-5 foot shawl was worth the money.” The ground truth relation is (Darryl Breniser, Live_In,Blue Ball) which indicates that “Blue Ball” is in fact a location. However, it is difficult to assesswhether “Blue Ball” is a location or company based on the context alone and without broadergeographical knowledge (even for humans). Our model predicted (Darryl Breniser, Work_For,

15

Page 16: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

Blue Ball) in this case. We observe a similar pattern in the following case: “Santa Monica artistTom Van Sant said Monday after the 23-foot-tall statue was found crushed and broken in pieces.”;here, we see the same phenomenon where our model mistakes (Tom Van Sant, Live_In, SantaMonica) for (Tom Van Sant, Work_For, Santa Monica). Finally, we present the most interestingexample of this type of ambiguity in the sentence: “‘Temperatures didn’t get too low, but the windchill was bad’, said Bingham County Sheriff’s Lt. Bill Gordon.” Here, the ground truth indicatesthat the only relation to be extracted is (Bill Gordon, Live_In, Bingham County); however, ourmodel extracts (Bill Gordon, Work_For, Bingham County Sheriff), which is also technically avalid relation. Such cases present ambiguities that are also difficult for human annotators; here,imbuing the NER component with external knowledge or learning based on a broader level ofcontext may alleviate these types of errors.

Inconsistencies in the way entities are annotated can also cause issues when it comes todemarcating names that are accompanied with honorifics or titles. For example, some groundtruth annotations will include the title, such as “President Park Chung-hee” or “Sen. Bob Dole”,and other cases will leave out the title, such as “Kennedy” instead of “President Kennedy.” Thesetruth annotations are inconsistent and present a source of difficulty for the model during trainingand testing. For example, “Navy spokeswoman Lt. Nettie Johnson was unable to say immediatelywhether the aircraft had experienced problems from faulty check and drain valves.” Here, ourmodel extracted (Lt. Nettie Johnson, Work_For, Navy), while the groundtruth is (Nettie Johnson,Work_For, Navy) — while both are technically correctly, the extremely precise nature of theevaluation metric causes this prediction to be considered a false positive.

We also see such issues with annotation at the relation extraction stage; for example, considerthe sentence “In 1964, a jury in Dallas found Jack Ruby guilty of murdering Lee Harvey Oswald,the accused assassin of President Kennedy.” Figure 8 shows the internal activity of the networkas it attempts to extract entities and relations from this particular example. Here, the ground truthannotation includes (Lee Harvey Oswald, Kills, President Kennedy), which our model fails torecognize; we instead obtain the prediction (Jack Ruby, Kills, Lee Harvey Oswald) which is avalid relation missed by the ground truth. In fact, it can be argued that the latter relation is astronger manifested of the “Kill” relation based on the linguistic context as evidenced by thetrigger phrase “found [..] guilty of murdering”. We note that our model is able to detect (LeeHarvey Oswald, Kills, President Kennedy) as shown in the center-bottom heatmap of Figure 8;however, signals were not strong enough to warrant a concrete extraction of the relation.

In the ADE dataset, we mostly observe issues with entity recognition where boundariesof noun phrases are not properly recognized. Modifier phrases are sometimes not predicted aspart of the named entity, for example: “protracted neuromuscular block” instead extracted as“neuromuscular block”, and “generalized mite infestation” instead extracted as simply “miteinfestation.” The nature of the data results in especially long named entities that are oftenentire noun or verb phrases which can be difficult to delimit. For example, consider thefollowing case: “DISCUSSION: Central nervous system (CNS) toxicity has been describedwith ifosfamide, with most cases reported in the pediatric population.” Here, instead of ex-tracting (Central nervous system (CNS) toxicity, ifosfamide) as the relation pair, our modelpredicts (Central nervous system, ifosfamide) and (CNS, ifosfamide). Essentially, long entityphrases are often not recognized in their entirety, and broken down into segments whereeach segment is independently involved in a relation. In this particular case, this error inprediction lead to one false negative and two false positives. This phenomenon occurs fre-quently with coordinated noun phrases which present a nontrivial challenge. For example,“Growth and adrenal suppression in asthmatic children treated with high-dose fluticasone pro-pionate.” is annotated with “Growth and adrenal suppression” as a singular entity, while ourmodel falsely recognizes it as two entities “Growth” and “adrenal suppression.” We see similaroutcomes for the sentence: “Generalized maculopapular and papular purpuric eruptions are per-

16

Page 17: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

In1964

,a

juryin

Dallasfound

JackRubyguilty

ofmurdering

LeeHarveyOswald

,the

accusedassassin

ofPresidentKennedy

.

L1 L2 L3

In1964

,a

juryin

Dallasfound

JackRubyguilty

ofmurdering

LeeHarveyOswald

,the

accusedassassin

ofPresidentKennedy

.

L4 L5 L6

In19

64 , ajur

y inDalla

sfou

ndJack

Ruby

guilty of

murderi

ngLeeHarv

ey

Oswald

,the

accuse

d

assass

in of

Presid

ent

Kenn

edy .

In1964

,a

juryin

Dallasfound

JackRubyguilty

ofmurdering

LeeHarveyOswald

,the

accusedassassin

ofPresidentKennedy

.

L7

In19

64 , ajur

y inDalla

sfou

ndJack

Ruby

guilty of

murderi

ngLeeHarv

ey

Oswald

,the

accuse

d

assass

in of

Presid

ent

Kenn

edy .

Prediction (L8)

In19

64 , ajur

y inDalla

sfou

ndJack

Ruby

guilty of

murderi

ngLeeHarv

ey

Oswald

,the

accuse

d

assass

in of

Presid

ent

Kenn

edy .

Ground Truth

0.0

0.2

0.4

0.6

0.8

1.0

Figure 8Visualization of activity of pooling layers at various depths (Li for i = 1, . . . , λ), as tabular heatmaps, fora network with a depth of λ = 8 given the following input sentence: “In 1964, a jury in Dallas found JackRuby guilty of murdering Lee Harvey Oswald, the accused assassin of President Kennedy.” Here, wemeasure activity by sum-pooling the activations along the channel dimension of each hiddenrepresentation. For the prediction activity, we simply max-pool probabilities along the relation dimensionthus ignoring the exact type of entity or relation.

haps the most common thionamide-induced reactions.” Such cases occur frequently which wesuspect are a major source of hampered precision given the increased number of false positivesfor each predictive mistake.

6. Conclusion

In this study, we introduced a novel neural architecture that combines the ideas of metriclearning and convolutional neural networks to tackle the highly challenging problem of end-to-end relation extraction. Our method is able to simultaneously and efficiently recognize entityboundaries, the type of each entity, and the relationships among them. It achieves this by learningintermediate table representations by pooling local metric, dependency, and position informationvia repeated application of the 2D convolution. For end-to-end relation extraction, this approach

17

Page 18: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

improves over the state-of-the-art across two datasets from different domains with statisticallysignificant results based on examining average performance of repeated runs. Moreover, theproposed architecture operates at substantially reduced training and testing times with testingtimes that are seven to ten times faster, the latter important for many user-end applications. Wealso perform extensive error analysis and show that our network can be visually analyzed byobserving the hidden pooling activity leading to preliminary or intermediate decisions. Currently,the architecture is designed for extracting relations involving two entities and occur withinsentence bounds; handling n-ary relations and exploring document-level extraction involvingcross-sentence relations will be the focus of future work.

ReferencesAdel, Heike and Hinrich Schütze. 2017. Global

normalization of convolutional neuralnetworks for joint entity and relationclassification. In Proceedings of the 2017Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages1723–1729.

Airola, Antti, Sampo Pyysalo, Jari Björne, TapioPahikkala, Filip Ginter, and Tapio Salakoski.2008. All-paths graph kernel forprotein-protein interaction extraction withevaluation of cross-corpus learning. BMCbioinformatics, 9(11):S2.

Bekoulis, Giannis, Johannes Deleu, ThomasDemeester, and Chris Develder. 2018a.Adversarial training for multi-context jointentity and relation extraction. In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing (EMNLP),pages 2830–2836.

Bekoulis, Giannis, Johannes Deleu, ThomasDemeester, and Chris Develder. 2018b. Jointentity recognition and relation extraction as amulti-head selection problem. Expert Systemswith Applications, 114:34–45.

Bunescu, Razvan C and Raymond J Mooney.2005. A shortest path dependency kernel forrelation extraction. In Proceedings of theconference on human language technology andempirical methods in natural languageprocessing, pages 724–731, Association forComputational Linguistics.

Chiu, Jason PC and Eric Nichols. 2016. Namedentity recognition with bidirectionalLSTM-CNNs. Transactions of the Associationfor Computational Linguistics, 4:357–370.

Choi, Yejin, Eric Breck, and Claire Cardie. 2006.Joint extraction of entities and relations foropinion recognition. In Proceedings of the2006 Conference on Empirical Methods inNatural Language Processing, pages 431–439.

Collins, Michael. 2002. Discriminative trainingmethods for hidden markov models: Theoryand experiments with perceptron algorithms.In Proceedings of the ACL-02 conference onEmpirical methods in natural language

processing-Volume 10, pages 1–8.Frunza, Oana, Diana Inkpen, and Thomas Tran.

2011. A machine learning approach foridentifying disease-treatment relations in shorttexts. IEEE Transactions on Knowledge andData Engineering, 23(6):801–814.

Fundel, Katrin, Robert Küffner, and Ralf Zimmer.2007. RelEx - relation extraction usingdependency parse trees. Bioinformatics,23(3):365–371.

Gupta, Pankaj, Hinrich Schütze, and BerntAndrassy. 2016. Table filling multi-taskrecurrent neural network for joint entity andrelation extraction. In Proceedings of COLING2016, the 26th International Conference onComputational Linguistics: Technical Papers,pages 2537–2547.

Gurulingappa, Harsha, Abdul Mateen Rajput,Angus Roberts, Juliane Fluck, MartinHofmann-Apitius, and Luca Toldo. 2012.Development of a benchmark corpus tosupport the automatic extraction ofdrug-related adverse effects from medical casereports. Journal of biomedical informatics,45(5):885–892.

Ioffe, Sergey and Christian Szegedy. 2015. Batchnormalization: Accelerating deep networktraining by reducing internal covariate shift. InProceedings of the 32nd InternationalConference on Machine Learning (ICML2015), pages 448–456.

Jurafsky, Daniel and James H Martin. 2008.Speech and language processing (prentice hallseries in artificial intelligence).

Kate, Rohit J and Raymond J Mooney. 2010.Joint entity and relation extraction usingcard-pyramid parsing. In Proceedings of theFourteenth Conference on ComputationalNatural Language Learning (CoNLL 2010),pages 203–212.

Katiyar, Arzoo and Claire Cardie. 2017. Goingout on a limb: Joint extraction of entitymentions and relations without dependencytrees. In Proceedings of the 55th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers),volume 1, pages 917–928.

18

Page 19: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Tran et al. Neural Metric Learning for Fast End-to-End Relation Extraction

Kavuluru, Ramakanth, Anthony Rios, and TungTran. 2017. Extracting drug-drug interactionswith word and character-level recurrent neuralnetworks. In Fifth IEEE InternationalConference on Healthcare Informatics (ICHI),pages 5–12, IEEE.

Kim, J-D, Tomoko Ohta, Yuka Tateisi, andJunâAZichi Tsujii. 2003. Genia corpusâATasemantically annotated corpus forbio-textmining. Bioinformatics,19(suppl_1):i180–i182.

Li, Fei, Meishan Zhang, Guohong Fu, andDonghong Ji. 2017. A neural joint model forentity and relation extraction from biomedicaltext. BMC bioinformatics, 18(1):198.

Li, Fei, Yue Zhang, Meishan Zhang, andDonghong Ji. 2016. Joint models for extractingadverse drug events from biomedical text. InProceedings of the Twenty-Fifth InternationalJoint Conference on Artificial Intelligence(IJCAI 2015), volume 2016, pages 2838–2844.

Li, Jiexun, Zhu Zhang, Xin Li, and HsinchunChen. 2008. Kernel-based learning forbiomedical relation extraction. Journal of theAssociation for Information Science andTechnology, 59(5):756–769.

Li, Qi and Heng Ji. 2014. Incremental jointextraction of entity mentions and relations. InProceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics(Volume 1: Long Papers), volume 1, pages402–412.

Liu, Shengyu, Kai Chen, Qingcai Chen, andBuzhou Tang. 2016. Dependency-basedconvolutional neural network for drug-druginteraction extraction. In 2016 IEEEInternational Conference on Bioinformaticsand Biomedicine (BIBM), pages 1074–1080,IEEE.

Luong, Minh-Thang, Hieu Pham, andChristopher D Manning. 2015. Effectiveapproaches to attention-based neural machinetranslation. In Proceedings of the 2015Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages1412–1421.

Miwa, Makoto and Mohit Bansal. 2016.End-to-end relation extraction using LSTMson sequences and tree structures. InProceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics(Volume 1: Long Papers), volume 1, pages1105–1116.

Miwa, Makoto and Yutaka Sasaki. 2014.Modeling joint entity and relation extractionwith table representation. In Proceedings ofthe 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP),pages 1858–1869.

Özgür, Arzucan, Thuy Vu, Günes Erkan, andDragomir R Radev. 2008. Identifyinggene-disease associations using centrality on aliterature mined gene-interaction network.Bioinformatics, 24(13):i277–i285.

Pawar, Sachin, Pushpak Bhattacharyya, andGirish Palshikar. 2017. End-to-end relationextraction using neural networks and markovlogic networks. In Proceedings of the 15thConference of the European Chapter of theAssociation for Computational Linguistics:Volume 1, Long Papers, volume 1, pages818–827.

Pennington, Jeffrey, Richard Socher, andChristopher Manning. 2014. Glove: Globalvectors for word representation. InProceedings of the 2014 conference onempirical methods in natural languageprocessing (EMNLP), pages 1532–1543.

Pyysalo, Sampo, Filip Ginter, Hans Moen, TapioSalakoski, and Sophia Ananiadou. 2013.Distributional semantics resources forbiomedical text processing. In Proceedings of5th International Symposium on Languages inBiology and Medicine, pages 39–44.

Qian, Longhua, Guodong Zhou, Fang Kong,Qiaoming Zhu, and Peide Qian. 2008.Exploiting constituent dependencies for treekernel-based semantic relation extraction. InProceedings of the 22nd InternationalConference on Computational Linguistics,volume 1, pages 697–704, Association forComputational Linguistics.

Raj, Desh, SUNIL SAHU, and Ashish Anand.2017. Learning local and global contexts usinga convolutional recurrent network model forrelation classification in biomedical text. InProceedings of the 21st Conference onComputational Natural Language Learning(CoNLL 2017), pages 311–321.

Ratinov, Lev and Dan Roth. 2009. Designchallenges and misconceptions in named entityrecognition. In Proceedings of the ThirteenthConference on Computational NaturalLanguage Learning, pages 147–155,Association for Computational Linguistics.

Rink, Bryan, Sanda Harabagiu, and Kirk Roberts.2011. Automatic extraction of relationsbetween medical concepts in clinical texts.Journal of the American Medical InformaticsAssociation, 18(5):594–600.

Roth, Dan and Wen-tau Yih. 2004. A linearprogramming formulation for global inferencein natural language tasks. In Proceedings of theAnnual Conference on Computational NaturalLanguage Learning (CoNLL), pages 1–8.

Singh, Sameer, Sebastian Riedel, Brian Martin,Jiaping Zheng, and Andrew McCallum. 2013.Joint inference of entities, relations, and

19

Page 20: arXiv:1905.07458v4 [cs.CL] 27 Aug 2019extract as entities the spans “Mrs. Tsuruyama”, “Yatsushiro”, “Kumamoto Prefecture”, and “Japan” where “Mrs. Tsuruyama” is

Computational Linguistics Volume 1, Number 1

coreference. In Proceedings of the 2013workshop on Automated knowledge baseconstruction, pages 1–6.

Srivastava, Nitish, Geoffrey Hinton, AlexKrizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A simple wayto prevent neural networks from overfitting.The Journal of Machine Learning Research,15(1):1929–1958.

Tieleman, Tijmen and Geoffrey Hinton. 2012.Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude.COURSERA: Neural Networks for MachineLearning, 4(2).

Yu, Xiaofeng and Wai Lam. 2010. Jointlyidentifying entities and extracting relations inencyclopedia text via a graphical modelapproach. In Proceedings of the 23rdInternational Conference on ComputationalLinguistics: Posters, pages 1399–1407.

Zeng, Daojian, Kang Liu, Siwei Lai, GuangyouZhou, Jun Zhao, et al. 2014. Relationclassification via convolutional deep neuralnetwork. In Proceedings of the 25thInternational Conference on ComputationalLinguistics: Technical Papers (COLING 2014),pages 2335–2344.

Zeng, Xiangrong, Daojian Zeng, Shizhu He,Kang Liu, and Jun Zhao. 2018. Extractingrelational facts by an end-to-end neural modelwith copy mechanism. In Proceedings of the56th Annual Meeting of the Association forComputational Linguistics (Volume 1: LongPapers), volume 1, pages 506–514.

Zhang, Meishan, Yue Zhang, and Guohong Fu.2017. End-to-end neural relation extractionwith global optimization. In Proceedings ofthe 2017 Conference on Empirical Methods inNatural Language Processing, pages1730–1740.

Zhang, Yuhao, Peng Qi, and Christopher DManning. 2018. Graph convolution overpruned dependency trees improves relationextraction. In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing.

Zheng, Suncong, Yuexing Hao, Dongyuan Lu,Hongyun Bao, Jiaming Xu, Hongwei Hao, andBo Xu. 2017a. Joint entity and relationextraction based on a hybrid neural network.Neurocomputing, 257:59–66.

Zheng, Suncong, Feng Wang, Hongyun Bao,Yuexing Hao, Peng Zhou, and Bo Xu. 2017b.Joint extraction of entities and relations basedon a novel tagging scheme. In Proceedings ofthe 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: LongPapers), pages 1227–1236.

20