Document-Level N-ary Relation Extraction with Multiscale … · 2019-04-09 · centric view of...

Document-Level N -ary Relation Extraction with MultiscaleRepresentation Learning

Robin Jia1∗ Cliff Wong2 Hoifung Poon2

1 Stanford University, Stanford, California, USA2 Microsoft Research, Redmond, Washington, USA

[email protected], {cliff.wong,hoifung}@microsoft.com

Abstract

Most information extraction methods focus onbinary relations expressed within single sen-tences. In high-value domains, however, n-aryrelations are of great demand (e.g., drug-gene-mutation interactions in precision oncology).Such relations often involve entity mentionsthat are far apart in the document, yet exist-ing work on cross-sentence relation extractionis generally confined to small text spans (e.g.,three consecutive sentences), which severelylimits recall. In this paper, we propose a novelmultiscale neural architecture for document-level n-ary relation extraction. Our systemcombines representations learned over varioustext spans throughout the document and acrossthe subrelation hierarchy. Widening the sys-tem’s purview to the entire document maxi-mizes potential recall. Moreover, by integrat-ing weak signals across the document, multi-scale modeling increases precision, even in thepresence of noisy labels from distant super-vision. Experiments on biomedical machinereading show that our approach substantiallyoutperforms previous n-ary relation extractionmethods.

1 Introduction

Knowledge acquisition is a perennial challenge inAI. In high-value domains, it has acquired newurgency in recent years due to the advent of bigdata. For example, the dramatic drop in genomesequencing cost has created unprecedented oppor-tunities for tailoring cancer treatment to a tumor’sgenetic composition (Bahcall, 2015). Despite thispotential, operationalizing personalized medicineis difficult, in part because it requires painstakingcuration of precision oncology knowledge frombiomedical literature. With tens of millions of pa-pers on PubMed, and thousands more added every

∗Work done as an intern at Microsoft Research.

“We next expressed ALK F1174L, ALK F1174L/L1198P,ALK F1174L/G1123S, and ALK F1174L/G1123D in theoriginal SH-SY5Y cell line.”

(. . . 15 sentences spanning 3 paragraphs . . . )

“The 2 mutations that were only found in the neurob-lastoma resistance screen (G1123S/D) are located inthe glycine-rich loop, which is known to be crucial forATP and ligand binding and are the first mutations de-scribed that induce resistance to TAE684, but not toPF02341066.”

Figure 1: Two examples of drug-gene-mutation re-lations from a biomedical journal paper. The rela-tions are expressed across multiple paragraphs, requir-ing document-level extraction.

day,1 we are sorely in need of automated methodsto accelerate manual curation.

Prior work in machine reading has made greatstrides in sentence-level binary relation extraction.However, generalizing extraction to n-ary rela-tions poses new challenges. Higher-order relationsoften involve entity mentions that are far away inthe document. Recent work on n-ary relation ex-traction has begun to explore cross-sentence ex-traction (Peng et al., 2017; Wang and Poon, 2018),but the scope is still confined to short text spans(e.g., three consecutive sentences), even thougha document may contain hundreds of sentencesand tens of thousands of words. While this al-ready increases the yield compared to sentence-level extraction, it still misses many relations. Forexample, in Figure 1, the drug-gene-mutation re-lations between PF02341066, ALK, G1123S(D)(PF02341066 can treat cancers with mutationG1123S(D) in gene ALK) can only be extracted bysubstantially expanding the scope. High-value in-formation, such as latest medical findings, mightonly be mentioned once in the corpus. Maximiz-

1ncbi.nlm.nih.gov/pubmed

ncbi.nlm.nih.gov/pubmed

ing recall is thus of paramount importance.In this paper, we propose a novel multiscale

neural architecture for document-level n-ary rela-tion extraction. By expanding extraction scope tothe entire document, rather than restricting rela-tion candidates to co-occurring entities in a shorttext span, we ensure maximum potential recall. Tocombat the ensuing difficulties in document-levelextraction, such as low precision, we introducemultiscale learning, which combines representa-tions learned over text spans of varying scales andfor various subrelations (Figure 2). This approachdeviates from past methods in several key regards.

First, we adopt an entity-centric formulation bymaking a single prediction for each entity tupleoccurring in a document. Previous n-ary rela-tion extraction methods typically classify individ-ual mention tuples, but this approach scales poorlyto whole documents. Since each entity can bementioned many times in the same document, ap-plying mention-level methods leads to a combina-torial explosion of mention tuples. This createsnot only computational challenges but also learn-ing challenges, as the vast majority of these tuplesdo not express the relation. Our entity-centric for-mulation alleviates both of these problems.

Second, for each candidate tuple, prior meth-ods typically take as input the contiguous text spanencompassing the mentions. For document-levelextraction, the resulting text span could becomeuntenably large, even though most of it is unre-lated to the relation of interest. Instead, we allowdiscontiguous input formed by multiple discourseunits (e.g., sentence or paragraph) containing thegiven entity mentions.

Finally, while an n-ary relation might not residewithin a discourse unit, its subrelations might. InFigure 1, the paper first mentions a gene-mutationsubrelation, then discusses a drug-mutation subre-lation in a later paragraph. By including subrela-tions in our modeling, we can predict n-ary rela-tions even when all n entities never co-occur in thesame discourse unit.

With multiscale learning, we turn the documentview from a challenge into an advantage by com-bining weak signals across text spans and subre-lations. Following recent work in cross-sentencerelation extraction, we conduct thorough evalua-tion in biomedical machine reading. Our approachsubstantially outperforms prior n-ary relation ex-traction methods, attaining state-of-the-art results

on a large benchmark dataset recently released bya major cancer center. Ablation studies show thatmultiscale modeling is the key to these gains.2

2 Document-Level N -ary RelationExtraction

Prior work on relation extraction typically formu-lates it as a mention-level classification problem.Let e1, . . . , en be entity mentions that co-occur ina text span T . Relation extraction amounts to clas-sifying whether a relation R holds for e1, . . . , enin T . For the well-studied case of binary relationswithin single sentences, n = 2 and T is a sentence.

In high-value domains, however, there is in-creasing demand for document-level n-ary rela-tion extraction, where n > 2 and T is a full docu-ment that may contain hundreds of sentences. Forexample, a molecular tumor board needs to knowif a drug is relevant for treating cancer patientswith a certain mutation in a given gene. We canhelp the tumor board by extracting such ternary in-teractions from biomedical articles. The mention-centric view of relation extraction does not scalewell to this general setting. Each of the n enti-ties may be mentioned many times in a document,resulting in a large number of candidate mentiontuples, even though the vast majority of them areirrelevant to the extraction task.

In this paper, we adopt an entity-centric for-mulation for document-level n-ary relation extrac-tion. We use upper case for entities (E1, · · · , En)and lower case for mentions (e1, · · · , en). We de-fine an n-ary relation candidate to be an (n +1)-tuple (E1, . . . , En, T ), where each entity Ei

is mentioned at least once in the text span T .The relation extraction model is given a candidate(E1, . . . , En, T ) and outputs whether or not the tu-ple expresses the relation R.3 Deciding what in-formation to use from the various entity mentionswithin T is now a modeling question, which weaddress in the next section.

3 Our Approach: MultiscaleRepresentation Learning

We present a general framework for document-level n-ary relation extraction using multiscale

2 Our code and data will be available at hanover.azurewebsites.net

3 It is easy to extend our approach to situations wherek mutually exclusive relations R1, . . . , Rk must be distin-guished, resulting in a (k + 1)-way classification problem.

hanover.azurewebsites.net

hanover.azurewebsites.net

---- -- ---- --- - --- -- -- --- -------

--- --- --- gefitinib -- ---- - ---- --

-- -- --- ------- EGFR -- ---- ---.

--- - - EGFR T790M - ---- ----

- ------ -- -- --- -- --- --------

T790M -- -- --- - ---- ----- -- --

---- -- ------ - - ------ --.

------ -- ---- -- --- --- - - --- ----

- ------ -- -- --- gefitinib -- -

---- EGFR -- ------ - -- -- ----

--- ----- - ----- --- T790M --- --- .

(2) Mention-level

Representations

(3) Entity-level

Representations

(1) Input text

(4) Final

prediction

Figure 2: Multiscale representation learning for document-level n-ary relation extraction, an entity-centric ap-proach that combines mention-level representations learned across text spans and subrelation hierarchy. (1) Entitymentions (red, green, blue) are identified from text, and mentions that co-occur within a discourse unit (e.g., para-graph) are isolated. (2) Within each discourse unit, mention-level representations are computed for each tuple ofentity mentions. These representations may correspond to the entire n-ary relation (white) or subrelations oversubsets of entities (magenta, yellow, cyan). (3) At the document scale, mention-level representations for both then-ary relation and its subrelations are combined into entity-level representations. (4) Entity-level representationsare used to predict the relation.

representation learning. Given a document withtext T and entities E1, . . . , En, we first buildmention-level representations for groups of theseentities whenever they co-occur within the samediscourse unit. We then aggregate these repre-sentations across the whole document, yieldingentity-level representations for each subset of en-tities. Finally, we predict whether E1, . . . , En par-ticipate in the relation based on the concatenationof these entity-level representations. These stepsare depicted in Figure 2.

3.1 Mention-level representation

Let the full document T be composed of discourseunits T1, . . . , Tm (e.g., different paragraphs). LetTj be one such discourse unit, and supposee1, . . . , en are entity mentions of E1, . . . , En thatco-occur in Tj . We construct a contextualizedrepresentation for mention tuple (e1, . . . , en) inTj . In this paper, we use a standard approachby applying a bi-directional LSTM (BiLSTM) toTj , concatenating the hidden states for each men-tion, and feeding this through a single-layer neu-ral network. We denote the resulting vector asr(R, e1, . . . , en, Tj) for the relation R.

3.2 Entity-level representationLet M(R,E1, . . . , En, T ) denote the set of allmention tuples (e1, . . . , en) and discourse unitsTj within T such that each ei appears inTj . We can create an entity-level representationr(R,E1, . . . , En, T ) of the n entities by combin-ing mention-level representations using an aggre-gation operator C:

C(e1,...,en,Tj)∈M(R,E1,...,En,T )

r(R, e1, . . . , en, Tj)

A standard choice for C is max pooling, whichworks well if it is pretty clear-cut whether a men-tion tuple expresses a relation. In practice, how-ever, the mention tuples could be ambiguous andless than certain individually, yet collectively ex-press a relation in the document. This motivatesus to experiment with logsumexp, the smooth ver-sion of max, where

logsumexp(x1, . . . , xk) = log

k∑i=1

exp(xi).

This facilitates accumulating weak signals fromindividual mention tuples, and our experimentsshow that it substantially improves extraction ac-curacy compared to max pooling.

3.3 Subrelations

For higher-order relations (i.e., larger n), it isless likely that they will be completely containedwithin a discourse unit. Often, the relation canbe decomposed into subrelations over subsets ofentities, each of which is more likely to be ex-pressed in a single discourse unit. This moti-vates us to construct entity-level representationsfor subrelations as well. The process is straightfor-ward. Let RS be the |S|-ary subrelation over en-tities ES1 , · · · , ES|S| , where S ⊆ {1, . . . , n} and|S| denotes its size. We first construct mention-level representations r(RS , eS1 , · · · , eS|S| , T ) forRS and its relevant entity mentions, then com-bine them into an entity-level representationr(RS , ES1 , · · · , ES|S| , D) using the chosen aggre-gation operator C. We do this for every S ⊆{1, . . . , n} with |S| ≥ 2 (including the whole set,which corresponds to the full relation R). Thisgives us an entity-level representation for eachsubrelation of arity at least 2, or equivalently, eachsubset of entities of size at least 2.

3.4 Relation prediction

To make a final prediction, we first con-catenate all of the entity-level representationsr(RS , ES1 , . . . , ES|S| , D) for all S ⊆ {1, . . . , n}with |S| ≥ 2. The concatenated representationis fed through a two-layer feedforward neural net-work followed by a softmax function to predict therelation type.

It is possible that for some subrelations RS , all|S| entities do not co-occur in any discourse unit.When this happens, we set r(RS , ES1 , . . . , ES|S|)to a bias vector which is learned separately foreach RS . This ensures that the concatenation isdone over a fixed number of vectors, e.g., 4 fora tenary relation (three binary subrelations andthe main relation). Importantly, this strategy en-ables us to make meaningful predictions for rela-tion candidates even if all n entities never co-occurin the same discourse unit; such candidates wouldnever be generated by a system that only looks atsingle discourse units in isolation.

3.5 Document model

Our document model is actually a family of rep-resentation learning methods, conditioned on thechoice of discourse units, subrelations, and ag-gregation operators. In this paper, we considersentences and paragraphs as possible discourse

Sentence Paragraph Documentlevel level level

Text Units 2, 326 3, 687 3, 362Pos. Examples 2,222 4,906 8,514Neg. Examples 2, 849 13, 371 323, 584

Table 1: Statistics of our training corpus using PMC-OA articles and distant supervision from CIVIC,GDKD, and OncoKB. “Text Units” refers to the num-ber of distinct sentences, paragraphs, and documentsthat contain a candidate triple of drug, gene, mutation.

Development TestDocuments 118 225Annotated facts 701 1, 324Paragraphs per document 101 105Sentences per document 314 320Words per document 6, 871 7, 010

Table 2: Statistics of the CKB evaluation corpus.

units. We explore max and logsumexp as aggre-gation operators. Moreover, we explore ensembleprediction as an additional aggregation method.Specifically, we learn a restricted multiscale modelby limiting the text span to a single discourseunit (e.g., a paragraph); the model still combinesrepresentations across mentions and subrelations.At test time, given a full document with m dis-course units, we obtain independent predictionsp1, . . . , pm for each discourse unit. We then com-bine these probabilities using an ensemble opera-tor P . A natural choice for P is max, though wealso experiment with noisy-or:

P(p1, · · · , pk) = 1−k∏

i=1

(1− pi).

It is also possible to ensemble multiple modelsthat operate on different discourse units, using thissame operator.

Our model can be trained using standard super-vised or indirectly supervised methods. In thispaper, we focus on distant supervision, as it isa particularly potent learning paradigm for high-value domains. Our entity-centric formulation isparticularly well aligned with distant supervision,as distant supervision at the entity level is signif-icantly less noisy compared to the mention level,so we don’t need to deploy sophisticated denoisingstrategies such as multi-instance learning (Hoff-mann et al., 2011).

4 Experiments

4.1 Biomedical machine readingWe validate our approach on a standard biomed-ical machine reading task: extracting drug-gene-mutation interactions from biomedical articles(Peng et al., 2017; Wang and Poon, 2018). Wecast this task as binary classification: given a drug,gene, mutation, and document in which they arementioned, determine whether the document as-serts that the mutation in the gene affects responseto the drug. For training, we use documents fromthe PubMed Central Open Access Subset (PMC-OA)4. For distant supervision, we use three ex-isting knowledgebases (KBs) with hand-curateddrug-gene-mutation facts: CIVIC,5 GDKD (Dien-stmann et al., 2015), and OncoKB (Chakravartyet al., 2017). Table 1 shows basic statistics of thistraining data. Past methods using distant super-vision often need to up-weight positive examples,due to the large proportion of negative candidates.Interestingly, we found that our document modelwas robust to this imbalance, as re-weighting hadlittle effect and we didn’t use it in our final results.

Evaluating distant supervision methods is chal-lenging, as there is often no gold-standard test set,especially at the mention level. Prior work thus re-sorts to reporting sample precision (estimated pro-portion of correct system extractions) and absoluterecall (estimated number of correct system extrac-tions). This requires subsampling extraction re-sults and manually annotating them. Subsamplingvariance also introduces noise in the estimate.

Instead, we used CKB CORE™, a public sub-set of the Clinical Knowledgebase (CKB)6 (Pat-terson et al., 2016), as our gold-standard test set.CKB CORE™ contains document-level annota-tion of drug-gene-mutation interactions manuallycurated by The Jackson Laboratory (JAX), anNCI-designated cancer center. It is a high-qualityKB containing facts from a few hundred PubMedarticles for 86 genes, with minimal overlap withthe three KBs we used for distant supervision.To avoid contamination, we removed CKB entrieswhose documents were used in our training data,and split the rest into a development and test set.See Table 2 for statistics. We tuned hyperparam-eters and thresholds on the development set, andreport results on the test set.

4www.ncbi.nlm.nih.gov/pmc5civic.genome.wustl.edu6ckbhome.jax.org

4.2 Implementation details

We conducted standard preprocessing and entitylinking, similar to Wang and Poon (2018) (seeSection A.1). Following standard practice, wemasked all entities of the same type with a dummytoken, to prevent the classifier from simply mem-orizing the facts in distant supervision. Wang andPoon (2018) observed that many errors stemmedfrom incorrect gene-mutation association. Wetherefore developed a simple rule-based systemthat predicts which gene-mutation pairs are valid(see Section A.2). We removed candidates thatcontained a gene-mutation pair that was not pre-dicted by the rule-based system.

4.3 Main results

We evaluate primarily on area under the precisionrecall curve (AUC).7 We also report maximum re-call, which is the fraction of true facts for whicha candidate was generated. Finally, we report pre-cision, recall, and F1, using a threshold tuned tomaximize F1 on the CKB development set.

We compared our multiscale system(MULTISCALE) with three restricted variants(SENTLEVEL, PARALEVEL, DOCLEVEL).SENTLEVEL and PARALEVEL restricted train-ing and prediction to single discourse units(i.e., sentences and paragraphs), and produceda document-level prediction by applying theensemble operator over individual discourseunits. DOCLEVEL takes the whole document asinput, with each paragraph as a discourse unit.MULTISCALE further combined SENTLEVEL,PARALEVEL, and DOCLEVEL using the ensem-ble operator. For additional details about themodels, see Section A.3. We also comparedMULTISCALE with DPL (Wang and Poon, 2018),the prior state of the art in cross-sentence n-aryrelation extraction. DPL classifies drug-gene-mutation interactions within three consecutivesentences using the same model architecture asPeng et al. (2017), but incorporates additionalindirect supervision such as data programmingand joint inference. We used the DPL codefrom the authors and produced a document-levelprediction similarly using the ensemble operator.In the base version, we used max as the ensembleoperator. We also evaluated the effect when we

7 We compute area using average precision, which is sim-ilar to a right Riemann sum. This avoids errors introduced bythe trapezoidal rule, which may overestimate area.

www.ncbi.nlm.nih.gov/pmc

civic.genome.wustl.edu

ckbhome.jax.org

System AUC Max Recall Precision Recall F1Base versionsDPL 24.4 53.8 27.3 42.3 33.2SENTLEVEL 22.4 36.6 39.3 34.7 36.9PARALEVEL 33.1 58.9 36.5 44.6 40.1DOCLEVEL 36.7 79.0 45.4 38.5 41.7MULTISCALE 37.3 79.0 41.8 43.3 42.5+ Noisy-OrDPL 31.5 53.8 33.3 41.5 36.9SENTLEVEL 25.3 36.6 39.3 35.3 37.2PARALEVEL 35.6 58.9 44.3 40.6 42.4DOCLEVEL 36.7 79.0 45.4 38.5 41.7MULTISCALE 39.7 79.0 48.1 38.9 43.0+ Noisy-Or + Gene-mutation filterDPL 39.1 52.6 50.5 47.8 49.1SENTLEVEL 29.0 35.5 63.3 34.2 44.4PARALEVEL 42.1 57.2 50.6 50.7 50.7DOCLEVEL 42.9 74.4 49.3 46.6 47.9MULTISCALE 47.5 74.4 52.6 53.0 52.8

Table 3: Comparison of our multiscale system with restricted variants and DPL (Wang and Poon, 2018) on CKB.

used noisy-or as the ensemble operator, as well aswhen we applied the gene-mutation filter duringpostprocessing.

Table 3 shows the results on the CKB test set.In all scenarios, our full model (MULTISCALE)substantially outperforms the prior state-of-the-art system (DPL). For example, in the best set-ting, using both noisy-or and the gene-mutationfilter, the full model improves over DPL by 8.4AUC points. Multiscale learning is the key tothis performance gain, with MULTISCALE sub-stantially outperforming more restricted variants.Not surprisingly, expanding extraction scope fromsentences to paragraphs resulted in the biggestgain, already surpassing DPL. Conducting end-to-end learning over a document-level representation,as in DOCLEVEL, is beneficial compared to en-sembling over predictions for individual discourseunits (SENTLEVEL, PARALEVEL), especially inthe base version. Interestingly, MULTISCALE stillattained significant gain over DOCLEVEL withan ensemble over SENTLEVEL and PARALEVEL,suggesting that the document-level representationcan still be improved. In addition to prediction ac-curacy, the document-level models also have muchmore room to grow, as maximum recall is about20 absolute points higher in MULTISCALE andDOCLEVEL, compared to PARALEVEL or DPL.8

The ensemble operator had a surprisingly largeeffect, as shown by the gain when it was changedfrom max (base version) to noisy-or. This sug-

8The difference in actual recall is less pronounced, as wechose thresholds to maximize F1 score. We expect actualrecall to increase significantly as document-level models im-prove, whereas the other models are closer to their ceiling.

Figure 3: Precision-recall curves on CKB (withnoisy-or and gene-mutation filter). MULTISCALE at-tained generally better precision than PARALEVEL,and higher maximum recall like DOCLEVEL.

gests that combining weak signals across mul-tiple scales can be quite beneficial. Our hand-crafted gene-mutation filter also improved all sys-tems substantially, corroborating the analysis ofWang and Poon (2018). In particular, without thefilter, it is hard for the document-level models toachieve high precision, so they sacrifice a lot ofrecall to get good F1 scores. Using the filter helpsthem attain significantly higher recall while main-taining respectable precision.

Figure 3 shows the precision-recall curves forthe four models (with noisy-or and gene-mutationfilter). DOCLEVEL has higher maximum recallthan PARALEVEL, but generally lower precisionat the same recall level. By ensembling all threevariants, MULTISCALE achieves the best combi-

System AUC MR P R F1MULTISCALE 47.5 74.4 52.6 53.0 52.8– SENTLEVEL 47.0 74.4 43.0 55.8 48.6– PARALEVEL 45.9 74.4 48.8 49.8 49.3– DOCLEVEL 42.4 57.2 59.6 44.4 50.9

Table 4: Results on CKB when removing eitherSENTLEVEL, PARALEVEL, or DOCLEVEL from theensemble computed by MULTISCALE. MR=max re-call, P=precision, R=recall.

System AUC P R F1SENTLEVEL 28.3 62.7 35.1 45.0PARALEVEL 38.1 47.4 52.2 49.7DOCLEVEL 41.1 48.2 45.6 46.9MULTISCALE 43.7 45.7 51.2 48.3

Table 5: Results on CKB after replacing logsumexp

with max (with noisy-or and gene-mutation filter).P=precision, R=recall. Max recall same as in Table 3.

nation: it generally improves precision while cap-turing more cross-paragraph relations. This canalso be seen in Table 4, where we ablate each ofthe three variants used by MULTISCALE. All threevariants in the ensemble contributed to overall per-formance.

We use logsumexp as the aggregation operatorto combine mention-level representations into anentity-level one. If we replace it with max pool-ing, the performance drops substantially acrossthe board, as shown in Table 5. For example,MULTISCALE lost 3.8 absolute points in AUC.Such difference is also observed in Verga et al.(2018). As in comparing ensemble operators, thisdemonstrates the benefit of combining weak sig-nals using a multiscale representation.

4.4 Cross-sentence and cross-paragraphextractions

Compared to standard sentence-level extraction,our method can extract relations among entitiesthat never co-occur in the same sentence or evenparagraph. Figure 4 shows the proportion of cor-rectly predicted facts by MULTISCALE that areexpressed across paragraph or sentence bound-aries. MULTISCALE can substantially improvethe recall by making additional cross-sentenceand cross-paragraph extractions. We manuallyinspected twenty correct cross-paragraph extrac-tions (with the chosen threshold for the preci-sion/recall numbers in Table 3) and found that ourmodel was able to handle some interesting linguis-tic phenomena. Often, a paper would first de-scribe the mutations present in a patient cohort,

Figure 4: Breakdown of MULTISCALE recall based onwhether entities in a correctly extracted fact occurredwithin a single sentence, cross-sentence but within asingle paragraph, or only cross-paragraph. Addingcross-sentence and cross-paragraph extractions is im-portant for high recall.

System AUC MR P R F1Base versionSENTDRUGMUT 31.0 40.8 60.0 40.7 48.5SENTDRUGGENE 17.9 64.2 31.4 27.9 29.5PARADRUGMUT 39.9 57.7 49.3 50.3 49.8PARADRUGGENE 19.9 68.9 32.1 18.9 23.8+ Noisy-OrSENTDRUGMUT 32.6 40.8 61.1 39.4 47.9SENTDRUGGENE 23.5 64.2 36.3 34.2 35.2PARADRUGMUT 42.0 57.7 49.9 51.5 50.7PARADRUGGENE 26.1 68.9 46.1 29.5 36.0

Table 6: Results of subrelation decomposition base-lines on CKB, with the gene-mutation filter. MR=maxrecall, P=precision, R=recall.

and later describe the effects of drug treatment.There are also instances of bridging anaphora,for example via cell lines. One paper firststated the gene and mutation for a cell line “TheFLT3-inhibitor resistant cells Ba/F3-ITD+691,Ba/F3-ITD+842, . . . , which harbored FLT-ITDplus F691L, Y842C, . . . mutations. . . ”, and laterstated the drug effect on the cell line “E6201 alsodemonstrated strong anti-proliferative effects inFLT3-inhibitor resistant cells. . . such as Ba/F3-ITD+691, Ba/F3-ITD+842 . . . ”.

4.5 Subrelation decomposition

As a baseline, we also consider a differentdocument-level strategy where we decompose then-ary relation into subrelations of lower arity, trainindependent classifiers for them, then join the sub-relation predictions into one for the n-ary rela-tion. We found that with distant supervision, the

gene-mutation subrelation classifier was too noisy.Therefore, we focused on training drug-gene anddrug-mutation classifiers, and joined each withthe rule-based gene-mutation predictions to maketernary predictions. Table 6 shows the results onCKB. The paragraph-level drug-mutation modelis quite competitive, which benefits from the factthat the gene-mutation associations in a documentare unique. This is not true in general n-ary rela-tions. Still, it trails MULTISCALE by a large mar-gin in predictive accuracy, and with an even largergap in the potential upside (i.e., maximum recall).The drug-gene model has higher maximum recall,but much worse precision. This low precision isexpected, as it is usually not valid to assume thatif a drug and gene interact, then all possible mu-tations in the gene will have an effect on the drugresponse.

4.6 Error Analysis

While much higher compared to other systems, themaximum recall for MULTISCALE is still far from100%. For over 20% of the relations, we can’t findall three entities in the document. In many cases,the missing entities are in figures or supplements,beyond the scope of our extraction. Some muta-tions are indirectly referenced by well-known celllines. There are also remaining entity linking er-rors (e.g., due to missing drug synonyms).

We next manually analyzed some sample pre-diction errors. Among 50 false positive errors,we found a significant portion of them were ac-tually true mentions in the paper but were ex-cluded by curators due to additional curation cri-teria. For example, CKB does not curate a factreferenced in related work, or if they deem the em-pirical evidence as insufficient. This suggests theneed for even higher-order relation extraction tocover these aspects. We also inspected 50 samplefalse negative errors. In 40% of the cases, the tex-tual evidence is vague and requires corroborationfrom a table or figure. In most of the remainingcases, there is direct textual evidence, though theyrequire cross-paragraph reasoning (e.g., bridginganaphora). While MULTISCALE was able to pro-cess such phenomena sometimes, there is clearlymuch room to improve.

5 Related Work

N -ary relation extraction Prior work on n-aryrelation extraction generally follows Davidsonian

semantics by reducing the n-ary relation to n bi-nary relations between the reified relation and itsarguments, a.k.a. slot filling. For example, earlywork on the Message Understanding Conference(MUC) dataset aims to identify event participantsin news articles (Chinchor, 1998). More recently,there has been much work in extracting seman-tic roles for verbs, as in semantic role labeling(Palmer et al., 2010), as well as properties forpopular entities, as in Wikipedia Infobox (Wu andWeld, 2007) and TAC KBP9. In biomedicine, theBioNLP Event Extraction Shared Task aims to ex-tract genetic events such as expression and regu-lation (Kim et al., 2009). These approaches typ-ically assume that the whole document refers toa single coherent event, or require an event an-chor (e.g., verb in semantic role labeling and trig-ger word in event extraction). We instead followrecent work in cross-sentence n-ary relation ex-traction (Peng et al., 2017; Wang and Poon, 2018),which does not have these restrictions.

Document-level relation extraction Most in-formation extraction work focuses on modelingand prediction within sentences (Surdeanu andJi, 2014). Duan et al. (2017) introduces a pre-trained document embedding to aid event detec-tion, but their extraction is still at the sentencelevel. Past work on cross-sentence extraction oftenrelies on explicit coreference annotations or the as-sumption of a single event in the document (Wicket al., 2006; Gerber and Chai, 2010; Swampil-lai and Stevenson, 2011; Yoshikawa et al., 2011;Koch et al., 2014; Yang and Mitchell, 2016). Re-cently, there has been increasing interest in gen-eral cross-sentence relation extraction (Quirk andPoon, 2017; Peng et al., 2017; Wang and Poon,2018), but their scope is still limited to short textspans of a few consecutive sentences. These meth-ods all extract relations at the mention level, whichdoes not scale to whole documents due to the com-binatorial explosion of relation candidates. Wuet al. (2018b) applies manually crafted rules toheavily filter the candidates. We instead adoptan entity-centric approach and combine mention-level representations to create an entity-level rep-resentation for extraction. Mintz et al. (2009) ag-gregates mention-level features into entity-levelones within a document, but they only considerbinary relations within single sentences. Kil-

9http://www.nist.gov/tac/2016/KBP/ColdStart/index.html

http://www.nist.gov/tac/2016/KBP/ColdStart/index.html

http://www.nist.gov/tac/2016/KBP/ColdStart/index.html

icoglu (2016) used hand-crafted features to im-prove cross-sentence extraction, but they focus onbinary relations, and their documents are limitedto abstracts, which are substantially shorter thanthe full-text articles we consider. Verga et al.(2018) applies self-attention to combine the rep-resentations of all mention pairs into an entity pairrepresentation, which can be viewed a special caseof our framework. Their work is also limited tobinary relations and abstracts, rather than full doc-uments.

Multiscale modeling Deep learning over longsequences can benefit from multiscale modelingthat accounts for varying scales in the discoursestructure. Prior work focuses on generative learn-ing such as language modeling (Chung et al.,2017). We instead apply multiscale modelingto discriminative learning for relation extraction.In addition to modeling various scales of dis-course units (sentence, paragraph, document), wealso combine mention-level representations intoan entity-level one, as well as sub-relations of then-ary relation. McDonald et al. (2005) learn

(n2

)pairwise relation classifiers, then construct max-imal cliques of related entities, which also bearsresemblance to our subrelation modeling. How-ever, our approach incorporates the entire subre-lation hierarchy, provides a principled end-to-endlearning framework, and extracts relations fromthe whole document rather than within single sen-tences.

Distant supervision Distant supervision hasemerged as a powerful paradigm to generate largebut potentially noisy labeled datasets (Cravenet al., 1999; Mintz et al., 2009). A common de-noising strategy applies multi-instance learning bytreating mention-level labels as latent variables(Hoffmann et al., 2011). Noise from distant su-pervision increases as extraction scope expandsbeyond single sentences, motivating a variety ofindirect supervision approaches (Quirk and Poon,2017; Peng et al., 2017; Wang and Poon, 2018).Our entity-centric representation and multiscalemodeling provide an orthogonal approach to com-bat noise by combining weak signals spanning var-ious text spans and subrelations.

6 Conclusion

We propose a multiscale, entity-centric approachfor document-level n-ary relation extraction.

We vastly increase maximum recall by scoringdocument-level candidates. Meanwhile, we pre-serve precision with a multiscale approach thatcombines representations learned across the sub-relation hierarchy and text spans of various scales.Our method substantially outperforms prior cross-sentence n-ary relation extraction approaches inthe high-value domain of precision oncology.

Our document-level view opens opportunitiesfor multimodal learning by integrating informa-tion from tables and figures (Wu et al., 2018a).We used the ternary drug-gene-mutation relationas a running example in this paper, but knowledgebases often store additional fields such as effect(sensitive or resistance), cancer type (solid tumoror leukemia), and evidence (human trial or cellline experiment). It is straightforward to apply ourmethod to such higher-order relations. Finally, itwill be interesting to validate our approach in areal-world assisted-curation setting, where a ma-chine reading system proposes candidate facts tobe verified by human curators.

7 Acknowledgements

We thank Sara Patterson and Susan Mockus forguidance on precision oncology knowledge cura-tion and CKB data, Hai Wang for help in run-ning experiments with deep probabilistic logic,and Tristan Naumann, Rajesh Rao, Peng Qi, JohnHewitt, and the anonymous reviewers for theirhelpful comments.

ReferencesOrli Bahcall. 2015. Precision medicine. Nature,

526:335.

Debyani Chakravarty, Jianjiong Gao, Sarah Phillips,Ritika Kundra, Hongxin Zhang, Jiaojiao Wang, Ju-lia E. Rudolph, Rona Yaeger, Tara Soumerai, Mo-riah H. Nissan, Matthew T. Chang, Sarat Chandar-lapaty, Tiffany A. Traina, Paul K. Paik, Alan L. Ho,Feras M. Hantash, Andrew Grupe, Shrujal S. Baxi,Margaret K. Callahan, Alexandra Snyder, Ping Chi,Daniel C. Danila, Mrinal Gounder, James J. Hard-ing, Matthew D. Hellmann, Gopa Iyer, Yelena Y.Janjigian, Thomas Kaley, Douglas A. Levine, MaeveLowery, Antonio Omuro, Michael A. Postow, DanaRathkopf, Alexander N. Shoushtari, Neerav Shukla,Martin H. Voss, Ederlinda Paraiso, Ahmet Zehir,Michael F. Berger, Barry S. Taylor, Leonard B.Saltz, Gregory J. Riely, Marc Ladanyi, David M.Hyman, Jos Baselga, Paul Sabbatini, David B. Solit,and Nikolaus Schultz. 2017. Oncokb: A precisiononcology knowledge base. JCO Precision Oncol-ogy.

Nancy Chinchor. 1998. Overview of MUC-7/MET-2.Technical report, Science Applications InternationalCorporation, San Diego, CA.

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio.2017. Hierarchical multiscale recurrent neural net-works. In ICLR.

M. Craven, J. Kumlien, et al. 1999. Constructing bi-ological knowledge bases by extracting informationfrom text sources. In ISMB, pages 77–86.

Rodrigo Dienstmann, In Sock Jang, Brian Bot, StephenFriend, and Justin Guinney. 2015. Database of ge-nomic biomarkers for cancer drugs and clinical tar-getability in solid tumors. Cancer Discovery, 5.

Shaoyang Duan, Ruifang He, and Wenli Zhao. 2017.Exploiting document level information to improveevent detection via recurrent neural networks. InIJCNLP.

Matthew Gerber and Joyce Y. Chai. 2010. BeyondNomBank: A study of implicit arguments for nomi-nal predicates. In Proceedings of the Forty-EighthAnnual Meeting of the Association for Computa-tional Linguistics.

R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer,and D. S. Weld. 2011. Knowledge-based weak su-pervision for information extraction of overlappingrelations. In Association for Computational Lin-guistics (ACL), pages 541–550.

Halil Kilicoglu. 2016. Inferring implicit causal rela-tionships in biomedical literature. In Proceedings ofthe 15th Workshop on Biomedical Natural LanguageProcessing.

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi-nobu Kano, and Junichi Tsujii. 2009. Overview ofbionlp-09 shared task on event extraction. In Pro-ceedings of the BioNLP Workshop 2009.

D. Kingma and J. Ba. 2014. Adam: A methodfor stochastic optimization. arXiv preprintarXiv:1412.6980.

Mitchell Koch, John Gilmer, Stephen Soderland, andS. Daniel Weld. 2014. Type-aware distantly su-pervised relation extraction with linked arguments.In Proceedings of the 2014 Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 1891–1901. Association for Com-putational Linguistics.

Ryan McDonald, Fernando Pereira, Seth Kulick, ScottWinters, Yang Jin, and Pete White. 2005. Sim-ple algorithms for complex relation extraction withapplications to biomedical IE. In Proceedings ofthe Forty-Third Annual Meeting on Association forComputational Linguistics.

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009.Distant supervision for relation extraction withoutlabeled data. In Association for Computational Lin-guistics (ACL), pages 1003–1011.

Martha Palmer, Daniel Gildea, and Nianwen Xue.2010. Semantic role labeling. Synthesis Lectureson Human Language Technologies, 3(1).

Sarah E. Patterson, Rangjiao Liu, Cara M. Statz, DanielDurkin, Anuradha Lakshminarayana, and Susan M.Mockus. 2016. The clinical trial landscape in oncol-ogy and connectivity of somatic mutational profilesto targeted therapies. Human Genomics, 10.

Nanyun Peng, Hoifung Poon, Chris Quirk, and KristinaToutanova Wen tau Yih. 2017. Cross-sentence n-ary relation extraction with graph lstms. Transac-tions of the Association for Computational Linguis-tics, 5:101–115.

Sampo Pyysalo, F Ginter, Hans Moen, T Salakoski, andSophia Ananiadou. 2013. Distributional semanticsresources for biomedical text processing. In Pro-ceedings of Languages in Biology and Medicine.

Chris Quirk and Hoifung Poon. 2017. Distant super-vision for relation extraction beyond the sentenceboundary. In Proceedings of the Fifteenth Confer-ence on European chapter of the Association forComputational Linguistics.

Mihai Surdeanu and Heng Ji. 2014. Overview of theEnglish Slot Filling Track at the TAC2014 Knowl-edge Base Population evaluation. In Proceed-ings of the TAC-KBP 2014 Workshop, pages 1–15,Gaithersburg, Maryland, USA.

Kumutha Swampillai and Mark Stevenson. 2011. Ex-tracting relations within and across sentences. InProceedings of the Conference on Recent Advancesin Natural Language Processing.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.2017. Attention is all you need. arXiv preprintarXiv:1706.03762.

Patrick Verga, Emma Strubell, and Andrew McCallum.2018. Simultaneously self-attending to all mentionsfor full-abstract biological relation extraction. InNAACL.

Hai Wang and Hoifung Poon. 2018. Deep probabilisticlogic: A unifying framework for indirect supervi-sion. In EMNLP.

Michael Wick, Aron Culotta, and Andrew McCal-lum. 2006. Learning field compatibilities to extractdatabase records from unstructured text. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing.

Fei Wu and Daniel S. Weld. 2007. Autonomously se-mantifying wikipedia. In Proceedings of the Six-teenth ACM Conference on Conference on Informa-tion and Knowledge Management.

S. Wu, L. Hsiao, X. Cheng, B. Hancock, T. Rekatsi-nas, P. Levis, and C. R’e. 2018a. Fonduer: Knowl-edge base construction from richly formatted data.In Proceedings of SIGMOD 2018.

https://doi.org/10.3115/v1/D14-1203

https://doi.org/10.3115/v1/D14-1203

Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock,Theodoros Rekatsinas, Philip Levis, and Christo-pher R. 2018b. Fonduer: Knowledge base construc-tion from richly formatted data. In SIGMOD.

Bishan Yang and Tom Mitchell. 2016. Joint extractionof events and entities within a document context. InNAACL.

Katsumasa Yoshikawa, Sebastian Riedel, Tsutomu Hi-rao, Masayuki Asahara, and Yuji Matsumoto. 2011.Coreference based event-argument relation extrac-tion on biomedical text. Journal of Biomedical Se-mantics, 2(5).

A Appendices

A.1 PreprocessingFull-text documents in this study were obtainedfrom PMC. The text was first tokenized usingNLTK10, then entities were extracted using a com-bination of regular expressions and dictionarylookups. To identify mutation mentions, we ap-plied a regular expression rule for missense mu-tations. To identify gene mentions, we used dic-tionary lookup from the HUGO Gene Nomencla-ture Committee (HGNC)11 dataset. To identifydrug mentions, we used dictionary lookup froma curated list of drugs and their synonyms. Forour training set, our list of drugs consists of allthe drugs present in the distant supervision knowl-edge bases and selected cancer-related drugs fromDrugBank12 (770 drugs total). For our test set, ourdrug dictionary consists of all the drugs in CKB(1119 drugs).

A.2 Gene-mutation rule-based systemHere we describe our rule-based system for link-ing mutations and genes within a document. Wefirst generate a global mapping of mutations tosets of genes by combining publicly-availablemutation-gene datasets (COSMIC13, COSMICCell Lines Project14, CIViC15, and OncoKB16).We then augment this mapping by finding the genethat most frequently co-occurs with each mutationin all of PubMed Central (PMC) full-text articlesbased on three high-precision rules:

1. Gene and mutation are in the same token(e.g., ”EGFR-T790M”)

2. Gene token is followed by mutation token(e.g., ”EGFR T790M”)

3. Gene token is followed by a token of any sin-gle character and then followed by mutationtoken (e.g., ”EGFR - T790M”)

For each mutation, we start with the first rule, andfind all text matches for a gene with that mutationand rule. If we found at least one match, we addthe gene that occurred in the most matches to the

10https://www.nltk.org/11https://www.genenames.org/12https://www.drugbank.ca/13https://cancer.sanger.ac.uk/cosmic14https://cancer.sanger.ac.uk/cell lines15https://civicdb.org16http://oncokb.org/

global map. Otherwise, we repeat with the nextrule.

Each mutation in the global mutation-gene mapis mapped to more than 20 genes on average.However, within the context of a document, eachmutation is (usually) associated with just a singlegene. Given a document containing a mutation,we associate that mutation with the gene that (1) isin the global mutation-gene map for that mutation,and (2) appears closest to any mutation mention inthe document.

To associate genes for the remaining mutations,we apply two recall-friendly regular expressionrules within that document:

4. Mutation is in same sentence as “GENE mut”

5. Mutation is in same paragraph as “GENEmutation”

We choose the first gene in the document thatsatisfies one of the two rules, in the above order.If there is still no matching gene at this point, themost frequent gene in the document is selected forthat mutation.

A.3 Model detailsWe used 200-dimensional word vectors, initial-ized with word2vec vectors trained on a biomedi-cal text corpus (Pyysalo et al., 2013). We updatedthese vectors during training. At each step, ourBiLSTM received as input a concatenation of theword vectors and a 100-dimensional embedding ofthe index of the current discourse unit within thedocument. Following Vaswani et al. (2017), weused sinusoidal embeddings to represent these in-dices. We used a single-layer bidirectional LSTMwith a 200-dimensional hidden state. Mention-level representations were 400-dimensional andcomputed from BiLSTM hidden states using a sin-gle linear layer followed by the tanh activationfunction. For the final prediction layer, we useda two-layer feedforward network with 400 hiddenunits and ReLU activation function. We train us-ing the Adam optimizer (Kingma and Ba, 2014)with learning rate of 1 × 10−5. During training,we consider each document to be a single batch,which allows us to reuse computation for differentrelation candidates in the same document.

Document-Level N-ary Relation Extraction with Multiscale … · 2019-04-09 · centric view of...

Documents

Transcript of Document-Level N-ary Relation Extraction with Multiscale … · 2019-04-09 · centric view of...