Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed...

13
Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering Yanlin Feng * Xinyue Chen ♠* Bill Yuchen Lin Peifeng Wang Jun Yan Xiang Ren [email protected], [email protected], {yuchen.lin, peifengw, yanjun, xiangren}@usc.edu University of Southern California Peking University Shanghai Jiao Tong University Abstract While fine-tuning pre-trained language mod- els (PTLMs) has yielded strong results on a range of question answering (QA) bench- marks, these methods still suffer in cases when external knowledge are needed to infer the right answer. Existing work on augment- ing QA models with external knowledge (e.g., knowledge graphs) either struggle to model multi-hop relations efficiently, or lack trans- parency into the model’s prediction rationale. In this paper, we propose a novel knowledge- aware approach that equips PTLMs with a multi-hop relational reasoning module, named multi-hop graph relation networks (MHGRN). It performs multi-hop, multi-relational reason- ing over subgraphs extracted from external knowledge graphs. The proposed reasoning module unifies path-based reasoning methods and graph neural networks to achieve better in- terpretability and scalability. We also empiri- cally show its effectiveness and scalability on CommonsenseQA and OpenbookQA datasets, and interpret its behaviors with case studies. In particular, MHGRN achieves the state-of-the- art performance (76.5% accuracy) on the Com- monsenseQA official test set. 1 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question and context, but also relational reason- ing over entities (concepts) and their relationships based by referencing external knowledge (Talmor et al., 2019; Sap et al., 2019; Clark et al., 2018; Mihaylov et al., 2018). For example, the question in Fig. 1 requires a model to perform relational rea- soning over mentioned entities, i.e., to infer latent relations among the concepts: {CHILD,SIT,DESK, * The first two authors contributed equally. The major work was done when both authors interned at USC. 1 https://github.com/INK-USC/MHGRN. Desires Learn Child Schoolroom Classroom Sit Desk AtLocation Where does a child likely sit at a desk? A. Schoolroom * B. Furniture store C. Patio D. Office building E. Library Figure 1: Illustration of knowledge-aware QA. A sample question from CommonsenseQA can be better answered if a relevant subgraph of ConceptNet is pro- vided as evidence. Blue nodes correspond to entities mentioned in the question, pink nodes correspond to those in the answer. The other nodes are some associ- ated entities introduced when extracting the subgraph. indicates the correct answer. SCHOOLROOM}. Background knowledge such as “a child is likely to appear in a schoolroom” may not be readily contained in the questions themselves, but are commonsensical to humans. Despite the success of large-scale pre-trained language models (PTLMs) (Devlin et al., 2019; Liu et al., 2019b), there is still a large performance gap between fine-tuned models and human perfor- mance on datasets that probe relational reasoning. These models also fall short of providing inter- pretable predictions as the knowledge in their pre- training corpus is not explicitly stated, but rather is implicitly learned. It is thus difficult to recover the knowledge used in the reasoning process. This has led many works to leverage knowl- edge graphs to improve machine reasoning abil- ity to infer these latent relations for answering

Transcript of Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed...

Page 1: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

Scalable Multi-Hop Relational Reasoning forKnowledge-Aware Question Answering

Yanlin Fengclubslowast Xinyue Chenspadeslowast Bill Yuchen Linhearts Peifeng Wanghearts Jun Yanhearts Xiang Renhearts

fengyanlinpkueducn kiwishersjtueducnyuchenlin peifengw yanjun xiangrenuscedu

heartsUniversity of Southern CaliforniaclubsPeking University spadesShanghai Jiao Tong University

Abstract

While fine-tuning pre-trained language mod-els (PTLMs) has yielded strong results ona range of question answering (QA) bench-marks these methods still suffer in cases whenexternal knowledge are needed to infer theright answer Existing work on augment-ing QA models with external knowledge (egknowledge graphs) either struggle to modelmulti-hop relations efficiently or lack trans-parency into the modelrsquos prediction rationaleIn this paper we propose a novel knowledge-aware approach that equips PTLMs with amulti-hop relational reasoning module namedmulti-hop graph relation networks (MHGRN)It performs multi-hop multi-relational reason-ing over subgraphs extracted from externalknowledge graphs The proposed reasoningmodule unifies path-based reasoning methodsand graph neural networks to achieve better in-terpretability and scalability We also empiri-cally show its effectiveness and scalability onCommonsenseQA and OpenbookQA datasetsand interpret its behaviors with case studies Inparticular MHGRN achieves the state-of-the-art performance (765 accuracy) on the Com-monsenseQA official test set 1

1 Introduction

Many recently proposed question answering tasksrequire not only machine comprehension of thequestion and context but also relational reason-ing over entities (concepts) and their relationshipsbased by referencing external knowledge (Talmoret al 2019 Sap et al 2019 Clark et al 2018Mihaylov et al 2018) For example the questionin Fig 1 requires a model to perform relational rea-soning over mentioned entities ie to infer latentrelations among the concepts CHILD SIT DESK

lowast The first two authors contributed equally The majorwork was done when both authors interned at USC

1httpsgithubcomINK-USCMHGRN

Des

ires

Learn

Child

Schoolroom

Classroom

Sit

Desk

AtLo

cation

Where does a child likely sit at a desk

A Schoolroom B Furniture store C PatioD Office building E Library

Figure 1 Illustration of knowledge-aware QA Asample question from CommonsenseQA can be betteranswered if a relevant subgraph of ConceptNet is pro-vided as evidence Blue nodes correspond to entitiesmentioned in the question pink nodes correspond tothose in the answer The other nodes are some associ-ated entities introduced when extracting the subgraph⋆ indicates the correct answer

SCHOOLROOM Background knowledge such asldquoa child is likely to appear in a schoolroomrdquo may notbe readily contained in the questions themselvesbut are commonsensical to humans

Despite the success of large-scale pre-trainedlanguage models (PTLMs) (Devlin et al 2019Liu et al 2019b) there is still a large performancegap between fine-tuned models and human perfor-mance on datasets that probe relational reasoningThese models also fall short of providing inter-pretable predictions as the knowledge in their pre-training corpus is not explicitly stated but rather isimplicitly learned It is thus difficult to recover theknowledge used in the reasoning process

This has led many works to leverage knowl-edge graphs to improve machine reasoning abil-ity to infer these latent relations for answering

0 80node

31e3

pat

hK=2

0 80node

63e4K=2K=3

Figure 2 Number of K-hop relational paths wrtthe node count in extracted graphs on Common-senseQA Left The path count is polynomial wrt thenumber of nodes Right The path count is exponentialwrt the number of hops

this kind of questions (Mihaylov and Frank 2018Lin et al 2019 Wang et al 2019 Yang et al2019) Knowledge graphs represent relationalknowledge between entities into multi-relationaledges thus making it easier for a model to acquirerelational knowledge and improve its reasoningability Moreover incorporating knowledge graphsbrings the potential of interpretable and trustwor-thy predictions as the knowledge is now explicitlystated For example in Fig 1 the relational path(CHILD rarr AtLocation rarr CLASSROOM rarr

Synonym rarr SCHOOLROOM) naturally providesevidence for the answer SCHOOLROOM

A straightforward approach to leveraging aknowledge graph is to directly model these rela-tional paths KagNet (Lin et al 2019) and MH-PGM (Bauer et al 2018) extract relational pathsfrom knowledge graph and encode them with se-quence models resulting in multi-hop relationsbeing explicitly modeled Application of attentionmechanisms upon these relational paths can furtheroffer good interpretability However these modelsare hardly scalable because the number of possiblepaths in a graph is (1) polynomial wrt the num-ber of nodes (2) exponential wrt the path length(see Fig 2) Therefore some (Weissenborn et al2017 Mihaylov and Frank 2018) resort to onlyusing one-hop paths namely triples to balancescalability and reasoning capacities

Graph neural networks (GNN) in contrast enjoybetter scalability via their message passing formula-tion but usually lack transparency The most com-monly used GNN variant Graph ConvolutionalNetwork (GCN) (Kipf and Welling 2017) per-forms message passing by aggregating neighbor-hood information for each node but ignores the re-lation types RGCN (Schlichtkrull et al 2018) gen-eralizes GCN by performing relation-specific ag-

GCN RGCN KagNet MHGRN

Multi-Relational Encoding 7 3 3 3Interpretable 7 7 3 3

Scalable wrt node 3 3 7 3Scalable wrt hop 3 3 7 3

Table 1 Properties of our MHGRN and other repre-sentative models for graph encoding

gregation making it applicable to encoding multi-relational graphs However these models do notdistinguish the importance of different neighborsor relation types and thus cannot provide explicitrelational paths for model behavior interpretation

In this paper we propose a novel graph encod-ing architecture Multi-hop Graph Relation Net-works (MHGRN) which combines the strengthsof path-based models and GNNs Our model in-herits scalability from GNNs by preserving themessage passing formulation It also enjoys inter-pretability of path-based models by incorporatingstructured relational attention mechanism to modelmessage passing routes Our key motivation is toperform multi-hop message passing within a sin-gle layer to allow each node to directly attend toits multi-hop neighbours thus enabling multi-hoprelational reasoning with MHGRN We outline thefavorable features of knowledge-aware QA modelsin Table 1 and compare our MHGRN with repre-sentative GNNs and path-based methods

We summarize the main contributions of thiswork as follows 1) We propose MHGRN a novelmodel architecture tailored to multi-hop relationalreasoning Our model is capable of explicitly mod-eling multi-hop relational paths at scale 2) Wepropose structured relational attention mechanismfor efficient and interpretable modeling of multi-hop reasoning paths along with its training andinference algorithms 3) We conduct extensive ex-periments on two question answering datasets andshow that our models bring significant improve-ments compared to knowledge-agnostic pre-trainedlanguage models and outperform other graph en-coding methods by a large margin

2 Problem Formulation and Overview

In this paper we limit the scope to the task ofmultiple-choice question answering although itcan be easily generalized to other knowledge-guided tasks (eg natural language inference) Theoverall paradigm of knowledge-aware QA is illus-trated in Fig 3 Formally given an external knowl-edge graph (KG) as the knowledge source and a

PlausibilityScore

Graph ExtractionFrom KG

Graph Encoder

Text Encoder

120022

119956119954

119938

Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option

question q our goal is to identify the correct answerfrom a set C of given choices We turn this prob-lem into measuring the plausibility score betweenq and each answer choice a isin C then selecting theanswer with the highest plausibility score

We use q and a to denote the representationvectors of question q and option a To measurethe score for q and a we first concatenate them toform a statement s = [qa] Then we extract fromthe external KG a subgraph G (ie schema graphin KagNet (Lin et al 2019)) with the guidance ofs (detailed in sect51) This contextualized subgraphis defined as a multi-relational graph G = (V E φ)Here V is a subset of entity nodes in the externalKG containing only entities relevant to s E sube

V timesRtimesV is the set of edges that connect nodes inV where R = 1⋯m are ids of all pre-definedrelation types The mapping function φ(i) ∶ V rarrT = EqEaEo takes node i isin V as input andoutputs Eq if i is an entity mentioned in q Ea ifit is mentioned in a or Eo otherwise We finallyencode the statement to s G to g concatenate sand g for calculating the plausibility score

3 Background Multi-Relational GraphEncoding Methods

We leave encoding of s to pre-trained languagemodels which have shown powerful text represen-tation ability while we focus on the challengeof encoding graph G to capture latent relationsbetween entities Current methods for encodingmulti-relational graphs mainly fall into two cat-egories graph neural networks and path-basedmodels Graph neural networks encode structuredinformation by passing messages between nodesdirectly operating on the graph structure whilepath-based methods first decompose the graph intopaths and then pool features over all the paths

Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei

GNN(G) = Pool(hprime1hprime2 hprimen) (1)

As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-specific weight matrix Wr for each edge type

hprimei = σ

⎛⎜⎝(sumrisinR

∣N ri ∣)

minus1

sumrisinR

sumjisinN r

i

Wrhj⎞⎟⎠ (2)

where N ri denotes neighbors of node i under rela-

tion r2

While GNNs have proved to have good scalabil-ity their reasoning is done at the node level thusmaking them incompatible with modeling path-level reasoning chains-a crucial component for QAtasks that require relational reasoning This prop-erty also hinders the modelrsquos decisions from beinginterpreted at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsone can also view a graph as a set of relationalpaths connecting pairs of entities

Relation Networks (RN) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows

RN(G) = Pool(MLP(hj oplus eroplus

hi) ∣ j isin Q i isin A (j r i) isin E) (3)

Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation

2For simplicity we assume a single graph convolutionallayer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

To further equip RN with the ability of model-ing nondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Networks (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel graph neural networkarchitecture that unifies both GNN and RN forencoding multi-relational graphs to augment textcomprehension MHGRN inherits the capability ofpath reasoning and interpretabilty from path-basedmodels while preserving good scalability of GNNwith the message passing formulation

41 MHGRN Model Architecture

Following the introduction to GNN in Sec 3 weconsider encoding a multi-relational graph G =

(V E φ) into a fixed size vector g conditioned ontextual representation s by first transforming input

node features h1 hn into node embeddingshprime1 hprimen and then pooling these embeddingsNode features can be initialized with pre-trainedweights (details in Appendix A) and we focus onthe computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type information φ wefirst perform node type specific linear transforma-tion on the input node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNN with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K where K is ahyper-parameter The set of valid k-hop relationalpaths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of thesingle-hop message passing in RGCN (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionThe remaining problem becomes how to effectivelyparameterize the attention score α(j r1 rk i) in

Eq 7 for all possible k-hop paths without introduc-ing O(mk) parameters We first regard it as the prob-ability of a relation sequence (φ(j) r1 rk φ(i))conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) tonode type φ(i) (eg the model can learn to passmessages only from question entities to answerentities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity AnalysisAlthough the multi-hop message passing processin Eq 7 and the structured relational attention mod-ule in Eq11 handles potentially exponential num-ber of paths we show that it can be computed in

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

linear time using dynamic programming (see Ap-pendix B) As summarized in Table 2 the timecomplexity and space complexity of our model ona sparse graph are O (m2

nK∆) and O (mnK)respectively both of which are linear with respectto either the path length K or the number of nodesn

44 Expressive Power of MHGRN

In addition to the efficiency and scalability wenow discuss the modeling capacity of our modelWith the message passing formulation and relation-specific transformations MHGRN is by nature thegeneralization of RGCN Furthermore it is capableof directly modeling paths making it interpretableas are path-based models like RN and KagNet Toshow this we first generalize RN (Eq 3) to themulti-hop setting and introduce K-hop RN (seeAppendix C for a formal definition) K-hop RNmodels multi-hop relation as the composition ofsingle-hop relations with element-wise multiplica-tion We show that MHGRN is capable of repre-senting K-hop RN (see Appendix D for the proof)

45 Learning Inference and Path Decoding

We now discuss the learning and inference processof MHGRN instantiated for the task of multiple-choice question answering Following the prob-lem formulation in Sec 2 we aim to determinethe plausibility of an answer candidate a isin Cgiven the question q with the information fromboth text s and graph G We first obtained the graphrepresentation g by performing attentive poolingover the output node embeddings of answer enti-

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5729 (plusmn062) 5441 (plusmn050) 6298 (plusmn017) 5694 (plusmn077) 7338( plusmn027) 6988 (plusmn047)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 7008 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

ties hprimei ∣ i isin A Next we simply concatenateit with the text representation s and compute theplausibility score by ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length k with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix A shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relationalgraph with 34 relation types To extract an informa-tive contextual graph G from the KG we recognizeentity mentions in s and link them to entities in

Methods Single Ensemble

RoBERTadagger 721 725ALBERTdagger (Lan et al 2019) - 765

XLNet + GRdagger (Lv et al 2019) 753 -RoBERTa+KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa+HyKAS 20dagger (Ma et al 2019) 732 -XLNet+DREAMdagger 669 733RoBERTa+KEDGNdagger 725 744RoBERTa+FreeLBdagger(Zhu et al 2020) 722 731

MHGRN (RoBERTa K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on the leaderboardMHGRN achieves state-of-the-art on both single andensemble settings with RoBERTa-large encoder

ConceptNet with which we initialize our node setV We then add to V all the entities that appearin any two-hop paths between pairs of mentionedentities Unlike KagNet we do not perform anypruning but instead reserve all the edges betweennodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provideselementary science questions together with an open

4Models based on ConceptNet are no longer shown on theleaderboard and thus we got oursrsquo results directly from theorganizers

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 2: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

0 80node

31e3

pat

hK=2

0 80node

63e4K=2K=3

Figure 2 Number of K-hop relational paths wrtthe node count in extracted graphs on Common-senseQA Left The path count is polynomial wrt thenumber of nodes Right The path count is exponentialwrt the number of hops

this kind of questions (Mihaylov and Frank 2018Lin et al 2019 Wang et al 2019 Yang et al2019) Knowledge graphs represent relationalknowledge between entities into multi-relationaledges thus making it easier for a model to acquirerelational knowledge and improve its reasoningability Moreover incorporating knowledge graphsbrings the potential of interpretable and trustwor-thy predictions as the knowledge is now explicitlystated For example in Fig 1 the relational path(CHILD rarr AtLocation rarr CLASSROOM rarr

Synonym rarr SCHOOLROOM) naturally providesevidence for the answer SCHOOLROOM

A straightforward approach to leveraging aknowledge graph is to directly model these rela-tional paths KagNet (Lin et al 2019) and MH-PGM (Bauer et al 2018) extract relational pathsfrom knowledge graph and encode them with se-quence models resulting in multi-hop relationsbeing explicitly modeled Application of attentionmechanisms upon these relational paths can furtheroffer good interpretability However these modelsare hardly scalable because the number of possiblepaths in a graph is (1) polynomial wrt the num-ber of nodes (2) exponential wrt the path length(see Fig 2) Therefore some (Weissenborn et al2017 Mihaylov and Frank 2018) resort to onlyusing one-hop paths namely triples to balancescalability and reasoning capacities

Graph neural networks (GNN) in contrast enjoybetter scalability via their message passing formula-tion but usually lack transparency The most com-monly used GNN variant Graph ConvolutionalNetwork (GCN) (Kipf and Welling 2017) per-forms message passing by aggregating neighbor-hood information for each node but ignores the re-lation types RGCN (Schlichtkrull et al 2018) gen-eralizes GCN by performing relation-specific ag-

GCN RGCN KagNet MHGRN

Multi-Relational Encoding 7 3 3 3Interpretable 7 7 3 3

Scalable wrt node 3 3 7 3Scalable wrt hop 3 3 7 3

Table 1 Properties of our MHGRN and other repre-sentative models for graph encoding

gregation making it applicable to encoding multi-relational graphs However these models do notdistinguish the importance of different neighborsor relation types and thus cannot provide explicitrelational paths for model behavior interpretation

In this paper we propose a novel graph encod-ing architecture Multi-hop Graph Relation Net-works (MHGRN) which combines the strengthsof path-based models and GNNs Our model in-herits scalability from GNNs by preserving themessage passing formulation It also enjoys inter-pretability of path-based models by incorporatingstructured relational attention mechanism to modelmessage passing routes Our key motivation is toperform multi-hop message passing within a sin-gle layer to allow each node to directly attend toits multi-hop neighbours thus enabling multi-hoprelational reasoning with MHGRN We outline thefavorable features of knowledge-aware QA modelsin Table 1 and compare our MHGRN with repre-sentative GNNs and path-based methods

We summarize the main contributions of thiswork as follows 1) We propose MHGRN a novelmodel architecture tailored to multi-hop relationalreasoning Our model is capable of explicitly mod-eling multi-hop relational paths at scale 2) Wepropose structured relational attention mechanismfor efficient and interpretable modeling of multi-hop reasoning paths along with its training andinference algorithms 3) We conduct extensive ex-periments on two question answering datasets andshow that our models bring significant improve-ments compared to knowledge-agnostic pre-trainedlanguage models and outperform other graph en-coding methods by a large margin

2 Problem Formulation and Overview

In this paper we limit the scope to the task ofmultiple-choice question answering although itcan be easily generalized to other knowledge-guided tasks (eg natural language inference) Theoverall paradigm of knowledge-aware QA is illus-trated in Fig 3 Formally given an external knowl-edge graph (KG) as the knowledge source and a

PlausibilityScore

Graph ExtractionFrom KG

Graph Encoder

Text Encoder

120022

119956119954

119938

Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option

question q our goal is to identify the correct answerfrom a set C of given choices We turn this prob-lem into measuring the plausibility score betweenq and each answer choice a isin C then selecting theanswer with the highest plausibility score

We use q and a to denote the representationvectors of question q and option a To measurethe score for q and a we first concatenate them toform a statement s = [qa] Then we extract fromthe external KG a subgraph G (ie schema graphin KagNet (Lin et al 2019)) with the guidance ofs (detailed in sect51) This contextualized subgraphis defined as a multi-relational graph G = (V E φ)Here V is a subset of entity nodes in the externalKG containing only entities relevant to s E sube

V timesRtimesV is the set of edges that connect nodes inV where R = 1⋯m are ids of all pre-definedrelation types The mapping function φ(i) ∶ V rarrT = EqEaEo takes node i isin V as input andoutputs Eq if i is an entity mentioned in q Ea ifit is mentioned in a or Eo otherwise We finallyencode the statement to s G to g concatenate sand g for calculating the plausibility score

3 Background Multi-Relational GraphEncoding Methods

We leave encoding of s to pre-trained languagemodels which have shown powerful text represen-tation ability while we focus on the challengeof encoding graph G to capture latent relationsbetween entities Current methods for encodingmulti-relational graphs mainly fall into two cat-egories graph neural networks and path-basedmodels Graph neural networks encode structuredinformation by passing messages between nodesdirectly operating on the graph structure whilepath-based methods first decompose the graph intopaths and then pool features over all the paths

Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei

GNN(G) = Pool(hprime1hprime2 hprimen) (1)

As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-specific weight matrix Wr for each edge type

hprimei = σ

⎛⎜⎝(sumrisinR

∣N ri ∣)

minus1

sumrisinR

sumjisinN r

i

Wrhj⎞⎟⎠ (2)

where N ri denotes neighbors of node i under rela-

tion r2

While GNNs have proved to have good scalabil-ity their reasoning is done at the node level thusmaking them incompatible with modeling path-level reasoning chains-a crucial component for QAtasks that require relational reasoning This prop-erty also hinders the modelrsquos decisions from beinginterpreted at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsone can also view a graph as a set of relationalpaths connecting pairs of entities

Relation Networks (RN) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows

RN(G) = Pool(MLP(hj oplus eroplus

hi) ∣ j isin Q i isin A (j r i) isin E) (3)

Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation

2For simplicity we assume a single graph convolutionallayer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

To further equip RN with the ability of model-ing nondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Networks (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel graph neural networkarchitecture that unifies both GNN and RN forencoding multi-relational graphs to augment textcomprehension MHGRN inherits the capability ofpath reasoning and interpretabilty from path-basedmodels while preserving good scalability of GNNwith the message passing formulation

41 MHGRN Model Architecture

Following the introduction to GNN in Sec 3 weconsider encoding a multi-relational graph G =

(V E φ) into a fixed size vector g conditioned ontextual representation s by first transforming input

node features h1 hn into node embeddingshprime1 hprimen and then pooling these embeddingsNode features can be initialized with pre-trainedweights (details in Appendix A) and we focus onthe computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type information φ wefirst perform node type specific linear transforma-tion on the input node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNN with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K where K is ahyper-parameter The set of valid k-hop relationalpaths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of thesingle-hop message passing in RGCN (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionThe remaining problem becomes how to effectivelyparameterize the attention score α(j r1 rk i) in

Eq 7 for all possible k-hop paths without introduc-ing O(mk) parameters We first regard it as the prob-ability of a relation sequence (φ(j) r1 rk φ(i))conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) tonode type φ(i) (eg the model can learn to passmessages only from question entities to answerentities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity AnalysisAlthough the multi-hop message passing processin Eq 7 and the structured relational attention mod-ule in Eq11 handles potentially exponential num-ber of paths we show that it can be computed in

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

linear time using dynamic programming (see Ap-pendix B) As summarized in Table 2 the timecomplexity and space complexity of our model ona sparse graph are O (m2

nK∆) and O (mnK)respectively both of which are linear with respectto either the path length K or the number of nodesn

44 Expressive Power of MHGRN

In addition to the efficiency and scalability wenow discuss the modeling capacity of our modelWith the message passing formulation and relation-specific transformations MHGRN is by nature thegeneralization of RGCN Furthermore it is capableof directly modeling paths making it interpretableas are path-based models like RN and KagNet Toshow this we first generalize RN (Eq 3) to themulti-hop setting and introduce K-hop RN (seeAppendix C for a formal definition) K-hop RNmodels multi-hop relation as the composition ofsingle-hop relations with element-wise multiplica-tion We show that MHGRN is capable of repre-senting K-hop RN (see Appendix D for the proof)

45 Learning Inference and Path Decoding

We now discuss the learning and inference processof MHGRN instantiated for the task of multiple-choice question answering Following the prob-lem formulation in Sec 2 we aim to determinethe plausibility of an answer candidate a isin Cgiven the question q with the information fromboth text s and graph G We first obtained the graphrepresentation g by performing attentive poolingover the output node embeddings of answer enti-

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5729 (plusmn062) 5441 (plusmn050) 6298 (plusmn017) 5694 (plusmn077) 7338( plusmn027) 6988 (plusmn047)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 7008 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

ties hprimei ∣ i isin A Next we simply concatenateit with the text representation s and compute theplausibility score by ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length k with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix A shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relationalgraph with 34 relation types To extract an informa-tive contextual graph G from the KG we recognizeentity mentions in s and link them to entities in

Methods Single Ensemble

RoBERTadagger 721 725ALBERTdagger (Lan et al 2019) - 765

XLNet + GRdagger (Lv et al 2019) 753 -RoBERTa+KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa+HyKAS 20dagger (Ma et al 2019) 732 -XLNet+DREAMdagger 669 733RoBERTa+KEDGNdagger 725 744RoBERTa+FreeLBdagger(Zhu et al 2020) 722 731

MHGRN (RoBERTa K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on the leaderboardMHGRN achieves state-of-the-art on both single andensemble settings with RoBERTa-large encoder

ConceptNet with which we initialize our node setV We then add to V all the entities that appearin any two-hop paths between pairs of mentionedentities Unlike KagNet we do not perform anypruning but instead reserve all the edges betweennodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provideselementary science questions together with an open

4Models based on ConceptNet are no longer shown on theleaderboard and thus we got oursrsquo results directly from theorganizers

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 3: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

PlausibilityScore

Graph ExtractionFrom KG

Graph Encoder

Text Encoder

120022

119956119954

119938

Figure 3 Overview of the knowledge-aware QAframework It integrates the output from graph en-coder (for relational reasoning over contextual sub-graphs) and text encoder (for textual understanding) togenerate the plausibility score for an answer option

question q our goal is to identify the correct answerfrom a set C of given choices We turn this prob-lem into measuring the plausibility score betweenq and each answer choice a isin C then selecting theanswer with the highest plausibility score

We use q and a to denote the representationvectors of question q and option a To measurethe score for q and a we first concatenate them toform a statement s = [qa] Then we extract fromthe external KG a subgraph G (ie schema graphin KagNet (Lin et al 2019)) with the guidance ofs (detailed in sect51) This contextualized subgraphis defined as a multi-relational graph G = (V E φ)Here V is a subset of entity nodes in the externalKG containing only entities relevant to s E sube

V timesRtimesV is the set of edges that connect nodes inV where R = 1⋯m are ids of all pre-definedrelation types The mapping function φ(i) ∶ V rarrT = EqEaEo takes node i isin V as input andoutputs Eq if i is an entity mentioned in q Ea ifit is mentioned in a or Eo otherwise We finallyencode the statement to s G to g concatenate sand g for calculating the plausibility score

3 Background Multi-Relational GraphEncoding Methods

We leave encoding of s to pre-trained languagemodels which have shown powerful text represen-tation ability while we focus on the challengeof encoding graph G to capture latent relationsbetween entities Current methods for encodingmulti-relational graphs mainly fall into two cat-egories graph neural networks and path-basedmodels Graph neural networks encode structuredinformation by passing messages between nodesdirectly operating on the graph structure whilepath-based methods first decompose the graph intopaths and then pool features over all the paths

Graph Encoding with GNNs For a graph withn nodes a graph neural network (GNN) takes aset of node features h1h2 hn as input andcomputes their corresponding node embeddingshprime1hprime2 hprimen via message passing (Gilmeret al 2017) A compact graph representation forG can thus be obtained by pooling over the nodeembeddings hprimei

GNN(G) = Pool(hprime1hprime2 hprimen) (1)

As a notable variant of GNNs graph convo-lutional networks (GCNs) (Kipf and Welling2017) additionally update node embeddings byaggregating messages from its direct neighborsRGCNs (Schlichtkrull et al 2018) extend GCNs toencode multi-relational graphs by defining relation-specific weight matrix Wr for each edge type

hprimei = σ

⎛⎜⎝(sumrisinR

∣N ri ∣)

minus1

sumrisinR

sumjisinN r

i

Wrhj⎞⎟⎠ (2)

where N ri denotes neighbors of node i under rela-

tion r2

While GNNs have proved to have good scalabil-ity their reasoning is done at the node level thusmaking them incompatible with modeling path-level reasoning chains-a crucial component for QAtasks that require relational reasoning This prop-erty also hinders the modelrsquos decisions from beinginterpreted at the path levelGraph Encoding with Path-Based Models Inaddition to directly modeling the graph with GNNsone can also view a graph as a set of relationalpaths connecting pairs of entities

Relation Networks (RN) (Santoro et al 2017)can be adapted to multi-relational graph encodingunder QA settings RNs use MLPs to encode alltriples (one-hop paths) in G whose head entity isin Q = j ∣ φ(j) = Eq and tail entity is inA = i ∣ φ(i) = Ea It then pools the tripleembeddings to generate a vector for G as follows

RN(G) = Pool(MLP(hj oplus eroplus

hi) ∣ j isin Q i isin A (j r i) isin E) (3)

Here hj and hi are features for nodes j and i er isthe embedding of relation r isin R oplus denotes vectorconcatenation

2For simplicity we assume a single graph convolutionallayer In practice multiple layers are stacked to enable mes-sage passing from multi-hop neighbors

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

To further equip RN with the ability of model-ing nondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Networks (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel graph neural networkarchitecture that unifies both GNN and RN forencoding multi-relational graphs to augment textcomprehension MHGRN inherits the capability ofpath reasoning and interpretabilty from path-basedmodels while preserving good scalability of GNNwith the message passing formulation

41 MHGRN Model Architecture

Following the introduction to GNN in Sec 3 weconsider encoding a multi-relational graph G =

(V E φ) into a fixed size vector g conditioned ontextual representation s by first transforming input

node features h1 hn into node embeddingshprime1 hprimen and then pooling these embeddingsNode features can be initialized with pre-trainedweights (details in Appendix A) and we focus onthe computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type information φ wefirst perform node type specific linear transforma-tion on the input node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNN with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K where K is ahyper-parameter The set of valid k-hop relationalpaths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of thesingle-hop message passing in RGCN (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionThe remaining problem becomes how to effectivelyparameterize the attention score α(j r1 rk i) in

Eq 7 for all possible k-hop paths without introduc-ing O(mk) parameters We first regard it as the prob-ability of a relation sequence (φ(j) r1 rk φ(i))conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) tonode type φ(i) (eg the model can learn to passmessages only from question entities to answerentities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity AnalysisAlthough the multi-hop message passing processin Eq 7 and the structured relational attention mod-ule in Eq11 handles potentially exponential num-ber of paths we show that it can be computed in

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

linear time using dynamic programming (see Ap-pendix B) As summarized in Table 2 the timecomplexity and space complexity of our model ona sparse graph are O (m2

nK∆) and O (mnK)respectively both of which are linear with respectto either the path length K or the number of nodesn

44 Expressive Power of MHGRN

In addition to the efficiency and scalability wenow discuss the modeling capacity of our modelWith the message passing formulation and relation-specific transformations MHGRN is by nature thegeneralization of RGCN Furthermore it is capableof directly modeling paths making it interpretableas are path-based models like RN and KagNet Toshow this we first generalize RN (Eq 3) to themulti-hop setting and introduce K-hop RN (seeAppendix C for a formal definition) K-hop RNmodels multi-hop relation as the composition ofsingle-hop relations with element-wise multiplica-tion We show that MHGRN is capable of repre-senting K-hop RN (see Appendix D for the proof)

45 Learning Inference and Path Decoding

We now discuss the learning and inference processof MHGRN instantiated for the task of multiple-choice question answering Following the prob-lem formulation in Sec 2 we aim to determinethe plausibility of an answer candidate a isin Cgiven the question q with the information fromboth text s and graph G We first obtained the graphrepresentation g by performing attentive poolingover the output node embeddings of answer enti-

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5729 (plusmn062) 5441 (plusmn050) 6298 (plusmn017) 5694 (plusmn077) 7338( plusmn027) 6988 (plusmn047)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 7008 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

ties hprimei ∣ i isin A Next we simply concatenateit with the text representation s and compute theplausibility score by ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length k with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix A shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relationalgraph with 34 relation types To extract an informa-tive contextual graph G from the KG we recognizeentity mentions in s and link them to entities in

Methods Single Ensemble

RoBERTadagger 721 725ALBERTdagger (Lan et al 2019) - 765

XLNet + GRdagger (Lv et al 2019) 753 -RoBERTa+KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa+HyKAS 20dagger (Ma et al 2019) 732 -XLNet+DREAMdagger 669 733RoBERTa+KEDGNdagger 725 744RoBERTa+FreeLBdagger(Zhu et al 2020) 722 731

MHGRN (RoBERTa K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on the leaderboardMHGRN achieves state-of-the-art on both single andensemble settings with RoBERTa-large encoder

ConceptNet with which we initialize our node setV We then add to V all the entities that appearin any two-hop paths between pairs of mentionedentities Unlike KagNet we do not perform anypruning but instead reserve all the edges betweennodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provideselementary science questions together with an open

4Models based on ConceptNet are no longer shown on theleaderboard and thus we got oursrsquo results directly from theorganizers

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 4: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

AttentivePooling

2 Multi-hop Message Passing

Node Features

Text Encodereg BERT

1 Node Type Specific Transform

3 Nonlinear Activation

timesLayer

PlausibilityScore

MLP

120022 119956

AggregateMessages

120630(119947 119955120783 hellip 119955119948 119946)

Structured Relational Att

Learn

Child

Child CapableOf UsedForminus1 AtLocation

Desires UsedForminus1 Synonym

AtLocation Synonym

Sit AtLocationUsedForminus1

Desk AtLocation

hellipMulti-hop Message Passing

Figure 4 Our proposed MHGRN architecturefor relational reasoning MHGRN takes a multi-relational graph G and a (question-answer) statementvector s as input and outputs a scalar that represent theplausibility score of this statement

To further equip RN with the ability of model-ing nondegenerate paths KagNet (Lin et al 2019)adopts LSTMs to encode all paths connecting ques-tion entities and answer entities with lengths nomore than K It then aggregates all path embed-dings via attention mechanism

KAGNET(G) = Pool(LSTM(j r1 rk i) ∣

(j r1 j1)⋯ (jkminus1 rk i) isin E 1 le k le K)(4)

4 Proposed Method Multi-Hop GraphRelation Networks (MHGRN)

This section presents Multi-hop Graph RelationNetwork (MHGRN) a novel graph neural networkarchitecture that unifies both GNN and RN forencoding multi-relational graphs to augment textcomprehension MHGRN inherits the capability ofpath reasoning and interpretabilty from path-basedmodels while preserving good scalability of GNNwith the message passing formulation

41 MHGRN Model Architecture

Following the introduction to GNN in Sec 3 weconsider encoding a multi-relational graph G =

(V E φ) into a fixed size vector g conditioned ontextual representation s by first transforming input

node features h1 hn into node embeddingshprime1 hprimen and then pooling these embeddingsNode features can be initialized with pre-trainedweights (details in Appendix A) and we focus onthe computation of node embeddings

Type-Specific Transformation To make ourmodel aware of the node type information φ wefirst perform node type specific linear transforma-tion on the input node features

xi = Uφ(i)hi + bφ(i) (5)

where the learnable parameters U and b are specificto the type of node i

Multi-Hop Message Passing As mentioned be-fore our motivation is to endow GNN with thecapability of directly modeling paths To this endwe propose to pass messages directly over all therelational paths of lengths up to K where K is ahyper-parameter The set of valid k-hop relationalpaths is defined as

Φk = (j r1 rk i) ∣ (j r1 j1)⋯ (jkminus1 rk i) isin E (1 le k le K) (6)

We perform k-hop (1 le k le K) message passingover these paths which is a generalization of thesingle-hop message passing in RGCN (see Eq 2)

zki = sum

(jr1rki)isinΦk

α(j r1 rk i)dki sdotWK0

⋯Wk+10 W

krk⋯W

1r1xj (1 le k le K) (7)

where the Wtr (1 le t le K 0 le r le m) matrices

are learnable3 α(j r1 rk i) is an attentionscore elaborated in sect42 and d

ki = sum(j⋯i)isinΦk

α(j⋯i)is the normalization factor The W k

rk⋯W1r1 ∣ 1 le

r1 rk le m matrices can be interpreted as thelow rank approximation of a mtimes⋯timesmktimesdtimesdtensor that assigns a separate transformation foreach k-hop relation where d is the dim of xi

Incoming messages from paths of differentlengths are aggregated via attention mecha-nism (Vaswani et al 2017)

zi =K

sumk=1

softmax(bilinear(s zki )) sdot zki (8)

3W

t0(0 le t le K) are introduced as padding matrices

so that K transformations are applied regardless of k thusensuring comparable scale of zki across different k

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionThe remaining problem becomes how to effectivelyparameterize the attention score α(j r1 rk i) in

Eq 7 for all possible k-hop paths without introduc-ing O(mk) parameters We first regard it as the prob-ability of a relation sequence (φ(j) r1 rk φ(i))conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) tonode type φ(i) (eg the model can learn to passmessages only from question entities to answerentities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity AnalysisAlthough the multi-hop message passing processin Eq 7 and the structured relational attention mod-ule in Eq11 handles potentially exponential num-ber of paths we show that it can be computed in

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

linear time using dynamic programming (see Ap-pendix B) As summarized in Table 2 the timecomplexity and space complexity of our model ona sparse graph are O (m2

nK∆) and O (mnK)respectively both of which are linear with respectto either the path length K or the number of nodesn

44 Expressive Power of MHGRN

In addition to the efficiency and scalability wenow discuss the modeling capacity of our modelWith the message passing formulation and relation-specific transformations MHGRN is by nature thegeneralization of RGCN Furthermore it is capableof directly modeling paths making it interpretableas are path-based models like RN and KagNet Toshow this we first generalize RN (Eq 3) to themulti-hop setting and introduce K-hop RN (seeAppendix C for a formal definition) K-hop RNmodels multi-hop relation as the composition ofsingle-hop relations with element-wise multiplica-tion We show that MHGRN is capable of repre-senting K-hop RN (see Appendix D for the proof)

45 Learning Inference and Path Decoding

We now discuss the learning and inference processof MHGRN instantiated for the task of multiple-choice question answering Following the prob-lem formulation in Sec 2 we aim to determinethe plausibility of an answer candidate a isin Cgiven the question q with the information fromboth text s and graph G We first obtained the graphrepresentation g by performing attentive poolingover the output node embeddings of answer enti-

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5729 (plusmn062) 5441 (plusmn050) 6298 (plusmn017) 5694 (plusmn077) 7338( plusmn027) 6988 (plusmn047)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 7008 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

ties hprimei ∣ i isin A Next we simply concatenateit with the text representation s and compute theplausibility score by ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length k with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix A shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relationalgraph with 34 relation types To extract an informa-tive contextual graph G from the KG we recognizeentity mentions in s and link them to entities in

Methods Single Ensemble

RoBERTadagger 721 725ALBERTdagger (Lan et al 2019) - 765

XLNet + GRdagger (Lv et al 2019) 753 -RoBERTa+KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa+HyKAS 20dagger (Ma et al 2019) 732 -XLNet+DREAMdagger 669 733RoBERTa+KEDGNdagger 725 744RoBERTa+FreeLBdagger(Zhu et al 2020) 722 731

MHGRN (RoBERTa K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on the leaderboardMHGRN achieves state-of-the-art on both single andensemble settings with RoBERTa-large encoder

ConceptNet with which we initialize our node setV We then add to V all the entities that appearin any two-hop paths between pairs of mentionedentities Unlike KagNet we do not perform anypruning but instead reserve all the edges betweennodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provideselementary science questions together with an open

4Models based on ConceptNet are no longer shown on theleaderboard and thus we got oursrsquo results directly from theorganizers

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 5: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

Non-linear Activation Finally we apply shortcutconnection and nonlinear activation to obtain theoutput node embeddings

hprimei = σ (V hi + V

primezi) (9)

where V and Vprime are learnable model parameters

and σ is a non-linear activation function

42 Structured Relational AttentionThe remaining problem becomes how to effectivelyparameterize the attention score α(j r1 rk i) in

Eq 7 for all possible k-hop paths without introduc-ing O(mk) parameters We first regard it as the prob-ability of a relation sequence (φ(j) r1 rk φ(i))conditioned on s

α(j r1 rk i) = p (φ(j) r1 rk φ(i) ∣ s) (10)

which can naturally be modeled by a probabilisticgraphical model such as conditional random field(Lafferty et al 2001)

p (⋯ ∣ s)prop exp(f(φ(j) s) +k

sumt=1

δ(rt s)

+kminus1

sumt=1

τ(rt rt+1) + g(φ(i) s))

∆= β(r1 rk s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIuml

Relation Type Attention

sdot γ(φ(j) φ(i) s)IacuteOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveNtildeOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveOgraveIumlNode Type Attention

(11)

where f(sdot) δ(sdot) and g(sdot) are parameterized bytwo-layer MLPs and τ(sdot) by a transition matrixof shape m timesm Intuitively β(sdot) models the im-portance of a k-hop relation while γ(sdot) models theimportance of messages from node type φ(j) tonode type φ(i) (eg the model can learn to passmessages only from question entities to answerentities)

Our model scores a k-hop relation by decom-posing it into both context-aware single-hop re-lations (modeled by δ) and two-hop relations(modeled by τ ) We argue that τ is indis-pensable without which the model may assignhigh importance to illogical multi-hop relations(eg [AtLocation CapableOf]) or noisy re-lations (eg [RelatedTo RelatedTo])

43 Computation Complexity AnalysisAlthough the multi-hop message passing processin Eq 7 and the structured relational attention mod-ule in Eq11 handles potentially exponential num-ber of paths we show that it can be computed in

Model Time Space

G is a dense graph

K-hop KagNet O (mKnK+1

K) O (mKnK+1

K)K-layer RGCN O (mn2

K) O (mnK)MHGRN O (m2

n2K) O (mnK)

G is a sparse graph with maximum node degree ∆ ≪ n

K-hop KagNet O (mKnK∆

K) O (mKnK∆

K)K-layer RGCN O (mnK∆) O (mnK)MHGRN O (m2

nK∆) O (mnK)

Table 2 Computation complexity of different K-hopreasoning models on a densesparse multi-relationalgraph with n nodes and m relation types Despite thequadratic complexity wrt m MHGRNrsquos time cost issimilar to RGCN on GPUs with parallelizable matrixmultiplications (cf Fig 7)

linear time using dynamic programming (see Ap-pendix B) As summarized in Table 2 the timecomplexity and space complexity of our model ona sparse graph are O (m2

nK∆) and O (mnK)respectively both of which are linear with respectto either the path length K or the number of nodesn

44 Expressive Power of MHGRN

In addition to the efficiency and scalability wenow discuss the modeling capacity of our modelWith the message passing formulation and relation-specific transformations MHGRN is by nature thegeneralization of RGCN Furthermore it is capableof directly modeling paths making it interpretableas are path-based models like RN and KagNet Toshow this we first generalize RN (Eq 3) to themulti-hop setting and introduce K-hop RN (seeAppendix C for a formal definition) K-hop RNmodels multi-hop relation as the composition ofsingle-hop relations with element-wise multiplica-tion We show that MHGRN is capable of repre-senting K-hop RN (see Appendix D for the proof)

45 Learning Inference and Path Decoding

We now discuss the learning and inference processof MHGRN instantiated for the task of multiple-choice question answering Following the prob-lem formulation in Sec 2 we aim to determinethe plausibility of an answer candidate a isin Cgiven the question q with the information fromboth text s and graph G We first obtained the graphrepresentation g by performing attentive poolingover the output node embeddings of answer enti-

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5729 (plusmn062) 5441 (plusmn050) 6298 (plusmn017) 5694 (plusmn077) 7338( plusmn027) 6988 (plusmn047)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 7008 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

ties hprimei ∣ i isin A Next we simply concatenateit with the text representation s and compute theplausibility score by ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length k with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix A shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relationalgraph with 34 relation types To extract an informa-tive contextual graph G from the KG we recognizeentity mentions in s and link them to entities in

Methods Single Ensemble

RoBERTadagger 721 725ALBERTdagger (Lan et al 2019) - 765

XLNet + GRdagger (Lv et al 2019) 753 -RoBERTa+KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa+HyKAS 20dagger (Ma et al 2019) 732 -XLNet+DREAMdagger 669 733RoBERTa+KEDGNdagger 725 744RoBERTa+FreeLBdagger(Zhu et al 2020) 722 731

MHGRN (RoBERTa K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on the leaderboardMHGRN achieves state-of-the-art on both single andensemble settings with RoBERTa-large encoder

ConceptNet with which we initialize our node setV We then add to V all the entities that appearin any two-hop paths between pairs of mentionedentities Unlike KagNet we do not perform anypruning but instead reserve all the edges betweennodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provideselementary science questions together with an open

4Models based on ConceptNet are no longer shown on theleaderboard and thus we got oursrsquo results directly from theorganizers

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 6: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

Methods BERT-Base BERT-Large RoBERTa-Large

IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc() IHdev-Acc() IHtest-Acc()

wo KG 5731 (plusmn107) 5347 (plusmn087) 6106 (plusmn085) 5539 (plusmn040) 7307 (plusmn045) 6869(plusmn056)

RGCN (Schlichtkrull et al 2018) 5694 (plusmn038) 5450 (plusmn056) 6298 (plusmn082) 5713 (plusmn036) 7269 (plusmn019) 6841 (plusmn066)GconAttn (Wang et al 2019) 5729 (plusmn062) 5441 (plusmn050) 6298 (plusmn017) 5694 (plusmn077) 7338( plusmn027) 6988 (plusmn047)KagNetdagger (Lin et al 2019) 5557 5619 6235 5716 - -RN (1-hop) 5827 (plusmn022) 5620 (plusmn045) 6304 (plusmn058) 5846 (plusmn071) 7457 (plusmn091) 7008 (plusmn021)RN (2-hop) 5981 (plusmn076) 5661 (plusmn068) 6336 (plusmn026) 5892 (plusmn014) 7365 (plusmn309) 6959 (plusmn380)

MHGRN 6036 (plusmn023) 5723 (plusmn082) 6329(plusmn051) 6059 (plusmn058) 7445 (plusmn010) 7111 (plusmn081)

Table 3 Performance comparison on CommonsenseQA in-house split We report in-house Dev (IHdev) andTest (IHtest) accuracy (mean and standard deviation of four runs) using the data split of Lin et al (2019) onCommonsenseQA dagger indicates reported results in its paper

ties hprimei ∣ i isin A Next we simply concatenateit with the text representation s and compute theplausibility score by ρ(qa) = MLP(soplus g)

During training we maximize the plausibilityscore of the correct answer a by minimizing thecross-entropy loss

L = EqaC [minus logexp(ρ(q a))

sumaisinC exp(ρ(qa))] (12)

The whole model is trained end-to-end jointly withthe text encoder (eg RoBERTa)

During inference we predict the most plau-sible answer by argmaxaisinC ρ(qa) Addition-ally we can decode a reasoning path as evidencefor model predictions endowing our model withthe interpretability enjoyed by path-based mod-els Specifically we first determine the answerentity ilowast with the highest score in the pooling layerand the path length k with the highest score inEq 8 Then the reasoning path is decoded byargmax α(j r1 rklowast i

lowast) which can be com-puted in linear time using dynamic programming

5 Experimental SetupWe introduce how we construct G (sect51) thedatasets (sect52) as well as the baseline methods(sect53) Appendix A shows more implementationand experimental details for reproducibility

51 Extracting G from External KGWe use ConceptNet (Speer et al 2017) a general-domain knowledge graph as our external KG to testmodelsrsquo ability to harness structured knowledgesource Following KagNet (Lin et al 2019) wemerge relation types to increase graph density andadd reverse relations to construct a multi-relationalgraph with 34 relation types To extract an informa-tive contextual graph G from the KG we recognizeentity mentions in s and link them to entities in

Methods Single Ensemble

RoBERTadagger 721 725ALBERTdagger (Lan et al 2019) - 765

XLNet + GRdagger (Lv et al 2019) 753 -RoBERTa+KEDGNdagger 725 744RoBERTa + KEdagger 733 -RoBERTa+HyKAS 20dagger (Ma et al 2019) 732 -XLNet+DREAMdagger 669 733RoBERTa+KEDGNdagger 725 744RoBERTa+FreeLBdagger(Zhu et al 2020) 722 731

MHGRN (RoBERTa K = 2) 754 765

Table 4 Performance comparison on official test ofCommonsenseQA with leaderboard SoTAs4 (accuracyin ) dagger indicates reported results on the leaderboardMHGRN achieves state-of-the-art on both single andensemble settings with RoBERTa-large encoder

ConceptNet with which we initialize our node setV We then add to V all the entities that appearin any two-hop paths between pairs of mentionedentities Unlike KagNet we do not perform anypruning but instead reserve all the edges betweennodes in V forming our G

52 Datasets

We evaluate models on two multiple-choice ques-tion answering datasets CommonsenseQA andOpenbookQA Both require world knowledge be-yond textual understanding to perform well

CommonsenseQA (Talmor et al 2019) neces-sitates various commonsense reasoning skills Thequestions are created with entities from ConceptNetand they are designed to probe latent compositionalrelations between entities in ConceptNet

OpenBookQA (Mihaylov et al 2018) provideselementary science questions together with an open

4Models based on ConceptNet are no longer shown on theleaderboard and thus we got oursrsquo results directly from theorganizers

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 7: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

book of science facts This dataset also probes gen-eral commonsense knowledge beyond the providedfacts As our model is orthogonal to text-formknowledge retrieval we do not utilize the providedopen book and instead use ConceptNet Conse-quently we do not compare our methods with thoseusing the open book

53 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incor-porate KG as external sources as our baselinesAdditionally we directly compare our model withthe results from corresponding leaderboard Thesemethods typically leverage textual knowledge or ex-tra training data as opposed to external KG In allour implemented models we use pre-trained LMsas text encoders for s for fair comparison We stickto our focus of encoding structured KG and there-fore do not compare our models with those (Panet al 2019 Zhang et al 2018 Sun et al 2019Banerjee et al 2019) augmented by other text-formexternal knowledge (eg Wikipedia)

Specifically we fine-tune BERT-BASE BERT-LARGE (Devlin et al 2019) and ROBERTA (Liuet al 2019b) for multiple-choice questions Wetake RGCN (Eq 2 in Sec 3) RN5 (Eq 3 in Sec 3)KagNet (Eq 4 in Sec 3) and GconAttn (Wanget al 2019) as baselines GconAttn general-izes match-LSTM (Wang and Jiang 2016) fromsequence modeling to entity set modeling andachieves success in language inference tasks

6 Results and Discussions

In this section we present the results of our modelsin comparison with baselines as well as methodson the leaderboards for both CommonsenseQA andOpenbookQA We also provide analysis of modelsrsquocomponents and characteristics

61 Main ResultsFor CommonsenseQA (Table 3) we first use the in-house data split of Lin et al (2019) to compare ourmodels with our implemented baselines In this set-ting we take 1241 examples from official trainingexamples as our in-house test examples and regardthe remaining 8500 ones as our in-house train-ing examples Almost all KG-augmented modelsachieve performance gain over vanilla pre-trained

5We use mean pooling for 1-hop RN and attentive poolingfor 2-hop RN (detailed in Appendix C)

Methods Dev () Test ()

BERT (wo KG)dagger 604 604BERT Multi-Task (wo KG)dagger - 638GapQAdagger (Khot et al 2019) - 594 (plusmn130)

RoBERTa-Large (wo KG) 6676 (plusmn114) 6480 (plusmn237)+ RGCN 6465 (plusmn196) 6245 (plusmn157)+ GconAttn 6685 (plusmn182) 6475 (plusmn148)+ RN (1-hop) 6485 (plusmn111) 6365 (plusmn231)+ RN (2-hop) 6700 (plusmn071) 6520 (plusmn118)

+ MHGRN (K = 3) 6810 (plusmn102) 6685 (plusmn119)

Table 5 Dev and Test accuracy on OpenbookQA daggerindicates reported results on the leaderboard

Methods IHdev-Acc ()

MHGRN (K = 3) 7445 (plusmn010)- Type-specific transformation (sect41) 7316 (plusmn028)- Structured relational attention (sect42) 7326 (plusmn031)- Relation type attention (sect42) 7355 (plusmn068)- Node type attention (sect42) 7392 (plusmn065)

Table 6 Ablation study on model components (re-moving one component each time) using ROBERTA-LARGE as the text encoder We report the IHdev ac-curacy on CommonsenseQA

LMs demonstrating the value of external knowl-edge on this dataset Additionally we evaluate ourMHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair compar-ison with other methods on leaderboard in bothsingle-model setting and ensemble-model settingIn both cases we achieve state-of-the-art perfor-mances across all existing models

For OpenbookQA (Table 5) we use official splitand build models with ROBERTA-LARGE as textencoder Overall external knowledge still bringsbenefit to this task Our model surpasses all base-lines with an absolute increase of sim2 on Test Aswe seek to evaluate the performance of models interms of encoding external knowledge graphs wedid not compare with other models based on doc-ument retrieval and reading comprehension Notethat other submissions on OBQA leaderboard usu-ally were fine-tuned on external question answer-ing datasets or using large corpus via informationretrieval As we seek to systematically examineknowledge-aware reasoning methods by reason-ing over interpretable structures we do not includecomparisons with models that are beyond the scopeOther fine-tuned PTLMs can be simply incorpo-rated as well whenever they are available

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 8: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

62 Performance Analysis

Ablation Study on Model Components We as-sess the impact of our modelsrsquo components shownin Table 6 Disabling type-specific transformationresults in sim 13 drop in performance demonstrat-ing the need for distinguishing node type for QAtasks Our structured relational attention mecha-nism is also critical with its two sub-componentscontributing almost equally

Impact of the Amount of Training Data Weuse different fractions of training data of Common-senseQA and report results of fine-tuning text en-coders alone and jointly training text encoder andgraph encoder in Fig 5 Regardless of training datafraction our model shows consistently more per-formance improvement over knowledge-agnosticfine-tuning compared with the other graph encod-ing methods indicating MHGRNrsquos complementarystrengths to text encoders

20 40 60 80 100Percentage of Training Data

45

50

55

60

65

Accu

racy

()

MultiGRNGconAttnwo KG

Figure 5 Performance change (accuracy in ) wrtthe amounts of training data on CommonsenseQA IHT-est set (same as Lin et al (2019))

Impact of Number of Hops (K) We investigatethe impact of hyperparameter K for MHGRN byits performance on CommonsenseQA (shown inFig 6) The increase of K continues to bring ben-efit until K = 4 However performance beginsto drop when K gt 3 This might be attributedto exponential noise in longer relational paths inknowledge graph

63 Model Scalability

Fig 7 presents the computation cost of MultiRGNand RGCN (measured by training time) The com-putation cost of both models grow linearly wrt toK Although the theoretical complexity of Multi-RGN is m times that of RGCN the ratio of theirempirical cost only approaches 2 demonstratingthat our model can be parallelized

65 70 75Accuracy ()

K=1

K=2

K=3

K=4

K=5

Reasoning Hops

7324

7393

7465

7365

7416

Figure 6 Effect of K in MHGRN We show IHDevaccuracy of MHGRN on CommonsenseQA using thedata split of Lin et al (2019) with different hops

2 4 6 800

05

10

15

20

sba

tch

K Reasoning Hops

RGCNMHGRN

Figure 7 Analysis of model scalability Comparisonof per-batch training efficiency wrt hops K

64 Model InterpretabilityWe can analyze our modelrsquos reasoning process bydecoding the reasoning path using the method de-scribed in sect45 Fig 8 shows two examples fromCommonsenseQA where our model correctly an-swers the questions and provides reasonable pathevidences In the example on the left the modellinks question entities and answer entity in a chainto support reasoning while the example on theright provides a case where our model leverage un-mentioned entities to bridge the reasoning gap be-tween question entity and answer entities in a waythat is coherent with the implied relation betweenCHAPEL and the desired answer in the question

Where is known for a multitude of wedding chapels

A town B texasC city D church building E Nevada

RenoChapel

PartOf

AtLocation

UsedForminus1 Synonym

AtLocation Synonym

Neveda

Baseball

Play_Baseball

HasProperty

UsedForminus1

Fun_to_Play

Why do parents encourage their kids to play baseball

A round B cheap C break window D hardE fun to play Fun_to_Play

HasProperty

Reno

Chapel

AtLocation

Neveda

PartOf

MultiGRNFigure 8 Case study on model interpretability Wepresent two sampled questions from CommonsenseQAwith the reasoning paths output by MHGRN

65 Potential Compatibility with OtherMethods

In theory our approach is naturally compatiblewith the methods that utilize textual knowledge

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 9: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

or extra data (such as leaderboard methods inTable 4) because in our paradigm the encodingof textual statement and graph are structurally-decoupled (Fig 3) We can take for example thefine-tuned RoBERTa+KE6 system as our text en-coder and leave the rest of our model architectureunchanged

7 Related Work

Knowledge-Aware Methods for NLP Variouswork have investigated the potential to empowerNLP models with external knowledge Many at-tempt to extract structured knowledge either in theform of nodes (Yang and Mitchell 2017 Wanget al 2019) triples (Weissenborn et al 2017Mihaylov and Frank 2018) paths (Bauer et al2018 Kundu et al 2019 Lin et al 2019) or sub-graphs (Li and Clark 2015) and encode them toaugment textual understanding

Recent success of pre-trained LMs motivatesmany (Pan et al 2019 Ye et al 2019 Zhang et al2018 Li et al 2019 Banerjee et al 2019) to probeLMsrsquo potential as latent knowledge bases This lineof work turn to textual knowledge (eg Wikipedia)to directly impart knowledge to pre-trained LMsThey generally fall into two paradigms 1) Fine-tuning LMs on large-scale general-domain datasets(eg RACE (Lai et al 2017)) or on knowledge-richtext 2) Providing LMs with evidence via informa-tion retrieval techniques However these modelscannot provide explicit reasoning and evidencethus hardly trustworthy They are also subject tothe availability of in-domain datasets and maxi-mum input token of pre-trained LMs

Neural Graph Encoding Graph Attention Net-works (GAT) (Velickovic et al 2018) incorpo-rates attention mechanism in feature aggregationRGCN (Schlichtkrull et al 2018) proposes rela-tional message passing which makes it applicableto multi-relational graphs However they only per-form single-hop message passing and cannot beinterpreted at path level Other work (Abu-El-Haijaet al 2019 Nikolentzos et al 2019) aggregate fora node its K-hop neighbors based on node-wisedistances but they are designed for non-relationalgraphs MHGRN addresses these issues by rea-soning on multi-relational graphs and being inter-pretable via maintaining paths as reasoning chains

6httpsgithubcomjose77csqa

8 Conclusion

We present a principled scalable method MHGRNthat can leverage general knowledge by multi-hopreasoning over interpretable structures (eg Con-ceptNet) The proposed MHGRN generalizes andcombines the advantages of GNNs and path-basedreasoning models It explicitly performs multi-hoprelational reasoning and is empirically shown tooutperform existing methods with superior scal-ablility and interpretability Our extensive experi-ments systematically compare MHGRN and otherexisting methods on knowledge-aware methodsParticularly we achieve the state-of-the-art perfor-mance on the CommonsenseQA dataset

ReferencesSami Abu-El-Haija Bryan Perozzi Amol Kapoor

Nazanin Alipourfard Kristina Lerman Hrayr Haru-tyunyan Greg Ver Steeg and Aram Galstyan 2019Mixhop Higher-order graph convolutional architec-tures via sparsified neighborhood mixing In Pro-ceedings of the 36th International Conference onMachine Learning ICML 2019 9-15 June 2019Long Beach California USA volume 97 of Pro-ceedings of Machine Learning Research pages 21ndash29 PMLR

Pratyay Banerjee Kuntal Kumar Pal Arindam Mi-tra and Chitta Baral 2019 Careful selection ofknowledge to solve open book question answeringIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages6120ndash6129 Florence Italy Association for Compu-tational Linguistics

Lisa Bauer Yicheng Wang and Mohit Bansal 2018Commonsense for generative multi-hop question an-swering tasks In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing Brussels Belgium October 31 - Novem-ber 4 2018 pages 4220ndash4230 Association for Com-putational Linguistics

Peter Clark Isaac Cowhey Oren Etzioni Tushar KhotAshish Sabharwal Carissa Schoenick and OyvindTafjord 2018 Think you have solved questionanswering try arc the AI2 reasoning challengeCoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 10: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

Justin Gilmer Samuel S Schoenholz Patrick F RileyOriol Vinyals and George E Dahl 2017 Neuralmessage passing for quantum chemistry In Pro-ceedings of the 34th International Conference onMachine Learning ICML 2017 Sydney NSW Aus-tralia 6-11 August 2017 volume 70 of Proceedingsof Machine Learning Research pages 1263ndash1272PMLR

Tushar Khot Ashish Sabharwal and Peter Clark 2019Whatrsquos missing A knowledge gap guided approachfor multi-hop question answering In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 2814ndash2828 Hong KongChina Association for Computational Linguistics

Thomas N Kipf and Max Welling 2017 Semi-supervised classification with graph convolutionalnetworks In 5th International Conference on Learn-ing Representations ICLR 2017 Toulon FranceApril 24-26 2017 Conference Track ProceedingsOpenReviewnet

Souvik Kundu Tushar Khot Ashish Sabharwal andPeter Clark 2019 Exploiting explicit paths formulti-hop reading comprehension In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics pages 2737ndash2747 Flo-rence Italy Association for Computational Linguis-tics

John D Lafferty Andrew McCallum and FernandoC N Pereira 2001 Conditional random fieldsProbabilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth Inter-national Conference on Machine Learning (ICML2001) Williams College Williamstown MA USAJune 28 - July 1 2001 pages 282ndash289 MorganKaufmann

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Zhen-Zhong Lan Mingda Chen Sebastian Good-man Kevin Gimpel Piyush Sharma and RaduSoricut 2019 Albert A lite bert for self-supervised learning of language representationsArXiv abs190911942

Shiyang Li Jianshu Chen and Dian Yu 2019 Teach-ing pretrained models with commonsense reason-ing A preliminary kb-based approach CoRRabs190909743

Yang Li and Peter Clark 2015 Answering elementaryscience questions by constructing coherent scenesusing background knowledge In Proceedings of

the 2015 Conference on Empirical Methods in Nat-ural Language Processing pages 2007ndash2012 Lis-bon Portugal Association for Computational Lin-guistics

Bill Yuchen Lin Xinyue Chen Jamin Chen and Xi-ang Ren 2019 KagNet Knowledge-aware graphnetworks for commonsense reasoning In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 2829ndash2839 HongKong China Association for Computational Lin-guistics

Liyuan Liu Haoming Jiang Pengcheng He WeizhuChen Xiaodong Liu Jianfeng Gao and Jiawei Han2019a On the variance of the adaptive learning rateand beyond CoRR abs190803265

Yinhan Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy Mike LewisLuke Zettlemoyer and Veselin Stoyanov 2019bRoberta A robustly optimized BERT pretraining ap-proach CoRR abs190711692

Shangwen Lv Daya Guo Jingjing Xu Duyu TangNan Duan Ming Gong Linjun Shou Daxin JiangGuihong Cao and Songlin Hu 2019 Graph-based reasoning over heterogeneous external knowl-edge for commonsense question answering CoRRabs190905311

Kaixin Ma Jonathan Francis Quanyang Lu Eric Ny-berg and Alessandro Oltramari 2019 Towardsgeneralizable neuro-symbolic systems for common-sense question answering In Proceedings of theFirst Workshop on Commonsense Inference in Natu-ral Language Processing pages 22ndash32 Hong KongChina Association for Computational Linguistics

Todor Mihaylov Peter Clark Tushar Khot and AshishSabharwal 2018 Can a suit of armor conduct elec-tricity a new dataset for open book question an-swering In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processingpages 2381ndash2391 Brussels Belgium Associationfor Computational Linguistics

Todor Mihaylov and Anette Frank 2018 Knowledge-able reader Enhancing cloze-style reading compre-hension with external commonsense knowledge InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 821ndash832 Melbourne AustraliaAssociation for Computational Linguistics

Giannis Nikolentzos George Dasoulas and MichalisVazirgiannis 2019 k-hop graph neural networksCoRR abs190706051

Xiaoman Pan Kai Sun Dian Yu Heng Ji and Dong Yu2019 Improving question answering with externalknowledge CoRR abs190200993

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 11: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

Adam Santoro David Raposo David G T BarrettMateusz Malinowski Razvan Pascanu Peter WBattaglia and Tim Lillicrap 2017 A simple neu-ral network module for relational reasoning In Ad-vances in Neural Information Processing Systems30 Annual Conference on Neural Information Pro-cessing Systems 2017 4-9 December 2017 LongBeach CA USA pages 4967ndash4976

Maarten Sap Hannah Rashkin Derek Chen RonanLe Bras and Yejin Choi 2019 Social IQa Com-monsense reasoning about social interactions InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) pages 4463ndash4473 Hong Kong China Association for Computa-tional Linguistics

Michael Sejr Schlichtkrull Thomas N Kipf PeterBloem Rianne van den Berg Ivan Titov and MaxWelling 2018 Modeling relational data with graphconvolutional networks In The Semantic Web - 15thInternational Conference ESWC 2018 HeraklionCrete Greece June 3-7 2018 Proceedings volume10843 of Lecture Notes in Computer Science pages593ndash607 Springer

Robyn Speer Joshua Chin and Catherine Havasi 2017Conceptnet 55 An open multilingual graph of gen-eral knowledge In Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence Febru-ary 4-9 2017 San Francisco California USApages 4444ndash4451 AAAI Press

Kai Sun Dian Yu Dong Yu and Claire Cardie 2019Improving machine reading comprehension withgeneral reading strategies In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Long andShort Papers) pages 2633ndash2643 Minneapolis Min-nesota Association for Computational Linguistics

Alon Talmor Jonathan Herzig Nicholas Lourie andJonathan Berant 2019 CommonsenseQA A ques-tion answering challenge targeting commonsenseknowledge In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4149ndash4158 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 4-9 Decem-ber 2017 Long Beach CA USA pages 5998ndash6008

Petar Velickovic Guillem Cucurull Arantxa CasanovaAdriana Romero Pietro Lio and Yoshua Bengio

2018 Graph attention networks In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Shuohang Wang and Jing Jiang 2016 Learning nat-ural language inference with LSTM In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics Human Language Technologies pages1442ndash1451 San Diego California Association forComputational Linguistics

Xiaoyan Wang Pavan Kapanipathi Ryan MusaMo Yu Kartik Talamadupula Ibrahim AbdelazizMaria Chang Achille Fokoue Bassem MakniNicholas Mattei and Michael Witbrock 2019 Im-proving natural language inference using externalknowledge in the science questions domain In TheThirty-Third AAAI Conference on Artificial Intelli-gence AAAI 2019 The Thirty-First Innovative Ap-plications of Artificial Intelligence Conference IAAI2019 The Ninth AAAI Symposium on EducationalAdvances in Artificial Intelligence EAAI 2019 Hon-olulu Hawaii USA January 27 - February 1 2019pages 7208ndash7215 AAAI Press

Dirk Weissenborn Tomas Kocisky and Chris Dyer2017 Dynamic integration of background knowl-edge in neural nlu systems arXiv preprintarXiv170602596

An Yang Quan Wang Jing Liu Kai Liu Yajuan LyuHua Wu Qiaoqiao She and Sujian Li 2019 En-hancing pre-trained language representations withrich knowledge for machine reading comprehensionIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2346ndash2357 Florence Italy Association for Compu-tational Linguistics

Bishan Yang and Tom Mitchell 2017 Leveragingknowledge bases in LSTMs for improving machinereading In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1 Long Papers) pages 1436ndash1446 Van-couver Canada Association for Computational Lin-guistics

Zhi-Xiu Ye Qian Chen Wen Wang and Zhen-HuaLing 2019 Align mask and select A sim-ple method for incorporating commonsense knowl-edge into language representation models CoRRabs190806725

Yuyu Zhang Hanjun Dai Kamil Toraman andLe Song 2018 Kgˆ2 Learning to reason scienceexam questions with contextual knowledge graphembeddings CoRR abs180512393

Chen Zhu Yu Cheng Zhe Gan Siqi Sun Tom Gold-stein and Jingjing Liu 2020 Freelb Enhanced ad-versarial training for natural language understandingIn International Conference on Learning Represen-tations

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 12: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

A Implementation Details

CommonsenseQA OpenbookQA

BERT-BASE 3 times 10minus5 -

BERT-LARGE 2 times 10minus5 -

ROBERTA-LARGE 1 times 10minus5

1 times 10minus5

Table 7 Learning rate for text encoders on differentdatasets

CommonsenseQA OpenbookQA

RN 3 times 10minus4

3 times 10minus4

RGCN 1 times 10minus3

1 times 10minus3

GconAttn 3 times 10minus4

1 times 10minus4

KV-Memory 1 times 10minus3

1 times 10minus3

MultiGRN 1 times 10minus3

1 times 10minus3

Table 8 Learning rate for graph encoders on differentdatasets

Our models are implemented in PyTorch We usecross-entropy loss and RAdam (Liu et al 2019a)optimizer We find it beneficial to use separatelearning rates for the text encoder and the graph en-coder We tune learning rates for text encoders andgraph encoders on two datasets We first fine-tuneROBERTA-LARGE BERT-LARGE BERT-BASE

on CommonsenseQA and ROBERTA-LARGE onOpenbookQA respectively and choose a dataset-specific learning rate from 1times10

minus5 2times10

minus5 3times

10minus5 6 times 10

minus5 1 times 10

minus4 for each text encoderbased on the best performance on development setas listed in Table 7 We report the performance ofthese fine-tuned text encoders and also adopt theirdataset-specific optimal learning rates in joint train-ing with graph encoders For models that involveKG the learning rate of their graph encoders 7 arechosen from 1 times 10

minus4 3 times 10

minus4 1 times 10

minus3 3 times

10minus3 based on their best development set perfor-

mance with ROBERTA-LARGE as the text encoderWe report the optimal learning rates for graph en-coders in Table 8 In training we set the maximuminput sequence length to text encoders to 64 batchsize to 32 and perform early stopping

For the input node features we first use tem-plates to turn knowledge triples in Common-senseQA into sentences and feed them into pre-trained BERT-LARGE obtaining a sequence of

7KV-memory actually encode triples to enhance tokenrepresentations of s instead of encoding graph alone But forsimplicity we also refer to it as a graph encoder

token embeddings from the last layer of BERT-LARGE for each triple For each entity in Com-monsenseQA we perform mean pooling over thetokens of the entityrsquos occurrences across all the sen-tences to form a 1024d vector as its correspondingnode feature We use this set of features for all ourimplemented models

We use 2-layer RGCN and single-layer Multi-GRN across our experiments

B Dynamic Programming Algorithm forEq 7

To show that multi-hop message passing can becomputed in linear time we observe that Eq 7 canbe re-written in matrix form

Zk= (Dk)minus1 sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FXW1r1

⊤⋯W

krk

sdotWk+10

⊤⋯W

K0

⊤ (1 le k le K) (13)

where G = diag(exp([g(φ(v1) s) g(φ(vn)s)]) (F is similarly defined) Ar is the adjacencymatrix for relation r and D

k is defined as follows

Dk= diag( sum

(r1rk)isinRk

β(r1 rk s)

sdotGArk⋯Ar1FX1) (1 le k le K) (14)

Using this matrix formulation we can computeEq 7 using dynamic programming

C Formal Definition of K-hop RN

Definition 1 (K-hop Relation Network) A multi-hop relation network is a function that maps amulti-relational graph to a fixed size vector

KHopRN(G W E H) =K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)sdotW (hjoplus(er1⋯erk)oplushi)(15)

where denotes element-wise product and β(⋯) =1(K∣A∣ sdot ∣(j prime i) isin G ∣ j prime isin Q∣) defines thepooling weights

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)

Page 13: Scalable Multi-Hop Relational Reasoning for Knowledge ... · 1 Introduction Many recently proposed question answering tasks require not only machine comprehension of the question

Algorithm 1 Dynamic programming algorithm formulti-hop message passingInput sXAr(1 le r le m)W t

r (r isin R 1 le t le

k)F G δ τOutput Z

1 WKlarr I

2 for k larr K minus 1 to 1 do3 W

klarrW

k+10 W

k+1

4 end for5 for r isin R do6 Mr larr FX7 end for8 for k larr 1 to K do9 if k gt 1 then

10 for r isin R do11 M

primer larr e

δ(rs)ArsumrprimeisinR e

τ(rprimers)MrprimeW

kr

12 end for13 for r isin R do14 Mr larrM

primer

15 end for16 else17 for r isin R do18 Mr larr e

δ(rs) sdotArMrWkr

19 end for20 end if21 Z

klarr GsumrisinR MrW

k

22 end for23 Replace W

tr (0 le r le m 1 le t le k) with identity

matrices and X with 1 and re-run line 1 - line 19 tocompute d

1 d

K

24 for k larr 1 to K do25 Z

klarr (diag(dk))minus1

Zk

26 end for27 return Z

1Z

2 Z

k

D Expressing K-hop RN with MultiGRN

Theorem 1 Given any W E H there exists a pa-rameter setting such that the output of the modelbecomes KHopRN(GW E H) for arbitrary G

Proof Suppose W = [W1 W2 W3] whereW1 W3 isin Rd3timesd1 W2 isin Rd3timesd2 ForMultiRGN we set the parameters as followsH = HUlowast = [I0] isin R(d1+d2)timesd1 blowast =

[01]⊤ isin Rd1+d2 Wtr = diag(1 oplus er) isin

R(d1+d2)times(d1+d2)(r isin R 1 le t le K)V = W3 isin

Rd3timesd1 Vprime= [W1 W2] isin Rd3times(d1+d2) We dis-

able the relation type attention module and enablemessage passing only from Q to A By furtherchoosing σ as the identity function and perform-ing pooling over A we observe that the output ofMultiGRN becomes

1

∣A∣ sumiisinA

hprimei

=1

∣A∣ (V hi + Vprimezi)

=1

K∣A∣K

sumk=1

(V hi + Vprimezki )

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(j r1 rk i)(V hi+

VprimeW

krk⋯W

1r1xj)

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ

β(⋯)(V hi + VprimeW

krk

⋯W1r1Uφ(j)hj + V

primeW

krk⋯W

1r1bφ(j))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)(W3hi + W1hj

+ W2(er1 ⋯ erk))

=

K

sumk=1

sum(jr1rki)isinΦk

jisinQ iisinA

β(⋯)W (hjoplus

(er1 ⋯ erk)oplus hi)= RN(G W E H)

(16)