arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

12
DTCA: Decision Tree-based Co-Attention Networks for Explainable Claim Verification Lianwei Wu, Yuan Rao, Yongqiang Zhao, Hao Liang, Ambreen Nazir Lab of Social Intelligence and Complexity Data Processing, School of Software Engineering, Xi’an Jiaotong University, China Shannxi Joint Key Laboratory for Artifact Intelligence(Sub-Lab of Xi’an Jiaotong University), China Research Institute of Xi’an Jiaotong University, Shenzhen, China {stayhungry,yongqiang1210,favorablelearner,ambreen.nazir}@stu.xjtu.edu.cn [email protected] Abstract Recently, many methods discover effective ev- idence from reliable sources by appropriate neural networks for explainable claim verifica- tion, which has been widely recognized. How- ever, in these methods, the discovery process of evidence is nontransparent and unexplained. Simultaneously, the discovered evidence only roughly aims at the interpretability of the whole sequence of claims but insufficient to focus on the false parts of claims. In this paper, we propose a Decision Tree-based Co- Attention model (DTCA) to discover evidence for explainable claim verification. Specifi- cally, we first construct Decision Tree-based Evidence model (DTE) to select comments with high credibility as evidence in a transpar- ent and interpretable way. Then we design Co-attention Self-attention networks (CaSa) to make the selected evidence interact with claims, which is for 1) training DTE to de- termine the optimal decision thresholds and obtain more powerful evidence; and 2) utiliz- ing the evidence to find the false parts in the claim. Experiments on two public datasets, RumourEval and PHEME, demonstrate that DTCA not only provides explanations for the results of claim verification but also achieves the state-of-the-art performance, boosting the F1-score by 3.11%, 2.41%, respectively. 1 Introduction The increasing popularity of social media has brought unprecedented challenges to the ecology of information dissemination, causing rampancy of a large volume of false or unverified claims, like extreme news, hoaxes, rumors, fake news, etc. Re- search indicates that during the US presidential election (2016), fake news accounts for nearly 6% of all news consumption, where 1% of users are ex- posed to 80% of fake news, and 0.1% of users are responsible for sharing 80% of fake news (Grinberg et al., 2019), and democratic elections are vulnera- ble to manipulation of the false or unverified claims on social media (Aral and Eckles, 2019), which ren- ders the automatic verification of claims a crucial problem. Currently, the methods for automatic claim veri- fication could be divided into two categories: the first is that the methods relying on deep neural net- works learn credibility indicators from claim con- tent and auxiliary relevant articles or comments (i.e., responses) (Volkova et al., 2017; Rashkin et al., 2017; Dungs et al., 2018). Despite their effec- tiveness, these methods are difficult to explain why claims are true or false in practice. To overcome the weakness, a trend in recent studies (the sec- ond category) is to endeavor to explore evidence- based verification solutions, which focuses on cap- turing the fragments of evidence obtained from reli- able sources by appropriate neural networks (Popat et al., 2018; Hanselowski et al., 2018; Ma et al., 2019; Nie et al., 2019). For instance, Thorne et al. (2018) build multi-task learning to extract evidence from Wikipedia and synthesize information from multiple documents to verify claims. Popat et al. (2018) capture signals from external evidence arti- cles and model joint interactions between various factors, like the context of a claim and trustworthi- ness of sources of related articles, for assessment of claims. Ma et al. (2019) propose hierarchical attention networks to learn sentence-level evidence from claims and their related articles based on co- herence modeling and natural language inference for claim verification. Although these methods provide evidence to solve the explainability of claim verification in a manner, there are still several limitations. First, they are generally hard to interpret the discovery process of evidence for claims, namely, lack the in- terpretability of methods themselves because these methods are all based on neural networks, belong- arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

Transcript of arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

Page 1: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

DTCA: Decision Tree-based Co-Attention Networks for ExplainableClaim Verification

Lianwei Wu, Yuan Rao, Yongqiang Zhao, Hao Liang, Ambreen NazirLab of Social Intelligence and Complexity Data Processing,

School of Software Engineering, Xi’an Jiaotong University, ChinaShannxi Joint Key Laboratory for Artifact Intelligence(Sub-Lab of Xi’an Jiaotong University), China

Research Institute of Xi’an Jiaotong University, Shenzhen, China{stayhungry,yongqiang1210,favorablelearner,ambreen.nazir}@stu.xjtu.edu.cn

[email protected]

Abstract

Recently, many methods discover effective ev-idence from reliable sources by appropriateneural networks for explainable claim verifica-tion, which has been widely recognized. How-ever, in these methods, the discovery processof evidence is nontransparent and unexplained.Simultaneously, the discovered evidence onlyroughly aims at the interpretability of thewhole sequence of claims but insufficient tofocus on the false parts of claims. In thispaper, we propose a Decision Tree-based Co-Attention model (DTCA) to discover evidencefor explainable claim verification. Specifi-cally, we first construct Decision Tree-basedEvidence model (DTE) to select commentswith high credibility as evidence in a transpar-ent and interpretable way. Then we designCo-attention Self-attention networks (CaSa)to make the selected evidence interact withclaims, which is for 1) training DTE to de-termine the optimal decision thresholds andobtain more powerful evidence; and 2) utiliz-ing the evidence to find the false parts in theclaim. Experiments on two public datasets,RumourEval and PHEME, demonstrate thatDTCA not only provides explanations for theresults of claim verification but also achievesthe state-of-the-art performance, boosting theF1-score by 3.11%, 2.41%, respectively.

1 Introduction

The increasing popularity of social media hasbrought unprecedented challenges to the ecologyof information dissemination, causing rampancy ofa large volume of false or unverified claims, likeextreme news, hoaxes, rumors, fake news, etc. Re-search indicates that during the US presidentialelection (2016), fake news accounts for nearly 6%of all news consumption, where 1% of users are ex-posed to 80% of fake news, and 0.1% of users areresponsible for sharing 80% of fake news (Grinberg

et al., 2019), and democratic elections are vulnera-ble to manipulation of the false or unverified claimson social media (Aral and Eckles, 2019), which ren-ders the automatic verification of claims a crucialproblem.

Currently, the methods for automatic claim veri-fication could be divided into two categories: thefirst is that the methods relying on deep neural net-works learn credibility indicators from claim con-tent and auxiliary relevant articles or comments(i.e., responses) (Volkova et al., 2017; Rashkinet al., 2017; Dungs et al., 2018). Despite their effec-tiveness, these methods are difficult to explain whyclaims are true or false in practice. To overcomethe weakness, a trend in recent studies (the sec-ond category) is to endeavor to explore evidence-based verification solutions, which focuses on cap-turing the fragments of evidence obtained from reli-able sources by appropriate neural networks (Popatet al., 2018; Hanselowski et al., 2018; Ma et al.,2019; Nie et al., 2019). For instance, Thorne et al.(2018) build multi-task learning to extract evidencefrom Wikipedia and synthesize information frommultiple documents to verify claims. Popat et al.(2018) capture signals from external evidence arti-cles and model joint interactions between variousfactors, like the context of a claim and trustworthi-ness of sources of related articles, for assessmentof claims. Ma et al. (2019) propose hierarchicalattention networks to learn sentence-level evidencefrom claims and their related articles based on co-herence modeling and natural language inferencefor claim verification.

Although these methods provide evidence tosolve the explainability of claim verification in amanner, there are still several limitations. First,they are generally hard to interpret the discoveryprocess of evidence for claims, namely, lack the in-terpretability of methods themselves because thesemethods are all based on neural networks, belong-

arX

iv:2

004.

1345

5v1

[cs

.CL

] 2

8 A

pr 2

020

Page 2: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

ing to nontransparent black box models. Secondly,the provided evidence only offers a coarse-grainedexplanation to claims. They are all aimed at theinterpretability of the whole sequence of claims butinsufficient to focus on the false parts of claims.

To address the above problems, we de-sign Decision Tree-based Co-Attention networks(DTCA) to discover evidence for explainableclaim verification, which contains two stages: 1)Decision Tree-based Evidence model (DTE) fordiscovering evidence in a transparent and inter-pretable way; and 2) Co-attention Self-attentionnetworks (CaSa) using the evidence to explore thefalse parts of claims. Specifically, DTE is con-structed on the basis of structured and hierarchicalcomments (aiming at the claim), which considersmany factors as decision conditions from the per-spective of content and meta data of commentsand selects high credibility comments as evidence.CaSa exploits the selected evidence to interact withclaims at the deep semantic level, which is for tworoles: one is to train DTE to pursue the optimaldecision threshold and finally obtain more pow-erful evidence; and another is to utilize the evi-dence to find the false parts in claims. Experimen-tal results reveal that DTCA not only achieves thestate-of-the-art performance but also provides theinterpretability of results of claim verification andthe interpretability of selection process of evidence.Our contributions are summarized as follows:

• We propose a transparent and interpretablescheme that incorporates decision tree modelinto co-attention networks, which not only dis-covers evidence for explainable claim verifi-cation (Section 4.4.3) but also provides inter-pretation for the discovery process of evidencethrough the decision conditions (Section 4.4.2).

• Designed co-attention networks promote thedeep semantic interaction between evidence andclaims, which can train DTE to obtain more pow-erful evidence and effectively focus on the falseparts of claims (Section 4.4.3).

• Experiments on two public, widely used fakenews datasets demonstrate that our DTCAachieves more excellent performance than previ-ous state-of-the-art methods (Section 4.3.2).

2 Related Work

Claim Verification Many studies on claim veri-fication generally extract an appreciable quantity

of credibility-indicative features around semantics(Ma et al., 2018b; Khattar et al., 2019; Wu et al.,2020), emotions (Ajao et al., 2019), stances (Maet al., 2018a; Kochkina et al., 2018; Wu et al.,2019), write styles (Potthast et al., 2018; Grondahland Asokan, 2019), and source credibility (Popatet al., 2018; Baly et al., 2018a) from claims andrelevant articles (or comments). For a concrete in-stance, Wu et al. (2019) devise sifted multi-tasklearning networks to jointly train stance detectionand fake news detection tasks for effectively uti-lizing common features of the two tasks to im-prove the task performance. Despite reliable per-formance, these methods for claim verification areunexplainable. To address this issue, recent re-search concentrates on the discovery of evidencefor explainable claim verification, which mainlydesigns different deep models to exploit semanticmatching (Nie et al., 2019; Zhou et al., 2019), se-mantic conflicts (Baly et al., 2018b; Dvorak andWoltran, 2019; Wu and Rao, 2020), and semanticentailments (Hanselowski et al., 2018; Ma et al.,2019) between claims and relevant articles. Forinstance, Nie et al. (2019) develop neural semanticmatching networks that encode, align, and matchthe semantics of two text sequences to capture evi-dence for verifying claims. Combined with the prosof recent studies, we exert to perceive explainableevidence through semantic interaction for claimverification.

Explainable Machine Learning Our work isalso related to explainable machine learning, whichcan be generally divided into two categories: in-trinsic explainability and post-hoc explainability(Du et al., 2018). Intrinsic explainability (Shuet al., 2019; He et al., 2015; Zhang and Chen,2018) is achieved by constructing self-explanatorymodels that incorporate explainability directly intotheir structures, which requires to build fully inter-pretable models for clearly expressing the explain-able process. However, the current deep learningmodels belong to black box models, which are dif-ficult to achieve intrinsic explainability (Gunning,2017). Post-hoc explainability (Samek et al., 2017;Wang et al., 2018; Chen et al., 2018) needs to de-sign a second model to provide explanations for anexisting model. For example, Wang et al. (2018)combine the strengths of the embeddings-basedmodel and the tree-based model to develop explain-able recommendation, where the tree-based modelobtains evidence and the embeddings-based model

Page 3: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

ℎ1

ℎ1

ℎ2

ℎ2

ℎ𝑛

ℎ𝑛ℎ1′

ℎ1′

ℎ2′

ℎ2′

ℎ𝑚′

ℎ𝑚′

Self-

Attention

E |E-C| E⨀C C

Softmax

Comments

Claim

Sequence Representation

Layer

E

C

Output Layer

V K Q

Q K V

Feed-Forward

Feed-Forward

Pooling

×6

Decision Tree-based Evidence Model

(DTE)

Co-attention

Self-attention

Networks

(CaSa) Co-Attention

Layer

……

Self-

Attention ×6

论文最终版

Figure 1: The architecture of DTCA. DTCA includestwo stages, i.e., DTE for discovering evidence and CaSausing the evidence to explore the false parts of claims.

improves the performance of recommendation. Inthis paper, following the post-hoc explainability,we harness decision-tree model to explain the dis-covery process of evidence and design co-attentionnetworks to boost the task performance.

3 Decision Tree-based Co-AttentionNetworks (DTCA)

In this section, we introduce the decision tree-based co-attention networks (DTCA) for explain-able claim verification, with architecture shown inFigure 1, which involves two stages: decision tree-based evidence model (DTE) and co-attention self-attention networks (CaSa) that consist of a 3-levelhierarchical structure, i.e., sequence representationlayer, co-attention layer, and output layer. Next, wedescribe each part of DTCA in detail.

3.1 Decision Tree-based Evidence Model(DTE)

DTE is based on tree comments (including replies)aiming at one claim. We first build a tree networkbased on hierarchical comments, as shown in theleft of Figure 2. The root node is one claim and thesecond level nodes and below are users’ commentson the claim (R11, ... ,Rkn), where k and n denotethe depth of tree comments and the width of thelast level respectively. We try to select commentswith high credibility as evidence of the claim, sowe need to evaluate the credibility of each node(comment) in the network and decide whether toselect the comment or not. Three factors from the

Decision Tree

帖子下评论树

Figure 2: Decision Tree-based Evidence Model

𝑥1 < 𝑎2

𝑥2 < 𝑎2

yes no

yes no

𝐶

𝑅11 𝑅1𝑙

𝑅21 𝑅22 𝑅2𝑚

𝑅𝑘1 𝑅𝑘2 𝑅𝑘𝑖 𝑅𝑘𝑛

… …

… yes no

𝑥3 < 𝑎3

Tree Comments

Figure 2: Overview of DTE. DTE consists of twoparts: tree comment network (the left) and decision treemodel (the right), which is used to evaluate the cred-ibility of each node in the tree comment network fordiscovering evidence.

perspective of content and meta data of commentsare considered and the details are described:The semantic similarity between comments andclaims. It measures relevancy between commentsand claims and aims to filter irrelevant and noisycomments. Specifically, we adopt soft consine mea-sure (Sidorov et al., 2014) between average wordembeddings of both claims and comments as se-mantic similarity.The credibility of reviewers1. It follows that “re-viewers with high credibility also usually havehigh reliability in their comments” (Shan, 2016).Specifically, we utilize multiple meta-data featuresof reviewers to evaluate reviewer credibility, i.e.,whether the following elements exist or not: ver-ified, geo, screen name, and profile image; andthe number of the items: followers, friends, andfavorites. The examples are shown in Appendix A.The credibility of comments. It is based on metadata of comments to roughly measure the credibil-ity of comments (Shu et al., 2017), i.e., 1) whetherthe following elements exist or not: geo, source, fa-vorite the comment; and 2) the number of favoritesand content-length. The examples are shown inAppendix A.

In order to integrate these factors in a transpar-ent and interpretable way, we build a decision treemodel which takes the factors as decision condi-tions to measure node credibility of tree comments,as shown in the grey part in Figure 2.

We represent the structure of a decision treemodel as Q = {V,E}, where V and E denotenodes and edges, respectively. Nodes in V havetwo types: decision (a.k.a. internal) nodes and leafnodes. Each decision node splits a decision condi-tion xi (one of the three factors) with two decisionedges (decision results) based on the specific deci-sion threshold ai. The leaf node gives the decision

1People who post comments

Page 4: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

result (the red circle), i.e., whether the comment isselected or not. In our experiments, if any decisionnodes are yes, the evaluated comment in the treecomment network will be selected as a piece ofevidence. In this way, each comment is selectedas evidence, which is transparent and interpretable,i.e., interpreted by decision conditions.

When comment nodes in the tree network areevaluated by the decision tree model, we leveragepost-pruning algorithm to select comment subtreesas evidence set for CaSa (in section 3.2) training.

3.2 Co-attention Self-attention Networks(CaSa)

In DTE, the decision threshold ai is uncertain,to say, according to different decision thresholds,there are different numbers of comments as evi-dence for CaSa training. In order to train decisionthresholds in DTE so as to obtain more powerfulevidence, and then exploit this evidence to explorethe false parts of fake news, we devise CaSa to pro-mote the interaction between evidence and claims.The details of DTCA are as follows:

3.2.1 Sequence Representation LayerThe inputs of CaSa include a sequence of evidence(the evidence set obtained by DTE model is con-catenated into a sequence of evidence) and a se-quence of claim. Given a sequence of length ltokens X = {x1, x2, ..., xl}, X ∈ Rl×d, whichcould be either a claim or the evidence, each to-ken xi ∈ Rd is a d-dimensional vector obtained bypre-trained BERT model (Devlin et al., 2019). Weencode each token into a fixed-sized hidden vec-tor hi and then obtain the sequence representationfor a claim Xc and evidence Xe via two BiLSTM(Graves et al., 2005) neural networks respectively.

−→hi =

−−−−→LSTM(

−−→hi−1, xi) (1)

←−hi =

←−−−−LSTM(

←−−hi+1, xi) (2)

hi = [−→hi ;←−hi ] (3)

where−→hi∈Rh and

←−hi∈Rh are hidden states of for-

ward LSTM−−−−→LSTM and backward LSTM

←−−−−LSTM.

h is the number of hidden units of LSTM. ; denotesconcatenation operation. Finally, Re∈Rl×2h andRc ∈ Rl×2h are representations of sequences ofboth evidence and a claim. Additionally, experi-ments confirm BiLSTM in CaSa can be replacedby BiGRU (Cho et al., 2014) for comparable per-formance.

3.2.2 Co-attention Layer

Co-attention networks are composed of two hierar-chical self-attention networks. In our paper, the se-quence of evidence first leverages one self-attentionnetwork to conduct deep semantic interaction withthe claim for capturing the false parts of the claim.Then semantics of the interacted claim focus onsemantics of the sequence of evidence via anotherself-attention network for concentrating on the keyparts of the evidence. The two self-attention net-works are both based on the multi-head attentionmechanism (Vaswani et al., 2017). Given a matrixof l query vectors Q∈Rl×2h, keys K∈Rl×2h, andvalues V∈Rl×2h, the scaled dot-product attention,the core of self-attention networks, is described as

Attention(Q,K,V) = softmax(QKT

√d

)V (4)

Particularly, to enable claim and evidence to in-teract more directly and effectively, in the first self-attention network, Q = Re

pool (Repool ∈ R2h) is

the max-pooled vector of the sequence represen-tation of evidence, and K = V = Rc, Rc is thesequence representation of claim. In the secondself-attention network, Q=C, i.e., the output vec-tor of self-attention network for claim (the detailsare in Eq. 7), and K=V=Re, Re is the sequencerepresentation of evidence.

To get high parallelizability of attention, multi-head attention first linearly projects queries, keys,and values j times by different linear projectionsand then j projections perform the scaled dot-product attention in parallel. Finally, these resultsof attention are concatenated and once again pro-jected to get the new representation. Formally, themulti-head attention can be formulated as:

headi = Attention(QWQi ,KWK

i ,VWVi ) (5)

O′ = MultiHead(Q,K,V)

= [head1; head2; ...; headj ]Wo (6)

where WQi ∈R2h×D, WK

i ∈R2h×D, WVi ∈R2h×D,

and Wo∈R2h×2h are trainable parameters and Dis 2h/j.

Subsequently, co-attention networks pass a feedforward network (FFN) for adding non-linear fea-tures while scale-invariant features, which containsa single hidden layer with an ReLU.

O=FFN(O′)=max(0,O′W1+b1)W2+b2 (7)

where W1, W2, b1, and b2 are the learned param-eters. O=C and O=E are output vectors of two

Page 5: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

self-attention networks aiming at the claim and theevidence, respectively.

Finally, to fully integrate evidence and claim,we adopt the absolute difference and element-wiseproduct to fuse the vectors E and C (Wu et al.,2019).

EC = [E; |E− C|;E� C;C] (8)where � denotes element-wise multiplication oper-ation.

3.2.3 Output LayerAs the last layer, softmax function emits the pre-diction of probability distribution by the equation:

p = softmax(WpEC + bp) (9)

We train the model to minimize cross-entropy er-ror for a training sample with ground-truth label y:

Loss = −∑

ylogp (10)

The training process of DTCA is presented inAlgorithm 1 of Appendix B.

4 Experiments

As the key contribution of this work is to verifyclaims accurately and offer evidence as explana-tions, we design experiments to answer the follow-ing questions:

• RQ1: Can DTCA achieve better performancecompared with the state-of-the-art models?

• RQ2: How do decision conditions in the deci-sion tree affect model performance (to say, theinterpretability of evidence selection process)?

• RQ3: Can DTCA make verification results easy-to-interpret by evidence and find false parts of claims?

4.1 DatasetsTo evaluate our proposed model, we use two widelyused datasets, i.e., RumourEval (Derczynski et al.,2017) and PHEME (Zubiaga et al., 2016). Struc-ture. Both datasets respectively contain 325 and6,425 Twitter conversation threads associated withdifferent newsworthy events like Charlie Hebdo,the shooting in Ottawa, etc. A thread consists ofa claim and a tree of comments (a.k.a. responses)expressing their opinion towards the claim. Labels.Both datasets have the same labels, i.e., true, false,and unverified. Since our goal is to verify whether aclaim is true or false, we filter out unverified tweets.Table 1 gives statistics of the two datasets.

Subset Veracity RumourEval PHEME#posts #comments #posts #comments

TrainingTrue 83 1,949 861 24,438False 70 1,504 625 17,676Total 153 3,453 1,468 42,114

ValidationTrue 10 101 95 1,154False 12 141 115 1,611Total 22 242 210 2,765

TestingTrue 9 412 198 3,077False 12 437 219 3,265Total 21 849 417 6,342

Table 1: Statistics of the datasets.

In consideration of the imbalance label distri-butions, besides accuracy (A), we add precision(P), recall (R) and F1-score (F1) as evaluation met-rics for DTCA and baselines. We divide the twodatasets into training, validation, and testing sub-sets with proportion of 70%, 10%, and 20% respec-tively.

4.2 Settings

We turn all hyper-parameters on the validation setand achieve the best performance via a small gridsearch. For hyper-parameter configurations, (1)in DTE, the change range of semantic similarity,the credibility of reviewers, and the credibility ofcomments respectively belong to [0, 0.8], [0, 0.8],and [0, 0.7]; (2) in CaSa, word embedding size d isset to 768; the size of LSTM hidden states h is 120;attention heads and blocks are 6 and 4 respectively;the dropout of multi-head attention is set to 0.8; theinitial learning rate is set to 0.001; the dropout rateis 0.5; and the mini-batch size is 64.

4.3 Performance Evaluation (RQ1)

4.3.1 BaselinesSVM (Derczynski et al., 2017) is used to detectfake news based on manually extracted features.

CNN (Chen et al., 2017) adopts different win-dow sizes to obtain semantic features similar ton-grams for rumor classification.

TE (Guacho et al., 2018) creates article-by-article graphs relying on tensor decomposition withderiving article embeddings for rumor detection.

DeClarE (Popat et al., 2018) presents attentionnetworks to aggregate signals from external evi-dence articles for claim verification.

TRNN (Ma et al., 2018b) proposes two tree-structured RNN models based on top-down anddown-top integrating semantics of structure andcontent to detect rumors. In this work, we adopt thetop-down model with better results as the baseline.

Page 6: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

Dataset Measure SVM CNN TE DeClarE TRNN MTL-LSTM Bayesian-DL Sifted-MTL Ours

RumourEval

A (%) 71.42 61.90 66.67 66.67 76.19 66.67 80.95 81.48 82.54P (%) 66.67 54.54 60.00 58.33 70.00 57.14 77.78 72.24 78.25R (%) 66.67 66.67 66.67 77.78 77.78 88.89 77.78 86.31 85.60F1 (%) 66.67 59.88 63.15 66.67 73.68 69.57 77.78 78.65 81.76

PHEME

A (%) 72.18 59.23 65.22 67.87 78.65 74.94 80.33 81.27 82.46P (%) 78.80 56.14 63.05 64.68 77.11 68.77 78.29 73.41 79.08R (%) 75.75 64.64 64.64 71.21 78.28 87.87 79.29 88.10 86.24F1 (%) 72.10 60.09 63.83 67.89 77.69 77.15 78.78 80.09 82.50

Table 2: The performance comparison of DTCA against the baselines.

MTL-LSTM (Kochkina et al., 2018) jointlytrains rumor detection, claim verification, andstance detection tasks, and learns correlationsamong these tasks for task learning.

Bayesian-DL (Zhang et al., 2019) uses Bayesianto represent the uncertainty of prediction of theveracity of claims and then encodes responses toupdate the posterior representations.

Sifted-MTL (Wu et al., 2019) is a sifted multi-task learning model that trains jointly fake newsdetection and stance detection tasks and adopts gateand attention mechanism to screen shared features.

4.3.2 Results of ComparisonTable 2 shows the experimental results of all com-pared models on the two datasets. We observe that:

• SVM integrating semantics from claim contentand comments outperforms traditional neural net-works only capturing semantics from claim con-tent, like CNN and TE, with at least 4.75% and6.96% boost in accuracy on the two datasetsrespectively, which indicates that semantics ofcomments are helpful for claim verification.

• On the whole, most neural network modelswith semantic interaction between commentsand claims, such as TRNN and Bayesian-DL,achieve from 4.77% to 9.53% improvements inaccuracy on the two datasets than SVM withoutany interaction, which reveals the effectivenessof the interaction between comments and claims.

• TRNN, Bayesian-DL, and DTCA enable claimsand comments to interact, but the first two mod-els get the worse performance than DTCA (atleast 1.06% and 1.19% degradation in accuracyrespectively). That is because they integrate allcomments indiscriminately and might introducesome noise into their models, while DTCA picksmore valuable comments by DTE.

• Multi-task learning models, e.g. MTL-LSTMand Sifted-MTL leveraging stance features show

at most 3.29% and 1.86% boosts in recall thanDTCA on the two datasets respectively, butthey also bring out noise, which achieve from1.06% to 21.11% reduction than DTCA in theother three metrics. Besides, DTCA achieves3.11% and 2.41% boosts than the latest baseline(sifted-MTL) in F1-score on the two datasets re-spectively. These elaborate the effectiveness ofDTCA.

4.4 Discussions4.4.1 The impact of comments on DTCAIn Section 4.3, we find that the use of commentscan improve the performance of models. To furtherinvestigate the quantitative impact of comments onour model, we evaluate the performance of DTCAand CaSa with 0%, 50%, and 100% comments.The experimental results are shown in Table 3. Wegain the following observations:

• Models without comment features present thelowest performance, decreasing from 5.08% to9.76% in accuracy on the two datasets, whichimplies that there are a large number of veracity-indicative features in comments.

• As the proportion of comments expands, the per-formance of models is improved continuously.However, the rate of comments for CaSa raisesfrom 50% to 100%, the boost is not significant,only achieving 1.44% boosts in accuracy on Ru-mourEval, while DTCA obtains better perfor-mance, reflecting 3.90% and 3.28% boosts inaccuracy on the two datasets, which fully provesthat DTCA can choose valuable comments andignore unimportant comments with the help ofDTE.

4.4.2 The impact of decision conditions ofDTE on DTCA (RQ2)

To answer RQ2, we analyze the changes of modelperformance under different decision conditions.

Page 7: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

<0.2 <0.3 <0.4 <0.5 <0.6 <0.7A (%) on RumourEval74.84 76.77 78.43 82.54 82.38 82.46A (%) on PHEME75.36 77.24 79.03 82.46 82.42 82.4D-value on RumourEval2.06 1.93 1.66 4.11 -0.16 0.08D-value on PHEME2.15 1.88 1.79 3.43 -0.04 -0.02

-2

0

2

4

6

70

74

78

82

<0.2 <0.3 <0.4 <0.5 <0.6 <0.7

A (%) on RumourEval

A (%) on PHEME

D-value on RumourEval

D-value on PHEME

(a) Different semantic similarity

<0.2 <0.3 <0.4 <0.5 <0.6 <0.7A (%) on RumourEval74.48 76.32 77.9 78.79 80.12 82.54A (%) on PHEME75.31 76.57 78.43 79.1 80.18 82.46D-value on RumourEval1.7 1.84 1.58 0.89 1.33 2.42D-value on PHEME1.9 1.26 1.86 0.67 1.08 2.28

0

1

2

3

4

66

72

78

84

<0.2 <0.3 <0.4 <0.5 <0.6 <0.7

A (%) on RumourEval

A (%) on PHEME

D-value on RumourEval

D-value on PHEME

(b) Different credibility of reviewers

<0.1 <0.2 <0.3 <0.4 <0.5 <0.6A (%) on RumourEval73.67 74.9 76.21 77.46 79.32 82.54A (%) on PHEME73.98 75.19 77.27 78.26 80.05 82.46D-value on RumourEval0.89 1.23 1.31 1.25 1.86 3.22D-value on PHEME0.77 1.21 2.08 0.99 1.79 2.41

0

1

2

3

4

68

72

76

80

84

<0.1 <0.2 <0.3 <0.4 <0.5 <0.6

A (%) on RumourEval

A (%) on PHEME

D-value on RumourEval

D-value on PHEME

(c) Different credibility of comments

Figure 3: The accuracy comparison of DTCA in different decision conditions. Broken lines represent the perfor-mance difference (D-value) between the current decision condition and the previous decision condition.

No (0%) CommentsA P R F1

RumourEval CaSa 72.78 67.03 72.87 69.83DTCA 72.78 67.03 72.87 69.83

PHEME CaSa 73.21 71.26 74.74 72.96DTCA 73.21 71.26 74.74 72.96

50% Comments

RumourEval CaSa 76.42 70.21 76.78 73.35DTCA 78.64 73.43 80.06 76.60

PHEME CaSa 77.65 74.18 78.11 76.09DTCA 79.18 75.24 80.66 77.86

All (100%) Comments

RumourEval CaSa 77.86 71.92 79.24 75.40DTCA 82.54 78.25 85.60 81.76

PHEME CaSa 79.85 75.06 80.35 77.61DTCA 82.46 79.08 86.24 82.50

Table 3: The performance comparison of models ondifferent number of comments.

Different decision conditions can choose differentcomments as evidence to participate in the modellearning. According to the performance change ofthe model verification, we are capable of well ex-plaining the process of evidence selection throughdecision conditions. Specifically, we measure dif-ferent values (interval [0, 1]) as thresholds of deci-sion conditions so that DTE could screen differentcomments. Figure 3(a), (b), and (c) respectivelypresent the influence of semantic similarity (simi),the credibility of reviewers (r cred), and the credi-bility of comments (c cred) on the performance ofDTCA, where the maximum thresholds are set to0.7, 0.7, and 0.6 respectively because there are fewcomments when the decision threshold is greaterthan these values. We observe that:

• When simi is less than 0.4, the model is contin-ually improved, where the average performanceimprovement is about 2 % (broken lines) on thetwo datasets when simi increases by 0.1. Espe-cially, DTCA earns the best performance whensimi is set to 0.5 (<0.5), while it is difficult toimprove performance after that. These exem-plify that DTCA can provide more credibilityfeatures under appropriate semantic similarity

for verification.

• DTCA continues to improve with the increase ofr cred, which is in our commonsense, i.e., themore authoritative people are, the more credi-ble their speech is. Analogously, DTCA boostswith the increase of c cred. These show the rea-sonability of the terms of both the credibility ofreviewers and comments built by meta data.

• When simi is set to 0.5 (<0.5), r cred is0.7 (<0.7), c cred is 0.6 (<0.6), DTCA winsthe biggest improvements, i.e., at least 3.43%,2.28%, and 2.41% on the two datasets respec-tively. At this moment, we infer that commentscaptured by the model contain the most powerfulevidence for claim verification. This is, the opti-mal evidence is formed under the conditions ofmoderate semantic similarity, high reviewer cred-ibility, and higher comment credibility, whichexplains the selection process of evidence.

4.4.3 Explainability Analysis (RQ3)To answer RQ3, we visualize comments (evidence)captured by DTE and the key semantics learnedby CaSa when the training of DTCA is optimized.Figure 4 depicts the results based on a specificsample in PHEME, where at the comment level,red arrows represent the captured evidence and greyarrows denote the unused comments; at the wordlevel, darker shades indicate higher weights givento the corresponding words, representing higherattention. We observe that:

• In line with the optimization of DTCA, the com-ments finally captured by DTE contain abundantevidence that questions the claim, like ‘presum-ably Muslim? How do you arrive at that?’, ‘butit should be confirmed 1st before speculatingon air.’, and ‘false flag’, to prove the falsity ofthe claim (the label of the claim is false), whichindicates that DTCA can effectively discover ev-idence to explain the results of claim verification.

Page 8: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

10 people have died in a shooting at the Paris HQ of

French weekly Charlie Hebdo,

reports say [URL].

@SkyNews presumably Muslim? How do you arrive at that?

@MoonMetropolis @SkyNews but it should be confirmed 1st before speculating on air.

@redtom43 @SkyNews Perhaps them shouting 'we have avenged the Prophet' maybe. What do you think?? #Moron

@TimGriffiths85 where was that reported? #dickface Simi:0.41, u_cred:0.75, c_cred:0.67

@SkyNews The magazine has made fun of Muhammad. I'll be very surprised if the shooter isn't an Islamist.

@SkyNews False Flag. @SajS4j o\nYeah, same as 9/11

Simi:0.46, u_cred:0.69, c_cred:0.62

Simi:0.40, u_cred:0.72, c_cred:0.67

Simi:0.42, u_cred:0.72, c_cred:0.62

Simi:0.47, u_cred:0.66, c_cred:0.71

Simi:0.21, u_cred:0.83, c_cred:0.71 Simi:0.17, u_cred:0.55, c_cred:0.57 Wow. Shocked. RT @SkyNews People have died in a shooting at the Paris HQ of French weekly Charlie Hebdo, reports say [URL].

Simi:0.84, u_cred:0.45, c_cred:0.38

@ddemontgolfier @SkyNews why are you shocked??

Simi:0.31, u_cred:0.55, c_cred:0.48 @SkyNews @alaidi A French crime of passion or another heathen moslem atrocity? Simi:0.38, u_cred:0.41, c_cred:0.43

@SkyNews the magazine made fun of many a religions. If these terrorists were Islamists..it reinforces Islam as an intolerant religion again.

Simi:0.49, u_cred:0.66, c_cred:0.72 @SkyNews @zerohedge they must of run a cartoon with mohamed islam nibbling pork rinds from allu akbars shitty camel asshole

Simi:0.29, u_cred:0.24, c_cred:0.33

Claim:

Comments:

Label: False

Figure 4: The visualization of a sample (labeled false) in PHEME by DTCA, where the captured evidence (redarrows) and the specific values of decision conditions (blue) are presented by DTE, and the attention of differentwords (red shades) is obtained by CaSa.

Additionally, there are common characteristicsin captured comments, i.e., moderate semanticsimilarity (interval [0.40, 0.49]), high reviewercredibility (over 0.66), and high comment cred-ibility (over 0.62). For instance, the values ofthe three characteristics of evidence ‘@TimGrif-fiths85 where was that reported? #dickface’ are0.47, 0.66, and 0.71 respectively. These phe-nomena explain that DTCA can give reasonableexplanations to the captured evidence by deci-sion conditions of DTE, which visually reflectsthe interpretability of DTCA method itself.

• At the word level, the evidence-related words‘presumably Muslim’, ‘made fun of’, ‘shooter’,and ‘isn’t Islamist’ in comments receive higherweights than the evidence-independent words‘surprised’, ‘confirmed 1st’ and ‘speculating’,which illustrates that DTCA can earn the key se-mantics of evidence. Moreover, ‘weekly CharlieHebdo’ in the claim and ‘Islamist’ and ‘Muham-mad’ in comments are closely focused, whichis related to the background knowledge, i.e.,weekly Charlie Hebdo is a French satirical comicmagazine which often publishes bold satire onreligion and politics. ‘report say’ in claim isqueried in the comments, like ‘How do you ar-rive at that?’ and ‘false flag’. These visuallydemonstrate that DTCA can uncover the ques-tionable and even false parts in claims.

4.4.4 Error AnalysisTable 4 provides the performance of DTCA un-der different claims with different number of com-ments. We observe that DTCA achieves the sat-isfactory performance in claims with more than 8

Claims Datasets A (%) P (%) R (%) F1 (%)Claims with less RumourEval 73.42 68.21 73.50 70.76than 3 comments PHEME 74.10 72.45 75.32 73.86Claims with com- RumourEval 75.33 69.16 75.07 71.99ments ∈ [3, 8] PHEME 77.26 74.67 79.03 76.79Claims with more RumourEval 80.25 75.61 83.45 79.34than 8 comments PHEME 80.36 75.52 84.33 79.68

Table 4: The performance comparison of DTCA underdifferent claims with different number of comments.

comments, while in claims with less than 8 com-ments, DTCA does not perform well, underper-forming its best performance by at least 4.92% and3.10% in accuracy on the two datasets respectively.Two reasons might explain the issue: 1) The claimwith few comments has limited attention, and itsfalse parts are hard to be found by the public; 2)DTCA is capable of capturing worthwhile seman-tics from multiple comments, but it is not suitablefor verifying claims with fewer comments.

5 Conclusion

We proposed a novel framework combining deci-sion tree and neural attention networks to explorea transparent and interpretable way to discover ev-idence for explainable claim verification, whichconstructed decision tree model to select commentswith high credibility as evidence, and then designedco-attention networks to make the evidence andclaims interact with each other for unearthing thefalse parts of claims. Results on two public datasetsdemonstrated the effectiveness and explainabilityof this framework. In the future, we will extend theproposed framework by considering more context(meta data) information, such as time, storylines,and comment sentiment, to further enrich our ex-

Page 9: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

plainability.

Acknowledgments

The research work is supported by NationalKey Research and Development Program inChina (2019YFB2102300); The World-ClassUniversities (Disciplines) and the Characteris-tic Development Guidance Funds for the Cen-tral Universities of China (PY3A022); Min-istry of Education Fund Projects (18JZD022 and2017B00030); Shenzhen Science and TechnologyProject (JCYJ20180306170836595); Basic Scien-tific Research Operating Expenses of Central Uni-versities (ZDYF2017006); Xi’an Navinfo Corp.&Engineering Center of Xi’an Intelligence Spatial-temporal Data Analysis Project (C2020103); BeilinDistrict of Xi’an Science & Technology Project(GX1803). We would like to thank the anonymousreviewers for their insightful comments.

ReferencesOluwaseun Ajao, Deepayan Bhowmik, and Shahrzad

Zargari. 2019. Sentiment aware fake news detec-tion on online social networks. In ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages2507–2511. IEEE.

Sinan Aral and Dean Eckles. 2019. Protecting elec-tions from social media manipulation. Science,365(6456):858–861.

Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov,James Glass, and Preslav Nakov. 2018a. Predict-ing factuality of reporting and bias of news mediasources. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 3528–3539.

Ramy Baly, Mitra Mohtarami, James Glass, LluısMarquez, Alessandro Moschitti, and Preslav Nakov.2018b. Integrating stance detection and fact check-ing in a unified corpus. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 21–27.

Xu Chen, Yongfeng Zhang, Hongteng Xu, YixinCao, Zheng Qin, and Hongyuan Zha. 2018. Visu-ally explainable recommendation. arXiv preprintarXiv:1801.10288.

Yi-Chin Chen, Zhao-Yang Liu, and Hung-Yu Kao.2017. Ikm at semeval-2017 task 8: Convolutionalneural networks for stance detection and rumor ver-ification. In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation (SemEval-2017),pages 465–469.

Kyunghyun Cho, Bart Van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches. arXiv preprint arXiv:1409.1259.

Leon Derczynski, Kalina Bontcheva, Maria Liakata,Rob Procter, Geraldine Wong Sak Hoi, and ArkaitzZubiaga. 2017. Semeval-2017 task 8: Rumoureval:Determining rumour veracity and support for ru-mours. arXiv preprint arXiv:1704.05972.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Mengnan Du, Ninghao Liu, and Xia Hu. 2018. Tech-niques for interpretable machine learning. arXivpreprint arXiv:1808.00033.

Sebastian Dungs, Ahmet Aker, Norbert Fuhr, andKalina Bontcheva. 2018. Can rumour stance alonepredict veracity? In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 3360–3370.

Wolfgang Dvorak and Stefan Woltran. 2019. Complex-ity of abstract argumentation under a claim-centricview. In Proceedings of the AAAI Conference on Ar-tificial Intelligence, volume 33, pages 2801–2808.

Alex Graves, Santiago Fernandez, and Jurgen Schmid-huber. 2005. Bidirectional lstm networks for im-proved phoneme classification and recognition. InInternational Conference on Artificial Neural Net-works, pages 799–804. Springer.

Nir Grinberg, Kenneth Joseph, Lisa Friedland, BrionySwire-Thompson, and David Lazer. 2019. Fakenews on twitter during the 2016 us presidential elec-tion. Science, 363(6425):374–378.

Tommi Grondahl and N Asokan. 2019. Text analysis inadversarial settings: Does deception leave a stylistictrace? ACM Computing Surveys (CSUR), 52(3):45.

Gisel Bastidas Guacho, Sara Abdali, Neil Shah, andEvangelos E Papalexakis. 2018. Semi-supervisedcontent-based detection of misinformation via ten-sor embeddings. In 2018 IEEE/ACM InternationalConference on Advances in Social Networks Analy-sis and Mining (ASONAM), pages 322–325. IEEE.

David Gunning. 2017. Explainable artificial intelli-gence (xai). Defense Advanced Research ProjectsAgency (DARPA), nd Web, 2.

Andreas Hanselowski, Hao Zhang, Zile Li, DaniilSorokin, Benjamin Schiller, Claudia Schulz, andIryna Gurevych. 2018. Ukp-athene: Multi-sentencetextual entailment for claim verification. In Proceed-ings of the First Workshop on Fact Extraction andVERification (FEVER), pages 103–108.

Page 10: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen.2015. Trirank: Review-aware explainable recom-mendation by modeling aspects. In Proceedings ofthe 24th ACM International on Conference on Infor-mation and Knowledge Management, pages 1661–1670. ACM.

Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, andVasudeva Varma. 2019. Mvae: Multimodal vari-ational autoencoder for fake news detection. InThe World Wide Web Conference, pages 2915–2921.ACM.

Elena Kochkina, Maria Liakata, and Arkaitz Zubi-aga. 2018. All-in-one: Multi-task learning for ru-mour verification. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 3402–3413.

Jing Ma, Wei Gao, Shafiq Joty, and Kam-Fai Wong.2019. Sentence-level evidence embedding for claimverification with hierarchical attention networks. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 2561–2571.

Jing Ma, Wei Gao, and Kam-Fai Wong. 2018a. Detectrumor and stance jointly by neural multi-task learn-ing. In Companion Proceedings of the The Web Con-ference 2018, pages 585–593. International WorldWide Web Conferences Steering Committee.

Jing Ma, Wei Gao, and Kam-Fai Wong. 2018b. Ru-mor detection on twitter with tree-structured recur-sive neural networks. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1980–1989.

Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.Combining fact extraction and verification with neu-ral semantic matching networks. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 33, pages 6859–6866.

Kashyap Popat, Subhabrata Mukherjee, Andrew Yates,and Gerhard Weikum. 2018. Declare: Debunkingfake news and false claims using evidence-awaredeep learning. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 22–32.

Martin Potthast, Johannes Kiesel, Kevin Reinartz,Janek Bevendorff, and Benno Stein. 2018. A stylo-metric inquiry into hyperpartisan and fake news. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 231–240.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and polit-ical fact-checking. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2931–2937.

Wojciech Samek, Thomas Wiegand, and Klaus-RobertMuller. 2017. Explainable artificial intelligence:Understanding, visualizing and interpreting deeplearning models. arXiv preprint arXiv:1708.08296.

Yan Shan. 2016. How credible are online prod-uct reviews? the effects of self-generated andsystem-generated cues on source credibility evalua-tion. Computers in Human Behavior, 55:633–641.

Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee,and Huan Liu. 2019. defend: Explainable fake newsdetection. In KDD.

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, andHuan Liu. 2017. Fake news detection on social me-dia: A data mining perspective. ACM SIGKDD Ex-plorations Newsletter, 19(1):22–36.

Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto. 2014. Soft similarity andsoft cosine measure: Similarity of features in vectorspace model. Computacion y Sistemas, 18(3):491–504.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extraction andverification. arXiv preprint arXiv:1803.05355.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Svitlana Volkova, Kyle Shaffer, Jin Yea Jang, andNathan Hodas. 2017. Separating facts from fiction:Linguistic models to classify suspicious and trustednews posts on twitter. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 647–653.

Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie,and Tat-Seng Chua. 2018. Tem: Tree-enhanced em-bedding model for explainable recommendation. InProceedings of the 2018 World Wide Web Confer-ence, pages 1543–1552. International World WideWeb Conferences Steering Committee.

Lianwei Wu and Yuan Rao. 2020. Adaptive interac-tion fusion networks for fake news detection. ArXivpreprint arXiv:2004.10009.

Lianwei Wu, Yuan Rao, Haolin Jin, Ambreen Nazir,and Ling Sun. 2019. Different absorption from thesame sharing: Sifted multi-task learning for fakenews detection. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 4636–4645.

Page 11: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

Lianwei Wu, Yuan Rao, Ambreen Nazir, and HaolinJin. 2020. Discovering differential features: Adver-sarial learning for information credibility evaluation.Information Sciences, 516:453–473.

Qiang Zhang, Aldo Lipani, Shangsong Liang, and Em-ine Yilmaz. 2019. Reply-aided detection of misin-formation via bayesian deep learning. In The WorldWide Web Conference, pages 2333–2343. ACM.

Yongfeng Zhang and Xu Chen. 2018. Explainablerecommendation: A survey and new perspectives.arXiv preprint arXiv:1804.11192.

Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, LifengWang, Changcheng Li, and Maosong Sun. 2019.Gear: Graph-based evidence aggregating and rea-soning for fact verification. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 892–901.

Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-dine Wong Sak Hoi, and Peter Tolmie. 2016.Analysing how people orient to and spread rumoursin social media by looking at conversational threads.PloS one, 11(3):e0150989.

Page 12: arXiv:2004.13455v1 [cs.CL] 28 Apr 2020

Element Standard of Credibility scorecriterionWhether the verified True 3elements exist geo True 3or not screen name True 1

profile image True 2The value of

followers[0, 100) 2

elements [100, 500) 5[500,∞) 10

friends [0, 100) 1[100, 200) 2[200,∞) 5

favourites [0, 100) 2[100, 200) 3[200,∞) 5

Table 5: The credibility score of reviewers

A The Details of Decision Conditions

The paper has introduced the details of semanticsimilarity between claims and comments, here weintroduce the details of the other two decision con-ditions.

Table 5 shows some scores of meta data relatedto reviewer credibility. In elements ‘whether theelements exist or not’, if the element is false, thescore will be zero. The credibility score of reviewer(r cred) is formulated as follows:

r cred =A1

B1(11)

where A1 denotes the specific score accumulationof all metadata related to reviewer credibility andB1 means the total credibility score of reviewers.

Table 6 describes some credibility score of com-ments. Like the credibility score of reviewer, inelements ‘whether the elements exist or not’, if theelement is false, the score will be zero. The credi-bility score of comments (c cred) is formulated asfollows:

c cred =A2

B2(12)

where A2 denotes the specific score accumulationof all metadata related to comment credibility andB2 means the total credibility score of comments.

B Algorithm of DTCA

The training procedure of DTCA is shown in Algo-rithm 1.

Element Standard of Credibilitycriterion score

Whether the elements geo True 3source True 3

exist or not favorite True 1the comment

The value of elementsthe number of favorites [0, 100) 5

[100,∞) 7

the length of content [0, 10) 3[10,∞) 7

Table 6: The credibility score of comments

Algorithm 1: Training Procedure of DTCA.Require: Dataset S = {Ci, Ri, y}T1 with T training samples,where Ci denotes one claim and Ri means tree comments,i.e., Ri = r1, r2, ..., rk. Particularly, k is different fordifferent claims; the thresholds of the three decision conditionsare a1, a2, a3, respectively;the evidence set E; model parameters Θ; learning rate ε.1 Initial parameters;2 Repeat3

∣∣ For i = 1 to T do∣∣ ∣∣ // Part 1: DTE4

∣∣ ∣∣ A series of subtree sets S obtained by depth-first∣∣ ∣∣ search of tree comments Ri, i.e., S=[S1,S2, ...,Sn];5

∣∣ ∣∣ On each subtree Si:6

∣∣ ∣∣ For ri in Si do7

∣∣ ∣∣ ∣∣ // The semantic similarity between comments∣∣ ∣∣ ∣∣ // and claims8

∣∣ ∣∣ ∣∣ If simi(ri, Ci) ∈ interval[a, b]:9

∣∣ ∣∣ ∣∣ E = E + ri;10

∣∣ ∣∣ ∣∣ // The credibility of reviewers11

∣∣ ∣∣ ∣∣ If r cred(ri) > a2:12

∣∣ ∣∣ ∣∣ E = E + ri;13

∣∣ ∣∣ ∣∣ // The credibility of comments14

∣∣ ∣∣ ∣∣ If c cred(ri) > a3:15

∣∣ ∣∣ ∣∣ E = E + ri;16

∣∣ ∣∣ End For17

∣∣ ∣∣ By traversing all subtrees, the final evidence set E∣∣ ∣∣ that meets the conditions is captured.18

∣∣ ∣∣ //Part 2: CaSa19

∣∣ ∣∣ Word embeddings of evidence E:∣∣ ∣∣ Xe = BERT embed(E);20

∣∣ ∣∣ Word embeddings of claim Ci: BERT embed(Ci);21

∣∣ ∣∣ Get representations of evidence and claim by∣∣ ∣∣ Eq. (1-3), i.e., Re and Rc;22

∣∣ ∣∣ Get Repool by maximum pooling operation of Re;

23∣∣ ∣∣ Get deep interaction semantics C of claim concerned∣∣ ∣∣ by evidence through integrating Re

pool into∣∣ ∣∣ self-attention networks by Eq. (4-7);24

∣∣ ∣∣ Get deep interaction semantics E of evidence∣∣ ∣∣ concerned by the claim through integrating C into∣∣ ∣∣ self-attention networks by Eq. (4-7);25

∣∣ ∣∣ Get fused vectors between evidence and claim∣∣ ∣∣ by Eq. (8);26

∣∣ ∣∣ Compute loss L(Θ) using Eq. (9,10);27

∣∣ ∣∣ Compute gradient∇(Θ);28

∣∣ ∣∣ Update model: Θ← Θ− ε∇(Θ);29

∣∣ End For30 Update parameters a1, a2, a3;31 Until a1 = 0.8, a2 = 0.8, and a3 = 0.7.