i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks,...

13
i Two-path Deep Semi-supervised Learning for Timely Fake News Detection Xishuang Dong, Uboho Victor, and Lijun Qian, Senior Member, IEEE, Abstract—News in social media such as Twitter has been generated in high volume and speed. However, very few of them are labeled (as fake or true news) by professionals in near real time. In order to achieve timely detection of fake news in social media, a novel framework of two-path deep semi-supervised learning is proposed where one path is for supervised learning and the other is for unsupervised learning. The supervised learning path learns on the limited amount of labeled data while the unsupervised learning path is able to learn on a huge amount of unlabeled data. Furthermore, these two paths implemented with convolutional neural networks (CNN) are jointly optimized to complete semi-supervised learning. In addition, we build a shared CNN to extract the low level features on both labeled data and unlabeled data to feed them into these two paths. To verify this framework, we implement a Word CNN based semi- supervised learning model and test it on two datasets, namely, LIAR and PHEME. Experimental results demonstrate that the model built on the proposed framework can recognize fake news effectively with very few labeled data. Index Terms—Fake News Detection, Deep Semi-supervised Learning, Convolutional Neural Networks, Joint Optimization I. I NTRODUCTION Social media (e.g., Twitter and Facebook) has become a new ecosystem for spreading news [1]. Nowadays, people are relying more on social media services rather than traditional media because of its advantages such as social awareness, global connectivity, and real-time sharing of digital infor- mation. Unfortunately, social media is full of fake news. Fake news consists of information that is intentionally and verifiably false to mislead readers, which is motivated by chasing personal or organizational profits [2]. For example, fake news has been propagated on Twitter like infectious virus during the 2016 election cycle in the United States [3], [4]. Understanding what can be done to discourage fake news is of great importance. One of the fundamental steps to discourage fake news would be timely fake news detection. Fake news detection [5], [6], [7] is to determine the truthfulness of the news by analyzing the news contents and related information such as propagation patterns. It attracts a lot of attention to resolve this problem from different aspects, where supervised learning based fake news detection dominates this domain. For instance, Ma et.al detects fake news with data representations of the contents that are learned on the labeled news [8]. Early detection is also an X. Dong, U. Victor, and L. Qian are with the Center of Excellence in Research and Education for Big Military Data Intelligence (CREDIT Center), Department of Electrical and Computer Engineering, Prairie View A&M University, Texas A&M University System, Prairie View, TX 77446, USA. Email: [email protected], [email protected], [email protected] effective approach to recognize fake news by identifying the signature of text phrases in social media posts [9]. Moreover, temporal features play a crucial role in the fast-paced social media environment because information spreads more rapidly than traditional media [10]. For example, detecting a burst of topics on social media can capture the variations in temporal patterns of news [11], [12]. Specifically, deep learning based fake news detection achieves the state-of-the-art performance on different datasets [13], where both recurrent neural net- works (RNN) and convolutional neural networks (CNN) are employed to recognize fake news [6], [14], [15]. However, since news spreads on social media at very high speed when an event happens, only very limited labeled data is available in practice for fake news detection, which is inadequate for the supervised model to perform well. As an emerging task in the field of natural language pro- cessing (NLP), fake news detection requires big labeled data to meet the requirement of building supervised learning based detection models. However, annotating the news on social media is too expensive and costs a huge amount of human labor due to the huge size of social media data. Furthermore, this is almost impossible to achieve in near real time. In addition, it is difficult to ensure the annotation consistency for big data labeling [16]. With the increment of the data size, the annotation inconsistence will be worse. Therefore, using unlabeled data to enhance fake news detection becomes a promising solution and more urgent. In this paper, we propose a deep semi-supervised learning framework by building two-path convolutional neural networks to accomplish timely fake news detection in the case of limited labeled data, where the framework is shown in Figure 1. It consists of three components, namely, a shared CNN, a supervised CNN, and an unsupervised CNN. One path is composed of the shared CNN and supervised CNN while the other is made of the shared CNN and unsupervised CNN. Moreover, the architectures of these three CNNs can be similar or different, which are determined by the application and performance. All data (labeled and unlabeled data) will be used to generate the mean squared error loss, while only labeled data will be used to calculate the cross-entropy loss. Then a weighted sum of these two losses is used to optimize the proposed framework. We validate the proposed framework on detecting fake news using two datasets, namely, LIAR [6] and PHEME [17]. Experimental results demonstrate the ef- fectiveness of the proposed framework even with very limited labeled data. In summary, the contributions of this study are as follows: We propose a novel two-path deep semi-supervised learn- arXiv:2002.00763v1 [cs.CL] 31 Jan 2020

Transcript of i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks,...

Page 1: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

i

Two-path Deep Semi-supervised Learning forTimely Fake News Detection

Xishuang Dong, Uboho Victor, and Lijun Qian, Senior Member, IEEE,

Abstract—News in social media such as Twitter has beengenerated in high volume and speed. However, very few of themare labeled (as fake or true news) by professionals in near realtime. In order to achieve timely detection of fake news in socialmedia, a novel framework of two-path deep semi-supervisedlearning is proposed where one path is for supervised learningand the other is for unsupervised learning. The supervisedlearning path learns on the limited amount of labeled data whilethe unsupervised learning path is able to learn on a huge amountof unlabeled data. Furthermore, these two paths implementedwith convolutional neural networks (CNN) are jointly optimizedto complete semi-supervised learning. In addition, we build ashared CNN to extract the low level features on both labeleddata and unlabeled data to feed them into these two paths. Toverify this framework, we implement a Word CNN based semi-supervised learning model and test it on two datasets, namely,LIAR and PHEME. Experimental results demonstrate that themodel built on the proposed framework can recognize fake newseffectively with very few labeled data.

Index Terms—Fake News Detection, Deep Semi-supervisedLearning, Convolutional Neural Networks, Joint Optimization

I. INTRODUCTION

Social media (e.g., Twitter and Facebook) has become anew ecosystem for spreading news [1]. Nowadays, people arerelying more on social media services rather than traditionalmedia because of its advantages such as social awareness,global connectivity, and real-time sharing of digital infor-mation. Unfortunately, social media is full of fake news.Fake news consists of information that is intentionally andverifiably false to mislead readers, which is motivated bychasing personal or organizational profits [2]. For example,fake news has been propagated on Twitter like infectious virusduring the 2016 election cycle in the United States [3], [4].Understanding what can be done to discourage fake news isof great importance.

One of the fundamental steps to discourage fake news wouldbe timely fake news detection. Fake news detection [5], [6],[7] is to determine the truthfulness of the news by analyzingthe news contents and related information such as propagationpatterns. It attracts a lot of attention to resolve this problemfrom different aspects, where supervised learning based fakenews detection dominates this domain. For instance, Ma et.aldetects fake news with data representations of the contents thatare learned on the labeled news [8]. Early detection is also an

X. Dong, U. Victor, and L. Qian are with the Center of Excellence inResearch and Education for Big Military Data Intelligence (CREDIT Center),Department of Electrical and Computer Engineering, Prairie View A&MUniversity, Texas A&M University System, Prairie View, TX 77446, USA.Email: [email protected], [email protected], [email protected]

effective approach to recognize fake news by identifying thesignature of text phrases in social media posts [9]. Moreover,temporal features play a crucial role in the fast-paced socialmedia environment because information spreads more rapidlythan traditional media [10]. For example, detecting a burst oftopics on social media can capture the variations in temporalpatterns of news [11], [12]. Specifically, deep learning basedfake news detection achieves the state-of-the-art performanceon different datasets [13], where both recurrent neural net-works (RNN) and convolutional neural networks (CNN) areemployed to recognize fake news [6], [14], [15]. However,since news spreads on social media at very high speed whenan event happens, only very limited labeled data is availablein practice for fake news detection, which is inadequate forthe supervised model to perform well.

As an emerging task in the field of natural language pro-cessing (NLP), fake news detection requires big labeled datato meet the requirement of building supervised learning baseddetection models. However, annotating the news on socialmedia is too expensive and costs a huge amount of humanlabor due to the huge size of social media data. Furthermore,this is almost impossible to achieve in near real time. Inaddition, it is difficult to ensure the annotation consistencyfor big data labeling [16]. With the increment of the datasize, the annotation inconsistence will be worse. Therefore,using unlabeled data to enhance fake news detection becomesa promising solution and more urgent.

In this paper, we propose a deep semi-supervised learningframework by building two-path convolutional neural networksto accomplish timely fake news detection in the case of limitedlabeled data, where the framework is shown in Figure 1.It consists of three components, namely, a shared CNN, asupervised CNN, and an unsupervised CNN. One path iscomposed of the shared CNN and supervised CNN while theother is made of the shared CNN and unsupervised CNN.Moreover, the architectures of these three CNNs can be similaror different, which are determined by the application andperformance. All data (labeled and unlabeled data) will beused to generate the mean squared error loss, while onlylabeled data will be used to calculate the cross-entropy loss.Then a weighted sum of these two losses is used to optimizethe proposed framework. We validate the proposed frameworkon detecting fake news using two datasets, namely, LIAR [6]and PHEME [17]. Experimental results demonstrate the ef-fectiveness of the proposed framework even with very limitedlabeled data.

In summary, the contributions of this study are as follows:• We propose a novel two-path deep semi-supervised learn-

arX

iv:2

002.

0076

3v1

[cs

.CL

] 3

1 Ja

n 20

20

Page 2: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

ii

yi

xi

crossentropy

mean squarederror

zi

weightedsum of loss

z'i

w(t)Optimizer

zi

li

l'iShared CNN 

: Pooling Layer: Convolutional Layer

Supervised CNN 

Unsupervised CNN 

Softmaxy'

For all data

Only for labeled data

. . .

. . .

. . .

WordEmbedding

Fig. 1. Framework of two-path deep semi-supervised learning. Samples xi are inputs. Labels yi are available only for the labeled inputs and the associatedcross-entropy loss component is evaluated only for those. zi and z′i are outputs from the supervised CNN and the unsupervised CNN, respectively. y′i is thepredicted label for xi. li is the cross-entropy loss and l′i is the mean squared error loss. w(t) are the weights for joint optimization of li and l′i.

ing (TDSL) framework containing three CNNs, whereboth labeled data and unlabeled data are used jointly totrain the model and enhance the detection performance.

• We implement a Word CNN [18] based TDSL to detectfake news with limited labeled data and compare itsperformance with various deep learning based baselines.Specifically, we validate the implemented model by test-ing on the LIAR and PHEME datasets. It is observed thatthe proposed model detects fake news effectively evenwith extremely limited labeled data.

• The proposed framework could be applied to addressother tasks. Furthermore, novel deep semi-supervisedlearning models can be implemented based on the pro-posed framework with various designs of CNNs, whichwill be determined by the intended applications and tasks.

II. RELATED WORK

Fake news detection has attracted a lot of attention inrecent years. There are extensive studies such as contentbased method and propagation pattern based method. Contentbased method typically involves two steps: preprocessingnews contents and training supervised learning model onthe preprocessed contents. The first step usually involvestokenization, stemming, and/or weighting words [19], [20]. Inthe second step, Term Frequency-Inverse Document Frequency(TF-IDF) [21], [22] may be employed to build samples to trainsupervised learning models. However, the samples generatedby TF-IDF will be sparse, especially for social media data.To overcome this challenge, word embedding methods suchas word2vec [23] and GloVe [24] are used to convert wordsinto vectors.

In addition, Mihalcea et al. [25] used linguistic inquiry andword count (LIWC) [26] to explore the difference of wordusage between deceptive language and non-deceptive ones.Specifically, deep learning based models have been exploredmore than other supervised learning models [13]. For example,Rashkin et al. [27] built the detection model with two LSTM

RNN models: one learns on simple word embeddings, and theother enhances the performance by concatenating long short-term memory (LSTM) outputs with LIWC feature vectors.Doc2vec [28] is also applied to represent content that isrelated to each social engagement. Attention based RNNsare employed to achieve better performance as well. Longet al. [15] incorporates the speaker names and the statementtopics into the inputs to the attention based RNN. In addition,convolutional neural networks are also widely used sincethey succeed in many text classification tasks. Karimi et al.[29] proposed Multi-source Multi-class Fake news Detectionframework (MMFD), where CNN analyzes local patterns ofeach text in a claim and LSTM analyze temporal dependenciesin the entire text.

In propagation pattern based method, the propagation pat-terns have been extracted from time-series information of newsspreading on social media such as Twitter and they are usedas features for detecting fake news. For instance, to identifyfake news from microblogs, Ma et al. [30] proposed theDynamic Series-Time Structure (DSTS) to capture variationsin social context features such as microblog contents andusers over time for early detection of rumors. Lan et al. [31]proposed Hierarchical Attention RNN (HARNN) that uses aBi-GRU layer with the attention mechanism to capture high-level representations of rumor contents, and a gated recurrentunit (GRU) layer to extract semantic changes. Hashimoto etal. [32] visualized topic structures in time-series variation andseeks help from external reliable source to determine the topictruthfulness.

In summary, most of the current methods are based onsupervised learning. It requires a large amount of labeleddata to implement the detection processes, especially for thedeep learning based approaches. However, annotating the newson social media is too expensive and costs a huge amountof human labor due to the huge size of social media data.Furthermore, this is almost impossible to achieve in nearreal time. Even with labeled data, constructing the huge

Page 3: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

iii

Algorithm 1 Learning in the proposed frameworkRequire: xi = training sampleRequire: S = set of training samplesRequire: yi = label for labeled xi i ∈ SRequire: fembedding(x) = word embeddingRequire: fθshared

(x) = shared CNN with trainable parameters θsharedRequire: fθsup

(x) = supervised CNN with trainable parameters θsupRequire: fθunsup(x) = unsupervised CNN with trainable parameters θunsupRequire: w(t) = unsupervised weight ramp-up function

1: for t in [1, num epochs] do2: for each minibatch B do3: x′i∈B ← fembedding(xi∈B) . represent words with word embedding4: zi∈B ← fθsup(fθshared

(x′i∈B)) . evaluate supervised cnn outputs for inputs5: z′i∈B ← fθunsup(fθshared

(x′i∈B)) . evaluate unsupervised cnn outputs for inputs6: li∈B ← − 1

|B|∑i∈B∩S logfsoftmax(zi)[yi] . supervised loss component

7: l′i∈B ← 1C|B|

∑i∈B ||zi − z′i||2 . unsupervised loss component

8: loss← li∈B + w(t)× l′i∈B . total loss9: update θshared, θsup, θunsup using, e.g., ADAM . update parameters

return θshared, θsup, θunsup

amount of labeled corpus is an extremely difficult task inthe field of natural language processing as it costs a largevolume of resources and it is challenging to guarantee thelabel consistence. Therefore, it is imperative to incorporateunlabeled data together with labeled data in fake news detec-tion to enhance the detection performance. Semi-supervisedlearning [33], [34] is a technique that is able to use bothlabeled data and unlabeled data. Therefore, we propose a novelframework of deep semi-supervised learning via convolutionalneural networks for timely fake news detection. In the nextsection, we present the proposed framework and show how toimplement the deep semi-supervised learning model in detail.

III. MODEL

We propose a general framework of deep semi-supervisedlearning and apply it to accomplish fake news detection.Suppose the training data consist of total N inputs, out ofwhich M are labeled. The inputs, denoted by xj , wherej ∈ 1...N , are the news contents that contain sentences relatedto fake news. In general, the news on social media normallycontains limited number of words like 100 or less. It will leadto the data sparsity if we apply TF-IDF to extract features. Torelieve the data sparsity problem, we employ word embeddingtechniques, for instance, word2vec [23], [35], [36], to representthe news contents. Here S is the set of labeled inputs, |S|= M .For every i ∈ S, we have a known correct label yi ∈ 1...C,where C is the number of different classes.

The proposed framework of deep semi-supervised learningand corresponding learning procedures are shown in Figure1 and Algorithm 1, respectively. As shown in Figure 1, weevaluate the network for each training input xi with thesupervised path and the unsupervised path to complete twotasks, resulting in prediction vectors zi and z′i, respectively.One task is to learn how to mine patterns of fake newsregarding the news labels while the other is to optimize therepresentations of news without the news labels. Specially,

before these two paths, there is a shared CNN to extract low-level features to feed the later two CNNs. It is similar to deepmulti-task learning [37], [38] since the low-level features areshared in the different tasks [39], [40]. The major differencebetween the proposed framework and deep multi-task learningis that tasks in the proposed framework will involve bothsupervised learning and unsupervised learning, while all tasksin deep multi-task learning are only based on supervisedlearning.

In addition, these two paths can have independent CNNswith the identical or different setups for supervised learningand unsupervised learning, respectively. They generate twoprediction vectors that are new representations for the inputswith respect to their tasks. For the identical setups of thesetwo path, i.e., using the same CNN structure for both paths, itis important to notice that, because of dropout regularization,training CNNs in these two paths is a stochastic process. Thiswill result in the two CNNs having different link weights andfilters during training. It implies that there will be differencebetween the two prediction vectors zi and z′i of the sameinput xi. Given that the original input xi is the same, thisdifference can be seen as an error and thus minimizing themean square error (MSE) is a reasonable objective in thelearning procedure.

We utilize those two vectors zi and z′i to calculate the lossgiven by

(1)

Loss = − 1

|B|∑

i∈B∩Slogfsoftmax(zi)[yi]

+ w(t)× 1

C|B|∑i∈B||zi − z′i||2 ,

where B is the minibatch in the learning process. The lossconsists of two components. As illustrated in Algorithm 1,li is the standard cross-entropy loss to evaluate the loss forlabeled inputs only. On the other hand, l′i, evaluated for allinputs, penalizes different predictions for the same training

Page 4: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

iv

yi

xi

crossentropy

mean squarederror

zi

weightedsum of loss

z'i

w(t) Optimizer

zi

li

l'iShared CNN 

: Pooling Layer: Convolutional Layer

Supervised CNN 

Unsupervised CNN 

Softmaxy'

For all data

Only for labeled data

x'ix'ix'i

Word Embedding

...

...

...

... ... ... ............

en

e1e2...

e1 e2 en...=

k

x'i

r1

r2

r3

r'1

r'2

r'3

Fig. 2. Word-CNN Based Deep Semi-supervised Learning. In the shared CNN, each convolutional layer contains 100 (3× 3) filters, 100 (4× 4) filters, and100 (5× 5) filters, respectively. Both the supervised CNN and the unsupervised CNN have the same architecture of the shared CNN with different numbersof filters, where each convolutional layer contains 100 (3 × 3) filters. We use (2 × 2) max-pooling for all pooling layers. ⊕ is the concatenation operator.r1, r2 and r3 are outputs from the supervised path while r′1, r′2 and r′3 are those from the unsupervised path. Furthermore, we concatenate r1, r2 and r3to conduct zi and connect r′1, r′2 and r′3 to generate z′i.

input xi by taking the mean squared error between zi and z′i.To combine the supervised loss li and unsupervised loss l′i,we scale the latter by time-dependent weighting function w(t)[41] that ramps up, starting from zero, along a Gaussian curve.In the beginning of training, the total loss and the learninggradients are dominated by the supervised loss component,i.e., the labeled data only. Unlabeled data will contribute morethan the labeled data at later stage of training.

Based on the proposed framework, we implement a deepsemi-supervised learning model with Word CNN [18], wherethe architecture is shown in Figure 2. The Word CNN is apowerful classifier with the simple architecture of CNN forsentence classification [18]. Considering fake news detection,let ei ∈ Rk be the k-dimensional word vector correspondingto the i-th word of the sentence in the news. A sentence oflength n is represented as

(2)x′i = e1 ⊕ e2 ⊕ ...⊕ en,

where ⊕ is the concatenation operator. A convolution opera-tion involves a filter c ∈ Rhk, which is applied to a windowof h words to produce a new feature. The pooling operationdeals with variable sentence lengths. As shown in Figure 2,the input xi is represented as x′i with the word embedding1.Then the representation x′i is input into three convolutionallayers followed by pooling layers to extract low-level featuresin the shared CNN. The extracted low-level features are given

1https://www.tensorflow.org/api docs/python/tf/nn/embedding lookup

to the supervised CNN and unsupervised CNN, respectively.Moreover, the outputs from pooling layers in these two pathsare concatenated as two vectors zi and z′i, respectively. For thesupervised path, we use the label yi and the result generatedby running softmax on zi to build the supervised loss with thecross entropy. On the other hand, these two vectors zi and z′iare employed to construct the unsupervised loss with the meansquared error. Finally, the supervised loss and unsupervisedloss are jointly optimized with the Adam optimizer, which isshown in Algorithm 1.

Although there are a few related works in the literaturesuch as the Π model [41], there exist significant differences.Firstly, the Π model is designed for processing image datawhile the proposed framework is more suitable to be used tosolve problems related to NLP. Secondly, in the Π model, itcombined image data augmentation with dropout to generatetwo outputs, but it cannot be used to process language in NLPtasks since data augmentation for processing text contentsis difficult to implement. Finally, instead of using one pathCNN, we apply two independent CNNs to generate those twooutputs. Furthermore, the proposed model is more flexible asthe two independent paths can be tuned for specific goals.

IV. EXPERIMENT

A. DatasetsTo validate the effectiveness of the proposed framework on

fake news detection, we test the implemented model on twobenchmarks: LIAR and PHEME.

Page 5: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

v

0 20TF-IDF (10-e3)

charliehebdopeoplemuslimsislamparisreligionlikeattackjesuischarliecharlieknowampmuslimrightfreedomthinkthatsfrancekilledfrenchsayhebdospeechimpoliceworldgoodterroristskillrtcartoonsyesterroristsaidfreewantgodtimedontcartoonstopneedmakereallydeadpointbreakingnewskillingthank

Charlie Hebdo

0 20TF-IDF (10-e3)

fergusonpolicepeoplelikeblackknowamprightcopsthatsimmikebrownrtwhiteshotcopthinkbrownneedmangoodofficersaidrobberysaytimeunarmedyeswantgoingdontshootingpointkilledwaystopmakeshootlookreallythankchiefracistmediaamericagunlolgotvideosure

Ferguson

0 20TF-IDF (10-e3)

germanwingscrashplane4u9525flighta320sadnewsfrenchcrashedairbusfamiliesbreakingfranceknowpeoplepilotalpscopilotlikeamppassengersrtcrashesgodmuslimprayerssaysgermanthatscrewboardtimeripcockpitthoughtssurvivorsohimsaydescentyesaltitudeemergencypilotssouthernreallydaylostterrible

Germanwings-cras

0 20TF-IDF (10-e3)

ottawacanadasoldierparliamentshootingshotsafecanadianwarpolicesadmemorialhillnewsottawashootingpeoplelikeampknowthoughtsfamilythankprayersrtstayimgoodbreakingthankstodayrightsaythatsripdeadkilledshooterattacknathanthinkisishopeshootingsgodcirillotimeyesneeddayoh

Ottawa shooting

0 20TF-IDF (10-e3)

sydneysiegesydneypeoplehostagesflagknowlikecafepolicehostageislamampmuslimsimisisthatsmuslimhopereligionthinkgoodaustraliarightnewssaymediaislamicsiegeyesterroristworldrtsituationohbreakingreallygodlovelivewantneedtimedontillridewithyousaidmanthanksafethanksterrorists

Sydney siege

Fig. 3. An example of the difference of word distributions between five events in PHEME dataset. x-axis indicates the TF-IDF value of the word while y-axisshows top 50 words ranked by corresponding TF-IDF values.

1) LIAR: LIAR [6] is the recent benchmark dataset forfake news detection. This dataset includes 12,836 real-worldshort statements collected from PolitiFact2, where editorshandpicked the claims from a variety of occasions such asdebate, campaign, Facebook, Twitter, interviews, ads, etc. Eachstatement is labeled with six-grade truthfulness, namely, true,false, half-true, part-fire, barely-true, and mostly-true. Theinformation about the subjects, party, context, and speakersare also included in this dataset. In this paper, this benchmarkcontains three parts: training dataset with 10,269 statements,validation dataset with 1,284 statements, and testing datasetwith 1,283 statements. Furthermore, we reorganize the dataas two classes by treating five classes including false, half-true, part-fire, barely-true, and mostly-true as Fake class andtrue as True class. Therefore, the fake news detection on thisbenchmark becomes a binary classification task.

2) PHEME: We employ PHEME [17] to verify the ef-fectiveness of the proposed framework on social media data,where it collects 330 twitter threads. Tweets in PHEME arelabeled as true or false in terms of thread structures and follow-follower relationships. PHEME dataset is related to nine eventswhereas this paper only focuses on the five main events,namely, Germanwings-crash (GC), Charlie Hebdo (CH), Syd-ney siege (SS), Ferguson (FE), and Ottawa shooting (OS). Ithas different levels of annotations such as thread level andtweet level. We only adopt the annotations on the tweet leveland thus classify the tweets as fake or true. The detaileddistribution of tweets and classes is shown in table II after

2https://www.politifact.com

the data is preprocessed such as removing data redundancy.It is observed that the class distribution is different amongthese events. For example, three events including CH, SS, andFE have more true news while the event GC and OS havemore fake news. Furthermore, class distributions for eventssuch as CH, SS, FE are significantly imbalanced, which willbe a challenge to the detection task (a binary classificationtask).

In addition, the word distributions vary among five events,where one example is shown in Figure 3: there are 50 wordsranked by their TF-IDF values, where TF-IDF values are usedto evaluate the word relevance to the event [42]. It is observedthat the top ranked words are very different for different events.Moreover, even for the same word, their relevance are not thesame for different events. For example, the relevance (TF-IDF values) of the word “thank” are different between eventsincluding Charlie Hebdo (CH), Ferguson (FE), and Ottawashooting (OS).

In addition, we perform leave-one-event-out (LOEO) cross-validation [43], which is closer to the realistic scenario wherewe have to verify unseen truth. For example, the training datacan contain the events GC, CH, SS, and FE whereas the testingdata will contain the event OS. However, as highlighted inFigure 3, this fake news detection task becomes very difficultsince the training data has very different data distribution fromthat of the testing data,.

It should be noted that the data in original datasets is fullylabeled. We employ all labeled data to train the baselinemodels. Compared to the baselines, we train the proposed

Page 6: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

vi

TABLE IHYPER-PARAMETERS FOR THE TRAINING OF WORD CNN BASED TDSL

Hyper-parameters ValuesDropout 0.5Minibatch size 128Number of epochs 200Maximum learning rate 1e-4

TABLE IINUMBER OF TWEETS AND CLASS DISTRIBUTION IN THE PHEME

DATASET.

Events Tweets Fake TrueGermanwings-crash 3,920 2,220 1,700Charlie Hebdo 34,236 6,452 27,784Sydney siege 21.837 7,765 14,072Ferguson 21,658 5,952 15,706Ottawa shooting 10,848 5,603 5,245Total 92,499 27,992 64,507

models with only partially labeled data and the percentageof labeled data for training of our proposed models is from1% to 30%.

B. Experiment Setup

The key hyper-parameters for the implemented model areshown in Table I and we employ Adam optimizer to completethe training the implemented proposed model (TDSL).

In addition, we employ five deep supervised learning modelsas baselines including 1) Word-level CNN (Word CNN) [18],2) Character-level CNN (Char CNN) [44], 3) Very Deep CNN(VD CNN) [45], 4) Attention-Based Bidirectional RNN (AttRNN) [46], and 5) Recurrent CNN (RCNN) [47], where thesemodels perform well on text classification. Specifically, WordCNN performs well on sentence classification, which is moresuitable to process social media data as the length of thecontent of the data is short like that of the sentence. In addition,we build 6) word-level bidirectional RNN (Word RNN) tocompare the implemented model, where Word RNN containsone embedding layer and one bidirectional RNN layer, andconcatenate all the outputs from the RNN layer to feed to thefinal layer that is a fully-connected layer. Thus, there are total6 baseline models. Note that baseline models use all labeleddata from the original datasets.

C. Evaluation

We apply different metrics to evaluate the performance offake news detection regarding the task features on these twobenchmarks.• LIAR: We employ accuracy, precision, recall and Fs-

core to evaluate the detection performance. Accuracy iscalculated by dividing the number of statements detectedcorrectly over the total number of statements.

Accuracy =NcorrectNtotal

. (3)

In addition, we employ Fscore values of each class tocheck the performance since the task is a binary textclassification.

Fscore =2× Precision×RecallPrecision+Recall

. (4)

where Precision indicates precision measurement thatdefines the capability of a model to represent only fakestatements and Recall computes the aptness to refer allcorresponding fake statements:

Precision =TP

TP + FP. (5)

Recall =TP

TP + FN. (6)

whereas TP (True Positive) counts total number of newsmatched with the news in the labels. FP (False Positive)measures the number of recognized label does not matchthe annotated corpus dataset. FN (False Negative) countsthe number of news that does not match the predictedlabel news.

• PHEME: Accuracy is one of the common evaluationmetric to measure the performance of fake news de-tection on this dataset [13]. However, we also evaluatethe performance based on the Fscore since our task onPHEME datasets is the binary text classification withimbalanced data. Specifically, as we perform leave-one-event-out cross-validation on the PHEME dataset, weutilize macro-averaged Fscore [48] to evaluate the wholeperformance of mining fake news on different events.

MacroF =1

T

T∑t=1

Fscoret. (7)

MacroP =1

T

T∑t=1

Precisiont. (8)

MacroR =1

T

T∑t=1

Recallt. (9)

where T denotes the total number of events and Fscoret,Precisiont, Recallt are Fscore, Precision, Recallvalues in the tth event. Additionally, we use macro-average accuracy on five events to examine performance.The main goal for learning from imbalanced datasetsis to improve the recall without hurting the precision.However, recall and precision goals can be often con-flicting, since when increasing the true positive (TP) forthe minority class (True), the number of false positives(FP) can also be increased; this will reduce the precision[49]. It means that when the MacroP is increased, theMacroR might be decreased for the case of PHEME.

MacroA =1

T

T∑t=1

Accuracyt. (10)

Page 7: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

vii

D. Results and Discussion

We compare the two-path deep semi-supervised learning(TDSL) implemented based on Word CNN with the baselineson two datasets: LIAR and PHEME. In addition, we examinethe effects on the performance when applying different hyper-parameters. All evaluation results are average values on fiveruns.

1) LIAR: Table III presents the performance comparisonon LIAR datasets. When we focus on the baselines, WordCNN outperforms other baselines with respect to the accuracy,recall, and Fscore. However, with respect to Precision, AttRNN is better than other baselines. It means we should selectspecific models when we concern certain evaluation metrics.In addition, we present results generated by the proposedTDSL, where we utilize 1% and 30% of labeled data andthe rest of unlabeled data to build the deep semi-supervisedmodel. It is observed that the performance is strengthened byincreasing the amount of the labeled data for training TDSL.Specifically, even we use little amount of labeled data, westill obtain acceptable performance. For example, we use 1%labeled training data to construct Word CNN based TDSL,compared with the Word CNN, its accuracy and Fscore areonly reduced about 6% and 3%, respectively. Moreover, theaccuracy, recall and Fscore of TDSL (1%) are better than thoseof some baselines such as Char CNN, RCNN, Word RNN, andAtt RNN. It means that the deep semi-supervised learning builtbased on the proposed framework is able to detect fake newseven with little labeled data.

In table IV, we show how the ratio of labeled data affectthe detection performance generated with TDSL. When weincrease the ratio of labeled data step by step, the performancessuch as accuracy, recall, and Fscore are improved significantlywhile only precision is relatively stable. Moreover, when weuse the 30% labeled data, the performance is similar to the-state-of-the-art obtained with Word CNN. Specifically, theprecision is decreased slightly when increasing the ratio oflabeled data. The supervised loss, cross entropy, is definedin terms of the prediction accuracy. Therefore, we are ableto gain higher accuracy when adding more labeled data intothe training procedure, but there is no guarantee to increaseprecision.

Additionally, we examine the performance effects fromdifferent hyper-parameters in Figure 4, 5, and 6. In Figure4, we focus on the effects from different batch sizes. Whenwe train the model with fewer labeled data, the different batchsizes affect the performance more significantly. For instance,the performance hardly change when there are 30% labeleddata while the performance varies when there are only 1%labeled data. It is because more information on the groundtruth will be embedded in the training samples when batchsize is large, which results in robust prediction.

For Figure 5 and 6, we examine the performance effectson different embedding sizes and learning rates. In Figure 5,there are similar observations to those of the case of batchsize. For example, for the case of more labeled data fortraining, different embedding sizes doesn’t affect the perfor-mance significantly. On the contrary, for the fewer labeled

TABLE IIICOMPARING PERFORMANCE BETWEEN BASELINES AND PROPOSED

MODEL (TDSL) ON LIAR DATASETS. THE BASELINES, NAMELY, WORDCNN, CHAR CNN, VD CNN, RCNN, WORD RNN, AND ATT RNN, ARE

BUILT WITH THE TRAINING DATA THAT IS FULLY LABELED. ON THECONTRARY, WE ONLY APPLY 1% AND 30% LABELED TRAINING DATA ANDREST OF UNLABELED TRAINING DATA TO ACCOMPLISH LEARNING OF THE

PROPOSED MODEL.

Model Accuracy Precision Recall FscoreWord CNN[18] 85.01% 83.57% 99.94% 91.02%Char CNN)[44] 77.93% 84.09% 91.41% 87.59%VD CNN[45] 83.77% 83.56% 99.88% 91.00%RCNN[47] 79.30% 84.19% 89.66% 86.81%Word RNN 72.44% 84.19% 81.52% 82.79%Att RNN[46] 75.90% 84.26% 86.79% 85.52%TDSL (1%) 79.81% 83.62% 94.30% 88.62%TDSL (30%) 83.36% 83.59% 99.64% 90.91%

TABLE IVCOMPARING PERFORMANCES GENERATED BY PROPOSED MODEL (TDSL)LEARNING ON DIFFERENT RATIOS OF LABELED TRAINING DATA AND REST

OF UNLABELED TRAINING DATA.

Ratio of Labeled Data Accuracy Precision Recall Fscore1% 79.81% 83.62% 94.30% 88.62%3% 80.71% 83.69% 95.54% 89.21%5% 80.74% 83.70% 95.59% 89.23%8% 82.24% 83.56% 98.02% 90.21%10% 82.52% 83.58% 98.41% 90.39%30% 83.36% 83.59% 99.84% 90.91%

data, choosing embedding size should be carefully becausedifferent embedding sizes will lead to different performances.In addition, larger embedding size enhances the performance.In Figure 6, we observe that there is no significant differencewhen using different learning rates in the case of 10% labeleddata and 30% labeled data while only small performancedifference can be observed in the case of 1% labeled data.

2) PHEME: Compared to LIAR dataset, PHEME datasetswill introduce new challenges such as imbalanced class distri-bution and word distribution differences among these events.Therefore, it will lead to different observations, compared tothe case of LIAR. Table V indicates the performance compari-son on PHEME datasets. When we examine the baselines, VDCNN outperforms other baselines with respect to the MacroA,

TABLE VCOMPARING PERFORMANCE BETWEEN BASELINES AND PROPOSEDMODEL (TDSL) ON PHEME DATASETS. THE BASELINES, NAMELY,

WORD CNN, CHAR CNN, VD CNN, RCNN, WORD RNN, AND ATTRNN, ARE BUILT WITH THE TRAINING DATA THAT IS FULLY LABELED. ONTHE CONTRARY, WE ONLY APPLY 1% AND 30% LABELED TRAINING DATAAND REST OF UNLABELED TRAINING DATA TO ACCOMPLISH LEARNING OF

THE PROPOSED MODEL.

Model MacroA MacroP MacroR MacroFWord CNN[18] 61.75% 50.82% 17.60% 24.03%Char CNN)[44] 63.68% 50.66% 19.91% 26.73%VD CNN [45] 65.42% 49.21% 30.04% 28.50%RCNN[47] 60.62% 45.86% 16.40% 22.24%Word RNN 59.70% 45.57% 22.89% 28.22%Att RNN[46] 60.32% 45.58% 25.49% 31.15%TDSL (1%) 56.19% 38.83% 18.73% 21.13%TDSL (30%) 60.64% 41.14% 4.77% 6.75%

Page 8: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

viii

Accuracy Precision Recall Fscore50

60

70

80

90

1001% Labeled Data

batch size: 128batch size: 256batch size: 512

Accuracy Precision Recall Fscore50

60

70

80

90

10010% Labeled Data

batch size: 128batch size: 256batch size: 512

Accuracy Precision Recall Fscore50

60

70

80

90

10030% Labeled Data

batch size: 128batch size: 256batch size: 512

Fig. 4. Different performances generated with three batch sizes, 128, 256, and 512 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is fordifferent evaluation metrics while y-axis is for performance. Different color bars illustrate different batch sizes, where green bars are for batch size 128, bluebars are for batch size 256, and red bars are for batch size 512.

Accuracy Precision Recall Fscore50

60

70

80

90

1001% Labeled Data

embedding size: 64embedding size: 128embedding size: 256

Accuracy Precision Recall Fscore50

60

70

80

90

10010% Labeled Data

embedding size: 64embedding size: 128embedding size: 256

Accuracy Precision Recall Fscore50

60

70

80

90

10030% Labeled Data

embedding size: 64embedding size: 128embedding size: 256

Fig. 5. Different performances generated with three embedding sizes, 64, 128, and 256 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis isfor different evaluation metrics while y-axis is for performance. Different color bars show different batch sizes, where green bars are for embedding size 64,blue bars are for embedding size 128, and red bars are for embedding size 256.

TABLE VICOMPARING PERFORMANCES GENERATED BY PROPOSED MODEL (TDSL)

LEARNING ON DIFFERENT RATIOS OF LABELED TRAINING DATA FROMPHEME DATASETS.

Labeled Ratio MacroA MacroP MacroR MacroF1% 56.19% 38.83% 18.73% 21.13%3% 58.58% 39.38% 13.12% 17.83%5% 58.40% 39.18% 12.58% 16.31%8% 59.74% 40.48% 8.11% 11.18%

10% 59.84% 40.38% 7.08% 10.49%30% 60.64% 41.14% 4.77% 6.75%

MacroR, and MacroF. However, considering MacroP, WordCNN is better than other baselines. In addition, we observethat the performance (MacroA) is enhanced when increasingthe ratio of the labeled data for training TDSL. Moreover, evenwe use little amount of labeled data, we still obtain acceptableperformance. For example, we use 1% labeled training datato construct Word CNN based TDSL, compared with theVD CNN, its MacroA and MacroF are just reduced about

TABLE VIICOMPARING PERFORMANCE WITH DIFFERENT BATCH SIZES ON PHEME

DATASETS. WE CHOOSE THREE CASES OF RATIOS OF LABELED TRAININGDATA, NAMELY, 1%, 10%, AND 30%.

1% Labeled DataBatch size MacroA MacroP MacroR MacroF

128 56.19% 38.83% 18.73% 21.13%256 57.78% 39.72% 18.50% 23.52%512 57.78% 39.07% 15.73% 20.43%

10% Labeled DataBatch size MacroA MacroP MacroR MacroF

128 59.84% 40.38% 7.08% 10.49%256 60.08% 40.46% 8.55% 12.49%512 59.02% 40.81% 12.96% 17.18%

30% Labeled DataBatch size MacroA MacroP MacroR MacroF

128 60.64% 41.14% 4.77% 6.75%256 60.59% 41.50% 7.09% 10.77%512 59.94% 42.77% 8.51% 12.27%

Page 9: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

ix

Accuracy Precision Recall Fscore50

60

70

80

90

1001% Labeled Data

learning rate: 1e-4learning rate: 1e-3

Accuracy Precision Recall Fscore50

60

70

80

90

10010% Labeled Data

learning rate: 1e-4learning rate: 1e-3

Accuracy Precision Recall Fscore50

60

70

80

90

10030% Labeled Data

learning rate: 1e-4learning rate: 1e-3

Fig. 6. Different performances generated with three learning rate, 1e-3 and 1e-4 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is fordifferent evaluation metrics while y-axis is for performance. Different color bars indicate different batch sizes, where blue bars are for learning rate 1e-3, andred bars are for learning rate 1e-4.

TABLE VIIICOMPARING PERFORMANCE WITH DIFFERENT EMBEDDING SIZES ON

PHEME DATASETS. WE CHOOSE THREE CASES OF RATIOS OF LABELEDTRAINING DATA, NAMELY, 1%, 10%, AND 30%.

1% Labeled DataEmbedding size MacroA MacroP MacroR MacroF

64 57.38% 39.20% 16.57% 20.78%128 56.19% 38.83% 18.73% 21.13%256 57.13% 38.86% 18.07% 22.23%

10% Labeled DataEmbedding size MacroA MacroP MacroR MacroF

64 59.73% 40.44% 7.91% 11.06%128 59.84% 40.38% 7.08% 10.49%256 58.89% 40.27% 11.67% 15.22%

30% Labeled DataEmbedding size MacroA MacroP MacroR MacroF

64 60.79% 42.32% 4.37% 6.99%128 60.64% 41.14% 4.77% 6.75%256 60.63% 42.54% 4.63% 6.66%

9% and 7%, respectively. However, the MacroF is decreasedwhen MacroA is increased when adding more labeled data fortraining. There are two reasons for this observation. One is thatthe learning of TDSL aims to optimize the accuracy, but notthe Fscore. The other is that the data distribution of trainingdata is different from that of testing data since we utilize theleave-one-out policy to complete the validation, which breaksthe assumption that the training data should share the samedistribution to the testing data. The more labeled data is, themore serious the difference on the distribution is.

Similar to the case of LIAR, in table VI, we illustrate howthe ratio of labeled data affects the detection performance.When we increase the ratio of labeled data step by step,the MacroA is improved as well, but the MacroF is reducedsignificantly. Specifically, MacroR and MacroF are decreasedsignificantly when increasing the ratio of labeled data. Thesupervised loss, cross entropy, is defined in terms of the pre-diction accuracy. Therefore, there is no guarantee to increaseMacroR and MacroF when adding more labeled data into the

training procedure.In table VII and VIII, we examine the performance dif-

ferences when choosing different batch sizes and embeddingsizes, respectively. We observe the similar trends regardingthe MacroA and MacroP, where there is no big difference onMacroA and MacroP when choosing different batch sizes andembedding sizes for building the proposed model. However,MacroR is changed more significantly by different batch sizeswhen comparing to the case of embedding sizes.

Moreover, in the Figure 7, 8, and 9, we show the detailedperformances for five events when choosing different batchsizes to train the model on different ratios of labeled data.When examining the results shown in Figure 7, MacroA isincreased for these events Charlie Hebdo, Sydney siege, andFerguson when increasing the amount of labeled data whereasfor the events Germanwings-cras and Ottawa shooting, theMacroA is decreased. It is because more imbalanced classesinvolved in the training procedure will lead to reducing theperformance. For MacroR and MacroF, the performance isreduced when adding the ratios of labeled data. In addition,the similar observations can be obtained in terms of resultsshown in 8, and 9. However, the difference is that larger batchsizes can reduce the performance affections that are from batchsizes.

Finally, we examine the performance differences whenchoosing two different embedding sizes, namely 64 and 256,where the results are shown in Figure 10 and 11, respectively.MacroA and MacroP are increased for the events CharlieHebdo, Sydney siege, and Ferguson when increasing theamount of labeled data for training models whereas for theevents Germanwings-cras and Ottawa shooting, the MacroAis decreased. It is caused by the same reason of the case ofbatch size.

In summary, in terms of observations that are from theaforementioned results, increasing the amount of labeled datafor training TDSL will enhance performance when trainingdata has the similar distribution to the testing data, for instance,the case of LIAR. In addition, for both of benchmarks, we can

Page 10: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

x

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Charlie Hebdo

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ferguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Germanwings-cras

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ottawa shooting

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Sydney siege

1% Labeled Data10% Labeled Data30% Labeled Data

Fig. 7. Comparing detailed performances generated with batch size 128 for five events. x-axis is for different evaluation metrics while y-axis is for performance.Different color bars are for different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%

obtain acceptable performance even using extremely limitedlabeled data for training. However, we should pay moreattention to choosing the ratio of labeled when processing im-balance classification task, for example, the case of PHEME.Meantime, we should delicately choose the hyper-parametersif we plan to obtain reasonable performance.

V. CONCLUSION AND FUTURE WORK

In this paper, a novel framework of deep semi-supervisedlearning is proposed for fake news detection. Because of thefast propagation of fake news, timely detection is criticalto mitigate their effects. However, usually very few datasamples can be labeled in a short time, which in turn makesthe supervised learning models infeasible. Hence, a deepsemi-supervised learning model is implemented based on theproposed framework. The two paths in the model generatesupervised loss (cross-entropy) and unsupervised loss (MeanSquared Error), respectively. Then training is performed byjointly optimizing these two losses. Experimental results indi-cate that the implemented model could detect fake news fromthese two benchmarks, LIAR and PHEME effectively usingvery limited labeled data and a large amount of unlabeled data.Furthermore, given the data distribution differences betweentraining and testing datasets for the case of PHEME, andusing the leave-one-event-out cross-validation, increasing thepercentage of labeled data to train the semi-supervised modeldoes not automatically imply performance improvement. Inthe future, we plan to examine the proposed framework onother NLP tasks such as sentiment analysis and dependencyanalysis.

ACKNOWLEDGMENT

This research work is supported in part by the U.S. Officeof the Under Secretary of Defense for Research and Engi-

neering (OUSD(R&E)) under agreement number FA8750-15-2-0119. The U.S. Government is authorized to reproduce anddistribute reprints for governmental purposes notwithstandingany copyright notation thereon. The views and conclusionscontained herein are those of the authors and should not beinterpreted as necessarily representing the official policies orendorsements, either expressed or implied, of the Office ofthe Under Secretary of Defense for Research and Engineering(OUSD(R&E)) or the U.S. Government.

REFERENCES

[1] G. Pennycook and D. G. Rand, “Fighting misinformation on social me-dia using crowdsourced judgments of news source quality,” Proceedingsof the National Academy of Sciences, p. 201806781, 2019.

[2] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection onsocial media: A data mining perspective,” ACM SIGKDD ExplorationsNewsletter, vol. 19, no. 1, pp. 22–36, 2017.

[3] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016election,” Journal of economic perspectives, vol. 31, no. 2, pp. 211–36,2017.

[4] N. Grinberg, K. Joseph, L. Friedland, B. Swire-Thompson, and D. Lazer,“Fake news on twitter during the 2016 us presidential election,” Science,vol. 363, no. 6425, pp. 374–378, 2019.

[5] D. Hovy, “The enemy in your own camp: How well can we detectstatistically-generated fake reviews–an adversarial study,” in Proceedingsof the 54th Annual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), vol. 2, 2016, pp. 351–356.

[6] W. Y. Wang, “” liar, liar pants on fire”: A new benchmark dataset forfake news detection,” arXiv preprint arXiv:1705.00648, 2017.

[7] M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, and B. Stein, “Astylometric inquiry into hyperpartisan and fake news,” in Proceedingsof the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), 2018, pp. 231–240.

[8] J. Ma, W. Gao, P. Mitra, S. Kwon, B. J. Jansen, K.-F. Wong, and M. Cha,“Detecting rumors from microblogs with recurrent neural networks.” inIjcai, 2016, pp. 3818–3824.

[9] Z. Zhao, P. Resnick, and Q. Mei, “Enquiring minds: Early detectionof rumors in social media from enquiry posts,” in Proceedings of the24th International Conference on World Wide Web. International WorldWide Web Conferences Steering Committee, 2015, pp. 1395–1405.

[10] S. Kwon, M. Cha, K. Jung, W. Chen, and Y. Wang, “Prominent featuresof rumor propagation in online social media,” in 2013 IEEE 13thInternational Conference on Data Mining. IEEE, 2013, pp. 1103–1108.

Page 11: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

xi

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Charlie Hebdo

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ferguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Germanwings-cras

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ottawa shootingFerguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Sydney siege

1% Labeled Data10% Labeled Data30% Labeled Data

Fig. 8. Comparing detailed performances generated with batch size 256 for five events. x-axis is for different evaluation metrics while y-axis is for performance.Different color bars show different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Charlie Hebdo

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ferguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Germanwings-cras

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ottawa shootingFerguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Sydney siege

1% Labeled Data10% Labeled Data30% Labeled Data

Fig. 9. Comparing detailed performances generated with batch size 512 for five events. x-axis is for different evaluation metrics while y-axis is for performance.Different color bars indicate different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

[11] Y. Chang, M. Yamada, A. Ortega, and Y. Liu, “Lifecycle modelingfor buzz temporal pattern discovery,” ACM Transactions on KnowledgeDiscovery from Data (TKDD), vol. 11, no. 2, p. 20, 2016.

[12] ——, “Ups and downs in buzzes: Life cycle modeling for temporalpattern discovery,” in 2014 IEEE International Conference on DataMining. IEEE, 2014, pp. 749–754.

[13] R. Oshikawa, J. Qian, and W. Y. Wang, “A survey on natural languageprocessing for fake news detection,” arXiv preprint arXiv:1811.00770,2018.

[14] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi, “Truth ofvarying shades: Analyzing language in fake news and political fact-checking,” in Proceedings of the 2017 Conference on Empirical Methodsin Natural Language Processing, 2017, pp. 2931–2937.

[15] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detectionthrough multi-perspective speaker profiles,” in Proceedings of the EighthInternational Joint Conference on Natural Language Processing (Volume

2: Short Papers), 2017, pp. 252–256.[16] C. Bosco, V. Lombardo, L. Lesmo, and V. Daniela, “Building a treebank

for italian: a data-driven annotation schema.” in LREC 2000. ELDA,2000, pp. 99–105.

[17] A. Zubiaga, M. Liakata, R. Procter, G. W. S. Hoi, and P. Tolmie,“Analysing how people orient to and spread rumours in social media bylooking at conversational threads,” PloS one, vol. 11, no. 3, p. e0150989,2016.

[18] Y. Kim, “Convolutional neural networks for sentence classification,”arXiv preprint arXiv:1408.5882, 2014.

[19] C. D. Manning, C. D. Manning, and H. Schutze, Foundations ofstatistical natural language processing. MIT press, 1999.

[20] S. Tong and D. Koller, “Support vector machine active learning withapplications to text classification,” Journal of machine learning research,vol. 2, no. Nov, pp. 45–66, 2001.

[21] J. Ramos et al., “Using tf-idf to determine word relevance in document

Page 12: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

xii

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Charlie Hebdo

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ferguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Germanwings-cras

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ottawa shootingFerguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Sydney siege

1% Labeled Data10% Labeled Data30% Labeled Data

Fig. 10. Comparing performance for the case of embedding size 64. x-axis is for different evaluation metrics while y-axis is for performance. Different colorbars present different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Charlie Hebdo

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ferguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Germanwings-cras

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Ottawa shootingFerguson

1% Labeled Data10% Labeled Data30% Labeled Data

MacroA MacroP MacroR MacroF0

10

20

30

40

50

60

70

80Sydney siege

1% Labeled Data10% Labeled Data30% Labeled Data

Fig. 11. Comparing performance for the case of embedding size 256. x-axis is for different evaluation metrics while y-axis is for performance. Differentcolor bars illustrate different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

queries,” in Proceedings of the first instructional conference on machinelearning, vol. 242. Piscataway, NJ, 2003, pp. 133–142.

[22] J. H. Paik, “A novel tf-idf weighting scheme for effective ranking,”in Proceedings of the 36th international ACM SIGIR conference onResearch and development in information retrieval. ACM, 2013, pp.343–352.

[23] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation ofword representations in vector space,” arXiv preprint arXiv:1301.3781,2013.

[24] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectorsfor word representation,” in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP), 2014, pp.1532–1543.

[25] R. Mihalcea and C. Strapparava, “The lie detector: Explorations inthe automatic recognition of deceptive language,” in Proceedings ofthe ACL-IJCNLP 2009 Conference Short Papers. Association for

Computational Linguistics, 2009, pp. 309–312.[26] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry

and word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates,vol. 71, no. 2001, p. 2001, 2001.

[27] N. Ruchansky, S. Seo, and Y. Liu, “Csi: A hybrid deep model for fakenews detection,” in Proceedings of the 2017 ACM on Conference onInformation and Knowledge Management. ACM, 2017, pp. 797–806.

[28] Q. Le and T. Mikolov, “Distributed representations of sentences anddocuments,” in International conference on machine learning, 2014, pp.1188–1196.

[29] H. Karimi, P. Roy, S. Saba-Sadiya, and J. Tang, “Multi-source multi-class fake news detection,” in Proceedings of the 27th InternationalConference on Computational Linguistics, 2018, pp. 1546–1557.

[30] J. Ma, W. Gao, Z. Wei, Y. Lu, and K.-F. Wong, “Detect rumors usingtime series of social context information on microblogging websites,”in Proceedings of the 24th ACM International on Conference on

Page 13: i Two-path Deep Semi-supervised Learning for Timely Fake ...Learning, Convolutional Neural Networks, Joint Optimization I. INTRODUCTION Social media (e.g., Twitter and Facebook) has

xiii

Information and Knowledge Management. ACM, 2015, pp. 1751–1754.[31] T. Lan, C. Li, and J. Li, “Mining semantic variation in time series

for rumor detection via recurrent neural networks,” in 2018 IEEE20th International Conference on High Performance Computing andCommunications; IEEE 16th International Conference on Smart City;IEEE 4th International Conference on Data Science and Systems(HPCC/SmartCity/DSS). IEEE, 2018, pp. 282–289.

[32] T. Hashimoto, T. Kuboyama, and Y. Shirota, “Rumor analysis frameworkin social media,” in TENCON 2011-2011 IEEE Region 10 Conference.IEEE, 2011, pp. 133–137.

[33] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning(chapelle, o. et al., eds.; 2006)[book reviews],” IEEE Transactions onNeural Networks, vol. 20, no. 3, pp. 542–542, 2009.

[34] X. J. Zhu, “Semi-supervised learning literature survey,” University ofWisconsin-Madison Department of Computer Sciences, Tech. Rep.,2005.

[35] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in Advances in neural information processing systems, 2013,pp. 3111–3119.

[36] Y. Goldberg and O. Levy, “word2vec explained: deriving mikolovet al.’s negative-sampling word-embedding method,” arXiv preprintarXiv:1402.3722, 2014.

[37] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection bydeep multi-task learning,” in European conference on computer vision.Springer, 2014, pp. 94–108.

[38] S. Ruder, “An overview of multi-task learning in deep neural networks,”arXiv preprint arXiv:1706.05098, 2017.

[39] S. Chowdhury, X. Dong, L. Qian, X. Li, Y. Guan, J. Yang, and Q. Yu,“A multitask bi-directional rnn model for named entity recognition onchinese electronic medical records,” BMC bioinformatics, vol. 19, no. 17,p. 499, 2018.

[40] X. Dong, S. Chowdhury, L. Qian, X. Li, Y. Guan, J. Yang, and Q. Yu,“Deep learning for named entity recognition on chinese electronicmedical records: Combining deep transfer learning with multitask bi-directional lstm rnn,” PloS one, vol. 14, no. 5, p. e0216046, 2019.

[41] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn-ing,” arXiv preprint arXiv:1610.02242, 2016.

[42] A. Rajaraman and J. D. Ullman, Mining of massive datasets. CambridgeUniversity Press, 2011.

[43] E. Kochkina, M. Liakata, and A. Zubiaga, “All-in-one: Multi-task learn-ing for rumour verification,” in Proceedings of the 27th InternationalConference on Computational Linguistics, 2018, pp. 3402–3413.

[44] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutionalnetworks for text classification,” in Advances in neural informationprocessing systems, 2015, pp. 649–657.

[45] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Verydeep convolutional networks for text classification,” arXiv preprintarXiv:1606.01781, 2016.

[46] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, “Attention-based bidirectional long short-term memory networks for relation classi-fication,” in Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Papers), 2016, pp. 207–212.

[47] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neuralnetworks for text classification,” in Twenty-ninth AAAI conference onartificial intelligence, 2015.

[48] Y. Yang, “A study of thresholding strategies for text categorization,” inProceedings of the 24th annual international ACM SIGIR conference onResearch and development in information retrieval, 2001, pp. 137–145.

[49] N. V. Chawla, “Data mining for imbalanced datasets: An overview,” inData mining and knowledge discovery handbook. Springer, 2009, pp.875–886.