Predicting News Headline Popularity with Syntactic and ... · headline popularity. The positive...

DRAFT

Predicting News Headline Popularity with Syntactic and SemanticKnowledge Using Multi-Task Learning

Daniel HardtCopenhagen Business School

[email protected]

Dirk HovyBocconi University

[email protected]

Sotiris LamprinidisCopenhagen University

[email protected]

Abstract

Newspapers need to attract readers with head-lines, anticipating their readers’ preferences.These preferences rely on topical, structural,and lexical factors. We model each of thesefactors in a multi-task GRU network to predictheadline popularity. We find that pre-trainedword embeddings provide significant improve-ments over untrained embeddings, as do thecombination of two auxiliary tasks, news-section prediction and part-of-speech tagging.However, we also find that performance is verysimilar to that of a simple Logistic Regressionmodel over character n-grams. Feature analy-sis reveals structural patterns of headline popu-larity, including the use of forward-looking de-ictic expressions and second person pronouns.

1 Introduction

The data generated from online news consumptionconstitutes a rich resource, which allows us to ex-plore the relation between news content and useropinions and behaviors. In order to stay in busi-ness, newspapers need to pay attention to this in-formation. For example, what headlines do usersclick on, and why? With the volume of news beingconsumed online today, there is great interest inaddressing this problem algorithmically. We col-laborate with a large Danish newspaper, who gaveus access to several years worth of headlines, andthe number of clicks generated by readers.

We aggregate the viewing logs to classify head-lines as popular or unpopular, and build models topredict those classifications. We use an expandedversion of the dataset investigated by Hardt andRambow (2017). That work found that bag-of-word models based on headlines did indeed havepredictive value concerning viewing behavior, al-though models based on the article body weremore accurate. As Hardt and Rambow noted, thisis somewhat paradoxical: how can a model based

on the article text be better at predicting clicks?After all, the choice to click on an article must bebased on the headline alone – the article is onlyseen after the clicking decision is made. Hardtand Rambow speculate that “it is possible that theheadline on its own gives readers a lot of semanticinformation which we are not capturing with ourfeatures, but which the whole article does provide.So human readers can “imagine” the article beforethey read it and implicitly base their behavior ontheir expectation.” (Hardt and Rambow, 2017)

In other words, readers are able to anticipate thecontents of an article in advance from a headline,because of the linguistic and world knowledge thatthey bring to bear when assessing the headline. Ifwe can incorporate this “future” knowledge into aprediction model, we are likely to improve perfor-mance.

We test this hypothesis by defining ways tomodel aspects of the lexical, structural, and top-ical knowledge of human news readers:

• Lexical – Word Embeddings: we provideour models with pretrained word embeddingsfrom large datasets. This models aspects ofthe rich lexical information and associationthat human readers bring to bear in reading aheadline.

• Structural – POS Tagging: part of speechinformation is a basic component of struc-tural linguistic knowledge, reflected in thestructure of common headline templates suchas “Can X do Y?” or “You will not believewhat happened when X”.

• Topical – Section Prediction: Each article islabeled with a section (sports, politics, etc).We include a task which predicts the sectionof a headline. This models the ability of anews reader to understand the most salient

DRAFT

and interesting topical material in a headlinetext.

We use a multi-task learning (MTL) setup(Caruana, 1993), which provides a natural frame-work to test the above hypotheses: one of the firstuses of MTL was to include the outcome of fu-ture diagnostic tests into a prediction task (Caru-ana et al., 1996).

We explore the effect of pretrained word em-beddings, and the effects of auxiliary tasks in-volving POS tagging and section prediction. Wefind that the combination of all of these factors re-sults in substantial improvements over the baselineand previous work, single-task system. We alsobuild logistic regression models, both for word andcharacter n-grams. The word-based models havethe advantage that the predictiveness of individualwords can be examined.

While the word n-gram models have perfor-mance comparable to the baseline neural net, thecharacter n-gram model has higher performance,competing with the best MTL result. This findingis in line with the results from Zhang et al. (2015).

Our results indicate that MTL can indeed pro-vide the tools to implement prediction processesthat involve expectations about the future. Giventhe successful integration of two auxiliary tasks,we see this as a promising starting point for fu-ture research. However, the performance paritywith the character model underscores the fact thatsimple model architectures still have a place. Ourfindings, in line with other current work (Bentonet al., 2017), shine light on the question of auxil-iary task selection and their interaction, and high-light that MTL results should be rigorously tested.

A good predictive model is a powerful diagnos-tic tool for editors, allowing them to select pro-posed headlines. However, journalism is a creativeproduction process, so detection is only part of theapplication. We also want to be able to give strate-gic advice to headline writers. To this end, we re-port an analysis of common n-gram features in theword-based logistic regression model, that providesome insights into successful headline patterns.

Contributions We explore an MTL architecturewith two auxiliary tasks for headline popularityprediction. We show how aspects of lexical, struc-tural, and topical knowledge are all relevant forheadline popularity. The positive results reportedhere provide a fruitful basis for further develop-ment of MTL models for news data. We also ana-

Figure 1: Example of Jyllands-Posten headline as seenby audience

lyze lexical features that are predictive of headlinepopularity.

2 Data

News Data The present work is based on a sig-nificantly expanded and cleaned version of thedataset used by Hardt and Rambow (2017). Thisdataset includes Jyllands-Posten articles and logs.Jyllands-Posten is a major Danish newspaper (andbecame known to an international audience overthe cartoon controversy). The data covers a pe-riod from July 2015 through July 2017. We re-moved any articles from before July 2015, whenthe viewing logs began, since these older articleshave unreliable numbers of clicks. The resultingdataset consists of 82,532 articles and a total of281,005,390 user views. We furthermore extractedthe news section each article belongs to (sports,politics, etc.) from the URL.

We bin the articles by numbers of clicks into 2bins, thus defining a classification task: is the ar-ticle in the top 50% of clicks or not? The data isdivided into 80% training, and 10% each develop-ment and test data.

Figure 1 shows the top headline on the Jyllands-Posten web site for August 27, 2018. Our datadoes not include information such as the positionof a headline on the page, and possible associatedgraphical material.

Additional Data In addition to the news datafrom JP, we obtained a corpus of 100 millionwords of Danish text from the Society for DanishLanguage and Literature, or DSL (Jørg Asmussen,2018). This corpus was collected from diverse

DRAFT

sources over a period from 1990 to 2010. The cor-pus has been automatically annotated for part ofspeech and lemmatization, and we use this for ourPOS tagging task. We also downloaded the Dan-ish Wikipedia, which consists of approximately 49million words of Danish text. We use these cor-pora in conjunction with the JP article texts to in-duce pre-trained Danish word-embeddings.

Data Statement A. CURATION RATIONALEThe dataset is collected by Jyllands-Posten as partof a general strategy to understand user behaviorand preferences with respect to the news contenton the site.

B. LANGUAGE VARIETY The data is Danish(da-DK).

C. SPEAKER DEMOGRAPHIC The text isproduced by professional journalists.

D. ANNOTATOR DEMOGRAPHIC There isno manual annotation of the text.

E. SPEECH SITUATION The texts were pro-duced from July 2015 until July 2017; the intendedaudience is Danish news consumers.

F. TEXT CHARACTERISTICS The text isstandard, mainstream Danish journalism.

3 Models

Our task is to predict which articles get the mostuser clicks, based on the headline alone. We re-port results using logistic regression and a neuralnetwork, using MTL.

Logistic Regression We define the followingfeatures for logistic regression models:

1. n-chars: sequences of n characters, with nranging from 2 to 6 in all experiments.

2. word unigrams: tfidf scores for all word uni-grams

3. word bigrams: tfidf scores for all word bi-grams

GRU Neural Network While the task is classi-fication, which could be done with a feed-forwardmodel, we want a sequential architecture, so thatwe can incorporate POS tagging as an auxiliarytask, adding POS output at each time step.

Based on good results in recent work (Lee andDernoncourt, 2016), (Liu et al., 2016), we choosea Recurrent Neural Network architecture, and af-ter a series of experiments on the training set and

validating using our validation set we got the bestresults using GRU (Gated Recurrent Unit) units.

Each layer k consists of two sets of units, la-beled fw and bw that process the sequence for-wards and backwards respectively, so that infor-mation from the whole sequence is available onevery timestep t. The two directions’ activa-tions are concatenated and fed to a fully-connectedsoftmax (for multi-class classification) or sigmoidlayer (for binary classification) to get the outputprobability ykt of the task associated with layer k.So that higher level tasks can benefit, we embedthe output probabilities using the fully connectedlabel embedding LE layer, a technique used onsimilar scenarios (Rønning et al., 2018). The em-bedded label gets concatenated with the GRU out-put to get the activation akt that gets fed in the nextlayer, or the final fully connected prediction layer,as presented in figure 2.

In the sequential auxiliary task, i.e. POS tag-ging, this is done for every timestep, while for theclassification tasks the prediction is made on thefinal timestep.

For regularization, we apply dropout on everylayer of our network.

Figure 2: Representation of a single timestep t for apair of forward-backward units on layer k where hk

t−1

is the previous hidden state.

Auxiliary Tasks In our setup, we use two auxil-iary tasks:

1. POS tagging: we include POS tagging usingthe DSL dataset on the first recurrent layer ofthe GRU.

2. Section prediction: we include classificationinto one of the 227 sections of the Jyllands-Posten website. The output for this task isbased on the penultimate recurrent layer.

DRAFT

Hyper-parameters and Training We perform agrid search to find the best hyper-parameters fora single-task model (i.e., without any auxiliarytasks) and then keep those settings for all our ex-periments. We settle on a model with hidden sizeH = 112 and Nk = 3 layers, respectively. Thedropout probability p = 0.3 gave best results forboth models.

We train the model for 10 epochs using Adamoptimizer with the default parameters, clipping thegradient updates so that their norm is not higherthan 5. We train the different tasks sequentiallyfor each epoch, with the lower level (POS tagging)first and the popularity prediction last. Addition-ally, we decay the learning rate by a factor of 0.9after each epoch. While this is not common withadaptive methods such as Adam, it performed bet-ter. We stop training if the accuracy on the devel-opment set stops improving.

4 Results

Tables 1 and 2 report accuracy for logistic re-gression and neural classifiers. We also give thebest score from Hardt and Rambow (2017) forcomparison purposes (note, though, that the datasets are not identical and can therefore not be di-rectly compared). We observe a substantial im-provement over the baseline GRU when incorpo-rating the pre-trained embeddings and both aux-iliary tasks. It seems that pretrained embeddingsand MTL act at least partly as regularizers, as thesemodels trained for more epochs without overfit-ting. Interestingly, we observe a similar improve-ment over the word-based logistic regression mod-els with a character n-gram model.

5 Analysis and Discussion

Our main focus in this paper is on MTL as a frame-work to explore the lexical, structural and topicalknowledge involved in users’ selection of head-lines. However, recognizing a popular headlineand giving advice on how to write one are not thesame: we want to provide editors and journalistswith insights as to what constructions are likely toattract more eyeballs.

One way to explore this is to examine individualwords and their contribution to predictiveness. Ta-ble 3 displays the top 20 unigrams based on theircoefficients in the logistic regression model. Foreach unigram we provide a translation (if needed)and a comment. We classify several unigrams as

Deictic-reference. This follows Blom and Hansen(2015), who suggest that headline ”clickbait” of-ten relies on forward-looking expressions, suchas ”This”, as in, e.g., ”This is how you shouldeat an avocado”. Here, ”this” is a referring ex-pression, but the reader understands that the an-tecedent will be found in the article body. Sev-eral of these top unigrams are names that are ofspecific topical interest in areas such as sports andpolitics. Others mention topics of more general in-terest (Researchers, dead, found). The second per-son pronoun is also on the list – in general, it wasfound that second person pronouns are far morepredictive of popularity than first or third personpronouns. Finally, several unigrams identify sec-tions of the newspaper of particular interest (car,weather, analysis, and satire).

6 Related Work

Prediction of news headline popularity is an in-creasingly important problem, as news consump-tion has moved online. The insights and modelsdescribed here might well be applicable to relatedproblems of interest: for example, Balakrishnanand Parekh (2014) and Jaidka et al. (2018) studythe problem of predicting clicks on email subjectlines.

Subramanian et al. (2018) show that aregression-based multitask approach can increaseperformance for the classification prediction ofpopularity. Their work looks at the popularity ofonline petitions, but the methodology applies toour subject as well, and ties in with the approachestaken in this project.

Benton et al. (2017) caution that in order toevaluate MTL results properly, we need to take thenumber of parameters into account. Our results tosome extent support this finding, by showing thata simpler linear model can fare equally well on thetask.

The choice of auxiliary tasks greatly influencesthe performance of MTL architectures, prompt-ing several recent investigations into the selec-tion process (Alonso and Plank, 2017; Bingeland Søgaard, 2017). However, it is still unclearwhether these tasks serve as mere regularizers, orwhether they can also impart some additional in-formation.

DRAFT

model input accuracyHardt and Rambow (2017) word unigrams 61.2logistic regression word unigrams 65.6logistic regression word bigrams 65.7logistic regression character 2-6grams 67.4

Table 1: Accuracy results for various Logistic Regression models

model input auxiliary tasks epoch accuracyGRU 3 layers w/ 112 hidden random embeddings — 3 65.2GRU 3 layers w/ 112 hidden pretrained embeddings — 5 66.8GRU 3 layers w/ 112 hidden pretrained embeddings POS 5 65.7GRU 3 layers w/ 112 hidden pretrained embeddings section 4 66.8GRU 3 layers w/ 112 hidden pretrained embeddings POS+section 7 67.4

Table 2: Accuracy results for various GRU model implementations

Unigram Translation CommentMagnussen Name (Sports)Trump Name (Politics)AGF Name (Sports)TestHer Here Deictic-referencedød dead topicalWozniac Name (Tech)Trumps Name (Politics)Forskere Researchers topicalfundet found topicaldu you pronounAGF-træner AGF coach Name (sports)Se Watch Deictic-referenceKevin Name (Sports)Islamisk Islamic Name (Politics)Analyse Analysis SectionSadan This Deictic-referenceSatire Satire Sectionbil car SectionDMI Weather Section

Table 3: Top twenty Unigrams (Logistic Regression)

7 Conclusion

We presented an exploratory approach to predict-ing newspaper article popularity from headlinesalone. Using pre-trained embeddings and a MTLsetup, we are able to incorporate rich structuraland semantic knowledge into the task and sub-stantially improve performance. While the resultsare encouraging and allow the exploration of fur-ther auxiliary tasks (for example article word pre-diction), we find that a simple character-based n-

gram model performs competitively. These find-ings highlight two aspects: 1) For any applicationof MTL, this is a strong case for comparing theresults to non-deep models. While it is compara-tively easy to show an improvement over the basicSTL model, there might be other simple modelsthat are competitive. 2) The selection of auxil-iary tasks greatly influences the performance, evenbeyond simple regularization, and in a non-linearway. It does, however, provide us with a tool totest human intuitions about task interactions andthe importance of certain problem aspects.

Acknowledgments

Thanks to A. Michele Colombo for help with thedata and experiments. We also thank Jyllands-Posten for giving us access to the data, and to DSLfor data for embeddings and POS annotations.

ReferencesHector Martınez Alonso and Barbara Plank. 2017.

When is multitask learning effective? semantic se-quence prediction under varying data conditions. In15th Conference of the European Chapter of the As-sociation for Computational Linguistics.

Raju Balakrishnan and Rajesh Parekh. 2014. Learningto predict subject-line opens for large-scale emailmarketing. In Big Data (Big Data), 2014 IEEE In-ternational Conference on, pages 579–584. IEEE.

Adrian Benton, Margaret Mitchell, and Dirk Hovy.2017. Multitask Learning for Mental Health Condi-tions with Limited Social Media Data. In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 1, Long Papers, volume 1, pages 152–162.

DRAFT

Joachim Bingel and Anders Søgaard. 2017. Identify-ing beneficial task relations for multi-task learningin deep neural networks. In Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 2, ShortPapers, volume 2, pages 164–169.

Jonas Nygaard Blom and Kenneth Reinecke Hansen.2015. Click bait: Forward-reference as lure in on-line news headlines. Journal of Pragmatics, 76:87 –100.

Rich Caruana. 1993. Multitask Learning: AKnowledge-Based Source of Inductive Bias. In Pro-ceedings of the Tenth International Conference onMachine Learning, pages 41–48.

Rich Caruana, Shumeet Baluja, Tom Mitchell, et al.1996. Using the future to “sort out” the present:Rankprop and multitask learning for medical riskevaluation. Advances in neural information process-ing systems, pages 959–965.

Daniel Hardt and Owen Rambow. 2017. Predictinguser views in online news. In Proceedings of the2017 EMNLP Workshop: Natural Language Pro-cessing meets Journalism, pages 7–12.

Kokil Jaidka, Tanya Goyal, and Niyati Chhaya. 2018.Predicting Email and Article Clickthroughs withDomain-adaptive Language Models. In Proceed-ings of the 10th ACM Conference on Web Science,pages 177–184. ACM.

Jørg Asmussen. 2018. Society for Danish Languageand Literature. http://dsl.dk. [Online; ac-cessed 28-April-2018].

Ji Young Lee and Franck Dernoncourt. 2016. Se-quential short-text classification with recurrent andconvolutional neural networks. arXiv preprintarXiv:1603.03827.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.2016. Recurrent neural network for text classi-fication with multi-task learning. arXiv preprintarXiv:1605.05101.

Ola Rønning, Daniel Hardt, and Anders Søgaard. 2018.Sluice Resolution without Hand-crafted Featuresover Brittle Syntax Trees. In NAACL.

Shivashankar Subramanian, Timothy Baldwin, andTrevor Cohn. 2018. Content-based Popularity Pre-diction of Online Petitions Using a Deep RegressionModel. Transactions of the Association for Compu-tational Linguistics.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in neural information pro-cessing systems, pages 649–657.

http://dsl.dk

Predicting News Headline Popularity with Syntactic and ... · headline popularity. The positive...

Documents

Transcript of Predicting News Headline Popularity with Syntactic and ... · headline popularity. The positive...