Demystifying content popularity on Redditsnap.stanford.edu/class/cs224w-2012/projects/cs224... ·...

Demystifying content popularity on Reddit

Katyaini Himabindu LakkarajuProject Group Number: 25

[email protected]

December 10, 2012

Abstract

World Wide Web is now teeming with user gener-ated content as more and more sites allow users tobecome an integral part of content generation anddissemination. Several sites such as Youtube, Digg,Twitter, Facebook, Reddit allow users to post con-tent of their choice, comment on and rate the contentposted by other users. This user generated contenthas given rise to a whole bunch of interesting prob-lems such as prediction of content popularity, userpopularity and many more. Solving these kind ofproblems involves understanding the underlying so-ciological processes and coming up with appropriatealgorithmic solutions. The goal of this report is twofold. Firstly, I present my findings about the influ-ence of various factors such as community aspects,user attributes, content quality, multiple content ex-posures on the popularity of the content on Reddit.Secondly, I propose models that can effectively pre-dict the content popularity in terms of the number ofcomments a particular post would get at any givenpoint in time.

Introduction

With the inception of sites such as Twitter, Face-book, Reddit, Digg etc., the web is now a powerhouse of user generated content. It is often seen thatusers share content of their choice, post comments,write reviews, upload data, acknowledge contentposted by other users on these sites. In some sense,users are leaving their digital footprints almosteverywhere on the web. This kind of data whenharnessed appropriately can be used for variouspurposes ranging from understanding customersegments for better product marketing to under-standing sociological aspects of human behavior.The surge in this kind of data has reinforced the

Figure 1: Snapshot of reddit front page

need for computational algorithms and techniqueswhich effectively capture human behavior, both atan individual and aggregate level.

This paper primarily focuses on analyzing thecontent popularity and the associated dynamics on asocial content sharing site called Reddit. Reddit hasa very interesting approach to organizing contentand determining content visibility. The principalentities of reddit are users and subreddits1. A usercan follow (this is ’friends’ in reddit terminology)other users and subscribe to various subreddits. Atany given point in time, user’s homepage providesseamless access to various content views. A user canbrowse through the activity of people he follows. Hecan see the most recent posts and highly popularposts submitted across various subreddits. In addi-tion to all these, the content from those subredditswhich the user subscribed to also shows up on hishomepage. The user can browse through each ofthese content views by clicking on various tabs onhis home page. Further, users can share interestinglinks and textual content in the subreddits of theirchoice and other users vote and comment on these

1I will be using the terms communities and subreddits in-terchangeably through out this writeup

submissions. The interactions between all theseentities would render the task of predicting contentpopularity more challenging and interesting.

The concept of content popularity is subjective. Inthis work, we associate popularity of a post with thenumber of comments it receives. The questions thatwe aim to answer in this work are -

• How much impact does the quality of the contenthave on its popularity ?

• What role does the subreddit play in the contentpopularity ?

• What is the impact of the user (who posts thecontent) on the content popularity ?

• Does the title of the submission play a role in itspopularity ? If so, what would constitute a goodsubmission title ?

• How would the submission popularity vary withmultiple subsequent submissions of the samecontent ?

• Can we design a framework which accounts forthe submission popularity and accomodates theeffects of the various factors highlighted above ?

• How can we account for the popularity devia-tions of the submissions with time ?

Surprisingly, there is not much published work onanalyzing the dynamics of Reddit. This work is oneof the initial attempts towards the same.

Related Work

In this section, I will briefly discuss the literature thatis most relevant to this piece of work. This work cutsacross two aspects - content popularity prediction onsocial media and analysis of voting / commenting pat-terns on social sharing sites. Hence, this section dealswith the discussion of work along these two lines.

Content Popularity on Social Media

There has been a surge in the interest in problemsof this kind in the recent past. [2] deals with theproblem of determining if a given post on twitterwould elicit any response or not. No distinction ismade between various kinds of responses such as re-ply, retweet, favoriting etc. This work presents a de-tailed analysis of content features based on natural

language heuristics such as number of stop words,positive and negative sentiment words and identify-ing lexical items with high correlation with previoustweets which elicited responses. Though this workgives some detailed insights into the various kinds ofcontent features, the problem that this work solves isa limiting case of what we are proposing. [1] attemptsto solve the problem of predicting the popularity of anarticle on a social networking site (twitter). The mea-sure of popularity here is determined by how manytimes the article url is shared across twitter. The au-thors employ classification and regression based ap-proaches with various features such as categories /topics the article belongs to, subjectivity of the arti-cle, named entities mentioned and the source of thearticle However, this work also does not deal with theaspect of predicting the popularity of an item at var-ious points in its lifetime. [3] deals with the problemof predicting if the popularity of some content ex-ceeds certain threshold. The approach employed hasits roots in survival analysis framework. The mainidea behind this work is to employ initial popular-ity measures to detect the long term popularity of apiece of content. There is also a huge body of work onrelated lines about predicting the popularity of postson social media and other controlled forums on socialmedia [4] [5].

Analysis of Voting and CommentingPatterns on Social Media

Though there is a lot of literature on analyzing userbehavior on social media, this subsection focuseson analyzing user behavioral activity with respectto social sharing sites such as Digg. [6], [7], [8]present an extensive analysis of Digg. These paperspresent simple models capturing the user behaviorand content visibility to capture the dynamics ofdigg. These papers highlight user influence andinterestingness of the content as the two primaryfactors behind the popularity of a piece of content.Further, the authors use this model to predictthe long term content popularity. However, thedynamics of the data that we are dealing with is a lotmore complex since there is additional componentof topic (content) based communities and also thecontent that we are dealing with comprises of a lot ofpersonal content (pictures, queries etc.) in additionto other knowledge based postings (as opposed tomostly news based urls on digg).

2

Our work is different from the state-of-the-art in twoaspects. Firstly, the process of content generation,organization and visibility on Reddit has not beenexplored so far. Secondly, we attempt to predict thepopularity of the content at various phases of its life-time rather than just predicting the initial popularityor the long term popularity.

Dataset and Features

This section describes the data and the feature spacein detail.

Dataset

Data collection process spans three phases.

• First phase comprised of crawling those postswhich are duplicate submissions of the same con-tent. As discussed in the introduction, one of theobjectives of this work is to understand how pop-ularity varies with multiple submissions of thesame content. In order to achieve this, the firstphase of the crawl primarily constituted of crawl-ing such resubmissions. Note that Reddit alertsa user who is trying to submit duplicate urls.However, there is currently not any mechanismwhich alerts the user who is submitting duplicateimages. Hence, the content of the submissionsobtained during this crawl primarily comprisesof images. This dataset comprises of about 102Kposts spanning 928 communities. The number ofunique images in this set is 9,022 which meansthat on an average there are roughly about 10 -12 resubmissions of the same image.

• Since the data that we had crawled in thefirst phase was highly skewed towards duplicatecontent, a second phase of crawling was initi-ated which crawled data from those communitieswhich were found in the data crawled in the firstphase. This was also an attempt to obtain moredata from communities such as politics, technol-ogy which were under represented in the datafrom the first phase. From this second crawlwhich was carried out bi-hourly for about 4 daysadded another 20K posts to the dataset from thefirst phase.

• The final phase of the crawl was carried out pri-marily for monitoring the front page of reddit

Figure 2: Log log plot of the comment distribution(Power law fit: xmin = 789, alpha = 3.5)

Figure 3: Community vs. Fraction of Posts

and to figure out the thresholds of content pro-motion on to the front page. This was necessaryto model the popularity changes with time. Thedata that I obtained in this third phase com-prised of 1248 posts which hit the front pageand this data is only used to estimate the scorethreshold which qualifies a submission to appearon the front page. This data was not used in anyother analysis.

The dataset on which the entire analysis in thisreport has been carried out comprises of about 120Kposts from 226 communities. Note that the numberof posts and the communities is smaller than thenumbers specified above. This is because of the factthat there were several communties with very lowactivity (less than 50 posts). So, I had to prunedown such communities and all the associated posts.The basic statistics of this dataset and correlationsbetween the number of upvotes, downvotes andcomments are shown in the Table 1. It can beseen that there is a very high correlation betweenthe number of upvotes, downvotes and comments.Another unsurprising aspect empirically verified

3

Table 1: Basic statistics of the dataset# of Posts 120423# of users 58032

# of communities 1012# of active communities 226

# of up votes 89M# of down votes 68M# of comments 3.8M

Average # of posts per user 1.83Average # of posts per community 114.1Average # of up votes per post 845

Average # of down votes per post 642Average # of comments per post 36

Post Level Correlationscorrelation between # of upvotes and downvotes 0.99correlation between # of upvotes and comments 0.73

correlation between # of downvotes and comments 0.72Community Level Correlations

correlation between # of posts and comments 0.97correlation between # of posts and upvotes 0.96

correlation between # of posts and downvotes 0.97correlation between # of upvotes and downvotes 0.99correlation between # of upvotes and comments 0.96

correlation between # of downvotes and comments 0.95

by the results in Table 1 is the strong correlationbetween the number of posts, upvotes, downvotesand comments at the community level.

Figure 2 shows the distribution of the comments inthe dataset. It can be seen that the density functionfollows a power law. Also, Figure 3 shows the frac-tion of posts in the data which belongs to the respec-tive communities. It can be seen that pics, funny,aww, gaming, atheism are amongst the well repre-sented communities in the dataset.

Features

This subsection discusses in detail all the featuresthat I will be employing in the approaches describedin the subsequent sections. The features in thedataset that we are dealing with can be broadly cat-egorized into user features, post title features, com-munity features, submission features and content fea-tures. I present here a detailed analysis of each ofthese features and also describe how the feature sethas been pruned to come up with an appropriate fea-ture list for the prediction tasks.

• User Features - As discussed in the introduction,one of the goals of this work is to study howuser’s influence effects the popularity of a sub-mission. In sites like Twitter, this factor would

Figure 4: Link Karma vs. Average Number of Com-ments (in the first 2 hours of post submission)

be very crucial for the popularity of a post. Onthe other hand, Reddit’s content organizationmechanism allows for content to take precedenceover the user influence in accounting for the sub-mission popularity. However, it was importantto understand the role of a user in such an envi-ronment. Some of the features which would allowus to capture this are average number of com-ments, upvotes and downvotes received by theuser’s previous submissions and the user’s sub-scriber list. Unfortunately, user’s subscriber listsare mostly private on Reddit and hence it is notpossible to access them. Also, our dataset didnot provide enough representation of each userto confidently capture metrics such as averagenumber of comments, upvotes and downvotes.To overcome these issues, I extracted each user’slink karma from reddit and used it as a proxyfor the user influence. Infact, doing so actu-ally yielded some interesting correlation with thenumber of upvotes, downvotes and comments.

Figure 4 shows the plot of user link karma vs. av-erage number of comments received by their sub-missions. It can be seen that this curve exhibitsan exponential distribution and users with linkkarma greater than 10K have a relatively higheraverage number of comments. This is interest-ing because it indicates that users with high linkkarma are expert users who have mastered theart of submitting posts on Reddit in such a waythat they receive high attention. Also, users withhigher link karma typically have more visibilitybecause of lots of other users subscribing to themand the high attention received by their submis-sions can also be attributed to this reason.

• Post Title Features - Post title features comprise

4

of the attributes of the title of the post (or thedescriptive text written by the user). There aremultiple features relevant to the text of the ti-tle that we can consider. I explored a bulk offeatures in this space such as KL divergence ofthe words used in the title with respect to thecommunity’s language, presence of named enti-ties and positive and negative sentiments, sub-jectivity of the title, number of words in the ti-tle, fraction of content specific words in the title,fraction of community specific words in the ti-tle. Amongst all these features, fraction of con-tent specific words, fraction of community spe-cific words, subjectivity of the title proved to bethe most effective. KL divergence was also an-other feature which showed some promising in-sights, however, this had a high correlation withfraction of community specific words and metricssuch as information gain suggested that usage offraction of community specific words was morebeneficial as compared to KL divergence.

In order to determine the community specificwords and content specific words, I used the fol-lowing metrics. Let us denote subreddit speci-ficity score of a word with respect to subreddit sas ws and content specific score as wc,s.

ws =# of times w appears in titles on subreddit s

# of times w appears in the entire dataset

wc,s =# of times w appears in titles of content c

# of times w appears in the subreddit s

These two scores determine the the specificityof a word with respect to its community andthe associated content respectively. I designatedthresholds for these scores and any word with ascore higher than the designated threshold wastagged as a content specific or a community spe-cific word and fraction of such words in eachpost is computed. Formally the subreddit spe-cific (spsubrp ) and content specific (spcontp ) scoresof a particular post p are :

spsubrp =# of subreddit specific words in title of post p

# of words in title of post p

spcontp =

# of content specific words in title of post p

# of words in title of post p

Subjectivity of the title is a much more involvedconcept. Initially, I started with the naive ap-proach of using lexicons for positive and negativewords. I found that another approach which in-volved subjectivity detection at a phrase level2

was much more effective. I employed this ap-proach which assigns a score to each title in the

2www.lingpipe.com

Figure 5: Community specificity score of a post vs.Average # of comments (normalized)

Figure 6: Content specificity score of a post vs. Av-erage # of comments (normalized)

range of 0 to 1 indicating the extent of its sub-jectivity. Note that we are not distinguishingbetween positive and negative sentiments here,but just extracting the subjectivity of each ti-tle. This aspect exhibited interesting correla-tions with the number of comments at a com-munity level.

Figures 5, 6 and 7 show the effect of aforemen-tioned three factors on the average number ofcomments received by the submissions exhibit-ing the corresponding feature values.

• Community Features - The major feature in thiscategory was ’activity’ in the community. Table1 shows a positive correlation between the num-ber of posts in a community and the numberof comments obtained by posts in that commu-nity. So, the number of posts in the communitycould serve as a good indicator of the activitywithin the community. Further, I tried to an-alyze the peak activity periods for each of the

5

Figure 7: Subjectivity score of a post vs. Average #of comments (normalized)

communities. Figure 8 shows the hour of the day(in UTC) plotted against the average number ofcomments in that hour for top 5 communities inthe dataset. This shows that the hour of the dayis one of the factors that influences the popular-ity of a particular post. Number of posts withinthe community during any particular hour of theday also exhibits similar sinusoidal behavior. Inorder to quantify the ’activity’ in a community,the feature of average number of posts in thecommunity at that particular hour turned outto be the most effective feature offering highestinformation gain.

• Resubmission Features - One of the key aspectsof our study is to understand how content pop-ularity varies with multiple submissions of thesame content, albeit with different community,submission and user attributes.

Figures 9 a and b show submission number (atthe corpus level) plotted against number of com-ments and submission number (at communitylevel) plotted against number of comments re-spectively. It can be seen that the average num-ber of comments decrease with the increase insubmission number both at corpus and commu-nity level. This is primarily because the first sub-mission usually is novel and gets more attentionfrom the subscribers and once the content be-comes stale, the number of people who would beinterested in commenting on it again decreases.However, this decrease is non-monotonic and afew peaks can be seen on and off in the curve.

Our analysis showed that this non-monotonicitycan be explained by a couple of interesting fac-

tors. Firstly, if the difference in time betweentwo submissions is high (of the order of sev-eral days), then the resubmission’s popularityremains unaffected by the previous submission.Another interesting aspect that we observed dur-ing our analysis was that the number of com-ments a submission would get depends on thenumber of comments obtained by the previoussubmissions. Figure 10 demonstrates both ofthe aforementioned aspects. If the previous sub-missions failed to get adequate visibility or gen-erate appropriate interest due to being postedin an inappropriate community or with an un-catchy title, then the probability that the currentsubmission gets more visibility and hence morecomments increases. On the other hand, if theprevious submissions received the attention theyare due, then the current submission is likely toreceive lesser comments because users have al-ready seen the previous submissions of the samecontent. This dependency on the previous sub-missions can only be broken by the differencein time between two adjacent submissions. Thehigher this difference, the lower the dependenceof the popularity of the current submission onthe previous ones.

In order to capture this behavior, let us define ametric called potential of the post p (both acrosscommunities and within the same community)φc,p and φsc,p where p denotes a submission ofsome content c and s denotes a specific subred-dit. Let An and Vc,n denote the average numberof comments a post with submission number nwould get (averaged across all communities) andthe actual number of comments obtained by thatsubmission respectively (Adding a subscript s tothese quantities corresponds to their subredditspecific values). The value of k is set based onempirical performance and this will be discussedin detail in experiments section.

φc,p =1

k

n−k∑i=n−1

Ai − Vc,i∆tc,p,i + 1

φsc,p =1

k

n−k∑i=n−1

Ai,s − Vc,i,s∆stc,p,i + 1

where ∆tc,p,i denotes the time difference in daysbetween the submissions p and i of content c.Similarly, ∆stc,p,i denotes the time difference indays between submissions p and i of content c

6

Figure 8: Hour of the day vs. Average Number ofComments (normalized against maximum number ofcomments received by a post)

(both p and i are posted in subreddit s). So, wepick the following resubmission features -

– submission number

– potential function of post p evaluatedacross previous k submissions

Note that each of the aforementioned featuresare computed both across the communities andwithin community.

• Content Features - This corresponds to identify-ing the interestingness of the image itself. This isa latent attribute and needs to be inferred fromthe multiple submissions of the same content. Iwill be parameterizing this as a numeric quantityand the higher this value the better the content.Note that we can also quantify interestingness ofthe content with respect to various communitiesbut our dataset does not carry enough submis-sions of the same content in multiple communi-ties. Hence, I will be sticking with the notion ofinterestingness which holds across all the com-munities.

Our Approach

This section discusses in detail the framework that Iemploy in order to predict the popularity of any givensubmission at various points in time. This frameworkhas two parts to it - First part involves proposing amodel for predicting initial popularity as a function ofall the features discussed in the previous section. Sec-ond part involves proposing a model which takes asinput initial predictions from the first part and thenpredicts the popularity of the submission at subse-quent points in time.

Figure 9: a. Submission Number (Aggregate) vs.Number of Comments b. Submission Number (Com-munity level) vs. Average Number of Comments(normalized against maximum number of commentsreceived by a post)

Predicting initial popularity

The typical approaches to predicting the initialpopularity given a set of features involves usingoff-the-shelf classification or regression based ap-proaches. However, in our case, as seen in theprevious section, most of the features have differenteffects on the popularity depending on the sub-reddit we are talking about. For instance, certainsubreddits such as politics favor submissions withtitles which are specific to the community and the

Table 2: List of all the features and NotationsNotation Feature Description

c Index denoting contents Index denoting subredditu Index denoting a userp Index denoting a postn submission number of the post across all the subredditsns submission number within the subreddit slku link karma of user u

spsubrp subreddit specific score of post p

spcontp content specific score of post p

spsubjp subjectivity score of post p

avgcs,h average # of comments in subreddit s during the hour of the day

φp potential function of the post p carrying content c (across all subreddits)φsp potential function of the post p carrying content c

with respect to the subreddit s∆t time in days since previous submission (across all subreddits)∆ts time in days since previous submission within the subreddit src interestingness of content c (latent variable)

7

Figure 10: Time difference vs. Correlation (explainedin the text of the section)

content. On the other hand, subreddits such as pics,funny do not emphasize this aspect. Further, theinterestingness of content is an important aspectthat needs to be modeled as a latent variable andshould be integrated as an explanatory factor intothe model. Putting together all these details, itbecomes essential to come up with our own modelthat can incorporate these aspects.

I propose a regression based formulation encapsulat-ing a latent variable for modeling the content inter-estingness in order to solve the problem at hand. LetV̂p denote the estimated initial popularity of a postp. It can be modeled as :

V̂p = β0 + β1 exp(α1 · lkup ) + βs2 spsubrp + βs

3 spcontp

+βs4 spsubjp + β5 avgcs,h + β6 exp(α2 · φp)

+β7 exp(α3 ·φsp)+β8 exp(α4 ·np)+β9 exp(α5 ·ns,p)+βs

10 rc(1)

The unknown variables in the equation above are allthe α, β quantities and rc. Note that the parame-ters with superscript s correspond to subreddit levelparameters and rc is a content level parameter. Theobjective function to be minimized can be formulatedas :

min1

2

∑p

(Vp − V̂p)2

I employed gradient descent algorithm with simulta-neous update in order to solve the above optimizationproblem. The algorithm is executed till the differencein the squared error between subsequent iterations is< 0.0001

Predicting content popularity at nexttime instant

The next step is to generalize this framework topredict the popularity of a particular submission at

various time instants. The popularity of a given postp at the first time instant is given by equation 1.

Autoregressive models (AR) constitute simple andeffective approaches to time series modeling. ARmodels posit the value of a variable at time ’t’ asa function of the variable’s values at p previous timeinstants. The AR(m) model is defined as

Vt = c+m∑i=1

ψiVt−i + εt (2)

c is a constant, εt is the white noise and ψi correspondto the coeefficients which need to be estimated. Amajor limitation with AR is that it does not capturebursts effectively. In the application that we arelooking at, submission popularity increases in burstswhen the submissions are promoted to the frontpage. So, I propose an Augmented Autoregressivemodel (AAR) which can capture the bursts in thesubmission popularity.

AAR is designed to capture the sudden bursts in thecommenting activity due to promotion to front page.Let tcur denote the difference in hours since the sub-mission was posted till the current time instant. Lettpr denote the time the upvotes on the submissionexceeded the threshold and the submission was pro-moted to the front page. tde is the time to decaywhich is the average number of hours after the sub-mission’s promotion to the front page at which thesubmission is removed from the front page. The aug-mented autoregressive model can be formulated as :

V̂p,t = c u(tcur − tpr) u(tpr − tcur + 2)

+

m∑i=1

ψiVt−i−d u(tcur−tpr−tde) u(tpr+tde−tcur+2)

(3)c and d are coefficients which need to be estimatedand u() is a step function. u(x) = 1 if x > 0 andu(x) = 0 otherwise. The above equation preciselyinduces the increase in the rate of comments fromprevious time instants due to promotion of a post onto the front page. The first term in the summationin equation 3 models the sudden increase in the rateof comments due to promotion of a submission to thefront page. u(tcur - tpr) = 1 when the content hasbeen promoted to the front page. u(tpr − tcur + 2)= 1 for the first 2 time instants after the submis-sion’s promotion to the front page. This term in thesummation will be effective only when both the step

8

functions evaluate to 1. This implies that this termis effective during the first two time instants after thesubmission’s promotion to the front page. After thisinterval, the step functions evaluate to 0 and thenthe onus is on the second term in the summation tomodel the comments as a function of the prior popu-larity. In an analogous manner, the last term in thesummation is employed to account for the decreasein the rate at which comments pour in after the sub-mission has been pulled off the front page.In the formulation above, the parameters that needto be estimated are c, d, tpr, tde and ψi. Note thattcur corresponds to the number of hours elapsed sincethe submission of the post. The parameters tpr andtde are set from the empiricial observations of thedata. As discussed in the section on Dataset, I ob-tained 1248 posts which were promoted to the frontpage. Analysis of this set of posts revealed that sub-missions promoted to the front page need to satisfya specific threshold on the number of upvotes. Thisthreshold needs to be achieved within about 10 hoursof the submission. The minimum number of upvotesneeded for a post to be featured on front page is notpublished by reddit, however from the data collected,the minimum number of upvotes of any submissionwhich was featured on the front page is 1012. Typi-cally, the rate of accumulation of comments increasessteeply from the instant the content shows up on thefront page and the submission shows up on the frontpage typically for an average of 5 hours (1 - 9 hoursis the typical time that was observed in the dataset)since the time it shows up first on the front page.From these observations, tde = 5 (average numberof hours after which submission is removed from thefront page). tpr is defined as follows :

tpr =

{t if V upvotes

p,t ≥ 1012 and t ≤ 10

0 otherwise

The above equation reads that tpr is the time instantt when the number of upvotes of post p exceed 1012and if this time instant t is within 10 hours of thepost’s inception, it is 0 otherwise.In order to estimate c, d and ψi, the formulationposed in equation 3 is solved using minimization ofleast squares and employing gradient descent to ob-tain estimates as discussed in the previous subsection.

Experimentation

This section discusses in detail the experimentationcarried out to demonstrate the effectiveness of the

proposed framework and analysis of the effect of vari-ous features on the popularity of the submission. Thissection is divided into three parts. The first subsec-tion discusses the evaluation of the initial predictor.The second subsection discusses the performance ofAAR model on the task of predicting the number ofcomments obtained by a submission. The last sub-section discusses the relative importance of featuresin explaining the popularity of the content.

Setup All the experiments have been performed us-ing 75% of the data as training set and the remaining25% of the data as the test set. All the feature val-ues have been normalized to have a mean of 0 andunit variance in order to facilitate faster convergenceof the gradient descent algorithm and also easier in-terpretation of the weights assigned to the respectivefeatures.

Baselines Since there has not been much workon popularity prediction for reddit, I compared ourmodel with its ablations. In addition, since the datahas lots of submissions which are duplicates, we wantto examine the explanatory capability of the resub-missions. In order to do this, I used two more base-lines - First one is a model of exponential decay ofthe number of comments depending on the submis-sion number. Second one is a model of exponentialdecay in the number of comments depending on thesubmission number with the subreddit.Baseline 1:

V̂p = βc0 + Vc,1 − β1e

αnp

Vc,1 denotes the number of comments of the firstsubmission of content c (this will be a known quan-tity) and np indicates the submission number of postp. Other parameters are again estimated using leastsquares minimization process.Baseline 2:

V̂p = βc0 + Vc,s,1 − β1e

αnp,s

Vc,s,1 denotes the number of comments of the firstsubmission of content c in subreddit s (this will bea known quantity) and np,s indicates the submissionnumber of post p within subreddit s. Note that βc

0 isa content level parameter.Baseline 3: This baseline just corresponds to the lin-ear regression formulation without the latent variablerc and with all the coefficients taken at a global level.

9

Table 3: Initial Prediction ResultsApproach R2

Our Model 0.672Our Model - User 0.603Our Model - Title 0.588

Our Model - Resubmission 0.267Our Model - Content 0.512

Baseline 1 0.287Baseline 2 0.376Baseline 3 0.412

Evaluating Initial Predictions

Table 3 shows the results of predicting the # of com-ments in the first one hour. The predictions of themodels are evaluated using the R2 metric. R2 is thecoefficient of determination and relates to the meansquared error and variance as :

R2 = 1− MSE

V ar

The results of other ablations of the model are alsohighlighted in the Table. Our Model - feature indi-cates that the ablation model does not capture theeffects of that particular feature. For instance, OurModel - User model does not contain the user fea-tures.

Discussion It can be seen that the model we pro-pose is the most effective for explaining the submis-sion popularity. Table 3 also indicates that removingresubmission features decreases the accuracy of themodel considerably. On the other hand, user and ti-tle features impact the overall accuracy much lesser.It is interesting to see that ignoring content modelingalso results in significant dip in the accuracy. Further,the baselines especially the second baseline accountsfor a decent portion of the popularity too indicatingthat initial submissions in the subreddit typically geta lot of attention as opposed to the subsequent sub-missions of the same content.

Evaluating Predictions at SubsequentTime Instants

Figure 11 highlights the predictive performance ofAAR and AR models for time series prediction taskat hand. In addition, it also shows the performanceof using the initial prediction model 3. It is impor-tant to discuss the details of the training phase here.

3Note that all the other ablations and baselines were alsotested on the same lines as that of the initial prediction modelfor this task, however, they exhibited poorer performance than

AAR and AR models are fit to the data by observ-ing the # of comments in the previous 2 time in-stants. On the other hand, initial prediction modelwas also trained separately for each time instant t tosee how effective it would be in making the predic-tions at various time instants. Firstly, note that tomake the initial prediction model work in a set uplike this, I had to train the model separately for eachslice in time causing additional overhead. From this,it can be seen that it is not ideal to train the modelto obtain the coefficients for each feature separatelyat each time instant. ARR and AR models on theother hand have relatively parameters to keep trackof and are performing efficiently.

Discussion From Figure 11, it can be seen thatAAR model, despite having very simple augmentedadditive terms, performs better at almost all time in-stants. AR model exhibits comparable performanceat time instants > 16 hours. This is the time pe-riod when the flow of comments would have beenstabilized. The initial prediction model despite of be-ing trained separately at each time instant performsslightly poorly compared to the AAR model. It wasobserved that the features could explain the popular-ity of the post better during the intial phases of thepost’s life time. However, when it comes to modelingthe bursts and sudden decay in the rate of comments,conditioning the popularity only on the features per-forms poorly.

Evaluation on Bursty Data The main objectiveof using an augmented model as opposed to ARmodeldirectly was to account for the bursts and steep de-cays in the rate of comment flows. In order to studythis more closely, I picked up those posts from thetest set for which the rate of increase of commentssuddenly increases and decreases beyond a particu-lar threshold. The entire dataset comprises of about28.32% of such posts. The test set in particular has8.92% of these posts (percentage computed across en-tire dataset) which exhibit this kind of properties.Figure 12 shows the plots for comparing the perfor-mance of AAR and AR models. The differences inperformance are now much more clearly visible. Itcan be seen that AR model performs poorly in thebeginning, then adapts itself to the changes in thecomment rates and then the performance degradesagain when there is a decline in the comment rate

the initial prediction model and hence have been left out of theFigure to avoid clutter.

10

Table 4: Communities with low and high coefficients for various featuresCommunity Specificity Content Specificity Subjectivity Content rcHigh Low High Low High Low High Low

atheism aww politics funny pics politics pics politicsgaming pics business pics funny business food business

technology funny atheism gifs gifs food apple technologypolitics gifs gaming reactiongifs aww worldnews funny games

worldnews reactiongifs movies aww food technology gifs movies

Figure 11: Time since posting of submission vs. R2

prediction

Figure 12: Time since posting of submission vs. R2

prediction (of bursty data)

and then becomes stable towards the end of the curve.As can be seen in the figure, AAR can model theseabrupt changes much more effectively.

Feature Analysis

In this subsection, we present answers to the ques-tions such as which feature is the most important inexplaining the content popularity. This is done bycomparing the coefficients obtained from the initialprediction model. Note that not all coefficients areat the global level, there are several subreddit levelcoefficients. We begin by analyzing all the coefficientsat the global level. β1, β6, β7, β8 and β9 have mul-tiplicate exponential terms. Comparing these coeffi-cients, it was found that potential function within thesubreddit is the most influential factor. This is fol-lowed by submission number in the subreddit. Then,we have the potential function across the entire cor-pus followed by the submission number of the entirecorpus. Last ranked is the user influence.

Moving on to the subreddit specific coefficients, itwas interesting to see that different subreddits em-phasized different aspects. As discussed in the Fea-tures section of this write up, few reddits favor usageof community specific words and others dont. Table4 highlights some of the top communities which placehigher and lower weights on various factors that weare considering. It is interesting to see that commu-nities such as politics, business, worldnews, technol-ogy which are focussed towards certain content placehigher emphasis on content specificity and commu-nity specificity and discourage subjectivity in the ti-tle of the post. On the other hand, communities suchas pics, gifs, funny etc. tend to exhibit opposite be-havior. Another interesting insight is that the coef-ficients placed on the content attribute are high forcommunities such as pics, gifs, funny etc. and lowfor communities such as politics, technology, gamesetc. This is rather surprising. When I delved deeper,I realized that the reason for this kind of behavioris that the titles in the communities such as pics arenot enough descriptive to discriminate the contentor the image, hence the actual image or the contentbecomes important. On the other hand, in commu-nities such as politics, technology etc. titles tend tobe enough descriptive to indicate what the contentis. Hence, the importance given to the topic is splitbetween the title and the actual content.

Conclusion

In this report, I have presented in detail my workon analyzing popularity of a given post. I have out-lined methods which are suitable for the prediction ofinitial popularity and also popularity at subsequenttime instants. This work is a preliminary attemptat understanding the richness of the content on Red-dit. As is evident from this writeup, there are severalfactors which are contributing to the popularity ofthe content on reddit. I have made an effort to ana-lyze their interplay and propose a framework with allthese aspects tied together. There is definitely much

11

left to be done in terms of exploring the factors a lotmore rigorously and coming up with more scalablemodels for handling the data on Reddit.

References

[1] Roja Bandari, Sitaram Asur, Bernardo A Hu-berman: The Pulse of News in Social Media:Forecasting Popularity. In ICWSM, 2012.

[2] Yoav Artzi, Patrick Pantel, Michael Gamon:Predicting Responses to Microblog Posts. InHLT-NAACL 2012: 602-606.

[3] Jong Gun Lee, Sue Moon, Kave Salamatian. AnApproach to Model and Predict the Popularityof Online Contents with Explanatory Factors.In 2010 IEEE/WIC/ACM International Confer-ence on Web Intelligence and Intelligent AgentTechnology, 2010.

[4] Himabindu Lakkaraju, Jitendra Ajmera. Atten-tion Prediction on Social Media Brand Pages. InCIKM 2011.

[5] Sitaram Asur, Bernardo A Huberman: Predict-ing the Future With Social Media. In WI-AIT,2010.

[6] Kristina Lerman, Aram Galstyan: Analysis ofSocial Voting Patterns on Digg. In WOSN, 2008.

[7] Tad Hogg, Kristina Lerman: Social Dynamics ofDigg. In ICWSM, 2010.

[8] Kristina Lerman, Tad Hogg: Using a Model ofSocial Dynamics to Predict Popularity of News.In WWW, 2010.

12

Demystifying content popularity on Redditsnap.stanford.edu/class/cs224w-2012/projects/cs224... ·...

Documents

Transcript of Demystifying content popularity on Redditsnap.stanford.edu/class/cs224w-2012/projects/cs224... ·...