Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Post on 27-Jan-2015

106 views 1 download

Tags:

description

presented at the ASE/IEEE International conference on Social Computing 2012 in Amsterdam

Transcript of Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Ignorance isn't Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Claudia Wagner, Matthew Rowe, Markus Strohmaier and Harith Alani

Amsterdam, 16.4.2012

with…

Matthew Rowe

Markus Strohmaier

Harith Alani

3Motivation

Which factors impact how much attention a post gets?

We use the number of replies as a proxy measurment of attention

Research Questions

Which factors impact the attention level a post gets in certain community forums?

How do these factors differ between individual community forums?

5Methodology

Empirical study of attention patterns in 20 randomly selected forums

Two-stage approach Differentiate between threadstarter posts that got at least one reply (seed posts) and threadstarter posts which got no replies at all (non-seed posts)

Predict the level of attention that seed posts will generate - i.e. the number of replies

DatasetMost popular Irish Message Boards, Boards.ie

725 Forums

Year 2005 and 2006

7

Feature Engineering

AimIdentify the features that impact upon seeding a discussion

Identify features associated with seed posts that generate the most attention

Five Feature Groups

Five Feature GroupsUser Features

user account age, post count, in-degree, out-degree, post rate

Content Featurespost length, complexity, readability, link count, time in day, informativeness, polarity

Title FeaturesLength, question marks, linguistic dimensions (LIWC)

Focus FeaturesForum entropy, forum likelihood, topic entropy, topic likelihood, topic distance

Community FeaturesTopical community fit, topical community distance, evolution score, inequity score

Feature Computation

For each threadstarter post published in one of the 20 randomly selected forums in 2006 we computed our 28 features

m1

6 month

2005 2006

Fit LDA model with standard parameterT=50, beta=0.01, alpha=50/T

11

Seed Post Identification Experiment

Identify Posts which got replies (Binary Classification Task)

Split data of each forum into train and test data (80/20)

Train a logistic regression classifier with each feature group in isolation and all features combined

Compare performance by using F1 score and the Matthews correlation coefficient (MCC)

12

Seed Post Identification Results

For these 9 forums our classifiers outperforms the random baseline:

Astronomy & Space: a classifier trained with content features aloneperforms best

Spanish: a classifier trained with title features alone performs best

13

Seed Post Identification Feature Impact

Analyze impact of individual features rather than groups

Interpret statistically significant coefficients of the best performing feature group learned by the logistic regression model

Rank the features of the best performing feature group using the Information Gain Ratio (IGR) as a ranking criterion

14

Seed Post Identification Observations

In Spanish community the title length is the most important features (IGR=0.558, coef=-0.326)

Posts with long titles are less likely to get replies

In the Bank & Insurance forum short but complex posts which are authored by newbies are most likely to get replies

Content length coef=-0.017, p< 0.05

Topic distance coef=2.890, p<0.01

Complexity has highest IGR (IGR=0.354)

15

Seed Post Identification Observations

Number of links has a negative impact in forum Work & Jobs and Golf, but a positive impact in the Astronomy & Space forum

Purpose of community Links have a positive impact in content and information driven communities

Links have a negative impact in other communities

16

Seed Post Identification Observations

Some communities require posts to fit to the topics they usually discuss (e.g., Golf) while others are more open to diverse topics (e.g., Work & Jobs)

Specificity of community’s subject Subject of Work &Jobs forum is very general high topical community distance has a positive impact

Subject of Golf forum is very specific high community distance has a negative impact

17

Activity Level PredictionExperiment

Identify the features that were correlated with lengthy discussions

Rank posts according to their attention level

Evaluate our predicted rank using normalized Discounted Cumulative Gain (nDCG) at varying rank positions i.e. top-k where k={1, 5, 10, 20, 50, 100}

nDCG = DCG of the predicted ranking divided by DCG the actual rank

18

Activity Level PredictionResults

Aver

AVERAGED NORMALISED DISCOUNTED CUMULATIVE GAINA value of 1 indicates that the predicted ranking of posts perfectly matched their real ranking.

19

Activity Level PredictionResults

Aver

For the Astronomy & Space community content features were best for identifying seed posts and are also best for ranking posts according to the attention level they will generate.

20

Activity Level PredictionResults

Aver

Golf forum (343) Combination of all features worked best for identifying seed posts.Focus features alone are best for ranking posts.

21

Activity Level PredictionResults

Aver

Bank & Insurance forum (544) Combination of all features worked best for identifying seed posts.Community features alone are best for ranking posts.

22

Activity Level PredictionSummary

Factors that impact discussion initiation often differ from the factors that impact discussion length

e.g. for the Golf community

Seed Posts = all features

Activity level = focus features

23

Activity Level PredictionSummary

Factors that are associated with lengthy discussion tend to be different for different communities

The title length is the only feature which has a slightly significant positive impact across several communities on the number of replies a post gets

Work & Jobs forum title length coef=0.034 and p<0.01

Satellite forum titles length coef =0.030 and p<0.05

24Conclusions (1)

Different community forums exhibit interesting differences in terms of how attention is generated

Most attention patterns which we identified are local and community-specific

“Global” patterns may highly depend on composition of dataset

25Conclusions (2)

Same features that have a positive impact on the start of discussions in one community can have a negative impact in another community

Example: number of links Negative impact in most communities

Positive impact in information and content driven communities

26Conclusions (3)

Purpose of community and specificity of community’s subject may impact their reply behavior

Communities which have a supportive purpose are most likely driven by different factors than communities with an informational purpose.

Communities around very specific topics require posts to fit to the topical focus. Communities around more general topics do not have this requirement.

27Limitations & Future Work

Correlation versus CausalityWe cannot answer the „what would have happened if“ question with our approach

Controlled experiments where platform is manipulated

Most attention patterns are lokal. But how lokal?Can we automatically identify the context in which attention patterns may hold?

Experimental Setup

THANK YOU

claudia.wagner@joanneum.athttp://claudiawagner.info

src: http://adobeairstream.com/green/a-natural-predicament-sustainability-in-the-21st-century/

Attention patterns tend to be local and community-specific.Ignoring communities’ idiosyncrasies isn’t a bliss.