Indrajit Bhattacharya Research Scientist IBM Research, Bangalore
-
Upload
driscoll-york -
Category
Documents
-
view
33 -
download
3
description
Transcript of Indrajit Bhattacharya Research Scientist IBM Research, Bangalore
Dynamic Multi-Relational Chinese Restaurant Process for Analyzing Influences on
Users in Social Media*Indrajit Bhattacharya
Research ScientistIBM Research, Bangalore
*Collaboration w/ Himabindu Lakkaraju & Chiranjib Bhattacharyya
Workshop on Social ComputingIIT Kharagpur, Oct 5-6 2012
Social Media Analysis: Motivation
Microblogs: Twitter, Facebook, MySpace
Understanding and analyzing topics & trends
Influences on users
Variety of stakeholders
Business
Government
Social scientists
2
Social Media Analysis: Challenges
Network and Influences on Users
User personality: Personal preferences, global and geographic trends, social circle in the network [Yang WSDM 11]
Dynamic nature
Topics & user personalities evolve over time
Volume of data
Existing approaches fall short 3
Soc Med Analysis: State of the Art
Content Analysis
Ramage ICWSM 2010, Hong SOMA 2010
Variants of LDA
Inferring User Interests
Ahmed KDD 2011, Wen KDD 2010
Individual features such as user activity or network
Patterns in Temporal Evolution
Yang et al WSDM 20114
Bayesian Non-parametric Models
Choosing no of components in a mixture model
Particularly severe problem for large data volumes such as for social media data
Bayesian solution
Infinite dimensional prior
Allows no of mixture components to grow with data size
Cannot capture richness of social media data
Algorithms often not scalable 5
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 8
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 9
Dirichlet Process (Informal)
10
Dirichlet Process: Properties
12
Chinese Restaurant Process (CRP)
14
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Parallelized Online Inference Algorithm
Experimental Results 15
Relational Ch. Rest. Pr. (RelCRP)
R16
Relational Ch. Rest. Pr. (RelCRP)
17
Influence of World-wide Factors
18
Influence of World-wide Factors
19
Influence of Personal Preferences
20
Influence of Personal Preferences
21
Influence of Friend Network
22
Influence of Friend Network
23
Influence of Geography
India China
UK
24
Influence of Geography
25
Aggregating Influences
RelCRP is exchangeable like the CRP
Useful as a prior for infinite mixture model
RelCRP captures influence of one relation on posts
Influences act simultaneously on any user
Aggregated influence pattern is user specific
Different users affected differently by same combination of world-wide and geographic factors
Multi Relational CRP
28
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 30
Evolving Patterns in Social Media
Number of Topics
Topics die and new ones are born
User Personalities
Susceptibility to influence by world-wide, geographic and friends’ preferences
Existing Topic Distributions
Words go out of fashion, new ones enter vocabulary
Topic Characters:
Popularity of topic changes world-wide, in users preference, sub-networks and geographies 31
Dynamic MultiRelCRP
32
User Personality Trends
33
Evolving Topic Distributions
34
Topic Character Trends
35
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 36
Inference and Estimation Tasks
37
Online Algorithm
Traditional iterative framework does not scale for social media data
Sequential Monte Carlo methods [Canini AIStats ‘09] that rejuvenate some old labels also infeasible
Online sampling [Banerjee SDM ‘07] does not revisit old labels at all; initial batch phase
Adapt for non-parametric setting
38
Multi-threaded Implementation
Sequential online implementation does not scale
Iterative Gibbs sampling algorithms parallelized for hierarchical Bayesian models [Asuncion NIPS 08, Smola VLDB 10]
Our algorithm is parallel, online and non-parametric
Explicit consolidation by master thread at the end of each iteration
Only new topics consolidated 39
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 40
Datasets and Baselines
Twitter: 360 million tweets (Jun-Dec 2009)
Facebook: 300,000 posts (public profiles, 3 mths)
Latent Dirichlet Allocation (LDA)
[Hong SOMA 2010]
Labeled LDA (L-LDA)
Hashtags as topics [Ramage ICWSM 2010]
Timeline
Dynamic non-parametric topic model [Ahmed UAI 2010] 41
1 Model Goodness
Perplexity: Ability to generalize to unseen data
Both network and dynamics are important for modeling social media data
Model Twitter FacebookDMRelCRP 1188.29 1562.34Timeline 1582.86 1802.9L-LDA 1982.76 -LDA 2932.06 3602
Perplexity
42
2 Quality of Discovered Topics
Label assigned to each post indicating category
Distribution over words indicating semantics
A. Clustering posts using topic labels
B. Prediction using topic labels
Predicting post authorship & user commenting activity
C. Major event detection
43
2A Post Clustering using Topics
Use hashtags as gold standard (for Twitter)
16K posts #NIPS2009, #ICML2009, #bollywood etc
DMRelCRP close to L-LDA without using hashtags
DMelCRP produces ‘finer-grained’ clusters
Model nMI R-Index F1DMRelCRP 0.93 0.88 0.86Timeline 0.81 0.72 0.73L-LDA 1 1 1LDA 0.55 0.52 0.48
Clustering accuracy (Tw)
44
2B Prediction Using Topics
Authorship: Given post and user, predict if author
Commenting activity: Given post and (non-author) user, predict if user comments on that post
DMRelCRP topics lead to more accurate prediction
Model Twitter Facebook Twitter FacebookDMRelCRP 0.793 0.734 0.683 0.648Timeline 0.718 0.669 0.582 0.579L-LDA 0.521 0.432 0.429 0.482LDA 0.647 - 0.542 -
Authorship Commenting
45
2C Major Event Detection
47
2C Major Event Detection
48
3 Analysis of Influences
49
3A Global Personality Trends
50
3A Global Personality Trends
51
Michael Jackson’s death
FIFA WC
Google Wave
3A Global Personality Trends
52
3B Geo-specific Personality Trends
Personality trends very similar in UK and US
Geographic influences high at different epochs 53
3B Geo-specific Personality Trends
India: W-wide and geographic influences weaker
China: W-wide weak, geo strong; stable pattern 54
3C Topic Character Trends
55
3C Topic Character Trends
56
3C Topic Character Trends
57
Scaling with Data Size
Java-based multi-threaded framework; 7 threads
8-core 32 GB RAM
Scales largely because of multi-threading 58
Summary
First attempt at studying user influences in social media data
New non-parametric model that captures multiple relationships and temporal evolution
Multi-threaded online Gibbs sampling algorithm
Extensive evaluation on large real dataset
Topics lead to better clustering and prediction
Insights on user influence patterns
59