Post on 18-Dec-2015
Language Technologies Institute
1
Analysis of Social Media
Trend Analysis
Mohit KumarOct 31, 2007
Language Technologies Institute
2
Roadmap• The Predictive Power of Online
Chatter – Gruhl et al, KDD’05• Topics over Time: A Non-Markov
Continuous-Time Model of Topical Trends – Wand and McCallum, KDD’06
• Briefly: • Visualizing Tags over Time – Dubinko et
al, WWW’06
Language Technologies Institute
3
The Predictive Power of Online Chatter– KDD’05
Daniel Gruhl IBM Almaden Research CenterR. Guha Google, IncRavi Kumar IBM Almaden Research CenterJasmine Novak IBM Almaden Research
CenterAndrew Tomkins IBM Almaden Research
Center (now Yahoo!)
Language Technologies Institute
4
Motivation
• Demonstration of link between online content (blogs) and customer behavior (purchase decision)
• Predict spikes in sales rank based on online chatter
Language Technologies Institute
5
Domain
• Sales rank of books on Amazon.com• Postings in blogs, media and
webpages
Language Technologies Institute
6
Findings
• Hand-crafted queries produce matching posts whose volume predict sales rank
• These queries can be automatically generated
• Successfully predict spikes in sales rank (not general sales rank motion)
Language Technologies Institute
7
Causation• Bloggers are most likely non-causative
indicators of other root-cause (typically an event in the outside world)
• Possible explanations for delay between postings and changes in sales rank, linked to profiling of bloggers:• Forward thinking people who write and buy
early but represent only a small fraction of population
• Representative of the population but threshold to write about a product may be lower than buying the product
Language Technologies Institute
8
Data Details• IBM Web Fountain
• 300K blogs• 200K postings per day• 3B web pages• 200K media articles per day (Factiva media
feed)• Amazon sales rank data
• 2430 books• 480K sales rank readings• Duration – 120 days
Language Technologies Institute
9
Correlation between Sales Rank and Blog Mentions
• Spike: All the ranks that do not occur within a week of the minimum rank are “large enough”
• Large enough ~ max(m+50,1.5m)
• 50 books contains ‘spikes’ during the considered time interval of study
Language Technologies Institute
10
• Spike corresponds to Armstrong winning Tour de France on July 25th
Query: Lance Armstrong OR Tour de France
Example 1
Language Technologies Institute
11
• Two plausible factors:• Message board
corresponding to a TV show starting taking style questions
• Authors had another book release
Query: What not to wear
Example 2
Language Technologies Institute
12
• Spike coincides with movie released on Sep 1
Query: Vanity Fair OR William Thackeray
Example 3
Language Technologies Institute
13
Cross correlation
Query: Lance Armstrong OR Tour de France
• Leading Best Lag
Query: What not to wear
• Leading Best Lag
Query: Vanity Fair OR William Thackeray
• Slightly Trailing Best Lag
• Out of 50 books with spikes, 10 have highly correlated blog mentions
Language Technologies Institute
14
Correlation between Sales Rank and Blog Mentions
Language Technologies Institute
15
Factors affecting sales rank but not Blog mentions
• Marketing promotions• Book release (filtered out books
released in 2004 from their dataset)• Wholesale purchase• Lower ranking books get spikes but
not public attention
Language Technologies Institute
16
• High point following the release of the film – Jun 28’04
• Steadily falling
Query: The Notebook AND Nicholas Sparks
Separate Example - Trend
Language Technologies Institute
17
Findings
• Hand-crafted queries produce matching posts whose volume predict sales rank
• These queries can be automatically generated
• Successfully predict spikes in sales rank (not general sales rank motion)
Language Technologies Institute
18
Automatically generated query• Query generation
based on author name• Fairly simple• Needs more
exploration
Query: B
uste
r Oln
ey
Language Technologies Institute
19
Automatically generated query
Scatter plot of cross-correlation versus lag for 182 automatically-generated queries.
Language Technologies Institute
20
Findings
• Hand-crafted queries produce matching posts whose volume predict sales rank
• These queries can be automatically generated
• Successfully predict spikes in sales rank (not general sales rank motion)
Language Technologies Institute
21
Problem statement
• Given• Time series representing sales rank upto
time t• Is
• Addition of blog mention data• Helpful in predicting
• Sales rank future trend?
Language Technologies Institute
22
Predicting Motion (Sales up or down)
• 2-class classification problem• Natural predictors
• Moving averages• 63 % accuracy – best classfier
• Least squares predictors• 60 % accuracy – best classifier
• Markov predictor• 63% accuracy
• Not explicitly mentioned how the Blog mention data is used
Language Technologies Institute
23
Predicting Volatility (difference in sales by a threshold)
• Threshold chosen to indicate volatility – 44
• 72% accuracy – Best classifier• Not explicitly mentioned how the Blog
mention data is used
Language Technologies Institute
24
Predicting Spikes
• Create labeled data for evaluation• Tag spikes
• Problem:• Given: Product, time t, blog mentions• Output: spike in near future (binary
classification)
Language Technologies Institute
25
Predicting Spikes – Heuristic Algorithm
• Three principles for finding spikes:• Biggest ever• Exceed historical averages significantly• Rise relatively quickly
• Translates into a function with 3 linear equations
Language Technologies Institute
26
Evaluation• Not treated as binary classification at time
instance• Treated as predicting Leading & Trailing
predictions (2 week window)• Fairly weak/complicated evaluation
• 2/3 of the predictions made have Leading/Trailing predictions so may be an accuracy of 66%
• Recall ~ 0.5
Language Technologies Institute
27
Conclusion
• Online chatter ‘may’ represent early indicator of real-world behavior
Language Technologies Institute
28
Critique
• Preliminary work in exploring a significant/important problem
• Good roadmap for future research by decomposing the problem as follows:• Get ‘relevant’ blog mentions• Correlate mentions with sales rank
Language Technologies Institute
29
Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends
KDD’06
Xuerui Wand U.Mass, AmherstAndrew McCallum U.Mass, Amherst
Acknowledgement: Slides borrowed from Linda Buisman, Australian National University
Language Technologies Institute
30
Motivation• Information retrieval & text mining• Text is highly-dimensional• Topic models
• Discover summaries of documents• Reduce dimensions• Model co-occurrences of words
• mouse, cat, Tweety -> cartoons• mouse, keyboard -> computer supplies
• Topics over time• Co-occurrences are dynamic• Additional modality – time
• united, states, war @ 1850 -> Mexican-American War• united, states, war @ 1918 -> World War I• united, states, war @ 2006 -> War in Iraq
Language Technologies Institute
31
Modeling time
• Earlier approaches• Discretize
• Fixed interval size does not fit all topics• Markov model
• State at time t+1 depends on t, but not earlier
• Solution• Treat time as a continuous variable• Time is a parameter in a Bayesian
network
Language Technologies Institute
32
Bayesian network• Generative model
• vs discriminative (SVM, NN, …)• Bayes’ rule:• Bayesian network
• Directed graph of parameters• A connected to B:
• Probability of B conditionally depends on A• Generation step
• Estimate conditional probabilities for all (hidden) parameters
• Goal• Predict probability of hypothesis H being true for
observation X
Language Technologies Institute
33
Topics-over-time model• Based on an earlier topic model LDA• “Bag-of-words” approach
• Word count in a document is significant• Position and order are not significant
• Timestamp of document becomes another parameter
• Generate Bayesian network from existing documents• Exact inference computationally infeasible• Use approximate inference
• Goal• Predict the probability of a document belonging to topic
T
Language Technologies Institute
34
Model
Language Technologies Institute
35
Results
Words associated with a topic
Distribution of topic over time
Language Technologies Institute
36
Comparison with LDA
Confuses Mexican war with WWI
TOT
LDA
Confuses Panama Canal with other activites in Central America
Language Technologies Institute
37
KL Divergence between topics
TOT topics are more distinct from each other
Language Technologies Institute
38
Time Prediction
Task: predict the decade given the text of the SoU Address
Language Technologies Institute
39
Topic Distribution Profile
NIPS dataset
Emphasis on Neural networks, analog circuits and cells
Emphasis on SVMs, Optimization, Probability and Inference
Language Technologies Institute
40
Topic Co-occurrences over time
Co-occurrence of topics with the “classification’ topic in NIPS dataset
Language Technologies Institute
41
Analysis
• Generative vs discriminative methods• Discriminative usually faster• Accuracy depends on application• Generative model offers more
information• E.g. not just topic(s) of a document, but
also:• Predict time-stamp, given a document• Distribution of topics over time
Language Technologies Institute
42
Analysis (cont)
• Limitations and simplifications• “Bag-of-words” instead of word
sequences or phrases• Computer science vs computer, science
• No account of position within document• Title, introduction, body, footnote