Post on 14-Jul-2015
CIKM’12
Generating Event Storylinesfrom
Microblogs
ABSTRACT
we explore the problem of generating storylines
from microblogs for user input queries.
Given a query of an ongoing event, we propose
to sketch the real-time storyline of the event by a
two-level solution.
1. propose a language model with dynamic
pseudo relevance feedback to obtain relevant
tweets
2. Generate storylines via graph optimization
INTRODUCTION
Generating Event Storyline from Microblogs
(GESM)
differences between GESM and prior studies:
1. Well edited facts ---- short noisy text
2. GESM provides personalized service
3. A two-level framework is necessary: at the low
level, finding all relevant tweets through the
time-line of the event by a retrieve model; and
at the high level, summarizing relevant tweets
and the latent structure to produce a storyline.
INTRODUCTION
INTRODUCTION
Challenges
1、the dynamic and sparse nature of microblogs
——How to match the underlying event expressed
by the vague event query to potential relevant
tweets which possibly not contain any query terms
2、Numerous duplicate tweets and direct and
undirect re-tweets
INTRODUCTION
contributions
1. generating event storylines from microblogs
2. A dynamic pseudo relevance feedback (DPRF)
language
model
3. a graph-based optimization problem and is
solved by approximation algorithms of
minimum-weight dominating set and directed
Steiner tree
THE FRAMEWORK OVERVIEW
generated storyline should be a graph structure
Node is labeled by a summary
Edge represents causal relationship between two
phases
Offline layer
Online layers
THE RETRIEVAL MODEL
Preliminaries
the original query is usually short and vague
Query expansion
In a pseudo relevance manner, suppose the few top
ranked documents d + by the initial query Q builds a
relevant model θ F , we can set the new query to be
a linear combination of original query Q and
relevant model θF
THE RETRIEVAL MODEL
Dynamic Pseudo Relevance Feedback
K burst periods
Assume that the prior probability of relevant
document d + is dependent on the distance of td+
to the centroid
of burst periods, denoted as Φ = { φ 1 ··· φ K }
three probability functions to model the effective
range of burst period, decay coefficient and
skewness.
1. Mixture Gaussian Distribution
2. Local Power Distribution
3. Skewed Linear Distribution
THE RETRIEVAL MODEL
Mixture Gaussian Distribution
Local Power Distribution
Skewed Linear Distribution
THE RETRIEVAL MODEL
Burst Period Detection
1. appear more frequently than usual
2. be continuously frequent around the time point.
detect burst periods of the event by
1. for each query term, finding the time intervals
with arbitrary length in which the query term
appears constantly frequent;
2. picking the time points within these intervals
with the
largest sum of frequencies over all query terms.
THE RETRIEVAL MODEL
“bursty score”
find time interval Tw,j = <st, et, LS, RS> with the maximal cumulative burst score B ( w, Tw,j )
Compute the score of any query term q at each time point
Rank each time point by ∑q∈QH ( q,t )and choose the largest K time point φk .
STORYLINE GENERATION
1. Representative tweets
2. Depict the evolving structure of the event
3. an optimistic connection
a multi-view tweet graph is constructed
a minimum dominant set on the tweet graph
a minimum steiner tree
STORYLINE GENERATION
three non negative real parameters α, τ1, τ2 , τ1<
τ2 .
define E : text similarity > α
define A : τ1 ≤ t j − t i ≤ τ2
w(vi ) = 1 − score ( Q,vi ).
STORYLINE GENERATION
A subset S of the vertex set of an undirected
graph is a
dominating set if for each vertex u ,either u is in
S or is adjacent to a vertex in S .
STORYLINE GENERATION
greedy algorithm
STORYLINE GENERATION
A Steiner tree of a graph G with respect to a
vertex subset S is the edge-induced sub-tree of G
that contains all the vertices of S having the
minimum total cost, where the cost is
the total weight of the vertices.
STORYLINE GENERATION
STORYLINE GENERATION
EXPERIMENTS
Data Set
EXPERIMENTS
Tweet Retrieval
49 queries
evaluation metric :
precision at top 30 tweets(P@30)
mean average precision(MAP)
precision at top 100 tweets(P@100)
R-precision (R-PREC)
EXPERIMENTS
Comparative Study
EXPERIMENTS
Parameter Tuning
EXPERIMENTS
Summarization Capability
EXPERIMENTS
Parameter Tuning
EXPERIMENTS
A User Study
CONCLUSION
The proposed dynamic pseudo relevance
feedback model
minimum weighted Steiner tree on a dominant set
充分的实验
OMG, I Have to Tweet That!
A Study of Factors that Influence Tweet
Rates
Abstract
key limitation :
it depends on people self reporting their own
behaviors
and observations.
a large scale quantitative analysis of some of
the factors that influence self reporting bias.
the daily variations in tweet rates about weather
events
Introduction
treating social media as a signal to measure the relative real-world occurrence of events
critical challenge :the bias introduced by the self-reported nature of
social media
What is it about an event that makes it more or less “tweetable”?
A first large-scale, quantitative analysis of some of the factors that influence self-reporting bias by comparing a year of tweets about weather events in cities across the United States and Canada to ground-truth knowledge about actual weather occurrences.
Introduction
three potential factors :
1. How extreme is the weather?
2. How expected is the weather given the time-of-
year?
3. How much did the weather change?
Data Preparation
Jun 1, 2010 and Jun 30, 2011
56 different metropolitan areas
historical weather data provided by the National
Oceanic and Atmospheric Administration of the
United States.
Identifying Weather-related Tweets
discovering the rate of weather-related tweets
that occurred per-day across metropolitan areas
1. filtering the full archive of tweets for tweets that
contain at least 1 weather-related word from a
list of 179 weather-related words and phrases
2. build a classifier for weather-related tweets
a simple classifier that estimates the probability
of a tweet being weather related as
Identifying the Location of Tweets
geo-coded
the textual user- provided location field in a user’s Twitter profile
normalize the textual
arbitrary user-provided location information into concrete geo-coded coordinates
1. a mapping from user-provided location fields to latitude-longitude coordinates.
2. merge location fields with similar geo-mappings together to create clusters for roughly metropolitan-sized areas
Identifying the Location of Tweets
Historical Weather Data
calculate daily summaries
For each daily summary of weather data at a
location:
Expectation: how normal the observed weather
is at a location
Extremeness : how extreme the weather is on a
particular day
Change: how different the observed weather data
is from previous days’ weather
Analysis and Results
Tweet Rates and Weather Reports
Analysis and Results
Linear Regression
the relationship between a set of weather-derived
features and the daily rate of weather-related
tweets
Analysis and Results
Correlating Basic Weather Data and Tweet
Rates
Analysis and Results
Correlating Expectation and Tweet Rates
expectation measure adds little information about likely tweet rates beyond what is already contained in basic weather data
Correlating Extremeness and Tweet Rates
extremeness can independently explain more of the variation in weather-related tweet rates than basic weather alone
Correlating Delta Change and Tweet Rates
there is little difference in the amount of information gained from building these delta-change models
Combining Extremeness, Expectation, and Delta Change Models
Analysis and Results
Per-Location Models
Discussion
Additional Factors Likely to Effect Tweet
Rates
Sentiment
Privacy concerns, embarrassments and safety:
Population segments :
Mobile devices
Time-of-Day, day-of-week, holiday, and other
effects of time:
Conclusions
the correlation between daily tweet
rates and the expectation, extremeness, and the
change in
observed weather.
global models
location-specific models
Extremeness>change>expectation