Bridge Project Presentation Group #2 Alex Odle Matt Simon Salim Hamed Micael Thiodet.
TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D....
-
Upload
jermaine-maher -
Category
Documents
-
view
214 -
download
0
Transcript of TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D....
TwitterStand: News in Tweets
Jagan SankaranarayananHanan SametBenjamin E. TeitlerMicael D. Lieberman Jon Sperling
Department of Computer Science University of Maryland
Group Members• Enkh-Amgalan Baatarjav• Jedsada Chartree• Thiraphat Meesumrarn
OutlineIntroduction to Twitter
Problem statement
Contributions
Key concepts
Methodology
Assumptions
Questions
References
Introduction: TwitterThree actors
UserFollowers Friend
RelationshipUnidirectionalBidirectional + =
Multi-interfaceWebsite, SMS, applications, IM, etc
Introduction to Twitter
Search Engines
Twitter Services: APITwitter API
Functions to obtain user-specific information
Twitter dataset samples through public feedsPublic timelineSpritzerGardenHose: sparse sampling of all feedsBirdDog: tweets written by up to 200,000 users
Introduction: StatisticsU.S. Unique Visitor (000) Trend (Source: comScore Media Metrix)
Introduction: Statistics21% of Twitter accounts are empty placeholders
94% of Twitter accounts have less than 100 followers
10% of Twitter users create 86% of all activity
49.6% of Twitter users are inactive (1 tweet in last 7 days)
55% of Twitter users use 3rd party application
Introduction: Statistics
Problem StatementConventional system:
News aggregators: Google News, Bing News, and Yahoo! News
Content providers: newspapers, television stations, news blogs
Vast amount of information being generated by Twitter users2008 Southern California earthquakeIranian election
Separating News from Junk
ContributionsMobilizing millions of Twitter users to be eyes
and ears in the world
Geographic proximity plays important role
TwitterStand Identifying current newsClustering similar tweets into news storiesRanking news based on importanceGeo-tagging news topics
Key ConceptsSeparating news from noise
Clustering tweets
Mapping the the clusters to geographic location
Example: Twitter Vs Aggregator
Benefits of TwitterSocial networking website
Community and structure
Meta-data informationDescription, source location, friends, etc
Very open communityDiverse community with varied interestBroadcasting less popular view points
Capturing breaking newsVery little lag time between event and
tweet
Challenges of TwitterDetermining tweet is whether news or
notMost of them are not news
A very high throughputNeeds to be fast, resilient to noise
Brevity of the tweetsLucking conveyed information: time critical
Credibly issues
Key Strategies1. Utilizing online Algorithm
Stream of tweets arrive at furious amount
2. Extracting useful information from noise Noise, spelling & gram. error, abbr., etc
3. Keeping up with Twitter evolution
4. Finding core group of users who tweet about news Manually identify the core group is better than mining SN
structure Finding the most common set of followers among them
5. Obtaining user-generated news content Videos, photographs, unconventional news, biased
toward entertainment, politics and tech
Architecture of TwitterStand
Architecture: Input Seeders
2,000 handpicked users that are known to publish news: newspapers, television stations, reporters, bloggers, etc.
GardenHose Sampling of all tweets: very noisy feeds from diverse topics.
BirdDog Feeds from up to 200,000 users, identified by “friend finder
”
Artifacts Links to external resource, only retained from seeders feed
Track Automatically generate pool of search keys to scour Twitter
for potential news tweets of interest from stream of tweets
Separating the ChaffClassify incoming tweets as either junk or news
Except for tweets from seeders
Goal Not completely rid of noiseDiscard as many tweets as possible without losing
many news tweets
Training naïve Bayes classifier with corpus tweets marked as either junk or news
Cont.Probability of a tweet is junk or news is denoted by using Bayes Theorem:
Assumption of independence among the words in t
Cont.If D < 0, the tweet is classified as news, else it is
junk
Cont. How to insure that not to classify tweets related to news
as junk?
The corpus is made up of two component Static
Large collection of news tweets are marked as news Large collection of tweets are marked as junk
Dynamic Periodically obtained from the clustering module Names of people, hashtags
News Tweets: Static: Helps to identify news tweets on topics that have
not encountered previously Dynamic: Helps to identify news tweets about current
event
Online ClusteringGoal: Automatically group news tweets into sets
of tweets, clustersTopic detection: Each cluster contains tweets
pertaining to a specific topic
ChallengesTopic is not predefined No training setOnline clustering
Cont.Leader-follower clustering
Features: be able to cluster both content and time
Algorithm detailsActive cluster list
Feature vectors: tweets’ terms (TF-IDF)Time centroid
Inactive cluster: time centroid > 3 days
Cont. Cosine similarity measure
Feature vectors TFVt, TFVc
Pre-specified constantε
if > ε, start a new cluster
To account for temporal dimension Apply Gaussian
attenuator
Cont.: OptimizationInverted index of cluster centroids
Reduce number of distance computationFor each feature f, the index stores pointers to all
clusters containing f. iff at least one feature is common between a tweet
and a clusters
Maintaining a list of active clustersCentroids are less than a three days old
Additional Tweaks: Dealing with Noise Very noise medium
Seeding good quality clusters
Only Seeders are allowed to start new cluster
Unreliable feed allowed to add to existing cluster
Drawback Seeders are mostly consists of conventional news resource
Solution Relaxing the rule by any tweet can form inactive cluster if after
the k tweets have been added to the cluster (none of k tweets from seeders)
Cluster status changed to active when seeder tweet is added to the cluster
Tweak: FragmentationSeveral different clusters on a single topic
Frequently occurs with online clustering algorithmTweets are distributed to tens and hundreds of
duplicate clusters
Solution Periodically checking for duplicate clusters among
active clustersMaster cluster: one has older time centroid Slave cluster: one has younger time centroid Any new tweets belong to slave cluster added to
Master cluster
Tweak: Weight upper bounds Dynamic corpus: addition of new features have
high TF-IDF valuesRelatively unimportant, misspelled words, etc.
Problem: spurious clustersClustering based on an unimportant feature
SolutionTo a tweet to be added to a cluster, the tweet and
the cluster should share k common features (k > 1)
Tweak: PhrasesFeatures containing two or more terms - phrase
Problem Treading phrase as separate features results in lost
meaning: “San Francisco” Treading phrase as a single feature results with large
TF-IDF score
Solution Distinguishing two kinds of relationships betweens
words in the phrase by Determining occurrence of t1 close to t2 volumeFinding a dominant word: “Barak” “Obama”=>”Obama”Merging words to single feature: “San” “Francisco” =>
“San Francisco”
Topic Geographic Focus Associate each cluster of tweets with a set of geographic
locations
Tweet content: geotagging1. Toponym recognition: finding all instances of textual
reference geographic location2. Toponym resolution: determining correct location for
each recognized toponym out of all possible interpretations
Source location of the user Meta-data contains user’s location Containment or prominence heuristic
Computing Topic Focus Ranking geographic locations by frequency
User Interface Issues NewsStand
Topic Hashtags
- Reducing ε value- Proactively searching for more tweets belonging to a particular topic
ConclusionGeneral technique to extract concept from noise
Adaptable to different environment
Generating dynamic corpus online algorithm
Pinpointing news clusters to geographic location
User interface for displaying news
Harbinger of a futuristic technology that can capture and transmit the sum total of all human experiences of the moment
AssumptionsNoise
Tweets that does not belong to the news domain
Tweets from seeders are considered to be reliable news
To apply Naïve Bayes classifier, assumption is made that words in tweets are independent
Questions & Answers Sankaranarayanan, J., et al., “TwitterStand:
News in Tweets”, Proc. ACM GIS ‘09. Seattle, WA, USA
Rohib Bhargava, “Influential Marketing Blog” http://rohitbhargava.typepad.com/weblog/2009/07/10-stunning-and-useful-stats-about-twitter.html
“In-depth study of Twitter: How much we tweet, and when” http://royal.pingdom.com/2009/11/13/in-depth-study-of-twitter-how-much-we-tweet-and-when/