TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D....

TwitterStand: News in Tweets

Jagan SankaranarayananHanan SametBenjamin E. TeitlerMicael D. Lieberman Jon Sperling

Department of Computer Science University of Maryland

Group Members• Enkh-Amgalan Baatarjav• Jedsada Chartree• Thiraphat Meesumrarn

OutlineIntroduction to Twitter

Problem statement

Contributions

Key concepts

Methodology

Assumptions

Questions

References

Introduction: TwitterThree actors

UserFollowers Friend

RelationshipUnidirectionalBidirectional + =

Multi-interfaceWebsite, SMS, applications, IM, etc

Introduction to Twitter

Search Engines

Twitter Services: APITwitter API

Functions to obtain user-specific information

Twitter dataset samples through public feedsPublic timelineSpritzerGardenHose: sparse sampling of all feedsBirdDog: tweets written by up to 200,000 users

Introduction: StatisticsU.S. Unique Visitor (000) Trend (Source: comScore Media Metrix)

Introduction: Statistics21% of Twitter accounts are empty placeholders

94% of Twitter accounts have less than 100 followers

10% of Twitter users create 86% of all activity

49.6% of Twitter users are inactive (1 tweet in last 7 days)

55% of Twitter users use 3rd party application

Introduction: Statistics

Problem StatementConventional system:

News aggregators: Google News, Bing News, and Yahoo! News

Content providers: newspapers, television stations, news blogs

Vast amount of information being generated by Twitter users2008 Southern California earthquakeIranian election

Separating News from Junk

ContributionsMobilizing millions of Twitter users to be eyes

and ears in the world

Geographic proximity plays important role

TwitterStand Identifying current newsClustering similar tweets into news storiesRanking news based on importanceGeo-tagging news topics

Key ConceptsSeparating news from noise

Clustering tweets

Mapping the the clusters to geographic location

Example: Twitter Vs Aggregator

Benefits of TwitterSocial networking website

Community and structure

Meta-data informationDescription, source location, friends, etc

Very open communityDiverse community with varied interestBroadcasting less popular view points

Capturing breaking newsVery little lag time between event and

tweet

Challenges of TwitterDetermining tweet is whether news or

notMost of them are not news

A very high throughputNeeds to be fast, resilient to noise

Brevity of the tweetsLucking conveyed information: time critical

Credibly issues

Key Strategies1. Utilizing online Algorithm

Stream of tweets arrive at furious amount

2. Extracting useful information from noise Noise, spelling & gram. error, abbr., etc

3. Keeping up with Twitter evolution

4. Finding core group of users who tweet about news Manually identify the core group is better than mining SN

structure Finding the most common set of followers among them

5. Obtaining user-generated news content Videos, photographs, unconventional news, biased

toward entertainment, politics and tech

Architecture of TwitterStand

Architecture: Input Seeders

2,000 handpicked users that are known to publish news: newspapers, television stations, reporters, bloggers, etc.

GardenHose Sampling of all tweets: very noisy feeds from diverse topics.

BirdDog Feeds from up to 200,000 users, identified by “friend finder

”

Artifacts Links to external resource, only retained from seeders feed

Track Automatically generate pool of search keys to scour Twitter

for potential news tweets of interest from stream of tweets

Separating the ChaffClassify incoming tweets as either junk or news

Except for tweets from seeders

Goal Not completely rid of noiseDiscard as many tweets as possible without losing

many news tweets

Training naïve Bayes classifier with corpus tweets marked as either junk or news

Cont.Probability of a tweet is junk or news is denoted by using Bayes Theorem:

Assumption of independence among the words in t

Cont.If D < 0, the tweet is classified as news, else it is

junk

Cont. How to insure that not to classify tweets related to news

as junk?

The corpus is made up of two component Static

Large collection of news tweets are marked as news Large collection of tweets are marked as junk

Dynamic Periodically obtained from the clustering module Names of people, hashtags

News Tweets: Static: Helps to identify news tweets on topics that have

not encountered previously Dynamic: Helps to identify news tweets about current

event

Online ClusteringGoal: Automatically group news tweets into sets

of tweets, clustersTopic detection: Each cluster contains tweets

pertaining to a specific topic

ChallengesTopic is not predefined No training setOnline clustering

Cont.Leader-follower clustering

Features: be able to cluster both content and time

Algorithm detailsActive cluster list

Feature vectors: tweets’ terms (TF-IDF)Time centroid

Inactive cluster: time centroid > 3 days

Cont. Cosine similarity measure

Feature vectors TFVt, TFVc

Pre-specified constantε

if > ε, start a new cluster

To account for temporal dimension Apply Gaussian

attenuator

Cont.: OptimizationInverted index of cluster centroids

Reduce number of distance computationFor each feature f, the index stores pointers to all

clusters containing f. iff at least one feature is common between a tweet

and a clusters

Maintaining a list of active clustersCentroids are less than a three days old

Additional Tweaks: Dealing with Noise Very noise medium

Seeding good quality clusters

Only Seeders are allowed to start new cluster

Unreliable feed allowed to add to existing cluster

Drawback Seeders are mostly consists of conventional news resource

Solution Relaxing the rule by any tweet can form inactive cluster if after

the k tweets have been added to the cluster (none of k tweets from seeders)

Cluster status changed to active when seeder tweet is added to the cluster

Tweak: FragmentationSeveral different clusters on a single topic

Frequently occurs with online clustering algorithmTweets are distributed to tens and hundreds of

duplicate clusters

Solution Periodically checking for duplicate clusters among

active clustersMaster cluster: one has older time centroid Slave cluster: one has younger time centroid Any new tweets belong to slave cluster added to

Master cluster

Tweak: Weight upper bounds Dynamic corpus: addition of new features have

high TF-IDF valuesRelatively unimportant, misspelled words, etc.

Problem: spurious clustersClustering based on an unimportant feature

SolutionTo a tweet to be added to a cluster, the tweet and

the cluster should share k common features (k > 1)

Tweak: PhrasesFeatures containing two or more terms - phrase

Problem Treading phrase as separate features results in lost

meaning: “San Francisco” Treading phrase as a single feature results with large

TF-IDF score

Solution Distinguishing two kinds of relationships betweens

words in the phrase by Determining occurrence of t1 close to t2 volumeFinding a dominant word: “Barak” “Obama”=>”Obama”Merging words to single feature: “San” “Francisco” =>

“San Francisco”

Topic Geographic Focus Associate each cluster of tweets with a set of geographic

locations

Tweet content: geotagging1. Toponym recognition: finding all instances of textual

reference geographic location2. Toponym resolution: determining correct location for

each recognized toponym out of all possible interpretations

Source location of the user Meta-data contains user’s location Containment or prominence heuristic

Computing Topic Focus Ranking geographic locations by frequency

User Interface Issues NewsStand

http://newsstand.umiacs.umd.edu/

Topic Hashtags

- Reducing ε value- Proactively searching for more tweets belonging to a particular topic

ConclusionGeneral technique to extract concept from noise

Adaptable to different environment

Generating dynamic corpus online algorithm

Pinpointing news clusters to geographic location

User interface for displaying news

Harbinger of a futuristic technology that can capture and transmit the sum total of all human experiences of the moment

AssumptionsNoise

Tweets that does not belong to the news domain

Tweets from seeders are considered to be reliable news

To apply Naïve Bayes classifier, assumption is made that words in tweets are independent

Questions & Answers Sankaranarayanan, J., et al., “TwitterStand:

News in Tweets”, Proc. ACM GIS ‘09. Seattle, WA, USA

Rohib Bhargava, “Influential Marketing Blog” http://rohitbhargava.typepad.com/weblog/2009/07/10-stunning-and-useful-stats-about-twitter.html

“In-depth study of Twitter: How much we tweet, and when” http://royal.pingdom.com/2009/11/13/in-depth-study-of-twitter-how-much-we-tweet-and-when/

http://rohitbhargava.typepad.com/weblog/2009/07/10-stunning-and-useful-stats-about-twitter.html



http://royal.pingdom.com/2009/11/13/in-depth-study-of-twitter-how-much-we-tweet-and-when/

http://royal.pingdom.com/2009/11/13/in-depth-study-of-twitter-how-much-we-tweet-and-when/

TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D....

Documents

Transcript of TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D....