TwitInfo: Aggregating and Visualizing Microblogs...

TwitInfo: Aggregating and Visualizing Microblogsfor Event Exploration

ABSTRACTMicroblogs are a tremendous repository of user-generatedcontent about world events. However, for people trying tounderstand events by querying services like Twitter, a chrono-logical log of posts makes it very difficult to get a detailedunderstanding of an event. In this paper, we present Twit-Info, a system for visualizing and summarizing events onTwitter. TwitInfo allows users to browse a large collection oftweets using a timeline-based display that highlights peaksof high tweet activity. A novel algorithm automatically dis-covers these peaks and labels them meaningfully using textfrom the tweets. Users can explore events via a variety ofmetadata like geolocation, sentiment, and popular URLs. Anevaluation of the system revealed that users were able to re-construct meaningful summaries of events in a small amountof time. An in-depth review by a Pulitzer Prize-winningjournalist suggested that the system would be especially use-ful for establishing an understanding of a long-running eventand for tracking down eyewitnesses. Quantitatively, our sys-tem can identify 80-100% of manually labeled peaks, facili-tating a relatively complete view of each event studied.

Author KeywordsTwitter, Visualization, Exploration, Event Detection

ACM Classification KeywordsH.5.2 Information Interfaces and Presentation: Miscellaneous

INTRODUCTIONTwitter can provide a moment-by-moment pulse of the pub-lic consciousness during an event. Researchers have madeconsiderable progress studying and visualizing individual po-litical events [6, 24], media events [23], and crisis scenarios[27, 30]. Each such project requires authors to build scal-able solutions for capturing tweets and extracting metadata.Once the data is in place, current microblog analysis inter-faces still suffer from a number of limitations. First, it canbe hard to tell which posts are specifically about an event ofinterest, as the same keywords may appear in many posts.

Submitted for review to CHI 2011.

Second, the same post may appear many times, due to phe-nomena like re-tweeting, and there may be spam posts oflow quality or with little content. Third, without complexanalytics interfaces, users must wade through thousands ofposts to get a sense of the timeline of an event. Taken to-gether, this process is simply too costly or out of reach formany citizen journalists.

To address these limitations, we present TwitInfo, a systemand interface for visualizing and summarizing events fromTwitter. TwitInfo (Figure 1) supports the entire process fromcollection through exploration. Users identify a set of tweetsof interest by inputting a set of keywords and a time range.TwitInfo provides an interface that facilitates exploring andunderstanding the various points of interest (“sub-events”)in this input set. Specifically, TwitInfo provides a graphicaltimeline interface that groups peaks of high-volume postsinto sub-events and highlights important terms and tweetsin the conversation around each sub-event. Users can zoominto this timeline to visualize more peaks and sub-events.By prioritizing novel content over retweets, TwitInfo helpsusers avoid wading through similar content. TwitInfo alsoincludes a sentiment analysis engine that displays poster sen-timent about various sub-events, and additionally shows thegeographic distribution of posted messages. When there arefrequently occurring links in a cluster of tweets which mightprovide a bigger picture, TwitInfo also displays these linksin the user interface.

There are several engineering challenges in building sucha system, because it must provide a useful summary of anevent with little user involvement. As such, determining in-teresting sub-events, providing reasonable labels for thoseevents, summarizing overall sentiment, and providing a ge-ographic breakdown for the events should be automated asmuch as possible. For example, we need algorithms to iden-tify peaks in conversation around a given topic. TwitInfocontributes one such algorithm, as well as a normalizationcomponent that helps dampen tweets from search terms withdual meanings like the term football. To allow users to quicklyunderstand what a peak is about, TwitInfo automatically gen-erates labels for each peak, although our interface allows hu-mans to specify labels if they choose.

Specifically, in this paper we make three contributions:

1. We present the TwitInfo interface for collecting and ag-gregating tweets into a display of nested, labelled events,the sentiment surrounding those events, and the geographiclocation of such tweets with minimal user involvement.

1

2. We develop a new algorithm for performing streamingevent detection as tweets arrive, by identifying large in-creases in tweets over a timeframe relative to their histor-ical trends. One challenge in running this algorithm onTwitter is that the keywords which users track might addnoise to the overall signal. For example, the term footballon Twitter sees more frequent tweets relating to AmericanFootball, and if a user wishes to track a low-frequencyEuropean soccer match using the terms football, soccer,manchester, liverpool, the football term might introducelarge numbers of unwanted tweets. We present a normal-ization step to reduce the effect of such terms.

3. We describe a formative evaluation of the interface with12 users, as well as one professional journalist. These par-ticipants successfully used the interface to rapidly summa-rize complex event timelines. They reported finding thetool useful, particularly when the goal is to quickly assesswhat people on the Internet have been saying about a par-ticular topic or event over a long time period. We alsoperformed a quantitative analysis of the algorithm’s abil-ity to identify events on different datasets, showing that,compared to a human labeling of events, TwitInfo has re-call and precision of 80-95%.

We start with a motivating scenario for TwitInfo. We go onto discuss related approaches, describe the system in detail,and present our algorithms. Finally, we evaluate the algo-rithms and user interface, and discuss future work.

MOTIVATING SCENARIOEllen has caught wind of recent protests in Iran concerningthe outcome of the June 2009 election. A few traditionalnews media articles have come out based on eyewitness re-ports, but the media are barred from covering events. In-stead, many individuals utilized the Twitter microbloggingservice to broadcast updates and report on events on theground. Ellen wants to get a bigger picture. Her key ques-tions are: what were the major events of the protests, whatwere people located inside the country saying, and how dopeople generally feel about each event? Performing a Twit-ter search for the #iranElection hashtag, used by many peo-ple to identify the subject of their posts, produces a list ofthousands of individual messages that aren’t grouped or or-ganized in any meaningful way.

Instead, Ellen visits the TwitInfo website to view a timelineof tweets containing the term #iranElection. Whenever thereis a peak in activity, TwitInfo highlights the peak and labelsit with commonly occurring words like the protest location.She wants to know how individuals in Israel, whose citizensare interested in Iranian affairs, are reacting to the event. So,she turns to the map interface and zooms in to Tel Aviv, thenreads several comments on the events. Ellen hypothesizesthat public sentiment about the election will shift over time,so she uses TwitInfo’s sentiment aggregation interface to ex-amine overall opinion at each peak in the timeline. Ellencenters in on one protest event of interest, and follows a pop-ular link mentioned in tweets for deeper exploration.

RELATED WORK

We now review related work in Twitter analytics, temporalexploration interfaces and topic detection algorithms.

The growth of microblogging has led to a rapidly-expandingbody of research exploring its makeup, norms, and latentdata. One branch of this research examines Twitter at amacro level using aggregate statistics. For example, Java etal. studied the topological and geographical distribution ofTwitter [15]. Kwak et al. extended this analysis and foundthat Twitter exhibited network properties more like a newsmedia than a social network, suggesting that it may be pos-sible to aggregate tweets to provide coverage of a news eventand reactions to it [16]. Suh et al. studied diffusion on Twit-ter via retweets and saw that URLs and hashtags stronglyinfluence such diffusion [28].

A second branch of Twitter research attempts to understandthe service’s emerging norms and conversational patterns.Naaman et al. detailed a division between informers, whospread information, and meformers, who discuss their ownaffairs [19]. TwitInfo is targeted at harvesting data from in-formers, who spread news about world events. However,we cannot ignore conversation: Honeycutt and Herring weresurprised to find a high degree of conversationality on Twit-ter [12], and boyd et al. explored the phenomenon specifi-cally through the lens of retweets [5]. Starbird, Vieweg andcolleagues have contributed analyses of microblog usage andinformation lifecycles during crisis situations [27, 30].

A particularly relevant branch of microblogging research an-alyzes reactions to news events on Twitter. Many of the pa-pers in this category involve hand-creating timelines, graphsand visualizations very similar to those that TwitInfo pro-duces automatically or interactively. Diakopoulous and Shammadid inspirational work in this vein, demonstrating the use oftimeline analytics to explore the 2008 Presidential debatesthrough Twitter sentiment [6]. Shamma et al. later demon-strated a similar analysis at the President Obama’s Inaugura-tion, finding that tweet volume increases and @replies dropduring important moments [24]. Starbird et al. tracked tweetsfrom different geographical areas on a timeline as the RedRiver flood proceeded [27]. Gaffney performed a retrospec-tive analysis of the recent Iran election, in which Twitterplayed a pivotal role in news reporting [9]. Jansen et al.found that tweet sentiment analysis could provide useful in-sights into product opinion, an idea we leverage for the sen-timent visualization in TwitInfo [14]. Our goal in TwitInfois to make these kinds of analyses easier and faster.

TwitInfo builds on recent work on exploratory interfaces fortemporal exploration. Most closely related are Statler [23]and Eddi [4], which both provide timeline visualizations ofTwitter data. Statler focuses on media events, and Eddi ononly one user’s feed; our goal with TwitInfo is to extend andgeneralize these interfaces to an arbitrary topic of interest onTwitter. To do this, we draw on other work exploring tempo-ral visualization (for a survey, see Silva [26]). Leskovec etal.’s Memetracker pursued similar goals, tracking prominentphrases through blogs and news media [17] rather than Twit-ter. Zoetrope allows the user to graph web content trends liketemperature or images over time [1]. TwitInfo also takes in-spiration from Continuum’s [2] notion of hierarchical time-

2

lines, allowing users to define sub-events in the timeline forfurther exploration. Opinion Space [7] investigated visuallayouts for browsing many user comments—in the future,we plan to integrate its ideas for treating tweets with senti-ment or opinion separately from information-sharing tweets.

We build on work pursuing topic and event detection. Swanand Allen developed a χ2 significance test for identifyingbursty extracted noun phrases [29] in a stream, but it is notentirely appropriate for Twitter’s needs. We need an onlinealgorithm – one that can update quickly as tweets stream in– and tweets are so short that noun agreement may not occurnaturally. Newsjunkie developed algorithms for users whowish to find novel content in their stream [8]: this approachwould help TwitInfo filter out retweets and find a broaderset of representative tweets to display. BlogScope [3] pur-sues many of the same goals as TwitInfo, but is focused onblogs, and does not go into detail about peak detection algo-rithms. One body of work that does address such scenariosis TCP congestion control [13], which has minimal memoryrequirements and works in an online fashion to determineoutlier delays in packet transmission. We base our approachon this notion.

Finally, it is worth touching on some related non-academicprojects. Twitter Sentiment1 is a site which displays a mes-sage frequency timeline, sentiment pie chart, and sentiment-coded tweets. The site inspired many visual elements inTwitInfo, and we utilize their approach to extract sentimentfrom tweets, but they fall victim to the aggregate sentimentnormalization issues we solve with TwitInfo. Google NewsTimeline2 facilitates exploration of news media in given timeranges, but does not visually signal to the user which items tofocus on, and does not provide intuition into which articlesare selected for a given period. Systems such as Flipboard3

make news socially relevant by providing a magazine-likeinterface for exploring news sources extracted from yoursocial networks on Twitter and Facebook, but do not pro-vide context about events of interest. Twitter is planning areal-time analytics framework4, but this does not seem to begeared at relaying news or informing end-users.

THE TWITINFO SYSTEMIn this section we describe TwitInfo, beginning with the user’sinital event query, then detailing the exploration interfaceand the creation of nested events.

Creating the EventTwitInfo users define an event by specifying a Twitter key-word query. For example, for a soccer game users mightenter search keywords soccer, football, premierleague, andteam names like manchester and liverpool. Users give theevent a human-readable name like “Soccer: Manchester Cityvs. Liverpool” or “Obama presidency,” as well as an op-tional time window. When users are done entering the infor-mation, TwitInfo saves the event and begins logging tweets1http://twittersentiment.appspot.com/2http://newstimeline.googlelabs.com/3http://www.flipboard.com/4http://gigaom.com/2010/09/22/twitter-to-launch-free-real-time-analytics-this-year/

matching the query. In its current implementation, we onlytrack tweets for a keyword when a user first enters the key-word into the system, but this is not a hard limitation: to geta historical view of an event from its keywords, we couldcontinuously collect a sample of all tweets, and historicallyindex each keyword as users begin tracking them.

Timeline and TweetsOnce users have created an event, they can monitor the eventin realtime by navigating to a web page that TwitInfo createsfor the event. The main TwitInfo interface (Figure 1) is adashboard summarizing the event over time. The dashboarddisplays a timeline for this event, raw tweet text sampledfrom the event, an overview graph of tweet sentiment, and amap view displaying tweet sentiment and locations.

The event timeline (Figure 1B) reports tweet activity by vol-ume. The more tweets that match the query during a periodof time, the higher the y-axis value on the timeline for thatperiod. So, when many users are tweeting about a topic (forexample, Obama), the timeline spikes. TwitInfo’s peak de-tection algorithm, described later in the paper, automaticallyidentifies these spikes and flags them as peaks in the inter-face. Peaks appear as flags in the timeline and appear to theright of the timeline along with automatically-generated keyterms that appear frequently in tweets during the peak. Forexample, in Figure 1B, TwitInfo automatically tags one ofthe goals in the soccer game as peak “F” and annotates iton the right with representative terms in the tweets like 3-0(the new score) and tevez (the soccer player who scored).Users can perform text search on this list of key terms to lo-cate a specific peak. To visually scale from short events tolong-running queries, TwitInfo will aggregate tweets on thetimeline by minute-, hour-, or day-level granularity depend-ing on the timespan the user is viewing.

The timeline is a way to filter the tweets in the rest of theinterface: when the user clicks on a peak, the other interfaceelements (map, links, tweet list, and sentiment graph) refreshto show only tweets in the time period of that peak.

While derived information is useful at a glance, users needto have access to the tweets behind the event. The Rele-vant Tweets panel (Figure 1D) lists tweets that fall withinthe event’s start and end times. These tweets are sorted bysimilarity to the event or peak keywords, so that tweets nearthe top are most representative of the selected event. Tweetsare colored red, blue, or white depending on whether theirdetected sentiment is positive, negative, or neutral.

Aggregate Metadata ViewsIn addition to skimming sentiment for individual tweets, auser likely wishes to see the general sentiment on Twitterabout a given topic. The Overall Sentiment panel (Figure 1F)displays a piechart representing the total proportion of pos-itive and negative tweets during the event. We discuss thedifficulties in accurately depicting aggregate sentiment andother aggregated classifier output in the Algorithms section.

Twitterers will share links as a story unfolds [16], so the Pop-ular Links panel (Figure 1E) aggregates the top three URLs

3

Figure 1. The TwitInfo user interface. A) the user-defined name of the event, as well as keywords sent to the Twitter Search API to log the tweets.B) The central timeline interface, with the y-axis corresponding to tweet volume. Automatically-detected peaks (sub-events) are labeled with letteredflags. When the user clicks on a peak, all the other interface elements filter to tweets in that time period. C) Event-related tweets with geolocationare displayed on a map. D) Tweets for the currently selected event or peak (sub-event), colored red if TwitInfo detects negative sentiment or blue ifTwitInfo detects positive sentiment. E) The most popular links in the currently-selected event or sub-event. F) Aggregated sentiment of all tweets inthe event or sub-event.

extracted from tweets in the timeframe being explored. Thislist provides users with sources for further exploration.

Often, opinion on an event differs by geographic region.A user should be able to quickly zoom in on clusters ofactivity around New York and Boston during a Red Sox-Yankees baseball game, with sentiment toward a given peak(e.g., a home run) varying by region. The Tweet Map (Fig-ure 1C) displays tweets that provide geolocation metadata.The marker for each tweet is colored according to its senti-ment, and clicking on a pin reveals the associated tweet.

Creating Sub-EventsAn important contribution of TwitInfo is its interactive sup-port for hierarchically nested events: users can start by view-ing several months of the Obama presidency on Twitter, zoomin to a sub-event to see a timeline of just a single speech, thenzoom in again to do a detailed analysis of reactions to a sin-gle remark the President made. The user can transform anypeak on the timeline into its own first-class event with a ded-icated page. To do so, users are asked to “Label this event”in the timeline and provide a human-readable name for theevent. This serves the purpose of providing a better eventlabel, and for allowing future users to navigate deeper intothe sub-event structure.

These labeled sub-events are shared between users, so Twit-Info builds up a folksonomy of labeled events over time [11,

18]. To speed the creation of sub-events, each event can havea schema type such as “Soccer Game.” Schema types alsoform a rough folksonomy, learned from the natural parent-child hierarchies that users author while creating sub-events.For example, if a user creates a new child event on an ex-isting parent of type “Soccer Game,” and if it is commonfor “Soccer Game” to have sub-events named “Goal,” “HalfTime,” and “Overtime,” then these types will be offered forthe new sub-event. For advanced users, TwitInfo allows typeequivalences so that “Soccer Game” and “Football Game”can be cross-referenced. With enough data, we expect thatsuch relationships can be learned based on the key terms fre-quently appearing with certain schema types.

Realtime UpdatingDiffusion speed is one of Twitter’s competitive advantagesin tracking the news, so it is important for Twitter interfacesto be able to handle ongoing events. Because TwitInfo cap-tures realtime streams from Twitter, its interface reacts ac-cordingly. The interface refreshes at regular intervals to in-clude any new tweets that have entered the system. In thisway, a citizen journalist might track a speech, sporting event,or media event in realtime as it occurs.

ALGORITHMS FOR DISPLAYING EVENTSWe now turn to the algorithms we use for detecting and dis-playing sub-events in a collection of tweets.

4

Event DetectionA key technical contribution of this paper is identifying tem-poral peaks in tweet frequency. Our algorithm first bins thetweets into a histogram by time, for example by minute. Thisbinning allows us to count the tweet-arrival rate in each timeperiod. We then calculate a historically weighted runningaverage of tweet rate and identify rates that are significantlyhigher than the mean tweet rate. For these rate spikes, wefind the local maximum of tweet rate and identify a windowsurrouding the local maximum. Finally, we collect tweetswithin this window that contain the desired keywords, andidentify key frequent terms to provide an automated label ofeach peak. We now describe each of these steps in detail.

Given a time-sorted collection of tweets, we first group thetweets that were posted within a one-minute time window ofone another. Increasing bin size serves to smooth out smallspikes in longer-running events, and users can adjust bin sizeto hours or days if they want.

After our binning process, we are left with a list [C1, ..., CN ]of tweet counts, where Ci is the number of tweets in bin i.For example, a list [25, 50, 13] with minute-sized bins wouldmean that there were 25 tweets in the first minute of an event,50 tweets in the second, and 13 tweets in the third. Wewish to determine when there is a peak in conversation aboutsome topic. A naıve approach would be to find local maximaamongst the Ci’s, but this would mis-identify peaks whichare marginally taller relative to their neighbors. Instead, wewish to identify each bin i such that Ci is large relative tothe recent history Ci−1, Ci−2, .... A similar problem is ad-dressed in TCP’s congestion control mechanism [13], whichmust determine whether a packet is taking unusually long tobe acknowledged and is thus an outlier. The analogue in oursystem is that the algorithm must determine whether a binhas an unusually large number of tweets in it. We take inspi-ration from TCP’s approach, which uses a weighted movingaverage and variance.

An offline version of peak detection is described in Algo-rithm 1. The function find peak windows(C) takes binnedtweet counts as input [C1, ..., CN ]. The function returns alist [W1, ...,WN ] of peak windows, where window Wi =(Si, Fi), Si < Fi is represented by Si, the start bin of thewindow, and Fi, the final bin of the window. After initializ-ing the mean to the first bin count (line 3) and the mean de-viation to the variance of the first p (we use p = 5) bins (line4), we loop through the subsequent bin counts (line 6). Line7 contains our peak detection logic: if the current bin count(Ci) is more than τ (we use τ = 2) mean deviations fromthe current mean, and the bin counts are increasing, we saythat the increase is significant, and begin a new peak window(line 8). The loop in lines 9-12 is performing hill-climbingto find the peak of this window: we loop until we reach a bincount smaller than the previous one. Every time we iterateover a new bin count, we update the mean and mean devia-tion (lines 10, 18, 24). After a peak Ci has been found, weenter the loop of lines 13-21. Here we follow the peak to itsbottom, which occurs either when the bin counts are back atthe level they started (line 13) or another significant increaseis found (the if statement on line 14).

Algorithm 1 Offline Peak-Finding Algorithm1: function find peak windows(C):2: windows = []3: mean = C1

4: meandev = variance(C1, ..., Cp)5:6: for i = 2; i < len(C); i++ do7: if |Ci−mean|

meandev> τ and Ci > Ci−1 then

8: start = i− 19: while i <len(C) and Ci > Ci−1 do

10: (mean, meandev) = update(mean, meandev, Ci)11: i++12: end while13: while i < len(C) and Ci > Cstart do14: if |Ci−mean|

meandev> τ and Ci > Ci−1 then

15: end = −− i16: break17: else18: (mean, meandev) = update(mean, meandev, Ci)19: end = i++20: end if21: end while22: windows.append(start, end)23: else24: (mean, meandev) = update(mean, meandev, Ci)25: end if26: end for27: return windows28:29: function update(oldmean, oldmeandev, updatevalue):30: diff = |oldmean − updatevalue|31: newmeandev = α*diff + (1−α)*oldmeandev32: newmean = α*updatevalue + (1−α)*oldmean33: return (newmean, newmeandev)

To summarize the algorithm: when the algorithm encoun-ters a significant increase in bin count relative to the histor-ical mean, it follows that increase to its maximum, and endthe peak’s window once we return to a bin count that westarted at or when we encounter another significant increase.The implementation of find peak windows(C) in TwitInfois different than the one in Algorithm 1 to faciliate realtimestreaming interactivity. In order to make peak-finding an on-line algorithm, one would have to make the for loop betweenlines 6 and 26 reachable using a continuation-passing style,and the function would only take a single bin count at a time.

The update(oldmean, oldmeandev, updatevalue) functionis the update step: given a new bin count (updatevalue),it updates oldmean and oldmeandev and returns their newvalues. Because we are doing this in a streaming context(we do not want to maintain a list of historical values) andbecause we want to eventually forget old bin counts to ad-just to changing message frequency trends, we maintain anexponentially weighted moving mean of bin counts. Simi-larly, we maintain exponentially weighted moving mean de-viations. The formulas for these update steps are on lines 31and 32, and require that α < 1. We have found that α = .125captures a reasonable amount of historical information.

The significance tests of lines 7 and 14 are similar to outlierdetection criteria which identify outliers if they are a certainnumber of standard deviations from the mean. We utilizemean deviations instead of standard deviations because theirupdate step (line 31) does not require keeping old bin count

5

values around. The update steps of lines 31 and 32, as wellas the use of exponentially weighted moving averages andmean deviations are based on ideas from the TCP congestioncontrol literature.

Once we have identified windows W = {S, F}, we mustgenerate a label for W . To label W , we consider all tweetscontaining the event’s keywords that were posted betweenthe start of bin S and the end of bin F . We then tokenize alltweets into unigrams and rank the unigrams by their TermFrequency / Inverse Document Frequency (TF-IDF) [22].We present the top five ranked unigrams as the label for W .

Removing Noisy Query TermsIf a user includes a tangential but high-volume keyword inthe list of event keywords, the resulting timeline can lose itsmeaningful signal. For example, suppose the user wanted totrack music news using band keywords like Journey, T-Pain,and Justin Bieber. Justin Bieber is a Twitter celebrity andperpetually trending — the volume of Bieber-related tweetswould overwhelm signal about the other musicians.

We emply an IDF-normalizing technique for avoiding thissituation. Rather than count the number of tweets whichcontain any of the desired keywords in a given timeframe,we first downweight always-popular terms using their globalpopularity (IDF). Because Bieber will frequently appear inthe Twitter stream, its contribution to the volume of a periodwill be dampened.

Identifying Relevant TweetsBecause there are typically more tweets than can be digestedby a user for any sub-event, we try to identify relevant tweetsfor the user in the Relevant Tweets list (Figure 1D). Givenour 5-term label for each identified event, we rank tweets bythe number of terms from the sub-event label that the tweetcontains. Since retweets contain the same words as a tweetand we want to promote diversity, we halve the matchingterm count of retweeted tweets.

Representing Aggregate SentimentOur sentiment analysis algorithm places tweets into positiveand negative classes. It uses a Naıve Bayes classifier trainedon unigram features. We built the algorithm around an ideadescribed by Go et al. [10], generating training sets for thepositive and negative classes using tweets with happy andsad emoticons.

Traditionally, twitter-based sentiment analysis algorithms areevaluated on a tweet-by-tweet basis, using standard preci-sion/recall metrics. In practice, however, they are often usedin aggregated form. However, the standard means of dis-playing aggregate sentiment has significant biases. We illus-trate the issue in Figure 2. TwitInfo contributes a correctionfor this bias.

Our classifer predicts the probability that a tweet is a mem-ber of the positive class (pp) and the probability that it is amember of the negative class (pn). But, not all tweets ex-press a strong emotion, so we need a confidence thresholdτ on the classifiers below which we place predictions into a

Negative Tweets Positive Tweets(500 tweets)

(50 tweets)

(500 tweets)

Classi�ed as Negative(250 tweets)

Classi�ed as Positive

Uncorrected Output: Recall-Normalized Output:

20%

80%

50% 50%

Figure 2. Positive and negative sentiment classifiers often have differentperformance. Suppose both the positive and the negative class contain500 tweets. If the negative classifier is conservative and only identifies50 tweets while the positive classifier is more liberal and identifies 250,the relative proportions will be biased (bottom left). If we divide by anestimate of the classifiers recall (as a number in [0 . . . 1]), however, weproduce a correct visualization.

neutral category. Increasing the threshold for a class on awell-trained classifier has the effect of increasing the preci-sion and decreasing the recall of the classifier. For example,we can increase the precision of the positive classifier by in-creasing the threshold τp for which we will accept tweetswith pp values as positive.

Our positive and negative classifications will have differentprecision/recall characteristics as we adjust their thresholds.It is not guaranteed, it was not our experience, that the preci-sion/recall curves will be indentical for the positive and neg-ative classes. In such situations, the two classes may havedifferent recall values when set to the same precision.

This means that one classifier may be conservatively ignor-ing tweets while the other is accepting them. Following theexample of Figure 2, if we have 500 positive and negativetweets, but the negative classifier has a recall of .1, whereasthe positive classifier has a recall of .5, then the positive clas-sifier will identify 5 times as many positive tweets as nega-tive ones in aggregate. If we do not correct for this problem,the user will see an aggregate visualization that grossly over-estimates the number of positive tweets in the data.

Our solution to this problem is to recall-normalize the ag-gregate numbers. We first adjust both classifiers so that theyhave equivalent precision values on a test dataset. We thenmeasure their recalls on this set. When we count the numberof positive tweets, we divide the count by the positive recall,and similarly normalize the negative count. So, if we knowrecall is .1 and we observed 50 negative tweets, we estimate50/.1 = 500 negative tweets. This correction assumes thatthe precision/recall characteristics of our sentiment analysisalgorithm is similar in practice as it is on the test dataset.Because aggregated sentiment is very important in many do-mains, we think that researching a principled solution to thisproblem will be important future work.

IMPLEMENTATIONTwitInfo layers an interface and infrastructure over the Twit-ter streaming query API. When an event is created, Twit-

6

Data Source Precision Recall

Soccer Events 1823 ⇒ 78% 18

23 ⇒ 78%

Soccer Events & Discussion 2223 ⇒ 96% 22

27 ⇒ 81%

Major Earthquakes 644 ⇒ 14% 5

5 ⇒ 100%

All Earthquakes 2944 ⇒ 66% N/A

All Earthquakes & Discussion 3944 ⇒ 89% N/A

Table 1. Event detection performance on our data sets. Without a rea-sonable ground truth labeling of all earthquakes, we omit recall valuesfor them. Note that there were five major earthquakes in the period ofinterest, but one event had two distinct peaks on Twitter.

Info searches for the keywords for that event using the Twit-ter Stream API. It associates tweets containing relevant key-words with events, counts tweet volume over time, and ex-tracts metadata like sentiment and URLs. It records locationmetadata by harvesting geolocation tags on the tweet or, fail-ing that, attempting to convert the freetext user location fieldinto a latitude/longitude pair using a geolocation service.

These tweets are serialized to a database using Django, in-dexed by keywords that matched the streaming query. Whenthe user requests a TwitInfo page, Django looks up relevanttweets using indexes on keyword and timestamp. To powerthe visualization, we use the Google Visualization API 5 forour annotated timeline and pie charts, and generate a mapusing the Google Maps API.

EVALUATIONNow that we have described our system and algorithms, weinvestigate the two main contributions of this paper: the per-formance of the algorithm for detecting events in the time-line, and the effectiveness of the user interface at visualizingevents using Twitter data. We analyze the algorithm througha quantitative analysis of its ability to identify events, andanalyze the user interface through a user study on twelveusers and a semi-structured interview with a Pulitzer Prize-winning former Washington Post investigative reporter.

Algorithm EvaluationOur goal with the algorithmic evaluation was to determinewhether our automated event detection approach matcheshuman intuitions about correct behavior. To do so, we gath-ered tweets from three soccer games and one month of earth-quakes. To create ground truth for the soccer data, one re-searcher annotated major events in the soccer game withoutlooking at the tweets. For ground truth on the earthquakedata, we gathered data from the US Geological Survey onmajor earthquakes during the time period.

We tested how many of the events our algorithm detectedwere part of this ground truth set (precision), and how manyevents in the ground truth set our algorithm detected (recall).We used the default threshold cutoff for determining events— we can trade off precision and recall by adjusting this5http://code.google.com/apis/visualization/documentation/gallery.html

Real Madrid vs. Penarol

Manchester City vs. Liverpool

Barcelona vs. AC Milan

Game Start

Livestream URL

Pregame Chatter

Game Start

Game Start

Goal Halftime

Halftime

Goal

Goal

Goal

GoalGame End

GameEnd Penalty

Shots

Halftime Goal

GoalGame End

Wrong

Figure 3. The TwitInfo timeline displaying three soccer games. Thepeak detction algorithm finds only 1 false positive, labeled red. Thealgorithm did not find Yellow Card events, but there was no spike inTwitter usage to suggest it. The algorithm occasionally flagged pregamechatter like retweeting a livestream URL.

threshold, but we chose to use the single value that userswill experience. Results are in Table 1.

The algorithm has high recall, finding all of major earth-quakes and most soccer events. Figure 3 displays severalground truth timelines and events that the algorithm identi-fied. If the algorithm failed to catch an event, it was typicallybecause Twitter volume did not peak when it occurred. Forexample, the only uncaught ground truth events in the soc-cer data set were yellow cards, which simply did not appearin large quantities in the tweet stream, even though our re-searcher had identified them as interesting events in his ini-tial analysis of the game.

Precision depended on activity type: it was high for soccer,but there were several false positives for major earthquakesbecause the algorithm also flagged minor earthquakes. If weinclude in our ground truth dataset all earthquakes (Table 1)precision rises from 14% to 66%. It still produces false posi-tives, because Twitter spikes when its users discuss an earth-quake or share information about donating aid money. Ifwe include all earthquake-related discussion, precision risesto 89%. Likewise, including soccer-relevant discussion (likeconversations identifying live video stream URLs for a game)leads to 96% precision in the soccer dataset. The remainingfalse positives come from high-volume topics that happenedto mention the word earthquake or a soccer term. For exam-ple, one Twitterer asked celebrity Justin Bieber to followerher because she survived Chile’s earthquake, and she wasretweeted around Bieber’s fan base.

The algorithm can be biased by Twitter’s interests. If thereis a medium-sized earthquake in the middle of the ocean,

7

nobody will notice or tweet about it, so TwitInfo will missthe event. Conversely, if there is even a minor earthquakein a densely populated area, lots of locals will tweet about itand TwitInfo may flag it as an event.

We also note two artifacts that occur because of a combi-nation of our algorithm, timeline visualization, and use ofTwitter as a datasource. First, a well-defined event, such asthe September 2010 earthquake in Christchurch, NZ, oftenappears as multiple peaks on Twitter. In the Christchurchearthquake, we noticed three distinct peaks, all of whichwere identified by the algorithm: 1) news about the earth-quake, 2) thoughts and prayers to those in Christchurch, and3) information about aid and scams. The second artifact oc-curs when multiple events overlap in time. For example,two earthquakes in different locations might result in a singlelarger peak on the timeline. Our algorithm does not distin-guish overlapping events, and our visualization, which doesnot provide stacked layers like MemeTracker [17] does notidentify the two events as distinct.

User Interface EvaluationOur next goal was to understand if TwitInfo users can un-derstand an event’s structure via the interface, and if someelements of the interface are more helpful for event explo-ration than others.

MethodWe recruited twelve participants, six of whom had Twitteraccounts, and ten of whom had been to the Twitter websitebefore. All of the users could explain Twitter’s mechanics,including tweet length and broadcast messaging. Four of ourparticipants were female. We compensated them with a $10gift certificate.

We asked the participants to perform directed search tasksfor the first half of the study. We chose tasks that exercisedall the elements of the interface and allowed us to gatherfeedback on a variety of usage scenarios. Some exampletasks on the earthquake dataset included:

• Find the earthquake that happened in Vanuatu. When didit occur, and what was the magnitude?• How did sentiment about the earthquake differ between

Vanuatu and New Zealand?• What are people who live in Christchurch saying about

the earthquake there?

During this process, we gathered usability feedback and ob-served which interface elements were useful or ignored.

The second half of the study was a time-limited explorationtask. We gave participants up to five minutes to understandan event that TwitInfo was visualizing and five minutes todictate a news report on the event to the experimenter. Sothat we observed more variation, we gave half of the partic-ipants a datset of a soccer game between Manchester Cityand Liverpool, and half a dataset containing sixteen days oftweets about President Barack Obama.

Finally, we guided participants in a semi-structured inter-view. We asked about the best and worst parts of the Twit-Info interface. We also tried to understand the situations

in which our participants would want to use TwitInfo, andasked them for alternate modes of performing their assignedreporting task in the absence of TwitInfo.

ResultsHere is one participant’s summary of President Obama’s ac-tivities over a 16 day period:

After the peace talks, Obama traveled to the ASEANconference and to NATO to work on issues in thoseparts of the world. He then spent the week dedicated todomestic economic issues. First he proposed a researchtax break, then a $50 billion investment in infrastruc-ture, then the issue came up about whether he shouldkeep some tax breaks that Bush had implemented, andhe’s asking for some tax breaks from business, and theseare generating some controversy because . . .

Participants such as this one successfully reconstructed eventsfrom the Twitter timeline within five minutes, even withoutprevious knowledge. TwitInfo gave quick, high-level under-standing, with one user describing some of the visual ele-ments as “at-a-glance” interfaces, and another praising itsspeed of distilling information, but explaining that the ex-tracted information was “shallow.”

Common Usage Patterns. When performing freeform ex-ploration, users picked the largest peak and explored it care-fully: they read the tweets thoroughly, drilled in on the map,and followed links to articles. Most users relied on tweets toconfirm event details, though a few skimmed the automatically-generated labels. Under time pressure, users instead skimmedall peaks to get a broad sense of the chronology. Self-reported“news junkies” tended to follow at least one link, either froma tweet citing an interesting news source, or from the popu-lar links section. These users cited the lack of detail of thetweets as their reason for going outside of the system.

The Timeline Focused User Activity. When providing ahigh-level summary of the interface in the semistructured in-terview, most users cited the timeline and event labels as themost memorable and helpful elements. The timeline wasalso the most actively used component in the interface.

Mapping Requires Aggregation. The map provided help-ful filtering features, but could have been even more useful ifit offered aggregation. Users were excited to find that tweetdensity increased in areas affected by earthquakes, but thisfact was not discoverable while the map was zoomed out.So, these users requested a map indicator for geographicareas with unusually high volume, for example a heatmap.Other participants wanted the bounds and zoom level of themap to act as a filter for the rest of the interface, for examplere-sampling the timeline based only on tweets from Iran.

Users Do Not Trust Sentiment Analysis. We tested oursentiment classifier against previous work investigating tweetsentiment [14], finding similar performance. However, usersoften did not agree with its model of sentiment. For exam-ple, users were often surprised to find that aggregate sen-timent for earthquakes was positive. After examining thepositive and negative sentiment tweets, they realized thatmany tweets contained positive messages such as “sending

8

best wishes to those recovering from the earthquake in haiti.”They would decide that the sentiment classifier was workingcorrectly, but that it was not giving them the information theywanted: it extracted sentiment about tweets, not topics.

What Alternatives Did Users Suggest? We asked usershow they would have perfomed their reporting task had Twit-Info not been provided to them. Users who successfully re-ported on the soccer match suggested, correctly, that theycould have searched for an article about the game and ex-tracted much of the same information. Users reporting ontwo weeks of Obama’s activities had a harder time arrivingat solutions. One user explained that he would have collectednews weeklies (e.g. NewsWeek), identified key events fromeach of President Obama’s week, and then searched for arti-cles on those events. Most of the other users explained thatthey would search on Google News for historical day-by-daycoverage, and compiled a set of the most interesting stories.All of the users had difficulty determining how, across thesets of daily search results, they would identify which resultsto pay attention to.

A Journalist’s PerspectiveTwitter has considerable buzz as a news medium [16]. Wewanted to understand if TwitInfo would be a useful tool forjournalists exploring a new topic. We conducted a semi-structured interview with a Pulitzer Prize-winning formerWashington Post investigative reporter who is now a KnightProfessor of Journalism. We asked the journalist to interactwith the soccer, earthquake, and Obama examples, and toreflect on journalists might use TwitInfo.

When presented with TwitInfo, the journalist initially fo-cused on the labeled timeline view. She explained that thetimeline with labeled events would be useful for ”background-ing” on a longer-running topic. The journalist expanded:“When I do long stories, they are very arcane subject. Beingable to see a timeline on an arcane subject could be reallyhelpful.” Specifically, she felt that the automated timelinewould be useful to understand which substories to explorefurther. Additionally, she recommended blending the ex-tracted peaks from the tweet stream with traditional storiesfrom sources such as Google News about a topic: intereston Twitter could draw attention to a topic which could beexpanded upon using traditional news sources. In order tonavigate longer timelines covering multiple subevents, sherecommended a topic-based drill-down interface along thelines of what Eddi [4] provides.

The journalist was also interested a separate use case for themap view of geolocated tweets, which could be used by re-porters seeking to contact on-the-ground sources. The mapview was compelling because “from the perspective of uscovering a breaking event, you want to know what the peo-ple on the ground are saying,” and zooming in on the loca-tion of interest could lead to eyewitnesses to follow up with.The journalist drew an analogy to the #iranElection, whichwas billed as a Twitter revolution, but left reporters want-ing to talk to the 200-odd people in Iran who were the eye-witness sources of on-the-ground information. However, shecautioned that livetweeters may not be good sources: “when

you’re writing [a quote] down, you’re not listening,” makingit difficult to provide true context around a quote.

For the most part, the journalist’s view on sentiment analysiswas similar to that of the user study participants. She saidthat journalists would be skeptical not only of the qualityand accuracy of the algorithm, but also of the sample popu-lation expressing the sentiment. While aware of the researchsuggesting that extracted Twitter sentiment matches senti-ment measured through other polling measures [20], she ex-pressed skepticism that a journalist would trust the numbersat face value.

Overall, the journalist found TwitInfo to be a compelling firstversion of a tool to make sense of the tweet stream from areporter’s perspective. She expressed that with her recom-mended interface modifications, she could imagine the toolbeing useful for reporting on breaking news, following real-time events, and familiarizing onesself with overall interestand current angles on an arcane topic.

DISCUSSIONTwitInfo successfully allows individuals to aggregate thou-sands of short messages into a coherent picture of an event.If a user or journalist wishes to receive a high-level back-ground on a new topic, or if a reporter wishes to identifyeyewitnesses to follow up with, our study and interview sug-gest that TwitInfo would be a useful tool.

Our evaluation had several limitations. First-use studies makeit difficult to understand how a dedicated person might usethe interface longitudinally to track a topic of interest. Suchstudies also have low external validity: could a citizen jour-nalist do real investigative reporting using but nothing butTwitter and TwitInfo? We plan to release TwitInfo publiclyto begin answering some of these questions.

As a news source, our users found that Twitter, for all its ben-efits, was often only a shallow window into world events.Though analysis and conversation does appear on Twitter[12, 5], for many events Twitter is better described as a mediumfor quick reactions and information sharing. Even the sum-mary tweets sometimes barely explained the headline news:statistical summarization interfaces could help convey mean-ing even if individual tweets are unclear [25, 4, 21].

Finally, we note that Twitter is just one entry point into un-derstanding our world. The next step is to synthesize Twit-Info with other data sources in order to paint a more nuancedpicture. Online news has a complementary set of properties:it might not be real-time, but it provides deeper analysis. Weenvision a version of TwitInfo that embeds blogs and newsitems in the visualization directly, searching for such con-tent using labels extracted from identified peaks. Live videois also a promising addition for broadcast events [23].

CONCLUSIONTwitter is fast becoming a critical source of information aboutworld events large and small. But, it is difficult to translatethis information into a format allows users to draw higher-level conclusions. In this paper we present TwitInfo, a novelmicroblog event tracking interface that can collect, aggre-

9

gate, and visualize tweets on any event. TwitInfo embeds anovel algorithm for peak detection and labeling, as well as anew technique for correcting aggregate sentiment displays.The peak detection algorithm identifies 80-100% of manu-ally labeled peaks. TwitInfo users were able to understandseveral weeks’ worth of Twitter data in a matter of minutes,on diverse topics like earthquakes, politics and sports. Aprofessional journalist considered the interface as a meansof exploring complex topics and identifying eyewitness ac-counts. We discuss implications of our approach, includingTwitter’s blurring of conversation and information.

Aggregate Twitter interfaces promise a new set of exploratorydata analysis tools. These tools are already being used forsocial science [27, 30] and augmented media experiences[23]. We envision a future where citizen journalists, citizenscientists and curious minds can use tools like TwitInfo toexplore and understand their world.

REFERENCES

1. E. Adar, M. Dontcheva, J. Fogarty, and D. S. Weld.Zoetrope: interacting with the ephemeral web. In UIST’08. ACM Press, 2008.

2. P. Andre, M. L. Wilson, A. Russell, D. A. Smith,A. Owens, and m. schraefel. Continuum: designingtimelines for hierarchies, relationships and scale. InUIST ’07. ACM Press, 2007.

3. N. Bansal and N. Koudas. Blogscope: a system foronline analysis of high volume text streams. In VLDB’07. VLDB Endowment, 2007.

4. M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,and E. H. Chi. Eddi: Interactive Topic-based Browsingof Social Status Streams. In UIST ’10. ACM Press,2010.

5. danah boyd, S. Golder, and G. Lotan. Tweet, tweet,retweet: Conversational aspects of retweeting ontwitter. In HICSS. IEEE, 2010.

6. N. Diakopoulos and D. A. Shamma. Characterizingdebate performance via aggregated twitter sentiment. InCHI ’10. ACM Press, 2010.

7. S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg.Opinion space: a scalable tool for browsing onlinecomments. In CHI ’10. ACM Press, 2010.

8. E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie:providing personalized newsfeeds via analysis ofinformation novelty. In WWW ’04. ACM Press, 2004.

9. D. Gaffney. #iranelection: Quantifying online activism.In Web Science, 2010.

10. Go, Alec and Bhayani, Richa and Huang, Lei. Twittersentiment website, September 2010.http://twittersentiment.appspot.com.

11. J. Heer, F. B. Viegas, and M. Wattenberg. Voyagers andVoyeurs: Supporting Asynchronous CollaborativeInformation Visualization. In CHI ’07, 2007.

12. C. Honeycutt and S. C. Herring. Beyondmicroblogging: Conversation and collaboration viatwitter. In HICSS ’09, 2009.

13. IETF Network Working Group. Rfc 2988: Computingtcp’s retransmission timer, November 2000.http://tools.ietf.org/html/rfc2988.

14. B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury.Twitter power: Tweets as electronic word of mouth.JASIST, 60(11), 2009.

15. A. Java, X. Song, T. Finin, and B. Tseng. Why wetwitter: Understanding the microblogging effect in userintentions and communities. In WebKDD, 2007.

16. H. Kwak, C. Lee, H. Park, and S. Moon. What istwitter, a social network or a news media? In WWW2010, 2010.

17. J. Leskovec, L. Backstrom, and J. Kleinberg.Meme-tracking and the dynamics of the news cycle. InKDD ’09. ACM Press, 2009.

18. A. Mathes. Folksonomies – cooperative classificationand communication through shared metadata.Technical Report LIS590CMC, University of IllinoisUrbana-Champaign, December 2004.

19. M. Naaman, J. Boase, and C.-H. Lai. Is it really aboutme?: message content in social awareness streams. InCSCW ’10, New York, NY, USA, 2010. ACM.

20. B. O’Connor, R. Balasubramanyan, B. R. Routledge,and N. A. Smith. From tweets to polls: Linking textsentiment to public opinion time series. In ICSWSM’10, 2010.

21. D. Ramage, S. Dumais, and D. Liebling. Characterizingmicroblogs with topic models. In ICWSM ’10, 2010.

22. G. Salton and C. Buckley. Term-weighting approachesin automatic text retrieval. Information Processing andManagement, 24(5):513–523, 1988.

23. D. Shamma, L. Kennedy, and E. Churchill. Tweetgeist:Can the twitter timeline reveal the structure ofbroadcast events? In CSCW 2010 Horizon, 2010.

24. D. A. Shamma, L. Kennedy, and E. F. Churchill.Conversational shadows: Describing live media eventsusing short messages. In ICWSM ’10, 2010.

25. B. Sharifi, M.-A. Hutton, and J. Kalita. Summarizingmicroblogs automatically. In NAACL ’10. ACL, 2010.

26. S. F. Silva and T. Catarci. Visualization of lineartime-oriented data: A survey. Web Information SystemsEngineering, 1:0310, 2000.

27. K. Starbird, L. Palen, A. L. Hughes, and S. Vieweg.Chatter on the red: what hazards threat reveals aboutthe social life of microblogged information. In CSCW’10. ACM, 2010.

28. B. Suh, L. Hong, P. L. Pirolli, and E. H. Chi. Want to beretweeted? large scale analytics on factors impactingretweet in twitter network. In SocialCom 2010, 2010.

29. R. Swan and J. Allan. Extracting significant timevarying features from text. In CIKM ’99. ACM Press,1999.

30. S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen.Microblogging during two natural hazards events: whattwitter may contribute to situational awareness. In CHI’10. ACM Press, 2010.

10

TwitInfo: Aggregating and Visualizing Microblogs...

Documents

Transcript of TwitInfo: Aggregating and Visualizing Microblogs...