Twitterology - The Science of Twitter

Bruno Gonalves www.bgoncalves.com

Twitterology:The Science of Twitter

www.bgoncalves.com@bgoncalves

The Internet In Real Time

www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves


Social Media


Twitter


Anatomy of a Tweet


Anatomy of a Tweet

[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata']

[u'follow_request_sent', u'profile_use_background_image', u'default_profile_image', u'id', u'profile_background_image_url_https', u'verified', u'profile_text_color', u'profile_image_url_https', u'profile_sidebar_fill_color', u'entities', u'followers_count', u'profile_sidebar_border_color', u'id_str', u'profile_background_color', u'listed_count', u'is_translation_enabled', u'utc_offset', u'statuses_count', u'description', u'friends_count', u'location', u'profile_link_color', u'profile_image_url', u'following', u'geo_enabled', u'profile_banner_url', u'profile_background_image_url', u'screen_name', u'lang',

u'profile_background_tile', u'favourites_count', u'name', u'notifications', u'url', u'created_at', u'contributors_enabled', u'time_zone', u'protected', u'default_profile', u'is_translator']

http://www.bgoncalves.com/teaching/data-mining.html


Anatomy of a Tweet

[u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata']

[u'type', u'coordinates']

[u'symbols', u'user_mentions', u'hashtags', u'urls']

u'

Demographics


Market Penetration PLoS One 8, E61981 (2013)


World Coverage


Age DistributionPLoS One 10, e0115545 (2015)


Demographics

(a) Normal representation (b) Area cartogram representation

Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.Blue colors indicate underrepresentation, while red colors represent overrepresentation. The intensity of the color correspondsto the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west andoverrepresentation of populous counties.

less than 95% predictive (e.g., the name Avery was observedto correspond to male babies only 56.8% of the time; it wastherefore removed). The result is a list of 5,836 names thatwe use to infer gender.

Limitations Clearly, this approach to detecting gender issubject to a number of potential limitations. First, users maymisrepresent their name, leading to an incorrect gender in-ference. Second, there may be differences in choosing to re-veal ones name between genders, leading us to believe thatfewer users of one gender are present. Third, the name listsabove may cover different fractions of the male and femalepopulations.

Gender of Twitter usersWe first determine the number of the 3,279,425 U.S.-basedusers who we could infer a gender for, based on their nameand the list previously described. We do so by comparingthe first word of their self-reported name to the gender list.We observe that there exists a match for 64.2% of the users.Moreover, we find a strong bias towards male users: Fully71.8% of the the users who we find a name match for had amale name.

0

0.2

0.4

0.6

0.8

1

2007-01 2007-07 2008-01 2008-07 2009-01 2009-07Fra

ctio

n of

Joi

ning

Use

rsw

ho a

re M

ale

Date

Figure 3: Gender of joining users over time, binned intogroups of 10,000 joining users (note that the join rate in-creases substantially). The bias towards male users is ob-served to be decreasing over time.

To further explore this trend, we examine the historic gen-der bias. To do so, we use the join date of each user (avail-able in the users profile). Figure 3 plots the average fractionof joining users who are male over time. From this plot, itis clear that while the male gender bias was significantlystronger among the early Twitter adopters, the bias is be-coming reduced over time.

Race/ethnicityDetecting race/ethnicity using last namesAgain, since we have very limited information availableon each Twitter user, we resort to inferring race/ethnicityusing self-reported last name. We examine the last nameof users, and correlate the last name with data from theU.S. 2000 Census (U.S. Census 2000). In more detail, foreach last name with over 100 individuals in the U.S. dur-ing the 2000 Census, the Census releases the distribution ofrace/ethnicity for that last name. For example, the last nameMyers was observed to correspond to Caucasians 86% ofthe time, African-Americans 9.7%, Asians 0.4%, and His-panics 1.4%.

Race/ethnicity distribution of Twitter usersWe first determined the number of U.S.-based users forwhom we could infer the race/ethnicity by comparing thelast word of their self-reported name to the U.S. Censuslast name list. We observed that we found a match for71.8% of the users. We the determined the distribution ofrace/ethnicity in each county by taking the race/ethnicitydistribution in the Census list, weighted by the frequencyof each name occurring in Twitter users in that county.1Due to the large amount of ambiguity in the last name-to-race/ethnicity list (in particular, the last name list is morethan 95% predictive for only 18.5% of the users), we are un-able to directly compare the Twitter race/ethnicity distribu-

1This is effectively the census.model approach discussed inprior work (Chang et al. 2010).

(a) Normal representation (b) Area cartogram representation

Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.Blue colors indicate underrepresentation, while red colors represent overrepresentation. The intensity of the color correspondsto the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west andoverrepresentation of populous counties.

less than 95% predictive (e.g., the name Avery was observedto correspond to male babies only 56.8% of the time; it wastherefore removed). The result is a list of 5,836 names thatwe use to infer gender.

Limitations Clearly, this approach to detecting gender issubject to a number of potential limitations. First, users maymisrepresent their name, leading to an incorrect gender in-ference. Second, there may be differences in choosing to re-veal ones name between genders, leading us to believe thatfewer users of one gender are present. Third, the name listsabove may cover different fractions of the male and femalepopulations.

Gender of Twitter usersWe first determine the number of the 3,279,425 U.S.-basedusers who we could infer a gender for, based on their nameand the list previously described. We do so by comparingthe first word of their self-reported name to the gender list.We observe that there exists a match for 64.2% of the users.Moreover, we find a strong bias towards male users: Fully71.8% of the the users who we find a name match for had amale name.

0

0.2

0.4

0.6

0.8

1

2007-01 2007-07 2008-01 2008-07 2009-01 2009-07Fra

ctio

n of

Joi

ning

Use

rsw

ho a

re M

ale

Date

Figure 3: Gender of joining users over time, binned intogroups of 10,000 joining users (note that the join rate in-creases substantially). The bias towards male users is ob-served to be decreasing over time.

To further explore this trend, we examine the historic gen-der bias. To do so, we use the join date of each user (avail-able in the users profile). Figure 3 plots the average fractionof joining users who are male over time. From this plot, itis clear that while the male gender bias was significantlystronger among the early Twitter adopters, the bias is be-coming reduced over time.

Race/ethnicityDetecting race/ethnicity using last namesAgain, since we have very limited information availableon each Twitter user, we resort to inferring race/ethnicityusing self-reported last name. We examine the last nameof users, and correlate the last name with data from theU.S. 2000 Census (U.S. Census 2000). In more detail, foreach last name with over 100 individuals in the U.S. dur-ing the 2000 Census, the Census releases the distribution ofrace/ethnicity for that last name. For example, the last nameMyers was observed to correspond to Caucasians 86% ofthe time, African-Americans 9.7%, Asians 0.4%, and His-panics 1.4%.

Race/ethnicity distribution of Twitter usersWe first determined the number of U.S.-based users forwhom we could infer the race/ethnicity by comparing thelast word of their self-reported name to the U.S. Censuslast name list. We observed that we found a match for71.8% of the users. We the determined the distribution ofrace/ethnicity in each county by taking the race/ethnicitydistribution in the Census list, weighted by the frequencyof each name occurring in Twitter users in that county.1Due to the large amount of ambiguity in the last name-to-race/ethnicity list (in particular, the last name list is morethan 95% predictive for only 18.5% of the users), we are un-able to directly compare the Twitter race/ethnicity distribu-

1This is effectively the census.model approach discussed inprior work (Chang et al. 2010).

Undersampling

Oversampling

(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic

Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, andHispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicityare shown. Blue regions correspond to undersampling; red regions to oversampling.

tion directly to race/ethnicity distribution in the U.S. Census.However, we are able to make relative comparisons betweenTwitter users in different geographic regions, allowing us toexplore geographic trends in the race/ethnicity distribution.Thus, we examine the per-county race/ethnicity distributionof Twitter users.In order to account for the uneven distribution of

race/ethnicity across the U.S., we examine the per-countyrace/ethnicity distribution relative to the distribution fromthe overall U.S. Census. For example, if we observed that25% of Twitter users in a county were predicted to be His-panic, and the 2000 U.S. counted 23% of people in thatcounty as being Hispanic, we would consider Twitter to beoversampling the Hispanic users in that county. Figure 4plots the per-county race/ethnicity distribution, relative tothe 2000 U.S. Census, per all counties in which we observedmore than 500 Twitter users with identifiable last names.A number of geographic trends are visible, such as the un-dersampling of Hispanic users in the southwest; the under-samping of African-American users in the south and mid-west; and the oversampling of Caucasian users in many ma-jor cities.

Related workA few other studies have examined the demographics of so-cial network users. For example, recent studies have exam-ined the ethnicity of Facebook users (Chang et al. 2010),general demographics of Facebook users (Corbett 2010),and differences in online behavior on Facebook and MyS-pace by gender (Strayhorn 2009). However, studies of gen-eral social networking sites are able to leverage the broadnature of the profiles available; in contrast, on Twitter, usersself-report only a minimal set of information, making calcu-lating demographics significantly more difficult.

ConclusionTwitter has received significant research interest lately as ameans for understanding, monitoring, and even predictingreal-world phenomena. However, most existing work doesnot address the sampling bias, simply applying machinelearning and data mining algorithms without an understand-ing of the Twitter user population. In this paper, we tooka first look at the user population themselves, and exam-ined the population along the axes of geography, gender, andrace/ethnicity. Overall, we found that Twitter users signif-icantly overrepresent the densely population regions of the

U.S., are predominantly male, and represent a highly non-random sample of the overall race/ethnicity distribution.Going forward, our study sets the foundation for future

work upon Twitter data. Existing approaches could imme-diately use our analysis to improve predictions or measure-ments. By enabling post-hoc corrections, our work is a firststep towards turning Twitter into a tool that can make infer-ences about the population as a whole. More nuanced anal-yses on the biases in the Twitter population will enhancethe ability for Twitter to be used as a sophisticated inferencetool.

AcknowledgementsWe thank Fabricio Benevento and Meeyoung Cha for theirassistance in gathering the Twitter data used in this study.We also thank Jim Bagrow for valuable discussions and hiscollection of geographic data from Google Maps. This re-search was supported in part by NSF grant IIS-0964465 andan Amazon Web Services in Education Grant.

ReferencesAsur, S., and Huberman, B. 2010. Predicing the future with socialmedia. http://arxiv.org/abs/1003.5699.Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predictsthe stock market. In ICWSM.Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. 2010.Measuring user influence in twitter: The million follower fallacy.In ICWSM.Chang, J.; Rosenn, I.; Backstrom, L.; and Marlow, C. 2010.epluribus: Ethnicity on social networks. In ICWSM.Corbett, P. 2010. Facebook demographics and statistics re-port 2010. http://www.istrategylabs.com/2010/01/facebook-demographics-and-statistics-report-\2010-145-growth-in-1-year.Gastner, M. T., and Newman, M. E. J. 2004. Diffusion-basedmethod for producing density-equalizing maps. PNAS 101.OConnor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N.2010. From tweets to polls: Linking text sentiment to public opin-ion time series. In ICWSM.Social Security Administration. 2010. Most popular baby names.http://www.ssa.gov/oact/babynames.Strayhorn, T. 2009. Sex differences in use of facebook and mys-pace among first-year college students. Stud. Affairs 10(2).U.S. Census. 2000. Genealogy data: Frequently occur-ring surnames from census. http://www.census.gov/genealogy/www/data/2000surnames.

ICWSM11, 375 (2011)


Network Structure


Twitter Network

an initial set of 20 people a newcomer can follow by a single clickand quite a few people take up on the offer. The second glitch isat around x = 2000. Before 2009 there was an upper limit on thenumber of people a user could follow [12]. Twitter removed thiscap and there is no limit now. The glitch represents the gap in themomentum of network building inflicted by the upper limit. A verysmall number of users follow more than 10, 000. They are mostlyofficial pages of politicians and celebrities who need to offer someform of customer service.

The dashed line in Figure 1 up to x = 105 fits to a power-lawdistribution with the exponent of 2.276. Most real networks includ-ing social networks have a power-law exponent between 2 and 3.The data points beyond x = 105 represent users who have manymore followers than the power-law distribution predicts. Similartail behavior in degree distribution has been reported from Cyworldin [1] but not from other social networks. The common character-istics between Twitter and Cyworld are that many celebrities arepresent and they readily form online relations with their fans.

There are only 40 users with more than a million followers andall of them are either celebrities (e.g. Ashton Kutcher, BritneySpears) or mass media (e.g. the Ellen DeGeneres Show, CNNBreaking News, the New York Times, the Onion, NPR Politics,TIME). The top 20 are listed in Figure 7. Some of them follow theirfollowers, but most of them do not (the median number of follow-ings of the top 40 users is 114, three orders of magnitude smallerthan the number of followers). We revisit the issue of reciprocity inSection 3.3.

3.2 Followers vs. Tweets

Figure 2: The number of followers and that of tweets per user

In order to gauge the correlation between the number of follow-ers and that of written tweets, we plot the number of tweets (y)against the number of followers a user has (x) in Figure 2. We binthe number of followers in logscale and plot the median per bin inthe dashed line. The majority of users who have fewer than 10 fol-lowers never tweeted or did just once and thus the median stays at 1.The average number of tweets against the number of followers peruser is always above the median, indicating that there are outlierswho tweet far more than expected from the number of followers.The median number of tweets stays relatively flat in x = 100 to1, 000, and grows by an order of magnitude for x > 5, 000.

We gauge the inclination to be active by the number of peoplea user follows and plots in Figure 3. As pointed out in Figure 1irregularities at x = 20 and x = 2000 are observed. Yet the graphplunges at a few more points, x = 250, 500, 2000, 5000. We con-jecture that they are spam accounts, as many of them have disap-peared as of October 2009. We also bin the number of followers inlogscale and plot the median per bin in the dashed line. The dashed

Figure 3: The number of followings and that of tweets per user

line shows a positive trend, while the line is flat between 100 and1, 000. As in Figure 2 the number of tweets increases by an orderof magnitude as the number of followings goes over 5, 000.

Figures 2 and 3 demonstrate that the median number of tweetsincreases up to x = 10 against both the numbers of followers andfollowings and remains relatively flat up till x = 100. Then beyondx = 5, 000 the number of tweets increases by an order of magni-tude or more. Our numbers do not state causation of the peer pres-sure, but only state the correlation between the numbers of tweetsand followers.

3.3 ReciprocityIn Section 3.1 we briefly mention that top users by the number

of followers in Twitter are mostly celebrities and mass media andmost of them do not follow their followers back. In fact Twittershows a low level of reciprocity; 77.9% of user pairs with any linkbetween them are connected one-way, and only 22.1% have recip-rocal relationship between them. We call those r-friends of a user asthey reciprocate a users following. Previous studies have reportedmuch higher reciprocity on other social networking services: 68%on Flickr [4] and 84% on Yahoo! 360 [18].

Moreover, 67.6% of users are not followed by any of their fol-lowings in Twitter. We conjecture that for these users Twitter israther a source of information than a social networking site. Fur-ther validation is out of the scope of this paper and we leave it forfuture work.

3.4 Degree of Separation

Figure 4: Degree of separation

The concept of degrees of separation has become a key to un-derstanding the societal structure, ever since Stanley Milgrams fa-mous six degrees of separation experiment [27]. In his work hereports that any two people could be connected on average within

WWW 2010 Full Paper April 26-30 Raleigh NC USA

593

2. TWITTER SPACE CRAWLTwitter offers an Application Programming Interface (API) that

is easy to crawl and collect data. We crawled and collected pro-files of all users on Twitter starting on June 6th and lasting untilJune 31st, 2009. Additionally, we collected profiles of users whomentioned trending topics until September 24th, 2009. On top ofuser profiles we also collected popular topics on Twitter and tweetsrelated to them. Below we describe in detail how we collected userprofiles, popular topics, and related tweets.

2.1 Data CollectionUser Profile

A Twitter user keeps a brief profile about oneself. The publicprofile includes the full name, the location, a web page, a short bi-ography, and the number of tweets of the user. The people who fol-low the user and those that the user follows are also listed. In orderto collect user profiles, we began with Perez Hilton who has overone million followers and crawled breadth-first along the directionof followers and followings. Twitter rate-limits 20, 000 requestsper hour per whitelisted IP. Using 20 machines with different IPsand self-regulating collection rate at 10, 000 requests per hour, wecollected user profiles from July 6th to July 31st, 2009. To crawlusers not connected to the Giant Connected Component of the Twit-ter network, we additionally collected profiles of those who refer totrending topics in their tweets from June to August. The final tallyof user profiles we collected is 41.7 million. There exist 1.47 bil-lion directed relations of following and being followed.

Trending TopicsTwitter tracks phrases, words, and hashtags that are most often

mentioned and posts them under the title of "trending topics" regu-larly. A hashtag is a convention among Twitter users to create andfollow a thread of discussion by prefixing a word with a # char-acter. The social bookmarking site Del.icio.us also uses the samehashtag convention.

Twitter shows a list of top ten trending topics of the moment on aright sidebar on every users homepage by default, unless set other-wise. Twitter does not group similar trending topics and, whenMichael Jackson died, most of the top ten trending topics wereabout him: Michael Jackson, MJ, King of Pop, etc. Although theexact mechanism of how Twitter mines the top ten trending topicsis not known, we believe the trending topics are a good represen-tation, if not complete, of issues that draw most attention and havedecided to crawl them. We collected the top ten trending topics ev-ery five minutes via Twitter Search API [36]. The API returns thetrending topic title, a query string, and the time of the API request.We used the query string to grab all the tweets that mention thetrending topic. In total we have collected 4, 262 unique trendingtopics and their tweets.

Once any phrase, word, or hashtag appears as a top trendingtopic, we follow it for seven more days after it is taken off the topten trending topics list.

TweetsOn top of trending topics, we collected all the tweets that men-

tioned the trending topics. The Twitter Search API returns a max-imum number of 1, 500 tweets per query. We downloaded thetweets of a trending topic at every 5 minute interval. That is, wecaptured at most 5 tweets per second. We collected the full text,the author, the written time, the ISO standard language code of atweet, as well as the receiver, if the tweet is a reply, and the thirdparty application, such as Tweetie.

2.2 Removing Spam TweetsSpam tweets have increased in Twitter as the popularity of Twit-

ter grows as reported in [35]. As spam web page farms under-mine the accuracy of PageRank and spam keywords inserted in webpages hinder relevant web page extraction, spam tweets add noiseand bias in our analysis. The Twitter Support Team suspends anyuser reported to be a spammer. Still unreported spam tweets cancreep into our data. In order to remove spam tweets, we employ thewell-known mechanism of the FireFox add-on, Clean Tweets [6].Clean Tweets filters tweets from users who have been on Twitter forless than a day when presenting Twitter search results to FireFox. Italso removes those tweets that contain three or more trending top-ics. We use the same mechanisms in removing spam tweets fromour data.

Before we set the threshold of the trending topics to 3 in ourspam filtering, we vary the number from 3 to 10 and see the changein the number of identified spam tweets. As we decrease the thresh-old from 10 to 8, 5, and 3, an order of magnitude more tweets arecategorized as spam each time and removed. A tweet is limited to140 characters and most references to other web pages are abbre-viated via URL shortening services (e.g., http://www.tiny.cc/ andhttp://bit.ly) so that readers could not guess where the referencespoint at. This is an appealing feature to spammers and spammersadd as many trending topics as possible to appear in top resultsfor any search in Twitter. There are 20, 217, 061 tweets with morethan 3 trending topics and 1, 966, 461 unique users are responsiblefor those tweets. For the rest of the paper we remove those tweetsfrom collected tweets. The final number of collected tweets is 106millions.

3. ON TWITTERERS TRAILWe begin our analysis of Twitter space with the following ques-

tion: How the directed relationship in Twitter impacts the topologi-cal characteristics? Numerous social networks have been analyzedand compared against each other. Before we delve into the eccen-tricities and peculiarities of Twitter, we run a batch of well-knownanalysis and present the summary.

3.1 Basic Analysis

Figure 1: Number of followings and followers

We construct a directed network based on the following and fol-lowed and analyze its basic characteristics. Figure 1 displays thedistribution of the number of followings as the solid line and that offollowers as the dotted line. The y-axis represents complementarycumulative distribution function (CCDF). We first explain the dis-tribution of the number of followings. There are noticeable glitchesin the solid line. The first occurs at x = 20. Twitter recommends


592

an initial set of 20 people a newcomer can follow by a single clickand quite a few people take up on the offer. The second glitch isat around x = 2000. Before 2009 there was an upper limit on thenumber of people a user could follow [12]. Twitter removed thiscap and there is no limit now. The glitch represents the gap in themomentum of network building inflicted by the upper limit. A verysmall number of users follow more than 10, 000. They are mostlyofficial pages of politicians and celebrities who need to offer someform of customer service.

The dashed line in Figure 1 up to x = 105 fits to a power-lawdistribution with the exponent of 2.276. Most real networks includ-ing social networks have a power-law exponent between 2 and 3.The data points beyond x = 105 represent users who have manymore followers than the power-law distribution predicts. Similartail behavior in degree distribution has been reported from Cyworldin [1] but not from other social networks. The common character-istics between Twitter and Cyworld are that many celebrities arepresent and they readily form online relations with their fans.

There are only 40 users with more than a million followers andall of them are either celebrities (e.g. Ashton Kutcher, BritneySpears) or mass media (e.g. the Ellen DeGeneres Show, CNNBreaking News, the New York Times, the Onion, NPR Politics,TIME). The top 20 are listed in Figure 7. Some of them follow theirfollowers, but most of them do not (the median number of follow-ings of the top 40 users is 114, three orders of magnitude smallerthan the number of followers). We revisit the issue of reciprocity inSection 3.3.

3.2 Followers vs. Tweets

Figure 2: The number of followers and that of tweets per user

In order to gauge the correlation between the number of follow-ers and that of written tweets, we plot the number of tweets (y)against the number of followers a user has (x) in Figure 2. We binthe number of followers in logscale and plot the median per bin inthe dashed line. The majority of users who have fewer than 10 fol-lowers never tweeted or did just once and thus the median stays at 1.The average number of tweets against the number of followers peruser is always above the median, indicating that there are outlierswho tweet far more than expected from the number of followers.The median number of tweets stays relatively flat in x = 100 to1, 000, and grows by an order of magnitude for x > 5, 000.

We gauge the inclination to be active by the number of peoplea user follows and plots in Figure 3. As pointed out in Figure 1irregularities at x = 20 and x = 2000 are observed. Yet the graphplunges at a few more points, x = 250, 500, 2000, 5000. We con-jecture that they are spam accounts, as many of them have disap-peared as of October 2009. We also bin the number of followers inlogscale and plot the median per bin in the dashed line. The dashed

Figure 3: The number of followings and that of tweets per user

line shows a positive trend, while the line is flat between 100 and1, 000. As in Figure 2 the number of tweets increases by an orderof magnitude as the number of followings goes over 5, 000.

Figures 2 and 3 demonstrate that the median number of tweetsincreases up to x = 10 against both the numbers of followers andfollowings and remains relatively flat up till x = 100. Then beyondx = 5, 000 the number of tweets increases by an order of magni-tude or more. Our numbers do not state causation of the peer pres-sure, but only state the correlation between the numbers of tweetsand followers.

3.3 ReciprocityIn Section 3.1 we briefly mention that top users by the number

of followers in Twitter are mostly celebrities and mass media andmost of them do not follow their followers back. In fact Twittershows a low level of reciprocity; 77.9% of user pairs with any linkbetween them are connected one-way, and only 22.1% have recip-rocal relationship between them. We call those r-friends of a user asthey reciprocate a users following. Previous studies have reportedmuch higher reciprocity on other social networking services: 68%on Flickr [4] and 84% on Yahoo! 360 [18].

Moreover, 67.6% of users are not followed by any of their fol-lowings in Twitter. We conjecture that for these users Twitter israther a source of information than a social networking site. Fur-ther validation is out of the scope of this paper and we leave it forfuture work.

3.4 Degree of Separation

Figure 4: Degree of separation

The concept of degrees of separation has become a key to un-derstanding the societal structure, ever since Stanley Milgrams fa-mous six degrees of separation experiment [27]. In his work hereports that any two people could be connected on average within


593

WWW'10, 591 (2010)


Retweet Trees

6. IMPACT OF RETWEETWe have seen how trending topics rise in popularity and eventu-

ally die in Section 5. Then how exactly does information spread onTwitter? Retweet is an effective means to relay the information be-yond adjacent neighbors. We dig into the retweet trees constructedper trending topic and examine key factors that impact the eventualspread of information.

6.1 Audience Size of Retweet

Figure 14: Average and median numbers of additional recipi-ents of the tweet via retweeting

People subscribe to mass media in various forms: radio, TV, andnewspapers. They are immediate recipients and consumers of thenews the established media produce. On Twitter people acquireinformation not always directly from those they follow, but oftenvia retweets. Assuming a tweet posted by a user is viewed andconsumed by all of the users followers, we count the number ofadditional recipients who are not immediate followers of the orig-inal tweet owner. Figure 14 displays its average and median pertweet against the number of followers of the original tweet user.The median lies almost always below the average, indicating thatmany tweets have a very large number of additional recipients. Upto about 1, 000 followers, the average number of additional recipi-ents is not affected by the number of followers of the tweet source.That is, no matter how many followers a user has, the tweet is likelyto reach a certain number of audience, once the users tweet startsspreading via retweets. This illustrates the power of retweeting.That is, the mechanism of retweet has given every user the powerto spread information broadly. We recall that influentials by thenumber of retweets are dissimilar with those by the number of fol-lowers or PageRank. Individual users have the power to dictatewhich information is important and should spread by the form ofretweet, which collectively determines the importance of the origi-nal tweet. In a way we are witnessing the emergence of collectiveintelligence.

6.2 Retweet TreesKnowing that retweet actually delivers information to far more

people than a sources immediate followers, we are now interestedin how far and deep retweets travel in Twitter. In order to answerthe question we build an information diffusion tree of every tweetthat is retweeted and call it a retweet tree. All retweet trees aresubgraphs of the Twitter network.

We illustrate all the retweet trees of the topic air france flight inFigure 15. In every connected component different colors representdifferent tweets. The forest of retweet trees has a large number ofone or two-hop chains. We find interesting retweet patterns suchas repetitive retweet and cross-retweet; the former is repeatedly

Figure 15: Retweet trees of air france flight tweets

Figure 16: Height and participating users in retweet trees

retweeting the same tweet, and cross-retweet is retweeting eachother.

In Figure 16 we plot the CCDFs of the retweet tree heights andthe number of users in a retweet tree. The height of 1 is the most


598














598

WWW'10, 591 (2010)


Retweets Trees














598














598

WWW10, 591 (2010)

WWW'10, 591 (2010)


Link Function ICWSM11, 89 (2011)


Link Function

Agreement Discussion

ICWSM11, 89 (2011)

Social Interactions


Friends Talk to Each Other PLoS One 6, E22656 (2011)


Online Conversations

0 50 100 150 200 250 300 350 400 450 500 550 600

12

34

56

78

out

kout

A)

0 50 100 150 200 250 300 350 400 450 500 550 600

0100

200

300

400

500

600

50150

250

350

450

550

kin

B)

Rec

ipro

cate

d C

onne

ctio

ns

0 50 100 150 200 250 300 350 400 450 500 550 600

12

34

56

78

out

kout

A)

0 50 100 150 200 250 300 350 400 450 500 550 600

0100

200

300

400

500

600

50150

250

350

450

550

kin

B)

!outi

=P

j

!ij

kouti

Aver

age

Wei

ght p

er C

onne

ctio

n

1.7 Million users370 Million messages

Saturation of the number of reciprocated connectionsNumber of connections for which interaction strength is highest

PLoS One 6, E22656 (2011)


erated by two-point joint probability functions Pkk!!w ,w!",and among those, initially only the ones that are degree in-dependent given by functions of the type P!w ,w!".

In order to construct weighted networks along these lines,we use the so-called Barabsi-Albert !BA" model #22$,where new nodes entering the network connect to old oneswith a probability proportional to their degree #23$. The net-works generated by this model are scale-free #their degreedistribution reads as Pk!k"%k3$, have no degree-degree cor-relations, and their clustering coefficient !probability of find-ing triangles" tends to zero when the system size tends toinfinity. All this makes them ideal null models to test corre-lations between edge weights. Once the network is grown, ajoint probability distribution for the link weights P!w ,w!"and an algorithm for weight assignation are needed. With thefunction P!w ,w!" one can calculate the weight distributionP!w"=&dw! P!w ,w!", and the conditional probability ofhaving a weight w! provided that a neighboring link has aweight w, P!'w!'w"= P!w ,w!" / P!w". We start by choosingan edge at random and giving it a weight obtained fromP!w". Then we move to the nodes at its extremes and assignweights to the neighboring links. To do this, we follow arecursive method: if the edge from which the node is ac-cessed has a weight w0, the rest, w1 , . . . ,wk1, are obtainedfrom the conditional distributions P!'wi'wi1". The recursionis necessary to increase the variability in case of anticorrela-tion !see below". If any of the links j already have a weight,it remains without change and its value affects the subse-quent edges j+1, . . . ,k1. We repeat this process until allthe edges of the network have a weight assigned #24$.

For P!w ,w!", we have considered different possibilitiesbut here we will focus only on the following three:

P+!w,w!" =X+

!w + w!"2+!,

PU!w,w!" =XU

!ww!"1+!,

P!w,w!" =X

!ww! + 1"1+!, !2"

where X+=2!!!1+!", XU=!2, and X=!2 / 2F1!! ,! ,1+! ,1" are the normalization factors for the distributions on thedomain of weights !1,"", and 2F1! " is the Gauss hypergeo-metric function #25$. Without losing generality, we havechosen these particular functional forms due to their analyti-cal and numerical tractability. The distributions generatedby Eqs. !2" asymptotically decay as P!w"%w1!. Thereason to use power-law decaying distributions is that em-pirical networks commonly show very wide weight distri-butions that in a first approach can be modeled as powerlaws !see Fig. 6 and Refs. #35,26$". We name thefunctions as + !positively correlated", !anticorrelated",and U !uncorrelated" because the average weight (w)!w0"=&dw w P!'w'w0", obtained with the conditional probabili-ties from a certain seed w0 grows as (w)+!w0"= !1+!+w0" /!, decreases as (w)!w0"= !!+1 /w0" / !!1" and

remains constant (w)U=! / !!1", respectively. This meansthat in + networks the links of each node tend to be relativelyuniform in the weights #see Fig. 1!a"$, with separate areas ofthe graph concentrating the strong or the weak links, while inthe negative case, links with high and low weights areheavily mixed.

From a numerical point of view, we have checked how thevariables to measure vary with the network size N. In thefollowing, most results are shown for N=105, which is bigenough to avoid significant finite size effects. For each valueof the exponent ! #from Eqs. !2"$ and for each type of cor-relation, we have averaged over more than 600 realizations.Note that we use ! as a control parameter for the strength ofthe correlations. For high values of !, P!w" decays very fastand the correlations become negligible; all links have almostthe same weight. When ! decreases however, the higher mo-ments of P!w" diverge and one would expect the correlationsto be more prominent.

III. MEASURES OF WEIGHT CORRELATIONS

After a look at the sketch of Fig. 1, the first estimator toconsider in order to estimate weight correlations is the stan-dard deviation of the weights of the links arriving at eachnode. If the weights are relatively homogeneous, the stan-dard deviation will be lower compared with its counterpart ina randomized instance of the graph. The opposite will hap-pen if the correlations are negative as in Fig. 1!b". Morespecifically, for a generic node of the network i, #w!i" can bedefined as

#w2 !i" = *

j!$!i"!wij (w)i"2, !3"

where $!i" is the set of neighbors of i and (w)i is the meanvalue of the weight of the links arriving at i. Once the devia-tion is calculated for each node, an average can be taken overthe full network getting (#w)= !1 /N"*i#w!i". Then to evalu-ate the effects of weight correlations, it is necessary to com-pare the value of (#w)org obtained for the original networkwith that measured on uncorrelated graphs. It is, of course,important that the statistical properties of such uncorrelated

FIG. 1. !Color online" Two possible cases in networks withcorrelations in the link weight: !a" positively correlated nets and !b"anticorrelated networks. The width of the line of the links representsthe value of the weight.

JOS J. RAMASCO AND BRUNO GONALVES PHYSICAL REVIEW E 76, 066106 !2007"

066106-2

A

C

B

kin = 1kout = 2sin = 1

sout = 2


sout = 1 kin = 1kout = 1sin = 1

sout = 2

Figure 2: Example of a meme diffusion network involvingthree users mentioning and retweeting each other. The val-ues of various node statistics are shown next to each node.The strength s refers to weighted degree, k stands for degree.

Observing a retweet at node B provides implicit confirma-tion that information from A appeared in Bs Twitter feed,while a mention of B originating at node A explicitly con-firms that As message appeared in Bs Twitter feed. Thismay or may not be noticed by B, therefore mention edgesare less reliable indicators of information flow compared toretweet edges.

Retweet and reply/mention information parsed from thetext can be ambiguous, as in the case when a tweet is markedas being a retweet of multiple people. Rather, we relyon Twitter metadata, which designates users replied to orretweeted by each message. Thus, while the text of a tweetmay contain several mentions, we only draw an edge to theuser explicitly designated as the mentioned user by the meta-data. In so doing, we may miss retweets that do not use theexplicit retweet feature and thus are not captured in the meta-data. Note that this is separate from our use of mentions asmemes ( 3.1), which we parse from the text of the tweet.

4 System ArchitectureWe implemented a system based on the data representationdescribed above to automatically monitor the data streamfrom Twitter, detect relevant memes, collect the tweets thatmatch themes of interest, and produce basic statistical fea-tures relative to patterns of diffusion. These features arethen passed to our meme classifier and/or visualized. Wecalled this system Truthy. The different stages that leadto the identification of the truthy memes are described in thefollowing subsections. A screenshot of the meme overviewpage of our website (truthy.indiana.edu) is shownin Fig. 3. Upon clicking on any meme, the user is taken toanother page with more detailed statistics about that meme.They are also given an opportunity to label the meme astruthy; the idea is to crowdsource the identification oftruthy memes, as an input to the classifier described in 5.

4.1 Data CollectionTo collect meme diffusion data we rely on whitelisted ac-cess to the Twitter Gardenhose streaming API (dev.twitter.com/pages/streaming_api). The Gar-denhose provides detailed data on a sample of the Twittercorpus at a rate that varied between roughly 4 million tweets

Figure 3: Screenshot of the Meme Overview page of ourwebsite, displaying a number of vital statistics about trackedmemes. Users can then select a particular meme for moredetailed information.

per day near the beginning of our study, to around 8 mil-lion tweets per day at the time of this writing. While theprocess of sampling edges (tweets between users) from anetwork to investigate structural properties has been shownto produce suboptimal approximations of true network char-acteristics (Leskovec and Faloutsos 2006), we find that theanalyses described below are able to produce accurate clas-sifications of truthy memes even in light of this shortcoming.

4.2 Meme DetectionA second component of our system is devoted to scanningthe collected tweets in real time. The task of this meme de-tection component is to determine which of the collectedtweets are to be stored in our database for further analysis.Our goal is to collect only tweets (a) with content relatedto U.S. politics, and (b) of sufficiently general interest inthat context. Political relevance is determined by matchingagainst a manually compiled list of keywords. We consider ameme to be of general interest if the number of tweets withthat meme observed in a sliding window of time exceeds agiven threshold. We implemented a filtering step for each ofthese criteria, described elsewhere (Ratkiewicz et al. 2011).

Our system has tracked a total of approximately 305 mil-lion tweets collected from September 14 until October 27,2010. Of these, 1.2 million contain one or more of our polit-ical keywords; the meme filtering step further reduced thisnumber to 600,000. Note that this number of tweets does notdirectly correspond to the number of tracked memes, as eachtweet might contribute to several memes.

4.3 Network AnalysisTo characterize the structure of each memes diffusion net-work we compute several statistics based on the topologyof the largest connected component of the retweet/mention

The Strength of Ties


Weak

Interviews to find out how individuals found out about job opportunities.

Mostly from acquaintances or friends of friends

It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another




P+!w,w!" =X+

!w + w!"2+!,

PU!w,w!" =XU

!ww!"1+!,

P!w,w!" =X

!ww! + 1"1+!, !2"






#w2 !i" = *

j!$!i"!wij (w)i"2, !3"




066106-2

A

C

B


sout = 2



sout = 2











The Strength of Ties (1973)


Weak

Interviews to find out how individuals found out about job opportunities.

Mostly from acquaintances or friends of friends

It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another




P+!w,w!" =X+

!w + w!"2+!,

PU!w,w!" =XU

!ww!"1+!,

P!w,w!" =X

!ww! + 1"1+!, !2"






#w2 !i" = *

j!$!i"!wij (w)i"2, !3"




066106-2

A

C

B


sout = 2



sout = 2











The Strength of Ties (1973)

the system. Within this new framework we study a family of informa-tion propagation processes, namely the rumour spreading model38,39.We tackle the case in which the dynamics of contacts and the spread-ing process are acting on the same time-scale. Interestingly, both insynthetic and real time-varying networks we find that memory ham-pers the rumour spreading process. Strong ties have an importantrole in the early cessation of the rumor diffusion by favouring inter-actions among agents already aware of the gossip. The celebratedGranovetter conjecture that spreading is mostly supported by weakties40, goes along with a negative effect of strong ties. In other words,while favouring locally the rumor spreading, strong ties have anactive role in confining the process for a time sufficient to itscessation.

ResultsWe focus on a prototypical large scale communication networkwhere mobile phone users are nodes and the calls among them links.

The common analysis framework for such systems neglects thetemporal nature of the connections in favour of time-aggregatedrepresentations. In these representations, the degree k of a nodeindicates the total number of contacted individuals, while the weightof a link w (the strength of the tie) the total number of calls betweenthe pair of connected nodes. The distributions of these quantities areshown in Fig. 1.a, and b. Interestingly, they are characterized byheavy-tailed distributions. Although, the study of the time-aggre-gated network provides basic information about its structure, it can-not inform us on the processes driving its dynamics. This intuition isclearly exemplified in Fig. 2.a and b. These figures show two snap-shots of the network at different times covering few hours of calls in atown. The two plots capture dynamical interaction patterns not vis-ible from the aggregated network representation (Fig. 2.c).

Here we aim to study and identify the mechanisms driving theevolution, and dynamics of the egocentric networks (egonets) of theglobal network. Egonets were thoroughly investigated earlier in psy-chology and sociology4143. Some other characteristics have beenrecently mapped out with the availability of large-scale data4448.We tackle this problem from a different angle focusing on the activityrate, a, that allows describing the network evolution beyond simplestatic measures. It is defined as the probability of any given node to beinvolved in an interaction at each unit time. The activity distributionis also heavy-tailed (see Fig. 1.c), but contrary to degree and weight, isa time invariant property of individuals23. It does not change by usingdifferent time aggregation scales23,25. This quantity is the basic ingre-dient of the activity-driven modelling framework23. Here we extendthis approach by identifying, and modelling another crucial com-ponent: the memory of each agent. We encode this ingredient in asimple non-Markovian reinforcing mechanism that allows to repro-duce with great accuracy the empirical data.

Egocentric network dynamics. In general, social networks arecharacterized by two types of links. The first class describes strongties that identify time repeated and frequent interactions amongspecific couples of agents. The second class characterizes weak tiesamong agents that are activated only occasionally. It is natural toassume that strong ties are the first to appear in the system, whileweak ties are incrementally added to the egonet of each agent1. Thisintuition has been recently confirmed49 in a large-scale dataset andindicates a particular egocentric network evolution. In order toquantify it, we measure the probability, p(n), that the nextcommunication event of an agent having n social ties will occur viathe establishment of a new (n 1 1)th link. We calculate theseprobabilities in the MPC dataset averaging them for users with thesame degree k at the end of the observation time. We therefore

Figure 1 | Distributions of the characteristic measures of the aggregatedMPC network, and activity-driven networks. In panels (a), and (d) we plotthe degree distributions. In panels (b), and (e) we plot the weightdistributions. Finally, in panels (c), and (f) we plot the activitydistributions. In each figure grey symbols are assigning the originaldistributions while coloured symbols are denoting the same distributionsafter logarithmic binning. Measured quantities in MPC sequences wererecorded for 182 days (see Methods). In panels (d), (e), and (f) solid linesare assigned to the distributions induced by the reinforced process, whiledashed lines denote results of the original memoryless process. Modelcalculations were performed with parameters N 5 106, 5 1024 andT 5 104.

Figure 2 | Dynamics of the MPC network. Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows.Panel (c) presents the total weighted social network structure, which was recorded by aggregating interactions during 6 months. Node size and colorsdescribe the activity of users, while link width and color represent weight.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 4 : 4001 | DOI: 10.1038/srep04001 2


Network Structure

a mention involves some effort and addresses only single targetedusers.

2.3 Internal linksAccording to Granovetters theory, one could expect the

internal connections inside a group to bear closer relations.Mechanisms such as homophily [43], cognitive balance [44,45] ortriadic closure [12] favor this kind of structural configurations.Unfortunately, we have no means to measure the closeness of auser-user relation in a sociological sense in our Twitter dataset.However we can verify whether the link has been used formentions, whether the interchange has been reciprocated orwhether it has happened more than once. We define the fractionf ip of links with interaction i in position p with respect to the groupsof size s as

f ip(s)~nip(s)

Ni, 1

where nip(s) is the number of links with that type of interaction in

position p with respect to the groups of size s and Ni in the totalnumber of links with interaction i. The fractions f iinternal(s) revealsan interesting pattern as function of the group size as can be seenin Figure 3A. Note that the fraction of links in the follower network(black curve) is taken as the reference for comparison. Links withmentions are more abundant as internal links than the baselinefollower relations for groups of size up to 150 users. This particularvalue brings reminiscences of the quantity known as the Dunbarnumber [46], the cognitive limit to the number of people withwhom each person can have a close relationship and that hasrecently been discussed in the context of Twitter [47]. Althoughwe have identified larger groups, the density of mentions is similarto the density of links in the follower network. In addition, thedistribution of the number of times that a link is used (intensity) formentions is wide, which allows for a systematic study of thedependence of intensity and position (see Figure 3B). The moreintense (or reciprocated) a link with mentions is, the more likely itbecomes to find this link as internal (Figure 3C). This corresponds

Figure 1. Groups and links. (A) Sample of Twitter network: nodes represent users and links, interactions. The follower connections are plotted asgray arrows, mentions in red, and retweets in green. The width of the arrows is proportional to the number of times that the link has been used formentions. We display three groups (yellow, purple and turquoise) and a user (blue star) belonging to two groups. (B) Different types of linksdepending on their position with respect to the groups structure: internal, between groups, intermediary links and no-group links.doi:10.1371/journal.pone.0029358.g001

The Strength of Intermediary Ties in Social Media

PLoS ONE | www.plosone.org 3 January 2012 | Volume 7 | Issue 1 | e29358

People whose networks bridge the structural holes between groups have an advantage in detecting and developing rewarding opportunities. Information arbitrage is their advantage. They are able to see early, see more broadly, and translate information across groups.

AJS Volume 110 Number 2 (September 2004): 34999 349

! 2004 by The University of Chicago. All rights reserved.0002-9602/2004/11002-0004$10.00

Structural Holes and Good Ideas1

Ronald S. BurtUniversity of Chicago

This article outlines the mechanism by which brokerage providessocial capital. Opinion and behavior are more homogeneous withinthan between groups, so people connected across groups are morefamiliar with alternative ways of thinking and behaving. Brokerageacross the structural holes between groups provides a vision of op-tions otherwise unseen, which is the mechanism by which brokeragebecomes social capital. I review evidence consistent with the hy-pothesis, then look at the networks around managers in a largeAmerican electronics company. The organization is rife with struc-tural holes, and brokerage has its expected correlates. Compensation,positive performance evaluations, promotions, and good ideas aredisproportionately in the hands of people whose networks spanstructural holes. The between-group brokers are more likely to ex-press ideas, less likely to have ideas dismissed, and more likely tohave ideas evaluated as valuable. I close with implications for cre-ativity and structural change.

The hypothesis in this article is that people who stand near the holes ina social structure are at higher risk of having good ideas. The argumentis that opinion and behavior are more homogeneous within than betweengroups, so people connected across groups are more familiar with alter-

1 Portions of this material were presented as the 2003 Coleman Lecture at the Universityof Chicago, at the Harvard-MIT workshop on economic sociology, in workshops atthe University of California at Berkeley, the University of Chicago, the University ofKentucky, the Russell Sage Foundation, the Stanford Graduate School of Business,the University of Texas at Dallas, Universiteit Utrecht, and the Social Aspects ofRationality conference at the 2003 meetings of the American Sociological Association.I am grateful to Christina Hardy for her assistance on the manuscript and to severalcolleagues for comments affecting the final text: William Barnett, James Baron, Jon-athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, RachelKranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James Rauch,Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grateful toPeter Marsden for his comments as discussant at the Coleman Lecture. Direct cor-respondence to Ron Burt, Graduate School of Business, University of Chicago, Chi-cago, Illinois 60637. E-mail: [email protected]

PLoS One 7, e29358 (2012)


to Granovetter expectation that the stronger the tie is the highernumber of mutual contacts of both parties it has and the higher thechance that the parties belong to the same group.

2.4 Links between groupsThe next question to consider is the characteristics of links

between groups. These links occur mainly between groupscontaining less than 200 users (Figure 4AC). However, theirfrequency depends on the quality of the links (if they bear mentionsor retweets). While links with mentions are less abundant than thebaseline, those with retweets are slightly more abundant.According to the strength of weak ties theory [12,1416], weaklinks are typically connections between persons not sharingneighbors, being important to keep the network connected andfor information diffusion. We investigate whether the linksbetween groups play a similar role in the online network asinformation transmitters. The actions more related to informationdiffusion are retweets [24] that show a slight preference foroccurring on between-group links (Figures 4B and 4C). Thispreference is enhanced when the similarity between connectedgroups is taken into account. We define the similarity between twogroups, A and B, in terms of the Jaccard index of theirconnections:

similarity(A,B)~j\links of A and Bjj|links of A and Bj

: 2

The similarity is the overlap between the groups connections andit estimates network proximity of the groups. The general patternis that links with mentions more likely occur between close groupsand retweets occur between groups with medium similarity(Figure 4D). Mentions as personal messages are typicallyexchanged between users with similar environments, what ispredicted by the strength of weak ties theory. Links with retweetsare related to information transfer and the similarity of the groups

between which they take place should be small according to theGranovetters theory. The results show that the most likely toattract retweets are the links connecting groups that are neither too

Twitterology - The Science of Twitter

Science

Transcript of Twitterology - The Science of Twitter