Unleashing twitter data for fun and insight

132
Agile Data Solutions Mining the Social Web Matthew A. Russell http://linkedin.com/in/ptwobrussell @ptwobrussell Unleashing Twitter Data for fun and insight

description

Matthew Russell, VP Engineering at Digital Reasoning, discusses techniques and results of mining twitter data for fun and insight

Transcript of Unleashing twitter data for fun and insight

Page 1: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Matthew A. Russell

http://linkedin.com/in/ptwobrussell@ptwobrussell

Unleashing Twitter Datafor fun and insight

Page 2: Unleashing twitter data for fun and insight

Happy Groundhog Day!

Page 3: Unleashing twitter data for fun and insight

Mining the Social Web Chapters 1-5

Introduction: Trends, Tweets, and Twitterers

Microformats: Semantic Markup and Common Sense Collide

Mailboxes: Oldies but Goodies

Friends, Followers, and Setwise Operations

Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet

Page 4: Unleashing twitter data for fun and insight

Mining the Social Web Chapters 6-10

LinkedIn: Clustering Your Professional Network For Fun (and Profit?)

Google Buzz: TF-IDF, Cosine Similarity, and Collocations

Blogs et al: Natural Language Processing (and Beyond)

Facebook: The All-In-One Wonder

The Semantic Web: A Cocktail Discussion

Page 5: Unleashing twitter data for fun and insight

O•Trends, Tweets, and Retweet Visualizations

•Friends, Followers, and Setwise Operations

•The Tweet, the Whole Tweet, and Nothing but the Tweet

verview

Page 6: Unleashing twitter data for fun and insight

Insight Matters

•What is @user's potential influence?

•What are @user's passions right now?

•Who are @user's most trusted friends?

Page 7: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Part 1:Tweets, Trends, and Retweet

Visualizations

Page 8: Unleashing twitter data for fun and insight

A point to ponder:Twitter : Data :: JavaScript : Programming Languages (???)

Page 9: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Getting Ready To Code

Page 10: Unleashing twitter data for fun and insight

Python Installation

•Mac users already have it

•Linux users probably have it

•Windows users should grab ActivePython

Page 11: Unleashing twitter data for fun and insight

easy_install

•Installs packages from PyPI

•Get it:

•http://pypi.python.org/pypi/setuptools

•Ships with ActivePython

•It really is easy:

easy_install twitter

easy_install nltk

easy_install networkx

Page 12: Unleashing twitter data for fun and insight

Git It?

•http://github.com/ptwobrussell/Mining-the-Social-Web

•git clone git://github.com/ptwobrussell/Mining-the-Social-Web.git

•introduction__*.py

•friends_followers__*.py

•the_tweet__*.py

Page 13: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Getting Data

Page 14: Unleashing twitter data for fun and insight

Twitter Data Sources

•Twitter API Resources

•GNIP

•Infochimps

•Library of Congress

Page 15: Unleashing twitter data for fun and insight

>>> import twitter # Remember to "easy_install twitter">>> twitter_search = twitter.Twitter(domain="search.twitter.com") >>> trends = twitter_search.trends() >>> [ trend['name'] for trend in trends['trends'] ]

[u'#ZodiacFacts', u'#nowplaying', u'#ItsOverWhen', u'#Christoferdrew', u'Justin Bieber', u'#WhatwouldItBeLike', u'#Sagittarius', u'SNL', u'#SurveySays', u'#iDoit2']

Trending Topics

Page 16: Unleashing twitter data for fun and insight

Search Results

>>> search_results = [] >>> for page in range(1,6): ... search_results.append(twitter_search.search(q="SNL",rpp=100, page=page))

Page 17: Unleashing twitter data for fun and insight

Search Results (continued)

>>> import json >>> print json.dumps(search_results, sort_keys=True, indent=1) [ { "completed_in": 0.088122000000000006, "max_id": 11966285265, "next_page": "?page=2&max_id=11966285265&rpp=100&q=SNL", "page": 1, "query": "SNL", "refresh_url": "?since_id=11966285265&q=SNL",

...more...

Page 18: Unleashing twitter data for fun and insight

Search Results (continued)

"results": [ { "created_at": "Sun, 11 Apr 2010 01:34:52 +0000", "from_user": "bieber_luv2", "from_user_id": 106998169, "geo": null, "id": 11966285265, "iso_language_code": "en", "metadata": { "result_type": "recent" }, ...more...

Page 19: Unleashing twitter data for fun and insight

"profile_image_url": "http://a1.twimg.com/profile_images/80...", "source": "<a href="http://twitter.com/&quo...", "text": "im nt gonna go to sleep happy unless i see ...", "to_user_id": null } ... output truncated - 99 more tweets ... ], "results_per_page": 100, "since_id": 0 }, ... output truncated - 4 more pages ... ]

Search Results (continued)

Page 20: Unleashing twitter data for fun and insight

•Ratio of unique terms to total terms

•A measure of "stickiness"?

•A measure of "group think"?

•A crude indicator of retweets to originally authored tweets?

Lexical Diversity

Page 21: Unleashing twitter data for fun and insight

>>> # search_results is already defined

>>> tweets = [ r['text'] \ ... for result in search_results \ ... for r in result['results'] ]

>>> words = []

>>> for t in tweets: ... words += [ w for w in t.split() ] ...

Distilling Tweet Text

Page 22: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Analyzing Data

Page 23: Unleashing twitter data for fun and insight

>>> len(words)7238

>>> # unique words>>> len(set(words)) 1636

>>> # lexical diversity>>> 1.0*len(set(words))/len(words) 0.22602928985907708

>>> # average number of words per tweet>>> 1.0*sum([ len(t.split()) for t in tweets ])/len(tweets)14.476000000000001

Lexical Diversity

Page 24: Unleashing twitter data for fun and insight

Size Frequency Matters

•Counting: always the first step

•Simple but effective

•NLTK saves us a little trouble

Page 25: Unleashing twitter data for fun and insight

Frequency Analysis>>> import nltk >>> freq_dist = nltk.FreqDist(words)>>> freq_dist.keys()[:50] #50 most frequent tokens

[u'snl', u'on', u'rt', u'is', u'to', u'i', u'watch', u'justin', u'@justinbieber', u'be', u'the', u'tonight', u'gonna', u'at', u'in', u'bieber', u'and', u'you', u'watching', u'tina', u'for', u'a', u'wait', u'fey', u'of', u'@justinbieber:', u'if', u'with', u'so', u"can't", u'who', u'great', u'it', u'going', u'im', u':)', u'snl...', u'2nite...', u'are', u'cant', u'dress', u'rehearsal', u'see', u'that', u'what', u'but', u'tonight!', u':d', u'2', u'will']

Page 26: Unleashing twitter data for fun and insight

Frequency Visualization

Page 27: Unleashing twitter data for fun and insight

Tweet and RT were sitting on a fence. Tweet fell off. Who was left?

Page 28: Unleashing twitter data for fun and insight

RTs: past, present, & future

•Retweet: Tweeting a tweet that's already been tweeted

•RT or via followed by @mention

•Example: RT @SocialWebMining Justin Bieber is on SNL 2nite. w00t?!?

•Relatively new APIs were rolled out last year for retweeting sans

conventions

Page 29: Unleashing twitter data for fun and insight

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two

problems. -- Jamie Zawinski

Page 30: Unleashing twitter data for fun and insight

Parsing Retweets>>> example_tweets = ["Visualize Twitter search results w/ this simple script http://bit.ly/cBu0l4 - Gist instructions http://bit.ly/9SZ2kb (via @SocialWebMining @ptwobrussell)"]

>>> import re >>> rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", \... re.IGNORECASE) >>> rt_origins = []>>> for t in example_tweets: ... try:... rt_origins += [mention.strip() \... for mention in rt_patterns.findall(t)[0][1].split()]... except IndexError, e:... pass

>>> [rto.strip("@") for rto in rt_origins]

Page 31: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Visualizing Data

Page 32: Unleashing twitter data for fun and insight

Graph Construction

>>> import networkx as nx>>> g = nx.DiGraph()>>> g.add_edge("@SocialWebMining", "@ptwobrussell", \... {"tweet_id" : 4815162342},)

Page 33: Unleashing twitter data for fun and insight

Writing out DOT OUT_FILE = "out_file.dot"

try: nx.drawing.write_dot(g, OUT_FILE)except ImportError, e: dot = ['"%s" -> "%s" [tweet_id=%s]' % \ (n1, n2, g[n1][n2]['tweet_id']) for n1, n2 in g.edges()]

f = open(OUT_FILE, 'w') f.write('strict digraph {\n%s\n}' % (';\n'.join(dot),)) f.close()

Page 34: Unleashing twitter data for fun and insight

Example DOT Language

strict digraph { "@ericastolte" -> "bonitasworld" [tweet_id=11965974697]; "@mpcoelho" ->"Lil_Amaral" [tweet_id=11965954427]; "@BieberBelle123" -> "BELIEBE4EVER" [tweet_id=11966261062]; "@BieberBelle123" -> "sabrina9451" [tweet_id=11966197327]; }

Page 35: Unleashing twitter data for fun and insight

DOT to Image

•Download Graphviz: http://www.graphviz.org/

•$ dot -Tpng out_file.dot > graph.png•Windows users might prefer GVEdit

Page 36: Unleashing twitter data for fun and insight

Graphviz: Extreme Closeup

Page 37: Unleashing twitter data for fun and insight

But you want more sexy?

Page 38: Unleashing twitter data for fun and insight

Mining the Social Web

Protovis: Extreme Closeup

38

Page 39: Unleashing twitter data for fun and insight

It Doesn't Have To Be a Graph

Graph Connectedness

Page 40: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Part 2:Friends, Followers, and Setwise

Operations

Page 41: Unleashing twitter data for fun and insight

Insight Matters

•What is my potential influence?

•Who are the most popular people in my network?

•Who are my mutual friends?

•What common friends/followers do I have with @user?

•Who is not following me back?

•What can I learn from analyzing my friendship cliques?

Page 42: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Getting Data

Page 43: Unleashing twitter data for fun and insight

OAuth (1.0a)import twitterfrom twitter.oauth_dance import oauth_dance

# Get these from http://dev.twitter.com/apps/newconsumer_key, consumer_secret = 'key', 'secret'

(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb', consumer_key, consumer_secret)

auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret, consumer_key, consumer_secret)

t = twitter.Twitter(domain='api.twitter.com', auth=auth)

Page 44: Unleashing twitter data for fun and insight

Getting Friendship Data

friend_ids = t.friends.ids(screen_name='timoreilly', cursor=-1)follower_ids = t.followers.ids(screen_name='timoreilly', cursor=-1)

# store the data somewhere...

Page 45: Unleashing twitter data for fun and insight

Perspective: Fetching all of Lady Gaga's ~7M followers would take ~4 hours

Page 46: Unleashing twitter data for fun and insight

But there's always a catch...

Page 47: Unleashing twitter data for fun and insight

Rate Limits•350 requests/hr for authenticated requests

•150 requests/hr for anonymous requests

•Coping mechanisms:

•Caching & Archiving Data

•Streaming API

•HTTP 400 codes

•See http://dev.twitter.com/pages/rate-limiting

Page 48: Unleashing twitter data for fun and insight

The Beloved Fail Whale

•Twitter is sometimes "overcapacity"

•HTTP 503 Error

•Handle it just as any other HTTP error

•RESTfulness has its advantages

Page 49: Unleashing twitter data for fun and insight

Abstraction Helps

friend_ids = []wait_period = 2 # secscursor = -1

while cursor != 0: response = makeTwitterRequest(t, # twitter.Twitter instance t.friends.ids, screen_name=screen_name, cursor=cursor)

friend_ids += response['ids'] cursor = response['next_cursor']

# break out of loop early if you don't need all ids

Page 50: Unleashing twitter data for fun and insight

Abstracting Abstractions

screen_name = 'timoreilly'

# This is what you ultimately want...

friend_ids = getFriends(screen_name)follower_ids = getFollowers(screen_name)

Page 51: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Storing Data

Page 52: Unleashing twitter data for fun and insight

Flat Files?

./ screen_name1/ friend_ids.json follower_ids.json user_info.json

screen_name2/ ...

...

Page 53: Unleashing twitter data for fun and insight

Pickles?

import cPickle

o = { 'friend_ids' : friend_ids, 'follower_ids' : follower_ids, 'user_info' : user_info}

f = open('screen_name1.pickle, 'wb')cPickle.dump(o, f)f.close()

Page 54: Unleashing twitter data for fun and insight

A relational database?import sqlite3 as sqlite

conn = sqlite.connect('data.db')c = conn.cursor()

c.execute('''create table friends...''')

c.execute('''insert into friends... ''')

# Lots of fun...sigh...

Page 55: Unleashing twitter data for fun and insight

import redis

r = redis.Redis()

[ r.sadd("timoreilly$friend_ids", i) for i in friend_ids ]

r.smembers("timoreilly$friend_ids") # returns a set

Redis (A Data Structures Server)

Windows binary: http://code.google.com/p/servicestack/wiki/RedisWindowsDownload

Project page: http://redis.io

Page 56: Unleashing twitter data for fun and insight

Redis Set Operations

•Key/value store...on typed values!

•Common set operations

•smembers, scard

•sinter, sdiff, sunion

•sadd, srem, etc.

•See http://code.google.com/p/redis/wiki/CommandReference

•Don't forget to $ easy_install redis

Page 57: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Analyzing Data

Page 58: Unleashing twitter data for fun and insight

Setwise Operations

•Union

•Intersection

•Difference

•Complement

Page 59: Unleashing twitter data for fun and insight

Venn Diagrams

Friends

Followers

Friends - Followers

Friends Followers

Followers - Friends

U

Page 60: Unleashing twitter data for fun and insight

Count Your Blessings# A utility functiondef getRedisIdByScreenName(screen_name, key_name): return 'screen_name$' + screen_name + '$' + key_name

# Number of friendsn_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids'))

# Number of followersn_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))

Page 61: Unleashing twitter data for fun and insight

Asymmetric Relationships

# Friends who aren't following backfriends_diff_followers = r.sdiffstore('temp', [ getRedisIdByScreenName(screen_name, 'friend_ids'), getRedisIdByScreenName(screen_name, 'follower_ids') ]) # ... compute interesting things ...r.delete('temp')

Page 62: Unleashing twitter data for fun and insight

Asymmetric Relationships

# Followers who aren't friended followers_diff_friends = r.sdiffstore('temp', [ getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids') ]) # ... compute interesting things ...r.delete('temp')

Page 63: Unleashing twitter data for fun and insight

Symmetric Relationships

mutual_friends = r.sinterstore('temp', [ getRedisIdByScreenName(screen_name, 'follower_ids'), getRedisIdByScreenName(screen_name, 'friend_ids') ]) # ... compute interesting things ...r.delete('temp')

Page 64: Unleashing twitter data for fun and insight

Sample Output

timoreilly is following 663

timoreilly is being followed by 1,423,704

131 of 663 are not following timoreilly back

1,423,172 of 1,423,704 are not being followed back by timoreilly

timoreilly has 532 mutual friends

Page 65: Unleashing twitter data for fun and insight

Who Isn't Following Back?user_ids = [ ... ] # Resolve these to user info objects

while len(user_ids) > 0: user_ids_str, = ','.join([ str(i) for i in user_ids[:100] ]) user_ids = user_ids[100:]

response = t.users.lookup(user_id=user_ids)

if type(response) is dict: response = [response] r.mset(dict([(getRedisIdByUserId(resp['id'], 'info.json'), json.dumps(resp)) for resp in response]))

r.mset(dict([(getRedisIdByScreenName(resp['screen_name'],'info.json'), json.dumps(resp)) for resp in response]))

Page 66: Unleashing twitter data for fun and insight

Friends in Common

# Assume we've harvested friends/followers and it's in Redis...screen_names = ['timoreilly', 'mikeloukides']

r.sinterstore('temp$friends_in_common', [getRedisIdByScreenName(screen_name, 'friend_ids') for screen_name in screen_names])

r.sinterstore('temp$followers_in_common', [getRedisIdByScreenName(screen_name,'follower_ids') for screen_name in screen_names])

# Manipulate the sets

Page 67: Unleashing twitter data for fun and insight

Potential Influence

•My followers?

•My followers' followers?

•My followers' followers' followers?

•for n in range(1, 7): # 6 degrees? print "My " + "followers' "*n + "followers?"

Page 68: Unleashing twitter data for fun and insight

Saving a Thousand Words...

1

2 3

4 5 6 7

8 9 10 11 12 13 14 15{Depth = 3

BranchingFactor = 2

Page 69: Unleashing twitter data for fun and insight

Same Data, Different Layout

1

2

3

4 5

6 7

4 5

8

9 10

11

12

13

12

14

15

Page 70: Unleashing twitter data for fun and insight

Space Complexity

1 2 3 4 52 3 7 15 31 633 4 13 40 121 3644 5 21 85 341 13655 6 31 156 781 39066 7 43 259 1555 9331

Depth

BranchingFactor

Page 71: Unleashing twitter data for fun and insight

Breadth-First TraversalCreate an empty graph Create an empty queue to keep track of unprocessed nodes

Add the starting point to the graph as the "root node" Add the root node to a queue for processing

Repeat until some maximum depth is reached or the queue is empty: Remove a node from queue For each of the node's neighbors: If the neighbor hasn't already been processed: Add it to the graph Add it to the queue Add an edge to the graph connecting the node & its neighbor

Page 72: Unleashing twitter data for fun and insight

Breadth-First Harvest

next_queue = [ 'timoreilly' ] # seed noded = 1

while d < depth: d += 1 queue, next_queue = next_queue, [] for screen_name in queue: follower_ids = getFollowers(screen_name=screen_name) next_queue += follower_ids getUserInfo(user_ids=next_queue)

Page 73: Unleashing twitter data for fun and insight

The Most Popular Followers

freqs = {} for follower in followers: cnt = follower['followers_count'] if not freqs.has_key(cnt): freqs[cnt] = []

freqs[cnt].append({'screen_name': follower['screen_name'], 'user_id': f['id']})

popular_followers = sorted(freqs, reverse=True)[:100]

Page 74: Unleashing twitter data for fun and insight

Average # of Followers

all_freqs = [k for k in keys for user in freqs[k]] avg = sum(all_freqs) / len(all_freqs)

Page 75: Unleashing twitter data for fun and insight

@timoreilly's Popular Followers

The top 10 followers from the sample:

aplusk 4,993,072 BarackObama 4,114,901 mashable 2,014,615 MarthaStewart 1,932,321 Schwarzenegger 1,705,177 zappos 1,689,289 Veronica 1,612,827 jack 1,592,004 stephenfry 1,531,813 davos 1,522,621

Page 76: Unleashing twitter data for fun and insight

Futzing the Numbers

•The average number of timoreilly's followers' followers: 445

•Discarding the top 10 lowers the average to around 300

•Discarding any follower with less than 10 followers of their

own increases the average to over 1,000!

•Doing both brings the average to around 800

Page 77: Unleashing twitter data for fun and insight

The Right Tool For the Job:NetworkX for Networks

Page 78: Unleashing twitter data for fun and insight

Friendship Graphs

for i in ids: #ids is timoreilly's id along with friend ids info = json.loads(r.get(getRedisIdByUserId(i, 'info.json'))) screen_name = info['screen_name'] friend_ids = list(r.smembers(getRedisIdByScreenName(screen_name, 'friend_ids'))) for friend_id in [fid for fid in friend_ids if fid in ids]: friend_info = json.loads(r.get(getRedisIdByUserId(friend_id, 'info.json'))) g.add_edge(screen_name, friend_info['screen_name'])

nx.write_gpickle(g, 'timoreilly.gpickle') # see also nx.read_gpickle

Page 79: Unleashing twitter data for fun and insight

Clique Analysis

•Cliques

•Maximum Cliques

•Maximal Cliques

http://en.wikipedia.org/wiki/Clique_problem

Page 80: Unleashing twitter data for fun and insight

Calculating Cliquescliques = [c for c in nx.find_cliques(g)]

num_cliques = len(cliques) clique_sizes = [len(c) for c in cliques]

max_clique_size = max(clique_sizes) avg_clique_size = sum(clique_sizes) / num_cliques max_cliques = [c for c in cliques if len(c) == max_clique_size] num_max_cliques = len(max_cliques)

people_in_every_max_clique = list(reduce( lambda x, y: x.intersection(y),[set(c) for c in max_cliques]))

Page 81: Unleashing twitter data for fun and insight

Cliques for @timoreilly

Num cliques: 762573 Avg clique size: 14 Max clique size: 26 Num max cliques: 6Num people in every max clique: 20

Page 82: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Visualizing Data

Page 83: Unleashing twitter data for fun and insight

Graphs, etc

•Your first instinct is naturally

G = (V, E) ?

Page 84: Unleashing twitter data for fun and insight

Dorling Cartogram

•A location-aware bubble chart (ish)

•At least 3-dimensional

•Position, color, size

•Look at friends/followers by state

Page 85: Unleashing twitter data for fun and insight

Sunburst of Friends

•A very compact visualization

•Slice and dice friends/followers by

gender, country, locale, etc.

Page 86: Unleashing twitter data for fun and insight

Agile Data SolutionsMining the Social Web

Part 3:The Tweet, the Whole Tweet, and

Nothing but the Tweet

Page 87: Unleashing twitter data for fun and insight

Insight Matters

•Which entities frequently appear in @user's tweets?

•How often does @user talk about specific friends?

•Who does @user retweet most frequently?

•How frequently is @user retweeted (by anyone)?

•How many #hashtags are usually in @user's tweets?

Page 88: Unleashing twitter data for fun and insight

Pen : Sword :: Tweet : Machine Gun (?!?)

Page 89: Unleashing twitter data for fun and insight

Mining the Social Web

Getting Data

Page 90: Unleashing twitter data for fun and insight

Let me count the APIs...

•Timelines

•Tweets

•Favorites

•Direct Messages

•Streams

Page 91: Unleashing twitter data for fun and insight

Anatomy of a Tweet (1/2){ "created_at" : "Thu Jun 24 14:21:11 +0000 2010", "id" : 16932571217, "text" : "Great idea from @crowdflower: Crowdsourcing ... #opengov", "user" : { "description" : "Founder and CEO, O'Reilly Media. Watching the alpha geeks...", "id" : 2384071, "location" : "Sebastopol, CA", "name" : "Tim O'Reilly", "screen_name" : "timoreilly", "url" : "http://radar.oreilly.com" },

...

Page 92: Unleashing twitter data for fun and insight

Anatomy of a Tweet (2/2)

...

"entities" : { "hashtags" : [ {"indices" : [ 97, 103 ], "text" : "gov20"}, {"indices" : [ 104, 112 ], "text" : "opengov"} ],

"urls" : [{"expanded_url" : null, "indices" : [ 76, 96 ], "url" : "http://bit.ly/9o4uoG"} ], "user_mentions" : [{"id" : 28165790, "indices" : [ 16, 28 ], "name" : "crowdFlower","screen_name" : "crowdFlower"}] } }

Page 93: Unleashing twitter data for fun and insight

Entities & Annotations

•Entities

•Opt-in now but will "soon" be standard

• $ easy_install twitter_text

•Annotations

•User-defined metadata

•See http://dev.twitter.com/pages/annotations_overview

Page 94: Unleashing twitter data for fun and insight

Manual Entity Extraction

import twitter_text

extractor = twitter_text.Extractor(tweet['text'])

mentions = extractor.extract_mentioned_screen_names_with_indices()hashtags = extractor.extract_hashtags_with_indices()urls = extractor.extract_urls_with_indices()

# Splice info into a tweet object

Page 95: Unleashing twitter data for fun and insight

Mining the Social Web

Storing Data

Page 96: Unleashing twitter data for fun and insight

•Flat files? (Really, who does that?)

•A relational database?

•Redis?

•CouchDB (Relax...?)

Storing Tweets

Page 97: Unleashing twitter data for fun and insight

CouchDB: Relax

•Document-oriented key/value

•Map/Reduce

•RESTful API

•Erlang

Page 98: Unleashing twitter data for fun and insight

As easy as sitting on the couch

•Get it - http://www.couchone.com/get

• Install it

•Relax - http://localhost:5984/_utils/

•Also - $ easy_install couchdb

Page 99: Unleashing twitter data for fun and insight

Storing Timeline Dataimport couchdbimport twitter

TIMELINE_NAME = "user" # or "home" or "public"

t = twitter.Twitter(domain='api.twitter.com', api_version='1)

server = couchdb.Server('http://localhost:5984')db = server.create(DB)

page_num = 1while page_num <= MAX_PAGES: api_call = getattr(t.statuses, TIMELINE_NAME + '_timeline') tweets = makeTwitterRequest(t, api_call, page=page_num) db.update(tweets, all_or_nothing=True) print 'Fetched %i tweets' % len(tweets) page_num += 1

Page 100: Unleashing twitter data for fun and insight

Mining the Social Web

Analyzing & Visualizing Data

Page 101: Unleashing twitter data for fun and insight

Approach: Map/Reduce on Tweets

Page 102: Unleashing twitter data for fun and insight

Map/Reduce Paraadigm

•Mapper: yields key/value pairs

•Reducer: operates on keyed mapper output

•Example: Computing the sum of squares

•Mapper Input: (k, [2,4,6])

•Mapper Output: (k, [4,16,36])

•Reducer Input: [(k, 4,16), (k, 36)]

•Reducer Output: 56

Page 103: Unleashing twitter data for fun and insight

Which entities frequently appear in @mention's tweets?

Page 104: Unleashing twitter data for fun and insight

@timoreilly's Tweet Entities

Page 105: Unleashing twitter data for fun and insight

How often does @timoreilly mention specific friends?

Page 106: Unleashing twitter data for fun and insight

Filtering Tweet Entities

•Let's find out how often someone talks about

specific friends

•We have friend info on hand

•We've extracted @mentions from the tweets

•Let's cound friend vs non-friend mentions

Page 107: Unleashing twitter data for fun and insight

@timoreilly's friend mentionsNumber of @user entities in tweets: 20 Number of @user entities in tweets who are friends: 18 ahier pkedrosky CodeforAmerica nytimes brady carlmalamud pahlkadot make jamesoreilly andrewsavikas

Number of user entities in tweets who are not friends: 2 n2vip timoreilly andrewsavikas

gnat slashdot OReillyMedia dalepd mikeloukides monkchips fredwilson digiphile

Page 108: Unleashing twitter data for fun and insight

Who does @timoreilly retweet most frequently?

Page 109: Unleashing twitter data for fun and insight

Counting Retweets

•Map @mentions out of tweets using a regex

•Reduce to sum them up

•Sort the results

•Display results

Page 110: Unleashing twitter data for fun and insight

Retweets by @timoreilly

Page 111: Unleashing twitter data for fun and insight

How frequently is @timoreilly retweeted?

Page 112: Unleashing twitter data for fun and insight

Retweet Counts

•An API resource /statuses/retweet_count exists (and is now functional)

•Example: http://twitter.com/statuses/show/29016139807.json

•retweet_count

•retweeted

Page 113: Unleashing twitter data for fun and insight

Survey Says...@timoreilly is retweeted about 2/3

of the time

Page 114: Unleashing twitter data for fun and insight

How often does @timoreilly include #hashtags in tweets?

Page 115: Unleashing twitter data for fun and insight

Counting Hashtags

•Use a mapper to emit a #hashtag entities for tweets

•Use a reducer to sum them all up

•Been there, done that...

Page 116: Unleashing twitter data for fun and insight

Survey Says...About 1 out of every 3 tweets by

@timoreilly contain #hashtags

Page 117: Unleashing twitter data for fun and insight

Mining the Social Web

But if you order within the next 5 mintues...

Page 118: Unleashing twitter data for fun and insight

Mining the Social Web

Bonus Material:

What do #JustinBieber and #TeaParty have in common?

Page 119: Unleashing twitter data for fun and insight

Tweet Entities

Page 120: Unleashing twitter data for fun and insight

#bieberblast#Eclipse#somebodytolovehttp://bit.ly/aARD4thttp://bit.ly/b2Kc1L#Escutando#justinBieber#Restart#TT#Telezwerge@rheinzeitung#WTF

http://tinyurl.com/343kax4@JustBieberFact@TinselTownDirt#beliebers#BieberFact#Celebrity#Dschungel@_Yassi_#musicmonday#video#tickets

#music@justinbieber#nowplaying#Justinbieber#JUSTINBIEBER#Proformhttp://migre.me/TJwj@ProSieben@lojadoaltivo#JustinBieber#justinbieber

#JustinBieber co-occurrences

Page 121: Unleashing twitter data for fun and insight

@blogging_tories#cdnpoli#fail#nra#roft@BrnEyeSuss@crispix49@koopersmith@Kriskxx#Kagan@Liliaep#nvsen@First_Patriots#patriot#pjtv@andilinks@RonPaulNews#ampats#cnn#jews#GOPDeficit#wethepeople#asamom@thenewdeal#AFIRE#Dems@JIDF

@STOPOBAMA2012@TheFlaCracker#palin2012#AZ#TopProg#conservativehttp://tinyurl.com/386k5hh@ResistTyranny#tsot@ALIPAC#majority#NoAmnesty#patriottweets@Drudge_Report#military#palin12#rnc#TCOThttp://tinyurl.com/24h36zq#spwbt@welshman007#FF#liberty#glennbeck#news#oilspill#rs#Teaparty

#jcot#tweetcongress#Obama#topprog#palin#dems#acon#cspj#immigration#politics#hhrs#TeaParty#vote2010#libertarian#obama#ucot#iamthemob#GOP#tpp#dnc#twisters#sgp#ocra#gop#tlot#p2#tcot#teaparty

#TeaParty co-occurrences

Page 122: Unleashing twitter data for fun and insight

Hashtag Distributions

Page 123: Unleashing twitter data for fun and insight

Hashtag Analysis

•TeaParty: ~ 5 hashtags per tweet.

•Example: “Rarely is the questioned asked: Is our children

learning?” - G.W. Bush #p2 #topprog #tcot #tlot #teaparty

#GOP #FF

•JustinBieber: ~ 2 hashtags per tweet

•Example: #justinbieber is so coool

Page 124: Unleashing twitter data for fun and insight

Common #hashtags

#lol #jesus #worldcup #teaparty #AZ #milk #ff #guns #WorldCup #bp #News

#dancing #music #glennbeck @addthis #nowplaying#news#WTF #fail #toomanypeople #oilspill #catholic

Page 125: Unleashing twitter data for fun and insight

Retweet Patterns

Page 126: Unleashing twitter data for fun and insight

Retweet Behaviors

Page 127: Unleashing twitter data for fun and insight

Friendship Networks

Page 128: Unleashing twitter data for fun and insight

Juxtaposing Friendships

•Harvest search results for #JustinBieber and #TeaParty

•Get friend ids for each @mention with /friends/ids

•Resolve screen names with /users/lookup

•Populate a NetworkX graph

•Analyze it

•Visualize with Graphviz

Page 129: Unleashing twitter data for fun and insight

Nodes Degrees

Page 130: Unleashing twitter data for fun and insight

Two Kinds of Hairballs...

#TeaParty#JustinBieber

Page 131: Unleashing twitter data for fun and insight

The world twitterverse is your oyster

Page 132: Unleashing twitter data for fun and insight

Mining the Social Web

• Twitter : @SocialWebMining

• GitHub: http://bit.ly/socialwebmining

• Facbook: http://facebook.com/MiningTheSocialWeb