Making Sense of Millions of Thoughts: Finding Patterns in the Tweets

Post on 11-Aug-2014

271 views 0 download

description

I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014. http://nlp.stanford.edu/events/illvi2014/index.html ABSTRACT Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.

Transcript of Making Sense of Millions of Thoughts: Finding Patterns in the Tweets

Making Sense of Millions of Thoughts

Findingpatterns

in theTweets

“Knowing comes from learning, from seeking.”

“What we call chaos is just we haven't recognized.”

“I am looking for a needle haystack.”

“140-character text messages, called ”

Krist Wongsuphasawat

(50 characters)

(58 characters)

(42 characters)

(42 characters)

X-Men

Prof. XAbility: Telepathy (mind reading)

CerebroEnhance telepathy

Prof. X

Cerebro

With this power…

What are you thinking?

What are people thinking about x?

Product Event

Personetc.

Reality

Cerebro

Internet

Platformthought

thought

thought

thought

thought

crowdsourcing social networks

Data

Twittertweet

tweet

tweet

tweet

tweet

Tweets

Tweets• 140 characters

• text + media

• geo

• time

Twittertweet

tweet

tweet

tweet

tweet

Tweets

What can we learn from these Tweets?

visual-insights@twitter@miguelrios @philogb @trebor @kristw

World Cup

Election

Oscars

Pure Curiosity

Grammy

TV Shows

New Year

Breaking news

Earthquake

Insights, Stories

(Tweets)DATA

with limited time

Audience: general public

Tools

• Hadoop

• Apache Pig

• Vertica

• node.js, python

• d3 & co.

Pig

Insights, Stories

(Tweets)DATA

Insights, Stories

(Tweets)

Filter

DATA

Having all Tweets

How people think I feel.

Having all Tweets

How people think I feel. How I really feel.

Filter data

Good news:

Bad news:

Want only relevant Tweets

Have all Tweets

Too many Tweets

Filter data (2)• #hashtags — e.g. #world-cup

• easy to filter

• hashtags must be presented

• typo?

Filter data (2)• #hashtags — e.g. #world-cup

• easy to filter

• hashtags must be presented

• keywords — e.g. goal

• broader

• can be ambiguous

Filter data (3)• Combine with other attributes

• Time

• during the first half of World Cup final

Filter data (3)• Combine with other attributes

• Time

• during the first half of World Cup final

• Location

• Tweets from Brazil

• Not every Tweet is geotagged.

Filter data (4)

• Languages

• Sometimes use only English Tweets

• Future

• Translation?

Insights, Stories

(Tweets)

Filter

Clean

DATA

Clean data

• Typo (Mobile input)

• Abbreviation (due to 140-character limit)

• Exaggeration (e.g. GOOOOALLLL)

• Twitter specific e.g., Old-style retweet “RT …”

• Inappropriate content

Insights, Stories

(Tweets)

Filter

Clean

Visualize

DATA

(+ media)photos, videos

What?

Where? When?

GEO TIME

TEXT

DATA

What?

Where? When?

GEO TIME

TEXT

Visualize Data

What?

Where? When?

GEO TIME

TEXT

Visualize Data

TIME Tweets/second

TIME Tweets/second

TIME Tweets/second + Annotation

http://www.flickr.com/photos/twitteroffice/5681263084/

TIME Tweets/second + Annotation

Manual

To automate

Top tweets (most Retweets, Favs)

What?

Where? When?

GEO TIME

TEXT

Visualize Data

GEOHeatmap

Low density

High density

GEONew York City

flickr.com/photos/twitteroffice/8798020541

GEOSan Francisco

flickr.com/photos/twitteroffice/8798020541

GEOSan Francisco

Rebuild the world based on

tweet volumes

twitter.github.io/interactive/andes/

What?

Where? When?

GEO TIME

TEXT

Visualize Data

TIME + GEO

blog.twitter.com/2011/global-pulseyoutu.be/SybWjN9pKQk

Japan Earthquake 2011

TIME + GEO Tweet pattern [Rios & Lin 2012]

Night

Late night

Daytime

Night

Late night

Daytime

What?

Where? When?

GEO TIME

TEXT

Visualize Data

TEXT Trends

TEXT

www.wordle.net

Some samples from World Cup

TEXT Word cloud of Tweets right after the 1st goal

www.wordle.net

TEXT• Now

• Derived information: Sentiment, Topic

• Combine with other information (geo & time) + context

• Future

• Better technique + involves more NLP e.g. key phrases, etc.

TEXT Descriptive Keyphrases [Chuang et al. 2012]

TEXT• Challenge

• Scale

What?

Where? When?

GEO TIME

TEXT

Visualize Data

GEO + TEXT Real-time Tweet map

GEO + TEXT Real-time Tweet map

GEO + TEXT Real-time Tweet map

most frequent

term

GEO + TEXT Real-time Tweet map

Gmail went down Jan 24, 2014

GEO + TEXT Real-time Tweet map

Nelson Mandela passed away Dec 5, 2013

GEO + TEXT Real-time Tweet map

• Next:

• Involves more NLP

• Tokenization - Languages without space between words

• etc.

• Challenge:

• Real-time

GEO + TEXT

www.yelp.com/wordmap

Yelp Wordmap

What?

Where? When?

GEO TIME

TEXT

Visualize Data

TIME + TEXT

http://www.babynamewizard.com/voyager

Baby Name Voyager

TIME + TEXT

http://www.babynamewizard.com/voyager

Baby Name Voyager

TIME + TEXT

UEFA Champions League

Biggest Tournament for European soccer clubs

Many Tweets during the matches

TIME + TEXT UEFA Champions League

Dortmund Bayern Munich

Count Tweets mentioning the teams every minute

Team 1 Team 2

TIME + TEXT UEFA Champions League

TIME + TEXT UEFA Champions League

+ “goal” count + context

TIME + TEXT UEFA Champions League

+ “offside”

TIME + TEXT UEFA Champions League

+ players

A B C D

A C

C

Competition Tree

vs vs

vs

A B C D

A C

C

Competition Tree

+

vs vs

vs

A B C D

A C

C

Competition Tree

+ =

uclfinal.twitter.com

vs vs

vs

TIME + TEXT UEFA Champions League

• Challenges

• Filter relevance tweets

• Multiple matches at the same time

• Ambiguous words: “goal”, “red”, “yellow”

• Tweets mentioning both teams e.g. “#GER 2-2 #GHA”

What?

Where? When?

GEO TIME

TEXT

Visualize Data

TIME + GEO + TEXT State of the Union

twitter.github.io/interactive/sotu2014

TIME + GEO + TEXT State of the Union

1) timeline + topic from Tweets

4) Density map of Tweets about selected topic

3) Volume of Tweets by topics

during selected part of the SOTU

2) context (speech)

twitter.github.io/interactive/sotu2014

TIME + GEO + TEXT New Year 2014

TIME + GEO + TEXT New Year 2014

TIME + GEO + TEXT New Year 2014

twitter.github.io/interactive/newyear2014/

Recap

What can we learn from these Tweets?

many, many things.

better

the examples in this talk

imagine…

DATA(Tweets)

Insights, Stories

(Tweets)

Filter

Clean

Visualize

DATA

(Tweets)

Insights, Stories

Filter

Clean

Process &Visualize

DATA

(Tweets)

Insights, Stories

Filter

Clean

Process &Visualize

DATA

NLP

TEXTWhat?

Where? When?

GEO TIME

Visualize data

(Tweets)

Insights, Stories

Filter

Clean

Process &Visualize

DATA

Research

Working together

Raw data

Human

Working together

Raw data

Human

Computer (One machine, Cloud, MapReduce, etc.)

Working together

Raw data

Human

Ignored informationProcessed information

Computer (One machine, Cloud, MapReduce, etc.)

Working together

Raw data

Human

Aggregated information

Ignored informationProcessed information

Computer (One machine, Cloud, MapReduce, etc.)

Working together

Raw data

Human

Aggregated information

Ignored informationProcessed information

Computer (One machine, Cloud, MapReduce, etc.)

NLP Make computers think more like Human.

Working together

Raw data

Human

Aggregated information

Ignored informationProcessed information

VISHelp people consume information.

Computer (One machine, Cloud, MapReduce, etc.)

NLP Make computers think more like Human.

Working together

Raw data

Human

Aggregated information

Ignored informationProcessed information

VISHelp people consume information.

Computer (One machine, Cloud, MapReduce, etc.)

NLP Make computers think more like Human.

HCI

User interactions or

Provide feedback

Bridge the gap. Connect human & computer.

Advanced techniques vs.

Scalability

LifeFlow => Flying SessionsResearch System at Twitter

Summary• Thoughts are captured in the Tweets: what, where, when

• Finding patterns from: text + geo + time

• Opportunities for NLP + HCI + VIS collaboration

• Better technique vs. Scalability + Real-time

@kristw / interactive.twitter.com

Questions?

Thank you