From the Twitter Stream to your Stats Screen

139
...shedding light on psychosocial phenomena through big language analysis. H. Andrew Schwartz @ International Conference and Global Working Group meeting on Big Data for Official Statistics 29 October, 2014, Beijing, China From the Twitter Stream to your Stats Screen: Towards Working with Social Media Data for Official Statistics

Transcript of From the Twitter Stream to your Stats Screen

Page 1: From the Twitter Stream to your Stats Screen

...shedding light on psychosocial phenomena through big language analysis.

H. Andrew Schwartz

@

International Conference and Global Working Group meeting on Big Data for Official Statistics

29 October, 2014, Beijing, China

From the Twitter Stream to your Stats Screen:

Towards Working with Social Media Data for Official Statistics

Page 2: From the Twitter Stream to your Stats Screen

Thank You

United Nations Statistics Division (UNSD)

National Bureau of Statistics of China (NBS)

Page 3: From the Twitter Stream to your Stats Screen

Social Media

Page 4: From the Twitter Stream to your Stats Screen

300mil. tweets/day

Social Media

Page 5: From the Twitter Stream to your Stats Screen

300mil. tweets/day

4bil. messages/day

Social Media

Page 6: From the Twitter Stream to your Stats Screen

300mil. tweets/day

4bil. messages/day

100mil. (Sina) weibos/day

Social Media

Page 7: From the Twitter Stream to your Stats Screen

300mil. tweets/day

4bil. messages/day

100mil. (Sina) weibos/day

Social Media

BIGGERDATA

Page 8: From the Twitter Stream to your Stats Screen

300mil. tweets/day

4bil. messages/day

100mil. (Sina) weibos/day

Social Media

Page 9: From the Twitter Stream to your Stats Screen

300mil. tweets/day

4bil. messages/day

100mil. (Sina) weibos/day

Social Media

150mil. (2014)

1bil. (2014)

75mil. (2014)

PEOPLE:

Page 10: From the Twitter Stream to your Stats Screen

300mil. tweets/day

4bil. messages/day

100mil. (Sina) weibos/day

Social Media

150mil. (2014)

1bil. (2014)

75mil. (2014)

PEOPLE:

Largest dataset(s) of everyday human behavior and conerns.

Page 11: From the Twitter Stream to your Stats Screen

Social Media

Applications

Largest dataset(s) of everyday human behavior and conerns.

Page 12: From the Twitter Stream to your Stats Screen

Social Media

Applications

Largest dataset(s) of everyday human behavior and conerns.

1. Measurement

Page 13: From the Twitter Stream to your Stats Screen

Social Media

Applications

Largest dataset(s) of everyday human behavior and conerns.

1. MeasurementTo what extent can we replace traditional survey-based methods?

Page 14: From the Twitter Stream to your Stats Screen

Measurement: Personality

Page 15: From the Twitter Stream to your Stats Screen

Measurement: Personality

Page 16: From the Twitter Stream to your Stats Screen

Measurement: Personality

Page 17: From the Twitter Stream to your Stats Screen

Social Media

Applications

Largest dataset(s) of everyday human behavior and conerns.

1. MeasurementTo what extent can we replace traditional survey-based methods?

Page 18: From the Twitter Stream to your Stats Screen

Social Media

Applications

Largest dataset(s) of everyday human behavior and conerns.

1. MeasurementTo what extent can we replace traditional survey-based methods?2. Data-driven discovery

Page 19: From the Twitter Stream to your Stats Screen

Social Media

Applications

Largest dataset(s) of everyday human behavior and conerns.

1. MeasurementTo what extent can we replace traditional survey-based methods?2. Data-driven discoveryCan we discovery new links with outcomes? What is driving a trend?

Page 20: From the Twitter Stream to your Stats Screen

Data-driven Social Science: Extraversionsociable, assertive, active, energetic, talkative, outgoing

Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. In PLOS ONE 8(9).

Page 21: From the Twitter Stream to your Stats Screen

Data-driven Social Science: Introversion

Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M. E. P., & Ungar, L. H. (2013). Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. In PLOS ONE 8(9).

Page 22: From the Twitter Stream to your Stats Screen

moody, anxious, fearful, worry-prone, depressive

Explicit Language Warning

Data-Driven Social Science: Neuroticism

Page 23: From the Twitter Stream to your Stats Screen

moody, anxious, fearful, worry-prone, depressive

Data-Driven Social Science: Neuroticism

Page 24: From the Twitter Stream to your Stats Screen

moody, anxious, fearful, worry-prone, depressive

Data-Driven Social Science: Neuroticism

Page 25: From the Twitter Stream to your Stats Screen

Data-Driven Social Science: Neuroticism

Page 26: From the Twitter Stream to your Stats Screen

Data-Driven Social Science: Neuroticism

Page 27: From the Twitter Stream to your Stats Screen

Social Media1. MeasurementTo what extent can we replace traditional survey-based methods?2. Data-driven discoveryCan we discovery new links with outcomes? What is driving a trend?

Page 28: From the Twitter Stream to your Stats Screen

Social Media1. MeasurementTo what extent can we replace traditional survey-based methods?2. Data-driven discoveryCan we discovery new links with outcomes? What is driving a trend?

TraditionalOfficial

Statistics

BigData

?

Page 29: From the Twitter Stream to your Stats Screen

Social Media1. MeasurementTo what extent can we replace traditional survey-based methods?2. Data-driven discoveryCan we discovery new links with outcomes? What is driving a trend?

TraditionalOfficial

StatisticsBig

Data

Page 30: From the Twitter Stream to your Stats Screen

Overview

• Introduction

• Background on Social Media Data

• Examples

• Challenges

• Summary

Page 31: From the Twitter Stream to your Stats Screen

Overview

• Introduction

• Background on Social Media Data– Sources

– Types

– Acquisition

– Analysis Methodology

• Examples

• Challenges

• Summary

Page 32: From the Twitter Stream to your Stats Screen

microblogging social interaction messaging

TwitterWeibo

Facebook Renren

Text MessagesSnapChatWeChat

mostly public somewhat private private

Social Media Sources

Page 33: From the Twitter Stream to your Stats Screen

Social Media Sourcesmicroblogging social interaction messaging

TwitterWeibo

Facebook Renren

Text MessagesSnapChatWeChat

mostly public somewhat private private

Page 34: From the Twitter Stream to your Stats Screen

Social Media Sourcesmicroblogging social interaction messaging

TwitterWeibo

Facebook Renren

Text MessagesSnapChatWeChat

mostly public somewhat private private

big bigger biggest

Page 35: From the Twitter Stream to your Stats Screen

Social Media Sourcesmicroblogging social interaction messaging

TwitterWeibo

Facebook Renren

Text MessagesSnapChatWeChat

mostly public somewhat private private

big bigger biggestOther social mediaInstagramYouTubeYelpPinterestTumblrReddit

Page 36: From the Twitter Stream to your Stats Screen

Social Media Sources

Other social mediaInstagramYouTubeYelpPinterestTumblrReddit

SearchGoogleYahooBaiduBing

microblogging social interaction messagingTwitterWeibo

Facebook Renren

Text MessagesSnapChatWeChat

mostly public somewhat private private

big bigger biggest

Page 37: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Text!

Page 38: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Text!

Page 39: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Text!

Page 40: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Text!

Page 41: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Text!

Page 42: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Text!

Page 43: From the Twitter Stream to your Stats Screen

Social Media Data Types:Levels of Analysis

Page 44: From the Twitter Stream to your Stats Screen

Social Media Data Types:

Page 45: From the Twitter Stream to your Stats Screen

Acquiring Social Media

Twitter

– Application Programming Interfaces (APIs)● random stream (1% daily = ~2 to 3.5m)● filter stream (1%; not random sample)● search API (180 queries per 15 minutes)

Page 46: From the Twitter Stream to your Stats Screen

Acquiring Social Media

Twitter

– Application Programming Interfaces (APIs)● random stream (1% daily = ~2 to 3.5m)● filter stream (1%; not random sample)● search API (180 queries per 15 minutes)

– More data provided by third parties(Datasift, Gnip, ...)

Page 47: From the Twitter Stream to your Stats Screen

Acquiring Social Media

{ "coordinates": None, "created_at": "Wed Jan 29 22:58:50 +0000 2014", "favorite_count": 19, "favorited": False, "geo": None, "id": 428663556889145344, "lang": "en", "place": None, "retweet_count": 14, "retweeted": False, "text": "Wow, where did January go? Was I in Tulsa or Yemen? Or Vermont?", ...}

JSON encoding

Page 48: From the Twitter Stream to your Stats Screen

Acquiring Social Media

Facebooko Graph API

o Limited public data

o Consent participants to share private data through Facebook App.

Page 49: From the Twitter Stream to your Stats Screen

Analysis / Methodology

Page 50: From the Twitter Stream to your Stats Screen

Features

words and phrases: 1 to 3 word sequences more likely to occur together than chance.

• Words identified from text via social-media aware tokenization.

• usually restricted to those used more than a few times• e.g. 'day', 'the beautiful day', 'Mexico City', etc...

Analysis / Methodology

Page 51: From the Twitter Stream to your Stats Screen

Analysis / MethodologyFeatures

words and phrases: 1 to 3 word sequences more likely to occur together than chance.

• Words identified from text via social-media aware tokenization.

• usually restricted to those used more than a few times• e.g. 'day', 'the beautiful day', 'Mexico City', etc...

topics: Clusters of semantically-related words found via latent Dirichlet allocation

e.g.

Page 52: From the Twitter Stream to your Stats Screen

Features

words and phrases: 1 to 3 word sequences more likely to occur together than chance.

topics: Clusters of semantically-related words found via latent Dirichlet allocation

lexica: Manually-created clusters of wordse.g. positive emotion: happy, joyous, like, etc… negative emotion: sad, hate, terrible, etc…

Method:Data-driven language analysis

Page 53: From the Twitter Stream to your Stats Screen

Analysis / Methodology

open-vocabulary : Not restricted to predefined lists of features.

Page 54: From the Twitter Stream to your Stats Screen

Analysis / MethodologyExample: Sentiment Analysis

+ / - Emotion from LIWC(Pennebaker et al., 2001)

Thumbs up... (Pang and Lee, 2004)

NRC Canada(Mohammad et al., 2013)

Page 55: From the Twitter Stream to your Stats Screen

Analysis / Methodology

All require validation in new domain. (e.g., new platform, time-frame, or level of analysis)

Page 56: From the Twitter Stream to your Stats Screen

Analysis / Methodology

PredictionHow to fit a single model on lots of language variables?(e.g. 25,000 words and phrases)

Methods from Machine Learning:● discrete outcomes: support vector machines (SVM)● continuous outcomes: ridge regression

Page 57: From the Twitter Stream to your Stats Screen

Analysis / Methodology

PredictionIssues with words as variables:● sparseness: most words do not occur very often● high co-variance: e.g. people that say “soccer”

often are also more likely to say “goal”

Page 58: From the Twitter Stream to your Stats Screen

Analysis / Methodology

PredictionIssues with words as variables:● sparseness: most words do not occur very often● high co-variance: e.g. people that say “statistics”

often are also more likely to say “variable”

Solutions:● L1 penalized fit (lasso regression)● Use principal components analysis before fit

Page 59: From the Twitter Stream to your Stats Screen

Analysis / Methodology

Page 60: From the Twitter Stream to your Stats Screen

Some Available Resources

MALLET: Machine Learning Language ToolkitGood for topic modelinghttp://mallet.cs.umass.edu/GUI: http://code.google.com/p/topic-modeling-tool/

Lightside: Point and Click Machine Learninghttp://ankara.lti.cs.cmu.edu/side/download.html

WWBP Resources wwbp.org/data.htmlComing this January: “LexHub: Language Analysis X social science”email to get on list: [email protected]

Page 61: From the Twitter Stream to your Stats Screen

Overview

• Introduction

• Background on Social Media Data– Sources

– Types

– Acquisition

– Analysis Methodology

• Examples

• Challenges

• Summary

Page 62: From the Twitter Stream to your Stats Screen

Overview

• Introduction

• Background on Social Media Data

• Examples– Heart Disease Mortality

– HIV Prevalence

– Life Satisfaction

– Flu Tracking

• Challenges

• Summary

Page 63: From the Twitter Stream to your Stats Screen

Example: Community Heart Disease Mortality

Eichstaedt, Schwartz, Park, Kern, … Ungar, Seligman. (2014; in press)

Page 64: From the Twitter Stream to your Stats Screen

Twitter Dataset Studied:

10% of tweets from June 2009 to March 2010(826 million tweets)

United States CDC data:

2009-2011 Atherosclerotic Heart Disease Mortality

Example: Community Heart Disease Mortality

Page 65: From the Twitter Stream to your Stats Screen

Example: Community Heart Disease Mortality

Page 66: From the Twitter Stream to your Stats Screen

Example: Community Heart Disease Mortality

Page 67: From the Twitter Stream to your Stats Screen

* **

Example: Community Heart Disease Mortality

Page 68: From the Twitter Stream to your Stats Screen

* **

Example: Community Heart Disease Mortality

Page 69: From the Twitter Stream to your Stats Screen

* **

Example: Community Heart Disease Mortality

Page 70: From the Twitter Stream to your Stats Screen

* **

Example: Community Heart Disease Mortality

Page 71: From the Twitter Stream to your Stats Screen

* **

Example: Community Heart Disease Mortality

Page 72: From the Twitter Stream to your Stats Screen

* **

Example: Community Heart Disease Mortality

Page 73: From the Twitter Stream to your Stats Screen

Language positively correlated with US-county-level Heart Disease

Page 74: From the Twitter Stream to your Stats Screen

Language negatively correlated with US-county-level Heart Disease

Page 75: From the Twitter Stream to your Stats Screen

Example: County Life Satisfaction

In collaboration with Molly Ireland and Dolores Albaraccin

Page 76: From the Twitter Stream to your Stats Screen

Example: County Life Satisfaction

education level, income, demographics, ethnicity Twitter

Page 77: From the Twitter Stream to your Stats Screen

Example: County Life Satisfaction

Page 78: From the Twitter Stream to your Stats Screen

Example: County HIV Prevalence

In collaboration with Molly Ireland and Dolores Albaraccin

Page 79: From the Twitter Stream to your Stats Screen

Example: County HIV Prevalence

Page 80: From the Twitter Stream to your Stats Screen

Example: County HIV Prevalence

HIV prevalence is higher in counties with less future tense in...

all 1375 qualifying counties(Beta = -0.48, p <.001)

top 200 most populated counties (Beta = -0.27, p <.001)

Page 81: From the Twitter Stream to your Stats Screen

Example: Flu Trends

Page 82: From the Twitter Stream to your Stats Screen

Google Flu Trends

Page 83: From the Twitter Stream to your Stats Screen

Health Tweets

http://www.healthtweets.org/(Mark Dredze and Michael Paul; Johns Hopkins University)

narrows in on health-related tweets

Page 84: From the Twitter Stream to your Stats Screen

Overview

• Introduction

• Background on Social Media Data

• Examples– Heart Disease Mortality

– HIV Prevalence

– Life Satisfaction

– Flu Tracking

• Challenges

• Summary

Page 85: From the Twitter Stream to your Stats Screen

Overview

• Introduction

• Background on Social Media Data

• Examples

• Challenges

• Summary

Page 86: From the Twitter Stream to your Stats Screen

Challenges

● Ethical / Privacy

● Technical

● Methodological

Page 87: From the Twitter Stream to your Stats Screen

Challenges

Page 88: From the Twitter Stream to your Stats Screen

Challenges

● Ethical / Privacy

– Public Awareness / Participant Consent● Technical

● Methodological

Page 89: From the Twitter Stream to your Stats Screen

Challenges

● Ethical / Privacy

– Public Awareness / Participant Consent● Technical

– Data Storage and Analysis Infrastructure

– Evolving APIs● Methodological

Page 90: From the Twitter Stream to your Stats Screen

Challenges● Ethical / Privacy

– Public Awareness / Participant Consent● Technical

– Data Storage and Analysis Infrastructure– Evolving APIs

● Methodological

– Word meaning / domains– Correlation versus Causation– Sample Bias– Self-presentation Bias

Page 91: From the Twitter Stream to your Stats Screen

Issues attributed to missclassification Facebook status update.

Page 92: From the Twitter Stream to your Stats Screen

Challenges● Ethical / Privacy

– Public Awareness / Participant Consent● Technical

– Data Storage and Analysis Infrastructure– Evolving APIs

● Methodological

– Word meaning / domains– Correlation versus Causation– Sample Bias– Self-presentation Bias

Page 93: From the Twitter Stream to your Stats Screen
Page 94: From the Twitter Stream to your Stats Screen
Page 95: From the Twitter Stream to your Stats Screen
Page 96: From the Twitter Stream to your Stats Screen
Page 97: From the Twitter Stream to your Stats Screen
Page 98: From the Twitter Stream to your Stats Screen
Page 99: From the Twitter Stream to your Stats Screen
Page 100: From the Twitter Stream to your Stats Screen

Representative Sample?

● Alternaitve: Post-stratification

– Demographics are one of the most accurately predicted from language

● gender 92% accuracy● age 0.86 correlation

Page 101: From the Twitter Stream to your Stats Screen

Challenges● Ethical / Privacy

– Public Awareness / Participant Consent● Technical

– Data Storage and Analysis Infrastructure– Evolving APIs

● Methodological

– Word meaning / domains– Correlation versus Causation– Sample Bias– Self-presentation Bias

Page 102: From the Twitter Stream to your Stats Screen

Challenges

validate

● Ethical / Privacy

– Public Awareness / Participant Consent● Technical

– Data Storage and Analysis Infrastructure– Evolving APIs

● Methodological

– Word meaning / domains– Correlation versus Causation– Sample Bias– Self-presentation Bias

Page 103: From the Twitter Stream to your Stats Screen
Page 104: From the Twitter Stream to your Stats Screen

Thank You! Questions?

[email protected]

| wwbp.org

TraditionalOfficial

StatisticsBig

Data

Page 105: From the Twitter Stream to your Stats Screen

Thank You! Questions?

[email protected]

| wwbp.org

Big Data for Official

Statistics

Page 106: From the Twitter Stream to your Stats Screen

APPENDIX

Page 107: From the Twitter Stream to your Stats Screen

Method: County-Mapping

94% accurate map to human-judged intended city, state pair.

Page 108: From the Twitter Stream to your Stats Screen

Distributed Computing● approximately 1 billion tweets

○ Too much for single computer system

● Utilize map-reduce in a “Hadoop” style cluster:

image: http://xiaochongzhang.me

Page 109: From the Twitter Stream to your Stats Screen

Well-Being and Policy

=> Life Satisfaction(across domains)

Page 110: From the Twitter Stream to your Stats Screen

What topics matter for all counties (that we have data for) in the United

States?

Evidence for moderation• A moderator alters the strength or direction of a relationship• Question of external validity – how universal is the effect?

Daivd Kenny – Moderator Variables: Introduction, http://davidakenny.net/cm/moderation.htm

What topics matter for the poorest 25% of counties in

the United States?

Page 111: From the Twitter Stream to your Stats Screen

Individual Well-Being

satisfaction with life

Page 112: From the Twitter Stream to your Stats Screen

Individual Well-Being: message to user-level

message and user-level

Page 113: From the Twitter Stream to your Stats Screen
Page 114: From the Twitter Stream to your Stats Screen
Page 115: From the Twitter Stream to your Stats Screen
Page 116: From the Twitter Stream to your Stats Screen

Fit unrepresentative sample to representative sample results (i.e. implicitly maps unrepresentative sample to representative)In the end we are validating against representative data.

Representative Sample?

Page 117: From the Twitter Stream to your Stats Screen

MyPersonality Dataset

• Facebook application to take “Big-5” personality survey.

• Approximately 75,000 users of the app:o shared their status updates for researcho wrote at least 1,000 wordso share their age and gender

Individual Traits in Facebook

Page 118: From the Twitter Stream to your Stats Screen

Community Well-Being Through Twitter

Page 119: From the Twitter Stream to your Stats Screen

Twitter

> 150 million active monthly users> 350 million messages a day

often list a location or geo-coordinates

Community Well-Being through Twitter

Page 120: From the Twitter Stream to your Stats Screen
Page 121: From the Twitter Stream to your Stats Screen

You Are What You Tweet

status update!

tweet!

tweet!

tweet!

status update!

status update!

tweet!

tweet!

status update!

status update! language

analysis

● prediction(measurement)

● insights

Outcomes

?

Page 122: From the Twitter Stream to your Stats Screen

Example JSON - Tweet{ "coordinates": None, "created_at": "Wed Jan 29 22:58:50 +0000 2014", "favorite_count": 19, "favorited": False, "geo": None, "id": 428663556889145344, "lang": "en", "place": None, "retweet_count": 14, "retweeted": False, "text": "Wow, where did January go? Was I in Tulsa or Yemen? Or Vermont?", ...}

"created_at": "Wed Jan 29 22:58:50 +0000 2014",

Page 123: From the Twitter Stream to your Stats Screen

Twitter APIs● REST APIs

o Twitter App building (e.g. smartphone apps)o Search API

● Streaming APIso Firehoseo public random sampleo “user” and “site” streams

● REST APIso Twitter App building (e.g. smartphone apps)o Search API

● Streaming APIso Firehoseo public random sampleo “user” and “site” streams

https://dev.twitter.com/docs

Page 124: From the Twitter Stream to your Stats Screen

Sample Stream● 1 % of all public tweets● real time● useful for representative language sample

o less than 40% of tweets are in Englisho can be useful for frequencies of terms looked at

Sample Stream

Page 125: From the Twitter Stream to your Stats Screen

Search API● Specific to what you’re looking for● same content as the web search

https://twitter.com/search?q=obama● parameters include

o Recent vs Top tweetso Geolocalizationo Language filter (Twitter’s algorithm is “best effort”)o time ranges (limited)o more:https://dev.twitter.com/docs/api/1.1/get/search/tweets

Search Stream

Page 126: From the Twitter Stream to your Stats Screen

Method: Prediction● Lasso, L1 penalized, regression● Controls:

○ demographics: age, gender, ethnicity○ socio-economic status: income, education

Community Heart Disease through Twitter

Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Lucas, R. E., Agrawal, M., Park, G. J., Lakshmikanth, S. K., Jha, S., Seligman, M. E. P., & Ungar, L. H. (2013). Characterizing Geographic Variation in Well-Being using Tweets. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM). Boston, MA.

Page 127: From the Twitter Stream to your Stats Screen

Search API● Specific to what you’re looking for● same content as the web search

https://twitter.com/search?q=obama● parameters include

o Recent vs Top tweetso Geolocalizationo Language filter (Twitter’s algorithm is “best effort”)o time ranges (limited)o more:https://dev.twitter.com/docs/api/1.1/get/search/tweets

Search Stream

Page 128: From the Twitter Stream to your Stats Screen

Who has access to APIs ?● Twitter uses OAuth2 for authentication● Not a “username, password” authentication● Need a “Twitter App” (and a Twitter account)

o Anyone can create a blank appo Go to https://apps.twitter.com/app/newo Generate API key, API secret, access token &

access secret on this page: https://apps.twitter.com/app/YOUR_APP_ID/keys

Page 129: From the Twitter Stream to your Stats Screen

What’s in a “Tweet JSON”● text of the tweet● unique Twitter id● created date & time● replies:

o user id & tweet id of tweet replied to● retweets:

o Tweet JSON of the original tweet● favorited & retweeted counts● entities

o expanded links, hashtags, media & user mentions● user info:

o unique Twitter ido screen name, handle, location, descriptiono nb tweets, favourites, followerso profile picture & background information

Example Tweet JSON: https://gist.github.com/gnip/764239

Find a complete list of fields at:https://dev.twitter.com/docs/platform-objects/tweets &https://dev.twitter.com/docs/platform-objects/users

!! Some fields are optional !!

Page 130: From the Twitter Stream to your Stats Screen

Limitations of Twitter APISample Stream:● only 1 % of all tweets● terms that aren’t frequent enough

might not even appear in your dataset

Search:

● 180 “queries” limit in every 15 minute window

● each search query can only contain 10 terms

Page 131: From the Twitter Stream to your Stats Screen

● Free APIso Graph APIo Chat APIo FQL API

● Third party APIso Public Feed APIo Keywords Insights API

Facebook API● Free APIs

o Graph APIo Chat APIo FQL API

● Third party APIso Public Feed APIo Keywords Insights API

That’s where the

data is

Page 132: From the Twitter Stream to your Stats Screen

Graph API● Every data point is a node in a graph

John

John’s posts

JohnJohnJohnJohn’s friends

Likes & Comments

Page 133: From the Twitter Stream to your Stats Screen

Graph API● Every data point is a node in a graph

John

John’s posts

JohnJohnJohnJohn’s friends

Likes & Comments

Page 134: From the Twitter Stream to your Stats Screen

What did we learn?● API = Application Programming Interface● Easier for huge amounts of data● Twitter has multiple APIs● So does Facebook● How to use the Graph API to post/delete a

status● You might want to ask your programmer for

help

Page 135: From the Twitter Stream to your Stats Screen
Page 136: From the Twitter Stream to your Stats Screen

Individual Traits in Facebook

Page 137: From the Twitter Stream to your Stats Screen

MyPersonality Dataset

• Facebook application to take “Big-5” personality survey.

• Approximately 75,000 users of the app:o shared their status updates for researcho wrote at least 1,000 wordso share their age and gender

Individual Traits in Facebook

Page 138: From the Twitter Stream to your Stats Screen

Individual Traits in Facebook: Female

Gender

Page 139: From the Twitter Stream to your Stats Screen

Individual Traits in Facebook: Male

Gender