Why Twitter Is All the Rage: A Data Miner's Perspective

Post on 10-May-2015

984 views 0 download

Tags:

description

A presentation on data mining with Twitter that was originally presented as an O'Reilly webinar. See http://oreillynet.com/pub/e/2928 for the archived webinar video.

Transcript of Why Twitter Is All the Rage: A Data Miner's Perspective

Why Twitter Is All The Rage:A Data Miner's PerspectiveMatthew A. Russell

O'Reilly Webcast

15 Oct 2013

1

Hello, My Name Is ... Matthew

2

Educated as a Computer Scientist

CTO @ Digital Reasoning Systems

Data mining; machine learning

Author @ O'Reilly Media

5 published books on technology

Principal @ Zaffra

Selective boutique consulting

Transforming Curiosity Into Insight

3

An open source software (OSS) project

http://bit.ly/MiningTheSocialWeb2E

A book

http://bit.ly/135dHfs

Accessible to (virtually) everyone

Virtual machine with turn-key coding templates for data science experiments

Think of the book as "premium" support for the OSS project

Overview

Background

Twitter as a data science platform

Politics, influence, world events

Data science tools for mining Twitter

Q&A

4

Background

5

Data Science

6

Data => Actionable information

Highly interdisciplinary

Nascent

Necessary

http://wikipedia.org/wiki/Data_science

Digital Signal Explosion

A model for the world: signal and sinks

Growth in data exhaust is accelerating

Digital fingerprints

"Software is eating the world"

Data mining opportunities galore...

7

Digital Data Stats100 terabytes of data uploaded daily to Facebook.

Brands and organizations on Facebook receive 34,722 Likes every minute of the day.

According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day

30 Billion pieces of content shared on Facebook every month.

Data production will be 44 times greater in 2020 than it was in 2009

According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years.

8

See http://wikibon.org/blog/big-data-statistics

Social Media Is All the Rage

World population: ~7B people

Facebook: 1.15B users

Twitter: 500M users

Google+ 343M users

LinkedIn: 238M users

~200M+ blogs (conservative estimate)

9

Why Does Social Media Matter?

It's the frontier for predictive analytics

Understanding world events

Swaying political elections

Modeling human behavior

Analyzing sentiment

Making intelligent recommendations

10

Twitter Is All the Rage

It satisfies fundamental human desires

We want to be heard

We want to satisfy our curiosity

We want it easy

We want it now

Accessible, rich, and (mostly) "open" data

RESTful APIs and JSON responses

Great proving ground for predictive analytics

11

Twitter's Network Dynamics

500M curious users

100M curious users actively engaging

Real-time communication

Short, sweet, ... and fast

Asymmetric Following Model

An interest graph

12

Twitter as a data science platform

13

What's in a Tweet?

14

140 Characters ...

... Plus ~5KB of metadata!

Authorship

Time & location

Tweet "entities"

Replying, retweeting, favoriting, etc.

Twitter and Facebook Compared

15

Twitter

Accounts Types: "Anything"

"Following" Relationships

Favorites

Retweets

Replies

(Almost) No Privacy Controls

Facebook

Accounts Types: People & Pages

Mutual Connections

"Likes"

"Shares"

"Comments"

Extensive Privacy Controls

16

Roberto Mercedes

Jorge

Ana

Nina

Social Network Mechanics

Interest Graph Mechanics

17

Roberto Mercedes

Jorge

Ana

Nina

U2

Juan Luis

Guerra

Juan Luís

Guerra

A (Social) Interest Graph

18

Roberto Mercedes

Jorge

Ana

Nina

U2

Juan Luis

Guerra

Juan Luís

Guerra

A (Political) Interest Graph

19

Roberto Mercedes

Jorge

Ana

Nina

Johnny Araya

Rodolfo Hernández

Costa Rican Presidential Candidates

20

@Johnny_Araya@ElDoctor2014

~3 Months on Twitter

21

Aug 2013 Sept 2013 % ChangeJohnny ArayaOtto Guevara GuthJosé María Villalta Florez-Estrada

Dr. Rodolfo HernándezLuis Guillermo Solís Rivera

14,573 15,506 6.40%114 159 39.47%

8,160 8,990 10.17%

745 858 15.17%

1,192 1,487 24.75%

Who are Candidates Following?

22

What are Candidates Tweeting?

23

Potential Influence

24

Potential Twitter Influence

25

Araya Hernández

Followers

TheoreticalReach

Reach (10)

Reach (100)

Reach (1000)

Reach (10,000)

"Suspect" Followers

~14k ~750

~40M ~550k

490 673

289 702

2782 X

2832 X

3,246 94

See also http://wp.me/p3QiJd-2a

Considerations for Measuring Influence

26

Spam bot accounts that effectively are zombies and can’t be harnessed for any utility at all

Inactive or abandoned accounts that can’t influence or be influenced since they are not in use

Accounts that follow so many other accounts that the likelihood of getting noticed (and thus influencing) is practically zero

The network effects of retweets by accounts that are active and can be influenced to spread a message

See also http://wp.me/p3QiJd-2a

27

Araya%

Hernandez%

Araya%

Hernandez%

Twitter Popularity

Social Media Popularity: Araya vs Hernández

Facebook Popularity

Realtime Analysis: #Syria

28

Monitor Twitter's firehose for realtime data using filters such as #Syria

Keep in mind the sheer volume of data can be considerable

Analysis at MiningTheSocialWeb.com

#Syria: Who?

29

See http://wp.me/p3QiJd-1I

#Syria: Who?

30

See http://wp.me/p3QiJd-1I

#Syria: Who?

31

See http://wp.me/p3QiJd-1I

#Syria: What?

32

See http://wp.me/p3QiJd-1I

#Syria: What?

33

See http://wp.me/p3QiJd-1I

#Syria: Where?

34

See http://wp.me/p3QiJd-1I

#Syria: When?

35

See http://wp.me/p3QiJd-1I

#Syria: Why?

36

That's for you (as the data scientist) to decide

Quantitative automation can amplify human intelligence

Qualitative analysis is still requires human intelligence

Data science tools for mining Twitter

37

MTSW Virtual Machine Experience

Goal: Make it easy to transform curiosity into insight

Vagrant-based virtual machine

Virtualbox or AWS

IPython Notebook User Experience

Point-and-click GUI

100+ turn-key examples and templates

Social web mining for the masses

38

Social Media Analysis Framework

A memorable four step process to guide data science experiments:

Aspire

Acquire

Analyze

Summarize

39

40

41

42

43

Free ResourcesMining the Social Web 2E Chapter 1 (Chimera)

http://bit.ly/13XgNWR

Source Code (GitHub)

http://bit.ly/MiningTheSocialWeb2E

http://bit.ly/1fVf5ej (numbered examples)

Screencasts (Vimeo)

http://bit.ly/mtsw2e-screencasts

http://MiningTheSocialWeb.com

44

Q&A

45