Mining Social Web Data Like a Pro: Four Steps to Success

Post on 10-May-2015

1.365 views 0 download

Tags:

description

GDA Presentation - Quito Ecuador - 20 Sept 2013

Transcript of Mining Social Web Data Like a Pro: Four Steps to Success

Mining Social Web Data Like a Pro: Four Steps to Success

Presented by Matthew A. Russell

"Data Journalism and Interactivity" - GDA Seminar

Quito, Ecuador - 20 September 2013

1

Hola

2

Trained as a Computer Scientist

CTO @ Digital Reasoning Systems

Data Mining, Machine Learning

Principal @ Zaffra

Boutique Consulting

Author @ O'Reilly Media

5 published books on technology

3

Transform Curiosity Into Insight

4

An open source project

http://bit.ly/MiningTheSocialWeb2E

Inherently accessible

Virtual machine & IPython Notebook UX

Turn-key code templates for bootstrapping data science experiments

Think of the book as "premium" support for the OSS project

¿Por qué no Español?

5

Investigative Journalist

6

"A person whose profession it is to

discover the truth and to identify lapses from

it in whatever media may be available."

Data Science

7

Data => Actionable Information

Highly interdisciplinary

Nascent

Necessary

http://wikipedia.org/wiki/Data_science

Digital Signal Explosion

A model for the world: signal and sinks

Growth in data exhaust is accelerating

Digital fingerprints

Software is eating the world

Data mining opportunities galore...

8

Digital Data Stats100 terabytes of data uploaded daily to Facebook.

Brands and organizations on Facebook receive 34,722 Likes every minute of the day.

According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day

30 Billion pieces of content shared on Facebook every month.

Data production will be 44 times greater in 2020 than it was in 2009

According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years.

9

See http://wikibon.org/blog/big-data-statistics

Social Media Is All the Rage

World population: ~7B people

Facebook: 1.15B users

Twitter: 500M users

Google+ 343M users

LinkedIn: 238M users

~200M+ blogs (conservative estimate)

10

But Why Is It All the Rage?

It satisfies fundamental human desires

We want to be heard

We want to satisfy our curiosity

We want it easy

We want it now

11

12

Roberto Mercedes

Jorge

Ana

Nina

Social Network Mechanics

Interest Graph Mechanics

13

Roberto Mercedes

Jorge

Ana

Nina

U2

Juan Luis

Guerra

Juan Luís

Guerra

A (Social) Interest Graph

14

Roberto Mercedes

Jorge

Ana

Nina

U2

Juan Luis

Guerra

Juan Luís

Guerra

A (Political) Interest Graph

15

Roberto Mercedes

Jorge

Ana

Nina

Johnny Araya

Rodolfo Hernández

Social Media Dimensions

16

Facebook

Accounts Types: People & Pages

Mutual Connections

"Likes"

"Shares"

"Comments"

Extensive Privacy Controls

Twitter

Accounts Types: "Anything"

"Following" Relationships

Favorites

Retweets

Replies

(Almost) No Privacy Controls

Why Does This Matter?

"If you can measure it, you can improve it"

Modeling Behavior

Predictive Analysis

Recommending Content

Swaying political situations might just be the ultimate value proposition for social media

17

Social Media Analysis Framework

Four Steps To Success

Aspire

Acquire

Analyze

Summarize

Let's step through a trivial example...

18

(1) Aspire

Let's frame a trivial hypothesis to illustrate the four steps...

Frame a hypothesis about some real world phenomenon

For example: "Johnny Araya is a more popular candidate than Rodolfo Hernández"

Let's use social media as a basis of investigation

19

(2) Acquire

Collect the data that you need to test the hypothesis

How?

Use Facebook and Twitter APIs to harvest data about each candidate

Go after low hanging fruit before something more complex

You don't even need to write code to do this (yet)

20

They're both on Facebook

21

http://facebook.com/ElDoctor2014

http://facebook.com/JohnnyArayaMonge

They're both on Twitter

22

@Johnny_Araya@ElDoctor2014

(3) AnalyzeCount, Filter, and Rank the Data

Johnny Araya:

~50k Facebook likes

~14k Twitter followers

Rodolfo Hernández:

~37k Facebook likes;

745 Twitter followers

Johnny Araya is indeed more popular in social media

23

(4) Summarize

Present the data in a concise and easily understood manner

Charts

Tables

Simple visualizations

Some examples...

24

25

Araya%

Hernandez%

Araya%

Hernandez%

Twitter Popularity

Social Media Popularity: Araya vs Hernández

Facebook Popularity

26

0"

10000"

20000"

30000"

40000"

50000"

60000"

Araya" Hernandez"

Twi5er"followers"

Facebook"fans"

Social Media Popularity: Araya vs Hernández

27

1"

10"

100"

1000"

10000"

100000"

Araya" Hernandez"

Twi0er"followers"

Facebook"fans"

Social Media Popularity: Araya vs Hernández

Twitter Popularity

28

Facebook Popularity

29

JohnnyArayaMonge,35%,

o0oguevaraguth,17%,

luisguillermosolisr,3%,

villaltaJM,19%,

ElDoctor2014,26%,

Facebook(Likes(for(Costa(Rican(Presiden4al(Candidates(

Recall the previous hypothesis:

"Johnny Araya is a more popular candidate than Rodolfo Hernández"

What do we know now that we didn't before?

The current state of each candidate's Twitter and Facebook popularity

Let's explore a slightly more complex hypothesis...

30

Reflect and Refine...

(1) Aspire

Redefine the hypothesis:

For example: "Johnny Araya has a more effective social media strategy than Rodolfo Hernández"

Presumably because of his superior social media status at the moment

31

(2) Acquire

Collect the data that you need to test the hypothesis

How? Use APIs to harvest data about each candidate

Let's consider any Facebook posts for 2013

32

33

for candidate in ['JohnnyArayaMonge', 'ElDoctor2014']:

# Get the data

url = 'https://graph.facebook.com/{0}?' + \ fields= posts.limit(500)&access_token=XXX'.format(candidate) content = requests.get(url).json()

# Save the data

f = open(candidate + ".json", "w") f.write(json.dumps(content)) f.close()

Python Source Code

(3) Analyze

34

Count, Filter, and Rank the Data

Some more Python source code to crunch the numbers

Extract Facebook likes and shares this year

Facebook Vitals

35

ElDoctor2014Total Likes 37495Num Posts since Jan 1, 2013 (of 500 possible) 436Total Post Likes 155473Total Post Shares 9684Oldest Post in Batch 2013-03-15T00:40:21+0000Num posts prior to Jan 1, 2013 0Avg likes/post 356.589449541 (0.951032003044%)Avg shares/post 22.2110091743 (0.059237256099%)Post Types [(u'photo', 286), (u'link', 77), (u'status', 40), (u'video', 32), (u'swf', 1)]

JohnnyArayaMongeTotal Likes 50301Num Posts since Jan 1, 2013 (of 500 possible) 205Total Post Likes 176161Total Post Shares 7542Oldest Post in Batch 2013-01-01T07:18:43+0000Num posts prior to Jan 1, 2013 190Avg likes/post 859.32195122 (1.70835957778%)Avg shares/post 36.7902439024 (0.0731401838978%)Post Types [(u'photo', 149), (u'status', 38), (u'link', 13), (u'video', 5)]

(4) Summarize

Present the data in a concise and easily understood manner

Like a table...

36

37

Metric Araya Hernández

Total Likes

Posts since 1 Jan 13

Num Prior Posts

Earliest Post

Post Likes since 1 Jan 13

Post Shares since 1 Jan 13

Avg Likes per Post

Avg Shares per Post

50,301 37,495

205 436

190+ 0

1 Jan 2013 15 March 2013

176,161 155,473

7,542 9,684

859 356

36 22

38

Metric Araya Hernández

Total Likes

Posts since 1 Jan 13

Num Prior Posts

Earliest Post

Post Likes since 1 Jan 13

Post Shares since 1 Jan 13

Avg Likes per Post

Avg Shares per Post

50,301 37,495

205 436

190+ 0

1 Jan 2013 15 March 2013

176,161 155,473

7,542 9,684

859 356

36 22

Recall the hypothesis:

"Johnny Araya has a more effective social media strategy than Rodolfo Hernández because he has more Facebook and Twitter popularity"

What do we know now?

Hernández has Facebook vitals that are quite competitive with Araya

However, Hernández only joined Facebook ~6 months ago!

It would appear that Hernández has the more effective strategy

What is he doing to rise in popularity so quickly?

39

Reflect and Refine...

40

Comparison of Facebok Content

Other Candidates

41

Johnny Araya FB Posts

42

Rodolfo Hernández FB Posts

43

44

Past ~2 Months on Facebook

45

Aug 2013 FB Likes Sept 2013 FB Likes % Change

Johnny Araya

Otto Guevara Guth

José María Villalta Florez-Estrada

Dr. Rodolfo Hernández

Luis Guillermo Solís Rivera

50,301 53,809 6.97%24,146 27,675 14.62%

27,262 35,169 29.00%

37,495 38,298 2.14%

5,334 6,763 26.79%

Past ~3 Months on Twitter

46

Aug 2013 Sept 2013 % Change

Johnny Araya

Otto Guevara Guth

José María Villalta Florez-Estrada

Dr. Rodolfo Hernández

Luis Guillermo Solís Rivera

14,573 15,506 6.40%114 159 39.47%

8,160 8,990 10.17%

745 858 15.17%

1,192 1,487 24.75%

Facebook and Twitter Compared

47

% FB Change % Twitter Change

Johnny Araya

Otto Guevara Guth

José María Villalta Florez-Estrada

Dr. Rodolfo Hernández

Luis Guillermo Solís Rivera

6.97% 6.40%14.62% 39.47%

29.00% 10.17%

2.14% 15.17%

26.79% 24.75%

Your Imagination Is the Only Limit

Analyze the comments that people are leaving on Facebook pages

Try to ascertain common common Facebook fans or Twitter followers amongst candidates

Deduce demographics from social media by synthesizing public data

Theorize about potential "reach" or "influence" using social media

Analyze data in realtime

48

Thinking about Reach

49

Think about "liking" and "following" as opt-ins to feeds

Remember: Interest Graphs

Arriving at effective metrics is tricker than it initially seems

Potential Twitter Influence

50

Araya Hernández

Followers

TheoreticalReach

Reach (10)

Reach (100)

Reach (1000)

Reach (10,000)

"Suspect" Followers

~14k ~750

~40M ~550k

490 673

289 702

2782 X

2832 X

3,246 94

See also http://wp.me/p3QiJd-2a

Potential Influence

51

Who are Candidates Following?

52

What are Candidates Tweeting?

53

Realtime Analysis

54

Monitor Twitter's firehose for realtime data using filters such as #Syria

Keep in mind the sheer volume of data can be considerable

Analysis at MiningTheSocialWeb.com

Mapping #Syria Tweets

55

See http://wp.me/p3QiJd-1t Text

Temporal Analysis on #Syria

56

Analyzing #Syria Tweet Entities

57

Closing Remarks

Software is the gift that keeps on giving

Code it up once, run it ad infinitum...

Code designed for one account will work for other accounts

Analysis is all about knowing what to count

Coding it up is just the dirty work

Start somewhere and then iteratively explore...then exploit

58

Aspire to Do Great Things

Predicting demographic data such as age or gender is possible for some languages

Time and space are fundamentals for grounding online discussions in reality.

Twitter is about as good as it gets for realtime topical analysis

Think of the world as signal producers and signal collectors

Monitoring breaking news events like #Syria

59

The Tip of the Iceberg

60

Stay in Touch

Website: http://MiningTheSocialWeb.com

Twitter: @ptwobrussell

FB: http://facebook.com/MiningTheSocialWeb

LinkedIn: http://linkedin.com/in/ptwobrussell

Email: ptwobrussell@gmail.com

61