HUG August 2010:Whats the buzz_bay_area

24
© Datameer, Inc 2010 What’s Really the Buzz? Stefan Groschupf [email protected]

description

•Stefan Groschupf, the co-founder and CTO of Datameer, will discuss challenges in social media analytics and how to overcome these using big data analytics built on Hadoop, in his “Social Media: What’s Really the Buzz?” talk. Identifying true thought leads and influencers in social media conversations are becoming increasingly important, so that companies can better understand who is having an impact on their customers' buying decisions. Rather than counting mentions in limited subsets of social media data, organizations need a solution that can uncover complex relationships buried in massive volumes of social media data and a way to bring in data from multiple online data sources to determine the quality and effectiveness of user commentary.

Transcript of HUG August 2010:Whats the buzz_bay_area

Page 1: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

What’s Really the Buzz? Stefan Groschupf [email protected]

Page 2: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 20102

How it started...

Page 3: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 20103

How it started...

Page 4: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 20104

How it started...

Page 5: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Agenda

5

Page 6: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Street Cred

Long time open source contributor

6

http://github.com/sgroschupf/zkclient

http://github.com/sgroschupf/aws-tasks

Page 7: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Cubicle Cred

Cloud Computing Architect

Hadoop consultant at e.g.

Co-Founder/CEO Scale Unlimited

Co-Founder / CEO Datameer Inc.

7

Page 8: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Hadoop vs. DB

8

Hadoop(mergesort) DB(b-tree/index)

Page 9: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Twitter

9

Page 10: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Stack

10

Page 11: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

StackAWS

• ...Hadoop

• ...Datameer

• Spreadsheet compiles into MapReduce Jobs, + much more

• ComercialGephi

• Graph Visualization, GPL, Netbeans based

11

Page 12: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Expenses

12

What Price Description Client Ec2S3 Storage

EMRDatameer

GephiDevelopment

$65.00 0.085/h * 24 * 31$7.50 0.15/GB/m * 50 $24.00 0.20 * 20 * 6~$20 private beta$0.00 GPL

$8.99 + tax 6pk Pilsner Urquell

Page 13: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Pipeline

13

S3

Client on Ec2 EMR

DATAMEER

CSV/GEXF

Gephi

Page 14: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Twitter Client

14

TwitterClientThread

Basic Auth

Compression Thread

UploadThread

S3

EC2 Server

256 MB 50 MB

Page 15: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Twitter API

Streaming APIcurl http://stream.twitter.com/1/statuses/sample.json \-ustefan:somepwd

streams, once a while interruped5% of public statuses by default

Search APIcurl http://search.twitter.com/search.json?q=datameer \-ustefan:somepwd

150 requests per hour

15

Page 16: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Some Twitter Fields

user_screen_nameuser_followers_countuser_friends_countuser_statuses_countuser_locationcreated_at

textsourcegeo_typegeo_longitudegeo_latitude

16

Page 17: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Simple Analytics

Trending topicsTweet spread timingTopic ReachTopic Location ReachEtc.

17

Page 19: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

From Stream to Network

Understand tweets as network plus metadata (time, geo, etc)ReTweet => Citation => Link => PagerankTwitter friends (not accessible from Streaming API)

• Technically not realistic to get

• Quality?

19

Page 20: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Social Network Analytics

Degree Number of direct connections a node has (Diane)

Betweennessbroker, point of failure, high influence of flow (Heather)

Closeness Shortest path to all others (Fernando, Garth)

20

David Krackhardt

Page 21: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Social Network AnalyticsCentralizationCentralized network dominated by one or a few very central nodesSingle point of failure

PrestigePrestige is the term for a node's centrality

ReachDegree to which any member of a network can reach other members

Boundary SpannersCentral in overall network, bridging clusters, innovators since information comes from multiple clusters

Path LengthThe distances between pairs of nodes in the network. Average path length is the average of these distances between all pairs of nodes.

21 http://en.wikipedia.org/wiki/Social_network http://www.orgnet.com

Page 23: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Yes, we hiring too...

23 http://en.wikipedia.org/wiki/Social_network http://www.orgnet.com

Page 24: HUG August 2010:Whats the buzz_bay_area

© Datameer, Inc 2010

Resources...

http://www.slideshare.net/padday/the-real-life-social-network-v2 (Paul Adams)http://xrime.sourceforge.net/

• SNA Metrics and Structureswww.datameer.comtwitter.com/datameerhttp://github.com/[email protected]

24