HUG August 2010:Whats the buzz_bay_area
-
Upload
hadoop-user-group -
Category
Technology
-
view
2.108 -
download
3
description
Transcript of HUG August 2010:Whats the buzz_bay_area
© Datameer, Inc 2010
What’s Really the Buzz? Stefan Groschupf [email protected]
© Datameer, Inc 20102
How it started...
© Datameer, Inc 20103
How it started...
© Datameer, Inc 20104
How it started...
© Datameer, Inc 2010
Agenda
5
© Datameer, Inc 2010
Street Cred
Long time open source contributor
6
http://github.com/sgroschupf/zkclient
http://github.com/sgroschupf/aws-tasks
© Datameer, Inc 2010
Cubicle Cred
Cloud Computing Architect
Hadoop consultant at e.g.
Co-Founder/CEO Scale Unlimited
Co-Founder / CEO Datameer Inc.
7
© Datameer, Inc 2010
Hadoop vs. DB
8
Hadoop(mergesort) DB(b-tree/index)
© Datameer, Inc 2010
9
© Datameer, Inc 2010
Stack
10
© Datameer, Inc 2010
StackAWS
• ...Hadoop
• ...Datameer
• Spreadsheet compiles into MapReduce Jobs, + much more
• ComercialGephi
• Graph Visualization, GPL, Netbeans based
11
© Datameer, Inc 2010
Expenses
12
What Price Description Client Ec2S3 Storage
EMRDatameer
GephiDevelopment
$65.00 0.085/h * 24 * 31$7.50 0.15/GB/m * 50 $24.00 0.20 * 20 * 6~$20 private beta$0.00 GPL
$8.99 + tax 6pk Pilsner Urquell
© Datameer, Inc 2010
Pipeline
13
S3
Client on Ec2 EMR
DATAMEER
CSV/GEXF
Gephi
© Datameer, Inc 2010
Twitter Client
14
TwitterClientThread
Basic Auth
Compression Thread
UploadThread
S3
EC2 Server
256 MB 50 MB
© Datameer, Inc 2010
Twitter API
Streaming APIcurl http://stream.twitter.com/1/statuses/sample.json \-ustefan:somepwd
streams, once a while interruped5% of public statuses by default
Search APIcurl http://search.twitter.com/search.json?q=datameer \-ustefan:somepwd
150 requests per hour
15
© Datameer, Inc 2010
Some Twitter Fields
user_screen_nameuser_followers_countuser_friends_countuser_statuses_countuser_locationcreated_at
textsourcegeo_typegeo_longitudegeo_latitude
16
© Datameer, Inc 2010
Simple Analytics
Trending topicsTweet spread timingTopic ReachTopic Location ReachEtc.
17
© Datameer, Inc 2010
Demo
18 http://en.wikipedia.org/wiki/Social_network http://www.orgnet.com
© Datameer, Inc 2010
From Stream to Network
Understand tweets as network plus metadata (time, geo, etc)ReTweet => Citation => Link => PagerankTwitter friends (not accessible from Streaming API)
• Technically not realistic to get
• Quality?
19
© Datameer, Inc 2010
Social Network Analytics
Degree Number of direct connections a node has (Diane)
Betweennessbroker, point of failure, high influence of flow (Heather)
Closeness Shortest path to all others (Fernando, Garth)
20
David Krackhardt
© Datameer, Inc 2010
Social Network AnalyticsCentralizationCentralized network dominated by one or a few very central nodesSingle point of failure
PrestigePrestige is the term for a node's centrality
ReachDegree to which any member of a network can reach other members
Boundary SpannersCentral in overall network, bridging clusters, innovators since information comes from multiple clusters
Path LengthThe distances between pairs of nodes in the network. Average path length is the average of these distances between all pairs of nodes.
21 http://en.wikipedia.org/wiki/Social_network http://www.orgnet.com
© Datameer, Inc 2010
Demo
22 http://en.wikipedia.org/wiki/Social_network http://www.orgnet.com
© Datameer, Inc 2010
Yes, we hiring too...
23 http://en.wikipedia.org/wiki/Social_network http://www.orgnet.com
© Datameer, Inc 2010
Resources...
http://www.slideshare.net/padday/the-real-life-social-network-v2 (Paul Adams)http://xrime.sourceforge.net/
• SNA Metrics and Structureswww.datameer.comtwitter.com/datameerhttp://github.com/[email protected]
24