O'Reilly Strata: Distilling Data Exhaust
-
Upload
peter-skomoroch -
Category
Technology
-
view
2.402 -
download
0
description
Transcript of O'Reilly Strata: Distilling Data Exhaust
Distilling Data Exhaust
How to Surface Insights & Build Data Products
Feb 2, 2011Peter SkomorochLinkedIn@peteskomoroch
What is Data Exhaust?
What is Data Exhaust?
My Delicious Tags
What is Data Exhaust?
Words I use on Twitter
What can you do with it?
•Data has value
•I’ll share some lessons I’ve learned about how to extract that value
•We’ll go through a case study
Part 1) 10 Lessons Learned
1) Choose a meaningful problem
http://www.flickr.com/photos/aloshbennett/
•Find pain points
•Work on stuff that matters
•Look for underutilized data
2) Find or collect relevant data
•DataWrangling
•InfoChimps
•Pete Warden
•Factual, SimpleGeo
•Mechanical Turk
3) Raw is better than processed
•Normalization could be incorrect
•Data might be lost or corrupted
•Good approach: public.resource.org
http://www.flickr.com/photos/nedraggett/347280918/
4) Guide user input when you can
•Auto suggest
•Validate inputs
•Collect tags, votes
•Makes data scrubbing easier
5) Solve easier problems first
http://where2conf.com/where2010/public/schedule/detail/12400
6) Build a baseline model quickly
•Iterate rapidly after baseline is done
•Measure accuracy on hold out test set
7) Test code on sample data
build logical sample data
8) Use Continuous Integration
8) Use Continuous Integrationhttps://github.com/matthayes/azkaban
9) Pick the right tool for the job
10) Developer productivity is key
•Fast Iterations: Python, Ruby, Pig
•Convention over configuration
•Embrace Github, DevOps, & EC2
•Currently using JRuby & Sinatra
SNA Team: sna-projects.com
Part 2) Case Study: Strata
Conference Insights
•I’d like to understand the audience at Strata
•What companies do we work for?
•What are the top skills at Strata?
•Do attendees cluster together based on skill?
Round up a Data Viz team
Use the right tools
•Data Crunching: Hadoop, Pig
•Statistical Work: Python, NumPy
•Visualization: Gephi
Find Some Data: Attendees
Add LinkedIn data
Extract Skills from Profiles
What are skills?
Extract
Build Hadoop Skill Graph
Discover
Core Talent Graph for “Hadoop”Igor Perisic
The Talent Graph
We can combine skills with the attendee directory to better understand Strata
What are skills @Strata?
Extract skills for attendees
Top Skills @Strata
Information Overload
Relevance Measures
Jaccard Similarity
TFIDF
Relevant Skills @Strata
Do attendees cluster together based on skills?
•Compute similarity of attendees based on skill vector distance
•Cluster similarities in Gephi
More analysis on the way
•DJ Patil has a session tomorrow
•We’ll blog about additional Strata insights soon
Questions?Peter SkomorochLinkedIn@peteskomorochhttp://linkedin.com/in/peterskomorochBlog: DataWrangling.com
Appendix