Back to Square One: Building a Data Science Team from Scratch
-
Upload
klaas-bosteels -
Category
Documents
-
view
1.685 -
download
1
description
Transcript of Back to Square One: Building a Data Science Team from Scratch
BUILDINGDATA SCIENCE TEAMSFROM SCRATCH
Klaas Bosteels @klbostee
MY CAREER PATH SO FAR
2007: Began working with big data as PhD student
2009: Embarked on a data science career at Last.fm
2011: Joined Massive Media as Lead Data Scientist
Data company at heart; one of the earliest Hadoop adopters world-wide; inventors of Ketama; organised first “NoSQL” meetup in SF.
Huge audience and tremendous potential, but data science newcomer at the time.
MY TEAM AT MASSIVE MEDIA
+ interns!Currently 4 permanent people, so not huge just yet
Relatively big and growing faster than anticipated though
OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
high
er r
isk
but
bigg
er r
etur
ns
OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
high
er r
isk
but
bigg
er r
etur
ns
very
wide
ran
ge o
f ta
sks
STEP 1
FOLLOW THE MONEY
photo by Chris Isherwood
BOOTSTRAP BY SAVING OR GAINING MONEY
You need to get some capital to get started
Saving money tends to be easier in practice
Real-world example:
• Analyzing CDN logs unveiled abuse
• Stopping the abuse greatly reduced the bills
STEP 2
EMBRACE HADOOP
photo by Doug Kukurudza
HADOOP
Not the holy grail, but deserves a central role
It has a vibrant community and is proven to be:
ECONOMICAL runs on commodity hardware
SCALABLE smart distributed processing
MAINTAINABLE very robust and fault-tolerant
FLEXIBLE predefined schemas not required
STEP 3
BUILD DASHBOARDS
photo by Dawn Hopkins
STATS PIPELINE BASED ON HADOOP
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
STATS PIPELINE BASED ON HADOOP
Realtimeprocessing
Cfr. “lambda architecture”
coined by @nathanmarz
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
STATS PIPELINE BASED ON HADOOP
Ad-hoc results
Realtimeprocessing
Cfr. “lambda architecture”
coined by @nathanmarz
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
PYTHON IS AN AWESOME JACK OF ALL TRADES
It is great for building dashboards:
• Hadoop support: Dumbo, Python UDFs for Pig, ...
• Several amazing web frameworks, e.g. Flask
• Likewise for drawing graphs, e.g. PyCairo
And it covers many other data science needs as well:
• Scripting, prototyping and full-blown programming
• NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...
STEP 4
ASSEMBLE A TEAM
photo by Jean-François Schmitz
THE SECRET IS IN THE MIX
Hadoop’s tricks also apply to data science teams
• Avoid specialisation to allow easy distribution and scaling
• Exploit data locality by hiring people with wide skill set
Great Data Scientists have the right mix of skills
• Hackers with solid technical background
• Analytical mind that knows statistics and machine learning
• Clever and creative in everything they do
STEP 5
EXPLORE & INNOVATE
photo by NASAr
SOME TIPS AND TRICKS
Dare to fail and/or start from estimates
Introduce data exploration/innovation days
• Basically 20% time devoted to playing with data
• Incorporate brainstorming
• Encourage collaboration
Communicate findings to the rest of the company
• Fun and silliness are allowed
• Prototype early and often
FIVE SIMPLE STEPS IS ALL IT TAKES
1
2
3
4
5
FOLLOW THE MONEY
EMBRACE HADOOP
BUILD DASHBOARDS
ASSEMBLE A TEAM
EXPLORE & INNOVATE
FIVE SIMPLE STEPS IS ALL IT TAKES
1
2
3
4
5
FOLLOW THE MONEY
EMBRACE HADOOP
BUILD DASHBOARDS
ASSEMBLE A TEAM
EXPLORE & INNOVATE
Thanks!Questions?