Back to Square One: Building a Data Science Team from Scratch

21
BUILDING DATA SCIENCE TEAMS FROM SCRATCH Klaas Bosteels @klbostee

description

Generally speaking, big data and data science originated in the west and are coming to Europe with a bit of a delay. There is at least one exception though: The London-based music discovery website Last.fm is a data company at heart and has been doing large-scale data processing and analysis for years. It started using Hadoop in early 2006, for instance, making it one of the earliest adopters worldwide. When I left Last.fm to join Massive Media, the social media company behind Netlog.com and Twoo.com, I basically moved from a data science forerunner to a newcomer. Massive Media had at least as much data to play with and tremendous potential, but they were not doing much with it yet. The data science team had to be build from the ground up and every step had to be argued for and justified along the way. Having done this exercise of evaluating everything I learned at Last.fm and starting over completely with a clean slate at Massive Media, I developed a pretty clear perspective on how to find good data scientists, what they should be doing, what tools they should be using, and how to organize them to work together efficiently as team, which is precisely what I would like to share in this talk.

Transcript of Back to Square One: Building a Data Science Team from Scratch

Page 1: Back to Square One: Building a Data Science Team from Scratch

BUILDINGDATA SCIENCE TEAMSFROM SCRATCH

Klaas Bosteels @klbostee

Page 2: Back to Square One: Building a Data Science Team from Scratch

MY CAREER PATH SO FAR

2007: Began working with big data as PhD student

2009: Embarked on a data science career at Last.fm

2011: Joined Massive Media as Lead Data Scientist

Data company at heart; one of the earliest Hadoop adopters world-wide; inventors of Ketama; organised first “NoSQL” meetup in SF.

Huge audience and tremendous potential, but data science newcomer at the time.

Page 3: Back to Square One: Building a Data Science Team from Scratch

MY TEAM AT MASSIVE MEDIA

+ interns!Currently 4 permanent people, so not huge just yet

Relatively big and growing faster than anticipated though

Page 4: Back to Square One: Building a Data Science Team from Scratch

OUR MISSION IS HELPING THE COMPANY...

MEASURE metrics dashboards

EVALUATE data-driven testing

DECIDE ad hoc data insights

IMPROVE e.g. abuse detection

EXTEND new product features

PROMOTE PR via data porn

Page 5: Back to Square One: Building a Data Science Team from Scratch

OUR MISSION IS HELPING THE COMPANY...

MEASURE metrics dashboards

EVALUATE data-driven testing

DECIDE ad hoc data insights

IMPROVE e.g. abuse detection

EXTEND new product features

PROMOTE PR via data porn

high

er r

isk

but

bigg

er r

etur

ns

Page 6: Back to Square One: Building a Data Science Team from Scratch

OUR MISSION IS HELPING THE COMPANY...

MEASURE metrics dashboards

EVALUATE data-driven testing

DECIDE ad hoc data insights

IMPROVE e.g. abuse detection

EXTEND new product features

PROMOTE PR via data porn

high

er r

isk

but

bigg

er r

etur

ns

very

wide

ran

ge o

f ta

sks

Page 8: Back to Square One: Building a Data Science Team from Scratch

BOOTSTRAP BY SAVING OR GAINING MONEY

You need to get some capital to get started

Saving money tends to be easier in practice

Real-world example:

• Analyzing CDN logs unveiled abuse

• Stopping the abuse greatly reduced the bills

Page 10: Back to Square One: Building a Data Science Team from Scratch

HADOOP

Not the holy grail, but deserves a central role

It has a vibrant community and is proven to be:

ECONOMICAL runs on commodity hardware

SCALABLE smart distributed processing

MAINTAINABLE very robust and fault-tolerant

FLEXIBLE predefined schemas not required

Page 11: Back to Square One: Building a Data Science Team from Scratch

STEP 3

BUILD DASHBOARDS

photo by Dawn Hopkins

Page 12: Back to Square One: Building a Data Science Team from Scratch

STATS PIPELINE BASED ON HADOOP

MapReduce

HBase

HDFS

Log collector

Dashboardsin batches

continuous

Page 13: Back to Square One: Building a Data Science Team from Scratch

STATS PIPELINE BASED ON HADOOP

Realtimeprocessing

Cfr. “lambda architecture”

coined by @nathanmarz

MapReduce

HBase

HDFS

Log collector

Dashboardsin batches

continuous

Page 14: Back to Square One: Building a Data Science Team from Scratch

STATS PIPELINE BASED ON HADOOP

Ad-hoc results

Realtimeprocessing

Cfr. “lambda architecture”

coined by @nathanmarz

MapReduce

HBase

HDFS

Log collector

Dashboardsin batches

continuous

Page 15: Back to Square One: Building a Data Science Team from Scratch

PYTHON IS AN AWESOME JACK OF ALL TRADES

It is great for building dashboards:

• Hadoop support: Dumbo, Python UDFs for Pig, ...

• Several amazing web frameworks, e.g. Flask

• Likewise for drawing graphs, e.g. PyCairo

And it covers many other data science needs as well:

• Scripting, prototyping and full-blown programming

• NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...

Page 16: Back to Square One: Building a Data Science Team from Scratch

STEP 4

ASSEMBLE A TEAM

photo by Jean-François Schmitz

Page 17: Back to Square One: Building a Data Science Team from Scratch

THE SECRET IS IN THE MIX

Hadoop’s tricks also apply to data science teams

• Avoid specialisation to allow easy distribution and scaling

• Exploit data locality by hiring people with wide skill set

Great Data Scientists have the right mix of skills

• Hackers with solid technical background

• Analytical mind that knows statistics and machine learning

• Clever and creative in everything they do

Page 18: Back to Square One: Building a Data Science Team from Scratch

STEP 5

EXPLORE & INNOVATE

photo by NASAr

Page 19: Back to Square One: Building a Data Science Team from Scratch

SOME TIPS AND TRICKS

Dare to fail and/or start from estimates

Introduce data exploration/innovation days

• Basically 20% time devoted to playing with data

• Incorporate brainstorming

• Encourage collaboration

Communicate findings to the rest of the company

• Fun and silliness are allowed

• Prototype early and often

Page 20: Back to Square One: Building a Data Science Team from Scratch

FIVE SIMPLE STEPS IS ALL IT TAKES

1

2

3

4

5

FOLLOW THE MONEY

EMBRACE HADOOP

BUILD DASHBOARDS

ASSEMBLE A TEAM

EXPLORE & INNOVATE

Page 21: Back to Square One: Building a Data Science Team from Scratch

FIVE SIMPLE STEPS IS ALL IT TAKES

1

2

3

4

5

FOLLOW THE MONEY

EMBRACE HADOOP

BUILD DASHBOARDS

ASSEMBLE A TEAM

EXPLORE & INNOVATE

Thanks!Questions?