Building Data Science Teams: A Moneyball Approach

35
1 © Cloudera, Inc. All rights reserved. A Moneyball Approach Josh Wills | Senior Director of Data Science Building Data Science Teams

Transcript of Building Data Science Teams: A Moneyball Approach

Page 1: Building Data Science Teams: A Moneyball Approach

1© Cloudera, Inc. All rights reserved.

A Moneyball ApproachJosh Wills | Senior Director of Data Science

Building Data Science Teams

Page 2: Building Data Science Teams: A Moneyball Approach

2© Cloudera, Inc. All rights reserved.

About Me

Page 3: Building Data Science Teams: A Moneyball Approach

3© Cloudera, Inc. All rights reserved.

A Team Building Exercise

Page 4: Building Data Science Teams: A Moneyball Approach

4© Cloudera, Inc. All rights reserved.

Data Scientist Supply vs. Data Scientist Demand

Page 5: Building Data Science Teams: A Moneyball Approach

5© Cloudera, Inc. All rights reserved.

Recruiting Techniques

Page 6: Building Data Science Teams: A Moneyball Approach

6© Cloudera, Inc. All rights reserved.

Moneyball and Data Science

Page 7: Building Data Science Teams: A Moneyball Approach

7© Cloudera, Inc. All rights reserved.

Choosing The Right Metrics

Page 8: Building Data Science Teams: A Moneyball Approach

8© Cloudera, Inc. All rights reserved.

1. Analyzing “Unstructured” Data Sources

Page 9: Building Data Science Teams: A Moneyball Approach

9© Cloudera, Inc. All rights reserved.

2. Building Machine Learning Models

Page 10: Building Data Science Teams: A Moneyball Approach

10© Cloudera, Inc. All rights reserved.

3. Turn Static Reports Into Analytical Applications

Page 11: Building Data Science Teams: A Moneyball Approach

11© Cloudera, Inc. All rights reserved.

Answering More Questions in Less Time

Page 12: Building Data Science Teams: A Moneyball Approach

12© Cloudera, Inc. All rights reserved.

How To Answer QuestionsLike A Data Scientist

Page 13: Building Data Science Teams: A Moneyball Approach

13© Cloudera, Inc. All rights reserved.

1. Read and deserialize input data.

2. Project/filter input records.

3. Shuffle: serialize it, send over the network, deserialize it.

4. Apply aggregation logic.

5. Serialize output data.

The Life of a Data Processing Job

Page 14: Building Data Science Teams: A Moneyball Approach

14© Cloudera, Inc. All rights reserved.

Handling the Cost of Serialization

Page 15: Building Data Science Teams: A Moneyball Approach

15© Cloudera, Inc. All rights reserved.

The Traditional RDBMS Approach

Page 16: Building Data Science Teams: A Moneyball Approach

16© Cloudera, Inc. All rights reserved.

The Cost of The Traditional RDBMS Approach

Page 17: Building Data Science Teams: A Moneyball Approach

17© Cloudera, Inc. All rights reserved.

Query Scheduling and Exploratory Data Analysis

Page 18: Building Data Science Teams: A Moneyball Approach

18© Cloudera, Inc. All rights reserved.

The Spark Approach

Page 19: Building Data Science Teams: A Moneyball Approach

19© Cloudera, Inc. All rights reserved.

The Cost of the Spark Approach

Page 20: Building Data Science Teams: A Moneyball Approach

20© Cloudera, Inc. All rights reserved.

The MapReduce Approach

Page 21: Building Data Science Teams: A Moneyball Approach

21© Cloudera, Inc. All rights reserved.

MapReduce In The Hands of a Data Scientist

Page 22: Building Data Science Teams: A Moneyball Approach

22© Cloudera, Inc. All rights reserved.

Example: Hive Multi-Insert

Page 23: Building Data Science Teams: A Moneyball Approach

23© Cloudera, Inc. All rights reserved.

Our Goal: Public Transit for Questions

Page 24: Building Data Science Teams: A Moneyball Approach

24© Cloudera, Inc. All rights reserved.

Data Modeling for Data Scientists

Page 25: Building Data Science Teams: A Moneyball Approach

25© Cloudera, Inc. All rights reserved.

Motivating Example: Spelling Correction

Page 26: Building Data Science Teams: A Moneyball Approach

26© Cloudera, Inc. All rights reserved.

Event Series Analytics

Page 27: Building Data Science Teams: A Moneyball Approach

27© Cloudera, Inc. All rights reserved.

A Simple Star Schema for Spell Correction

Page 28: Building Data Science Teams: A Moneyball Approach

28© Cloudera, Inc. All rights reserved.

The Combinatorial Explosion

Page 29: Building Data Science Teams: A Moneyball Approach

29© Cloudera, Inc. All rights reserved.

• What parameters does this model need…• during the analysis phase?• during deployment?

• Some Candidates• Lag time between events• Similarity of queries• What else?

Designing the Spell Correction Data Product

Page 30: Building Data Science Teams: A Moneyball Approach

30© Cloudera, Inc. All rights reserved.

A Supernova Schema for Search

Page 31: Building Data Science Teams: A Moneyball Approach

31© Cloudera, Inc. All rights reserved.

Spell Correction in SQL

Page 32: Building Data Science Teams: A Moneyball Approach

32© Cloudera, Inc. All rights reserved.

Exhibit: http://github.com/jwills/exhibit

Page 33: Building Data Science Teams: A Moneyball Approach

33© Cloudera, Inc. All rights reserved.

Querying Nested Types with Impala

Page 34: Building Data Science Teams: A Moneyball Approach

34© Cloudera, Inc. All rights reserved.

• Core Metric: # Outputs/ # Jobs• Measure on both an individual and

aggregate level• Drive the marginal cost of asking one

additional question towards zero• Point business analysts at output

tables for interactive analysis with Impala• Self-serve BI frees up resources

(compute + data science time)

Trading Up: From Data Analyst to Data Scientist

Page 35: Building Data Science Teams: A Moneyball Approach

35© Cloudera, Inc. All rights reserved.

Thanks!@josh_wills