MLconf NYC Josh Wills

21
1 MLConf NYC 2014 Josh Wills, Senior Director of Data Science Cloudera

description

 

Transcript of MLconf NYC Josh Wills

Page 1: MLconf NYC Josh Wills

1

MLConf NYC 2014Josh Wills, Senior Director of Data ScienceCloudera

Page 2: MLconf NYC Josh Wills

A Little Bit About Me

2

Page 3: MLconf NYC Josh Wills

3

An Experience I Had Recently

Page 4: MLconf NYC Josh Wills

The Two Kinds of Data Scientists

• The Lab• Statisticians who got

really good at programming

• Neuroscientists, geneticists, etc.

• The Factory• Software engineers who

were in the wrong place at the wrong time

4

Page 5: MLconf NYC Josh Wills

5

The Lab and The Factory

Analytics in the Lab

• Question-driven• Interactive• Ad-hoc, post-hoc• Fixed data• Focus on speed and flexibility• Output is embedded into a

report or in-database scoring engine

Analytics in the Factory

• Metric-driven• Automated• Systematic• Fluid data• Focus on transparency and

reliability• Output is a production

system that makes customer-facing decisions

Page 6: MLconf NYC Josh Wills

6

Data Science In The Factory

Page 7: MLconf NYC Josh Wills

7

On Icebergs

Page 8: MLconf NYC Josh Wills

8

The Impedance Mismatch

Page 9: MLconf NYC Josh Wills

9

What Do We Need?

Page 10: MLconf NYC Josh Wills

10

Apache Spark

Page 11: MLconf NYC Josh Wills

11

A Feature Extraction DSL for Spark

Page 12: MLconf NYC Josh Wills

12

The R Formula Specification

Page 13: MLconf NYC Josh Wills

13

So Why Doesn’t This Exist Yet?

Page 14: MLconf NYC Josh Wills

14

Functional Programming to the Rescue

Page 15: MLconf NYC Josh Wills

15

Data Science in the Lab

Page 16: MLconf NYC Josh Wills

16

Great Tools for Investigative Analytics

Page 17: MLconf NYC Josh Wills

17

Cloudera Impala

Page 18: MLconf NYC Josh Wills

18

LLVM and NUMBA

Page 19: MLconf NYC Josh Wills

19

Python UDFs for Impala

Page 20: MLconf NYC Josh Wills

20

Python UDFs for Impala

• github.com/cloudera/impyla• Already There

• Numeric and boolean types (as native python objects)• In Progress

• String support• C/C++ function integration

• Planned• Struct/tuple and array types• UDAFs• Include support for PyData stack (scikit-learn, NLTK)

Page 21: MLconf NYC Josh Wills

Josh Wills, Director of Data Science, Cloudera@josh_wills

Thank you!