MLconf NYC Josh Wills
-
Upload
sessionsevents -
Category
Technology
-
view
108 -
download
3
description
Transcript of MLconf NYC Josh Wills
1
MLConf NYC 2014Josh Wills, Senior Director of Data ScienceCloudera
A Little Bit About Me
2
3
An Experience I Had Recently
The Two Kinds of Data Scientists
• The Lab• Statisticians who got
really good at programming
• Neuroscientists, geneticists, etc.
• The Factory• Software engineers who
were in the wrong place at the wrong time
4
5
The Lab and The Factory
Analytics in the Lab
• Question-driven• Interactive• Ad-hoc, post-hoc• Fixed data• Focus on speed and flexibility• Output is embedded into a
report or in-database scoring engine
Analytics in the Factory
• Metric-driven• Automated• Systematic• Fluid data• Focus on transparency and
reliability• Output is a production
system that makes customer-facing decisions
6
Data Science In The Factory
7
On Icebergs
8
The Impedance Mismatch
9
What Do We Need?
10
Apache Spark
11
A Feature Extraction DSL for Spark
12
The R Formula Specification
13
So Why Doesn’t This Exist Yet?
14
Functional Programming to the Rescue
15
Data Science in the Lab
16
Great Tools for Investigative Analytics
17
Cloudera Impala
18
LLVM and NUMBA
19
Python UDFs for Impala
20
Python UDFs for Impala
• github.com/cloudera/impyla• Already There
• Numeric and boolean types (as native python objects)• In Progress
• String support• C/C++ function integration
• Planned• Struct/tuple and array types• UDAFs• Include support for PyData stack (scikit-learn, NLTK)
Josh Wills, Director of Data Science, Cloudera@josh_wills
Thank you!