Becoming a Data-Driven Organization with Machine Learning

30
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Becoming a Data Driven Organization with Machine Learning By Peter Harrington

description

Does your organization collect data? Lots of data? Does your organization make use of all that data they have collected? In this session you will learn what you do with machine learning, and what are the building blocks for an application that uses machine learning. This session will show you how to go from data you have collected to creating predictions for customers. You will learn how valuable insights into your data can be gleaned while building the code to make predictions.

Transcript of Becoming a Data-Driven Organization with Machine Learning

Page 1: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Becoming a Data Driven Organization

with Machine Learning By Peter Harrington

Page 2: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Goals for this talk

• Introduce Machine Learning (ML)

• Talk about how we can take ML outside of the

• Share Some experience from the trenches

• Simplicity

2

Page 3: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Agenda

• Introduce Myself

• Define Data Driven and ML

• Common tasks in ML

• Sample of some ML algos with examples

• *Interpretable ML

• *ML & Agile Development

• *What is a “Data Scientist”?

3

Page 4: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

About me

• Author of Machine Learning in Action

4

Page 5: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

My employer

• Provide Customers with a list of WHO is using WHAT

product.

• Customers are willing to pay us for this data.

• Collect data from numerous document sources where

companies are talking about themselves.

• Our product is data

5

Page 6: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

What we do

• Natural Language Processing

• Knowledge Graph of Business Information

• 1.5B documents

• Update results daily

• Try to keep infrastructure costs down

6

Page 7: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Spam

• We use lots and lots of Java

• We are hiring

• Santa Barbra, California

• Sunnyvale, California

• Apply on our website: www.hgdata.com

7

Page 8: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

“Data Driven”

• “Data driven means that progress in an activity is

compelled by data, rather than by intuition or personal

experience.” --Wikipedia

• This talk is going to show how you can use some

techniques from Machine Learning to help make data

driven decisions in your work place, or help your

applications make decisions.

8

Page 9: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

What is Machine Learning?

• Some tools to allow a machine to learn from data.

9

Page 10: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Example

10

50, 62

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60

Heig

ht

(in

ch

es)

Weight (Pounds)

Page 11: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Example (continued)

11

50, 62 y = 0.9922x + 12.472

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60

Heig

ht

(in

ch

es)

Weight (Pounds)

Page 12: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

What is Machine Learning?

• Some tools to allow a machine to learn from data.

• Tools that can make decisions from non-deterministic

data

12

Page 13: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Common Tasks in Machine Learning

• Supervised Learning

• Predicting a numerical value, this is called regression.

• Predicting categories, (spam or not spam for example) this is

called classification

• Unsupervised Learning

• Clustering

• Association rule mining {men, diapers} {beer}

• Topic modeling

• Semi-Supervised Learning

13

Page 14: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Where is ML being used today?

• Face detection

• Handwriting detection in mail

• Voice recognition (Siri, Sync)

• Answering questions (IBM Watson, Google)

• Forecasting weather

• Stock Trading

• Recommending things when you shop

• Spam detection online: email, forums

• Law: forecasting results, extracting info from docs

14

Page 15: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Where else?

• Spacecraft

• Self driving cars

• Identifying whales

• Predicting strokes

• Fighting financial fraud

15

Page 16: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Why is ML useful?

• You do not need to be a domain expert to make a

prediction/forecast.

• Example 1: mathematician out predicts a Law professor at

Supreme court rulings

• Example 2: an economist out predicts wine snobs at predicting

the best vintages.

Both of the above examples are from a book called:“Super Crunchers” by

Ian Ayres

• We are not trying to escape study of these fields, rather

we are often asked to study fields that few have studied

before.

16

Page 17: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Classification Example

17

Page 18: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Decision Tree Example

18

Page 19: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Data wrangling

• The data doesn’t always come as easily as in these toy

examples.

• 50-90% of our time is spent getting the data into the

system

• Reasons why we need to do this

• Wrong format

• Not being recorded

• Political reasons

19

Page 20: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Regression Example

20

Page 21: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Ridge Regression Example

21

Page 22: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Semi-Supervised Learning

22

Image taken from: http://bioinformatics.oxfordjournals.org/content/24/6/783/F1.expansion.html

Page 23: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

How can you do this?

• Collect Some data

• Put the data into some existing package in your

language of choice.

• Take the resulting model and put it into your application.

Understand the model.

23

ML code Data Model

H = 0.9*W +1 2.4

Page 24: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Better Approach

• Introduce Machine Learning (ML)

24

ML code Data Model

H = 0.9*W +1 2.4

Training

Set

Test

Set Test Code

1.5 lbs Error:

Page 25: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Agile Development and ML

25

Traditional

Development

Agile

Development

Agile Research

Requirements Negotiate with

customer

Working prototypes

to constantly refine

requirements

Establish target

accuracy,

Sufficient Data

Implementation Comprehensive plan Small teams, quick

implementation

cycle

Rapid research

cycle: focus on

data or algorithm

improvements

Measurement Validate that the

software meets specs

with tests

Iterate with

customer to

evaluate if

requirements are

met

Accuracy metrics

dominate,

Headroom analysis

used to guide next

sprint

Page 26: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Interpretable ML

26

Page 27: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Interpretable ML

• “Black box” models are not easy to interpret, and may

be poorly received.

• Some models like decision trees are easy to interpret

but may have not have the best performance.

• Check out: Decision Lists and Sparse Integer Lists if this

is something you are interested in.

27

Page 28: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

What is a Data Scientist?

A. A cross between a statistician and a developer?

B. A developer who knows ML?

C. A buzzword we use to attract developers?

D. All of the above?

28

Page 29: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

Thanks again for coming!

• Questions?

29

Page 30: Becoming a Data-Driven Organization with Machine Learning

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a

Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/

True AI vs. Modern AI

• True AI seeks to understand how the human brain works

by creating an artificial version.

• Modern AI is a collection of hard problems.

30