Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a...

Post on 31-Dec-2020

1 views 0 download

Transcript of Data Science for Everyone - KNIME...•Customer Data •Task is Upselling prediction •Product is a...

© 2016 KNIME.com AG. All Rights Reserved.

Data Science for Everyone

Greg Landrum

Rosaria Silipo

KNIME

© 2016 KNIME.com AG. All Rights Reserved. 2

Introduction to the characters

• The scientist (chemist, business analyst, domain expert, etc.).

– Deep domain knowledge

– Strong analytics needs (questions that need to be answered!)

• The data scientists (analyst, modeler, informatician, data scientist, etc.)

– Deep knowledge of analytics, data processing

– Knows KNIME (and other tools)

© 2016 KNIME.com AG. All Rights Reserved. 3

The specific scenario/problem

The scientist:

“I’m trying to discover a new anti-malaria medicine. I’ve got a new dataset from a high-throughput screen against a malaria target. Doing the next experiments is expensive. I want to pick the right compounds from our inventory to try next.”

© 2016 KNIME.com AG. All Rights Reserved. 4

The scenario/problem

• Given a new dataset, clean it up so that a model can be built

• Build and validate a model from that dataset

• Use the model to prioritize a set of items from a catalog

• Let the user pick from that prioritized list

© 2016 KNIME.com AG. All Rights Reserved. 5

The steps for doing this

• Cleaning up the data

• Building and validating a model

• Ranking a set of new items from a catalog

• Letting the user pick the items they are interested in

• Providing an excel file

This is a familiar pattern, we know how to do this

© 2016 KNIME.com AG. All Rights Reserved. 8

A guided analytics solution

• The data scientist builds a data preparation and modeling workflow in KNIME capturing their most robust approach along with a solid validation protocol that won’t let a low-quality model pass.

• The data scientist deploys this model as a web application using the KNIME server.

• The scientist can then upload their data, build and validate a model, and then apply it to generate predictions for the items in their catalog in order to decide which experiments to do next.

© 2016 KNIME.com AG. All Rights Reserved. 9

Data Cleaning

© 2016 KNIME.com AG. All Rights Reserved. 10

The 80% problem

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

© 2016 KNIME.com AG. All Rights Reserved. 11

Data Cleaning & Data Scientists

11

https://twitter.com/mrogati/status/601538814746628096

© 2016 KNIME.com AG. All Rights Reserved. 12

7 Techniques for Dimensionality Reduction

Column Reduction based on:

1. Missing values 2. High correlation3. Low standard deviation4. PCA5. Infrequent choice in random forest shallow trees6. Backward Feature Elimination7. Forward Feature Construction

Whitepaper on KNIME web site https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf

12

© 2016 KNIME.com AG. All Rights Reserved. 13

Dataset Quality Measures

Additional Techniques for Data Dimensionality Reduction:

• Low Skewness

• Outlier Removal

13

Measure Dataset Quality Before and After:

• Average Error (%) from Cross-Validation

• Normalized Cronbach Alpha

© 2016 KNIME.com AG. All Rights Reserved. 14

Data Cleaning as a Process

• Reliable (cross-domain)

• Repeatable (not automatic)

• Interactive (human expert supervised)

• From a Web Browser (no KNIME expertise)

• On demand

14

© 2016 KNIME.com AG. All Rights Reserved. 15

CRM Dataset

• Customer Data

• Task is Upselling prediction

• Product is a lawyer insurance

• Lawyer Insurance 0/1 is Target

• If lawyer insurance was bought then after a little while lawyer was assigned

• 10K data rows x 33 data columns

15

© 2016 KNIME.com AG. All Rights Reserved. 16

From KNIME WebPortal: Login

16

© 2016 KNIME.com AG. All Rights Reserved. 17

From KNIME WebPortal: Start

17

© 2016 KNIME.com AG. All Rights Reserved. 18

From KNIME WebPortal: Upload File

18

Only .table and .csv files

© 2016 KNIME.com AG. All Rights Reserved. 19

From KNIME WebPortal: Initial Dataset Quality

19

© 2016 KNIME.com AG. All Rights Reserved. 20

From KNIME WebPortal: Missing Values

20

© 2016 KNIME.com AG. All Rights Reserved. 21

From KNIME WebPortal: Outliers

21

© 2016 KNIME.com AG. All Rights Reserved. 22

From KNIME WebPortal: Low Standard Deviation

22

© 2016 KNIME.com AG. All Rights Reserved. 23

From KNIME WebPortal: Low Skewness & High Correlation

23

© 2016 KNIME.com AG. All Rights Reserved. 24

From KNIME WebPortal: Final Dataset Quality

24

© 2016 KNIME.com AG. All Rights Reserved. 25

From KNIME WebPortal: Back to Refine

25

© 2016 KNIME.com AG. All Rights Reserved. 26

From KNIME WebPortal: Final dataset Quality again

26

© 2016 KNIME.com AG. All Rights Reserved. 27

From KNIME WebPortal: Workflow successful

27

© 2016 KNIME.com AG. All Rights Reserved. 28

Workflow

28

© 2016 KNIME.com AG. All Rights Reserved. 29

Metanode “Dataset Quality”

29

Sum

mar

y o

f D

atas

et Q

ual

ity

© 2016 KNIME.com AG. All Rights Reserved. 30

Malaria Dataset

• Patient Data

• Task is Pf3D7_ps_hit = yes/no

• Primary & secondary readouts, SMILES, experiment date, sample

• Many primary readout ?

• 6675 data rows x 8 data columns

30

© 2016 KNIME.com AG. All Rights Reserved. 31

From KNIME WebPortal: Initial Dataset Quality

31

© 2016 KNIME.com AG. All Rights Reserved. 32

From KNIME WebPortal: Missing Values

32

© 2016 KNIME.com AG. All Rights Reserved. 33

From KNIME WebPortal: Outliers

33

© 2016 KNIME.com AG. All Rights Reserved. 34

From KNIME WebPortal: Low Standard Deviation

34

© 2016 KNIME.com AG. All Rights Reserved. 35

From KNIME WebPortal: Low Skewness & High Correlation

35

© 2016 KNIME.com AG. All Rights Reserved. 36

From KNIME WebPortal: Final Dataset Quality

36

© 2016 KNIME.com AG. All Rights Reserved. 37

That was easy!

37

Happy scientist!

© 2016 KNIME.com AG. All Rights Reserved. 38

Model building

© 2016 KNIME.com AG. All Rights Reserved. 39

The modeling and prediction workflow

Reading the cleaned data and adding the chemistry-specific details

Building a model

Evaluating the model

Ranking and picking new items

© 2016 KNIME.com AG. All Rights Reserved. 40

Robust learning: use multiple models and representations

• Multiple models:

– Random forest (representation 2)

– Gradient boosting (representation 1)

– Fingerprint Bayes (representation 1)

– Logistic regression (representation 1)

– Logistic regression (representation 2)

• Combine predictions using "model fusion"

© 2016 KNIME.com AG. All Rights Reserved. 41

Validation

• The model will be used for ranking new items

• To ensure that it is useful we will evaluate it based both on overall accuracy (using Cohen’s Kappa) and how accurate early picks are (using enrichment)

© 2016 KNIME.com AG. All Rights Reserved. 42

Validation

• Parameters from the Scorer node, adapted to model fusion

• Accuracy parameters from the ROC node

© 2016 KNIME.com AG. All Rights Reserved. 43

When the model isn’t good enough

Accuracy thresholds are set by the data scientist when building the workflow

The workflow ends here.No sense continuing with a model that's unreliable/misleading.

© 2016 KNIME.com AG. All Rights Reserved. 44

Making predictions

Read items from catalog

Generate predictions

Show histogram and ask for number of items to consider

Interactive selection

Download Excel file

© 2016 KNIME.com AG. All Rights Reserved. 45

Interactive selection

Create images for the table

Create plots Keep only rows that are selected in the table

© 2016 KNIME.com AG. All Rights Reserved. 46

The output, Excel at last!

© 2016 KNIME.com AG. All Rights Reserved. 48

That’s it!

48

• Whitepapers & workflows for the two different parts coming soon!

• For more infos email: education@knime.com

© 2016 KNIME.com AG. All Rights Reserved. 49

The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States.

KNIME® is also registered in Germany.