Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced...

36
© Copyright 2014 Pivotal. All rights reserved. Data Science in Action Practical use cases that demonstrate how businesses generate value from data 21 March 2014 – SDS2014, Winterthur

Transcript of Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced...

Page 1: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved. © Copyright 2014 Pivotal. All rights reserved.

Data Science in Action Practical use cases that demonstrate how businesses generate value from data

21 March 2014 – SDS2014, Winterthur

Page 2: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Introduction to Pivotal Data Labs

• Leading edge tools to implement machine learning collaboratively

• Have open-sourced several of our own tools for wide-spread use

Our Tools

• Parallelized a wide variety of machine learning algorithms for optimum performance on the Pivotal platform

• Agile, test-driven, customer focused

Our Methods

P(A|B) = !P(B|A) P(A)!

P(B) !

num_

depe

nden

ts_0

male_

mem_

1su

b_ge

nder_

mfem

ale_m

em_0

avg_

mem_

age_

56_6

5su

b_ag

e_56

_65

over_

avg_

age

sub_

age_

0_17

avg_

mem_

age_

0_17

avg_

mem_

age_

41_5

5su

b_ag

e_41

_55

num_

depe

nden

ts_1

sub_

age_

18_2

5av

g_me

m_ag

e_18

_25

num_

depe

nden

ts_3

num_

depe

nden

ts_2

sub_

benif

it_typ

e_i_c

urnu

m_de

pend

ents_

4ma

le_me

m_4

male_

mem_

3fem

ale_m

em_4

female

_mem

_3ma

le_me

m_2

female

_mem

_2su

b_reg

ion_1

sub_

age_

66_p

lusav

g_me

m_ag

e_66

_plus

ca_re

siden

tsu

b_reg

ion_2

sub_

region

_3ma

le_me

m_0

female

_mem

_1su

b_ge

nder_

fun

der_a

vg_a

gesu

b_ag

e_26

_40

avg_

mem_

age_

26_4

0me

m_clm

_cnt_

0urb

ansu

b_reg

ion_4 rural

Males Under 25

Males 26−40

Males 41−55 or 66 and Over

Males 56−65

Females 20−40

Females 55 and Over

Married with Children

Married 56−65

Normalized,k=8

• More than 100 customer assignments carried out in the past 18 months

• Ensures quality and best practice in all our assignments

Our Experience

• Analytical workflow aligned with business needs and optimized for speed

• Supports iterative and collaborative working

Our Process

Data Warehouse

Data Sources Structured Data •  dbSNP •  refSeq

Analysis

Raw Sequence Data Unstructured Unstructured

Data

Raw Sequence Data

Image Data

External Data Sources

Visualization/Quality Reporting

• High caliber global team of machine learning experts from a wide variety of quantitative backgrounds

• Equally capable in coding & statistics

Our Team

Page 3: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

4%

32%

64%

18%

43%

39%

18%

11%

25%

28%

11%

7%

Pivotal Data Science Team

PhD

By Degree By University By Subject

Bachelor

Masters

Top 25 Other

Top 50

Physics

Math

Statistics & OR

Computer Science

Engineering

Other

Page 4: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

What does the Pivotal Data Science team do?

Deliver Data Science Labs

•  Deliver value for customers by applying best practice Data Science

•  Kick-off use and spread of the Pivotal platform

Engagement & Enablement

•  Enable customers to build on and extend Data Science Labs

•  Train and enable customers, partners and Pivotal teams

•  Recognized as thought leaders

IP Development

•  Develop leading approaches •  Develop and parallelize

algorithms •  Develop and file patents

Data Scientists as catalysts for data-driven transformation

Page 5: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Science in Action

� The Value of Data Science

� The Practice of Data Science

�  In-depth Use Case: Traffic Prediction

� Use Case Overviews

� Q&A

Page 6: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

What do we mean with Data Science?

Complexity

Value of Analytics

($)

Descriptive Analytics

Diagnostic Analytics

Predictive Analytics

Prescriptive Analytics

What happened?

Why did it happen?

What will happen?

How can we make it happen?

Page 7: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Big Data & Data Science

Decision    =    Data    +    Rules  

“Big  Data”   Data  Science  

Page 8: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Social  Media  Commercial  &  Public  Data  Dark  Data  

“Big Data”

Opera7onal  Data  

Page 9: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Combining data sources: Example IPSQ (Quality)

Owner: TS Production team Test flags from production line

1 year ~300GB

APDM Owner: TS Production team

Full vehicle history including IPST (technical), IPSL (logistics), IPSQ test flags and all test

results. 30 years ~TBs

FASTA Owner: Aftersales

Dealership electronic tests Identifies early issues with cars

>25TB

IQS: Initial Quality Survey from JD Power Owner: R&D

Survey responses from new owners after 90 days for approx 1700 vehicles

Few thousand lines ~MB

Social Data Owner: R&D

Pulling 500MB per day from Twitter

TQP Owner: Supplier management

PDFs of parts spec sheets ~ 500GB

Page 10: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Generating value from data: Car configurator example

Sales

Configurations

Customers

Sales

Configurations

Customers

Basic Recommendation Engine

Sales

Configurations

Customers

Advanced Recommendation Engine

Sales

Configurations

Customers

Yield Optimization Engine

All car elements: •  Attribute frequencies (colors etc) •  Attribute combination frequencies

For instance: •  Browsing history •  Usage patterns •  Demographic insights

Ideally: •  Volumes •  Pricing •  By market •  Linkable to configurations

Page 11: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Traditional Systems “Big Data” “Fast Data”

The value of data over time

Time

Value of Data ($)

µs ms s hour day month year yr+

Pivotal Data Science Labs

Page 12: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Science

� The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.

�  In order to drive automated low latency actions in response to events of interest

Page 13: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Why do Data Science?

Time

Return on Analytics

($)

Business Transformation

Innovation

Optimization

Expansion

“low hanging fruit”

Page 14: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

What transforms businesses today?

� Digitization

� Internet of Things

� Pervasive Computing

� Pervasive Connectivity

Page 15: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Example: major paradigm shifts in automotive

Genesis

Not a horse

Mass Production

Mass availability “You can have any

color of car, provided it’s black”

Modern Manufacturing

Brand proliferation You can have

any color

What’s Next?

Connected, autonomous

vehicles

Platform Strategy

Globalization You can have

anything anywhere

1908 1950s 1885 1980s 2020

Page 16: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

The Connected Car drives innovation

Communication

Handsfree telephony

WiFi hotspot Car2X

comms

Rich media comms

Telematics

eMobility solutions

Car2X solutions

Remote activation

Floating car data

Stolen vehicle recovery Remote

diagnosis

Driver assistance

Real-time parking info

Behaviour monitoring

Navigation Hybrid

Navigation

Vehicle Tracking

Map/PoI updates

Geo fencing

Fleet Management

Responsive PoIs

Real-time traffic info

Concierge services

Next gen Navigation

Social Media

Share my trip

Car sharing

Car search

Traffic updates

eCommerce

City toll

Payment solutions

Parking space reservation

Pay as you drive

Road tolls

Entertainment

Music streaming

Online games

Web radio

VoD Environmental

browsing

Page 17: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Science in Action

� The Value of Data Science

� The Practice of Data Science

�  In-depth Use Case: Traffic Prediction

� Use Case Overviews

� Q&A

Page 18: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Warehouse

Visualization/Quality Reporting

Traditional Analytics Process Data Sources

Structured Data •  Sensors •  Flight

recordings

Analysis Scheduling

Analysis

Operations Analysis

Time-to-Insights

sample

forecast

In-memory statistics tool

In-memory optimization tool solution

Page 19: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Warehouse

Augmenting an analytical architecture Data Sources

Structured Data •  Sensors •  Flight recordings

Analysis

Raw Sequence Data Unstructured

In Database Analytics

Unstructured Data

Image Data Geolocation Data Voice Transciption

External Data Sources

Visualization/Quality Reporting

Benefits of a new architecture: •  Eliminates data movement •  Enables rapid data re-processing •  Seamless integration of additional

external resources into analyses

Weather Data

Open Gov Data

Page 20: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Machine Learning and Big Data Getting the whole picture improves predictive power

Sensor Data

•  More data from different sources •  Provides a more complete view •  Improves statistics and inference

False positive rate

True

pos

itive

rate

Sensor Data

Flight Records

Sensor Data

Geoloc & Weather

Flight Records

Sensor Data

Geoloc & Weather Imaging Flight

Records

Page 21: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

The main types of use cases in practice

Data Mining Predicting Behavior Optimization

•  Categorize types (segmentation)

•  Categorize behaviors/usage

•  Identify co-occurrences and associations

•  Identify anomalies

•  Identify attitudes

•  Resolve entities

•  Predict churn likelihood

•  Predict cross/up sales potential

•  Predict fraud/waste/abuse likelihood

•  Predict performance

•  Predict reliability/quality

•  Make a “recommendation”

•  Optimize processes

•  Optimize process parameters

•  Optimize asset allocation

Page 22: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

1 Find Data Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2 Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL

Hadoop •  Pig •  Hive •  Java •  HAWQ

5 Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Python •  numpy •  scipy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

6 Show Results

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

•  GraphViz •  Gephi •  R (ggplot2, lattice,

shiny) •  Excel

7 Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive &

Hangouts

PIVOTAL DATA SCIENCE TOOLKIT

A large and varied tool box!

Page 23: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

As Data Scientists, what do we want?

Infrastructure Independent

Fast & Scalable

Schema Free

Easy to Use

Real Time

•  Open source

•  PaaS

•  In-database analytics

•  MPP

•  Hadoop •  In-memory data grids embedded into the platform

•  SQL, not Java

•  Faster than Hive

Page 24: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Science in Action

� The Value of Data Science

� The Practice of Data Science

�  In-depth Use Case: Traffic Prediction

� Use Case Overviews

� Q&A

Page 25: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved. © Copyright 2014 Pivotal. All rights reserved.

What does traffic data look like?

Page 26: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

…like this?

Page 27: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

…or this?

(Note: This is the least offensive topic

cluster in our Twitter data!)

Page 28: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Velocity by Time of Day

Page 29: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Distribution of Velocity over Time

Page 30: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

0 50 100 150 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

dens

ity

Link 1000064869

km/h

Velocity Distribution

Page 31: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Find Velocity Groups �  Velocity distributions can be fit well

with Gaussians

�  An ‘overlay’ of multiple Gaussians is called Gaussian Mixture Model

�  GMM fitting of the velocity distribution is done by Expectation-Maximization algorithm

�  Shapes and positions of Gaussians determine velocity groups

0 50 100 150 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

dens

ity

CombinedComponent 1Component 2Component 3Component 4

Link 1000064869

km/h

Page 32: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Gaussian Mixture Model

0 50 100 150 200

0.000

0.005

0.010

0.015

0.020

0.025

0.030

dens

ity

CombinedComponent 1Component 2Component 3Component 4

Link 1000064869

km/h 0 50 100 150 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

dens

ity

CombinedComponent 1Component 2Component 3Component 4

Link 1000064869

km/h

0 50 100 150 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

dens

ity

CombinedComponent 1Component 2Component 3Component 4

Link 1000064869

km/h

0 50 100 150 200

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

dens

ity

CombinedComponent 1Component 2Component 3Component 4

Link 1000064869

km/h

Page 33: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Decision Trees Example

0 50 100 150 200 250

0.00

0.01

0.02

0.03

0.04

dens

ity

CombinedComponent 1Component 2Component 3

hour >= 14

hour < 20

weekday = 1,2,3,4,5 weekday = 1,2,3,4,5

weekday = 1,2,3,4,5

nextlink = −1,100000

< 14

>= 20

6,7 6,7

6,7

100002

10.47

10.69

10.76

10.81

30.85

20.55

20.73

30.65

20.56

20.63

20.68

30.49

30.73

Page 34: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Sneak Peek at our TfL Data Demo �  Used the freely

accessible TfL data for a demo

�  Shows # of active disruptions over different days in London

Ø  Rush hour effects visible Ø  Nights are more quiet, but more disruptions on weekend nights

Page 35: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

© Copyright 2014 Pivotal. All rights reserved.

Data Science in Action

� The Value of Data Science

� The Practice of Data Science

�  In-depth Use Case: Traffic Prediction

� Use Case Overviews – not presented

� Q&A

Page 36: Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced several of our own tools for wide-spread ... team of machine learning experts from

BUILT FOR THE SPEED OF BUSINESS