Dataiku - From Big Data To Machine Learning

Post on 27-Jan-2015

125 views 3 download

Tags:

description

This presentation was made in front of CIO to sensibilize to the big data in practical terms and to the new usages of machine learning and analytics.

Transcript of Dataiku - From Big Data To Machine Learning

1Dataiku04/10/2023

04/10/2023 2Dataiku

Hi !

Current Life:CEO, Dataiku

Tweet about this: @dataiku @club_dsi_gun

Past Life: CriteoIsCool EntertainmentExalead

Florian Douetteau

Available on Slide Sharehttp://www.slideshare.net/Dataiku

Goals Today: • Concrete Feedback on Data Analytics

Projects• Data Team in practice and Key technologies • Motivate you to start a data science project

Slide deck allergic ? Check:https://github.com/dataiku

04/10/2023Dataiku 3

Dataiku

Dataiku : An open source platform to help you build your data lab‟

04/10/2023Dataiku 4

Motivation

04/10/2023Dataiku 5

Collocation

Big Apple

Big Mama

Big Data

A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association.

Collocation:

04/10/2023Dataiku 6

“Big” Data in 1999

struct Element { Key key; void* stat_data ;}….

C Optimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuse HTTP 0.9

1 Month

04/10/2023Dataiku 7

Hadoop Java / Pig / Hive / Scala /

Closure / … A Dozen NoSQL data store MPP Databases Real-Time

Big Data in 2013

1 Hour

04/10/2023Dataiku 8

Data Analytics: The Stakes

1 TB? $

Social Gaming2011Web Search

1999

Logistics2004

Online Advertising2012

1 TB100M $

E-Commerce2013

Banking CRM2008

1 TB1B $

Web Search2010

100 TB? $

10 TB10M $

1000TB500M $

50TB1B$

04/10/2023 9

Meet Hal Alowne

Dataiku - Data Tuesday

Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data Scientist

Hal AlowneBI ManagerDim’s Private Showroom

Hey Hal ! We need a big data platform

like the big guys.Let’s just do as they do!

‟”European E-commerce Web site

• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)

Dim SumCEO & Founder Dim’s Private Showroom

Big DataCopy Cat Project

04/10/2023Dataiku 10

Technology is complex

HadoopCeph

Sphere

Cassandra

Spark

Scikit-Learn

MahoutWEKA

MLBase

RapidMiner

PandaD3Crossfilter

InfiniDBLucidDB

Impala

Elastic Search

SOLR

MongoDBRiak

Membase

Pig HiveCascadingTalend

Machine Learning Mystery Land

Scalability CentralNoSQL-Slavia

SQL Colunnar Republic

Vizualization County Data Clean Wasteland

Statistician Old House

R

04/10/2023Dataiku 11

Statistics and Machine Learning is complex !

Try to understand myself

04/10/2023Dataiku 12

(Some Book you might want to read)

04/10/2023Dataiku 13

Plumbing is not complex(but difficult)

Implicit User Data(Views, Searches…)

Content Data(Title, Categories, Price, …)

Explicit User Data(Click, Buy, …)

User Information(Location, Graph…)

500TB

50TB

1TB

200GB

Transformation Matrix

Transformation Predictor

Per User Stats

Per Content Stats

User Similarity

Rank Predictor

Content Similarity

04/10/2023Dataiku 14

MERIT = TIME + ROI

Targeted Newsletter

RecommenderSystems

Adapted Product/ Promotions

TIME : 6 MONTHS ROI : APPS

Build a lab in 6 months (rather than 18 months)

Find the right people

(6 months?)

Choose the technology(6 months?)

Make it work (6 months?)

Build the lab (6 months)

Deploy apps that actually deliver value

2013 2014

2013

• Train People• Reuse working patterns

04/10/2023Dataiku 15

The Problem

It’s utterly complex and unreasonable

04/10/2023Dataiku 16

Our Goal

Our Goal:

Change his perspective on data science projects

(sorry, we couldn’tfind a picture of Hal Smiling)

04/10/2023Dataiku 17

Why and For What ?◦ Business Theory ◦ Concrete Projects

How people and project ? ◦ How to start◦ Dedicated team ?

What technologies ? ◦ Machine Learning◦ Architecture

Agenda

04/10/2023Dataiku 18

Embodiment of Knowledge

Find your core business avantage

04/10/2023Dataiku 19

Product Success driven by Quality !

Margin / Customer Value / Traffic / Acquisition

Example: Launching an Appon the App Store

04/10/2023Dataiku 20

Margin for new customers might decline …

Margin for new

features might decline …

Is your business really scalable ?

you continue growing ….

04/10/2023Dataiku 21

Existing Customers Profiles

Existing Product Assets

Existing Specific Business Model

And your KNOWLEDGE of it

Where is your core business advantage ?

04/10/2023Dataiku 22

Data Driven BusinessWhat your value ?

Number of Customers

Customer Knowledge

Increase over time with:- Time spend in your app- User relationship (network effet)- Partner / Other Apps Interactions

Your Value

1,409,540 $1,03$2,57

$4,081,710,239

2,534,123

04/10/2023Dataiku 23

Data ImpactNot all business equals

Online Advertising

Telecommunication

Insurance

Ability to Acquire

Margin New Services Overall

Subscription Market

Infrastructure Driver

Selling Data

Risk / Price Optimization

Subscription Market

Subscription Market

04/10/2023Dataiku 24

From Theory To Practice

Concrete Projects

04/10/2023Dataiku 25

What should be free in the application ?

How to optimize conversion ?

How to plan and create a business model ?

Main Pain Point:How to plan and optimize pricing in the application ?

Freemium Application

04/10/2023Dataiku 26

Example (Freemium Application) Fremium Model Optimization

BusinessModel

User Cluster

Simulation

Optimized Pricing: Margin +23%

Business Planning Capability 1 month 9 months

R + Python + InfiniDBOn-Premise1TB Dataset 5 weeks project

04/10/2023Dataiku 27

Business Intelligence Stack as Scalability and maintenance issues

Backoffice implements business rules that are challenged

Existing infrastructure cannot cope with per-user information

Main Pain Point:23 hours 52 minutes to compute Business Intelligence aggregates for one day.

Large E-Retailer

04/10/2023Dataiku - Data Tuesday 28

• Relieve their current DWH

and accelerate production of some aggregates/KPIs

• Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc.,

• Train existing people around machine learning and segmentation experience

1h12 to perform the aggregate, available every morning

New home page personalization deployed in a few weeks

Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects

Large E-Retailer : The Datalab

04/10/2023Dataiku - Data Tuesday 29

BI performed directly on production databases

New reports required the CTO direct work for design and implementation

Each photo tag manually validated and completed

Large Photo Bank

Main pain point:No visibility on new users behaviours

04/10/2023Dataiku - Data Tuesday 30

Implementing a Cloud-based data lab to :

• centralize all available data, previously scattered between SQL DB and file systems,

• improve web tracking granularity to enhance customer knowledge via behavior modeling and segmentation,

• create content-based recommendation engines with keywords clustering and association.

Large Photo Bank : The Datalab

R + Vertica + HadoopAmazon Web Services8 weeks projects

Automated content filtering and recommendation

04/10/2023Dataiku 31

Large set of manually crafted linguistic resources for interpreting users queries

New Brands, rare terms .. hard to maintain

Large Online Directory

Main Pain Point:Ability to maintain a very large ontological knowledge sets, with more than 100k concepts

04/10/2023Dataiku 32

Analyze clicks, rephrasing navigation to detect queries that require specific processing

Gather web and external data to enrich the existing index

Train team to Hadoop and Machine Learning

Continuous Relevance Monitoring

Automated enrichment 2x more productivity

Hadoop (48 cores) PythonOn Premise10 weeks projects

Large Online Directory: The Data Lab

Dataiku 33

Launch A Marketing campaign

After a few days PREDICT based on behaviours◦ Total ARPU for users

after 3 months◦ Efficiency of a campaign◦ Continue or not ?

Example ( E-Application ) Marketing Campaign Prediction

04/10/2023Dataiku 34

A very large community

Some mid-size communities

Lots of small clusters mostly 2 players)

Correlation◦ between community size

and engagement / virality Meaningul patterns

◦ 2 players / Family / Group What is the minimum

number of friends to have in the application to get additional engagement ?

Example (Social Gaming) Social Gaming Communities

04/10/2023Dataiku 35

What others do ? ◦ Concrete Projects

How people and project ? ◦ How to start◦ Dedicated team ?

What technologies ? ◦ Machine Learning◦ Architecture

Agenda

04/10/2023Dataiku 36

First Steps

Drag picture to placeholder or click icon to add

04/10/2023Dataiku 37

A / B Test (or equivalent for your business) is the first step to get into a “data-driven” mind set

No advanced analytics requires, some existing tools can help

Changing a color button +21%

(1) Be Data Driven

04/10/2023Dataiku 38

People Microsoft Excel

(2) Use Excel

04/10/2023Dataiku 39

Data Team Data Tools

(3) Build a team

The Business Expertwho knows maths

The Analyst that reveals patterns

The Coding Guy That is enthusiastic

04/10/2023Dataiku 40

data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology

A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…)

TEAM + TOOLS = LAB

04/10/2023Dataiku 41

Organization

Targeted campaingsPrice optimization

Personalized experience

Quality AssuranceWorkload and yield

management

User Feedback (A/B Test)Continuous improvement

Data

Product Designer

Business &

Marketing

Engineers

User Voice

04/10/2023Dataiku 42

Short Term Focus Long Term Drive

Business People Optimize Margin, …. Create new business revenue streams

Marketing People Optimize click ratio Brand awareness and impact

IT People Make IT work Clean and efficient Architecture

Data People Get Stats Right, make predictions

Create Data Driven Features

It’s just a new team

04/10/2023Dataiku 43

Super Intern

What is your ability to integrate a new smart guy and give him any data he would need and any computingpower he would need to enhance your product ?

04/10/2023Dataiku 44

What others do ? ◦ Concrete Projects

How people and project ? ◦ How to start◦ Dedicated team ?

What technologies ? ◦ Machine Learning◦ Architecture

Agenda

04/10/2023Dataiku 45

An oversimplified view of big data architecture

Architecture Patterns

04/10/2023Dataiku 46

Database Business Layer Application

04/10/2023Dataiku 47

(What it really looks like)

04/10/2023Dataiku 48

What kind of scale?

Database Business Layer Application

Or

Data Science App

Or ?

04/10/2023Dataiku 49

What kind of interaction ?

Database Business Layer Application

Data Science App

?

?

? ? ?

?

04/10/2023Dataiku 50

Classic Columnar Architecture

Some data Some Place To Pour It In

Some Tool To To Some Maths And Graphs

04/10/2023Dataiku 51

Classic Columnar Architecture

Lots of data Some Place To Pour It In

Some Tool To To Some Maths And GraphsWeb Tracking Logs

Raw Server Logs

Order / Product / Customer

Facebook Info

Open Data (Weather, Currency …)

04/10/2023Dataiku 52

The Corinthian Architecture

Lots of dataSome Place To Perform Rapid Calculations

Some Tools To Do Some Maths And Charts

Some Place To Pour It In And Clean / Prepare It

04/10/2023Dataiku 53

Data Storage And Preparation

Large Scale:Hadoop Cluster CassandraMPP SQL Columnar

Medium/Large Scale:CouchBaseMongoDB….

Selection Drivers

VolumeScalability

04/10/2023Dataiku 54

Calculations

Classic Database• PostgresSQL• MySQL• ….

MPP SQL Database • Vertica, Vectorwise, InfiniDB,

GreenplumHD….

Hadoop New Databases• Impala

Selection Drivers:

Speed ( Interactivity )

Expressivity

04/10/2023Dataiku 55

The Corinthian Architecture

Lots of dataSome Place To Perform Rapid Calculations

Some Tools To Do Some Maths And Charts

Some Place To Pour It In And Clean / Prepare It

Statistics

Cohorts

Regressions

Bar Charts For Marketing

Nice Infography for you Company Board

04/10/2023Dataiku 56

The Corinthian Architecture

Lots of dataSome Database To Perform Rapid Calculations

Some Tools To Do Some Maths Some Other To Do Some Charts

Some Place To Pour It In And Clean / Prepare It

04/10/2023Dataiku 57

Statistical Tools

Open Source:• IPython • Rstudio

Commercial• RapidMiner• SAS• RevolutionR

Selection Drivers

Existing Knowhow

Scalability

04/10/2023Dataiku 58

What is a statistical tool ?

Interact and explore data

Some stats capabilities

Some Graph Capabilities

04/10/2023Dataiku 59

Visualization Tools

Open Source:• SpotFire• Tableau• QlikView

SAAS• BIME• ChartIO• RevolutionR

HTML5 / AdHoc• D3• GraphViz

Selection Drivers

How Many Contributors / Readers ?

Scalability

04/10/2023Dataiku 60

The One Database won’t make it all problem

Lots of dataSome Database To Perform Rapid Calculations

Some Tools To Do Some Maths Some Other To Do Some Charts

Some Place To Pour It In And Clean / Prepare It

JOIN / Aggregate

Rapid Goup By Computations

Direct Access to the computed Results to production etc..

04/10/2023Dataiku 61

The Roman Social Forum

Lots of dataSome Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs

Some Tools To Do Some Maths Some Other To Do Some Charts

Some Place To Pour It In And Clean / Prepare It

04/10/2023Dataiku 62

Graph

Databases• Neo4J• Titan• OrientDB• InfiniteGraph

Analytic / Visualization• Gephi

Selection Drivers

Scalability

What Algorithms ?

Licensing Constraints

04/10/2023Dataiku 63

The Key Value Store

Lots of dataSome Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs And Some Distributed Key Value Store

Some Tools To Do Some Maths Some Other To Do Some Charts

Some Place To Pour It In And Clean / Prepare It

04/10/2023Dataiku 64

NoSQL

Search• SOLR• ElasticSearch

Document• MongoDB• CouchDB

KeyValue• Redis• Hbase

Selection Drivers

Durability / Avaiability …

Performance

Ease of use and API

Indexing

04/10/2023Dataiku 65

Action requires Prediction

Lots of dataSome Database To Perform Rapid CalculationsAnd some databasefor graphs And Some Distributed Key Value Store

Some Tools To Do Some Maths Some Other To Do Some Charts

Some Place To Pour It In And Clean / Prepare It

Draw A Line For the future

What are my real users groups ?

Should I launch a discount offering or not ? To everybody or to specific users only ?

04/10/2023Dataiku 66

The Medieval Fairy Land

Lots of data Some Tools To Do Some Maths Some Other To Do Some Charts and some MACHINE LEARNING

Some Place To Pour It In And Clean / Prepare It

Some Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs And Some Distributed Key Value Store

04/10/2023Dataiku 67

Predictions

Java• Mahout (Hadoop)• WEKA

Python• Scikit-Learn• PyML

R

Commercial• Kxen• SAS• SPSS…

Selection Drivers

Scalability

Black Box / White Box ?

Data Management Integration

04/10/2023Dataiku 68

Can be fun

Machine Learning

Exploratory Data Analysis◦ Identifying and visualizing key patterns and correlations within the dataset

Unsupervised Learning◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)

Supervised Learning◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)

Time Series Prevision◦ Predict a time-dependent variable using its own history, and sometimes other covariates

(variables)

Graph Analysis◦ Analyzing relationships between a set of “nodes”, linked by “edges”

Associations / Sequences Mining◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time

And many more…

Classes of Machine Learning Problems

10/04/2023Dataiku - Innovation Services 69

Mapping ML to Business Questions

10/04/2023Dataiku - Innovation Services 70

Class Sample Business Questions

Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?

Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The same navigation behavior ?

Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying users ? Who is going to leave my service ? What is the profile of the users who do X ?

Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast, can I also forecast my sales ?Product Sale Forecast (for surbooking)

Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends to my users ?

Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation path on my website ?

Machine Learning Methods Detailed

10/04/2023Dataiku - Innovation Services 71

Analytical Task ML Task Sample Algorithms Shape of Dataset

Exploratory Data Analysis

Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features

Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square...

N obs. (1 row per obs.) * P features

Multivariate Analysis

Principal components analysis, multi-dimensional scaling correspondence analysis, factor analysis…

N obs. (1 row per obs.) * P features

“Oriented” Data Analysis

Unsupervised Learning

K-means, K-medoids, hierarchical clustering, gaussian mixture models, mean shift, dbscan, spectral clustering...

N obs. (1 row per obs.) * P features

Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM, naïve Bayes, K-NN, random forests…

N obs. (1 row per obs.) * P features

Time Series Prevision

ARMA, VARMAX, ARIMA… Time Series (rows: time period, columns: measures)

Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity (Louvain)…

Nodes and Edges lists (+ attributes)

Associations & Sequences

Frequent Itemsets, A priori, Market Basket… (Timestamped) events or transactions

04/10/2023Dataiku 72

Cluster a dataset into K Buckets by choosing the “closest” neighbours

Unsupervised MethodK-Means

04/10/2023Dataiku 73

Predict the color of a point depending on the colors of its K closest neighbours

Supervised K-Nearest-Neighbours

04/10/2023Dataiku 74

Find the most “significant” input variable and split value

Split the dataset recursively

SupervisedDecision Tree

Several Paths to Machine Learning

10/04/2023Dataiku - Innovation Services 75

Analytical Dataset

I’m looking for

clusters

I want to

predict a

variable

I’m looking variable

by variable, or pairs I know how

many groups to look for

HCA…

Partitioning (K-means…)

GMM…

DP GMM

K-means + Gap

| Silhouette | …

2-steps clusteri

ng

I just want to explore

Yes

No

Yes

No

Small Dataset (<<1K)Ye

sNo

Medium Dataset

(<<100K)Yes

No

I can sample

Yes

No

Affinity Propagation

, Mean Shift…

Unsupervised Learning

Yes

No

All my variables

are numeric Ye

sNo

CA…

I have a distance matrix

Yes

No

MDS...

PCA…

Exploratory Data Analysis Data Viz..

.

Yes

Not Only

I value interpretabil

ityGeneralized Linear

Model

Simple Decision Tree

Supervised Learning*

Correlation Analysis

GLM

Parametric and non parametric

stat. tests

* Methods generally working for both classification & regression

Support Vector

Machines

Neural Networ

ks

K-Nearest Neighbor

s

Ensembles (Random Forest, Gradient Boosted

Tree

MARS

Generalized

Additive Model

04/10/2023Dataiku 76

Questions ?

Take Away◦ There are new ways to perform data

analytics that are within your reach and can bring business value

Some Additional Resources◦ Open Source Projects

Dataiku Cloud Transport Clienthttp://dctc.io

Dataiku Web Trackerhttps://github.com/dataiku/wt1

◦ Our Technical Blog http://www.dataiku.com/blog