Dataiku - From Big Data To Machine Learning
-
Upload
dataiku -
Category
Technology
-
view
125 -
download
3
description
Transcript of Dataiku - From Big Data To Machine Learning
1Dataiku04/10/2023
04/10/2023 2Dataiku
Hi !
Current Life:CEO, Dataiku
Tweet about this: @dataiku @club_dsi_gun
Past Life: CriteoIsCool EntertainmentExalead
Florian Douetteau
Available on Slide Sharehttp://www.slideshare.net/Dataiku
Goals Today: • Concrete Feedback on Data Analytics
Projects• Data Team in practice and Key technologies • Motivate you to start a data science project
Slide deck allergic ? Check:https://github.com/dataiku
04/10/2023Dataiku 3
Dataiku
Dataiku : An open source platform to help you build your data lab‟
”
04/10/2023Dataiku 4
Motivation
04/10/2023Dataiku 5
Collocation
Big Apple
Big Mama
Big Data
A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association.
Collocation:
04/10/2023Dataiku 6
“Big” Data in 1999
struct Element { Key key; void* stat_data ;}….
C Optimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuse HTTP 0.9
1 Month
04/10/2023Dataiku 7
Hadoop Java / Pig / Hive / Scala /
Closure / … A Dozen NoSQL data store MPP Databases Real-Time
Big Data in 2013
1 Hour
04/10/2023Dataiku 8
Data Analytics: The Stakes
1 TB? $
Social Gaming2011Web Search
1999
Logistics2004
Online Advertising2012
1 TB100M $
E-Commerce2013
Banking CRM2008
1 TB1B $
Web Search2010
100 TB? $
10 TB10M $
1000TB500M $
50TB1B$
04/10/2023 9
Meet Hal Alowne
Dataiku - Data Tuesday
Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data Scientist
Hal AlowneBI ManagerDim’s Private Showroom
Hey Hal ! We need a big data platform
like the big guys.Let’s just do as they do!
‟”European E-commerce Web site
• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)
Dim SumCEO & Founder Dim’s Private Showroom
Big DataCopy Cat Project
04/10/2023Dataiku 10
Technology is complex
HadoopCeph
Sphere
Cassandra
Spark
Scikit-Learn
MahoutWEKA
MLBase
RapidMiner
PandaD3Crossfilter
InfiniDBLucidDB
Impala
Elastic Search
SOLR
MongoDBRiak
Membase
Pig HiveCascadingTalend
Machine Learning Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County Data Clean Wasteland
Statistician Old House
R
04/10/2023Dataiku 11
Statistics and Machine Learning is complex !
Try to understand myself
04/10/2023Dataiku 12
(Some Book you might want to read)
04/10/2023Dataiku 13
Plumbing is not complex(but difficult)
Implicit User Data(Views, Searches…)
Content Data(Title, Categories, Price, …)
Explicit User Data(Click, Buy, …)
User Information(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation Matrix
Transformation Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
04/10/2023Dataiku 14
MERIT = TIME + ROI
Targeted Newsletter
RecommenderSystems
Adapted Product/ Promotions
TIME : 6 MONTHS ROI : APPS
Build a lab in 6 months (rather than 18 months)
Find the right people
(6 months?)
Choose the technology(6 months?)
Make it work (6 months?)
Build the lab (6 months)
Deploy apps that actually deliver value
2013 2014
2013
• Train People• Reuse working patterns
04/10/2023Dataiku 15
The Problem
It’s utterly complex and unreasonable
04/10/2023Dataiku 16
Our Goal
Our Goal:
Change his perspective on data science projects
(sorry, we couldn’tfind a picture of Hal Smiling)
04/10/2023Dataiku 17
Why and For What ?◦ Business Theory ◦ Concrete Projects
How people and project ? ◦ How to start◦ Dedicated team ?
What technologies ? ◦ Machine Learning◦ Architecture
Agenda
04/10/2023Dataiku 18
Embodiment of Knowledge
Find your core business avantage
04/10/2023Dataiku 19
Product Success driven by Quality !
Margin / Customer Value / Traffic / Acquisition
Example: Launching an Appon the App Store
04/10/2023Dataiku 20
Margin for new customers might decline …
Margin for new
features might decline …
Is your business really scalable ?
you continue growing ….
04/10/2023Dataiku 21
Existing Customers Profiles
Existing Product Assets
Existing Specific Business Model
And your KNOWLEDGE of it
Where is your core business advantage ?
04/10/2023Dataiku 22
Data Driven BusinessWhat your value ?
Number of Customers
Customer Knowledge
Increase over time with:- Time spend in your app- User relationship (network effet)- Partner / Other Apps Interactions
Your Value
1,409,540 $1,03$2,57
$4,081,710,239
2,534,123
04/10/2023Dataiku 23
Data ImpactNot all business equals
Online Advertising
Telecommunication
Insurance
Ability to Acquire
Margin New Services Overall
Subscription Market
Infrastructure Driver
Selling Data
Risk / Price Optimization
Subscription Market
Subscription Market
04/10/2023Dataiku 24
From Theory To Practice
Concrete Projects
04/10/2023Dataiku 25
What should be free in the application ?
How to optimize conversion ?
How to plan and create a business model ?
Main Pain Point:How to plan and optimize pricing in the application ?
Freemium Application
04/10/2023Dataiku 26
Example (Freemium Application) Fremium Model Optimization
BusinessModel
User Cluster
Simulation
Optimized Pricing: Margin +23%
Business Planning Capability 1 month 9 months
R + Python + InfiniDBOn-Premise1TB Dataset 5 weeks project
04/10/2023Dataiku 27
Business Intelligence Stack as Scalability and maintenance issues
Backoffice implements business rules that are challenged
Existing infrastructure cannot cope with per-user information
Main Pain Point:23 hours 52 minutes to compute Business Intelligence aggregates for one day.
Large E-Retailer
04/10/2023Dataiku - Data Tuesday 28
• Relieve their current DWH
and accelerate production of some aggregates/KPIs
• Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc.,
• Train existing people around machine learning and segmentation experience
1h12 to perform the aggregate, available every morning
New home page personalization deployed in a few weeks
Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects
Large E-Retailer : The Datalab
04/10/2023Dataiku - Data Tuesday 29
BI performed directly on production databases
New reports required the CTO direct work for design and implementation
Each photo tag manually validated and completed
Large Photo Bank
Main pain point:No visibility on new users behaviours
04/10/2023Dataiku - Data Tuesday 30
Implementing a Cloud-based data lab to :
• centralize all available data, previously scattered between SQL DB and file systems,
• improve web tracking granularity to enhance customer knowledge via behavior modeling and segmentation,
• create content-based recommendation engines with keywords clustering and association.
Large Photo Bank : The Datalab
R + Vertica + HadoopAmazon Web Services8 weeks projects
Automated content filtering and recommendation
04/10/2023Dataiku 31
Large set of manually crafted linguistic resources for interpreting users queries
New Brands, rare terms .. hard to maintain
Large Online Directory
Main Pain Point:Ability to maintain a very large ontological knowledge sets, with more than 100k concepts
04/10/2023Dataiku 32
Analyze clicks, rephrasing navigation to detect queries that require specific processing
Gather web and external data to enrich the existing index
Train team to Hadoop and Machine Learning
Continuous Relevance Monitoring
Automated enrichment 2x more productivity
Hadoop (48 cores) PythonOn Premise10 weeks projects
Large Online Directory: The Data Lab
Dataiku 33
Launch A Marketing campaign
After a few days PREDICT based on behaviours◦ Total ARPU for users
after 3 months◦ Efficiency of a campaign◦ Continue or not ?
Example ( E-Application ) Marketing Campaign Prediction
04/10/2023Dataiku 34
A very large community
Some mid-size communities
Lots of small clusters mostly 2 players)
Correlation◦ between community size
and engagement / virality Meaningul patterns
◦ 2 players / Family / Group What is the minimum
number of friends to have in the application to get additional engagement ?
Example (Social Gaming) Social Gaming Communities
04/10/2023Dataiku 35
What others do ? ◦ Concrete Projects
How people and project ? ◦ How to start◦ Dedicated team ?
What technologies ? ◦ Machine Learning◦ Architecture
Agenda
04/10/2023Dataiku 36
First Steps
Drag picture to placeholder or click icon to add
04/10/2023Dataiku 37
A / B Test (or equivalent for your business) is the first step to get into a “data-driven” mind set
No advanced analytics requires, some existing tools can help
Changing a color button +21%
(1) Be Data Driven
04/10/2023Dataiku 38
People Microsoft Excel
(2) Use Excel
04/10/2023Dataiku 39
Data Team Data Tools
(3) Build a team
The Business Expertwho knows maths
The Analyst that reveals patterns
The Coding Guy That is enthusiastic
04/10/2023Dataiku 40
data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology
A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…)
TEAM + TOOLS = LAB
04/10/2023Dataiku 41
Organization
Targeted campaingsPrice optimization
Personalized experience
Quality AssuranceWorkload and yield
management
User Feedback (A/B Test)Continuous improvement
Data
Product Designer
Business &
Marketing
Engineers
User Voice
04/10/2023Dataiku 42
Short Term Focus Long Term Drive
Business People Optimize Margin, …. Create new business revenue streams
Marketing People Optimize click ratio Brand awareness and impact
IT People Make IT work Clean and efficient Architecture
Data People Get Stats Right, make predictions
Create Data Driven Features
It’s just a new team
04/10/2023Dataiku 43
Super Intern
What is your ability to integrate a new smart guy and give him any data he would need and any computingpower he would need to enhance your product ?
04/10/2023Dataiku 44
What others do ? ◦ Concrete Projects
How people and project ? ◦ How to start◦ Dedicated team ?
What technologies ? ◦ Machine Learning◦ Architecture
Agenda
04/10/2023Dataiku 45
An oversimplified view of big data architecture
Architecture Patterns
04/10/2023Dataiku 46
Database Business Layer Application
04/10/2023Dataiku 47
(What it really looks like)
04/10/2023Dataiku 48
What kind of scale?
Database Business Layer Application
Or
Data Science App
Or ?
04/10/2023Dataiku 49
What kind of interaction ?
Database Business Layer Application
Data Science App
?
?
? ? ?
?
04/10/2023Dataiku 50
Classic Columnar Architecture
Some data Some Place To Pour It In
Some Tool To To Some Maths And Graphs
04/10/2023Dataiku 51
Classic Columnar Architecture
Lots of data Some Place To Pour It In
Some Tool To To Some Maths And GraphsWeb Tracking Logs
Raw Server Logs
Order / Product / Customer
Facebook Info
Open Data (Weather, Currency …)
04/10/2023Dataiku 52
The Corinthian Architecture
Lots of dataSome Place To Perform Rapid Calculations
Some Tools To Do Some Maths And Charts
Some Place To Pour It In And Clean / Prepare It
04/10/2023Dataiku 53
Data Storage And Preparation
Large Scale:Hadoop Cluster CassandraMPP SQL Columnar
Medium/Large Scale:CouchBaseMongoDB….
Selection Drivers
VolumeScalability
04/10/2023Dataiku 54
Calculations
Classic Database• PostgresSQL• MySQL• ….
MPP SQL Database • Vertica, Vectorwise, InfiniDB,
GreenplumHD….
Hadoop New Databases• Impala
…
Selection Drivers:
Speed ( Interactivity )
Expressivity
04/10/2023Dataiku 55
The Corinthian Architecture
Lots of dataSome Place To Perform Rapid Calculations
Some Tools To Do Some Maths And Charts
Some Place To Pour It In And Clean / Prepare It
Statistics
Cohorts
Regressions
Bar Charts For Marketing
Nice Infography for you Company Board
04/10/2023Dataiku 56
The Corinthian Architecture
Lots of dataSome Database To Perform Rapid Calculations
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
04/10/2023Dataiku 57
Statistical Tools
Open Source:• IPython • Rstudio
Commercial• RapidMiner• SAS• RevolutionR
Selection Drivers
Existing Knowhow
Scalability
04/10/2023Dataiku 58
What is a statistical tool ?
Interact and explore data
Some stats capabilities
Some Graph Capabilities
04/10/2023Dataiku 59
Visualization Tools
Open Source:• SpotFire• Tableau• QlikView
SAAS• BIME• ChartIO• RevolutionR
HTML5 / AdHoc• D3• GraphViz
Selection Drivers
How Many Contributors / Readers ?
Scalability
04/10/2023Dataiku 60
The One Database won’t make it all problem
Lots of dataSome Database To Perform Rapid Calculations
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
JOIN / Aggregate
Rapid Goup By Computations
Direct Access to the computed Results to production etc..
04/10/2023Dataiku 61
The Roman Social Forum
Lots of dataSome Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
04/10/2023Dataiku 62
Graph
Databases• Neo4J• Titan• OrientDB• InfiniteGraph
Analytic / Visualization• Gephi
Selection Drivers
Scalability
What Algorithms ?
Licensing Constraints
04/10/2023Dataiku 63
The Key Value Store
Lots of dataSome Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs And Some Distributed Key Value Store
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
04/10/2023Dataiku 64
NoSQL
Search• SOLR• ElasticSearch
Document• MongoDB• CouchDB
KeyValue• Redis• Hbase
…
Selection Drivers
Durability / Avaiability …
Performance
Ease of use and API
Indexing
04/10/2023Dataiku 65
Action requires Prediction
Lots of dataSome Database To Perform Rapid CalculationsAnd some databasefor graphs And Some Distributed Key Value Store
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
Draw A Line For the future
What are my real users groups ?
Should I launch a discount offering or not ? To everybody or to specific users only ?
04/10/2023Dataiku 66
The Medieval Fairy Land
Lots of data Some Tools To Do Some Maths Some Other To Do Some Charts and some MACHINE LEARNING
Some Place To Pour It In And Clean / Prepare It
Some Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs And Some Distributed Key Value Store
04/10/2023Dataiku 67
Predictions
Java• Mahout (Hadoop)• WEKA
Python• Scikit-Learn• PyML
R
Commercial• Kxen• SAS• SPSS…
…
Selection Drivers
Scalability
Black Box / White Box ?
Data Management Integration
04/10/2023Dataiku 68
Can be fun
Machine Learning
Exploratory Data Analysis◦ Identifying and visualizing key patterns and correlations within the dataset
Unsupervised Learning◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)
Supervised Learning◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)
Time Series Prevision◦ Predict a time-dependent variable using its own history, and sometimes other covariates
(variables)
Graph Analysis◦ Analyzing relationships between a set of “nodes”, linked by “edges”
Associations / Sequences Mining◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time
And many more…
Classes of Machine Learning Problems
10/04/2023Dataiku - Innovation Services 69
Mapping ML to Business Questions
10/04/2023Dataiku - Innovation Services 70
Class Sample Business Questions
Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?
Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The same navigation behavior ?
Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying users ? Who is going to leave my service ? What is the profile of the users who do X ?
Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast, can I also forecast my sales ?Product Sale Forecast (for surbooking)
Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends to my users ?
Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation path on my website ?
Machine Learning Methods Detailed
10/04/2023Dataiku - Innovation Services 71
Analytical Task ML Task Sample Algorithms Shape of Dataset
Exploratory Data Analysis
Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features
Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square...
N obs. (1 row per obs.) * P features
Multivariate Analysis
Principal components analysis, multi-dimensional scaling correspondence analysis, factor analysis…
N obs. (1 row per obs.) * P features
“Oriented” Data Analysis
Unsupervised Learning
K-means, K-medoids, hierarchical clustering, gaussian mixture models, mean shift, dbscan, spectral clustering...
N obs. (1 row per obs.) * P features
Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM, naïve Bayes, K-NN, random forests…
N obs. (1 row per obs.) * P features
Time Series Prevision
ARMA, VARMAX, ARIMA… Time Series (rows: time period, columns: measures)
Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity (Louvain)…
Nodes and Edges lists (+ attributes)
Associations & Sequences
Frequent Itemsets, A priori, Market Basket… (Timestamped) events or transactions
04/10/2023Dataiku 72
Cluster a dataset into K Buckets by choosing the “closest” neighbours
Unsupervised MethodK-Means
04/10/2023Dataiku 73
Predict the color of a point depending on the colors of its K closest neighbours
Supervised K-Nearest-Neighbours
04/10/2023Dataiku 74
Find the most “significant” input variable and split value
Split the dataset recursively
SupervisedDecision Tree
Several Paths to Machine Learning
10/04/2023Dataiku - Innovation Services 75
Analytical Dataset
I’m looking for
clusters
I want to
predict a
variable
I’m looking variable
by variable, or pairs I know how
many groups to look for
HCA…
Partitioning (K-means…)
GMM…
DP GMM
…
K-means + Gap
| Silhouette | …
2-steps clusteri
ng
I just want to explore
Yes
No
Yes
No
Small Dataset (<<1K)Ye
sNo
Medium Dataset
(<<100K)Yes
No
I can sample
Yes
No
Affinity Propagation
, Mean Shift…
Unsupervised Learning
Yes
No
All my variables
are numeric Ye
sNo
CA…
I have a distance matrix
Yes
No
MDS...
PCA…
Exploratory Data Analysis Data Viz..
.
Yes
Not Only
I value interpretabil
ityGeneralized Linear
Model
Simple Decision Tree
Supervised Learning*
Correlation Analysis
GLM
Parametric and non parametric
stat. tests
* Methods generally working for both classification & regression
Support Vector
Machines
Neural Networ
ks
K-Nearest Neighbor
s
Ensembles (Random Forest, Gradient Boosted
Tree
MARS
Generalized
Additive Model
04/10/2023Dataiku 76
Questions ?
Take Away◦ There are new ways to perform data
analytics that are within your reach and can bring business value
Some Additional Resources◦ Open Source Projects
Dataiku Cloud Transport Clienthttp://dctc.io
Dataiku Web Trackerhttps://github.com/dataiku/wt1
◦ Our Technical Blog http://www.dataiku.com/blog