H2O World - Data Science in Action @ 6sense - Viral Bajaria
-
Upload
jo-fai-chow -
Category
Software
-
view
810 -
download
3
Transcript of H2O World - Data Science in Action @ 6sense - Viral Bajaria
WE FIND PROSPECTS THAT ARE IN MARKET TO BUY
WE ARE THE CENTRAL NERVOUS SYSTEM
EMPOWERING ALL MARKETING, SALES AND BIZ
6sense
EMPOWERING ALL MARKETING, SALES AND BIZ
OPERATIONS TEAMS
AS A TEAM, WE LIVE ON: DATA, STATISTICS AND
BEER
CTO & CO-FOUNDER @ 6SENSE
EARLY HADOOP ADOPTER (LATE 2008)
about.me
3B+ EVENTS PER DAY
FUN FACT: Used a sledgehammer to unrack my first
hadoop cluster
Predict who is in-market to buy!!
eg: Company XYZ is 90% going to buy routers in next
90 days.
Problem
90 days.
What kind of data do we need…. A lot!
1st Party:
- Web (eg. apache logs)
- Marketing Automation (eg. Eloqua)
- CRM (eg. Salesforce)
Data Needs
- CRM (eg. Salesforce)
6sense Data Network:
- Publishers
- Ads
- Blogs
Research patterns are different for different products
- Expensive routers
Insights
- Expensive routers
- Freemium cloud services
- Open source tools (think H2O)
Need to build different models for each product
Data Science Needs
Plus, we don’t like to make our life’s easy :)
- Where’s the fun in easy ?
- Need to build 4 models per product
Need to build different models for each product
Data Science Needs
Plus, we don’t like to make our life’s easy :)
- Where’s the fun in easy ?
- Need to build 4 models per product
100’S OF MODELS IN PROD
Processing Pipeline
Web
Identify
Companies
Identify Contacts
Customer
Contacts
SalesNormalize
Companies
Custom Data Set
Make Consistent
Scikit-Learn or H2O
Output Types: pickle files or pojo
Modeling
Output Types: pickle files or pojo
Script to promote model to production
Puts all artifacts used in s3
eg: data, stats, queries
Multiple Models for same prediction
Model 1 Model StatsContinue
Prod Pipeline
Model 2 Model Stats
Model 3 Model Stats
Same pipeline as before…
Output written to temporary tables
use templating to switch settings at runtime
Experimental Modeling
use templating to switch settings at runtime
Stats compared to production runs
top decile
raw data for top-100 items
Platform : AWS
Backend: Hadoop, Hive, Presto, Redshift… and a lot more
Tech Stack
ML: H2O, Scikit-Learn
Ops: Fabric, Mesos, Docker, Marathon and home-grown
tools