BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

118
BIG DATA How do elephant make babies Florian Douetteau CEO, Dataiku

description

Presentation made at Rennes in January for the handsome BreizhJUG. This is a mixed presentation for big data technologies, which covers topics such as : Why Hadoop ? What next ? Machine Learning for big data in practice.

Transcript of BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

Page 1: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

BIG DATAHow do elephant make babies

Florian DouetteauCEO, Dataiku

Page 2: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Agenda

• Big Data & Hadoop Overview

• Practical Big Data Coding: Pig / Hive / Cascading

• PagesJaunes Big Data Use Case

• Machine Learning For Big Data

Page 3: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku 1/8/14

3

Motivation

Page 4: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

Collocation

1/8/14

4

Big AppleBig MamaBig Data

A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association.

Collocation:

Page 5: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

“Big” Data in 1999

1/8/14

5

struct Element { Key key; void* stat_data ;}….

C Optimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuse HTTP 0.9

1 Month

Page 6: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time

1/8/14

6

Big Data in 2013

1 Hour

Page 7: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

Data Analytics: The Stakes

1/8/14

7

1 TB? $

Social Gaming2011Web Search

1999

Logistics2004

Online Advertising2012

1 TB100M $

E-Commerce2013Banking

CRM2008

1 TB1B $

Web Search2010

100 TB? $

10 TB10M $

1000TB500M $

50TB1B$

Page 8: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Data Tuesday

Meet Hal Alowne

1/8/14

8

Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data Scientist

Hal AlowneBI ManagerDim’s Private Showroom

Hey Hal ! We need a big data platform

like the big guys.Let’s just do as they do!

‟”European E-commerce Web

site• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)

Dim SumCEO & Founder Dim’s Private Showroom

Big DataCopy Cat Project

Page 9: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

QUESTION #1IS IT EASY OR

NOT ?

Page 10: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

SUBTLE PATTERNS

Page 11: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

"MORE BUSINESS"

BUTTONS

Page 12: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

QUESTION #2WHO TO HIRE ?

Page 13: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

DATA SCIENTISTAT NIGHT

Page 14: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

DATA CLEANERTHE DAY

Page 15: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

PARADOX #3

WHERE ?

Page 16: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

MY DATAIS WORTH MILLIONS

Page 17: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

I SEND IT TO THE MARKETING CLOUD

Page 18: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

QUERSTION #4IS IT BIG OR

NOT ?

Page 19: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

WE ALL LIVE IN A BIG DATA

LAKE

Page 20: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

ALL MY DATAPROBABLY FITS IN HERE

Page 21: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

QUESTION #5 (at last)

HUMAN OR NOT ?

Page 22: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

MACHINELEARNINGWILL SAVEUS ALL

Page 23: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

I JUST WANT MORE

REPORTS

Page 24: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

MERIT = TIME + ROI

1/9/14

24

Targeted Newsletter

RecommenderSystems

Adapted Product/ Promotions

TIME : 6 MONTHS

ROI : APPS

Build a lab in 6 months

(rather than 18 months)

Find the right people

(6 months?)

Choose the technology(6 months?)

Make it work (6 months?)

Build the lab (6 months)

Deploy apps that actually deliver value

2013 2014

2013

• Train People• Reuse working

patterns

Page 25: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

Statistics and Machine Learning is complex !

1/9/14

25

Try to understand myself

Page 26: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

(Some Book you might want to read)

1/9/14

26

Page 27: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

CHOOSE TECHNOLOGY

HadoopCeph

Sphere

Cassandra

Kafka FlumeSpark Storm

Scikit-Learn GraphLAB prediction.io jubatusMahout

WEKAMLBase LibSVM

SASRapidMiner

SPSS Panda

QlickViewTableau

KibanaSpotFire D3

InfiniDB DrillVertica

GreenPlumImpalaNetezza

Elastic Search

SOLR

MongoDBRiak

Membase

Pig Cascading

Talend

Machine Learning Mystery Land

Scalability CentralNoSQL-Slavia

SQL Colunnar Republic

Vizualization County Data Clean

Wasteland

Statistician Old House

R Real-time island

Page 28: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

Business Intelligence Stack as Scalability and maintenance issues

Backoffice implements business rules that are challenged

Existing infrastructure cannot cope with per-user information

Main Pain Point:23 hours 52 minutes to compute Business Intelligence aggregates for one day.

1/9/14

29

Large E-Retailer

Page 29: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Data Tuesday

• Relieve their current DWH and

accelerate production of some aggregates/KPIs

• Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc.,

• Train existing people around machine learning and segmentation experience

1h12 to perform the aggregate, available every morning

New home page personalization deployed in a few weeks

Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects

1/9/14

30

Large E-Retailer : The Datalab

Page 30: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku

A very large community

Some mid-size communities

Lots of small clusters mostly 2 players)

Correlation◦ between community size and

engagement / virality

Meaningul patterns◦ 2 players / Family / Group

What is the minimum number of friends to have in the application to get additional engagement ?

Example (Social Gaming) Social Gaming Communities

1/9/14

31

Page 31: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

How do I (pre)process data?

Implicit User Data(Views, Searches…)

Content Data(Title, Categories, Price, …)

Explicit User Data(Click, Buy, …)

User Information(Location, Graph…)

500TB

50TB

1TB

200GB

Transformation Matrix

Transformation Predictor

Per User Stats

Per Content Stats

User Similarity

Rank Predictor

Content Similarity

A/B Test Data

Predictor Runtime

Online User Information

Page 32: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Always the same

Pour Data In

Compute Something

Smart About It

Make Available

Page 33: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

The Questions

Pour Data In

Compute Something

Smart About It

Make Available

How often ? What kind of interaction? How much ?

How complex ? Do you need all data at once ? How incremental ?

Interaction ? Random Access ?

Page 34: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

At the Beginning was the elephant

Page 35: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 36: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Innovation Services

MapReduceHow to count works in many many boxes

1/8/14

37

Page 37: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

ELEPHANT MAKE BABIES

Page 38: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

After Hadoop

Massive BatchMap Reduce Over

HDFS

Random Access

Faster in Memory Computation

In Memory MultiCore Machine Learning

Real-Time Distributed Computation

Faster SQL Analytics Queries

Page 39: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Innovation Services

MapReduceSimplicity is a complexity

1/8/14

40

Page 40: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 41: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from

2003 2007 as an Apache Project

Initial motivation◦ Search Log Analytics: how long is the average user

session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? …

Pig History

words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘\t’)

AS (word:chararray, count:int);

sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;

DUMP first_words;

Page 42: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Developed by Facebook in January 2007

Open source in August 2008

Initial Motivation◦ Provide a SQL like abstraction to perform statistics on

status updates

Hive History

create external table wordcounts ( word string, count int) row format delimited fields terminated by '\t' location '/training/hadoop-wordcount/output';

select * from wordcounts order by count desc limit 10;

select SUM(count) from wordcounts where word like ‘th%’;

Page 43: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Authored by Chris Wensel 2008

Associated Projects◦ Cascalog : Cascading in Closure◦ Scalding : Cascading in Scala (Twitter in 2012)◦ Lingual ( to be released soon): SQL layer on

top of cascading

Cascading History

Page 44: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 45: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Innovation Services

Pig & HiveMapping to Mapreduce jobs

1/8/14

46* VAT

excluded

events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;

by_user = GROUP events_filtered BY user;

price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;

high_pbu = FILTER price_by_user BY total_price > 1000;

Job 1 : Mapper Job 1 : Reducer1

LOAD FILTER GROUP FOREACH FILTERShuffle and sort by user

Page 46: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Innovation Services

Pig & HiveMapping to Mapreduce jobs

1/8/14

47

events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;

by_user = GROUP events_filtered BY user;

price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;

high_pbu = FILTER price_by_user BY total_price > 1000;

recent_high = ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO ‘/output’;

Job 1: Mapper Job 1 :Reducer

LOAD FILTER GROUP FOREACH FILTERShuffle and sort by user

Job 2: Mapper Job 2: Reducer

LOAD(from tmp)

STOREShuffle and sort by max_ts

Page 47: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Pig How does it work

Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not)

Page 48: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Innovation Services

Reducer 2Mappers output

Reducer 1

Hive JoinsHow to join with MapReduce ?

1/8/14

49

tbl_idx uid name

1 1 Dupont

1 2 Durand

tbl_idx uid type

2 1 Type1

2 1 Type2

2 2 Type1

Shuffle by uidSort by (uid,

tbl_idx)

Uid Tbl_idx Name Type

1 1 Dupont

1 2 Type1

1 2 Type2

Uid Tbl_idx Name Type

2 1 Durand

2 2 Type1

Uid Name Type

1 Dupont Type1

1 Dupont Type2

Uid Name Type

2 Durand Type1

Page 49: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap’up and question (->Beer)

Agenda

Page 50: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Partitioning◦ Formats Integration◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 51: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Transformation as a sequence of operations

Transformation as a set of formulas

Procedural Vs Declarative

insert into ValuableClicksPerDMA select dma, count(*)from geoinfo join (

select name, ipaddr from users join clicks on (users.name = clicks.user)

where value > 0;) using ipaddr

group by dma;

Users = load 'users' as (name, age, ipaddr);Clicks = load 'clicks' as (user, url, value);ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user;Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;ByDMA = group UserGeo by dma;ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

Page 52: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}

Different approach◦ Resilient Schema ◦ Static Typing ◦ No Static Typing

Data type and ModelRationale

Page 53: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku Training – Hadoop for Data Science

HiveData Type and Schema

1/8/14

54

Simple type Details

TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes

FLOAT, DOUBLE 4 and 8 bytes

BOOLEAN

STRING Arbitrary-length, replaces VARCHAR

TIMESTAMP

Complex type Details

ARRAY Array of typed items (0-indexed)

MAP Associative map

STRUCT Complex class-like objects

CREATE TABLE visit (user_name STRING,user_id INT,user_details STRUCT<age:INT, zipcode:INT>

);

Page 54: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku Training – Hadoop for Data Science

rel = LOAD '/folder/path/'USING PigStorage(‘\t’)AS (col:type, col:type, col:type);

Data types and SchemaPig

1/8/14

55

Simple type Details

int, long, float, double

32 and 64 bits, signed

chararray A string

bytearray An array of … bytes

boolean A boolean

Complex type Details

tuple a tuple is an ordered fieldname:value map

bag a bag is a set of tuples

Page 55: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Support for Any Java Types, provided they can be serialized in Hadoop

No support for Typing

Data Type and Schema Cascading

Simple type Details

Int, Long, Float, Double

32 and 64 bits, signed

String A string

byte[] An array of … bytes

Boolean A boolean

Complex type Details

Object Object must be « Hadoop serializable »

Page 56: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Style Summary

Style Typing Data Model Metadata store

Pig Procedural Static + Dynamic

scalar + tuple+ bag

(fully recursive)

No (HCatalog)

Hive Declarative Static + Dynamic,

enforced at execution

time

scalar+ list + map

Integrated

Cascading Procedural Weak scalar+ java objects

No

Page 57: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing, error management and environment

Integration◦ Partitioning◦ Formats Integration◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 58: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Does debugging the tool lead to bad headaches ?

HeadachilityMotivation

Page 59: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Out Of Memory Error (Reducer)

Exception in Building /

Extended Functions

(handling of null)

Null vs “”

Nested Foreach and scoping

Date Management (pig 0.10)

Field implicit ordering

HeadachesPig

Page 60: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

A Pig Error

Page 61: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Out of Memory Errors in

Reducers

Few Debugging Options

Null / “”

No builtin “first”

HeadachesHive

Page 62: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Weak Typing Errors (comparing

Int and String … )

Illegal Operation Sequence

(Group after group …)

Field Implicit Ordering

HeadachesCascading

Page 63: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

How to perform unit tests ? How to have different versions of the same script

(parameter) ?

TestingMotivation

Page 64: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

System Variables Comment to test No Meta Programming pig –x local to execute on local files

TestingPig

Page 65: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Junit Tests are possible Ability to use code to actually comment out some

variables

Testing / Environment Cascading

Page 66: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start …

Checkpointing Motivation

Page User Correlation

OutputFilteringParse Logs

Per Page Stats

FAIL

FIX and relaunch

Page 67: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

STORE Command to manually

store files

PigManual Checkpointing

Page User Correlation

OutputFilteringParse Logs

Per Page Stats

// COMMENT Beginning of script and relaunch

Page 68: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Ability to re-run a flow automatically from the last saved checkpoint

Cascading Automated Checkpointing

addCheckpoint(…)

Page 69: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Check each file intermediate timestamp Execute only if more recent

Cascading Topological Scheduler

Page User Correlation

OutputFilteringParse Logs

Per Page Stats

Page 70: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Productivity Summary

Headaches Checkpointing/Replay

Testing / Metaprogrammation

Pig Lots Manual Save Difficult Meta programming, easy

local testing

Hive Few, but without

debugging options

None (That’s SQL) None (That’s SQL)

Cascading Weak TypingComplexity

Checkpointing Partial Updates

Possible

Page 71: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Formats Integration◦ Partitioning◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 72: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Ability to integrate different file formats◦ Text Delimited◦ Sequence File (Binary Hadoop format)◦ Avro, Thrift ..

Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …)

Formats IntegrationMotivation

Format Size on Disk (GB) HIVE Processing time (24 cores)

Text File, uncompressed 18.7 1m32s

1 Text File, Gzipped 3.89 6m23s (no parallelization)

JSON compressed 7.89 2m42s

multiple text file gzipped 4.02 43s

Sequence File, Block, Gzip 5.32 1m18s

Text File, LZO Indexed 7.03 1m22s

Format impact on size and performance

Page 73: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap

Format Integration

Page 74: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition

Common partition schemas on Hadoop◦ By Date /apache_logs/dt=2013-01-23◦ By Data center /apache_logs/dc=redbus01/…◦ By Country◦ …◦ Or any combination of the above

PartitionsMotivation

Page 75: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku Training – Hadoop for Data Science

Hive PartitioningPartitioned tables

1/8/14

76

CREATE TABLE event (user_id INT,type STRING,message STRING)

PARTITIONED BY (day STRING, server_id STRING);

Disk structure

/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1

INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=‘s1’)SELECT * FROM event_tmp;

Page 76: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

No Direct support for partition Support for “Glob” Tap, to build read from files using patterns

➔ You can code your own custom or virtual partition schemes

Cascading Partition

Florian Douetteau
TODO
Page 77: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

External Code IntegrationSimple UDF

Pig Hive

Cascading

Page 78: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Hive Complex UDF(Aggregators)

Page 79: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Cascading Direct Code Evaluation

Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO

Page 80: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Allow to call a cascading flow from a Spring Batch

Spring Batch Cascading Integration

No full Integration with Spring MessageSource or MessageHandler yet (only for local flows)

Page 81: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

IntegrationSummary

Partition/Incremental Updates

External Code Format Integration

Pig No Direct Support

Simple Doable and rich community

Hive Fully integrated, SQL Like

Very simple, but complex dev setup

Doable and existing community

Cascading With Coding Complex UDFS but regular, and Java Expression

embeddable

Doable and growing

commuinty

Florian Douetteau
Compare different version
Page 82: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema

Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment

Integration◦ Formats Integration◦ Partitioning◦ External Code Integration

Performance and optimization

Comparing without Comparable

Page 83: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Several Common Map Reduce Optimization Patterns◦ Combiners◦ MapJoin◦ Job Fusion◦ Job Parallelism◦ Reducer Parallelism

Different support per framework◦ Fully Automatic◦ Pragma / Directives / Options◦ Coding style / Code to write

Optimization

Page 84: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

SELECT date, COUNT(*) FROM product GROUP BY date

CombinerPerform Partial Aggregate at Mapper Stage

Map

Reduce2012-02-14 4354

2012-02-15 21we2

2012-02-14 qa334

2012-02-15 23aq2

2012-02-14 20

2012-02-15 35

2012-02-16 1

2012-02-14 4354

2012-02-15 21we2

2012-02-14 qa334

2012-02-15 23aq2

Page 85: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

SELECT date, COUNT(*) FROM product GROUP BY date

CombinerPerform Partial Aggregate at Mapper Stage

Map

Reduce2012-02-14 4354

2012-02-15 21we2

2012-02-14 qa334

2012-02-15 23aq2

2012-02-14 12

2012-02-15 23

2012-02-16 1

2012-02-14 8

2012-02-15 12

2012-02-14 20

2012-02-15 35

2012-02-16 1

Reduced network bandwith. Better parallelism

Page 86: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Join OptimizationMap Join

set hive.auto.convert.join = true;

Hive

Pig

Cascading

( no aggregation support after HashJoin)

Page 87: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

Critical for performance

Estimated per the size of input file◦ Hive

divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)◦ Pig

divide size pig.exec.reducers.bytes.per.reducer (default 1GB)

Number of Reducers

Page 88: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Dataiku - Pig, Hive and Cascading

CombinerOptimization

JoinOptimization

Number of reducers optimization

Pig Automatic Option Estimate or DIY

Cascading DIY HashJoin DIY

Hive PartialDIY

Automatic(Map Join)

Estimate or DIY

Performance & Optimization Summary

Page 89: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

CAS D’USAGE DU BIG DATA ET MACHINE LEARNING

90Date • Titre de la présentation

• ERWAN PIGNEUL

• TEAM LEADER – RESPONSABLE DE PROJET

Qualité du search

Page 90: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

CONTEXTE PAGESJAUNES

CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS

PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE NÉCESSITANT UNE INDEXATION MANUELLE

CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES MAIS CELA NE GÈRE PAS LA LONGUE TRAINE

Page 91: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Clic sur un pro

Top recherche

Clic de navigation ou

filtre

Reformulation de la recherche

Pas de réponse

COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?

20 M

Analyse &

corrections

automatisation

>10 occurrences

1,4M requête

s

>200M recherche

s

0,5M requêtes priorisées

Page 92: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

SOLUTION

Machine

Gestion Exploration

pagesjaunes.frAnnuaire

hadoop PIG+Hive

Export indexation

Moteur d’interprétation

crawlAutres

référentiels

Sickit-learn

Page 93: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

ENSEIGNEMENTS TECHNIQUES

HADOOP / PIG / HIVE :

EfficaceRemet en question certaines logiques test/prod (apparition de pbs sur gros

volumes)Attention, ca reste jeune (compatibilité, …)

DATAIKU STUDIO :

Accélérateur de dev big dataOrdonnanceur des traitements en intégrant tous nos jobs et gère les

dépendancesEasy Machine learning

ELASTICSEARCH :

Volume indexé et rapidité de search

Page 94: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

EFFICACITÉ DE L’APPROCHE

Evolution de la fragilité de la requête ‘Parc enfant’

Requête‘Parc enfant’

Moyenne générale

Fragile

Not fragile

Page 95: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Mahout 102Clustering

Page 96: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Goal for Today

• Quick Introduction To Clustering

• How does it work in Practice

• How does it work in Mahout

• Overview of Mahout Algorithms

Page 97: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Clustering

c

Revenue

Age

Page 98: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Clustering

c

Revenue

Age

One Cluster

Centroid== Center

of the cluster

Page 99: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

clustering applications

• Fraud: Detect Outliers

• CRM : Mine for customer segments

• Image Processing : Similar Images

• Search : Similar documents

• Search : Allocate Topics

Page 100: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

K-Means

Guess an initial placement for centroids

Assign each point to closest Center

Reposition Center

MAP

REDUCE

Page 101: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 102: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 103: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 104: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 105: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 106: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 107: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 108: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 109: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
Page 110: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

clustering challenges

• Curse of Dimensionality

• Choice of distance / number of parameters

• Performance

• Choice # of clusters

Page 111: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Mahout Clustering Challenges

• No Integrated Feature Engineering Stack:Get ready to write data processing in Java

• Hadoop SequenceFile required as an input

• Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing

Page 112: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Data Processing

Data ProcessingVectorized

Data

Image

Voice

Log / DB

Page 113: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Mahout K-Means on Text Workflow

mahoutseqdirectory

mahoutseq2parse

mahoutkmeans

Text Files

Mahout Sequence Files

Tfidf Vectors

Clusters

Page 114: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Mahout K-Means on Database Extract Worflow

org.apache.mahout.clustering.conversion.InputDriver

mahoutkmeans

Database Dump (CSV)

Mahout Vectors

Clusters

Page 115: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Convert a CSV File to Mahout Vector

• Real Code would have

• Converting Categorical variables to dimensions

• Variable Rescaling

• Dropping IDs (name, forname …)

Page 116: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Mahout AlgorithmsParameters Implicit Assumption Ouput

K-MeansK (number of clusters)

ConvergenceCircles Point -> ClusterId

Fuzzy K-MeansK (number of clusters)

ConvergenceCircles

Point -> ClusterId * , Probability

Expectation Maximization

K (Number of clusterS)Convergence

Gaussian distributionPoint -> ClusterId*,

Probability

Mean-Shift Clustering

Distance boundaries, Convergence

Gradient like distribution Point -> Cluster ID

Top Down Clustering

Two Clustering Algorithns HierarchyPoint -> Large ClusterId,

Small ClusterId

Dirichlet Process

Model DistributionPoints are a mixture of

distributionPoint -> ClusterId,

Probability

Spectral Clustering

- - Point -> ClusterId

MinHash Clustering

Number of hash / keysHash Type

High Dimension Point -> Hash*

Page 117: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

Comparing ClusteringKMean

sDirichl

etFuzzy

KMeans

MeanShift

Page 118: BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes

T2

Canopy OptimizationT2

T1

Surely in Cluster Surely not in

cluster

Pick a random point