Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation...

37
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012

Transcript of Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation...

Page 1: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

Extending the Enterprise Data Warehouse with

Hadoop

Robert Lancaster

Nov 7, 2012

Page 2: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Robert Lancaster

• Solutions Architect, Hotel Supply Team

[email protected]

• @rob1lancaster

• Organizer of Chicago Machine Learning Study Group

• Co-organizer of Chicago Big Data.

page 2

Who I Am

Page 3: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 3

Launched in 2001

Over 160 million

bookings

Page 4: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 4

Some History…

Page 5: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• The Machine Learning team is formed to improve site performance.

For example, improving hotel search results.

• This required access to large volumes of behavioral data for analysis.

• Fortunately, the required data was collected in session data stored in web

analytics logs.

page 5

In 2009…

Page 6: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• The only archive of the required data went back about two weeks.

page 6

The Problem…

Transactional data

(e.g. bookings) and

aggregated Non-

transactional data

Data Warehouse

Non-transactional Data

(e.g. searches)

Page 7: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 7

Hadoop Provided a Solution…

Data Warehouse

Detailed non-

transactional data

(what every user sees,

clicks, etc.)

Hadoop

Transactional data

(e.g. bookings) and

aggregated Non-

transactional data

Page 8: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Distributed file system and parallel processing platform.

• Open source Apache project created by Doug Cutting.

• Modeled on papers published by Google on the Google File System

and MapReduce.

• Intended to run on a cluster of relatively inexpensive machines (aka

commodity hardware).

• Bring processing to the data.

page 8

What is Hadoop?

Page 9: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 9

The Hadoop Ecosystem

Hadoop Distributed File System

Hive Pig HBase

Sqoop &

Flu

me

Zookeeper & Oozie

MapReduce

Page 10: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 10

Deploying Hadoop Enabled Multiple Applications…

2.78%

34.30% 31.87%

71.67%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Queries

Searches

Page 11: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 11

And Useful Analyses…

Page 12: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Most of these efforts are driven by development teams.

• The challenge now is unlocking the value of this data for non-

technical users.

• Support for Hadoop via traditional BI/reporting tools still meager.

page 12

But Brought New Challenges…

Page 13: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 13

BI Vendors Are Working on Hadoop Integration

Both big (relatively)…

Page 14: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 14

And small…

Page 15: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Big Data team is formed under Business Intelligence team at Orbitz

Worldwide.

• Allows the Big Data team to work more closely with the data

warehouse and BI teams.

• Reflects the importance of big data to the future of the company.

• Our production cluster has grown 40-fold since it was launched.

page 15

In 2011& 2012

Page 16: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

“We strongly believe that Hadoop is the nucleus of the next-generation

cloud EDW…”

“…but that promise is still three to five years from fruition.”*

*James Kobielus, Forrester Research,

“Hadoop, Is It Soup Yet?”

page 16

A View Shared Beyond Orbitz…

Page 17: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Extraction and transformation of data for loading into the data

warehouse – “ETL”.

• Off-loading of analysis from the data warehouse.

page 17

Two Primary Ways We Use Hadoop to

Complement the EDW

Page 18: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

Proposed Processing

page 18

ETL Example

Raw logs Hadoop Dimensional

model

Page 19: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

Previous Processing in Data Warehouse

page 19

ETL Example: Click Data Processing

Web

Server

Logs ETL DW

Data

Cleansing

(Stored

procedure)

DW

Web

Server Web

Servers

Several hours of processing ~20% original

data size

Page 20: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Moving to Hadoop:

• Removed load from the data warehouse.

• Facilitated adding additional attributes for processing.

• Allowed processing to be run more frequently.

page 20

ETL Example: Click Data Processing

Web

Server

Logs HDFS

Data

Cleansing

(MapReduce)

DW

Web

Server Web

Servers

Processing in Hadoop

Page 21: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Facilitated analysis that allows for more personalized ad content.

• Allowed marketing team to analyze over a years worth of search

data.

• Provided analysis that was difficult to perform in the data warehouse.

page 21

Analysis Example: Geo-Targeting Ads

Page 22: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 22

Example Processing Pipeline for Web Analytics

Data

Page 23: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 23

Example Use Case: Selection Errors

Page 24: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 24

Use Case – Selection Errors: Introduction

• Multiple points of entry.

• Multiple paths through site.

• Goal: tie events together to

form picture of customer

behavior.

Page 25: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 25

Use Case – Selection Errors: Processing

Page 26: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 26

Use Case – Selection Errors: Visualization

Page 27: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 27

Example Use Case: Beta Data

Page 28: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 28

Use Case – Beta Data: Introduction

• Hotel Sort Optimization

• Compare A vs. B

• Web Analytics Data

• What user saw.

• How user behaved

• Server Log Data

• Sorting behavior used.

Page 29: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 29

Use Case – Beta Data Processing

Page 30: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 30

Use Case – Beta Data: Visualization

Page 31: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 31

Example Use Case: RCDC

Page 32: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Understand and improve cache behavior.

• Improve “coverage”

• Traditionally search 1 page of hotels at a time.

• Get “just enough” information to present to consumers.

• Increase amount of availability information we have when consumer

performs a search.

• Data needed to support needs beyond reporting.

page 32

Use Case – RCDC: Introduction

Page 33: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 33

Use Case – RCDC: Processing

Page 34: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

page 34

Use Case – RCDC: Visualization

Page 35: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Hadoop market is still immature, but growing quickly. Better tools are

on the way.

• Look beyond the usual (enterprise) suspects. Many of the most interesting

companies in the big data space are small startups.

• Hadoop won’t replace your EDW, but any organization with a large

EDW should at least be exploring Hadoop as a complement to their

BI infrastructure.

page 35

Conclusions

Page 36: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

• Work closely with your existing data management teams.

• Your idea of what constitutes “big data” might quickly diverge from theirs.

• The flip-side to this is that Hadoop can be an excellent tool to off-load

resource-consuming jobs from your data warehouse.

page 36

Conclusions

Page 37: Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis

Thank you!

Questions?

page 37