Extending the EDW with Hadoop - Chicago Data Summit 2011
-
Upload
jonathan-seidman -
Category
Technology
-
view
2.439 -
download
0
description
Transcript of Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the Enterprise Data Warehouse with Hadoop
Robert Lancaster and Jonathan Seidman
Chicago Data Summit
April 26 | 2011
Who We Are
• Robert Lancaster
– Solutions Architect, Hotel Supply Team
– @rob1lancaster
• Jonathan Seidman
– Lead Engineer, Business Intelligence/Big Data Team
– Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG)
– @jseidman
page 2
page 3
Launched: 2001, Chicago, IL
page 4
Why are we using Hadoop?
Stop me if you’ve heard this before…
page 5
On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day.
page 6
Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data.
$ per TB
And…
page 7
Hadoop places no constraints on how data is processed.
Before Hadoop
page 8
page 9
With Hadoop
page 10
Access to this non-transactional data enables a number of applications…
Optimizing Hotel Search
page 11
Recommendations
page 12
Page Performance Tracking
page 13
Cache Analysis
page 14
2.78%
34.30% 31.87%
71.67%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Queries
Searches
Reverse Running Total (Searches)
Reverse Running Total (Queries)
72% of queries are singletons and make up nearly a third of total search volume.
A small number of queries (3%) make up more than a third of search volume.
User Segmentation
page 15
All of this is great, but…
Most of these efforts are driven by development teams.
The challenge now is to unlock the value in this data by making it more available to the rest of the organization.
page 16
page 17
“Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”*
*MAD Skills: New Analysis Practices for Big Data
page 18
In a better world…
Integrating Hadoop with the Enterprise Data Warehouse
Robert Lancaster and Jonathan Seidman
Chicago Data Summit
April 26 | 2011
page 20
The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis.
page 21
BI vendors are working on integration with Hadoop…
page 22
And one more reporting tool…
Example Processing Pipeline for Web Analytics Data
page 23
Aggregating data for import into Data Warehouse
page 24
page 25
Example Use Case: Beta Data Processing
Example Use Case – Beta Data Processing
page 26
page 27
Example Use Case – Beta Data Processing Output
page 28
Example Use Case: RCDC Processing
Example Use Case – RCDC Processing
page 29
page 30
Example Use Case: Click Data Processing
Click Data Processing – Current DW Processing
page 31
Web Server
Logs ETL DW
Data Cleansing (Stored procedure)
DW Web Server Web Servers
3 hours 2 hours ~20% original
data size
Click Data Processing – New Hadoop Processing
page 32
Web Server
Logs HDFS
Data Cleansing (MapReduce) DW
Web Server Web Servers
Conclusions
• Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure.
• Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure.
• Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs.
• The challenge now is making Hadoop more accessible to non-developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility.
page 33
Oh, and also…
• Orbitz is looking for a Lead Engineer for the BI/Big Data team.
• Go to http://careers.orbitz.com/ and search for IRC19035.
page 34
References
• MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009
page 35