Putting Business Intelligence to Work on Hadoop Data Stores
-
Upload
dataversity -
Category
Technology
-
view
1.483 -
download
1
Transcript of Putting Business Intelligence to Work on Hadoop Data Stores
Putting BusinesPutting BusinesWork on HadoWork on Hado
Ian Fyfe, Chief Techno
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights R
s Intelligence to s Intelligence to oop Data Storesoop Data Stores
ology Evangelist, Pentaho
Worldwide: +1 (866) 660-7555 | Slide 1Reserved. www.pentaho.com.
Session AbstractThis presentation will cover how to ovmore out of your business data analysAn inexpensive way of storing large volumes of daAn inexpensive way of storing large volumes of dagetting data out of Hadoop is tough due to a lackexperience high latency (up to several minutes pequery, reporting, and business analysis with tradiTh fi t t i i H d ' t i tThe first step in overcoming Hadoop's constraintsinfrastructure built on top of Hadoop, which provschedule reporting of large datasets data stored ilanguage called Hive QL which is based on SQL anthis data.But to really unlock the power of Hadoop, you mumultiple (often tens or hundreds) of nodes with atool that will then allow you to move your Hadooy ywhere you can use BI tools for analysis.
Attendees will learn, how an IT person without jaIntegrate with Hadoop and Hive to bring ETL, datanalyzing Big Data;Provide key data integration and transformation fManage and control Hadoop jobs using a graphica
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Manage and control Hadoop jobs using a graphicaIntegrating Hadoop data with data from other soufor today's massive volumes of data.
vercome Hadoop's constraints to get sis.ata Hadoop is also scalable and redundant But ata, Hadoop is also scalable and redundant. But
k of a built-in query language. Also, because users er query), Hadoop is not appropriate for ad hoc itional tools. i ti t HIVE d t h s is connecting to HIVE, a data warehouse
vides the relational structure necessary for in Hadoop files. HIVE also provides a simple query nd which enables users familiar with SQL to query
ust be able to efficiently extract data stored across a user-friendly ETL (extract, transform and load) op data into a relational data mart or warehouse p
va programming skills can:ta warehousing and BI applications to the tasks of
functionality to Hadoop data;al interface;
Worldwide: +1 (866) 660-7555 | Slide 2
al interface;urces to drive compelling reporting and analytics
THE CASE FOR B
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
BIG DATA
Worldwide: +1 (866) 660-7555 | Slide 3
The Case for Big DataEnterprises increasingly face neelarger and larger volumes of stru
ComplianceCompetitive Advantage
Challenges associated with big daChallenges associated with big daCost – storage and processing powerTimeliness of data processing
Why Hadoop?Low cost, reliable scale-out architecParallel distributed computing framParallel, distributed computing framProven success in solving Big Data prGoogle, Yahoo!, IBM and GEVib i l di iVibrant community, exploding intere
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
eds to store, process and maintain ctured and unstructured data
ataatar
cture for storing massive amounts of datamework for processing data
Google trends for ‘Hadoop’
mework for processing dataroblems at fortune 500 companies like
i l iest, strong commercial investments
Worldwide: +1 (866) 660-7555 | Slide 4
Hadoop for Data IntegratioTop Use Cases for Hadoop*
1. “mine data for improved busines2 “reducing cost of data analysis”
Top Challenges with Hadoop*
2. reducing cost of data analysis3. “log analysis”
1. Steep technical learning curve2. Hiring qualified people3. Availability of appropriate produ
Unfortunately, Hadoop was not designed
It’s not a database
High latency queries and jobs not ideal
*Based on a survey of 100+ Hadoop users conducted
Skill set mismatch for traditional ETL us
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
*Based on a survey of 100+ Hadoop users conducted
n and BI
ss intelligence”
ucts and tools
d specifically for ETL and BI use cases:
for all BI use cases
d by Karmasphere Sept 2010
sers and BI Solution architects
Worldwide: +1 (866) 660-7555 | Slide 5
d by Karmasphere, Sept. 2010
ESTABLISHING AARCHITECTURE F
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
AN FOR BIG DATA
Worldwide: +1 (866) 660-7555 | Slide 6
Example Use Cases Top
Transactional•Fraud detectionFi i l i / t•Financial services/sto
Sub-Transactional•Weblogs•Social/online media•Social/online media•Telecoms events
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
dayy
k k tock markets
Worldwide: +1 (866) 660-7555 | Slide 7US and Worldwide: +1 (866) 660-7555 | Slide
Example Use Cases Top
Non-Transactional•Web pages, blogs etcD t•Documents
•Physical eventsy•Application events•Machine events
In most cases structur
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
dayy
c
red or semi-structured
Worldwide: +1 (866) 660-7555 | Slide 8US and Worldwide: +1 (866) 660-7555 | Slide
Traditional Business InData Mart(s)
Tape/TTape/T
Data ? ?DataSource
?? ?
??
??
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
ntelligence (BI)g ( )
TrashTrash
Worldwide: +1 (866) 660-7555 | Slide 9US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake• Single source• Large volumeLarge volume• Not distilled
T i ll th 0 2• Typically no more than 0-2 lakes per company
• Known and unknown questions
• Multiple user communities• Don’t fit in traditional
RDBMS with a reasonable cost
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 10US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake Requiremenq
• Store all the data• Satisfy routine reporting
and analysis• Satisfy ad-hoc query /
analysis / reporting • Balance performance and
cost
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
nts
Worldwide: +1 (866) 660-7555 | Slide 11US and Worldwide: +1 (866) 660-7555 | Slide
What if...Data Mart(s) Ad-H
Data L
Data
Data L
DataSource
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Hoc Data Warehouse
Lake(s)Lake(s)
Worldwide: +1 (866) 660-7555 | Slide 12US and Worldwide: +1 (866) 660-7555 | Slide
Big Data Does Not Repg p
It’s not a database
High latency
Optimized for mass
Big Data databases
Databases are no© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Databases are no-
lace Data Marts
sive data-crunching
s are immature
SQLWorldwide: +1 (866) 660-7555 | Slide 13US and Worldwide: +1 (866) 660-7555 | Slide
SQL
What Hadoop Really isp yCore Components
HDFSa distributed file system allowstorage across a cluster of comstorage across a cluster of comservers
MapReduceMapReduceFramework for distributed comcommon use cases include aggsorting, and filtering BIG data Problem is broken up into smaof work that can be computedof work that can be computedrecomputed in isolation on anycluster
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
s….
wing massive mmodity mmodity
mputation, gregating, sets
all fragments d or d or y node of the
Worldwide: +1 (866) 660-7555 | Slide 14US and Worldwide: +1 (866) 660-7555 | Slide
What Hadoop Really isp yRelated Projects
Hive – a data warehouse Hive a data warehouse infrastructure on top of H
Implements a SQL like Query lImplements a SQL like Query lincluding a JDBC driverAllows MapReduce developers p pcustom mappers and reducers
Hbase – the Hadoop dataAH HA!
A variant of NoSQL databases,problematic for traditional BIBest at storing large amounts unstructured data
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
unstructured data
s….
Hadooplanguage language,
to plugin p g
abase –
of
Worldwide: +1 (866) 660-7555 | Slide 15US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?p
Distributed processinDistributed file systeC dit h dCommodity hardwarPlatform independenPlatform independenScales out beyond teeconomy of a RDBM
In many cases it’s the
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
ngem rent (in theory)nt (in theory) echnology and/or
MS
only viable solution
Worldwide: +1 (866) 660-7555 | Slide 16US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?p
90% of new Had90% of new Hadare transfoare transfosemi/structsemi/struct
* of those companies we’ve talke
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
of those companies we ve talke
doop use casesdoop use cases ormation oformation of tured data*tured data
ed to
Worldwide: +1 (866) 660-7555 | Slide 17US and Worldwide: +1 (866) 660-7555 | Slide
ed to...
Hadoop and BI?p
“The working conditiowithin Hadoop are showithin Hadoop are sho
ETL Developer
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
ons ocking”ocking
Worldwide: +1 (866) 660-7555 | Slide 18US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?p
Instead of this...
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 19US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?p
You have to do this in public void map(
Text key,
Text value,
OutputCollector output
Reporter reporter)Reporter reporter)
public void reduce(p
Text key,
Iterator values,
OutputCollector output
Reporter reporter)
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Java...
t,
t,
Worldwide: +1 (866) 660-7555 | Slide 20US and Worldwide: +1 (866) 660-7555 | Slide
People dPeople dHadoop forHadoop for
they wathey wa
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
don’t usedon t use BI becauseBI because
ant toant to...
Worldwide: +1 (866) 660-7555 | Slide 21US and Worldwide: +1 (866) 660-7555 | Slide
they do i...they do ithey hathey ha
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
it becauseit because ave toave to...
Worldwide: +1 (866) 660-7555 | Slide 22US and Worldwide: +1 (866) 660-7555 | Slide
... and unfowasn’t d
for most BI r
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
ortunately it designed equirements
Worldwide: +1 (866) 660-7555 | Slide 23US and Worldwide: +1 (866) 660-7555 | Slide
Why not addthe things it
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
d to Hadoop ’s missing...
Worldwide: +1 (866) 660-7555 | Slide 24US and Worldwide: +1 (866) 660-7555 | Slide
... until itwhat we n
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
t can do need it to?
Worldwide: +1 (866) 660-7555 | Slide 25US and Worldwide: +1 (866) 660-7555 | Slide
If only wIf only wJava embJava, emb
data transformdata transform
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
we had awe had a beddablebeddable,
mation enginemation engine...
Worldwide: +1 (866) 660-7555 | Slide 26US and Worldwide: +1 (866) 660-7555 | Slide
A Data Integration Engg gData Marts, Da
Analytical Ay
Data IntegrData IntegrEngine
Hadoop Data IntegrE iHadoop Engine
Data IntegrEngine
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
g
gine for Hadoopg pata Warehouse, Applicationspp
rationration e
ration Design
Deploye Deploy
Orchestrate
ration e
Worldwide: +1 (866) 660-7555 | Slide 27US and Worldwide: +1 (866) 660-7555 | Slide
Visualize Reporting / Dashb
OptimizeDM &
OptimizeHiv
Files /
Load Applications
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
boards / Analysis
Web Tier
& DW RDBMS
veHadoop
HDFSHadoop
s & Systems
Worldwide: +1 (866) 660-7555 | Slide 28US and Worldwide: +1 (866) 660-7555 | Slide
Reporting / Dashb
DM &
adat
a
HivMet
a
Files /
Applications
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
boards / Analysis
Web Tier
& DW RDBMS
veHadoop
HDFSHadoop
s & Systems
Worldwide: +1 (866) 660-7555 | Slide 29US and Worldwide: +1 (866) 660-7555 | Slide
Data Mart(s) Ad-H
Data LData L
DataDataSource
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Data WarehouseHoc
ake(s)ake(s)
Worldwide: +1 (866) 660-7555 | Slide 30US and Worldwide: +1 (866) 660-7555 | Slide
Reporting / Dashb
Data Lake
Applications
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
boards / Analysis
Web Tier
RDBMS
HadoopHadoop
s & Systems
Worldwide: +1 (866) 660-7555 | Slide 31US and Worldwide: +1 (866) 660-7555 | Slide
Product Requirements for BI Ag
Lower technical barriers through grapenvironment for creating and managingM R d j bMapReduce jobs
Extreme ETL scalability through deploacross the Hadoop clusteracross the Hadoop cluster
Easily spin-off high performance datainteractive analysis
Easily integrate data from Hadoop withother sources
P id d t d BI dd i Provide end-to-end BI addressing commcases with Hadoop including reporting, query and interactive analysis
Reduce costs through subscription-basereduced dependency on scarce technica
d i i t i bilit
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
resources, and easier maintainability
gainst Hadoop
phical ETL g Hadoop
Interactive Analysis
oyment Batch Reportingand Ad Hoc Query
Interactive Analysis
D t M t
marts for Data Marts
gile
BI
Hih data from
BI Hadoop
Ag Hive
mon BI use ad hoc Data Integration Jobs
ed pricing, al Log
FilesDBs andother sources
Worldwide: +1 (866) 660-7555 | Slide 32
THE ROA
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
D AHEAD
Worldwide: +1 (866) 660-7555 | Slide 33
The Road AheadOther NoSQL Integration
Facilitate BI use cases on top Facilitate BI use cases on top MongoDB, Cassandra
Streaming Data Source SuStreaming Data Source SuIn support of near-realtime usLong/always running data procLong/always running data proc
Contiguous Meta-dataData Lineage and Impact AnalyData Lineage and Impact Analyarchitecture
The End of MapReduce (… asp (understand)
Push down optimization of Tra
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
native MapReduce tasks in Had
of HBase possibly others like of HBase, possibly others like
upportupportse casescessing jobscessing jobs
ysis covering the entire big data ysis covering the entire big data
s a concept ETL users need to p
ansformations that generate
Worldwide: +1 (866) 660-7555 | Slide 34US and Worldwide: +1 (866) 660-7555 | Slide
doop
Hadoop Distro Wars
The Apache Software Foundation
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 35
Tools That Make Hadoe.g. Apache Pig
Pig is a platform for analyzing large data sets
Produces sequences of Produces sequences of MapReduce programs
Integrate Pig scripts into enterprise data integration workflows e.g.
1 Submit and monitor a 1. Submit and monitor a series of Pig and MapReduce jobs
2. Process a database bulk load step to ready data for ad-hoc analysis or
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
report bursting
oop Easier
Worldwide: +1 (866) 660-7555 | Slide 36
Growth in Adoption oNoSQL Big Data Platf
Hbase – the Hadoop database
mongoDB – scalable high performmongoDB scalable, high-perform
LexisNexis HPCC – a data intens
Many othersMany others
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
of Other forms
ance document oriented databaseance, document-oriented database
ive computing system platform
Worldwide: +1 (866) 660-7555 | Slide 37
Summary
Hadoop and other Big Data NGreat at storing and processinGreat at storing and processinNot designed for Business Inte
Choosing the right BI technoto drive actionable insightsg
Graphical user interfacesScalableSpin-off data martsIntegrate data into data warehIntegrated dashboards, reportintegration
© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
NoSQL platformsng large diverse data volumesng large diverse data volumeselligence
ology can unlock your Big Data
housesting, data analysis, data
Worldwide: +1 (866) 660-7555 | Slide 38
ThankThank
ifyfe@penifyfe@pen
© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
k You!k You!
ntaho comntaho.com
Worldwide: +1 (866) 660-7555 | Slide 39US and Worldwide: +1 (866) 660-7555 | Slide