A tour of the zoo – Hadoop Ecosystem - · PDF fileReporting . Confidential ... Workflow...
-
Upload
nguyenhuong -
Category
Documents
-
view
220 -
download
0
Transcript of A tour of the zoo – Hadoop Ecosystem - · PDF fileReporting . Confidential ... Workflow...
A Tour of the Zoo the Hadoop Ecosystem
Prafulla Wani
Technical Architect - Big Data
Syntel
Confidential 2012 Syntel, Inc.
Agenda
Welcome to the Zoo!
Evolution Timeline
Traditional BI/DW Architecture
Where Hadoop Fits In
2
Confidential 2012 Syntel, Inc.
3
Welcome to the Zoo!
3
Jaql
Giraph Shark
Zookeeper Pig
Hama
Hadoop
I am sure you wont find a Shark in any other zoo
http://zookeeper.apache.org/https://cwiki.apache.org/confluence/display/Hive
Confidential 2012 Syntel, Inc.
What is Hadoop?
Hadoop is an open-source project overseen by the Apache Software
Foundation
Hadoop is an ecosystem, not a single product
Originally based on papers published by Google in 2003 and 2004
Some of the projects in the ecosystem have been inspired based on
whitepapers published by Google
4
Google calls it: Hadoop equivalent
GFS HDFS
MapReduce Hadoop MapReduce
Sawzall Hive, Pig
BigTable HBase
Chubby ZooKeeper
Pregel Giraph
Confidential 2012 Syntel, Inc.
Evolution Timeline
Started by Doug Cutting at Yahoo! in early 2006, and named after
his kids toy elephant
Hadoop committers work at several different organizations
Including Facebook, Yahoo!, LinkedIn, Twitter, Cloudera, Hortonworks
5
Jaql Giraph
2006 2007 2008 2009 2010 2011
http://zookeeper.apache.org/https://cwiki.apache.org/confluence/display/Hive
Confidential 2012 Syntel, Inc.
Traditional Data Strategy - BI/DW Architecture
6
ETL Tools DW / Marts BI Analytics
Commercial
Informatica Teradata Microstrategy SAS
Oracle Data Integrator Oracle OBIEE TIBCO Spotfire
IBM Datastage DB2, Netezza Cognos SPSS
Microsoft SSIS SQL server Microsoft SSRS
Open source Talend mySQL Pentaho , Jaspersoft R, RapidMiner
Data Warehouse
Data Marts
ETL
Process
ERP
CRM
Database
Files
Analytics
OLAP Analysis/BI
Ad Hoc
Reporting
Confidential 2012 Syntel, Inc.
How Hadoop fits in?
7
Hadoop can complement the existing DW environment as
well replace some of the components in a traditional data
architecture.
Data Warehouse
Data Marts
ETL
Process
ERP
CRM
Database
Files
Analytics
OLAP Analysis/BI
Ad Hoc
Reporting
Confidential 2012 Syntel, Inc.
Data Storage
Hadoop Distributed File System (HDFS)
Its a file system, not a DBMS
Allows storage of both structured and unstructured data
Provides distributed, redundant storage for massive amounts of data on
cheap, unreliable computers
Hadoop 2.0 release (still beta) added some important features
HDFS Federation
High Availability
HBase
Distributed, versioned, column-oriented store on top of HDFS
Provides an option of low-latency (OLTP) reads/writes along with
support for batch-processing model of map-reduce
Goal - To store tables with billion rows and million columns
8
Confidential 2012 Syntel, Inc.
Data Processing (ETL / Analytics)
Extract / Load
Source / Target is RDBMS - Sqoop
Log collection and aggregation - Flume, Scribe, Chukwa
Stream processing - S4, Storm (supports Transformation also)
Transformation
Map-reduce programming in Java or any other language or high level query
languages like Pig, Hive etc.
Workflow design and implementation using tools like Oozie, Azkaban etc.
Iterative algorithms or in-memory cluster processing using Spark, Shark etc.
Analytics
Mahout - Scalable machine learning library with most of the algorithms implemented
on top Apache Hadoop using map/reduce paradigm
RHadoop Provides R packages to access data in HDFS & HBase and also to write
map-reduce jobs in R
9
Confidential 2012 Syntel, Inc.
Common Industry Use Cases
10
Use cases Solution Comments
Cold Data Storage HDFS More cost-effective option compared to most appliances in the market
Huge transactional
volume HBase
StumbleUpon created openTSDB to capture their infrastructure metrics
data
Batch processing MapReduce
/Hive /Pig
Log aggregation Flume, Scribe,
Chukwa web-log collection on HDFS in near real-time
Real-time message/
stream processing Storm, S4 Used by twitter for real-time tweet processing
Iterative algorithms / In-
memory processing Spark / Shark Predictive analytics, Log Mining
Machine Learning/
Analytics
Mahout,
RHadoop
Graph data
storage/processing Giraph Championed at Yahoo!
Confidential 2012 Syntel, Inc.
11
Proposed Big Data Roadmap
Kickoff - Assessment Study:
Understand the business processes
Understand organizational goals & current investments
Understand the challenges and pain-points of current setup
Proof of Concept:
Proof of Concept can be performed to demonstrate applicability of Hadoop to enhance DW
Big Data integration Initial steps
Move cold/warm data to Hive/HBase to reduce expenses on storage infrastructure
Bring new data sources like web-logs, which was not possible with traditional storage solutions
Big Data integration Next steps
Throw data open to business users for analysis and they will appreciate the power of new infrastructure
Big Data integration Next steps
Identify the opportunities in ETL & Analytics space
Move Hot data to Hadoop
Perform real-time data integration using Storm/Spark
Big Data integration Next steps
Implement advanced solutions
1
2 3
4
5
6
HDFS, Hbase
Hive, Pig,
MapReduce
Mahout, RHadoop
Hadoop Technology Stack
Thank You