Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
-
Upload
cloudera-inc -
Category
Documents
-
view
981 -
download
1
Transcript of Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
1
Headline Goes HereSpeaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12From Zero to Hadoop
Speaker Name | Title April 18, 2023
2
Agenda
• Hadoop Ecosystem Overview• Hadoop Core Technical Overview
• HDFS• MapReduce
• Hadoop in the Enterprise• Cluster Planning• Cluster Management with Cloudera Manager
3
What Are All These Things?
Hadoop Ecosystem Overview
4
Hadoop Ecosystem
INGEST STORE EXPLORE PROCESS ANALYZE SERVE
CONNECTORS
STORAGE
RESOURCE MGMT& COORDINATION
USER INTERFACE WORKFLOW MGMT METADATACLOUD
INTEGRATION
YAYARN
ZOZOOKEEPER
HDFSHADOOP DFS
HBHBASE
HUHUE
OOOOZIE
WHWHIRR
SQSQOOP
FLFLUME
FILEFUSE-DFS
RESTWEBHDFS / HTTPFS
SQLODBC / JDBC
MSMETA STORE
ACACCESS
BI ETL RDBMS
BATCH COMPUTE
BATCH PROCESSING REAL-TIME ACCESS & COMPUTE
MRMAPREDUCE
MR2MAPREDUCE2
HIHIVE
PIPIG
MAMAHOUT
DFDATAFU
IMIMPALA
MANAGEMENT SOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONS
CLOUDERA NAVIGATOR
CLOUDERA MANAGER
CORE(REQUIRED)
RTD RTQ
BDR
AUDIT(v1.0) LINEAGE
ACCESS(v1.0) LIFECYCLE
EXPLORE
CORE
5
Sqoop
Performs Bi Directional data transfers between Hadoop and almost any SQL database with a JDBC driver
6
FlumeNG
Client
Client
Client
Client
Agent
Agent
Agent
A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.
HBase
7
• A low latency, distributed, non-SQL database built on HDFS.
• A “Columnar Database”
8
Hive
• Relational database
abstraction using a SQL like
dialect called HiveQL• Statements are executed as
One or more MapReduce
Jobs
SELECTs.word, s.freq, k.freq
FROM shakespeare JOIN ON (s.word= k.word)WHERE s.freq >= 5;
9
Pig
• High-level scripting language
for for executing one or more
MapReduce jobs• Created to simplify authoring
of MapReduce jobs• Can be extended with user
defined functions
emps = LOAD 'people.txt’ AS (id,name,salary);rich = FILTER emps BY salary > 200000;sorted_rich = ORDER rich BY salary DESC;STORE sorted_rich INTO ’rich_people.txt';
10
Oozie
A workflow engine and
scheduler built specifically
for large-scale job
orchestration on a
Hadoop cluster
11
Zookeeper
• Zookeeper is a distributed
consensus engine• Provides well-defined concurrent
access semantics:• Leader election• Service discovery• Distributed locking / mutual
exclusion• Message board / mailboxes
12
MahoutA machine learning library with algorithms for:• Recommendation based on users'
behavior. • Clustering groups related documents. • Classification from existing
categorized. • Frequent item-set mining (shopping
cart content).
13
Hadoop Security
• Authentication is secured by MIT Kerberos v5
and integrated with LDAP
• Provides Identity, Authentication, and
Authorization
• Useful for multitenancy or secure
environments
14
Only the Good Parts
Hadoop Core Technical Overview
15
Components of HDFS
• NameNode – Holds all metadata for HDFS• Needs to be a highly reliable machine
• RAID drives – typically RAID 10• Dual power supplies• Dual network cards – Bonded
• The more memory the better – typical 36GB to - 64GB• Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be used
16
Components of HDFS – Contd.
• DataNodes – Hardware will depend on the specific needs of the cluster• No RAID needed, JBOD (just a bunch of disks) is used• Typical ratio is:
• 1 hard drive• 2 cores• 4GB of RAM
17
HDFS Architecture Overview
Secondary Namenode
Host 2
Namenode
Host 1DataNode
Host 3
DataNode
Host 4
DataNode
Host 5
DataNode
Host n
18
Block Size = 64MBReplication Factor = 3
HDFS Block Replication
1
2
3
4
5 2
3
4
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
Node 1 Node 2
Node 3
Node 4 Node 5
Blocks
19
MapReduce – Map• Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).
• map() produces one or more intermediate values along with an output key from the input.
MapTask
(key 1, values)
(key 2, values)
(key 3, values)
ShufflePhase
(key 1, int. values)
(key 1, int. values)
(key 1, int. values)
Reduce Task
Final (key, values)
20
MapReduce – Reduce• After the map phase is over, all the intermediate values for a
given output key are combined together into a list
• reduce() combines those intermediate values into one or more final values for that same output key
MapTask
(key 1, values)
(key 2, values)
(key 3, values)
ShufflePhase
(key 1, int. values)
(key 1, int. values)
(key 1, int. values)
Reduce Task
Final (key, values)
21
MapReduce – Shuffle and Sort
22
How It Works In The Real World
Hadoop In the Enterprise
24
Networking• One of the most important things to consider when
setting up a Hadoop cluster• Typically a top of rack is used with Hadoop with a
core switch • Careful on over subscribing the backplane of the
switch!
25
Hadoop Typical Data Pipeline
Data Sources
PigHive
MapReduce
HDFS
Orig
inal
Sou
rce
Dat
a
Resu
lt or
Cal
cula
ted
Dat
a
Data Warehouse
Marts
Sqoop
HadoopOozie
SqoopFlume
26
Hadoop Use Cases
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application ApplicationIndustry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
Use CaseUse Case
ADVA
NCE
D A
NAL
YTIC
S
DATA
PRO
CESS
ING
27
Hadoop in the Enterprise
Logs Files Web Data Relational Databases
IDE’s BI / Analytics Enterprise Reporting
Enterprise Data Warehouse
Web Application
Management Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
CUSTOMERS
28
Cloudera ManagerEnd-to-End Administration for CDH
ManageEasily deploy, configure & optimize clusters1MonitorMaintain a central view of all activity2DiagnoseEasily identify and resolve issues3IntegrateUse Cloudera Manager with existing tools4
29
Install A Cluster In 3 Simple Steps
1 2 3Find Nodes Install Components Assign Roles
Enter the names of the hosts which will be included in the Hadoop cluster. Click Continue.
Cloudera Manager automatically installs the CDH components on the hosts you specified.
Verify the roles of the nodes within your cluster. Make changes as necessary.
Cloudera Manager Key Features
30
View Service Health & PerformanceCloudera Manager Key Features
31
Monitor & Diagnose Cluster WorkloadsCloudera Manager Key Features
32
Visualize Health Status With HeatmapsCloudera Manager Key Features
33
Rolling UpgradesCloudera Manager Key Features
34
?
35