Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Demystifying Hadoop 2.0 - Part 1
-
Upload
lansa-phillip -
Category
Documents
-
view
221 -
download
2
Transcript of Demystifying Hadoop 2.0 - Part 1
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
1/29
www.e
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
2/29
www.e
Course Topics
Week 1 Understanding Big Data
A typical Hadoop Cluster
Hadoop Cluster Administrator: Roles andResponsibilities
Week 2 Hadoop 2.0
Hadoop Configuration files
Popular Hadoop Distributions
Week 3 Different Hadoop Server Roles
Data processing flow
Cluster Network Configuration
Week 4 Job Scheduling
Fair Scheduler
Monitoring a Hadoop C
Week 5 Securing your Hadoop
Kerberos and HDFS Fe
Backup and Recovery
Week 6 Oozie and Hive Admin
HBase Architecture
HBase Administration
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
3/29
www.e
Topics for Today
Revision
Hadoop 2.0
Hadoop Configuration Files
Plan your Hadoop Cluster: Hardware Considerations
Plan your Hadoop Cluster: Software Considerations
Popular Hadoop Distributions
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
4/29
www
Hadoop Core Components
Different Cluster Modes
Letss Revise
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
5/29www
Client
HDFS Map Reduce
Hadoop 1.0
SecondaryName Node
Data
Blocks
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task
Map
.
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
6/29
www
Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, N
Manager, App Master
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
7/29
www
Hadoop 2.0 HDFS Federation
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation
Namenode
Block Management
NS
Storage
Datanode Datanode
Namespace
BlockStorage
Namespace
NS1 NSk
NN-1 NN-k
Common Storage
Datanode 1
Datanode 2
BlockSto
rage
Pool 1 Pool k
Block Pools
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.htmlhttp://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
8/29
www
Hadoop 2.0 HDFS NameNode High Availability
Sharededit logs
Data Blocks
.
Data Nodes are configuredlocation of both Name Nodblock location information to both.
Read edit logs and applies to its own
namespace
All name space editslogged to shared NFSstorage; single writer
(fencing)
ActiveName Node
StandbyName Node
Data Node Data Node Data Node Data Node
SecondaryName Node
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
9/29
www
Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2)
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN = Yet Another Resource Manager
Node Manager
Container Container
Node Manager
AppMaster
Container
Node Manager
Container
App
Master
ResourceManager
Client
Client
MapReduce StatusJob SubmissionNode StatusResource Request
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
10/29
www
ClientHDFS
YARN
Resource Mana
Hadoop 2.0
Sharededit logs
All name space editslogged to shared NFS
storage; single writer(fencing)
Read edit logs and appliesto its own namespace
SecondaryName Node
Data Node Data Node
Data NodeNode Manager
ContainerAppMaster
Node Manager
ContainerAppMaster
StandbyNameNode
Node Manager
ContainerAppMaster
ActiveNameNode
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
11/29
Poll Questions
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
12/29
www.e
Hadoop 2.0 Configuration Files
ConfigurationFilenames
Description of Log Files
hadoop-env.shyarn-env.sh Settings for HadoopDaemons process environment.
core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that comand YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the D
yarn-site.xml Configuration setting for ResourceManager and NodeManager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and NodeMa
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
13/29
www
Hadoop 2.0 Configuration Files
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
14/29
www.e
Deprecated Properties
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are samand 1.0 but many new properties have been added and many have bFor example:
fs.default.namehas been deprecated and replaced withfs.defaultFSfor YA dfs.nameserviceshas been added to enable NameNode High Availability in h
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedPro
In Hadoop 2.x.x (CDH4) release, you can use either the old or the new prop
The old property names are now deprecated, but still work!
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.htmlhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
15/29
www
Runtime Environment
Offers a way to provide custom parameters for each of the servers.
Sourced by the Hadoop Daemons start/stop scripts.
Examples of environment variables that you can specify:
HADOOP_DATANODE_HEAPSIZEYARN_HEAPSIZE
Set parameter JAVA_HOMEJV
hadoop-env.shyarn-env.sh
Map
Reduce
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
16/29
www.e
Configuration Files for Core Components
Core core-site.xml
HDFS hdfs-site.xml
mapred-site.xmlMap
Reduce
yarn-site.xmlYARN
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
17/29
www.e
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
dfs.replication fs.defaultFS
1 hdfs://test.abc.in:8020/
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
18/29
www.e
mapred-site.xml
mapred-site.xml
mapreduce.jobhistory.address
test.abc.in:10020
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ma
http://hadoop.apache.org/docs/stable/mapred_tutorial.html
Noticecurren
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xmlhttp://hadoop.apache.org/docs/stable/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/stable/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
19/29
www.e
yarn-site.xml
yarn-site.xml
yarn.resourcemanager.address
test.abc.in:8021
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-defau
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xmlhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
20/29
www.e
Slaves
MapRedu
Slaves
Contains a list of slave hosts, one per line, that are to host DataNode andNodeManager servers.
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
21/29
www.ehttp://wiki.apache.org/hadoop/PoweredBy
Hadoop Cluster: Facebook
http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredBy -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
22/29
www.e
Hadoop Cluster: A Typical Use Case (Hadoop 1.0)
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOS
RAM: 32 GB,Hard disk: 1 TB
Processor: Xenon wi
Ethernet: 3 X 10 GB/
OS: 32bit CentOS
Name Node Secondary N
Data NodeRAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
Data Node
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
23/29
www.e
Hadoop Cluster: Thinking About The Problem
Single Machine
Great for testing,developing.
Not a practicalimplementation for
large amounts of data.
Initially four or sixnodes.
As the volume of datagrows, more nodes can
easily be added.
Ways of deciding when thcluster needs to grow
Increasing amount ofcomputation power
needed.
Increasing amount ofdata which needs to bestored.
Increasing amount ofmemory needed toprocess tasks.
Hadoop Cluster
Small Cluster Large Cluster
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
24/29
www.e
Master Hardware
Namenode requirements
RAM to fit metadata
Modest but dedicated disk
Secondary Namenode
Almost identical to Namenode
Resource Manager
Retain Job Data, Memory Hungry
Memory requirements can grow
independent of cluster size
Slave Hardware
Storage
Computation
Cluster Sizing
Usage Pattern and
IO-bound or C
Consider requirem
additional compon
HBase
Plan your Hadoop Cluster: Hardware
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
25/29
www.e
Operating System Linux is the only production quality option today. A significant number run on RHEL.
Java JDK- the most critical software List of tested JVMs:
http://wiki.apache.org/hadoop/HadoopJavaVersions
Java 1.6.x
Operating System utilities ssh cron rsync ntp
Plan your Hadoop Cluster: Software
l d b
http://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersions -
8/11/2019 Demystifying Hadoop 2.0 - Part 1
26/29
www.e
Choose a Distribution and Version of Hadoop
Popular Hadoop Distributions
Apache Hadoop Complex Cluster setup Manual install and Integration of Hadoop
ecosystem components such as Pig, Hive,HBase etc
No commercial Support Good for First try
Cloudera
Established distribution with many referenceddeployments
Powerful tools for deployment, managementand monitoring such as Cloudera Manager
P l H d Di ib i
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
27/29
www.e
HortonWorks
Only distribution without any modification in Apache Hadoop
HCatalog for metadata
Stinger for Hive
MapR
Support native Unix filesystem
HA features such as snapshots, mirroring or stateful failover
Amazon Elastic Map Reduce (EMR)
Hosted Solution
Only Pig and Hive are available as of now
Popular Hadoop Distributions
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
28/29
www.e
Assignments Status
Attempt the following Assignments using the documents present in the L
Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or V
-
8/11/2019 Demystifying Hadoop 2.0 - Part 1
29/29
Thank YouSee You in Class Next Week