Demystifying Hadoop 2.0 - Part 1

8/11/2019 Demystifying Hadoop 2.0 - Part 1

1/29

www.e


2/29

www.e

Course Topics

Week 1 Understanding Big Data

A typical Hadoop Cluster

Hadoop Cluster Administrator: Roles andResponsibilities

Week 2 Hadoop 2.0

Hadoop Configuration files

Popular Hadoop Distributions

Week 3 Different Hadoop Server Roles

Data processing flow

Cluster Network Configuration

Week 4 Job Scheduling

Fair Scheduler

Monitoring a Hadoop C

Week 5 Securing your Hadoop

Kerberos and HDFS Fe

Backup and Recovery

Week 6 Oozie and Hive Admin

HBase Architecture

HBase Administration


3/29

www.e

Topics for Today

Revision

Hadoop 2.0

Hadoop Configuration Files

Plan your Hadoop Cluster: Hardware Considerations

Plan your Hadoop Cluster: Software Considerations



4/29

www

Hadoop Core Components

Different Cluster Modes

Letss Revise


5/29www

Client

HDFS Map Reduce

Hadoop 1.0

SecondaryName Node

Data

Blocks

Data Node

Name Node Job Tracker

Task Tracker

Map Reduce

Data Node Task

Map

.


6/29

www

Hadoop 1.0 Vs. Hadoop 2.0

Property Hadoop 1.x Hadoop 2.x

NameNodes 1 Many

High Availability Not present Highly Available

Processing Control JobTracker, Task Tracker Resource Manager, N

Manager, App Master


7/29

www

Hadoop 2.0 HDFS Federation

http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation

Namenode

Block Management

NS

Storage

Datanode Datanode

Namespace

BlockStorage

Namespace

NS1 NSk

NN-1 NN-k

Common Storage

Datanode 1

Datanode 2

BlockSto

rage

Pool 1 Pool k

Block Pools
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.htmlhttp://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html


8/29

www

Hadoop 2.0 HDFS NameNode High Availability

Sharededit logs

Data Blocks

.

Data Nodes are configuredlocation of both Name Nodblock location information to both.

Read edit logs and applies to its own

namespace

All name space editslogged to shared NFSstorage; single writer

(fencing)

ActiveName Node

StandbyName Node

Data Node Data Node Data Node Data Node

SecondaryName Node


9/29

www

Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2)

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN = Yet Another Resource Manager

Node Manager

Container Container

Node Manager

AppMaster

Container

Node Manager

Container

App

Master

ResourceManager

Client

Client

MapReduce StatusJob SubmissionNode StatusResource Request
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html


10/29

www

ClientHDFS

YARN

Resource Mana

Hadoop 2.0

Sharededit logs

All name space editslogged to shared NFS

storage; single writer(fencing)

Read edit logs and appliesto its own namespace

SecondaryName Node

Data Node Data Node

Data NodeNode Manager

ContainerAppMaster

Node Manager

ContainerAppMaster

StandbyNameNode

Node Manager

ContainerAppMaster

ActiveNameNode


11/29

Poll Questions


12/29

www.e

Hadoop 2.0 Configuration Files

ConfigurationFilenames

Description of Log Files

hadoop-env.shyarn-env.sh Settings for HadoopDaemons process environment.

core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that comand YARN.

hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the D

yarn-site.xml Configuration setting for ResourceManager and NodeManager.

mapred-site.xml Configuration settings for MapReduce Applications.

slaves A list of machines (one per line) that each run DataNode and NodeMa


13/29

www

Hadoop 2.0 Configuration Files


14/29

www.e

Deprecated Properties

Deprecated Property Name New Property Name

dfs.data.dir dfs.datanode.data.dir

dfs.http.address dfs.namenode.http-address

fs.default.name fs.defaultFS

The core functionality and usage of these core configuration files are samand 1.0 but many new properties have been added and many have bFor example:

fs.default.namehas been deprecated and replaced withfs.defaultFSfor YA dfs.nameserviceshas been added to enable NameNode High Availability in h

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedPro

In Hadoop 2.x.x (CDH4) release, you can use either the old or the new prop

The old property names are now deprecated, but still work!
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.htmlhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html


15/29

www

Runtime Environment

Offers a way to provide custom parameters for each of the servers.

Sourced by the Hadoop Daemons start/stop scripts.

Examples of environment variables that you can specify:

HADOOP_DATANODE_HEAPSIZEYARN_HEAPSIZE

Set parameter JAVA_HOMEJV

hadoop-env.shyarn-env.sh

Map

Reduce


16/29

www.e

Configuration Files for Core Components

Core core-site.xml

HDFS hdfs-site.xml

mapred-site.xmlMap

Reduce

yarn-site.xmlYARN


17/29

www.e

core-site.xml and hdfs-site.xml

hdfs-site.xml core-site.xml

dfs.replication fs.defaultFS

1 hdfs://test.abc.in:8020/


18/29

www.e

mapred-site.xml

mapred-site.xml

mapreduce.jobhistory.address

test.abc.in:10020

http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ma

http://hadoop.apache.org/docs/stable/mapred_tutorial.html

Noticecurren
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xmlhttp://hadoop.apache.org/docs/stable/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/stable/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml


19/29

www.e

yarn-site.xml

yarn-site.xml

yarn.resourcemanager.address

test.abc.in:8021

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-defau
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xmlhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml


20/29

www.e

Slaves

MapRedu

Slaves

Contains a list of slave hosts, one per line, that are to host DataNode andNodeManager servers.


21/29

www.ehttp://wiki.apache.org/hadoop/PoweredBy

Hadoop Cluster: Facebook
http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredBy


22/29

www.e

Hadoop Cluster: A Typical Use Case (Hadoop 1.0)

RAM: 16GB

Hard disk: 6 X 2TB

Processor: Xenon with 2 cores.

Ethernet: 3 X 10 GB/s

OS: 32bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOS

RAM: 32 GB,Hard disk: 1 TB

Processor: Xenon wi

Ethernet: 3 X 10 GB/

OS: 32bit CentOS

Name Node Secondary N

Data NodeRAM: 16GB

Hard disk: 6 X 2TB

Processor: Xenon with 2 cores.

Ethernet: 3 X 10 GB/s

OS: 32bit CentOS

Data Node


23/29

www.e

Hadoop Cluster: Thinking About The Problem

Single Machine

Great for testing,developing.

Not a practicalimplementation for

large amounts of data.

Initially four or sixnodes.

As the volume of datagrows, more nodes can

easily be added.

Ways of deciding when thcluster needs to grow

Increasing amount ofcomputation power

needed.

Increasing amount ofdata which needs to bestored.

Increasing amount ofmemory needed toprocess tasks.

Hadoop Cluster

Small Cluster Large Cluster


24/29

www.e

Master Hardware

Namenode requirements

RAM to fit metadata

Modest but dedicated disk

Secondary Namenode

Almost identical to Namenode

Resource Manager

Retain Job Data, Memory Hungry

Memory requirements can grow

independent of cluster size

Slave Hardware

Storage

Computation

Cluster Sizing

Usage Pattern and

IO-bound or C

Consider requirem

additional compon

HBase

Plan your Hadoop Cluster: Hardware


25/29

www.e

Operating System Linux is the only production quality option today. A significant number run on RHEL.

Java JDK- the most critical software List of tested JVMs:

http://wiki.apache.org/hadoop/HadoopJavaVersions

Java 1.6.x

Operating System utilities ssh cron rsync ntp

Plan your Hadoop Cluster: Software

l d b
http://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersions


26/29

www.e

Choose a Distribution and Version of Hadoop


Apache Hadoop Complex Cluster setup Manual install and Integration of Hadoop

ecosystem components such as Pig, Hive,HBase etc

No commercial Support Good for First try

Cloudera

Established distribution with many referenceddeployments

Powerful tools for deployment, managementand monitoring such as Cloudera Manager

P l H d Di ib i


27/29

www.e

HortonWorks

Only distribution without any modification in Apache Hadoop

HCatalog for metadata

Stinger for Hive

MapR

Support native Unix filesystem

HA features such as snapshots, mirroring or stateful failover

Amazon Elastic Map Reduce (EMR)

Hosted Solution

Only Pig and Hive are available as of now



28/29

www.e

Assignments Status

Attempt the following Assignments using the documents present in the L

Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or V


29/29

Thank YouSee You in Class Next Week

Demystifying Hadoop 2.0 - Part 1

Documents

Transcript of Demystifying Hadoop 2.0 - Part 1