Introduction to hadoop administration jk

View Hadoop Administration course details at www.edureka.co/hadoop-admin

Top 5 Hadoop admin tasks

http://www.edureka.co/hadoop-admin

www.edureka.co/hadoop-adminSlide 2

Objectives of this Session

At the end of this module, you will be able to

Understand Cluster Planning

Understand Hadoop fully distributed cluster set up

Add further nodes to the running cluster

Upgrade existing Hadoop cluster

Understand name node High availability


Why Hadoop Administration


With the Rise of Hadoop Adoption and usage across various industries, the role of Hadoop Administrator has

become very important and is in demand.

Hadoop Administrator


Hadoop Administration Responsibilities


HDFS Support & Maintenance Monitor Hadoop ClusterProviding Security

Integrating Different Frameworks Hadoop Infrastructure Maintenance

Hadoop Admin Responsibilities


Top 5 Hadoop Admin Tasks


Top 5 Hadoop Admin Tasks

Task-1

Cluster Planning

Task-2

Hadoop Cluster set up Hadoop Version upgrade

Task-3

Adding or Removing Nodes to Cluster Providing High Availability to Cluster

Task-4 Task-5


Cluster Planning

Task-1


RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply


Active NameNodeSecondary NameNode

DataNode DataNode


StandBy NameNode

Optional

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

DataNode

DataNode DataNode DataNode


Seeking cluster growth on storage capacity is often a good method to use!

Cluster Growth Based On Storage Capacity

Data grows by approximately5TB per week

HDFS set up to replicate eachblock three times

Thus, 15TB of extra storagespace required per week

Assuming machines with 5x3TBhard drives, equating to a newmachine required each week

Assume Overheads to be 30%


Slave Nodes: Recommended Configuration

Higher-performance vs lower performance components

Save the Money, Buy more Nodes!

General ( Depends on requirement ‘base’ configuration for a slave Node

» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration

» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet

General Configuration

Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications

Special Configuration

Slave Nodes

“A cluster with more nodes performs better than one with fewer, slightly faster nodes”


Slave Nodes: More Details (RAM)

Slave Nodes (RAM)

Generally each Map or Reduce taskwill take 1GB to 2GB of RAM

Slave nodes should not be usingvirtual memory

RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core

Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system


Master Node Hardware Recommendations

Carrier-class hardware (Not commodity hardware)

Dual power supplies

Dual Ethernet cards(Bonded to provide failover)

Raided hard drives

At least 32GB of RAM

Master Node

Requires


Hadoop Cluster Set up

Task-2


Hadoop Cluster Modes

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS

Hadoop daemons run on the local machine

Hadoop daemons run on a cluster of machines

Standalone (or Local) Mode


Core

HDFS

core-site.xml

hdfs-site.xml

yarn-site.xmlYARN

mapred-site.xmlMap

Reduce

Hadoop 2.x Configuration Files – Apache Hadoop

www.edureka.co/hadoop-admin


Configuration Files

ConfigurationFilenames

Description of Log Files

hadoop-env.shyarn-env.sh

Settings for Hadoop Daemon’s process environment.

core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.

hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.

yarn-site.xml Configuration setting for Resource Manager and Node Manager.

mapred-site.xml Configuration settings for MapReduce Applications.

slaves A list of machines (one per line) that each run DataNode and Node Manager.


Hadoop Daemons

NameNode daemon» Runs on master node of the Hadoop Distributed File System (HDFS)» Directs Data Nodes to perform their low-level I/O tasks

DataNode daemon» Runs on each slave machine in the HDFS» Does the low-level I/O work

Resource Manager» Runs on master node of the Data processing System(MapReduce)» Global resource Scheduler

Node Manager» Runs on each slave node of Data processing System» Platform for the Data processing tasks

Job HistoryServer» JobHistoryServer is responsible for servicing all job history related requests from client


Hadoop 1.x and Hadoop 2.x Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GIRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)


HiveDW System

MapReduce Framework

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)


HBase

Structured DataUnstructured/Semi-structured Data

Hadoop 1.x Hadoop 2.x


Demo On Hadoop Cluster Set Up


Hadoop Version upgrade

Task-3


Stop map-reduce cluster and all client applications running on the DFS cluster

Take the back up of File System Name Space

Install new version of Hadoop software

Update the all configuration files in new Hadoop

start name node with Upgrade command

Compare the new HDFS file system with previous version file system name space

finalize upgrade.

Hadoop Version Upgrade


1) Run Report• FSCK• LSR• DFSADMIN

2) Take Back up• Configuration• Applications• Data and Meta Data

3) Install new Version of Hadoop4) Upgradehadoop-daemon.sh start namenode -upgrade

Hadoop Version Upgrade

5) Run New Reports• FSCK• LSR• DFSADMIN

Compare old and new ReportsTest new Cluster

6) Finalize upgrade• hadoop dfsadmin -finalizeUpgrade


Adding or Removing Nodes from Cluster

Task-4


Commissioning and Decommissioning of DataNode

DataNode

Master Node

DataNode

DataNode DataNode DataNode

DataNode DataNode

DataNode

Decommissioning Commissioning


Add (Commission) DataNodes

Update the network addresses in the ‘include’ files dfs.include mapred.include

Update the NameNode: hadoop dfsadmin-refreshNodes

Update the Job Tracker:hadoop mradmin-refreshNodes Update the

‘slaves’ file

Start the DataNode and TaskTracker hadoop-daemon.sh start tasktracker hadoop-daemon.sh start datanode

Cross Check the Web

6 UI to ensure the

successful addition

Run Balancer to

7 move the HDFS

blocks toDataNodes

1 2 3

45


Demo On Commissioning Data Node


Providing High Availability to Cluster

Task-5


Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a

single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable

until the NameNode was either restarted or brought up on a separate machine.

Achieve the High Availability in two different ways

HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.

HA using NFS for shared storage instead of the QJM

High Availability (HA)


Slave NodeSlave NodeSlave Node

Standby NodeActive Node

Journal Nodes(Shared Edits)

Failover Controller Standby

Failover Controller Active

Zookeeper Service

Block Report & Heart beat

Monitor status and health. Manage HA state

HA Architecture

Monitor status and health. Manage HA state

Write Read


Demo On NameNode High Availability


Hadoop admin Job Trends

Questions

www.edureka.co/hadoop-adminSlide 34 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Introduction to hadoop administration jk

Technology

Transcript of Introduction to hadoop administration jk