Introduction to hadoop administration jk
-
Upload
edureka -
Category
Technology
-
view
219 -
download
1
Transcript of Introduction to hadoop administration jk
View Hadoop Administration course details at www.edureka.co/hadoop-admin
Top 5 Hadoop admin tasks
www.edureka.co/hadoop-adminSlide 2
Objectives of this Session
At the end of this module, you will be able to
Understand Cluster Planning
Understand Hadoop fully distributed cluster set up
Add further nodes to the running cluster
Upgrade existing Hadoop cluster
Understand name node High availability
www.edureka.co/hadoop-adminSlide 3
Why Hadoop Administration
www.edureka.co/hadoop-adminSlide 4
With the Rise of Hadoop Adoption and usage across various industries, the role of Hadoop Administrator has
become very important and is in demand.
Hadoop Administrator
www.edureka.co/hadoop-adminSlide 5
Hadoop Administration Responsibilities
www.edureka.co/hadoop-adminSlide 6
HDFS Support & Maintenance Monitor Hadoop ClusterProviding Security
Integrating Different Frameworks Hadoop Infrastructure Maintenance
Hadoop Admin Responsibilities
www.edureka.co/hadoop-adminSlide 7
Top 5 Hadoop Admin Tasks
www.edureka.co/hadoop-adminSlide 8
Top 5 Hadoop Admin Tasks
Task-1
Cluster Planning
Task-2
Hadoop Cluster set up Hadoop Version upgrade
Task-3
Adding or Removing Nodes to Cluster Providing High Availability to Cluster
Task-4 Task-5
www.edureka.co/hadoop-adminSlide 9
Cluster Planning
Task-1
www.edureka.co/hadoop-adminSlide 10
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
www.edureka.co/hadoop-adminSlide 11
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately5TB per week
HDFS set up to replicate eachblock three times
Thus, 15TB of extra storagespace required per week
Assuming machines with 5x3TBhard drives, equating to a newmachine required each week
Assume Overheads to be 30%
www.edureka.co/hadoop-adminSlide 12
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement ‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration
» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
www.edureka.co/hadoop-adminSlide 13
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce taskwill take 1GB to 2GB of RAM
Slave nodes should not be usingvirtual memory
RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core
Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system
www.edureka.co/hadoop-adminSlide 14
Master Node Hardware Recommendations
Carrier-class hardware (Not commodity hardware)
Dual power supplies
Dual Ethernet cards(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master Node
Requires
www.edureka.co/hadoop-adminSlide 15
Hadoop Cluster Set up
Task-2
www.edureka.co/hadoop-adminSlide 16
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
www.edureka.co/hadoop-adminSlide 17
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xmlMap
Reduce
Hadoop 2.x Configuration Files – Apache Hadoop
www.edureka.co/hadoop-admin
www.edureka.co/hadoop-adminSlide 18
Configuration Files
ConfigurationFilenames
Description of Log Files
hadoop-env.shyarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
www.edureka.co/hadoop-adminSlide 19
Hadoop Daemons
NameNode daemon» Runs on master node of the Hadoop Distributed File System (HDFS)» Directs Data Nodes to perform their low-level I/O tasks
DataNode daemon» Runs on each slave machine in the HDFS» Does the low-level I/O work
Resource Manager» Runs on master node of the Data processing System(MapReduce)» Global resource Scheduler
Node Manager» Runs on each slave node of Data processing System» Platform for the Data processing tasks
Job HistoryServer» JobHistoryServer is responsible for servicing all job history related requests from client
www.edureka.co/hadoop-adminSlide 20
Hadoop 1.x and Hadoop 2.x Ecosystem
Pig LatinData Analysis
HiveDW System
OtherYARN
Frameworks(MPI, GIRAPH)
HBaseMapReduce Framework
YARNCluster Resource Management
Apache Oozie(Workflow)
HDFS(Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework
Apache Oozie(Workflow)
HDFS(Hadoop Distributed File System)
Pig LatinData Analysis
HBase
Structured DataUnstructured/Semi-structured Data
Hadoop 1.x Hadoop 2.x
www.edureka.co/hadoop-adminSlide 21
Demo On Hadoop Cluster Set Up
www.edureka.co/hadoop-adminSlide 22
Hadoop Version upgrade
Task-3
www.edureka.co/hadoop-adminSlide 23
Stop map-reduce cluster and all client applications running on the DFS cluster
Take the back up of File System Name Space
Install new version of Hadoop software
Update the all configuration files in new Hadoop
start name node with Upgrade command
Compare the new HDFS file system with previous version file system name space
finalize upgrade.
Hadoop Version Upgrade
www.edureka.co/hadoop-adminSlide 24
1) Run Report• FSCK• LSR• DFSADMIN
2) Take Back up• Configuration• Applications• Data and Meta Data
3) Install new Version of Hadoop4) Upgradehadoop-daemon.sh start namenode -upgrade
Hadoop Version Upgrade
5) Run New Reports• FSCK• LSR• DFSADMIN
Compare old and new ReportsTest new Cluster
6) Finalize upgrade• hadoop dfsadmin -finalizeUpgrade
www.edureka.co/hadoop-adminSlide 25
Adding or Removing Nodes from Cluster
Task-4
www.edureka.co/hadoop-adminSlide 26
Commissioning and Decommissioning of DataNode
DataNode
Master Node
DataNode
DataNode DataNode DataNode
DataNode DataNode
DataNode
Decommissioning Commissioning
www.edureka.co/hadoop-adminSlide 27
Add (Commission) DataNodes
Update the network addresses in the ‘include’ files dfs.include mapred.include
Update the NameNode: hadoop dfsadmin-refreshNodes
Update the Job Tracker:hadoop mradmin-refreshNodes Update the
‘slaves’ file
Start the DataNode and TaskTracker hadoop-daemon.sh start tasktracker hadoop-daemon.sh start datanode
Cross Check the Web
6 UI to ensure the
successful addition
Run Balancer to
7 move the HDFS
blocks toDataNodes
1 2 3
45
www.edureka.co/hadoop-adminSlide 28
Demo On Commissioning Data Node
www.edureka.co/hadoop-adminSlide 29
Providing High Availability to Cluster
Task-5
www.edureka.co/hadoop-adminSlide 30
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a
single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable
until the NameNode was either restarted or brought up on a separate machine.
Achieve the High Availability in two different ways
HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.
HA using NFS for shared storage instead of the QJM
High Availability (HA)
www.edureka.co/hadoop-adminSlide 31
Slave NodeSlave NodeSlave Node
Standby NodeActive Node
Journal Nodes(Shared Edits)
Failover Controller Standby
Failover Controller Active
Zookeeper Service
Block Report & Heart beat
Monitor status and health. Manage HA state
HA Architecture
Monitor status and health. Manage HA state
Write Read
www.edureka.co/hadoop-adminSlide 32
Demo On NameNode High Availability
www.edureka.co/hadoop-adminSlide 33
Hadoop admin Job Trends
Questions
www.edureka.co/hadoop-adminSlide 34 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions