Setting High Availability in Hadoop Cluster

17
www.edureka.co/hadoop-admin Setting High Availability in Hadoop Cluster

Transcript of Setting High Availability in Hadoop Cluster

Page 1: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Setting High Availability in Hadoop Cluster

Page 2: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

What will you learn today?

Hadoop: A synonym for Big Data

Hadoop High Availability

Hands-On: Achieving NameNode and YARN high availability

Hands-On: Securing HDFS through ACL

Hadoop as a Data Warehouse

Page 3: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

What is Hadoop?

Apache Hadoop is an open source, scalable and reliable solution that stores and allows distributed processing of large data sets across clusters of computers using simple programming model

Page 4: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

A closer look at Apache Hadoop

Apache Hadoop includes following modules :

Hadoop Distributed File System (HDFS): A distributed file system

Hadoop Common: The common utilities that support the other Hadoop modules

Hadoop YARN: A framework for job scheduling and cluster resource management

Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

Page 5: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

High Availability

Page 6: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Maintaining High Availability

In Distributed Computing, failure is a norm, which means YARN should have acceptable amount of availability

NameNode - No Horizontal Scale NameNode - No High Availability

DataNode

DataNode

DataNode

….

Client get Block Locations

Read Data

NameNodeNS

Block Management

Page 7: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

NameNode: Single Point of Failure

SecondaryNameNode

NameNode

Secondary NameNode:

"Not a hot standby" for the NameNode

Connects to NameNode every hour*

Housekeeping, backup of NemeNode metadata

Saved metadata can build a failed NameNode

metadata

metadata

Single PointFailure

You give me metadata

every hour, I will make it

secure

Page 8: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Hadoop 2.0 Cluster Architecture: High Availability

Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Secondary Name Node

DataNode

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

DataNode

ContainerApp

Master

Data Node

Client

DataNode

ContainerApp

Master

Node Manager

DataNode

ContainerApp

Master

Node Manager

NameNode High Availability

Next Generation MapReduce

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

HDFS HIGH AVAILABILITY

Page 9: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

NN ActiveNN

Standby

DN 1 DN 2 DN n

Shared storage

Failover ControllerActive

ZK ZK ZK

Failover Controller Standby

Heartbeat Heartbeat

Monitors NN’s Health

Monitors NN’s Health

Block Reports to Active and standby NN: Update cmds from one

Sharead NN state with single writer(fencing)

HDFS

Cmds

Page 10: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

ZooKeeperRMState

ZooKeeperRMState

ZKFC

Resource ManagerActive

ZKFC

Resource ManagerPassive

1. Active Node stores all state in ZKStore

2. Failure 4. Failover

3. Standby Nodebecome active

3. ZKFC Detects failure

Page 11: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Monitor liveness &

heath

zookeeper

Journal Node

zookeeper

zookeeper

Journal Node

Journal Node

ZookeeperFC

NameNode

StandbyNameNode

Active

DataNode DataNode DataNode

ZookeeperFC

Zookeeper Service

Shared Edits

Monitor and maintain

active lockMonitor and try to take active lock

Monitor liveness &

heath

ReadWrite

Page 12: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Hands-OnAchieving HDFS and YARN High Availability

Page 13: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Hands-OnSecuring HDFS through ACL

Page 14: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

What to do with Big Data?

Page 15: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Hadoop: The Perfect Data Warehouse

Free TextImages/Videos

HCatalog

HiveSQL Others …ImpalaSQL

Tableau CognosQlikView

LogsTransaction Sensors

Pentaho

HDFS Files

Metadata

Query Engines

BI Tools

Page 16: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

What a Data Warehouse is good at?

Among others, a data warehouse is the foundation for a successful business intelligence program

The Data Warehouse Institute

www.tdwi.org

Page 17: Setting High Availability in Hadoop Cluster

www.edureka.co/hadoop-admin

Thank You …

Questions/Queries/Feedback

Recording and presentation will be made available to you within 24 hours