Demystifying Hadoop 2.0 - Part 1

download Demystifying Hadoop 2.0 - Part 1

of 29

Transcript of Demystifying Hadoop 2.0 - Part 1

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    1/29

    www.e

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    2/29

    www.e

    Course Topics

    Week 1 Understanding Big Data

    A typical Hadoop Cluster

    Hadoop Cluster Administrator: Roles andResponsibilities

    Week 2 Hadoop 2.0

    Hadoop Configuration files

    Popular Hadoop Distributions

    Week 3 Different Hadoop Server Roles

    Data processing flow

    Cluster Network Configuration

    Week 4 Job Scheduling

    Fair Scheduler

    Monitoring a Hadoop C

    Week 5 Securing your Hadoop

    Kerberos and HDFS Fe

    Backup and Recovery

    Week 6 Oozie and Hive Admin

    HBase Architecture

    HBase Administration

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    3/29

    www.e

    Topics for Today

    Revision

    Hadoop 2.0

    Hadoop Configuration Files

    Plan your Hadoop Cluster: Hardware Considerations

    Plan your Hadoop Cluster: Software Considerations

    Popular Hadoop Distributions

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    4/29

    www

    Hadoop Core Components

    Different Cluster Modes

    Letss Revise

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    5/29www

    Client

    HDFS Map Reduce

    Hadoop 1.0

    SecondaryName Node

    Data

    Blocks

    Data Node

    Name Node Job Tracker

    Task Tracker

    Map Reduce

    Data Node Task

    Map

    .

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    6/29

    www

    Hadoop 1.0 Vs. Hadoop 2.0

    Property Hadoop 1.x Hadoop 2.x

    NameNodes 1 Many

    High Availability Not present Highly Available

    Processing Control JobTracker, Task Tracker Resource Manager, N

    Manager, App Master

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    7/29

    www

    Hadoop 2.0 HDFS Federation

    http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation

    Namenode

    Block Management

    NS

    Storage

    Datanode Datanode

    Namespace

    BlockStorage

    Namespace

    NS1 NSk

    NN-1 NN-k

    Common Storage

    Datanode 1

    Datanode 2

    BlockSto

    rage

    Pool 1 Pool k

    Block Pools

    http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.htmlhttp://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    8/29

    www

    Hadoop 2.0 HDFS NameNode High Availability

    Sharededit logs

    Data Blocks

    .

    Data Nodes are configuredlocation of both Name Nodblock location information to both.

    Read edit logs and applies to its own

    namespace

    All name space editslogged to shared NFSstorage; single writer

    (fencing)

    ActiveName Node

    StandbyName Node

    Data Node Data Node Data Node Data Node

    SecondaryName Node

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    9/29

    www

    Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2)

    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

    YARN = Yet Another Resource Manager

    Node Manager

    Container Container

    Node Manager

    AppMaster

    Container

    Node Manager

    Container

    App

    Master

    ResourceManager

    Client

    Client

    MapReduce StatusJob SubmissionNode StatusResource Request

    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    10/29

    www

    ClientHDFS

    YARN

    Resource Mana

    Hadoop 2.0

    Sharededit logs

    All name space editslogged to shared NFS

    storage; single writer(fencing)

    Read edit logs and appliesto its own namespace

    SecondaryName Node

    Data Node Data Node

    Data NodeNode Manager

    ContainerAppMaster

    Node Manager

    ContainerAppMaster

    StandbyNameNode

    Node Manager

    ContainerAppMaster

    ActiveNameNode

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    11/29

    Poll Questions

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    12/29

    www.e

    Hadoop 2.0 Configuration Files

    ConfigurationFilenames

    Description of Log Files

    hadoop-env.shyarn-env.sh Settings for HadoopDaemons process environment.

    core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that comand YARN.

    hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the D

    yarn-site.xml Configuration setting for ResourceManager and NodeManager.

    mapred-site.xml Configuration settings for MapReduce Applications.

    slaves A list of machines (one per line) that each run DataNode and NodeMa

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    13/29

    www

    Hadoop 2.0 Configuration Files

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    14/29

    www.e

    Deprecated Properties

    Deprecated Property Name New Property Name

    dfs.data.dir dfs.datanode.data.dir

    dfs.http.address dfs.namenode.http-address

    fs.default.name fs.defaultFS

    The core functionality and usage of these core configuration files are samand 1.0 but many new properties have been added and many have bFor example:

    fs.default.namehas been deprecated and replaced withfs.defaultFSfor YA dfs.nameserviceshas been added to enable NameNode High Availability in h

    http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedPro

    In Hadoop 2.x.x (CDH4) release, you can use either the old or the new prop

    The old property names are now deprecated, but still work!

    http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.htmlhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    15/29

    www

    Runtime Environment

    Offers a way to provide custom parameters for each of the servers.

    Sourced by the Hadoop Daemons start/stop scripts.

    Examples of environment variables that you can specify:

    HADOOP_DATANODE_HEAPSIZEYARN_HEAPSIZE

    Set parameter JAVA_HOMEJV

    hadoop-env.shyarn-env.sh

    Map

    Reduce

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    16/29

    www.e

    Configuration Files for Core Components

    Core core-site.xml

    HDFS hdfs-site.xml

    mapred-site.xmlMap

    Reduce

    yarn-site.xmlYARN

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    17/29

    www.e

    core-site.xml and hdfs-site.xml

    hdfs-site.xml core-site.xml

    dfs.replication fs.defaultFS

    1 hdfs://test.abc.in:8020/

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    18/29

    www.e

    mapred-site.xml

    mapred-site.xml

    mapreduce.jobhistory.address

    test.abc.in:10020

    http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ma

    http://hadoop.apache.org/docs/stable/mapred_tutorial.html

    Noticecurren

    http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xmlhttp://hadoop.apache.org/docs/stable/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/stable/mapred_tutorial.htmlhttp://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    19/29

    www.e

    yarn-site.xml

    yarn-site.xml

    yarn.resourcemanager.address

    test.abc.in:8021

    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-defau

    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xmlhttp://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    20/29

    www.e

    Slaves

    MapRedu

    Slaves

    Contains a list of slave hosts, one per line, that are to host DataNode andNodeManager servers.

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    21/29

    www.ehttp://wiki.apache.org/hadoop/PoweredBy

    Hadoop Cluster: Facebook

    http://wiki.apache.org/hadoop/PoweredByhttp://wiki.apache.org/hadoop/PoweredBy
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    22/29

    www.e

    Hadoop Cluster: A Typical Use Case (Hadoop 1.0)

    RAM: 16GB

    Hard disk: 6 X 2TB

    Processor: Xenon with 2 cores.

    Ethernet: 3 X 10 GB/s

    OS: 32bit CentOS

    RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOS

    RAM: 32 GB,Hard disk: 1 TB

    Processor: Xenon wi

    Ethernet: 3 X 10 GB/

    OS: 32bit CentOS

    Name Node Secondary N

    Data NodeRAM: 16GB

    Hard disk: 6 X 2TB

    Processor: Xenon with 2 cores.

    Ethernet: 3 X 10 GB/s

    OS: 32bit CentOS

    Data Node

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    23/29

    www.e

    Hadoop Cluster: Thinking About The Problem

    Single Machine

    Great for testing,developing.

    Not a practicalimplementation for

    large amounts of data.

    Initially four or sixnodes.

    As the volume of datagrows, more nodes can

    easily be added.

    Ways of deciding when thcluster needs to grow

    Increasing amount ofcomputation power

    needed.

    Increasing amount ofdata which needs to bestored.

    Increasing amount ofmemory needed toprocess tasks.

    Hadoop Cluster

    Small Cluster Large Cluster

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    24/29

    www.e

    Master Hardware

    Namenode requirements

    RAM to fit metadata

    Modest but dedicated disk

    Secondary Namenode

    Almost identical to Namenode

    Resource Manager

    Retain Job Data, Memory Hungry

    Memory requirements can grow

    independent of cluster size

    Slave Hardware

    Storage

    Computation

    Cluster Sizing

    Usage Pattern and

    IO-bound or C

    Consider requirem

    additional compon

    HBase

    Plan your Hadoop Cluster: Hardware

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    25/29

    www.e

    Operating System Linux is the only production quality option today. A significant number run on RHEL.

    Java JDK- the most critical software List of tested JVMs:

    http://wiki.apache.org/hadoop/HadoopJavaVersions

    Java 1.6.x

    Operating System utilities ssh cron rsync ntp

    Plan your Hadoop Cluster: Software

    l d b

    http://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersionshttp://wiki.apache.org/hadoop/HadoopJavaVersions
  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    26/29

    www.e

    Choose a Distribution and Version of Hadoop

    Popular Hadoop Distributions

    Apache Hadoop Complex Cluster setup Manual install and Integration of Hadoop

    ecosystem components such as Pig, Hive,HBase etc

    No commercial Support Good for First try

    Cloudera

    Established distribution with many referenceddeployments

    Powerful tools for deployment, managementand monitoring such as Cloudera Manager

    P l H d Di ib i

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    27/29

    www.e

    HortonWorks

    Only distribution without any modification in Apache Hadoop

    HCatalog for metadata

    Stinger for Hive

    MapR

    Support native Unix filesystem

    HA features such as snapshots, mirroring or stateful failover

    Amazon Elastic Map Reduce (EMR)

    Hosted Solution

    Only Pig and Hive are available as of now

    Popular Hadoop Distributions

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    28/29

    www.e

    Assignments Status

    Attempt the following Assignments using the documents present in the L

    Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or V

  • 8/11/2019 Demystifying Hadoop 2.0 - Part 1

    29/29

    Thank YouSee You in Class Next Week