Download - Hadoop Training -- Part1

Transcript
  • 7/28/2019 Hadoop Training -- Part1

    1/33

    Hadoop Training

    ---- By Sasidhar M

    [email protected]

    Module 1 Introduction and Basics

  • 7/28/2019 Hadoop Training -- Part1

    2/33

    Why Hadoop?

    We can processing and querying vast amount of data.

    Efficient.

    Security.

    Automatic distribution of data and work across machines.

    Open source

    Parallel Processing

  • 7/28/2019 Hadoop Training -- Part1

    3/33

    HistoryofHadoop

  • 7/28/2019 Hadoop Training -- Part1

    4/33

    Stand alone Applications

    Application Server

    Data Base

    Network

    Connection

    Pulling Data

    Manipulating

    Updating

    Synchronization

  • 7/28/2019 Hadoop Training -- Part1

    5/33

    Challenges in Standalone Application

    Network Latency GB,TB,PBs of Data is moved

    What is the size of application?

    What if data size is huge?

  • 7/28/2019 Hadoop Training -- Part1

    6/33

    Changing Mindset

    Should the application be moved or Data?

  • 7/28/2019 Hadoop Training -- Part1

    7/33

    Parallel Processing

    Apart from data storage performance

    becomes major concern

    Parallel Processing

    Multithreading

    OpenMP

    MPI (Message passing Interface)

  • 7/28/2019 Hadoop Training -- Part1

    8/33

    Building Scalable Systems

    What is the need of scaling?

    Storage

    Processing

    Vertical Scaling Adding extra hardware, RAM etc

    Horizontal Scaling

    Adding more nodes

    Can you scale your existing system?

    Elastic scalability

  • 7/28/2019 Hadoop Training -- Part1

    9/33

    Distributed Framework

    Data Localization

    Moving application to where data resides

    Data Availability When storing data across nodes, it should be available to

    all other nodes and should be accessible

    Even if the nodes are failing, data should not be lost

    Data Consistency Data at all times should be consistent

    Data Reliablity

  • 7/28/2019 Hadoop Training -- Part1

    10/33

    Challenges in Distributed Framework

    How to reduce network latency?

    How to make sure that data is not lost?

    How to design a programming model toaccess the data?

  • 7/28/2019 Hadoop Training -- Part1

    11/33

    Chunking an input file

  • 7/28/2019 Hadoop Training -- Part1

    12/33

  • 7/28/2019 Hadoop Training -- Part1

    13/33

    Assignments- Prerequisities

    Linux OS

    Sun JDK 6 (>=1.6)

    Hadoop 1.0.3 Eclipse

    Apache Maven 3.0.4

  • 7/28/2019 Hadoop Training -- Part1

    14/33

    Hadoop Training Module-2

    Hadoop Installation

  • 7/28/2019 Hadoop Training -- Part1

    15/33

    Next Module OverView

    Installing Hadoop

    Running Hadoop on Pseudo Mode

    Understanding configuration Understanding Hadoop Processes like

    NN,DN,SNN,JT,TT

    Running sample mapreduce programs Overview of basic commands and their usage

  • 7/28/2019 Hadoop Training -- Part1

    16/33

    Agenda

    Hadoop Installation

    Running Sample MapReduce Program

    HDFS Commands

  • 7/28/2019 Hadoop Training -- Part1

    17/33

    Step1: Installing Java on RHEL or

    CentOS or Fedora OS Download Sun jdk from the oracle web site (rpm.bin file)

    Execute chmod +x give_rpm.bin_filename

    java will be installed under /usr/java folder

    Set JAVA_HOME environment variable in .bashrc file or .bash_profile

    export JAVA_HOME=/usr/java/jdk1.6.0_31 export PATH = $PATH:$JAVA_HOME/bin

    source .bashrc

    Run the command java -version, it should show the version of jdk you installed

  • 7/28/2019 Hadoop Training -- Part1

    18/33

    Installingjava

    on

    Ubuntu

    Contd..

    tar-xvzfjdk-7u9-linux-x64.tar.gz sudomvjdk1.7.0_04/usr/lib/jvm/ sudoupdate-alternatives--install/usr/bin/javacjavac

    /usr/lib/jvm/jdk1.7.0_04/bin/javac1 sudoupdate-alternatives--install/usr/bin/javajava

    /usr/lib/jvm/jdk1.7.0_04/bin/java1 sudoupdate-alternatives--install/usr/bin/javawsjavaws

    /usr/lib/jvm/jdk1.7.0_04/bin/javaws

  • 7/28/2019 Hadoop Training -- Part1

    19/33

    Step 2: Disabling ipv6

    cat /proc/sys/net/ipv6/conf/all/disable_ipv6

    The value of 0 indicates that ipv6 is disabled

    If ipv6 is not disabled then

    Open /etc/sysctl.conf and add the following lines:

    net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1

    net.ipv6.conf.lo.disable_ipv6 = 1

  • 7/28/2019 Hadoop Training -- Part1

    20/33

    Step 3 Creating a new user for

    hadoop

    Not a mandatory step, but make sure in

    cluster mode, hadoop should be run from

    same user

    useradd hadoop

    passwd hadoop

  • 7/28/2019 Hadoop Training -- Part1

    21/33

    Step 4- Configuring SSH

    Nodes in the cluster communicate with each other via ssh

    Name Node should be able to communicate to the Data nodes in

    password less manner

    Run the following command to generate public and private key withoutpassword

    ssh-keygen -t rsa -P

    cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys

    Set the permission of authorized_keys to 755 chmod 755 authorized_keys

    Now when u do ssh localhost, it shold not ask you password

  • 7/28/2019 Hadoop Training -- Part1

    22/33

    Possible Error While configuring ssh

    ssh: connect to host localhost port 22: Connection refused

    Check if sshd is running or not psef | grep sshd

    Check if ssh server and client are installed or not. If not install it

    sudo apt-get install openssh-client openssh-server (On Ubuntu)

    yum -y install openssh-server openssh-client (On Centos)

    Now start the service by running the following command

    chkconfig sshd onservice sshd start

  • 7/28/2019 Hadoop Training -- Part1

    23/33

    Step 5-Installing Hadoop

    Untar the file by running the following command

    tarzxf hadoop-1.0.3.tar.gz

    Create following environment variable in .bashrc file

    export HADOOP_HOME=/home/hadoop/hadoop-1.0.3

    export HADOOP_LIB=$HADOOP_HOME/lib

  • 7/28/2019 Hadoop Training -- Part1

    24/33

    Step 6: Configuring Hadoop

  • 7/28/2019 Hadoop Training -- Part1

    25/33

    mapred-site.xml

    mapred.job.tracker

    localhost:54311

  • 7/28/2019 Hadoop Training -- Part1

    26/33

    hdfs-site.xml

    dfs.replication

    1

  • 7/28/2019 Hadoop Training -- Part1

    27/33

    Core-site.xml

    hadoop.tmp.dir

    /home/hadoop/hadoop-${user.name}

    A base for other temporary directories.

    fs.default.name

    hdfs://localhost:54310

  • 7/28/2019 Hadoop Training -- Part1

    28/33

    Masters and Slaves file

    In masters file specify the IP or hostname of

    the NameNode

    In slaves file, specify the IP or hostname of the

    slaves file

  • 7/28/2019 Hadoop Training -- Part1

    29/33

    hadoop-env.sh

    Specify the JAVA_HOME path

  • 7/28/2019 Hadoop Training -- Part1

    30/33

    Step 7: Formatting NameNode and

    starting Hadoop cluster Require to build file system Where it is created?

    Run the following command to format the namenode.

    The output should show that NameNode is successfully formatted. Run bin/start-all.sh to start hadoop cluster

    Execute psef | grep hadoop. It should show you all the below 5 processesrunning NameNode

    DataNode

    Secondary NameNode

    Task Tracker

    Job Tracker

    cd $HADOOP_HOME

    bin/hadoop namenode -format

  • 7/28/2019 Hadoop Training -- Part1

    31/33

    Job Tracker UI

    http://localhost:50030

  • 7/28/2019 Hadoop Training -- Part1

    32/33

    NameNode UI

    http://locahost:50070

    http://locahost:50070/http://locahost:50070/
  • 7/28/2019 Hadoop Training -- Part1

    33/33