Hadoop Fully Distributed Cluster

download Hadoop Fully Distributed Cluster

of 8

Transcript of Hadoop Fully Distributed Cluster

  • 7/28/2019 Hadoop Fully Distributed Cluster

    1/8

    Hadoop: Fully-Distributed Cluster Setup

    OS & Tools to use

    OS: Ubuntu

    JVM: Sun JDK

    Hadoop: Apache Hadoop

    Note : Create a dedicated user on the Linuxmachine(master as well as slaves) for Hadoopconfiguration and installation and make it as root , sosuppose we have created a user called cluster so loginto the cluster account and start the configuration

    Hadoop: Prerequisite for Hadoop Setup in Ubuntu

    1. sudo apt-get install python-software-properties

    2. sudo add-apt-repository ppa:ferramroberto/java

    3. sudo apt-get update

    4. sudo apt-get install sun-java6-jdk

    5. sudo update-java-alternatives -s java-6-sun

    Apache Hadoop startup scripts (start-all.sh & stop-all.sh) uses SSH to connect and start

    hadoop in slaves machines. So, to install SSH follow the steps below

    Step-1:Install SSH from Ubuntu repository.user1@ubuntu-server:~$ sudoapt-getinstallssh

    The origional hosts file will be like

    127.0.0.1 localhost127.0.1.1 localhost

    # The following lines are desirable for IPv6 capable hosts

    2. Installing SSH

    1. Installing Java 1.6 (Sun JDK) in Ubuntu

    3. Hosts File configuration

  • 7/28/2019 Hadoop Fully Distributed Cluster

    2/8

    ::1 ip6-localhost ip6-loopbackfe00::0 ip6-localnetff00::0 ip6-mcastprefixff02::1 ip6-allnodesff02::2 ip6-allrouters

    Just need to make some changes :

    127.0.0.1 localhost#127.0.1.1 localhost192.168.2.118 shashwat.blr.pointcross.com shashwat192.168.2.117 chethan192.168.2.116 tariq192.168.2.56 alok192.168.2.69 sandish192.168.2.121 moses

    Why ssh-keygen? hadoop uses ssh (not password) to communicate witheach other. The masters public key should be added to all the slaves~/.ssh/authorized_keys file, so that master can easily communicate to allthe slaves. In this case (pseudo distributed mode) both master and slaveare in the same machine, hence we are adding the machines public key tothe ~/.ssh/authorized_keys file in the same machine

    Start Terminal and issue following commands :

    1. ssh-keygen-t rsa -P""2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

    then try

    ssh localhost

    Why ssh localhost? To check whether step-1 & 2 was done correctly or not.ssh localhost should connect to localhost without asking for password,because ssh uses public key for authentication and we have already addedthe public key in authorized_keys file.

    Copy the content of id_rsa.pub to authorized_keys to others slaves machine,which will be inside .ssh folder, this is required to enable master tocommunicate with slaves without using any password.

    1. cd - -

    4. Configure Passwordless ssh

    5. Download and configure Hadoop

  • 7/28/2019 Hadoop Fully Distributed Cluster

    3/8

    2. wget http://www.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz3. sudotar-xzfhadoop-0.20.2.tar.gz4. After extracting just give these two commands

    1. chown -R cluster hadoop-0.20.2/2. chmod -R 755 hadoop-0.20.2

    3. SetJAVA_HOME in /hadoop/conf/hadoop-env.sh

    Open hadoop-env.sh and put this line in it export JAVA_HOME=/usr/lib/jvm/java-6-sun

    1. Edit the config file hadoop/conf/masters as shown below.

    localhost

    2. Edit the /hadoop/conf/slaves as follows :

    shashwatchethantariqalok

    3. Edit the core-site.xml file and put the following lines insideconfiguration tag /hadoop/conf/core-site.xml as follows :

    hadoop.tmp.dirtmpA base for other temporary directories.

    6.Configure Hadoop in Fully Distributed (or Cluster) Mode

  • 7/28/2019 Hadoop Fully Distributed Cluster

    4/8

    fs.default.namehdfs://shashwat:9000The name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class. The uri's authority is used to

    determine the host, port, etc. for a filesystem.

    4. Edit the mapred-site.xml file and put the following linesinside configuration tag /hadoop/conf/mapred-site.xml asfollows :

    mapred.job.trackershashwat:54311The host and port that the MapReduce job tracker runsat. If "local", then jobs are run in-process as a single map

    and reduce task.

    5. Edit the hdfs-site.xml file and put the following lines insideconfiguration tag /hadoop/conf/hdfs-site.xml as follows :

    dfs.replication2

    (According to no of nodes)Default block replication

    Then copy the same hadoop folder to slaves with the same path asin master :

    supoose hadoop folder is in /home/cluster/hadoop it should exist onslaves too, so you can use following command to copy file from masterto slave as follows :

    ssh all salves from the master. e.g. shown below

    ssh alokssh tariqssh chethanssh moses

    scp -r /home/cluster/hadoop cluster@tariq:/home/clusterscp -r /home/cluster/hadoop cluster@alok:/home/clusterscp -r /home/cluster/hadoop cluster@chethan:/home/clusterscp -r /home/cluster/hadoop cluster@moses:/home/cluster

    mailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariq
  • 7/28/2019 Hadoop Fully Distributed Cluster

    5/8

    6. After all the above steps issue following command :

    bin/hadoop namenode -format

    and then bin/start-all

    Check if all(namenode, datanode, tasktracker, jobtracker, secondrynamenode are running) if so

    configuration is complete for master.

    Issue jps command to see running java programs :

    Master should list NameNode, JobTracker, SecondaryNameNodeAll Slaves should list DataNode, TaskTracker

    Screenshots from running hadoop machine :core-site.xml

    hadoop-env.sh

  • 7/28/2019 Hadoop Fully Distributed Cluster

    6/8

    hdfs-site.xml

    mapred-site.xml

  • 7/28/2019 Hadoop Fully Distributed Cluster

    7/8

    masters

  • 7/28/2019 Hadoop Fully Distributed Cluster

    8/8

    slaves

    1. Where to find the logs? at /hadoop/logs

    2. How to check hadoop is running or not? use jps command or goto http://localhost:50070to get more information on HDFS and goto http://localhost:50030 to get more information onMapReduce