Hadoop Fully Distributed Cluster

7/28/2019 Hadoop Fully Distributed Cluster

1/8

Hadoop: Fully-Distributed Cluster Setup

OS & Tools to use

OS: Ubuntu

JVM: Sun JDK

Hadoop: Apache Hadoop

Note : Create a dedicated user on the Linuxmachine(master as well as slaves) for Hadoopconfiguration and installation and make it as root , sosuppose we have created a user called cluster so loginto the cluster account and start the configuration

Hadoop: Prerequisite for Hadoop Setup in Ubuntu

1. sudo apt-get install python-software-properties

2. sudo add-apt-repository ppa:ferramroberto/java

3. sudo apt-get update

4. sudo apt-get install sun-java6-jdk

5. sudo update-java-alternatives -s java-6-sun

Apache Hadoop startup scripts (start-all.sh & stop-all.sh) uses SSH to connect and start

hadoop in slaves machines. So, to install SSH follow the steps below

Step-1:Install SSH from Ubuntu repository.user1@ubuntu-server:~$ sudoapt-getinstallssh

The origional hosts file will be like

127.0.0.1 localhost127.0.1.1 localhost

# The following lines are desirable for IPv6 capable hosts

2. Installing SSH

1. Installing Java 1.6 (Sun JDK) in Ubuntu

3. Hosts File configuration


2/8

::1 ip6-localhost ip6-loopbackfe00::0 ip6-localnetff00::0 ip6-mcastprefixff02::1 ip6-allnodesff02::2 ip6-allrouters

Just need to make some changes :

127.0.0.1 localhost#127.0.1.1 localhost192.168.2.118 shashwat.blr.pointcross.com shashwat192.168.2.117 chethan192.168.2.116 tariq192.168.2.56 alok192.168.2.69 sandish192.168.2.121 moses

Why ssh-keygen? hadoop uses ssh (not password) to communicate witheach other. The masters public key should be added to all the slaves~/.ssh/authorized_keys file, so that master can easily communicate to allthe slaves. In this case (pseudo distributed mode) both master and slaveare in the same machine, hence we are adding the machines public key tothe ~/.ssh/authorized_keys file in the same machine

Start Terminal and issue following commands :

1. ssh-keygen-t rsa -P""2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

then try

ssh localhost

Why ssh localhost? To check whether step-1 & 2 was done correctly or not.ssh localhost should connect to localhost without asking for password,because ssh uses public key for authentication and we have already addedthe public key in authorized_keys file.

Copy the content of id_rsa.pub to authorized_keys to others slaves machine,which will be inside .ssh folder, this is required to enable master tocommunicate with slaves without using any password.

1. cd - -

4. Configure Passwordless ssh

5. Download and configure Hadoop


3/8

2. wget http://www.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz3. sudotar-xzfhadoop-0.20.2.tar.gz4. After extracting just give these two commands

1. chown -R cluster hadoop-0.20.2/2. chmod -R 755 hadoop-0.20.2

3. SetJAVA_HOME in /hadoop/conf/hadoop-env.sh

Open hadoop-env.sh and put this line in it export JAVA_HOME=/usr/lib/jvm/java-6-sun

1. Edit the config file hadoop/conf/masters as shown below.

localhost

2. Edit the /hadoop/conf/slaves as follows :

shashwatchethantariqalok

3. Edit the core-site.xml file and put the following lines insideconfiguration tag /hadoop/conf/core-site.xml as follows :

hadoop.tmp.dirtmpA base for other temporary directories.

6.Configure Hadoop in Fully Distributed (or Cluster) Mode


4/8

fs.default.namehdfs://shashwat:9000The name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.

4. Edit the mapred-site.xml file and put the following linesinside configuration tag /hadoop/conf/mapred-site.xml asfollows :

mapred.job.trackershashwat:54311The host and port that the MapReduce job tracker runsat. If "local", then jobs are run in-process as a single map

and reduce task.

5. Edit the hdfs-site.xml file and put the following lines insideconfiguration tag /hadoop/conf/hdfs-site.xml as follows :

dfs.replication2

(According to no of nodes)Default block replication

Then copy the same hadoop folder to slaves with the same path asin master :

supoose hadoop folder is in /home/cluster/hadoop it should exist onslaves too, so you can use following command to copy file from masterto slave as follows :

ssh all salves from the master. e.g. shown below

ssh alokssh tariqssh chethanssh moses

scp -r /home/cluster/hadoop cluster@tariq:/home/clusterscp -r /home/cluster/hadoop cluster@alok:/home/clusterscp -r /home/cluster/hadoop cluster@chethan:/home/clusterscp -r /home/cluster/hadoop cluster@moses:/home/cluster
mailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariqmailto:cluster@tariq


5/8

6. After all the above steps issue following command :

bin/hadoop namenode -format

and then bin/start-all

Check if all(namenode, datanode, tasktracker, jobtracker, secondrynamenode are running) if so

configuration is complete for master.

Issue jps command to see running java programs :

Master should list NameNode, JobTracker, SecondaryNameNodeAll Slaves should list DataNode, TaskTracker

Screenshots from running hadoop machine :core-site.xml

hadoop-env.sh


6/8

hdfs-site.xml

mapred-site.xml


7/8

masters


8/8

slaves

1. Where to find the logs? at /hadoop/logs

2. How to check hadoop is running or not? use jps command or goto http://localhost:50070to get more information on HDFS and goto http://localhost:50030 to get more information onMapReduce

Hadoop Fully Distributed Cluster

Documents

Transcript of Hadoop Fully Distributed Cluster