02 Hadoop deployment and configuration

33
Hadoop Deployment and Configuration - Single machine and a cluster

Transcript of 02 Hadoop deployment and configuration

Page 1: 02 Hadoop deployment and configuration

Hadoop Deployment and Configuration - Single

machine and a cluster

Page 2: 02 Hadoop deployment and configuration

Typical Hardware• HP Compaq 8100 Elite CMT PC

Specifications· Processor: Intel Core i7-860· RAM: 8GB PC3-10600 Memory (2X4GB)· HDD: 1TB SATA 3.5· Network: Intel 82578 GbE (Integrated)

• Network switch

– Netgear GS2608Specifications· N Port· 10/100/1000 Mbps Gigabit Switch

• Gateway node

– Dell Optiplex GX280Specifications· Processor: Intel Pentium 4 2.80 GHz· RAM: 1GB

Page 3: 02 Hadoop deployment and configuration

OS

• Install the Ubuntu Server (Maverick Meerkat) operating system that is available for download from the Ubuntu releases site.

• Some important points to remember while installing the OS– Ensure that the SSH server is selected to be installed

– Enter the proxy details needed for systems to connect to the internet from within your network

– Create a user on each installation

• Preferably with the same password on each node

Page 4: 02 Hadoop deployment and configuration

Prerequisites• Supported Platforms

– GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.

– Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.

• Required Software– Required software for Linux and Windows include:

• JavaTM 1.7.x, preferably from Sun, must be installed.

• ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

• Additional requirements for Windows include:– Cygwin - Required for shell support in addition to the required software above.

• Installing Software– If your cluster doesn't have the requisite software you will need to install it.

– For example on Ubuntu Linux:• $ sudo apt-get install ssh

$ sudo apt-get install rsync

• On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the packages:– openssh - the Net category

Page 5: 02 Hadoop deployment and configuration

Install Sun’s java JDK• Install Sun’s java JDK on each node in the cluster

• Add the canonical partner repository to your list of apt repositories.

• You can do this by adding the line below into your /etc/apt/sources.list filedeb http://archive.canonical.com/ maverick partner

• Update the source list

– sudo apt-get update

• Install sun-java7-jdk

– sudo apt-get install sun-java6-jdk

• Select Sun’s java as the default on the machine

– sudo update-java-alternatives -s java-6-sun

• Verify the installation running the command

– java –version

Page 6: 02 Hadoop deployment and configuration

Adding a dedicated Hadoop system user

• Use a dedicated Hadoop user account for running Hadoop.

• While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

• This will add the user hduser and the group hadoop to your local machine:– $ sudo addgroup hadoop

– $ sudo adduser --ingroup hadoop hduser

Page 7: 02 Hadoop deployment and configuration

Configuring SSH• Hadoop requires SSH access to manage its nodes, i.e. remote machines plus

your local machine if you want to use Hadoop on it.

• For single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous slide.

• Have SSH up and running on your machine and configured it to allow SSH public key authentication. http://ubuntuguide.org/

• Generate an SSH key for the hduser user.user@ubuntu:~$ su - hduserhduser@ubuntu:~$ ssh-keygen -t rsa -P ""Generating public/private rsa key pair.Enter file in which to save the key (/home/hduser/.ssh/id_rsa):Created directory '/home/hduser/.ssh'.Your identification has been saved in /home/hduser/.ssh/id_rsa.Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.The key fingerprint is:9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntuThe key's randomart image is:[...snipp...]hduser@ubuntu:~$

Page 8: 02 Hadoop deployment and configuration

Configuring SSH• Second, you have to enable SSH access to your local machine with this newly

created key.

– hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

• The final step is to test the SSH setup by connecting to your local machine with the hduser user.

• The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file.

• If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

hduser@ubuntu:~$ ssh localhostThe authenticity of host 'localhost (::1)' can't be established.RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'localhost' (RSA) to the list of known hosts.Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/LinuxUbuntu 10.04 LTS[...snipp...]hduser@ubuntu:~$

Page 9: 02 Hadoop deployment and configuration

Disabling IPv6• One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various

networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses.

• To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

• You have to reboot your machine in order to make the changes take effect.

• You can check whether IPv6 is enabled on your machine with the following command:

• You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh:

#disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Page 10: 02 Hadoop deployment and configuration

Hadoop Installation• You have to download Hadoop from the Apache Download Mirrors and

extract the contents of the Hadoop package to a location of your choice.

• Say /usr/local/hadoop.

• Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

• Create a symlink from hadoop-xxxxx to hadoop

$ cd /usr/local$ sudo tar xzf hadoop-xxxx.tar.gz$ sudo mv hadoop-xxxxx hadoop$ sudo chown -R hduser:hadoop hadoop

Page 11: 02 Hadoop deployment and configuration

Update $HOME/.bashrc• Add the following lines to the end of the $HOME/.bashrc file of user hduser.

• If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

# Set Hadoop-related environment variablesexport HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/nullalias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"

Page 12: 02 Hadoop deployment and configuration

Update $HOME/.bashrc

# If you have LZO compression enabled in your Hadoop cluster and# compress job outputs with LZOP (not covered in this tutorial):# Conveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#lzohead () {

hadoop fs -cat $1 | lzop -dc | head -1000 | less}

# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/bin

Page 13: 02 Hadoop deployment and configuration

Configuration files• The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files

for Hadoop. These are:

• hadoop-env.sh - This file contains some environment variable settings used by Hadoop. You can use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the maximum amount of heap used etc. The only variable you should need to change in this file is JAVA_HOME, which specifies the path to the Java 1.5.x installation used by Hadoop.

• slaves - This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. By default this contains the single entry localhost

• hdfs-site.xml - This file contains generic default settings for Hadoop daemons and Map/Reduce jobs. Do not modify this file.

• mapred-site.xml - This file contains site specific settings for the Hadoop Map/Reduce daemons and jobs. The file is empty by default. Putting configuration properties in this file will override Map/Reduce settings in the hadoop-default.xml file. Use this file to tailor the behavior of Map/Reduce on your site.

• core-site.xml - This file contains site specific settings for all Hadoop daemons and Map/Reduce jobs. This file is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml. This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.

Page 14: 02 Hadoop deployment and configuration

Configuration : Single node

• hadoop-env.sh : – The only required environment variable we have to configure for Hadoop

in this case is JAVA_HOME.

– Open etc/hadoop/conf/hadoop-env.sh in the editor of your choice

– set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory

– export JAVA_HOME=/usr/lib/jvm/java-6-sun

• conf/*-site.xml– We configure following:

– core-site.xml

• hadoop.tmp.dir

• fs.default.name

– mapred-site.xml

• mapred.job.tracker

– hdfs-site.xml

• dfs.replication

Page 15: 02 Hadoop deployment and configuration

Configure HDFS• We will configure the directory where Hadoop will store its data files, the

network ports it listens to, etc.

• Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

• You can leave the settings below ”as is” with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice.

• We will use the directory /app/hadoop/tmp

• Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

$ sudo mkdir -p /app/hadoop/tmp$ sudo chown hduser:hadoop /app/hadoop/tmp# ...and if you want to tighten up security, chmod from 755 to 750...$ sudo chmod 750 /app/hadoop/tmp

Page 16: 02 Hadoop deployment and configuration

conf/core-site.xml

<!-- In: conf/core-site.xml --><property>

<name>hadoop.tmp.dir</name><value>/app/hadoop/tmp</value><description>A base for other temporary directories.</description>

</property>

<property><name>fs.default.name</name><value>hdfs://localhost:54310</value><description>The name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. Theuri's scheme determines the config property (fs.SCHEME.impl) namingthe FileSystem implementation class. The uri's authority is used todetermine the host, port, etc. for a filesystem.</description>

</property>

Page 17: 02 Hadoop deployment and configuration

conf/mapred-site.xml

<!-- In: conf/mapred-site.xml --><property>

<name>mapred.job.tracker</name><value>localhost:54311</value><description>The host and port that the MapReduce job tracker runsat. If "local", then jobs are run in-process as a single mapand reduce task.</description>

</property>

Page 18: 02 Hadoop deployment and configuration

conf/hdfs-site.xml

<!-- In: conf/hdfs-site.xml --><property>

<name>dfs.replication</name><value>1</value><description>Default block replication.The actual number of replications can be specified when the file is created.The default is used if replication is not specified in create time.</description>

</property>

Page 19: 02 Hadoop deployment and configuration

Formatting the HDFS and Starting• To format the filesystem (which simply initializes the directory specified by

the dfs.name.dir variable), run the command

• hadoop namenode –format

• Run start-all.sh : This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine

• Run stop-all.sh to stop all processes

Page 20: 02 Hadoop deployment and configuration

Download example input data

• Create a directory inside /home/…/gutenberg

• Download:– The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson

http://www.gutenberg.org/ebooks/20417.txt.utf-8

– The Notebooks of Leonardo Da Vincihttp://www.gutenberg.org/cache/epub/5000/pg5000.txt

– Ulysses by James Joycehttp://www.gutenberg.org/cache/epub/4300/pg4300.txt

• Copy local example data to HDFS

– hdfs dfs -copyFromLocal gutenberg gutenberg

• Check

– hadoop dfs -ls gutenberg

Page 21: 02 Hadoop deployment and configuration

Run the MapReduce job• Now, we run the WordCount example job

• hadoop jar /usr/lib/hadoop/hadoop-xxxx-example.jar wordcount gutenberggutenberg-out

• This command will

– read all the files in the HDFS directory /user/cloudera/gutenberg,

– process it, and

– store the result in the HDFS directory /user/cloudera/gutenberg-out

• Check if the result is successfully stored in HDFS directory gutenberg-out

– hdfs dfs –ls gutenberg-out

• Retrieve the job result from HDFS

– hdfs dfs –cat gutenberg-out/part-r-00000

• Better:

– hdfs dfs –cat gutenberg-out/part-r-00000 | sort –nk 2,2 –r | less

Page 22: 02 Hadoop deployment and configuration

Hadoop Web Interfaces• Hadoop comes with several web interfaces which are by default

(see conf/hadoop-default.xml) available at these locations:

• http://localhost:50030/ – web UI for MapReduce job tracker(s)

• http://localhost:50060/ – web UI for task tracker(s)

• http://localhost:50070/ – web UI for HDFS name node(s)

Page 23: 02 Hadoop deployment and configuration

Cluster setup• Basic idea

Box 1

Single Node Cluster

Master

Box 2

Single Node Cluster

Master

What we have done so far

Master Slave

Gateway

Switch

LAN

Use BitviseTunnelierSSH port forwarding

Page 24: 02 Hadoop deployment and configuration

Calling by name• Now that you have two single-node clusters up and running, we will modify

the Hadoop configuration to make

• one Ubuntu box the ”master” (which will also act as a slave) and

• the other Ubuntu box a ”slave”.

• We will call the designated master machine just the master from now on and the slave-only machine the slave.

• We will also give the two machines these respective hostnames in their networking setup, most notably in /etc/hosts.

• If the hostnames of your machines are different (e.g. node01) then you must adapt the settings as appropriate.

Page 25: 02 Hadoop deployment and configuration

Networking• connect both machines via a single hub or switch and configure the network

interfaces to use a common network such as 192.168.0.x/24.

• To make it simple,

• we will assign the IP address 192.168.0.1 to the master machine and

• 192.168.0.2 to theslave machine.

• Update /etc/hosts on both machines with the following lines:

# /etc/hosts (for master AND slave)192.168.0.1 master192.168.0.2 slave

Page 26: 02 Hadoop deployment and configuration

SSH access• The hduser user on the master (aka hduser@master) must be able to

connect a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and b) to the hduser user account on the slave (aka hduser@slave) via a password-less SSH login.

• you just have to add the hduser@master‘s public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user’s$HOME/.ssh/authorized_keys).

• ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

• Verify that the password-less access to all slaves from the master worksssh hduser@slave ssh hduser@master

Page 27: 02 Hadoop deployment and configuration

How the final multi-node cluster will look like

Page 28: 02 Hadoop deployment and configuration

Naming again• The master node will run the “master” daemons for each layer:

– NameNode for the HDFS storage layer, and

– JobTracker for the MapReduce processing layer

• Both machines will run the “slave” daemons:

– DataNode for the HDFS layer, and

– TaskTracker for MapReduce processing layer

• The “master” daemons are responsible for coordination and management of the “slave” daemons while the latter will do the actual data storage and data processing work.

• Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively.

• These are the actual “master nodes”.

• The rest of the machines in the cluster act as both DataNode and TaskTracker.

• These are the slaves or “worker nodes”.

Page 29: 02 Hadoop deployment and configuration

conf/masters (master only)

• The conf/masters file defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster.

• In our case, this is just the master machine.

• The primary NameNode and the JobTracker will always be the machines on which you run the bin/start-dfs.sh and bin/start-mapred.sh scripts, respectively

• The primary NameNode and the JobTracker will be started on the same machine if you run bin/start-all.sh

• On master, update /conf/masters that it looks like this: master

Page 30: 02 Hadoop deployment and configuration

conf/slaves (master only)• This conf/slaves file lists the hosts, one per line, where the Hadoop slave

daemons (DataNodes and TaskTrackers) will be run.

• We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

• On master, update conf/slaves that it looks like this:

• If you have additional slave nodes, just add them to the conf/slaves file, one per line (do this on all machines in the cluster).

masterslave

Page 31: 02 Hadoop deployment and configuration

conf/*-site.xml (all machines)• You have to change the configuration files

• conf/core-site.xml,

• conf/mapred-site.xml and

• conf/hdfs-site.xml

• on ALL machines:

– fs.default.name : The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. Set as hdfs://master:54310

– mapred.job.tracker: The host and port that the MapReduce job tracker runs at. Set as master:54311

– dfs.replication: Default block replication. Set as 2

– mapred.local.dir: Determines where temporary MapReduce data is written. It also may be a list of directories.

– mapred.map.tasks: As a rule of thumb, use 10x the number of slaves (i.e., number of TaskTrackers).

– mapred.reduce.tasks: As a rule of thumb, use 2x the number of slave processors (i.e., number of TaskTrackers).

Page 32: 02 Hadoop deployment and configuration

Formatting the HDFS and Starting

• To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable on the NameNode), run the command

• hdfs namenode –format

• Starting the multi-node cluster– Starting the cluster is done in two steps.

– First, the HDFS daemons are started: start-dfs.sh

• NameNode daemon is started on master, and

• DataNode daemons are started on all slaves (here: master and slave)

– Second, the MapReduce daemons are started: start-mapred.sh

• JobTracker is started on master, and

• TaskTracker daemons are started on all slaves (here: master and slave)

• stop-mapred.sh followed by stop-dfs.sh to stop

Page 33: 02 Hadoop deployment and configuration

End of session

Day – 1: Hadoop Deployment and Configuration - Single machine

and a cluster

Run the PiEstimator examplehadoop jar /usr/lib/hadoop/hadoop-xxxxx-example.jar pi 2 100000