Download - Hadoop Training -- Part1

7/28/2019 Hadoop Training -- Part1

1/33

Hadoop Training

---- By Sasidhar M

[email protected]

Module 1 Introduction and Basics


2/33

Why Hadoop?

We can processing and querying vast amount of data.

Efficient.

Security.

Automatic distribution of data and work across machines.

Open source

Parallel Processing


3/33

HistoryofHadoop


4/33

Stand alone Applications

Application Server

Data Base

Network

Connection

Pulling Data

Manipulating

Updating

Synchronization


5/33

Challenges in Standalone Application

Network Latency GB,TB,PBs of Data is moved

What is the size of application?

What if data size is huge?


6/33

Changing Mindset

Should the application be moved or Data?


7/33

Parallel Processing

Apart from data storage performance

becomes major concern

Parallel Processing

Multithreading

OpenMP

MPI (Message passing Interface)


8/33

Building Scalable Systems

What is the need of scaling?

Storage

Processing

Vertical Scaling Adding extra hardware, RAM etc

Horizontal Scaling

Adding more nodes

Can you scale your existing system?

Elastic scalability


9/33

Distributed Framework

Data Localization

Moving application to where data resides

Data Availability When storing data across nodes, it should be available to

all other nodes and should be accessible

Even if the nodes are failing, data should not be lost

Data Consistency Data at all times should be consistent

Data Reliablity


10/33

Challenges in Distributed Framework

How to reduce network latency?

How to make sure that data is not lost?

How to design a programming model toaccess the data?


11/33

Chunking an input file


12/33


13/33

Assignments- Prerequisities

Linux OS

Sun JDK 6 (>=1.6)

Hadoop 1.0.3 Eclipse

Apache Maven 3.0.4


14/33

Hadoop Training Module-2

Hadoop Installation


15/33

Next Module OverView

Installing Hadoop

Running Hadoop on Pseudo Mode

Understanding configuration Understanding Hadoop Processes like

NN,DN,SNN,JT,TT

Running sample mapreduce programs Overview of basic commands and their usage


16/33

Agenda

Hadoop Installation

Running Sample MapReduce Program

HDFS Commands


17/33

Step1: Installing Java on RHEL or

CentOS or Fedora OS Download Sun jdk from the oracle web site (rpm.bin file)

Execute chmod +x give_rpm.bin_filename

java will be installed under /usr/java folder

Set JAVA_HOME environment variable in .bashrc file or .bash_profile

export JAVA_HOME=/usr/java/jdk1.6.0_31 export PATH = $PATH:$JAVA_HOME/bin

source .bashrc

Run the command java -version, it should show the version of jdk you installed


18/33

Installingjava

on

Ubuntu

Contd..

tar-xvzfjdk-7u9-linux-x64.tar.gz sudomvjdk1.7.0_04/usr/lib/jvm/ sudoupdate-alternatives--install/usr/bin/javacjavac

/usr/lib/jvm/jdk1.7.0_04/bin/javac1 sudoupdate-alternatives--install/usr/bin/javajava

/usr/lib/jvm/jdk1.7.0_04/bin/java1 sudoupdate-alternatives--install/usr/bin/javawsjavaws

/usr/lib/jvm/jdk1.7.0_04/bin/javaws


19/33

Step 2: Disabling ipv6

cat /proc/sys/net/ipv6/conf/all/disable_ipv6

The value of 0 indicates that ipv6 is disabled

If ipv6 is not disabled then

Open /etc/sysctl.conf and add the following lines:

net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1


20/33

Step 3 Creating a new user for

hadoop

Not a mandatory step, but make sure in

cluster mode, hadoop should be run from

same user

useradd hadoop

passwd hadoop


21/33

Step 4- Configuring SSH

Nodes in the cluster communicate with each other via ssh

Name Node should be able to communicate to the Data nodes in

password less manner

Run the following command to generate public and private key withoutpassword

ssh-keygen -t rsa -P

cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys

Set the permission of authorized_keys to 755 chmod 755 authorized_keys

Now when u do ssh localhost, it shold not ask you password


22/33

Possible Error While configuring ssh

ssh: connect to host localhost port 22: Connection refused

Check if sshd is running or not psef | grep sshd

Check if ssh server and client are installed or not. If not install it

sudo apt-get install openssh-client openssh-server (On Ubuntu)

yum -y install openssh-server openssh-client (On Centos)

Now start the service by running the following command

chkconfig sshd onservice sshd start


23/33

Step 5-Installing Hadoop

Untar the file by running the following command

tarzxf hadoop-1.0.3.tar.gz

Create following environment variable in .bashrc file

export HADOOP_HOME=/home/hadoop/hadoop-1.0.3

export HADOOP_LIB=$HADOOP_HOME/lib


24/33

Step 6: Configuring Hadoop


25/33

mapred-site.xml

mapred.job.tracker

localhost:54311


26/33

hdfs-site.xml

dfs.replication

1


27/33

Core-site.xml

hadoop.tmp.dir

/home/hadoop/hadoop-${user.name}

A base for other temporary directories.

fs.default.name

hdfs://localhost:54310


28/33

Masters and Slaves file

In masters file specify the IP or hostname of

the NameNode

In slaves file, specify the IP or hostname of the

slaves file


29/33

hadoop-env.sh

Specify the JAVA_HOME path


30/33

Step 7: Formatting NameNode and

starting Hadoop cluster Require to build file system Where it is created?

Run the following command to format the namenode.

The output should show that NameNode is successfully formatted. Run bin/start-all.sh to start hadoop cluster

Execute psef | grep hadoop. It should show you all the below 5 processesrunning NameNode

DataNode

Secondary NameNode

Task Tracker

Job Tracker

cd $HADOOP_HOME

bin/hadoop namenode -format


31/33

Job Tracker UI

http://localhost:50030


32/33

NameNode UI

http://locahost:50070
http://locahost:50070/http://locahost:50070/


33/33