Hadoop Training -- Part1
-
Author
pappalasirisha -
Category
Documents
-
view
214 -
download
0
Embed Size (px)
Transcript of Hadoop Training -- Part1
-
7/28/2019 Hadoop Training -- Part1
1/33
Hadoop Training
---- By Sasidhar M
Module 1 Introduction and Basics
-
7/28/2019 Hadoop Training -- Part1
2/33
Why Hadoop?
We can processing and querying vast amount of data.
Efficient.
Security.
Automatic distribution of data and work across machines.
Open source
Parallel Processing
-
7/28/2019 Hadoop Training -- Part1
3/33
HistoryofHadoop
-
7/28/2019 Hadoop Training -- Part1
4/33
Stand alone Applications
Application Server
Data Base
Network
Connection
Pulling Data
Manipulating
Updating
Synchronization
-
7/28/2019 Hadoop Training -- Part1
5/33
Challenges in Standalone Application
Network Latency GB,TB,PBs of Data is moved
What is the size of application?
What if data size is huge?
-
7/28/2019 Hadoop Training -- Part1
6/33
Changing Mindset
Should the application be moved or Data?
-
7/28/2019 Hadoop Training -- Part1
7/33
Parallel Processing
Apart from data storage performance
becomes major concern
Parallel Processing
Multithreading
OpenMP
MPI (Message passing Interface)
-
7/28/2019 Hadoop Training -- Part1
8/33
Building Scalable Systems
What is the need of scaling?
Storage
Processing
Vertical Scaling Adding extra hardware, RAM etc
Horizontal Scaling
Adding more nodes
Can you scale your existing system?
Elastic scalability
-
7/28/2019 Hadoop Training -- Part1
9/33
Distributed Framework
Data Localization
Moving application to where data resides
Data Availability When storing data across nodes, it should be available to
all other nodes and should be accessible
Even if the nodes are failing, data should not be lost
Data Consistency Data at all times should be consistent
Data Reliablity
-
7/28/2019 Hadoop Training -- Part1
10/33
Challenges in Distributed Framework
How to reduce network latency?
How to make sure that data is not lost?
How to design a programming model toaccess the data?
-
7/28/2019 Hadoop Training -- Part1
11/33
Chunking an input file
-
7/28/2019 Hadoop Training -- Part1
12/33
-
7/28/2019 Hadoop Training -- Part1
13/33
Assignments- Prerequisities
Linux OS
Sun JDK 6 (>=1.6)
Hadoop 1.0.3 Eclipse
Apache Maven 3.0.4
-
7/28/2019 Hadoop Training -- Part1
14/33
Hadoop Training Module-2
Hadoop Installation
-
7/28/2019 Hadoop Training -- Part1
15/33
Next Module OverView
Installing Hadoop
Running Hadoop on Pseudo Mode
Understanding configuration Understanding Hadoop Processes like
NN,DN,SNN,JT,TT
Running sample mapreduce programs Overview of basic commands and their usage
-
7/28/2019 Hadoop Training -- Part1
16/33
Agenda
Hadoop Installation
Running Sample MapReduce Program
HDFS Commands
-
7/28/2019 Hadoop Training -- Part1
17/33
Step1: Installing Java on RHEL or
CentOS or Fedora OS Download Sun jdk from the oracle web site (rpm.bin file)
Execute chmod +x give_rpm.bin_filename
java will be installed under /usr/java folder
Set JAVA_HOME environment variable in .bashrc file or .bash_profile
export JAVA_HOME=/usr/java/jdk1.6.0_31 export PATH = $PATH:$JAVA_HOME/bin
source .bashrc
Run the command java -version, it should show the version of jdk you installed
-
7/28/2019 Hadoop Training -- Part1
18/33
Installingjava
on
Ubuntu
Contd..
tar-xvzfjdk-7u9-linux-x64.tar.gz sudomvjdk1.7.0_04/usr/lib/jvm/ sudoupdate-alternatives--install/usr/bin/javacjavac
/usr/lib/jvm/jdk1.7.0_04/bin/javac1 sudoupdate-alternatives--install/usr/bin/javajava
/usr/lib/jvm/jdk1.7.0_04/bin/java1 sudoupdate-alternatives--install/usr/bin/javawsjavaws
/usr/lib/jvm/jdk1.7.0_04/bin/javaws
-
7/28/2019 Hadoop Training -- Part1
19/33
Step 2: Disabling ipv6
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
The value of 0 indicates that ipv6 is disabled
If ipv6 is not disabled then
Open /etc/sysctl.conf and add the following lines:
net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
-
7/28/2019 Hadoop Training -- Part1
20/33
Step 3 Creating a new user for
hadoop
Not a mandatory step, but make sure in
cluster mode, hadoop should be run from
same user
useradd hadoop
passwd hadoop
-
7/28/2019 Hadoop Training -- Part1
21/33
Step 4- Configuring SSH
Nodes in the cluster communicate with each other via ssh
Name Node should be able to communicate to the Data nodes in
password less manner
Run the following command to generate public and private key withoutpassword
ssh-keygen -t rsa -P
cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
Set the permission of authorized_keys to 755 chmod 755 authorized_keys
Now when u do ssh localhost, it shold not ask you password
-
7/28/2019 Hadoop Training -- Part1
22/33
Possible Error While configuring ssh
ssh: connect to host localhost port 22: Connection refused
Check if sshd is running or not psef | grep sshd
Check if ssh server and client are installed or not. If not install it
sudo apt-get install openssh-client openssh-server (On Ubuntu)
yum -y install openssh-server openssh-client (On Centos)
Now start the service by running the following command
chkconfig sshd onservice sshd start
-
7/28/2019 Hadoop Training -- Part1
23/33
Step 5-Installing Hadoop
Untar the file by running the following command
tarzxf hadoop-1.0.3.tar.gz
Create following environment variable in .bashrc file
export HADOOP_HOME=/home/hadoop/hadoop-1.0.3
export HADOOP_LIB=$HADOOP_HOME/lib
-
7/28/2019 Hadoop Training -- Part1
24/33
Step 6: Configuring Hadoop
-
7/28/2019 Hadoop Training -- Part1
25/33
mapred-site.xml
mapred.job.tracker
localhost:54311
-
7/28/2019 Hadoop Training -- Part1
26/33
hdfs-site.xml
dfs.replication
1
-
7/28/2019 Hadoop Training -- Part1
27/33
Core-site.xml
hadoop.tmp.dir
/home/hadoop/hadoop-${user.name}
A base for other temporary directories.
fs.default.name
hdfs://localhost:54310
-
7/28/2019 Hadoop Training -- Part1
28/33
Masters and Slaves file
In masters file specify the IP or hostname of
the NameNode
In slaves file, specify the IP or hostname of the
slaves file
-
7/28/2019 Hadoop Training -- Part1
29/33
hadoop-env.sh
Specify the JAVA_HOME path
-
7/28/2019 Hadoop Training -- Part1
30/33
Step 7: Formatting NameNode and
starting Hadoop cluster Require to build file system Where it is created?
Run the following command to format the namenode.
The output should show that NameNode is successfully formatted. Run bin/start-all.sh to start hadoop cluster
Execute psef | grep hadoop. It should show you all the below 5 processesrunning NameNode
DataNode
Secondary NameNode
Task Tracker
Job Tracker
cd $HADOOP_HOME
bin/hadoop namenode -format
-
7/28/2019 Hadoop Training -- Part1
31/33
Job Tracker UI
http://localhost:50030
-
7/28/2019 Hadoop Training -- Part1
32/33
NameNode UI
http://locahost:50070
http://locahost:50070/http://locahost:50070/ -
7/28/2019 Hadoop Training -- Part1
33/33