Hadoop Nishant Gandhi.

8/3/2019 Hadoop Nishant Gandhi.

1/21

HadoopIntroduction & Setup

Prepared By,Nishant M Gandhi.Certified network Manager byNettech.

Diploma in CyberSecurity(persuing).

C K Pithawalla College of Engineering & Technology,Surat


2/21

What is ?

Hadoop is a framework for runningapplications on large clusters built ofcommodity hardware.

HADOOP WIKI

Open source, Java

Googles MapReduce inspired Yahoos Hadoop.

Now part of Apache group


3/21

Hadoop Architecture on DELL C Series Server


4/21

Hadoop Software Stack

Hadoop Common: The common utilities

Hadoop Distributed File System (HDFS): A distributed file system

Hadoop Map Reduce: Distributed processing on compute clusters.

Other Hadoop-related projects:

Avro: A data serialization system.

Cassandra: A scalable multi-master database

Chukwa: A data collection system for managing large distributed systems.

HBase: A scalable, distributed database that supports structured data storage for large tables.

Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

Mahout: A Scalable machine learning and data mining library.

Pig: A high-level data-flow language and execution framework for parallel computation.

ZooKeeper:co-ordination services


5/21

Who Uses Hadoop?

lick to edit Master text stylescond level

Third level Fourth level

Fifth level

The Yahoo! Search Webmap is a Hadoopapplication that runs on more than 10,000 coreLinux cluster and produces data that is now usedin every Yahoo! Web search query.

On February 19, 2008, Yahoo! Inc. launched whatit claimed was the world's largest Hadoopproduction application


6/21

Who Uses Hadoop?

The Datawarehouse Hadoop cluster at Facebook

21 PB of storage in a single HDFS cluster 2000 machines 12 TB per machine (a few machines have 24 TB each) 1200 machines with 8 cores each + 800 machines with16 cores each 32 GB of RAM per machine 15 map-reduce tasks per machine That's a total of more than 21 PB of configured storagecapacity! This is larger than the previously known Yahoo!'scluster of 14 PB. Here are the cluster statistics from theHDFS cluster at Facebook:

k to edit Master text stylesond levelThird level Fourth level

Fifth level


7/21

Who Uses Hadoop?

Creater of MapReduce

Runs Hadoop for NS-Research cluster

HDFS is inspired by GFS


8/21

Who Uses Hadoop?

Other Hadoop Users:

IBM

NEW YORK TIMESTwitter

Veoh

Amazon

Apple

eBay

AOL

Hewlett-Packard

Joost


9/21

Map Reduce

Programming model developed at Google

Sort/merge based distributed computing

Initially, it was intended for their internalsearch/indexing application, but now used

extensively by more organizations (e.g., Yahoo,Amazon.com, IBM, etc.)

It is functional style programming (e.g., LISP) that isnaturally parallelizable across a large cluster ofworkstations or PCS.

The underlying system takes care of the partitioningof the input data, scheduling the programsexecution across several machines, handlingmachine failures, and managing required inter-machine communication. (This is the key forHadoops success)


10/21

Map Reduce


11/21

Hadoop Distributed File System(HDFS)

At Google MapReduce operation are run on a special file systemcalled Google File System (GFS) that is highly optimized for thispurpose.

GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS and

called it Hadoop Distributed File System (HDFS).


12/21

Goals of HDFS

Very Large Distributed File System 10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them

Optimized for Batch Processing Data locations exposed so that computations

can move to where data resides Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS


13/21

DFShellThe HDFS shell is invoked by: bin/hadoop dfs

cat chgrp chmod chown copyFromLocal copyToLocal cp du dus

expunge get getmerge

ls lsr mkdir movefromLocal mv touchz

put rm rmr setrep stat tail test text


14/21

Hadoop Single Node Setup

Step 1:

Download hadoop from

http://hadoop.apache.org/mapreduce/releases.html

Step 2:

Untar the hadoop file:

tar xvfz hadoop-0.20.2.tar.gz


15/21


Step 3:

Set the path to java compiler by editing

JAVA_HOMEParameter in

hadoop/conf/hadoop-- env.sh


16/21


Step 4:

Create an RSA key to be used by hadoop whensshing to localhost:

ssh-keygen -t rsa -P

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys


17/21


Step 5:

Do the following changes to the configuration files underhadoop/conf

core--site.xml:

hadoop.tmp.dirTEMPORARY-DIR-FOR-HADOOPDATASTORE

fs.default.name

hdfs://localhost:54310


18/21


mapred--site.xml:

mapred.job.tracker

localhost:54311

hdfs--site.xml:

dfs.replication

1


19/21


Step 6:

Format the hadoop file system. From hadoop

directory run the following:

bin/hadoop namenode -format


20/21

Using Hadoop

1)How to start Hadoop?

cd hadoop/bin

./start-all.sh

2)How to stop Hadoop?

cd hadoop/bin

./stop-all.sh

3)How to copy file from local to HDFS?

cd hadoop

bin/hadoop dfs put local_machine_path hdfs_path

4)How to list files in HDFS?

cd hadoop

bin/hadoop dfs -ls


21/21

Thank You..

Hadoop Nishant Gandhi.

Documents

Transcript of Hadoop Nishant Gandhi.