Hadoop HP Day1

What Is What Is HadoopHadoop Distributed computing frame work

For clusters of computers Thousands of Compute Nodes Petabytes of data Petabytes of data

Open source, Java Googles MapReduce inspired

Yahoos Hadoop. Now part of Apache group

3/3/2013 www.hpottech.com 1

What is What is The Apache Hadoop project develops open-source

software for reliable, scalable, distributed computing. Hadoop includes: Hadoop Common utilities Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel

computation. ZooKeeper: coordination service for distributed

applications.


Problem ScopeProblem Scope

Hadoop is a large-scale distributed batch processing infrastructure.

scale to hundreds or thousands of computers, each with several computers, each with several processor cores.

efficiently distribute large amounts of work across a set of machines

3/3/2013 3www.hpottech.com

How large an amount of work?How large an amount of work? Hundreds of gigabytes of data - the

low end of Hadoop-scale. process "web-scale" data (hundreds of

gigabytes to terabytes or petabytes). gigabytes to terabytes or petabytes). Includes a distributed file system -

breaks up input data and distribute to several machines.


Challenges at Large ScaleChallenges at Large Scale

Performing large-scale computation is difficult. the probability of failures rises. In a distributed environment, partial failures are an

expected and common occurrence. Individual compute nodes may overheat, crash, experience

hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly Data may be corrupted, or maliciously or improperly

transmitted. Clocks may become desynchronized, lock files may not be

released. the rest of the distributed system should be able to

recover from the component failure or transient error condition and continue to make progress.


Challenges at Large ScaleChallenges at Large Scale Hadoop is designed to handle

hardware failure and data congestion issues very robustly.

Compute hardware has finite Compute hardware has finite resources : Processor time Memory Hard drive space Network bandwidth


Moore's LawMoore's Law

Moore's Law (named after Gordon Moore, the founder of Intel) states that the number of transistors that can be placed in a processor will be placed in a processor will double approximately every two years, for half the cost.


The The HadoopHadoop ApproachApproach

Efficiently process large volumes of information by connecting many commodity computers together to work in parallel. work in parallel.

Tied these smaller together into a single cost-effective compute cluster.


Comparison to Existing Comparison to Existing TechniquesTechniques

Hadoop Condor.

Simplified programming model :

Efficient, automatic distribution of data and

Condor does not automatically distribute data: a separate SAN must be managed in

distribution of data and work across machines.

must be managed in addition to the compute cluster.

collaboration between multiple compute nodes must be managed with a communication system such as MPI..


Data Distribution Data Distribution -- HadoopHadoop

In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in.

The Hadoop Distributed File System (HDFS) will split large data files into (HDFS) will split large data files into chunks

Each chunk is replicated across several machines.

An active monitoring system then re-replicates the data.


Data Distribution Data Distribution -- HadoopHadoop.... Data is conceptually record-oriented Individual input files are broken into lines

or into other formats specific to the application logic.

Each process running on a node in the cluster then processes a subset of these

Each process running on a node in the cluster then processes a subset of these records.

Data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers.


Data is distributed across nodes Data is distributed across nodes at load timeat load time


MapReduceMapReduce: Isolated Processes: Isolated Processes

Hadoop limits the amount of communication across nodes

Hadoop will not run just any program and distribute it across a cluster. and distribute it across a cluster.

Programs must be written to conform to a particular programming model, named "MapReduce."


Flat ScalabilityFlat Scalability

One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve.

A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten amounts of refactoring when scaling from ten to one hundred or one thousand machines.

After a Hadoop program is written and functioning on ten nodes, very little--if any--work is required for that same program to run on a much larger amount of hardware.


HadoopHadoop Installation PreparationInstallation PreparationDemo Demo


Steps Steps HadoopHadoop Installation PreparationInstallation Preparation

Install VM Player. Import the RedHat Linux VM to this

VM Player. Start the VM Player. Start the VM Player. User root/root123 to log on to the VM. Follow the tutorial.


Hadoop File System

3/3/2013 1

The Hadoop Distributed File System

HDFS, is a distributed file system designed to

hold very large amounts of data (terabytes or

even petabytes), and provide high-throughput

access to this information. access to this information.

Files are stored in a redundant fashion across

multiple machines to ensure their durability to

failure and high availability to very parallel

applications.

Basic Features: HDFS

Highly fault-tolerant

High throughput

Suitable for applications with large data sets

Streaming access to file system data Streaming access to file system data

Can be built out of commodity hardware

3/3/2013 3

Fault tolerance

Failure is the norm rather than exception

A HDFS instance may consist of thousands of server machines, each storing part of the file systems data.

Since we have huge number of components and Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.

Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

3/3/2013 4

Data Characteristics

Streaming data access

Applications need streaming access to data

Batch processing rather than interactive user access.

Large data sets and files: gigabytes to terabytes size

High aggregate data bandwidth

Scale to hundreds of nodes in a cluster Scale to hundreds of nodes in a cluster

Tens of millions of files in a single instance

Write-once-read-many: a file once created, written and closed need not be changed this assumption simplifies coherency

A map-reduce application or web-crawler application fits perfectly with this model.

3/3/2013 5

Catmap

map

split

split

combine

combine

reduce

reduce

part0

part1

MapReduce

Bat

Dog

Other

Words

(size:

TByte)

map

mapsplit

split

combine reducepart2

3/3/2013 6

ARCHITECTURE

3/3/2013 7

Namenode and Datanodes

Master/slave architecture

HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.on.

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

3/3/2013 8

HDFS Architecture

Namenode

Datanodes Datanodes

Client

Read

Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..

Block ops

3/3/2013 9

Breplication

Rack1 Rack2

Client

Blocks

Write

File system Namespace

Hierarchical file system with directories and files

Create, remove, move, rename etc.

Namenode maintains the file system

Any meta information changes to the file system

3/3/2013 10

Any meta information changes to the file system recorded by the Namenode.

An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.

Data Replication

HDFS is designed to store very large files across machines in a large cluster.

Each file is a sequence of blocks.

All blocks in the file except the last are of the same size.

3/3/2013 11

size.

Blocks are replicated for fault tolerance.

Block size and replicas are configurable per file.

The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

BlockReport contains all the blocks on a Datanode.

Replica Placement

The placement of the replicas is critical to HDFS reliability and performance.

Optimizing replica placement distinguishes HDFS from other distributed file systems.

Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth

3/3/2013 12

Goal: improve reliability, availability and network bandwidth utilization

Many racks, communication between racks are through switches.

Network bandwidth between machines on the same rack is greater than those in different racks.

Namenode determines the rack id for each DataNode.

Replica Placement

Replicas are typically placed on unique racks

Simple but non-optimal

Writes are expensive

Replication factor is 3

3/3/2013 13

Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.

1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.

Replica Selection

Replica selection for READ operation: HDFS tries

to minimize the bandwidth consumption and

latency.

If there is a replica on the Reader node then

3/3/2013 14

If there is a replica on the Reader node then

that is preferred.

HDFS cluster may span multiple data centers:

replica in the local data center is preferred over

the remote one.

Safemode Startup

On startup Namenode enters Safemode.

Replication of data blocks do not occur in Safemode.

Each DataNode checks in with Heartbeat and BlockReport.

Namenode verifies that each block has acceptable number of replicas

3/3/2013 15

Namenode verifies that each block has acceptable number of replicas

After a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode.

It then makes the list of blocks that need to be replicated.

Namenode then proceeds to replicate these blocks to other Datanodes.

Filesystem Metadata

The HDFS namespace is stored by Namenode.

Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. For example, creating a new file.

3/3/2013 16

For example, creating a new file.

Change replication factor of a file

EditLog is stored in the Namenodes local filesystem

Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenodes local filesystem.

Namenode

Keeps image of entire file system namespace and file Blockmap in memory.

4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.

When the Namenode starts up it gets the FsImage and

3/3/2013 17

When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.

Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.

Datanode

A Datanode stores data in files in its local file system.

Datanode has no knowledge about HDFS filesystem

It stores each block of HDFS data in a separate file.

Datanode does not create all files in the same directory.

It uses heuristics to determine optimal number of files

3/3/2013 18

It uses heuristics to determine optimal number of files per directory and creates directories appropriately:

Research issue?

When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.

Configuring HDFS

Cluster configuration

The HDFS configuration is located in a set of XML

files in the Hadoop configuration directory;

conf conf

hadoop-defaults.xml

contains default values for every parameter in

Hadoop.

This file is considered read-only.

override this configuration by setting new override this configuration by setting new

values in hadoop-site.xml.

This file should be replicated consistently across

all machines in the cluster. (It is also possible,

though not advisable, to host it on NFS.)

hadoop-site.xml.

Configuration settings are a set of key-value

pairs of the format:

property-name

property-valueproperty-value

Adding the line true inside the property body will prevent

properties from being overridden by user applications.

hadoop-site.xml

The following settings are necessary to configure

HDFS:

key value examplekey value example

fs.default.name protocol://servername:port hdfs://alpha.milkman.org:9000

dfs.data.dir pathname /home/username/hdfs/data

dfs.name.dir pathname /home/username/hdfs/name

A single-node configuration:

hadoop-site.xml

fs.default.name

hdfs://your.server.name.com:9000

dfs.data.dir

/home/username/hdfs/data/home/username/hdfs/data

dfs.name.dir

/home/username/hdfs/name

* After copying this information into your conf/hadoop-site.xml file, copy this to the conf/ directories on all machines in the cluster.

Starting HDFS

format the file system that was just configured:

user@namenode:hadoop$ bin/hadoop namenode -format

This process should only be performed once. When it is complete, you are free to start the distributed file system:complete, you are free to start the distributed file system:

user@namenode:hadoop$ bin/start-dfs.sh

This command will start the NameNode server on the master machine (which is where the start-dfs.sh script was invoked).

It will also start the DataNode instances on each of the slave machines.

In a single-machine "cluster," this is the same machine as the NameNodeinstance.

On a real cluster of two or more machines, this script will ssh into each slave machine and start a DataNode instance.

Interacting With HDFS

Command format:user@machine:hadoop$ bin/hadoop moduleName -cmd args...

The moduleName tells the program which subset of Hadoop functionality to use. -cmd

is the name of a specific command within this module to execute. Its arguments

follow the command name.

Two such modules are relevant to HDFS: dfs and dfsadmin.

Tutorial:

Hadoop - Single Node :Installation

Tutorial-InstallationHDFSSingleNode.docx

PROTOCOL

3/3/2013 27

The Communication Protocol

All HDFS communication protocols are layered on top of the TCP/IP protocol

A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocolwith the Namenode.

The Datanodes talk to the Namenode using Datanode

3/3/2013 28

The Datanodes talk to the Namenode using Datanode protocol.

RPC abstraction wraps both ClientProtocol and Datanode protocol.

Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.

ROBUSTNESS

3/3/2013 29

Objectives

Primary objective of HDFS is to store data

reliably in the presence of failures.

Three common failures are: Namenode failure,

Datanode failure and network partition.Datanode failure and network partition.

3/3/2013 30

DataNode failure and heartbeat

A network partition can cause a subset of Datanodes to lose connectivity with the Namenode.

Namenode detects this condition by the absence of a Heartbeat message.

Namenode marks Datanodes without Hearbeat and does Namenode marks Datanodes without Hearbeat and does not send any IO requests to them.

Any data registered to the failed Datanode is not available to the HDFS.

Also the death of a Datanode may cause replication factor of some of the blocks to fall below their specified value.

3/3/2013 31

Re-replication

The necessity for re-replication may arise due

to:

A Datanode may become unavailable,

A replica may become corrupted, A replica may become corrupted,

A hard disk on a Datanode may fail, or

The replication factor on the block may be

increased.

3/3/2013 32

Cluster Rebalancing

HDFS architecture is compatible with data rebalancing schemes.

A scheme might move data from one Datanodeto another if the free space on a Datanode falls below a certain threshold.below a certain threshold.

In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.

3/3/2013 33

Data Integrity

Consider a situation: a block of data fetched from Datanode arrives corrupted.

This corruption may occur because of faults in a storage device, network faults, or buggy software.

A HDFS client creates the checksum of every block A HDFS client creates the checksum of every block of its file and stores it in hidden files in the HDFS namespace.

When a clients retrieves the contents of file, it verifies that the corresponding checksums match.

If does not match, the client can retrieve the block from a replica.

3/3/2013 34

Metadata Disk Failure

FsImage and EditLog are central data structures of HDFS.

A corruption of these files can cause a HDFS instance to be non-

functional.

For this reason, a Namenode can be configured to maintain

multiple copies of the FsImage and EditLog.multiple copies of the FsImage and EditLog.

Multiple copies of the FsImage and EditLog files are updated

synchronously.

Meta-data is not data-intensive.

The Namenode could be single point failure: automatic failover

is NOT supported.

3/3/2013 35

DATA ORGANIZATION

3/3/2013 36

Data Blocks

HDFS support write-once-read-many with reads

at streaming speeds.

A typical block size is 64MB (or even 128 MB).

A file is chopped into 64MB chunks and stored. A file is chopped into 64MB chunks and stored.

3/3/2013 37

Staging

A client request to create a file does not reach Namenode immediately.

HDFS client caches the data into a temporary file. When the data reached a HDFS block size the client contacts the Namenode.contacts the Namenode.

Namenode inserts the filename into its hierarchy and allocates a data block for it.

The Namenode responds to the client with the identity of the Datanode and the destination of the replicas (Datanodes) for the block.

Then the client flushes it from its local memory.

3/3/2013 38

Staging (contd.)

The client sends a message that the file is closed.

Namenode proceeds to commit the file for creation operation into the persistent store.

If the Namenode dies before file is closed, the file is lost.

This client side caching is required to avoid network congestion; also it has precedence is AFS (Andrew file system).

3/3/2013 39

Replication Pipelining

When the client receives response from

Namenode, it flushes its block in small pieces

(4K) to the first replica, that in turn copies it to

the next replica and so on.the next replica and so on.

Thus data is pipelined from Datanode to the

next.

3/3/2013 40

API (ACCESSIBILITY)

3/3/2013 41

Application Programming Interface

HDFS provides Java API for application to use.

Python access is also used in many applications.

A C language wrapper for Java API is also

available.available.

A HTTP browser can be used to browse the files

of a HDFS instance.

3/3/2013 42

FS Shell, Admin and Browser Interface

HDFS organizes its data in files and directories.

It provides a command line interface called the FS shell that lets the user interact with data in the HDFS.

The syntax of the commands is similar to bash and csh.

Example: to create a directory /foodir Example: to create a directory /foodir

/bin/hadoop dfs mkdir /foodir

There is also DFSAdmin interface available

Browser interface is also available to view the namespace.

3/3/2013 43

Space Reclamation

When a file is deleted by a client, HDFS renames file to a file in be the /trash directory for a configurable amount of time.

A client can request for an undelete in this allowed time.

After the specified time the file is deleted and the space is reclaimed.is reclaimed.

When the replication factor is reduced, the Namenode selects excess replicas that can be deleted.

Next heartbeat(?) transfers this information to the Datanode that clears the blocks for use.

3/3/2013 44

HDFS & GFS

The design of HDFS is based on the design of GFS, the Google File System.

HDFS is a block-structured file system: individual files are broken into blocks of a individual files are broken into blocks of a fixed size.

These blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are referred to as DataNodes.

HDFS Characteristics

A file can be made of several blocks, and they are

not necessarily stored on the same machine; the

target machines which hold each block are

chosen randomly on a block-by-block basis. chosen randomly on a block-by-block basis.

Thus access to a file may require the cooperation of

multiple machines, but supports file sizes far

larger than a single-machine DFS; individual files

can require more space than a single hard drive

could hold.

Replication in HDFS

DataNodes holding blocks of multiple files with a replication factor of 2.

The NameNode maps the filenames onto the block ids.

Common Example Operations

Listing files

someone@anynode:hadoop$ bin/hadoop dfs -ls

someone@anynode:hadoop$

someone@anynode:hadoop$ bin/hadoop dfs -ls /

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2008-09-20 19:40 /hadoop

drwxr-xr-x - hadoop supergroup 0 2008-09-20 20:08 /tmp

Create home directory, and then

populate it with some filesStep 1: Create your home directory if it does not already exist.

someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user

someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/someone

Step 2: Upload a file. To insert a single file into HDFS, we can use the putcommand like so:command like so:

someone@anynode:hadoop$ bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/

Step 3: Verify the file is in HDFS. We can verify that the operation worked with either of the two following (equivalent) commands:

someone@anynode:hadoop$ bin/hadoop dfs -ls /user/yourUserName

someone@anynode:hadoop$ bin/hadoop dfs -ls

Uploading

Step 4: Uploading multiple files at once. The put command is more powerful than moving a single file at a time. It can also be used to upload entire directory trees into HDFS.

Create a local directory and put some files into it using the cp command. Our example user may have a situation like the following:

someone@anynode:hadoop$ ls -R myfiles

myfiles:

file1.txt file2.txt subdir/

myfiles/subdir:

anotherFile.txtanotherFile.txt


This entire myfiles/ directory can be copied into HDFS like so:

someone@anynode:hadoop$ bin/hadoop -put myfiles /user/myUsername

someone@anynode:hadoop$ bin/hadoop -ls

Found 1 items

/user/someone/myfiles 2008-06-12 20:59 rwxr-xr-x someone supergroup

user@anynode:hadoop bin/hadoop -ls myfiles

Found 3 items

/user/someone/myfiles/file1.txt 186731 2008-06-12 20:59 rw-r--r-- someone supergroup

/user/someone/myfiles/file2.txt 168 2008-06-12 20:59 rw-r--r-- someone supergroup

/user/someone/myfiles/subdir 2008-06-12 20:59 rwxr-xr-x someone supergroup

Uploading of Files

Uploading a file into HDFS first copies the data onto the DataNodes.

When they all acknowledge that they have received all the data and the file handle is closed, it is then made visible to the rest of the system.

Thus based on the return value of the put command, you Thus based on the return value of the put command, you can be confident that a file has either been successfully uploaded, or has "fully failed;

You will never get into a state where a file is partially uploaded and the partial contents are visible externally

but the upload disconnected and did not complete the entire file contents. In a case like this, it will be as though no upload took place.

uses of the put command

Command: Assuming: Outcome:

bin/hadoop dfs -put foo barNo file/directory named

/user/$USER/bar exists in HDFS

Uploads local file foo to a file named

/user/$USER/bar

bin/hadoop dfs -put foo bar /user/$USER/bar is a directoryUploads local file foo to a file named

bin/hadoop dfs -put foo bar /user/$USER/bar is a directoryUploads local file foo to a file named

/user/$USER/bar/foo

bin/hadoop dfs -put foo

somedir/somefile

/user/$USER/somedir does not exist in

HDFS

Uploads local file foo to a file named

/user/$USER/somedir/somefile,

creating the missing directory

bin/hadoop dfs -put foo bar/user/$USER/bar is already a file in

HDFS

No change in HDFS, and an error is

returned to the user.

Retrieving data from HDFS

Step 1: Display data with cat.

someone@anynode:hadoop$ bin/hadoop dfs -cat foo

(contents of foo are displayed here)


Step 2: Copy a file from HDFS to the local file system.Step 2: Copy a file from HDFS to the local file system.

The get command is the inverse operation of put; it will copy a file or directory (recursively) from HDFS into the target of your choosing on the local file system. A synonymous operation is called -copyToLocal.

someone@anynode:hadoop$ bin/hadoop dfs -get foo localFoo

someone@anynode:hadoop$ ls

localFoo

someone@anynode:hadoop$ cat localFoo

(contents of foo are displayed here)

Shutting Down HDFS

someone@namenode:hadoop$ bin/stop-dfs.sh

This command must be performed by the same

user who started HDFS with bin/start-dfs.shuser who started HDFS with bin/start-dfs.sh

HDFS Command Reference

Running bin/hadoop dfs with no additional

arguments will list all commands which can be

run with the FsShell system

bin/hadoop dfs -help commandName will bin/hadoop dfs -help commandName will

display a short usage summary for the

operation in question


Command Operation

-ls path

Lists the contents of the directory specified by path, showing the

names, permissions, owner, size and modification date for each

entry.

-lsr pathBehaves like -ls, but recursively displays entries in all

subdirectories of path.subdirectories of path.

-du pathShows disk usage, in bytes, for all files which match path;

filenames are reported with the full HDFS protocol prefix.

-dus pathLike -du, but prints a summary of disk usage of all files/directories

in the path.

-mv src dest Moves the file or directory indicated by src to dest, within HDFS.

-cp src dest Copies the file or directory identified by src to dest, within HDFS.

-rm path Removes the file or empty directory identified by path.

-moveToLocal [-crc] src localDest Works like -get, but deletes the HDFS copy on success.


Command Operation

-get [-crc] src localDestCopies the file or directory in HDFS identified by src

to the local file system path identified by localDest.

-getmerge src localDest [addnl]

Retrieves all files that match the path src in HDFS,

and copies them to a single, merged file in the local

file system identified by localDest.

-cat filename Displays the contents of filename on stdout.

-copyToLocal [-crc] src localDest Identical to -get

-moveToLocal [-crc] src localDestWorks like -get, but deletes the HDFS copy on

success.

Tutorial

HDFS Command.

Tutorial-HDFSComand.docx

DFSAdmin Command Reference

dfs module : provides common file and directory manipulation commands, they all work with objects within the file system.

dfsadmin module manipulates or queries the file system as a whole.

Getting overall status: Getting overall status: bin/hadoop dfsadmin -report. This returns basic information

about the overall health of the HDFS cluster, as well as some per-server metrics.

More involved status: bin/hadoop dfsadmin -metasave filename will record this

information in filename. The metasave command will enumerate lists of blocks which are under-replicated, in the process of being replicated, and scheduled for deletion.

Safemode:

Safemode: the file system is mounted read-only

no replication is performed

nor can files be created or deleted.

automatically entered as the NameNode starts, to allow all DataNodes time to check in with the NameNodeDataNodes time to check in with the NameNode

waits until a specific percentage of the blocks are present and accounted-for; (dfs.safemode.threshold.pct ).

The bin/hadoop dfsadmin -safemode what : enter - Enters safemode

leave - Forces the NameNode to exit safemode

get - Returns a string indicating whether safemode is ON or OFF

wait - Waits until safemode has exited and returns

HDFS

Changing HDFS membership - When

decommissioning nodes, it is important to

disconnect nodes from HDFS gradually to

ensure that data is not lost. ensure that data is not lost.

Upgrading HDFS versions

# bin/start-dfs.sh -upgrade.

It will then begin upgrading the HDFS version.

#bin/hadoop dfsadmin -upgradeProgress

#bin/hadoop dfsadmin -upgradeProgress force. #bin/hadoop dfsadmin -upgradeProgress force.

Upgrading HDFS versions.

bin/start-dfs.sh -rollback.

It will restore the previous HDFS state.

Only one such archival copy can be kept at a

time.time.

bin/hadoop dfsadmin -finalizeUpgrade.

The rollback command cannot be issued after this

point. This must be performed before a second

Hadoop upgrade is allowed

Getting help

bin/hadoop dfsadmin -help cmd

Using HDFS in MapReduce

The HDFS is a powerful companion to HadoopMapReduce.

By setting the fs.default.name configuration option to point to the NameNode , Hadoop MapReduce jobs will automatically draw their input files from HDFS. automatically draw their input files from HDFS.

Using the regular FileInputFormat subclasses, Hadoop will automatically draw its input data sources from file paths within HDFS, and will distribute the work over the cluster in an intelligent fashion to exploit block locality where possible.

Using HDFS Programmatically1: import java.io.File;

2: import java.io.IOException;

3:

4: import org.apache.hadoop.conf.Configuration;

5: import org.apache.hadoop.fs.FileSystem;

6: import org.apache.hadoop.fs.FSDataInputStream;

7: import org.apache.hadoop.fs.FSDataOutputStream;

8: import org.apache.hadoop.fs.Path;

9:

10: public class HDFSHelloWorld {

11:

12: public static final String theFilename = "hello.txt";

13: public static final String message = "Hello, world!\n";

14:

15: public static void main (String [] args) throws IOException {

16:

17: Configuration conf = new Configuration();

18: FileSystem fs = FileSystem.get(conf);18: FileSystem fs = FileSystem.get(conf);

19:

20: Path filenamePath = new Path(theFilename);

21:

22: try {

23: if (fs.exists(filenamePath)) {

24: // remove the file first

25: fs.delete(filenamePath);

26: }

27:

28: FSDataOutputStream out = fs.create(filenamePath);

29: out.writeUTF(message;

30: out.close();

31:

32: FSDataInputStream in = fs.open(filenamePath);

33: String messageIn = in.readUTF();

34: System.out.print(messageIn);

35: in.close();

46: } catch (IOException ioe) {

47: System.err.println("IOException during operation: " + ioe.toString());

48: System.exit(1);

49: }

40: }

41: }

HDFS Permissions and Security

HDFS security is based on the POSIX model of users and groups.

Each file or directory has 3 permissions (read, write and execute) associated with it at three different granularities: the file's owner, users in the same group granularities: the file's owner, users in the same group as the owner, and all other users in the system.

As the HDFS does not provide the full POSIX spectrum of activity, some combinations of bits will be meaningless.

For example, no file can be executed; the +x bits cannot be set on files (only directories). Nor can an existing file be written to, although the +w bits may still be set.

HDFS Permissions and Security

Security permissions and ownership can be

modified using the

bin/hadoop dfs -chmod, -chown, and -chgrp

they work in a similar fashion to the POSIX/Linux

tools of the same name.

Security

Superuser status - The username which was

used to start the Hadoop process (i.e., the

username who actually ran bin/start-all.sh or

bin/start-dfs.sh) is acknowledged to be the bin/start-dfs.sh) is acknowledged to be the

superuser for HDFS.

If Hadoop is shutdown and restarted under a

different username, that username is then

bound to the superuser account.

Tutorial

Showing Security and dfsadmin command

Tutorial-HDFSAdmin&SecurityComand.docx

Additional HDFS Tasks

Rebalancing Blocks

Copying Large Sets of Files

Decommissioning Nodes

Verifying File System Health Verifying File System Health

Rack Awareness

HDFS Web Interface

Rebalancing Blocks

New nodes can be added to a cluster in a straightforward manner.

On the new node, the same Hadoop version and configuration (conf/hadoop-site.xml) as on the rest of the cluster should be installed.

Starting the DataNode daemon on the machine will cause it to contact the NameNode and join the cluster. (The new node should be added to the slaves file on the master server as well, to inform be added to the slaves file on the master server as well, to inform the master how to invoke script-based commands on the new node.)

But the new DataNode will have no data on board initially; it is therefore not alleviating space concerns on the existing nodes.

New files will be stored on the new DataNode in addition to the existing ones, but for optimum usage, storage should be evenly balanced across all nodes.

Rebalancing Blocks

The Balancer class will intelligently balance blocks across the nodes to achieve an even distribution of blocks within a given threshold, expressed as a percentage. (The default is 10%.)

The balancer script can be run by starting The balancer script can be run by starting bin/start-balancer.sh in the Hadoop directory. e.g., bin/start-balancer.sh -threshold 5.

The balancer can always be terminated safely by the administrator by running bin/stop-balancer.sh.

Rebalancing Blocks

The balancing script can be run either when

nobody else is using the cluster (e.g.,

overnight), but can also be run in an "online"

fashion while many other jobs are on-going. fashion while many other jobs are on-going.

the dfs.balance.bandwidthPerSec configuration

parameter can be used to limit the number of

bytes/sec each node may devote to

rebalancing its data store.


Hadoop includes a tool called distcp.

bin/hadoop distcp src dest, Hadoop will start

a MapReduce task to distribute the burden of

copying a large number of files from src to copying a large number of files from src to

dest.

The paths are assumed to be directories, and

are copied recursively. S3 URLs can be

specified with s3://bucket-name/key.

Decommissioning Nodes

nodes can also be removed from a cluster while it is running, without data loss.

But if nodes are simply shut down "hard," data loss may occur as they may hold the sole copy loss may occur as they may hold the sole copy of one or more file blocks.

Nodes must be retired on a schedule that allows HDFS to ensure that no blocks are entirely replicated within the to-be-retired set of DataNodes.

Decommissioning Nodes.. Steps

Step 1: Cluster configuration. If it is assumed that nodes may be retired in your cluster, then before it is started, an excludes file must be configured. Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

Step 2: Determine hosts to decommission. Each machine to be decommissioned should be added to the file identified by dfs.hosts.exclude, one per line. This will prevent them from connecting to the NameNode.

Decommissioning Nodes.. Steps

Step 3: Force configuration reload. Run the command bin/hadoopdfsadmin -refreshNodes. This will force the NameNode to reread its configuration, including the newly-updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.

Step 4: Shutdown nodes. After the decommission process has Step 4: Shutdown nodes. After the decommission process has completed, the decommissioned hardware can be safely shutdown for maintenance, etc. The bin/hadoop dfsadmin -report command will describe which nodes are connected to the cluster.

Step 5: Edit excludes file again. Once the machines have been decommissioned, they can be removed from the excludes file. Running bin/hadoop dfsadmin -refreshNodes again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.

Verifying File System Health

Hadoop provides an fsck command to do exactly this

bin/hadoop fsck [path] [options]

bin/hadoop fsck -- -files blocks

By default, fsck will not operate on files still open for write by another client. A list of such files can be produced with the -openforwrite option

Rack Awareness

For larger Hadoop installations which span

multiple racks, it is important to ensure that

replicas of data exist on multiple racks..

HDFS can be made rack-aware by the use of a HDFS can be made rack-aware by the use of a

script which allows the master node to map

the network topology of the cluster.

Rack Awareness

#!/bin/bash

# Set rack id based on IP address.

# Assumes network administrator has complete control

# over IP addresses assigned to nodes and they are

# in the 10.x.y.z address space. Assumes that

# IP addresses are distributed hierarchically. e.g.,

# 10.1.y.z is one data center segment and 10.2.y.z is another;

# 10.1.1.z is one rack, 10.1.2.z is another rack in

# the same segment, etc.)

##

# This is invoked with an IP address as its only argument

# get IP address from the input

ipaddr=$0

# select "x.y" and convert it to "x/y"

segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`

echo /${segments}

HDFS Web Interface

HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations.

http://namenode:50070/

The address and port where the web interface listens can be changed by setting dfs.http.addressin conf/hadoop-site.xml.

It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0

HDFS Web Interface

Each DataNode exposes its file browser interface on port 50075.

You can override this by setting the dfs.datanode.http.address configuration key dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075.

Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

Tutorial


Verifying File System Health

HDFS Web Interface : features

Tutorial-HDFSMiscelle.docx Tutorial-HDFSMiscelle.docx

Lecture 2

MapReduce

MapReduce

MapReduce

Outline

MapReduce: Programming Model

MapReduce Examples

A Brief History

MapReduce Execution Overview MapReduce Execution Overview

Hadoop

MapReduce Resources

MapReduce Basics

designed to compute large volumes of data in a parallel fashion all data elements in MapReduce are immutable

MapReduce programs transform lists of input data elements into lists of output data elementselements into lists of output data elements

map, and reduce

Mapping Lists

Reducing Lists

Combination Map Reduce

MapReduce

A simple and powerful interface that enables

automatic parallelization and distribution of

large-scale computations, combined with an

implementation of this interface that achieves implementation of this interface that achieves

high performance on large clusters of

commodity PCs.

Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.

MapReduce

More simply, MapReduce is: A parallel programming model and associated

implementation.

Programming Model

Description The mental model the programmer has about the detailed

execution of their application.

Purpose Improve programmer productivity

Evaluation Evaluation Expressibility Simplicity Performance

Programming Models

Parallel Programming Models Message passing

Independent tasks encapsulating local data

Tasks interact by exchanging messages

Shared memory Tasks share a common address space

Tasks interact by reading and writing this space asynchronously Tasks interact by reading and writing this space asynchronously

Data parallelization Tasks execute a sequence of independent operations

Data usually evenly partitioned across tasks

Also referred to as Embarrassingly parallel

MapReduce:

Programming Model

Process data using special map() and reduce()

functions

The map() function is called on every item in the input and

emits a series of intermediate key/value pairs

All values associated with a given key are grouped together All values associated with a given key are grouped together

The reduce() function is called on every unique key, and its

value list, and emits a value that is added to the output

MapReduce:

Programming Model

M

How now

Brown cow

How doesIt work now

brown 1cow 1does 1How 2

it 1now 2work 1

M

M

M

R

R

Input OutputMap

ReduceMapReduceFramework

MapReduce:

Programming Model

More formally,

Map(k1,v1) --> list(k2,v2)

Reduce(k2, list(v2)) --> list(v2)

MapReduce Runtime System

1. Partitions input data

2. Schedules execution across a set of

machines

3. Handles machine failure3. Handles machine failure

4. Manages interprocess communication

MapReduce Benefits

Greatly reduces parallel programming complexity Reduces synchronization complexity

Automatically partitions data

Provides failure transparency

Handles load balancing

Practical Practical Approximately 1000 Google MapReduce jobs run everyday.

MapReduce Examples

Word frequency

RuntimeSystem

Map

doc

Reduce

MapReduce Examples

Distributed grep

Map function emits if word matches

search criteria

Reduce function is the identity function

URL access frequency URL access frequency

Map function processes web logs, emits

Reduce function sums values and emits

A Brief History

Functional programming (e.g., Lisp)

map() function

Applies a function to each value of a sequence

reduce() function reduce() function

Combines all elements of a sequence using a binary

operator

MapReduce Execution Overview

1. The user program, via the MapReduce

library, shards the input data

UserProgramInput

Data

Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6

* Shards are typically 16-64mb in size


2. The user program creates process copies

distributed on a machine cluster. One copy

will be the Master and the others will be

worker threads.worker threads.

UserProgram

Master

WorkersWorkers

WorkersWorkers

Workers

MapReduce Resources

3. The master distributes M map and R reduce

tasks to idle workers.

M == number of shards

R == the intermediate key space is divided into R R == the intermediate key space is divided into R

parts

Master IdleWorker

Message(Do_map_task)

MapReduce Resources

4. Each map-task worker reads assigned input

shard and outputs intermediate key/value

pairs.

Output buffered in RAM. Output buffered in RAM.

MapworkerShard 0 Key/value pairs


5. Each worker flushes intermediate values,

partitioned into R regions, to disk and

notifies the Master process.

Master

Mapworker

Disk locations

LocalStorage


6. Master process gives disk locations to an

available reduce-task worker who reads all

associated intermediate data.

Master

Reduceworker

Disk locations

remoteStorage


7. Each reduce-task worker sorts its

intermediate data. Calls the reduce function,

passing in unique keys and associated key

values. Reduce function output appended to values. Reduce function output appended to

reduce-tasks partition output file.

Reduceworker

Sorts data PartitionOutput file


8. Master process wakes up user process when

all tasks have completed. Output contained

in R output files.

wakeup UserProgramMaster

Outputfiles


Fault Tolerance

Master process periodically pings workers

Map-task failure

Re-execute Re-execute

All output was stored locally

Reduce-task failure

Only re-execute partially completed tasks

All output stored in the global file system

Tutorial:

Running a map Reduce Program

Tutorial-MapReduce.docx

Hadoop HP Day1

Documents

Transcript of Hadoop HP Day1