Hadoop HP Day1

129
What Is What Is Hadoop Hadoop Distributed computing frame work For clusters of computers Thousands of Compute Nodes Petabytes of data Open source, Java Google’s MapReduce inspired Yahoo’s Hadoop. Now part of Apache group 3/3/2013 www.hpottech.com 1

description

HADOOP

Transcript of Hadoop HP Day1

  • What Is What Is HadoopHadoop Distributed computing frame work

    For clusters of computers Thousands of Compute Nodes Petabytes of data Petabytes of data

    Open source, Java Googles MapReduce inspired

    Yahoos Hadoop. Now part of Apache group

    3/3/2013 www.hpottech.com 1

  • What is What is The Apache Hadoop project develops open-source

    software for reliable, scalable, distributed computing. Hadoop includes: Hadoop Common utilities Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel

    computation. ZooKeeper: coordination service for distributed

    applications.

    3/3/2013 www.hpottech.com 2

  • Problem ScopeProblem Scope

    Hadoop is a large-scale distributed batch processing infrastructure.

    scale to hundreds or thousands of computers, each with several computers, each with several processor cores.

    efficiently distribute large amounts of work across a set of machines

    3/3/2013 3www.hpottech.com

  • How large an amount of work?How large an amount of work? Hundreds of gigabytes of data - the

    low end of Hadoop-scale. process "web-scale" data (hundreds of

    gigabytes to terabytes or petabytes). gigabytes to terabytes or petabytes). Includes a distributed file system -

    breaks up input data and distribute to several machines.

    3/3/2013 4www.hpottech.com

  • Challenges at Large ScaleChallenges at Large Scale

    Performing large-scale computation is difficult. the probability of failures rises. In a distributed environment, partial failures are an

    expected and common occurrence. Individual compute nodes may overheat, crash, experience

    hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly Data may be corrupted, or maliciously or improperly

    transmitted. Clocks may become desynchronized, lock files may not be

    released. the rest of the distributed system should be able to

    recover from the component failure or transient error condition and continue to make progress.

    3/3/2013 5www.hpottech.com

  • Challenges at Large ScaleChallenges at Large Scale Hadoop is designed to handle

    hardware failure and data congestion issues very robustly.

    Compute hardware has finite Compute hardware has finite resources : Processor time Memory Hard drive space Network bandwidth

    3/3/2013 6www.hpottech.com

  • Moore's LawMoore's Law

    Moore's Law (named after Gordon Moore, the founder of Intel) states that the number of transistors that can be placed in a processor will be placed in a processor will double approximately every two years, for half the cost.

    3/3/2013 7www.hpottech.com

  • The The HadoopHadoop ApproachApproach

    Efficiently process large volumes of information by connecting many commodity computers together to work in parallel. work in parallel.

    Tied these smaller together into a single cost-effective compute cluster.

    3/3/2013 8www.hpottech.com

  • Comparison to Existing Comparison to Existing TechniquesTechniques

    Hadoop Condor.

    Simplified programming model :

    Efficient, automatic distribution of data and

    Condor does not automatically distribute data: a separate SAN must be managed in

    distribution of data and work across machines.

    must be managed in addition to the compute cluster.

    collaboration between multiple compute nodes must be managed with a communication system such as MPI..

    3/3/2013 www.hpottech.com 9

  • Data Distribution Data Distribution -- HadoopHadoop

    In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in.

    The Hadoop Distributed File System (HDFS) will split large data files into (HDFS) will split large data files into chunks

    Each chunk is replicated across several machines.

    An active monitoring system then re-replicates the data.

    3/3/2013 10www.hpottech.com

  • Data Distribution Data Distribution -- HadoopHadoop.... Data is conceptually record-oriented Individual input files are broken into lines

    or into other formats specific to the application logic.

    Each process running on a node in the cluster then processes a subset of these

    Each process running on a node in the cluster then processes a subset of these records.

    Data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers.

    3/3/2013 11www.hpottech.com

  • Data is distributed across nodes Data is distributed across nodes at load timeat load time

    3/3/2013 12www.hpottech.com

  • MapReduceMapReduce: Isolated Processes: Isolated Processes

    Hadoop limits the amount of communication across nodes

    Hadoop will not run just any program and distribute it across a cluster. and distribute it across a cluster.

    Programs must be written to conform to a particular programming model, named "MapReduce."

    3/3/2013 13www.hpottech.com

  • Flat ScalabilityFlat Scalability

    One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve.

    A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten amounts of refactoring when scaling from ten to one hundred or one thousand machines.

    After a Hadoop program is written and functioning on ten nodes, very little--if any--work is required for that same program to run on a much larger amount of hardware.

    3/3/2013 14www.hpottech.com

  • HadoopHadoop Installation PreparationInstallation PreparationDemo Demo

    3/3/2013 www.hpottech.com 15

  • Steps Steps HadoopHadoop Installation PreparationInstallation Preparation

    Install VM Player. Import the RedHat Linux VM to this

    VM Player. Start the VM Player. Start the VM Player. User root/root123 to log on to the VM. Follow the tutorial.

    3/3/2013 www.hpottech.com 16

  • Hadoop File System

    3/3/2013 1

  • The Hadoop Distributed File System

    HDFS, is a distributed file system designed to

    hold very large amounts of data (terabytes or

    even petabytes), and provide high-throughput

    access to this information. access to this information.

    Files are stored in a redundant fashion across

    multiple machines to ensure their durability to

    failure and high availability to very parallel

    applications.

  • Basic Features: HDFS

    Highly fault-tolerant

    High throughput

    Suitable for applications with large data sets

    Streaming access to file system data Streaming access to file system data

    Can be built out of commodity hardware

    3/3/2013 3

  • Fault tolerance

    Failure is the norm rather than exception

    A HDFS instance may consist of thousands of server machines, each storing part of the file systems data.

    Since we have huge number of components and Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.

    Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

    3/3/2013 4

  • Data Characteristics

    Streaming data access

    Applications need streaming access to data

    Batch processing rather than interactive user access.

    Large data sets and files: gigabytes to terabytes size

    High aggregate data bandwidth

    Scale to hundreds of nodes in a cluster Scale to hundreds of nodes in a cluster

    Tens of millions of files in a single instance

    Write-once-read-many: a file once created, written and closed need not be changed this assumption simplifies coherency

    A map-reduce application or web-crawler application fits perfectly with this model.

    3/3/2013 5

  • Catmap

    map

    split

    split

    combine

    combine

    reduce

    reduce

    part0

    part1

    MapReduce

    Bat

    Dog

    Other

    Words

    (size:

    TByte)

    map

    mapsplit

    split

    combine reducepart2

    3/3/2013 6

  • ARCHITECTURE

    3/3/2013 7

  • Namenode and Datanodes

    Master/slave architecture

    HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

    There are a number of DataNodes usually one per node in a cluster.

    The DataNodes manage storage attached to the nodes that they run on.on.

    HDFS exposes a file system namespace and allows user data to be stored in files.

    A file is split into one or more blocks and set of blocks are stored in DataNodes.

    DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

    3/3/2013 8

  • HDFS Architecture

    Namenode

    Datanodes Datanodes

    Client

    Read

    Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..

    Block ops

    3/3/2013 9

    Breplication

    Rack1 Rack2

    Client

    Blocks

    Write

  • File system Namespace

    Hierarchical file system with directories and files

    Create, remove, move, rename etc.

    Namenode maintains the file system

    Any meta information changes to the file system

    3/3/2013 10

    Any meta information changes to the file system recorded by the Namenode.

    An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.

  • Data Replication

    HDFS is designed to store very large files across machines in a large cluster.

    Each file is a sequence of blocks.

    All blocks in the file except the last are of the same size.

    3/3/2013 11

    size.

    Blocks are replicated for fault tolerance.

    Block size and replicas are configurable per file.

    The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

    BlockReport contains all the blocks on a Datanode.

  • Replica Placement

    The placement of the replicas is critical to HDFS reliability and performance.

    Optimizing replica placement distinguishes HDFS from other distributed file systems.

    Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth

    3/3/2013 12

    Goal: improve reliability, availability and network bandwidth utilization

    Many racks, communication between racks are through switches.

    Network bandwidth between machines on the same rack is greater than those in different racks.

    Namenode determines the rack id for each DataNode.

  • Replica Placement

    Replicas are typically placed on unique racks

    Simple but non-optimal

    Writes are expensive

    Replication factor is 3

    3/3/2013 13

    Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.

    1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.

  • Replica Selection

    Replica selection for READ operation: HDFS tries

    to minimize the bandwidth consumption and

    latency.

    If there is a replica on the Reader node then

    3/3/2013 14

    If there is a replica on the Reader node then

    that is preferred.

    HDFS cluster may span multiple data centers:

    replica in the local data center is preferred over

    the remote one.

  • Safemode Startup

    On startup Namenode enters Safemode.

    Replication of data blocks do not occur in Safemode.

    Each DataNode checks in with Heartbeat and BlockReport.

    Namenode verifies that each block has acceptable number of replicas

    3/3/2013 15

    Namenode verifies that each block has acceptable number of replicas

    After a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode.

    It then makes the list of blocks that need to be replicated.

    Namenode then proceeds to replicate these blocks to other Datanodes.

  • Filesystem Metadata

    The HDFS namespace is stored by Namenode.

    Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. For example, creating a new file.

    3/3/2013 16

    For example, creating a new file.

    Change replication factor of a file

    EditLog is stored in the Namenodes local filesystem

    Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenodes local filesystem.

  • Namenode

    Keeps image of entire file system namespace and file Blockmap in memory.

    4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.

    When the Namenode starts up it gets the FsImage and

    3/3/2013 17

    When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.

    Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.

  • Datanode

    A Datanode stores data in files in its local file system.

    Datanode has no knowledge about HDFS filesystem

    It stores each block of HDFS data in a separate file.

    Datanode does not create all files in the same directory.

    It uses heuristics to determine optimal number of files

    3/3/2013 18

    It uses heuristics to determine optimal number of files per directory and creates directories appropriately:

    Research issue?

    When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.

  • Configuring HDFS

    Cluster configuration

    The HDFS configuration is located in a set of XML

    files in the Hadoop configuration directory;

    conf conf

  • hadoop-defaults.xml

    contains default values for every parameter in

    Hadoop.

    This file is considered read-only.

    override this configuration by setting new override this configuration by setting new

    values in hadoop-site.xml.

    This file should be replicated consistently across

    all machines in the cluster. (It is also possible,

    though not advisable, to host it on NFS.)

  • hadoop-site.xml.

    Configuration settings are a set of key-value

    pairs of the format:

    property-name

    property-valueproperty-value

    Adding the line true inside the property body will prevent

    properties from being overridden by user applications.

  • hadoop-site.xml

    The following settings are necessary to configure

    HDFS:

    key value examplekey value example

    fs.default.name protocol://servername:port hdfs://alpha.milkman.org:9000

    dfs.data.dir pathname /home/username/hdfs/data

    dfs.name.dir pathname /home/username/hdfs/name

  • A single-node configuration:

    hadoop-site.xml

    fs.default.name

    hdfs://your.server.name.com:9000

    dfs.data.dir

    /home/username/hdfs/data/home/username/hdfs/data

    dfs.name.dir

    /home/username/hdfs/name

    * After copying this information into your conf/hadoop-site.xml file, copy this to the conf/ directories on all machines in the cluster.

  • Starting HDFS

    format the file system that was just configured:

    user@namenode:hadoop$ bin/hadoop namenode -format

    This process should only be performed once. When it is complete, you are free to start the distributed file system:complete, you are free to start the distributed file system:

    user@namenode:hadoop$ bin/start-dfs.sh

    This command will start the NameNode server on the master machine (which is where the start-dfs.sh script was invoked).

    It will also start the DataNode instances on each of the slave machines.

    In a single-machine "cluster," this is the same machine as the NameNodeinstance.

    On a real cluster of two or more machines, this script will ssh into each slave machine and start a DataNode instance.

  • Interacting With HDFS

    Command format:user@machine:hadoop$ bin/hadoop moduleName -cmd args...

    The moduleName tells the program which subset of Hadoop functionality to use. -cmd

    is the name of a specific command within this module to execute. Its arguments

    follow the command name.

    Two such modules are relevant to HDFS: dfs and dfsadmin.

  • Tutorial:

    Hadoop - Single Node :Installation

    Tutorial-InstallationHDFSSingleNode.docx

  • PROTOCOL

    3/3/2013 27

  • The Communication Protocol

    All HDFS communication protocols are layered on top of the TCP/IP protocol

    A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocolwith the Namenode.

    The Datanodes talk to the Namenode using Datanode

    3/3/2013 28

    The Datanodes talk to the Namenode using Datanode protocol.

    RPC abstraction wraps both ClientProtocol and Datanode protocol.

    Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.

  • ROBUSTNESS

    3/3/2013 29

  • Objectives

    Primary objective of HDFS is to store data

    reliably in the presence of failures.

    Three common failures are: Namenode failure,

    Datanode failure and network partition.Datanode failure and network partition.

    3/3/2013 30

  • DataNode failure and heartbeat

    A network partition can cause a subset of Datanodes to lose connectivity with the Namenode.

    Namenode detects this condition by the absence of a Heartbeat message.

    Namenode marks Datanodes without Hearbeat and does Namenode marks Datanodes without Hearbeat and does not send any IO requests to them.

    Any data registered to the failed Datanode is not available to the HDFS.

    Also the death of a Datanode may cause replication factor of some of the blocks to fall below their specified value.

    3/3/2013 31

  • Re-replication

    The necessity for re-replication may arise due

    to:

    A Datanode may become unavailable,

    A replica may become corrupted, A replica may become corrupted,

    A hard disk on a Datanode may fail, or

    The replication factor on the block may be

    increased.

    3/3/2013 32

  • Cluster Rebalancing

    HDFS architecture is compatible with data rebalancing schemes.

    A scheme might move data from one Datanodeto another if the free space on a Datanode falls below a certain threshold.below a certain threshold.

    In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.

    3/3/2013 33

  • Data Integrity

    Consider a situation: a block of data fetched from Datanode arrives corrupted.

    This corruption may occur because of faults in a storage device, network faults, or buggy software.

    A HDFS client creates the checksum of every block A HDFS client creates the checksum of every block of its file and stores it in hidden files in the HDFS namespace.

    When a clients retrieves the contents of file, it verifies that the corresponding checksums match.

    If does not match, the client can retrieve the block from a replica.

    3/3/2013 34

  • Metadata Disk Failure

    FsImage and EditLog are central data structures of HDFS.

    A corruption of these files can cause a HDFS instance to be non-

    functional.

    For this reason, a Namenode can be configured to maintain

    multiple copies of the FsImage and EditLog.multiple copies of the FsImage and EditLog.

    Multiple copies of the FsImage and EditLog files are updated

    synchronously.

    Meta-data is not data-intensive.

    The Namenode could be single point failure: automatic failover

    is NOT supported.

    3/3/2013 35

  • DATA ORGANIZATION

    3/3/2013 36

  • Data Blocks

    HDFS support write-once-read-many with reads

    at streaming speeds.

    A typical block size is 64MB (or even 128 MB).

    A file is chopped into 64MB chunks and stored. A file is chopped into 64MB chunks and stored.

    3/3/2013 37

  • Staging

    A client request to create a file does not reach Namenode immediately.

    HDFS client caches the data into a temporary file. When the data reached a HDFS block size the client contacts the Namenode.contacts the Namenode.

    Namenode inserts the filename into its hierarchy and allocates a data block for it.

    The Namenode responds to the client with the identity of the Datanode and the destination of the replicas (Datanodes) for the block.

    Then the client flushes it from its local memory.

    3/3/2013 38

  • Staging (contd.)

    The client sends a message that the file is closed.

    Namenode proceeds to commit the file for creation operation into the persistent store.

    If the Namenode dies before file is closed, the file is lost.

    This client side caching is required to avoid network congestion; also it has precedence is AFS (Andrew file system).

    3/3/2013 39

  • Replication Pipelining

    When the client receives response from

    Namenode, it flushes its block in small pieces

    (4K) to the first replica, that in turn copies it to

    the next replica and so on.the next replica and so on.

    Thus data is pipelined from Datanode to the

    next.

    3/3/2013 40

  • API (ACCESSIBILITY)

    3/3/2013 41

  • Application Programming Interface

    HDFS provides Java API for application to use.

    Python access is also used in many applications.

    A C language wrapper for Java API is also

    available.available.

    A HTTP browser can be used to browse the files

    of a HDFS instance.

    3/3/2013 42

  • FS Shell, Admin and Browser Interface

    HDFS organizes its data in files and directories.

    It provides a command line interface called the FS shell that lets the user interact with data in the HDFS.

    The syntax of the commands is similar to bash and csh.

    Example: to create a directory /foodir Example: to create a directory /foodir

    /bin/hadoop dfs mkdir /foodir

    There is also DFSAdmin interface available

    Browser interface is also available to view the namespace.

    3/3/2013 43

  • Space Reclamation

    When a file is deleted by a client, HDFS renames file to a file in be the /trash directory for a configurable amount of time.

    A client can request for an undelete in this allowed time.

    After the specified time the file is deleted and the space is reclaimed.is reclaimed.

    When the replication factor is reduced, the Namenode selects excess replicas that can be deleted.

    Next heartbeat(?) transfers this information to the Datanode that clears the blocks for use.

    3/3/2013 44

  • HDFS & GFS

    The design of HDFS is based on the design of GFS, the Google File System.

    HDFS is a block-structured file system: individual files are broken into blocks of a individual files are broken into blocks of a fixed size.

    These blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are referred to as DataNodes.

  • HDFS Characteristics

    A file can be made of several blocks, and they are

    not necessarily stored on the same machine; the

    target machines which hold each block are

    chosen randomly on a block-by-block basis. chosen randomly on a block-by-block basis.

    Thus access to a file may require the cooperation of

    multiple machines, but supports file sizes far

    larger than a single-machine DFS; individual files

    can require more space than a single hard drive

    could hold.

  • Replication in HDFS

    DataNodes holding blocks of multiple files with a replication factor of 2.

    The NameNode maps the filenames onto the block ids.

  • Common Example Operations

  • Listing files

    someone@anynode:hadoop$ bin/hadoop dfs -ls

    someone@anynode:hadoop$

    someone@anynode:hadoop$ bin/hadoop dfs -ls /

    Found 2 items

    drwxr-xr-x - hadoop supergroup 0 2008-09-20 19:40 /hadoop

    drwxr-xr-x - hadoop supergroup 0 2008-09-20 20:08 /tmp

  • Create home directory, and then

    populate it with some filesStep 1: Create your home directory if it does not already exist.

    someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user

    someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/someone

    Step 2: Upload a file. To insert a single file into HDFS, we can use the putcommand like so:command like so:

    someone@anynode:hadoop$ bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/

    Step 3: Verify the file is in HDFS. We can verify that the operation worked with either of the two following (equivalent) commands:

    someone@anynode:hadoop$ bin/hadoop dfs -ls /user/yourUserName

    someone@anynode:hadoop$ bin/hadoop dfs -ls

  • Uploading

    Step 4: Uploading multiple files at once. The put command is more powerful than moving a single file at a time. It can also be used to upload entire directory trees into HDFS.

    Create a local directory and put some files into it using the cp command. Our example user may have a situation like the following:

    someone@anynode:hadoop$ ls -R myfiles

    myfiles:

    file1.txt file2.txt subdir/

    myfiles/subdir:

    anotherFile.txtanotherFile.txt

    someone@anynode:hadoop$

    This entire myfiles/ directory can be copied into HDFS like so:

    someone@anynode:hadoop$ bin/hadoop -put myfiles /user/myUsername

    someone@anynode:hadoop$ bin/hadoop -ls

    Found 1 items

    /user/someone/myfiles 2008-06-12 20:59 rwxr-xr-x someone supergroup

    user@anynode:hadoop bin/hadoop -ls myfiles

    Found 3 items

    /user/someone/myfiles/file1.txt 186731 2008-06-12 20:59 rw-r--r-- someone supergroup

    /user/someone/myfiles/file2.txt 168 2008-06-12 20:59 rw-r--r-- someone supergroup

    /user/someone/myfiles/subdir 2008-06-12 20:59 rwxr-xr-x someone supergroup

  • Uploading of Files

    Uploading a file into HDFS first copies the data onto the DataNodes.

    When they all acknowledge that they have received all the data and the file handle is closed, it is then made visible to the rest of the system.

    Thus based on the return value of the put command, you Thus based on the return value of the put command, you can be confident that a file has either been successfully uploaded, or has "fully failed;

    You will never get into a state where a file is partially uploaded and the partial contents are visible externally

    but the upload disconnected and did not complete the entire file contents. In a case like this, it will be as though no upload took place.

  • uses of the put command

    Command: Assuming: Outcome:

    bin/hadoop dfs -put foo barNo file/directory named

    /user/$USER/bar exists in HDFS

    Uploads local file foo to a file named

    /user/$USER/bar

    bin/hadoop dfs -put foo bar /user/$USER/bar is a directoryUploads local file foo to a file named

    bin/hadoop dfs -put foo bar /user/$USER/bar is a directoryUploads local file foo to a file named

    /user/$USER/bar/foo

    bin/hadoop dfs -put foo

    somedir/somefile

    /user/$USER/somedir does not exist in

    HDFS

    Uploads local file foo to a file named

    /user/$USER/somedir/somefile,

    creating the missing directory

    bin/hadoop dfs -put foo bar/user/$USER/bar is already a file in

    HDFS

    No change in HDFS, and an error is

    returned to the user.

  • Retrieving data from HDFS

    Step 1: Display data with cat.

    someone@anynode:hadoop$ bin/hadoop dfs -cat foo

    (contents of foo are displayed here)

    someone@anynode:hadoop$

    Step 2: Copy a file from HDFS to the local file system.Step 2: Copy a file from HDFS to the local file system.

    The get command is the inverse operation of put; it will copy a file or directory (recursively) from HDFS into the target of your choosing on the local file system. A synonymous operation is called -copyToLocal.

    someone@anynode:hadoop$ bin/hadoop dfs -get foo localFoo

    someone@anynode:hadoop$ ls

    localFoo

    someone@anynode:hadoop$ cat localFoo

    (contents of foo are displayed here)

  • Shutting Down HDFS

    someone@namenode:hadoop$ bin/stop-dfs.sh

    This command must be performed by the same

    user who started HDFS with bin/start-dfs.shuser who started HDFS with bin/start-dfs.sh

  • HDFS Command Reference

    Running bin/hadoop dfs with no additional

    arguments will list all commands which can be

    run with the FsShell system

    bin/hadoop dfs -help commandName will bin/hadoop dfs -help commandName will

    display a short usage summary for the

    operation in question

  • HDFS Command Reference

    Command Operation

    -ls path

    Lists the contents of the directory specified by path, showing the

    names, permissions, owner, size and modification date for each

    entry.

    -lsr pathBehaves like -ls, but recursively displays entries in all

    subdirectories of path.subdirectories of path.

    -du pathShows disk usage, in bytes, for all files which match path;

    filenames are reported with the full HDFS protocol prefix.

    -dus pathLike -du, but prints a summary of disk usage of all files/directories

    in the path.

    -mv src dest Moves the file or directory indicated by src to dest, within HDFS.

    -cp src dest Copies the file or directory identified by src to dest, within HDFS.

    -rm path Removes the file or empty directory identified by path.

    -moveToLocal [-crc] src localDest Works like -get, but deletes the HDFS copy on success.

  • HDFS Command Reference

    Command Operation

    -get [-crc] src localDestCopies the file or directory in HDFS identified by src

    to the local file system path identified by localDest.

    -getmerge src localDest [addnl]

    Retrieves all files that match the path src in HDFS,

    and copies them to a single, merged file in the local

    file system identified by localDest.

    -cat filename Displays the contents of filename on stdout.

    -copyToLocal [-crc] src localDest Identical to -get

    -moveToLocal [-crc] src localDestWorks like -get, but deletes the HDFS copy on

    success.

  • Tutorial

    HDFS Command.

    Tutorial-HDFSComand.docx

  • DFSAdmin Command Reference

    dfs module : provides common file and directory manipulation commands, they all work with objects within the file system.

    dfsadmin module manipulates or queries the file system as a whole.

    Getting overall status: Getting overall status: bin/hadoop dfsadmin -report. This returns basic information

    about the overall health of the HDFS cluster, as well as some per-server metrics.

    More involved status: bin/hadoop dfsadmin -metasave filename will record this

    information in filename. The metasave command will enumerate lists of blocks which are under-replicated, in the process of being replicated, and scheduled for deletion.

  • Safemode:

    Safemode: the file system is mounted read-only

    no replication is performed

    nor can files be created or deleted.

    automatically entered as the NameNode starts, to allow all DataNodes time to check in with the NameNodeDataNodes time to check in with the NameNode

    waits until a specific percentage of the blocks are present and accounted-for; (dfs.safemode.threshold.pct ).

    The bin/hadoop dfsadmin -safemode what : enter - Enters safemode

    leave - Forces the NameNode to exit safemode

    get - Returns a string indicating whether safemode is ON or OFF

    wait - Waits until safemode has exited and returns

  • HDFS

    Changing HDFS membership - When

    decommissioning nodes, it is important to

    disconnect nodes from HDFS gradually to

    ensure that data is not lost. ensure that data is not lost.

  • Upgrading HDFS versions

    # bin/start-dfs.sh -upgrade.

    It will then begin upgrading the HDFS version.

    #bin/hadoop dfsadmin -upgradeProgress

    #bin/hadoop dfsadmin -upgradeProgress force. #bin/hadoop dfsadmin -upgradeProgress force.

  • Upgrading HDFS versions.

    bin/start-dfs.sh -rollback.

    It will restore the previous HDFS state.

    Only one such archival copy can be kept at a

    time.time.

    bin/hadoop dfsadmin -finalizeUpgrade.

    The rollback command cannot be issued after this

    point. This must be performed before a second

    Hadoop upgrade is allowed

  • Getting help

    bin/hadoop dfsadmin -help cmd

  • Using HDFS in MapReduce

    The HDFS is a powerful companion to HadoopMapReduce.

    By setting the fs.default.name configuration option to point to the NameNode , Hadoop MapReduce jobs will automatically draw their input files from HDFS. automatically draw their input files from HDFS.

    Using the regular FileInputFormat subclasses, Hadoop will automatically draw its input data sources from file paths within HDFS, and will distribute the work over the cluster in an intelligent fashion to exploit block locality where possible.

  • Using HDFS Programmatically1: import java.io.File;

    2: import java.io.IOException;

    3:

    4: import org.apache.hadoop.conf.Configuration;

    5: import org.apache.hadoop.fs.FileSystem;

    6: import org.apache.hadoop.fs.FSDataInputStream;

    7: import org.apache.hadoop.fs.FSDataOutputStream;

    8: import org.apache.hadoop.fs.Path;

    9:

    10: public class HDFSHelloWorld {

    11:

    12: public static final String theFilename = "hello.txt";

    13: public static final String message = "Hello, world!\n";

    14:

    15: public static void main (String [] args) throws IOException {

    16:

    17: Configuration conf = new Configuration();

    18: FileSystem fs = FileSystem.get(conf);18: FileSystem fs = FileSystem.get(conf);

    19:

    20: Path filenamePath = new Path(theFilename);

    21:

    22: try {

    23: if (fs.exists(filenamePath)) {

    24: // remove the file first

    25: fs.delete(filenamePath);

    26: }

    27:

    28: FSDataOutputStream out = fs.create(filenamePath);

    29: out.writeUTF(message;

    30: out.close();

    31:

    32: FSDataInputStream in = fs.open(filenamePath);

    33: String messageIn = in.readUTF();

    34: System.out.print(messageIn);

    35: in.close();

    46: } catch (IOException ioe) {

    47: System.err.println("IOException during operation: " + ioe.toString());

    48: System.exit(1);

    49: }

    40: }

    41: }

  • HDFS Permissions and Security

    HDFS security is based on the POSIX model of users and groups.

    Each file or directory has 3 permissions (read, write and execute) associated with it at three different granularities: the file's owner, users in the same group granularities: the file's owner, users in the same group as the owner, and all other users in the system.

    As the HDFS does not provide the full POSIX spectrum of activity, some combinations of bits will be meaningless.

    For example, no file can be executed; the +x bits cannot be set on files (only directories). Nor can an existing file be written to, although the +w bits may still be set.

  • HDFS Permissions and Security

    Security permissions and ownership can be

    modified using the

    bin/hadoop dfs -chmod, -chown, and -chgrp

    they work in a similar fashion to the POSIX/Linux

    tools of the same name.

  • Security

    Superuser status - The username which was

    used to start the Hadoop process (i.e., the

    username who actually ran bin/start-all.sh or

    bin/start-dfs.sh) is acknowledged to be the bin/start-dfs.sh) is acknowledged to be the

    superuser for HDFS.

    If Hadoop is shutdown and restarted under a

    different username, that username is then

    bound to the superuser account.

  • Tutorial

    Showing Security and dfsadmin command

    Tutorial-HDFSAdmin&SecurityComand.docx

  • Additional HDFS Tasks

    Rebalancing Blocks

    Copying Large Sets of Files

    Decommissioning Nodes

    Verifying File System Health Verifying File System Health

    Rack Awareness

    HDFS Web Interface

  • Rebalancing Blocks

    New nodes can be added to a cluster in a straightforward manner.

    On the new node, the same Hadoop version and configuration (conf/hadoop-site.xml) as on the rest of the cluster should be installed.

    Starting the DataNode daemon on the machine will cause it to contact the NameNode and join the cluster. (The new node should be added to the slaves file on the master server as well, to inform be added to the slaves file on the master server as well, to inform the master how to invoke script-based commands on the new node.)

    But the new DataNode will have no data on board initially; it is therefore not alleviating space concerns on the existing nodes.

    New files will be stored on the new DataNode in addition to the existing ones, but for optimum usage, storage should be evenly balanced across all nodes.

  • Rebalancing Blocks

    The Balancer class will intelligently balance blocks across the nodes to achieve an even distribution of blocks within a given threshold, expressed as a percentage. (The default is 10%.)

    The balancer script can be run by starting The balancer script can be run by starting bin/start-balancer.sh in the Hadoop directory. e.g., bin/start-balancer.sh -threshold 5.

    The balancer can always be terminated safely by the administrator by running bin/stop-balancer.sh.

  • Rebalancing Blocks

    The balancing script can be run either when

    nobody else is using the cluster (e.g.,

    overnight), but can also be run in an "online"

    fashion while many other jobs are on-going. fashion while many other jobs are on-going.

    the dfs.balance.bandwidthPerSec configuration

    parameter can be used to limit the number of

    bytes/sec each node may devote to

    rebalancing its data store.

  • Copying Large Sets of Files

    Hadoop includes a tool called distcp.

    bin/hadoop distcp src dest, Hadoop will start

    a MapReduce task to distribute the burden of

    copying a large number of files from src to copying a large number of files from src to

    dest.

    The paths are assumed to be directories, and

    are copied recursively. S3 URLs can be

    specified with s3://bucket-name/key.

  • Decommissioning Nodes

    nodes can also be removed from a cluster while it is running, without data loss.

    But if nodes are simply shut down "hard," data loss may occur as they may hold the sole copy loss may occur as they may hold the sole copy of one or more file blocks.

    Nodes must be retired on a schedule that allows HDFS to ensure that no blocks are entirely replicated within the to-be-retired set of DataNodes.

  • Decommissioning Nodes.. Steps

    Step 1: Cluster configuration. If it is assumed that nodes may be retired in your cluster, then before it is started, an excludes file must be configured. Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

    Step 2: Determine hosts to decommission. Each machine to be decommissioned should be added to the file identified by dfs.hosts.exclude, one per line. This will prevent them from connecting to the NameNode.

  • Decommissioning Nodes.. Steps

    Step 3: Force configuration reload. Run the command bin/hadoopdfsadmin -refreshNodes. This will force the NameNode to reread its configuration, including the newly-updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.

    Step 4: Shutdown nodes. After the decommission process has Step 4: Shutdown nodes. After the decommission process has completed, the decommissioned hardware can be safely shutdown for maintenance, etc. The bin/hadoop dfsadmin -report command will describe which nodes are connected to the cluster.

    Step 5: Edit excludes file again. Once the machines have been decommissioned, they can be removed from the excludes file. Running bin/hadoop dfsadmin -refreshNodes again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.

  • Verifying File System Health

    Hadoop provides an fsck command to do exactly this

    bin/hadoop fsck [path] [options]

    bin/hadoop fsck -- -files blocks

    By default, fsck will not operate on files still open for write by another client. A list of such files can be produced with the -openforwrite option

  • Rack Awareness

    For larger Hadoop installations which span

    multiple racks, it is important to ensure that

    replicas of data exist on multiple racks..

    HDFS can be made rack-aware by the use of a HDFS can be made rack-aware by the use of a

    script which allows the master node to map

    the network topology of the cluster.

  • Rack Awareness

    #!/bin/bash

    # Set rack id based on IP address.

    # Assumes network administrator has complete control

    # over IP addresses assigned to nodes and they are

    # in the 10.x.y.z address space. Assumes that

    # IP addresses are distributed hierarchically. e.g.,

    # 10.1.y.z is one data center segment and 10.2.y.z is another;

    # 10.1.1.z is one rack, 10.1.2.z is another rack in

    # the same segment, etc.)

    ##

    # This is invoked with an IP address as its only argument

    # get IP address from the input

    ipaddr=$0

    # select "x.y" and convert it to "x/y"

    segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`

    echo /${segments}

  • HDFS Web Interface

    HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations.

    http://namenode:50070/

    The address and port where the web interface listens can be changed by setting dfs.http.addressin conf/hadoop-site.xml.

    It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0

  • HDFS Web Interface

    Each DataNode exposes its file browser interface on port 50075.

    You can override this by setting the dfs.datanode.http.address configuration key dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075.

    Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

  • Tutorial

    Copying Large Sets of Files

    Verifying File System Health

    HDFS Web Interface : features

    Tutorial-HDFSMiscelle.docx Tutorial-HDFSMiscelle.docx

  • Lecture 2

    MapReduce

    MapReduce

    MapReduce

  • Outline

    MapReduce: Programming Model

    MapReduce Examples

    A Brief History

    MapReduce Execution Overview MapReduce Execution Overview

    Hadoop

    MapReduce Resources

  • MapReduce Basics

    designed to compute large volumes of data in a parallel fashion all data elements in MapReduce are immutable

    MapReduce programs transform lists of input data elements into lists of output data elementselements into lists of output data elements

    map, and reduce

  • Mapping Lists

  • Reducing Lists

  • Combination Map Reduce

  • MapReduce

    A simple and powerful interface that enables

    automatic parallelization and distribution of

    large-scale computations, combined with an

    implementation of this interface that achieves implementation of this interface that achieves

    high performance on large clusters of

    commodity PCs.

    Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.

  • MapReduce

    More simply, MapReduce is: A parallel programming model and associated

    implementation.

  • Programming Model

    Description The mental model the programmer has about the detailed

    execution of their application.

    Purpose Improve programmer productivity

    Evaluation Evaluation Expressibility Simplicity Performance

  • Programming Models

    Parallel Programming Models Message passing

    Independent tasks encapsulating local data

    Tasks interact by exchanging messages

    Shared memory Tasks share a common address space

    Tasks interact by reading and writing this space asynchronously Tasks interact by reading and writing this space asynchronously

    Data parallelization Tasks execute a sequence of independent operations

    Data usually evenly partitioned across tasks

    Also referred to as Embarrassingly parallel

  • MapReduce:

    Programming Model

    Process data using special map() and reduce()

    functions

    The map() function is called on every item in the input and

    emits a series of intermediate key/value pairs

    All values associated with a given key are grouped together All values associated with a given key are grouped together

    The reduce() function is called on every unique key, and its

    value list, and emits a value that is added to the output

  • MapReduce:

    Programming Model

    M

    How now

    Brown cow

    How doesIt work now

    brown 1cow 1does 1How 2

    it 1now 2work 1

    M

    M

    M

    R

    R

    Input OutputMap

    ReduceMapReduceFramework

  • MapReduce:

    Programming Model

    More formally,

    Map(k1,v1) --> list(k2,v2)

    Reduce(k2, list(v2)) --> list(v2)

  • MapReduce Runtime System

    1. Partitions input data

    2. Schedules execution across a set of

    machines

    3. Handles machine failure3. Handles machine failure

    4. Manages interprocess communication

  • MapReduce Benefits

    Greatly reduces parallel programming complexity Reduces synchronization complexity

    Automatically partitions data

    Provides failure transparency

    Handles load balancing

    Practical Practical Approximately 1000 Google MapReduce jobs run everyday.

  • MapReduce Examples

    Word frequency

    RuntimeSystem

    Map

    doc

    Reduce

  • MapReduce Examples

    Distributed grep

    Map function emits if word matches

    search criteria

    Reduce function is the identity function

    URL access frequency URL access frequency

    Map function processes web logs, emits

    Reduce function sums values and emits

  • A Brief History

    Functional programming (e.g., Lisp)

    map() function

    Applies a function to each value of a sequence

    reduce() function reduce() function

    Combines all elements of a sequence using a binary

    operator

  • MapReduce Execution Overview

    1. The user program, via the MapReduce

    library, shards the input data

    UserProgramInput

    Data

    Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6

    * Shards are typically 16-64mb in size

  • MapReduce Execution Overview

    2. The user program creates process copies

    distributed on a machine cluster. One copy

    will be the Master and the others will be

    worker threads.worker threads.

    UserProgram

    Master

    WorkersWorkers

    WorkersWorkers

    Workers

  • MapReduce Resources

    3. The master distributes M map and R reduce

    tasks to idle workers.

    M == number of shards

    R == the intermediate key space is divided into R R == the intermediate key space is divided into R

    parts

    Master IdleWorker

    Message(Do_map_task)

  • MapReduce Resources

    4. Each map-task worker reads assigned input

    shard and outputs intermediate key/value

    pairs.

    Output buffered in RAM. Output buffered in RAM.

    MapworkerShard 0 Key/value pairs

  • MapReduce Execution Overview

    5. Each worker flushes intermediate values,

    partitioned into R regions, to disk and

    notifies the Master process.

    Master

    Mapworker

    Disk locations

    LocalStorage

  • MapReduce Execution Overview

    6. Master process gives disk locations to an

    available reduce-task worker who reads all

    associated intermediate data.

    Master

    Reduceworker

    Disk locations

    remoteStorage

  • MapReduce Execution Overview

    7. Each reduce-task worker sorts its

    intermediate data. Calls the reduce function,

    passing in unique keys and associated key

    values. Reduce function output appended to values. Reduce function output appended to

    reduce-tasks partition output file.

    Reduceworker

    Sorts data PartitionOutput file

  • MapReduce Execution Overview

    8. Master process wakes up user process when

    all tasks have completed. Output contained

    in R output files.

    wakeup UserProgramMaster

    Outputfiles

  • MapReduce Execution Overview

    Fault Tolerance

    Master process periodically pings workers

    Map-task failure

    Re-execute Re-execute

    All output was stored locally

    Reduce-task failure

    Only re-execute partially completed tasks

    All output stored in the global file system

  • Tutorial:

    Running a map Reduce Program

    Tutorial-MapReduce.docx