Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern...
Transcript of Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern...
Using the 50TB Hadoop Cluster on Discovery
Northeastern University Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics
Introduction to Hadoop Distributed and Streaming Computing
• Hadoop is an opensource project and everyone is free to use and modify its source • Hadoop has its distributions free (apache hadoop 2.4.1) - commercial distributions are: Cloudera (Cloudera is able to claim Doug Cutting, Hadoop’s co-founder, as its chief architect) EMC (Pivotal HD natively integrates EMC’s massively parallel processing (MPP) database technology with Apache Hadoop - the result is a high-performance Hadoop distribution with true SQL processing for Hadoop - SQL-based queries and other business intelligence tools can be used to analyze data that is stored in HDFS) Hortonworks IBM MapR
“Five or six years ago, the average large corporation had maybe 360 terabytes of data lying around”, Kirk Dunn (COO of Cloudera) says, “Cloudera now has some customers that are generating about that much new data nearly every day, and it’s not slowing down”.
Where HDFS is Good Fit ? • Store large datasets which may be in TB's or PB's or even more. Data - Volume • Store different variety of data - Structured | UnStructured | Semi-Structured Data - Variety • Store data on commodity hardware (Economical).
Where HDFS is not a good fit ? • Low Latency Data Access (Hbase is better option) • Huge number of small files (Upto Millions is ok, but Billions is beyond the capacity of current hardwares. Basically NameNode Metadata
storage capacity is the problem). • Random file access (Random Read, write, delete or insert is not possible, Hadoop doesn't support OLTP. RDBMS is the best fit for OLTP
operations).
OLTP System Online Transaction Processing (Operational System)
OLAP System Online Analytical Processing (Data Warehouse)
Source of data Operational data; OLTPs are the original source of the data. Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision
What the data Reveals a snapshot of ongoing business processes Multi-dimensional views of various kinds of business activities Inserts and Updates Short and fast inserts and updates initiated by end users Periodic long-running batch jobs refresh the data
Queries Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations
Processing Speed Typically very fast Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements Can be relatively small if historical data is archived Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database Design Highly normalized with many tables Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and Recovery Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability
Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
Relationship to and needs in “Big Data Processing” • Applications run on HDFS – best for large data sets • Typical file in HDFS is gigabytes to terabytes in size • Normal OS block size 4 KB • Large files run into GB or TB – with 4KB BS the metadata will be overwhelming • Hadoop uses a Block size of 128 MB/64 MB depending on the release • Now meta data associated with a large file is extremely small • HDFS is tuned to support large files • Provides high aggregate data bandwidth and scales to hundreds of nodes in a single
cluster • Support tens of millions of files in a single instance
Data is stored in different formats and can be broadly classified into three types 1. Structured Data: characterized by high degree of organization - kind of data in relational databases / spreadsheets – searched / manipulated using standard algorithms 2. Semi structured Data: data is stored in the form of text files - some degree of order but cannot be searched / manipulated using standard algorithms 3. Unstructured Data: no logical structure - analysis is tedious and cumbersome considering the huge volume of data BIG DATA: volume, variety and velocity - terabytes and petabytes of data with different file types - generated very fast Hadoop best suited to BIG DATA.
Why use Hadoop, advantages • Hardware Failure: norm rather than the exception - detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS • Streaming Data Access: HDFS designed for batch processing with emphasis on high throughput of
data access rather than low latency of data access – POSIX semantics in few key areas traded to increase data throughput rates
• Large Data Sets: a typical file in HDFS is gigabytes to terabytes in size - HDFS is tuned to support large files - provides high aggregate data bandwidth and scales to hundreds of nodes in a single cluster - supports tens of millions of files in a single instance
• Simple Coherency Model: HDFS applications need a write-once-read-many access model for files - a file once created, written, and closed need not be changed - this assumption simplifies data coherency issues and enables high throughput data access
• “Moving Computation is Cheaper than Moving Data” - HDFS provides interfaces for applications to move themselves closer to where the data is located – when size of data set is huge a computation requested by an application is much more efficient if it is executed near the data it operates on
• Portability Across Heterogeneous Hardware and Software Platforms – java API based
Yahoo Hadoop Cluster of 42,000 nodes is the largest Hadoop Cluster to date:
Hadoop features Scale-Out Architecture - Add servers to increase capacity High Availability - Serve mission-critical workflows and applications Fault Tolerance - Automatically and seamlessly recover from failures Flexible Access – Multiple and open frameworks for serialization and file system mounts Load Balancing - Place data intelligently for maximum efficiency and utilization Tunable Replication - Multiple copies of each file provide data protection and computational performance Security - POSIX-based file permissions for users and groups with optional LDAP integration • By default every block in Hadoop is 64MB (128MB) and is replicated thrice • The replications of the blocks will be as per the Rack Awareness and by default two replications in one rack and
the other in another Rack
File write in HDFS: • Data is written in HDFS as a pipeline write that too block by block • Acknowledgement will be passed to the client indicating that
data was successfully written in HDFS • The blocks are placed as the rack aware fashion • Each block is given a unique ID which will be stored in the
metadata
File reading from HDFS: • When client wants to read a file from HDFS, initially it is taken care by the Name Node • Name node verify the file matches and asks the nearest data node to handle the file for reading • If file name is not found in the metadata IO exception occurs
Data replications in HDFS: • Stores each file as a sequence of blocks; all blocks in a file except the last block are the same size • Files in HDFS are write-once and have strictly one writer at any time • The NameNode makes all decisions regarding replication of blocks • It periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster • When a client is writing data to a HDFS, its file has a replication factor of three • Write data is pipelined from one DataNode to the next.
discovery3
compute-2-004 compute-2-005 compute-2-006
Some limitations of HDFS include • Centralized master-slave architecture • No file locking • File data stripped into uniformly sized blocks that are distributed across cluster servers • Block-level information exposed to applications • Simple coherency with a write-once, read-many model that restricts what users can do
with data
Solution for HPC retaining full POSIX semantics, and having BIG DATA capabilities with scaling – use a true Parallel File System – IBM GPFS
GPFS provides a common storage plane – software defined storage
GPFS features include • High-performance, shared-disk cluster architecture with full POSIX semantics • Distributed metadata, space allocation, and lock management • File data blocks striped across multiple servers and disks • Block-level information not exposed to applications • Ability to open, read, and append to any section of a file • GPFS includes a set of features that support MapReduce workloads called GPFS File Placement Optimizer
(FPO) • GPFS-FPO is a distributed computing architecture where each server is self-sufficient and utilizes local
storage - compute tasks are divided between these independent systems and no single server waits on another
• GPFS-FPO provides higher availability through advanced clustering technologies, dynamic file system management, and advanced data replication techniques
• GPFS supports a whole range of enterprise data storage features, such as snapshots, backup, archiving, tiered storage, data caching, WAN data replication, and management policies
• GPFS can be used by a wide range of applications running Hadoop MapReduce workloads and accessing other unstructured file data
• Benchmarks demonstrate that a GPFSFPO- based (modifies system) system scales linearly so that a file system with 40 servers would have a 12GB/s throughput, and a system with 400 servers could achieve 120GB/s throughput.
Core Hadoop modules
Hadoop Common: The common utilities that support the other Hadoop modules Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data Hadoop YARN: A framework for job scheduling and cluster resource management Hadoop MapReduce: A YARN-based system for parallel processing of large data sets
The other Hadoop EcoSystem Ambari: administration tools for installing, monitoring,and maintaining a Hadoop cluster and tools to add or remove slave nodes Avro: A framework for the efficient serialization (a kind of transformation) of data into a compact binary format Flume: A data flow service for the movement of large volumes of log data into Hadoop HBase: A distributed columnar database that uses HDFS for its underlying storage. With HBase, you can store data in extremely large tables with variable column structures Cassandra: A scalable multi-master database with no single points of failure Chukwa: A data collection system for managing large distributed systems HCatalog: A service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data Hive: A distributed data warehouse for data that is stored in HDFS; also provides a query language that’s based on SQL (HiveQL) Hue: A Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows Mahout: A library of machine learning statistical algorithms that were implemented in MapReduce and can run natively on Hadoop Oozie: A workflow management tool that can handle the scheduling and chaining together of Hadoop applications Pig: A platform for the analysis of very large data sets that runs on HDFS and with an infrastructure layer consisting of a compiler that produces sequences of MapReduce programs and a language layer consisting of the query language named Pig Latin Sqoop: A tool for efficiently moving large amounts of data between relational databases and HDFS ZooKeeper: A simple interface to the centralized coordination of services (such as naming, configuration, and synchronization) used by distributed applications Tez: A generalized data-flow programming framework, built on Hadoop YARN - provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases - Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine
NICE TO HAVE BUT NOT NEEDED
Hadoop shell • FileSystem (FS) shell is invoked by “hadoop fs <args>” • All the FS shell commands take path URIs as arguments • The URI format is scheme://autority/path • For HDFS the scheme is hdfs, and for the local filesystem the scheme is file • The scheme and authority are optional • If not specified, the default scheme specified in the configuration is used • An HDFS file or directory such as /parent/child can be specified as
hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost)
• Most of the commands in FS shell behave like corresponding Unix commands
• Differences are described with each of the commands • Error information is sent to stderr and the output is sent to stdout.
FS Shell ## cat ## chgrp ## chmod ## chown ## copyFromLocal ## copyToLocal ## cp ## du ## dus ## expunge ## get ## getmerge ## ls ## lsr ## mkdir ## movefromLocal ## mv ## put ## rm ## rmr ## setrep ## stat ## tail ## test ## text ## touchz
[nilay.roy@compute-2-005 test1]$ hadoop fs -mkdir hdfs://discovery3:9000/tmp/nilay.roy
[nilay.roy@compute-2-005 ~]$ hdfs dfs -put hadoop_test/ hdfs://discovery3:9000/tmp/nilay.roy/.
[nilay.roy@compute-2-005 ~]$ hdfs dfs -lsr hdfs://discovery3:9000/tmp/nilay.roy
[nilay.roy@compute-2-005 test1]$ hdfs dfs -cat hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output/part-00000
Hadoop API org.apache.hadoop org.apache.hadoop.classification org.apache.hadoop.conf org.apache.hadoop.contrib.bkjournal org.apache.hadoop.contrib.utils.join org.apache.hadoop.examples org.apache.hadoop.examples.dancing org.apache.hadoop.examples.pi org.apache.hadoop.examples.pi.math org.apache.hadoop.examples.terasort org.apache.hadoop.filecache org.apache.hadoop.fs org.apache.hadoop.fs.ftp org.apache.hadoop.fs.http.client org.apache.hadoop.fs.http.server org.apache.hadoop.fs.permission org.apache.hadoop.fs.s3 org.apache.hadoop.fs.s3native org.apache.hadoop.fs.swift.auth org.apache.hadoop.fs.swift.auth.entities org.apache.hadoop.fs.swift.exceptions org.apache.hadoop.fs.swift.http org.apache.hadoop.fs.swift.snative org.apache.hadoop.fs.swift.util org.apache.hadoop.fs.viewfs org.apache.hadoop.ha org.apache.hadoop.ha.proto org.apache.hadoop.ha.protocolPB org.apache.hadoop.http.lib org.apache.hadoop.io org.apache.hadoop.io.compress org.apache.hadoop.io.file.tfile org.apache.hadoop.io.serializer org.apache.hadoop.io.serializer.avro org.apache.hadoop.ipc.proto org.apache.hadoop.ipc.protobuf org.apache.hadoop.ipc.protocolPB org.apache.hadoop.jmx org.apache.hadoop.lib.lang org.apache.hadoop.lib.server org.apache.hadoop.lib.service org.apache.hadoop.lib.service.hadoop org.apache.hadoop.lib.service.instrumentation org.apache.hadoop.lib.service.scheduler org.apache.hadoop.lib.service.security
org.apache.hadoop.lib.servlet org.apache.hadoop.lib.util org.apache.hadoop.lib.wsrs org.apache.hadoop.log org.apache.hadoop.log.metrics org.apache.hadoop.mapred org.apache.hadoop.mapred.gridmix org.apache.hadoop.mapred.gridmix.emulators.resourceusage org.apache.hadoop.mapred.jobcontrol org.apache.hadoop.mapred.join org.apache.hadoop.mapred.lib org.apache.hadoop.mapred.lib.aggregate org.apache.hadoop.mapred.lib.db org.apache.hadoop.mapred.pipes org.apache.hadoop.mapred.proto org.apache.hadoop.mapred.tools org.apache.hadoop.mapreduce org.apache.hadoop.mapreduce.lib.aggregate org.apache.hadoop.mapreduce.lib.chain org.apache.hadoop.mapreduce.lib.db org.apache.hadoop.mapreduce.lib.fieldsel org.apache.hadoop.mapreduce.lib.input org.apache.hadoop.mapreduce.lib.jobcontrol org.apache.hadoop.mapreduce.lib.join org.apache.hadoop.mapreduce.lib.map org.apache.hadoop.mapreduce.lib.output org.apache.hadoop.mapreduce.lib.partition org.apache.hadoop.mapreduce.lib.reduce org.apache.hadoop.mapreduce.security org.apache.hadoop.mapreduce.server.jobtracker org.apache.hadoop.mapreduce.server.tasktracker org.apache.hadoop.mapreduce.task.annotation org.apache.hadoop.mapreduce.tools org.apache.hadoop.mapreduce.v2 org.apache.hadoop.mapreduce.v2.app.webapp.dao org.apache.hadoop.mapreduce.v2.hs.client org.apache.hadoop.mapreduce.v2.hs.proto org.apache.hadoop.mapreduce.v2.hs.protocol org.apache.hadoop.mapreduce.v2.hs.protocolPB org.apache.hadoop.mapreduce.v2.hs.server org.apache.hadoop.mapreduce.v2.hs.webapp.dao org.apache.hadoop.mapreduce.v2.security org.apache.hadoop.maven.plugin.protoc org.apache.hadoop.maven.plugin.util org.apache.hadoop.maven.plugin.versioninfo
org.apache.hadoop.metrics org.apache.hadoop.metrics.file org.apache.hadoop.metrics.ganglia org.apache.hadoop.metrics.spi org.apache.hadoop.metrics2 org.apache.hadoop.metrics2.annotation org.apache.hadoop.metrics2.filter org.apache.hadoop.metrics2.lib org.apache.hadoop.metrics2.sink org.apache.hadoop.metrics2.sink.ganglia org.apache.hadoop.metrics2.source org.apache.hadoop.metrics2.util org.apache.hadoop.minikdc org.apache.hadoop.mount org.apache.hadoop.net org.apache.hadoop.net.unix org.apache.hadoop.nfs org.apache.hadoop.nfs.nfs3 org.apache.hadoop.nfs.nfs3.request org.apache.hadoop.nfs.nfs3.response org.apache.hadoop.oncrpc org.apache.hadoop.oncrpc.security org.apache.hadoop.portmap org.apache.hadoop.record org.apache.hadoop.record.compiler org.apache.hadoop.record.compiler.ant org.apache.hadoop.record.compiler.generated org.apache.hadoop.record.meta org.apache.hadoop.security org.apache.hadoop.security.authentication.client org.apache.hadoop.security.authentication.examples org.apache.hadoop.security.authentication.server org.apache.hadoop.security.authentication.util org.apache.hadoop.security.proto org.apache.hadoop.security.protocolPB org.apache.hadoop.security.ssl org.apache.hadoop.service org.apache.hadoop.streaming org.apache.hadoop.streaming.io org.apache.hadoop.tools.mapred org.apache.hadoop.tools.mapred.lib org.apache.hadoop.tools.proto org.apache.hadoop.tools.protocolPB org.apache.hadoop.tools.rumen org.apache.hadoop.tools.rumen.anonymization org.apache.hadoop.tools.rumen.datatypes org.apache.hadoop.tools.rumen.datatypes.util org.apache.hadoop.tools.rumen.serializers org.apache.hadoop.tools.rumen.state
org.apache.hadoop.tools.util org.apache.hadoop.typedbytes org.apache.hadoop.util org.apache.hadoop.util.bloom org.apache.hadoop.util.hash org.apache.hadoop.yarn org.apache.hadoop.yarn.api org.apache.hadoop.yarn.api.protocolrecords org.apache.hadoop.yarn.api.records org.apache.hadoop.yarn.api.records.timeline org.apache.hadoop.yarn.applications.distributedshell org.apache.hadoop.yarn.applications.unmanagedamlauncher org.apache.hadoop.yarn.client org.apache.hadoop.yarn.client.api org.apache.hadoop.yarn.client.api.async org.apache.hadoop.yarn.client.api.async.impl org.apache.hadoop.yarn.client.api.impl org.apache.hadoop.yarn.client.cli org.apache.hadoop.yarn.conf org.apache.hadoop.yarn.event org.apache.hadoop.yarn.exceptions org.apache.hadoop.yarn.logaggregation org.apache.hadoop.yarn.security org.apache.hadoop.yarn.security.admin org.apache.hadoop.yarn.security.client org.apache.hadoop.yarn.sls org.apache.hadoop.yarn.sls.appmaster org.apache.hadoop.yarn.sls.conf org.apache.hadoop.yarn.sls.nodemanager org.apache.hadoop.yarn.sls.scheduler org.apache.hadoop.yarn.sls.utils org.apache.hadoop.yarn.sls.web org.apache.hadoop.yarn.state org.apache.hadoop.yarn.util org.apache.hadoop.yarn.util.resource org.apache.hadoop.yarn.util.timeline
[nilay.roy@discovery2 test1]$ head -20 WordCount.java package org.myorg; import java.io.*; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount extends Configured implements Tool { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static enum Counters { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); [nilay.roy@discovery2 test1]$
HDFS (Storage) + MapReduce (Processing) MapReduce data flow with a single reduce task MapReduce data flow with multiple reduce tasks
MapReduce data flow with no reduce tasks
Fault tolerance • Failure is the norm rather than exception • A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data.
• Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.
• Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
9/26/2014 21
Data Characteristics Streaming data access Applications need streaming access to data Batch processing rather than interactive user access. Large data sets and files: gigabytes to terabytes size High aggregate data bandwidth Scale to hundreds of nodes in a cluster Tens of millions of files in a single instance Write-once-read-many: a file once created, written and closed
need not be changed – this assumption simplifies coherency A map-reduce application like sort or web-crawler application
fits perfectly with this model.
22
Cat
Bat
Dog
Other Words (size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce
Namenode and Datanodes Master/slave architecture HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run
on. HDFS exposes a file system namespace and allows user data to be
stored in files. A file is split into one or more blocks and set of blocks are stored in
DataNodes. DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
24
HDFS Architecture
25
Namenode
B replication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..
Block ops
• Hierarchical file system with directories and files • Create, remove, move, rename etc. • Namenode maintains the file system • Any meta information changes to the file system
recorded by the Namenode. • An application can specify the number of
replicas of the file needed: replication factor of the file. This information is stored in the Namenode.
File system Namespace
Data Replication
27
HDFS is designed to store very large files across machines in a large cluster.
Each file is a sequence of blocks. All blocks in the file except the last are of the same
size. Blocks are replicated for fault tolerance. Block size and replicas are configurable per file. The Namenode receives a Heartbeat and a
BlockReport from each DataNode in the cluster. BlockReport contains all the blocks on a Datanode.
Replica Placement
28
The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth utilization Research topic
Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in
different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks Simple but non-optimal Writes are expensive Replication factor is 3 Another research topic?
Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.
1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.
Replica Selection
29
• Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.
Safemode Startup
30
On startup Namenode enters Safemode. Replication of data blocks do not occur in Safemode. Each DataNode checks in with Heartbeat and
BlockReport. Namenode verifies that each block has acceptable
number of replicas After a configurable percentage of safely replicated blocks
check in with the Namenode, Namenode exits Safemode. It then makes the list of blocks that need to be replicated. Namenode then proceeds to replicate these blocks to
other Datanodes.
Filesystem Metadata
31
• The HDFS namespace is stored by Namenode. • Namenode uses a transaction log called the EditLog
to record every change that occurs to the filesystem meta data. – For example, creating a new file. – Change replication factor of a file – EditLog is stored in the Namenode’s local filesystem
• Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenode’s local filesystem.
Namenode
32
Keeps image of entire file system namespace and file Blockmap in memory.
4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.
When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.
Datanode
33
A Datanode stores data in files in its local file system. Datanode has no knowledge about HDFS filesystem It stores each block of HDFS data in a separate file. Datanode does not create all files in the same directory. It uses heuristics to determine optimal number of files
per directory and creates directories appropriately: Research issue?
When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
The Communication Protocol
34
All HDFS communication protocols are layered on top of the TCP/IP protocol
A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocol with the Namenode.
The Datanodes talk to the Namenode using Datanode protocol.
RPC abstraction wraps both ClientProtocol and Datanode protocol.
Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.
Robustness - Objectives
• Primary objective of HDFS is to store data reliably in the presence of failures.
• Three common failures are: Namenode failure, Datanode failure and network partition.
35
DataNode failure and heartbeat • A network partition can cause a subset of Datanodes to
lose connectivity with the Namenode. • Namenode detects this condition by the absence of a
Heartbeat message. • Namenode marks Datanodes without Hearbeat and does
not send any IO requests to them. • Any data registered to the failed Datanode is not
available to the HDFS. • Also the death of a Datanode may cause replication
factor of some of the blocks to fall below their specified value.
36
Re-replication
• The necessity for re-replication may arise due to: – A Datanode may become unavailable, – A replica may become corrupted, – A hard disk on a Datanode may fail, or – The replication factor on the block may be
increased.
37
Cluster Rebalancing • HDFS architecture is compatible with data
rebalancing schemes. • A scheme might move data from one Datanode to
another if the free space on a Datanode falls below a certain threshold.
• In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.
• These types of data rebalancing are not yet implemented: research issue.
38
Data Integrity • Consider a situation: a block of data fetched from
Datanode arrives corrupted. • This corruption may occur because of faults in a
storage device, network faults, or buggy software. • A HDFS client creates the checksum of every block
of its file and stores it in hidden files in the HDFS namespace.
• When a clients retrieves the contents of file, it verifies that the corresponding checksums match.
• If does not match, the client can retrieve the block from a replica.
39
Metadata Disk Failure • FsImage and EditLog are central data structures of HDFS. • A corruption of these files can cause a HDFS instance to be non-
functional. • For this reason, a Namenode can be configured to maintain
multiple copies of the FsImage and EditLog. • Multiple copies of the FsImage and EditLog files are updated
synchronously. • Meta-data is not data-intensive. • The Namenode could be single point failure: automatic failover
is NOT supported! Another research topic.
40
Data Organization - Data Blocks
• HDFS support write-once-read-many with reads at streaming speeds.
• A typical block size is 64MB (or even 128 MB). • A file is chopped into 64MB chunks and stored.
41
Staging • A client request to create a file does not reach
Namenode immediately. • HDFS client caches the data into a temporary file.
When the data reached a HDFS block size the client contacts the Namenode.
• Namenode inserts the filename into its hierarchy and allocates a data block for it.
• The Namenode responds to the client with the identity of the Datanode and the destination of the replicas (Datanodes) for the block.
• Then the client flushes it from its local memory.
42
Staging (contd.) • The client sends a message that the file is
closed. • Namenode proceeds to commit the file for
creation operation into the persistent store. • If the Namenode dies before file is closed, the
file is lost. • This client side caching is required to avoid
network congestion; also it has precedence is AFS (Andrew file system).
43
Replication Pipelining
• When the client receives response from Namenode, it flushes its block in small pieces (4K) to the first replica, that in turn copies it to the next replica and so on.
• Thus data is pipelined from Datanode to the next.
44
Application Programming Interface
• HDFS provides Java API for application to use. • Python access is also used in many applications. • A C language wrapper for Java API is also
available. • A HTTP browser can be used to browse the files
of a HDFS instance.
45
FS Shell, Admin and Browser Interface • HDFS organizes its data in files and directories. • It provides a command line interface called the FS shell
that lets the user interact with data in the HDFS. • The syntax of the commands is similar to bash and csh. • Example: to create a directory /foodir /bin/hadoop dfs –mkdir /foodir • There is also DFSAdmin interface available • Browser interface is also available to view the
namespace.
46
Space Reclamation • When a file is deleted by a client, HDFS renames file to a
file in be the /trash directory for a configurable amount of time.
• A client can request for an undelete in this allowed time. • After the specified time the file is deleted and the space
is reclaimed. • When the replication factor is reduced, the Namenode
selects excess replicas that can be deleted. • Next heartbeat(?) transfers this information to the
Datanode that clears the blocks for use.
47
Terminology
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
Some MapReduce Terminology Job – A “full program” - an execution of a
Mapper and Reducer across a data set Task – An execution of a Mapper or a Reducer
on a slice of data a.k.a. Task-In-Progress (TIP)
Task Attempt – A particular instance of an attempt to execute a task on a machine
Task Attempts
A particular task will be attempted at least once, possibly more times if it crashes If the same input causes crashes over and over, that input will
eventually be abandoned
Multiple attempts at one task may occur in parallel with speculative execution turned on Task ID from TaskInProgress is not a unique identifier; don’t use
it that way
MapReduce: High Level
JobTrackerMapReduce job
submitted by client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
In our case: hadoop-10g queue
Nodes, Trackers, Tasks
Master node runs JobTracker instance, which accepts Job requests from clients
TaskTracker instances run on slave nodes TaskTracker forks separate Java process for
task instances
Job Distribution
MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options
Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code
… Where’s the data distribution?
Data Distribution
Implicit in design of MapReduce! All mappers are equivalent; so map whatever data
is local to a particular node in HDFS
If lots of data does happen to pile up on the same node, nearby nodes will map instead Data transfer is handled implicitly by HDFS
Data Flow in a MapReduce Program in Hadoop
• InputFormat • Map function • Partitioner • Sorting & Merging • Combiner • Shuffling • Merging • Reduce function • OutputFormat
1:many
Lifecycle of a MapReduce Job
Map Wave 1
Reduce Wave 1
Map Wave 2
Reduce Wave 2
Input Splits
Lifecycle of a MapReduce Job Time
How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters • 190+ parameters in
Hadoop • Set manually or defaults
are used
What Happens In Hadoop? Depth First
Job Launch Process: Client
Client program creates a JobConf Identify classes implementing Mapper and Reducer
interfaces JobConf.setMapperClass(), setReducerClass()
Specify inputs, outputs FileInputFormat.setInputPath(), FileOutputFormat.setOutputPath()
Optionally, other options too: JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Job Launch Process: JobClient
Pass JobConf to JobClient.runJob() or submitJob() runJob() blocks, submitJob() does not
JobClient: Determines proper division of input into
InputSplits Sends job data to master JobTracker server
Job Launch Process: JobTracker JobTracker: Inserts jar and JobConf (serialized to XML) in
shared location Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker TaskTrackers running on slave nodes
periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java main() is provided by Hadoop
Job Launch Process: Task
TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce
components via RPC Uses TaskRunner to launch user process
Job Launch Process: TaskRunner TaskRunner, MapTaskRunner, MapRunner
work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it
should be mapping Calls Mapper once for each record retrieved from
the InputSplit Running the Reducer is much the same
Creating the Mapper
You provide the instance of Mapper Should extend MapReduceBase
One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances
of Mapper – no data sharing!
Mapper
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
K types implement WritableComparable V types implement Writable
What is Writable?
Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.
All values are instances of Writable All keys are instances of WritableComparable
Getting Data To The Mapper
Input file
InputSplit InputSplit InputSplit InputSplit
Input file
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Inpu
tFor
mat
Reading Data
Data sets are specified by InputFormats Defines input data (e.g., a directory) Identifies partitions of the data that form an
InputSplit Factory for RecordReader objects to extract (k, v)
records from the input source
FileInputFormat and Friends
TextInputFormat – Treats each ‘\n’-terminated line of a file as a value
KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v”
SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata
SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
Filtering File Inputs
FileInputFormat will read all files out of a specified directory and send them to the mapper
Delegates filtering this file list to a method subclasses may override e.g., Create your own “xyzFileInputFormat” to read
*.xyz from directory list
Record Readers
Each InputFormat provides its own RecordReader implementation Provides (unused?) capability multiplexing
LineRecordReader – Reads a line from a text file
KeyValueRecordReader – Used by KeyValueTextInputFormat
Input Split Size
FileInputFormat will divide large files into chunks Exact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of chunk
Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”
Sending Data To Reducers
Map function receives OutputCollector object OutputCollector.collect() takes (k, v) elements
Any (WritableComparable, Writable) can be used
By default, mapper output type assumed to be same as reducer output type
WritableComparator
Compares WritableComparable data Will call WritableComparable.compare() Can provide fast path for serialized data
JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
Reporter object sent to Mapper allows simple asynchronous feedback incrCounter(Enum key, long amount) setStatus(String msg)
Allows self-identification of input InputSplit getInputSplit()
Partition And Shuffle
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
shuf
fling
Partitioner
int getPartition(key, val, numPartitions) Outputs the partition number for a given key One partition == values sent to one Reduce task
HashPartitioner used by default Uses key.hashCode() to return partition num
JobConf sets Partitioner implementation
Reduction
reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter )
Keys & values sent to one partition all go to the same reduce task
Calls are sorted by key – “earlier” keys are reduced and output before “later” keys
Finally: Writing The Output
Reducer Reducer Reducer
RecordWriter RecordWriter RecordWriter
output file output file output file
Out
putF
orm
at
OutputFormat
Analogous to InputFormat TextOutputFormat – Writes “key val\n” strings
to output file SequenceFileOutputFormat – Uses a binary
format to pack (k, v) pairs NullOutputFormat – Discards output Only useful if defining own output methods within
reduce()
Example Program - Wordcount map()
Receives a chunk of text Outputs a set of word/count pairs
reduce() Receives a key and all its associated values Outputs the key and the sum of the values
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount {
Wordcount – main( ) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }
Wordcount – map( ) public static class Map extends MapReduceBase … { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
Wordcount – reduce( ) public static class Reduce extends MapReduceBase … { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } }
Hadoop Streaming
Allows you to create and run map/reduce jobs with any executable
Similar to unix pipes, e.g.: format is: Input | Mapper | Reducer echo “this sentence has five lines” | cat | wc
Hadoop Streaming
Mapper and Reducer receive data from stdin and output to stdout
Hadoop takes care of the transmission of data between the map/reduce tasks It is still the programmer’s responsibility to set the
correct key/value Default format: “key \t value\n”
Let’s look at a Python example of a MapReduce word count program…
Streaming_Mapper.py
# read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string words = line.split() # list of strings # write data on stdout for word in words: print ‘%s\t%i’ % (word, 1)
Hadoop Streaming
What are we outputting? Example output: “the 1” By default, “the” is the key, and “1” is the value
Hadoop Streaming handles delivering this key/value pair to a Reducer Able to send similar keys to the same Reducer or
to an intermediary Combiner
Streaming_Reducer.py
wordcount = { } # empty dictionary # read in one line of input at a time from stdin
for line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value # write data on stdout for word, count in sorted(wordcount.items()): print ‘%s\t%i’ % (word, count)
Hadoop Streaming Gotcha
Streaming Reducer receives single lines (which are key/value pairs) from stdin Regular Reducer receives a collection of all the
values for a particular key It is still the case that all the values for a particular
key will go to a single Reducer
Using Hadoop Distributed File System (HDFS)
Can access HDFS through various shell commands (see Further Resources slide for link to documentation) hadoop –put <localsrc> … <dst> hadoop –get <src> <localdst> hadoop –ls hadoop –rm file
Configuring Number of Tasks
Normal method jobConf.setNumMapTasks(400) jobConf.setNumReduceTasks(4)
Hadoop Streaming method -jobconf mapred.map.tasks=400 -jobconf mapred.reduce.tasks=4
Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generated
Running a Hadoop Job
Place input file into HDFS: hadoop fs –put ./input-file input-file
Run either normal or streaming version: hadoop jar Wordcount.jar org.myorg.Wordcount input-file
output-file hadoop jar hadoop-streaming.jar \
-input input-file \ -output output-file \ -file Streaming_Mapper.py \ -mapper python Streaming_Mapper.py \ -file Streaming_Reducer.py \ -reducer python Streaming_Reducer.py \
Submitting /Running via LSF Add appropriate modules Get an interactive node on queue “hadoop-10g” Adjust the lines for transferring the input file to HDFS and starting the
hadoop job Know expected runtime (generally good practice to overshoot your
estimate) NOTICE: “Every user in this queue will not get more than 10 cores at
any given time. There is no queue time limit. Use “screen” on the login nodes so you can detach and exit while the job runs.”
Output Parsing Output of the reduce tasks must be retrieved:
hadoop fs –get output-file hadoop-output This creates a directory of output files, 1 per reduce task
Output files numbered part-00000, part-00001, etc. Sample output of Wordcount
head –n5 part-00000 “’tis 1 “come 2 “coming 1 “edwin 1 “found 1
Extra Output The stdout/stderr streams of Hadoop itself will be stored in an output file
(whichever one is named in the startup script) #$ -o output.$job_id
STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205 … 11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1 11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001 … 11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1 … 11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments 11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size:
43927 bytes 11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0% … 11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001
The 50TB Hadoop Cluster on Discovery
Ransomware hackers use HADOOP to hack HADOOP
Don’t try it … Hadoop Cluster on Discovery is
configured using a non-root account.
So move your data over from HDFS
once done.
WE NOW RUN TWO TEST EXAMPLES
• Download the document with example code from:
http://nuweb12.neu.edu/rc/wp-content/uploads/2014/09/USING_HDFS_ON_DISCOVERY_CLUSTER-.pdf Detailed instructions are in the document above.
QUESTIONS