Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern...

103
Using the 50TB Hadoop Cluster on Discovery Northeastern University Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Transcript of Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern...

Page 1: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Using the 50TB Hadoop Cluster on Discovery

Northeastern University Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Page 2: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics
Page 3: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Introduction to Hadoop Distributed and Streaming Computing

• Hadoop is an opensource project and everyone is free to use and modify its source • Hadoop has its distributions free (apache hadoop 2.4.1) - commercial distributions are: Cloudera (Cloudera is able to claim Doug Cutting, Hadoop’s co-founder, as its chief architect) EMC (Pivotal HD natively integrates EMC’s massively parallel processing (MPP) database technology with Apache Hadoop - the result is a high-performance Hadoop distribution with true SQL processing for Hadoop - SQL-based queries and other business intelligence tools can be used to analyze data that is stored in HDFS) Hortonworks IBM MapR

“Five or six years ago, the average large corporation had maybe 360 terabytes of data lying around”, Kirk Dunn (COO of Cloudera) says, “Cloudera now has some customers that are generating about that much new data nearly every day, and it’s not slowing down”.

Page 4: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Where HDFS is Good Fit ? • Store large datasets which may be in TB's or PB's or even more. Data - Volume • Store different variety of data - Structured | UnStructured | Semi-Structured Data - Variety • Store data on commodity hardware (Economical).

Where HDFS is not a good fit ? • Low Latency Data Access (Hbase is better option) • Huge number of small files (Upto Millions is ok, but Billions is beyond the capacity of current hardwares. Basically NameNode Metadata

storage capacity is the problem). • Random file access (Random Read, write, delete or insert is not possible, Hadoop doesn't support OLTP. RDBMS is the best fit for OLTP

operations).

OLTP System Online Transaction Processing (Operational System)

OLAP System Online Analytical Processing (Data Warehouse)

Source of data Operational data; OLTPs are the original source of the data. Consolidation data; OLAP data comes from the various OLTP Databases

Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision

What the data Reveals a snapshot of ongoing business processes Multi-dimensional views of various kinds of business activities Inserts and Updates Short and fast inserts and updates initiated by end users Periodic long-running batch jobs refresh the data

Queries Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations

Processing Speed Typically very fast Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes

Space Requirements Can be relatively small if historical data is archived Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP

Database Design Highly normalized with many tables Typically de-normalized with fewer tables; use of star and/or snowflake schemas

Backup and Recovery Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability

Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

Page 5: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Relationship to and needs in “Big Data Processing” • Applications run on HDFS – best for large data sets • Typical file in HDFS is gigabytes to terabytes in size • Normal OS block size 4 KB • Large files run into GB or TB – with 4KB BS the metadata will be overwhelming • Hadoop uses a Block size of 128 MB/64 MB depending on the release • Now meta data associated with a large file is extremely small • HDFS is tuned to support large files • Provides high aggregate data bandwidth and scales to hundreds of nodes in a single

cluster • Support tens of millions of files in a single instance

Data is stored in different formats and can be broadly classified into three types 1. Structured Data: characterized by high degree of organization - kind of data in relational databases / spreadsheets – searched / manipulated using standard algorithms 2. Semi structured Data: data is stored in the form of text files - some degree of order but cannot be searched / manipulated using standard algorithms 3. Unstructured Data: no logical structure - analysis is tedious and cumbersome considering the huge volume of data BIG DATA: volume, variety and velocity - terabytes and petabytes of data with different file types - generated very fast Hadoop best suited to BIG DATA.

Page 6: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Why use Hadoop, advantages • Hardware Failure: norm rather than the exception - detection of faults and quick, automatic

recovery from them is a core architectural goal of HDFS • Streaming Data Access: HDFS designed for batch processing with emphasis on high throughput of

data access rather than low latency of data access – POSIX semantics in few key areas traded to increase data throughput rates

• Large Data Sets: a typical file in HDFS is gigabytes to terabytes in size - HDFS is tuned to support large files - provides high aggregate data bandwidth and scales to hundreds of nodes in a single cluster - supports tens of millions of files in a single instance

• Simple Coherency Model: HDFS applications need a write-once-read-many access model for files - a file once created, written, and closed need not be changed - this assumption simplifies data coherency issues and enables high throughput data access

• “Moving Computation is Cheaper than Moving Data” - HDFS provides interfaces for applications to move themselves closer to where the data is located – when size of data set is huge a computation requested by an application is much more efficient if it is executed near the data it operates on

• Portability Across Heterogeneous Hardware and Software Platforms – java API based

Yahoo Hadoop Cluster of 42,000 nodes is the largest Hadoop Cluster to date:

Page 7: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop features Scale-Out Architecture - Add servers to increase capacity High Availability - Serve mission-critical workflows and applications Fault Tolerance - Automatically and seamlessly recover from failures Flexible Access – Multiple and open frameworks for serialization and file system mounts Load Balancing - Place data intelligently for maximum efficiency and utilization Tunable Replication - Multiple copies of each file provide data protection and computational performance Security - POSIX-based file permissions for users and groups with optional LDAP integration • By default every block in Hadoop is 64MB (128MB) and is replicated thrice • The replications of the blocks will be as per the Rack Awareness and by default two replications in one rack and

the other in another Rack

File write in HDFS: • Data is written in HDFS as a pipeline write that too block by block • Acknowledgement will be passed to the client indicating that

data was successfully written in HDFS • The blocks are placed as the rack aware fashion • Each block is given a unique ID which will be stored in the

metadata

Page 8: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

File reading from HDFS: • When client wants to read a file from HDFS, initially it is taken care by the Name Node • Name node verify the file matches and asks the nearest data node to handle the file for reading • If file name is not found in the metadata IO exception occurs

Data replications in HDFS: • Stores each file as a sequence of blocks; all blocks in a file except the last block are the same size • Files in HDFS are write-once and have strictly one writer at any time • The NameNode makes all decisions regarding replication of blocks • It periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster • When a client is writing data to a HDFS, its file has a replication factor of three • Write data is pipelined from one DataNode to the next.

Page 9: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics
Page 10: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

discovery3

compute-2-004 compute-2-005 compute-2-006

Page 11: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Some limitations of HDFS include • Centralized master-slave architecture • No file locking • File data stripped into uniformly sized blocks that are distributed across cluster servers • Block-level information exposed to applications • Simple coherency with a write-once, read-many model that restricts what users can do

with data

Solution for HPC retaining full POSIX semantics, and having BIG DATA capabilities with scaling – use a true Parallel File System – IBM GPFS

GPFS provides a common storage plane – software defined storage

Page 12: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

GPFS features include • High-performance, shared-disk cluster architecture with full POSIX semantics • Distributed metadata, space allocation, and lock management • File data blocks striped across multiple servers and disks • Block-level information not exposed to applications • Ability to open, read, and append to any section of a file • GPFS includes a set of features that support MapReduce workloads called GPFS File Placement Optimizer

(FPO) • GPFS-FPO is a distributed computing architecture where each server is self-sufficient and utilizes local

storage - compute tasks are divided between these independent systems and no single server waits on another

• GPFS-FPO provides higher availability through advanced clustering technologies, dynamic file system management, and advanced data replication techniques

• GPFS supports a whole range of enterprise data storage features, such as snapshots, backup, archiving, tiered storage, data caching, WAN data replication, and management policies

• GPFS can be used by a wide range of applications running Hadoop MapReduce workloads and accessing other unstructured file data

• Benchmarks demonstrate that a GPFSFPO- based (modifies system) system scales linearly so that a file system with 40 servers would have a 12GB/s throughput, and a system with 400 servers could achieve 120GB/s throughput.

Page 13: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Core Hadoop modules

Hadoop Common: The common utilities that support the other Hadoop modules Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data Hadoop YARN: A framework for job scheduling and cluster resource management Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

Page 14: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

The other Hadoop EcoSystem Ambari: administration tools for installing, monitoring,and maintaining a Hadoop cluster and tools to add or remove slave nodes Avro: A framework for the efficient serialization (a kind of transformation) of data into a compact binary format Flume: A data flow service for the movement of large volumes of log data into Hadoop HBase: A distributed columnar database that uses HDFS for its underlying storage. With HBase, you can store data in extremely large tables with variable column structures Cassandra: A scalable multi-master database with no single points of failure Chukwa: A data collection system for managing large distributed systems HCatalog: A service for providing a relational view of data stored in Hadoop, including a standard approach for tabular data Hive: A distributed data warehouse for data that is stored in HDFS; also provides a query language that’s based on SQL (HiveQL) Hue: A Hadoop administration interface with handy GUI tools for browsing files, issuing Hive and Pig queries, and developing Oozie workflows Mahout: A library of machine learning statistical algorithms that were implemented in MapReduce and can run natively on Hadoop Oozie: A workflow management tool that can handle the scheduling and chaining together of Hadoop applications Pig: A platform for the analysis of very large data sets that runs on HDFS and with an infrastructure layer consisting of a compiler that produces sequences of MapReduce programs and a language layer consisting of the query language named Pig Latin Sqoop: A tool for efficiently moving large amounts of data between relational databases and HDFS ZooKeeper: A simple interface to the centralized coordination of services (such as naming, configuration, and synchronization) used by distributed applications Tez: A generalized data-flow programming framework, built on Hadoop YARN - provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases - Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine

NICE TO HAVE BUT NOT NEEDED

Page 15: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop shell • FileSystem (FS) shell is invoked by “hadoop fs <args>” • All the FS shell commands take path URIs as arguments • The URI format is scheme://autority/path • For HDFS the scheme is hdfs, and for the local filesystem the scheme is file • The scheme and authority are optional • If not specified, the default scheme specified in the configuration is used • An HDFS file or directory such as /parent/child can be specified as

hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost)

• Most of the commands in FS shell behave like corresponding Unix commands

• Differences are described with each of the commands • Error information is sent to stderr and the output is sent to stdout.

Page 16: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

FS Shell ## cat ## chgrp ## chmod ## chown ## copyFromLocal ## copyToLocal ## cp ## du ## dus ## expunge ## get ## getmerge ## ls ## lsr ## mkdir ## movefromLocal ## mv ## put ## rm ## rmr ## setrep ## stat ## tail ## test ## text ## touchz

[nilay.roy@compute-2-005 test1]$ hadoop fs -mkdir hdfs://discovery3:9000/tmp/nilay.roy

[nilay.roy@compute-2-005 ~]$ hdfs dfs -put hadoop_test/ hdfs://discovery3:9000/tmp/nilay.roy/.

[nilay.roy@compute-2-005 ~]$ hdfs dfs -lsr hdfs://discovery3:9000/tmp/nilay.roy

[nilay.roy@compute-2-005 test1]$ hdfs dfs -cat hdfs://discovery3:9000/tmp/nilay.roy/hadoop_test/test1/output/part-00000

Page 17: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop API org.apache.hadoop org.apache.hadoop.classification org.apache.hadoop.conf org.apache.hadoop.contrib.bkjournal org.apache.hadoop.contrib.utils.join org.apache.hadoop.examples org.apache.hadoop.examples.dancing org.apache.hadoop.examples.pi org.apache.hadoop.examples.pi.math org.apache.hadoop.examples.terasort org.apache.hadoop.filecache org.apache.hadoop.fs org.apache.hadoop.fs.ftp org.apache.hadoop.fs.http.client org.apache.hadoop.fs.http.server org.apache.hadoop.fs.permission org.apache.hadoop.fs.s3 org.apache.hadoop.fs.s3native org.apache.hadoop.fs.swift.auth org.apache.hadoop.fs.swift.auth.entities org.apache.hadoop.fs.swift.exceptions org.apache.hadoop.fs.swift.http org.apache.hadoop.fs.swift.snative org.apache.hadoop.fs.swift.util org.apache.hadoop.fs.viewfs org.apache.hadoop.ha org.apache.hadoop.ha.proto org.apache.hadoop.ha.protocolPB org.apache.hadoop.http.lib org.apache.hadoop.io org.apache.hadoop.io.compress org.apache.hadoop.io.file.tfile org.apache.hadoop.io.serializer org.apache.hadoop.io.serializer.avro org.apache.hadoop.ipc.proto org.apache.hadoop.ipc.protobuf org.apache.hadoop.ipc.protocolPB org.apache.hadoop.jmx org.apache.hadoop.lib.lang org.apache.hadoop.lib.server org.apache.hadoop.lib.service org.apache.hadoop.lib.service.hadoop org.apache.hadoop.lib.service.instrumentation org.apache.hadoop.lib.service.scheduler org.apache.hadoop.lib.service.security

org.apache.hadoop.lib.servlet org.apache.hadoop.lib.util org.apache.hadoop.lib.wsrs org.apache.hadoop.log org.apache.hadoop.log.metrics org.apache.hadoop.mapred org.apache.hadoop.mapred.gridmix org.apache.hadoop.mapred.gridmix.emulators.resourceusage org.apache.hadoop.mapred.jobcontrol org.apache.hadoop.mapred.join org.apache.hadoop.mapred.lib org.apache.hadoop.mapred.lib.aggregate org.apache.hadoop.mapred.lib.db org.apache.hadoop.mapred.pipes org.apache.hadoop.mapred.proto org.apache.hadoop.mapred.tools org.apache.hadoop.mapreduce org.apache.hadoop.mapreduce.lib.aggregate org.apache.hadoop.mapreduce.lib.chain org.apache.hadoop.mapreduce.lib.db org.apache.hadoop.mapreduce.lib.fieldsel org.apache.hadoop.mapreduce.lib.input org.apache.hadoop.mapreduce.lib.jobcontrol org.apache.hadoop.mapreduce.lib.join org.apache.hadoop.mapreduce.lib.map org.apache.hadoop.mapreduce.lib.output org.apache.hadoop.mapreduce.lib.partition org.apache.hadoop.mapreduce.lib.reduce org.apache.hadoop.mapreduce.security org.apache.hadoop.mapreduce.server.jobtracker org.apache.hadoop.mapreduce.server.tasktracker org.apache.hadoop.mapreduce.task.annotation org.apache.hadoop.mapreduce.tools org.apache.hadoop.mapreduce.v2 org.apache.hadoop.mapreduce.v2.app.webapp.dao org.apache.hadoop.mapreduce.v2.hs.client org.apache.hadoop.mapreduce.v2.hs.proto org.apache.hadoop.mapreduce.v2.hs.protocol org.apache.hadoop.mapreduce.v2.hs.protocolPB org.apache.hadoop.mapreduce.v2.hs.server org.apache.hadoop.mapreduce.v2.hs.webapp.dao org.apache.hadoop.mapreduce.v2.security org.apache.hadoop.maven.plugin.protoc org.apache.hadoop.maven.plugin.util org.apache.hadoop.maven.plugin.versioninfo

org.apache.hadoop.metrics org.apache.hadoop.metrics.file org.apache.hadoop.metrics.ganglia org.apache.hadoop.metrics.spi org.apache.hadoop.metrics2 org.apache.hadoop.metrics2.annotation org.apache.hadoop.metrics2.filter org.apache.hadoop.metrics2.lib org.apache.hadoop.metrics2.sink org.apache.hadoop.metrics2.sink.ganglia org.apache.hadoop.metrics2.source org.apache.hadoop.metrics2.util org.apache.hadoop.minikdc org.apache.hadoop.mount org.apache.hadoop.net org.apache.hadoop.net.unix org.apache.hadoop.nfs org.apache.hadoop.nfs.nfs3 org.apache.hadoop.nfs.nfs3.request org.apache.hadoop.nfs.nfs3.response org.apache.hadoop.oncrpc org.apache.hadoop.oncrpc.security org.apache.hadoop.portmap org.apache.hadoop.record org.apache.hadoop.record.compiler org.apache.hadoop.record.compiler.ant org.apache.hadoop.record.compiler.generated org.apache.hadoop.record.meta org.apache.hadoop.security org.apache.hadoop.security.authentication.client org.apache.hadoop.security.authentication.examples org.apache.hadoop.security.authentication.server org.apache.hadoop.security.authentication.util org.apache.hadoop.security.proto org.apache.hadoop.security.protocolPB org.apache.hadoop.security.ssl org.apache.hadoop.service org.apache.hadoop.streaming org.apache.hadoop.streaming.io org.apache.hadoop.tools.mapred org.apache.hadoop.tools.mapred.lib org.apache.hadoop.tools.proto org.apache.hadoop.tools.protocolPB org.apache.hadoop.tools.rumen org.apache.hadoop.tools.rumen.anonymization org.apache.hadoop.tools.rumen.datatypes org.apache.hadoop.tools.rumen.datatypes.util org.apache.hadoop.tools.rumen.serializers org.apache.hadoop.tools.rumen.state

org.apache.hadoop.tools.util org.apache.hadoop.typedbytes org.apache.hadoop.util org.apache.hadoop.util.bloom org.apache.hadoop.util.hash org.apache.hadoop.yarn org.apache.hadoop.yarn.api org.apache.hadoop.yarn.api.protocolrecords org.apache.hadoop.yarn.api.records org.apache.hadoop.yarn.api.records.timeline org.apache.hadoop.yarn.applications.distributedshell org.apache.hadoop.yarn.applications.unmanagedamlauncher org.apache.hadoop.yarn.client org.apache.hadoop.yarn.client.api org.apache.hadoop.yarn.client.api.async org.apache.hadoop.yarn.client.api.async.impl org.apache.hadoop.yarn.client.api.impl org.apache.hadoop.yarn.client.cli org.apache.hadoop.yarn.conf org.apache.hadoop.yarn.event org.apache.hadoop.yarn.exceptions org.apache.hadoop.yarn.logaggregation org.apache.hadoop.yarn.security org.apache.hadoop.yarn.security.admin org.apache.hadoop.yarn.security.client org.apache.hadoop.yarn.sls org.apache.hadoop.yarn.sls.appmaster org.apache.hadoop.yarn.sls.conf org.apache.hadoop.yarn.sls.nodemanager org.apache.hadoop.yarn.sls.scheduler org.apache.hadoop.yarn.sls.utils org.apache.hadoop.yarn.sls.web org.apache.hadoop.yarn.state org.apache.hadoop.yarn.util org.apache.hadoop.yarn.util.resource org.apache.hadoop.yarn.util.timeline

Page 18: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

[nilay.roy@discovery2 test1]$ head -20 WordCount.java package org.myorg; import java.io.*; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount extends Configured implements Tool { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static enum Counters { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); [nilay.roy@discovery2 test1]$

Page 19: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics
Page 20: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

HDFS (Storage) + MapReduce (Processing) MapReduce data flow with a single reduce task MapReduce data flow with multiple reduce tasks

MapReduce data flow with no reduce tasks

Page 21: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Fault tolerance • Failure is the norm rather than exception • A HDFS instance may consist of thousands of server

machines, each storing part of the file system’s data.

• Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.

• Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

9/26/2014 21

Page 22: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Data Characteristics Streaming data access Applications need streaming access to data Batch processing rather than interactive user access. Large data sets and files: gigabytes to terabytes size High aggregate data bandwidth Scale to hundreds of nodes in a cluster Tens of millions of files in a single instance Write-once-read-many: a file once created, written and closed

need not be changed – this assumption simplifies coherency A map-reduce application like sort or web-crawler application

fits perfectly with this model.

22

Page 23: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Cat

Bat

Dog

Other Words (size:

TByte)

map

map

map

map

split

split

split

split

combine

combine

combine

reduce

reduce

reduce

part0

part1

part2

MapReduce

Page 24: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Namenode and Datanodes Master/slave architecture HDFS cluster consists of a single Namenode, a master server that

manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run

on. HDFS exposes a file system namespace and allows user data to be

stored in files. A file is split into one or more blocks and set of blocks are stored in

DataNodes. DataNodes: serves read, write requests, performs block creation,

deletion, and replication upon instruction from Namenode.

24

Page 25: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

HDFS Architecture

25

Namenode

B replication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..

Block ops

Page 26: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

• Hierarchical file system with directories and files • Create, remove, move, rename etc. • Namenode maintains the file system • Any meta information changes to the file system

recorded by the Namenode. • An application can specify the number of

replicas of the file needed: replication factor of the file. This information is stored in the Namenode.

File system Namespace

Page 27: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Data Replication

27

HDFS is designed to store very large files across machines in a large cluster.

Each file is a sequence of blocks. All blocks in the file except the last are of the same

size. Blocks are replicated for fault tolerance. Block size and replicas are configurable per file. The Namenode receives a Heartbeat and a

BlockReport from each DataNode in the cluster. BlockReport contains all the blocks on a Datanode.

Page 28: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Replica Placement

28

The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth utilization Research topic

Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in

different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks Simple but non-optimal Writes are expensive Replication factor is 3 Another research topic?

Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.

1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.

Page 29: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Replica Selection

29

• Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.

• If there is a replica on the Reader node then that is preferred.

• HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.

Page 30: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Safemode Startup

30

On startup Namenode enters Safemode. Replication of data blocks do not occur in Safemode. Each DataNode checks in with Heartbeat and

BlockReport. Namenode verifies that each block has acceptable

number of replicas After a configurable percentage of safely replicated blocks

check in with the Namenode, Namenode exits Safemode. It then makes the list of blocks that need to be replicated. Namenode then proceeds to replicate these blocks to

other Datanodes.

Page 31: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Filesystem Metadata

31

• The HDFS namespace is stored by Namenode. • Namenode uses a transaction log called the EditLog

to record every change that occurs to the filesystem meta data. – For example, creating a new file. – Change replication factor of a file – EditLog is stored in the Namenode’s local filesystem

• Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenode’s local filesystem.

Page 32: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Namenode

32

Keeps image of entire file system namespace and file Blockmap in memory.

4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.

When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.

Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.

Page 33: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Datanode

33

A Datanode stores data in files in its local file system. Datanode has no knowledge about HDFS filesystem It stores each block of HDFS data in a separate file. Datanode does not create all files in the same directory. It uses heuristics to determine optimal number of files

per directory and creates directories appropriately: Research issue?

When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.

Page 34: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

The Communication Protocol

34

All HDFS communication protocols are layered on top of the TCP/IP protocol

A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocol with the Namenode.

The Datanodes talk to the Namenode using Datanode protocol.

RPC abstraction wraps both ClientProtocol and Datanode protocol.

Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.

Page 35: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Robustness - Objectives

• Primary objective of HDFS is to store data reliably in the presence of failures.

• Three common failures are: Namenode failure, Datanode failure and network partition.

35

Page 36: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

DataNode failure and heartbeat • A network partition can cause a subset of Datanodes to

lose connectivity with the Namenode. • Namenode detects this condition by the absence of a

Heartbeat message. • Namenode marks Datanodes without Hearbeat and does

not send any IO requests to them. • Any data registered to the failed Datanode is not

available to the HDFS. • Also the death of a Datanode may cause replication

factor of some of the blocks to fall below their specified value.

36

Page 37: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Re-replication

• The necessity for re-replication may arise due to: – A Datanode may become unavailable, – A replica may become corrupted, – A hard disk on a Datanode may fail, or – The replication factor on the block may be

increased.

37

Page 38: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Cluster Rebalancing • HDFS architecture is compatible with data

rebalancing schemes. • A scheme might move data from one Datanode to

another if the free space on a Datanode falls below a certain threshold.

• In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.

• These types of data rebalancing are not yet implemented: research issue.

38

Page 39: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Data Integrity • Consider a situation: a block of data fetched from

Datanode arrives corrupted. • This corruption may occur because of faults in a

storage device, network faults, or buggy software. • A HDFS client creates the checksum of every block

of its file and stores it in hidden files in the HDFS namespace.

• When a clients retrieves the contents of file, it verifies that the corresponding checksums match.

• If does not match, the client can retrieve the block from a replica.

39

Page 40: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Metadata Disk Failure • FsImage and EditLog are central data structures of HDFS. • A corruption of these files can cause a HDFS instance to be non-

functional. • For this reason, a Namenode can be configured to maintain

multiple copies of the FsImage and EditLog. • Multiple copies of the FsImage and EditLog files are updated

synchronously. • Meta-data is not data-intensive. • The Namenode could be single point failure: automatic failover

is NOT supported! Another research topic.

40

Page 41: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Data Organization - Data Blocks

• HDFS support write-once-read-many with reads at streaming speeds.

• A typical block size is 64MB (or even 128 MB). • A file is chopped into 64MB chunks and stored.

41

Page 42: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Staging • A client request to create a file does not reach

Namenode immediately. • HDFS client caches the data into a temporary file.

When the data reached a HDFS block size the client contacts the Namenode.

• Namenode inserts the filename into its hierarchy and allocates a data block for it.

• The Namenode responds to the client with the identity of the Datanode and the destination of the replicas (Datanodes) for the block.

• Then the client flushes it from its local memory.

42

Page 43: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Staging (contd.) • The client sends a message that the file is

closed. • Namenode proceeds to commit the file for

creation operation into the persistent store. • If the Namenode dies before file is closed, the

file is lost. • This client side caching is required to avoid

network congestion; also it has precedence is AFS (Andrew file system).

43

Page 44: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Replication Pipelining

• When the client receives response from Namenode, it flushes its block in small pieces (4K) to the first replica, that in turn copies it to the next replica and so on.

• Thus data is pipelined from Datanode to the next.

44

Page 45: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Application Programming Interface

• HDFS provides Java API for application to use. • Python access is also used in many applications. • A C language wrapper for Java API is also

available. • A HTTP browser can be used to browse the files

of a HDFS instance.

45

Page 46: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

FS Shell, Admin and Browser Interface • HDFS organizes its data in files and directories. • It provides a command line interface called the FS shell

that lets the user interact with data in the HDFS. • The syntax of the commands is similar to bash and csh. • Example: to create a directory /foodir /bin/hadoop dfs –mkdir /foodir • There is also DFSAdmin interface available • Browser interface is also available to view the

namespace.

46

Page 47: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Space Reclamation • When a file is deleted by a client, HDFS renames file to a

file in be the /trash directory for a configurable amount of time.

• A client can request for an undelete in this allowed time. • After the specified time the file is deleted and the space

is reclaimed. • When the replication factor is reduced, the Namenode

selects excess replicas that can be deleted. • Next heartbeat(?) transfers this information to the

Datanode that clears the blocks for use.

47

Page 48: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Terminology

Google calls it: Hadoop equivalent:

MapReduce Hadoop

GFS HDFS

Bigtable HBase

Chubby Zookeeper

Page 49: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Some MapReduce Terminology Job – A “full program” - an execution of a

Mapper and Reducer across a data set Task – An execution of a Mapper or a Reducer

on a slice of data a.k.a. Task-In-Progress (TIP)

Task Attempt – A particular instance of an attempt to execute a task on a machine

Page 50: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Task Attempts

A particular task will be attempted at least once, possibly more times if it crashes If the same input causes crashes over and over, that input will

eventually be abandoned

Multiple attempts at one task may occur in parallel with speculative execution turned on Task ID from TaskInProgress is not a unique identifier; don’t use

it that way

Page 51: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

MapReduce: High Level

JobTrackerMapReduce job

submitted by client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

In our case: hadoop-10g queue

Page 52: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Nodes, Trackers, Tasks

Master node runs JobTracker instance, which accepts Job requests from clients

TaskTracker instances run on slave nodes TaskTracker forks separate Java process for

task instances

Page 53: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Distribution

MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options

Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code

… Where’s the data distribution?

Page 54: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Data Distribution

Implicit in design of MapReduce! All mappers are equivalent; so map whatever data

is local to a particular node in HDFS

If lots of data does happen to pile up on the same node, nearby nodes will map instead Data transfer is handled implicitly by HDFS

Page 55: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Data Flow in a MapReduce Program in Hadoop

• InputFormat • Map function • Partitioner • Sorting & Merging • Combiner • Shuffling • Merging • Reduce function • OutputFormat

1:many

Page 56: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics
Page 57: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Lifecycle of a MapReduce Job

Page 58: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Map Wave 1

Reduce Wave 1

Map Wave 2

Reduce Wave 2

Input Splits

Lifecycle of a MapReduce Job Time

How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

Page 59: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics
Page 60: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Configuration Parameters • 190+ parameters in

Hadoop • Set manually or defaults

are used

Page 61: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

What Happens In Hadoop? Depth First

Page 62: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Launch Process: Client

Client program creates a JobConf Identify classes implementing Mapper and Reducer

interfaces JobConf.setMapperClass(), setReducerClass()

Specify inputs, outputs FileInputFormat.setInputPath(), FileOutputFormat.setOutputPath()

Optionally, other options too: JobConf.setNumReduceTasks(),

JobConf.setOutputFormat()…

Page 63: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Launch Process: JobClient

Pass JobConf to JobClient.runJob() or submitJob() runJob() blocks, submitJob() does not

JobClient: Determines proper division of input into

InputSplits Sends job data to master JobTracker server

Page 64: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Launch Process: JobTracker JobTracker: Inserts jar and JobConf (serialized to XML) in

shared location Posts a JobInProgress to its run queue

Page 65: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Launch Process: TaskTracker TaskTrackers running on slave nodes

periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java main() is provided by Hadoop

Page 66: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Launch Process: Task

TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce

components via RPC Uses TaskRunner to launch user process

Page 67: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Job Launch Process: TaskRunner TaskRunner, MapTaskRunner, MapRunner

work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it

should be mapping Calls Mapper once for each record retrieved from

the InputSplit Running the Reducer is much the same

Page 68: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Creating the Mapper

You provide the instance of Mapper Should extend MapReduceBase

One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances

of Mapper – no data sharing!

Page 69: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Mapper

void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)

K types implement WritableComparable V types implement Writable

Page 70: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

What is Writable?

Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.

All values are instances of Writable All keys are instances of WritableComparable

Page 71: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Getting Data To The Mapper

Input file

InputSplit InputSplit InputSplit InputSplit

Input file

RecordReader RecordReader RecordReader RecordReader

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Inpu

tFor

mat

Page 72: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Reading Data

Data sets are specified by InputFormats Defines input data (e.g., a directory) Identifies partitions of the data that form an

InputSplit Factory for RecordReader objects to extract (k, v)

records from the input source

Page 73: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

FileInputFormat and Friends

TextInputFormat – Treats each ‘\n’-terminated line of a file as a value

KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v”

SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata

SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())

Page 74: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Filtering File Inputs

FileInputFormat will read all files out of a specified directory and send them to the mapper

Delegates filtering this file list to a method subclasses may override e.g., Create your own “xyzFileInputFormat” to read

*.xyz from directory list

Page 75: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Record Readers

Each InputFormat provides its own RecordReader implementation Provides (unused?) capability multiplexing

LineRecordReader – Reads a line from a text file

KeyValueRecordReader – Used by KeyValueTextInputFormat

Page 76: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Input Split Size

FileInputFormat will divide large files into chunks Exact size controlled by mapred.min.split.size

RecordReaders receive file, offset, and length of chunk

Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”

Page 77: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Sending Data To Reducers

Map function receives OutputCollector object OutputCollector.collect() takes (k, v) elements

Any (WritableComparable, Writable) can be used

By default, mapper output type assumed to be same as reducer output type

Page 78: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

WritableComparator

Compares WritableComparable data Will call WritableComparable.compare() Can provide fast path for serialized data

JobConf.setOutputValueGroupingComparator()

Page 79: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Sending Data To The Client

Reporter object sent to Mapper allows simple asynchronous feedback incrCounter(Enum key, long amount) setStatus(String msg)

Allows self-identification of input InputSplit getInputSplit()

Page 80: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Partition And Shuffle

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Reducer Reducer Reducer

(intermediates) (intermediates) (intermediates)

Partitioner Partitioner Partitioner Partitioner

shuf

fling

Page 81: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Partitioner

int getPartition(key, val, numPartitions) Outputs the partition number for a given key One partition == values sent to one Reduce task

HashPartitioner used by default Uses key.hashCode() to return partition num

JobConf sets Partitioner implementation

Page 82: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Reduction

reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter )

Keys & values sent to one partition all go to the same reduce task

Calls are sorted by key – “earlier” keys are reduced and output before “later” keys

Page 83: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Finally: Writing The Output

Reducer Reducer Reducer

RecordWriter RecordWriter RecordWriter

output file output file output file

Out

putF

orm

at

Page 84: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

OutputFormat

Analogous to InputFormat TextOutputFormat – Writes “key val\n” strings

to output file SequenceFileOutputFormat – Uses a binary

format to pack (k, v) pairs NullOutputFormat – Discards output Only useful if defining own output methods within

reduce()

Page 85: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Example Program - Wordcount map()

Receives a chunk of text Outputs a set of word/count pairs

reduce() Receives a key and all its associated values Outputs the key and the sum of the values

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount {

Page 86: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Wordcount – main( ) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }

Page 87: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Wordcount – map( ) public static class Map extends MapReduceBase … { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

Page 88: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Wordcount – reduce( ) public static class Reduce extends MapReduceBase … { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } }

Page 89: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop Streaming

Allows you to create and run map/reduce jobs with any executable

Similar to unix pipes, e.g.: format is: Input | Mapper | Reducer echo “this sentence has five lines” | cat | wc

Page 90: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop Streaming

Mapper and Reducer receive data from stdin and output to stdout

Hadoop takes care of the transmission of data between the map/reduce tasks It is still the programmer’s responsibility to set the

correct key/value Default format: “key \t value\n”

Let’s look at a Python example of a MapReduce word count program…

Page 91: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Streaming_Mapper.py

# read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string words = line.split() # list of strings # write data on stdout for word in words: print ‘%s\t%i’ % (word, 1)

Page 92: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop Streaming

What are we outputting? Example output: “the 1” By default, “the” is the key, and “1” is the value

Hadoop Streaming handles delivering this key/value pair to a Reducer Able to send similar keys to the same Reducer or

to an intermediary Combiner

Page 93: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Streaming_Reducer.py

wordcount = { } # empty dictionary # read in one line of input at a time from stdin

for line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value # write data on stdout for word, count in sorted(wordcount.items()): print ‘%s\t%i’ % (word, count)

Page 94: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Hadoop Streaming Gotcha

Streaming Reducer receives single lines (which are key/value pairs) from stdin Regular Reducer receives a collection of all the

values for a particular key It is still the case that all the values for a particular

key will go to a single Reducer

Page 95: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Using Hadoop Distributed File System (HDFS)

Can access HDFS through various shell commands (see Further Resources slide for link to documentation) hadoop –put <localsrc> … <dst> hadoop –get <src> <localdst> hadoop –ls hadoop –rm file

Page 96: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Configuring Number of Tasks

Normal method jobConf.setNumMapTasks(400) jobConf.setNumReduceTasks(4)

Hadoop Streaming method -jobconf mapred.map.tasks=400 -jobconf mapred.reduce.tasks=4

Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generated

Page 97: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Running a Hadoop Job

Place input file into HDFS: hadoop fs –put ./input-file input-file

Run either normal or streaming version: hadoop jar Wordcount.jar org.myorg.Wordcount input-file

output-file hadoop jar hadoop-streaming.jar \

-input input-file \ -output output-file \ -file Streaming_Mapper.py \ -mapper python Streaming_Mapper.py \ -file Streaming_Reducer.py \ -reducer python Streaming_Reducer.py \

Page 98: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Submitting /Running via LSF Add appropriate modules Get an interactive node on queue “hadoop-10g” Adjust the lines for transferring the input file to HDFS and starting the

hadoop job Know expected runtime (generally good practice to overshoot your

estimate) NOTICE: “Every user in this queue will not get more than 10 cores at

any given time. There is no queue time limit. Use “screen” on the login nodes so you can detach and exit while the job runs.”

Page 99: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Output Parsing Output of the reduce tasks must be retrieved:

hadoop fs –get output-file hadoop-output This creates a directory of output files, 1 per reduce task

Output files numbered part-00000, part-00001, etc. Sample output of Wordcount

head –n5 part-00000 “’tis 1 “come 2 “coming 1 “edwin 1 “found 1

Page 100: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

Extra Output The stdout/stderr streams of Hadoop itself will be stored in an output file

(whichever one is named in the startup script) #$ -o output.$job_id

STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205 … 11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1 11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001 … 11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1 … 11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments 11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size:

43927 bytes 11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0% … 11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001

Page 101: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

The 50TB Hadoop Cluster on Discovery

Ransomware hackers use HADOOP to hack HADOOP

Don’t try it … Hadoop Cluster on Discovery is

configured using a non-root account.

So move your data over from HDFS

once done.

Page 102: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

WE NOW RUN TWO TEST EXAMPLES

• Download the document with example code from:

http://nuweb12.neu.edu/rc/wp-content/uploads/2014/09/USING_HDFS_ON_DISCOVERY_CLUSTER-.pdf Detailed instructions are in the document above.

Page 103: Using the 50TB Hadoop Cluster on Discovery the 50TB Hadoop Cluster on Discovery Northeastern Universtiy Research Computing: Nilay K Roy, MS Computer Science, Ph.D Computational Physics

QUESTIONS