Intro to HDFS and MapReduce

Introduction to HDFS and MapReduce

Thursday, January 10, 13

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.

Confidential Think Big Analytics

• One of Silicon Valley’s Fastest Growing Big Data start ups• 100% Focus on Big Data consulting & Data Science solution services• Management Background:

Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture

C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999• Clients: 40+• North America Locations

• US East: Boston, New York, Washington D.C.• US Central: Chicago, Austin• US West: HQ Mountain View, San Diego, Salt Lake City

• EMEA & APAC

Think Big is the leading professional services firm that’s purpose built for Big Data.

Confidential Think Big Analytics 01/04/13

Think Big Recognized as a Top Pure-Play Big Data Vendor

Source: Forbes February 2012

4Thursday, January 10, 13

Agenda- Big Data

- Hadoop Ecosystem

- HDFS

- MapReduce in Hadoop

- The Hadoop Java API

- Conclusions

Big Data

A Data Shift...

Source: EMC Digital Universe Study*

“Simple algorithms and lots of data trump complex

models. ”

Motivation

Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems

Pioneers• Google and Yahoo:

- Index 850+ million websites, over one trillion URLs.

• Facebook ad targeting:

- 840+ million users, > 50% of whom are active daily.

Hadoop Ecosystem

Common Tool?• Hadoop

- Cluster: distributed computing platform.

- Commodity*, server-class hardware.

- Extensible Platform.

Hadoop Origins• MapReduce and Google File System (GFS)

pioneered at Google.

• Hadoop is the commercially-supported open-source equivalent.

What Is Hadoop?• Hadoop is a platform.

• Distributes and replicates data.

• Manages parallel tasks created by users.

• Runs as several processes on a cluster.

• The term Hadoop generally refers to a toolset, not a single tool.

Why Hadoop?• Handles unstructured to semi-structured to

structured data.

• Handles enormous data volumes.

• Flexible data analysis and machine learning tools.

• Cost-effective scalability.

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

The Hadoop Ecosystem

What Is HDFS?• Hadoop Distributed File System.

• Stores files in blocks across many nodes in a cluster.

• Replicates the blocks across nodes for durability.

• Master/Slave architecture.

HDFS Traits

• Not fully POSIX compliant.

• No file updates.

• Write once, read many times.

• Large blocks, sequential read patterns.

• Designed for batch processing.

HDFS Master

• NameNode

- Runs on a single node as a master process

‣ Holds file metadata (which blocks are where)

‣ Directs client access to files in HDFS

• SecondaryNameNode

- Not a hot failover

- Maintains a copy of the NameNode metadata

HDFS Slaves

• DataNode

- Generally runs on all nodes in the cluster

‣ Block creation/replication/deletion/reads

‣ Takes orders from the NameNode

NameNode

DataNode 1 DataNode 2 DataNode 3

HDFS Illustrated

Put File

NameNode

HDFS Illustrated

Put File

NameNode

HDFS Illustrated

12Put File

NameNode

HDFS Illustrated

12,4,6

Put File

NameNode

HDFS Illustrated

12,4,6,5,3Put File

NameNode

HDFS Illustrated

12,2,6

,4,6,5,3Put File

NameNode

HDFS Illustrated

12,2,6

,4,6,5,3Put File

DataNode 1

NameNode

DataNode 2 DataNode 3

Power of Hadoop

12,2,6

,4,6,5,3Read File

DataNode 1

NameNode

Power of Hadoop

12,2,6

,4,6,5,3Read File

DataNode 1

NameNode

Power of Hadoop

12,2,6

,4,6,5,3Read File

NameNode

Power of Hadoop

32,2,6

,4,6,5,3Read File

NameNode

Power of Hadoop

32,2,6

,4,6,5,3Read File

NameNode

Power of Hadoop

32,2,6

,4,6,5,3Read File

NameNode

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

NameNode

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

100 MB/sx3=

300MB/s

HDFS Shell

• Easy to use command line interface.

• Create, copy, move, and delete files.

• Administrative duties - chmod, chown, chgrp.

• Set replication factor for a file.

• Head, tail, cat to view files.

MapReduce in

Hadoop

MapReduce Basics

• Logical functions: Mappers and Reducers.

• Developers write map and reduce functions, then submit a jar to the Hadoop cluster.

• Hadoop handles distributing the Map and Reduce tasks across the cluster.

• Typically batch oriented.

MapReduce Daemons

•JobTracker (Master)

- Manages MapReduce jobs, giving tasks to different nodes, managing task failure

•TaskTracker (Slave)

- Creates individual map and reduce tasks

- Reports task status to JobTracker

MapReduce in Hadoop

Let’s look at how MapReduce actually works in Hadoop,

using WordCount.

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

Reducers

a 2hadoop 1is 2

Output

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

We need to convert the Input

into the Output.

Reducers

a 2hadoop 1is 2

Output

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

There is a Reduce phase (doc4, "…")

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

Reducers

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

(doc1, "…")

(doc2, "…")

(doc3, "")

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

(doc1, "…")

(doc2, "…")

(doc3, "")

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(phase, [1,1])

Reducers

a 2hadoop 1is 2

Output

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

(doc1, "…")

(doc2, "…")

(doc3, "")

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(phase, [1,1])

Reducers

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

(doc1, "…")

(doc2, "…")

(doc3, "")

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(phase, [1,1])

Reducers

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

• Transform one input to 0-N outputs.

(doc1, "…")

(doc2, "…")

(doc3, "")

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(phase, [1,1])

Reducers

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(is, 1), (a, 1)

(there, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

• Transform one input to 0-N outputs.

Reduce:

• Collect multiple inputs into one output.

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Map Phase

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

k,v k,vk,v k,v k,vMap Phase * Intermediate Data Is Stored Locally

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

k,v k,vk,v k,v k,vMap Phase

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

k,v k,vk,v k,v k,v

Shuffle/Sort

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

k,v k,v k,v k,v k,v

Shuffle/Sort

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

k,v k,v k,v k,v k,v

Reduce Phase

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Reduce Phase

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Job Complete!

The Hadoop Java API

MapReduce in Java

Let’s look at WordCountwritten in the

MapReduce Java API.

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Let’s drill into this code...

Map Code

Map CodeMapper class with 4

type parameters for the input key-value types and

output types.

Map Code

Output key-value objects we’ll reuse.

Map Code

Map method with input, output “collector”, and

reporting object.

Map Code

Tokenize the line, “collect” each (word, 1)

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Reduce Code

Reduce CodeReducer class with 4

type parameters for the input key-value types and

output types.

Reduce Code

Reduce method with input, output “collector”,

and reporting object.

Reduce Code

Count the counts per word and emit(word, N)

Other Options

Conclusions

• A cost-effective, scalable way to:- Store massive data sets.- Perform arbitrary analyses on

those data sets.

Hadoop Benefits

• Offers a variety of tools for:- Application development.- Integration with other platforms

(e.g., databases).

Hadoop Tools

• A rich, open-source ecosystem.- Free to use.- Commercially-supported

distributions.

Hadoop Distributions

Thank You!- Feel free to contact me at

‣ ryan.tabora@thinkbiganalytics.com

- Or our solutions consultant

‣ matt.mcdevitt@thinkbiganalytics.com

- As always, THINK BIG!

Bonus Content

Hive: SQL for Hadoop

Let’s look at WordCountwritten in Hive,

the SQL for Hadoop.

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Create a table to hold the raw text we’re

counting. Each line is a “column”.

Load the text in the “docs” directory into the

table.

Create the final table and fill it with the results from a nested query of

the docs table that performs WordCount

on the fly.

Because so many Hadoop userscome from SQL backgrounds,

Hive is one of the mostessential tools in the ecosystem!!

Pig: Data Flow

for Hadoop

Let’s look at WordCountwritten in Pig,

the Data Flow language for Hadoop.

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

STORE cntd INTO 'output'; Let’s drill into this code...

Like the Hive example, load “docs” content,each line is a “field”.

Tokenize into words (an array) and “flatten” into

separate records.

Collect the same words together.

Count each word.

STORE cntd INTO 'output'; Save the results.Profit!

Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc.

Questions?

Intro to HDFS and MapReduce

Technology

Transcript of Intro to HDFS and MapReduce

Introduction to Hadoop, MapReduce and HDFS for Big Data - SNIA

Lab 01: HDFS, MapReduce, Pig, Hive, and Jaql

Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapReduce Application

A BigData Tour – HDFS, Ceph and MapReducecourses.cecs.anu.edu.au/courses/COMP4300/lectures15/HDFS.pdf · 16/05/15 1 A BigData Tour – HDFS, Ceph and MapReduce These slides are

Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

HADOOP & SAS Data Loader for HADOOP...Nov 18, 2014 · SAS Forum Twitter Contest –Tweet to win prizes! A. HDFS & Hive B. Cloudera & HDFS C. HDFS & MapReduce 5. Which are the 2 core

HDFS Design Principles - home.apache.orghome.apache.org/~shv/docs/hdfs-design-principles.pdf · 2018-02-08 · High Availability for NameNode Next Generation MapReduce Separation

Overview of the HiBD (High-Performance Big Data) Projectneurohpc.cse.ohio-state.edu/static/media/talks/slide/hibd-luxi_1.pdf · HDFS and MapReduce in Apache Hadoop • HDFS: Primary

Intro to BigData , Hadoop and Mapreduce

Hadoop, HDFS, MapReduce and Pig

XHAMI - Extended HDFS and MapReduce Interface …cloudbus.cis.unimelb.edu.au/~raj/papers/XHAMI-Cloud2015.pdfprocessing of small size images in batch mode over HDFS, where each map

Python MapReduce Programming with Pydoop · MapReduce and Hadoop Hadoop Crash Course Pydoop: a Python MapReduce and HDFS API for Hadoop Python MapReduce Programming with Pydoop Simone

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase)

HDFS & MapReduce

Hadoop File System - University at Buffalobina/cse487/spring2018/... · Hadoop Distributed file system and mapreduce were found to have applications beyond search. HDFS and MapReduce

雲端計算 Cloud Computing Lab–Hadoop. Agenda Hadoop Introduction HDFS MapReduce Programming Model Hbase.

The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.

Überblick Hadoop Einführung HDFS und MapReduce - doag.org · Inhalt Seite 3 1 Apache Hadoop 2 Hadoop Distributed File System (HDFS) 3 MapReduce Überblick Hadoop 4 MapReduce im

A BigData Tour – HDFS, Ceph and MapReduce - · PDF fileA BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto

A BigData Tour – HDFS, Ceph and MapReduce...•The Hadoop infrastructure provides these capabilities Introduction to Hadoop •Apache Hadoop • Based on 2004 Google MapReduce Paper