Distributed Cache With MapReduce

www.edureka.co/big-data-and-hadoop

Distributed Cache With Map Reduce

View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop

For more details please contact us: US : 1800 275 9730 (toll free)INDIA : +91 88808 62004Email Us : [email protected]

For Queries: Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN

http://www.edureka.co/big-data-and-hadoop

mailto:[email protected]


Objectives

Analyze different use-cases where MapReduce is used

Differentiate between Traditional way and MapReduce way

Learn about Hadoop 2.x MapReduce architecture and components

Understand execution flow of YARN MapReduce application

What is Distributed Cache

Run a MapReduce Program with Distributed cache

At the end of this module, you will be able to


Where MapReduce is Used?

Weather Forecasting

HealthCare

Problem Statement:» De-identify personal health information.

Problem Statement:» Finding Maximum temperature recorded in a year.


Where MapReduce is Used?

MapReduce

FeaturesLarge Scale Distributed Model

Used in

Function

Design Pattern

Parallel Programming

A Program Model

Classification

Analytics

Recommendation

Index and SearchMap

Reduce

ClassificationEg: Top N records

AnalyticsEg: Join, Selection

RecommendationEg: Sort

SummarizationEg: Inverted Index

Implemented

Google

Apache Hadoop

HDFS

Pig

Hive

HBase

For


MapReduce Paradigm

The Overall MapReduce Word Count Process

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3)Deer Bear River

Dear Bear RiverCar Car RiverDeer Car Bear

Bear, 2Car, 3Deer, 2River, 2

Deer, 1Bear, 1River, 1

Car, 1Car, 1

River, 1

Deer, 1Car, 1Bear, 1

K2,List(V2)List(K2,V2)K1,V1

Car Car River

Deer Car Bear

Bear, 2

Car, 3

Deer, 2

River, 2

Bear, (1,1)

Car, (1,1,1)

Deer, (1,1)

River, (1,1)


MapReduce Application Execution

Executing MapReduce Application on YARN


YARN MR Application Execution Flow

MapReduce Job Execution

» Job Submission

» Job Initialization

» Tasks Assignment

» Memory Assignment

» Status Updates

» Failure Recovery


YARN MR Application Execution Flow

11.Task get Executed.

12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container

13.Output of All the Maps given to reducer and Reducer get executed

14.Once Job finished, Application Master notify the Resource Manager and Client Library

15.Application Master closed.


Hadoop 2.x : YARN Workflow

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Container 1.2

Container 1.1

Container 2.1

Container 2.2

Container 2.3

AppMaster 2

AppMaster 1

Scheduler

Applications Manager (AsM)

Resource

Manager


Summary: Application Workflow

Execution Sequence :

1. Client submits an application Client RM NM AM

1




1. Client submits an application

2. RM allocates a container to start AM

Client RM NM AM

1

2






3. AM registers with RM

Client RM NM AM

1

2

3







4. AM asks containers from RM

Client RM NM AM

1

2

3

4








5. AM notifies NM to launch containers

Client RM NM AM

1

2

3

4

5









6. Application code is executed in container

Client RM NM AM

1

2

3

4

5

6










7. Client contacts RM/AM to monitor application’s status

Client RM NM AM

1

2

3

4

5

7 6










7. Client contacts RM/AM to monitor application’s status

8. AM unregisters with RM

Client RM NM AM

1

2

3

4

5

7

8

6


Input Splits

INPUT DATA

PhysicalDivision

LogicalDivision

HDFSBlocks

InputSplits


Relation Between Input Splits and HDFS Blocks

1 2 3 4 5 6 7 8 9 10 11

Logical records do not fit neatly into the HDFS blocks.

Logical records are lines that cross the boundary of the blocks.

First split contains line 5 although it spans across blocks.

FileLines

BlockBoundary

BlockBoundary

BlockBoundary

BlockBoundary

Split Split Split


MapReduce Job Submission Flow

Input data is distributed to nodes

Node 1 Node 2

INPUT DATA




Each map task works on a “split” of dataMap

Node 1

Map

Node 2

INPUT DATA




Each map task works on a “split” of data

Mapper outputs intermediate data

Map

Node 1

Map

Node 2

INPUT DATA






Data exchange between nodes in a “shuffle” process

Map

Node 1

Map

Node 2

Node 1 Node 2

INPUT DATA







Intermediate data of the same key goes to the same reducer

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA







Intermediate data of the same key goes to the same reducer

Reducer output is stored

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA


Getting Data to the Mapper

Input File Input File

Input split Input split Input split Input split

RecordReader RecordReader RecordReader RecordReader

Mapper Mapper Mapper Mapper

(intermediates) (intermediates) (intermediates) (intermediates)


Partition and Shuffle

Mapper Mapper Mapper Mapper

(intermediates) (intermediates) (intermediates) (intermediates)

Partitioner Partitioner Partitioner Partitioner

(intermediates) (intermediates) (intermediates)

Reducer Reducer Reducer


Input file

Input Split Input Split Input Split

RecordReader

RecordReader

RecordReader

Mapper Mapper Mapper

(Intermediates) (Intermediates) (Intermediates)

Inp

ut

Form

at Input Split

RecordReader

Mapper

Input file

(Intermediates)

Input Format


Combine FileInput Format<K,V>

Text Input Format

Key Value Text Input Format

Nline Input Format

Sequence FileInput Format<K,V>

File Input Format

<K,V>

Input Format<K,V>

org.apache.hadoop.mapreduce

<<interface>>

Composable

Input Format

<K,V>

Composite Input Format

<K,V>

DB Input Format<T>

Sequence File As

Binary Input Format

Sequence File As

Text Input Format

Sequence File Input

Filter<K,V>

Input Format – Class Hierarchy


What is Distributed cache

In computing, a distributed cache is an extension of the traditional concept of cache used in a single locale.

A distributed cache may span multiple servers so that it can grow in size and in transactional capacity.

The idea of distributed caching has become feasible now because main memory has become very cheap and

network cards have become very fast.

Distribute application-specific large, read-only files efficiently.

Distributed Cache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.)

needed by applications.


Demo

Demo: Bulk Load with MR

Distributed Cache With MapReduce

Technology

Transcript of Distributed Cache With MapReduce