Schedulers optimization to handle multiple jobs in hadoop cluster

SCHEDULERS OPTIMIZATION TO HANDLE MULTIPLE JOBS IN HADOOP CLUSTER

By Shivaraj B G4th Sem Department of Computer Networking, MITE.

Introduction Big Data: Information is turning out to be

more accessible as well as more reasonable to PCs.

A large portion of the Big Data flow of information in the world is uncontrollable stuff like words, pictures and feature on the Web and those floods of sensor information.

It is called unstructured information and is not commonly available for conventional databases.

Firstly Big data should be obtained, processed and analyzed.

Characteristics of Bigdata

Volume: Represents huge information.

Velocity (Speed): A quick rate that information is gotten.

Variety (Mixture): Unstructured and semi-organized information sorts.

Value: Information has inherent value however must be found.

Hadoop: A structure of open source tools, libraries and techniques for 'huge information' examination.

Hadoop is converged with two essential key properties known as Hadoop Distributed File System (HDFS) with capacity and administration and MapReduce with processing ability.

Why Hadoop? Adjustable framework with hardware and

programming with implicit application level with fault tolerant among the clusters of computers.

Why to use Hadoop ? The Hadoop is reliable and simple Utilized to store extensive information with

unlimited storage capacity with adaptable framework.

As it is of 100% open source package. Quick recovery from system failures. Ability to quickly handle large amount of

information in parallel. Once information written in HDFS can be read

many times.

Contents of Hadoop 1. Hadoop Distributed File System

(HDFS): HDFS is master (expert) and slave

structural architecture; client first cooperates with expert with NameNode and Secondary NameNode.

Every slave runs both DataNode and JobTracker daemon that speak with and getting guidelines from master hub.

The TaskTracker daemon is the slave to the JobTracker and DataNode daemon is slave to the NameNode.

2. MapReduce: Is basic programming model confined to use of key-values pairs.

Scheduling in Hadoop Scheduling is a policy used to determine

when a job executes its tasks. And it communicates each other or across clusters using TCP suit using its distributed file system.

Example: process time, communication for data transmission along with available bandwidth.

So the need of scheduling is the fact that multiplexing and multitasking.

The fresh hadoop setup uses only First-in First-out scheduler.

Pig scripting language intended to handle any sort of data sets

(unstructured information). Pig is comprised of two parts: the First is language itself which

is called Piglatin; the Second is a Pigruntime environment where Piglatin programs are executed.

Pig has two execution modes: Local Mode (pig –x nearby) and MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way of creating and executing MapReduce jobs.

Piglatin High-level scripting language. Requires no metadata or schema. Statements translated into a series of

MapReduce jobs.Grunt Interactive shell.Piggybank Shared repository for User Defined Functions

(UDFs).

Hive Data Warehouse System for Hadoop. Tools to facilitate effortless information to

extract/transform/load (ETL) from records stored directly HDFS or in other information storage systems such as HBase.

Contains metadata so as to describe information right to use in HDFS and HBase, not information itself.

Uses a straightforward SQL-like query language called HiveQL

Query implementation through MapReduce

With the aim of optimizing schedulers, the framework allow jobs to complete in a timely manner, while allowing users who are making queries to get results back in a reasonable time. so users have more freedom in adopting the most appropriated scheduler or other techniques according to their requirements.

Problem Statement

First in first out (FIFO) scheduler approach allows one job to take all task slots within the cluster, i.e., no other jobs can utilize the cluster until the current one completes. Consequently, jobs that arrive at a later time or with a lower priority will be blocked by those ahead in the queue. Given the total number of jobs is large, there will be a significant delay caused by FIFO scheduler.

Proposed System

The proposed system has a novel framework for optimization approach.

1. Fair scheduler

The Fair scheduler starts execution if any slots are available during execution of tasks the smaller jobs are assigned to that slot.

Merits of Fair Scheduler Though cluster is shared with large jobs it allows running small jobs quickly. Unlike default FIFO scheduler, without starving the large or small job fair

scheduling chooses a job in the queue to run task if available slot is free. Provide service for multiple slots with guaranteed levels of jobs execution in

shared cluster and simple to configure and administer.

2. Capacity Scheduler The concept provided by Capacity Scheduler is

scheduling queues. Is designed for large cluster for sharing minimum

capacity guarantee while cluster is partitioned among multiple users.

Merits of Capacity Scheduler Capacity guarantees – Multiple queues supports jobs

execution simultaneously as submitted by the user. Elasticity – jobs are assigned beyond the capacity due its

freely available resources which helps entire cluster utilization Multi-Tenancy – Provides system resources to queues created

by user to link with JobTracker to execute jobs.

Objective To parallelize the job execution crosswise

over stand alone mode or in a cluster. Estimate the processing time for all

parallel applications like small or long running jobs by distributing tasks to the schedulers.

Methodology HDFS Framework works on block size. Whereas general Parallel file system supports block sizes of 16 KB to 4 MB and maintains default block size of 256KB. But HDFS works with default 64 Mb block size.

Optimization Techniques1. MapCombineReduce

2. Distinctive Block Size dfs.block.size: File system block size: 67108864 (bytes) E.g. Input data size = 1GB and dfs.block.size = 64 MB

then the minimum no. of maps are (1*1024)/64 = 16 maps. E.g. If input data size = 1 GB and dfs.block.size = 128

MB (134217728 bytes) then minimum no. of maps are (1*1024)/128 = 8 maps.

File Size Time for Moving Data to

HDFS

Total Number of Blocks

Created

64 Mb 128 MB

1.3 GB 0.48 Sec 0.38 Sec 21 11

2.7 GB 1.38 Sec 1.17 Sec 41 21

4.0 GB 2.32 Sec 1.58 Sec 61 31

S/W and H/W SpecificationSoftware Specification Operating System: Ubuntu 12.04 LTS. Java (Jdk 6.1) is required to run *.jar files. Hadoop Version-hadoop-1.0.3.tar.gz Stable Release. Pig: apache- pig-0.14.0.tar.gz. Hive: apache-hive-0.13.1-bin.tar.gz.Hardware specification

Name Specification

Main processor P4, 2GHz or Higher

Secondary Memory 200GB

Primary Memory 4GB or higher

Results Starting Hadoop multi-node clusterCd /usr/local/hadoop/bin$ ./start-all.sh

Displaying Master Node and available cluster summary Using web browser type the master URL such as

localhost:50070 or master:50070

Listing Total Number of Files in HDFS.

$ bin/hadoop dfs -ls

Input file of weblogs Stored in HDFS

Movie dataset Input File Stored in HDFS

Job Queue Scheduling Information with Default Scheduler

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor /user/hduser/nasa_input /user/hduser/weblog_output/

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/

Scheduling Information with FAIR Scheduler. Using web browser type Hadoop MapReduce URL such as localhost:50030

$ time bin/hadoop jar WeblogHits.jar WeblogHits /user/hduser/heterogeneous/inputs/weblog

/user/hduser/heterogeneous/fair/ouputs/weblog_output

$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogs_output1

$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog

/user/hduser/heterogeneous/fair/ouputs/weblogg_output2

Job Summary for QueueA using Capacity Scheduler

$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA user/hduser/heterogeneous/inputs/weblog

/user/hduser/heterogeneous/capacity/ouputs/weblog_output

Job Summary for QueueB using Capacity Scheduler.

$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator -Dmapred.job.queue.name=queueB /user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogeneous/capacity/ouputs/weblogs_output1

Output for Web-Logs with Total Number of Hits on Particular Links.

Output for Web-Logs with Total Number Messages with their Size.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

Output for Web-Logs with Time between 0-23 Hours along Number of Users.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

TimeUsers

Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Single Node Cluster

Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0

0.5

1

1.5

2

2.5

3

3.5

4 3.7

2.92

3.63.25

Proposed Hadoop System

Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Single Node Cluster


5

10

15

20

25

30

35

26.11

19.04

29.96

21.24


Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Multi-Node Cluster.


0.5

1

1.5

2

2.5

3

3.5

4

1.54 1.5

3.79

1.66


Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Multi-Node Cluster.


2

4

6

8

10

12

14

16

7.386.38

13.47

12.26


Results for Movie dataset using PIG Running pig in local mode: pig -x local movies = LOAD

'/Users/Rich/Documents/Courses/Fall2014/BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

DUMP movies

Filter: List the movies that were released between 1950 and 1960 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961; store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';

Foreach Generate: List movie names and their duration (in minutes) movies_name_duration = foreach movies generate name, (float)duration/3600; store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';

Order: List all movies in descending order of year movies_year_sort =order movies by year desc; store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';

Results for Movie Dataset using Hive

For Displaying Contents of Movie Data Set.Hive> select * from Movies;

Result for Particular Year of Movies Released.

Hive> select *from movies where year = 1995;

finding all Movies based on Movie Length specified.

Hive> select * from movies where length > 3000;

Retrieve information using GROUP BY clause.

Hive> select COUNT(1) from movies GROUP BY year;

Results for heterogeneous datasets using Hadoop, pig and hive

Hadoop Pig HiveWord Count 2.367 1.58 1.52Movie Rating 1.53 1.45 1.57

Hadoop Pig Hive0

0.5

1

1.5

2

2.5 2.36

1.58 1.521.53 1.451.57

WordCountMovieRating

Conclusion This effort is projected to give a high level

summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.

Thank you

Schedulers optimization to handle multiple jobs in hadoop cluster

Engineering

Transcript of Schedulers optimization to handle multiple jobs in hadoop cluster