Schedulers optimization to handle multiple jobs in hadoop cluster
-
Upload
shivraj-raj -
Category
Engineering
-
view
346 -
download
1
Transcript of Schedulers optimization to handle multiple jobs in hadoop cluster
![Page 1: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/1.jpg)
SCHEDULERS OPTIMIZATION TO HANDLE MULTIPLE JOBS IN HADOOP CLUSTER
By Shivaraj B G4th Sem Department of Computer Networking, MITE.
![Page 2: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/2.jpg)
Introduction Big Data: Information is turning out to be
more accessible as well as more reasonable to PCs.
A large portion of the Big Data flow of information in the world is uncontrollable stuff like words, pictures and feature on the Web and those floods of sensor information.
It is called unstructured information and is not commonly available for conventional databases.
Firstly Big data should be obtained, processed and analyzed.
![Page 3: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/3.jpg)
Characteristics of Bigdata
Volume: Represents huge information.
Velocity (Speed): A quick rate that information is gotten.
Variety (Mixture): Unstructured and semi-organized information sorts.
Value: Information has inherent value however must be found.
![Page 4: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/4.jpg)
Hadoop: A structure of open source tools, libraries and techniques for 'huge information' examination.
Hadoop is converged with two essential key properties known as Hadoop Distributed File System (HDFS) with capacity and administration and MapReduce with processing ability.
Why Hadoop? Adjustable framework with hardware and
programming with implicit application level with fault tolerant among the clusters of computers.
![Page 5: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/5.jpg)
Why to use Hadoop ? The Hadoop is reliable and simple Utilized to store extensive information with
unlimited storage capacity with adaptable framework.
As it is of 100% open source package. Quick recovery from system failures. Ability to quickly handle large amount of
information in parallel. Once information written in HDFS can be read
many times.
![Page 6: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/6.jpg)
Contents of Hadoop 1. Hadoop Distributed File System
(HDFS): HDFS is master (expert) and slave
structural architecture; client first cooperates with expert with NameNode and Secondary NameNode.
Every slave runs both DataNode and JobTracker daemon that speak with and getting guidelines from master hub.
The TaskTracker daemon is the slave to the JobTracker and DataNode daemon is slave to the NameNode.
![Page 7: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/7.jpg)
2. MapReduce: Is basic programming model confined to use of key-values pairs.
![Page 8: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/8.jpg)
Scheduling in Hadoop Scheduling is a policy used to determine
when a job executes its tasks. And it communicates each other or across clusters using TCP suit using its distributed file system.
Example: process time, communication for data transmission along with available bandwidth.
So the need of scheduling is the fact that multiplexing and multitasking.
The fresh hadoop setup uses only First-in First-out scheduler.
![Page 9: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/9.jpg)
Pig scripting language intended to handle any sort of data sets
(unstructured information). Pig is comprised of two parts: the First is language itself which
is called Piglatin; the Second is a Pigruntime environment where Piglatin programs are executed.
Pig has two execution modes: Local Mode (pig –x nearby) and MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way of creating and executing MapReduce jobs.
Piglatin High-level scripting language. Requires no metadata or schema. Statements translated into a series of
MapReduce jobs.Grunt Interactive shell.Piggybank Shared repository for User Defined Functions
(UDFs).
![Page 10: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/10.jpg)
Hive Data Warehouse System for Hadoop. Tools to facilitate effortless information to
extract/transform/load (ETL) from records stored directly HDFS or in other information storage systems such as HBase.
Contains metadata so as to describe information right to use in HDFS and HBase, not information itself.
Uses a straightforward SQL-like query language called HiveQL
Query implementation through MapReduce
![Page 11: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/11.jpg)
With the aim of optimizing schedulers, the framework allow jobs to complete in a timely manner, while allowing users who are making queries to get results back in a reasonable time. so users have more freedom in adopting the most appropriated scheduler or other techniques according to their requirements.
![Page 12: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/12.jpg)
Problem Statement
First in first out (FIFO) scheduler approach allows one job to take all task slots within the cluster, i.e., no other jobs can utilize the cluster until the current one completes. Consequently, jobs that arrive at a later time or with a lower priority will be blocked by those ahead in the queue. Given the total number of jobs is large, there will be a significant delay caused by FIFO scheduler.
![Page 13: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/13.jpg)
Proposed System
The proposed system has a novel framework for optimization approach.
1. Fair scheduler
The Fair scheduler starts execution if any slots are available during execution of tasks the smaller jobs are assigned to that slot.
Merits of Fair Scheduler Though cluster is shared with large jobs it allows running small jobs quickly. Unlike default FIFO scheduler, without starving the large or small job fair
scheduling chooses a job in the queue to run task if available slot is free. Provide service for multiple slots with guaranteed levels of jobs execution in
shared cluster and simple to configure and administer.
![Page 14: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/14.jpg)
2. Capacity Scheduler The concept provided by Capacity Scheduler is
scheduling queues. Is designed for large cluster for sharing minimum
capacity guarantee while cluster is partitioned among multiple users.
Merits of Capacity Scheduler Capacity guarantees – Multiple queues supports jobs
execution simultaneously as submitted by the user. Elasticity – jobs are assigned beyond the capacity due its
freely available resources which helps entire cluster utilization Multi-Tenancy – Provides system resources to queues created
by user to link with JobTracker to execute jobs.
![Page 15: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/15.jpg)
Objective To parallelize the job execution crosswise
over stand alone mode or in a cluster. Estimate the processing time for all
parallel applications like small or long running jobs by distributing tasks to the schedulers.
![Page 16: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/16.jpg)
Methodology HDFS Framework works on block size. Whereas general Parallel file system supports block sizes of 16 KB to 4 MB and maintains default block size of 256KB. But HDFS works with default 64 Mb block size.
![Page 17: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/17.jpg)
![Page 18: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/18.jpg)
![Page 19: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/19.jpg)
Optimization Techniques1. MapCombineReduce
![Page 20: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/20.jpg)
2. Distinctive Block Size dfs.block.size: File system block size: 67108864 (bytes) E.g. Input data size = 1GB and dfs.block.size = 64 MB
then the minimum no. of maps are (1*1024)/64 = 16 maps. E.g. If input data size = 1 GB and dfs.block.size = 128
MB (134217728 bytes) then minimum no. of maps are (1*1024)/128 = 8 maps.
File Size Time for Moving Data to
HDFS
Total Number of Blocks
Created
64 Mb 128 MB
1.3 GB 0.48 Sec 0.38 Sec 21 11
2.7 GB 1.38 Sec 1.17 Sec 41 21
4.0 GB 2.32 Sec 1.58 Sec 61 31
![Page 21: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/21.jpg)
S/W and H/W SpecificationSoftware Specification Operating System: Ubuntu 12.04 LTS. Java (Jdk 6.1) is required to run *.jar files. Hadoop Version-hadoop-1.0.3.tar.gz Stable Release. Pig: apache- pig-0.14.0.tar.gz. Hive: apache-hive-0.13.1-bin.tar.gz.Hardware specification
Name Specification
Main processor P4, 2GHz or Higher
Secondary Memory 200GB
Primary Memory 4GB or higher
![Page 22: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/22.jpg)
Results Starting Hadoop multi-node clusterCd /usr/local/hadoop/bin$ ./start-all.sh
![Page 23: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/23.jpg)
Displaying Master Node and available cluster summary Using web browser type the master URL such as
localhost:50070 or master:50070
![Page 24: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/24.jpg)
Listing Total Number of Files in HDFS.
$ bin/hadoop dfs -ls
![Page 25: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/25.jpg)
Input file of weblogs Stored in HDFS
![Page 26: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/26.jpg)
Movie dataset Input File Stored in HDFS
![Page 27: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/27.jpg)
Job Queue Scheduling Information with Default Scheduler
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor /user/hduser/nasa_input /user/hduser/weblog_output/
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/
![Page 28: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/28.jpg)
Scheduling Information with FAIR Scheduler. Using web browser type Hadoop MapReduce URL such as localhost:50030
$ time bin/hadoop jar WeblogHits.jar WeblogHits /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblog_output
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogs_output1
$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/fair/ouputs/weblogg_output2
![Page 29: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/29.jpg)
![Page 30: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/30.jpg)
Job Summary for QueueA using Capacity Scheduler
$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA user/hduser/heterogeneous/inputs/weblog
/user/hduser/heterogeneous/capacity/ouputs/weblog_output
![Page 31: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/31.jpg)
Job Summary for QueueB using Capacity Scheduler.
$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator -Dmapred.job.queue.name=queueB /user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogeneous/capacity/ouputs/weblogs_output1
![Page 32: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/32.jpg)
Output for Web-Logs with Total Number of Hits on Particular Links.
![Page 33: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/33.jpg)
Output for Web-Logs with Total Number Messages with their Size.
![Page 34: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/34.jpg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
![Page 35: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/35.jpg)
Output for Web-Logs with Time between 0-23 Hours along Number of Users.
![Page 36: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/36.jpg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
TimeUsers
![Page 37: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/37.jpg)
Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Single Node Cluster
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0
0.5
1
1.5
2
2.5
3
3.5
4 3.7
2.92
3.63.25
Proposed Hadoop System
![Page 38: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/38.jpg)
Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Single Node Cluster
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0
5
10
15
20
25
30
35
26.11
19.04
29.96
21.24
Proposed Hadoop System
![Page 39: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/39.jpg)
Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Multi-Node Cluster.
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0
0.5
1
1.5
2
2.5
3
3.5
4
1.54 1.5
3.79
1.66
Proposed Hadoop System
![Page 40: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/40.jpg)
Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Multi-Node Cluster.
Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0
2
4
6
8
10
12
14
16
7.386.38
13.47
12.26
Proposed Hadoop System
![Page 41: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/41.jpg)
Results for Movie dataset using PIG Running pig in local mode: pig -x local movies = LOAD
'/Users/Rich/Documents/Courses/Fall2014/BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
DUMP movies
![Page 42: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/42.jpg)
Filter: List the movies that were released between 1950 and 1960 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961; store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';
![Page 43: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/43.jpg)
Foreach Generate: List movie names and their duration (in minutes) movies_name_duration = foreach movies generate name, (float)duration/3600; store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';
![Page 44: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/44.jpg)
Order: List all movies in descending order of year movies_year_sort =order movies by year desc; store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';
![Page 45: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/45.jpg)
Results for Movie Dataset using Hive
For Displaying Contents of Movie Data Set.Hive> select * from Movies;
![Page 46: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/46.jpg)
Result for Particular Year of Movies Released.
Hive> select *from movies where year = 1995;
![Page 47: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/47.jpg)
finding all Movies based on Movie Length specified.
Hive> select * from movies where length > 3000;
![Page 48: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/48.jpg)
Retrieve information using GROUP BY clause.
Hive> select COUNT(1) from movies GROUP BY year;
![Page 49: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/49.jpg)
Results for heterogeneous datasets using Hadoop, pig and hive
Hadoop Pig HiveWord Count 2.367 1.58 1.52Movie Rating 1.53 1.45 1.57
Hadoop Pig Hive0
0.5
1
1.5
2
2.5 2.36
1.58 1.521.53 1.451.57
WordCountMovieRating
![Page 50: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/50.jpg)
Conclusion This effort is projected to give a high level
summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.
![Page 51: Schedulers optimization to handle multiple jobs in hadoop cluster](https://reader035.fdocuments.in/reader035/viewer/2022062412/58eebc5d1a28ab620a8b4655/html5/thumbnails/51.jpg)
Thank you