Post on 29-Sep-2020
REDUCTION OF INTERMEDIATE DATA THROUGH MAP REDUCE
FOR PRECISED BIG DATA PROCESSING IN CLOUD COMPUTING
ENVIRONMENT
Shivani Sant
Department of Computer Science
Sagar Institute Of Research & Technology-Excellence
Bhopal,India
shivani.sant23@gmail.com
Prof. Anjana Verma
Department of Computer Science
Sagar Institute Of Research & Technology-Excellence
Bhopal,India
anjana9859@gmail.com
Abstract— Cloud computing appeared for huge data
because of its ability to provide users with on-
demand, reliable, flexible, and low-cost services.
With the increasing use of cloud applications, data
generation from the application is also increases
which has become an important issue for the cloud
which is known as a big data problem. So there is a
need of technology that can manage the big data on
the cloud efficiently. Apache Hadoop is an efficient
solution for handling big data because it is a open
source technology which stores the big data in a
distributed manner over cluster of heterogeneous
systems, provides a reliable storage as well as
processing to the big data over the cloud. But when
the data is distributed over cluster of multiple
system, the data movement between the systems
creates a problem known as load balancing. In this
paper we proposed approach is to run the combine
function inside the map method to minimize the
volume of emitted intermediate results through
which the network congestion on the cloud is less
which will improve the overall performance of the
cluster on the cloud.
Keywords— cloud, big data, Hadoop, heterogeneous
cluster, performance, load balancing.
I.INTRODUCTION
There is huge amount of data which is generated from
the internet or various sources. These rapidly increasing
data is known as big data which creates a problem of
storage and processing. Traditional tools and techniques
are failed to manage such huge volume of big data The
growing technology is generated data from which only
20% of data is in structured form and the remaining 80%
of data is in unstructured form which is known as a
Bigdata problem [1]. But their large amount of
unstructured data and incomplete information creates a
problem for analyst to analyze these data.Traditional
systems were used to store and process the unstructured
data, but handling such data by tradition systems is too
time-consuming and expensive too. The Bigdata can be
identified by three main attributes:
1. Volume – Data is huge and massive.
2.Velocity –Data changes rapidly and arrives
quickly so processing data in less time is very difficult.
3. Variety – Data have different structure they
are semi-structured or unstructured data.
.
In present scenario it is instrumental to blend
both big data and analytics into a single entity termed as
big data Analytics. Analytics involves examination of
data to derive meaningful insights such as hidden
patterns and trends that can in turn benefit the
organizations in making important business decisions
and developing newer business models. The problem of
data deluge imposes potential challenges involved in
processing and extracting useful information from data.
It also requires skills for management and analysis of
huge data sets.
Cloud computing is one of the solution for managing
Bigdata and host the big data workload on a cluster of
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:394
multiple machines. Cloud computing is an on-demand
computer resources and systems that can provide a
number of integrated computer services without being
bound by local resources to facilitate user access. These
resources include data storage, backup and self-
synchronization, as well as software processing and
scheduling tasks. Cloud computing is also a kind of
shared resource system through which can offer a
variety of online services such as virtual server storage,
applications and licensing for desktop applications[10].
By leveraging common resources, cloud computing is
able to achieve expansion and provide volume. The
cloud resources can be used in private mode through
private cloud or can be shared publicly using a public
cloud such as Google cloud platform, Amazon EC2 and
Microsoft Azure.
Hadoop technology can be used for managing big data
on a cluster of commodity hardware, Hadoop comes
with HDFS and Mapreduce [8]. It stores the big data on
HDFS and analyze or process the big data by using
Mapreduce.
II. LITERATURE REVIEW
According to [1], with the event of construction on
sensible grid and therefore, the affordable utilization of
intermittent energy supply, processing on ancient
platform cannot already satisfy the intermittent energy
sources. It should be a good challenge for the entire
platform. This paper proposes a technique that the cloud
platform is integrated with the intermittent energy
sources knowledge and therefore the load equalisation of
multi-factor prognosticative cloud platform. Firstly,
deploying the general method of the intermittent energy
processing on a brand new processing platform.
In [2] these days handling massive knowledge transfer
among thousands of interconnected servers plays an
important role on cloud computing setting. Massive
knowledge is nothing however, assortment of relative
knowledge, unstructured knowledge, and semi-
structured knowledge, streaming knowledge like
machines, sensors, net applications, and social media. In
existing system, this idea enhanced by fixing some best
to beat the bottlenecks whereas, knowledge transfer on
scientific cloud application. The parameters are
pipelining, correspondence, and concurrency. The key
drawback is fixing of incorrect parameter combination
ends up in overloading and underneath utilization of
network which ends up congestion and packet loss on
knowledge transfer. In this, they projected a brand new
dynamic work load, queuing ways to improvise the
network rate whereas equalisation work load
dynamically. They additionally invoke numerous
programming algorithms to predict the unbalanced
resource utilization in knowledge centre at the initial
length of every interval that is employed to schedule the
unbalanced resource. This rescheduling method recovers
over utilization and underneath utilization on network.
According to [3], managing and processing big data in
geo-distributed data centre gain abundant attention in
recent years. Despite the increasing attention on this
subject, most efforts are centred on user-centric
solutions, and sadly minimal on the difficulties
encountered by Cloud suppliers to boost their profits.
extremely economical framework for geo-distributed big
data process in cloud federation setting could be a
crucial answer to maximise profit of the cloud suppliers.
They maximise the profit for cloud suppliers by
minimizing prices and penalty. This work proposes to
transfer computations to geo-distributed knowledge and
outsourcing solely the specified knowledge to idle
resources of united clouds so as to reduce job prices; and
proposes a jobs rearrangement dynamic approach to
reduce the penalties costs. The performance analysis
proves that our projected formula will maximize profit,
scale back the MapReduce jobs prices and improve
utilization of clusters resources.
This paper they propose a profit maximization
formula to with efficiency maximize the profit of cloud
suppliers running multiple MapReduce jobs on united
clouds underneath a point in time. Associate optimized
cloud supplier profit needs initial with efficiency
reducing MapReduce jobs value obtained by outsourcing
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:395
the remaining map tasks to idle VMs across clouds; and
second, overcoming the penalty of execution
MapReduce jobs when a given point in time by applying
jobs rearrangement dynamic approach[11]. We have a
tendency to proof that reduction of wasted resources
features a direct impact on value decrease and
rearrangement execution time of jobs features a fairness
impact on penalties prices and accepted jobs rate.
Results show that the projected formula improved
performance relating to the cloud suppliers profit, the
duty prices and therefore the VMs utilization.
According to [4], Recently Cloud based Hadoop has
gained plenty of interest that provide access to use
Hadoop cluster setting for processing of massive
knowledge, eliminating the operational challenges of on-
the-spot hardware investment, IT support, configuring of
Hadoop parts like HDFS and MapReduce. On demand
Hadoop as a service helps the industries to specialize in
business growth and supported use of model for giant
processing with auto-scaling of Hadoop cluster feature.
They implementation of assorted MapReduce jobs like
Pi, TeraSort, WordCount has been done on cloud based
Hadoop preparation by mistreatment Microsoft Azure
cloud services. Performance of MapReduce jobs has
been evaluated with computer hardware execution time
with variable size of Hadoop cluster [12]. From the
experimental result, it is found that computer hardware
execution time to end the roles decrease because the
range of information Nodes in HDInsight cluster will
increase and indicates the great latency with increase in
performance in addition as additional client satisfaction.
According to [5], Cloud Computing leverages Hadoop
framework for process BigData in parallel. Hadoop has
bound limitations that would be exploited to execute the
duty with efficiency. These limitations are largely
attribute to knowledge section within the cluster, jobs
and tasks programming, and resource allocations in
Hadoop. we have a tendency to propose H2Hadoop, that
is associate increased Hadoop design that reduces the
computation value related to bigdata analysis. The
projected design additionally addresses the difficulty of
resource allocation in native Hadoop. H2Hadoop
provides a far better answer for “text data”, like finding
polymer sequence. Also, H2Hadoop provides associate
economical data processing approach for Cloud
Computing environments. H2Hadoop design leverages
on NameNode’s ability to assign jobs to the TaskTrakers
inside the cluster[14]. By adding management options to
the NameNode, H2Hadoop will showing intelligence
direct and assign tasks to the DataNodes that contain the
specified knowledge.
According to [6] , Mapreduce could be a coinciding
operational model for Brobdingnagian data refinement
in teams and datacenters. The work of a Mapreduce
consists of a bunch of tasks that contains additional
range of matching jobs and reducing the roles. The final
mapping jobs are processed earlier for reducing jobs.
Numerous tasks process the requests and Mapreduce
configuration positions of a Mapreduce has numerous
accomplishment and sort of pc usage supported the
workload. Two kinds of precise rules that utilised in
decrease the build span and therefore the entire finishing
amount of a logged off Mapreduce workload. Initial
formula concentrates on the task organizing
improvement for a Mapreduce workload for the given
Mapreduce position [9]. The second formula expects the
procedure that seems for optimized Mapreduce position
configuration in an exceedingly Mapreduce workload.
III PROBLEM DEFINITION
In a heterogeneous cluster [7] of Hadoop over the
cloud where each cluster node has their own local
memory, Hadoop has a advantage of data locality
[13] which means to launch the task or processing
operation on that machine where data are located or
stored. If information doesn’t seem to be regionally
accessible during a process node, information have
to be compelled to be migrated via network
interconnects to the node that performs the info
process operations. Migrating immense quantity of
knowledge ends up in excessive network
congestion, that successively will deteriorate
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:396
system performance. So when the map task is
launched on multiple machine where the data is
located, these map task will perform on data blocks
and intermediate output of these blocks are transfer
via network to the reducer machine, the reducer
machine takes the intermediate output as a input or
produces the final output. The intermediate output
of map task is incredibly massive, So there is
overhead of moving large intermediate information
from mapper nodes to reducer nodes which
becomes a issue moving Hadoop’s performance.
IV PROPOSED WORK
Hadoop Mapreduce is now a popular choice for
performing large-scale data analytics over the cloud. We
created a heterogeneous cluster of multiple machines on
the cloud to store and process, huge amount of data. For
managing huge volume over the cluster we proposed
Hadoop on the cloud which is very popular for storing
huge volume of data into multiple machines and
processing is done by Mapreduce. The data is processed
by multiple machines parallel and then these outputs are
migrated via network to the reducer machine to collect
these intermediate data which generates the final output.
Migrating immense quantity of knowledge ends up in
excessive network congestion that successively will
deteriorate system performance. The key idea of
proposed approach is to run the combine function inside
the map method to minimize the volume of emitted
intermediate results through which the network
congestion on the cloud is less which will improve the
overall performance of the cluster on the cloud.
Figure 1. Proposed Flow Diagram
V EXPERIMENTAL & RESULT ANALYSIS
All the experiments were performed using an i5-2410M
CPU @ 2.30 GHz processor and 4 GB of RAM running
ubuntu 14. After than we can install java which is a
prerequest for hadoop, and then after we are configuring
hadoop on ubuntu . And also All the experiments are
perform on Google cloud platform (GCP) on which we
developed a heterogeneous clusters of five nodes.
Cluster is implemented on linux with hadoop is
configure on it, and the cluster summary are shown in
table 1.:
Table 1. Hardware property of experimental
environment
For loading dataset into HDFS, we first created an
heterogenous cluster of Hadoop, for which we can
use a Google cloud services to create a Hadoop
cluster. We can use Google Cloud Platform to
create a cluster for five nodes for which we first
created a GCP project and under compute engine
we can create a five virtual machine (one for
master and remaining for slaves) figure 2 shows
the cluster of six virtual machines.
Figure 2. Heterogeneous cluster of six nodes
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:397
After we can developed a mapreduce job for performing
word count application in a overall file , the file size is
around 560MB, In these we can run mapreduce job
which work on data locality configuration. After
developing we can launch the existing.jar file and after
execution of existing mapreduce job the final output is
shown in output directory and the other performance
fields such as shuffle bytes taken and time taken for
execution, the execution time taken are shown in figure
3.
Figure 3. time taken by existing mapreduce job
In an existing technique, cluster on the cloud where each
node has a local disk, it is efficient to move data
processing operations to nodes where application data
are located. If information does not seem to be
regionally accessible during a process node, information
have to be compelled to be migrated via network
interconnects to the node that performs the info process
operations. Migrating immense quantity of knowledge
ends up in excessive network congestion that
successively will reduce system performance.
So, to increase performance of the cluster we
efficiently balanced the load into the multiple nodes by
performing computation on the nodes where the data is
located. Due to less data is sent to the other nodes
through which the network congestion is reduced and we
can improve the performance by usingcombine function
inside the map method to minimize the volume of
emitted intermediate results through which the network
congestion on the cloud is less which will improve the
overall performance of the cluster on the cloud.. Figure
4 shows the proposed technique measures.
Figure 4. time taken by proposed mapreduce job
So we can compare the performance of cluster over
the cloud between existing techniques or proposed
technique and the time taken by these techniques
are shown in table 2.
Table 2. Execution time taken by existing and proposed
system
The tabular result which is shown in figure 5 are
represented on graph shown in figure 6, on which it is
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:398
clearly show that proposed mapreduce job are taking
less execution time as compared to existing mapreduce
job.
Figure 5. Graph representation of execution time taken
VI CONCLUSION
Huge volume of data is generated from various
applications which are deployed to cloud computing
environment and these huge data is need to be
processed. Cloud computing is one of the solution for
managing Bigdata and host the Bigdata workload on a
cluster of multiple machines. In this, we create a hadoop
cluster over cloud computing environment, which is
suitable to deal in parallel with these kinds of
applications. In this paper, we use Google cloud
platform (GCP) for creating Hadoop cluster and we can
fetch the data to corresponding compute nodes in
advance. It’s proved that the proposed mechanism
reduces data transmission overhead over network
effectively.
REFERENCES
[1] Tao Lin, Pengfei Zhao, Jing Zhao, Jing Zhao,
"Study on Load Balancing of Intermittent Energy Big
Data Cloud Platform" in 2018 International
Conference on Intelligent Transportation, Big Data &
Smart City (ICITBS), IEEE.
[2] C.Jayashri, P.Abitha, S.Subburaj, S.Yamuna Devi,
Suthir S , Janakiraman S, "Big Data Transfers through
Dynamic and Load Balanced Flow on Cloud Networks"
in 3rd International Conference on Advances in
Electrical, Electronics, Information, Communication and
Bio-Informatics (AEEICB17), IEEE 2017.
.
[3] Thouraya Gouasmi, Wajdi Louati, Ahmed Hadj
Kacem, "Geo-distributed BigData Processing for
Maximizing Profit in Federated clouds environment" in
26th Euromicro International Conference on Parallel,
Distributed, and Network-Based Processing , 2018
IEEE.
[4] Aditya Bhardwaj, Vineet Kumar Singh, Vanraj,
Yogendra Narayan, "Analyzing BigData with Hadoop
Cluster in HDInsight Azure Cloud" in 2015 IEEE.
[5] Hamoud Alshammari, Jeongkyu Lee and Hassan
Bajwa, "H2Hadoop: Improving Hadoop Performance
using the Metadata of Related Jobs" in IEEE
TRANSACTIONS ON Cloud Computing, manuscript
ID TCC-2015-11-0399, 2015 IEEE.
[6] Sadhana.R, Rabeena.S, "Improve Job Ordering And
Slot Configuration In Bigdata" in IEEE 2017.
[7] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors,
A. Manzanares and X. Qin, “Improving MapReduce
Performance through Data Placement in Heterogeneous
Hadoop Clusters”, IEEE International Symposium on
Parallel & Distributed Processing, Workshops and PhD
Forum (IPDPSW), (2010) April 19-23: Arlanta, USA.
[8] Improving MapReduce Performance in
Heterogeneous Network Environments and Resource
Utilization, Zhenhua Guo, Geoffrey Fox IEEE (2012)
[9] Improving MapReduce Performance Using Smart
Speculative Execution Strategy Qi Chen, Cheng Liu,
and Zhen Xiao, Senior Member, 2013 IEEE
[10] S. Khalil, S. A. Salem, S. Nassar and E. M. Saad,
“Mapreduce Performance in Heterogeneous
Environments: A Review”, International Journal of
Scientific & Engineering Research, vol. 4, no. 4, (2013).
[11] Z. Tang, J. Q. Zhou, K. L. Li and R. X. Li, “MTSD:
A task scheduling algorithm for MapReduce base on
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:399
deadline constraints”, IEEE International Symposium on
Parallel and Distributed Processing Workshops and PhD
Forum (IPDPSW), (2012) May 21-25: Shanghai, China.
[12] M. Zaharia, D. Borthakur, J. Sen Sarma, K.
Elmeleegy, S. Shenker and I. Stoica, “Delay Scheduling:
A Simple Technique for Achieving Locality and
Fairness in Cluster Scheduling”, Proceedings of the 5th
European conference on Computer systems, (2010)
April 13-16: Paris, France.
[13] X. Zhang, Z. Zhong, S. Feng and B. Tu,
“Improving Data Locality of MapReduce by Scheduling
in Homogeneous Computing Environments”, IEEE 9th
International Symposium on Parallel and Distributed
Processing with Applications (ISPA), (2011) May 26-
28: Busan, Korea.
[14] C. Abad, Y. Lu and R. Campbell, “DARE:
Adaptive Data Replication for Efficient Cluster
Scheduling”, IEEE International Conference on Cluster
Computing (CLUSTER), (2011) September 26-30:
Austin, USA.
JASC: Journal of Applied Science and Computations
Volume VI, Issue V, May/2019
ISSN NO: 1076-5131
Page No:400