Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing...

13
Hybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly over the last years, with more than 60 releases bringing new features. By implementing the MapReduce programming paradigm and leveraging HDFS, its distributed file system, Hadoop has become a reliable and fault tolerant middleware for parallel and distributed computing over large datasets. Nevertheless, Hadoop may struggle under certain workloads, resulting in poor performance and high energy consumption. Users increasingly demand that high performance computing solutions being to address sustainability and limit power consumption. In this paper, we introduce HDFS H , a hybrid storage mechanism for HDFS, which uses a combination of Hard Disks and Solid-State Disks to achieve higher performance while saving power in Hadoop computations. HDFS H brings to middleware the best from HDs (affordable cost per GB and high storage capacity) and SSDs (high throughput and low energy consumption) in a configurable fashion, using dedicated storage zones for each storage device type. We implemented our mechanism as a block placement policy for HDFS, and assessed it over six recent releases of the Hadoop project, representing different designs of the Hadoop middleware. Results indicate that our approach increases overall job performance while decreasing the energy consumption under most hybrid configurations evaluated. Our results also showed that in many cases storing only part of the data in SSDs results in significant energy savings and execution speedups. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015 PrePrints

Transcript of Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing...

Page 1: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Hybrid HDFS: decreasing energy consumption and speedingup Hadoop using SSDs

Apache Hadoop has evolved significantly over the last years, with more than 60 releases

bringing new features. By implementing the MapReduce programming paradigm and

leveraging HDFS, its distributed file system, Hadoop has become a reliable and fault

tolerant middleware for parallel and distributed computing over large datasets.

Nevertheless, Hadoop may struggle under certain workloads, resulting in poor

performance and high energy consumption. Users increasingly demand that high

performance computing solutions being to address sustainability and limit power

consumption. In this paper, we introduce HDFSH, a hybrid storage mechanism for HDFS,

which uses a combination of Hard Disks and Solid-State Disks to achieve higher

performance while saving power in Hadoop computations. HDFSH brings to middleware the

best from HDs (affordable cost per GB and high storage capacity) and SSDs (high

throughput and low energy consumption) in a configurable fashion, using dedicated

storage zones for each storage device type. We implemented our mechanism as a block

placement policy for HDFS, and assessed it over six recent releases of the Hadoop project,

representing different designs of the Hadoop middleware. Results indicate that our

approach increases overall job performance while decreasing the energy consumption

under most hybrid configurations evaluated. Our results also showed that in many cases

storing only part of the data in SSDs results in significant energy savings and execution

speedups.

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 2: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Hybrid HDFS: Decreasing Energy Consumption andSpeeding up Hadoop using SSDs

Ivanilton PolatoDept. of Computer Science

Federal University ofTechnology – Paraná

Campo Mourão, PR, [email protected]

Fabio KonDept. of Computer Science

University of São PauloSão Paulo, SP, [email protected]

Denilson Barbosa andAbram Hindle

Dept. of Computer ScienceUniversity of Alberta

Edmonton, AB, Canada{denilson,abram.hindle}@ualberta.ca

ABSTRACTApache Hadoop has evolved significantly over the last years,with more than 60 releases bringing new features. By imple-menting the MapReduce programming paradigm and lever-aging HDFS, its distributed file system, Hadoop has be-come a reliable and fault tolerant middleware for paralleland distributed computing over large datasets. Neverthe-less, Hadoop may struggle under certain workloads, resultingin poor performance and high energy consumption. Usersincreasingly demand that high performance computing so-lutions being to address sustainability and limit power con-sumption. In this paper, we introduce HDFSH , a hybridstorage mechanism for HDFS, which uses a combination ofHard Disks and Solid-State Disks to achieve higher perfor-mance while saving power in Hadoop computations. HDFSH

brings to middleware the best from HDs (affordable cost perGB and high storage capacity) and SSDs (high throughputand low energy consumption) in a configurable fashion, us-ing dedicated storage zones for each storage device type. Weimplemented our mechanism as a block placement policy forHDFS, and assessed it over six recent releases of the Hadoopproject, representing different designs of the Hadoop middle-ware. Results indicate that our approach increases overalljob performance while decreasing the energy consumptionunder most hybrid configurations evaluated. Our resultsalso showed that in many cases storing only part of the datain SSDs results in significant energy savings and executionspeedups.

1. INTRODUCTIONNowadays, two perspectives are relevant for big data anal-ysis: the 3 “Vs”: volume, variety, and velocity [21, 36]; andhardware and software infrastructures capable of processingall the collected data. These processing infrastructures nowencounter new performance challenges: energy consump-tion, power usage, and environmental impact. Over the lastyears, the volume and speed of data creation consistently

increased. A recent study estimates that 90% of all data inthe world was generated over the last two years [3]. TheInternational Data Corporation (IDC) predicted that from2005 to 2020, the digital universe will grow by a factor of300, from 130 exabytes to 40,000 exabytes [31]. The samestudy also predicted that the ?digital universe? will roughlydouble every two years and the storage market will grow55%. As a consequence, they expect that the discovery andanalytics software market will grow 33% until 2016, whichrepresents an 8 billion-dollar business [31].

In terms of infrastructure, the storage server market has ben-efited from continually decreasing magnetic spinning harddisk prices and higher performance solid state drives. Com-panies can store data at a relatively low cost per GB andcreate new data centers at affordable prices. Over the lastthree years, the cost for hard disks has ranged between $0.03and $0.05 per GB [22], with specialized providers offeringever lower costs for hard disks bundled with services. SSDsare faster, but cost around 20 times or more per GB.

From this perspective, the cost of SSDs could still be con-sidered prohibitive for general storage purposes. However,they can provide unique performance enhancements to dataanalysis by reducing energy consumption if they receive ade-quate support from processing platforms. In recent years, anumber of frameworks have been released to support paralleland distributed computing, and provide high level solutionsto end-users [2, 6, 14, 23]. Some of these frameworks werebuilt over well-known programming models, such as MPIand MapReduce. Nevertheless, these frameworks do notsufficiently acknowledge the coexistence of HDs and SSDsin the same environment, and therefore are not tailored tobenefit from this possibility.

Within this context, this paper presents a hybrid storagemodel for the Hadoop Distributed File System (HDFS),called HDFSH , which seamless integrates both storage tech-nologies – HDs and SSDs – to create a highly-efficient hy-brid storage system. Our results indicate that in most con-figurations this approach promotes overall job performanceincreases, while decreasing the energy consumption. Addi-tionally, our hybrid storage model splits the file system intostorage zones, wherein a block placement strategy directsfile blocks to zones according to predefined rules. This en-ables the use of different storage configurations for different

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 3: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

workloads, thereby achieving the desired tradeoff betweenperformance and energy consumption. Our goal is to allowthe user to determine the best configuration for the availableinfrastructure, by setting how much of each storage deviceshould be used during MapReduce computations. The ob-servations made during experiments may also be used as aguide for users seeking to modify existing Hadoop clusters,or even put together a new cluster. The key original contri-butions of this paper are:

• A novel hybrid storage model for HDFS thattakes into account the performance profiles ofHDs and SSDs. Since SSDs are faster and use lesspower than HDs but are much more expensive, thesecharacteristics must be adequately treated for differentworkloads.

• An HD- and SSD-aware block placement pol-icy that is optimized for heterogeneous storage.This policy’s rules were designed to take advantage ofthe differences between HDs and SSDs by accommo-dating a pre-defined percentage of the total numberof blocks in the SSDs, and the remainder in the HDs.Additionally, we also present the results of an experi-ment in directing temporary files to a specific storagezone.

• Evaluation of the technique over multiple ver-sions of Apache Hadoop. Our research showed thateach Hadoop branch presents a unique energy con-sumption profile; we detail these results and show howour system behaves in each situation.

Our motivation relies on an inevitable side-effect of the dataanalysis field’s expansion: data centers’ increase in energyconsumption. The number of data centers has consistentlygrown, increasing the availability of computing nodes andstorage space, and demanding more power. Data centers’maintenance costs and environmental impacts have consis-tently increased with the demand for more energy to powerand cool them. In fact, energy accounts for 30% of the TotalCost of Ownership (TCO), a major and continuous cost fordata centers [12].

Despite the fact that it may be easy and relatively cheapto store data today, the information extraction process re-mains a bottleneck for IT companies. A large body of re-search concerns how to perform computations on large scaledatasets within acceptable time limits. As a result, greateffort is put into frameworks that help developers leveragenumerous available computing resources, providing supportfor data integrity, replication, load balancing, task schedul-ing, scalability, and failure recovery.

The MapReduce paradigm [5, 8], through its open sourcemiddleware implementation, Apache Hadoop, comprises suchan approach to processing large datasets. The map functiongenerally filters and sorts the input data, while the reducefunctions are responsible for summarizing the results. Pop-ular for its ability to achieve high computing power by usingcommodity hardware in large-scale clusters, Hadoop is a rel-atively low-cost way to obtain an infrastructure capable of

carrying out massive computations with a high level of paral-lelism. The Hadoop middleware framework has three majorcomponents: the Hadoop Distributed File System (HDFS),a block file storage, designed to reliably hold very largedatasets using data replication [30]; MapReduce, a systemfor parallel processing of large data sets using the homony-mous programming paradigm; and the more recent compo-nent, the YARN scheduler and resource manager, which isincluded in Hadoop 0.23.0 and 2.0.0.

Using MapReduce over HDFS, the middleware has rapidlyevolved over the last years, promoting transparent data in-tegrity, replication, scalability, and failure recovery featuresthat have made Apache Hadoop very popular both in academiaand industry. The platform popularization over the last fiveyears has benefitted from several concepts and technologiesthat consolidated in the same period, including, e.g., the useof Cloud Computing to achieve scalability, availability, andflexibility in data processing [27].

Despite Hadoop’s popularity, it still struggles to properlyincorporate certain features. For example, Hadoop typ-ically lacks features that support heterogeneous environ-ments. In its storage layer, Hadoop uniformly treats diversestorage devices. Thus, even it supports the use of SSDs, theHDFS does not acknowledge any differences between HDsand SSDs.

The remainder of this paper is organized as follows. Sec-tion 2 presents an overview of Apache Hadoop, includingits history and components. Section 3 introduces our hy-brid storage, the block placement models, and a cost model.Section 4 delineates the experimentation methodology andthe infrastructure we used for testing. Section 5 presents andanalyzes the results. Section 6 discusses related work. Fi-nally, Section 7 presents our conclusions and future researchdirections.

2. APACHE HADOOP AND HDFSThe MapReduce programming model, now highly used inthe Big Data context, is not new. One of its first useswas on the LISP programming language, which later in-spired Google’s approach [5, pp. 1]. Google’s MapReducewas initially composed of the GFS distributed filesystem[8] and an implementation of MapReduce [5]. Hadoop wasdeveloped based on this parallel approach, using the sameidea from Google’s implementation: hiding complexity fromusers and thereby allowing them to focus on programmingthe paradigm’s two primitive functions, Map and Reduce.To move computation towards data, Hadoop uses the HDFSfile system. HDFS [30] is a block direct storage system ca-pable of storing, managing, and streaming large amounts ofdata in a reasonable time to user applications. As mentionedearlier, HDFS lacks differentiation of the different storagedevices attached to a Hadoop cluster node; consequently, itcannot properly exploit the features provided by such de-vices to customizably increase job performance or decreasea cluster’s energy consumption. Our approach tackles thisspecific issue, creating storage zones according to the devicetypes connected to the cluster nodes. To the best of ourknowledge, this is a novel approach to represent and man-age storage space for Hadoop’s MapReduce computations.

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 4: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Figure 1: Hadoop Versions Genealogy Tree

2.1 Brief Hadoop HistoryThroughout its history, Hadoop experienced more than 60releases in several development branches. As of now, Hadoophas three main development branches: 1.x, 0.23.x and 2.x.Our research focused on each of these branches’ recent re-leases. We observed that most of the Hadoop R&D workswere developed using older versions ranging from 0.19.x to0.22.x [27]. We also observed that the number of studiescontributing to the project increased from 2010 to 2012, butdecreased in 2013, meaning that fewer studies focused onHadoop’s new releases, since they were launched in late 2011(version 0.23.0 on 2011-11-01; version 1.0.0 on 2011-12-27)and in the second quarter of 2012 (version 2.0.0 on 2012-05-23) [11]. Additionally, versions 0.23.x and 2.x adopted anew architecture, introducing the YARN resource manager.Thus, we suspect that the new releases’ instability in theYARN branches may have directly contributed to the lownumber of studies using such versions. We built a genealogytree from the Hadoop project using release logs from eachproject branch and its releases. Figure 1 shows a part ofthe Hadoop project genealogy tree, from version 0.20.0 upto the latest releases, including the release dates.

The versions we selected for our experiments include theHadoop project’s three current branches: 1.x, 0.23.x and2.x. The 0.23.x and 2.x branches include the YARN re-source manager. The difference between them is that 2.xreleases include the High Available NameNode for HDFS –an effort focused on the automatic failover of the NameN-ode – whereas the 0.23.x releases exclude such a feature.The 1.x Hadoop releases do not include the YARN featureand have the limitation that they are more tightly coupledto the MapReduce paradigm and mostly designed to runbatch jobs, making them less flexible than the other twobranches. Table 1 lists the selected group of releases. Withthese selections we can overview almost two years of theproject’s releases, including different versions and branches,representing different Hadoop middleware designs.

3. HDFS HYBRID STORAGE

Table 1: Releases used in the experiments

Hadoop Release Date

1.1.1 2012-11-181.2.1 2013-07-150.23.8 2013-06-050.23.10 2013-12-092.3.0 2014-02-202.4.0 2014-04-07

We developed a hybrid storage approach for HDFS thatleverages the different characteristics of HDs and SSDs con-nected to a Hadoop cluster. This approach’s key feature isthe controlled use of SSDs to increase performance and re-duce energy consumption. Yet, although these two SSDs’characteristics are outstanding, we must also consider theirprice. The cost per GB of the SSDs sits between $1.00 and$1.50, with expectations it will decrease to less than $1.00per GB in the near future [10]. Even reaching this cost perGB, it still represents around 20 times or more the cost perGB of HDs. From this perspective, the cost of SSDs couldstill be considered prohibitive for general use as the singlestorage option. Recently, a newly developed hybrid storagedevice, known as SSHD, combines in the same device bothSSD and HD technologies. Generally, these devices can im-prove performance by storing the most frequently accesseddata in their NAND flash memory.

In this context, our goal is to allow the user to determinethe best configuration according to the available infrastruc-ture, by setting how much of each storage device should beused during MapReduce computations. The observationsmade during experiments may also be used as a guide forusers seeking to modify existing Hadoop clusters, or evenput together a new cluster. To describe our hybrid storageapproach, the total HDFS space available must be expressedas the sum of available space in each device of the clusterDataNodes. In our model, we limited the storage devices inthe cluster nodes to HDs and SSDs, excluding any SSHDs,

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 5: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

since they do not increase performance on read/write op-erations compared to SSDs. Table 2 contains a glossary ofsymbols and functions used in our storage model.

Table 2: Definitions used on the HDFS Hybrid Stor-age Model

Symbol Definition

HD Hard DiskSSD Solid-State DiskDN DataNoded Number of DataNodes in the clusterDS Dataset composed of multiple filesblockSize HDFS default block size configured in

dfs.block.size

First, we define the HDFS storage space as the sum of theavailable spaces on all DataNodes in the cluster. We use ageneric function SpaceAvailableInNode(), which returns thefree space that can be used by HDFS on a DataNode. Thus,the HDFS storage space is defined as:

HDFS =

d∑i=1

SpaceAvailableInNode(DNi) (1)

Since we are modeling a hybrid environment, each DataNodemay have different devices connected to it (HDs or SSDs),each with different available spaces. Thus, we define twostorage space zones: HDzone is the sum of all HD spaceavailable for HDFS on DataNodes; and SSDzone is analo-gous for the SSDs. From each DataNode, we can obtain theavailable space for each device category using the followingequations.

HDspace(DN ) =

z∑i=1

SpaceAvailableInDevice(HDi) (2)

SSDspace(DN ) =

w∑i=1

SpaceAvailableInDevice(SSDi) (3)

where z and w are, respectively, the number of configuredHDs and SSDs on DataNode DN . The total HD space willcompose the HDzone, and similarly, the SSDzone will becomposed of the sum of the SSD space available on eachDataNode. We define them as:

HDzone =d∑

i=1

HDspace(DNi) (4)

SSDzone =

d∑i=1

SSDspace(DNi) (5)

where d is the number of DataNodes running in the cluster.

Finally, we define the hybrid HDFS storage space for ourmodel as the following:

HDFSH = HDzone + SSDzone (6)

Our storage model aims to capture the existing nuances be-tween the different storage devices in the same file system.The NameNode middleware must be aware of the storagezones and the difference between the devices on each DataN-ode. Originally, the dfs.data.dir Hadoop configurationvariable contains a comma separated list of the directoriesthat HDFS can use. We extended this property to hold twolists, one for the SSD directories and the other for the HD di-rectories. By using both lists to create HDFS storage spaceNameNode maintains a general view of the HDFS storage;and also can separately access the devices from the blockplacement policies.

3.1 Block Placement PoliciesFollowing our HDFS hybrid storage model, we created theblock placement policy for HDFSH . From version 1.x on,HDFS allows users to create their own pluggable block place-ment policies. For our purposes, a Block Placement Policyis an algorithm that specifies where a file’s blocks will bestored on HDFS. These policies are controlled by the Na-meNode daemon, which manages the table of files, blocks,and locations.

The development of such a hybrid storage environment forHadoop added more flexibility to HDFS. Since SSDs arefaster and consume less power than HDs, given their instal-lation on storage servers, we expect resulting performancegains and energy consumption decreases. Our first blockplacement policy can send a pre-configured percentage of theblocks to one of the designed storage zones, and the remain-der of the blocks to the other. Let DS be a dataset that willbe stored in HDFSH . To know the number of blocks of DS ,we must know the number of blocks of each file in DS . Usinga generic function called blocks that returns the number ofblocks used by a file according to the default pre-configuredHDFS block size, we have:

blocks(file) = dsize(file)/blockSizee (7)

where the generic function size returns a given file’s numberof bytes. To obtain the total blocks needed to store DS intoHDFSH , we sum the individual number of blocks occupiedby each file in DS :

blocksDS =

k∑f=1

blocks(filef ) (8)

where k is the number of files in DS .

Since we know in advance how many blocks DS will requirein the file system and our policy controls the percentage ofblocks sent to each storage zone, we express the input of DSinto HDFSH as:

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 6: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

HDFSH ← DS

SSDzone ← dρ.blocksDSe

HDzone ← b(1− ρ).blocksDSc(9)

where ρ is the coefficient (0 <= ρ <= 1 ) that determineshow many blocks will be sent to the SSDzone. The com-plement of ρ determines the amount of blocks sent to theHDzone.

In all of the experiments, we configured the HDzone to holdthe temporary files during job execution. To explore ad-ditional possibilities, we performed experiments using theSSDzone to store Hadoop temporary files. These experi-ments aim to show added benefits of including SSD storagespace into a Hadoop cluster.

3.2 Cost ModelOur storage model splits the dataset blocks in the HDFS intodifferent storage zones using HD and SSD devices. Giventhat all files in the HDFSH are stored either in the HDzoneor in the SSDzone, the block proportion for each zone iscomplimentary. Thus, considering a Hadoop job we have:

SSDprop + HDprop = 1 (10)

where SSDprop and HDprop are the storage proportion de-fined for each zone by the HDFSH policy.

Therefore, let DS be a dataset in the HDFSH . We canmodel the storage cost for a job as:

HDFSH StorageCost = size(DS)× (SSDC + HDC ) (11)

where:

SSDC = SSDprop × SSDcostPerGB

HDC = HDprop ×HDcostPerGB

(12)

using Equation 10 we have:

HDC = (1− SSDprop)×HDcostPerGB (13)

Thus, we can estimate the storage cost for a Hadoop jobusing HDFSH by setting the SSD proportion, and the SSDs’and HDs’ cost per GB.

4. EXPERIMENTAL METHODOLOGY ANDINFRASTRUCTURE

With HDFSH and the block placement rules properly de-signed, we then selected the Hadoop releases and bench-marks for the experiments. Hadoop comes with a large setof examples and tools that allow benchmarking in severalways. Some of them use MapReduce applications, such asWord Count, Sort, Terasort, and Join. Another relevantbenchmark set is HiBench [13], which is publicly available onGitHub and developed by Intel, and includes machine learn-ing and data analytics benchmarks, as well as the Hadoopproject’s stock benchmarks. To cover an array of differentsituations, our final selection of benchmarks includes Sortand Join from Hadoop and the Mahout K-Means clusteringfrom HiBench. Table 3 presents our set of benchmarks and

their corresponding dataset size. Although our approach fo-cuses on storage, we also performed experiments using CPU-bound benchmarks to analyze the behavior of the hybridstorage under these workloads.

Table 3: Benchmarks and Dataset Sizes used in theexperiments

Benchmark Dataset Type

Sort 10GB I/O48GB I/O256GB I/O

Join 20GB CPUK-Means Clustering 3× 107 samples CPU + I/O

Each benchmark uses a specific dataset that was generatedand stored for experimental reuse and replication. For theSort experiments, the datasets were generated using theRandomWriter job, which generates a predefined amount ofrandom bytes. As a result of this job, a set of files with ran-dom keys serve as input for the Sort jobs. The Join bench-mark performs a join between two datasets, in a databasefashion. These experiments used datasets generated withDBGEN from the TPC-H benchmark [4], which is widelyused by the database community. Finally, we used an im-plementation of the K-Means clustering algorithm using theMahout Library [1] from HiBench in our experiments. K-means considers a euclidean space and attempts to groupthe existing elements into a predefined number of clusters.This benchmark is CPU-bound during the iteration phaseand I/O-bound during the clustering phase. We chose ourexperimentation benchmarks based not only on their I/Oor CPU characteristics, but also on prior research analysis[13,26].

Table 4: Configurations used in the experiments

Data Percentage Configurations

Configuration Short Name HDzone SSDzone

HD 100% -80/20 80% 20%50/50 50% 50%20/80 20% 80%SSD - 100%

Our testing infrastructure limited dataset sizes. Since ourcluster has eight DataNodes with two disks – one HD (1TB)and one SSD (120GB) – the SSDzone total size, which was960GB, limited our datasets. Additionally, after each ex-ecutions the Sort benchmark doubles the used file systemspace. Therefore, the 256GB dataset benchmarks cannot beperformed only using the SSDs, and the experiments wererun both on the HD only and hybrid configurations. Foreach benchmark, we ran a batch job as follows: for smalldatasets, a 30-job batch; and for medium and large ones, a5-job batch. Each selected Hadoop release ran one batch foreach benchmark/dataset pair. The 10GB and 48GB Sort ex-periments, the Join, and the K-Means experiments ran witha cluster configuration using 4 nodes, whereas the 256GBSort ran in the full cluster configuration. We used this ap-proach to check the framework’s behavior under different

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 7: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

configurations, and to check the cluster’s energy consump-tion. To record the power measurements during the experi-ments, we deployed an additional daemon on the head nodegathering the data from each power meter, and stored thepower readings from each node in separate files. We config-ured this daemon to record one reading every second fromeach power meter.

To illustrate the use of our block placement policy, we setfive predefined proportions to split the data into the storagezones. The first one keeps all the data in the HDzone, anddoes not use the SSDzone. The three intermediate ones varythe amount of data stored in each zone, as Table 4 shows.The last one uses only the SSDzone to store the data.

Figure 2: Cluster Infrastructure

The testing infrastructure is a 9-node commodity cluster.Each node has a quad-core processor (AMD A8-5600K; 3600MHz), 8GB of RAM, a 1TB hard disk (Western DigitalWD Black WD1002FAEX; 7200RPM; 64MB Cache; SATA6.0Gb/s), and a 120GB solid-state disk (Intel SSDSC2BW120A4;SATA 6Gb/s; 20nm MLC). All nodes run the Red Hat(4.4.7-4) GNU/Linux operating system. One of the nodes(head) exclusively runs the Hadoop NameNode and Job-Tracker daemons, which also keeps the job history log. Theother eight nodes run Hadoop DataNodes and TaskTrackerdaemons. For the energy consumption measurements, we in-strumented the cluster with Watts Up? Pro, a hardware de-vice that measures wall socket power use and reports powermeasures every second containing watts, kWh, voltage, amps,

Job Total Energy − 10GB Sort Benchmark

0

25

50

75

100

125

0

25

50

75

100

125

Ene

rgy

(kJ)

100% HD100% SSD

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Figure 3: Average Energy Consumption

Job Makespan − 10GB Sort Benchmark

0

50

100

150

200

250

300

0

50

100

150

200

250

300

Tim

e (s

)

100% HD100% SSD

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Figure 4: Average Job Makespan

power-factor, and other information. We connected the DataN-odes in pairs on each one of the four power meter devices.Since we were concerned with power measurements on hy-brid storage, we did not instrument the head node, since it isresponsible for coordinating Hadoop and does not store anyof the HDFS block files. We present the results and analysisnext. Figure 2 illustrates our hardware infrastructure.

5. EXPERIMENTAL RESULTS ANDANALYSIS

We were concerned with the energy consumption impact ofhybrid storage system. Our first finding was the large dif-ference among the multiple Hadoop releases. Due to archi-tectural changes in the middleware, Hadoop releases thatincluded the YARN resource manager performed worst andconsumed more energy when compared to the 1.x releases.In the following, we detail our experimental results. In termsof performance, we consider makespan as the time differencebetween the start and finish of Hadoop jobs.

5.1 I/O-Bound Benchmark ResultsStarting with the 10GB Sort benchmark, Figure 3 shows thedifferences between two HDFSH configurations: HD andSSD. The differences among the three branches are eas-ily identified. Whereas releases 0.23.x consumed on average65% more power than releases 1.x, releases 2.x consumed onaverage 85% more power than the 1.1.1 and 1.2.1 releases.

Sort Experiments 10Gb Dataset

0

20

40

60

80

100

120

140

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Ene

rgy

(kJ)

HD80/2050/5020/80SSD

Figure 5: Energy Consumption: Sort 10GB

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 8: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Sort Experiments 48Gb Dataset

0

100

200

300

400

500

600

700

800

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Ene

rgy

(kJ)

HD80/2050/5020/80SSD

Figure 6: Energy Consumption: Sort 48GB

The increase is partially explained by the performance loss:jobs running on 0.23.x releases were 27% slower than on 1.xreleases. The same is observed in the 2.x branch, which wasaround 35% slower than 1.x releases, as Figure 4 illustrates.The significant difference in energy consumption can also beexplained by recent Hadoop branches’ increase in the num-ber of included features. The YARN component broughtflexibility to the framework, allowing other types of jobs tobe executed in Hadoop, in addition to the original MapRe-duce. YARN also enabled the instantiation of multiple Job-Trackers and NameNodes. Our experiments demonstratedthat all this flexibility came at a price: loss in performanceand, consequently, an increase in energy consumption whenexecuting MapReduce jobs. From Figures 3 and 4, we canalso observe that there are small differences between releasesfrom the same branch, with small energy consumption vari-ations.

Further considering the 10GB Sort experiments, Figure 5presents the results for all the releases using the five config-urations we tested, and in the following order: HD, 80/20,50/50, 20/80, and SSD. We can also observe, in Figure 5,the expected tendency in energy savings when moving datato the configurations that favor SSD use.

Table 5: Sort Benchmarks: Energy consumed (kJ)

Dataset size Release HD 80/20 50/50 20/80 SSD

10GB

1.1.1 64 62 60 59 581.2.1 66 62 61 60 590.23.8 108 105 99 98 970.23.10 106 105 99 98 972.3.0 117 112 109 108 1072.4.0 120 119 114 113 112

48GB

1.1.1 378 348 303 299 3041.2.1 398 352 306 311 2980.23.8 588 566 463 454 4560.23.10 593 596 510 449 4642.3.0 696 630 579 529 5642.4.0 682 631 563 540 521

256GB

1.1.1 2005 1967 1795 1750 1626*1.2.1 1972 1934 1796 1795 1588*0.23.8 3339 3469 3001 3157 2790*0.23.10 3168 2971 3000 3320 2736*2.3.0 3461 3587 3248 3345 2885*2.4.0 3530 3650 3212 3301 3077*

* HDFS Replication Factor = 1

Normalized Total Job Energy − Sort Benchmark − Datasets 10GB & 48GB

0

20

40

60

70

80

90

100

0

20

40

60

70

80

90

100

10GB Dataset 48GB Dataset

Hadoop Releases

Hadoop 1.1.1 Hadoop 1.2.1

HD 80/20 50/50 20/80 SSD HD 80/20 50/50 20/80 SSD

Figure 7: Energy Consumption: 1.x Releases

Next, we moved on to the experiments with larger datasets.Figure 6 shows the results for the 48GB Sort experiments.The results support the tendency toward energy savingswhen using SSD. Analyzing these two initial experiments, wenoticed that increasing the dataset size shifts the tendencyof power saving toward the middle configurations: 50/50and 20/80. This indicates that, by storing only a fraction ofthe data on SSDs with these specific hybrid configurations,we achieve a significant increase in performance and, con-sequently, a reduction in energy consumption. Our resultsindicates that, if all data is processed from the SSDzone,there is an average reduction of 20% in energy consump-tion. Additionally, a similar reduction can also be observedin the 50/50 and 20/80 configurations in Figures 5 and 6.The energy consumption results from the Sort benchmarksare listed in Table 5.

To promote a better data visualization, we normalized theenergy results using the previously presented values – theaverage of the observations for each tested configuration.Therefore, for a pair release–dataset, we divided each valueby the maximum value of the group. Generally this valueis associated with the HD configuration, since it demandsmore power when running experiments. This allowed theside-by-side comparison of the results from different dataset

Normalized Total Job Energy − Sort Benchmark − Datasets 10GB & 48GB

0

20

40

60

70

80

90

100

0

20

40

60

70

80

90

100

10GB Dataset 48GB Dataset

Hadoop Releases

Hadoop 0.23.8 Hadoop 0.23.10

HD 80/20 50/50 20/80 SSD HD 80/20 50/50 20/80 SSD

Figure 8: Energy Consumption: 0.23.x Releases

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 9: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Normalized Total Job Energy − Sort Benchmark − Datasets 10GB & 48GB

0

20

40

60

70

80

90

100

0

20

40

60

70

80

90

100

10GB Dataset 48GB Dataset

Hadoop Releases

Hadoop 2.3.0 Hadoop 2.4.0

HD 80/20 50/50 20/80 SSD HD 80/20 50/50 20/80 SSD

Figure 9: Energy Consumption: 2.x Releases

experiments, since their results are in different scales. Fig-ures 7, 8, and 9 shows this panorama. We put together theexperiments from the two initial Sort benchmarks (10GBand 48GB) in both versions from each release. Surprisingly,for some releases the use of 50/50 configurations result inenergy consumption close to the power consumed when ex-clusively processing data from SSD. This can be clearly ob-served in releases 1.1.1, 1.2.1, 0.23.8, and 2.3.0. Even in theother two releases, the energy consumption rates are closerto the SSD values than to the HDs ones.

The results from the Sort benchmark using the 256GB datasetalso corroborate the previous statements. With the increasein the dataset size, the energy consumption rates decreasein the middle configurations, favoring the use of less SSDstorage to achieve results similar to the SSD-only configura-tions. Table 5 presents these results. The power consump-tion reduction when using more SSD storage is clearly no-ticeable. During the 256GB Sort experiments, we observedan issue regarding the SSDzone total space. Our DataNodeinfrastructure has eight 120GB SSD disks, totaling 960GBof useful storage space. Using a replication factor of 3, adataset triples its size in the HDFS. Thus, in this settingthe 256GB dataset uses 768GB, leaving less than 200GB offree space in the SSDzone. This does not allow the Sort Jobexecution, since it requires the equivalent of three times thedataset size of free space (at least 768GB) during its execu-

Normalized Total Job Energy − Sort Benchmark

0

20

40

60708090

100

0

20

40

60708090100

48GB Dataset 256GB Dataset

Hadoop Releases

Hadoop 1.1.1 Hadoop 1.2.1

HD 80/2050/5020/80 SSD HD 80/2050/5020/80 SSD

Figure 10: Energy Consumption: 1.x Releases

Normalized Total Job Energy − Sort Benchmark

0

20

40

60708090

100

0

20

40

60708090100

48GB Dataset 256GB Dataset

Hadoop Releases

Hadoop 2.3.0 Hadoop 2.4.0

HD 80/2050/5020/80 SSD HD 80/2050/5020/80 SSD

Figure 11: Energy Consumption: 2.x Releases

tion. Therefore, the 256GB Sort experiments ran withoutblock replication in the HDFSH for the SSD configurationwith this particular dataset. These results appears in Ta-ble 5.

We also compared the results of the 256GB Sort with the48GB Sort in normalized graphs. Figures 10 and 11 indicatethat we can achieve relatively higher performance and lowerenergy consumption by increasing the dataset size.

To further investigate the effects of block replication on en-ergy consumption, we ran experiments with the 256GB Sortdataset on the HD configurations, presented in Figure 12.The results indicate that the triple block replication de-mands on average 1.5% more energy in the 1.x brach, 9.5%more energy in 0.23.x releases, and 6.2% on branch 2.x. Fol-lowing our previous results, this means that in the configu-rations favoring the SSD storage, this percentage should begreatly reduced, since SSDs consumes far less energy thanHDs.

5.2 CPU-Bound Benchmarks ResultsTo evaluate whether our approach favors I/O-bound jobsonly, we performed two sets of experiments using CPU-bound jobs. The differences between the hybrid HD andSSD configurations were insignificant, thus we present onlythe results for the two extreme configurations using the Join

Job Total Energy − 256GB Sort Benchmark

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

Ene

rgy

(kJ)

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

100% HD − 3 Replicas100% HD − 1 Replica

Figure 12: Impact of Triple Block Replica on Energy

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 10: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Total Job Energy − Join Benchmark

0

25

50

75

100

125

0

25

50

75

100

125

Ene

rgy

( kJ

)

Hadoop Releases

100% HD100% SSD

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Figure 13: Energy Consumption, Join Benchmark

and K-Means benchmarks. Except for the differences amongbranches, the Join benchmark did not present any significantdifferences, as seen in Figure 13.

The Mahout K-Means is a hybrid benchmark that is CPU-bound in the iterations, and I/O-bound in clustering. Withour setup (3 iterations), 3/4 of the execution in this bench-mark was CPU-bound, while the rest was I/O-bound. Ascan be seen in Figure 14, again, there is no significant dif-ference across the configurations, except for the differencesamong branches.

We thus conclude that, except for the difference among branchesand releases, there are no significant gains in terms of perfor-mance and energy consumption when running CPU-boundjobs with our approach.

Total Job Energy − Kmeans Benchmark

0

100

200

300

400

500

600

700

0

100

200

300

400

500

600

700

Ene

rgy

( kJ

)

Hadoop Releases

100% HD100% SSD

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Figure 14: Energy Consumption K-Means Bench-mark

5.3 Using SSD as Temporary Storage SpaceFollowing the experiment’s results, we performed a set ofexperiments using the SSDzone to store the temporary filesgenerated during Hadoop jobs. We named this configurationtmpSSD . It stores the dataset file blocks in the HDzone, andall the temporary files generated during job execution in theSSDzone. Figures 15 and 16 present the results. We noticethat for the 10GB dataset there is no significant differencein the 1.x releases. The results are similar to the SSD con-figuration, which can be explained by the dataset size. But

Sort 48Gb − Temporary Files on SSD

0

100

200

300

400

500

600

700

800

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Ene

rgy

(kJ)

HDSSDHD + tmpSSD

Figure 15: Energy using SSDzone as temporary

we observed a different behavior from 0.23.x and 2.x re-leases, although they have the same software architecture.Releases from the 0.23.x branch did not achieve any energybenefits by using the tmpSSD . The opposite happened withthe 2.x releases, which achieved better performance and con-sequently reduced their energy demands.

The same pattern developed for the bigger datasets in the0.23.x and 2.x releases. The novelty was the 1.x branchresults, which performed better with bigger datasets. Thismeans that not only did performance improve, but therewere significant energy savings when using the SSDzone astemporary space.

As a general conclusion here, it is worth mentioning thatboth 1.x and 2.x releases significantly improved when usingthe SSDzone to store temporary files during Hadoop jobs,such that in most cases the performance was even betterthan storing the dataset files in the SSDzone. By contrast,releases from 0.23.x branch did not perform well, and insome cases the results suggest that the use of this strategymay be prohibitive, especially with small datasets. There-fore, using SSD as part of a hybrid storage system offerstwo kinds of benefits for Hadoop computations: (I) primarystorage in HDFS; and, (II) temporary storage space. In thelatter case, changes in the behavior of temporary files al-lowed several releases to achieve better performance withthem on SSD.

5.4 Speedup

Sort 256Gb − Temporary Files on SSD

0

500

1000

1500

2000

2500

3000

3500

4000

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Ene

rgy

(kJ)

HDSSDHD + tmpSSD

Figure 16: Energy using SSDzone as temporary

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 11: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Average Job Makespan for Storage Configurations

0

1000

2000

3000

4000

5000

6000

7000

8000

0

1000

2000

3000

4000

5000

6000

7000

8000

Tim

e (s

)

Sort 48GB Sort 256GB

HD80/2050/5020/80SSDtmpSSD

1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0 1.1.1 1.2.1 0.23.8 0.23.10 2.3.0 2.4.0

Figure 17: Hadoop Performance: Multiple Releases over Configurations

Regarding job makespan performance, we observed that stor-ing more data in the SSDs enabled the jobs to run faster,which was expected, since SSDs provide higher through-put. The novelty here is the non-linear behavior of the jobmakespan as we increase the dataset size. With the 10GBdataset, the 80/20 configuration was on average 7% fasterthan the HD configuration, with only 20% of the data pro-cessed from the SSD; in the 50/50 configuration, jobs wereon average 17% faster than the HD configuration; and, inthe 20/80 configuration, 22% faster; finally, jobs runningwith the SSD-only configurations were on average 26% fasterthan purely running on HD. In the 48GB Sort, the observedspeedups compared to the HD configuration were: 80/20,8% faster; 50/50, 27% faster; 20/80, 31% faster; and SSD,30% faster. The average makespan from the Sort bench-marks can be seen in Figure 17.

5.5 Cost Model AnalysisWith the SSDzone as a temporary storage, we achieved morepromising results than some SSD-only experiments. Fig-ure 17 also presents these experiment’s results. All experi-ments from the hybrid policy ran using the HD as tempo-rary space. Thus, the SSD configuration stores the entiredataset on SSD disks, and uses HDs as temporary space.The tmpSSD configuration does exactly the opposite, stor-ing the dataset in the HDzone, and the temporary files in theSSDs. Thus, it is fair to compare these two experiment sets.Although the HD experiments tend to be much slower thanthe SSD, in this case the HDzone using SSD as temporaryspace in most cases outperforms the SSDzone experiments.Results from the 48GB and 256GB Sort experiments in 1.xand 2.x releases point to a speedup factor of more than 2times when comparing HD versus tmpSSD configurations,and on average more than 1.5 times when comparing SSDversus tmpSSD configurations.

For the storage cost analysis, we assume that the HDcost/GB =$0.05 and the SSDcost/GB = $1.00 (20 times the HDcost/GB ).Following we plotted the ratio between job makespan andjob storage cost for the sort jobs, using the latest tested re-lease from each branch. We can observe in Figure 18 that,

for each release, there is a pareto optimal configuration. Thisexample can be used to identify which proportions are thebest in the tradeoff between cost and performance. Themore SSD is used, the more expensive the cost is, but withless hours spent to perform the job, less energy is used. Onthe other hand, by using more HD in the configurations,jobs take more hours to finish, demanding more energy, butthe cost decreases greatly. This behavior can be observedin every release on the 48GB and 256GB datasets tested inthe sort experiments. Finally we can observe that, althoughthere is a general behavior within the tested releases, eachone has its particularities, which means that the optimalpoint varies from one release to another, specially amongstdifferent branches.

0.0 0.5 1.0 1.5 2.0

050

100

150

200

250

Storage Cost versus Performance

Job Makespan (Hours)

Cos

t ($)

(H

D: $

0.05

/GB

; SS

D: $

1.00

/GB

)

● 1.2.10.23.102.4.0

Sort 48GB

Sort 256GB

Figure 18: HDFSH storage cost versus performance

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 12: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

6. RELATED WORKIn our comprehensive literature research [27], we observedthat HDFS has been modified to increase its performance inmultiple ways: tuning the I/O mode [16,29,34], solving dataplacement problems [33], and adapting it to support smallfiles processing [7, 16], since HDFS was not originally de-signed for such purposes. Some works replaced the originalHDFS with a more suitable solution for specific compati-bility problems [19, 24], or to support areas such as CloudComputing [9, 19]. Yet, these approaches only targeted theHDFS’ performance increases without considering energyconsumption or using a combination of different storage de-vices.

The use of SSDs as a storage solution on clusters is sucha recent trend in the industry that the first proposals forMapReduce clusters only recently appeared. Most of theresearch up to date tends to analyze whether MapReducecan benefit, in terms of performance, when deploying HDFSusing SSDs. Jeon et al. [15] analyze the Flash Transla-tion Layer (FTL) – the SSD core engine – to understandthe endurance implications of such technologies on HadoopMapReduce workloads. Kang et al. [18] explore the ben-efits and limitations of in-storage processing on SSDs (theexecution of applications on processors in the storage con-troller). Other researchers focus on incorporating SSDs intoHDFS using caching mechanisms to achieve better perfor-mance [20,28,32,35]. A few works also discuss SSDs’ impacton Hadoop [17,25], sometimes focusing on using SSDs as thesole storage device under HDFS. Our approach tends to bemore affordable, since we developed a hybrid file system thatseamlessly couples the best from HDs (affordable cost perGB, high storage capacity, and, to some extent, endurance)and SSDs (high performance and low energy consumptionrates) in a configurable fashion.

7. CONCLUSIONS AND FUTURE WORKIn this paper, we presented our approach to seamlessly inte-grating HD and SSD technologies into HDFSH . Our blockplacement policy shows an energy consumption reductioneven when only a fraction of the data is stored in the mod-ified HDFS. We showed that, with larger datasets, the re-duction in energy demand can be significant, achieving upto a 20% savings under certain hybrid configurations. Thegeneral use of HDFSH affords immediate benefits since itincreases MapReduce jobs’ performance and reduces energyconsumption. Yet, for now, users must manually define theseconfigurations. In future work, we intend to investigate au-tonomic heuristics that would analyze the workflow to au-tomatically and dynamically configure HDFSH for optimalperformance and energy consumption.

Regarding Hadoop branches and their significant performanceand energy consumption differences, we acknowledge thatYARN brought flexibility to the framework, but generateda significant performance loss. Since most of the energy con-sumed by Hadoop is associated with job makespan, branches0.23.x and 2.x releases almost double the energy consumedto run the same jobs compared to 1.x releases. Users mustfocus on the real needs associated with Hadoop: flexibilityor performance. In the case of the latter option, we stronglyrecommend selecting Hadoop releases from 1.x branch, andapplying the compatible HDFS security patches.

Additionally, we observed that the tmpSSD configurationcan bring benefits when used with the 1.x and 2.x branches.This configuration can increase speedup rates to outperformthe exclusive use of SSD on certain releases and benchmarks.Four out of 6 tested releases showed performance enhance-ments using this configuration on our larger datasets. There-fore, the use of SSD as an energy saver and performance en-hancer for MapReduce computations ought to be seen as aviable alternative to existing Hadoop clusters, since Hadoopdoes not need to be modified to allow tmpSSD configura-tions. We recommend that Hadoop deployment considerusing SSDs for temporary space.

Finally, we are also concerned with the Hadoop middlewarearchitectural changes. We are conducting a code repositorydata mining experiment to investigate how source-code mod-ifications across Hadoop releases have impacted performanceand, consequently, energy consumption. Additionally, ourresults may show how much of the power being used relatesto job makespan, correlating the evolution of the frameworkand its architectural changes with performance and energyconsumption.

AcknowledgmentThis work was funded by Fundacao Araucaria, CNPq-Brazil,and by the Emerging Leaders in the Americas Program(ELAP) from the Government of Canada. Abram Hindleis supported by an NSERC Discovery Grant.

8. REFERENCES[1] Apache Mahout. Apache Mahout, 2014.

[2] D. Battre, S. Ewen, F. Hueske, O. Kao, V. Markl, andD. Warneke. Nephele/PACTs: a programming modeland execution framework for web-scale analyticalprocessing. In Proceedings of the 1st Symposium onCloud Computing, pages 119–130, New York, NY,USA, 2010. ACM.

[3] P. B. Brandtzæg. Big data - for better or worse, May2013.

[4] T. P. P. Council. TPC Benchmark H (DecisionSupport) Standard Specification, Jun 2002.

[5] J. Dean and S. Ghemawat. MapReduce: simplifieddata processing on large clusters. In Proceedings of the6th Conference on Operating Systems Design andImplementation, volume 6, pages 10–10, Berkeley, CA,USA, 2004. USENIX Association.

[6] D. J. DeWitt, E. Paulson, E. Robinson, J. Naughton,J. Royalty, S. Shankar, and A. Krioukov. Clustera: anintegrated computation and data management system.Proceedings of the VLDB Endowment, 1(1):28–41,Aug. 2008.

[7] B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li.A novel approach to improving the efficiency ofstoring and accessing small files on Hadoop: A casestudy by PowerPoint files. In International Conferenceon Services Computing, pages 65–72. IEEE, july 2010.

[8] S. Ghemawat, H. Gobioff, and S.-T. Leung. TheGoogle File System. ACM SIGOPS Operating SystemsReview, 37(5):29–43, 2003.

[9] S. Guang-hua, C. Jun-na, Y. Bo-wei, and Z. Yao.QDFS: A quality-aware distributed file storage servicebased on hdfs. In International Conference on

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts

Page 13: Hybrid HDFS: decreasing energy consumption and speeding · PDF fileHybrid HDFS: decreasing energy consumption and speeding up Hadoop using SSDs Apache Hadoop has evolved significantly

Computer Science and Automation Engineering,volume 2, pages 203–207. IEEE, june 2011.

[10] M. Hachman. SSD prices face uncertain future in2014, 2014.

[11] A. Hadoop. Apache hadoop releases.http://hadoop.apache.org/releases.html, 2015.

[12] J. Hamilton. Overall data center costs, September2010.

[13] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang.The hibench benchmark suite: Characterization of themapreduce-based data analysis. In Data EngineeringWorkshops (ICDEW), 2010 IEEE 26th InternationalConference on, pages 41–51, March 2010.

[14] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.Dryad: distributed data-parallel programs fromsequential building blocks. ACM SIGOPS OperatingSystems Review, 41(3):59–72, Mar. 2007.

[15] H. Jeon, K. El Maghraoui, and G. B. Kandiraju.Investigating hybrid ssd ftl schemes for hadoopworkloads. In Proceedings of the ACM InternationalConference on Computing Frontiers, CF ’13, pages20:1–20:10, New York, NY, USA, 2013. ACM.

[16] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. Theperformance of MapReduce: an in-depth study.Proceedings of the VLDB Endowment, 3(1-2):472–483,Sept. 2010.

[17] K. Kambatla and Y. Chen. The truth aboutmapreduce performance on ssds. In 28th LargeInstallation System Administration Conference(LISA14), pages 118–126, Seattle, WA, Nov. 2014.USENIX Association.

[18] Y. Kang, Y. suk Kee, E. Miller, and C. Park. Enablingcost-effective data processing with smart ssd. In MassStorage Systems and Technologies (MSST), 2013IEEE 29th Symposium on, pages 1–12, 2013.

[19] G. Kousiouris, G. Vafiadis, and T. Varvarigou. Afront-end, Hadoop-based data management service forefficient federated clouds. In Third InternationalConference on Cloud Computing Technology andScience, pages 511–516. IEEE, 29 2011-dec. 1 2011.

[20] K. Krish, M. Iqbal, and A. Butt. Venu: Orchestratingssds in hadoop storage. In Big Data (Big Data), 2014IEEE International Conference on, pages 207–212,Oct 2014.

[21] D. Laney. 3D data management: Controlling datavolume, velocity, and variety. Technical report, METAGroup, February 2001.

[22] J. C. MacCallum. Disk drive prices.http://www.jcmit.com/diskprice.htm, 2014.

[23] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski. Pregel: asystem for large-scale graph processing. In Proceedingsof the International Conference on Management ofData, pages 135–146, New York, NY, USA, 2010.ACM.

[24] S. Mikami, K. Ohta, and O. Tatebe. Using the GfarmFile System as a POSIX compatible storage platformfor Hadoop MapReduce applications. In Proceedings ofthe 12th International Conference on Grid Computing,pages 181–189, Washington, DC, USA, 2011.IEEE/ACM.

[25] S. Moon, J. Lee, and Y. S. Kee. Introducing ssds tothe hadoop mapreduce framework. In CloudComputing (CLOUD), 2014 IEEE 7th InternationalConference on, pages 272–279, June 2014.

[26] F. Pan, Y. Yue, J. Xiong, and D. Hao. I/Ocharacterization of big data workloads in data centers.In J. Zhan, R. Han, and C. Weng, editors, Big DataBenchmarks, Performance Optimization, andEmerging Hardware, volume 8807 of Lecture Notes inComputer Science, pages 85–97. Springer InternationalPublishing, 2014.

[27] I. Polato, R. Re, A. Goldman, and F. Kon. Acomprehensive view of Hadoop research - A systematicliterature review. Journal of Network and ComputerApplications, 46:1 – 25, 2014.

[28] M. Roger, Y. Xu, and M. Zhao. Bigcache for big-datasystems. In Big Data (Big Data), 2014 IEEEInternational Conference on, pages 189–194, Oct 2014.

[29] J. Shafer, S. Rixner, and A. Cox. The HadoopDistributed Filesystem: Balancing portability andperformance. In International Symposium onPerformance Analysis of Systems Software, pages122–133. IEEE, march 2010. IEEE.

[30] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.The Hadoop Distributed File System. In Proceedingsof the 26th Symposium on Mass Storage Systems andTechnologies, pages 1–10, Washington, DC, USA,2010. IEEE.

[31] D. Vesset, A. Nadkarni, C. W. Olofson, andD. Schubmehl. Worldwide big data technology andservices 2012-2016 forecast. Technical report, IDCCorporate USA, 2012.

[32] B. Wang, J. Jiang, and G. Yang. mpCache:Accelerating mapreduce with hybrid storage systemon many-core clusters. In C.-H. Hsu, X. Shi, andV. Salapura, editors, Network and Parallel Computing,volume 8707 of Lecture Notes in Computer Science,pages 220–233. Springer Berlin Heidelberg, 2014.

[33] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors,A. Manzanares, and X. Qin. Improving MapReduceperformance through data placement in heterogeneousHadoop clusters. In International Symposium onParallel Distributed Processing, Workshops and PhdForum, pages 1–9. IEEE, april 2010. IEEE.

[34] J. Zhang, X. Yu, Y. Li, and L. Lin. HadoopRsync. InInternational Conference on Cloud and ServiceComputing, pages 166–173, dec. 2011.

[35] D. Zhao and I. Raicu. Hycache: A user-level cachingmiddleware for distributed file systems. In Paralleland Distributed Processing Symposium WorkshopsPhD Forum (IPDPSW), 2013 IEEE 27thInternational, pages 1997–2006, May 2013.

[36] P. Zikopoulos and C. Eaton. Understanding Big Data:Analytics for Enterprise Class Hadoop and StreamingData. Mcgraw-hill, 2011.

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1320v1 | CC-BY 4.0 Open Access | rec: 24 Aug 2015, publ: 24 Aug 2015

PrePrin

ts