Exploring Energy and Performance Behaviors of Data ...

10
Exploring Energy and Performance Behaviors of Data-Intensive Scientific Workflows on Systems with Deep Memory Hierarchies Marc Gamell, Ivan Rodero, Manish Parashar Rutgers Discovery Informatics Institute NSF Cloud and Autonomic Computing Center Rutgers University, Piscataway, NJ, USA Email: {mgamell, irodero, parashar}@cac.rutgers.edu Stephen Poole Computer Science and Mathematics & NCCS Divisions Oak Ridge National Laboratory, Oak Ridge, TN, USA Email: [email protected] Abstract—The increasing gap between the rate at which large scale scientific simulations generate data and the corresponding storage speeds and capacities is leading to more complex system architectures with deep memory hierarchies. Advances in non- volatile memory (NVRAM) technology have made it an attractive candidate as intermediate storage in this memory hierarchy to address the latency and performance gap between main memory and disk storage. As a result, it is important to understand and model its energy/performance behavior from an application perspective as well as how it can be effectively used for staging data within an application workflow. In this paper, we target a NVRAM-based deep memory hierarchy and explore its po- tential for supporting in-situ/in-transit data analytics pipelines that are part of application workflows patterns. Specifically, we model the memory hierarchy and experimentally explore energy/performance behaviors of different data management strategies and data exchange patterns, as well as the tradeoffs associated with data placement, data movement and data pro- cessing. I. I NTRODUCTION Large-scale scientific simulation workflows are generating increasing amounts of data at very high rates, and this data has to be transported and analyzed before scientific insights can be obtained. These simulation workflows are also becoming increasingly complex, and are composed of coupled applica- tions and data processing components that need to interact and exchange data. As a result, data management, transportation, processing and analysis costs, in terms of time and energy, are quickly dominating and limiting the potential impacts of these advanced simulations. Recent research efforts are attempting to address these challenges from a latency and performance perspective, using techniques such as data staging and in-situ and in-transit processing [1], [2], [3]. However, understanding energy/performance behaviors for different types of data man- agement strategies, data placement options, and data exchange patterns has also become increasingly important. At the same time, technology trends and constraints are resulting in complex computer systems with a growing number of cores per chip with reduced memory per core, an increasing gap between compute and I/O speeds, and architectures with deep memory hierarchies. Advances in storage-class non- volatile memory (NVRAM) have made it an attractive can- didate as intermediate storage within this memory hierarchy for addressing the latency and performance gap between main memory and disk storage. Existing research [4], [5] has studied how memory hierarchy performance can be improved using different NVRAM technologies and has considered NVRAM as a viable solution for I/O staging. Recent studies [6], [7] have also analyzed the potential impact of NVRAM on performance as part of a deep memory hierarchies. However, it is also important to understand how such a NVRAM-based deep memory hierarchy can be effectively used for data staging and by data-intensive scientific workflows, and the associated energy/performance behaviors from an application perspective. The goal of this paper is to explore these issues. In this paper, we first model performance and energy of a canoni- cal application workflow, which is based on real application workflow patterns, on systems with a deep memory hierarchy. We then characterize the performance and energy behaviors of each level of the memory hierarchy. Finally, we evaluate the performance and energy consumption of different data man- agement strategies and data exchange patterns, as well as the energy/performance tradeoffs associated with data placement, data movement and data processing. Specifically, we target architectures with deep memory hierarchies composed of four different levels: DRAM, storage-level PCIe-based NVRAM (simply NVRAM throughout the rest of the paper), solid-state drive (SSD) and spinning hard disk. Complementing the existing body of work addressing power management at system and runtime levels, in this paper we address energy efficiency from an application- aware and data-centric perspective, and focus on optimizing data placement/movement and scheduling computations as part of end-to-end simulation workflows. Furthermore, we address energy/power-efficiency tradeoffs in a holistic manner in combination with performance and quality of solution. The key contributions of this paper are: (1) the design, model- ing and implementation of a canonical workflow skeleton, based on characteristics extracted from real scientific data- intensive workflows; (2) the characterization of the different levels of a 4-level deep memory hierarchy using an empirical power/performance exploration, and the formulation of a set of rules to optimize their use from power and performance perspectives; and (3) an experimental evaluation of different relevant data management strategies and data exchange pat- terns using the canonical workflow skeleton on a system with 226 978-1-4799-0730-4/13/$31.00 ©2013 IEEE

Transcript of Exploring Energy and Performance Behaviors of Data ...

Exploring Energy and Performance Behaviors ofData-Intensive Scientific Workflows on Systems with

Deep Memory Hierarchies

Marc Gamell, Ivan Rodero, Manish ParasharRutgers Discovery Informatics Institute

NSF Cloud and Autonomic Computing Center

Rutgers University, Piscataway, NJ, USA

Email: {mgamell, irodero, parashar}@cac.rutgers.edu

Stephen PooleComputer Science and Mathematics & NCCS Divisions

Oak Ridge National Laboratory,

Oak Ridge, TN, USA

Email: [email protected]

Abstract—The increasing gap between the rate at which largescale scientific simulations generate data and the correspondingstorage speeds and capacities is leading to more complex systemarchitectures with deep memory hierarchies. Advances in non-volatile memory (NVRAM) technology have made it an attractivecandidate as intermediate storage in this memory hierarchy toaddress the latency and performance gap between main memoryand disk storage. As a result, it is important to understandand model its energy/performance behavior from an applicationperspective as well as how it can be effectively used for stagingdata within an application workflow. In this paper, we targeta NVRAM-based deep memory hierarchy and explore its po-tential for supporting in-situ/in-transit data analytics pipelinesthat are part of application workflows patterns. Specifically,we model the memory hierarchy and experimentally exploreenergy/performance behaviors of different data managementstrategies and data exchange patterns, as well as the tradeoffsassociated with data placement, data movement and data pro-cessing.

I. INTRODUCTION

Large-scale scientific simulation workflows are generatingincreasing amounts of data at very high rates, and this data hasto be transported and analyzed before scientific insights canbe obtained. These simulation workflows are also becomingincreasingly complex, and are composed of coupled applica-tions and data processing components that need to interact andexchange data. As a result, data management, transportation,processing and analysis costs, in terms of time and energy, arequickly dominating and limiting the potential impacts of theseadvanced simulations. Recent research efforts are attemptingto address these challenges from a latency and performanceperspective, using techniques such as data staging and in-situand in-transit processing [1], [2], [3]. However, understandingenergy/performance behaviors for different types of data man-agement strategies, data placement options, and data exchangepatterns has also become increasingly important.

At the same time, technology trends and constraints areresulting in complex computer systems with a growing numberof cores per chip with reduced memory per core, an increasinggap between compute and I/O speeds, and architectures withdeep memory hierarchies. Advances in storage-class non-volatile memory (NVRAM) have made it an attractive can-didate as intermediate storage within this memory hierarchy

for addressing the latency and performance gap between mainmemory and disk storage. Existing research [4], [5] has studiedhow memory hierarchy performance can be improved usingdifferent NVRAM technologies and has considered NVRAMas a viable solution for I/O staging. Recent studies [6], [7] havealso analyzed the potential impact of NVRAM on performanceas part of a deep memory hierarchies. However, it is alsoimportant to understand how such a NVRAM-based deepmemory hierarchy can be effectively used for data stagingand by data-intensive scientific workflows, and the associatedenergy/performance behaviors from an application perspective.The goal of this paper is to explore these issues. In thispaper, we first model performance and energy of a canoni-cal application workflow, which is based on real applicationworkflow patterns, on systems with a deep memory hierarchy.We then characterize the performance and energy behaviors ofeach level of the memory hierarchy. Finally, we evaluate theperformance and energy consumption of different data man-agement strategies and data exchange patterns, as well as theenergy/performance tradeoffs associated with data placement,data movement and data processing. Specifically, we targetarchitectures with deep memory hierarchies composed of fourdifferent levels: DRAM, storage-level PCIe-based NVRAM(simply NVRAM throughout the rest of the paper), solid-statedrive (SSD) and spinning hard disk.

Complementing the existing body of work addressingpower management at system and runtime levels, in thispaper we address energy efficiency from an application-aware and data-centric perspective, and focus on optimizingdata placement/movement and scheduling computations aspart of end-to-end simulation workflows. Furthermore, weaddress energy/power-efficiency tradeoffs in a holistic mannerin combination with performance and quality of solution. Thekey contributions of this paper are: (1) the design, model-ing and implementation of a canonical workflow skeleton,based on characteristics extracted from real scientific data-intensive workflows; (2) the characterization of the differentlevels of a 4-level deep memory hierarchy using an empiricalpower/performance exploration, and the formulation of a setof rules to optimize their use from power and performanceperspectives; and (3) an experimental evaluation of differentrelevant data management strategies and data exchange pat-terns using the canonical workflow skeleton on a system with

226

978-1-4799-0730-4/13/$31.00 ©2013 IEEE

a deep memory hierarchy.

The rest of the paper is organized as follows. Section IIdiscusses the related work. Section III describes the key data-related challenges in executing typical scientific workflowsand develops a canonical workflow. Section IV models thecanonical workflow and presents its implementation for theexperiments presented in the paper. Section V characterizesthe power/performance behaviors of the different memorydevices that are part of the targeted deep memory hierarchy.Section VI presents an experimental evaluation of relevant datamanagement and exchange strategies. Section VII concludesthe paper and outlines directions for future work.

II. RELATED WORK

In this section, we first present existing work related to thedifferent memory technologies that have been used for buildingdeep memories hierarchies. We then discuss existing studiesthat have benchmarked deep memory hierarchies and, finally,we describe related in-situ/in-transit data analysis approachesfor scientific workflows.

Deep memory hierarchies technology: There is a large bodyof research exploring deep memory hierarchies and how theycan be used to address I/O challenges. Different types oftechnologies have been developed to improve the memoryhierarchy using non-volatile, solid-state memory (NVRAM)such as phase-change memory (PCM) [8], memristors [9], andFlash memory [4], [10]. Flash-based SSD technology has alsobeen widely studied [11], characterized [12] and evaluated fordifferent application types [13], [14]. In this paper, we focuson storage-class NVRAM (i.e., Fusion-IO iodrive technology).

NVRAM for I/O staging: Existing work has considered NV-RAM as a viable solution for I/O staging [5], [15]. Caulfieldet al. [6] proposed Moneta, an architecture with NVRAM asan I/O device for HPC applications. Ekel et al. [16] extendedMoneta with a real PCM device to understand the performanceimplications of using NVRAM. Dong et al. [17] studiedNVRAM for HPC application checkpointing. Kannan et al.[18] studied NVRAM for I/O intensive benchmarks (e.g., post-processing applications) also in Cloud environments.

Performance evaluation of NVRAM: Several studies havemeasured performance of different memory and storage de-vices such as PCIe-based NVRAM, SSD, spinning disk ornetworked devices. Cohen et al. [19] evaluated both memory-intensive benchmarks (e.g., graph benchmark) and real I/Oapplications (e.g., Livermore Entity Extraction) using Fusion-IO NAND Flash array on PCIe, local SATA hard disk, andNFS-mounted RAID. Caulfield et al. [6], [20] comparedperformance and power of RAID storage and Fusion-io vs.Moneta, a prototype PCIe attached storage array built fromemulated PCM storage. He et al. [21] presented the designand evaluation of the DASH supercomputer and compared harddisk, Fusion-io and Oracle’s F5100 with DASH using scientificapplications. Polte et al. [22] compared Fusion-io with regularSSD, enterprise SSD and hard disk using standard benchmarks.

Power/Energy Efficiency: Recent work (e.g., µblades [23],FAWN [14] and Gordon [4]) have studied energy efficiency,for example, using low-power (thin or “wimpy”) cores thatconsume less energy [10] and determining how they can be

exploited using micro-benchmarks [24]. An important partof this research is oriented to Cloud workloads rather thanHPC. Dong et al. [25] proposed 3D stacked magnetic memory(MRAM) caches claiming better power efficiency. Vasudevanet al. [26] used standard I/O benchmarks to evaluate theenergy and performance of several system configurations,including Fusion-io. Davis et al. [27] used the JouleSort, asequential I/O-based benchmark, to evaluate the performanceand energy-efficiency of different storage systems: three SATASSDs, Fusion-io, and two SATA hard disks. In recent work,Yoon et al. [7] studied the potential impact of NVRAM onthe performance for deep memory hierarchies and Li et al.[28] studied the potential impact of hybrid DRAM and byte-addressable NVRAM on scientific applications from energyand performance perspectives using simulation, which arecomplementary to our approach.

In-situ and in-transit processing: Largely data parallel op-erations, including visualization [29], and statistical compres-sion and queries [30], have been directly integrated into thesimulation routines, enabling them to operate on in-memorysimulation data. Another approach, used by FP [31] andCoDS [32], performs in-situ data operations on-chip usingseparate dedicated processor cores on multi/many-core nodes.Energy aspects of in-situ data analytics have been recentlyaddressed [33]; however, without using a deep memory hi-erarchy. The use of a data staging area, i.e., a set of ad-ditional compute nodes allocated by users when launchingthe parallel simulations, has been investigated in projectssuch as DataStager [2], PreDatA [3], JITStaging [34] andDataSpaces [1]/ActiveSpaces [35]. Most of these existing datastaging solutions primarily focus on fast and asynchronousdata movement of simulation nodes to lessen the impact ofexpensive I/O operations. They typically support limited dataoperations within the staging area, such as pre-processing,and transformations, often resulting in under-utilization of thestaging nodes compute power.

At the best of our knowledge, there is a lack of workstudying empirically how to effectively use such a deepmemory hierarchy from energy/performance perspectives, in-cluding the study of its behaviors for data-centric scientificworkflows. Most of the above research is focused on improvingperformance (e.g., I/O throughput) and do not translate to ourhierarchical approach for scientific applications.

III. BACKGROUND

This section introduces the canonical deep memory archi-tecture and scientific application workflow targeted by thiswork, and describes the key features of such an architectureand workflow.

System architecture: This work targets the canonical systemarchitecture shown in Figure 1. Each node features a deepmemory hierarchy that includes DRAM, NVRAM, SSD anddisk storage levels. The system is composed of multiple nodesinterconnected via a high-speed network (i.e., Infiniband). Fur-thermore, nodes can access the different levels of the memoryhierarchy on remote nodes as well. This architecture alsosupports the creation of data staging areas, i.e., intermediateon-system storage between the sources of information and itsconsumer. Such a staging area can extend along both directions

227

DRAM

NVRAM

SSD

Hard Disk

Multi-core

Processor

Accelerators

DRAM

NVRAM

SSD

Hard Disk

Multi-coreProcessor

Accelerators

Multi-core

Processor

Accelerators

DRAM

NVRAM

SSD

Hard Disk

Network

Fig. 1. Targeted system architecture with a deep memory hierarchy.

– horizontally spanning the nodes running the applications thatproduce and consume the data, as well as dedicated stagingnodes, and vertically across the levels of the deep memoryhierarchy at the nodes. For example, Figure 1 illustrates ahorizontal staging area at the NVRAM level.

Canonical application workflow: In this paper we considerthe canonical workflow skeleton shown in Figure 2, whichis derived from real application workflows. The workflowincludes an arbitrary number of coupled parallel applicationcomponents (e.g., simulations) and data analytics components.Application components generate data in either the local mainmemory (typically DRAM) or the staging area or both, atarbitrary rates. In-situ data analytics components process thedata where it is generated (typically from DRAM) but canalso interact with local portion of the staging area. In-stagingdata analytics components process the data from the stagingarea, which may be local or remote. Both in-staging and in-situanalytics are typically asynchronous, but may be synchronousas well.

The subsequent paragraphs discuses key data-centric fea-tures of scientific application workflows using the proposedarchitecture and canonical workflow skeleton.

Quality of solution: Simulation-based application workflowsare typically iterative (e.g., consisting of multiple time steps).Each simulation step usually generates data, which needs tobe analyzed; however, the data is not typically processed andanalyzed every step. However, the data must be analyzed asoften as possible given the system constants, and the moreoften the data is analyzed, the better. For example, turbulentcombustion direct numerical simulation simulations currentlyresolve intermittent phenomena that occur on the order of 10simulation steps; however, in order to maintain I/O overheadsat a reasonable level, typically only every 400th step is savedto persistent storage for post-processing and, as a result, thedata pertaining to these intermittent phenomena is lost [36]. Weuse the frequency of data analysis as metric for the quality ofsolution in this paper.

Processingrate In-situ Data

Analytics

Asynchronous

StagingArea

Main Memory(DRAM)

In-stagingData Analytics

SimulationStep

I s

t prod,v

Processingrate

Generationrate

Fig. 2. Canonical application workflow.

Data Coupling: The coupling, interaction and coordinationof the different applications that compose a scientific data-intensive workflow can be expressed using different workflowpatterns such as: (i) tight coupling – the coupled componentapplications run concurrently and exchange data frequently,and as a result data movement costs can quickly dominate theexecution time; (ii) loose coupling – the coupling and dataexchange are less frequent, often asynchronous and possiblyopportunistic and the coupled component applications mayrun concurrently or sequentially; and (iii) dataflow coupling– the coupling is determined by availability of data and thedata flows through the workflow, which is often expressed asa directed acyclic graph (DAG). In our canonical workflowskeleton, we consider the ratio between data generation andprocessing rates as a metric of the coupling between theworkflow components, i.e., the simulation and the analysisapplications.

Online Data Analysis: We consider two variants for onlinedata analysis in scientific workflows: in-situ and in-transitprocessing. Both these variants are based on the idea ofperforming analyses as the simulation is running, and storingonly the results, which can be several orders of magnitudesmaller than the raw data, and thus mitigating the effects oflimited disk bandwidth or capacity. Their difference lies inhow and where the analysis is performed. In-situ analysisis performed where the data is located, and typically sharesresources with the primary simulation. The main advantage ofin-situ analysis is that avoids the data movement. However,constraints on the acceptable impact on the execution ofsimulation itself place significant restrictions on this kind ofanalysis. In contrast, in-transit or in-staging pipelines analyzethe data while it is in the staging area (DRAM, NVRAM,SSD or HDD), either local or remote. Data is asynchronouslytransferred to the staging resources used for in-transit analysis,allowing the simulation to resume execution more quickly.However, in practice, transferring even a subset of the rawdata over the network can become expensive. As a result,hybrid approaches are often used to balance the advantagesof both techniques. The decision whether to perform in-situand/or in-staging data analysis is also based on data locationand performance/energy constraints. For example, one may useNVRAM as extended DRAM to analyze data in-situ ratherthan having to transfer it over the network or offload it todisk, thereby saving time and energy required to transfer thedata.

Data placement: The data volumes involved in the targetedapplication workflows naturally requires the data to be parti-tioned, depending on the data analysis technique used, across

228

multiple nodes (horizontal allocation) and/or different memorylevels (vertical allocation). In general, vertical allocation is ex-ploited when the data does not fit in DRAM and higher levelsof the memory hierarchy are used first to minimize latency.Consequently, at the data producer nodes (i.e., the simulation)we may try to use DRAM for “short term” data, which mayeventually be aggregated and stored at other memory levels,for example, for post-processing. However, in case of datastaging nodes, we may use NVRAM to store the data anduse DRAM as a cache. Other memory levels may be used ifthe associated I/O cost does not penalize the performance. Themain goal of the horizontal allocation is storing data close towhere it will be used in order to minimize the data transfercosts. However, data may also be staged where it is generatedinstead of where is analyzed, if data transfer over the networksignificantly impacts the execution of the simulation.

Data motion: While the general rule is to minimize datamovement, the most appropriate strategy depends on the work-flow characteristics (e.g., tight vs. loose coupling) and howand where computations are scheduled (for example, movingdata to where it will be analyzed). Most of the vertical datamovement is from/to DRAM. Horizontal data transfer acrossthe nodes is driven by the execution strategy used for theanalysis pipeline. For example, in-staging analysis requires thesimulation to use the network intensively, while data transfersare less frequent for in-situ analysis. Additionally, data transferrates are typically dictated by analysis quality requirements.

Note that the above strategies may have different optimiza-tion objectives that might be conflicting, and the dominant onemight drive the other strategies depending on requirements orgoals. For example, if in-situ data processing is required, thendata has to be placed where is generated, and this overridesother possible optimizations.

IV. MODELING THE CANONICAL APPLICATION

WORKFLOW

This section presents coarse grain models for executiontime and energy consumption for the generic canonical dataanalytics workflow presented in Figure 2. These models arebased on certain assumptions, which are explained below.Their input parameters are presented in Table I.

A. Execution time model

The time to execute the workflow depends on the couplingbetween the simulation and the analysis (i.e., synchronoustsync vs. asynchronous tasync) as follows, where ts,step is thesimulation time and ta,step is the analysis time of a given step:

tsync = Is ·

(

ts,step +ta,step

Ia

)

tasync = Is · max

(

ts,step,ta,step

Ia

)

Simulation time: The time to simulate a step is defined as:

ts,step = tprod,step + tst,step

Name Description Name Description

V number of variables Vs size of each variable (bytes)N number of nodes C number of cores, per nodeCs simulation cores, per node Ca analysis cores, per nodeIs number of simulation steps Ia number of simulation steps

between two analysistprod,v time to produce a tcons,v time to consume a

variable (s) variable (s)BWwr, effective memory write BWrd, effective memory read

mem,eff bandwidth (byte/s) mem,eff bandwidth (byte/s)BWwr, effective staging area write BWrd, effective staging area readstg,eff bandwidth (byte/s) stg,eff bandwidth (byte/s)BWnet, effective network access

eff bandwidth (byte/s)

TABLE I. PARAMETERS FOR THE WORKFLOW MODEL.

where tprod,step is the time to produce the data and tst,step isthe time to move the data to its destination:

tprod,step =V

N · Cs· tprod,v

tst,step =V

N · Cs·(

tstv,mem + tst

v,stg + tstv,net

)

where tprod,v is the CPU-core time needed to generate (i.e.,produce) a variable in cache. For example, if we are fillingeach variable with a random value, tprod,v is the time to callthe rand() function. Note that we model the generated data asa set of variables (i.e., data units).

Assuming that the data is large and does not fit in cache,data will be generated directly in DRAM memory. Therefore,tstv,mem = Vs/BWwr,mem,eff and the time to store a single

variable to the staging area is tstv,stg = Vs/BWwr,stg,eff ,

assuming that the variable is stored as part of a block (with256KB blocks or greater, as discussed in Section V). If thestaging area is remote, the data has to transferred over thenetwork, resulting tst

v,net = Vs/BWnet,eff .

Note that the effective bandwidth is not actual devicebandwidth because all the cores in each node access eachmemory device. A lower bound on the effective bandwidthis the actual device bandwidth divided by the total number ofsimulation and analysis cores in a node.

Analysis time: The definition of the time to analyze a stepta,step is similar to ts,step but involves time to load and timeto consume instead of time to store and time to produce,respectively. It also uses the bandwidth of read operationsinstead of write operations, analysis cores Ca instead ofsimulation cores Cs, and the consumption constant tcons,v

instead of the producer constant tprod,v.

B. Energy model

The energy consumed by the workflow execution is definedas follows, where t represents the execution time, (i.e., tsync

or tasync):

E = Pnode,idle · N · t + Ecomputation + EdataMotion

Energy for computation: The energy for computation isdefined as follows:

Ecomputation = N ·Is ·

(

Cs · Eprod,c,step +Ca · Econs,c,step

Ia

)

229

where Eprod,c,step and Econs,c,step are the energy consumedby one core in order to produce and consume the data,respectively, assuming that the data is already in cache:

Eprod,c,step = tprod,step · PCPU,dynamic,core

Econs,c,step = tcons,step · PCPU,dynamic,core

PCPU,dynamic,core =PCPU,dynamic

C

If we simplify:

Ecomputation =PCPU,dynamic

C· Is · V ·

(

tprod,v +tcons,v

Ia

)

Energy for data motion: The energy for data motion is relatedto the time to load and store the data from any level of thehierarchy to DRAM memory, or from DRAM memory tocache, or viceversa:

EdataMotion = V · Is ·

((

tstv,mem +

tldv,mem

Ia

)

· Pmem,dyn

+

(

tstv,stg +

tldv,stg

Ia

)

· Pstg,dyn

+

(

tstv,net +

tldv,net

Ia

)

· Pnet,dyn

)

C. Workflow execution framework

We implemented a generic data-centric workflow frame-work that instantiates our canonical workflow architecture andthe model shown above. This workflow framework implementsboth in-situ and in-staging data analytics, in both synchronousand asynchronous manners. The framework has been devel-oped using the Unified Parallel C (UPC) PGAS language.

The parameters of the application workflow include theamount of data generated in each level of the memory hierar-chy, the rate of data generation, frequency of data movementand amount of computation in the simulation and in the dataanalysis, among others (first 4 rows of Table I). The applica-tion provides three different kernels that can be chosen bothfor simulation and/or analysis: matrix multiplication (MUL),matrix addition (ADD) and no-operation (NOP). Each kernelprovides a specific value for tprod,v and tcons,v.

The simulation part of the workflow iteratively performsthe following steps: (1) allocates two matrices in DRAMmemory, (2) fills the matrices with random data, (3) runs oneof the available kernels on the two matrices, and (4) stores theoutput matrix either in DRAM memory or in the staging area(i.e., NVRAM, SSD or hard disk, either local or remote).

To perform in-situ analysis, the application runs one ofthe kernels with a given configuration, and depending on theconfiguration aggregates the analyzed data in the staging area.Similarly, to perform in-staging data analysis, the applicationloads the data from the staging area and runs one of the kernels.

Fig. 3. Execution time and energy consumption with different benchmarksusing NVRAM, SSD, and hard disk in different modes.

V. DEEP MEMORY HIERARCHY CHARACTERIZATION

Before exploring the potential of the proposed deepmemory hierarchy to efficiently handle data-intensive work-flows from energy and performance perspectives, we needto characterize representative benchmark application’s dataaccess and understand the energy and performance behaviorof the different memory levels of the hierarchy for differentconfigurations and usage modes.

Testbed: Our experiments were conducted using a Dell serverwith an Intel quad-core Xeon X3220 processor, 4GB of RAM,SATA hard disk, and Infiniband network interface. We alsoused a 160GB Intel x25-M Mainstream SATA SSD and a640GB Fusion-io ioDrive Duo (2 partitions of 320GB) NV-RAM connected to PCI-Express x8. To measure the power weused a “Watts Up? .NET” power meter, which has ±1.5% ofaccuracy and 1Hz of sampling rate.

Benchmarks: We used the following comprehensive set ofHPC benchmarks and kernels that stress I/O in different ways:

• Flexible I/O benchmark (fio): I/O tool meant to beused for benchmark and stress/hardware verification.

• ioZone: Filesystem benchmark tool that generatesand measures a variety of file operations.

• Forkthread: Synthetic benchmark to compare readoperation rates. Creates p processes (each containing tthreads), and performs 16KB block size random reads.

Energy and performance behavior overview: Figure 3 sum-marizes the results obtained with the different memory devices,benchmarks, operation types and access modes. The presentedresults are obtained with 10GB data size and show the average,maximum and minimum values of 12 measurements usingdifferent block sizes and I/O depth sizes (i.e., the number ofparallel accesses to NVRAM). Note that the values obtainedwith hard disk are out of the range of the plot in some cases.

230

! " #$ %& $! #&"&'$'#&#(&!

)*+,-*./ -.0/,-*./ )*+,1-23* -.0/,1-23*

04056.67

*/6.67

*/

"((

$((

!((

&((

(

"((

$((

!((

&((

(

! " #$ %& $! #&"&'$'#&#(&!

! " #$ %& $! #&"&'$'#&#(&!

! " #$ %& $! #&"&'$'#&#(&!

8.0/12/3759:8;)<

8=46>5)2?*59@8<

Fig. 4. fio bandwidth for different operations, access modes and block sizesusing NVRAM. The error bars show the variability due to different numberof accesses in parallel.

Figure 3 shows that energy and execution time are corre-lated; therefore, the following observations related to executiontime can be applied to energy consumption. Execution timewith hard disk is much longer than with SSD and the executiontime with SSD is longer than with NVRAM. The results withcached usage mode (i.e., DRAM is used as cache to accessdata) present lower variability than those with non-cached(direct access - directIO) because the OS manages cache well.However, with directIO we obtain shorter execution times insome cases, which means that execution time can be adapteddepending on the access mode (i.e., sequential vs. random).Since we expect real applications having different accessmodes over time, management strategies should be application-aware.

We can conclude that hard disk should be used very spar-ingly (e.g., to write log files that will be processed offline usinglarge block sizes), and that NVRAM is preferable. However,SSD seems a good alternative, especially for sequential readingoperations.

Determining the best usage modes: As we focus on data-intensive scientific applications the main performance metricfor each memory level of the hierarchy is bandwidth insteadof other metrics such as latency. As a result, the best usagemode for each memory device of the hierarchy is the onethat maximizes its bandwidth. In order to determine the bestusage modes we evaluated each memory level of the hierarchyunder different configurations using standard I/O benchmarkworkloads. Specifically, each device was used as a blockstorage device following the recommended standard format(i.e., EXT4 for hard disk and SSD, and XFS for NVRAM).The access to the devices was through regular I/O operationssuch as POSIX read() and write() operations.

Figure 4 displays the bandwidth obtained running the fiobenchmark on top of the NVRAM with different block sizes(4KB to 1024KB), access modes (cached/non-cached) anddifferent number of accesses in parallel. We present resultsobtained with data size of 10GB since with smaller data sizes(e.g., 1GB) the OS caches data too effectively; therefore, datais managed mainly in DRAM and the different configurationsdo not impact bandwidth. As the different configurations usea fixed amount of data, both execution time and energyconsumption are correlated with the observed bandwidth.

Seq-read Rand-read Seq-write Rand-writeDevice Access

Pwr BW Pwr BW Pwr BW Pwr BW

direct 157.1 782.3 158.2 764.3 141.5 500.2 142.7 501.1nvram

cached 158.0 774.4 158.3 463.3 165.5 512.5 167.0 506.1direct 118.5 210.2 117.7 190.3 117.3 101.5 117.3 101.5

ssdcached 121.4 240.3 118.0 175.9 123.5 108.3 122.5 105.3direct 212.6 98.4 - - 214.5 96.3 211.8 37.0

hddcached 215.2 96.6 - - 221.2 100.8 216.0 58.1

TABLE II. SYSTEM POWER (PWR, IN W) AND BANDWIDTH (BW, IN

MB/S) WITH DIFFERENT DEVICES, OPERATIONS AND ACCESS MODES.

Seq-read Rand-read Seq-write Rand-writeDevice

Pwr BW Pwr BW Pwr BW Pwr BW

nvram direct direct direct direct direct cached direct cachedssd direct cached direct direct direct cached direct cachedhdd direct direct - - direct cached direct cached

TABLE III. SUMMARY OF OPTIMAL DEVICE ACCESS MODES (IN

TERMS OF POWER AND BANDWIDTH) WITH RESPECT TO OPERATION TYPE.

The results also show that NVRAM transfers data fasterwith big block sizes. Block sizes of 256KB or greater areoptimal for all operations, except for random read cachedaccesses, which optimal block size is 1024KB. In general, itcan be a significant performance limitation for smaller sizesdue to the large number of accesses to the device, except forsequential read and write operations with cached configuration.In these cases, the OS manages cache techniques (e.g., read-ahead) effectively and the block size does not impact theperformance. The number of parallel accesses does not affectthe bandwidth significantly due to the scalability of the XFSmechanisms.

Similar tests showed that the maximum bandwidth forSSD and HDD is achieved using block sizes >256KB and>1024KB, respectively. However, in the case of HDD thedifferences between using 1024KB and 256KB are negligibleenough to consider 256KB as the optimal block size.

To find the optimal usage modes from a power perspective,we evaluated the different devices with fio benchmark, usingdifferent operations, access modes, 256KB block size, and 1 asthe number of parallel accesses. We measured the bandwidthand the average power drawn by the whole system. Weconsider static power as the dissipated power when the systemis idle and dynamic power as the required power to perform atask. In our experimental testbed, the measured static systempower is 108.4 W.

Table II shows the power dissipated and bandwidth foreach of the different devices and configurations. The resultsof bandwidth support that NVRAM is faster than the SSD,which, in turn, is faster than the HDD. Specifically, NVRAMperforms 3.2 0– 4.8 times better than SSD and 5 – 8 timesbetter than HDD, and SSD performs 1.07 – 2 times betterthan the HDD. The best results are found using NVRAM withnon-cached mode for read operations, and with cached modefor write operations. However, the results of power follow adifferent trend. The HDD draws 1.3 – 1.5 times more powerthan NVRAM and NVRAM draws 1.2 – 1.3 times more powerthan the SSD. The lowest power is observed with SSD usingnon-cached configuration. Note that random accesses to HDDdevices should be avoided due to the frequent seeking overheadand that non-cached access are tested using the Linux-specificO DIRECT flag when opening files.

231

100

200

300

400

500

600

2 4 6 8 10 12 14

Exe

cutio

n tim

e (s

)

Total allocated memory (GB)

SATA disk as swap memorySSD as swap memory

NVRAM as swap memoryNVRAM as Explicit I/O (small bs)

NVRAM as Explicit I/O (big bs)

Fig. 5. Execution time of the synthetic benchmark using different devices(NVRAM, HDD and SSD) as swap and as explicit I/O

Table III summarizes the best usage mode of each devicefrom performance and power perspectives for the different op-eration types, based on the results of table II. We conclude thatnon-cached access mode optimizes power and bandwidth forread operations, and cached access mode optimizes bandwidthfor write operations.

Out-of-core computations: Existing applications may need towork with datasets that do not fit in main memory. The OSoffers mechanisms such as virtual memory (swap memory),which automatically offloads memory pages to a block devicewhen memory is full. Therefore, access to the device is implicitfrom the point of view of the programmer. We evaluated thedifferent devices as swap in order to quantify the potential ofusing them as extended memory and, in turn, the potentialof swap to implement extended memory. We developed asynthetic benchmark that iteratively allocates memory (i.e.,using malloc) and fills it with random data. In each iteration,the benchmark reads again all the previously allocated dataand processes it (with addition operations). Figure 5 shows theexecution time of running the synthetic benchmark using SATAhard disk, SSD and NVRAM as swap. We also provide theresults obtained with a variant of the synthetic benchmark thatstores the generated data using I/O operations with differentblock sizes (i.e., using filesystem’s read/write) to quantifythe potential of swapping algorithms vs. traditional I/O-awareout-of-core computation. We do not provide the figure thatplots energy consumption for the same experiments becauseit is very similar to Figure 5. The vertical lines indicate theamount of free DRAM memory before running the experimentand the total DRAM size.

Figure 5 shows that the hard drive is not scalable as swapand both SSD and NVRAM dramatically outperforms harddrive as swap. SSD and NVRAM present similar scalabilitytrends; however, NVRAM is approximately twice faster thanSSD. It also shows that NVRAM as swap obtains worseresults than as explicit I/O device, in part because the currentimplementation of swap in the Linux kernel does not supportall the PCIe-based NVRAM features such as multi-threadedaccess to the device. Thus, an efficient implementation ofextended memory would require an extended OS kernel orusing appropriate interfaces to program PCIe-based NVRAMnatively, for example, at the runtime level. We expect several

500

1000

1500

2000

2500

3000

0 5 10 15 20 25 30 35

50

100

150

200

250

300

350

400

450

Tim

e (

s)

Energ

y (k

J)

Size of generated/analyzed data (GB)

Time with HDDTime with NVRAMEnergy with HDD

Energy with NVRAM

Fig. 6. Synchronous in-staging data analysis using different data sizes.

times performance improvement with native byte-addressableNVRAM-based implementations, which seems to be a viablesolution for out-of-core computation (e.g., to enable in-situdata processing). Studying such an approach is part of ourfuture work, which we envision in combination with languageextensions for cross-layer interactions between application,runtime and OS.

VI. EXPLORING ENERGY-PERFORMANCE IN THE DATA

ANALYTICS PIPELINE

This section presents the experimental evaluation of differ-ent instantiations of the canonical model, using the frameworkintroduced in section IV-C. The main goal of the evaluationis finding the most effective methods for placing and movingdata among the memory levels of the hierarchy. These methodsare dictated by the needs of the various application types fromthe performance, energy and quality of solution perspectives.

Testbed: The experiments were conducted using two analysisnodes and a cluster of 32 simulation nodes, all connected viaInfiniband. The analysis nodes are the Dell servers describedin section V. The cluster is composed of two blade enclosureswith 16 nodes each. Each node has two Intel quad-coreXeon E5620 processors, 6GB of RAM, SATA hard disk, andInfiniband network interface. The power drawn by the 32-nodecluster was measured using the native metering infrastructure.

Impact of data size: This experiment aims at validating ifthe data size (V ) impacts linearly to the execution time andenergy. As a result, all other parameters are fixed. Figure6 displays the execution time and energy consumption forthe synchronous workflow with different problem sizes (1–32GB), using local NVRAM and hard disk for data staging.The workflow performs 10 iterations and analyzes data fromthe staging area in each iteration. As shown in Figure 6, thereis a large performance and energy gap between NVRAM andhard disk. Even for small data sizes, both execution time andenergy consumption using hard disk is several times largerthan with NVRAM.

Impact of analysis frequency: This experiment quantifies theimpact of the number of simulation steps between two analysis(Ia) and cores for analysis (Ca) on the performance and energy.

232

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 4 8 16 3250

100

150

200

250

300T

ime

(s)

En

erg

y(k

J)

Number of iterations between analysis

Time with HDDTime with NVRAMEnergy with HDD

Energy with NVRAM

(a) Synchronous analysis.

0

20

40

60

80

100

1 2 4 8 16 320

20

40

60

80

100

Tim

e(%

)

En

erg

y(%

)

Number of iterations between analysis

Time using 2 sim and 2 analysis (NVRAM)Energy using 2 sim and 2 analysis (NVRAM)

Time using 3 sim and 1 analysis (NVRAM)Energy using 3 sim and 1 analysis (NVRAM)

Time using 2 sim and 2 analysis (HDD)Energy using 2 sim and 2 analysis (HDD)

Time using 3 sim and 1 analysis (HDD)Energy using 3 sim and 1 analysis (HDD)

(b) Asynchronous analysis.

Fig. 7. In-staging data analysis workflow with different qualities of solution. The highlighted area points the most efficient tests. The best tradeoff betweenqualify of solution and energy/performance is analyzing the data every 8 simulation steps.

Another goal of this experiment is to identify the possibletradeoffs between performance/energy and quality of solution.

Figures 7(a) and 7(b) show the execution time and energyconsumption for synchronous and asynchronous in-stagingdata analysis, respectively. The staging area is in either NV-RAM or hard disk (HDD). Figure 7(b) shows results usingeither 1 or 2 dedicated cores for the data analysis. Theproblem size for the synchronous data analysis was fixed to3GB in order to accommodate the entire dataset into mainmemory to avoid multiple interactions with the staging areawhile analyzing the data. Conversely, for the asynchronousdata analysis, the problem size was fixed to 9GB to ensurethat data does not fit into main memory. Both figures showonce again, a large gap between NVRAM and hard disk. Thehighlighted areas mark a group of tests with similar executiontime and energy consumption regardless of the frequency ofdata analysis. This implies that the best tradeoff betweenperformance/energy and quality of solution is found whendata is analyzed every 8 iterations. However, if the applicationrequires higher frequency analysis, for example, to be able tovisualize phenomena that are changing more rapidly, it willresult in a larger cost. We also observe that, as expected,performing analysis very frequently (every 1–4 simulationsteps) has a higher impact on performance and energy whenusing hard disk than using NVRAM. The plotted values inFigure 7(b) are relative to the highest value, obtained usingHDD. The figure shows that when using two cores for dataanalysis, both execution time and energy consumption increaseby about 5% when using NVRAM; however, the increase fromone analysis core to two analysis cores is not significant whenusing hard disk.

Impact of the data placement and motion: The goal of thisexperiment is to understand the performance/energy behaviorof data placement within the memory hierarchy consideringboth local and remote memory, and how data movementfrequency impacts these behaviors.

In the local placement experiment, only asynchronousanalysis is considered, assigning Cs = 3, Ca = 1 numberof cores for simulation and analysis, respectively. Figure 8(a)

displays the execution time and energy consumption of theasynchronous workflow using local NVRAM, SSD and harddisk as staging area. The simulation performed 32 simulationsteps and data was analyzed every 8 iterations. The totaliteration data size was fixed to 3.2GB. Both execution time andenergy consumption using NVRAM are half of those usingSSD, and several times smaller than when using hard disk.Also, the increase in energy is larger when using hard disk.These results allow us to quantify how much better is NVRAMor SSD than hard disk while defining data placement policies.For example, this characterization can be used when decidingthe fraction of data that should be placed on a given device(e.g., hard disk) to meet deadline constrains and power budget.

In the remote placement experiment, the following sce-narios are explored, derived from workflow patterns in realsimulations. In all cases, the total iteration data size is set to4GB:

• In-situ: uses only the cluster nodes. Some cores areused for simulation while the rest are dedicated toanalyze the data from local DRAM. Two variants areconsidered: insitu7S1A (i.e., Cs = 7 and Ca = 1)and insitu6S2A (i.e., Cs = 6 and Ca = 2).

• In-staging: uses the cluster nodes for simulation andone (instaging8S4A) or two (instaging8S8A)analysis nodes to perform the data analysis asyn-chronously.

• Local in-staging: uses only the cluster node, but allo-cating the data in the local hard disk (i.e., Cs = 7 andCa = 1).

Figure 9 displays the execution time and energy con-sumption for different frequencies of analysis. Figure 8(b)summarizes the most representative executions, and comparesthem to the local in-staging option. The figures show thatthe scenario with shortest execution time and lowest energyconsumption is in-situ data analysis due to the smaller amountof data movement. However, in-situ data analysis scenariocan be used only when simulations components do not needdata from the other components and the dataset fits in main

233

!"#$% &&' ('')

*))

+)))

+*)) ,-./012345126.

,4.7895/34:06;1234

!"#$%&'$()*+,-.

(a) Vertical allocation exploration. Only one node used.

!

"!!

#!!!

#"!!

$!!!

$"!!

%&'(%)*+,-#.

%&'()/0%&0+1-1.

+)%02)34+56*7389

%&'()/0%&0+1-1.

+366(834+56*7389365/3+%&'

()/0%&0

!"#$%&'$()*+,-.

(b) Horizontal allocation exploration. The whole cluster was used, in additionto the 2 analysis nodes, in the last two columns.

Fig. 8. Execution time and energy consumption using different data placementstrategies. Both figures are relative to the left-most test.

memory. If the dataset does not fit in the main memory, datahas to be stored and analyzed using the staging area. In sucha case, using remote NVRAM is better than local hard disk(see the two middle sets of columns of Figure 8(b)). Wealso observe that high data movement frequency penalizesexecution time and energy dramatically. For example, in thein-staging NVRAM & loosely coupled scenario the energyconsumption and execution time is about three times smallerthan the tightly coupled scenario. Therefore, we can infer thatdata movement is crucial; however, the decision whether tomove data or not should be taken in combination with thedata placement strategy.

Figure 9 also shows that as the quality of the solutionimproves, the execution time increases linearly for in-situdata analysis and almost exponentially for in-staging dataanalysis. Furthermore, using different numbers of data analysiscomponents also impacts execution time and energy, but doesso less significantly than adjusting the frequency of analysis.Specifically, if we do in-situ analysis within the cluster, in-creasing the frequency of analysis from every 32 iterationsto every 2 iterations penalizes both execution time and energyconsumption by a factor of 1.72; however, if we use in-stagingdata analysis, execution time increases by a factor of about 3.5.We can also improve the quality of the solution by increasingthe number of analysis components with a performance andenergy penalty of between 15% to 55%, depending on thespecific case.

VII. CONCLUSIONS AND FUTURE WORK

This paper provides a detailed application-centric empir-ical evaluation of the use of NVRAM-based deep memoryhierarchies by data-intensive scientific workflows, with a focuson understanding performance and energy behaviors and dataplacement, data movement and quality of solution tradeoffs.It also explores usage modes and the potential of NVRAMfor I/O staging and in-situ/in-transit data analytics pipelines.The study used a canonical deep-memory architecture andapplication workflow patterns derived from real applications.

!"

#$%

&!

'

!'''

"'''

%'''

#'''

()*+,-()-.#/#0()*+,-()-.#/"0

()*(+1.%/!0()*(+1.2/$0

34561+(7).+(85.9*:

;<5=15)6>.7;.,),?>*(*9*(81?,+(7).*+5@*.A5+B55).,),?>*(*:

!

!"#

$

$"#

%

%"#

& $!'

%'

($)

*%

+,-./0+,01(2(3+,-./0+,01(2'3

+,-+.41)2%3+,-+.4152$3

6,

7809

1:;,

-4<

=.+

;,1>

?@A

B87C47,:91;B1/,/D9-+->-+<4D/.+;,1-.7=-1E7.F77,1/,/D9-+-A

Fig. 9. Impact of quality of the solution on the energy consumption andexecution time using different placement strategies.

The canonical deep memory hierarchy included DRAM, NV-RAM, SSD and disk storage levels, in addition to remotememory. The evaluation was conducted for both local andremote analytics pipelines, studied the performance and energybehaviors, and provided insights and recommendations onhow scientific application workflows can effectively use thedifferent levels of the deep memory hierarchy.

Our immediate future work includes the evaluation of othertypes of scientific workflow patterns as well as productionscientific application workflows, developing strategies to au-tonomically balance energy and performance based on thestrategies identified in this paper and use modeling and/orsimulation to evaluate the proposed architecture and datamanagement approach at large scale, which can be useful forexascale system/software co-design.

ACKNOWLEDGMENTS

The research presented in this work is supported in partby US National Science Foundation (NSF) via grants num-bers ACI 1310283, DMS 1228203 and IIP 0758566, by theDirector, Office of Advanced Scientific Computing Research,Office of Science, of the U.S. Department of Energy through

234

the Scientific Discovery through Advanced Computing (Sci-DAC) Institute of Scalable Data Management, Analysis andVisualization (SDAV) under ward number DE-SC0007455, theAdvanced Scientific Computing Research and Fusion EnergySciences Partnership for Edge Physics Simulations (EPSI)under award number DE-FG02-06ER54857, the ExaCT Com-bustion Co-Design Center via subcontract number 4000110839from UT Battelle, and by an IBM Faculty Award. The researchand was conducted as part of the NSF Cloud and AutonomicComputing (CAC) Center at Rutgers University and the Rut-gers Discovery Informatics Institute (RDI2).

REFERENCES

[1] C. Docan, M. Parashar, and S. Klasky, “Dataspaces: an interactionand coordination framework for coupled simulation workflows,” in Intl.Symp. on High Performance Distributed Computing, 2010, pp. 25–36.

[2] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, andF. Zheng, “Datastager: scalable data staging services for petascale appli-cations,” in Proc. of 18th Intl. Symp. on High Performance DistributedComputing (HPDC’09), 2009.

[3] F. Zheng, H. Abbasi, C. Docan, J. F. Lofstead, Q. Liu, S. Klasky,M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf, “Predata -preparatory data analytics on peta-scale machines,” in Intl. Symp. onParallel and Distributed Processing, 2010, pp. 1–12.

[4] A. M. Caulfield, L. M. Grupp, and S. Swanson, “Gordon: usingflash memory to build fast, power-efficient clusters for data-intensiveapplications,” in Intl. Conf. on Architectural support for programminglanguages and operating systems, 2009, pp. 217–228.

[5] S. Kannan, A. Gavrilovska, K. Schwan, D. Milojicic, and V. Talwar,“Using active nvram for i/o staging,” in Intl. workshop on Petascal dataanalytics: challenges and opportunities, 2011, pp. 15–22.

[6] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, andS. Swanson, “Moneta: A high-performance storage array architecturefor next-generation, non-volatile memories,” in IEEE/ACM Intl. Symp.on Microarch., 2010, pp. 385–395.

[7] D. H. Yoon, T. Gonzalez, P. Ranganathan, and R. S. Schreiber, “Explor-ing latency-power tradeoffs in deep nonvolatile memory hierarchies,” inConf. on Comp. Frontiers, 2012, pp. 95–102.

[8] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-mance main memory system using phase-change memory technology,”in Annual Intl. Symp. on Computer Arch., 2009, pp. 24–33.

[9] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “Themissing memristor found,” Nature, vol. 453, pp. 80–83, May 2008.

[10] A. N. Cockcroft, “Millicomputing: The coolest computers and theflashiest storage,” in Intl. Computer Measurement Group Conference,2007, pp. 407–414.

[11] Y. Joo, J. Ryu, S. Park, and K. G. Shin, “Fast: quick applicationlaunch on solid-state drives,” in USENIX Conf. on File and stroagetechnologies, 2011, pp. 19–19.

[12] K. El Maghraoui, G. Kandiraju, J. Jann, and P. Pattnaik, “Modelingand simulating flash based solid-state disks for operating systems,” inWOSP/SIPEW Intl. Conf. on Performance engineering, 2010, pp. 15–26.

[13] S. Chen, “Flashlogging: exploiting flash devices for synchronous log-ging performance,” in ACM SIGMOD Intl. Conf. on Management ofdata, 2009, pp. 73–86.

[14] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan,and V. Vasudevan, “Fawn: a fast array of wimpy nodes,” in SIGOPSSymp. on Operating systems principles, 2009, pp. 1–14.

[15] S. Kang, S. Park, H. Jung, H. Shim, and J. Cha, “Performance trade-offsin using nvram write buffer for flash memory-based storage devices,”IEEE Trans. Comput., vol. 58, no. 6, pp. 744–758, Jun. 2009.

[16] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and S. Swanson,“Onyx: a protoype phase change memory storage array,” in USENIXConf. on Hot topics in storage and file systems, 2011, pp. 2–2.

[17] X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie,“Leveraging 3d pcram technologies to reduce checkpoint overhead forfuture exascale systems,” in Conf. on High Performance ComputingNetworking, Storage and Analysis (SC), 2009, pp. 57:1–57:12.

[18] S. Kannan, D. Milojicic, V. Talwar, A. Gavrilovska, K. Schwan, andH. Abbasi, “Using active nvram for cloud i/o,” in Sixth Open CirrusSummit, 2011, pp. 32–36.

[19] J. Cohen, D. Dossa, M. Gokhale, D. Hysom, J. May, P. Pearce,and A. Yoo, “Storage-intensive supercomputing benchmark study,”Lawrence Livermore National Laboratory, Tech. Rep. UCRL-TR-236179, November 2007.

[20] A. M. Caulfield, J. Coburn, T. Mollov, A. De, et al., “Understandingthe impact of emerging non-volatile memories on high-performance, io-intensive computing,” in Intl. Conf. for High Performance Computing,Networking, Storage and Analysis (SC), 2010, pp. 1–11.

[21] J. He, A. Jagatheesan, S. Gupta, J. Bennett, and A. Snavely, “Dash: arecipe for a flash-based data intensive supercomputer,” in Intl. Conf. forHigh Performance Computing, Networking, Storage and Analysis (SC),2010, pp. 1–11.

[22] M. Polte, J. Simsa, and G. Gibson, “Comparing performance of solidstate devices and mechanical disks,” in Petascale Data Storage Work-shop, 2008, pp. 1–7.

[23] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt,“Understanding and designing new server architectures for emergingwarehouse-computing environments,” in Annual Intl. Symp. on Compu-ter Arch., 2008, pp. 315–326.

[24] S. Rivoire, M. A. Shah, P. Ranganathan, and C. Kozyrakis, “Joulesort:a balanced energy-efficiency benchmark,” in SIGMOD Intl. Conf. onManagement of data, 2007, pp. 365–376.

[25] X. Dong, X. Wu, Y. Xie, Y. Chen, and H. Li, “Stacking mram atopmicroprocessors: An architecture-level evaluation,” IET Computers &Digital Techniques, vol. 5, no. 3, pp. 213–220, Jun. 2011.

[26] V. Vasudevan, D. Andersen, M. Kaminsky, L. Tan, J. Franklin, andI. Moraru, “Energy-efficient cluster computing with fawn: workloadsand implications,” in Intl. Conf. on Energy-Efficient Computing andNetworking, 2010, pp. 195–204.

[27] J. D. Davis and S. Rivoire, “Building energy-efficient systems forsequential workloads,” Microsoft Research, Tech. Rep. MSR-TR-2010-30, March 2010.

[28] D. Li, J. S. Vetter, G. Marin, C. McCurdy, C. Cira, Z. Liu, and W. Yu,“Identifying opportunities for byte-addressable non-volatile memoryin extreme-scale scientific applications,” in Intl. Parallel and Distr.Processing Symp., 2012, pp. 945–956.

[29] B. Whitlock, J.-M. Favre, and J. S. Meredith, “Parallel In Situ Couplingof Simulation with a Fully Featured Visualization System,” in Proc.of 11th Eurographics Symp. on Parallel Graphics and Visualization(EGPGV’11), April 2011.

[30] S. Lakshminarasimhan and et. al., “Isabela-qa: Query-driven analyticswith isabela-compressed extreme-scale scientific data,” in Intl. Conf. forHigh Performance Computing, Networking, Storage and Analysis (SC),November 2011, pp. 1 –11.

[31] M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng et al., “Functional partition-ing to optimize end-to-end performance on many-core architectures,” inIntl. Conf. for High Performance Computing, Networking, Storage andAnalysis (SC), November 2010, pp. 1–12.

[32] F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Ab-basi, “Enabling in-situ execution of coupled scientific workflow onmulti-core platform,” in IEEE Intl. Parallel and Distributed ProcessingSymp. (IPDPS), 2012.

[33] M. Gamell, I. Rodero, M. Parashar, J. Bennett et al., “Exploring PowerBehaviors and Tradeoffs of In-situ Data Analytics,” in Intl. Conf. onHigh Performance Computing Networking, Storage and Analysis (SC),Denver, CO, Nov. 2013, pp. 1–12.

[34] H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky, “JustIn Time: Adding Value to The IO Pipelines of High PerformanceApplications with JITStaging,” in Proc. of 20th Intl. Symp. on HighPerformance Distributed Computing (HPDC’11), June 2011.

[35] C. Docan, M. Parashar, J. Cummings, and S. Klasky, “Moving the Codeto the Data - Dynamic Code Deployment Using ActiveSpaces,” in IEEEIntl. Parallel and Distributed Processing Symp. (IPDPS’11), May 2011.

[36] J. C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout et al., “Combin-ing in-situ and in-transit processing to enable extreme-scale scientificanalysis,” in Intl. Conf. on High Performance Computing, Networking,Storage and Analysis (SC), 2012, pp. 49:1–49:9.

235