Additivity: A Selection Criterion for Performance Events for...

17
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/321919331 Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive Modeling Article · November 2017 DOI: 10.14529/jsfi170404 CITATIONS 0 READS 21 4 authors, including: Some of the authors of this publication are also working on these related projects: Meeting the Future Challenges of Heterogeneous and Extreme-Scale Parallel Computing View project Hierarchical MPI (HiMPI) View project Arsalan Shahid University College Dublin 9 PUBLICATIONS 0 CITATIONS SEE PROFILE Ravi Reddy Manumachu University College Dublin 32 PUBLICATIONS 333 CITATIONS SEE PROFILE Alexey L. Lastovetsky University College Dublin 165 PUBLICATIONS 1,443 CITATIONS SEE PROFILE All content following this page was uploaded by Arsalan Shahid on 29 December 2017. The user has requested enhancement of the downloaded file.

Transcript of Additivity: A Selection Criterion for Performance Events for...

Page 1: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/321919331

Additivity:ASelectionCriterionforPerformanceEventsforReliableEnergyPredictiveModeling

Article·November2017

DOI:10.14529/jsfi170404

CITATIONS

0

READS

21

4authors,including:

Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:

MeetingtheFutureChallengesofHeterogeneousandExtreme-ScaleParallelComputingViewproject

HierarchicalMPI(HiMPI)Viewproject

ArsalanShahid

UniversityCollegeDublin

9PUBLICATIONS0CITATIONS

SEEPROFILE

RaviReddyManumachu

UniversityCollegeDublin

32PUBLICATIONS333CITATIONS

SEEPROFILE

AlexeyL.Lastovetsky

UniversityCollegeDublin

165PUBLICATIONS1,443CITATIONS

SEEPROFILE

AllcontentfollowingthispagewasuploadedbyArsalanShahidon29December2017.

Theuserhasrequestedenhancementofthedownloadedfile.

Page 2: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Additivity: A Selection Criterion for Performance Events for

Reliable Energy Predictive Modeling

Arsalan Shahid1 , Muhammad Fahad1, Ravi Reddy1, Alexey Lastovetsky1

c© The Authors 2017. This paper is published with open access at SuperFri.org

Performance events or performance monitoring counters (PMCs) are now the dominant pre-

dictor variables for modeling energy consumption. Modern hardware processors provide a large

set of PMCs. Determination of the best subset of PMCs for energy predictive modeling is a non-

trivial task given the fact that all the PMCs can not be determined using a single application run.

Several techniques have been devised to address this challenge. While some techniques are based

on a statistical methodology, some use expert advice to pick a subset (that may not necessarily

be obtained in one application run) that, in experts’ opinion, are significant contributors to en-

ergy consumption. However, the existing techniques have not considered a fundamental property

of predictor variables that should have been applied in the first place to remove PMCs unfit for

modeling energy. We address this oversight in this paper.

We propose a novel selection criterion for PMCs called additivity, which can be used to de-

termine the subset of PMCs that can potentially be used for reliable energy predictive modeling.

It is based on the experimental observation that the energy consumption of a serial execution of

two applications is the sum of energy consumptions observed for the individual execution of each

application. A linear predictive energy model is consistent if and only if its predictor variables are

additive in the sense that the vector of predictor variables for a serial execution of two applications

is the sum of vectors for the individual execution of each application. The criterion, therefore, is

based on a simple and intuitive rule that the value of a PMC for a serial execution of two appli-

cations is equal to the sum of its values obtained for the individual execution of each application.

The PMC is branded as non-additive on a platform if there exists an application for which the

calculated value differs significantly from the value observed for the application execution on the

platform. The use of non-additive PMCs in a model renders it inconsistent.

We study the additivity of PMCs offered by the popular state-of-the-art tools, Likwid and

PAPI, by employing a detailed experimental methodology on a modern Intel Haswell multicore

server CPU. We show that many PMCs in Likwid and PAPI that are widely used in models as

key predictor variables are non-additive. This brings into question the reliability and the reported

prediction accuracy of these models.

Keywords: performance events, PMC, energy predictive models, Likwid, PAPI.

Introduction

Performance events or performance monitoring counters (PMCs) are special-purpose regis-

ters provided in modern microprocessors to store the counts of software and hardware activities.

We will use the acronym PMCs to refer to software events, which are pure kernel-level coun-

ters such as page-faults, context-switches, etc. as well as micro-architectural events originating

from the processor and its performance monitoring unit called the hardware events such as

cache-misses, branch-instructions, etc. They have been developed primarily to aid low-level per-

formance analysis and tuning. Remarkably while PMCs have not been used for performance

modeling, over the years, they have become dominant predictor variables for energy predictive

modeling.

Modern hardware processors provide a large set of PMCs. Consider the Intel Haswell mul-

ticore server CPU whose specification is shown in Tab. 1. On this server, the PAPI tool [19]

provides 53 hardware performance events. The Likwid tool [17, 23] provides 167 PMCs. This

1School of Computer Science, University College Dublin, Dublin, Ireland

DOI: 10.14529/jsfi170404

50 Supercomputing Frontiers and Innovations

Page 3: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Table 1. Specification of the Intel Haswell Multicore CPU

Technical Specifications Intel Haswell Server

Processor Intel E5-2670 v3 @2.30GHz

OS CentOS 7

Micro-architecture Haswell

Thread(s) per core 2

Cores per socket 12

Socket(s) 2

NUMA node(s) 2

L1d cache 32 KB

L11 cache 32 KB

L2 cache 256 KB

L3 cache 30720 KB

Main memory 64 GB DDR4

Memory bandwidth 68 GB/sec

TDP 240 W

Idle Power 58 W

includes events for uncore and micro-operations (µops) of CPU cores specific to Haswell archi-

tecture that are not provided by PAPI. However, all the PMCs can not be determined using a

single application run since only a limited number of registers is dedicated to collecting them.

For example, to collect all the Likwid PMCs for a single runtime configuration of an application

on the server, the application must be executed 53 times. It must be also pointed out that energy

predictive models based on PMCs are not portable across a wide range of architectures. While

a model based on either Likwid PMCs or PAPI PMCs may be portable across Intel and AMD

architectures, it will be unsuitable for GPU architectures.

Therefore, there are three serious constraints that pose difficult challenges to employing

PMCs as predictor variables for energy predictive modeling. First, there is a large number of

PMCs to consider. Second, tremendous programming effort and time are required to automate

and collect all the PMCs. This is because all the PMCs can not be collected in one single

application run. Third, a model purely based on PMCs lacks portability. In this paper, we focus

mainly on techniques employed to select a subset of PMCs to be used as predictor variables for

energy predictive modeling. We now present a brief survey of them.

O’Brien et al. [18] survey the state-of-the-art energy predictive models in HPC and present a

case study demonstrating the ineffectiveness of the dominant PMC-based modeling approach for

accurate energy predictions. In the case study, they use 35 carefully selected PMCs (out of a total

of 390 available in the platform) in their linear regression model for predicting dynamic energy

consumption. [1, 9, 10] select PMCs manually, based on in-depth study of architecture and

empirical analysis. [8, 15, 18, 21, 22, 27, 30] select PMCs that are highly correlated with energy

consumption using Spearman’s rank correlation coefficient (or Pearson’s correlation coefficient)

and principal component analysis (PCA). [1, 2, 15] use variants of linear regression to remove

PMCs that do not improve the average model prediction error.

From the survey, we can classify the existing techniques into three categories. The first

category contains techniques that consider all the PMCs with the goal to capture all possible

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 51

Page 4: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

contributors to energy consumption. To the best of our knowledge, we found no research works

that adopt this approach. This could be due to several reasons: a) Gathering all PMCs requires

huge programming effort and time; b) Interpretation (for example, visual) of the relationship

between energy consumption and PMCs is difficult especially when there is a large number of

PMCs; c) Dynamic or runtime models must choose PMCs that can be gathered in just one

application run; d) Typically, simple models (those with less parameters) are preferred over

complex models not because they are accurate but because simplicity is considered a desirable

virtue.

The second category consists of techniques that are based on a statistical methodology. The

last category contains techniques that use expert advice or intuition to pick a subset (that may

not necessarily be determined in one application run) and that, in experts’ opinion, is a dominant

contributor to energy consumption. However, the existing techniques have not considered one

fundamental property of predictor variables that should have been considered in the first place

to remove PMCs unfit for modeling energy. We address this oversight in this paper.

We propose a novel selection criterion for PMCs called additivity, which can be used to de-

termine the subset of PMCs that can potentially be used for reliable energy predictive modeling.

It is based on the experimental observation that the energy consumption of a serial execution

of two applications is the sum of energy consumptions observed for the individual execution of

each application. We define a compound application to represent a serial execution of a combi-

nation of two or more individual applications. The individual applications are also termed as

base applications.

A linear predictive energy model is consistent if and only if its predictor variables are additive

in the sense that the vector of predictor variables for a compound application is the sum of vectors

for the individual execution of each application. The additivity criterion, therefore, is based on

simple and intuitive rule that the value of a PMC for a compound application is equal to the sum

of its values for the executions of the base applications constituting the compound application.

We brand a PMC non-additive on a platform if there exists a compound application for which

the calculated value significantly differs from the value observed for the application execution on

the platform (within a tolerance of 5.0%). If we fail to find a compound application (typically

from a sufficiently large suite of compound applications) for which the additivity criterion is

not satisfied, we term the PMC as potentially additive, which means that it can potentially be

used for reliable energy predictive modeling. By definition, a potentially additive PMC must be

deterministic and reproducible, that is, it must exhibit the same value (within a tolerance of

5.0%) for different executions of the same application with same runtime configuration on the

same platform.

The use of a non-additive PMC as a predictor variable in a model renders it inconsistent

and therefore unreliable.

We study the additivity of PMCs offered by two popular tools, Likwid and PAPI, by employ-

ing a detailed statistical experimental methodology on a modern Intel Haswell multicore server

CPU. We observe that all the Likwid PMCs and PAPI PMCs are reproducible. However, we

show that while many PMCs are potentially additive, a considerable number of PMCs are not.

Some of the non-additive PMCs are widely used in energy predictive models as key predictor

variables.

For each non-additive PMC, we determine the maximum percentage error (averaged over

several runs) observed experimentally. This is the ratio of the difference between the PMC of

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

52 Supercomputing Frontiers and Innovations

Page 5: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

a compound application and the sum of the PMCs of the base applications and the sum of

the PMCs. We show that there is a PMC where the error is as high as 3075% and there are

several PMCs where the error is over 100%. This brings into question the reliability and reported

prediction accuracy of models that use these PMCs.

Our key contribution in this work is that we propose a novel criterion called additivity that

can be used to identify PMCs not suitable for energy predictive modeling. PMCs offered by

popular tools such as Likwid and PAPI are classified based on this criterion using a detailed

experimental methodology on a modern Intel Haswell multicore server CPU. In our future work,

we plan to classify the non-additivity of PMC into application-specific and platform-specific

categories.

The rest of the paper is structured as follows. Section 1 surveys popular tools that provide

programmatic and command-line interfaces to obtain PMCs. In Section 2, we define the property

of additivity and explain why it is important for reliable energy predictive modeling. Section 3

presents the experimental methodology used to determine the additivity of Likwid and PAPI

PMCs. Sections 4 and 5 present a classification of Likwid and PAPI PMCs respectively based

on the criterion of additivity. Finally, Section 6 concludes the paper.

1. Related Work

This section is divided into two parts. In the first part, we present tools widely used to

obtain PMCs. In the second part, we survey notable research on selection of PMCs for power

and energy modeling from a large set supplied by a tool.

1.1. Tools to Determine PMCs

PAPI [19] provides a standard API for accessing PMCs available on most modern micropro-

cessors. It provides two types of events, native events and preset events. Native events correspond

to PMCs native to a platform. They form the building blocks for preset events. A preset event

is mapped onto one or more native events on each hardware platform. While native events are

specific to each platform, preset events obtained on different platforms can not be compared.

Likwid [22] provides command-line tools and an API to obtain PMCs for both Intel and

AMD processors on the Linux OS.

For Nvidia GPUs, CUDA Profiling Tools Interface (CUPTI ) [3] can be used for obtaining

the PMCs. Intel PCM [14] is used for reading PMCs of core and uncore (which includes the

QPI) components of an Intel processor. Perf [25] also called perf events can be used to gather

the PMCs for CPUs in Linux.

1.2. Techniques for Selection of PMCs for Energy Predictive Modeling

All the models surveyed in this section are linear energy predictive models.

Singh et al. [20] use PMCs provided by AMD Phenom processor. They divide the PMCs

into four categories and rank them in the increasing order of correlation with power using the

Spearman’s rank correlation. Then they select the top PMC in each category (four in total) for

their energy prediction model.

Goel et al. [8] divide PMCs into event categories that they believe capture different kinds

of microarchitectural activity. The PMCs in each category are then ordered based on their

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 53

Page 6: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

correlation to power consumption using the Spearman’s rank correlation. The PMCs with less

correlation are then investigated by analyzing the accuracy of several models that employ them.

Kadayif et al. [16] present a PMC-based model for predicting energy consumption of pro-

grams on a UltraSPARC platform. The platform provides 30 different PMCs. However, they use

only eight and do not specify how they have selected them.

Lively et al. [17] employ 40 PMCs in their predictive model. They use an elaborate statistical

methodology to select PMCs. They compute the Spearman’s rank correlation for each PMC and

remove those below a threshold. They compute the principal components (PCA) of the remaining

PMCs and select those with the highest PCA coefficients. Bircher et al. [1] employ an iterative

linear regression modeling process where they add a PMC at each step and stop until desired

average prediction error is achieved.

Song et al. [21] select a group of PMCs (for their energy model of Nvidia Fermi C2075 GPU)

that are strongly correlated to power consumption based on the Pearson correlation coefficient.

Witkowski et al. [26] use PMCs provided by the Perf tool for their model. They use the

correlation (Pearson correlation coefficient) between a PMC and the measured power consump-

tion and select those PMCs, which have high correlation coefficients. Although they find that

the PMCs related to DRAM have a low correlation with power consumption, they still use them

since these variables signify intensity of DRAM operations, which contribute significantly to

power consumption.

Gschwandtner et al. [9] deal with the problem of selecting the best subset of PMCs on the

IBM POWER7 processor, which offers over 500 different PMCs. They first manually select a

medium number of hardware counters that they believe are prominent contributors to energy

consumption. Then they empirically select a subset from their initial selection. Jarus et al. [15]

use PMCs provided by the Perf tool for their models. The PMCs employed differ for different

models and are selected using two-stage process. In the first stage, PMCs that are correlated

90% or above are selected. In the second stage, stepwise regression with forward selection is used

to decide the final set of PMCs.

Haj-Yihia et al. [10] start with a set of 23 PMCs (offered by Likwid) based on expert

knowledge of the Intel architecture. Then they perform linear regression iteratively where they

drop PMCs (one by one) that do not impact the average prediction error of their model.

Wu et al. [29] use the Spearman correlation coefficient and PCA to select the subset of

PMCs, that are highly correlated with power consumption. Chadha et al. [2] select a particular

PMC from the list of PAPI PMCs available for their platform and check if it fits well with linear

regression model. If it does, they select it as a key parameter for their modeling and experimental

study. Otherwise, they skip it.

2. Additivity: Definition

The additivity criterion is based on simple and intuitive rule that the value of a PMC for a

compound application is equal to the sum of its values for the executions of the base applications

constituting the compound application.

We brand a PMC non-additive on a platform if there exists a compound application for which

the calculated value significantly differs from the value observed for the application execution on

the platform (within a tolerance of 5.0%). If the experimentally observed PMCs (sample means)

of two base applications are e1 and e2 respectively, then a non-additive PMC of the compound

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

54 Supercomputing Frontiers and Innovations

Page 7: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Table 2. List of Applications

Application Description

NPB IS Integer Sort, Kernel for random memory access

NPB LU Lower-Upper Gauss-Seidel solver

NPB EP Embarrassingly Parallel, Kernel

NPB BT Block Tri-diagonal solver

NPB MG Multi-Grid on a sequence of meshes

NPB FT Discrete 3D fast Fourier Transform

NPB DC Data Cube

NPB UA Unstructured Adaptive mesh, dynamic and irregular

memory access

NPB CG Conjugate Gradient

NPB SP Scalar Penta-diagonal solver

NPB DT Data traffic

MKL FFT Fast Fourier Transform

MKL DGEMM Dense Matrix Multiplication

HPCG High performance conjugate gradient

stress CPU, disk and I/O stress

Naive MM Naive Matrix-matrix multiplication

Naive MV Naive Matrix-vector multiplication

application will experimentally exhibit a count that does not lie between (e1 + e2)× (1− ε) and

(e1 + e2)× (1 + ε), where the tolerance ε = 0.05.

If we fail to find a compound application (typically from a large set of diverse compound

applications) for which the additivity criterion fails, we term the PMC as potentially additive,

which means that it can potentially be used for reliable energy predictive modeling. By definition,

a potentially additive PMC must be deterministic and reproducible, that is, it must exhibit the

same value (within a tolerance of 5.0%) for different executions of the same application with the

same runtime configuration on the same platform.

The use of a non-additive PMC as a predictor variable in a model renders it inconsistent

and therefore unreliable. We explain this point using a simple example. Consider an instance of

an energy prediction model that uses a non-additive PMC as a predictor variable. A natural and

intuitive approach to predict the energy consumption of an application that executes two base

applications one after the other is to substitute the sum of the PMCs for the base applications

in the model. However, since the PMC is non-additive, the prediction would be very inaccurate.

Therefore, using non-additive PMCs in energy predictive models adds noise and can signif-

icantly damage the predicting power of energy models based on them.

We now present a test to determine if a PMC is non-additive or potentially additive. We

call it the additivity test.

2.1. Additivity Test

The test consists of two stages. A PMC must pass both stages to be declared additive for a

given compound application on a given platform. At the first stage, we determine if the PMC is

deterministic and reproducible.

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 55

Page 8: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

At the second stage, we examine how the PMC of the compound application relates to

its values for the base applications. At first, we collect the values of the PMC for the base

applications by executing them separately. Then, we execute the compound application and

obtain its value of the PMC. Typically, the core computations for the compound application

consist of the core computations of the base applications programmatically placed one after the

other. This has to be the case for PAPI PMCs. However, for Likwid PMCs, one can use the

system call to invoke the base application. It must also be ensured that the execution of the

compound application takes place under platform conditions similar to those for the execution

of its constituent base applications.

If the PMC of the compound application is equal to the sum of the PMCs of the base

applications (with a tolerance of 5.0%), we classify the PMC as potentially additive. Otherwise,

it is non-additive.

We call the PMC that passes the additivity test potentially additive. For it to be called

absolutely additive on a platform, ideally it must pass the test for all conceivable compound

applications on the platform. Therefore, we avoid this definition.

In our experiments, we observed that all the PMCs were deterministic and reproducible.

For each PMC, we determine the maximum percentage error. For a compound application,

the percentage error (averaged over several runs) is calculated as follows:

Error(%) = (|(eb1 + eb2)− eceb1 + eb2

|)× 100 (1)

where ec, eb1, eb2 are the sample means of predictor variables for the compound application and

the constituent base applications respectively. The maximum percentage error is then calculated

as the maximum of the errors for all the compound applications in the experimental testsuite.

3. Experimental Methodology to Obtain Likwid and PAPI

PMCs

In this section, we present our experimental setup to determine the additivity of PMCs.

The experiments are performed on the Intel Haswell multicore CPU platform (specifications

given in Tab. 1). We used diverse range of applications (both compute-bound and memory-

bound) in our testsuite composed of NAS parallel benchmarking suite (NPB), Intel math kernel

library (MKL), HPCG [13], and stress [24] (description given in Tab. 2). The experimental

workflow is shown in Fig. 1 where the internals of the server are shown in great detail.

For each run of a application in our testsuite, we measure the following: 1) Dynamic energy

consumption, 2) Execution time, and 3) PMCs. The dynamic energy consumption and the

application execution time are obtained using the HCLWattsUp interface [11]. We would like

to mention that the output variables (or response variables) in the performance and energy

predictive models, i.e. energy consumption and execution time, are additive. We confirm this via

thorough experimentation and therefore we will not discuss them hereafter.

We now present our experimental methodologies for determining Likwid and PAPI PMCs.

3.1. LIKWID PMCs

In this section, we explain the experimental methodology to obtain Likwid PMCs.

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

56 Supercomputing Frontiers and Innovations

Page 9: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

AppsNASS Parallel Intel Haswell

Multicore Server Output

Performance EventsDynamic EnergyExecution Time}{

E5-2670V3 E5-2670V3

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

DDR4 DIMM 2

DD4 DIMM 1

Intel C612 PCH

BMC AST2400[ARM9400MHz]

DualLAN

SSD1

SSD2

stressIntel MKL

Figure 1. Experimental workflow to determine the PMCs on the Intel Haswell server.

A sample Likwid command-line invocation is shown below where EVENTS represents one

or more PMCs, which are collected during the execution of the given application APP :

likwid-perfctr -f -C S0:0-11@S1:12-23 -g EVENTS APP

Here, the application (APP ) during its execution is pinned to physical cores (0-11, 12-23)

in our platform. Since Likwid does not provide option to bind application to memory, we have

used numactl, i.e. a command-line linux tool, with option –membind to pin our applications

to memory blocks (for our platform numactl gives 2 memory blocks, 0 and 1). The list of

comma-separated PMCs is specified in EVENTS. For example, the following command:

likwid-perfctr -f -C S0:0-11@S1:12-23

-g ICACHE ACCESSES:PMC0,ICACHE MISSES:PMC1

numactl –membind=0,1 APP

determines the counts for two PMCs, ICACHE ACCESSES:PMC0 and ICACHE

MISSES:PMC1.

Collection of all PMCs requires significant programming efforts and execution time because

only a limited number of PMCs can be obtained in a single application run due to the limited

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 57

Page 10: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

number of registers dedicated to collecting PMCs. In addition, to ensure the reliability of our

results, we follow a detailed statistical methodology where sample mean of a PMC is used. It

is calculated by executing the application repeatedly until it lies in the 95% confidence interval

and a precision of 0.050 (5.0%) has been achieved. For this purpose, Student’s t-test is used

assuming that the individual observations are independent and their population follows the

normal distribution. We verify the validity of these assumptions by plotting the distributions of

observations.

Likwid provides 167 PMCs for our platform. In order to collect all of them for an application,

we have to run the application 53 times. We wrote a software tool to automate this collection

process, SLOPE-PMC-LIKWID [12].

Before we apply the additivity test, we remove few PMCs such as IIO CREDIT (related to

I/O and QPI), and OFFCORE RESPONSE since they exhibit zero counts. We also remove

PMCs having very low count (less than 10). The resulting dataset contained 151 performance

events, which are then input to the additivity test.

3.2. PAPI PMCs

In this section, we explain the experimental methodology to obtain PAPI PMCs.

We check the available PAPI PMCs for our Intel Haswell platform using the command-line

invocation, ′papi avail − a′. We found that a total of 53 PMCs are available. The number of

PMCs that can be gathered in a single application run varies. While gathering a set of 4 PMCs

is common, there are a few event sets, which can contain up to 2 or 3 PMCs. Therefore, we

found that the application has to be executed 14 times in order to collect all the PMCs for the

application on our platform.

We wrote a software tool to automate the process of collection of PMCs, SLOPE-PMC-

PAPI [12]. It is to be noted that for ensuring the reliability of our experimental results, we

follow the same statistical methodology that was followed for determining Likwid PMCs.

4. Additivity of Likwid PMCs

In this section, we determine the additivity of Likwid PMCs. We execute all the compound

applications where each application is composed of two base applications in our testsuite (shown

in Tab. 2).

The list of potentially additive PMCs is shown in the Tab. 3. The list of non-additive PMCs

is presented in Tab. 4, which also reports the maximum percentage error for each PMC.

It is noteworthy that some non-additive PMCs are used as predictor variables in many

energy predictive models [5, 6, 10, 20, 23]. These are ICache events, L2 Transactions, and L2

Requests.

5. Additivity of PAPI PMCs

In this section, we determine the additivity of PAPI PMCs. We again execute all the com-

pound applications where each application is composed of two base applications in our testsuite

(shown in Tab. 2).

The list of potentially additive PMCs is shown in Tab. 5. The list of non-additive PMCs is

shown in Tab. 6, which also reports the maximum percentage error for each PMC.

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

58 Supercomputing Frontiers and Innovations

Page 11: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Table 3. List of Potentially Additive Likwid PMCs

BR INST EXEC ALL BRANCHES IDQ UOPS NOT DELIVERED CYCLES 0 UOPS DELIV COREBR MISP EXEC ALL BRANCHES IDQ UOPS NOT DELIVERED CYCLES FE WAS OKBR INST RETIRED ALL BRANCHES UOPS EXECUTED PORT PORT 0BR MISP RETIRED ALL BRANCHES UOPS EXECUTED PORT PORT 1DRAM CLOCKTICKS UOPS EXECUTED PORT PORT 2SNOOPS RSP AFTER DATA LOCAL UOPS EXECUTED PORT PORT 3SNOOPS RSP AFTER DATA REMOTE UOPS EXECUTED PORT PORT 4RXL FLITS G1 DRS NONDATA UOPS EXECUTED PORT PORT 5RXL FLITS G0 NON DATA UOPS EXECUTED PORT PORT 6TXL FLITS G0 NON DATA UOPS EXECUTED PORT PORT 7CPU CLK UNHALTED ANY UOPS EXECUTED PORT PORT 0 CORECPU CLOCK UNHALTED THREAD P UOPS EXECUTED PORT PORT 1 CORECPU CLOCK UNHALTED THREAD P ANY UOPS EXECUTED PORT PORT 2 CORECPU CLOCK UNHALTED REF XCLK UOPS EXECUTED PORT DATA PORTSCPU CLOCK UNHALTED REF XCLK ANY L2 RQSTS ALL DEMAND REFERENCESHA R2 BL CREDITS EMPTY LO HA0 L2 RQSTS L2 PF MISSHA R2 BL CREDITS EMPTY LO HA1 MEM UOPS RETIRED ALLCPU CLOCK THREAD UNHALTEDONE THREAD ACTIVE

UOPS EXECUTED PORT PORT 3 CORE

CPU CLOCK UNHALTED TOTAL CYCLES UOPS EXECUTED PORT PORT 4 COREOFFCORE REQUESTS OUTSTANDINGDEMAND DATA RD

UOPS EXECUTED PORT PORT 5 CORE

OFFCORE REQUESTS OUTSTANDINGCYCLES WITH DATA RD

UOPS EXECUTED PORT PORT 6 CORE

OFFCORE REQUESTS OUTSTANDINGDEMAND DATA RD C6

UOPS EXECUTED PORT PORT 7 CORE

UOPS EXECUTED PORT DATA PORTS UOPS EXECUTED PORT ARITH PORTSOFFCORE REQUESTS DEMAND DATA RD UOPS EXECUTED PORT ARITH PORTS COREHA R2 BL CREDITS EMPTY HI R2 NCB UOPS EXECUTED PORT DATA PORTSCPU CLOCK UNHALTED THREAD P UOPS RETIRED CORE TOTAL CYCLESCPU CLOCK UNHALTED THREAD P ANY LSD CYCLES 4 UOPSCPU CLOCK UNHALTED REF XCLK UOPS EXECUTED THREADCPU CLOCK UNHALTED REF XCLK ANY UOPS EXECUTED USED CYCLESCPU CLOCK THREAD UNHALTEDONE THREAD ACTIVE

UOPS EXECUTED STALL CYCLES

CPU CLOCK UNHALTED TOTAL CYCLES UOPS EXECUTED TOTAL CYCLESICACHE MISSES UOPS EXECUTED CYCLES GE 1 UOPS EXECL2 RQSTS RFO MISS UOPS EXECUTED CYCLES GE 2 UOPS EXECL2 RQSTS ALL RFO UOPS EXECUTED CYCLES GE 3 UOPS EXECL2 RQSTS CODE RD HIT UOPS EXECUTED CYCLES GE 4 UOPS EXECL2 RQSTS CODE RD MISS UOPS EXECUTED COREUOPS EXECUTED PORT DATA PORTS UOPS EXECUTED CORE USED CYCLESMEM LOAD UOPS RETIRED ALL ALL UOPS EXECUTED CORE STALL CYCLESUOPS ISSUED ANY UOPS EXECUTED CORE TOTAL CYCLESUOPS ISSUED USED CYCLES UOPS EXECUTED CORE CYCLES GE 1 UOPS EXECUOPS ISSUED STALL CYCLES UOPS EXECUTED CORE CYCLES GE 2 UOPS EXECUOPS ISSUED TOTAL CYCLES UOPS EXECUTED CORE CYCLES GE 3 UOPS EXECUOPS ISSUED CORE USED CYCLES UOPS EXECUTED CORE CYCLES GE 4 UOPS EXECUOPS ISSUED CORE STALL CYCLES UOPS RETIRED ALLUOPS ISSUED CORE TOTAL CYCLES UOPS RETIRED CORE ALLIDQ MITE ALL UOPS UOPS RETIRED RETIRE SLOTSIDQ DSB UOPS UOPS RETIRED CORE RETIRE SLOTSIDQ MS UOPS UOPS RETIRED USED CYCLESIDQ ALL DSB CYCLES ANY UOPS UOPS RETIRED STALL CYCLESIDQ ALL DSB CYCLES 4 UOPS UOPS RETIRED TOTAL CYCLESIDQ ALL MITE CYCLES ANY UOPS UOPS RETIRED CORE USED CYCLESIDQ UOPS NOT DELIVERED CORE UOPS RETIRED CORE STALL CYCLESCAS COUNT RD CAS COUNT WR

CAS COUNT ALL

It should be mentioned that some of these non-additive PMCs such as PAPI L1 ICM and

PAPI L2 ICM have been widely used in energy and performance predictive models [2, 4, 7,

17, 27, 28]. These represent L1 and L2 instruction cache misses.

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 59

Page 12: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Table 4. List of Non-additive Likwid PMCs

Event Name Maximum Percentage Error (%)UNCORE CLOCK 16.98CBOX CLOCKTICKS 16.98SBOX CLOCKTICKS 17.08WBOX CLOCKTICKS 17.57BBOX CLOCKTICKS 16.98PBOX CLOCKTICKS 16.98RBOX CLOCKTICKS 16.98QBOX CLOCKTICKS 17.57HA R2 BL CREDITS EMPTY LO R2 NCB 45.27HA R2 BL CREDITS EMPTY LO R2 NCS 48.28HA R2 BL CREDITS EMPTY HI HA0 203.15HA R2 BL CREDITS EMPTY HI HA1 213.15HA R2 BL CREDITS EMPTY HI R2 NCS 250.56OFFCORE RESPONSE 0 DMND DATA RD ANY 47.50ICACHE IFETCH STALL 86.60L2 RQSTS RFO HIT 27.44ARITH DIVIDER UOPS 3075.23IDQ UOPS NOT DELIVERED CYCLES LE 1UOP DELIV CORE

163.64

IDQ UOPS NOT DELIVERED CYCLES LE 2UOP DELIV CORE

89.16

L2 RQSTS L2 PF HIT 39.41ICACHE HIT 105.45RXL FLITS G0 DATA 176.62OFFCORE REQUESTS OUTSTANDINGALL DATA RD

33.76

OFFCORE REQUESTS ALL DATA RD 42.45IDQ MITE UOPS 42.06L2 RQSTS ALL DEMAND DATA RD 52.76L2 TRANS DEMAND DATA RD 24.29L2 RQSTS ALL DEMAND DATA RD MISS 29.14L2 RQSTS ALL DEMAND DATA RD HIT 35.09L2 RQSTS ALL DEMAND DATA RD 39.43L2 TRANS DEMAND DATA RD 52.43L2 RQSTS ALL DEMAND DATA RD MISS 56.23L2 RQSTS ALL DEMAND DATA RD HIT 72.32L2 RQSTS ALL DEMAND DATA RD 35.03L2 TRANS DEMAND DATA RD 75.24L2 RQSTS ALL DEMAND DATA RD 80.33RXL FLITS G2 NCB DATA 100RXL FLITS G2 NCB NONDATA 100TXL FLITS G0 DATA 100TXL FLITS G1 DRS DATA 100TXL FLITS G1 DRS NONDATA 100TXL FLITS G2 NCB DATA 100LSD UOPS 42

5.1. Core and Memory Pinning

We ran two sets of experiments, one with the application pinned to the cores and the other

with the application pinned to cores and memory. While the percentage errors were reduced

slightly when the application is pinned to both the cores and the memory, we observed that

memory pinning has no effect on additive PMCs but, most importantly, non-additive PMCs

remained non-additive (within a tolerance of 5%).

6. Discussion

From Tab. 3 and Tab. 4 showing potentially additive and non-additive Likwid PMCs respec-

tively, one can observe that out of a total of 151 PMCs, 43 PMCs are non-additive.

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

60 Supercomputing Frontiers and Innovations

Page 13: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

Table 5. List of potentially additive PAPI PMCs

PAPI L1 DCM PAPI FUL CCY PAPI L2 DCW

PAPI L2 DCM PAPI BR UCN PAPI L3 DCW

PAPI CA SHR PAPI BR CN PAPI L3 TCR

PAPI CA CLN PAPI BR TKN PAPI L2 TCW

PAPI CA INV PAPI BR NTK PAPI L3 TCW

PAPI CA ITV PAPI BR MSP PAPI REF CYC

PAPI L1 STM PAPI BR PRC PAPI L1 TCM

PAPI L2 LDM PAPI TOT INS PAPI L2 TCM

PAPI L2 STM PAPI L2 DCR PAPI BR INS

PAPI PRF DM PAPI L3 DCR PAPI RES STL

PAPI TOT CYC PAPI L2 DCA PAPI L3 DCA

PAPI L2 TCA PAPI L2 TCR PAPI L3 TCA

Table 6. List of non-additive PAPI PMCs

Event Name Maximum Percentage Error (%)

PAPI CA SNP 40.23

PAPI TLB DM 31.54

PAPI TLB IM 23.70

PAPI STL CCY 31.43

PAPI LD INS 32.06

PAPI SR INS 21.98

PAPI LST INS 45.87

PAPI L1 ICM 37.28

PAPI L2 ICM 37.50

PAPI L2 ICH 107.12

PAPI L2 ICA 30.65

PAPI L3 ICA 30.2

PAPI L2 ICR 30.65

PAPI L3 TCM 14.54

PAPI L3 LDM 74.68

PAPI L1 LDM 200.82

PAPI L3 ICR 19.48

The event ARITH DIVIDER UOPS exhibits the highest maximum percentage error of

about 3075%. This event belongs to the µOPS group of Likwid PMCs responsible for gath-

ering PMCs related to the instruction pipeline.

Several PMCs (HA R2 BL CREDITS EMPTY HI HA0, HA R2 BL CREDITS EMPTY HI HA1,

HA R2 BL CREDITS EMPTY HI R2 NCS ) show maximum percentage error of about 200%.

These events specifically belong to the Home Agent (HA) group of Likwid PMCs. HA is central

unit that is responsible for providing PMCs from protocol side of memory interactions.

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 61

Page 14: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

There are several PMCs that show maximum percentage error of about 100%. They are

mainly from the QPI group of Likwid PMCs responsible for packetizing requests from the

caching agent on the way out to the system interface.

Similarly, from Tab. 5 and Tab. 6 showing potentially additive and non-additive PAPI PMCs

respectively, 17 PMCs out of a total of 53 PMCs are non-additive. One PMC, PAPI L1 LDM,

demonstrates the highest maximum percentage error of about 200%. It represents L1 load misses.

Another PMC, PAPI L2 ICH, demonstrates a maximum percentage error of over 100%. It rep-

resents L2 instruction cache hits.

If we increase the tolerance to about 20%, then only 8 non-additive Likwid PMCs will

become potentially additive. For PAPI, only two non-additive PMCs will become potentially

additive, PAPI L3 TCM and PAPI L3 ICR. They represent L3 cache misses and L3 instruction

cache reads respectively. Increasing the tolerance to about 30% results in other 3 non-additive

Likwid PMCs and 5 non-additive PAPI PMCs becoming potentially additive.

Thus, one can see that there are still a large number of PMCs that are non-additive even

after increasing the tolerance to as high as 30%. Some of these PMCs have been used as key

predictor variables in energy predictive models [2, 4–7, 10, 17, 20, 23, 27, 28].

To summarize, the non-additive PMCs that exceed a specified tolerance must be excluded

from the list of PMCs to be considered as predictor variables for energy predictive modeling,

because they can potentially damage the prediction accuracy of these models due to their highly

non-deterministic nature. Also the list of potentially additive PMCs must be further tested

exhaustively for more diverse applications and platforms to secure more confidence in their

additivity.

In our future work, we would study how much the prediction error is affected due to the

presence of non-additive PMCs in all the linear predictive energy models that we surveyed.

Conclusion

Performance events (PMCs) are now dominant predictor variables for modeling energy con-

sumption. Considering the large set of PMCs offered by modern processors, several techniques

have been devised to select the best subset of PMCs to be used for energy predictive modeling.

However, the existing techniques have not considered one fundamental property of predictor

variables that should have been taken into account in the first place to remove PMCs unsuitable

for modeling energy. We have addressed this oversight in this paper.

We proposed a novel selection criterion for PMCs called additivity, which can be used to

determine the subset of PMCs that can potentially be considered for reliable energy predictive

modeling. It is based on the experimental observation that the energy consumption of a serial

execution of two applications is the sum of energy consumptions observed for the individual

execution of each application. A linear predictive energy model is consistent if and only if

its predictor variables are additive in the sense that the vector of predictor variables for a

serial execution of two applications is the sum of vectors for the individual execution of each

application.

We studied the additivity of PMCs offered by two popular tools, Likwid and PAPI, using a

detailed statistical experimental methodology on a modern Intel Haswell multicore server CPU.

We showed that many PMCs in Likwid and PAPI are non-additive and that some of these

PMCs are key predictor variables in energy predictive models thereby bringing into question the

reliability and reported prediction accuracy of these models.

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

62 Supercomputing Frontiers and Innovations

Page 15: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

In our future work, we would classify the non-additivity of a PMC into application-specific

and platform-specific categories. We will also look at additivity of PMCs offered by accelerators

such as Graphical Processing Units (GPUs). For instance, Nvidia GPUs provide CUDA Profiling

Tools Interface (CUPTI) that provides functions to determine around 140 PMCs. However,

implementing a compound application (or kernel) from two or more base applications (kernels)

is not straightforward. While CUPTI allows a continuous event collection mode, we found it is not

widely supported and hence unusable presently for implementation of compound applications.

Acknowledgement

This publication has emanated from research conducted with the financial support of Science

Foundation Ireland (SFI) under Grant Number 14/IA/2474.

This paper is distributed under the terms of the Creative Commons Attribution-Non Com-

mercial 3.0 License which permits non-commercial use, reproduction and distribution of the work

without further permission provided the original work is properly cited.

References

1. Bircher, W.L., John, L.K.: Complete system power estimation using processor

performance events. IEEE Transactions on Computers 61(4), 563–577 (Apr 2012),

DOI: 10.1109/TC.2011.47

2. Chadha, M., Ilsche, T., Bielert, M., Nagel, W.E.: A statistical approach to

power estimation for x86 processors. In: Parallel and Distributed Processing Sympo-

sium Workshops (IPDPSW), 2017 IEEE International. pp. 1012–1019. IEEE (2017),

DOI: 10.1109/IPDPSW.2017.98

3. CUPTI: Cuda profiling tools interface (2017), https://developer.nvidia.com/

cuda-profiling-tools-interface, accessed: 2017-04-10

4. Dauwe, D., Friese, R., Pasricha, S., Maciejewski, A.A., Koenig, G.A., Siegel, H.J.: Modeling

the effects on power and performance from memory interference of co-located applications

in multicore systems. In: Proceedings of the International Conference on Parallel and Dis-

tributed Processing Techniques and Applications (PDPTA). p. 1. The Steering Committee of

The World Congress in Computer Science, Computer Engineering and Applied Computing

(WorldComp) (2014)

5. Dolz, M.F., Kunkel, J., Chasapis, K., Catalan, S.: An analytical methodology to derive

power models based on hardware and software metrics. Computer Science-Research and

Development 31(4), 165–174 (2016), DOI: 10.1007/s00450-015-0298-8

6. Dolz Zaragoza, M.F., Kunkel, J., Chasapis, K., Catalan Pallares, S.: An analytical

methodology to derive power models based on hardware and software metrics (2015),

DOI: 10.1007/s00450-015-0298-8

7. Eidenbenz, S.J., Djidjev, H.N., Nadiga, B.T., Park, E.J.: Simulation-based and analytical

models for energy use prediction. Tech. rep., Los Alamos National Laboratory (LANL)

(2016)

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 63

Page 16: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

8. Goel, B., McKee, S.A., Gioiosa, R., Singh, K., Bhadauria, M., Cesati, M.: Portable, scalable,

per-core power estimation for intelligent resource management. Green Computing Confer-

ence, 2010 International (2010-08-16 2010), DOI: 10.1109/GREENCOMP.2010.5598313

9. Gschwandtner, P., Knobloch, M., Mohr, B., Pleiter, D., Fahringer, T.: Modeling CPU

energy consumption of hpc applications on the IBM POWER7. In: Parallel, Distributed

and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on.

pp. 536–543. IEEE (2014), DOI: 10.1109/PDP.2014.112

10. Haj-Yihia, J., Yasin, A., Asher, Y.B., Mendelson, A.: Fine-grain power breakdown of mod-

ern out-of-order cores and its implications on skylake-based systems. ACM Transactions on

Architecture and Code Optimization (TACO) 13(4), 56 (2016), DOI: 10.1145/3018112

11. HCL: HCLWattsUp: API for power and energy measurements using WattsUp Pro Meter

(2016), http://git.ucd.ie/hcl/hclwattsup, accessed: 2017-04-24

12. HCL: SLOPE-PMC: Towards the automation of pmcs collection for intel based multi-

core platforms (2017), https://git.ucd.ie/hcl/SLOPE/tree/master/SLOPE-PMC, accessed:

2017-04-24

13. Intel Optimzed HPCG: Overview of the intel optimized hpcg, https://software.intel.com/

en-us/node/599524, accessed: 2017-04-24

14. IntelPCM: Intel performance counter monitor - a better way to measure cpu utiliza-

tion. (2012), https://software.intel.com/en-us/articles/intel-performance-counter-monitor,

accessed: 2017-04-22

15. Jarus, M., Oleksiak, A., Piontek, T., Wglarz, J.: Runtime power usage estimation of HPC

servers for various classes of real-life applications. Future Generation Computer Systems 36

(2014), DOI: 10.1016/j.future.2013.07.012

16. Kadayif, I., Chinoda, T., Kandemir, M., Vijaykirsnan, N., Irwin, M.J., Sivasubramaniam,

A.: vec: Virtual energy counters. In: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT

Workshop on Program Analysis for Software Tools and Engineering. pp. 28–31. PASTE ’01,

ACM (2001), DOI: 10.1145/379605.379639

17. Lively, C., Wu, X., Taylor, V., Moore, S., Chang, H.C., Su, C.Y., Cameron, K.: Power-aware

predictive models of hybrid (mpi/openmp) scientific applications on multicore systems. Com-

puter Science-Research and Development 27(4), 245–253 (2012), DOI: 10.1007/s00450-011-

0190-0

18. O’Brien, K., Pietri, I., Reddy, R., Lastovetsky, A., Sakellariou, R.: A survey of power and

energy predictive models in HPC systems and applications. ACM Computing Surveys 50(3)

(2017), DOI: 10.1145/3078811

19. PAPI: Performance application programming interface 5.5.1 (2017), http://icl.cs.utk.edu/

papi/, accessed: 2017-04-24

20. Singh, K., Bhadauria, M., McKee, S.A.: Real time power estimation and thread schedul-

ing via performance counters. ACM SIGARCH Computer Architecture News 37(2), 46–55

(2009), DOI: 10.1145/1577129.1577137

Additivity: A Selection Criterion for Performance Events for Reliable Energy Predictive...

64 Supercomputing Frontiers and Innovations

Page 17: Additivity: A Selection Criterion for Performance Events for …hcl.ucd.ie/system/files/153-992-1-PB.pdf · 2020-06-21 · Arsalan Shahid University College Dublin 9 PUBLICATIONS

21. Song, S., Su, C., Rountree, B., Cameron, K.W.: A simplified and accurate model of power-

performance efficiency on emergent GPU architectures. In: 27th IEEE International Par-

allel & Distributed Processing Symposium (IPDPS). pp. 673–686. IEEE Computer Society

(2013), DOI: 10.1109/IPDPS.2013.73

22. Treibig, J., Hager, G., Wellein, G.: Likwid: A lightweight performance-oriented tool suite

for x86 multicore environments. In: Parallel Processing Workshops (ICPPW), 2010 39th

International Conference on. pp. 207–216. IEEE (2010), DOI: 10.1109/ICPPW.2010.38

23. Wang, S.: Software power analysis and optimization for power-aware multicore systems.

Wayne State University (2014), https://digitalcoomons.wayne.edu/oa dissertations/933/,

accessed: 2017-04-24

24. Waterland, A.: Stress. https://people.seas.harvard.edu/∼apw/stress/ (2001), accessed:

2017-04-24

25. Wiki, P.: perf: Linux profiling with performance counters (2017), https://perf.wiki.kernel.

org/index.php/Main Page, accessed: 2017-04-23

26. Witkowski, M., Oleksiak, A., Piontek, T., Weglarz, J.: Practical power consumption es-

timation for real life HPC applications. Future Gener. Comput. Syst. 29(1) (Jan 2013),

DOI: 10.1016/j.future.2012.06.003

27. Wu, X., Chang, H.C., Moore, S., Taylor, V., Su, C.Y., Terpstra, D., Lively, C., Cameron,

K., Lee, C.W.: Mummi: multiple metrics modeling infrastructure for exploring per-

formance and power modeling. In: Proceedings of the Conference on Extreme Science

and Engineering Discovery Environment: Gateway to Discovery. p. 36. ACM (2013),

DOI: 10.1145/2484762.2484773

28. Wu, X., Lively, C., Taylor, V., Chang, H.C., Su, C.Y., Cameron, K., Moore, S., Terpstra, D.,

Weaver, V.: Mummi: multiple metrics modeling infrastructure. In: Software Engineering,

Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2013 14th

ACIS International Conference on. pp. 289–295. IEEE (2013), DOI: 10.1109/SNPD.2013.73

29. Wu, X., Taylor, V., Cook, J., Mucci, P.J.: Using Performance-Power modeling

to improve energy efficiency of HPC applications. Computer 49(10), 20–29 (2016),

DOI: 10.1109/MC.2016.311

A. Shahid, M. Fahad, R. Reddy, A. Lastovetsky

2017, Vol. 4, No. 4 65

View publication statsView publication stats