E-Team: Practical Energy Accounting for Multi-Core … FIRESTARTER Figure 1: Energy attribution...

15
This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). July 12–14, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-38-6 Open access to the Proceedings of the 2017 USENIX Annual Technical Conference is sponsored by USENIX. E-Team: Practical Energy Accounting for Multi-Core Systems Till Smejkal and Marcus Hähnel, TU Dresden; Thomas Ilsche, Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden; Michael Roitzsch, TU Dresden; Wolfgang E. Nagel, Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden; Hermann Härtig, TU Dresden https://www.usenix.org/conference/atc17/technical-sessions/presentation/smejkal

Transcript of E-Team: Practical Energy Accounting for Multi-Core … FIRESTARTER Figure 1: Energy attribution...

This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17).

July 12–14, 2017 • Santa Clara, CA, USA

ISBN 978-1-931971-38-6

Open access to the Proceedings of the 2017 USENIX Annual Technical Conference

is sponsored by USENIX.

E-Team: Practical Energy Accounting for Multi-Core Systems

Till Smejkal and Marcus Hähnel, TU Dresden; Thomas Ilsche, Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden; Michael Roitzsch,

TU Dresden; Wolfgang E. Nagel, Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden; Hermann Härtig, TU Dresden

https://www.usenix.org/conference/atc17/technical-sessions/presentation/smejkal

E-Team: Practical Energy Accounting for Multi-Core Systems

Till Smejkal1, Marcus Hähnel1, Thomas Ilsche2, Michael Roitzsch1, Wolfgang E. Nagel2, andHermann Härtig1

1Operating Systems Group,TU Dresden2Center for Information Services and High Performance Computing (ZIH), TU Dresden

[email protected]

AbstractEnergy-based billing as well as energy-efficient soft-ware require accurate knowledge of energy consumption.Model-based energy accounting and external measure-ment hardware are the main methods to obtain energydata, but cost and the need for frequent recalibration haveimpeded their large-scale adoption. Running AveragePower Limit (RAPL) by Intel R© enables non-intrusive,off-the-shelf energy monitoring, but only on a per-socketlevel. To enable apportioning of energy to individ-ual applications we present E-Team, a non-intrusive,scheduler-based, easy-to-use energy-accounting mecha-nism. By leveraging RAPL, our method can be used onany Intel system built after 2011 without the need for ex-ternal infrastructure, application modification, or modelcalibration. E-Team allows starting and stopping mea-surements at arbitrary points in time while maintaining alow performance overhead. E-Team provides high accu-racy, compared to external instrumentation, with an errorof less than 3.5 %.

1 IntroductionEnergy has become the major factor constraining the util-ity of today’s systems. For mobile platforms, which relyheavily on battery life, energy efficiency is an impor-tant differentiator for applications and devices. Beingmore energy efficient is a competitive advantage. In data-centers, energy is nowadays dominating the operationcosts, necessitating energy-based payment models [21].

Accurate accounting of energy is paramount to op-timize energy consumption and to enable energy-basedbilling. Software developers rely on energy consump-tion statistics to find and fix energy bugs [31] and im-prove the energy efficiency of their algorithms [18]. Butsoftware development already requires developers’ atten-tion to non-functional properties, like responsiveness and

security. To enable energy efficient systems developersneed a measurement infrastructure that is easy to use andcost-effective to deploy.

As energy characteristics often only manifest duringruntime of the deployed application, such infrastructuremust be non-intrusive in production environments by notincurring any performance loss or energy penalty whennot in use. Still, enabling on-the-fly measurement of in-dividual applications or parts of the system should be aseasy as executing a simple command.

1.1 State of the ArtExternal measurement hardware is accurate [23], but canonly provide machine-level measurements. Inferencebased solutions [35, 25, 7] are more flexible, but requirecalibration. We strive for a solution combining the re-spective advantages.

Intel introduced the Running Average Power Limit(RAPL) technology in Sandy Bridge

TMCPUs [33]. It

provides a power limiting infrastructure that is automat-ically calibrated during startup and exposes energy mea-surements. While not providing per-application energyvalues, it removes the need for expensive, specialized ex-ternal measurement hardware and comes with zero setupeffort. We introduce RAPL in Section 2.

Simple inference-based models, using CPU time orretired instructions, fail to accurately capture energyconsumption of complex workloads making energy ap-portioning infeasible. We illustrate this point by mea-suring a busy loop that does not touch any data, andFIRESTARTER [14], a CPU burner application designedfor high power-usage. Figure 1 shows the result of theexperiment. We establish a baseline by measuring the en-ergy consumption of each app in isolation using RAPL.Then we run both programs at the same time, scheduledby Linux’ CFS, measure the system-level end-to-end en-

USENIX Association 2017 USENIX Annual Technical Conference 589

0 200 400 600 800 1,000

E-TeamInsns

CPU timeBaseline

549.88770.68

451.74549.26

408.39180.04

498.98404.37

Energy (CPU+Memory) [J]

busy-loop FIRESTARTER

Figure 1: Energy attribution based on instructions retired(Insns) and CPU time, compared to E-Team

ergy consumption and apportion it based on CPU timeand based on instructions retired. Both methods are in-capable of correctly attributing energy consumption. Wealso give a short glimpse of the result of our solution,E-Team, which is able to accurately capture the energyconsumption for both applications.

1.2 ContributionsWe present the design and implementation of an oper-ating system service for accurate and efficient measure-ment of per-application CPU energy use on multi-coresystems using RAPL. Our key contributions are:• A scheduler design to circumvent the limitations of

RAPL using team scheduling (Section 3.1) and aLinux implementation (Section 4).

• A user-accessible interface to start and stop energy ac-counting of individual thread groups (Section 4.2).

• A scheduler integration of RAPL for short code paths(Section 2.3).

• An evaluation using standard benchmarks (NPB [1])and real-world scenarios with multiple individuallymeasured applications running in parallel (Section 5).

• A validation of our implementation’s accuracy using aprecise external measurement setup (Section 6).

2 RAPL for Energy MeasurementsStarting with the Sandy Bridge generation, Intel CPUsprovide the Running Average Power Limit technology(RAPL) [33]. As the name implies, RAPL is intended forpower-limiting, but also provides energy counters. Dueto the widespread availability and the fact that it requiresno additional instrumentation, RAPL is used extensivelyfor power and energy estimation [12, 16, 11, 39].

2.1 Basic RAPL OperationRAPL provides energy measurements for four domains:Package (PKG) the whole processor package,Cores (PP0) aggregate of all cores in a package,Graphics (PP1) the CPU-integrated graphics process-

ing unit (not available on server platforms), and

Memory (DRAM) memory. Although officially onlysupported on server platforms [20], this domain isalso available on desktop processors since Haswell.

The initial implementation of RAPL was based on amodel using micro-architectural events to estimate en-ergy consumption [8]. Hackenberg et al. [13] have re-vealed systematic errors in the RAPL energy counters,e.g. bias towards certain workloads and contradictory re-sults when using Hyper-Threading. For Haswell genera-tion processors, RAPL has been demonstrated to provideaccurate measurements without systematic errors [15],hence the results presented in Section 5 and 6 were pro-duced on Haswell desktop and server systems.

2.2 Limitations of Basic RAPLContrary to performance counters, RAPL counters areexposed exclusively through Model-Specific Registers(MSRs) that are only readable from kernel space. Anumber of methods exist in Linux to read the MSRin the kernel and make the values accessible to ap-plications: Performance monitoring libraries such asPAPI [26] or LIKWID [38], and dedicated third-partydrivers [24]. Since Linux 3.12, RAPL is usable as power-cap driver. Since Linux 3.14 RAPL is accessible via theperf performance monitoring framework as system-wideperformance-counter.

Another fundamental difference to conventional per-formance counters is that RAPL values are updated withan approximate frequency of 1 kHz [20] only, while per-formance counters are updated continuously. The updateis not associated with a timestamp, preventing identifica-tion of stale values. The discrete updates make it diffi-cult to measure code paths running shorter than or closeto the counter’s 1 ms update interval. The number of up-dates cannot be accurately determined: considering, forexample, a piece of code running for 2.5 ms, it makesa large difference in terms of attributed energy whetherthere were two or three updates during that time.

This is especially visible in time-shared systems whereswitching between programs happens frequently. Inthese systems traditional performance counters are multi-plexed by saving and restoring their values on every con-text switch. Such a technique cannot be trivially appliedfor RAPL because of the aforementioned update behav-ior. Counter values may be outdated at the point of con-text switching leading to significant measurement errors.

Similar to other measurement-based methods, men-tioned in Section 1.1, the RAPL design cannot mea-sure energy for individual cores or applications. InsteadRAPL accounts the combined energy for all cores in asocket. This makes apportioning energy to an applica-tion executing in a multi-processor system with multiple,

590 2017 USENIX Annual Technical Conference USENIX Association

(a)

t [ms]0 1 2 3 4 5 6

foo

call return

(b)

t [ms]0 1 2 3 4 5 6

foo

call

RAPL read RAPL read

return

Figure 2: Synchronized measurement for short code

concurrently running applications non-trivial. In Sec-tion 3 we present the design of our scheduler-based mea-surement service. It ensures that, at any point in time, thecores of one socket are assigned exclusively to programsthat should be measured together.

2.3 Measuring Short Code PathsTo address the problem of RAPL’s fixed update inter-vals, we use a method from our previous work on mea-suring short code-paths [16]. When measuring a shortcode path, both the start and the end of the measure-ment may fall between the update points of the RAPLenergy counter as illustrated in Figure 2 (a). The shorterthe measured code path (here less than 3 ms) the higherthe influence of measurement inaccuracies on the result.The measurement window, indicated by the dotted arrow,is offset against the code execution, delineated by calland return. The offset results in the inclusion of irrele-vant code at the start and the omission of relevant code atthe end. A solution to this problem is to synchronize themeasurement time to counter updates. For the start of themeasurement, this is achieved by repeatedly reading theENERGY_STATUS register until it changes, indicating aRAPL update. Only then the measured code is executed.

Synchronizing the end of the function is not as simple.Just waiting for the next update will skew the measure-ment as the result would include the energy consumedwhile waiting. In our previous work, we propose to fillthe time until the next update with a workload of fixedand known energy consumption [16]. Since polling thecounter is needed to detect the update, using the pollingloop as this defined workload elegantly solves the prob-lem. Listing 1 shows pseudo-code for the algorithm exe-cuted when the function of interest terminates. The valueof ePerClock is determined in a one-time calibrationstep performed by measuring the cost of repeatedly read-ing the RAPL counter over an interval of several updates(less than one second in total). Subtracting the knowncost of the loop ensures that the returned value only con-tains energy consumed by measured code, thus effec-

f in ish_measurement ( ) {cyc les = rd tscp ( ) ;e _ s t a r t = RAPLRead ( ) ;

while (RAPLRead ( ) == e _ s t a r t ) { }

cyc les = rd tscp ( ) − cyc les ;return RAPLRead ( ) − ePerClock ∗ cyc les ;

}

Listing 1: Algorithm for leaving measured code

TimeC1

C2

PE

PO

Wrong Energy Accounting

Figure 3: Miss-accounted energy in normal scheduling

tively increasing the RAPL resolution and removing theconstraint imposed by the update interval. Figure 2 (b)shows how this solution synchronizes the measurementwith RAPL updates. The striped part before the mea-sured code may have arbitrary energy consumption, thepart at the end is the described polling loop. We call thismechanism short-time RAPL.

2.4 Accounting Individual ProcessesFigure 3 illustrates the challenge in accounting energyfor individual processes using socket-local measure-ments. When using RAPL directly to measure processPE running on core C1, while an unrelated process POexecutes in parallel on core C2, the final energy con-sumption will be incorrect. During the time marked inthe graph the measurement will include the energy con-sumed by both programs. The example in Section 1.1shows, that time-proportional accounting does not suf-fice to circumvent this problem as instruction types andaccess patterns are not taken into account. More sophisti-cated performance-counter-based models require exten-sive calibration leading to a significant deployment cost.Accordingly, we propose a different approach.

3 Energy Measurement as a Kernel ServiceOur solution, E-Team, enables energy accounting on aper-application basis in multi-core systems through spe-cialized scheduling. The main challenge of introducinga measurement infrastructure into a production system isto minimize the accompanying performance impact. Thesystem should run as usual if there are no active measure-ments and hardware modifications should not be neces-sary. However, some overhead may be acceptable dur-ing troubleshooting or application development. Even indevelopment scenarios isolating individual applicationsthat are part of a complex system may be worthwhile topinpoint energy bugs. Naïvely restricting the system toa single core or running the application of interest exclu-

USENIX Association 2017 USENIX Annual Technical Conference 591

sively in the system solves the energy accounting prob-lem but does not resemble production system behavior.

For the remainder of this paper, we refer to individ-ually executing programs as processes. Processes maybe comprised of many execution contexts called threads.Each thread is scheduled as a task by the scheduler andhas an assigned task structure in the kernel.

We introduce the concept of teams. A team is an arbi-trary group of tasks whose energy is accounted together.Teams get exclusive access to their assigned CPU socketto prevent tasks of different teams from running on thesame socket in parallel. This enables the use of thesocket-wide energy measurements of RAPL to accountthe team’s energy consumption. Tasks of a team arescheduled on the team’s socket according to any schedul-ing scheme. Thereby energy characteristics caused byinteraction between measured tasks are largely preservedand performance degradation is limited.

We call this approach team scheduling. For the re-mainder of this paper we refer to a team that is measuredas a measured team. There exists exactly one team con-taining all tasks that should not be measured (the non-measured team). To enforce the team-scheduling policy,we added a new scheduler to Linux.

3.1 Team SchedulingThe design of the E-Team scheduler guarantees that notasks belonging to different teams run on the same socketat the same time. We want to enforce the following prop-erties in our energy measurement service:

Property 1 (Team Interactivity): Teams are interruptibleto enable interaction between different teams and main-tain system responsiveness.

Property 2 (Task Interactivity): Tasks of a team sharethe team’s cores fairly to enable task interaction and pre-serve the team’s energy characteristics.

Property 3 (Accuracy): The scheduler limits switchesbetween teams to curtail measurement errors due to mul-tiplexing and uses short-time RAPL as required.

Property 4 (Non-invasiveness): In the absence of mea-surements, the system behaves like an unmodified system.

Property 5 (Usability): Starting and stopping measure-ments is possible at any point in time, either initiated bythe user or the program itself. Teams grow and shrinkwhen tasks are created, destroyed, added, or removed.

To account energy for individual processes we proposeto assign sockets exclusively to teams (measured teamsor non-measured) in a time-multiplexed fashion. Thisleads to the main invariant of our scheduler:

Invariant 1: On any socket only tasks of the same teamcan run concurrently at any point in time.

Property 2 requires cores to be time-multiplexed be-tween tasks of a team. The team scheduler controlswhich tasks can run on which socket at any given timebased on policy and team-membership. A task sched-uler distributes the tasks assigned to a socket between itscores. The team scheduler makes no assumptions aboutthe task scheduler policy and the policy can be set perteam. This allows to use the Completely Fair Scheduler(CFS) as task scheduler.

The team scheduler manages a list of teams (team run-queue), whereas each team consists of tasks called teammembers. One item in the team runqueue is the non-measured team. Teams are activated and deactivated bythe team scheduler only as a whole. To activate a team,the team scheduler first deactivates the currently runningteam. It then dequeues the new team, affinitizes its tasksto the socket and notifies the responsible task schedulerto reschedule. Deactivation of a team entails removingall its tasks from the socket and adding the team backto the team runqueue. This design enforces Invariant 1.Only tasks of one team are available to the socket-localtask scheduler to be scheduled.

When all tasks in the system belong to the same team,team scheduling is reduced to a no-op. This is the casewhen no measurements are taken as the non-measuredteam then contains all tasks. Together with the possibil-ity to run arbitrary scheduling schemes within the taskschedulers, this enforces Property 4.

While not implemented by us the extension of the pro-posed scheme to multiple sockets is straightforward. Arunning team can occupy multiple sockets at the sametime. The team scheduler may remove or add sockets toa team as necessary. The maximum number of simulta-neously active teams is limited by the number of sock-ets. Having multiple teams active on different socketsdoes not affect accounting accuracy, as each socket hasits own RAPL domains.

The architecture of E-Team can be thought of as core-local scheduling with socket-level coordination.

3.2 Fine-Grained Context SwitchingProperty 3 is the hardest to enforce. Although the teamscheduling approach guarantees that energy is only ac-counted for measured tasks, we still need to ensure thatprocesses which execute for short times due to blockingare accounted accurately. To minimize overhead we tryto switch teams only every 100 ms or more. This resultsin an error of about 1 % due to the fixed RAPL counterupdate intervals. The team scheduling frequency is tun-

592 2017 USENIX Annual Technical Conference USENIX Association

STOP CFS IDLE

Scheduling Classes

Core Schedulerschedule()

Scheduler

uses

Processor

Core 0 Core 1

Run

queu

e0

Run

queu

e1

uses

Figure 4: Linux scheduler architecture

able, allowing the system operator to trade overheadagainst interactivity. More frequent team switches leadto higher system responsiveness at the expense of moreoverhead of the E-Team mechanism. Irrespective of theconfigured time-slice length, the measurement has to bestopped when all tasks within a measured team yield theCPU, which is regularly the case with I/O-bound work-loads. To avoid measurement errors due to short execu-tions, we employ the short-time RAPL technique intro-duced in Section 2.3 if less than 50 ms of the time sliceare used. Otherwise we read the RAPL counters directly.This allows us to guarantee the accuracy property for allworkloads, even interactive and I/O-heavy ones, whilelimiting the performance impact for compute-intensiveworkloads and execution phases.

4 ImplementationWe implemented our energy measurement service E-Team as a scheduling class in the Linux kernel. The ker-nel patch and user tools are available on GitHub1.

4.1 Scheduler ImplementationThe general architecture of the Linux scheduling frame-work is illustrated in Figure 4. It consists of a core sched-uler, which invokes the scheduling classes implementingthe actual policies. Scheduling classes are sorted by pri-ority. The highest priority is given to the STOP class, thelowest to the IDLE class. Schedulers maintain per-corerunqueues, enabling them to make core-local schedul-ing decisions, which removes one of the bottlenecks inmany-core systems. Usually, tasks in Linux are sched-uled by the Completely Fair Scheduler (CFS). We prior-itize the team scheduler above CFS.

The team scheduler must ensure that only tasks be-longing to the same team are assigned to the same socket.The team scheduler maintains a list of teams (the teamrunqueue) where a team is a pointer to a list of the tasksthat comprise the team. We illustrate the team schedulingprocess in Figure 5. As soon as the team scheduler de-cides — based on its team scheduling policy — to switch

1https://github.com/TUD-OS

Core 0 Core 1

Task

RQ

CFS

RQ

Task

RQ

CFS

RQCurrent

Team

TeamRQ

1

2

2

2

3 3

Figure 5: Scheduling a team for energy measurement

the team running on a socket it clears the core-local taskrunqueues . The team scheduler then picks the new team(step 1) and distributes its tasks to the cores of the socket(step 2) by enqueuing them in the task runqueues of thecores. CPU affinity is respected during this step. Thecore-local task schedulers are then triggered to resched-ule the tasks in their task runqueue according to theirscheduling policy (step 3). If there are not enough tasksin a team to occupy all cores of the assigned sockets,the idle task is scheduled on the remaining cores. Thiscauses these cores to enter energy-saving states. Whenall the tasks in a team have terminated or the team’s timeslice is exhausted, the next team is scheduled.

Non-measured tasks are treated as an implicit teamwhich is not managed by the team scheduler. In our im-plementation, they are not implemented as an actual teambut as tasks kept in the core-local runqueues of CFS.These tasks are scheduled in between measured teamsby yielding to CFS in order to enforce the interactivityproperty. How frequently E-Team yields to CFS dependson the number of non-measured tasks and the number oftasks in the measured teams, allowing fairness propertiessimilar to CFS to be enforced.

Time-sliced round-robin with a base time-slice lengthof 100 ms was chosen as the team-scheduling policy. Thebase time-slice length is configurable. The actual lengthof the time slice depends on the number of ready tasks inthe teams (i.e. the load). A team with more tasks waitingin its runqueues will get proportionally more time than aless-loaded team. This leads to fair multiplexing of CPUtime between the team scheduler and the regular tasks inthe system scheduled by CFS. We found a 100 ms basetime-slice length to be a good compromise between over-head, accounting accuracy, and system-responsiveness.CFS chooses a similar base time-slice length for a sys-tem with CPU intensive load [6, Table 7-2]. Shorteningthe default time slice can improve responsiveness at thecost of higher overhead for frequent switching and morefrequent use of the short-time RAPL mechanism. Pleasenote that the default time-slice length is independent ofthe timer frequency. For all our experiments the timer

USENIX Association 2017 USENIX Annual Technical Conference 593

still ticked with a frequency of 1 kHz, thus invoking thescheduler every millisecond. The default time slice is ascheduler parameter that determines the default amountof time each process gets before it is rescheduled. Theactual time may be less.

Tasks in the task runqueue are scheduled using time-sliced round-robin, but it would also be possible to useCFS as the task scheduling policy. Especially when mea-suring large teams containing many tasks with differentpriorities, CFS would better preserve the execution char-acteristics of the unmodified system.

When the E-Team scheduler does not schedule a mea-sured team, it will yield to CFS, which then schedules thenon-measured tasks as it normally would. This is an ad-vantage of the implementation of the non-measured teamusing the normal CFS runqueues. Accordingly the sys-tem performs exactly as if it was unmodified wheneverno tasks are in measured teams.

4.2 User-Level ToolingTeams are formed by assigning a process to the E-Teamscheduler. E-Team then automatically adds threads cre-ated by the process to the process’s team. Although thissimplified grouping was sufficient for our evaluation, itwould also be possible to move individual threads to ateam, disable the automatic addition of newly createdthreads, or combine several processes in one team.

The decision to use a specialized scheduler supportsthe usability property (see Section 3.1). Scheduler as-signment is performed by starting the measured programthrough a tool such as schedtool. Alternatively ourown tool energy can be used with the added benefit ofoutputting the energy consumption after program termi-nation (like the Unix time utility does for time). Appli-cations can start and stop measurement at arbitrary pointsin time by calling sched_setscheduler to movethe process between the E-Team scheduler and CFS.

Applications can read their energy consumption froma file in their procfs subdirectory. Procfs provides run-time parameters and statistics of each process. We addedtwo entries, energystat and loopstat, which pro-vide access to the energy consumption and statisticsabout the scheduler’s operation (number of time slicesexecuted, short-time RAPL statistics, etc.), respectively.Listing 2 shows example output for the energy data.

Both files can be read during program execution to getregularly updated energy and statistics values. The con-tent of the files will not change when the process is notmeasured and retains the values from when the processleft the E-Team scheduling class. The files will retaintheir final values when the process stops being measuredby leaving the E-Team scheduling class.

root@measure$ cat / proc /100 / energys ta tpackage ( uJ ) : 1052792342dram ( uJ ) : 54277983core ( uJ ) : 842365434gpu ( uJ ) : 89234updates : 303avg_loop_time ( us ) : 180

Listing 2: Example data provided by E-Team

5 EvaluationTo deliver on our promise of accurate energy accounting,we evaluate our system by first establishing a baselineusing an unmodified system and then analyzing the en-ergy and time overhead in three scenarios. We start bymeasuring a single application running alone on a Linuxsystem. We then add background load and execute twoapplications in parallel, measuring them individually. Fi-nally, we investigate the influence of short scheduling in-tervals and the effects of short-time RAPL.

Measurements were performed on a single-socketquad-core Intel R© Haswell Core

TMi7-4770 machine with

3.4 GHz nominal frequency and 2×4 GiB of DDR3CL9 RAM clocked at 1333 MHz. We disabled Hyper-Threading and Turbo Boost, to make the individual mea-surements more deterministic and maintain compara-bility between single-application and multi-applicationruns. These options could otherwise lead to different be-havior based on thread assignment and the decisions ofTurbo Boost. We used a Linux 4.2.3 kernel in our exper-iments. Although we measured energy for all availableRAPL domains, we only present PKG energy for brevity.The other domains showed comparable results.

5.1 Baseline and OverheadOur first measurements establish the baseline for the restof our evaluation. Baseline measurements were per-formed on a Linux system stripped down to the mini-mum necessary to run the benchmarks: We ran the sys-tem from an initrd with no system services interferingwith execution. We believe the measured energy to con-form to the energy consumed by the benchmarks. We usethe NAS Parallel Benchmarks (NPB) [1], version 3.1, asbenchmarks. Presented data is averaged over 20 consec-utive runs. Error bars are not given in graphs if the stan-dard deviation is below 1 %. Figure 6 shows the end-to-end measurement of the benchmarks for wall-clock time,cpu time and package energy (measured by the PKGcounter). The benchmarks were scheduled using CFS.No parts of our kernel modification were active duringthe runs. Time was determined using the time com-mand, while energy was measured by reading the RAPLMSR at the start and end of the benchmark. We measuredeach benchmark running with one to four threads.

594 2017 USENIX Annual Technical Conference USENIX Association

Wal

l-cl

ock

time[s]

0

20

40

60

80

1 thread 2 threads 3 threads 4 threadsC

PUtim

e[s]

0

50

100

PKG

ener

gy[k

J]

bt.A

cg.B

ep.B ft.

Bis.

Clu.

Amg.C sp

.Aua

.A

0

1

2

Figure 6: Baselines for wall-clock time, CPU time andPKG energy for different NPB kernels

We do not include the DC benchmark in our mea-surements because it mixes computation with extensiveI/O. We found that its CPU time deviates significantly(>10 %) from wall-clock time when run as single ap-plication with one thread. The effect increases with thenumber of threads. This makes an end-to-end measure-ment meaningless as too much of the time is spent out-side the benchmark. This is one of the cases that cannotbe measured reliably without E-Team. We evaluate sim-ilar cases in Section 5.4 and will show a detailed discus-sion of DC in Section 6, when comparing against exter-nal measurements. For the other benchmarks, wall-clocktime matched CPU time for the single-core case, result-ing in a usable end-to-end baseline for energy.

Next, we repeated the baseline measurement using E-Team. This measurement and all the following in thissection were performed on a normal Arch Linux systemthat was not stripped down. Ideally, the results obtainedfrom E-Team would show the same wall-clock time andCPU time as the baseline. We also expected slightlylower energy consumption than the baseline, since E-Team does not account energy that is consumed by ker-nel tasks or by other processes in the system. Figure 7shows the results of our measurements relative to thebaseline. Team scheduling increases the wall-clock timeof each benchmark. The more threads the program has,the longer it executes compared to the baseline, since

Wal

l-cl

ock

time[%

]

−2

0

2

4

1 thread 2 threads 3 threads 4 threads

CPU

Tim

e[%

]

−2

0

2

PKG

ener

gy[%

]bt.

Acg

.Bep

.B ft.B

is.C

lu.A

mg.C sp.A

ua.A

−2

0

2

Figure 7: Wall-clock time, CPU time, and PKG energymeasured using team scheduling compared to baseline.

background load in the system, even if single-threaded,blocks the whole measured program from running on theCPU. This is expected and the worst-case overhead isapproximately 4 %. As we had hoped, CPU time didnot increase significantly, which shows that the perfor-mance impact of our scheduler is negligible at less than1 % in most cases. As long as there are enough tasks inall the teams, total performance of the system will notsuffer. For package energy, our measurements are in theexpected range with a difference relative to the baselineof less than 2 %. For most benchmarks we even measureless consumption due to the exclusion of unrelated workperformed by the system. The measurements prove thatour method combines low overhead with high precision.

5.2 Surveying Individual Groups of TasksAfter demonstrating that E-Team performs as good as theend-to-end measurements, we will show that our mea-surements stay accurate even in the presence of othertasks that are scheduled by the system. We introducebackground load by running a single-threaded busy loopconcurrently to the NPB suite. An empty busy loop doesnot touch any data and thus avoids any cache interferencethat could lead to changes in energy consumption.

Figure 8 shows the results of this experiment. Weomit wall-clock time, as it is not a useful metric to com-pare against in this case. Wall-clock time will increasecompared to the baseline in any case due to the intro-

USENIX Association 2017 USENIX Annual Technical Conference 595

CPU

time[%

]

−2

0

2

1 thread 2 threads 3 threads 4 threadsPK

Gen

ergy

[%]

bt.A

cg.B

ep.B ft.

Bis.

Clu.

Amg.C sp

.Aua

.A

−2

0

2

Figure 8: Per-application CPU time and PKG energy(relative to the baseline) as determined by E-Team witha background application running in parallel.

duced background load. For CPU time, we see an over-head of at most 2 %, while energy measurements areslightly below the baseline. We suspect the encounteredenergy reduction to be an artifact of precision limits ofthe RAPL counters. The busy-loop consumes signifi-cantly less energy than the benchmarks and we spec-ulate that internal RAPL state influenced by this low-power activity bleeds into the results for the much moreenergy-consuming benchmarks. To test this hypothesis,we replaced the busy-loop with FIRESTARTER, whichconsumes more energy than the benchmarks. In this ex-periment, energy consumption increased relative to thebaseline (e.g., 1.6 % for ft.B). This result indicates thatinaccuracies within RAPL caused the measurement er-rors we observed. RAPL counters for DRAM proved lesssusceptible to this effect.

5.3 Multiple MeasurementsOne feature of our scheduler is that we can extract andmeasure a single application out of a number of applica-tions running in parallel on the system. To demonstratethis feature we executed all application-pairs of the NPBsuite (except for DC due to the lack of a meaningful base-line), measuring only one application of the pair. The re-sults can be seen in Figure 9. We used scheduling slicesof 100 ms to limit interference between the benchmarks.We ran this benchmark for one to four threads and com-pared the results against baseline.

As a guide to read Figure 9, consider the row withft.B in the rightmost pane showing four threads in Fig-ure 9b: Selecting the column is.C shows that the mea-sured energy consumption of ft.B, when running concur-rently with is.C, is 4 % below the baseline. No statement

is made about is.C in this cell. Our worst-case error is6 % for the DRAM energy (not shown) when runningua.A concurrently with itself. This may be attributed toeither measurement errors introduced by our scheduler,errors in the RAPL model (i.e. incorrect energy values),or interference between the programs, despite the longscheduling interval. We will discuss the cause of this di-vergence in Section 6. Even a 6 % error is still on parwith model-based estimation techniques [32, 37, 5].

5.4 Short Scheduling IntervalsParticularly challenging for E-Team are applications thatexecute in short bursts, blocking in-between executionphases. Interactive GUI or multimedia applications aswell as I/O-bound applications are examples that exhibitsuch behavior. They require rescheduling more oftenthan our default time slice of 100 ms by yielding theCPU. Every time all the threads in the currently runningteam yield the CPU, we must switch to another team. Ifthe last switch was not at least 50 ms ago, we need to per-form short-time RAPL (refer to Section 2.3), to avoid in-accuracies introduced by the time-discrete updates of theRAPL counters. To evaluate the benefits of short-timeRAPL for scheduling, we implemented a synthetic, in-teractive load that executes a busy loop for 5.4 ms, subse-quently blocks for 1 s and then repeats the procedure 50times. We compare short-time RAPL and naïve, update-oblivious multiplexing of the counter. As baseline wemeasure the busy loop that occupies the CPU as long asour synthetic workload (270 ms), but runs uninterrupted.Figure 10 shows the results. The energy measured by E-Team matches the baseline. When using naïve, update-oblivious multiplexing our measurements exhibit an er-ror of up to 10 %. In contrast, short-time measurementsonly exhibit an error of 0.2 %. We conclude, that E-Teamcan reliably measure interactive and I/O-intensive tasksthat yield the CPU frequently.

5.5 Practical ScenariosVirtual machines We used qemu-kvm to run twoVMs with Debian Jessie 8.4 64-bit, each given one coreand 2 GiB of RAM. One VM was serving files overHTTP, the other was a malicious VM wasting CPU cy-cles by executing FIRESTARTER. We ran both VMs inparallel on Arch Linux using E-Team. Each VM received300 s CPU time. The fileserver used 1034.1 J while themalicious VM used 7013.8 J. Based on this informationa data-center operator could use appropriate billing or re-duce the CPU time allocated to the malicious VM.

Single-Core Sampling We show the effectiveness ofsampling to reduce overhead for single-threaded work-

596 2017 USENIX Annual Technical Conference USENIX Association

bt.Acg.Bep.Bft.Bis.Clu.A

mg.Csp.Aua.A

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

1 thread

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

2 threads

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

3 threads

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

4 threads

−5

0

5

Diff

eren

ce[%

]

(a) CPU Time

bt.Acg.Bep.Bft.Bis.Clu.A

mg.Csp.Aua.A

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A1 thread

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

2 threads

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

3 threads

bt.A

cg.B

ep.B

ft.B

is.C

lu.A

mg.

Csp

.Aua

.A

4 threads

−5

0

5

Diff

eren

ce[%

]

(b) Package Energy

Figure 9: Benchmarks in the rows are measured while running concurrently with the benchmarks in the columns.Shown is the relative difference to the baseline. We repeated this experiment for thread counts from one to four.

0 2 4 6

BaselineShort-time

Naïve

6.316.3

5.68

Energy [J]

Figure 10: Short-time RAPL and update-oblivious mea-surements compared to baseline

0204060801000

500

1,000

1,500

Sampling rate [%]

Ene

rgy[J]

0204060801000

20

40

60

80

Sampling rate [%]

Ban

dwid

th[M

B/s]

Energy ep.BBandwidth Redis

Figure 11: Redis throughput vs. ep.B measurement ac-curacy at various sampling rates

loads. Our setup consists of a single-threaded mea-sured instance of ep.B running in parallel with unmea-sured Redis, a popular in-memory database. To reducethe performance impact of isolating ep.B we employrandom sampling. Figure 11 shows the throughput ofmemtier_benchmark at default configuration. At 10 %sampling we achive 86.4 % of the baseline performanceof Redis, while maintaining 99 % measurement accuracyfor ep.B. Like DC from NPB, Redis’ I/O-intensive naturecomplicates determining an energy baseline. We thusomit its energy values.

6 External ValidationWhen running multiple teams in parallel, as done in Sec-tion 5.3, we do not know the cause of any aberrationsfrom the baseline. Causes may be interference betweenthreads, RAPL inaccuracies, or accounting errors in E-Team. To rule out the latter, we verify E-Team resultsusing a secondary, external measurement infrastructure.

6.1 Measurement SetupTo verify the accuracy of our results, we use a sophis-ticated high-resolution power measurement infrastruc-ture, which has been thoroughly verified [19]. It hasbeen adapted to a Haswell-EP with two-socket Xeon E5-2690 v3 and a total of 256 GiB DDR4-2133 ECC RAMthat we use as evaluation platform in this section.

We compare the results of RAPL against direct cur-rent (DC) measurements at inputs of each socket’s volt-age regulators. Both sockets are measured at a samplingrate of 500 kSa/s, to track power consumption betweenscheduling events. Data obtained from the external mea-surement infrastructure correspond to the sum of PKGand DRAM consumption according to RAPL. Becausewe measure at the input of the voltage regulators, theexternal measurements cover some components on themainboard that are not measured by RAPL. ThereforeRAPL reports less power consumption than the externalmeasurements, even if both are perfectly accurate in theirown power domain.

The verification is focused on identifying potentialsystematic inaccuracies introduced by our novel tech-

USENIX Association 2017 USENIX Annual Technical Conference 597

0 200 400 600 800

is.C

ft.B

312.54

783.42

321.89

792.34

Energy [J]

E-Team External

Figure 12: ft.B and is.C PKG energy using E-Team andexternal measurement with 12 threads per application

0 200 400 600 800 1,000

E-TeamExternal

995.11,034.86

Energy [J]

Figure 13: dc.W measured with E-Team and externally

nique. To compare the measured reference against thepower domain of RAPL, we apply a model to map be-tween the two. The model is trained on measurementsof different workload kernels executed at various thread-counts and configurations as described in [4]. Training isperformed on a non-modified Linux system using contin-uous RAPL and reference measurements. Linear regres-sion provides the final slopes and intercepts separatelyfor each socket with R2 > 0.999.

Since the external measurement traces not only con-tain the power usage of the measured program but alsoof other tasks executed in parallel, a post-processing stepwas necessary to identify the regions in the traces dur-ing which the program of interest actually executed. Forthis purpose, we used an additional trace, generated bythe E-Team scheduler, which indicates when each pro-gram was scheduled on the processor. We had to syn-chronize the traces, because they have timestamps fromdifferent clocks. We generated a special energy patternbefore and after every measurement to correlate the mea-surement and scheduler traces.

6.2 ResultsTo validate our measurements from Section 5, we exe-cute selected benchmarks on the instrumented hardware.We present the case of ft.B running together with is.C,which we already used in Section 5.3, as they exhibitsignificantly different power usage of 110 W and 80 Wper socket, respectively. The results in Figure 12 showthat our measurement is very accurate with an error of1.1 % for ft.B and 2.9 % for is.C. The 12-core configura-tion used for the figure represents the worst case for thisbenchmark. The error decreased with fewer threads. Wealso examined the DC benchmark, which we were notable to evaluate in Section 5. We measured dc.W run-

18.4 18.6 18.8 19 19.2 19.4

40

50

60

70

Time [s]

Pow

er[W

]

is.C (E-Team) ft.B (E-Team) system external

Figure 14: Power characteristics over time

ning with two threads and compared the external mea-surement to the E-Team result. Figure 13 shows that evenfor this I/O-intensive benchmark E-Team’s error is only3.5 %. Over 20 consecutive runs we observed a standarddeviation well below 1 % in all cases.

Figure 14 shows that our E-Team implementation ac-curately tracks energy consumption over time. We ranis.C and ft.B in parallel, each in its own measured teamand read their respective procfs entry repeatedly. Weused a time slice of 200 ms. The characteristics of theexternal measurement match those of the internal one.The dips visible in the power consumption reported byE-Team (e.g. at 18.95 s) are caused by switches betweenteams or scheduling of the non-measured team. Tasksin the non-measured team yielded after very short timeleading to short interruptions of the measured teams.

7 LimitationsE-Team provides accurate energy accounting for arbi-trary groups of threads using socket-wide energy mea-surements. But this feature comes at a cost: a team al-ways needs exclusive access to the socket. Accordingly,resources remain unused if teams cannot spread acrossall cores of the socket. The pathological example forthis is a team that consists of a single thread. However,E-Team allows on-the-fly starting and stopping of mea-surements and thereby supports random sampling. Thiscreates a trade-off space between measurement accuracyand performance overhead. In cases where the perfor-mance overhead of E-Team is prohibitive, limiting themeasurement duration to short sampling intervals allowsfor acceptable performance at a slight loss of accuracy.

For I/O-bound workloads short-time RAPL is requiredmore often, incurring additional overhead. In the worstcase this translates to 1 ms overhead per team switch.Our experiments with dc.W showed only a performancedegradation of 4.7 % in the worst case. We further ex-amined a worst-case scenario for I/O-bound workloadsby running grep recursively over the Linux sourcetree. We identified the extreme case showing 150 %overhead (155.32 s vs. 62.30 s) when flushing the buffer

598 2017 USENIX Annual Technical Conference USENIX Association

cache before the run. We also measured Redis runningmemtier_benchmark and achived 30 % to 80 % of nativeperformance for data sizes of 32 B to 128 kB despite itsI/O-intensive nature. To measure such scenarios, we ad-vice the use of random sampling.

8 Related WorkAs energy efficiency is a cross-cutting concern, it hasbeen approached from both the hardware and softwareside. On the hardware side, external measurement meth-ods, such as those proposed by Hönig et al. [18], have im-proved significantly in sampling speed and accuracy overexisting solutions, such as the frequently used Watts-Up power meter [9]. External measurements as datasources integrate well with our method, but introduce theneed for additional hardware. Intel’s RAPL addressesthis problem by providing self-calibrating models [33].Hackenberg et al. have shown that RAPL produces accu-rate energy estimates in recent versions [15] and comparevarious measurement methods [13].

Below the application layer, system architectsconstruct runtimes and scheduling frameworks tomodel [30], account [29], and control [34] platform en-ergy use. Several methods using performance counterbased power models [22, 9, 36, 3, 2] exist. They exhibitrelative errors in the range of 5 % to 10 % but can, con-trary to RAPL, include other components such as disks.However, models require calibration, which has to beperformed for each individual CPU. McCullough et al.found variations between individual CPUs of the exactsame type to be too large to calibrate based on CPUmodel and have shown that linear CPU energy modelsare intrinsically limited in their accuracy [27].

There are various approaches using performance-counter-based models to apportion energy to VMs orapplications. Shen et al. investigate Power Containers,which use model-based apportioning of energy to appli-cations [36]. They use external recalibration during run-time, thus relying on additional hardware. Their meth-ods exhibit relative errors of up to 11 % on Sandy BridgeCPUs. Bertran et al. account energy for VMs using amodel-based approach and report 5 % relative error [3].

For high performance computing (HPC) systems,Georgio et al. have shown a SLURM-based job man-agement system, which allows accounting of energy tojobs [12]. Their approach is limited to account energy ona per-node level. While suitable for typical HPC systems,it does not cover cloud or data-center scenarios with mul-tiple simultaneous users per machine.

To schedule groups of tasks Ousterhout introduced co-scheduling [28] and an Feitelson et al. presented gang-scheduling [10]. Our method builds on these approaches.

9 Conclusion & Future Work

We presented the design and implementation of E-Team,a facility that enables accurate measurement of energyconsumption for individual threads or groups of threadsin a system. We isolate groups of interest using teamscheduling. This enables us to use a system-wide mea-surement method, such as Intel’s RAPL, while still be-ing able to apportion energy consumption per thread orgroup of threads. To address the discrete nature of theRAPL readings, we employ short-time measurementsto accommodate for applications that are interactive oryield the CPU often. We are able to isolate arbitraryparts of a system and apportion their energy with an er-ror of at most 3.5 % compared to external measurements.Our methods provide greater accuracy than many exist-ing model-based approaches and our validation showsthat E-Team can apportion energy faithfully. To the bestof our knowledge, our implementation is the first to allowpractical, high-precision, per-application energy attribu-tion in a multi-core system without relying on manualcalibration or external measurement equipment.

Our implementation is applicable to a wide range ofdevices. E-Team does not rely on RAPL but can useother energy measurement techniques such as sensorsavailable on mobile platforms [17] or hand-held devices.

Some ideas of our design are not yet implemented andare left for future work. We did not implement simulta-neous execution of different teams on different sockets.The challenge in accounting energy on multiple socketsconcurrently is that applications running on one socketcan cause energy usage in another socket. Remote mem-ory access is one example for such behavior. We leavethe implementation of a cgroup-like interface to futurework as well. Such an interface could prove useful tocombine threads of multiple applications into one mea-sured team. While we implemented random sampling, adetailed discussion of the performance and accuracy im-plications is left for future work, due to space constraints.

In summary, our work represents a significant stepforward for data-center energy accounting, energy-basedbilling, and energy profiling of applications in produc-tion systems. E-Team provides a cheap, accurate, andeasy-to-use solution for on-the-fly energy accounting.

Acknowledgements

This work is supported by the German Research Foun-dation (DFG) within the CRC 912 - HAEC. The authorswould like to thank Mario Bielert for his work on theverification of the external measurement system.

USENIX Association 2017 USENIX Annual Technical Conference 599

References[1] BAILEY, D. H., BARSZCZ, E., BARTON, J. T., BROWNING,

D. S., CARTER, R. L., DAGUM, L., FATOOHI, R. A., FRED-ERICKSON, P. O., LASINSKI, T. A., SCHREIBER, R. S., ET AL.The NAS parallel benchmarks. International Journal of HighPerformance Computing Applications 5, 3 (1991), 63–73.

[2] BASMADJIAN, R., AND DE MEER, H. Evaluating and mod-eling power consumption of multi-core processors. In FutureEnergy Systems: Where Energy, Computing and CommunicationMeet (e-Energy), 2012 Third International Conference on (2012),IEEE, pp. 1–10.

[3] BERTRAN, R., BECERRA, Y., CARRERA, D., BELTRAN,V., GONZALEZ, M., MARTORELL, X., TORRES, J., ANDAYGUADE, E. Accurate energy accounting for shared virtual-ized environments using PMC-based power modeling techniques.In Grid Computing (GRID), 2010 11th IEEE/ACM InternationalConference on (2010), IEEE, pp. 1–8.

[4] BIELERT, M. Evaluating power estimation techniques: Amethodological approach. Master’s thesis, Technische Univer-sität Dresden, 2016.

[5] BOSE, P., MARTONOSI, M., AND BROOKS, D. Modeling andanalyzing CPU power and performance: Metrics, methods, andabstractions. Tutorial, ACM SIGMETRICS (2001).

[6] BOVET, D., AND CESATI, M. Understanding The Linux Kernel.Oreilly & Associates Inc, 2005.

[7] COLMANT, M., KURPICZ, M., FELBER, P., HUERTAS, L.,ROUVOY, R., AND SOBE, A. Process-level power estimation inVM-based systems. In Proceedings of the Tenth European Con-ference on Computer Systems (2015), ACM, p. 14.

[8] DAVID, H., GORBATOV, E., HANEBUTTE, U. R., KHANNA,R., AND LE, C. RAPL: Memory power estimation and capping.In Proceedings of the 2010 ACM/IEEE International Symposiumon Low-Power Electronics and Design (2010), ISLPED, IEEE,pp. 189–194.

[9] DO, T., RAWSHDEH, S., AND SHI, W. ptop: A process-levelpower profiling tool. In Proceedings of the 2nd workshop onpower aware computing and systems (HotPowerâAZ09) (2009).

[10] FEITELSON, D. G., AND RUDOLPH, L. Gang scheduling perfor-mance benefits for fine-grain synchronization. Journal of Paralleland Distributed Computing 16, 4 (1992), 306–318.

[11] GAUTHAM, A., KORGAONKAR, K., SLPSK, P., BALACHAN-DRAN, S., AND VEEZHINATHAN, K. The implications of shareddata synchronization techniques on multi-core energy efficiency.In Presented as part of the 2012 Workshop on Power-Aware Com-puting and Systems (2012).

[12] GEORGIOU, Y., CADEAU, T., GLESSER, D., AUBLE, D.,JETTE, M., AND HAUTREUX, M. Energy accounting and con-trol with SLURM resource and job management system. In Dis-tributed Computing and Networking. Springer, 2014, pp. 96–118.

[13] HACKENBERG, D., ILSCHE, T., SCHÃUNE, R., MOLKA, D.,SCHMIDT, M., AND NAGEL, W. E. Power measurement tech-niques on standard compute nodes: A quantitative comparison.In Performance Analysis of Systems and Software (ISPASS), 2013IEEE International Symposium on (April 2013), pp. 194–204.

[14] HACKENBERG, D., OLDENBURG, R., MOLKA, D., ANDSCHONE, R. Introducing FIRESTARTER: A processor stresstest utility. In Green Computing Conference (IGCC), 2013 Inter-national (2013), IEEE, pp. 1–9.

[15] HACKENBERG, D., SCHONE, R., ILSCHE, T., MOLKA, D.,SCHUCHART, J., AND GEYER, R. An energy efficiency featuresurvey of the Intel Haswell processor. In Parallel and DistributedProcessing Symposium Workshop (IPDPSW), 2015 IEEE Inter-national (2015), IEEE, pp. 896–904.

[16] HÄHNEL, M., DÖBEL, B., VÖLP, M., AND HÄRTIG, H. Mea-suring energy consumption for short code paths using RAPL.SIGMETRICS Perform. Eval. Rev. 40, 3 (Jan. 2012), 13–17.

[17] HÄHNEL, M., AND HÄRTIG, H. Heterogeneity by the numbers:A study of the ODROID XU+E big.LITTLE platform. In 6thWorkshop on Power-Aware Computing and Systems (HotPower14) (2014).

[18] HÖNIG, T., JANKER, H., EIBEL, C., MIHELIC, O., ANDKAPITZA, R. Proactive energy-aware programming with PEEK.In 2014 Conference on Timely Results in Operating Systems(TRIOS 14) (2014).

[19] ILSCHE, T., HACKENBERG, D., GRAUL, S., SCHUCHART, J.,AND SCHÖNE, R. Power measurements for compute nodes:Improving sampling rates, granularity and accuracy. In 2015Sixth International Green and Sustainable Computing Confer-ence (IGSC) (Dec. 2015), The sixth international green and sus-tainable computing conference, pp. 1–8.

[20] INTEL. Intel 64 and IA-32 Architectures Software Developer’sManual Volume 3A, 3B, and 3C: System Programming Guide,Sep 2014. Section 14.9.

[21] JIMENEZ, V., GIOIOSA, R., CAZORLA, F. J., VALERO, M.,KURSUN, E., ISCI, C., BUYUKTOSUNOGLU, A., AND BOSE,P. Energy-aware accounting and billing in large-scale computingfacilities. IEEE Micro, 3 (2011), 60–71.

[22] KANSAL, A., ZHAO, F., LIU, J., KOTHARI, N., AND BHAT-TACHARYA, A. A. Virtual machine power metering and provi-sioning. In Proceedings of the 1st ACM Symposium on CloudComputing (New York, NY, USA, 2010), SoCC ’10, ACM,pp. 39–50.

[23] KONSTANTAKOS, V., CHATZIGEORGIOU, A., NIKOLAIDIS, S.,AND LAOPOULOS, T. Energy consumption estimation in embed-ded systems. Instrumentation and Measurement, IEEE Transac-tions on 57, 4 (2008), 797–804.

[24] Krapl: Intel RAPL driver exposing the RAPL interface in sysfs.https://github.com/TUD-OS/krapl.

[25] LI, T., AND JOHN, L. K. Run-time modeling and estimation ofoperating system power consumption. In Proceedings of the 2003ACM SIGMETRICS International Conference on Measurementand Modeling of Computer Systems (New York, NY, USA, 2003),SIGMETRICS ’03, ACM, pp. 160–171.

[26] MCCRAW, H., RALPH, J., DANALIS, A., AND DONGARRA,J. Power monitoring with PAPI for extreme scale architecturesand dataflow-based programming models. In Proceedsings ofthe 2014 IEEE International Conference on Cluster Computing(2014), CLUSTER, IEEE, pp. 385–391.

[27] MCCULLOUGH, J. C., AGARWAL, Y., CHANDRASHEKAR, J.,KUPPUSWAMY, S., SNOEREN, A. C., AND GUPTA, R. K.Evaluating the effectiveness of model-based power characteri-zation. In Proceedings of the 2011 USENIX Conference onUSENIX Annual Technical Conference (Berkeley, CA, USA,2011), USENIXATC’11, USENIX Association, pp. 12–12.

[28] OUSTERHOUT, J. K. Scheduling techniques for concurrent sys-tems. In ICDCS (1982), vol. 82, pp. 22–30.

600 2017 USENIX Annual Technical Conference USENIX Association

[29] PATHAK, A., HU, Y. C., AND ZHANG, M. Where is the energyspent inside my app?: Fine grained energy accounting on smart-phones with eprof. In Proceedings of the 7th ACM EuropeanConference on Computer Systems (New York, NY, USA, 2012),EuroSys ’12, ACM, pp. 29–42.

[30] PATHAK, A., HU, Y. C., ZHANG, M., BAHL, P., AND WANG,Y.-M. Fine-grained power modeling for smartphones using sys-tem call tracing. In Proceedings of the 6th ACM European Con-ference on Computer Systems (2011), EuroSys, ACM, pp. 153–168.

[31] PATHAK, A., JINDAL, A., HU, Y. C., AND MIDKIFF, S. P.What is keeping my phone awake?: Characterizing and detectingno-sleep energy bugs in smartphone apps. In Proceedings of the10th International Conference on Mobile Systems, Applications,and Services (2012), MobiSys, ACM, pp. 267–280.

[32] RIVOIRE, S., RANGANATHAN, P., AND KOZYRAKIS, C. Acomparison of high-level full-system power models. HotPower8 (2008), 3–3.

[33] ROTEM, E., NAVEH, A., ANANTHAKRISHNAN, A., RAJWAN,D., AND WEISSMANN, E. Power-management architecture ofthe Intel microarchitecture code-named Sandy Bridge. IEEE Mi-cro 32, 2 (2012), 20–27.

[34] ROY, A., RUMBLE, S. M., STUTSMAN, R., LEVIS, P., MAZ-IÈRES, D., AND ZELDOVICH, N. Energy management in mo-bile devices with the Cinder operating system. In Proceedings ofthe 6th ACM European Conference on Computer Systems (2011),EuroSys, ACM, pp. 139–152.

[35] RYFFEL, S. LEA2P – The linux energy attribution and account-ing platform. Master’s thesis, Swiss Federal Institute of Technol-ogy (ETH), Zurich, Switzerland (2009).

[36] SHEN, K., SHRIRAMAN, A., DWARKADAS, S., ZHANG, X.,AND CHEN, Z. Power containers: An OS facility for fine-grainedpower and energy management on multicore servers. In Pro-ceedings of the Eighteenth International Conference on Architec-tural Support for Programming Languages and Operating Sys-tems (New York, NY, USA, 2013), ASPLOS ’13, ACM, pp. 65–76.

[37] SNOWDON, D. C., PETTERS, S. M., AND HEISER, G. Accurateon-line prediction of processor and memory energy usage undervoltage scaling. In Proceedings of the 7th ACM &Amp; IEEEInternational Conference on Embedded Software (New York, NY,USA, 2007), EMSOFT ’07, ACM, pp. 84–93.

[38] TREIBIG, J., HAGER, G., AND WELLEIN, G. LIKWID: Alightweight performance-oriented tool suite for x86 multicore en-vironments. In Proceedings of the 39th International Confer-ence on Parallel Processing Workshops (2010), ICPPW, IEEE,pp. 207–216.

[39] WANG, W., CAVAZOS, J., AND PORTERFIELD, A. Energy auto-tuning using the polyhedral approach. In Proceedings of the 4thInternational Workshop on Polyhedral Compilation Techniques,

S. Rajopadhye and S. Verdoolaege, Eds., Vienna, Austria (2014).

USENIX Association 2017 USENIX Annual Technical Conference 601