Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi

1

ENERGY AND POWER CHARACTERIZATION OF

PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHIJOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GEEMAIL: {JW1772, ZILIANG, QIJUN}@TXSTATE.EDU,

[email protected]

2

THE XEON PHI COPROCESSOR

Equipped with 60 x86-based cores, each capable of running 4 threads simultaneously.

Designed for high computation density.

Used in both Tianhe-2 and Stampede supercomputers.

3

OVERVIEW OF OUR WORK

• We profile the power and energy of multiple algorithms with contrasting workloads. • Concentrating on the performance and energy impact of increasing the

number of threads, running code in native versus offloaded mode, and co-running selected algorithms on the Xeon Phi.

• We describe how to correctly profile the instantaneous power of the Xeon Phi using the built-in power sensors.

4

XEON PHI POWER DATA

• Power data is collected using the MICAccessAPI - a C/C++ library that allows users to monitor and configure several metrics (including power) of the coprocessor.• The power results that we present are measured and

recorded by issuing the MicGetPowerUsage() call to the MICAccessAPI during execution of each experiment.

5

SELECTED ALGORITHMS

• Barnes-Hut simulation – O(nlogn) n-body approximation.• Shellsort – comparison based exchange/insertion sort.• SSSP – Single Source Shortest Path (Dijkstra’s algorithm)

graph searching.• Fibonacci – calculates 45 Fibonacci numbers.

6

POWER TRACING

• Graphing the instantaneous power of these algorithms allows us to confirm much of what can be inferred about the performance and energy from the implementation.• It can help us identify features of different applications that

aren’t otherwise obvious, and facilitate new findings.

7

BARNES-HUT• Designed to solve the n-body simulation problem

by approximating the forces acting on each body.• Uses an octree data structure to achieve a time

complexity of O(nlogn). • Memory access and control flow patterns are

irregular, since different parts of the octree must be traversed to compute forces from each body.• Balanced workload, as each thread will perform

the same amount of force calculation per iteration.

8

SHELLSORT• Comparison based in-place sorting

algorithm.• Starts by sorting elements far from

each other, reducing the gap between them.• Workload gradually reduces because

fewer swaps occur as the data set becomes relatively sorted.

9

SSSP• Returns the distance between 2

chosen nodes of the input graph.• Amount of parallelism changes

throughout execution.• Unbalanced workload, as each thread

is given a different number of neighbor nodes to compute the distance.

10

FIBONACCI• Calculates 45 Fibonacci sequence numbers.• Each sequence position is assigned to a thread,

which calculates the corresponding number.• Highly unbalanced workload, as threads

assigned to larger Fibonacci numbers (position 45 and 46) require much more work. • Changing the OMP_WAIT_POLICY environment

variable seemed to have no influence on the power trace of Fibonacci.

11

CORRECTLY PLOTTING THE INSTANTANEOUS POWER DATA

CORRECT POWER TRACE – X AXIS AS TIMESTAMPINCORRECT POWER TRACE – X AXIS

INCREMENTING BY SAMPLE NUMBER

12

NATIVE VS. OFFLOADED EXECUTION

• The Xeon Phi offers native and offloaded execution modes. During native execution, the program runs entirely on the coprocessor. • Building a native application is a fast way to get existing software

running with minimal code changes.

• Offloaded mode is a heterogeneous programming model where developers can designate specific code sections to run on the Xeon Phi.• For our experiments, we offload the entire execution onto the Xeon

Phi.

13

OFFLOADED SSSPThe energy consumption is slightly higher for offloaded mode compared to native mode across each number of threads. This is because the performance is consistently slightly worse than native execution.

However, the performance and energy deficit grows smaller as more threads are used.

Intuitively, offloading to the Xeon Phi with a high number of threads (120, 240) implies energy savings assuming the host CPU is utilized

14

OFFLOADED SHELLSORTOffloaded shellsort reveals a much higher performance and energy deficit compared to its native version.

Native shellsort consistently performs 3-4X faster than offloaded version.

These results show great benefit in terms of performance and energy when running codes in native mode.

Based on these results, generally speaking, codes that do not perform extensive I/0 operations and require a modest memory footprint should be executed in native mode.

15

CO-RUNNING PROGRAMS • The Xeon Phi contains 60 physical cores and is capable of

high computation density. We explore the viability of co-running complementary workloads on the Xeon Phi.• We chose the Fibonacci calculation code as the ideal co-

runner, as it performs best with lower thread counts.• Mostly interested in revealing if co-running these codes will incur

significant performance and energy losses.

16

BARNES-HUT & FIBONACCI CO-RUN

These codes are able to co-run well because Barnes-Hut is a very balanced workload and benefits from using moreThreads.

Fibonacci actually declines in performance as a large number of threads are used. This allows us to give as many threads as possible to execute Barnes-Hut whileleaving a small thread pool to execute Fibonacci.

17

SSSP & FIBONACCI CO-RUNFibonacci is an example of a workload that will co-

runwell when paired with other programs with a high degree ofparallelism.

SSSP is still a good candidate to co-run with Fibonacci. It yields similar results to that of co-running Barnes-Hut.

Assuming memory contention is low, each of these co-running programs will return with little performance cost.

18

CONCLUSIONS

• The power trace generated from the built-in power sensors of Xeon Phi can accurately capture the run-time program behavior.• Running code in native mode yields better performance and

consumes less energy compared to offload mode.• Co-running programs with complementary workloads has

potential to conserve energy with negligible performance degradation.

19

FUTURE WORK

• We need to investigate the heterogeneous power and energy implications of offloading work to the Xeon Phi from the host CPU. (Currently, we exclusively look at data from the Xeon Phi)• Compare the performance and energy of these algorithms

with corresponding CPU and GPU implementations.

20

ACKNOWLEDGEMENT

• The work reported in this paper is supported by the U.S. National Science Foundation under Grants CNS-1305359, CNS-1305382, CNS-1212535, and a grant from the Texas State University Research Enhancement Program.

Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi

Documents

Transcript of Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi