Is it me, or is it the machine? · lulesh@machine2 69.15 99.12 69.76 lulesh@machine3 70.55 96.56...

www.bsc.es

Scalable Tools Workshop (Tahoe)

August 2017

Judit Gimenez (judit@bsc.es)

Is it me, or is it the machine?

Parallel efficiency model

Parallel efficiency = LB eff * Comm eff

Computation Communication

MPI_Recv

MPI_Send Do not blame MPI

LB Comm

𝑳𝑩=1

Computation Communication

MPI_Send MPI_Recv

MPI_Send MPI_Recv Do not blame MPI

LB µLB Transfer

Parallel efficiency = LB eff * Serialization eff * Transfer eff

My favorite default app for trainings: Lulesh 2.0

Easy to install and does not require large input files

Iterative behaviour, well balanced except one region due

to instructions unbalance

Requires a cube number of MPI ranks

my target = 27 ranks; no nodes/sockets sized 27 :)

– No OpenMP

Expected problem: some extra unbalance due to the

unbalanced mapping

- instructions +

Same code, different behavior

Code Parallel efficiency Communication efficiency

Load Balance efficiency

lulesh@machine1 90.55 99.22 91.26

lulesh@machine2 69.15 99.12 69.76

lulesh@machine3 70.55 96.56 73.06

lulesh@machine4 83.68 95.48 87.64

lulesh@machine5 90.92 98.59 92.20

lulesh@machine6 73.96 97.56 75.81

lulesh@machine7 75.48 88.84 84.06

lulesh@machine8 77.28 92.33 83.70

lulesh@machine9 88.20 98.45 89.57

lulesh@machine10 81.26 91.58 88.73

Huge variability and worse than expected. Can I explain why?

Warning::: Higher parallel efficiency does not mean faster!

Same code, same machine… still different behaviors

Playing with both @ C MPI / configuration

Parallel efficiency Communication efficiency

BullMPI / default 84.00 93.41 89.35

OpenMPI / default 79.45 98.35 80.73

OpenMPI / binding 82.10 95.08 86.35

BullMPI / binding 85.15 96.59 88.18

Playing with binding @ B configuration

default 81.26 91.58 88.73

binding 75.10 97.44 77.07

Playing with MPI @ A MPI

IMPI 85.65 95.09 90.07

MPT 70.55 96.56 73.06

The expected

Balance between nodes and across sockets

Less frequent than expected!

Parallel eff. 90.55%

Comm 99.22%

LB 91.26%

• Using 2 nodes x 2 sockets

• 3 sockets with 7 ranks, 1

socket with 6 ranks

small time unbalance

6 guys

with more

resources

The good

27 fits in a node

Alternate between sockets

first socket

Small IPC variability in the 2 main

regions

Comm 98.59%

LB 92.20%

The ugly

Same code! but now two trends (one per node)

Slow node significant lower IPC for

almost all the regions

Comm 99.12%

LB 69.76%

But all nodes are equal!

The ugly

Clock frequency sanity check

Cluster Name Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 8

Avg. Duration 347038882 218366724 191517484 76087437 57184253

PAPI_L1_DCM 4253914 6369746 6360328 2195915 1540034

PAPI_L2_DCM 1647293 1984225 1864857 448603 258565

PAPI_L3_TCM 945265 1399329 1390967 160690 111273

PAPI_TOT_INS 881368852 880939731 881303626 275104547 274379604

PAPI_TOT_CYC 900031546 566347103 496687512 197363737 148307597

RESOURCE_STALLS:SB 257407726 154854806 106684134 74990020 38463538

RESOURCE_STALLS:ROB 3191720 3170735 2722761 2684869 562054

RESOURCE_STALLS:RS 43484959 63717139 63017342 38768779 32699838

Insight checking hardware counters differences

Guess: Memory problem – confirmed by sysadmin tests!

The unbalanced

Balance between nodes not between sockets

Balance between nodes 9 per node

Fill first a socket 6 + 3

What lulesh does

What the machine does running lulesh

Comm 98.35%

LB 80.73%

The unbalanced

Two main regions suffer the penalty of

the different socket occupancy

Most frequent behaviour!

The bipolar

MPI modify the computing phases behavior!

Comm 96.56%

LB 73.06%

Comm 95.09%

LB 90.07%

The bipolar

… because it can select a different mapping / binding

MPT IMPI

Histogram

of useful

duration

Process

mapping

“photo”

The undecided

Same number of instructions per iteration

showing a noisy behavior w.r.t time

… use to correspond to a system that does unnecessary process migrations

Comm 92.33%

LB 83.70%

The braking

But not always noise is caused by migrations

Comm 88.84%

LB 84.06%

node id

The braking

Clue: the noise affects to all processes within a node

The OS of each node is reducing for a while the clock frequency asynchronously!

Conclusions

As code developer, better not to assume machines will

do a good job running your code because you did a good

job programming your application

As performance analyst, do not assume where are the

bottlenecks, be open minded and equipped with flexible

tools (like Paraver ;)

As Bruce Lee said “Be water my friend!”

Is it me, or is it the machine? · lulesh@machine2 69.15 99.12 69.76 lulesh@machine3 70.55 96.56...

Documents

Transcript of Is it me, or is it the machine? · lulesh@machine2 69.15 99.12 69.76 lulesh@machine3 70.55 96.56...

Value of Evaluation and... · William E. Moore, PhD . Review of Logic Models •20 Forward Logic Models ... Wouldn't 1 0.09 87.92 Ws 1 0.09 88.01 blank 33 2.91 90.92 n/a 1 0.09 91.01

CPT CODE FEE SCHEDULE NEW MEXICO MEDICAID FEE FOR …...11980 $98.59 General Fee Schedule - 12/1/2009 11981 $119.47 General Fee Schedule - 12/1/2009 11982 $141.34 General Fee Schedule

Secretariat Reports of ISO/TC 135/SCs - cndt.cz Meetings/N490... · ISO 9934-1:2001 Non-destructive testing -- Magnetic particle testing -- Part 1: General principles 90.92 ISO 9934-2:2002

HPC with the NVIDIA Accelerated Computing Toolkitimages.nvidia.com/events/sc15/pdfs/SC5110-hpc-nvidia...this is a "generic” version of LULESH with no architecture-specific tuning.”

Exploring Co-Design in Chapel Using LULESH

GPUコンピューティングの現状と未来*Seismic CPML* FIM GTC DNS GAUSSIAN SPECFM3D GEOS-5 LULESH MiniGHOST MiniMD Harmonie S3D RAMSES ONETEP HBM UPACS Quantum Espresso ICON

1 | P a g eforestsclearance.nic.in/writereaddata/Order_and... · Sukumar, Member through E-Mail dated 08th February 2020 on the Agenda Item No. 54.4.3 pertaining to use of 98.59 hectares

Identifying Optimization Opportunities within Kernel ... · PDF fileIdentifying Optimization Opportunities within Kernel Execution in GPU ... techniques with LAMMPS and LULESH application

2014 april noisereport.ppt · FedEx Express 1 78 98.72 Alaska Airlines 0 126 100.00 Spirit Airlines 0 106 100 102 7,222 98.59 April 2014 4 DME Deviations Operators with 75 or more

Identifying Optimization Opportunities Within Kernel ... · PDF fileWe demonstrate the eﬀectiveness of our techniques with LAMMPS and LULESH application case studies on a variety

Performance characterization of a DRAM-NVM hybrid memory ...mueller/ftp/pub/mueller/papers/memsys1… · LULESH [13] and SNAP [29]. In Section 2, we review research related to persistent

Announcements – Lecture III – Tuesday, Sep 10...2.4 What Are Atoms Made Of? –– Atomic Weight 2.4 Example_3 Neon has 3 naturall y occurring isotopes: 20Ne, 90.92% Abundant,

[XLS]ves.ac.in · Web view1 26235 21852 90.92 591 650 591 2 25039 21247 89.84 584 650 584 3 31894 23771 89.38 581 650 581 4 27962 22408 87.84 571 650 571 5 30861 23401 86.15 560 650

Page 1 of 23 - CCSDS.org Secretariat's Packages/2013 Fall... · Page 1 of 23 Number Title Owner ... 1987 Reconfirmation Review May 2018 Working Group . ... 12175:1994 90.92: International

WILL MIDDLEWARE AND DSL CHANGE THE PROGRAMMING … de Verdiere.pdf · Performances of the generated (C++, Cuda) ≅ «natives» optimized versions Strong-scaling of Lulesh 32768 cells

CENBIO - GEE Isaias143.107.4.241/.../documentos/seminbioenergia/isaiasmacedo_2608.pdf · Mechanical harvesting % 49.5 27.1 0 87.7 44 98.59 “Other” agr. (diesel) L ha-1 67 38 2.7

Liszt Updates - Stanford University · 2018-04-27 · Par((oning Across Nodes Legion Mul(-Node Results Soleil (CFD) Spring- mass Lulesh Mesh Grid Par>cles Uniﬁed Liszt Relaons CPU

OpenMP Tools API (OMPT): Ready for Prime Time? · “Call” sites Loops in context Case Study: LLNL’s LULESH with RAJA 2 18-core Haswell 72+1 threads Case Study: AMG2006 11 2 18-core

GPUコンピューティングの現状と未来Seismic CPML FIM GTC DNS GAUSSIAN SPECFM3D GEOS-5 LULESH MiniGHOST MiniMD Harmonie S3D RAMSES ONETEP HBM UPACS Quantum Espresso ICON