Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd...

39
Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik 52425 Jülich {b.mohr,f.wolf}@fz-juelich.de Allen Malony, Sameer Shende University of Oregon Department of Computer and Information Science Eugene, Oregon 97403 {malony,sameer}@cs.uoregon.edu
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd...

Towards a Performance Tool Interface for OpenMP:An Approach Based on

Directive Rewriting

Bernd Mohr, Felix WolfForschungszentrum Jülich

John von Neumann - Institut für Computing

Zentralinstitut für Angewandte Mathematik

52425 Jülich{b.mohr,f.wolf}@fz-juelich.de

Allen Malony, Sameer ShendeUniversity of Oregon

Department of Computer andInformation Science

Eugene, Oregon 97403{malony,sameer}@cs.uoregon.edu

© 2001 Forschungszentrum Jülich, University of Oregon [2]

Outline

• Introduction

• Proposed OpenMP Performance Tool Interface

• Prototype Implementation

• Examples

• Future Work

© 2001 Forschungszentrum Jülich, University of Oregon [3]

Introduction

• Motivation• “Standard” OpenMP performance tools interface

similar in spirit to the MPI profiling interface (PMPI)”

• Goals• Expose OpenMP parallel execution to the

performance measurement system• Define it at the abstraction level of the

OpenMP programming model• Make the performance measurement interface portable

– across different platforms– across all OpenMP supported languages– different performance tools

• Allow flexibility in how the interface is applied

© 2001 Forschungszentrum Jülich, University of Oregon [4]

Proposed OpenMP Performance Tool Interface

• POMP• OpenMP Directive Instrumentation• OpenMP Runtime Library Routine Instrumentation• Performance Monitoring Library Control• User Code Instrumentation• Context Descriptors• Conditional Compilation• Conditional / Selective Transformations

• Remarks• C/C++ OpenMP Pragma Instrumentation• Implementation Issues• Open Issues

© 2001 Forschungszentrum Jülich, University of Oregon [5]

OpenMP Directive Instrumentation

• Insert calls to pomp_NAME_TYPE(d) at appropriate places around directives•NAME name of the OpenMP construct•TYPE

–fork, join mark change in parallelism grade–enter, exit flag entering/exiting OpenMP

construct–begin, end mark start/end of body of construct

•d context descriptor

• Observation of implicit barrier atDO, SECTIONS, WORKSHARE, SINGLE constructs

• Add NOWAIT to construct• Make barrier explicit

© 2001 Forschungszentrum Jülich, University of Oregon [6]

Example: !$OMP PARALLEL DO Instrumentation

!$OMP PARALLEL DO clauses...

do loop

!$OMP END PARALLEL DO

!$OMP PARALLEL other-clauses...

!$OMP DO schedule-clauses, ordered-clauses, lastprivate-clausesdo loop

!$OMP END DO

!$OMP END PARALLEL DO

NOWAIT

!$OMP BARRIER

call pomp_parallel_fork(d)

call pomp_parallel_begin(d)

call pomp_parallel_end(d)

call pomp_parallel_join(d)

call pomp_do_enter(d)

call pomp_do_exit(d)

call pomp_barrier_enter(d)

call pomp_barrier_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [7]

OpenMP Runtime Library Routine Instrumentation

• Transform•omp_###_lock() pomp_###_lock()•omp_###_nest_lock() pomp_###_nest_lock()

[ ### = init | destroy | set | unset | test ]

• POMP version• Calls omp version internally• Can do extra stuff before and after call

• Transformations of other OpenMP API functions necessary?

© 2001 Forschungszentrum Jülich, University of Oregon [8]

Performance Monitoring Library Control

• Give programmer control over performance monitoringat runtime•!$OMP INST [ INIT | FINALIZE | ON | OFF ]

• Translated into•pomp_init(), pomp_finalize()•pomp_on(), pomp_off()

• Ignored in “normal” OpenMP compilation mode

• Alternatives•!$POMP?• Use conditional compilation with explicit POMP calls

© 2001 Forschungszentrum Jülich, University of Oregon [9]

User Code Instrumentation

• Compiler / transformation tool should insert•pomp_begin(d)•pomp_end(d)

calls at beginning and end of each(?) user function

• Allow user-specified arbitrary (non-function) code regions•!$OMP INST BEGIN ( <region name> )

arbitrary user code !$OMP INST END ( <region name> )

• Alternatives•!$POMP?• Use conditional compilation with explicit POMP calls

descriptor?

© 2001 Forschungszentrum Jülich, University of Oregon [10]

Context Descriptors

• Describe execution contexts through context descriptortypedef struct ompregdescr { char name[]; /* construct */ char sub_name[]; /* region name */ int num_sections; char filename[]; /* src filename */ int begin_line1, begin_lineN; /* begin line # */ int end_line1, end_lineN; /* end line # */ WORD data[4]; /* perf. data */ struct ompregdescr* next;} OMPRegDescr;

• Generate context descriptors in global static memory:OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 };

• Pass address to POMP functions

© 2001 Forschungszentrum Jülich, University of Oregon [11]

Conditional Compilation

• C, C++, [Fortran, if supported]•#ifdef _POMP

arbitrary user code#endif

• Fortran Free Form•!P$ arbitrary user code

• Fortran Fixed Form•CP$ arbitrary *P$ user !P$ code

• Usual restrictions apply

© 2001 Forschungszentrum Jülich, University of Oregon [12]

Conditional / Selective Transformations

• (Temporarily) disable / re-enable POMP instrumentationat compile time

•!$OMP NOINSTRUMENT

•!$OMP INSTRUMENT

• Alternative:•!$POMP?

© 2001 Forschungszentrum Jülich, University of Oregon [13]

C/C++ OpenMP Pragma Instrumentation

• No END pragmas• instrumentation for “closing” part follows structured

block• adding nowait has to be done in the “opening part”

•#pragma omp XXX

structured block;

• Simple differences in language• no “call” keyword• “;”•!$OMP #pragma omp

pomp_###_begin(d);

pomp_###_end(d);

{

}

© 2001 Forschungszentrum Jülich, University of Oregon [14]

Example: #pragma omp sections Instrumentation

#pragma omp sections{

#pragma omp section

structured block;

#pragma omp section

structured block;

}

pomp_sections_enter(d);

{ pomp_section_begin(d);

pomp_section_end(d); }

{ pomp_section_begin(d);

pomp_section_end(d); }

pomp_sections_exit(d);

nowait

#pragma omp barrier

pomp_barrier_enter(d);

pomp_barrier_exit(d);

© 2001 Forschungszentrum Jülich, University of Oregon [15]

Implementation Issues

•pomp_NAME_TYPE(d) more efficient / simpler than pomp_event(POMP_TYPE, POMP_NAME, fname, line#, ...)

• Inlining of POMP calls possible• Context descriptors

• Full context information available, incl. source reference• But minimal runtime overhead

– just one argument needs to be passed– no need to dynamically allocate memory for data!!– context data initialization at compile time

• Context data is kept together with executable• Allows for separate compilation

• Potentially too much overhead for ATOMIC, CRITICAL, MASTER, SINGLE, and OpenMP lock calls --pomp-disable=construct-list

© 2001 Forschungszentrum Jülich, University of Oregon [16]

Open Issues

•ORDERED?•FLUSH?• Instrumentation of PARALLEL DO / FOR loop iterations

• Potentially allows measurement of influence of loop scheduling policies

• Overhead??• Allow passing additional user information to POMP library

• Conditional compilation• Extra parameter to !$OMP INST BEGIN/END• ...

• Specification of extent of user code instrumentation• Additional pragmas/directives?• Separate (outside source code) specification?

• OpenMP Runtime Instrumentation necessary?

© 2001 Forschungszentrum Jülich, University of Oregon [17]

Prototype Implementation: OPARI

• OOpenMP PPragma AAnd RRegion IInstrumentor (OPARI)• Source-to-Source translator to insert POMP calls around

OpenMP constructs and API functions

• Supports• Fortran77 and Fortran90, OpenMP 2.0• C and C++, OpenMP 1.0• Runtime Library Control (init, finalize, on, off)• (Manual) User Code Instrumentation (begin, end)• Conditional Compilation (#ifdef _POMP, !P$)• Conditional / Selective Transformation

([no]instrument)

• Preserves source code information (#line line file)• ~ 2000 lines of C++ code

© 2001 Forschungszentrum Jülich, University of Oregon [18]

OPARI

• Limitations• Fortran:

–END DO and END PARALLEL DO directives required– atomic expression on line by itself

• C/C++:– structured blocks: simple expression statement or

block (compound statement)– Exception: for statement after parallel for

• Could be fixed by enhancing OPARI’s parsing capabilities

• Source code and documentation available athttp://www.fz-juelich.de/zam/kojak/opari/

© 2001 Forschungszentrum Jülich, University of Oregon [19]

Prototype Implementation: POMP Library

• EXEXtensible PERPERformance TTool (EXPERT)• Automatic event trace analyzer•http://www.fz-juelich.de/zam/kojak/expert/

• TTuning and AAnalysis UUtilities (TAU)• Performance analysis framework•http://www.acl.lanl.gov/tau/

• Required ~ 1 day to implement tool specific POMP libraries

© 2001 Forschungszentrum Jülich, University of Oregon [20]

Prototype Implementation: EXPERT POMP Library

void pomp_for_enter(OMPRegDescr* r) { /* Get EPILOG region descriptor stored in r */ ElgRegion* e = (ElgRegion*)(r->data[0]);

/* If not yet there, initialize and store it */ if (! e) e = ElgRegion_Init(r);

/* Record enter event */ elg_enter(e->rid);}

void pomp_for_exit(OMPRegDescr* r) { /* Record collective exit event */ elg_omp_collexit();}

© 2001 Forschungszentrum Jülich, University of Oregon [21]

Prototype Implementation: TAU POMP Library

TAU_GLOBAL_TIMER(tfor, "for enter/exit","[OpenMP]", OpenMP);

void pomp_for_enter(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor); #endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer(); #endif}

void pomp_for_exit(OMPRegDescr* r) { ...}

© 2001 Forschungszentrum Jülich, University of Oregon [22]

Examples

• EXPERT• REMO: Weather Forecast• DKRZ Germany• MPI + OpenMP (experimental)

• TAU• Stommel: Ocean Circulation Simulation• SDSC• MPI + OpenMP• event trace based Vampir• profile based RACY

© 2001 Forschungszentrum Jülich, University of Oregon [23]

© 2001 Forschungszentrum Jülich, University of Oregon [24]

© 2001 Forschungszentrum Jülich, University of Oregon [25]

© 2001 Forschungszentrum Jülich, University of Oregon [26]

Future Work

• Measure typical POMP calling overhead• EPCC OpenMP Microbenchmarks?

• Investigate “formal” standardization with OpenMP forum[OpenMP Supplemental Standard?]

• OpenMP programmers– What do you expect from an OpenMP performance

tool?• Tool developers:

– Download and try out OPARI– Implement POMP interface for your tool– Tell us about problems, comments, enhancements

• OpenMP ARB members– What do we need to do next?

© 2001 Forschungszentrum Jülich, University of Oregon [27]

Conclusion

• POMP OpenMP Performance Tool Interface• Portable• Flexible• Efficient• Defined at the abstraction level of the

OpenMP programming model• Standard?

• Prototype Software• OOpenMP PPragma AAnd RRegion IInstrumentor (OPARI)http://www.fz-juelich.de/zam/kojak/opari/

• TTuning and AAnalysis UUtilities (TAU)http://www.acl.lanl.gov/tau/

© 2001 Forschungszentrum Jülich, University of Oregon [28]

© 2001 Forschungszentrum Jülich, University of Oregon [29]

!$OMP PARALLEL Instrumentation

call pomp_parallel_fork(d)!$OMP PARALLEL

call pomp_parallel_begin(d)structured blockcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_parallel_end(d)

!$OMP END PARALLELcall pomp_parallel_join(d)

© 2001 Forschungszentrum Jülich, University of Oregon [30]

!$OMP DO Instrumentation

call pomp_do_enter(d)!$OMP DO

do loop!$OMP END DO NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_do_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [31]

!$OMP WORKSHARE Instrumentation

call pomp_workshare_enter(d)!$OMP WORKSHARE

structured block!$OMP END WORKSHARE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_workshare_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [32]

!$OMP SECTIONS Instrumentation

call pomp_sections_enter(d)!$OMP SECTIONS!$OMP SECTION

call pomp_section_begin(d)structured blockcall pomp_section_end(d)

!$OMP SECTIONcall pomp_section_begin(d)structured blockcall pomp_section_end(d)

!$OMP END SECTIONS NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_sections_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [33]

Synchronization Constructs Instrumentation 1

call pomp_single_enter(d)!$OMP SINGLE

call pomp_single_begin(d)structured blockcall pomp_single_end(d)

!$OMP END SINGLE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d) call pomp_single_exit(d)

!$OMP MASTERcall pomp_master_begin(d)structured blockcall pomp_master_end(d)

!$OMP END MASTER

© 2001 Forschungszentrum Jülich, University of Oregon [34]

Synchronization Constructs Instrumentation 2

call pomp_critical_enter(d)!$OMP CRITICAL

call pomp_critical_begin(d)structured blockcall pomp_critical_end(d)

!$OMP END CRITICALcall pomp_sections_exit(d)

call pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)

call pomp_atomic_enter(d)!$OMP ATOMIC

atomic expressioncall pomp_atomic_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [35]

Automatic Analysis

• EXEXtensible PER PERformance T Tool (EXPERT)• programmable, extensible, flexible performance

property specification• based on event patterns• analyzes along three hierarchical dimensions

– performance properties (general specific)– dynamic call tree position– location (machine node process thread)

• Done: fully functional demonstration prototype• Work in Progress:

– optimization / generalization– more performance properties– source code and time line displays

© 2001 Forschungszentrum Jülich, University of Oregon [36]

Expert Result Presentation

• Interconnectedweighted treebrowser

• scalable still accurate• Each node has weight

• Percentage of CPU allocation time• i.e. time spent in subtree of call tree

• Displayed weight depends on state of node• Collapsed (including weight of descendants)• Expanded (without weight of descendants)

• Displayed using• Color: allows to easily identify hot spots (bottlenecks)• Numerical value: Detailed comparison

100 main

60 bar

10 main

30 foo

© 2001 Forschungszentrum Jülich, University of Oregon [37]

Performance Properties View

Main Problem:Idle Threads

Fine:User code

Fine:OpenMP +MPI

Fine:OpenMP +MPI

© 2001 Forschungszentrum Jülich, University of Oregon [38]

Dynamic Call Tree View

1st Optimization Opportunity

2nd Optimization Opportunity

3rd Optimization Opportunity

© 2001 Forschungszentrum Jülich, University of Oregon [39]

• Supports locationsup to Grid scale

• Easily allows explorationof load balance problemson different levels

• [ Of course, Idle Thread Problem only applies to slave threads ]

Locations View