NPS……..CHDS UNIVERSITY AND AGENCY PARTNERSHIP INITIATIVE Dr. Stan Supinski Director, UAPI.
Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski...
-
Upload
marcia-palmer -
Category
Documents
-
view
215 -
download
0
description
Transcript of Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski...
Mike Noeth, Frank MuellerNorth Carolina State University
Martin Schulz, Bronis R. de SupinskiLawrence Livermore National Laboratory
Scalable Compression and Replay of Communication Traces in Massively
Parallel Environments
2
Outline
Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work
3
Introduction
Contemporary HPC Systems— Size > 1000 processors— take IBM Blue Gene/L: 64k processors
Challenges on HPC Systems (large-scale scientific applications)
— Communication scaling (MPI)— Communication analysis— Task mapping
Procurements also require performance prediction of future systems
4
Communication Analysis Existing approaches and short comings
— Source code analysis+ Does not require machine time- Wrong abstraction level: source code often complicated- No dynamic information
— Lightweight statistical analysis (mpiP)+ Low instrumentation cost- Less information available (i.e. aggregate metrics only)
— Fully captured (lossless) (Vampir, VNG)+ Full trace available for offline analysis- Traces generated per task: not scalable- Gather only traces on subset of nodes- Use cluster for visualization
viz I/O nodes
5
Our Approach
Trace-driven approach to analyze MPI communication
Goals— Extract entire communication trace— Maintain structure— Full replay (independent of original application)— Scalable— Lossless— Rapid instrumentation— MPI implementation independent
6
Design Overview
Two Parts: Recording traces
— Use MPI profiling layer— Compress at the task-level— Compress across all nodes
Replaying traces
7
Outline
Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work
8
Intra-Node Compression Framework
Umpire [SC’00] wrapper generator for MPI profiling layer— Initialization wrapper— Tracing wrapper— Termination wrapper
Intra-node compression of MPI calls— Provides load scalability
Interoperability with cross-node framework Event aggregation
— Special handling of MPI_Waitsome Maintain structure of call sequences stack walk signatures
— XOR signature for speedXOR match necessary (not sufficient) same call sequence
9
Intra-Node Compression Example
Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5
op1 op2 op3 op4 op5 op5
targethead
targettail
targettail
targettail
targettail
targettail
mergehead
mergehead
op3
targethead
mergetail
targettail
op4
targettail
mergetail
targethead
targettail
mergetail
targethead
matchmatchmatch
10
Intra-Node Compression Example
Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5
Algorithm in the paper
op1 op2 op3 op4 op5(( ), iters = 2)
11
Event Aggregation
Domain-specific encoding improve compression MPI_Waitsome (array_of_requests, incount, outcount, …)
blocks until one or more requests satisfied Number of Waitsome calls in loop is nondeterministic Take advantage of general usage to delay compression
— MPI_Waitsome does not compress until a different operation is executed
— Accumulate output parameter outcount
12
Outline
Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work
13
Inter-Node Framework Interoperability
Single Program, Multiple Data (SPMD) nature of MPI codes
Match operations across nodes by manipulating parameters— Source / destination offsets— Request offsets
14
Location-Independent Encoding
Point-to-point communication specifies targets by MPI rank— MPI rank parameters will not match across tasks Use offsets instead
16 processor (4x4) 2D stencil example— MPI rank targets (source/destination)
–9 communicates with 8, 5, 10, 13–10 communicates with 9, 6, 11, 14
— MPI offsets–9 & 10 communicate with -1, -4, +1, +4
15
Request Handles
Asynchronous MPI ops associated w/ a MPI_Request handle— Handle nondeterministic across tasks Parameter mismatch in inter-node framework
Handle w/ circular request buffer (pre-set size)— On asynchronous MPI operation, store handle in buffer— On lookup of handle, use offset of current position in
buffer— Requires special handling in replay mechanism
Current
H1
Lookup H1 = -2
H2
Current
H3
Current
16
Inter-Node Compression Framework
Invoked after all computations done (in MPI_Finalize wrapper)
Merges operation queues produced by task-level framework Job size scalability
Task 0
Task 1
Task 2
Task 3
op1 op2 op3
op4 op5 op6
op1 op2 op3Match
op4 op5 op6
17
Inter-Node Compression Framework
Invoked after all computations done (in MPI_Finalize wrapper)
Merges operation queues produced by task-level framework Job size scalability
There’s more: Relaxed reordering of events of different nodes (dependence check) paper …
Task 0
Task 1
Task 2
Task 3
op1 op2 op3
op4 op5 op6
op4 op5 op6Match
18
Reduction over Binary Radix Tree
Cross-node framework merges operation queues of each task
Merge algorithm supports merging two queues at a time paper
Radix layout facilitates compression (constant stride b/w nodes)
Need a control mechanism to order merging process
19
Outline
Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work
20
Replay Mechanism
Motivation— Can replay traces on any architecture
–Useful for rapid prototyping procurements–Communication tuning (Miranda -- SC’05)–Communication analysis (patterns)–Communication tuning (inefficiencies)
Replay design— Replays comprehensive trace produced by recording
framework— Parses trace, loads task-level op queues (inverse of merge
algo)— Replay on-the-fly (inverse of compression algo)— Timing deltas
21
Experimental Results
Environment— 1024 node BG/L at Lawrence Livermore National Labs— Stencil micro-benchmarks— Raptor real world application
[Greenough’03]
22
Trace System Validation
Uncompressed trace dumps compared Replay
— Matching profiles from mpiP
23
Task Scalability Performance
Varied:— Strong (task) scaling: number of nodes
Examined metrics:— Trace file size— Memory usage— Compression time (or write time)
Results for— 1/2/3D stencils— Raptor application— NAS Parallel Benchmarks
24
Trace File Size – 3D Stencil Constant size for fully compressed traces (inter-node) Log scale
100MB
0.5MB
10kB
25
Memory Usage – 3D Stencil Constant memory usage for fully compressed (inter-node)
— min = leaves, avg = middle layer (decr. W/ node #), max ~ task 0
Average memory usage decreases w/ more processors0.5MB
26
Trace File Size – Raptor App Sub-linear increase for fully compressed traces (inter-node) NOT on log scale
10kB
80MB
93MB
27
Memory Usage – Raptor Constant memory usage for fully compressed (inter-node) Average memory usage decreases w/ more processors
500MB
28
Load Scaling – 3D Stencil Both intra- and inter-node compression result in constant
size Log scale
29
Trace File Size – NAS PB CodesLog scale file size [Bytes], 32-512 CPUs3 categories for none / intra- / inter-nodeFocus: blue = full compression Near-constant size (EP, also IS, DT)
— Instead of exponential Sub-linear (MG, also LU)
— Still good Non-scalable (FT, also BT, CG)
— Still 2-4 orders of magn. smaller— But could improve— Due to complex comm. Patterns
–along diagonal of 2D layout–Even varying # endpoints
30
Memory Usage – NAS PB CodesLog scale memory [Bytes], 32-512 CPUs 3 categories for min, avg, max, root0 Near-constant size (EP, also IS, DT)
— Also constant in memory
Sub-linear (MG, also LU)— Sometimes constant in memory
Non-scalable (FT, also BT, CG)— Non-scalable in memory
31
Compression/Write Overhead – NAS PBLog scale time [ms], 32-512 CPUs none / intra- / inter-node (full compression) Near-constant size (EP, also IS, DT)
— Inter-node compression fastest
Sub-linear (LU, also MG)— Intra-node faster
than inter-node
Non-scalable (FT, also BT, CG)— Not competitive
better write with intra-nodecompression
32
ConclusionContributions: Scalable approach to capturing full trace of communication
— Near constant trace sizes for some apps (others: more work)— Near constant memory requirement
Rapid analysis via replay mechanism Lossless MPI tracing of any number of node feasible
— May store and visualize MPI traces on your desktop
Future Work:— Task layout model (i.e. Miranda)— Post-analysis stencil identification— Tuning detect non-scalable MPI usage— Support for procurements— Scalable replay— Offload compression to I/O nodes
33
Acknowledgements
Mike Noeth (NCSU) Martin Schulz (LLNL) Bronis R. de Supinski (LLNL) Prasun Ratn (NCSU)
Availability under BSD license:moss.csc.ncsu.edu/~mueller/scala.html
Funded in part by NSF CCF-0429653, CNS-0410203, CAREER CCR-0237570 Part of this work was performed under the auspices of the U.S. Department of Energy by
University of California Lawrence Livermore National Laboratory
under contract No. W-7405-Eng-48, UCRL-???
34
Global Compression/Write Time [ms]
Average time per node Max. time
BT CG DT EP FT IS LU MG
35
Trace File Size – Raptor Near constant size for fully compressed traces Linear scale
36
Memory Usage – Raptor Results Same as 3D stencil Investigating the min memory usage for 1024 tasks
37
Intra-Node Compression Algorithm
Intercept MPI callIdentify targetIdentify merger
Match verification
Compression
38
Call Sequence ID: Stack Walk Signature
Maintain structure by distinguishing between operations XOR signature speed
— XOR match necessary (not sufficient) same calling context
39
Inter-Node Merge Algorithm
Iterate throughboth queues
Find matchMaintain orderCompress ops
40
Consider two tasks each with their own operation queue
Inter-Node Merge Example
MasterQueue
SlaveQueue
Sequence 1
Participants:Task 0
Sequence 4
Participants:Task 0
Sequence 1
Participants:Task 1
Sequence 2
Participants:Task 1
Sequence 3
Participants:Task 1
Sequence 4
Participants:Task 1
MI
SI SH
Task 1
MATCH!
MasterQueue
SlaveQueue
41
Consider two tasks each with their own operation queue
Inter-Node Merge Example
MasterQueue
SlaveQueue
Sequence 1
Participants:Task 0
Sequence 4
Participants:Task 0
Sequence 1
Participants:Task 1
Sequence 2
Participants:Task 1
Sequence 3
Participants:Task 1
Sequence 4
Participants:Task 1
MI
SH SI
Task 1
SI
MATCH!
MasterQueue
SlaveQueue
42
Consider two tasks each with their own operation queue
Cross-Node Merge Example
MasterQueue
SlaveQueue
Sequence 1
Participants:Task 0
Sequence 4
Participants:Task 0
Sequence 2
Participants:Task 1
Sequence 3
Participants:Task 1
Sequence 4
Participants:Task 1
MI
SH SI
Task 1 Task 1
MasterQueue
SlaveQueue
MATCH!
43
Temporal Cross-Node Reordering
Requirement: queue maintains order of operations Merge algorithm maintains order of operations too strictly
— Unmatched sequences in slave queue always moved to master
— Results in poorer compression Solution: only move operations that must be moved
— Intersect task participation lists of matched & unmatched ops
— Intersection empty no dependency— O/w ops must be moved
44
Dependency Example
Consider the 4 task job (task 1 & 3 have already merged):
Sequence 1
Participants:Task 0
Sequence 2
Participants:Task 0
Sequence 2
Participants:Task 1
Sequence 1
Participants:Task 3
SI
MI
SH SI
MasterQueue
SlaveQueue
MATCH!
45
Dependency Example
Consider the 4 task job (task 1 & 3 have already merged):
Sequence 1
Participants:Task 0
Sequence 2
Participants:Task 0
Sequence 2
Participants:Task 1
Sequence 1
Participants:Task 3
MI
SH SI
MasterQueue
SlaveQueue
46
Dependency Example
Consider the 4 task job (task 1 & 3 have already merged):
Sequence 1
Participants:Task 0
Sequence 2
Participants:Task 0
Sequence 2
Participants:Task 1
Sequence 1
Participants:Task 3
MI
SH SI
MasterQueue
SlaveQueue
Task 3
Duplicates
47
Dependency Example
Consider the 4 task job (task 1 & 3 have already merged):
Sequence 1
Participants:Task 0
Sequence 2
Participants:Task 0
Sequence 2
Participants:Task 1
Sequence 1
Participants:Task 3
MI
SH SI
MasterQueue
SlaveQueue
Task 3 Task 1