Detailed Cache Coherence Characterization for OpenMP Benchmarks
description
Transcript of Detailed Cache Coherence Characterization for OpenMP Benchmarks
6/28/2004 NC State University
Detailed Cache Coherence Characterization for OpenMP
BenchmarksJaydeep Marathe1, Anita Nagarajan2, Frank Mueller1
1 Department of Computer Science, NCSU2 Intel Technology India Pvt. Ltd.
6/28/2004 NC State University
Our Focus
• Target shared memory SMP systems.• Characterize coherence behavior at application level.• Metrics guide coherence optimization.
Processor-1
Processor-N
Cache Cache
Bus-Based Shared Memory SMP
CoherenceProtocol
Bus
6/28/2004 NC State University
Invalidation-based coherence protocol
• Cache lines have state bits.• Data migrates between processor caches, state transitions maintain coherence.• MESI Protocol: 4 states: M: Modified, E: Exclusive, S: Shared, I: Invalid
Processor A Processor B
1. Read “x” x IE
x IS
x ES
x SM
2. Read “x”
SI
3. Write “x”Cache linewas invalidated
“exclusive”
“shared” “shared”
“modified/dirty” “invalid”
6/28/2004 NC State University
A performance perspective
• Writes to shared variables cause invalidation traffic. (E->I, M->I, S->I)• Worse, invalidations lead to coherence misses !
Reducing Coherence Misses, Invalidations Improved Performance !
“Coherence Bottlenecks”
Proc A. Proc B.
1. Write shared vars Var_1
Var_2Var_n invalidations !
2. Read Var_1Var_2Var_n
3. Write Var_1Var_2Var_3
invalidations !
Coherence Misses in Proc. B !
Data
6/28/2004 NC State University
Question: Does a coherence bottleneck exist ?One Approach: Time-based Profilers
Problem: Implicit Information – Does imbalance/ speedup loss indicate a coherence bottleneck ?
- We can’t tell !
Load Imbalance between 2 threads(KAI GuideView)
Parallel TimeImbalance Time
6/28/2004 NC State University
Another Approach: Using Hardware counters
+ Detect potential coherence bottlenecks.- Block level statistics – perturbation prevents fine-grained monitoring.- Can’t diagnose cause ! - Which source code refs ? data structures ?
Does a coherence bottleneck exist ? (contd)
Need more detailed information !
P-1 P-2 P-3 P-4
Total Misses
Coherence Misses
Invalidations
Processors
# T
ota
l
6/28/2004 NC State University
What we offer..
• Hierarchical levels of detail.. Overallstatistics
Ref Coherence
Misses
Invalidations
True False
In Across
In Across
V1[]_Read 8627 4 7517 31 1342
Timer_Read 3182 1 0 3122 0
Clock_Read 2165 2166 0 0 0
. . . . . .
Per-reference metrics
Invalidator Set for each reference
• Rich metrics – coherence misses, true & false sharing, per-reference invalidators
Invalidator True-Sharing False-Sharing V2 4811 20 V3 3199 0 Clock 200 1000 .... .... ....
• Facilitates easy isolation of bottleneck references !
6/28/2004 NC State University
Our Framework
• Bound OpenMP threads for SMP parallelism• Access Traces using dynamic binary rewriting• Traces used for incremental SMP cache simulation. (L1+L2+coherence)
Handler()
Handler()
Handler()
Compression
TargetOpenMPExecutable
TargetDescriptor
Controller
SMP Cache Simulator
AccessTrace
Thread-0
DetailedCoherenceMetrics
Thread-1
Thread-N
AccessTrace
AccessTrace
Instruction Line, FileGlobal & Local Variables
Instrument
Extract
Execute
Trace Generation
Instrumentation
Simulation
6/28/2004 NC State University
Target Executable Instrumentation
myfunc(){….#pragma omp parallel forFor(I=0;I < N;I++) {A[I] = B[I] + C[I];
}//end parallel for…#pragma omp barrier
myfunc:…..…..CALL _xlsmpParallelDoSetup…LOAD B[I]LOAD C[I]STORE A[I]…Exit_from_parallel_for…CALL _xlsmpBarrier
InstrumentationPoints
Controller DynInst
Machine Code CFG
• Enhanced DynInst Dynamic Binary Rewriting package ( U.
Wisconsin)
• Instrument Memory access instructions (LD/ST)
• Instrument Compiler-generated OpenMP construct functions.
6/28/2004 NC State University
Per-reference metrics
• Uniprocessor Misses
• Coherence Misses
Invalidations
“In”-Region“Across”-Region
TrueSharing
FalseSharing
TrueSharing
FalseSharing
•Invalidations
• List & count of Invalidator references
//serial code…..#pragma omp parallel{….…}//end parallel…//serial code…#pragma omp parallel do{….….}//end parallel…
“Fork-Join” Model
“Region”
“Region”
“Region”
“Region”
6/28/2004 NC State University
In-depth Example: SMG2000
• ASCI Purple benchmark• Has been scaled up to 3150 processors• Hybrid OpenMP + MPI Parallelization• Code is known to be memory-intensive
Code Characteristics• 72 Files, 24213 lines (non-whitespace)
Instrumentation Characteristics• 4 OpenMP threads, default workload• Functions instrumented: 313 (69 OpenMP, 244 Others)• Access Points instrumented: 10692
8531 Load (2184 64-bit, 6329 32-bit, 18 8-bit)
2161 Store (722 64-bit, 1425 32-bit, 14 8-bit)
• Tracing: 16.73 Million accesses logged.
6/28/2004 NC State University
Overall Performance: SMG2000
• Most L2 misses are Coherence misses
• Most Invalidations result in Coherence Misses
• Only ~280 out of 10692 access points show coherence activity (2.6% !)
• Only ~10% of these points account for >= 90% of the coherence misses
0102030405060708090
100
0 20 40 60 80 100
% Participating Access Points
% C
ohM
isse
s
Processor-1 Processor-2
Processor-3 Processor-4
0
100000
200000
300000
400000
500000
600000
700000
800000
P1 P2 P3 P4Processors
Num
ber
Total-L2-MissesCoherence MissesInvalidations
10
A. Overall Misses B. Cumulative Coherence Misses
6/28/2004 NC State University
Drilling Down: Per-Access Point Metrics
• Top-5 Metrics for Processor-1
No
File Line
Ref Group
Coherence Misses
Invalidations
True False
In-Regio
n
Across-Region
In-Region
Across-
Region
1 smg_residual.c 289 rp[]_Read
1
168545 0 0 158842 9672
2 smg_residual.c 289 rp[]_Read 81729 0 0 74242 7587
3 smg_residual.c 289 rp[]_Write 43338 0 0 42684 3648
4 cyclic_reduction.c
853 xp[]_Write 22467 0 0 21388 1128
5 threading.c 24 num_threads_Write
2 16553 17402 0 0 0
• Group-1 Refs: False-sharing In-Region (Same OpenMP region) invalidations dominate
• Group-2 Ref: True-sharing In-Region invalidations only
6/28/2004 NC State University
• Large number of False-In Region Invalidations !• Sub-optimal Parallelization: Fine-grained sharing• Solution: Parallelize Outermost loop (Coarsening)
No
Reference True-sharing Invalidations
False-Sharing Invalidations
1 Proc_1:rp_Write[]
0 77820
2 Proc_1:rp_Write[]
0 86161
3 Proc_2:rp_Write[]
0 2352
for k = 0 to Kmax for j =0 to Jmax #pragma omp parallel do for i = 0 to Imax { ... rp[k][j][i] = rp[k][j][i] – Ai[] * xp[]; }//end omp do
P1 P2 P3
Cache Line Cache Line
Drilling Further: A Ref & its invalidators
File::Line Ref Invalidations
True
False
In Across
smg_residual.c::289
rp[]_Read
0 158842 9672
P4
Invalidator List
6/28/2004 NC State University
Another Optimization
File::Line Ref Invalidations
True False
In Across
In
Across
threading.c::24
num_threads_Write
17402 0 0 0
No
Reference True-sharing Invalidations
False-Sharing Invalidations
1 Proc_1:num_thread_Write[]
17402 0
#pragma omp parallel num_threads=omp_get_num_threads();
• Multiple threads updating same shared variable !• Solution: Remove unnecessary sharing (SharedRemoval)
Cache Line
num_thread
P1 P2 P3 P4
num_threads = omp_get_max_threads();
Invalidator List
6/28/2004 NC State University
Impact of Optimizations
• SMG2000 run on IBM SP Blue Horizon. (POWER3)• Wall-clock times for recommended full-sized workloads (threads = 1, 2, 4, 8)• Maximum of 73% improvement for 4th Workload
0
5
10
15
20
25
1 2 3 4Workloads
Se
co
nd
sOriginal
Coarsening
Coarsening+Shared Removal
6/28/2004 NC State University
Highlights & Future Directions
• First tool for characterizing OpenMP SMP performance• Dynamic Binary Rewriting – no source code modification !• Detailed source-correlated statistics.• Rich set of coherence metrics.
Highlights
Future Directions
• Use of partial access traces. (intermittent instrumentation)• Other Threading Models – Pthreads, etc.• Characterizing Perennial Server applications (Apache)
6/28/2004 NC State University
The End
6/28/2004 NC State University
Simulator Accuracy: Comparing #invalidations
• NAS 3.0 OpenMP benchmarks + NBF
• Total Invalidations: Hardware Counters (IBM SP: HPM) vs. Simulator (ccSIM)
• Account for OpenMP runtime overhead in HPM.
• 16.5% Maximum absolute error, most benchmarks have <= 7% error.
IS MG CG FT BT SP NBF
HPM(Corrected)
162964 13629
100487 325257
157384 258922 135926
ccSIM-Interleaved
163073 13174
117117 302630
157503 268334 137498
%ERROR: -0.006 3.3 -16.5 6.9 -0.07 -3.6 -1.15
IS MG CG FT BT SP NBF
HPM-Raw 165246 24631 134964
326595
185317 282269 474121
HPM-After OpenMP run-time correction
162964 13629 100487
325257
157384 258922 135926
6/28/2004 NC State University
Related Work
+ Execution-driven simulation+ Classifies by code & data objects+ Invalidations & coherence misses- No true/false sharing- No invalidator lists- Compiler-inserted instrumentation- Uniprocessor-simulated parallel threads
MemSpy: Martonosi et. al.(1992)
SM-Prof: Brorsson et.al.(1995)+ Variable Classification tool+ Access classes of “shared/private read/write few/many” - Cant detect true/false sharing, magnitude- No coherence misses,invalidator lists
Rsim, Proteus, SimOS
- Architecture-oriented simulators, only bulk statistics- Not meant for application developers
6/28/2004 NC State University
Tracing Overhead (earlier work: METRIC)
1
10
100
1000
10000
SPEC-applu SPEC-mgrid
SPEC-swim SPEC-tomcatv
SPEC-hydro2d
ADI MatrixMultiply
Tiled MatrixMultiply
Ove
rhea
d F
acto
r
NULLInstrumentation
Instrumentation+Compression
• 1-3 Orders of Magnitude overhead, in most cases.
• Conventional breakpoints (TRAP) have > 4 Orders of Magnitude
overhead.
6/28/2004 NC State University
Trap-based instrumentation
dyninst gdbapplication # operations ops/sec time (sec) time (sec)
compress 95 32,513 406,655.7 0.08 74.35li (xlmatch) 110,209 43,607.7 2.53 221.04li (compare) 4,475 640.2 6.99 16.39li (binary) 401 19.4 20.69 21.62
From DynInst Documentation
6/28/2004 NC State University
Compression Ratio (earlier work: METRIC)
• Comparison of Uncompressed and Compressed Stream sizes
• 1 Million Accesses Logged
• 2-4 Orders of Magnitude Compression , in most cases.
1
10
100
1000
10000
100000
1000000
10000000
SPEC-applu
SPEC-mgrid
SPEC-swim
SPEC-tomcatv
SPEC-hydro2d
ADI MatrixMultiply
TiledMatrix
Multiply
Siz
e (B
ytes
)
Uncompressed
Compressedd
6/28/2004 NC State University