Detailed Cache Coherence Characterization for OpenMP Benchmarks

6/28/2004 NC State University

Detailed Cache Coherence Characterization for OpenMP

BenchmarksJaydeep Marathe1, Anita Nagarajan2, Frank Mueller1

1 Department of Computer Science, NCSU2 Intel Technology India Pvt. Ltd.


Our Focus

• Target shared memory SMP systems.• Characterize coherence behavior at application level.• Metrics guide coherence optimization.

Processor-1

Processor-N

Cache Cache

Bus-Based Shared Memory SMP

CoherenceProtocol

Bus


Invalidation-based coherence protocol

• Cache lines have state bits.• Data migrates between processor caches, state transitions maintain coherence.• MESI Protocol: 4 states: M: Modified, E: Exclusive, S: Shared, I: Invalid

Processor A Processor B

1. Read “x” x IE

x IS

x ES

x SM

2. Read “x”

SI

3. Write “x”Cache linewas invalidated

“exclusive”

“shared” “shared”

“modified/dirty” “invalid”


A performance perspective

• Writes to shared variables cause invalidation traffic. (E->I, M->I, S->I)• Worse, invalidations lead to coherence misses !

Reducing Coherence Misses, Invalidations Improved Performance !

“Coherence Bottlenecks”

Proc A. Proc B.

1. Write shared vars Var_1

Var_2Var_n invalidations !

2. Read Var_1Var_2Var_n

3. Write Var_1Var_2Var_3

invalidations !

Coherence Misses in Proc. B !

Data


Question: Does a coherence bottleneck exist ?One Approach: Time-based Profilers

Problem: Implicit Information – Does imbalance/ speedup loss indicate a coherence bottleneck ?

- We can’t tell !

Load Imbalance between 2 threads(KAI GuideView)

Parallel TimeImbalance Time


Another Approach: Using Hardware counters

+ Detect potential coherence bottlenecks.- Block level statistics – perturbation prevents fine-grained monitoring.- Can’t diagnose cause ! - Which source code refs ? data structures ?

Does a coherence bottleneck exist ? (contd)

Need more detailed information !

P-1 P-2 P-3 P-4

Total Misses

Coherence Misses

Invalidations

Processors

# T

ota

l


What we offer..

• Hierarchical levels of detail.. Overallstatistics

Ref Coherence

Misses

Invalidations

True False

In Across

In Across

V1[]_Read 8627 4 7517 31 1342

Timer_Read 3182 1 0 3122 0

Clock_Read 2165 2166 0 0 0

. . . . . .

Per-reference metrics

Invalidator Set for each reference

• Rich metrics – coherence misses, true & false sharing, per-reference invalidators

Invalidator True-Sharing False-Sharing V2 4811 20 V3 3199 0 Clock 200 1000 .... .... ....

• Facilitates easy isolation of bottleneck references !


Our Framework

• Bound OpenMP threads for SMP parallelism• Access Traces using dynamic binary rewriting• Traces used for incremental SMP cache simulation. (L1+L2+coherence)

Handler()

Handler()

Handler()

Compression

TargetOpenMPExecutable

TargetDescriptor

Controller

SMP Cache Simulator

AccessTrace

Thread-0

DetailedCoherenceMetrics

Thread-1

Thread-N

AccessTrace

AccessTrace

Instruction Line, FileGlobal & Local Variables

Instrument

Extract

Execute

Trace Generation

Instrumentation

Simulation


Target Executable Instrumentation

myfunc(){….#pragma omp parallel forFor(I=0;I < N;I++) {A[I] = B[I] + C[I];

}//end parallel for…#pragma omp barrier

myfunc:…..…..CALL _xlsmpParallelDoSetup…LOAD B[I]LOAD C[I]STORE A[I]…Exit_from_parallel_for…CALL _xlsmpBarrier

InstrumentationPoints

Controller DynInst

Machine Code CFG

• Enhanced DynInst Dynamic Binary Rewriting package ( U.

Wisconsin)

• Instrument Memory access instructions (LD/ST)

• Instrument Compiler-generated OpenMP construct functions.


Per-reference metrics

• Uniprocessor Misses

• Coherence Misses

Invalidations

“In”-Region“Across”-Region

TrueSharing

FalseSharing

TrueSharing

FalseSharing

•Invalidations

• List & count of Invalidator references

//serial code…..#pragma omp parallel{….…}//end parallel…//serial code…#pragma omp parallel do{….….}//end parallel…

“Fork-Join” Model

“Region”

“Region”

“Region”

“Region”


In-depth Example: SMG2000

• ASCI Purple benchmark• Has been scaled up to 3150 processors• Hybrid OpenMP + MPI Parallelization• Code is known to be memory-intensive

Code Characteristics• 72 Files, 24213 lines (non-whitespace)

Instrumentation Characteristics• 4 OpenMP threads, default workload• Functions instrumented: 313 (69 OpenMP, 244 Others)• Access Points instrumented: 10692

8531 Load (2184 64-bit, 6329 32-bit, 18 8-bit)

2161 Store (722 64-bit, 1425 32-bit, 14 8-bit)

• Tracing: 16.73 Million accesses logged.


Overall Performance: SMG2000

• Most L2 misses are Coherence misses

• Most Invalidations result in Coherence Misses

• Only ~280 out of 10692 access points show coherence activity (2.6% !)

• Only ~10% of these points account for >= 90% of the coherence misses

0102030405060708090

100

0 20 40 60 80 100

% Participating Access Points

% C

ohM

isse

s

Processor-1 Processor-2

Processor-3 Processor-4

0

100000

200000

300000

400000

500000

600000

700000

800000

P1 P2 P3 P4Processors

Num

ber

Total-L2-MissesCoherence MissesInvalidations

10

A. Overall Misses B. Cumulative Coherence Misses


Drilling Down: Per-Access Point Metrics

• Top-5 Metrics for Processor-1

No

File Line

Ref Group

Coherence Misses

Invalidations

True False

In-Regio

n

Across-Region

In-Region

Across-

Region

1 smg_residual.c 289 rp[]_Read

1

168545 0 0 158842 9672

2 smg_residual.c 289 rp[]_Read 81729 0 0 74242 7587

3 smg_residual.c 289 rp[]_Write 43338 0 0 42684 3648

4 cyclic_reduction.c

853 xp[]_Write 22467 0 0 21388 1128

5 threading.c 24 num_threads_Write

2 16553 17402 0 0 0

• Group-1 Refs: False-sharing In-Region (Same OpenMP region) invalidations dominate

• Group-2 Ref: True-sharing In-Region invalidations only


• Large number of False-In Region Invalidations !• Sub-optimal Parallelization: Fine-grained sharing• Solution: Parallelize Outermost loop (Coarsening)

No

Reference True-sharing Invalidations

False-Sharing Invalidations

1 Proc_1:rp_Write[]

0 77820

2 Proc_1:rp_Write[]

0 86161

3 Proc_2:rp_Write[]

0 2352

for k = 0 to Kmax for j =0 to Jmax #pragma omp parallel do for i = 0 to Imax { ... rp[k][j][i] = rp[k][j][i] – Ai[] * xp[]; }//end omp do

P1 P2 P3

Cache Line Cache Line

Drilling Further: A Ref & its invalidators

File::Line Ref Invalidations

True

False

In Across

smg_residual.c::289

rp[]_Read

0 158842 9672

P4

Invalidator List


Another Optimization

File::Line Ref Invalidations

True False

In Across

In

Across

threading.c::24

num_threads_Write

17402 0 0 0

No

Reference True-sharing Invalidations

False-Sharing Invalidations

1 Proc_1:num_thread_Write[]

17402 0

#pragma omp parallel num_threads=omp_get_num_threads();

• Multiple threads updating same shared variable !• Solution: Remove unnecessary sharing (SharedRemoval)

Cache Line

num_thread

P1 P2 P3 P4

num_threads = omp_get_max_threads();

Invalidator List


Impact of Optimizations

• SMG2000 run on IBM SP Blue Horizon. (POWER3)• Wall-clock times for recommended full-sized workloads (threads = 1, 2, 4, 8)• Maximum of 73% improvement for 4th Workload

0

5

10

15

20

25

1 2 3 4Workloads

Se

co

nd

sOriginal

Coarsening

Coarsening+Shared Removal


Highlights & Future Directions

• First tool for characterizing OpenMP SMP performance• Dynamic Binary Rewriting – no source code modification !• Detailed source-correlated statistics.• Rich set of coherence metrics.

Highlights

Future Directions

• Use of partial access traces. (intermittent instrumentation)• Other Threading Models – Pthreads, etc.• Characterizing Perennial Server applications (Apache)


The End


Simulator Accuracy: Comparing #invalidations

• NAS 3.0 OpenMP benchmarks + NBF

• Total Invalidations: Hardware Counters (IBM SP: HPM) vs. Simulator (ccSIM)

• Account for OpenMP runtime overhead in HPM.

• 16.5% Maximum absolute error, most benchmarks have <= 7% error.

IS MG CG FT BT SP NBF

HPM(Corrected)

162964 13629

100487 325257

157384 258922 135926

ccSIM-Interleaved

163073 13174

117117 302630

157503 268334 137498

%ERROR: -0.006 3.3 -16.5 6.9 -0.07 -3.6 -1.15

IS MG CG FT BT SP NBF

HPM-Raw 165246 24631 134964

326595

185317 282269 474121

HPM-After OpenMP run-time correction

162964 13629 100487

325257

157384 258922 135926


Related Work

+ Execution-driven simulation+ Classifies by code & data objects+ Invalidations & coherence misses- No true/false sharing- No invalidator lists- Compiler-inserted instrumentation- Uniprocessor-simulated parallel threads

MemSpy: Martonosi et. al.(1992)

SM-Prof: Brorsson et.al.(1995)+ Variable Classification tool+ Access classes of “shared/private read/write few/many” - Cant detect true/false sharing, magnitude- No coherence misses,invalidator lists

Rsim, Proteus, SimOS

- Architecture-oriented simulators, only bulk statistics- Not meant for application developers


Tracing Overhead (earlier work: METRIC)

1

10

100

1000

10000

SPEC-applu SPEC-mgrid

SPEC-swim SPEC-tomcatv

SPEC-hydro2d

ADI MatrixMultiply

Tiled MatrixMultiply

Ove

rhea

d F

acto

r

NULLInstrumentation

Instrumentation+Compression

• 1-3 Orders of Magnitude overhead, in most cases.

• Conventional breakpoints (TRAP) have > 4 Orders of Magnitude

overhead.


Trap-based instrumentation

dyninst gdbapplication # operations ops/sec time (sec) time (sec)

compress 95 32,513 406,655.7 0.08 74.35li (xlmatch) 110,209 43,607.7 2.53 221.04li (compare) 4,475 640.2 6.99 16.39li (binary) 401 19.4 20.69 21.62

From DynInst Documentation


Compression Ratio (earlier work: METRIC)

• Comparison of Uncompressed and Compressed Stream sizes

• 1 Million Accesses Logged

• 2-4 Orders of Magnitude Compression , in most cases.

1

10

100

1000

10000

100000

1000000

10000000

SPEC-applu

SPEC-mgrid

SPEC-swim

SPEC-tomcatv

SPEC-hydro2d

ADI MatrixMultiply

TiledMatrix

Multiply

Siz

e (B

ytes

)

Uncompressed

Compressedd

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Documents

Transcript of Detailed Cache Coherence Characterization for OpenMP Benchmarks