NAVO MSRC PET Program Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (...

NAVO MSRC PET ProgramTowards More Meaningful Machine

Comparisons

Dr. Allan Snavely

PMaC (Performance Modeling & Characterization) Group Leader

www.sdsc.edu/PMaC

SDSC

PMaC Mission

• To bring scientific rigor to the art or performance prediction– for procurement– for architectural tradeoffs– for guiding applications to best-suited machine– for performance tuning

PMaC Mission

• To bridge the gap between benchmarks and cycle-accurate simulation– Benchmarks have dubious relevancy to real

apps, particularly on future machines– Cycle-accurate simulations take too long

Projects• MAPS (Memory Access Patterns)

– memory subsystem & interconnect signatures

• MetaSim

– an on-the-fly simulator for playing “what if?” (4 orders of magnitude faster than cycle-accurate simulation)

• Pseudocode Cache Simulator

• Scientific Application Loop Set

• Terascale Application Information

• IDC HPC List

People

• Dr. Allan Snavely, Group Leader– Dr. Laura Carrington, Xiaofeng Gao (MAPS)– Dr.Stuart Johnson (Pseudocode simulator)– Dr. Larry Carter (senior technical advisor)– Dr. Wayne Pfeiffer (Scientific Application

Loop Set)– Nicole Wolter (Paraver/Dimemas)– Dr. Bob Leary (resident mathemeticain)

What’s wrong with benchmarks?

• May anti-correlate to actual performance1

1: Conventional Benchmarks as a Sample of the Performance Spectrum

John L. Gustafson, Rajat Todi Ames Laboratory, USDOE

PMaC Methods

• Performance modeling via separation of concerns– Machine signatures– Application profiles– Convolution methods

Memory Bandwidth vs. Size for Loadson Blue Horizon

0

1000

2000

3000

4000

5000

6000

7000

1000 10000 100000 1000000 10000000

Size (W)

Ban

dW

idth

(M

B/s

)

1 - Stream

2 - Streams

3 - Streams

4 - Streams

Memory Bandwidth vs. Size for Loadson Blue Horizon

0

1000

2000

3000

4000

5000

6000

7000

1000 10000 100000 1000000 10000000

Size (W)

Ban

dW

idth

(M

B/s

)

1 - Stream

2 - Streams

3 - Streams

4 - Streams

TLB 131072 word4KB pages

2 way

L2 1048576 word4 way 16 block

L1

8192 word128 way 16 block

Memory Bandwidth vs. Size for Loadson BH, T3E, SX-4

0

1000

2000

3000

4000

5000

6000

7000

8000

1000 10000 100000 1000000 10000000Size (W)

Ba

nd

Wid

th (

MB

/s)

BH 1 - Stream

t3e 1 - Stream

sx-4

MAPS

• Useful in its own right for more meaningful machine comparisons at a glance

• Work going forward to port to Compaq TCS1, SX-5, T90, Sv1, MTA, Sun HPC 10K, Origin, others?

• Provides input to MetaSim (next)

Meta-SimA meta-simulator tool

Meta-Sim

• Takes 2 inputs– a program– a description of a machine

• Consumes instrumented trace data “on-the-fly”– 100 fold slowdown (as opposed to 1M fold!)

• Performs an automated predictive convolution

Meta-Sim

• Models caches and TLB– any number of levels– arbitrary sizes, line lengths, associativities

• Does accounting on the Basic Block level

• Looks for memory access patterns

A (simplistic) Convolution

MFLOPS i=1

n= Wt. BB

i * Rate BBi

Intensity BBi *

Wt. BB = % of total memory references

Rate BB = sustained rate of memory references

Intensity BB = ratio of floating point ops to memory opsi

i

i

How to determine rate of memory access for BB?

• sum = sum + a(k)*b(colidx(k))

• Even if only 33% of memory references in a BB fall out to MM, they may slow down the whole BB to the speed of MM accesses

• Why?

Results

NAS FP Kernels

0

100

200

300

cg s cg w cg a ft s ft w mg s mg w

MFL

OP

S

Predicted MFLOPS Observed MFLOPS

Results

Error for NAS FP Kernels

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

cg s cgw

cg a ft s ft w mgs

mgw

%

% error

Occam’s Razor

• Only add complexity if required to explain observed phenomena

• Observation - this approach just as accurate as SMTSIM (Tullsen, Snavely, et al) but 4 orders of magnitude faster!

Conventional Benchmarks as a Sample of the Performance

Spectrum

1

10

100

1000

1000 10000 100000 1000000 10000000

log MW

log

MW

/s

Random Loads Random Stores

FT S

FT W

CG SCG W

90% L1

CG A80% L1

MG S MG W

Apps Results

NAS Apps

0

200

400

600

bt s lu s lu w sp s sp w

MFL

OP

S

Predicted MFLOPS

Observed MFLOPS

Apps Results

Error for NAS Apps

-10.00%

0.00%

10.00%

20.00%

30.00%

bt s lu s lu w sp s sp w

% % error

Apps as a Sample of the Performance Spectrum (?)

1

10

100

1000

1000 10000 100000 1000000 10000000

log MW

log

MW

/s

Random Loads Random Stores

BT SLU S/W SP S

SP W 90 % L1

Work going forward

• Development of probes ala MAPS for floating point and integer functional unit issue, logical operations, I/O

• Increase sophistication of convolutions as required to fit observed facts

• Big goal; a robust set of metrics and methods for performance modeling and characterization

PMaC Thanks Our Sponsors

• Now includes DOE SciDac award (SUPREME)

• Support from HPC Users Forum

• DoD HPC Modernization was 1st to fund us and their vision made this work possible

NAVO MSRC PET Program Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (...

Documents

Transcript of NAVO MSRC PET Program Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (...