NAVO MSRC PET Program Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (...
-
Upload
jeffrey-dawson -
Category
Documents
-
view
215 -
download
0
Transcript of NAVO MSRC PET Program Towards More Meaningful Machine Comparisons Dr. Allan Snavely PMaC (...
NAVO MSRC PET ProgramTowards More Meaningful Machine
Comparisons
Dr. Allan Snavely
PMaC (Performance Modeling & Characterization) Group Leader
www.sdsc.edu/PMaC
SDSC
PMaC Mission
• To bring scientific rigor to the art or performance prediction– for procurement– for architectural tradeoffs– for guiding applications to best-suited machine– for performance tuning
PMaC Mission
• To bridge the gap between benchmarks and cycle-accurate simulation– Benchmarks have dubious relevancy to real
apps, particularly on future machines– Cycle-accurate simulations take too long
Projects• MAPS (Memory Access Patterns)
– memory subsystem & interconnect signatures
• MetaSim
– an on-the-fly simulator for playing “what if?” (4 orders of magnitude faster than cycle-accurate simulation)
• Pseudocode Cache Simulator
• Scientific Application Loop Set
• Terascale Application Information
• IDC HPC List
People
• Dr. Allan Snavely, Group Leader– Dr. Laura Carrington, Xiaofeng Gao (MAPS)– Dr.Stuart Johnson (Pseudocode simulator)– Dr. Larry Carter (senior technical advisor)– Dr. Wayne Pfeiffer (Scientific Application
Loop Set)– Nicole Wolter (Paraver/Dimemas)– Dr. Bob Leary (resident mathemeticain)
What’s wrong with benchmarks?
• May anti-correlate to actual performance1
1: Conventional Benchmarks as a Sample of the Performance Spectrum
John L. Gustafson, Rajat Todi Ames Laboratory, USDOE
PMaC Methods
• Performance modeling via separation of concerns– Machine signatures– Application profiles– Convolution methods
Memory Bandwidth vs. Size for Loadson Blue Horizon
0
1000
2000
3000
4000
5000
6000
7000
1000 10000 100000 1000000 10000000
Size (W)
Ban
dW
idth
(M
B/s
)
1 - Stream
2 - Streams
3 - Streams
4 - Streams
Memory Bandwidth vs. Size for Loadson Blue Horizon
0
1000
2000
3000
4000
5000
6000
7000
1000 10000 100000 1000000 10000000
Size (W)
Ban
dW
idth
(M
B/s
)
1 - Stream
2 - Streams
3 - Streams
4 - Streams
TLB 131072 word4KB pages
2 way
L2 1048576 word4 way 16 block
L1
8192 word128 way 16 block
Memory Bandwidth vs. Size for Loadson BH, T3E, SX-4
0
1000
2000
3000
4000
5000
6000
7000
8000
1000 10000 100000 1000000 10000000Size (W)
Ba
nd
Wid
th (
MB
/s)
BH 1 - Stream
t3e 1 - Stream
sx-4
MAPS
• Useful in its own right for more meaningful machine comparisons at a glance
• Work going forward to port to Compaq TCS1, SX-5, T90, Sv1, MTA, Sun HPC 10K, Origin, others?
• Provides input to MetaSim (next)
Meta-SimA meta-simulator tool
Meta-Sim
• Takes 2 inputs– a program– a description of a machine
• Consumes instrumented trace data “on-the-fly”– 100 fold slowdown (as opposed to 1M fold!)
• Performs an automated predictive convolution
Meta-Sim
• Models caches and TLB– any number of levels– arbitrary sizes, line lengths, associativities
• Does accounting on the Basic Block level
• Looks for memory access patterns
A (simplistic) Convolution
MFLOPS i=1
n= Wt. BB
i * Rate BBi
Intensity BBi *
Wt. BB = % of total memory references
Rate BB = sustained rate of memory references
Intensity BB = ratio of floating point ops to memory opsi
i
i
How to determine rate of memory access for BB?
• sum = sum + a(k)*b(colidx(k))
• Even if only 33% of memory references in a BB fall out to MM, they may slow down the whole BB to the speed of MM accesses
• Why?
Results
NAS FP Kernels
0
100
200
300
cg s cg w cg a ft s ft w mg s mg w
MFL
OP
S
Predicted MFLOPS Observed MFLOPS
Results
Error for NAS FP Kernels
-2.00%
-1.00%
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
cg s cgw
cg a ft s ft w mgs
mgw
%
% error
Occam’s Razor
• Only add complexity if required to explain observed phenomena
• Observation - this approach just as accurate as SMTSIM (Tullsen, Snavely, et al) but 4 orders of magnitude faster!
Conventional Benchmarks as a Sample of the Performance
Spectrum
1
10
100
1000
1000 10000 100000 1000000 10000000
log MW
log
MW
/s
Random Loads Random Stores
FT S
FT W
CG SCG W
90% L1
CG A80% L1
MG S MG W
Apps Results
NAS Apps
0
200
400
600
bt s lu s lu w sp s sp w
MFL
OP
S
Predicted MFLOPS
Observed MFLOPS
Apps Results
Error for NAS Apps
-10.00%
0.00%
10.00%
20.00%
30.00%
bt s lu s lu w sp s sp w
% % error
Apps as a Sample of the Performance Spectrum (?)
1
10
100
1000
1000 10000 100000 1000000 10000000
log MW
log
MW
/s
Random Loads Random Stores
BT SLU S/W SP S
SP W 90 % L1
Work going forward
• Development of probes ala MAPS for floating point and integer functional unit issue, logical operations, I/O
• Increase sophistication of convolutions as required to fit observed facts
• Big goal; a robust set of metrics and methods for performance modeling and characterization
PMaC Thanks Our Sponsors
• Now includes DOE SciDac award (SUPREME)
• Support from HPC Users Forum
• DoD HPC Modernization was 1st to fund us and their vision made this work possible