CMP Design Choices
description
Transcript of CMP Design Choices
CMP Design Choices
Finding Parameters that Impact CMP Performance
Sam Koblenski and Peter McClone
Outline Introduction Assumptions Plackett & Burman Analysis
Simulation methods Statistical Design Plackett & Burman Results
Mean Value Analysis MVA Implementation MVA Results AMVA Implementation AMVA Results
Complementary Results Conclusions
Introduction
2 part study Design space is huge, how can we reduce it?
Method 1 Plackett & Burman (PB) Analysis finds critical
parameters Design uses extreme values of parameters Detailed architecture design can focus on a few
parameters
Introduction (cont.)
Method 2 Mean Value Analysis Model of a CMP Simply designed to compute throughput Design choices can be narrowed down
quickly Intuition is gained and patterns/parameter
relationships identified
Assumptions - PB Design
In-Order approximated as OoO with small window Die Size = 300 mm2 (16 MB Cache @ 65nm) L2 Cache Size expanded to fill the die
Discrete sizes: 4, 8, 12 MB Associativity can be non-power-of-2
Core size measured in Cache Byte Equivalents:Pipeline Width CBE
In-Order 1 50 kBIn-Order 4 100 kB
Out-of-Order
1 75 kB
Out-of-Order
4 250 kB
Simulation Methodology
Simics with Ruby & Opal 16P sims used cache warmup files 2P sims ran for more transactions Attempted OLTP and JBB benchmarks
Benchmark Processors TransactionsOLTP 2 200OLTP 16 100
JBB 2 20000JBB 16 10000
Plackett & Burman Design
Motivation Narrow a huge design space Minimize simulation runs (experiments)
Preliminaries Performance Measure Extreme Parameter Values Number of Parameters (N < 4Xn-1)
PB Design ExampleA B C D E F G Time+ + + - + - - 9- + + + - + - 11- - + + + - + 2+ - - + + + - 1- + - - + + + 9+ - + - - + + 74+ + - + - - + 7+ + + + + + + 4- - - + - + + 17+ - - - + - + 76+ + - - - + - 6- + + - - - + 31+ - + + - - - 19- + - + + - - 33- - + - + + - 6- - - - - - - 112
191 19 111 -13 79 55 239
PB Design Parameter ValuesParameter Low Value (-) High Value (+)Number of Cores 2 16
Pipeline Organization
In-Order Out-of-Order
Pipeline Width 1 4
L1 Cache Size 16 kB 128 kB
L1 Associativity Direct Mapped 32-Way
L2 Cache Size Die Area – Core Area
L2 Associativity Direct Mapped 32-Way
L2 Banks 2 32
L2 Latency 50 Cycles 12 Cycles
L2 Directory Latency 25 Cycles 6 Cycles
Pin Bandwidth 400 10000
Memory Latency 300 Cycles 100 Cycles
PB Results
Extreme Values stressed the simulator Have not completed an entire set of
runs, yet Possibly necessary to build a custom
L2 network for each run
PB Results for JBB
0
2
4
6
8
10
12
14
16
18
20
Cores
In/Out
Width
L1 S
ize
L1 A
ssoc
L2 A
ssoc
L2 B
anks
L2 La
tency
Directo
ry La
tency
Pin BW
Memory
Laten
cy
Assumptions - MVA
Distribution of time between memory requests is exponential
Processor cores exhibit the same average behavior with respect to their service times and miss rates.
Doubling the size of the cache reduces the miss rate by a factor of 1/√2
An inorder core takes approximately the same area as 50 KB of cache
MVA Design
Simple Closed Model:
MVA Design
Two phases of this Model design First: Use the exact MVA equations
Use average time between memory access as an application parameter
Solve for throughput Second: Use Approximate MVA (AMVA)
Use an iterative method to converge on this service time
Solve for throughput
Exact MVA
To solve for the MVA equations, we determine the mean residence time at all service centers: Rp – processor/L1 residence time RL2 – L2 residence time RM – memory residence time.
The case with one core is trivial. Use this case to solve for additional cores Rn,p = Dp * (1 + Qn-1,p)
Exact MVA results
Using data from simulation runs throughput was calculated Miss rates, number of memory requests
Results are erratic Not consistent with simulation results Source of the problem is most likely
processor service time!
Approximate MVA Design
An iterative method can be used to converge on a service time Uses total R as an input parameter
Iterative method works well with approximate MVA Goal is to match total average residence time of a memory
request
Approximate MVA Results
Convergence using the AMVA equations does not always occur
Total measured residence time cannot be reached with this model and parameter set.
Variation of input values without convergence implies flaws in the model structure
There is a complex relationship between the memory system and the rate at which a core issues requests that must be modeled
Complementary Results
Initial goal to produce PB Results to find parameters to focus on for MVA Model
Results from both approaches could cross-verify correctness
Conclusions
Simics has a STEEP learning curve <5 weeks is not enough time for valid/any results
Refinement of a PB Design leads to long lead times on valid results
CMPs complicate the relationship between cores and memory subsystem
Design methodologies that focus simulation runs are necessary
More results and conclusions to follow
Questions
Questions?