CMP Design Choices

CMP Design Choices

Finding Parameters that Impact CMP Performance

Sam Koblenski and Peter McClone

Outline Introduction Assumptions Plackett & Burman Analysis

Simulation methods Statistical Design Plackett & Burman Results

Mean Value Analysis MVA Implementation MVA Results AMVA Implementation AMVA Results

Complementary Results Conclusions

Introduction

2 part study Design space is huge, how can we reduce it?

Method 1 Plackett & Burman (PB) Analysis finds critical

parameters Design uses extreme values of parameters Detailed architecture design can focus on a few

parameters

Introduction (cont.)

Method 2 Mean Value Analysis Model of a CMP Simply designed to compute throughput Design choices can be narrowed down

quickly Intuition is gained and patterns/parameter

relationships identified

Assumptions - PB Design

In-Order approximated as OoO with small window Die Size = 300 mm2 (16 MB Cache @ 65nm) L2 Cache Size expanded to fill the die

Discrete sizes: 4, 8, 12 MB Associativity can be non-power-of-2

Core size measured in Cache Byte Equivalents:Pipeline Width CBE

In-Order 1 50 kBIn-Order 4 100 kB

Out-of-Order

1 75 kB

Out-of-Order

4 250 kB

Simulation Methodology

Simics with Ruby & Opal 16P sims used cache warmup files 2P sims ran for more transactions Attempted OLTP and JBB benchmarks

Benchmark Processors TransactionsOLTP 2 200OLTP 16 100

JBB 2 20000JBB 16 10000

Plackett & Burman Design

Motivation Narrow a huge design space Minimize simulation runs (experiments)

Preliminaries Performance Measure Extreme Parameter Values Number of Parameters (N < 4Xn-1)

PB Design ExampleA B C D E F G Time+ + + - + - - 9- + + + - + - 11- - + + + - + 2+ - - + + + - 1- + - - + + + 9+ - + - - + + 74+ + - + - - + 7+ + + + + + + 4- - - + - + + 17+ - - - + - + 76+ + - - - + - 6- + + - - - + 31+ - + + - - - 19- + - + + - - 33- - + - + + - 6- - - - - - - 112

191 19 111 -13 79 55 239

PB Design Parameter ValuesParameter Low Value (-) High Value (+)Number of Cores 2 16

Pipeline Organization

In-Order Out-of-Order

Pipeline Width 1 4

L1 Cache Size 16 kB 128 kB

L1 Associativity Direct Mapped 32-Way

L2 Cache Size Die Area – Core Area

L2 Associativity Direct Mapped 32-Way

L2 Banks 2 32

L2 Latency 50 Cycles 12 Cycles

L2 Directory Latency 25 Cycles 6 Cycles

Pin Bandwidth 400 10000

Memory Latency 300 Cycles 100 Cycles

PB Results

Extreme Values stressed the simulator Have not completed an entire set of

runs, yet Possibly necessary to build a custom

L2 network for each run

PB Results for JBB

0

2

4

6

8

10

12

14

16

18

20

Cores

In/Out

Width

L1 S

ize

L1 A

ssoc

L2 A

ssoc

L2 B

anks

L2 La

tency

Directo

ry La

tency

Pin BW

Memory

Laten

cy

Assumptions - MVA

Distribution of time between memory requests is exponential

Processor cores exhibit the same average behavior with respect to their service times and miss rates.

Doubling the size of the cache reduces the miss rate by a factor of 1/√2

An inorder core takes approximately the same area as 50 KB of cache

MVA Design

Simple Closed Model:

MVA Design

Two phases of this Model design First: Use the exact MVA equations

Use average time between memory access as an application parameter

Solve for throughput Second: Use Approximate MVA (AMVA)

Use an iterative method to converge on this service time

Solve for throughput

Exact MVA

To solve for the MVA equations, we determine the mean residence time at all service centers: Rp – processor/L1 residence time RL2 – L2 residence time RM – memory residence time.

The case with one core is trivial. Use this case to solve for additional cores Rn,p = Dp * (1 + Qn-1,p)

Exact MVA results

Using data from simulation runs throughput was calculated Miss rates, number of memory requests

Results are erratic Not consistent with simulation results Source of the problem is most likely

processor service time!

Approximate MVA Design

An iterative method can be used to converge on a service time Uses total R as an input parameter

Iterative method works well with approximate MVA Goal is to match total average residence time of a memory

request

Approximate MVA Results

Convergence using the AMVA equations does not always occur

Total measured residence time cannot be reached with this model and parameter set.

Variation of input values without convergence implies flaws in the model structure

There is a complex relationship between the memory system and the rate at which a core issues requests that must be modeled

Complementary Results

Initial goal to produce PB Results to find parameters to focus on for MVA Model

Results from both approaches could cross-verify correctness

Conclusions

Simics has a STEEP learning curve <5 weeks is not enough time for valid/any results

Refinement of a PB Design leads to long lead times on valid results

CMPs complicate the relationship between cores and memory subsystem

Design methodologies that focus simulation runs are necessary

More results and conclusions to follow

Questions

Questions?

CMP Design Choices

Documents

Transcript of CMP Design Choices