Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro,...

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott

University of Rochester

The gist of the paper…

Radical idea: Trade off frequency and hardware complexity dynamically at runtime

rather than statically at design time

The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture

is key to making this worthwhile

Application phase behavior

Varying behavior over time

[Sherwood, Sair, Calder, ISCA 2003]

Can exploit to save power

gcc

L2 misses

IPC

L1I misses

L1D misses

branch mispred

E per interval

[Buyuktosunoglu, et al., GLSVLSI 2001]

adaptive issue queue

What about performance?

Lower power and faster access time!

entries relative delay322416 8

1.00.770.520.31

RAM delay

entries relative delay322426 8

1.00.770.550.34

CAM delay

[Buyuktosunoglu, GLSVLSI 2001]


How do we exploit the faster speed?

Variable latency

Increase frequency when downsizing

Decrease frequency when upsizing


[Albonesi, ISCA 1998]

Issue Queue

ALUs & RF

L1 I-Cache

Dispatch, Rename, ROB

Fetch Unit

Issue Queue

MainMemory

L2 Cache

Ld/St Unit

L1 D-Cache

clock

Br Pred

ALUs & RF

FP integer


[Albonesi, ISCA 1998]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

m88

ksim gcc

com

pre

ss

li

ijpeg

per

l

vort

ex

airs

hed

ster

eo

rad

ar

app

cg

tom

catv

swim

su2c

or

hyd

ro2d

mg

rid

app

lu

turb

3d

apsi

fpp

pp

wav

e5

aver

age

Avg

TP

I (n

s)

Best ConventionalProcess-level Adaptive

Enter GALS…

Issue Queue

ALUs & RF

L1 I-Cache


Fetch Unit

Issue Queue

ALUs & RF

MainMemory

L2 Cache

Ld/St Unit

Integer Domain FP Domain

Memory Domain

Front-end Domain External Domain

Br Pred

L1 D-Cache

[Semeraro et al., HPCA 2002][Iyer and Marculescu, ISCA 2002]

Outline

Motivation and background Adaptive GALS microarchitecture Control mechanisms Evaluation methodology Results Conclusions and future work

Adaptive GALS microarchitecture

Br PredBr PredBr Pred

L1 I-CacheL1 I-CacheL1 I-Cache

L2 CacheL2 CacheL2 Cache

L1 D-CacheL1 D-CacheL1 D-Cache

Issue QueueIssue Queue

ALUs & RF

L1 I-Cache


Fetch Unit

ALUs & RF

MainMemory

L2 Cache

Ld/St Unit

L1 D-Cache


Memory Domain

Front-end DomainExternal Domain

Issue Queue Issue QueueIssue Queue

Br Pred

Adaptive GALS operation

Br PredBr PredBr Pred

L1 I-CacheL1 I-CacheL1 I-Cache

L2 CacheL2 CacheL2 Cache

L1 D-CacheL1 D-CacheL1 D-Cache

Issue QueueIssue Queue

ALUs & RF


L1 I-Cache

Fetch Unit

ALUs & RF

MainMemory

L2 Cache

Ld/St Unit

L1 D-Cache


Memory Domain

Front-end DomainExternal Domain

Issue Queue Issue QueueIssue Queue

Br PredBr Pred

L1 I-CacheL1 I-Cache

Resizable cache organization

Access A part first, then B part on a miss Swap A and B blocks on a A miss, B hit Select A/B split according to application phase behavior

Resizable cache control

A

MRU State(LRU)(MRU)

MRU[1]++

MRU[2]++

MRU[0]++

MRU[3]++

Exa

mpl

e A

cces

ses

Config A1 B3• hitsA = MRU[0]• hitsB = MRU[1] + [2] + [3]

Config A2 B2• hitsA = MRU[0] + [1]• hitsB = MRU[2] + [3]

Config A3 B1• hitsA = MRU[0] + [1] + [2]• hitsB = MRU[3]

Config A4 B0• hitsA = MRU[0] + [1] + [2] + [3]• hitsB = 0

1 2 30

B C D

AB C D

BC A D

BC A D

• Calculate the cost for each possible configuration: A access costs = (hitsA + hitsB + misses) * CostA

B access costs = (hitsB + misses) * CostB

Miss access costs = misses * CostMiss

Total access cost = A + B + Miss (normalized to frequency)

Resizable issue queue control

Measures the exploitable ILP for each queue size Timestamp counter is reset at the start of an interval and

incremented each cycle During rename, a destination register is given a timestamp

based on the timestamp + execution latency of its slowest source operand

The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64)

ILP is estimated as N/MAXN Queue size with highest ILP (normalized to frequency) is

selectedRead th

e paper

Resizable hardware – some details Front end domain

• Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way• Branch predictor sized with Icache

– gshare PHT: 16KB-64KB– Local BHT: 2KB-8KB– Local PHT: 1024 entries– Meta: 16KB-64KB

Load/store domain• Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8-way• L2 cache “A” sized with Dcache

– 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way

Integer and floating point domains• Issue queue: 16, 32, 48, or 64 entries

Evaluation methodology

SimpleScalar and Cacti 40 benchmarks from SPEC, Mediabench, and Olden Baseline: best overall performing fully synchronous 21264-like design

found out of 1,024 simulated options Adaptive MCD costs imposed:

• Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined)

• Frequency penalty as much as 31% Mean PLL locking time of 15 µsec Program-Adaptive: profile application and pick the best adaptive

configuration for the whole program Phase-Adaptive: use online cache and issue queue control

mechanisms

Performance improvementMediabench Olden SPEC

Phase behavior – art

16

32

48

64

issu

e qu

eue

entr

ies

100 million instruction window

Phase behavior – apsiD

cach

e “A

” si

ze

32KB

128KB

64KB

256KB

100 million instruction window

Performance summary Program Adaptive: 17% performance improvement Phase Adaptive: 20% performance improvement

• Automatic • Never degrades performance for 40 applications• Few phases in chosen application windows – could perhaps do better

Distribution of chosen configurations for Program Adaptive:

Integer IQ FP IQ D/L2 Cache Icache

16 85%32 5%48 5%64 5%

32KB/256KB 50%64KB/512KB 18%128KB/1MB 23%256KB/2MB 10%

16KB 55%32KB 18%48KB 8%64KB 20%

16 73%32 15%48 8%64 5%

Domain frequency versus IQ size

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

16 32 48 64

Issue Queue Size

Rel

ativ

e fr

equ

ency

Conclusions

Application phase behavior can be exploited to improve performance in addition to power savings

GALS approach is key to localizing the impact of slowing the clock

Cache and queue control mechanisms can evaluate all possible configurations within a single interval

Phase adaptive approach improves performance by as much as 48% and by an average of 20%

Future work

Explore multiple adaptive structures in each domain Better take into account the branch predictor Resize the instruction cache by sets rather than ways Explore better issue queue design alternatives Build circuits Dynamically customized heterogeneous multi-core

architectures using phase-adaptive GALS cores

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott

University of Rochester

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro,...

Documents

Transcript of Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro,...