Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro,...

24
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester

Transcript of Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro,...

Page 1: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott

University of Rochester

Page 2: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

The gist of the paper…

Radical idea: Trade off frequency and hardware complexity dynamically at runtime

rather than statically at design time

The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture

is key to making this worthwhile

Page 3: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Application phase behavior

Varying behavior over time

[Sherwood, Sair, Calder, ISCA 2003]

Can exploit to save power

gcc

L2 misses

IPC

L1I misses

L1D misses

branch mispred

E per interval

[Buyuktosunoglu, et al., GLSVLSI 2001]

adaptive issue queue

Page 4: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

What about performance?

Lower power and faster access time!

entries relative delay322416 8

1.00.770.520.31

RAM delay

entries relative delay322426 8

1.00.770.550.34

CAM delay

[Buyuktosunoglu, GLSVLSI 2001]

Page 5: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

What about performance?

How do we exploit the faster speed?

Variable latency

Increase frequency when downsizing

Decrease frequency when upsizing

Page 6: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

What about performance?

[Albonesi, ISCA 1998]

Issue Queue

ALUs & RF

L1 I-Cache

Dispatch, Rename, ROB

Fetch Unit

Issue Queue

MainMemory

L2 Cache

Ld/St Unit

L1 D-Cache

clock

Br Pred

ALUs & RF

FP integer

Page 7: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

What about performance?

[Albonesi, ISCA 1998]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

m88

ksim gcc

com

pre

ss

li

ijpeg

per

l

vort

ex

airs

hed

ster

eo

rad

ar

app

cg

tom

catv

swim

su2c

or

hyd

ro2d

mg

rid

app

lu

turb

3d

apsi

fpp

pp

wav

e5

aver

age

Avg

TP

I (n

s)

Best ConventionalProcess-level Adaptive

Page 8: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Enter GALS…

Issue Queue

ALUs & RF

L1 I-Cache

Dispatch, Rename, ROB

Fetch Unit

Issue Queue

ALUs & RF

MainMemory

L2 Cache

Ld/St Unit

Integer Domain FP Domain

Memory Domain

Front-end Domain External Domain

Br Pred

L1 D-Cache

[Semeraro et al., HPCA 2002][Iyer and Marculescu, ISCA 2002]

Page 9: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Outline

Motivation and background Adaptive GALS microarchitecture Control mechanisms Evaluation methodology Results Conclusions and future work

Page 10: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Adaptive GALS microarchitecture

Br PredBr PredBr Pred

L1 I-CacheL1 I-CacheL1 I-Cache

L2 CacheL2 CacheL2 Cache

L1 D-CacheL1 D-CacheL1 D-Cache

Issue QueueIssue Queue

ALUs & RF

L1 I-Cache

Dispatch, Rename, ROB

Fetch Unit

ALUs & RF

MainMemory

L2 Cache

Ld/St Unit

L1 D-Cache

Integer Domain FP Domain

Memory Domain

Front-end DomainExternal Domain

Issue Queue Issue QueueIssue Queue

Br Pred

Page 11: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Adaptive GALS operation

Br PredBr PredBr Pred

L1 I-CacheL1 I-CacheL1 I-Cache

L2 CacheL2 CacheL2 Cache

L1 D-CacheL1 D-CacheL1 D-Cache

Issue QueueIssue Queue

ALUs & RF

Dispatch, Rename, ROB

L1 I-Cache

Fetch Unit

ALUs & RF

MainMemory

L2 Cache

Ld/St Unit

L1 D-Cache

Integer Domain FP Domain

Memory Domain

Front-end DomainExternal Domain

Issue Queue Issue QueueIssue Queue

Br PredBr Pred

L1 I-CacheL1 I-Cache

Page 12: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Resizable cache organization

Access A part first, then B part on a miss Swap A and B blocks on a A miss, B hit Select A/B split according to application phase behavior

Page 13: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Resizable cache control

A

MRU State(LRU)(MRU)

MRU[1]++

MRU[2]++

MRU[0]++

MRU[3]++

Exa

mpl

e A

cces

ses

Config A1 B3• hitsA = MRU[0]• hitsB = MRU[1] + [2] + [3]

Config A2 B2• hitsA = MRU[0] + [1]• hitsB = MRU[2] + [3]

Config A3 B1• hitsA = MRU[0] + [1] + [2]• hitsB = MRU[3]

Config A4 B0• hitsA = MRU[0] + [1] + [2] + [3]• hitsB = 0

1 2 30

B C D

AB C D

BC A D

BC A D

• Calculate the cost for each possible configuration: A access costs = (hitsA + hitsB + misses) * CostA

B access costs = (hitsB + misses) * CostB

Miss access costs = misses * CostMiss

Total access cost = A + B + Miss (normalized to frequency)

Page 14: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Resizable issue queue control

Measures the exploitable ILP for each queue size Timestamp counter is reset at the start of an interval and

incremented each cycle During rename, a destination register is given a timestamp

based on the timestamp + execution latency of its slowest source operand

The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64)

ILP is estimated as N/MAXN Queue size with highest ILP (normalized to frequency) is

selectedRead th

e paper

Page 15: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Resizable hardware – some details Front end domain

• Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way• Branch predictor sized with Icache

– gshare PHT: 16KB-64KB– Local BHT: 2KB-8KB– Local PHT: 1024 entries– Meta: 16KB-64KB

Load/store domain• Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8-way• L2 cache “A” sized with Dcache

– 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way

Integer and floating point domains• Issue queue: 16, 32, 48, or 64 entries

Page 16: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Evaluation methodology

SimpleScalar and Cacti 40 benchmarks from SPEC, Mediabench, and Olden Baseline: best overall performing fully synchronous 21264-like design

found out of 1,024 simulated options Adaptive MCD costs imposed:

• Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined)

• Frequency penalty as much as 31% Mean PLL locking time of 15 µsec Program-Adaptive: profile application and pick the best adaptive

configuration for the whole program Phase-Adaptive: use online cache and issue queue control

mechanisms

Page 17: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Performance improvementMediabench Olden SPEC

Page 18: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Phase behavior – art

16

32

48

64

issu

e qu

eue

entr

ies

100 million instruction window

Page 19: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Phase behavior – apsiD

cach

e “A

” si

ze

32KB

128KB

64KB

256KB

100 million instruction window

Page 20: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Performance summary Program Adaptive: 17% performance improvement Phase Adaptive: 20% performance improvement

• Automatic • Never degrades performance for 40 applications• Few phases in chosen application windows – could perhaps do better

Distribution of chosen configurations for Program Adaptive:

Integer IQ FP IQ D/L2 Cache Icache

16 85%32 5%48 5%64 5%

32KB/256KB 50%64KB/512KB 18%128KB/1MB 23%256KB/2MB 10%

16KB 55%32KB 18%48KB 8%64KB 20%

16 73%32 15%48 8%64 5%

Page 21: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Domain frequency versus IQ size

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

16 32 48 64

Issue Queue Size

Rel

ativ

e fr

equ

ency

Page 22: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Conclusions

Application phase behavior can be exploited to improve performance in addition to power savings

GALS approach is key to localizing the impact of slowing the clock

Cache and queue control mechanisms can evaluate all possible configurations within a single interval

Phase adaptive approach improves performance by as much as 48% and by an average of 20%

Page 23: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Future work

Explore multiple adaptive structures in each domain Better take into account the branch predictor Resize the instruction cache by sets rather than ways Explore better issue queue design alternatives Build circuits Dynamically customized heterogeneous multi-core

architectures using phase-adaptive GALS cores

Page 24: Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott

University of Rochester