Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi...

13
Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho Sandhya Dwarkadas

Transcript of Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi...

Page 1: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

Hiding Synchronization Delays in a GALS Processor MicroarchitectureGreg SemeraroDavid H. AlbonesiGrigorios MagklisMichael L. ScottSteven G. DropshoSandhya Dwarkadas

Page 2: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 2

Why GALS?

Simplified clock distribution network Reduced clock power dissipation Allows modular design of the processor Can run each domain at optimal frequency Can use conventional design and testing

methods Fine-grained DVS/DFS

Page 3: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 3

But there is a cost…

Inter-domain synchronization can hurt performance

Synchronization circuit costs in area and power

We have to be careful how we divide the processor

Page 4: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 4

The MCD Microprocessor

L2unifiedcache

L1datacache

LSQ

Memory

branchpredict rename

L1instr.cache

fetch IFQ

int.registerfile

int.FUs

IIQInteger

fp.registerfile

fp.FUs

FIQFloating Pt

MainMemory

CPU

dispatch

ROBFrontend

Page 5: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 5

Inter-domain Synchronization

Queue design based on Chelcea and Nowick (WVLSI ’00)Modified for Issue Queue configuration

Synchronization circuit based on Nyström and Martin (WCED ’02)Converted to single-rail logic

Timing analysis based on Sjogren and Myers (ARVLSI ’97)Skip a cycle rather than pause the clock

Page 6: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 6

Synchronization via Queues

FIFO Queue Issue Queue

Page 7: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 7

Timing Analysis

Source runs with CLK1, destination with CLK2

Source writes at edge 1 If T > Ts then the data

can be used at edge 2 If T < Ts then the data

can be used at edge 3 25% < Ts < 35%

T

CLK1

CLK2

1

2 3

4

Page 8: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 8

Simulation Methodology

Two processor pipelinesAlpha 21264StrongARM SA-1110

Synchronization penalty was measured against an identical synchronous design

30 benchmarksMediaBench, Olden, SPEC 2000

Page 9: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 9

Simulation Methodology

Simplescalar + Wattch + MCD Independent clock for each domain

Independent jitter for each domainNext edge based on period, last edge, jitter

When source and destination clocks are too close, one cycle penalty is assessed

Page 10: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 10

Synchronization Analysis

OoO and superscalar capabilities removed from Alpha

1.4

24.3

2.4

21.5

0

5

10

15

20

25

30

Performance Degradation Synchronization Time

Per

cent

Out-of-order, full superscalar In-order Issue, less superscalar

Page 11: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 11

Synchronization Analysis

OoO and superscalar capabilities added to StrongARM

1.9

12.2

0.7

10.9

0

2

4

6

8

10

12

14

Performance Degradation Synchronization Time

Pe

rce

nt

In-order Out-of-order, partially superscalar

Page 12: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 12

What we have learned

Synchronization penalty doesn’t mean performance loss

Out-of-order execution allows useful work to be performed when instructions are delayed

Superscalar design means that synchronization penalties can be “shared” across multiple instructions

For Alpha 95% of penalty hidden For StrongARM++ 63% of penalty hidden

We have to be careful Cannot have too many domains Careful where you split!

Page 13: Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

ASYNC 2004 - University of Rochester 13

Conclusions

GALS is a good idea for real processorssmall IPC lossclock network simplificationreduction in power dissipationhigher frequency independent domain tuning