Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi...
-
Upload
vanessa-brooks -
Category
Documents
-
view
214 -
download
1
Transcript of Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi...
Hiding Synchronization Delays in a GALS Processor MicroarchitectureGreg SemeraroDavid H. AlbonesiGrigorios MagklisMichael L. ScottSteven G. DropshoSandhya Dwarkadas
ASYNC 2004 - University of Rochester 2
Why GALS?
Simplified clock distribution network Reduced clock power dissipation Allows modular design of the processor Can run each domain at optimal frequency Can use conventional design and testing
methods Fine-grained DVS/DFS
ASYNC 2004 - University of Rochester 3
But there is a cost…
Inter-domain synchronization can hurt performance
Synchronization circuit costs in area and power
We have to be careful how we divide the processor
ASYNC 2004 - University of Rochester 4
The MCD Microprocessor
L2unifiedcache
L1datacache
LSQ
Memory
branchpredict rename
L1instr.cache
fetch IFQ
int.registerfile
int.FUs
IIQInteger
fp.registerfile
fp.FUs
FIQFloating Pt
MainMemory
CPU
dispatch
ROBFrontend
ASYNC 2004 - University of Rochester 5
Inter-domain Synchronization
Queue design based on Chelcea and Nowick (WVLSI ’00)Modified for Issue Queue configuration
Synchronization circuit based on Nyström and Martin (WCED ’02)Converted to single-rail logic
Timing analysis based on Sjogren and Myers (ARVLSI ’97)Skip a cycle rather than pause the clock
ASYNC 2004 - University of Rochester 6
Synchronization via Queues
FIFO Queue Issue Queue
ASYNC 2004 - University of Rochester 7
Timing Analysis
Source runs with CLK1, destination with CLK2
Source writes at edge 1 If T > Ts then the data
can be used at edge 2 If T < Ts then the data
can be used at edge 3 25% < Ts < 35%
T
CLK1
CLK2
1
2 3
4
ASYNC 2004 - University of Rochester 8
Simulation Methodology
Two processor pipelinesAlpha 21264StrongARM SA-1110
Synchronization penalty was measured against an identical synchronous design
30 benchmarksMediaBench, Olden, SPEC 2000
ASYNC 2004 - University of Rochester 9
Simulation Methodology
Simplescalar + Wattch + MCD Independent clock for each domain
Independent jitter for each domainNext edge based on period, last edge, jitter
When source and destination clocks are too close, one cycle penalty is assessed
ASYNC 2004 - University of Rochester 10
Synchronization Analysis
OoO and superscalar capabilities removed from Alpha
1.4
24.3
2.4
21.5
0
5
10
15
20
25
30
Performance Degradation Synchronization Time
Per
cent
Out-of-order, full superscalar In-order Issue, less superscalar
ASYNC 2004 - University of Rochester 11
Synchronization Analysis
OoO and superscalar capabilities added to StrongARM
1.9
12.2
0.7
10.9
0
2
4
6
8
10
12
14
Performance Degradation Synchronization Time
Pe
rce
nt
In-order Out-of-order, partially superscalar
ASYNC 2004 - University of Rochester 12
What we have learned
Synchronization penalty doesn’t mean performance loss
Out-of-order execution allows useful work to be performed when instructions are delayed
Superscalar design means that synchronization penalties can be “shared” across multiple instructions
For Alpha 95% of penalty hidden For StrongARM++ 63% of penalty hidden
We have to be careful Cannot have too many domains Careful where you split!
ASYNC 2004 - University of Rochester 13
Conclusions
GALS is a good idea for real processorssmall IPC lossclock network simplificationreduction in power dissipationhigher frequency independent domain tuning