Overview

1

Overview

1. Motivation (Kevin)2. Thermal issues (Kevin)3. Power modeling (David)4. Thermal management (David)5. Optimal DTM (Lev)6. Clustering (Antonio)7. Power distribution (David)8. What current chips do (Lev)9. HotSpot (Kevin)

2

The clustering approach

• Reduce complexity by partitioning– Less latency, area, power and temperature

• Fast, simple, distributed units

– Communication latency is heterogeneous and exposed to the microarchitecture

• Localize critical communication within clusters (fast wires)

interconnection network

cluster0

cluster1

cluster2

cluster3

global resources

3

The clustering approach (...)

• Smaller structures consume less power– Higher power efficiency [Zyuban, IEEE Transactions 01]

• Partitioning simplifies power management– Via clock/power gating techniques [Bahar, ISCA 01]

– Via dynamic cluster resizing [González, ICCD 03]

– Via DVS/DFS

• Partitioning reduces temperature– Activity is distributed [Chaparro, TACS 04]

– Hopping schemes can be applied [Chaparro, TACS 04]

– Adds flexibility for temperature-effective layouts

• IPC overheads due to communication/imbalance– Compensated by shorter latency/clock period [Palacharla, ISCA

97], [Canal, HPCA 00]

4

Clustered microarchitecture

• Dynamic steering• Distributed Issue,

Registers, FUs• Inter-cluster register

communication

IcacheIcache

Fetch & decodeFetch & decode

Steering logicSteering logic

C0C0 C1C1 C2C2 C3C3

Issue-Queue

Register File

FU

IC Network

FU

Cluster

5

On-demand communication

• Map table tracks locations of register values

• At rename– allocate register for result, in

the assigned cluster– if a source operand is in a

remote cluster• insert a copy instruction in

remote cluster• allocate register for a copy

• At commit– free allocated register(s) by

previous mapping

log. reg.

Register Map Table

phys. reg.

C0 C1 C2 C3

[Canal, PACT99]

6

Rename

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 18 X X 9

2 X 3 15 X

3 5 10 X 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Renaming Table

Steering Logic

Steering Logic

2 3 X X X 1

src1 src2 src3 src4 src5 dstLogical

Physical

Cluster1

3 10 X X X3 10 X X X 14

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 X 14 X X

2 X 3 15 X

3 5 10 X 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

7

Copy instructions

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 18 X X 9

2 X 3 15 X

3 5 10 X 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Renaming Table

Steering Logic

Steering Logic

2 3 X X X 1

src1 src2 src3 src4 src5 dstLogical

Physical

Cluster2

15 X!!! X X X

CL1:10 CL2:27

src1 dst

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 13 X X 5

2 X 3 15 X

3 5 10 27 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 X X 14 X

2 X 3 15 X

3 5 10 27 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

15 27 X X X15 27 X X X 14

Copy instruction

8

Broadcast communication

• Values sent to all register files– Local file is updated earlier than remote

ones– Registers are replicated in all files

• Register storage waste• Increase in power

– Values are written multiple times• Increase in power

– May reduce communication penalties• Values are present everywhere

– But not at the same time

– E.g.: Alpha 21264

9

Cluster assignment schemes

• Main goals– Minimize inter-cluster communication penalty– Maximize workload balance

• Main approaches– Static approaches

[Farkas, Micro 97] [Sastry, PLDI 98]• Less flexible than dynamic ones: poor load balancing

– Dynamic, dependence-based[Palacharla ISCA 97] [Alpha 21264] [Kemp, ICPP 96]

• Only consider dependences through unavailable operands• Lack specific balancing mechanisms

– Dynamic, workload balance oriented[Baniasadi 00]

• Only suitable with low communication penalty architectures– Dynamic, dependence-based and workload balance

oriented[Canal HPCA 2000, Parcerisa PACT 2002]

• Tries to find best trade-off between communications and workload balance

10

Cluster assignment schemes

• Accurate-Rebalancing Priority RMB1- To minimize communication penalties:

If unavailable source register: choose producer’s cluster Else: Select clusters with highest number of source regs.

mapped

2- Choose the least loaded one of the aboveException: if imbalance > threshold, then

exclude clusters with positive workload, prior to applying rules 1 and 2

11

Evaluation

SpecInt95

0

0.5

1

1.5

2

2.5

Hm

ean

IPC

0

0.2

0.4

0.6

0.8

1

1.2

NR

EA

DY

imb

alan

ce

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Com

mu

nic

atio

ns

/ in

stru

ctio

n

Modulo AR-Priority

12

Dynamic vs. static steering

0

10

20

30

40

50

60

70

80

Sp

ee

du

p (

%)

perl gcc compress m88ksim H-mean

Static LdSt slice Dynamic LdSt slice Advanced RMB

S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998

13

Data cache architectures

• Centralized

UL2UL2UL2UL2

BackendBackendBackendBackend




L1 DcacheL1 DcacheL1 DcacheL1 Dcache

• Dcache is a clusterDcache is a cluster• Single Load/Store queueSingle Load/Store queue• Simple disambiguationSimple disambiguation

• Dcache is a clusterDcache is a cluster• Single Load/Store queueSingle Load/Store queue• Simple disambiguationSimple disambiguation

[González, WMPI 04]

14

Data cache architecture (II)

• Attraction caches– Lines are copied on

demand– A coherence scheme

is needed– Steering must exploit

data localityUL2UL2UL2UL2

DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1

BE 2BE 2 BE 2BE 2 BE 1BE 1 BE 1BE 1 BE 4BE 4 BE 4BE 4 BE 3BE 3 BE 3BE 3

15

Data cache architecture (III)

• Replicated– Area cost– Traffic and activity due

to store broadcast

UL2UL2UL2UL2

DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1

BE 2BE 2 BE 2BE 2 BE 1BE 1 BE 1BE 1 BE 4BE 4 BE 4BE 4 BE 3BE 3 BE 3BE 3

16

Data cache architecture (IV)

• Interleaved– Word/line interleaved– Steering needs to

predict the bank

UL2UL2UL2UL2

DL1DL1DL1DL1

BE 1BE 1 BE 1BE 1

DL1DL1DL1DL1

BE 4BE 4 BE 4BE 4

DL1DL1DL1DL1

BE 2BE 2 BE 2BE 2

DL1DL1DL1DL1

BE 3BE 3 BE 3BE 3

17

Memory issues

• Disambiguation– Load/Store queues are distributed– Stores are allocated in all clusters– Address is computed in one and broadcast – Loads go to memory once previous stores know their

addresses

• Memory coherence– Write-Invalidate / Write-Update protocols

18

Performance comparison

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

amm

p

bzip2

eon

gzip

mesa

mgrid

parser

swim tp

c

wupw

ise

AVG

Re

lati

ve

to

Att

rac

tio

n

Attraction

Replicated

Phy-Dist

Centralized

19

Thermal benefits of clustering

Floorplan for a Floorplan for a quad-cluster quad-cluster architecturearchitecture

Floorplan for a Floorplan for a quad-cluster quad-cluster architecturearchitecture

Unified L2 Cache

Trace Cache

Reorder Buffer

Branch Predictors DECO

Cluster 0

ITLB

Cluster 1

Cluster 2Cluster 3

FP Scheduler

Integer Register File

Integer Execution Units

Data Cache Level 1 DTLB

Memory Scheduler

Floating Point Execution Units

Copy Scheduler

FP Register File

Integer Scheduler

Rename Table

[Chaparro, TACS 04]

20

Temperature metrics

• AbsMax– Maximum sensed temperature

• Average– Average temperature across time and area

• AverageMax– Average temperature across time of

maximum sensed temperature

21

-40%

-30%

-20%

-10%

0%

10%

20%

30%

40%

Ave

rag

e

Ab

sMa

x

Ave

rag

eM

ax

IPC

Lo

ss

Ave

rag

e

Ab

sMa

x

Ave

rag

eM

ax

IPC

Lo

ss

2 Clusters 4 Clusters

Re

du

cti

on

Backends UL2 Frontend Processor

Clustering reduces temperature

– If clustering is smart

22

Clustering effects

• May end up with higher power densities!– Simpler and smaller units may create

hotspots– Layout must be thermal-effective

• Surround hotspots by cold areas

– Activity steering must be smart

• Other techniques (e.g. throttling) can be applied at smaller granularity– Aim at particular clusters without affecting

others

23

Dynamic cluster resizing

• Motivation

Best ED2P aware configuration. Gzip application

1

2

3

4

1 55 109 163 217 271 325 379 433 487 541 595 649 703 757 811 865 919 973

Time

# C

lust

ers

[González, ICCD 03]

24


• Proposal– Dynamically compute the energy of blocks

• Schedulers, FUs, DL0s, etc– Dynamically compute the energyxdelay2 of

the processor– Use different configurations for different

intervals– Measure the optimal configuration – Gate-off (disable) useless units

• Scheduler level• Backend level

25


BE3BE3BE3BE3

UL2 cacheUL2 cacheUL2 cacheUL2 cacheI$I$I$I$Decode Decode RenameRename

SteerSteer

Decode Decode RenameRename

SteerSteer

BEnBEnBEnBEnBE2BE2BE2BE2BE1BE1BE1BE1

memory busmemory busmemory busmemory busdisamb. busdisamb. busdisamb. busdisamb. bus

BE4BE4BE4BE4 BE5BE5BE5BE5

XXXX

ED2PED2PxxED2PED2Pxx

X-1X-1X-1X-1

ED2PED2Px-1x-1ED2PED2Px-1x-1

ED2PED2Px+1x+1ED2PED2Px+1x+1X+1X+1X+1X+1ED2PED2Pxx < ED2P < ED2Px+1x+1 < ED2P < ED2Px-1 x-1 ??ED2PED2Pxx < ED2P < ED2Px+1x+1 < ED2P < ED2Px-1 x-1 ??

X-2X-2X-2X-2

ED2PED2Px-2x-2ED2PED2Px-2x-2

ED2PED2Px-3x-3ED2PED2Px-3x-3X-3X-3X-3X-3 X-yX-yX-yX-y

ED2PED2Px-yx-yED2PED2Px-yx-y

X+2X+2X+2X+2

ED2PED2Px+2x+2ED2PED2Px+2x+2

ED2PED2Px+3x+3ED2PED2Px+3x+3X+3X+3X+3X+3 X+yX+yX+yX+y

ED2PED2Px+yx+yED2PED2Px+yx+y

26


ED2P improvement

0.6

0.8

1

1.2

1.4

1.6

1.8

2

amm

pbzip

2eo

ngzip

mes

a

mgrid

parse

r

swim tp

c

wupwise

AVG

ED

2P i

mp

rove

rel

ativ

e to

4-c

lust

er

Gating scheduler

Gating cluster

27

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

Leakage Average AbsMax AverageMax Slowdown

Per

cent

age

of r

educ

tions

Cluster 2 Cluster 0 Clusters 2 and 3 Clusters 0, 2 and 3

Cluster hopping

• Motivation– Power and average temperature savings

when statically Vdd gating clusters

* Temperatures in the backend area when gating all but the indicated cluster(s). Reductions over in-box ambient temperature (45º) respect to a baseline quad-cluster architecture.

[Chaparro, TACS 04]

28

Cluster hopping

• Based on activity migration [Heo, ISLPED 03]

– Vdd gate a subset of clusters

– Rotate clusters to spread activity over time– Gated clusters cannot provide any register value

• Before gating, some register values must be evicted

– Cache/DTLB contents are lost• Unless some low power (e.g. drowsy) mode is used

– Proactive and/or reactive behavior• Proactive: Per interval basis• Reactive: On thermal events

29

Cluster hopping schemes

2dis-dia1dis-rot

2dis-alt 3dis-rot

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


Per

cent

age

of r

educ

tions

1dis-rot + Non-thermal 2dis-dia + Non-thermal

2dis-alt + Non-thermal 3dis-rot + Non-thermal

Effective at reducing average temperature (thus leakage) but not max temperature

Effective at reducing average temperature (thus leakage) but not max temperature

30

Thermal-aware steering

• Try to minimize max temperature– Take into account cluster temperature when deciding

destination

• Some examples– Cold

• Dispatch to coldest cluster with available resources– Lowest average temperature– Lowest peak temperature

– T-Cold• Like Cold but discard clusters that are too hot

– If difference in temperature with previous cluster (ordered by temperature) is higher than a threshold

[Chaparro, TACS 04]

31


– T-Thermal• Minimize communications unless candidate cluster

is too hot– If temperature difference > threshold

Priority to the colder– Otherwise Priority to the one that minimize

communications, and in case of tie maximize workload balance (#instructions in the schedulers)

32


• Thermal-aware steering standalone

-6%

-4%

-2%

0%

2%

4%

6%

8%


Per

cent

age

of r

educ

tions

Cold, avg. of cluster Cold, max of cluster T-Cold T-Thermal

33

-10%

0%

10%

20%

30%

40%

50%

60%

70%


Per

cent

age

of r

educ

tions

T-Thermal 1dis-rot + Non-thermal 1dis-rot + T-Thermal

2dis-dia + T-Thermal 2dis-alt + T-Thermal

Hopping + thermal steering

• Putting it all together

pchaparr

Se pueden simplificar los gráficos y dejar sólo las combinaciones más relevantes

34

Clustering the front-end

Br.

Pre

dic

tio

n

Fet

ch

Dec

od

e

Ren

am

e

Clu

ster

Ass

ign

men

t

Dep

en

den

ce

Ch

ecki

ng

PC hit/misssrc/dst regs.

assign-ments

steering

Distributed Back-end

`[Parcerisa, TR 02]

35

Distributed branch predictor

– Broadcast every prediction (next PC) to all clusters– Hardware loop: predictor uses PC as index

• insert bubble when switching the predictor cluster (2)• if interleaving by low order bits: frequent bubbles

PredictorTable

Cluster 0

Cluster 1

Cluster 2

Cluster 3

(1)

(2)(2)

(1)

BrP F Dec DR Back-endSt

– Solution• Pipeline prediction ahead of I-cache + interleave by hi-bits• Bubble only when high level interleave boundary crossed (2)

36

Impact of distributing branch predictor

• Bank switching– SpecInt95: every 24

instructions– Mbench: every 133

instructions

• IPC loss– SpecInt95: 0,5% – Mbench: no loss

37

Distributed cluster assignment

– Make local assignments and broadcast them to all clusters– Loop: steering logic uses assignments made by other

clusters

– Partial solution: use outdated info (2 cycles)– Problem: outdated dependences generates communications

BrP F Dec DR Back-endSt * ** Broadcast assignments

** Broadcast register designatorsBrP F Dec DR Back-endSt

Dep**

override assignments

– Solution: • anticipate dependence-checking and• override assignment, if dependence was violated

38

Impact of distributing assignment

• W/o assignment overriding– 0.42 communications /

instruction– More than 10% IPC loss

• With assignment overriding– 0.17 communications /

instruction– Less than 2% IPC loss

0

0.5

1

1.5

2

2.5

IPC

SpecInt95 Mediabench

Baseline

Clustered

Clustered+Overriding

39

Thermal benefits

• Clustering the rename table and the reorder buffer [Chaparro, 04]

0%

5%

10%

15%

20%

25%

30%

35%

Average AbsMax AverageMax Slowdown-Extraarea

Backends UL1 Frontends Processor

40

Summary

• Clustering is thermal-effective (in addition to complexity-effective)– Reduces power– Distributes activity

• Clustering enables effective temperature control schemes– Adaptive configuration– DVS/DFS– Cluster hopping– Thermal steering

Overview

Documents

Transcript of Overview