Overview

40
1 Overview 1. Motivation (Kevin) 2. Thermal issues (Kevin) 3. Power modeling (David) 4. Thermal management (David) 5. Optimal DTM (Lev) 6. Clustering (Antonio) 7. Power distribution (David) 8. What current chips do (Lev) 9. HotSpot (Kevin)

description

Overview. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin). global resources. cluster 0. cluster 1. cluster 2. cluster 3. - PowerPoint PPT Presentation

Transcript of Overview

Page 1: Overview

1

Overview

1. Motivation (Kevin)2. Thermal issues (Kevin)3. Power modeling (David)4. Thermal management (David)5. Optimal DTM (Lev)6. Clustering (Antonio)7. Power distribution (David)8. What current chips do (Lev)9. HotSpot (Kevin)

Page 2: Overview

2

The clustering approach

• Reduce complexity by partitioning– Less latency, area, power and temperature

• Fast, simple, distributed units

– Communication latency is heterogeneous and exposed to the microarchitecture

• Localize critical communication within clusters (fast wires)

interconnection network

cluster0

cluster1

cluster2

cluster3

global resources

Page 3: Overview

3

The clustering approach (...)

• Smaller structures consume less power– Higher power efficiency [Zyuban, IEEE Transactions 01]

• Partitioning simplifies power management– Via clock/power gating techniques [Bahar, ISCA 01]

– Via dynamic cluster resizing [González, ICCD 03]

– Via DVS/DFS

• Partitioning reduces temperature– Activity is distributed [Chaparro, TACS 04]

– Hopping schemes can be applied [Chaparro, TACS 04]

– Adds flexibility for temperature-effective layouts

• IPC overheads due to communication/imbalance– Compensated by shorter latency/clock period [Palacharla, ISCA

97], [Canal, HPCA 00]

Page 4: Overview

4

Clustered microarchitecture

• Dynamic steering• Distributed Issue,

Registers, FUs• Inter-cluster register

communication

IcacheIcache

Fetch & decodeFetch & decode

Steering logicSteering logic

C0C0 C1C1 C2C2 C3C3

Issue-Queue

Register File

FU

IC Network

FU

Cluster

Page 5: Overview

5

On-demand communication

• Map table tracks locations of register values

• At rename– allocate register for result, in

the assigned cluster– if a source operand is in a

remote cluster• insert a copy instruction in

remote cluster• allocate register for a copy

• At commit– free allocated register(s) by

previous mapping

log. reg.

Register Map Table

phys. reg.

C0 C1 C2 C3

[Canal, PACT99]

Page 6: Overview

6

Rename

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 18 X X 9

2 X 3 15 X

3 5 10 X 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Renaming Table

Steering Logic

Steering Logic

2 3 X X X 1

src1 src2 src3 src4 src5 dstLogical

Physical

Cluster1

3 10 X X X3 10 X X X 14

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 X 14 X X

2 X 3 15 X

3 5 10 X 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Page 7: Overview

7

Copy instructions

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 18 X X 9

2 X 3 15 X

3 5 10 X 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Renaming Table

Steering Logic

Steering Logic

2 3 X X X 1

src1 src2 src3 src4 src5 dstLogical

Physical

Cluster2

15 X!!! X X X

CL1:10 CL2:27

src1 dst

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 13 X X 5

2 X 3 15 X

3 5 10 27 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

Log CL0 CL1 CL2 CL3

0 X 27 X X

1 X X 14 X

2 X 3 15 X

3 5 10 27 13

4 X X 12 14

5 4 X X X

6 X 1 24 X

7 2 X X X

8 X 2 X 9

15 27 X X X15 27 X X X 14

Copy instruction

Page 8: Overview

8

Broadcast communication

• Values sent to all register files– Local file is updated earlier than remote

ones– Registers are replicated in all files

• Register storage waste• Increase in power

– Values are written multiple times• Increase in power

– May reduce communication penalties• Values are present everywhere

– But not at the same time

– E.g.: Alpha 21264

Page 9: Overview

9

Cluster assignment schemes

• Main goals– Minimize inter-cluster communication penalty– Maximize workload balance

• Main approaches– Static approaches

[Farkas, Micro 97] [Sastry, PLDI 98]• Less flexible than dynamic ones: poor load balancing

– Dynamic, dependence-based[Palacharla ISCA 97] [Alpha 21264] [Kemp, ICPP 96]

• Only consider dependences through unavailable operands• Lack specific balancing mechanisms

– Dynamic, workload balance oriented[Baniasadi 00]

• Only suitable with low communication penalty architectures– Dynamic, dependence-based and workload balance

oriented[Canal HPCA 2000, Parcerisa PACT 2002]

• Tries to find best trade-off between communications and workload balance

Page 10: Overview

10

Cluster assignment schemes

• Accurate-Rebalancing Priority RMB1- To minimize communication penalties:

If unavailable source register: choose producer’s cluster Else: Select clusters with highest number of source regs.

mapped

2- Choose the least loaded one of the aboveException: if imbalance > threshold, then

exclude clusters with positive workload, prior to applying rules 1 and 2

Page 11: Overview

11

Evaluation

SpecInt95

0

0.5

1

1.5

2

2.5

Hm

ean

IPC

0

0.2

0.4

0.6

0.8

1

1.2

NR

EA

DY

imb

alan

ce

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Com

mu

nic

atio

ns

/ in

stru

ctio

n

Modulo AR-Priority

Page 12: Overview

12

Dynamic vs. static steering

0

10

20

30

40

50

60

70

80

Sp

ee

du

p (

%)

perl gcc compress m88ksim H-mean

Static LdSt slice Dynamic LdSt slice Advanced RMB

S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998

Page 13: Overview

13

Data cache architectures

• Centralized

UL2UL2UL2UL2

BackendBackendBackendBackend

BackendBackendBackendBackend

BackendBackendBackendBackend

BackendBackendBackendBackend

L1 DcacheL1 DcacheL1 DcacheL1 Dcache

• Dcache is a clusterDcache is a cluster• Single Load/Store queueSingle Load/Store queue• Simple disambiguationSimple disambiguation

• Dcache is a clusterDcache is a cluster• Single Load/Store queueSingle Load/Store queue• Simple disambiguationSimple disambiguation

[González, WMPI 04]

Page 14: Overview

14

Data cache architecture (II)

• Attraction caches– Lines are copied on

demand– A coherence scheme

is needed– Steering must exploit

data localityUL2UL2UL2UL2

DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1

BE 2BE 2 BE 2BE 2 BE 1BE 1 BE 1BE 1 BE 4BE 4 BE 4BE 4 BE 3BE 3 BE 3BE 3

Page 15: Overview

15

Data cache architecture (III)

• Replicated– Area cost– Traffic and activity due

to store broadcast

UL2UL2UL2UL2

DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1

BE 2BE 2 BE 2BE 2 BE 1BE 1 BE 1BE 1 BE 4BE 4 BE 4BE 4 BE 3BE 3 BE 3BE 3

Page 16: Overview

16

Data cache architecture (IV)

• Interleaved– Word/line interleaved– Steering needs to

predict the bank

UL2UL2UL2UL2

DL1DL1DL1DL1

BE 1BE 1 BE 1BE 1

DL1DL1DL1DL1

BE 4BE 4 BE 4BE 4

DL1DL1DL1DL1

BE 2BE 2 BE 2BE 2

DL1DL1DL1DL1

BE 3BE 3 BE 3BE 3

Page 17: Overview

17

Memory issues

• Disambiguation– Load/Store queues are distributed– Stores are allocated in all clusters– Address is computed in one and broadcast – Loads go to memory once previous stores know their

addresses

• Memory coherence– Write-Invalidate / Write-Update protocols

Page 18: Overview

18

Performance comparison

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

amm

p

bzip2

eon

gzip

mesa

mgrid

parser

swim tp

c

wupw

ise

AVG

Re

lati

ve

to

Att

rac

tio

n

Attraction

Replicated

Phy-Dist

Centralized

Page 19: Overview

19

Thermal benefits of clustering

Floorplan for a Floorplan for a quad-cluster quad-cluster architecturearchitecture

Floorplan for a Floorplan for a quad-cluster quad-cluster architecturearchitecture

Unified L2 Cache

Trace Cache

Reorder Buffer

Branch Predictors DECO

Cluster 0

ITLB

Cluster 1

Cluster 2Cluster 3

FP Scheduler

Integer Register File

Integer Execution Units

Data Cache Level 1 DTLB

Memory Scheduler

Floating Point Execution Units

Copy Scheduler

FP Register File

Integer Scheduler

Rename Table

[Chaparro, TACS 04]

Page 20: Overview

20

Temperature metrics

• AbsMax– Maximum sensed temperature

• Average– Average temperature across time and area

• AverageMax– Average temperature across time of

maximum sensed temperature

Page 21: Overview

21

-40%

-30%

-20%

-10%

0%

10%

20%

30%

40%

Ave

rag

e

Ab

sMa

x

Ave

rag

eM

ax

IPC

Lo

ss

Ave

rag

e

Ab

sMa

x

Ave

rag

eM

ax

IPC

Lo

ss

2 Clusters 4 Clusters

Re

du

cti

on

Backends UL2 Frontend Processor

Clustering reduces temperature

– If clustering is smart

Page 22: Overview

22

Clustering effects

• May end up with higher power densities!– Simpler and smaller units may create

hotspots– Layout must be thermal-effective

• Surround hotspots by cold areas

– Activity steering must be smart

• Other techniques (e.g. throttling) can be applied at smaller granularity– Aim at particular clusters without affecting

others

Page 23: Overview

23

Dynamic cluster resizing

• Motivation

Best ED2P aware configuration. Gzip application

1

2

3

4

1 55 109 163 217 271 325 379 433 487 541 595 649 703 757 811 865 919 973

Time

# C

lust

ers

[González, ICCD 03]

Page 24: Overview

24

Dynamic cluster resizing

• Proposal– Dynamically compute the energy of blocks

• Schedulers, FUs, DL0s, etc– Dynamically compute the energyxdelay2 of

the processor– Use different configurations for different

intervals– Measure the optimal configuration – Gate-off (disable) useless units

• Scheduler level• Backend level

Page 25: Overview

25

Dynamic cluster resizing

BE3BE3BE3BE3

UL2 cacheUL2 cacheUL2 cacheUL2 cacheI$I$I$I$Decode Decode RenameRename

SteerSteer

Decode Decode RenameRename

SteerSteer

BEnBEnBEnBEnBE2BE2BE2BE2BE1BE1BE1BE1

memory busmemory busmemory busmemory busdisamb. busdisamb. busdisamb. busdisamb. bus

BE4BE4BE4BE4 BE5BE5BE5BE5

XXXX

ED2PED2PxxED2PED2Pxx

X-1X-1X-1X-1

ED2PED2Px-1x-1ED2PED2Px-1x-1

ED2PED2Px+1x+1ED2PED2Px+1x+1X+1X+1X+1X+1ED2PED2Pxx < ED2P < ED2Px+1x+1 < ED2P < ED2Px-1 x-1 ??ED2PED2Pxx < ED2P < ED2Px+1x+1 < ED2P < ED2Px-1 x-1 ??

X-2X-2X-2X-2

ED2PED2Px-2x-2ED2PED2Px-2x-2

ED2PED2Px-3x-3ED2PED2Px-3x-3X-3X-3X-3X-3 X-yX-yX-yX-y

ED2PED2Px-yx-yED2PED2Px-yx-y

X+2X+2X+2X+2

ED2PED2Px+2x+2ED2PED2Px+2x+2

ED2PED2Px+3x+3ED2PED2Px+3x+3X+3X+3X+3X+3 X+yX+yX+yX+y

ED2PED2Px+yx+yED2PED2Px+yx+y

Page 26: Overview

26

Dynamic cluster resizing

ED2P improvement

0.6

0.8

1

1.2

1.4

1.6

1.8

2

amm

pbzip

2eo

ngzip

mes

a

mgrid

parse

r

swim tp

c

wupwise

AVG

ED

2P i

mp

rove

rel

ativ

e to

4-c

lust

er

Gating scheduler

Gating cluster

Page 27: Overview

27

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

Leakage Average AbsMax AverageMax Slowdown

Per

cent

age

of r

educ

tions

Cluster 2 Cluster 0 Clusters 2 and 3 Clusters 0, 2 and 3

Cluster hopping

• Motivation– Power and average temperature savings

when statically Vdd gating clusters

* Temperatures in the backend area when gating all but the indicated cluster(s). Reductions over in-box ambient temperature (45º) respect to a baseline quad-cluster architecture.

[Chaparro, TACS 04]

Page 28: Overview

28

Cluster hopping

• Based on activity migration [Heo, ISLPED 03]

– Vdd gate a subset of clusters

– Rotate clusters to spread activity over time– Gated clusters cannot provide any register value

• Before gating, some register values must be evicted

– Cache/DTLB contents are lost• Unless some low power (e.g. drowsy) mode is used

– Proactive and/or reactive behavior• Proactive: Per interval basis• Reactive: On thermal events

Page 29: Overview

29

Cluster hopping schemes

2dis-dia1dis-rot

2dis-alt 3dis-rot

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Leakage Average AbsMax AverageMax Slowdown

Per

cent

age

of r

educ

tions

1dis-rot + Non-thermal 2dis-dia + Non-thermal

2dis-alt + Non-thermal 3dis-rot + Non-thermal

Effective at reducing average temperature (thus leakage) but not max temperature

Effective at reducing average temperature (thus leakage) but not max temperature

Page 30: Overview

30

Thermal-aware steering

• Try to minimize max temperature– Take into account cluster temperature when deciding

destination

• Some examples– Cold

• Dispatch to coldest cluster with available resources– Lowest average temperature– Lowest peak temperature

– T-Cold• Like Cold but discard clusters that are too hot

– If difference in temperature with previous cluster (ordered by temperature) is higher than a threshold

[Chaparro, TACS 04]

Page 31: Overview

31

Thermal-aware steering

– T-Thermal• Minimize communications unless candidate cluster

is too hot– If temperature difference > threshold

Priority to the colder– Otherwise Priority to the one that minimize

communications, and in case of tie maximize workload balance (#instructions in the schedulers)

Page 32: Overview

32

Thermal-aware steering

• Thermal-aware steering standalone

-6%

-4%

-2%

0%

2%

4%

6%

8%

Leakage Average AbsMax AverageMax Slowdown

Per

cent

age

of r

educ

tions

Cold, avg. of cluster Cold, max of cluster T-Cold T-Thermal

Page 33: Overview

33

-10%

0%

10%

20%

30%

40%

50%

60%

70%

Leakage Average AbsMax AverageMax Slowdown

Per

cent

age

of r

educ

tions

T-Thermal 1dis-rot + Non-thermal 1dis-rot + T-Thermal

2dis-dia + T-Thermal 2dis-alt + T-Thermal

Hopping + thermal steering

• Putting it all together

pchaparr
Se pueden simplificar los gráficos y dejar sólo las combinaciones más relevantes
Page 34: Overview

34

Clustering the front-end

Br.

Pre

dic

tio

n

Fet

ch

Dec

od

e

Ren

am

e

Clu

ster

Ass

ign

men

t

Dep

en

den

ce

Ch

ecki

ng

PC hit/misssrc/dst regs.

assign-ments

steering

Distributed Back-end

`[Parcerisa, TR 02]

Page 35: Overview

35

Distributed branch predictor

– Broadcast every prediction (next PC) to all clusters– Hardware loop: predictor uses PC as index

• insert bubble when switching the predictor cluster (2)• if interleaving by low order bits: frequent bubbles

PredictorTable

Cluster 0

Cluster 1

Cluster 2

Cluster 3

(1)

(2)(2)

(1)

BrP F Dec DR Back-endSt

– Solution• Pipeline prediction ahead of I-cache + interleave by hi-bits• Bubble only when high level interleave boundary crossed (2)

Page 36: Overview

36

Impact of distributing branch predictor

• Bank switching– SpecInt95: every 24

instructions– Mbench: every 133

instructions

• IPC loss– SpecInt95: 0,5% – Mbench: no loss

Page 37: Overview

37

Distributed cluster assignment

– Make local assignments and broadcast them to all clusters– Loop: steering logic uses assignments made by other

clusters

– Partial solution: use outdated info (2 cycles)– Problem: outdated dependences generates communications

BrP F Dec DR Back-endSt * ** Broadcast assignments

** Broadcast register designatorsBrP F Dec DR Back-endSt

Dep**

override assignments

– Solution: • anticipate dependence-checking and• override assignment, if dependence was violated

Page 38: Overview

38

Impact of distributing assignment

• W/o assignment overriding– 0.42 communications /

instruction– More than 10% IPC loss

• With assignment overriding– 0.17 communications /

instruction– Less than 2% IPC loss

0

0.5

1

1.5

2

2.5

IPC

SpecInt95 Mediabench

Baseline

Clustered

Clustered+Overriding

Page 39: Overview

39

Thermal benefits

• Clustering the rename table and the reorder buffer [Chaparro, 04]

0%

5%

10%

15%

20%

25%

30%

35%

Average AbsMax AverageMax Slowdown-Extraarea

Backends UL1 Frontends Processor

Page 40: Overview

40

Summary

• Clustering is thermal-effective (in addition to complexity-effective)– Reduces power– Distributes activity

• Clustering enables effective temperature control schemes– Adaptive configuration– DVS/DFS– Cluster hopping– Thermal steering