OVERVIEW - Intro POD OVERVIEW Point of Dispensing (POD) Overview for Communities.
Overview
description
Transcript of Overview
![Page 1: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/1.jpg)
1
Overview
1. Motivation (Kevin)2. Thermal issues (Kevin)3. Power modeling (David)4. Thermal management (David)5. Optimal DTM (Lev)6. Clustering (Antonio)7. Power distribution (David)8. What current chips do (Lev)9. HotSpot (Kevin)
![Page 2: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/2.jpg)
2
The clustering approach
• Reduce complexity by partitioning– Less latency, area, power and temperature
• Fast, simple, distributed units
– Communication latency is heterogeneous and exposed to the microarchitecture
• Localize critical communication within clusters (fast wires)
interconnection network
cluster0
cluster1
cluster2
cluster3
global resources
![Page 3: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/3.jpg)
3
The clustering approach (...)
• Smaller structures consume less power– Higher power efficiency [Zyuban, IEEE Transactions 01]
• Partitioning simplifies power management– Via clock/power gating techniques [Bahar, ISCA 01]
– Via dynamic cluster resizing [González, ICCD 03]
– Via DVS/DFS
• Partitioning reduces temperature– Activity is distributed [Chaparro, TACS 04]
– Hopping schemes can be applied [Chaparro, TACS 04]
– Adds flexibility for temperature-effective layouts
• IPC overheads due to communication/imbalance– Compensated by shorter latency/clock period [Palacharla, ISCA
97], [Canal, HPCA 00]
![Page 4: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/4.jpg)
4
Clustered microarchitecture
• Dynamic steering• Distributed Issue,
Registers, FUs• Inter-cluster register
communication
IcacheIcache
Fetch & decodeFetch & decode
Steering logicSteering logic
C0C0 C1C1 C2C2 C3C3
Issue-Queue
Register File
FU
IC Network
FU
Cluster
![Page 5: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/5.jpg)
5
On-demand communication
• Map table tracks locations of register values
• At rename– allocate register for result, in
the assigned cluster– if a source operand is in a
remote cluster• insert a copy instruction in
remote cluster• allocate register for a copy
• At commit– free allocated register(s) by
previous mapping
log. reg.
Register Map Table
phys. reg.
C0 C1 C2 C3
[Canal, PACT99]
![Page 6: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/6.jpg)
6
Rename
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 18 X X 9
2 X 3 15 X
3 5 10 X 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Renaming Table
Steering Logic
Steering Logic
2 3 X X X 1
src1 src2 src3 src4 src5 dstLogical
Physical
Cluster1
3 10 X X X3 10 X X X 14
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 X 14 X X
2 X 3 15 X
3 5 10 X 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
![Page 7: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/7.jpg)
7
Copy instructions
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 18 X X 9
2 X 3 15 X
3 5 10 X 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Renaming Table
Steering Logic
Steering Logic
2 3 X X X 1
src1 src2 src3 src4 src5 dstLogical
Physical
Cluster2
15 X!!! X X X
CL1:10 CL2:27
src1 dst
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 13 X X 5
2 X 3 15 X
3 5 10 27 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
Log CL0 CL1 CL2 CL3
0 X 27 X X
1 X X 14 X
2 X 3 15 X
3 5 10 27 13
4 X X 12 14
5 4 X X X
6 X 1 24 X
7 2 X X X
8 X 2 X 9
15 27 X X X15 27 X X X 14
Copy instruction
![Page 8: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/8.jpg)
8
Broadcast communication
• Values sent to all register files– Local file is updated earlier than remote
ones– Registers are replicated in all files
• Register storage waste• Increase in power
– Values are written multiple times• Increase in power
– May reduce communication penalties• Values are present everywhere
– But not at the same time
– E.g.: Alpha 21264
![Page 9: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/9.jpg)
9
Cluster assignment schemes
• Main goals– Minimize inter-cluster communication penalty– Maximize workload balance
• Main approaches– Static approaches
[Farkas, Micro 97] [Sastry, PLDI 98]• Less flexible than dynamic ones: poor load balancing
– Dynamic, dependence-based[Palacharla ISCA 97] [Alpha 21264] [Kemp, ICPP 96]
• Only consider dependences through unavailable operands• Lack specific balancing mechanisms
– Dynamic, workload balance oriented[Baniasadi 00]
• Only suitable with low communication penalty architectures– Dynamic, dependence-based and workload balance
oriented[Canal HPCA 2000, Parcerisa PACT 2002]
• Tries to find best trade-off between communications and workload balance
![Page 10: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/10.jpg)
10
Cluster assignment schemes
• Accurate-Rebalancing Priority RMB1- To minimize communication penalties:
If unavailable source register: choose producer’s cluster Else: Select clusters with highest number of source regs.
mapped
2- Choose the least loaded one of the aboveException: if imbalance > threshold, then
exclude clusters with positive workload, prior to applying rules 1 and 2
![Page 11: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/11.jpg)
11
Evaluation
SpecInt95
0
0.5
1
1.5
2
2.5
Hm
ean
IPC
0
0.2
0.4
0.6
0.8
1
1.2
NR
EA
DY
imb
alan
ce
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Com
mu
nic
atio
ns
/ in
stru
ctio
n
Modulo AR-Priority
![Page 12: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/12.jpg)
12
Dynamic vs. static steering
0
10
20
30
40
50
60
70
80
Sp
ee
du
p (
%)
perl gcc compress m88ksim H-mean
Static LdSt slice Dynamic LdSt slice Advanced RMB
S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998
![Page 13: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/13.jpg)
13
Data cache architectures
• Centralized
UL2UL2UL2UL2
BackendBackendBackendBackend
BackendBackendBackendBackend
BackendBackendBackendBackend
BackendBackendBackendBackend
L1 DcacheL1 DcacheL1 DcacheL1 Dcache
• Dcache is a clusterDcache is a cluster• Single Load/Store queueSingle Load/Store queue• Simple disambiguationSimple disambiguation
• Dcache is a clusterDcache is a cluster• Single Load/Store queueSingle Load/Store queue• Simple disambiguationSimple disambiguation
[González, WMPI 04]
![Page 14: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/14.jpg)
14
Data cache architecture (II)
• Attraction caches– Lines are copied on
demand– A coherence scheme
is needed– Steering must exploit
data localityUL2UL2UL2UL2
DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1
BE 2BE 2 BE 2BE 2 BE 1BE 1 BE 1BE 1 BE 4BE 4 BE 4BE 4 BE 3BE 3 BE 3BE 3
![Page 15: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/15.jpg)
15
Data cache architecture (III)
• Replicated– Area cost– Traffic and activity due
to store broadcast
UL2UL2UL2UL2
DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1 DL1DL1DL1DL1
BE 2BE 2 BE 2BE 2 BE 1BE 1 BE 1BE 1 BE 4BE 4 BE 4BE 4 BE 3BE 3 BE 3BE 3
![Page 16: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/16.jpg)
16
Data cache architecture (IV)
• Interleaved– Word/line interleaved– Steering needs to
predict the bank
UL2UL2UL2UL2
DL1DL1DL1DL1
BE 1BE 1 BE 1BE 1
DL1DL1DL1DL1
BE 4BE 4 BE 4BE 4
DL1DL1DL1DL1
BE 2BE 2 BE 2BE 2
DL1DL1DL1DL1
BE 3BE 3 BE 3BE 3
![Page 17: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/17.jpg)
17
Memory issues
• Disambiguation– Load/Store queues are distributed– Stores are allocated in all clusters– Address is computed in one and broadcast – Loads go to memory once previous stores know their
addresses
• Memory coherence– Write-Invalidate / Write-Update protocols
![Page 18: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/18.jpg)
18
Performance comparison
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
amm
p
bzip2
eon
gzip
mesa
mgrid
parser
swim tp
c
wupw
ise
AVG
Re
lati
ve
to
Att
rac
tio
n
Attraction
Replicated
Phy-Dist
Centralized
![Page 19: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/19.jpg)
19
Thermal benefits of clustering
Floorplan for a Floorplan for a quad-cluster quad-cluster architecturearchitecture
Floorplan for a Floorplan for a quad-cluster quad-cluster architecturearchitecture
Unified L2 Cache
Trace Cache
Reorder Buffer
Branch Predictors DECO
Cluster 0
ITLB
Cluster 1
Cluster 2Cluster 3
FP Scheduler
Integer Register File
Integer Execution Units
Data Cache Level 1 DTLB
Memory Scheduler
Floating Point Execution Units
Copy Scheduler
FP Register File
Integer Scheduler
Rename Table
[Chaparro, TACS 04]
![Page 20: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/20.jpg)
20
Temperature metrics
• AbsMax– Maximum sensed temperature
• Average– Average temperature across time and area
• AverageMax– Average temperature across time of
maximum sensed temperature
![Page 21: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/21.jpg)
21
-40%
-30%
-20%
-10%
0%
10%
20%
30%
40%
Ave
rag
e
Ab
sMa
x
Ave
rag
eM
ax
IPC
Lo
ss
Ave
rag
e
Ab
sMa
x
Ave
rag
eM
ax
IPC
Lo
ss
2 Clusters 4 Clusters
Re
du
cti
on
Backends UL2 Frontend Processor
Clustering reduces temperature
– If clustering is smart
![Page 22: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/22.jpg)
22
Clustering effects
• May end up with higher power densities!– Simpler and smaller units may create
hotspots– Layout must be thermal-effective
• Surround hotspots by cold areas
– Activity steering must be smart
• Other techniques (e.g. throttling) can be applied at smaller granularity– Aim at particular clusters without affecting
others
![Page 23: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/23.jpg)
23
Dynamic cluster resizing
• Motivation
Best ED2P aware configuration. Gzip application
1
2
3
4
1 55 109 163 217 271 325 379 433 487 541 595 649 703 757 811 865 919 973
Time
# C
lust
ers
[González, ICCD 03]
![Page 24: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/24.jpg)
24
Dynamic cluster resizing
• Proposal– Dynamically compute the energy of blocks
• Schedulers, FUs, DL0s, etc– Dynamically compute the energyxdelay2 of
the processor– Use different configurations for different
intervals– Measure the optimal configuration – Gate-off (disable) useless units
• Scheduler level• Backend level
![Page 25: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/25.jpg)
25
Dynamic cluster resizing
BE3BE3BE3BE3
UL2 cacheUL2 cacheUL2 cacheUL2 cacheI$I$I$I$Decode Decode RenameRename
SteerSteer
Decode Decode RenameRename
SteerSteer
BEnBEnBEnBEnBE2BE2BE2BE2BE1BE1BE1BE1
memory busmemory busmemory busmemory busdisamb. busdisamb. busdisamb. busdisamb. bus
BE4BE4BE4BE4 BE5BE5BE5BE5
XXXX
ED2PED2PxxED2PED2Pxx
X-1X-1X-1X-1
ED2PED2Px-1x-1ED2PED2Px-1x-1
ED2PED2Px+1x+1ED2PED2Px+1x+1X+1X+1X+1X+1ED2PED2Pxx < ED2P < ED2Px+1x+1 < ED2P < ED2Px-1 x-1 ??ED2PED2Pxx < ED2P < ED2Px+1x+1 < ED2P < ED2Px-1 x-1 ??
X-2X-2X-2X-2
ED2PED2Px-2x-2ED2PED2Px-2x-2
ED2PED2Px-3x-3ED2PED2Px-3x-3X-3X-3X-3X-3 X-yX-yX-yX-y
ED2PED2Px-yx-yED2PED2Px-yx-y
X+2X+2X+2X+2
ED2PED2Px+2x+2ED2PED2Px+2x+2
ED2PED2Px+3x+3ED2PED2Px+3x+3X+3X+3X+3X+3 X+yX+yX+yX+y
ED2PED2Px+yx+yED2PED2Px+yx+y
![Page 26: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/26.jpg)
26
Dynamic cluster resizing
ED2P improvement
0.6
0.8
1
1.2
1.4
1.6
1.8
2
amm
pbzip
2eo
ngzip
mes
a
mgrid
parse
r
swim tp
c
wupwise
AVG
ED
2P i
mp
rove
rel
ativ
e to
4-c
lust
er
Gating scheduler
Gating cluster
![Page 27: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/27.jpg)
27
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
Leakage Average AbsMax AverageMax Slowdown
Per
cent
age
of r
educ
tions
Cluster 2 Cluster 0 Clusters 2 and 3 Clusters 0, 2 and 3
Cluster hopping
• Motivation– Power and average temperature savings
when statically Vdd gating clusters
* Temperatures in the backend area when gating all but the indicated cluster(s). Reductions over in-box ambient temperature (45º) respect to a baseline quad-cluster architecture.
[Chaparro, TACS 04]
![Page 28: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/28.jpg)
28
Cluster hopping
• Based on activity migration [Heo, ISLPED 03]
– Vdd gate a subset of clusters
– Rotate clusters to spread activity over time– Gated clusters cannot provide any register value
• Before gating, some register values must be evicted
– Cache/DTLB contents are lost• Unless some low power (e.g. drowsy) mode is used
– Proactive and/or reactive behavior• Proactive: Per interval basis• Reactive: On thermal events
![Page 29: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/29.jpg)
29
Cluster hopping schemes
2dis-dia1dis-rot
2dis-alt 3dis-rot
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Leakage Average AbsMax AverageMax Slowdown
Per
cent
age
of r
educ
tions
1dis-rot + Non-thermal 2dis-dia + Non-thermal
2dis-alt + Non-thermal 3dis-rot + Non-thermal
Effective at reducing average temperature (thus leakage) but not max temperature
Effective at reducing average temperature (thus leakage) but not max temperature
![Page 30: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/30.jpg)
30
Thermal-aware steering
• Try to minimize max temperature– Take into account cluster temperature when deciding
destination
• Some examples– Cold
• Dispatch to coldest cluster with available resources– Lowest average temperature– Lowest peak temperature
– T-Cold• Like Cold but discard clusters that are too hot
– If difference in temperature with previous cluster (ordered by temperature) is higher than a threshold
[Chaparro, TACS 04]
![Page 31: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/31.jpg)
31
Thermal-aware steering
– T-Thermal• Minimize communications unless candidate cluster
is too hot– If temperature difference > threshold
Priority to the colder– Otherwise Priority to the one that minimize
communications, and in case of tie maximize workload balance (#instructions in the schedulers)
![Page 32: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/32.jpg)
32
Thermal-aware steering
• Thermal-aware steering standalone
-6%
-4%
-2%
0%
2%
4%
6%
8%
Leakage Average AbsMax AverageMax Slowdown
Per
cent
age
of r
educ
tions
Cold, avg. of cluster Cold, max of cluster T-Cold T-Thermal
![Page 33: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/33.jpg)
33
-10%
0%
10%
20%
30%
40%
50%
60%
70%
Leakage Average AbsMax AverageMax Slowdown
Per
cent
age
of r
educ
tions
T-Thermal 1dis-rot + Non-thermal 1dis-rot + T-Thermal
2dis-dia + T-Thermal 2dis-alt + T-Thermal
Hopping + thermal steering
• Putting it all together
![Page 34: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/34.jpg)
34
Clustering the front-end
Br.
Pre
dic
tio
n
Fet
ch
Dec
od
e
Ren
am
e
Clu
ster
Ass
ign
men
t
Dep
en
den
ce
Ch
ecki
ng
PC hit/misssrc/dst regs.
assign-ments
steering
Distributed Back-end
`[Parcerisa, TR 02]
![Page 35: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/35.jpg)
35
Distributed branch predictor
– Broadcast every prediction (next PC) to all clusters– Hardware loop: predictor uses PC as index
• insert bubble when switching the predictor cluster (2)• if interleaving by low order bits: frequent bubbles
PredictorTable
Cluster 0
Cluster 1
Cluster 2
Cluster 3
(1)
(2)(2)
(1)
BrP F Dec DR Back-endSt
– Solution• Pipeline prediction ahead of I-cache + interleave by hi-bits• Bubble only when high level interleave boundary crossed (2)
![Page 36: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/36.jpg)
36
Impact of distributing branch predictor
• Bank switching– SpecInt95: every 24
instructions– Mbench: every 133
instructions
• IPC loss– SpecInt95: 0,5% – Mbench: no loss
![Page 37: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/37.jpg)
37
Distributed cluster assignment
– Make local assignments and broadcast them to all clusters– Loop: steering logic uses assignments made by other
clusters
– Partial solution: use outdated info (2 cycles)– Problem: outdated dependences generates communications
BrP F Dec DR Back-endSt * ** Broadcast assignments
** Broadcast register designatorsBrP F Dec DR Back-endSt
Dep**
override assignments
– Solution: • anticipate dependence-checking and• override assignment, if dependence was violated
![Page 38: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/38.jpg)
38
Impact of distributing assignment
• W/o assignment overriding– 0.42 communications /
instruction– More than 10% IPC loss
• With assignment overriding– 0.17 communications /
instruction– Less than 2% IPC loss
0
0.5
1
1.5
2
2.5
IPC
SpecInt95 Mediabench
Baseline
Clustered
Clustered+Overriding
![Page 39: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/39.jpg)
39
Thermal benefits
• Clustering the rename table and the reorder buffer [Chaparro, 04]
0%
5%
10%
15%
20%
25%
30%
35%
Average AbsMax AverageMax Slowdown-Extraarea
Backends UL1 Frontends Processor
![Page 40: Overview](https://reader034.fdocuments.in/reader034/viewer/2022042822/5681540f550346895dc20e8b/html5/thumbnails/40.jpg)
40
Summary
• Clustering is thermal-effective (in addition to complexity-effective)– Reduces power– Distributes activity
• Clustering enables effective temperature control schemes– Adaptive configuration– DVS/DFS– Cluster hopping– Thermal steering