Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in...

30
Feb 14 th 2005 University of Utah 1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in...

Page 1: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Feb 14th 2005 University of Utah 1

Microarchitectural Wire Management for Performance and Power in Partitioned

Architectures

Rajeev BalasubramonianNaveen Muralimanohar

Karthik RamaniVenkatanand Venkatachalapathy

Page 2: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

2 University of Utah

Overview/Motivation

Wire delays are costly for performance and

power

Latencies of 30 cycles to reach ends of a

chip

50% of dynamic power is in interconnect

switching (Magen et al. SLIP 04)

Abundant number of metal layers

Page 3: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

3 University of Utah

Wire Characteristics

Wire Resistance and capacitance per unit length

),()22(0 verthorizverthorizwire fringenglayerspaci

width

spacing

thicknessKC

)2()( BarrierwidthBarrierthicknessRwire

(Width & Spacing) Delay (as delay RC), Bandwidth

Resistance Capacitance Bandwidth

Width

Spacing

Page 4: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

4 University of Utah

Design Space Exploration

Tuning wire width and spacing

d

2d

B WiresResistance

Capacitance

Resistance

Capacitance

BandwidthL wires

Page 5: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

5 University of Utah

Transmission Lines

Allow extremely low delay

High implementation complexity and overhead!

Large width

Large spacing between wires

Design of sensing circuit

Shielding power and ground lines adjacent to each line

Implemented in test CMOS chips

Not employed in this study

Page 6: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

6 University of Utah

Design Space Exploration

Tuning Repeater size and spacing

Traditional WiresLarge repeatersOptimum spacing

Power Optimal WiresSmaller repeatersIncreased spacing

Dela

y Po

wer

Page 7: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

7 University of Utah

Design Space Exploration

Base caseB wires

BandwidthOptimizedW wires

PowerOptimized

P wires

Power and B/WOptimizedPW wires

Fast, low bandwidth

L wires

Page 8: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

8 University of Utah

Outline

Overview

Wire Design Space Exploration

Employing L wires for Performance

PW wires: The Power Optimizers

Results

Conclusions

Page 9: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

9 University of Utah

Evaluation Platform

L1 DCache Cluster

Centralized front-end

I-Cache & D-Cache

LSQ

Branch Predictor

Clustered back-end

Page 10: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

10 University of Utah

Cache Pipeline

L1 DCache

LSQ

Eff. Address Transfer 10c

Mem. DepResolution

5c

CacheAccess

5c

Data return at 20c

L1 DCache

LSQ

Eff. Address Transfer 10c

Mem. DepResolution

5c

CacheAccess

5c

Data return at 20c

L1 DCache

LSQ

Eff. Address Transfer 10c

PartialMem. DepResolution

3c

CacheAccess

5c

8-bit Transfer 5c

Data return at 14c

Functional

Unit

Page 11: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

11 University of Utah

L wires: Accelerating cache access

Transmit LSB bits of effective address through L wires Faster memory disambiguation

Partial comparison of loads and stores in LSQ

Introduces false dependences ( < 9%)

Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$

Reduce access latency of loads

Page 12: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

12 University of Utah

L wires: Narrow Bit Width Operands

PowerPC: Data bit-width determines FU

latency

Transfer of 10 bit integers on L wires

Can introduce scheduling difficulties

A predictor table of saturating counters

Accuracy of 98%

Reduction in branch mispredict penalty

Page 13: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

13 University of Utah

Power Efficient Wires.

Base caseB wires

Power and B/WOptimizedPW wires

Idea: steer non-critical data through

energy efficient PW interconnect

Page 14: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

14 University of Utah

PW wires: Power/Bandwidth Efficient

Ready Register operands Transfer of data at

instruction dispatch

Transfer of input operands

to remote register file

Covered by long dispatch to

issue latency

Store data Could stall commit process

Delay dependent loads

Rename&

Dispatch

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

IQ

Regfile

FU

Operand is ready at cycle 90

Consumer instruction Dispatched at cycle 100

Page 15: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

15 University of Utah

Outline

Overview

Wire Design Space Exploration

Employing L wires for Performance

PW wires: The Power Optimizers

Results

Conclusions

Page 16: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

16 University of Utah

Evaluation Methodology

L1 DCache

B wires (2 cycles)

L wires (1 cycle)

PW wires (3 cycles)

Cluster

Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model

Crossbar interconnects (L, B and PW wires)

Page 17: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

17 University of Utah

Heterogeneous Interconnects Intercluster global Interconnect

72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay

18 L wires Wide wires and large spacing

Occupies more area

Low latencies 144 PW wires

Poor delay

High bandwidth

Low power

Page 18: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

18 University of Utah

Analytical Model

C = Ca + WsCb + Cc/Ws

1 2 31 Fringing Capacitance

2 Capacitance between

different layers of wires

3 Capacitance between wires

Of same metal layer

RC Model of the wire

Total Power = Short-Circuit Power + Switching Power + Leakage

Power

Page 19: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

19 University of Utah

Evaluation methodology

I-Cache

D-cache

LSQ Cluster

Cross bar

Ring interconnect

Simplescalar -3.0

augmented to simulate

a dynamically

scheduled 16-cluster

model

Ring latencies

B wires ( 4 cycles)

PW wires ( 6 cycles)

L wires (2 cycles)

Page 20: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

20 University of Utah

IPC improvements: L wires

L wires improve performance by 4.2% on four cluster

system and 7.1% on a sixteen cluster system

0

0.5

1

1.5

2

2.5

am

mp

ap

plu

ap

si art

bzi

p2

cra

fty

eo

n

eq

ua

ke

fma

3d

ga

lge

l

ga

p

gcc

gzi

p

luca

s

mcf

me

sa

mg

rid

pa

rse

r

swim

two

lf

vort

ex

vpr

wu

pw

ise

AM

Baseline: 144 B-Wires

Low-latency optimizations: 144 B-Wires and 36 L-Wires

Page 21: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

21 University of Utah

Four Cluster System: ED2 Improvements

92.195.0970.961.5144 PW 36 L

99.296.61030.982.0288 B

94.593.31010.992.0144 B, 36 L

93.294.4990.972.0288 PW,36 L

100.2103.4970.921.0288 PW

1001001000.951.0144 B

Relative

ED2

(20%)

Relative

ED2

(10%)

Relative

processor

energy

(10%)

IPCRelative

metal

area

Link

Page 22: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

22 University of Utah

Sixteen Cluster system: ED2 gains

93.11051.18288 B

88.71071.22288 B, 36 L

88.71021.19144 B, 36 L

105.3941.05144 PW, 36 L

1001001.11144 B

Relative ED2

(20%)

Relative

Processor

Energy (20%)

IPCLink

Page 23: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

23 University of Utah

Conclusions

Exposing the wire design space to the architecture

A case for micro-architectural wire management!

A low latency low bandwidth network alone helps improve performance by up to 7%

ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect

Entails hardware complexity

Page 24: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

24 University of Utah

Future work

3-D wire model for the interconnects

Design of heterogeneous clusters

Interconnects for cache coherence and L2$

Page 25: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

25 University of Utah

Questions and Comments?

Thank you!

Page 26: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

26 University of Utah

Backup

Page 27: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

27 University of Utah

L wires: Accelerating cache access

TLB access for page look up Transmit a few bits of

Virtual page number on L wires

Prefetch data our of L1$ and TLB

18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)

Wire

Type

Crossb

ar

delay

Ring

hop

delay

PW

wires

3 6

B wires 2 4

L wires 1 2

Page 28: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

28 University of Utah

Model parameters

Simplescalar-3.0 with separate integer and

floating point queues

32 KB 2 way Instruction cache

32 KB 4 way Data cache

128 entry 8 way I and D TLB

Page 29: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

29 University of Utah

Overview/Motivation:

± Three wire implementations employed in this study

± B wires: traditional Optimal delay

Huge power consumption

± L wires: Faster than B wires

Lesser bandwidth

± PW wires: Reduced power consumption

Higher bandwidth compared to B wires

Increased delay through the wires

Page 30: Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

February 14th 2005

30 University of Utah