Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International...

Yuchun MaYuchun Ma

Joint Work with Jason Cong, Yongxiang Liu, Joint Work with Jason Cong, Yongxiang Liu,

Glenn Reinman, and Yan ZhangGlenn Reinman, and Yan Zhang

International Center for Design on Nanotechnologies WorkshopInternational Center for Design on Nanotechnologies Workshop

2

OutlineOutline

Micro-architecture DesignMicro-architecture Design

3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks

3D cubic packing algorithm3D cubic packing algorithm

3D architecture exploration with folded blocks3D architecture exploration with folded blocks

Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware Floorplanning Floorplanning

Summary and Future WorkSummary and Future Work

3

OutlineOutline







4

Superscalar ProcessorsSuperscalar Processors Superscalar processing is the ability of a microprocessor to initiate Superscalar processing is the ability of a microprocessor to initiate

multiple instructions into multiple pipelines so that the computations of multiple instructions into multiple pipelines so that the computations of

many instructions can be done in parallel if they are not dependent on each many instructions can be done in parallel if they are not dependent on each

other.other.

5

Alpha 21264Alpha 21264

6

Performance of a microprocessorPerformance of a microprocessor

Performance is measured as the time taken to Performance is measured as the time taken to

complete a given taskcomplete a given task Operating systemsOperating systems

Compiler optimizationsCompiler optimizations

Workload used for studying the performanceWorkload used for studying the performance

Microprocessor organizationMicroprocessor organization

Typically, the processor performance is measured in Typically, the processor performance is measured in MIPS or BIPSMIPS or BIPS

7

OutlineOutline







8

Motivations of 3-D ICsMotivations of 3-D ICs Alternative ways for device integration as Alternative ways for device integration as

we approach the limit of CMOS scalingwe approach the limit of CMOS scaling Interconnect length/delay reductionInterconnect length/delay reduction

System performance Improvement System performance Improvement [Black04][Black04]

Power Reduction [Black04]Power Reduction [Black04] Integration of heterogeneous technologiesIntegration of heterogeneous technologies

No existing flow to evaluate 3D No existing flow to evaluate 3D implementations of architectures implementations of architectures systematicallysystematically

Performance

Thermal

[Black04]

9

Technology backgroundTechnology background

Wafer bonding 3D IC technologiesWafer bonding 3D IC technologies With flipping the top layer; With flipping the top layer;

Without flipping the top layer;Without flipping the top layer;

(a) With flipping the top layer (b) Without flipping the top layer

A 3D IC example with two device layers

10

Rlateral

Thermal Resistive Network [Wilkerson04]Thermal Resistive Network [Wilkerson04]

Circuit stack partitioned into tiles

Tiles connected through thermal resistances Lateral resistances: fixed Vertical resistances 1/#via

Heat sources modeled as current sources Current value = power

Heat sinks modeled as ground nodes

Thermal vias: After floorplanning, we can

further reduce the temperature by thermal via insertion.

(a) Tiles stack array

(b) Single tile stack

P1R2

R3

R4

P4

P3

P2

R1

1

2

3

4

-

R5

P5 5

11

OutlineOutline







12

MEVA-3DMEVA-3D An Automated Design Flow for 3D Architecture An Automated Design Flow for 3D Architecture

Evaluation (MEVA-3D)Evaluation (MEVA-3D) Evaluate 3D implementations of micro-architectures Evaluate 3D implementations of micro-architectures

systematically and study them from both performance and systematically and study them from both performance and thermal perspectives. thermal perspectives.

MEVA-3D Flow MEVA-3D Flow Automated 2D/3D floorplanning;Automated 2D/3D floorplanning;

• Reduce the latency along critical loops in the mico-Reduce the latency along critical loops in the mico-architecture by considering interconnect pipelining at a given architecture by considering interconnect pipelining at a given target frequency.target frequency.

Thermal EvaluationThermal Evaluation

• Resistive network model considering white-space and thermal Resistive network model considering white-space and thermal

via insertion.via insertion. 3D router3D router

13

3D Architecture Evaluation with Physical Planning3D Architecture Evaluation with Physical Planning

Optimize Optimize

BIPS (not IPC or Freq)BIPS (not IPC or Freq)• Consider interconnect Consider interconnect

pipelining based on early pipelining based on early floorplanning for critical pathsfloorplanning for critical paths

• Use IPC sensitivity model Use IPC sensitivity model [Jagannathan05][Jagannathan05]

Area/wirelength Area/wirelength

TemperatureTemperature

2D/3D floorplanning forperformance and thermal with

interconnect pipelining

performance simulationwith interconnect latencies

2D/3D thermal simulation

microarchitectureconfiguration

targetfrequency

critical architecturalpaths and sensitivity

power densityestimates

estimated performance, temperature,and interconnect data

power density withinterconnect consideration

performance, power andtemperature

ES

TIM

AT

ION

VA

LID

AT

ION

14

Design ExampleDesign Example

An out-of-order superscalar processor micro-architecture An out-of-order superscalar processor micro-architecture

with 4 banks of L2 cache in 70with 4 banks of L2 cache in 70nm nm technologytechnology

Critical pathsCritical paths

15

Baseline Processor ParametersBaseline Processor Parameters

16

2D vs 3D Layout2D vs 3D Layout

2D EV6-like core 3D EV6-like core (2 layers)BIPS= 2.75 BIPS= 2.94

Wakeup loop : The extra cycle is

eliminated.

Branch misprediction resolution loop and the

L2 cache access latency :

Some of the extra cycles are eliminated

Assume two device layers

17

Simulation ResultsSimulation Results

The 3D architecture outperforms 2D design The 3D architecture outperforms 2D design

about 11.7% when the frequency is 4GHz.about 11.7% when the frequency is 4GHz.

128_4G

0

0. 5

1

1. 5

2

2. 5

3

2D3D

18

Performance for the micro-architecture with 2D Performance for the micro-architecture with 2D and 3D layout at different target frequenciesand 3D layout at different target frequencies 3D integration can help improve the performance by

11% by eliminating most of the wire latencies in 2D.

19

Maximum On-Chip TemperatureMaximum On-Chip Temperature

HS denotes a heat sink, and the 3D integration allows to insert thermal vias to reduce the temperature.

3D integration shows a temperature increase of over 4.78 on average. After thermal via insertion, we can reduce the maximum on-chip temperature by an average of about 62%.

20

OutlineOutline







21

3D Design w/ Component Folding and Stacking3D Design w/ Component Folding and Stacking

Explore 3D design of architectural structures that areExplore 3D design of architectural structures that are Timing/Throughput CriticalTiming/Throughput Critical

Expensive in Terms of Power Consumption and/or Thermal OutputExpensive in Terms of Power Consumption and/or Thermal Output

Possible candidates for 3D component foldingPossible candidates for 3D component folding Instruction Scheduling WindowInstruction Scheduling Window

• Issue Queue can be partitioned into multiple Issue Queue can be partitioned into multiple levels via matchlines or taglines.levels via matchlines or taglines.

On-Chip CachesOn-Chip Caches

• Regular structure lends itself to a wide range of Regular structure lends itself to a wide range of partitioningspartitionings

Register FileRegister File

• Thermally critical resource – also has a regular Thermally critical resource – also has a regular structurestructure

22

33D Architectural Block Design and ModelingD Architectural Block Design and Modeling

First explore how to design blocks in 3DFirst explore how to design blocks in 3D Wordline foldingWordline folding

• Fold block horizontallyFold block horizontally

Port PartitioningPort Partitioning• Extend ports to different layersExtend ports to different layers

ToolsTools CACTICACTI

• Caches and cache-like structuresCaches and cache-like structures• Register filesRegister files

HSpiceHSpice• Issue QueueIssue Queue

Then explore design space for a microprocessor with Then explore design space for a microprocessor with

these blocksthese blocks

23

3D Issue Queue3D Issue Queue

(a) 2D issue queue with 4 taglines ；(b)block folding ； (c) port partitioning

Block foldingBlock folding Fold the entries and place them on different layers Fold the entries and place them on different layers

Effectively shortens the tag linesEffectively shortens the tag lines

Port partitioningPort partitioning Place tag lines and ports on multiple layer, thus reducing both the Place tag lines and ports on multiple layer, thus reducing both the

height and width of the ISQ.height and width of the ISQ.

The reduction in tag and matchline wires can help reduce both power The reduction in tag and matchline wires can help reduce both power and delay.and delay.

24

Benefits from IQ foldingBenefits from IQ folding

Maximum delay reduction of 50%, maximum Maximum delay reduction of 50%, maximum

area reduction of 90% and a maximum reduction area reduction of 90% and a maximum reduction

in power consumption of 40%in power consumption of 40%

nL- n number of layers, FB – Folding banks, TP – Tag/Ports Partitioning

25

Improvements for blocksImprovements for blocks Port folding performs better than wordline folding for area.(72% vs 51%); Wordline folding is more effective in reducing the block delay (13% vs 5%); Port folding also performs better in reducing power (13% vs 5%)

26

33D packing with folded blocksD packing with folded blocks

The exploration of the use of vertical integration on microprocessor design requires consideration for both physical design and architecture. True 3D packing Architectural Alternative Selection

• The number of layers in folded blocks• The partition way: block folding or port partitioning

27

3D Corner Block List Representation3D Corner Block List Representation

((SS, , LL, , TT) composes a 3D CBL. ) composes a 3D CBL. S: a record of block nameS: a record of block name

L: corner cubic block orientation(X-, Y- or Z- oriented)L: corner cubic block orientation(X-, Y- or Z- oriented)

T: The sequence of {T: The sequence of {TTnn,,TTn-1n-1, …,, …,TT22} recording the number of } recording the number of

attached tri-branches covered by corner cubic blockattached tri-branches covered by corner cubic block

3

4

1

2

S={1 2 3 4 5}L = ( Y,Z,Y,X)

T=( 10,110,10,1110)

5

28

Packings with folded blocksPackings with folded blocks

30

PerformancePerformance

On average, multi-layer(3D) block configurations On average, multi-layer(3D) block configurations

have 11% lower temperature as well as 14% have 11% lower temperature as well as 14%

improvement in BIPS.improvement in BIPS.

31

TemperaturesTemperatures

Temperatures can be below 100 degree with Temperatures can be below 100 degree with

thermal vias inserted.thermal vias inserted.

32

Temperature profileTemperature profile

1 layer 2 layers with no via inserted

33

Temperature profile(2 layers with thermal vias)Temperature profile(2 layers with thermal vias)

34

OutlineOutline





Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware FloorplanningFloorplanning


35

Micro-architecture Pipelining OptimizationMicro-architecture Pipelining Optimization Previous works assume that the blocks are separately Previous works assume that the blocks are separately

designed subject to a clock frequency, and the wire designed subject to a clock frequency, and the wire pipelining is then carried out on the global wires of the pipelining is then carried out on the global wires of the circuits.circuits. Sub-optimal due to the possible utilized slacks in block pipeline Sub-optimal due to the possible utilized slacks in block pipeline

designsdesigns We propose We propose a novel a novel optimization methodology of architecture optimization methodology of architecture

pipelining with physical design, so that block pipelining and pipelining with physical design, so that block pipelining and

interconnect pipelining can be considered simultaneously.interconnect pipelining can be considered simultaneously. A

B

A

B

0.2 0.21

0.2 1 0.2

0.4

0.3

0.4

0.3

1.4

1.4

0.7 0.7

0.2 1 0.2

0.7

0.4

0.3 0.110.3

pipeline with pre-designed blocks path-based pipeline

36

Simultaneous Block and Interconnect PipeliningSimultaneous Block and Interconnect Pipelining

We define path-based pipelinging as Simultaneous Block

and Interconnect Pipelining (SBIP) Problem Represent the micro-architecture design by a path graph

G(V,E). The delay between any two flip-flops along the same path is

less than clock period . The performance of the architecture can be evaluated by the

weighted sum of number of FFs on ei(nei) along the paths. Therefore the objective is to find a feasible solution with the

optimal performance.

A B

D

C

A

E

A’

E

E’

B

B’

C

C’

D

D’

37

MILP FormulationMILP Formulation We define a term We define a term aa(P,(P,vv) that represents the arrival time ) that represents the arrival time

at node (at node (vv) along path P, which is the longest delay from ) along path P, which is the longest delay from

a flip-flop to the node a flip-flop to the node v v along path P.along path P.

With the given clock period With the given clock period and the set of paths and the set of paths PP, we , we

can then formulate the problem as the following MILP can then formulate the problem as the following MILP

Obj. Min

s.t. 0 a(Pi,v) vV and Pi passes v (1)

nei0 eiE (2)

a(Pi,v) a(Pi,u) + dei – * nei ei E and ei is a

connection from node u to node v along path Pi. (3)

)(

PP Pe

eiPi

i ii

nw

38

Graph-based heuristic algorithmGraph-based heuristic algorithm

Traverse the graph to decide the optimal insertion of Traverse the graph to decide the optimal insertion of

flip-flops such that the weighted sum of cycle numbers flip-flops such that the weighted sum of cycle numbers

of paths is minimizedof paths is minimized Dynamic scanning for combinational circuits Dynamic scanning for combinational circuits

Slacks along pathsSlacks along paths are used to compute the optimal positions are used to compute the optimal positions for FFs.for FFs.

Near-optimal method for sequential circuits Near-optimal method for sequential circuits • break the cycle into a path from s to t break the cycle into a path from s to t

Throughput aware floorplanning with pipelining Throughput aware floorplanning with pipelining The path-based pipelining design guides the block design to The path-based pipelining design guides the block design to

optimize the performance for the whole design. optimize the performance for the whole design.

39

Experimental ResultsExperimental Results We compare the results with the wire-pipelining results (WP), and We compare the results with the wire-pipelining results (WP), and

the solutions obtained from the MILP solver (MILP), the ideal upper the solutions obtained from the MILP solver (MILP), the ideal upper

bound used in [6][8](UB) and our graph-based heuristic approach bound used in [6][8](UB) and our graph-based heuristic approach

(GH). (GH).

Impact of frequenciesImpact of frequencies

The path-based pipelining will give about a 27% performance The path-based pipelining will give about a 27% performance improvement over wire pipelining improvement over wire pipelining

40

Integrated with floorplanning optimizationIntegrated with floorplanning optimization

FrequencyGHz

UB+post_MILP GH

Area(mm2)

Wire(mm)

BIPSArea(mm2)

Wire(mm)

BIPS

2 32. 115.6 1.492 31.8 142 1.714

3 34.6 103.7 2.139 33.3 108.4 2.22

4 32.4 98.7 2.776 36.1 124.3 2.828

5 32.8 126.2 2.885 32.6 94.17 3.35

6 36.0 108.4 3.636 33.7 100.3 3.882

7 35.9 112.5 3.479 36.8 129.9 3.906

Comparison 1 1 1 1.003 1.05 1.091

MILP approach as a post process at the end of the floorplanning MILP approach as a post process at the end of the floorplanning

integrate our approach with the thoughput-driven floorplannning. integrate our approach with the thoughput-driven floorplannning.

41

SummarySummary 3D Architecture Exploration3D Architecture Exploration

Coupled with 3D physical planningCoupled with 3D physical planning

Consider both 3D component stacking and foldingConsider both 3D component stacking and folding

MEVA-3D can systematically evaluate the 3D MEVA-3D can systematically evaluate the 3D

architecture both from the performance side and from architecture both from the performance side and from

the thermal side.the thermal side.

We propose the optimization methodology of We propose the optimization methodology of

architecture pipelining with physical design which architecture pipelining with physical design which

simultaneously optimize the pipeline design and simultaneously optimize the pipeline design and

physical packing in terms of system throughput. The physical packing in terms of system throughput. The

performance of the system can be improved a lot over performance of the system can be improved a lot over

the wire-pipelining.the wire-pipelining.

42

Ongoing WorkOngoing Work

3D Multi-core architecture design and 3D Multi-core architecture design and implementationimplementation

Deep pipeline design in microarchitecture with Deep pipeline design in microarchitecture with interconnect consideredinterconnect considered

The slacks in 3D design may be used to enlarge The slacks in 3D design may be used to enlarge the sizes of blocks and get better performance.the sizes of blocks and get better performance.

Thank You!Thank You!

[email protected]@tsinghua.org.cn

Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International...

Documents

Transcript of Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International...