Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International...
-
Upload
reginald-watson -
Category
Documents
-
view
217 -
download
1
Transcript of Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International...
Yuchun MaYuchun Ma
Joint Work with Jason Cong, Yongxiang Liu, Joint Work with Jason Cong, Yongxiang Liu,
Glenn Reinman, and Yan ZhangGlenn Reinman, and Yan Zhang
International Center for Design on Nanotechnologies WorkshopInternational Center for Design on Nanotechnologies Workshop
2
OutlineOutline
Micro-architecture DesignMicro-architecture Design
3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks
3D cubic packing algorithm3D cubic packing algorithm
3D architecture exploration with folded blocks3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware Floorplanning Floorplanning
Summary and Future WorkSummary and Future Work
3
OutlineOutline
Micro-architecture DesignMicro-architecture Design
3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks
3D cubic packing algorithm3D cubic packing algorithm
3D architecture exploration with folded blocks3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware Floorplanning Floorplanning
Summary and Future WorkSummary and Future Work
4
Superscalar ProcessorsSuperscalar Processors Superscalar processing is the ability of a microprocessor to initiate Superscalar processing is the ability of a microprocessor to initiate
multiple instructions into multiple pipelines so that the computations of multiple instructions into multiple pipelines so that the computations of
many instructions can be done in parallel if they are not dependent on each many instructions can be done in parallel if they are not dependent on each
other.other.
5
Alpha 21264Alpha 21264
6
Performance of a microprocessorPerformance of a microprocessor
Performance is measured as the time taken to Performance is measured as the time taken to
complete a given taskcomplete a given task Operating systemsOperating systems
Compiler optimizationsCompiler optimizations
Workload used for studying the performanceWorkload used for studying the performance
Microprocessor organizationMicroprocessor organization
Typically, the processor performance is measured in Typically, the processor performance is measured in MIPS or BIPSMIPS or BIPS
7
OutlineOutline
Micro-architecture DesignMicro-architecture Design
3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks
3D cubic packing algorithm3D cubic packing algorithm
3D architecture exploration with folded blocks3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware Floorplanning Floorplanning
Summary and Future WorkSummary and Future Work
8
Motivations of 3-D ICsMotivations of 3-D ICs Alternative ways for device integration as Alternative ways for device integration as
we approach the limit of CMOS scalingwe approach the limit of CMOS scaling Interconnect length/delay reductionInterconnect length/delay reduction
System performance Improvement System performance Improvement [Black04][Black04]
Power Reduction [Black04]Power Reduction [Black04] Integration of heterogeneous technologiesIntegration of heterogeneous technologies
No existing flow to evaluate 3D No existing flow to evaluate 3D implementations of architectures implementations of architectures systematicallysystematically
Performance
Thermal
[Black04]
9
Technology backgroundTechnology background
Wafer bonding 3D IC technologiesWafer bonding 3D IC technologies With flipping the top layer; With flipping the top layer;
Without flipping the top layer;Without flipping the top layer;
(a) With flipping the top layer (b) Without flipping the top layer
A 3D IC example with two device layers
10
Rlateral
Thermal Resistive Network [Wilkerson04]Thermal Resistive Network [Wilkerson04]
Circuit stack partitioned into tiles
Tiles connected through thermal resistances Lateral resistances: fixed Vertical resistances 1/#via
Heat sources modeled as current sources Current value = power
Heat sinks modeled as ground nodes
Thermal vias: After floorplanning, we can
further reduce the temperature by thermal via insertion.
(a) Tiles stack array
(b) Single tile stack
P1R2
R3
R4
P4
P3
P2
R1
1
2
3
4
-
R5
P5 5
11
OutlineOutline
Micro-architecture DesignMicro-architecture Design
3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks
3D cubic packing algorithm3D cubic packing algorithm
3D architecture exploration with folded blocks3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware Floorplanning Floorplanning
Summary and Future WorkSummary and Future Work
12
MEVA-3DMEVA-3D An Automated Design Flow for 3D Architecture An Automated Design Flow for 3D Architecture
Evaluation (MEVA-3D)Evaluation (MEVA-3D) Evaluate 3D implementations of micro-architectures Evaluate 3D implementations of micro-architectures
systematically and study them from both performance and systematically and study them from both performance and thermal perspectives. thermal perspectives.
MEVA-3D Flow MEVA-3D Flow Automated 2D/3D floorplanning;Automated 2D/3D floorplanning;
• Reduce the latency along critical loops in the mico-Reduce the latency along critical loops in the mico-architecture by considering interconnect pipelining at a given architecture by considering interconnect pipelining at a given target frequency.target frequency.
Thermal EvaluationThermal Evaluation
• Resistive network model considering white-space and thermal Resistive network model considering white-space and thermal
via insertion.via insertion. 3D router3D router
13
3D Architecture Evaluation with Physical Planning3D Architecture Evaluation with Physical Planning
Optimize Optimize
BIPS (not IPC or Freq)BIPS (not IPC or Freq)• Consider interconnect Consider interconnect
pipelining based on early pipelining based on early floorplanning for critical pathsfloorplanning for critical paths
• Use IPC sensitivity model Use IPC sensitivity model [Jagannathan05][Jagannathan05]
Area/wirelength Area/wirelength
TemperatureTemperature
2D/3D floorplanning forperformance and thermal with
interconnect pipelining
performance simulationwith interconnect latencies
2D/3D thermal simulation
microarchitectureconfiguration
targetfrequency
critical architecturalpaths and sensitivity
power densityestimates
estimated performance, temperature,and interconnect data
power density withinterconnect consideration
performance, power andtemperature
ES
TIM
AT
ION
VA
LID
AT
ION
14
Design ExampleDesign Example
An out-of-order superscalar processor micro-architecture An out-of-order superscalar processor micro-architecture
with 4 banks of L2 cache in 70with 4 banks of L2 cache in 70nm nm technologytechnology
Critical pathsCritical paths
15
Baseline Processor ParametersBaseline Processor Parameters
16
2D vs 3D Layout2D vs 3D Layout
2D EV6-like core 3D EV6-like core (2 layers)BIPS= 2.75 BIPS= 2.94
Wakeup loop : The extra cycle is
eliminated.
Branch misprediction resolution loop and the
L2 cache access latency :
Some of the extra cycles are eliminated
Assume two device layers
17
Simulation ResultsSimulation Results
The 3D architecture outperforms 2D design The 3D architecture outperforms 2D design
about 11.7% when the frequency is 4GHz.about 11.7% when the frequency is 4GHz.
128_4G
0
0. 5
1
1. 5
2
2. 5
3
2D3D
18
Performance for the micro-architecture with 2D Performance for the micro-architecture with 2D and 3D layout at different target frequenciesand 3D layout at different target frequencies 3D integration can help improve the performance by
11% by eliminating most of the wire latencies in 2D.
19
Maximum On-Chip TemperatureMaximum On-Chip Temperature
HS denotes a heat sink, and the 3D integration allows to insert thermal vias to reduce the temperature.
3D integration shows a temperature increase of over 4.78 on average. After thermal via insertion, we can reduce the maximum on-chip temperature by an average of about 62%.
20
OutlineOutline
Micro-architecture DesignMicro-architecture Design
3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks
3D cubic packing algorithm3D cubic packing algorithm
3D architecture exploration with folded blocks3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware Floorplanning Floorplanning
Summary and Future WorkSummary and Future Work
21
3D Design w/ Component Folding and Stacking3D Design w/ Component Folding and Stacking
Explore 3D design of architectural structures that areExplore 3D design of architectural structures that are Timing/Throughput CriticalTiming/Throughput Critical
Expensive in Terms of Power Consumption and/or Thermal OutputExpensive in Terms of Power Consumption and/or Thermal Output
Possible candidates for 3D component foldingPossible candidates for 3D component folding Instruction Scheduling WindowInstruction Scheduling Window
• Issue Queue can be partitioned into multiple Issue Queue can be partitioned into multiple levels via matchlines or taglines.levels via matchlines or taglines.
On-Chip CachesOn-Chip Caches
• Regular structure lends itself to a wide range of Regular structure lends itself to a wide range of partitioningspartitionings
Register FileRegister File
• Thermally critical resource – also has a regular Thermally critical resource – also has a regular structurestructure
22
33D Architectural Block Design and ModelingD Architectural Block Design and Modeling
First explore how to design blocks in 3DFirst explore how to design blocks in 3D Wordline foldingWordline folding
• Fold block horizontallyFold block horizontally
Port PartitioningPort Partitioning• Extend ports to different layersExtend ports to different layers
ToolsTools CACTICACTI
• Caches and cache-like structuresCaches and cache-like structures• Register filesRegister files
HSpiceHSpice• Issue QueueIssue Queue
Then explore design space for a microprocessor with Then explore design space for a microprocessor with
these blocksthese blocks
23
3D Issue Queue3D Issue Queue
(a) 2D issue queue with 4 taglines ;(b)block folding ; (c) port partitioning
Block foldingBlock folding Fold the entries and place them on different layers Fold the entries and place them on different layers
Effectively shortens the tag linesEffectively shortens the tag lines
Port partitioningPort partitioning Place tag lines and ports on multiple layer, thus reducing both the Place tag lines and ports on multiple layer, thus reducing both the
height and width of the ISQ.height and width of the ISQ.
The reduction in tag and matchline wires can help reduce both power The reduction in tag and matchline wires can help reduce both power and delay.and delay.
24
Benefits from IQ foldingBenefits from IQ folding
Maximum delay reduction of 50%, maximum Maximum delay reduction of 50%, maximum
area reduction of 90% and a maximum reduction area reduction of 90% and a maximum reduction
in power consumption of 40%in power consumption of 40%
nL- n number of layers, FB – Folding banks, TP – Tag/Ports Partitioning
25
Improvements for blocksImprovements for blocks Port folding performs better than wordline folding for area.(72% vs 51%); Wordline folding is more effective in reducing the block delay (13% vs 5%); Port folding also performs better in reducing power (13% vs 5%)
26
33D packing with folded blocksD packing with folded blocks
The exploration of the use of vertical integration on microprocessor design requires consideration for both physical design and architecture. True 3D packing Architectural Alternative Selection
• The number of layers in folded blocks• The partition way: block folding or port partitioning
27
3D Corner Block List Representation3D Corner Block List Representation
((SS, , LL, , TT) composes a 3D CBL. ) composes a 3D CBL. S: a record of block nameS: a record of block name
L: corner cubic block orientation(X-, Y- or Z- oriented)L: corner cubic block orientation(X-, Y- or Z- oriented)
T: The sequence of {T: The sequence of {TTnn,,TTn-1n-1, …,, …,TT22} recording the number of } recording the number of
attached tri-branches covered by corner cubic blockattached tri-branches covered by corner cubic block
3
4
1
2
S={1 2 3 4 5}L = ( Y,Z,Y,X)
T=( 10,110,10,1110)
5
28
Packings with folded blocksPackings with folded blocks
29
30
PerformancePerformance
On average, multi-layer(3D) block configurations On average, multi-layer(3D) block configurations
have 11% lower temperature as well as 14% have 11% lower temperature as well as 14%
improvement in BIPS.improvement in BIPS.
31
TemperaturesTemperatures
Temperatures can be below 100 degree with Temperatures can be below 100 degree with
thermal vias inserted.thermal vias inserted.
32
Temperature profileTemperature profile
1 layer 2 layers with no via inserted
33
Temperature profile(2 layers with thermal vias)Temperature profile(2 layers with thermal vias)
34
OutlineOutline
Micro-architecture DesignMicro-architecture Design
3-D IC Technology3-D IC Technology 3D Architecture Exploration with 2D blocks3D Architecture Exploration with 2D blocks 3D Architecture Design with cubic folded blocks3D Architecture Design with cubic folded blocks
3D cubic packing algorithm3D cubic packing algorithm
3D architecture exploration with folded blocks3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware Pipelining Optimization with Throughput-Aware FloorplanningFloorplanning
Summary and Future WorkSummary and Future Work
35
Micro-architecture Pipelining OptimizationMicro-architecture Pipelining Optimization Previous works assume that the blocks are separately Previous works assume that the blocks are separately
designed subject to a clock frequency, and the wire designed subject to a clock frequency, and the wire pipelining is then carried out on the global wires of the pipelining is then carried out on the global wires of the circuits.circuits. Sub-optimal due to the possible utilized slacks in block pipeline Sub-optimal due to the possible utilized slacks in block pipeline
designsdesigns We propose We propose a novel a novel optimization methodology of architecture optimization methodology of architecture
pipelining with physical design, so that block pipelining and pipelining with physical design, so that block pipelining and
interconnect pipelining can be considered simultaneously.interconnect pipelining can be considered simultaneously. A
B
A
B
0.2 0.21
0.2 1 0.2
0.4
0.3
0.4
0.3
1.4
1.4
0.7 0.7
0.2 1 0.2
0.7
0.4
0.3 0.110.3
pipeline with pre-designed blocks path-based pipeline
36
Simultaneous Block and Interconnect PipeliningSimultaneous Block and Interconnect Pipelining
We define path-based pipelinging as Simultaneous Block
and Interconnect Pipelining (SBIP) Problem Represent the micro-architecture design by a path graph
G(V,E). The delay between any two flip-flops along the same path is
less than clock period . The performance of the architecture can be evaluated by the
weighted sum of number of FFs on ei(nei) along the paths. Therefore the objective is to find a feasible solution with the
optimal performance.
A B
D
C
A
E
A’
E
E’
B
B’
C
C’
D
D’
37
MILP FormulationMILP Formulation We define a term We define a term aa(P,(P,vv) that represents the arrival time ) that represents the arrival time
at node (at node (vv) along path P, which is the longest delay from ) along path P, which is the longest delay from
a flip-flop to the node a flip-flop to the node v v along path P.along path P.
With the given clock period With the given clock period and the set of paths and the set of paths PP, we , we
can then formulate the problem as the following MILP can then formulate the problem as the following MILP
Obj. Min
s.t. 0 a(Pi,v) vV and Pi passes v (1)
nei0 eiE (2)
a(Pi,v) a(Pi,u) + dei – * nei ei E and ei is a
connection from node u to node v along path Pi. (3)
)(
PP Pe
eiPi
i ii
nw
38
Graph-based heuristic algorithmGraph-based heuristic algorithm
Traverse the graph to decide the optimal insertion of Traverse the graph to decide the optimal insertion of
flip-flops such that the weighted sum of cycle numbers flip-flops such that the weighted sum of cycle numbers
of paths is minimizedof paths is minimized Dynamic scanning for combinational circuits Dynamic scanning for combinational circuits
Slacks along pathsSlacks along paths are used to compute the optimal positions are used to compute the optimal positions for FFs.for FFs.
Near-optimal method for sequential circuits Near-optimal method for sequential circuits • break the cycle into a path from s to t break the cycle into a path from s to t
Throughput aware floorplanning with pipelining Throughput aware floorplanning with pipelining The path-based pipelining design guides the block design to The path-based pipelining design guides the block design to
optimize the performance for the whole design. optimize the performance for the whole design.
39
Experimental ResultsExperimental Results We compare the results with the wire-pipelining results (WP), and We compare the results with the wire-pipelining results (WP), and
the solutions obtained from the MILP solver (MILP), the ideal upper the solutions obtained from the MILP solver (MILP), the ideal upper
bound used in [6][8](UB) and our graph-based heuristic approach bound used in [6][8](UB) and our graph-based heuristic approach
(GH). (GH).
Impact of frequenciesImpact of frequencies
The path-based pipelining will give about a 27% performance The path-based pipelining will give about a 27% performance improvement over wire pipelining improvement over wire pipelining
40
Integrated with floorplanning optimizationIntegrated with floorplanning optimization
FrequencyGHz
UB+post_MILP GH
Area(mm2)
Wire(mm)
BIPSArea(mm2)
Wire(mm)
BIPS
2 32. 115.6 1.492 31.8 142 1.714
3 34.6 103.7 2.139 33.3 108.4 2.22
4 32.4 98.7 2.776 36.1 124.3 2.828
5 32.8 126.2 2.885 32.6 94.17 3.35
6 36.0 108.4 3.636 33.7 100.3 3.882
7 35.9 112.5 3.479 36.8 129.9 3.906
Comparison 1 1 1 1.003 1.05 1.091
MILP approach as a post process at the end of the floorplanning MILP approach as a post process at the end of the floorplanning
integrate our approach with the thoughput-driven floorplannning. integrate our approach with the thoughput-driven floorplannning.
41
SummarySummary 3D Architecture Exploration3D Architecture Exploration
Coupled with 3D physical planningCoupled with 3D physical planning
Consider both 3D component stacking and foldingConsider both 3D component stacking and folding
MEVA-3D can systematically evaluate the 3D MEVA-3D can systematically evaluate the 3D
architecture both from the performance side and from architecture both from the performance side and from
the thermal side.the thermal side.
We propose the optimization methodology of We propose the optimization methodology of
architecture pipelining with physical design which architecture pipelining with physical design which
simultaneously optimize the pipeline design and simultaneously optimize the pipeline design and
physical packing in terms of system throughput. The physical packing in terms of system throughput. The
performance of the system can be improved a lot over performance of the system can be improved a lot over
the wire-pipelining.the wire-pipelining.
42
Ongoing WorkOngoing Work
3D Multi-core architecture design and 3D Multi-core architecture design and implementationimplementation
Deep pipeline design in microarchitecture with Deep pipeline design in microarchitecture with interconnect consideredinterconnect considered
The slacks in 3D design may be used to enlarge The slacks in 3D design may be used to enlarge the sizes of blocks and get better performance.the sizes of blocks and get better performance.
Thank You!Thank You!
[email protected]@tsinghua.org.cn