Team Members Instructor - Aseem Sayalaseemsayal.in/wp-content/uploads/2016/05/Design-of... ·...
Transcript of Team Members Instructor - Aseem Sayalaseemsayal.in/wp-content/uploads/2016/05/Design-of... ·...
Your Sub Title HereVLSI II – Final Review
Team Members
Aseem Sayal (ICS)
Kai Wang (CS)
Kshitiz Gupta (ICS)
Ronak Oswal (ICS)
Vipul Goyal (ICS)
Instructor
Prof. Mark McDermott
Team 3
High Performance
Examples of 800MHz processors…
2
Agenda• Design Specification
• Organization Chart
• ASIC Design Flow and Methodology
• Design Updates
– Synthesis – Optimizations & Results
– Floorplanning – Optimizations & Results
– Powerplanning – Optimizations & Results
– Placement – Optimizations & Results
– CTS – Optimizations & Results
– Route – Optimizations & Results
– ECO – Optimizations & Results
• Timing Convergence
• Formal Verification – Optimizations & Results
• Design Caveats
• Results
• Major Issues faced and their resolution
• How to further improve design?
• Key Learning's – “What would we like to do differently?”
• Economic Aspects
Design Specifications
QoRTarget
Specification
Results
Achieved
Cycle Time (ns) 1.25 1.25
Total Power
(mW)280 218.40
Die Area (mm^2) 1.05 1.00
Utilization (%) 65 70.37
Max. IR Drop
(mV)50 48.42
Design Specifications
QoRTarget
Specification
Results
Achieved
Cycle Time (ns) 1.25 1.25
Total Power
(mW)280 218.40
Die Area (mm^2) 1.05 1.00
Utilization (%) 65 70.37
Max. IR Drop
(mV)50 48.42
Improvement
MET
22%
4.76%
8.26%
3.16%
Organization Chart
Program Manager (Prof. Mark
McDermott )
Netlist/Cons Delivery
Owner: Kai, VipulSupport: Kshitiz
Floor planning, Power planning
Owner: Ronak, VipulSupport: Aseem,Kai
PNR
Owner: Kshitiz, AseemSupport: Ronak, Vipul
Timing, Formal and Power Signoff
Owner: Aseem, RonakSupport: Kshitiz, Kai
Application Engineer Support
(TA Wuxi Li)
ASIC Design Flow and Methodology
Signoff (Timing/Power)
Route
CTS
Placement
Power Planning
Floor Planning
DC
7
Generic Optimization:
1. Creating Group Path to optimize near critical path:
By default DC only optimizes the worst critical path on a particular clock.
By creating group path with critical range, DC optimize for all path inside the
range
Synthesis
2. Feeding def from placement under topographical mode
This gives DC compiler more accurate physical design information without relying
approximation based on wire load mode
Synthesis
• The critical path involve the execution stage in amber cores;
• This stage read the register file as well as performing ALU operations
• The reading of register file constitutes ~20% of critical path delay, which involves
combination logic for selection (mux) and condition encoding
• ideally the selection should be performed at the previous stage (decoding)
Synthesis
• The two Amber cores don’t interact with each other.
– Should be symmetrically placed ideally
• The Amber cores don’t talk to the I/O pins directly.
• Ethernet and Boot Memory were not a part of the critical paths.
– Timed on slower clocks
– In our previous (midterm review) floorplan, the Ethernet logic was
disrupting the placement of standard cells for Core 0.
• Aspect Ratio should be ideally 1:1
– horizontal and vertical metal layers
• Standard cell utilization dependency on area of core/die.
Floorplan Considerations
11
Floorplan
12
Core 0 D-Cache
Boot Memory
Core 0 I-Cache Core 1 I-Cache
Core 1 D-Cache
Eth
ern
etR
AM
Boot Memory
Core 0 D-Cache
Boot Memory
Core 0 I-Cache Core 1 I-Cache
Core 1 D-Cache
Eth
ern
etR
AM Boot Memory
• Minimize corners inside the floorplan.
• Have placement blockages around the memories.
Floorplan
13
Core 0 D-Cache
Boot Memory
Core 0 I-Cache Core 1 I-Cache
Core 1 D-Cache
Eth
ern
etR
AM
Boot Memory
Core 0 D-Cache
Boot Memory
Core 0 I-Cache Core 1 I-Cache
Core 1 D-Cache
Eth
ern
etR
AM Boot Memory
• Minimize corners inside the floorplan.
• Have placement blockages around the memories.
• To save area but not reduce utilization
– Added a notch in our design. (No spec to keep it rectangular)
Core 0 D-Cache
Boot Memory
Core 0 I-Cache Core 1 I-Cache
Core 1 D-Cache
Eth
ern
etR
AM
Input
ports
Output,
In/Out
ports
Blockage
• Die area: 1 mm2
• Memories Placed
• Blockages Added
• Pin placement
• Pre-route SRAMs
Floorplan
14
Power rings
Vertical power
straps - M8 layer
Horizontal power
straps - M1 layer
Power Plan Optimizations
15
– psynopt
• Command to do timing driven optimizations over place_opt. Expected to either improve
timing or let it stay the same.
– magnet_placement
• Assign certain cells to be “magnets” so that other cells connected with them are placed
close.
– create_bounds
• Assign a bound to a group to cells. Can either specify physical locations or just allocate
them as a special group to be placed close to each other.
Placement Optimizations & Results
16
– psynopt
• Command to do timing driven optimizations over place_opt. Expected to either improve
timing or let it stay the same.
• Gave good results when used around 5-6 times (depends on the design).
• Does degrade performance when used more than that.
– magnet_placement
• Assign certain cells to be “magnets” so that other cells connected with them are placed
close.
• Works only for macro pins and/or I/O pins.
– create_bounds
• Assign a bound to a group to cells. Can either specify physical locations or just allocate
them as a special group to be placed close to each other.
• When used with one path (no effect as another path becomes the critical)
• When used with all paths (degrades timing)
Placement Optimizations & Results
17
Placement Optimizations & Results
18
Placement Optimizations & Results
19
Placement Optimizations & Results
20
Placement Optimizations & Results
21
– “place_opt –effort high –cts” without congestion constraint
• congestion led to longer paths
– Optimal number of “psynopt” commands
• Random number of optimization commands will not necessarily yield ideal results
• Need to monitor results after every iteration
– Optimizing skew using the “clock-balancing” switch
• Analyses the design to determine which paths can be used for useful skew.
– Optimizing clock with “inter_clock_balance” switch
– Optimize skews for critical path groups, i.e. amber0 and amber1 groups
– Usage of “clock_opt –only_psyn” command
– Use “fix hold” switch with clock_opt
– Optimize clock for concurrent clock and data
– Incrementally optimize clock for both clock and data
CTS Optimizations
22
– “place_opt –effort high –cts” without congestion constraint
• congestion led to longer paths
– Optimal number of “psynopt” commands
• Random number of optimization commands will not necessarily yield ideal results
• Need to monitor results after every iteration
– Optimizing skew using the “clock-balancing” switch
• Analyses the design to determine which paths can be used for useful skew.
– Optimizing clock with “inter_clock_balance” switch
– Optimize skews for critical path groups, i.e. amber0 and amber1 groups
– Usage of “clock_opt –only_psyn” command
– Use “fix hold” switch with clock_opt
– Optimize clock for concurrent clock and data
– Incrementally optimize clock for both clock and data
Order in which the above commands are exercised also matters!
CTS Optimizations
23
Clock Tree
sys_clk sys_clk_slow
24
Clock Tree
mrx_clk mtx_clk brd_clk
25
Clock Buffers
sys_clk sys_clk_slow
26
Critical Path (CTS)
27
Critical Timing Path Slack
28
CTS Results
29
– Pre-routing instances and standard cells
– Using non-default routing rules for critical group paths
• define wire width and spacing rules and via types
– Exploring switches inside “set_route_options” for global routing
• Timing Driven
• Congestion Weights
• Track Assignment Timing Driven
– Setting up repair loops within “set_route_opt_strategy”
• Specifies the number of detail routing iterations performed by the route_opt
command.
– Restricting critical group paths to specific metal layers using
“set_net_routing_layer_constraints”
– Pre-routing critical paths and clock nets
– Stages of “route_opt” and incremental “route_opt”
Route Optimizations & Results
30
– Pre-routing instances and standard cells
– Using non-default routing rules for critical group paths
• define wire width and spacing rules and via types
– Exploring switches inside “set_route_options” for global routing
• Timing Driven
• Congestion Weights
• Track Assignment Timing Driven
– Setting up repair loops within “set_route_opt_strategy”
• Specifies the number of detail routing iterations performed by the route_opt
command.
– Restricting critical group paths to specific metal layers using
“set_net_routing_layer_constraints”
– Pre-routing critical paths and clock nets
– Stages of “route_opt” and incremental “route_opt”
Route Optimizations & Results
31
Critical Path
32
Critical Path
33
• POST ROUTE RESULTS:
ECO Optimizations and Results
34
• POST ECO ITERATION 1 RESULTS:
– Fixed max_transition, max_capacitance, and hold violations
ECO Optimizations and Results
35
• POST ECO ITERATION 2 RESULTS:
– Fixed hold violations
ECO Optimizations and Results
36
• POST ECO ITERATION 3 RESULTS:
– Fixed max_capacitance and hold violations
ECO Optimizations and Results
37
Post ECO IR Drop Maps
VSS IR DROP MAP VDD IR DROP MAP
38
Timing Convergence
DC Placement CTS Route ECO + PT
Initial Floorplan (Bad) – Without PNR optimizations
Setup WNS: 17ps 250ps 170ps 220ps 224ps
Setup TNS: 8.63ns 471.9ns 226ns 297ns 237.9ns
Hold TNS: - - 240ps 250ps 193ns
Hold TNS: - - 28.58ns 26.81ns 26.7ns
39
Final Floorplan (Good) – Without PNR optimizations
Setup WNS: 6ps 62ps 51ps 88ps 144ps
Setup TNS: 0.8ns 84.9ns 78ns 127ns 189.9ns
Hold TNS: - - 222ps 250ps 0ps
Hold TNS: - - 128.58ns 116.8ns 0
DC Placement CTS Route ECO + PT
Timing Convergence
40
DC Placement CTS Route ECO + PT
Timing Convergence
Optimizations:
- Path Groups
(Amber1, Amber0)
- Psynopt
- Magnet
placement
- Create
bounds
- Clock_opt
- Clock balancing
- Inter clock balancing
- psynopt
- hold fix
- Clock tree optimizations
- optimal fanout
- selective layers
- Skew optimization
- Timing driven
- Concur. clock/data opt.
- Hold fix
- Trans fix
- Max cap fix
- Setup fix
- PBA mode
- Uncertainty
- Routing options
- track driven
- timing driven
- congestion weights
- Route opt
- NDR critical path
pre route
- Search loop strategy
Final Floorplan (Good) – With PNR optimizations
41
Timing Convergence
Setup WNS: 2.2ps 50.34ps 2.20ps 28.2ps 0ps
Setup TNS: 96.3ps 20.62ns 24.62ps 5.79ns 0ps
Hold TNS: - - 217ps 240ps 0ps
Hold TNS: - - 13.68ns 49.92ns 0ps
Final Floorplan (Good) – Without PNR optimizations
DC Placement CTS Route ECO + PT
42
Critical Path (Max mode)
43
44
Critical Path (Min mode)
45
Timing QoR
46
We performed verification for RTL against DC, and RTL and PNR
All verification passed.
Formality
• Before verification, the net of reference and implementation must be
matched. In our case, the matching are based on names, which can be
changed after PNR.
• To ensure correct name matching (unless manually using regex),
synopsys_auto_setup must be set, so FM can read svf file, which contains
name changing information.
• For some reason, synopsys_auto_setup is not set by default
Formality
• In this project, we performed formality verification several times during
design iterations, and in relatively late stages.
• In more complex projects, regression tests should be performed more often
after certain amount of changes to reduce the cost of debugging.
Regression Test
• To deal with high fanout components, such the mux for register file
and for barrel shifter, setting constraint is also a potential
optimization technique, but it is not adopted into the final design.
• In our experiments, we found that such approach introduce trade-
offs in other aspects.
For example, setting max_transisition_time should theoretically
force the tool to increase drive strength for the critical path, however
it also increased the level of logic by 7% and increased
capacitance. The similar is also true for fanout.
Synthesis Exploration
• Reducing placement (by using blockage) area for core logic helps us improving
timing characteristics of critical path, since it reduces interconnect length.
• However as we find out, excessively reducing replacement area have negative
impact on timing, we believe this is due to increased adjacent capacitance and/or
increased congestion
The following is the normalized TNS and # violations for various blockage size
Placement Blockage Experiments
Results
Specification Achieved Value/Spec.
Cycle Time (ns) 1.25
Die Area (mm2) 1.00
Utilization (%) 60.40%
Power Consumption (mW) 218.40
Max. IR Drop VDD (mV) 47.672
Max. IR Drop VSS (mV) 48.421
LVS Check PASS
Formality – DC vs RTL PASS
Formality – ICC vs RTL PASS
Max. Trans Violations 0
Max. Fanout Violations 1
Max. Cap. Violations 23
Placement Utilization Fix
SOFT BLOCKAGE
Hardly any cells
sitting here…
53
Placement Utilization Fix
REPLACE WITH
HARD BLOCKAGE
54
Results
Specification Achieved Value/Spec.
Cycle Time (ns) 1.25
Die Area (mm2) 1.00
Utilization (%) 70.37%
Power Consumption (mW) 218.40
Max. IR Drop VDD (mV) 47.672
Max. IR Drop VSS (mV) 48.421
LVS Check PASS
Formality – DC vs RTL PASS
Formality – ICC vs RTL PASS
Max. Trans Violations 0
Max. Fanout Violations 1
Max. Cap. Violations 23
– Floorplan
• We had to spend a lot of time in getting the floorplan right.
• Multiple iterations with different floorplans till route were fired to see the
effect of notches/ memory placement etc. on timing.
Major Issues Faced (and their resolution)
56
• When WNS at CTS (~60ps), WNS at Route (~90ps)
• Brought WNS/TNS at CTS down to 2ps/24ps
– Route still around 85ps/64ns.
• Tried Route optimizations but to no avail.
• Bringing CTS below 1 ps also did not affect.
Major Issues Faced (and their resolution)
57
• When WNS at CTS (~60ps), WNS at Route (~90ps)
• Brought WNS/TNS at CTS down to 2ps/24ps
– Route still around 85ps/64ns.
• Tried Route optimizations but to no avail.
• Bringing CTS below 1 ps also did not affect.
• Turns out, our utilization at the time was around 75%.
– We looked into detailed reports based on a hunch.
– 60% of the design had a utilization of >87.5%. Congestion too high!!!
• Went back and increased the area at floorplan.
– Fixed the issue.
– Took about 1-2 weeks to figure out.
Major Issues Faced (and their resolution)
58
• Another such issue came during LVS.
• After meeting timing, we ran “verify_lvs”
– Got 2 opens (VDD and VSS).
– Basically standard cells were not connecting to the power rails.
• Our powergrid at the time was only vertical straps.
• Redesigned the complete powergrid using horizontal straps in M1.
– Started getting shorts.
– Had to fix the widths perfectly to remove both opens and shorts.
• The main issue was that we had to do this after meeting timing. So, the TAT
was very high as simulations ran for longer time and possibly our finalized
design could become useless.
– We also had to fix timing for multiple designs.
Major Issues Faced (and their resolution)
59
– Removing the blockages in the design using either notches or maybe shifting two
of the boot memories up in those slots.
• Saves area!
– One strategy for routing was pre-routing the critical paths using switches inside
set_route_opt_strategy. Our simulations using this did not work. It hung up on the
search loop variable.
• Possibly improve performance
– Since our power spec was relaxed and our focus was on timing, we did not
attempt to optimize on power. We can force non-critical paths to use only
HVT/RVT cells to save power (which may be using LVT cells right now).
• Saves power!
How to improve the design?
60
– Removing the blockages in the design using either notches or maybe shifting two
of the boot memories up in those slots.
• Saves area!
– One strategy for routing was pre-routing the critical paths using switches inside
set_route_opt_strategy. Our simulations using this did not work. It hung up on the
search loop variable.
• Possibly improve performance
– Since our power spec was relaxed and our focus was on timing, we did not
attempt to optimize on power. We can force non-critical paths to use only
HVT/RVT cells to save power (which may be using LVT cells right now).
• Saves power!
– We are meeting the given specs, so any improvement on these fronts is a
tradeoff with $$$$!
How to improve the design?
61
• LVS and M1 power strips from start – 1 week
• Over designing at Placement doesn’t help much CTS has great potential to
meet timing
• Take the optimization from manuals with the pinch of salt. All switches
doesn’t always help like
– ‘psynopt’ multiple times degraded the design after a certain point,
– ‘focal_opt’ didn’t help in the Route stage,
– Clock_opt –concurent_clock_and_data did not give good results
however –incremental_concurrent_clock_and_data works!
– Random experimentation is not a good option, small experiment and
analysis works.
Key Learning's – “What would we like to do differently?”
62
Economic Aspects
Assumptions:
• This goes in Apple A5 (800Mhz)
• Total SoC cost = $50 (Source: QC)
• CPU cost = 10% (Source: QC)
• No. of shipments = 500 million (Source: Appleinsider, Statista.com )
• Applications: iPhone 4S, iPad 2, iPad mini, iPod touch
63
Area Savings in $$$
• Saving 1mm2 in 100mm2 (1%) die results in $1 saving per each SoC (Source: QC)
• For 5% area savings in 1.05mm2, overall savings per SoC is 10 cents per die
Dollar savings = 0.05*0.1*500million = $2.5million
Economic Aspects
Power Savings in $$$
• Power is a huge concern these days. Snapdragon markets butter test!!
• Saving 22% power in CPU will result in more than 3-4% overall power savings
• Assuming SoC is sold at 0.1% higher cost ($5/CPU, $50/SoC)
Additional dollar earnings= 0. 001*50*500million = $25million
TAT Reduction in $$$
• Assume Physical design phase of 20 weeks (Source: QC)
• TAT reduction = 1 week (5%)
• Assuming 80% is NRE cost and 20% is RE
• RE costs include engineering, infrastructure and licensing cost
Dollar savings = 0.05*0.2*5*500million = $25million 64
Questions?
Thank You!
Q&A