Team Members Instructor - Aseem Sayalaseemsayal.in/wp-content/uploads/2016/05/Design-of... ·...

Examples of 800MHz processors…

2

Agenda• Design Specification

• Organization Chart

• ASIC Design Flow and Methodology

• Design Updates

– Synthesis – Optimizations & Results

– Floorplanning – Optimizations & Results

– Powerplanning – Optimizations & Results

– Placement – Optimizations & Results

– CTS – Optimizations & Results

– Route – Optimizations & Results

– ECO – Optimizations & Results

• Timing Convergence

• Formal Verification – Optimizations & Results

• Design Caveats

• Results

• Major Issues faced and their resolution

• How to further improve design?

• Key Learning's – “What would we like to do differently?”

• Economic Aspects

Design Specifications

QoRTarget

Specification

Results

Achieved

Cycle Time (ns) 1.25 1.25

Total Power

(mW)280 218.40

Die Area (mm^2) 1.05 1.00

Utilization (%) 65 70.37

Max. IR Drop

(mV)50 48.42

Design Specifications

QoRTarget

Specification

Results

Achieved

Cycle Time (ns) 1.25 1.25

Total Power

(mW)280 218.40

Die Area (mm^2) 1.05 1.00

Utilization (%) 65 70.37

Max. IR Drop

(mV)50 48.42

Improvement

MET

22%

4.76%

8.26%

3.16%

Organization Chart

Program Manager (Prof. Mark

McDermott )

Netlist/Cons Delivery

Owner: Kai, VipulSupport: Kshitiz

Floor planning, Power planning

Owner: Ronak, VipulSupport: Aseem,Kai

PNR

Owner: Kshitiz, AseemSupport: Ronak, Vipul

Timing, Formal and Power Signoff

Owner: Aseem, RonakSupport: Kshitiz, Kai

Application Engineer Support

(TA Wuxi Li)

ASIC Design Flow and Methodology

Signoff (Timing/Power)

Route

CTS

Placement

Power Planning

Floor Planning

DC

7

Generic Optimization:

1. Creating Group Path to optimize near critical path:

By default DC only optimizes the worst critical path on a particular clock.

By creating group path with critical range, DC optimize for all path inside the

range

Synthesis

2. Feeding def from placement under topographical mode

This gives DC compiler more accurate physical design information without relying

approximation based on wire load mode

Synthesis

• The critical path involve the execution stage in amber cores;

• This stage read the register file as well as performing ALU operations

• The reading of register file constitutes ~20% of critical path delay, which involves

combination logic for selection (mux) and condition encoding

• ideally the selection should be performed at the previous stage (decoding)

Synthesis

• The two Amber cores don’t interact with each other.

– Should be symmetrically placed ideally

• The Amber cores don’t talk to the I/O pins directly.

• Ethernet and Boot Memory were not a part of the critical paths.

– Timed on slower clocks

– In our previous (midterm review) floorplan, the Ethernet logic was

disrupting the placement of standard cells for Core 0.

• Aspect Ratio should be ideally 1:1

– horizontal and vertical metal layers

• Standard cell utilization dependency on area of core/die.

Floorplan Considerations

11

Floorplan

12

Core 0 D-Cache

Boot Memory

Core 0 I-Cache Core 1 I-Cache

Core 1 D-Cache

Eth

ern

etR

AM

Boot Memory

Core 0 D-Cache

Boot Memory


Core 1 D-Cache

Eth

ern

etR

AM Boot Memory

• Minimize corners inside the floorplan.

• Have placement blockages around the memories.

Floorplan

13

Core 0 D-Cache

Boot Memory


Core 1 D-Cache

Eth

ern

etR

AM

Boot Memory

Core 0 D-Cache

Boot Memory


Core 1 D-Cache

Eth

ern

etR

AM Boot Memory

• Minimize corners inside the floorplan.

• Have placement blockages around the memories.

• To save area but not reduce utilization

– Added a notch in our design. (No spec to keep it rectangular)

Core 0 D-Cache

Boot Memory


Core 1 D-Cache

Eth

ern

etR

AM

Input

ports

Output,

In/Out

ports

Blockage

• Die area: 1 mm2

• Memories Placed

• Blockages Added

• Pin placement

• Pre-route SRAMs

Floorplan

14

Power rings

Vertical power

straps - M8 layer

Horizontal power

straps - M1 layer

Power Plan Optimizations

15

– psynopt

• Command to do timing driven optimizations over place_opt. Expected to either improve

timing or let it stay the same.

– magnet_placement

• Assign certain cells to be “magnets” so that other cells connected with them are placed

close.

– create_bounds

• Assign a bound to a group to cells. Can either specify physical locations or just allocate

them as a special group to be placed close to each other.

Placement Optimizations & Results

16

– psynopt

• Command to do timing driven optimizations over place_opt. Expected to either improve

timing or let it stay the same.

• Gave good results when used around 5-6 times (depends on the design).

• Does degrade performance when used more than that.

– magnet_placement

• Assign certain cells to be “magnets” so that other cells connected with them are placed

close.

• Works only for macro pins and/or I/O pins.

– create_bounds

• Assign a bound to a group to cells. Can either specify physical locations or just allocate

them as a special group to be placed close to each other.

• When used with one path (no effect as another path becomes the critical)

• When used with all paths (degrades timing)


17


18


19


20


21

– “place_opt –effort high –cts” without congestion constraint

• congestion led to longer paths

– Optimal number of “psynopt” commands

• Random number of optimization commands will not necessarily yield ideal results

• Need to monitor results after every iteration

– Optimizing skew using the “clock-balancing” switch

• Analyses the design to determine which paths can be used for useful skew.

– Optimizing clock with “inter_clock_balance” switch

– Optimize skews for critical path groups, i.e. amber0 and amber1 groups

– Usage of “clock_opt –only_psyn” command

– Use “fix hold” switch with clock_opt

– Optimize clock for concurrent clock and data

– Incrementally optimize clock for both clock and data

CTS Optimizations

22

– “place_opt –effort high –cts” without congestion constraint

• congestion led to longer paths

– Optimal number of “psynopt” commands

• Random number of optimization commands will not necessarily yield ideal results

• Need to monitor results after every iteration

– Optimizing skew using the “clock-balancing” switch

• Analyses the design to determine which paths can be used for useful skew.

– Optimizing clock with “inter_clock_balance” switch

– Optimize skews for critical path groups, i.e. amber0 and amber1 groups

– Usage of “clock_opt –only_psyn” command

– Use “fix hold” switch with clock_opt

– Optimize clock for concurrent clock and data

– Incrementally optimize clock for both clock and data

Order in which the above commands are exercised also matters!

CTS Optimizations

23

Clock Tree

sys_clk sys_clk_slow

24

Clock Tree

mrx_clk mtx_clk brd_clk

25

Clock Buffers

sys_clk sys_clk_slow

26

Critical Path (CTS)

27

Critical Timing Path Slack

28

CTS Results

29

– Pre-routing instances and standard cells

– Using non-default routing rules for critical group paths

• define wire width and spacing rules and via types

– Exploring switches inside “set_route_options” for global routing

• Timing Driven

• Congestion Weights

• Track Assignment Timing Driven

– Setting up repair loops within “set_route_opt_strategy”

• Specifies the number of detail routing iterations performed by the route_opt

command.

– Restricting critical group paths to specific metal layers using

“set_net_routing_layer_constraints”

– Pre-routing critical paths and clock nets

– Stages of “route_opt” and incremental “route_opt”

Route Optimizations & Results

30

– Pre-routing instances and standard cells

– Using non-default routing rules for critical group paths

• define wire width and spacing rules and via types

– Exploring switches inside “set_route_options” for global routing

• Timing Driven

• Congestion Weights

• Track Assignment Timing Driven

– Setting up repair loops within “set_route_opt_strategy”

• Specifies the number of detail routing iterations performed by the route_opt

command.

– Restricting critical group paths to specific metal layers using

“set_net_routing_layer_constraints”

– Pre-routing critical paths and clock nets

– Stages of “route_opt” and incremental “route_opt”

Route Optimizations & Results

31

Critical Path

32

Critical Path

33

• POST ROUTE RESULTS:

ECO Optimizations and Results

34

• POST ECO ITERATION 1 RESULTS:

– Fixed max_transition, max_capacitance, and hold violations


35


– Fixed hold violations


36


– Fixed max_capacitance and hold violations


37

Post ECO IR Drop Maps

VSS IR DROP MAP VDD IR DROP MAP

38

Timing Convergence

DC Placement CTS Route ECO + PT

Initial Floorplan (Bad) – Without PNR optimizations

Setup WNS: 17ps 250ps 170ps 220ps 224ps

Setup TNS: 8.63ns 471.9ns 226ns 297ns 237.9ns

Hold TNS: - - 240ps 250ps 193ns

Hold TNS: - - 28.58ns 26.81ns 26.7ns

39

Final Floorplan (Good) – Without PNR optimizations

Setup WNS: 6ps 62ps 51ps 88ps 144ps

Setup TNS: 0.8ns 84.9ns 78ns 127ns 189.9ns

Hold TNS: - - 222ps 250ps 0ps

Hold TNS: - - 128.58ns 116.8ns 0


Timing Convergence

40


Timing Convergence

Optimizations:

- Path Groups

(Amber1, Amber0)

- Psynopt

- Magnet

placement

- Create

bounds

- Clock_opt

- Clock balancing

- Inter clock balancing

- psynopt

- hold fix

- Clock tree optimizations

- optimal fanout

- selective layers

- Skew optimization

- Timing driven

- Concur. clock/data opt.

- Hold fix

- Trans fix

- Max cap fix

- Setup fix

- PBA mode

- Uncertainty

- Routing options

- track driven

- timing driven

- congestion weights

- Route opt

- NDR critical path

pre route

- Search loop strategy

Final Floorplan (Good) – With PNR optimizations

41

Timing Convergence

Setup WNS: 2.2ps 50.34ps 2.20ps 28.2ps 0ps

Setup TNS: 96.3ps 20.62ns 24.62ps 5.79ns 0ps

Hold TNS: - - 217ps 240ps 0ps

Hold TNS: - - 13.68ns 49.92ns 0ps

Final Floorplan (Good) – Without PNR optimizations


42

Critical Path (Max mode)

43

Critical Path (Min mode)

45

Timing QoR

46

We performed verification for RTL against DC, and RTL and PNR

All verification passed.

Formality

• Before verification, the net of reference and implementation must be

matched. In our case, the matching are based on names, which can be

changed after PNR.

• To ensure correct name matching (unless manually using regex),

synopsys_auto_setup must be set, so FM can read svf file, which contains

name changing information.

• For some reason, synopsys_auto_setup is not set by default

Formality

• In this project, we performed formality verification several times during

design iterations, and in relatively late stages.

• In more complex projects, regression tests should be performed more often

after certain amount of changes to reduce the cost of debugging.

Regression Test

• To deal with high fanout components, such the mux for register file

and for barrel shifter, setting constraint is also a potential

optimization technique, but it is not adopted into the final design.

• In our experiments, we found that such approach introduce trade-

offs in other aspects.

For example, setting max_transisition_time should theoretically

force the tool to increase drive strength for the critical path, however

it also increased the level of logic by 7% and increased

capacitance. The similar is also true for fanout.

Synthesis Exploration

• Reducing placement (by using blockage) area for core logic helps us improving

timing characteristics of critical path, since it reduces interconnect length.

• However as we find out, excessively reducing replacement area have negative

impact on timing, we believe this is due to increased adjacent capacitance and/or

increased congestion

The following is the normalized TNS and # violations for various blockage size

Placement Blockage Experiments

Results

Specification Achieved Value/Spec.

Cycle Time (ns) 1.25

Die Area (mm2) 1.00

Utilization (%) 60.40%

Power Consumption (mW) 218.40

Max. IR Drop VDD (mV) 47.672

Max. IR Drop VSS (mV) 48.421

LVS Check PASS

Formality – DC vs RTL PASS

Formality – ICC vs RTL PASS

Max. Trans Violations 0

Max. Fanout Violations 1

Max. Cap. Violations 23

Placement Utilization Fix

SOFT BLOCKAGE

Hardly any cells

sitting here…

53

Placement Utilization Fix

REPLACE WITH

HARD BLOCKAGE

54

Results

Specification Achieved Value/Spec.

Cycle Time (ns) 1.25

Die Area (mm2) 1.00

Utilization (%) 70.37%

Power Consumption (mW) 218.40

Max. IR Drop VDD (mV) 47.672

Max. IR Drop VSS (mV) 48.421

LVS Check PASS

Formality – DC vs RTL PASS

Formality – ICC vs RTL PASS

Max. Trans Violations 0

Max. Fanout Violations 1

Max. Cap. Violations 23

– Floorplan

• We had to spend a lot of time in getting the floorplan right.

• Multiple iterations with different floorplans till route were fired to see the

effect of notches/ memory placement etc. on timing.

Major Issues Faced (and their resolution)

56

• When WNS at CTS (~60ps), WNS at Route (~90ps)

• Brought WNS/TNS at CTS down to 2ps/24ps

– Route still around 85ps/64ns.

• Tried Route optimizations but to no avail.

• Bringing CTS below 1 ps also did not affect.


57

• When WNS at CTS (~60ps), WNS at Route (~90ps)

• Brought WNS/TNS at CTS down to 2ps/24ps

– Route still around 85ps/64ns.

• Tried Route optimizations but to no avail.

• Bringing CTS below 1 ps also did not affect.

• Turns out, our utilization at the time was around 75%.

– We looked into detailed reports based on a hunch.

– 60% of the design had a utilization of >87.5%. Congestion too high!!!

• Went back and increased the area at floorplan.

– Fixed the issue.

– Took about 1-2 weeks to figure out.


58

• Another such issue came during LVS.

• After meeting timing, we ran “verify_lvs”

– Got 2 opens (VDD and VSS).

– Basically standard cells were not connecting to the power rails.

• Our powergrid at the time was only vertical straps.

• Redesigned the complete powergrid using horizontal straps in M1.

– Started getting shorts.

– Had to fix the widths perfectly to remove both opens and shorts.

• The main issue was that we had to do this after meeting timing. So, the TAT

was very high as simulations ran for longer time and possibly our finalized

design could become useless.

– We also had to fix timing for multiple designs.


59

– Removing the blockages in the design using either notches or maybe shifting two

of the boot memories up in those slots.

• Saves area!

– One strategy for routing was pre-routing the critical paths using switches inside

set_route_opt_strategy. Our simulations using this did not work. It hung up on the

search loop variable.

• Possibly improve performance

– Since our power spec was relaxed and our focus was on timing, we did not

attempt to optimize on power. We can force non-critical paths to use only

HVT/RVT cells to save power (which may be using LVT cells right now).

• Saves power!

How to improve the design?

60

– Removing the blockages in the design using either notches or maybe shifting two

of the boot memories up in those slots.

• Saves area!

– One strategy for routing was pre-routing the critical paths using switches inside

set_route_opt_strategy. Our simulations using this did not work. It hung up on the

search loop variable.

• Possibly improve performance

– Since our power spec was relaxed and our focus was on timing, we did not

attempt to optimize on power. We can force non-critical paths to use only

HVT/RVT cells to save power (which may be using LVT cells right now).

• Saves power!

– We are meeting the given specs, so any improvement on these fronts is a

tradeoff with $$$$!

How to improve the design?

61

• LVS and M1 power strips from start – 1 week

• Over designing at Placement doesn’t help much CTS has great potential to

meet timing

• Take the optimization from manuals with the pinch of salt. All switches

doesn’t always help like

– ‘psynopt’ multiple times degraded the design after a certain point,

– ‘focal_opt’ didn’t help in the Route stage,

– Clock_opt –concurent_clock_and_data did not give good results

however –incremental_concurrent_clock_and_data works!

– Random experimentation is not a good option, small experiment and

analysis works.

Key Learning's – “What would we like to do differently?”

62

Economic Aspects

Assumptions:

• This goes in Apple A5 (800Mhz)

• Total SoC cost = $50 (Source: QC)

• CPU cost = 10% (Source: QC)

• No. of shipments = 500 million (Source: Appleinsider, Statista.com )

• Applications: iPhone 4S, iPad 2, iPad mini, iPod touch

63

Area Savings in $$$

• Saving 1mm2 in 100mm2 (1%) die results in $1 saving per each SoC (Source: QC)

• For 5% area savings in 1.05mm2, overall savings per SoC is 10 cents per die

Dollar savings = 0.05*0.1*500million = $2.5million

Economic Aspects

Power Savings in $$$

• Power is a huge concern these days. Snapdragon markets butter test!!

• Saving 22% power in CPU will result in more than 3-4% overall power savings

• Assuming SoC is sold at 0.1% higher cost ($5/CPU, $50/SoC)

Additional dollar earnings= 0. 001*50*500million = $25million

TAT Reduction in $$$

• Assume Physical design phase of 20 weeks (Source: QC)

• TAT reduction = 1 week (5%)

• Assuming 80% is NRE cost and 20% is RE

• RE costs include engineering, infrastructure and licensing cost

Dollar savings = 0.05*0.2*5*500million = $25million 64

Questions?

Thank You!

Q&A

Team Members Instructor - Aseem Sayalaseemsayal.in/wp-content/uploads/2016/05/Design-of... ·...

Documents

Transcript of Team Members Instructor - Aseem Sayalaseemsayal.in/wp-content/uploads/2016/05/Design-of... ·...