High Performance FPGA Design

21
High Performance FPGA Designs Kartik Subramanian Iyer Nusrat Ali Date: Nov-03-2006 Copyright Notice This document contains proprietary information of HCL Technologies Ltd. No part of this document may be reproduced, stored, copied, or transmitted in any form or by means of electronic, mechanical, photocopying or otherwise, without the express consent of HCL Technologies. This document is intended for internal circulation only and not meant for external distribution.

Transcript of High Performance FPGA Design

Page 1: High Performance FPGA Design

High Performance FPGA Designs

Kartik Subramanian Iyer

Nusrat Ali

Date: Nov-03-2006

Copyright Notice

This document contains proprietary information of HCL Technologies Ltd. No part of this document

may be reproduced, stored, copied, or transmitted in any form or by means of electronic, mechanical,

photocopying or otherwise, without the express consent of HCL Technologies. This document is

intended for internal circulation only and not meant for external distribution.

Page 2: High Performance FPGA Design

Table of Contents

1. Introduction................................................................................................................. 3

2. Coding Guidelines for Good Synthesis results ........................................................... 3

2.1. Identify critical Blocks................................................................................................ 3

2.2. Limiting levels of logic............................................................................................... 3

2.3. Multiple Clocks Design and Clock Enable................................................................. 3

2.4. Single clock edge to clock data................................................................................... 4

2.5. Registered outputs from each leaf-level block............................................................ 4

2.6. Reset Strategy ............................................................................................................. 4

2.7. Proper partition ........................................................................................................... 4

2.8. Design for Testability ................................................................................................. 5

2.9. Resources Used........................................................................................................... 6

2.10. FIFO Uses ........................................................................................................... 6

2.11. Core Generator.................................................................................................... 6

2.12. Xilinx specific components................................................................................. 6

2.13. Device architecture ............................................................................................. 7

3. Core-Gen/Third Party IP Integration in Synthesis...................................................... 8

3.1. Core-Gen FIFO IP core support in Synplify –PRO.................................................... 8

3.2. Xilinx implementation perspective ............................................................................. 9

4. Analyzing Timing Reports.......................................................................................... 9

5. Implementation Options and Guidelines for ISE...................................................... 10

5.1. Translate Properties .................................................................................................. 10

5.2. Map Properties .......................................................................................................... 10

5.3. Place & Rou te Properties ......................................................................................... 11

5.4. Multi-Pass Place-and-Route...................................................................................... 12

6. Guideline for MAP & PAR Options......................................................................... 12

7. Guideline for Placement using Floor-Planner........................................................... 13

7.1. Area grouping constraints ......................................................................................... 14

8. Example of Pin Locking Constraints ........................................................................ 14

9. Common P&R & Map Errors ................................................................................... 15

10. Plan Ahead advantage............................................................................................... 17

10.1. PlanAhead Flow................................................................................................ 18

Page 3: High Performance FPGA Design

1. Introduction

This paper shares the Guidelines/Tips for writing High Performance FPGA designs. It also shares

the Authors experiences of designing High Performance DDR2 Controller IP. The paper covers all

aspects of FPGA designs starting with RTL coding, Map and Place & Route.

2. Coding Guidelines for Good Synthesis results

Following are some RTL coding guidelines for achieving high performance in FPGA.

2.1. Identify critical Blocks

The famous 80-20 rule holds good here also. In most of the cases it is the 20% of the design

(blocks) which fails in timing and creates problem for the complete design. These blocks should

be identified when creating the design document. Most of the times these blocks are Counters,

State machines, Decoding, and Data path logic. Always register their outputs before using them

in some other block/logic.

2.2. Limiting levels of logic

The designer should have rough idea about the levels of logic the design will tolerate for

achieving the desired frequency.

Tip: Around 5-8 levels of logic can achieve 250 MHz in Vertex4.

2.3. Multiple Clocks Design and Clock Enable

Design should be partitioned properly to make sure that the entire clock crossing logic is only in

one Block. It also reduces the effort in creating the Timing Constraints.

Use clock enable instead of gated clocks. Using clock enable saves clock resources and can

improve timing characteristics and analysis of the design.

To gate entire clock domains for power reduction, it is preferable to use the clock-enabled global

buffer resource (BUFGCE) whereas for applications that only attempts to pause the clock for a

few cycles on small areas of the design, the preferred method is to use the clock-enable pin of the

FPGA register.

Not suggested coding style Suggested coding style assign GATECLK = (IN1 & IN2 & CLK); assign ENABLE = (IN1 & IN2 & LOAD); always @(posedge GATECLK) always @(posedge CLOCK) begin begin if (LOAD) if (ENABLE) OUT1 <= DATA; DOUT <= DATA; end end

Page 4: High Performance FPGA Design

2.4. Single clock edge to clock data

Multiple edges can create problems in meeting the timing specially when there is some logic in

between two edges and two flops working on the different edges are routed far apart.

2.5. Registered outputs from each leaf-level block

The output from each major partition block, State Machine/FIFO’s should be registered. This is

quite helpful when modifying the design for better performance. This also provides flexibility in

routing and floor planning.

2.6. Reset Strategy

Synchronous reset in Xilinx devices allows better performance for the following reasons.

• Prevent the use of synchronous elements of dedicated hardware blocks

Example: Both multiplier blocks and RAM registers contain only synchronous resets. if an

asynchronous reset is coded for these functions, the registers within these blocks cannot be used. This

has a severe effect on performance.

• Prevent optimizations of the logic inside the fabric

• Severely constrain placement and routing because reset signals often have high fanout

performance.

• Prevent the use of a device library component, such as shift register look-up table (SRL)

Example: Reset cannot be described in the code when inferring performance-optimized shift

registers (SRL) because the SRL library component does not have a reset. Using resets in

code that infers shift registers requires either several flip-flops or additional logic around the

SRL to allow a reset function

2.7. Proper partition

The synthesis tools do not optimize the logic across the hierarchy as efficiently as they optimize

the logic in one hierarchy. More over at several occasions designers want to retain the hierarchy

to help them understand the implementation better. Here are few guidelines that will greatly help

in achieving high performance.

1. Partition the logic based on their interaction. On a general guideline not more than 300 line

of code should be kept in one partition and the output and input should be registered. If it is

not possible, always at least register the outputs.

2. Keep the related logic together for better optimization. Especially if you have are using some

FIFO’s, State-machine etc. Make sure that all the related logic is in one block. This helps you

to route the block independently.

Page 5: High Performance FPGA Design

3. Place all I/O components including any instantiated I/O buffers, registers, DDR circuitry,

SerDes, or delay elements on the top-level of the design. If it is not possible to place them on

the top-level, ensure that they are all contained within a single hierarchy.

4. Any logic in which the synthesis tool employs resource sharing should be contained within

the same hierarchy.

5. Manually duplicate registers with high fan-outs at hierarchy boundaries.

Tip: Avoid glue logic at the top level

2.8. Design for Testability

• Avoid tri-state bus as there are limited numbers of tri state resources available in FPGA. If

you have to use tri-sate buses then to ensure testability, pass the enable of the tri-state bus

through AND gate so that scan_enable signal can control the tri-state bus.

• Use multiplexer logic at the output of the derived clock before it fed to the input of another

flip-flip. Make the other input of Mux as the primary clock and the select line as

“scan_enable”. This will make sure that the primary clock is used during testability

• In Power Savvy designs/Gated clock design, add OR gate after the AND gate and add

scan_enable as another input to the OR gate in addition to the output to the AND gate.

• In case of derived reset/internally generated reset. Add “scan_enable” signal to the other

input of the “OR” gate. In test mode, asserting “sacn_enable” make sure that the

asynchronous reset is disabled to avoid losing any data in scan mode.

Page 6: High Performance FPGA Design

2.9. Resources Used

Always think about the resource requirement, routing issues, frequency requirements before

coding the logic

Let us say there is a requirement to model the grant logic where the grant has to be provided after

8 clocks of the request and there can be max 4 outstanding requests at a time. This could be

easily modeled using a shift register but when the same delay reaches 64, you may want to use

dedicated SRL’s available in Vertex devices as they will be efficiently utilizing the resources.

When the same delay reaches 256 or more the SRL implementation may not be very suitable as

this will occupy lot many LUT’s and may increase the routing delay for the other related logic as

they will be spread far across the LUT’s better approach will be to use a FIFO.

2.10. FIFO Uses

Always register the FIFO outputs i.e. FIFO empty and FIFO full signals. In case they are not

registered and used in some other logic available in some other RTL block which is placed far

apart from the FIFO contained block then the routing delay can have serious impacts.

2.11. Core Generator

The Xilinx CORE Generator™ tool comes with many basic corers which are quite useful and

timing efficient. These cores could be considered for following reasons.

• Synthesis Tool is not inferring the proper resources.

• Synthesis does not meet the timing/area requirements.

Ready to use proven cores are needed to save engineering time and money

2.12. Xilinx specific components

Xilinx provides some ready made components. Make use of available components for better

device utilization and performance. Ensure that you include these library files during the

synthesis stage along with your HDL code for the design.

For example some components for a DDR design are glbl, IDDR, IDELAY, IDELAYCTRL,

IOBUF, and ODDR. Designer should have some idea of the Set up requirements of FPGA

primitives/components used. For example the set up requirements for DDR registers are quite

high

Tip: Register the outputs and inputs to the third party IP and Xilinx components

Page 7: High Performance FPGA Design

2.13. Device architecture

Always keep in mind the device structure before coding the RTL logic. Vertex 4 has a column

architecture where the FIFO’s and block RAM’s are arranged in a column. The architecture

knowledge helps in efficiently utilizing the resources/performance.

Figure 1: Virtex 4 Device Architecture

Example: If a RTL block interacts with multiple FIFO’s and it also interacts with the I/O Pins

then the routing delays for the RAM/FIFO located far away from the I/O Pins will be more. In

this scenario you may want to want to reconsider the decision of using FIFO if the FIFO size is

small. The below figure explains the issue

BUFR /

BUFIO

SLICE

LOGIC FIFO

BLOCK

RAM DSP

IDELAY

- CTRL

ODDR/

IDDR

Page 8: High Performance FPGA Design

Figure 2: Routing delay when placed FIFO is far apart

The Figure 2 shown above explains a scenario in our DDR2 design where the ODDR/IDDR pin

were locked and the near by Block RAM were occupied causing the desired block RAM to be

placed far apart, causing significant routing delay.

The following changes helped us achieve the frequency.

The desired FIFO (placed far apart) in our case was small migration to distributed RAM helped

us. The dedicated I/O registers causing timing violation was moved from the I/O to FPGA fabric

to manually place the registers following steps must be performed.

• Disable any global I/O register placement options for the synthesis tool

• Specify whether the register should be placed into the I/O by adding an IOB=TRUE in

the UCF file or source HDL code

• Disable the Map option "Pack I/O Registers/Latches into IOBs" in ISE Project Navigator.

This disables automatic pushing of registers into the I/O

Tip: Disable global packing of registers into I/O cells. Instead, only constrain registers for which timing

is critical on the printed circuit board to be packed into the FPGA I/O cell.

3. Core-Gen/Third Party IP Integration in Synthesis

3.1. Core-Gen FIFO IP core support in Synplify –PRO

Xilinx Core generator creates structural EDIF Net lists (Xilinx/EDIF version 2.00) with both .ndf

and .edn filename extensions. The .edn and .ndf net list files will be used in the ISE translate

stage

ODDR/

IDDR

FIFO

Page 9: High Performance FPGA Design

In case the design is using Core-gen FIFOs then they can be declared as black boxes and the .edn

and .ndf net list files can be used in the translate stage. The Synplify tool can also read the EDIF-

formatted files generated by the Xilinx Core Generator reflecting the black box contents.

Note: In cases where part of the design is available in net list format the synthesis tool will not

be able to optimize the interface that efficiently though it will use the .ndf and .edn file in

generating the timing report.

3.2. Xilinx implementation perspective

Most of the cores generated by the Core-Gen tool are a combination of .edn and .ndf files. An

.ndf file is a Xilinx binary file equivalent to an .edf file and it has only LUT functionality

(conveys only resource and timing information) and can only be read by Xilinx software whereas

the .edn file has complete functionality (used both for logic implementation and for

communicating resource and timing information).

During the translate stage of the design all the net list files generated by the Core-gen tool for

the IP-core need to be added to the project file of the design.

Note: These include the files with extensions .edn, .ndf, and any other lower level net list files in

the hierarchy failure to add any one of these files will result in implementation errors during the

translate stage.

4. Analyzing Timing Reports

Logic Level Timing Report gives the first measure of the design performance .Following are the

important points to consider.

• Allow a margin of extra 20% for the delays reported by the synthesis report since routing

delays are estimated.

• Do a simple synthesis with just the clock constraints and observe the results adding compile

time is going to be large and the Implementation Tools may not be able to reach your timing

goal (be very AWARE of this).

• Use the Post Layout Timing Report to verify that your constraints were met by the

Implementation Tools. This is easier than opening the Timing Analyzer on a very large

Virtex design, which might take a couple minutes.

• Use the Timing Analyzer to generate detailed timing information about your design. The

Timing Analyzer will provide a wealth of timing information on designs that use timing

constraints. Unconstrained designs will generate a Default Path Analysis that is only slightly

helpful.

Page 10: High Performance FPGA Design

• The Report Paths in Timing Constraints Report shows each constraints delay path in

descending order of slack. The Report Paths Failing Timing Constraints report shows each

failing delay path.

• The Custom Report shows all the delay paths between groups of path endpoints created by

selecting Sources and Destinations. This report can be used find the timing information for a

particular delay path without having to review a large report.

• The Report Paths Not Covered Report shows the all of the delay paths in the design, in

descending order of length. This report can be used to find any unconstrained delay paths.

• The Timing Analyzer reports can show users how many levels of logic are being inferred.

This is very important, since most designers are not aware of how much logic they are

generating with their synthesis tool, or how much optimization the synthesis tool is doing for

them. If your delay path infers multiple levels of logic, it will have to be re-synthesized (with

code changes or different synthesis option settings) to meet your timing objective.

• Hide the unwanted messages reported in the ISE timing report by setting filters in ISE

5. Implementation Options and Guidelines for ISE

5.1. Translate Properties

Make sure that the LOC Constraints box is enabled if you already have some LOC constraints in

the UCF file.

5.2. Map Properties

� Timing-Driven packing and placement: The timing-driven packing option uses the timing

constraints to guide the packing of critical path logic into slices. It insures that the critical

paths are placed and routed before other non-critical paths.

Tip: Try timing-driven packing with a regular PAR effort level of High first; then, try the

extra-effort level starting with Normal.

� Map Effort Level: It is better to start from a standard effort level. if the design does not

meet the timing-requirements then try using a High map effort with a Normal Extra Effort.

� Combinatorial Logic Optimization: Enable this option to remove any extra (un-used)

combinatorial-logic.

� Register Duplication & Global Optimization: Be careful in using these options as you may

over constrain the implementation tool to perform a lot of actions on the design & the

implementation tools may issue an error.

Replicate Logic to Allow Logic Level Reduction: Register replication increases the speed of

critical paths by making copies of registers to reduce the fan-out of a given signal Enable this option

Page 11: High Performance FPGA Design

to potentially improve timing results.. Manual replication can also be tried if the tool is not able to

replicate the logic. This increases the area.

Tip: Enable this option when the high fanout nets with long route delays are reported as critical

paths in the timing reporting

Manual replication of High Fan out Net (*EQUIVALENT_REGISTER_REMOVAL="NO"*) reg signal_1, sihnal_2; always @(posedge clk) begin signal_1 = signal1_high_fan_out; sihnal_2 = signal1_high_fan_out; end always @(posedge clk) begin if (signal_1) data_out[7:0] <= data[7:0]; if (signal_2) data_out[15:8] <= data[15:8]; end

Note: Many times an additional synthesis constraint needs to be added to ensure that a manually

duplicated register is not optimized away by the synthesis tool. In the above example, the XST syntax

was used (EQUIVALENT_REGISTER_REMOVAL).

5.3. Place & Rou te Properties

� Place & Route Mode: The default value is Standard. the value of Standard will give you the

fastest run time but the least effort in meeting your timing objectives. The value of High will

give you the most effort at meeting your timing objectives at the expense of increased run

time. Try effort level Standard, then Medium, then High as final choice.

� Placer Effort Level: For shorter runtimes of the tool, it is better to choose a Medium effort

level.

� Router Effort Level: Router effort level can be increased to High from Medium, this

improves the timing when there are significant amount of timing violations because of

routing delays. This may typically improve the timing by about 5 %

Tip: The routing and timing are largely based on the placement of logic. Therefore, it is

usually most beneficial to use a High effort level for placement and limit the routing effort

level to Standard. While the quality of the routing is based on the placement, the best

placement will not always produce the best timing.

� Extra effort: It is better not to choose this option unless you have tried all possible

implementation strategies and are still not able to meet your timing objectives. Enabling this

option results in significantly long run times of the tool. This may also result in P & R errors

if you have already enabled other implementation options like Timing driven packing &

placement during the MAP stage.

Page 12: High Performance FPGA Design

5.4. Multi-Pass Place-and-Route

Multi-Pass Place-and-Route is the part of the Xilinx tool set that fully implements the design

based on a cost table (often referred to as a "seed"). 100 different cost tables can be attempted.

Each one will provide a fully implemented design with a different placement (and different

routing), which provides different timing.

� Using Multi-Pass Place-and-Route is a very time-consuming task and should be used only

after nearly all other options have been attempted.

Tip: Start with a low placer cost table value (preferably default value) and generally try

using a small value for the Number of PAR Iterations (about 3 to 5) For example if target is

for 5 iterations of PAR in the MPPR mode with An option of saving the results from the 3

best runs. this should give you. A fair idea of the improvement in timing achieved through

each run.

� For most designs, you can expect a 15 to 20% difference in performance between the very

best and the very worst cost tables. Typically, you might gain a 5% improvement over a

normal place and route.

� With reference to the bulleted points on placement mentioned in the previous section you can

run many different placements and, once you have found the best placement, then increase

the routing effort level to High to finish the routing on the best one or two (placements).

6. Guideline for MAP & PAR Options

1. Ensure that the clocks are routed on Global/dedicated clocks resources. This reduces clock skew

which minimizes Hold Violations possibility increasing the reliability of the design.

2. Route the Reset, Set on the global routing resources. Use the Global Set/Reset (GSR) resources

to reduce the skew on a set/reset in older device families. Don’t use the GSR in Virtex. The GSR

has too much delay and general interconnect will distribute this signal quickly.

3. Provide proper Max Skew attribute in the UCF on control signals that are routed on general

interconnect and are having high fan out.

4. In both the cases the designer has to be aware of the constituent of the routed (placed) block else

routing (placement) will fail. Always provide a margin of 20% logic elements for any design to

be (placed) routed this reduces the (placement) routing tool run time and increases the chances of

a successful route.

5. At times the logic being developed could be intermediate (part of a bigger design). Following are

the points which need to be taken care in such cases.

Page 13: High Performance FPGA Design

6. Generally the I/O is having large delays which will not happen in the real design where the input

could be registered output of some other block. In such cases I/O to first register could be

declared as False Path.

7. There could be designs where the intermediate design could be having more number of I/O than

available in the Device. Place and Route will fail in such cases. Design wrappers where all the

Input could be registered and could be fed to a Mux. The Select line of the Mux could be

generated by a free running counter. The output of the Mux could be fed to the real design. This

insures that the wrapper is not synthesized away. Declare the I/O to register, as false path.

Figure 3 : Wrapper for ISE implementation

7. Guideline for Placement using Floor-Planner

The co-ordinates of the block to be placed can be specified in terms of X & Y coordinates on the

FPGA device by manually locating the co-ordinates or by dragging the placed block on the device-

editor and the let the tool generate the slice co-ordinates automatically. In doing so it has to be

insured that the placed co-ordinates contain all the logic required for the block. It is good to have

20% margin.

Output

Register

Stage

DDR –

Controller

IP

Free-

running

counter

Input

Register

Stage

Input

Register

Stage

Output

Ports

Input

Ports

Internal

Inputs

Multiple outputs

Memory interface signals

DDR

Wrapper

DATA-

MUX

False path

False path

False path

Page 14: High Performance FPGA Design

The rough resource estimate of the block to be placed can be taken from the Map report (.mrp) and

the rough estimate of the resource contained in the placed co-ordinates can be had by looking at the

device architecture and the %age area occupied by the co-ordinates.

Tip: Specify coordinates diagonally like e.g., take X0Y0 as one coordinate and X10Y10 as another,

this will ensure that all FPGA - resources (slices / FIFOs/RAMs/ DSPs) Within this area are

available to the Placer to place the logic in the design.

Example: inst “u_mem_intf” RANGE = SLICE_X0Y247:SLICE_X80Y167;

Note: The draw back with this approach is the approximation involved in choosing the co-ordinates.

The approximation becomes further complex when you have common logic which is overlapping

between 2 blocks.

7.1. Area grouping constraints

The preferred method of placing related logic on an FPGA is to use Area Grouping constraints.

If Area Group is attached to a hierarchical block, all sub-blocks in the block is assigned to the

group. Once defined, an AREA GROUP can have a variety of additional constraints associated

with it to control its implementation. All these AREA GROUP constraints should be specified in

the UCF file.

Example:

AREA_GROUP "AG_mem_intf_grp" GROUP = OPEN;

AREA_GROUP "AG_mem_intf_grp" PLACE = OPEN;

AREA_GROUP "AG_mem_intf_grp" RANGE = SLICE_X0Y247:SLICE_X80Y167;

AREA_GROUP "AG_mem_intf_grp" RANGE = RAMB16_X0Y21:RAMB16_X0Y30,

RAMB16_X1Y21:RAMB16_X1Y30, RAMB16_X2Y21:RAMB16_X2Y30;

INST "u_ddr_mmr" AREA_GROUP = "AG_mem_intf_grp”;

INST "u_ddr_controller" AREA_GROUP = "AG_mem_intf_grp”;

INST "u_alg_dmapio_mux" AREA_GROUP = "AG_mem_intf_grp”;

INST "u_alg_dmapio_arbiter" AREA_GROUP = "AG_mem_intf_grp”;

INST "u_alg_synchronizer" AREA_GROUP = "AG_mem_intf_grp”;

INST "u_addr_fifo" AREA_GROUP = "AG_mem_intf_grp”;

As can be seen from the example above the RANGE for resource usage can be set.

Tip: Using AREA_GROUP constraints like GROUP & PLACE we can include & place logic

that is outside the AREA_GROUP with the logic which is within the AREA_GROUP. To enable

this set AREA_GROUP as OPEN.

8. Example of Pin Locking Constraints

There are 3 possible ways in which the pin-locking can be done.

Page 15: High Performance FPGA Design

1. The pin locking LOC constraints can be specified within the UCF (User-Constraints file).Here

user has to manually write the desired co-ordinates in the UCF File.

2. The pin-locking can be done through PACE editor where the tool itself generates the co-

ordinates for the placed block.

3. The pin-locking can be done through the design browser available with the Plan Ahead software

suite. The Plan Ahead is not part of ISE and requires a separate License. The tool also takes care

of built-in DRC.

Examples:

IDELAY Control related pin locking constraints

INST "u_ddr_controller/ddr_idelayctrl_0" LOC=IDELAYCTRL_X0Y0;

BUFG related pin locking constraints

INST "u_ddr_controller/u_BUFG_IDELAYCTRL" LOC=BUFGCTRL_X0Y0;

ODDR related pin locking constraints

NET “ddr2_dq_out[71]” LOC = “T36”;

IDDR related pin locking constraints

NET “MEM_DM[8]” LOC = “M37”;

Here MEM_DM is an inout type of port.

9. Common P&R & Map Errors

IDELATCTRL Uses: When instantiating only one IDELAYCTRL, in the HDL design code the

LOC constraints are not required but when Instantiating multiple IDELAYCTRL LOC constraints

are required else the tool will report error. The Reference Clock (REFCLK) port of the

IDELAYCTRL should be driven by the global clock buffer (BUFGCTRL) else the tool will report

error

Tip: Always provide Loc constraint to the IDELAYCTRL primitive used even if the design uses only

one IDELAYCTRL .It will decrease the power consumption and resource area

Page 16: High Performance FPGA Design

ODDR Uses: In the design for outputs with bit-widths greater than one and that are driven by

ODDR instances. we must ensure that for each bit of the signal output a Separate ODDR instance is

being driven. ODDR use is explained below.

Wrong Code

output [`DDR2DS_WIDTH -1 :0] ddr2_dqs_out ;

//---------------------------- Internal Wire Declarations --------------------------------------

wire [`DDR2DS_WIDTH -1 :0] ddr2_dqs_out ;

wire [`DDR2DS_WIDTH -1 :0] ddr2_dqs_out_int /* synthesis syn_keep=1 */ ;

reg dqs_en_in_reg /* synthesis syn_preserve=1 */ ;

wire mem_dqs_in /* synthesis syn_keep=1 */ ;

ODDR #("SAME_EDGE",0,"SYNC") U_dqs0_oddr (

.C ( MEM_CLK_in ), // in

.CE ( dqs_en_in_reg ), // in

.R ( 1'b0 ), // in

.S ( 1'b0 ), // in

.D1 (1'b1 ), // in

.D2 (1'b0 ), // out

.Q ( mem_dqs_in ) // out

) ;

assign ddr2_dqs_out_int = mem_dqs_in ;

assign ddr2_dqs_out = ddr2_dqs_out_int ;

always @ (posedge core_clk_in )

begin

dqs_en_in_reg <= dintrf_memintrf_wrdat_en_out;

end

In the above code the output of a single ODDR viz. mem_dqs_in is being used to drive a vectored

output port. Logically this is not possible as it amounts to packing multiple outputs to a single

ODDR. Under these circumstances the implementation tools will issue an error.

Page 17: High Performance FPGA Design

Correct Code

reg dqs_en_in_reg /* synthesis syn_preserve=1 */ ;

wire [`DDR2DS_WIDTH - 1:0] mem_dqs_in /* synthesis syn_keep=1 */ ;

generate for (i=0; i < `DDR2DS_WIDTH ; i= i+1)

begin : dqs_Test

ODDR #("SAME_EDGE",0,"SYNC") U_dqs0_oddr (

.C ( MEM_CLK_in ), // in

.CE ( dqs_en_in_reg ), // in

.R ( 1'b0 ), // in

.S ( 1'b0 ), // in

.D1 (1'b1 ), // in

.D2 (1'b0 ), // out

.Q ( mem_dqs_in[i] ) // out

) ;

end

endgenerate

assign ddr2_dqs_out_int = mem_dqs_in ;

assign ddr2_dqs_out = ddr2_dqs_out_int ;

always @ (posedge core_clk_in )

begin

dqs_en_in_reg <= dintrf_memintrf_wrdat_en_out;

end

In the code above, this is achieved by using a for loop within a generate statement please observe

below that the mem_dqs_in signal is being declared as a vectored signal.

Note: When instantiating IDELAYCTRL without LOC constraints, the implementation tools auto

replicates IDELAYCTRL instances throughout the entire device, even in clock regions not using the

delay element. This results in higher power consumption due to higher resource utilization (uses one

global clock resource in every clock region) , and a greater use of routing resources. There are eight

global clock lines per regional clock domain

10. Plan Ahead advantage

Following are the major challenges (limitations) of the ISE Place and Route Tool.

1- The user has to make a rough approximation of the resources required for the block placed.

Page 18: High Performance FPGA Design

2- The user has to also make an approximation of the total area (logic) where the desired block is

being routed and has to insure that the routed area can accommodate the routed logic.

3- One of the major disadvantages with the ISE place and route engine is that when you have

designs with overlapping logic and you try assigning area grouping constraints to such designs

then invariably the design fails at the MAP stage during the implementation-phase. The area for

placement (co-ordinates) has to be increased further in such cases.

PlanAhead provides a solution to the above mentioned issues. It improves performance of the design

by reducing the route delay in the design through floor-planning. It provides deep insight of the

routing issues and allows the user to decide about the appropriate placement of the logic.

It can hierarchically partition the design into smaller, more manageable physical blocks (called as

Pblocks). It maintains a physical hierarchy that is independent from the logic hierarchy. This enables

Pblocks to include logic modules and primitive logic from anywhere in the logic hierarchy. Critical

or associated logic can be tightly grouped together into a single Pblock preventing logic migration

thus limiting interconnect lengths and reducing delays.

10.1. PlanAhead Flow

The Plan-Ahead tool sits between synthesis and the ISE place and route tools. Any FPGA

synthesis tool, targeting Xilinx FPGAs, can be used for your design. The Plan-Ahead tool uses

the synthesized net list and design constraint files for analysis. The tool allows you to export an

EDIF net list and a single design constraint UCF file to drive the ISE tools.

Figure 4: FPGA Design flow using PlanAhead

Page 19: High Performance FPGA Design

Following are ways by which Floor Plan can be created.

1. The net list from synplify-pro along with the relevant user constraints file (UCF) can directly

be input to the PlanAhead tool to create a new floor plan.

2. However If the design has been run through ISE, the results can help in floor plan creation.

The ExploreAhead tool within PlanAhead can be used to load existing placement into the

PlanAhead floor plan.

Figure 5 : Typical Placement view in PlanAhead

Floor Planning with PlanAhead: Floor planning is an iterative process. To begin with one can

create a new Pblock containing the critical (violating the timing) path logic and place it near to the

interacting logic. We can use the “show connectivity” feature in the PlanAhead to find the

appropriate place for the Pblock.

Figure 6 : Using the Show Connectivity command

Page 20: High Performance FPGA Design

Run TimeAhead2 after the placement, if the violations disappear with the new placement then save

the UCF constraints and run ExploreAhead.

Tip: PlanAhead has an embedded static timing analysis engine and environment called TimeAhead.

This provides a good and fast approximation of the timing for the placed design. The analysis can be

run with zero interconnect delays or with estimated delays.

In case the above approach does not work and there is lot of common logic involved in the Pblock

creation then you may have to repartition larger Pblocks into smaller Pblocks.

Tip: As a general rule, the smaller the logic constrained in the Pblock the more predictable it

becomes.

The below figure explains the re-partitioning of Pblock.

Step-1: Select the Top Module of the Design and let us say it contains three sub blocks receiver, led

and channel as shown in the figure-6. Create Pblocks for them pblock_receiver , pblock_led and

pblock_channel. Place them and run the TimeAhead tool.

Step-2: After running the TimeAhead if there is any violation as shown in figure-7 then the Pblock

containing the critical path need to re-partitioned. In this case the step-1 has to be repeated by setting

the top as pblock_receiver (since this contains the critical path).

Tip: As a rule of thumb The Pblocks with the heaviest Bundle nets should be placed - close together.

Figure 7:Initial Pblock Placement Figure 8 :Refined Pblock Placement

ExploreAhead: PlanAhead contains a tool called ExploreAhead which allows multiple

implementation attempts using various ISE command options.

Page 21: High Performance FPGA Design

Users can create and save ISE “Strategies”, which are a set of option configurations for each ISE

implementation command. These various Strategies are then applied to Floor plans for

implementation using ISE. Users can monitor progress, view log reports and quickly identify and

import the best implementation results.

11. Conclusion

Though Synthesis tools are getting more and more advanced but still sound understanding of the

design and proper planning always pays, especially in the design having large gate count, multiple

clocks and higher speed.