Winning with HDL. AGENDA Introduction HDL coding techniques Virtex hardware Summary.

Winning with HDL

AGENDA

Introduction HDL coding techniques Virtex hardware Summary

Coding for Performance

Gate Arrays are relatively tolerant of poor coding styles and design practices

66 MHz is easy for an Gate Array

Designs coded for a Gate Array tend to perform 3x slower when converted to a FPGA

Not uncommon to see up to 30 layers of logic and 10-20 MHz FPGA designs

6-8 FPGA Logic Levels = 50 MHz

FPGAs require different coding styles and more effective design methodologies to reach gate array system speeds.

Coding for Performance

Common mistake is to ignore hardware and start coding as if programming. To achieve best performance, the designer must think about the hardware.

Improve performance by: avoiding unnecessary priority structures in logic optimizing logic for late-arriving signals structuring arithmetic for performance avoiding area-inefficient code buffering high fanout signals pipelining for high performance exploiting high performance cores from CoreGen

Effective Coding StyleCase vs. If-Then-Else

in0

in1

in2

in3

mux_out

sel

in0in1

in2

in3

sel=00sel=01

sel=10p_encoder_out

module mux (in0, in1, in2, in3, sel, mux_out);input in0, in1, in2, in3; input [1:0] sel;output mux_out;reg mux_out;always @(in0 or in1 or in2 or in3 or sel) begin

case (sel) 2'b00: mux_out = in0; 2'b01: mux_out = in1; 2'b10: mux_out = in2; default: mux_out = in3;endcase

endendmodule

module p_encoder (in0, in1, in2, in3, sel, p_encoder_out);input in0, in1, in2, in3;input [1:0] sel;output p_encoder_out;reg p_encoder_out;always @(in0 or in1 or in2 or in3 or sel) begin

if (sel == 2'b00) p_encoder_out = in0;else if (sel == 2'b01) p_encoder_out = in1;else if (sel == 2'b10) p_encoder_out = in2;else p_encoder_out = in3;

endendmodule

Generally, If-Else is slower unless you intend to build a priority encoder!

Priority Encoder “if-then-else”When to use?

Assign highest priority to a late arriving critical signalNested “if-then-else” can increase area and delayUse “case” statement if possible to describe the same function

always @(sel or in)begin if (sel == 3'h0)

out = in[0]; else if (sel == 3'h1)




out = in[4]; else

out = in[5];end

in [4]

in [3]

SS

SS

in [2]in [1]

in [0]

Benefits of “case” statementalways @(C or D or E or F or S)begin case (S)

2’b000 : Z = C;2’b001 : Z = D;2’b010 : Z = E;2’b011 : Z = F;2’b100 : Z = G;2’b101 : Z = H;2’b110 : Z = I;default : Z = J;

endcase

CDEFGHIJ

S

Z

8:1 Mux

Compact and delay optimized implementationImplemented in a single CLB

Synthesis maps to MUXF5 and MUXF6 functions4:1 MUX is implemented in a single CLB slice

Effective Coding StyleOptimize for the Critical Path

critical

in0in1

in2

in3out

in2

in0in1

in3

criticalout

module critical_bad (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical;

output out;

assign out = (((in0&in1) & ~critical) | ~in2) & ~in3;

endmodule

module critical_good (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out;

assign out = ((in0&in1) | ~in2) & ~in3 & ~critical;

endmodule

Minimize the critical path where possible

-- No parenthesesOUT1 <= I1 + I2 + I3 + I4

-- No parenthesesOUT1 <= I1 + I2 + I3 + I4

-- With parenthesesOUT1 <= (I1 + I2) + (I3 + I4)

-- With parenthesesOUT1 <= (I1 + I2) + (I3 + I4)

I1

I2

I3

I4

OUT1

I4

I1

I2

I3

OUT1

Structuring Arithmetic for Performance

Know your tools: use Synthesis directives, options (vendor specific)

Area, Speed, Ungrouping and flattening, Resource sharing, "DesignWare" libraries

Attributes - ripple, look-ahead, fastest, smallest.– i.e. // exemplar attribute out1 modgen_sel fastest

LogiBlox, CORE Generator if vendor hasn't fully tuned yet

Use parentheses to control logical structure

How to use the Carry-In in FPGA Express

In FPGA Express, concatenate the Carry-In to get an adder with carry (Adder_c). Without concatenation (Adder_b), you would end up with 2 adders.

In other tools, like Leonardo, Adder_b will generate a single adder with carry-in -- no concatenation is necessary.

// ADDER_A// No carry-inAOUT = AIN1 + AIN2;

// ADDER_A// No carry-inAOUT = AIN1 + AIN2;

// ADDER_B// Carry-in used but 2 addersBOUT = BIN1 + BIN2 + BCARRYIN;

// ADDER_B// Carry-in used but 2 addersBOUT = BIN1 + BIN2 + BCARRYIN;

// ADDER_C// Carry-in used with only 1 adder required// Concatenate{COUT, CCARRYOUT} = {CIN1 ,CCARRYIN} + {CIN2,CCARRYIN};

// ADDER_C// Carry-in used with only 1 adder required// Concatenate{COUT, CCARRYOUT} = {CIN1 ,CCARRYIN} + {CIN2,CCARRYIN};

Verilog Notes

For CASE statements, be sure to use your synthesis vendor’s syntax to ensure optimum performance.

Full_case syntax allows you to avoid unwanted latches Parallel_case syntax allows you to ensure a parallel (as

opposed to priority encoded) hardware implementation in case statements where all cases are mutually exclusive.

Use “Don’t-Cares” to speed up your design and reduce area

Avoid inefficient code

a0b0

+

+a1b1

sum

sel

+ sumsel

a0

a1

b0

b1

module poor_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;always @(a0 or a1 or b0 or b1 or sel) begin

if (sel)sum = a1 + b1;

elsesum = a0 + b0;

endendmodule

module good_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;reg a_temp, b_temp;always @(a0 or a1 or b0 or b1 or sel) begin

if (sel) begina_temp = a1;b_temp = b1;

endelse begin

a_temp = a0;b_temp = b0;

endsum = a_temp + b_temp;

endendmodule

Use 2 muxes rather than 2 adders to reduce resource usage

Duplicate Registers to Reduce Fan-Out

module low_fanout(in, en, clk, out);input [23:0] in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en1, tri_en2;always @(posedge clk) begin

tri_en1 = en; tri_en2 = en;endalways @(tri_en1 or in)begin

if (tri_en1) out[23:12] = in[23:12];else out[23:12] = 12'bZ;

endalways @(tri_en2 or in) begin

if (tri_en2) out[11:0] = in[11:0];else out[11:0] = 12'bZ;

endendmodule

module high_fanout(in, en, clk, out);input [23:0]in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en;always @(posedge clk) tri_en = en;always @(tri_en or in) begin

if (tri_en) out = in;else out = 24'bZ;

endendmodule

en

clk

[23:0]in [23:0]out

tri_en

en

clk

[23:0]in[23:0]out

en

clk

24 loads

12 loads

12 loads

tri_en1

tri_en2

Design Partition - Reg at Boundary

a0

clk

a1

clk

+ sum

+a0

a1

clk

sum

module reg_at_boundary (a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;always @(posedge clk) begin

sum = a0 + a1;end

endmodule

module reg_in_module(a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;reg a0_temp, a1_temp;always @(posedge clk) begin

a0_temp = a0;a1_temp = a1;

endalways @(a0_temp or a1_temp) begin

sum = a0_temp + a1_temp;end

endmodule

Pipeline for Performance

1 cyclemodule no_pipeline (a, b, c, clk, out);input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp;always @(posedge clk) begin

out = (a_temp * b_temp) + c_temp;a_temp = a; b_temp = b; c_temp = c;

endendmodule

module pipeline (a, b, c, clk, out);input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp1, c_temp2, mult_temp;always @(posedge clk) begin

mult_temp = a_temp * b_temp;a_temp = a; b_temp = b;

endalways @(posedge clk) begin

out = mult_temp + c_temp2;c_temp2 = c_temp1;c_temp1 = c;

endendmodule

*

+

a

b

c

out

2 cycle

*

+

a

b

c

out

Pipeline to increase performance

Take Advantage of Virtex Hardware

Use flip-flops and pipeline! FPGA’s contain hordes of flip-flops.

Virtex gives you 4 DLL’s that can be used to synchronize clocks for superior system timing

Use the optimized cores from CoreGen to get high performance, pipelined arithmetic and sophisticated functional blocks.

RTL Flexibility for Register Configurations

Register Mapping forRegisters with sync/async set and resetClocks, inverted clocks, and clock enable

Positive Edge Triggered Flip-Flop with clock enable, sync clear and preset

always @(posedge clk or posedge preset)begin if (preset)

q = 1; else if (reset)

q = 0; else if (CE)

q = data;end

reset

data

clk

q

preset

ce

Timing Driven Register IOB Mapping

Technology Mapping will not duplicate registersCritical signal will not be absorbed in the IOB register

process (Tri, Clk) begin if (clk’event and clk =`1`) then Tri_R <= Tri; end if;end process;

process (Tri, Data_in) begin if (Tri_R = ‘1’) then Out <= Data_in; else Out <= (others => ‘Z’); end if;end process;

TRI TRI_R

CLK

D Q

DATA [23:0] OUT [23:0]

fanout = 24

Timing Driven Register IOB Mapping

Duplicate register on critical path for fanout of 1Mapping will absorb register in IOB

process (Tri_, Clk) begin if (clk’event and clk =`1`) then Tri_R1 <= Tri; Tri_R2 <= Tri; end if; end process;process (Tri_R1, Data_in) begin if (Tri_R1 = ‘1’) then Out(23) <= Data_in(23); else Out(23) <= ‘Z’); end if;end process;process (Tri_R2, Data_in) begin if (Tri_R2 = ‘1’) then

Out(22:0) <= Data_in(22:0); else

Out(22:0) <= (others => ‘Z’); end if;end process;

TRI

CLK

D QTRI_R1

DATA [23] OUT [23]

fanout = 1

TRI

CLK

D QTRI_R2

OUT [22:0]DATA [22:0]

fanout = 23

Area Efficient Muxes using TBUFs

Improve area efficiency by using tri-statesEach CLB has 2 TBUFs

assign Q[7:0] = E0 ? A[7:0] : 8'bzz..z;assign Q[7:0] = E1 ? B[7:0] : 8'bzz..z;assign Q[7:0] = E2 ? C[7:0] : 8'bzz..z;assign Q[7:0] = E3 ? D[7:0] : 8'bzz..z;

case (E) 4’b0001 : Q[7:0] = A[7:0]; 4’b0010 : Q[7:0] = B[7:0]; 4’b0100 : Q[7:0] = C[7:0]; 4’b1000 : Q[7:0] = D[7:0];endcase E[3:0]

A[7:0]

B[7:0]

C[7:0]

D[7:0]

Z[7:0]

A[7:0]

B[7:0]

C[7:0]

D[7:0]

E0

E1

E2

E3

Z[7:0]

TBUFs as Muxes Performance Summary

•Improve area efficiency by using tri-states•But often slower than equivalent muxes under most

circumstance•Too much delay getting onto TBUF

•Each CLB has 2 TBUFs•PAR can connect tri-states on multiple horizontal long

lines to build wide muxes

Distributed RAM Inferencing System Memory

module ramtest(q, addr, d, we, clk); output [3:0] q; input [3:0] d; input [2:0] addr; input we; input clk;

reg [3:0] mem [7:0];

assign q = mem[addr]; always @(posedge clk) begin if(we) mem[addr] = d; endendmodule

Synplicity (RAM 8x4) AO

A1

A2

A3

D

WCLK

WE

AO

A1

A2

D

WCLK

WE

Addr [2:0]

D [3:0]

clkwe

q [3:0]

RAM 16x1S

RAM 16x1S

Q.

..

.

.

•Synplify and LeonardoSpectrum can infer distributed RAM•FPGA Express will support RAM inferencing in future

Registered IO Mapping System Interfaces

System Timing Chip to chip performance limits system speeds

No need to instantiate IOB register cells

Implementation tools will pack registers in the IO map -pr b

b (both input and output)i (input only)o (output only)

IOB = TRUE attributeMapping for data and enable ports

S/R

D

CE

CLK

S/R

Q

OBUF

QCE

D

CLK

IBUF

Instantiating Technology Specific Features

Block RAM System Memory

CLKDLL Minimizes clock skew

Special IOs Interfacing with standard buses

LUTs for Datapath pipelining Add latency with minimal area impact

LUTs for Datapath pipelining LUT can be used in place of registers to balance pipeline stages

Area efficient implementation SRL16E can delay an input value up to 16 clock cycles - Sync up operands before the next operation

F

GH

A[31:0]

B[31:0]

C[31:0]

Z

8 cycles5 cycles

1 cycle

SRL16EDCE CLKA3A2A1A0

Q7

SRL16EDCE CLKA3A2A1A0

Q12

32 LUTs replace 256 registers

32 LUTs replace 416 registers

Block RAM: System Memory

RAMB4_S1 U1 (.WE(WE), .EN(EN), .RST(RST), .CLK(CLK), .ADDR(ADDR), .DI(DI), .DO(DO));

component RAMb4_S1port(WE,EN,RST,CLK: in STD_LOGIC; ADDR: in STD_LOGIC_VECTOR(11 downto 0); DO: out STD_LOGIC; DI: in STD_LOGIC_VECTOR(0 downto 0));end component;

begin U1: RAMB4_S1 port map(WE=>WE, EN=>EN, RST=>RST, CLK=>CLK, DI=>DI, ADDR=>ADDR, DO=>DO);

RAMB4_S1

doDO

addr

enwe

rstclk

di

ADDRWEENRST

DI

CLK

Instantiate single and dual port RAMUse CoreGen to build RAM and FIFO (Q1 ‘99)

wire clk_fb;BUFGDLL U4 (.I(clkin), .O(clk_fb));

BUFG

CLKIN

CLKFB

RST

CLKDLL

CLK0CLK90

CLK180CLK270CLK2XCLKDV

LOCKED

IBUFG

U4

clkin

rst

clk_fbIO

Virtex CLKDLLMinimize clock to out pad delay

Removes all delay from external GCLKPAD pin to the registers and RAM

BUFGDLL is available for instantiation Other configurations can be built by instantiating the CLKDLL macro

UCF only way to configure CLKDLL or BUFGDLL In future would like to use generics (VHDL) and parameters (Verilog) but synthesizers don't pass them on yet

Special IO Buffers: System Interfaces

Default IO buffer is LVTTL (12mA), available via inference

Process technology leads to mixed voltage systemsHigh performance, low power signal standards emerging

Instantiate IO buffers for non default current drivenon default voltage standardnon default slew

OBUF_AGP U0 (.I(awire), .O(oport));

OBUF_F_24 U1 (.I(awire), .O(oport));

awire oport

U0

awire oport

U1

Advanced Graphics Port bus interface (Pentium II graphics)Fast slew rate and 24 mA drive strength

Summary

Efficient HDL coding allows designers to build high performance designsDesigners should consider the underlying hardware as they code, to achieve best resultsExploit the hardware’s features for best performance

Winning with HDL. AGENDA Introduction HDL coding techniques Virtex hardware Summary.

Documents

Transcript of Winning with HDL. AGENDA Introduction HDL coding techniques Virtex hardware Summary.