Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1...

23
11/25/2010 1 CS411 Digital System Design Dr. Arshad Aziz Basic FPGA Architecture Technology Timeline The Design Warrior’s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com) Major FPGA vendors SRAM-based FPGAs Xilinx Inc. – www.xilinx.com Altera Corp. www.altera.com Atmel Corp. www.atmel.com Lattice Semiconductor Corp.– www.latticesemi.com Antifuse and flash-based FPGAs Actel Corp. www.actel.com QuickLogic Corp. – www.quicklogic.com State-of-the-art Feature Technology node SRAM Antifuse E2PROM / FLASH One or more generations behind One or more generations behind Fast Reprogramming speed (inc. erasing) ---- 3x slower than SRAM Yes Volatile (must be programmed on power-up) No No (but can be if required) Medium Power consumption Low Medium Acceptable (especially when using bitstream encryption) IP Security Very Good Very Good Large (six transistors) Size of configuration cell Very small Medium-small (two transistors) No Rad Hard Yes Not really No Instant-on Yes Yes Yes Requires external configuration file No No Yes (very good) Good for prototyping No Yes (reasonable) Yes (in system) Reprogrammable No Yes (in-system or offline) The Programmable Marketplace Q1 Calendar Year 2005 Source: Company reports Latest information available; computed on a 4-quarter rolling basis Xilinx Xilinx Altera Lattice Actel QuickLogic: 2% Xilinx Xilinx All Others All Others PLD Segment FPGA Sub-Segment Other: 2% 51% 33% 5% 7% Altera Altera 58% 31% 11% FPGA Families Spartan 3 Virtex 4 LX / SX / FX Spartan 3E Virtex 5 LX Spartan 3L Xilinx Low-cost High-performance Cyclone II Stratix II Stratix II GX Altera

Transcript of Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1...

Page 1: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

1

CS411 Digital System Design

Dr. Arshad Aziz

Basic FPGA Architecture

Technology Timeline

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Major FPGA vendorsSRAM-based FPGAs

Xilinx Inc. – www.xilinx.comAltera Corp. – www.altera.comAtmel Corp. – www.atmel.comLattice Semiconductor Corp.–

www.latticesemi.com

Antifuse and flash-based FPGAsActel Corp. – www.actel.comQuickLogic Corp. – www.quicklogic.com

State-of-the-art

Feature

Technology node

SRAM Antifuse E2PROM /FLASH

One or moregenerations behind

One or moregenerations behind

FastReprogramming

speed (inc.erasing)

---- 3x slowerthan SRAM

YesVolatile (must

be programmedon power-up)

No No(but can be if required)

MediumPowerconsumption Low Medium

Acceptable(especially when usingbitstream encryption)

IP Security Very Good Very Good

Large(six transistors)

Size ofconfiguration cell Very small Medium-small

(two transistors)

NoRad Hard Yes Not really

NoInstant-on Yes Yes

YesRequires externalconfiguration file No No

Yes(very good)

Good forprototyping No Yes

(reasonable)

Yes(in system)Reprogrammable No Yes (in-system

or off line)

The Programmable MarketplaceQ1 Calendar Year 2005

Source: Company reportsLatest information available; computed on a 4-quarter rolling basis

XilinxXilinxAltera

LatticeActel QuickLogic: 2% XilinxXilinx

All OthersAll Others

PLD Segment FPGA Sub-Segment

Other: 2%

51%33%

5% 7%

AlteraAltera

58%

31% 11%

FPGA Families

– Spartan 3 Virtex 4 LX / SX / FX– Spartan 3E Virtex 5 LX– Spartan 3LXilinx

Low-cost High-performance

Cyclone II Stratix IIStratix II GX

Altera

Page 2: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

2

Xilinx• Primary products: FPGAs and the associated CAD

software

• Main headquarters in San Jose, CA• Fabless* Semiconductor and Software Company

UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}Seiko Epson (Japan)

• TSMC (Taiwan)

Programmable Logic Devices ISE Alliance and Foundation

Series Design Software

Source: [Xilinx Inc.]

Xilinx• Primary products: FPGAs and the associated CAD software

• Main headquarters in San Jose, CA• Fabless* Semiconductor and Software Company

• UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}

• Seiko Epson (Japan)• TSMC (Taiwan)

Programmable Logic Devices ISE Alliance and Foundation

Series Design Software

Source: [Xilinx Inc.]

Xilinx FPGA Families• Old families

– XC3000, XC4000, XC5200– Old 0.5µm, 0.35µm and 0.25µm technology. Not

recommended for modern designs.• Low Cost Family

– Spartan/XL – derived from XC4000– Spartan-II – derived from Virtex– Spartan-IIE – derived from Virtex-E– Spartan-3 (90 nm)– Spartan-3E (90 nm)– Spartan-3A (90 nm)

• High-performance families– Virtex (220 nm)– Virtex-E, Virtex-EM (180 nm)– Virtex-II, Virtex-II PRO (130 nm)– Virtex-4 (90 nm)– Virtex 5 (65 nm)

Source: [Xilinx Inc.]

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

General structure of an FPGA

Block R

AMs

Block R

AMs

ConfigurableLogicBlocks

I/OBlocks

Xilinx FPGA

BlockRAMs

Generic FPGA architecture:Configurable Configurable Logic Block (Logic Block (CLBCLB))

Connection Connection BlockBlock

Switch BlockSwitch Block

Routing ChannelsRouting Channels

I/O padI/O pad

Wire segmentsWire segments

Page 3: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

3

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Xilinx CLBXilinx Point of Reference

• A Xilinx CLB has FOUR slices– Each slice has TWO logic cells– Each logic cell has TWO LUTs plus other

logic (carry and control) plus a flip-flop/latch• For SLICEL slices, these LUTs can be

configured as:1. LUT

• For SLICEM slices, these LUTs can be configured as:1. LUT2. 16 x 1 Distributed RAM (16 words x 1 bit/word)3. 16-bit Shift Register

CLB Structure of Spartan 3• Each Virtex -II CLB

contains four slices– Local routing provides

feedback between slices in the same CLB, and it provides routing to neighboring CLBs

– A switch matrix provides access to general routing resources CIN

SwitchMatrix

BUFTBUF T

COUTCOUT

Slice S0

Slice S1

Local Routing

Slice S2

Slice S3

CIN

SHIFT

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Simplified view of a Xilinx Logic Cell

Slice 0

LUTLUT CarryCarry

LUTLUT CarryCarry D QCE

PRE

CLR

DQCE

PRE

CLR

Simplified Slice Structure• Each slice has four

outputs– Two registered outputs,

two non-registered outputs– Two BUFTs associated

with each CLB, accessible by all 16 CLB outputs

• Carry logic runs vertically, up only– Two independent

carry chains per CLB

Detailed Slice Structure• The next few slides

discuss the slice features– LUTs– MUXF5, MUXF6,

MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram)

– Carry Logic– MULT_ANDs– Sequential Elements

Page 4: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

4

SRAM Cell (Pass Transistor)• An SRAM cell can drive the gate (G) terminal of an

NMOS transistor.• If SRAM (M) = 1 then signals passes from S D• An SRAM cell can be attached to the select line of a

MUX to control it.

Combinatorial Logic

AB

CD

Z

Look-Up Tables• Combinatorial logic is stored in Look-Up

Tables (LUTs) – Also called Function Generators (FGs)– Capacity is limited by the number of inputs, not

by the complexity• Delay through the LUT is constant

A B C D Z0 0 0 0 00 0 0 1 00 0 1 0 00 0 1 1 10 1 0 0 10 1 0 1 1

. . .1 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 1

Look Up Table (LUT)• The LUT is used to realize any Boolean function.• Assume the function to be realized is y = (a&b) | !c • This could be achieved by loading the LUT with the

appropriate output values

LUT (Look-Up Table) Functionality

• Look-Up tables are primary elements for logic implementation

• Each LUT can implement any function of 4 inputs

x1 x2 x3 x4

y

x1 x2

y

LUT

x1x2x3x4

y

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y0100010101001100

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y1111111111110000

x1 x2 x3 x4

y

x1 x2 x3 x4

y

x1 x2

y

x1 x2

y

LUT

x1x2x3x4

y

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y0100010101001100

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y0100010101001100

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y1111111111110000

0x1

0x2 x3 x4

0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

y1111111111110000

5-Input Functions implemented using two LUTs

• One CLB Slice can implement any function of 5 inputs• Logic function is partitioned between two LUTs• F5 multiplexer selects LUT

A4

A3

A2

A1 WS DI

D

LUTROMRAM

1

0

F4

F3

F2

F1

A4

A3

A2

A1

WS DI

D

LUTROMRAM

F5GXORG

nBXBX10

BX

X

F5

A4

A3

A2

A1 WS DI

D

LUTROMRAM

A4

A3

A2

A1 WS DI

D

LUTROMRAM

1

0

1

0

F4

F3

F2

F1

A4

A3

A2

A1

WS DI

D

LUTROMRAM

A4

A3

A2

A1

WS DI

D

LUTROMRAM

F5GXORG

F5GXORG

nBXBX10

nBXBX10

BX

X

F5

5-Input Functions implemented using two LUTs

LUTLUT

X5 X4 X3 X2 X1 Y0 0 0 0 0 00 0 0 0 1 10 0 0 1 0 00 0 0 1 1 00 0 1 0 0 10 0 1 0 1 10 0 1 1 0 00 0 1 1 1 00 1 0 0 0 10 1 0 0 1 00 1 0 1 0 00 1 0 1 1 10 1 1 0 0 10 1 1 0 1 10 1 1 1 0 10 1 1 1 1 11 0 0 0 0 01 0 0 0 1 01 0 0 1 0 01 0 0 1 1 01 0 1 0 0 01 0 1 0 1 01 0 1 1 0 01 0 1 1 1 11 1 0 0 0 01 1 0 0 1 11 1 0 1 0 01 1 0 1 1 11 1 1 0 0 01 1 1 0 1 11 1 1 1 0 01 1 1 1 1 0

LUTLUT

OUT

Page 5: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

5

Dedicated Expansion Multiplexers

CLB

MUXF6

Slice

LUT

LUTMUXF5

Slice

LUT

LUTMUXF5

• MUXF5 combines 2 LUTs to create• Any 5-input function (LUT5)• Or selected functions up to 9 inputs• Or 4x1 multiplexer

• MUXF6 combines 2 slices to form• Any 6-input function (LUT6)• Or selected functions up to 19 inputs• 8x1 multiplexer

• Dedicated muxes are faster and more space efficient

Connecting Look-Up Tables

F5F8

F5F6

CLB

Slice S3

Slice S2

Slice S0

Slice S1 F5F7

F5F6

MUXF8 combines the two MUXF7 outputs (from the CLB above or below)

MUXF6 combines slices S2 and S3

MUXF7 combines the two MUXF6 outputs

MUXF6 combines slices S0 and S1

MUXF5 combines LUTs in each slice

Programmable Logic Block• Early devices were based on the concept of programmable

logic block, which comprised • 3-input lookup table (LUT), • register that could act as flip flop or a latch, • multiplexer, along with a few other elements.

3-, 4-, 5-, or 6-input LUTs?

• The key feature of n-input LUT is that it can implement any possible n-input combinational logic function.

• Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells!• The first FPGAs were based on 3-input LUTs.

• FPGA vendors and researchers studied the relative merits of 3, 4, 5 and even 6 input LUTS.• The current consensus is that 4-input LUTS offer the optimal

balance of pros and cons.• In the past, some devices were created using a mixture of

different LUT sizes because this offered the promise of optimal device utilization.

• However current logic synthesis tools prefer uniformity and regularity

FPGA Function generators• LUT Example: Implement the function • using:

2-input LUTs3-input LUTs4-input LUTs

AF = ABD + BC BCD +

ABDBCDABC

F

ABDBCDABC

CD

ABF F

Each CLB contains separate logic and routing for the fast generation of sum & carry signals– Increases efficiency and

performance of adders, subtractors, accumulators, comparators, and counters

Carry logic is independent of normal logic and routing resources

Fast Carry Logic

LSB

MSB

Carry

Log

icRo

utin

g

Page 6: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

6

Fast Carry Logic

• Simple, fast, and complete arithmetic Logic– Dedicated XOR

gate for single-level sum completion

– Uses dedicated routing resources

– All synthesis tools can infer carry logic

COUT COUT

SLICE S0

SLICE S1

Second Carry Chain

To S0 of the next CLB

To CIN of S2 of the next CLB

First Carry Chain

SLICE S3

SLICE S2

COUT

COUTCIN

CIN

CIN CIN CLB

Accessing Carry Logic• All major synthesis tools can infer carry

logic for arithmetic functions• Addition (SUM <= A + B)• Subtraction (DIFF <= A - B)• Comparators (if A < B then…)• Counters (count <= count +1)

DCE

PRE

CLR

Q

FDCPE

DCE

S

R

Q

FDRSE

DCE

PRE

CLR

Q

LDCPE

G

_1

Flexible Sequential Elements• Either flip-flops or latches• Two in each slice; eight in each CLB• Inputs come from LUTs or from an

independent CLB input• Separate set and reset controls

– Can be synchronous or asynchronous

• All controls are shared within a slice– Control signals can be inverted

locally within a slice

D QCE

D QCE

D QCE

D QCE

LUTIN

CECLK

DEPTH[3:0]

OUTLUT =

Shift Register• Each LUT can be

configured as shift register– Serial in, serial out

• Dynamically addressable delay up to 16 cycles

• For programmable pipeline

• Cascade for greater cycle delays

• Use CLB flip-flops to add depth

Shift Register

• Register-rich FPGA– Allows for addition of pipeline stages to increase

throughput• Data paths must be balanced to keep desired

functionality

64Operation A

4 Cycles 8 Cycles

Operation B

3 Cycles

Operation C64

12 Cycles

3 Cycles9-Cycle imbalance

Shift Register LUT Example

12 Cycles

64Operation A

4 Cycles 8 Cycles

Operation B

3 Cycles

Operation C

64

12 Cycles

Paths are StaticallyBalanced

9 Cycles

Operation D - NOP

Page 7: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

7

RAM16X1S

O

DWE

WCLKA0A1A2A3

RAM32X1S

O

DWEWCLKA0A1A2A3A4

RAM16X2S

O1

D0

WEWCLKA0A1A2A3

D1

O0

=

=LUT

LUT or

LUT

RAM16X1D

SPO

DWE

WCLKA0A1A2A3DPRA0 DPODPRA1DPRA2DPRA3

or

Distributed RAM• CLB LUT configurable as

Distributed RAM– An LUT equals 16x1 RAM– Cascade LUTs to increase

RAM size• Synchronous write• Asynchronous read

– Can create a synchronous read by using extra flip-flops

– Naturally, distributed RAM read is asynchronous

• Two LUTs can make– 32 x 1 single-port RAM– 16 x 2 single-port RAM– 16 x 1 dual-port RAM

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Xilinx Multipurpose LUT

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Simplified view of a Xilinx Logic Cell RAM Blocks and Multipliers in Xilinx FPGAs

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Embedded Ram Blocks• A lot of applications require the use of memory, so FPGAs now

include relatively large chunks of embedded RAM called e-RAM or Block RAM (BRAM).

• Depending on the architecture of the component, these blocks might be positioned around the periphery of the device or organized as columns

• These blocks can be used for a variety of purposes, such as implementing standard single or dual port RAMs, FIFO, e.t.c.

Block RAM

Spartan-3Dual-Port

Block RAM

Port A

Port B

Block RAM

• Most efficient memory implementation– Dedicated blocks of memory

• Ideal for most memory requirements– 4 to 104 memory blocks

• 18 kbits = 18,432 bits per block (16 k without parity bits)– Use multiple blocks for larger memories

• Builds both single and true dual-port RAMs• Synchronous write and read (different from distributed

RAM)

Page 8: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

8

Spartan-3 Block RAM Amounts Block RAM can have various configurations (port aspect ratios)

0

16,383

1

4,095

40

8,191

20

2047

8+10

1023

16+20

16k x 1

8k x 2 4k x 4

2k x (8+1)

1024 x (16+2)

Block RAM Port Aspect Ratios Single-Port Block RAM

Dual-Port Block RAMRAMB4_S16_S8

Port A Out18-Bit Width

Port B In2k-Bit Depth

Port A In1K-Bit Depth

Port B Out9-Bit Width

DOA[17:0]

DOB[8:0]

WEAENARSTA

ADDRA[9:0]CLKA

DIA[17:0]

WEBENB

RSTB

ADDRB[10:0]CLKB

DIB[8:0]

Dual-Port Bus Flexibility

• Each port can be configured with a different data bus width

• Provides easy data width conversion without any additional logic

Page 9: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

9

0, ADDR[12:0]

1, ADDR[12:0]

RAMB4_S1_S1

Port B Out1-Bit Width

DOA[0]

DOB[0]

WEAENARSTA

ADDRA[12:0]CLKA

DIA[0]

WEBENBRSTB

ADDRB[12:0]CLKB

DIB[0]

Port B In8K-Bit Depth

Port A Out1-Bit Width

Port A In8K-Bit Depth

Two Independent Single-Port RAMs

• To access the lower RAM– Tie the MSB address bit to

Logic Low• To access the upper RAM

– Tie the MSB address bit to Logic High

• Added advantage of True Dual-Port– No wasted RAM Bits

• Can split a Dual-Port 16K RAM into two Single-Port 8K RAM– Simultaneous independent access

to each RAM

Embedded Multipliers• Some functions, like multipliers are inherently slow if they are

implemented by connecting a large number of programmable logic blocks together.

• Current FPGA incorporate special hard wired multiplier blocks which are typically located in close proximity to the embedded RAM blocks (Arithmetic Based Applications).

18 x 18 Embedded Multiplier• Fast arithmetic functions

– Optimized to implement multiply / accumulate modules

18 x 18 signed multiplierFully combinationalOptional registers with CE & RST (pipeline)Independent from adjacent block RAM

18 x 18 Multiplier • Embedded 18-bit x 18-bit multiplier

– 2’s complement signed operation• Multipliers are organized in columns

18 x 18Multiplier

Output (36 bits)

Data_A (18 bits)

Data_B (18 bits)

Positions of Multipliers Asynchronous 18-bit Multiplier

Page 10: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

10

18-bit Multiplier with RegisterA simple clock tree

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Digital Clock Manager (DCM)Digital Clock Managers (DCM)

• The clock pin is usually connected to special hard-wired function called a clock-manager that generates “daughter clocks”.

• The daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board.

• There might be multiple clock managers supporting only a subset of features (Jitter removal, Frequency Synthesis, …)

DCM: Jitter Removal• In the real world clock edges may arrive a little early or a little late.• A fuzzy clock would result (jitter) due to the delay encountered.• The FPGA clock manager can be used to detect and correct for

this jitter and provide a “clean” daughter clock signal for use inside the device.

DCM: Frequency Synthesis

• The frequency of the clock signal being presented to the FPGA from the outside world might not be exactly what the designer engineer wishes for.

• The clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal.

Page 11: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

11

DCM: Phase Shifting

• Certain designs require the use of clocks that are phase shifted (delayed) with respect to each other.

• Some clock managers allow you to select from fixed phase shifts of common values such as 1200 and 2400

(for a three-phase clocking scheme)

Basic I/O Block Structure

DEC

Q

SR

DEC

Q

SR

DEC

Q

SR

Three-StateControl

Output Path

Input Path

Three-State

Output

Clock

Set/Reset

Direct Input

Registered Input

FF Enable

FF Enable

FF Enable

IOB Functionality

• IOB provides interface between the package pins and CLBs

• Each IOB can work as uni- or bi-directional I/O• Outputs can be forced into High Impedance• Inputs and outputs can be registered

– advised for high-performance I/O• Inputs can be delayed

Configurable I/O Impedances• The signals used to connect devices on today’s circuit

board often have fast edge rates.• In order to prevent signals reflecting back it is

necessary to apply appropriate terminating resistors to the FPGA input and output pins.

• In the past, resistors were applied as discrete components (outside the FPGA).

• Today's FPGAs allow the use of internal terminating resistors whose value can be configured by the user.

Spartan 3 Family Attributes

FPGA Nomenclature

Page 12: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

12

Spartan-3 FPGA Family Members

2001 – Virtex-II FPGA Family• Virtex-II FPGA introduced followed by Virtex-II Pro in 2003

– 444 18x18 Multipliers & 18kbit block RAMs introduced– Gbit Serial I/O Communications & Power PC Processors Introduced– Complex Floating Point Algorithm Implementation now possible

• Virtex-II / Pro– 44,000 Logic Slices– 444 18Kbits BRAMs– 444 18x18 Multipliers– 2 PowerPC

Processors– 20 Gbit I/O– 1164 Max User I/O

Virtex II Pro Floorplan

• 1 to 4 PowerPCs• 4 to 16 multi-gigabit

transceivers• 12 to 216 multipliers• 3,000 to 50,000 logic

cells• 200k to 4M bits RAM• 204 to 852 I/Os

Logic cells

Up to 16 serial transceivers•• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps

PowerPCs

Virtex-II Pro (Selection)

Embedded Processor Cores (Hard and Soft)

• The majority of designs make use of microprocessors.• These appeared as discrete devices on the circuit board.• Lately, high-end FPGAs have become available that

contain one or more embedded microprocessors (referred to as microprocessor cores).

• There are two types of cores:• A hard microprocessor core is implemented as a

dedicated predefined block (two approaches)• A soft microprocessor core is implemented by

configuring a group of programmable logic blocks to act as a microprocessor.

Embedded Core (Inside)• Xilinx and Altera tend to embed one or more microprocessor

cores directly into the main FPGA fabric (PowerPC)• In this case the design tools have to be able to take account of

the presence of these blocks in the fabric (any memory used by the core is formed from the embedded RAM blocks).

The main advantage of this scheme is the inherent speed advantages to be gained from having the processor core in intimate proximity to FPGA fabric.

Page 13: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

13

Soft Core • As opposed to embedding a microprocessor physically into the

fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor.

• Soft cores are simpler (more primitive) and slower than their hard-core counterparts.

1. The main advantage of this scheme is that the user need only implement a core if he/she needs it.

2. Also, the user can instantiate as many cores as they require until they run out of resources!

ADVANTAGE?ADVANTAGE?

Basic Architecture 74

Virtex Architectures

Other Families include• Virtex-II Pro• Virtex-4• Virtex-5Latest Family include• Virtex-6

Built for high-performance applications

Basic Architecture 75

Virtex-II Pro Architecture

High performance True Dual-port RAM - 8 Mb SelectIO™- Ultra

Technology - 1164 I/O

Advanced FPGA Logic –99k logic cells

XtremeDSP Functionality -Embedded multipliers

RocketIO™ and RocketIO X High-speed Serial Transceivers 622 Mbps to 3.125 Gbps

PowerPC™ Processors 400+ MHz Clock Rate - 2

XCITE Digitally Controlled Impedance -Any I/O

DCM™ Digital Clock Management - 12

130 nm, 9 layer copper in 300 mm wafer technology

Contains embedded Processors and Multi-Gigabit Transceivers

Basic Architecture 76

Virtex-4 Family

ResourceResource

14K14K––200K LCs200K LCsLogicMemory

DCMsDSP Slices

SelectIORocketIOPowerPC

Ethernet MAC

LXLX FXFX SXSX

0.90.9––6 Mb6 Mb

44––1212

3232––9696

240240––960960

23K23K––55K LCs55K LCs

2.32.3––5.7 Mb5.7 Mb

44––88

128128––512512

320320––640640

12K12K––140K LCs140K LCs

0.60.6––10 Mb10 Mb

44––2020

3232––192192

240240––896896

00––24 Channels24 Channels

1 or 2 Cores1 or 2 Cores

2 or 4 Cores2 or 4 Cores

N/A

N/A

N/A

N/A

N/A

N/A

Advanced Silicon Modular BLock (ASMBL) ArchitectureOptimized for logic, Embedded, and Signal Processing

Basic Architecture 77

Virtex-4 Architecture

1 Gbps SelectIO™ChipSync™ Source synch, XCITE Active Termination

Smart RAM New block RAM/FIFO

Xesium ClockingTechnology

500 MHz

PowerPC™ 405with APU Interface450 MHz, 680 DMIPS

Tri-ModeEthernet MAC

10/100/1000 Mbps

RocketIO™ Multi-GigabitTransceivers

622 Mbps–10.3 Gbps

XtremeDSP™ Technology Slices

256 18x18 GMACs

Advanced CLBs200K Logic Cells

Basic Architecture 78

Virtex-5 Family

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

LogicOn-chip RAM

DSP Capabilities

Serial I/OsParallel I/Os

Logic Logic/Serial DSP/Serial Emb./SerialLX LXT SXT FXT

PowerPC® Processors

Virtex™-5 Platforms

Optimized for logic, Embedded, Signal Processing, and High-Speed Connectivity

Page 14: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

14

Basic Architecture 79

Virtex-5 Architecture

550 MHz Clock Management Tile 550 MHz Clock Management Tile with DCM and PLLwith DCM and PLL

TriTri--Mode 10/100/1000 Mbps Mode 10/100/1000 Mbps Ethernet MACsEthernet MACs

Next Generation PowerPCNext Generation PowerPC®®

Embedded ProcessorEmbedded Processor

SelectIO with ChipSync SelectIO with ChipSync Technology and XCITE DCITechnology and XCITE DCI

RocketIO™ Transceiver OptionsRocketIO™ Transceiver OptionsLowLow--Power GTP: Up to 3.75 GbpsPower GTP: Up to 3.75 GbpsHighHigh--Performance GTX: Up to 6.5 Performance GTX: Up to 6.5 GbpsGbps

PCI ExpressPCI Express®® Endpoint BlockEndpoint Block

Most Advanced HighMost Advanced High--Performance Real 6LUT Logic Performance Real 6LUT Logic FabricFabric

Advanced Configuration OptionsAdvanced Configuration Options

36Kbit Dual36Kbit Dual--Port Block RAM / Port Block RAM / FIFO with Integrated ECCFIFO with Integrated ECC

25x18 DSP Slice with Integrated 25x18 DSP Slice with Integrated ALUALU

NewNewEnhancedEnhanced

System Monitor Function with System Monitor Function with BuiltBuilt--in ADCin ADC

Basic Architecture 80

The Spartan-3 Family

18x18 bit Embedded Pipelined Multipliers

for efficient DSP Configurable 18K Block RAMs + Distributed RAM

4 I/O Banks, Support for

all I/O Standards including

PCI, DDR333,RSDS, mini-LVDS

Bank 0

Bank 1

Bank 2

Bank 3

Up to eight on-chip Digital Clock Managers

to support multiple system clocks

Spartan-3

Built for high volume, low-cost applications

Basic Architecture 81

Spartan-3 Family

• Smaller process = lower core voltage– .09 micron versus .15 micron– Vccint = 1.2V versus 1.5V

• Logic resources– Only one-half of the slices support RAM or SRL16s (SLICEM)– Fewer block RAMs and multiplier blocks

• Clock Resources– Fewer global clock multiplexers and DCM blocks

• I/O Resources– Fewer pins per package– No internal 3-state buffers – Support for different standards

• New standards: 1.2V LVCMOS, 1.8V HSTL, and SSTL• Default is LVCMOS, versus LVTTL

Based upon Virtex-II Architecture – Optimized for Lower Cost

Basic Architecture 82

SLICEM and SLICEL• Each Spartan™-3 CLB

contains four slices– Similar to the Virtex™-II

• Slices are grouped in pairs– Left-hand SLICEM

(Memory)• LUTs can be

configured as memory or SRL16

– Right-hand SLICEL (Logic)

• LUT can be used as logic only CIN

SwitchMatrix

COUTCOUT

Slice X0Y0

Slice X0Y1

Fast Connects

Slice X1Y0

Slice X1Y1

CIN

SHIFTIN

Left-Hand SLICEM Right-Hand SLICEL

SHIFTOUT

Basic Architecture 83

Multiple Domain-optimized Platforms

Basic Architecture 84

Spartan-3E Features• More gates per I/O than

Spartan-3• Removed some I/O

standards– Higher-drive LVCMOS– GTL, GTLP– SSTL2_II– HSTL_II_18, HSTL_I,

HSTL_III– LVDS_EXT, ULVDS

• DDR Cascade– Internal data is presented

on a single clock edge

• 16 BUFGMUXes on left and right sides– Drive half the chip only– In addition to eight global

clocks• Pipelined multipliers• Additional configuration

modes– SPI, BPI– Multi-Boot mode

Page 15: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

15

Spartan-3A DSP Features

• Increased amount of block memory (BRAM)– 1512K of S3A1800 vs 648 K of S3E1600

• More XtremeDSP DSP48A slices– Replaces Embedded multiplier of Spartan-3E

• 3400A – 126 DSP48As• 1800A – 84 DSP48As

Basic Architecture 85

Spartan-3A DSPTuning DSP Performance

Basic Architecture 86

• Integrated XtremeDSP Slice– Application optimized

capacity– Integrated pre-adder

optimized for filters– 250 MHz operation,

standard speed grade– Compatible with Virtex-

DSP

• Increased memory capacity and performance– Also important for embedded processing, complex

IP, etc

XtremeDSP DSP48A Slice

DSP48 Comparison

Basic Architecture 87

Function DSP48 DSP48E DSP48A Benefit

Multiplier 18 x 18 25 x 18 18 x 18 Reduces FPGA resource needs for DSP algorithms.

Pre-Adder No No Yes Reduces the critical path timing in FIR filter applications better performance. Important in FIR filter construction.

Cascade Inputs One Two One Enables fast data path chaining of DSP48 blocks for larger filters.

Cascade Output Yes Yes Yes Enables fast data path chaining of DSP48 blocks for larger filters.

Dedicated C input

No Yes Yes The C input supports many 3-input mathematical functions, such as 3-input addition and 2-input multiplication with a single addition and the very valuable rounding of multiplication away from zero.

Adder 3 input 48 bit

3 input 48 bit

2 input 48 bit

Supports simple add and accumulate functions.

Dynamic Opmodes

Yes Yes Yes One DSP48 can provide more than one function.. Multiply, Multiply-add, multiply-accumulate etc.

ALU Logic Functions

No Yes No Similar to the ALU of a microprocessor. Enables the selection of ALU function on a clock cycle basis Enables multiple functions to be selected. (Add, Subtract, or Compare)

Pattern Detect No Yes No This feature supports convergent rounding, underflow/overflow detection for saturation arithmetic, and auto-resetting counters/accumulators.

SIMD ALU Support

No Yes NoEnables parallel ALU operations on multiple data sets.

Carry Signals Carry In Carry In & Out

Carry In & Out

Supports fast carry functions between DSP blocks. Often a speed limiting path.

Spartan-3A Device Table

Basic Architecture 88

Spartan-3 Spartan-DSPSpartan-3A Spartan-3A DSPXC3S1400A XC3SD1800A XC3SD3400A

XtremeDSP DSP48A Slices - 84 126

Dedicated Multipliers 32 DSP48As DSP48As

Block Ram Blocks 32 84 126

Block RAM (Kb) 576 1,512 2,268

Distributed RAM (Kb) 176 260 373

FFs/LUTs 22,528 33,280 47,744

Logic Cells 25,344 37,440 53, 712

DCMs 8 8 8

Max Diff I/O Pairs 227 227 213

CS484 19x19mm (0.8mm pitch) - 309 309

*FG676 27x27mm (1.0mm pitch) 502 519 469

Basic Architecture 89

Latest Families Architecture AlignmentVirtex-6 FPGAs Spartan-6 FPGAs

150K Logic Cell

Device

760K Logic Cell

Device

Common Resources

*Optimized for target application in each family

3.3 Volt compatible I/O

Hardened Memory Controllers

LUT-6 CLB

DSP Slices

BlockRAM

HSS Transceivers*

Parallel I/O FIFO Logic

System Monitor

Tri-mode EMAC

PCIe® Interface

Enables IP Portability, Protects Design Investments

High-performance Clocking

Basic Architecture 90

Page 16: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

16

Addressing the Broad Range of Technical Requirements

Mar

ket S

ize

Application Market Segments + 100s More

Spartan-6 LX

Lowest cost logic + DSP

Lowest logic +high-speed serial

Spartan-6 LXT

High logic density +serial connectivity

Virtex-6 LXT

DSP + logic +serial connectivity

Virtex-6 SXT

Ultra high-speed serialconnectivity + logic

Virtex-6 HXT

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appear

Basic Architecture 91

Designers Eccentrics

• Higher System Performance – More design margin to simplify designs– Higher integrated functionality

• Lower System Cost– Reduce BOM– Implement design in a smaller device & lower speed-

grade• Lower Power

– Help meet power budgets– Eliminate heat sinks & fans – Prevent thermal runawayBasic Architecture 92

Basic Architecture 93

Virtex-6 Family

Virtex-6 Base Platform 94

Virtex® Product & Process Evolution

Delivering Balanced Performance, Power, and Cost

Virtex

Virtex-E

Virtex-II

Virtex-II Pro

Virtex-4

Virtex-5

1st Generation1st Generation 2nd Generation2nd Generation 3rd Generation3rd Generation 4th Generation4th Generation 5th Generation5th Generation 6th Generation6th Generation

220-nm

180-nm

150-nm

40-nm

65-nm

90-nm

130-nm

Virtex-6

Basic Architecture 94

• Static Power Reduction– Higher distribution of low leakage transistors

• Dynamic Power Reduction– Reduced capacitance through device shrink

• Reduced Core Voltage Devices Lower Overall Power– VCCINT = 0.9V option allows power / performance tradeoff

• I/O Power Improvements– Dynamic termination

• System Monitor– Allows sophisticated monitoring of temperature and voltage

Strong Focus on Power Reduction

Up to 50% Power Reduction vs. Previous GenerationBasic Architecture 95

Virtex-6 Logic Fabric

• Virtex-6 Configurable Logic Block (CLB)– Each CLB contains two slices– Each slice contains four 6-input Lookup Tables

(6LUT)• Slices implement logic functions (slice_l) • Slices for memories and shift registers

(slice_m)• LUT6 implements

– All functions of up to 6 variables– Two functions of up to 5 or less variables each– Shift registers up to 32 stages long– Memories of 64 bits

• Multiple configurations within a slicePower Consumption Benefits Performance Benefits Cost Benefits

• Shift register mode greatly reduces power consumption over FF implementation

• Increased ratio of slice_m – memories available closer to the source or target logic

• Can pack logic and memory functions more efficiently

CLBCLB

Slice

LUT

LUT

LUT

LUT

Slice

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Slice

LUT

LUT

LUT

LUT

Slice

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

Basic Architecture 96

Page 17: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

17

Higher DSP Performance

• Most advanced DSP architecture– New optional pre-adder for symmetric filters– 25x18 multiplier

• High resolution filters• Efficient floating point support

– ALU-like second stage enables mapping of advanced operations

• Programmable op-code• SIMD support• Addition / Subtraction / Logic functions

– Pattern detector

• Lowest power consumption

• Highest DSP slice capacity– Up to 2K DSP Slices

Basic Architecture 97

Virtex®-6 LXT / SXT FPGAs

Basic Architecture 98

Basic Architecture 99

Spartan-6 Family Spartan-6

• Next Generation 45nm Spartan Family– Increased performance & density– Evolutionary feature enhancements– Dramatic cost & power reductions

• Two Silicon Platforms– LX: Cost optimized Logic, Memory– LXT: LX features plus High-Speed Serial

Connectivity– More unified & integrated with Virtex

Delivering the Optimal Balanced of Cost, Power & Performance

Basic Architecture 100

Spartan-6 Logic EvolutionHigher Performance, Increased Utilization

SpartanSpartan--3A Series & 3A Series & EarlierEarlier

Great Great GeneralGeneral--Purpose Purpose

LogicLogic

LUT / FF PairLUT / FF Pair

4LUT4LUT

SpartanSpartan--66

66--input LUT & 2nd Flipinput LUT & 2nd Flip--flop for Higher flop for Higher

UtilizationUtilization

LUT / Dual FF LUT / Dual FF PairPair

6LUT6LUT

• Modified Virtex 6-input LUT– 4 additional flip-flops per

slice – Higher utilization for register

intensive designs

• Efficient & Capable– Logic– Arithmetic functions– Distributed RAM & shift

registers– Interconnect

• Up to 25% Higher Performance

NEW Efficient DesignNEW Efficient Design

Basic Architecture 101

Spartan-6 CLB Logic Slices

LUT68 RegistersCarry LogicWide Function MuxesDistributed RAM / SRL logic

SliceM (25%)SliceM (25%) SliceL (25%)SliceL (25%) SliceX (50%)SliceX (50%)

LUT68 RegistersCarry LogicWide Function Muxes

LUT6 Optimized for Logic 8 Registers

Slice mix chosen for the optimal balance of Cost, Power & Performance

Basic Architecture 102

Page 18: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

18

Spartan-6 Lowest Total Power

• Static power reductions– Process & architectural innovations

• Dynamic power reduction– Lower node capacitance & architectural innovations

• More hard IP functionality– Integrated transceivers & other logic reduces power – Hard IP uses less current & power than soft IP

• Lower IO power• Low power option -1L reduces power even

further• Fewer supply rails reduces power

Basic Architecture 103

Spartan-6 Hard Memory Controller

• New Hard Block Memory Controller– Up to 4 controllers per device

• Why a Hard Memory Block?– Very common design component– Multiple customer benefits

Customer Requests Spartan-6 Hard Block Memory Controller Benefits

Higher performance • Up to 800 Mbps

Lower cost • Saves soft logic, smaller die

Lower power • Dedicated logic

Easier designs• Timing closure no longer an issue• Configurable MultiPort user interface • CoreGen/MIG wizard & EDK support

Basic Architecture 104

Memory Controller• Only low cost FPGA with a “hard” memory controller

• Guaranteed memory interface performance providing– Reduced engineering & board design time– DDR, DDR2, DDR3 & LP DDR support– Up to 12.8Mbps bandwidth for each memory controller

• Automatic calibration features

• Multiport structure for user interface– Six 32-bit programmable ports from fabric– Controller interface to 4, 8 or 16 bit memories devices

DRAM

SRAM

FLASH

EEPROM

DRAM

SRAM

FLASH

EEPROM

Spartan-6Spartan-6

DRAMDDRDDR2DDR3LP DDR

DRAMDRAMDDRDDR2DDR3LP DDR

Basic Architecture 105

Integrated DSP Slice

• 250 MHz implementation– Fast multiplier & 48 bit

adder– ASIC-like performance

• Input and output registers for higher speed

Optimizes FIR filter applications

XtremeDSP DSP48A1 Slice

Super Regional Training 106

Better, More BRAM

• More Block RAMs– 2x higher BRAM to Logic Cell ratio than Spartan-3A

platform

• More port flexibility – 18K can be split into two 9K BRAM blocks and can

be independently addressed

• Improves buffering, caching & data storage– Excellent for embedded processing, communication

protocols– Enables DSP blocks to provide more efficient video

and surveillance algorithms

• Lower Static Power

OR 9K BRAM

9K BRAM

18K BRAM

Basic Architecture 107

Compare to Spartan-3A Twice the Capabilities, Half the Power, Hard Blocks!

Feature Extended Spartan-3A (90nm) Spartan-6 (45nm)

Logic Cells (Kbit) Up to 55K Up to 150KLUT Design 4-input LUT + FF 6-input LUT + 2FFBlock RAM (Mbit) Up to 2 Mbit Up to 5 MbitTransceiver Count / Speed no Up to 8 / Up to 3.125 GbpsVoltage Scaling No (1.2V only) Yes (1.2V, 1.0V)Static Power (typ mW) 11 mW (smallest density) Up to 60% less!Memory Interface 400 Mbps DDR3 800 MbpsMax Differential IO 640 Mbps 1050 MbpsMultipliers/DSP Up to 126 Multipliers / DSP Up to 184 DSP48 BlocksMemory Controllers no Up to 4 Hard BlocksClock Management DCM Only DCM & PLLPCI Express Endpoint no Yes, Gen 1Security Device DNA Only Device DNA & AES

Basic Architecture 108

Page 19: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

19

Spartan-6 LX / LXT FPGAs

** All memory controller support x16 interface, except in CS225 package where x8 only is supported

Basic Architecture 109

FPGA Design Flow

Design process (1)Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds…..

Library IEEE;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;

entity RC5_core isport(

clock, reset, encr_decr: in std_logic;data_input: in std_logic_vector(31 downto 0);data_output: out std_logic_vector(31 downto 0);out_full: in std_logic;key_input: in std_logic_vector(31 downto 0);key_read: out std_logic;

);end AES_core;

Specification

Verilog description (Your Verilog Source Files)

Functional simulation

Post-synthesis simulationSynthesis

Design process (2)Implementation(Mapping, Placing & Routing)

Configuration

Timing simulation

On chip testing

Design Process control from Active-HDL

architecture MLU_DATAFLOW of MLU is

signal A1:STD_LOGIC;signal B1:STD_LOGIC;signal Y1:STD_LOGIC;signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;

beginA1<=A when (NEG_A='0') else

not A;B1<=B when (NEG_B='0') else

not B;Y<=Y1 when (NEG_Y='0') else

not Y1;

MUX_0<=A1 and B1;MUX_1<=A1 or B1;MUX_2<=A1 xor B1;MUX_3<=A1 xnor B1;

with (L1 & L0) selectY1<=MUX_0 when "00",

MUX_1 when "01",MUX_2 when "10",MUX_3 when others;

end MLU_DATAFLOW;

VHDL description Circuit netlist

Logic Synthesis

Page 20: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

20

Synthesis Tools

… and others

XST

Features of synthesis tools

• Interpret RTL code• Synplify Pro: Produces synthesized circuit netlist in a standard

EDIF (.edf) format– Can optionally produce .VHM (VHDL code merged into one) file

for post-synthesis simulation• XST: Produces synthesized circuit netlist in NGC format• Netlist is composed of gates in the particular Xilinx

implementation library– http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has

information on libraries• Give preliminary performance estimates• Some can display circuit schematics corresponding to EDIF

netlist

Timing report after synthesisPerformance Summary *******************

Worst slack in design: -0.924

Requested Estimated Requested Estimated Clock Clock

Starting Clock Frequency Frequency Period Period Slack Type Group

-------------------------------------------------------------------------------------------------------exam1|clk 85.0 MHz 78.8 MHz 11.765 12.688 -0.924

inferred Inferred_clkgroup_0System 85.0 MHz 86.4 MHz 11.765 11.572 0.193

system default_clkgroup ===========================================================

Implementation

• After synthesis the entire implementation process is performed by FPGA vendor tools

Mapping

LUT2

LUT3

LUT4

LUT5

LUT1FF1

FF2

LUT0

Page 21: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

21

PlacingCLB SLICES

FPGA Routing

Programmable Connections

FPGA

Map report headerRelease 7.1.03i Map H.41Xilinx Mapping Report File for Design 'exam1'

Design Information------------------Command Line : c:\Xilinx\bin\nt\map.exe -p 2S200FG256-6 -o map.ncd -pr b -k

4-cm area -c 100 -tx off exam1.ngd exam1.pcf Target Device : xc2s200Target Package : fg256Target Speed : -6Mapper Version : spartan2 -- $Revision: 1.26.6.4 $Mapped Date : Wed Nov 02 11:15:15 2005

Map reportDesign Summary--------------Number of errors: 0Number of warnings: 0Logic Utilization:Number of Slice Flip Flops: 144 out of 4,704 3%Number of 4 input LUTs: 173 out of 4,704 3%

Logic Distribution:Number of occupied Slices: 145 out of 2,352 6%Number of Slices containing only related logic: 145 out of 145 100%Number of Slices containing unrelated logic: 0 out of 145 0%

*See NOTES below for an explanation of the effects of unrelated logicTotal Number 4 input LUTs: 210 out of 4,704 4%

Number used as logic: 173Number used as a route-thru: 5Number used as 16x1 RAMs: 32

Number of bonded IOBs: 74 out of 176 42%Number of GCLKs: 1 out of 4 25%Number of GCLKIOBs: 1 out of 4 25

Place & route reportTiming Score: 0

Asterisk (*) preceding a constraint indicates it was not met.This may be due to a setup or hold violation.

--------------------------------------------------------------------------------Constraint | Requested | Actual | Logic

| | | Levels--------------------------------------------------------------------------------TS_clk = PERIOD TIMEGRP "clk" 11.765 ns | 11.765ns | 11.622ns | 13 HIGH 50% | | |

--------------------------------------------------------------------------------OFFSET = OUT 11.765 ns AFTER COMP "clk" | 11.765ns | 11.491ns | 1

--------------------------------------------------------------------------------OFFSET = IN 11.765 ns BEFORE COMP "clk" | 11.765ns | 11.442ns | 2

--------------------------------------------------------------------------------

Post layout timing reportTiming summary:---------------

Timing errors: 0 Score: 0

Constraints cover 42912 paths, 0 nets, and 1038 connections

Design statistics:Minimum period: 11.622ns (Maximum frequency: 86.044MHz)

Minimum input required time before clock: 11.442nsMinimum output required time after clock: 11.491ns

Page 22: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

22

Post-place-and-route simulation

• After place-and-route performed, can do post-place-and-route simulation– Now have real timing information!– Also can do static timing analysis: shows the

worst case critical path in circuit

Configuration

• Once a design is implemented, you must create a file that the FPGA can understand– This file is called a bit stream: a BIT file (.bit

extension)

• The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information

The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043

Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

Configuration of SRAM based FPGAsSystem Gates vs. Real GatesSystem Gates vs. Real Gates

• One common metric used to measure the size of a device in the ASIC world is that of equivalent gatesequivalent gates (e(e--gate)gate)

• Convention used:• A 2-input NAND function to represent one equivalent gate.• An equivalent gate consists of an arbitrary number of transistors.

• Different vendors provide different functions in their cell libraries, where each implementation of each function requires a different number of transistors (difficult to compare capacity/complexity)

• Solution: Assign each function an equivalent gateequivalent gate value and sum all these values.

•• How can we establish a basis for comparison between FPGAs and How can we establish a basis for comparison between FPGAs and ASICs?ASICs?

•• Can an ASIC of 500,000 equivalent gates that needs to be migrated Can an ASIC of 500,000 equivalent gates that needs to be migrated into an FPGA fit into a particular FPGA?into an FPGA fit into a particular FPGA?

FPGAs: System GatesFPGAs: System Gates

•• System GatesSystem Gates: A 4-input LUT can be used to represent anywhere between one and more than twenty 2-input primitive logic gates.

• Rule of thumb?• Divide the system gates value by three, so a three million FPGA

system gates would equate to one million ASIC equivalent gates!!• However, to make comparisons between two different

implementations on an FPGA (i.e. Floating point adder vs. Fixed point adder) designers should use the resources available in an FPGA:• Number of 4-input LUTs used• Number of embedded multipliers• Number of embedded RAM blocks

State-of-the-Art FPGAs• 65-90 nm process on 300 mm wafers

• Lower cost per function (LUT + register)• Smaller and faster transistors: Higher speed

• System speed up to 500 MHz• Mainly through smart interconnects, clock management,

dedicated circuits, flexible I/O. • Integrated transceivers running at 10 Gigabits/sec

• More Logic and Better Features:• >100,000 LUTs & flip-flops• >200 embedded RAMs, and same number 18 x 18 multipliers

• 1156 pins (balls) with >800 GP I/O• 50 I/O standards, incl. LVDS with internal termination

• 16 low-skew global clock lines• Multiple clock management circuits

• On-chip microprocessor(s) and multi-Gbps transceivers

Page 23: Basic FPGA Architecture - swedishcr.weebly.com · F6 MUX are shown in this diagram) ... • Or 4x1 multiplexer ... – An LUT equals 16x1 RAM – Cascade LUTs to increase RAM size

11/25/2010

23

Latest Devices: Capacity & Features

Xilinx Virtex-5• 65nm process• Up to 960 I/Os• >200000 logic cells• Up to 552 18kb block RAMs

(~10Mb RAM)• 450 DSP slices (18x18

multiplier-accumulator)• 20 digital clock managers

(DCM)

• 24 high-speed serial transceivers (622Mb/s to 11.1Gb/s)

• Up to four PowerPC 405 cores

Altera Stratix-II• 90nm process• Up to 1170 I/Os• 179000 logic elements• 9.6Mb embedded RAM• 96 DSP blocks: 380 18x18

multipliers

• 12 PLLs

• Serial I/O up to 1Gb/s

• No hard processor cores 1 /91 1 /92 1 /93 1/94 1 /95 1 /96 1/97 1 /98 1 /99

Yea r

Capac itySpee dPrice

FPGAs Becoming More Attractive

Source: Xilinx

21 X Bigger

5.5 X Faster

50 X Less Expensive

FPGA Shortcomings• Circuit Delay

• Delay increases due to programmable switches in the FPGA routing architecture

• Area• Configuration cells and programmable resources

incur substantial area penalty• Power

• Typically not suited for low power applicationsPerformance Cost

ASIC

FPGA

ASIC

FPGA

Time to market

ASIC

FPGANeed to improve

Conclusion

• FPGAs are the main enabler of Reconfigurable Computing Systems

• FPGAs fill the gap between Instruction Set Processors (GPs) and ASICS.– Advantages: Flexible, programmable, – Disadvantages: Power dissipation, performance w.r.t. ASIC

• Applicability of FPGAs relies on CAD tools provided by different vendors such as Xilinx and Altera

• RCS can be realized with several technologies:– FPGAs: Fine/Medium Grain– Coarse Grain Reconfigurable Architectures: CGRAs