A Survey of Fault Tolerant Methodologies for FPGA’s

A Survey of Fault Tolerant Methodologies for FPGA’s

Gökhan Kabukcu2006703357

Outline

Introduction to FPGA’s Device-Level Fault Tolerance

Methods Configuration Level Fault Tolerance

Methods Comparison of Methodologies Conclusion

Introduction (FPGA) A field programmable gate array

is a semiconductor device containing programmable logic components and programmable interconnects

Consists of regular arrays of processing logic blocks (PLBs)

Programmable routing matrix Configuration of FPGA includes

The functionality of the FPGA Which PLBs will be used The functionality of the PLBs Which wire segments will be used for

connecting PLBs

Introduction (FPGA) PLB’s are multi-input,

multi-output circuits and allow: Sequential Designs Combinational Designs

PLB’s include: Look Up Tables (LUTs or

small ROMs) Multiplexers Flip-Flops

Introduction (FPGA)

Look Up Tables (LUTs): 4 input-1 output units Can be used as:

RAM ROM Shift Register Functional Unit

Configured by an 16-bit“INIT” function

Introduction (FPGA) An Example:

y=(x1+x2)*x3+x4 Create truth table Assign “INIT” to the LUT Since there are 4 inputs and 1 output, 1

LUT is enough to represent the equation

The LUT can be put into any PLB in the FPGA

x1 x2 x3 x4 y

0 0 0 0 0

0 0 0 1 1

0 0 1 0 0

0 0 1 1 1

0 1 0 0 0

0 1 0 1 1

0 1 1 0 1

0 1 1 1 1

1 0 0 0 0

1 0 0 1 1

1 0 1 0 1

1 0 1 1 1

1 1 0 0 0

1 1 0 1 1

1 1 1 0 1

1 1 1 1 1

Introduction (FPGA) Another Example:

y=(x1+x2)*x3+x4 z=y*x5 Create truth tables Assign “INIT”s to LUTs Since there are 5 inputs

and 1 output, 2 LUTs needed to represent the equation

The LUTs can be put into any PLBs in the FPGA

A1 and A0 are “don’t care”s

x1 x2 x3 x4 y

0 0 0 0 0

0 0 0 1 1

0 0 1 0 0

0 0 1 1 1

0 1 0 0 0

0 1 0 1 1

0 1 1 0 1

0 1 1 1 1

1 0 0 0 0

1 0 0 1 1

1 0 1 0 1

1 0 1 1 1

1 1 0 0 0

1 1 0 1 1

1 1 1 0 1

1 1 1 1 1

y x5 z

0 0 0

0 1 0

1 0 0

1 1 1

Introduction (FPGA) An example of a full

design on an FPGA

Fault Tolerance Device-Level Fault Tolerance

Attempts to deal with faults at the level of FPGA hardware

Select redundant HW, replace faulty one Solution with extra HW resources

Configuration-Level Fault Tolerance Tolerates faults at the level of FPGA

configuration When a circuit is placed, fault-free resources are

selected Status of the resources is considered each time

a circuit is placed-and-routed Solution with extra reconfiguration time

Device-Level FT Methods(1) Extra Rows

One extra spare row is added

Selection Logic is added to bypass the defective row

Vertical wire segments are increased by one row

Faults in one row can be tolerated

More than 1 spare row needed to tolerate faults in multiple rows

Device-Level FT Methods(2) Reconfiguration Network

Four architectural changes Additional routing

resources (bypass lines) Reconfiguration Memory

to store locations of faulty resources

On-chip circuitry for reconfiguration routing

Additional column of PLBs

Device-Level FT Methods(2) Reconfiguration Network

Test and identify faulty resources Create fault map Load map into Reconfiguration

Memory On-board router avoids faulty

resources The network is constructed by

shifting all PLBs in the fault-containing row towards the right

Method can tolerate 1 fault in each row if there is one extra spare column.

Device-Level FT Methods(3) Self-Repairing Architecture

Sub-arrays of PLBs Routers between sub-arrays Extra columns of PLBs PLBs constantly test themselves If a fault is detected,

Column of affected PLB is shifted one position to the right

The inter-array routers are adjusted Area overhead of this method is

significant If there is 1 spare column and N sub-

arrays in vertical, method can tolerate N faults at a time

Device-Level FT Methods(4) Block-Structured

Architecture Goal: tolerate larger and

denser patterns of defects efficiently

Blocks of PLBs FPGA is configured by a

loading arm. The block at the end of

loading arm is configured

Device-Level FT Methods(4) Block-Structured Architecture

A block is selected by the loading arm and tested

If the test is passed, it is configured, otherwise designated as faulty

Loading arm configures blocks one by one

If the arm cannot extend any further in a path, it’s retracted by one block

Fault tolerance is provided by redundant rows and/or columns

Area overhead is significant

Device-Level FT Methods(5) Fault Tolerant Segments/Grids

Fault Tolerant Segments: Adds one track of spare segment to

each wiring channel If a faulty segment is found, segment

is shifted to spare Single fault can be tolerated

Fault Tolerant Grids: An entire spare routing grid is added No additional elements in routing

channel, no extra time delay

Device-Level FT Methods(6) SRAM Shifting

Based on shifting the entire circuit on the FPGA

PLBs should be placed in 2 ways: King Allocation: 8 PLBs uses one spare,

circuit can move in 8 directions Horse Allocation: 4 PLBs uses one spare,

circuit can move in 4 directions Testing determines the faulty cells, feeds

information to the shifter circuitry on the FPGA.

Device-Level FT Methods(6) SRAM Shifting

Additional spare PLBs surrounding the FPGA

Horse Allocation used in the figure

The circuit is shifted up and right Advantages of the Method:

No external reconfiguration algorithm is required

The timing of the circuit is almost fixed

Any single fault can be tolerated

Configuration-Level FT Methods(1) Pebble Shifting

Find an initial circuit configuration, then move pieces from faulty units

Occupied PLBs are called pebbles Pair pebbles on faulty cells with unique,

unused cells such that sum of weighted Manhattan distance is minimized

Start shifting pebbles If a pebble finds an empty cell other than

the intended cell, this empty cell becomes the destination

No limit to the number of faults that can be tolerated

Configuration-Level FT Methods(1)

Pebble Shifting Example: 1 and 6 are on faulty cells Using a minimum-cost, maximum

matching algorithm, pairings are: 1->v11 and 6->v32

Element 1 is shifted its position To move 6, we shift 3,8 and 7 Now all elements are on non-faulty

cells and allocation is done


Mini-Max Grid Matching Uses a grid matching algorithm to match faulty

logic to empty, non-faulty locations Like Pebble Shifting, uses minimum cost,

maximum matching algorithm Minimizes the maximum distance between the

pairings, since the circuit’s performance is set by the critical (longest) path

Can tolerate faults until there are no unused cells

Configuration-Level FT Methods(3) Node-Covering and Cover

Segments When a fault is discovered,

nodes are shifted along the chain (row) towards the right

The last PLB of a chain is reserved as a spare

One fault in a row can be tolerated

Needs no reconfiguration if local routing configurations are present

Configuration-Level FT Methods(4) Tiling

Partition FPGA into tiles Precompiled configurations of tiles are

stored in memory Each tile contains system function, some

spare logic and interconnect resources When a logic fault occurs in a tile, the

configuration of the tile is replaced by a configuration that does not use the faulty resources

Many logic faults can be tolerated Local interconnect faults can be

tolerated, but global ones can’t be tolerated


Cluster-Based Intracluster tolerance in a PLB Basic Logic Elements (BLEs or LUTs) For simple LUT faults, preferred solution

is to use another LUT in the PLB Instead of changing PLB, try to find a

solution in the same PLB In example, T is faulty and 4th PLB is

used instead of 2nd PLB

Configuration-Level FT Methods(6) Column-Based

Treats the design as a set of functional units, each unit is a column

Like Tiling, less cost precompiled configurations

At least one column should be spare If there is a faulty cell in a column, the

column is shifted toward the spare column

Method can tolerate m faulty columns, where m is the number of columns not occupied by system functions

Comparison of Methodologies(1) Device Level (DL) Methods need extra HW and

have more area cost

DL Methods use one initial reconfiguration and no extra reconfiguration cost

Configuration Level Methods needs more than one reconfiguration and sometimes result in high time cost

CL Methods don’t need extra HW and no additional area cost

Comparison of Methodologies(2)

DL Methods are less flexible, therefore less able to improve reliability

CL Methods usually tolerate more faults than DL Methods

Performance impact of fault tolerance is less for DL Methods than CL Methods

Conclusion

No single Fault Tolerance methodology is better than the others in all cases.

DL Techniques has less impact on performance, but not flexible

CL Methods tolerates more faults but have more impact on performance

A Survey of Fault Tolerant Methodologies for FPGA’s

Documents

Transcript of A Survey of Fault Tolerant Methodologies for FPGA’s