FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...

FPGA Implementations of the Massively Parallel GCA Model

Wolfgang Heenes, Rolf Hoffmann, Sebastian KanthakComputer Architecture GroupDarmstadt University of Technology

WMPP 05 2Friday, April 8, 2005

Outline

Introduction & MotivationThe GCA ModelHardware ArchitecturesImplementations & ResultsConclusion & Future Work


Introduction

Classical Cellular Automataoptimal model for applications with inherent local neighborhood

Game of Life

after 240 generations


Introduction

Applicationsphysical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks, pseudo-random generators


Motivation

Restriction of the Modelonly local access to fixed neighbors


Motivation

Restriction of the Modelglobal communication between remote cells is sequential in space

A EB C D


The GCA Model

GCA?Global Cellular Automata

Generation t+1 (dashed line)Generation t (solid line)

centercell


The GCA Model

FeaturesDirect dynamic read access to global neighborsMassively ParallelSuited for a wide class of parallel algorithms, e.g. FFT, bitonic sort, matrix multiplication, vector reductionand sure … we can describe the classical cellular automata with the GCA model


The GCA Model

Cell Field: C = array [0..n - 1] of State

State of each cell: State = record

Data: Datatype // the dataL1: 0..n-1 // points to the first global cellL2: 0..n-1 // points to the second global cell

endrecord

Local Rule: function f(Self:State, Neighbor1:State,Neighbor2:State):State

Next Generation:for i:=0..n-1 do in parallel

C[i] f(C[i], C[C[i].L1] , C[C[i].L2] )endfor


The GCA Model

Possibilities for changing theneighborhood1. time dependent2. data dependent


Hardware Architectures

Parallel Hardware for Cellular Automata1. One of the first machines has been the

CAM6 from MIT (Margolus, Toffoli)2. CEPRA-Family (using FPGAs), first

implementations in 1994



How can we implemented this model in hardware?1. Fully Parallel Architecture

The problem can be fully implemented in hardware if n is small enough. The implementation needs nregister cells, n functional units and n (n-1) connections from the remote cells to the center cell.



How can we implemented this model in hardware?2. Memory based sequential and parallel

ArchitecturesThe fully parallel architecture needs a lot of resources and is restricted to a certain n. In order to handle a large number n of cells and to make the architecture scalable in resources we design memory based architectures.



Master

Crosspoint Matrix

LP

LM

…. ….LP

LM

Processing Cell

LP

LM



Excalibur Device Features• ARM-CPU (200 MHz) and FPGA (ca. 4100 LEs) on one die• High-Speed Bus (AMBA) for communication between CPU and FPGA• 32 MB DRAM, LAN, etc.Cyclone Device Features• ca. 20000 LEs, • 294,912 memory bits


ImplementationsBitonic Merge Algorithm sorts a bitonic sequenceA sequence of numbers is called bitonic, if the first part of the sequence is ascending and the second part is descending, or if the sequence is cyclically shiftedBitonic merge is the second part of a complete sorting algorithm

Total complexity for Sortingis O((log N)2)

2(000)

4(001)

6(010)

8(011)

7(100)

5(101)

3(110)

1(111)

2 4 3 1 7 5 6 8

5 8764312

1 4 5 6 7 832

2(000)

4(001)

6(010)

8(011)

7(100)

5(101)

3(110)

1(111)

2 4 3 1 7 5 6 8

5 8764312

1 4 5 6 7 832


Results: Measured Values on a PCIn software one computation of a new state consists of the following steps:• accessing the cell´s state from a vector in memory• accessing the remote cell states via the pointer• compute the result• buffer the resultAfter having stored all new state values in the buffer memory, the buffer memory and the memory are exchanged

number of sorting cells n

time for the whole algorithm tw

time for one cell operation T = tw / n ld n

16 1.18 us 18.4 ns32 3.10 us 19.4 ns128 17.34 us 19.3 ns512 81.5 us 17.7 ns1024 183.48 us 17.9 ns


Results: Fully Parallel ArchitecturePrototype platform Cyclone• The number of bits per cell is eight• A maximum of 128 cells could be implemented in device of the limited of logic elements

array size n logic elements max. clock time T for onecell operationT = 1 / (max.

clock · n)

4 122 154 MHz 2 ns8 349 133 MHz 939 ps

16 1,044 125 MHz 500 ps32 2,184 118 MHz 265 ps64 4,936 102 MHz 153 ps128 14,785 83 MHz 94 ps

Compared to the software solution for an array size ofn = 128, the hardware implementation is around 190 times faster.


Results: Sequential Architecture

The memory requirements for the sequential architecture

operational clock rate around 100 MHz – Time for one cell operation 10 nsas apposed to the software solution (around 18 ns), the hardware implementation is 1.8 times faster

Data Width in Bit Address Width in Bit Memory Usage

64 10 282'432 bits - 96%

28 11 274'432 bits - 93%

10 12 262'144 bits - 88%


Conclusion

This is a first approach to implement the GCA Model in hardwareFor the (simple) bitonic merge problem, the FPGA implementation is faster.Further implementions are necessary to evaluate the capabilityAn application for the model is the acceleration from dedicated algorithms in Embedded Systems


Future Work

A more general purpose processing cell

R0R1

R2-CCR3-CA

R4-OCCR5-MDR6-MA

R7R8R9RARBRCRDRE

RF-CID

ALU

CID CA CCBSIProgram-memory

Data-memory

OCC

STW

R0..RFR0..REA1C1

Thank you very much for your attention.

B1

FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...

Documents

Transcript of FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...