FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...

21
FPGA Implementations of the Massively Parallel GCA Model Wolfgang Heenes , Rolf Hoffmann, Sebastian Kanthak Computer Architecture Group Darmstadt University of Technology

Transcript of FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...

Page 1: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

FPGA Implementations of the Massively Parallel GCA Model

Wolfgang Heenes, Rolf Hoffmann, Sebastian KanthakComputer Architecture GroupDarmstadt University of Technology

Page 2: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 2Friday, April 8, 2005

Outline

Introduction & MotivationThe GCA ModelHardware ArchitecturesImplementations & ResultsConclusion & Future Work

Page 3: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 3Friday, April 8, 2005

Introduction

Classical Cellular Automataoptimal model for applications with inherent local neighborhood

Game of Life

after 240 generations

Page 4: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 4Friday, April 8, 2005

Introduction

Applicationsphysical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks, pseudo-random generators

Page 5: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 5Friday, April 8, 2005

Motivation

Restriction of the Modelonly local access to fixed neighbors

Page 6: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 6Friday, April 8, 2005

Motivation

Restriction of the Modelglobal communication between remote cells is sequential in space

A EB C D

Page 7: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 7Friday, April 8, 2005

The GCA Model

GCA?Global Cellular Automata

Generation t+1 (dashed line)Generation t (solid line)

centercell

Page 8: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 8Friday, April 8, 2005

The GCA Model

FeaturesDirect dynamic read access to global neighborsMassively ParallelSuited for a wide class of parallel algorithms, e.g. FFT, bitonic sort, matrix multiplication, vector reductionand sure … we can describe the classical cellular automata with the GCA model

Page 9: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 9Friday, April 8, 2005

The GCA Model

Cell Field: C = array [0..n - 1] of State

State of each cell: State = record

Data: Datatype // the dataL1: 0..n-1 // points to the first global cellL2: 0..n-1 // points to the second global cell

endrecord

Local Rule: function f(Self:State, Neighbor1:State,Neighbor2:State):State

Next Generation:for i:=0..n-1 do in parallel

C[i] f(C[i], C[C[i].L1] , C[C[i].L2] )endfor

Page 10: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 10Friday, April 8, 2005

The GCA Model

Possibilities for changing theneighborhood1. time dependent2. data dependent

Page 11: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 11Friday, April 8, 2005

Hardware Architectures

Parallel Hardware for Cellular Automata1. One of the first machines has been the

CAM6 from MIT (Margolus, Toffoli)2. CEPRA-Family (using FPGAs), first

implementations in 1994

Page 12: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 12Friday, April 8, 2005

Hardware Architectures

How can we implemented this model in hardware?1. Fully Parallel Architecture

The problem can be fully implemented in hardware if n is small enough. The implementation needs nregister cells, n functional units and n (n-1) connections from the remote cells to the center cell.

Page 13: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 13Friday, April 8, 2005

Hardware Architectures

How can we implemented this model in hardware?2. Memory based sequential and parallel

ArchitecturesThe fully parallel architecture needs a lot of resources and is restricted to a certain n. In order to handle a large number n of cells and to make the architecture scalable in resources we design memory based architectures.

Page 14: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 14Friday, April 8, 2005

Hardware Architectures

Master

Crosspoint Matrix

LP

LM

…. ….LP

LM

Processing Cell

LP

LM

Page 15: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 15Friday, April 8, 2005

Hardware Architectures

Excalibur Device Features• ARM-CPU (200 MHz) and FPGA (ca. 4100 LEs) on one die• High-Speed Bus (AMBA) for communication between CPU and FPGA• 32 MB DRAM, LAN, etc.Cyclone Device Features• ca. 20000 LEs, • 294,912 memory bits

Page 16: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 16Friday, April 8, 2005

ImplementationsBitonic Merge Algorithm sorts a bitonic sequenceA sequence of numbers is called bitonic, if the first part of the sequence is ascending and the second part is descending, or if the sequence is cyclically shiftedBitonic merge is the second part of a complete sorting algorithm

Total complexity for Sortingis O((log N)2)

2(000)

4(001)

6(010)

8(011)

7(100)

5(101)

3(110)

1(111)

2 4 3 1 7 5 6 8

5 8764312

1 4 5 6 7 832

2(000)

4(001)

6(010)

8(011)

7(100)

5(101)

3(110)

1(111)

2 4 3 1 7 5 6 8

5 8764312

1 4 5 6 7 832

Page 17: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 17Friday, April 8, 2005

Results: Measured Values on a PCIn software one computation of a new state consists of the following steps:• accessing the cell´s state from a vector in memory• accessing the remote cell states via the pointer• compute the result• buffer the resultAfter having stored all new state values in the buffer memory, the buffer memory and the memory are exchanged

number of sorting cells n

time for the whole algorithm tw

time for one cell operation T = tw / n ld n

16 1.18 us 18.4 ns32 3.10 us 19.4 ns128 17.34 us 19.3 ns512 81.5 us 17.7 ns1024 183.48 us 17.9 ns

Page 18: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 18Friday, April 8, 2005

Results: Fully Parallel ArchitecturePrototype platform Cyclone• The number of bits per cell is eight• A maximum of 128 cells could be implemented in device of the limited of logic elements

array size n logic elements max. clock time T for onecell operationT = 1 / (max.

clock · n)

4 122 154 MHz 2 ns8 349 133 MHz 939 ps

16 1,044 125 MHz 500 ps32 2,184 118 MHz 265 ps64 4,936 102 MHz 153 ps128 14,785 83 MHz 94 ps

Compared to the software solution for an array size ofn = 128, the hardware implementation is around 190 times faster.

Page 19: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 19Friday, April 8, 2005

Results: Sequential Architecture

The memory requirements for the sequential architecture

operational clock rate around 100 MHz – Time for one cell operation 10 nsas apposed to the software solution (around 18 ns), the hardware implementation is 1.8 times faster

Data Width in Bit Address Width in Bit Memory Usage

64 10 282'432 bits - 96%

28 11 274'432 bits - 93%

10 12 262'144 bits - 88%

Page 20: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 20Friday, April 8, 2005

Conclusion

This is a first approach to implement the GCA Model in hardwareFor the (simple) bitonic merge problem, the FPGA implementation is faster.Further implementions are necessary to evaluate the capabilityAn application for the model is the acceleration from dedicated algorithms in Embedded Systems

Page 21: FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The problem can be fully implemented in hardware if n is small enough. The implementation

WMPP 05 21Friday, April 8, 2005

Future Work

A more general purpose processing cell

R0R1

R2-CCR3-CA

R4-OCCR5-MDR6-MA

R7R8R9RARBRCRDRE

RF-CID

ALU

CID CA CCBSIProgram-memory

Data-memory

OCC

STW

R0..RFR0..REA1C1

Thank you very much for your attention.

B1