FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...
Transcript of FPGA Implementations of the Massively Parallel GCA Model€¦ · Fully Parallel Architecture The...
FPGA Implementations of the Massively Parallel GCA Model
Wolfgang Heenes, Rolf Hoffmann, Sebastian KanthakComputer Architecture GroupDarmstadt University of Technology
WMPP 05 2Friday, April 8, 2005
Outline
Introduction & MotivationThe GCA ModelHardware ArchitecturesImplementations & ResultsConclusion & Future Work
WMPP 05 3Friday, April 8, 2005
Introduction
Classical Cellular Automataoptimal model for applications with inherent local neighborhood
Game of Life
after 240 generations
WMPP 05 4Friday, April 8, 2005
Introduction
Applicationsphysical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks, pseudo-random generators
WMPP 05 5Friday, April 8, 2005
Motivation
Restriction of the Modelonly local access to fixed neighbors
WMPP 05 6Friday, April 8, 2005
Motivation
Restriction of the Modelglobal communication between remote cells is sequential in space
A EB C D
WMPP 05 7Friday, April 8, 2005
The GCA Model
GCA?Global Cellular Automata
Generation t+1 (dashed line)Generation t (solid line)
centercell
WMPP 05 8Friday, April 8, 2005
The GCA Model
FeaturesDirect dynamic read access to global neighborsMassively ParallelSuited for a wide class of parallel algorithms, e.g. FFT, bitonic sort, matrix multiplication, vector reductionand sure … we can describe the classical cellular automata with the GCA model
WMPP 05 9Friday, April 8, 2005
The GCA Model
Cell Field: C = array [0..n - 1] of State
State of each cell: State = record
Data: Datatype // the dataL1: 0..n-1 // points to the first global cellL2: 0..n-1 // points to the second global cell
endrecord
Local Rule: function f(Self:State, Neighbor1:State,Neighbor2:State):State
Next Generation:for i:=0..n-1 do in parallel
C[i] f(C[i], C[C[i].L1] , C[C[i].L2] )endfor
WMPP 05 10Friday, April 8, 2005
The GCA Model
Possibilities for changing theneighborhood1. time dependent2. data dependent
WMPP 05 11Friday, April 8, 2005
Hardware Architectures
Parallel Hardware for Cellular Automata1. One of the first machines has been the
CAM6 from MIT (Margolus, Toffoli)2. CEPRA-Family (using FPGAs), first
implementations in 1994
WMPP 05 12Friday, April 8, 2005
Hardware Architectures
How can we implemented this model in hardware?1. Fully Parallel Architecture
The problem can be fully implemented in hardware if n is small enough. The implementation needs nregister cells, n functional units and n (n-1) connections from the remote cells to the center cell.
WMPP 05 13Friday, April 8, 2005
Hardware Architectures
How can we implemented this model in hardware?2. Memory based sequential and parallel
ArchitecturesThe fully parallel architecture needs a lot of resources and is restricted to a certain n. In order to handle a large number n of cells and to make the architecture scalable in resources we design memory based architectures.
WMPP 05 14Friday, April 8, 2005
Hardware Architectures
Master
Crosspoint Matrix
LP
LM
…. ….LP
LM
Processing Cell
LP
LM
WMPP 05 15Friday, April 8, 2005
Hardware Architectures
Excalibur Device Features• ARM-CPU (200 MHz) and FPGA (ca. 4100 LEs) on one die• High-Speed Bus (AMBA) for communication between CPU and FPGA• 32 MB DRAM, LAN, etc.Cyclone Device Features• ca. 20000 LEs, • 294,912 memory bits
WMPP 05 16Friday, April 8, 2005
ImplementationsBitonic Merge Algorithm sorts a bitonic sequenceA sequence of numbers is called bitonic, if the first part of the sequence is ascending and the second part is descending, or if the sequence is cyclically shiftedBitonic merge is the second part of a complete sorting algorithm
Total complexity for Sortingis O((log N)2)
2(000)
4(001)
6(010)
8(011)
7(100)
5(101)
3(110)
1(111)
2 4 3 1 7 5 6 8
5 8764312
1 4 5 6 7 832
2(000)
4(001)
6(010)
8(011)
7(100)
5(101)
3(110)
1(111)
2 4 3 1 7 5 6 8
5 8764312
1 4 5 6 7 832
WMPP 05 17Friday, April 8, 2005
Results: Measured Values on a PCIn software one computation of a new state consists of the following steps:• accessing the cell´s state from a vector in memory• accessing the remote cell states via the pointer• compute the result• buffer the resultAfter having stored all new state values in the buffer memory, the buffer memory and the memory are exchanged
number of sorting cells n
time for the whole algorithm tw
time for one cell operation T = tw / n ld n
16 1.18 us 18.4 ns32 3.10 us 19.4 ns128 17.34 us 19.3 ns512 81.5 us 17.7 ns1024 183.48 us 17.9 ns
WMPP 05 18Friday, April 8, 2005
Results: Fully Parallel ArchitecturePrototype platform Cyclone• The number of bits per cell is eight• A maximum of 128 cells could be implemented in device of the limited of logic elements
array size n logic elements max. clock time T for onecell operationT = 1 / (max.
clock · n)
4 122 154 MHz 2 ns8 349 133 MHz 939 ps
16 1,044 125 MHz 500 ps32 2,184 118 MHz 265 ps64 4,936 102 MHz 153 ps128 14,785 83 MHz 94 ps
Compared to the software solution for an array size ofn = 128, the hardware implementation is around 190 times faster.
WMPP 05 19Friday, April 8, 2005
Results: Sequential Architecture
The memory requirements for the sequential architecture
operational clock rate around 100 MHz – Time for one cell operation 10 nsas apposed to the software solution (around 18 ns), the hardware implementation is 1.8 times faster
Data Width in Bit Address Width in Bit Memory Usage
64 10 282'432 bits - 96%
28 11 274'432 bits - 93%
10 12 262'144 bits - 88%
WMPP 05 20Friday, April 8, 2005
Conclusion
This is a first approach to implement the GCA Model in hardwareFor the (simple) bitonic merge problem, the FPGA implementation is faster.Further implementions are necessary to evaluate the capabilityAn application for the model is the acceleration from dedicated algorithms in Embedded Systems
WMPP 05 21Friday, April 8, 2005
Future Work
A more general purpose processing cell
R0R1
R2-CCR3-CA
R4-OCCR5-MDR6-MA
R7R8R9RARBRCRDRE
RF-CID
ALU
CID CA CCBSIProgram-memory
Data-memory
OCC
STW
R0..RFR0..REA1C1
Thank you very much for your attention.
B1