Parallel Processor for Graphics Acceleration

82
Design and Development of a Floating-point Co-processor for the Acceleration of Graphics functions Sandip Jassar Department of Electronics, The University of York Academic Supervisors: Andy Tyrrell, Jonathan Dell Xilinx Development Centre, Edinburgh Industrial Supervisor: Richard Walke 15 th June, 2005 Final Project Report for degree of MEng in Electronic and Computer Engineering Abstract This report details the design and development of a processing architecture, complete with a Controller and DMA unit. The reader is shown how the architecture was optimised for executing Nested-Loop Programs, and in particular those found in the geometric transformation processes stage of the OpenGL pipeline.

Transcript of Parallel Processor for Graphics Acceleration

Page 1: Parallel Processor for Graphics Acceleration

Design and Development of a Floating-point Co-processor for the Acceleration of Graphics functions

Sandip Jassar

Department of Electronics, The University of York

Academic Supervisors: Andy Tyrrell, Jonathan Dell

Xilinx Development Centre, Edinburgh

Industrial Supervisor: Richard Walke

15th

June, 2005

Final Project Report for degree of MEng in Electronic and Computer

Engineering

Abstract

This report details the design and development of a processing architecture, complete

with a Controller and DMA unit. The reader is shown how the architecture was

optimised for executing Nested-Loop Programs, and in particular those found in the

geometric transformation processes stage of the OpenGL pipeline.

Page 2: Parallel Processor for Graphics Acceleration

2

Contents

Section 1 : Introduction …………………………………………………………pg4

1.1 Parallel Processing for Graphics Functions ………………...pg4-5

1.2 OpenGL ………….………………………………………….pg5

1.2.1 The OpenGL Pipeline ………………………………....pg5-8

1.2.2 The Geometric Transformation Process ……………....pg8

1.2.2.1 The ModelView Transformation ……………….pg8-10

1.2.2.2 The Projection Transformation ………………….pg10

1.2.2.3 Perspective Division and the

ViewPort Transformation ………………………..pg11

1.3 : Overview of Transformations on Nested Loop Programs .pg11-13

Section 2 : FIR Filter Analysis …………………………………………………..pg13

2.1 The Single MACC FIR Filter ……………………………pg13-16

2.2 The Transposed FIR Filter ………………………………..pg16-17

2.3 The Systolic FIR Filter ……………………………………pg18-19

2.4 The Semi-Parallel FIR filter ………………………………pg20-21

Section 3 : Matrix-Vector Multiplication Analysis ……………………………pg22

3.1 The Sequential Model of Computation ..…………………..pg22-24

3.2 Exploiting the Inherent Parallelism ...……………………...pg24

3.2.1 : Unrolling the Inner-Loop ………………………pg24-25

3.2.2 : Unrolling the Outer-Loop ……………………...pg26

3.3 Using Pipelined MACC Units ....………………………….pg26-27

3.3.1 : Optimising the Code for Execution on a

Pipelined MACC Unit ………………………….pg27-30

Section 4 : The Floating Point Unit ……………...……………………………pg31

4.1 Dealing with Hazards ……………………………………..pg32-33

4.1.1 : Using the Scoreboard to Detect Data Hazards…pg33-37

4.1.2 : Managing Data Hazards ..……………………...pg37-38

4.1.3 : Managing Structural Hazards ..………………...pg38-39

Section 5 : The Controller ………………………..……………………………pg40

5.1 Initial Look-Up Table Design ………....…………………..pg40-41

5.2 Optimising the Controller for Running Geometric

Transformation Programs ……….....……………………...pg41-43

5.2.1 : The Use of Induction Variables ..………………pg43-44

5.2.2 : Performing Strength Reduction on

Induction Variables ……..……………………...pg44-45

5.2.3 : Optimising a Geometric Transformation for

the Controller …….....………………………….p45-48

5.2.4 : Designing the Optimal Controller ……………..pg48

5.2.4.1 : The Data Address Generator Design ...pg49-50

5.2.4.2 : The Use of Data Address Generators as

Part of the Controller …...…………...pg51-53

5.2.4.3 : The Loop Counter Structure …………pg53-55

5.2.5 : The High-Level Controller …………………….pg55

Page 3: Parallel Processor for Graphics Acceleration

3

5.2.6 : The DMA Unit and Context Switching ………..pg55-57

Section 6 : Testing and Integration ..……………..…………………………..pg58

6.1 Testing of the FPU’s Control Unit .…....…………………pg58-59

6.2 Testing and Integrating the Register File ………………...pg59

6.3 Testing and Integrating the Execution Pipeline ………….pg60

6.4 Testing and Integration of the Look-Up Table Controller .pg60

6.5 Testing of the Optimal Controller’s Data Address Generator..pg60

6.6 Testing and Integration of the Optimal Controller’s

Loop Counter Structure, Program Memory and DAGs …...pg61

6.7 Testing and Integration of the DMA Unit …………………pg61-62

6.8 Testing and Integration of the High-Level Controller ….....pg62

6.4 The OpenGL Demonstration ………………………………pg63-64

6.4 Progress in Developing a Hardware Based Demonstration .pg64-65

Section 7 : Conclusion …………………………………………………………pg65

Section 8 : Appendices ………………………………………………………...pg67-80

Section 9 : Acknowledgments …………………………………………………pg81

Section 10 : References ………………………………………………………..pg82

1 : Introduction

This section begins by describing the 3-D graphics rendering application domain, and

the characteristics of the associated algorithms which allow for parallel execution.

The industry standard OpenGL 3-D graphics rendering pipeline is analysed, where

special focus is given to the geometric transformations stage which is consistent of a

pipeline of matrix-vector manipulation operations that implement a key part of

Page 4: Parallel Processor for Graphics Acceleration

4

OpenGL’s operation in defining what a scene looks like in terms of the position and

orientation of the objects in it. The section then closes by looking at how Nested

Loop Programs such as the FIR filter can be transformed to exploit their inherent

parallelism by making it more explicit.

Section 2 then follows on from this and examines the different combinations of

transformations that can be applied to the FIR filter and the resulting implementation

tradeoffs. Section 3 then carries out a similar analysis on another prominent NLP in

graphics processing, the matrix-vector multiplication algorithm, where similarities

and differences with the FIR filter are analysed, and tradeoffs between the different

ways to optimise the algorithm’s implementation are discussed. Section 4 then details

the design of a FPU which can exploit the characteristics inherent in the matrix-vector

multiplication algorithm if they are exposed by transforming the code as discussed in

the previous section, by employing temporal parallelism.

Section 5 goes through the design process for an Optimal Controller for the FPU

using a typical matrix-vector multiplication algorithm found in the geometric

transformations stage of the OpenGL pipeline, and goes on to show how the

processor architecture as a whole was optimised for this particular section of the

OpenGL pipeline. Section 6 explains how the complete processor architecture was

built up from its constituent blocks in a hierarchical fashion, after they were tested

against their specification and integrated together into sub-systems, and ends by

describing the demonstration of OpenGL’s geometric transformations based on the

processor architecture developed. Section 7 finishes by drawing conclusions from the

project and putting the results achieved into context.

1.1 : Parallel Processing for Graphics Functions

Currently multiprocessing is the technique used to carry out the significant arithmetic

processing required to implement the realistic rendering techniques used in advanced

3-D graphics applications. Most often this is in the form of clusters of commodity

PC’s [1].

As discussed previously, graphics arithmetic is largely consistent of matrix and vector

manipulation, and thus lends itself to parallel processing because of the independence

of the individual operations required within the high-level matrix/vector function.

Most high level functions encountered in a Graphics environment are also ‘order-

independent’, as the order they are done in has no effect on the final display. Such

functions can thus be executed in parallel to achieve higher throughput, which is one

of the fundamental strengths of FPGAs. Although some functions will be

encountered that are ‘sequential’ which must be executed after all preceding and

before all subsequent functions, and thus if a parallel processing architecture is

employed it must deal with sequential functions with minimal degradation to the

overall system performance.

Another benefit of having an identical number of processors operating in parallel is

that the programming of each processor can be the same, and so this method of

obtaining high performance can also greatly simplify the software development

Page 5: Parallel Processor for Graphics Acceleration

5

1.2 : OpenGL

The Open Graphics Library (OpenGL) is the industry’s most widely used and

supported 2D and 3D graphics API [2]. As such there are thousands of applications

based on OpenGL that are used to render compelling 2D and 3D graphics, in markets

ranging from broadcasting, CAD/CAM, entertainment, cinematics, medical imaging

and virtual reality, the most famous of all being Pixar’s Renderman (used by the

movie industry to create special effects). Individual function calls in the OpenGL

environment can be executed on dedicated tuned hardware, run as a software routine

on the generic system CPU or implemented as a combination of both. As a result of

this implementation flexibility, OpenGL hardware acceleration can range from that

which simply renders 2-D lines and polygons to the more advanced floating-point

processor capable of transforming and computing geometric data.

1.2.1 : The OpenGL Pipeline

The two types of data that are input to the OpenGL pipeline are pixel data and

geometric data. Pixel data is the RGBA data associated with pixels on the screen, and

comes in the form of individual pixel colour values, images and bitmaps. Geometric

data is used to model objects (ranging in complexity from simple 2-D shapes to

realistic 3-D objects), and comes in the form of points, lines and polygons, which are

OpenGL’s three geometric primitives. All geometric data is eventually described as

vertices. The data associated with each vertex is its 3-D positional coordinate vector,

normal vector and material properties (used in lighting calculations), pixel colour

value (RGBA value or an index to a colour-map), and its texture coordinates (used to

map a texture onto the vertex’s parent object). Figure 1 below shows an overview of

the OpenGL pipeline in terms of its various stages and the order in which operations

occur.

Figure 1 : showing an overview of the OpenGL pipeline

As can be seen in figure 1 above the vertex and pixel data are initially processed

differently before both being used in the rasterization stage. All data can be input to

Page 6: Parallel Processor for Graphics Acceleration

6

the pipeline from the application and processed immediately, or saved in a display list

which sends the data to the pipeline when the list is executed.

In the per-vertex operations stage, the three vectors associated with each vertex

(spatial coordinates, normal and texture coordinates) are transformed (multiplied) by

the current modelview matrix, its inverse transpose and the current texture matrix

respectively. These transformations are carried out in order to transform the position,

orientation and size of the vertices’ parent objects in the scene. Lighting calculations

are then performed on each vertex using their transformed spatial coordinate and

normal vectors, material properties and the current lighting model. These calculations

act so as to scale each vertex’s pixel colour value to reflect the location and

orientation of its parent polygon relative to the light source/s.

The viewing volume of the scene is defined by six clipping planes. In the primitive

assembly stage the spatial coordinate vector of all vertices is transformed (multiplied)

by the projection matrix, so as to clip the scene against the six planes of the viewing

volume. Depending on the type of clipping employed primitives may have vertices

rejected, modified or added. The programmer may also define additional clipping

planes to further restrict the viewing volume, in order to create cut-away views of

objects, and other similar effects. The equations that represent such additional

clipping planes are used to transform the spatial coordinate vectors before the

projection matrix.

The spatial coordinate vectors of all vertices then go through perspective division,

where primitives are scaled in size to reflect their distance from the viewer. This is

followed by the viewport transformation where the spatial coordinate vectors of all

vertices are transformed so as to map the 3-D scene onto the 2-D viewport (viewing

window) of the computer screen.

In the rasterization processing stage, all primitives (points, lines and polygons) are

rasterized into fragments, where the squares of the viewport’s integer pixel-grid that

are occupied by each primitive are determined. If enabled, advanced features to make

the rendered scene more realistic are also implemented in this stage. The most

commonly used of these is anti-aliasing, which is used to smooth jagged edges that

result from having to map non-vertical and non-horizontal lines to the square pixel-

grid of the viewport. Anti-aliasing calculates the portion of each square within the

pixel-grid that would be occupied by a line, if the line were to be drawn as originally

defined (before the viewport transformation), and this value is known as a pixel’s

coverage value. A pixel’s coverage value is used to scale the alpha component of its

RGBA colour value. Figure 2 illustrates how pixel coverage values are evaluated.

Page 7: Parallel Processor for Graphics Acceleration

7

Figure 2 : showing an example to illustrate how a pixel’s coverage value is evaluated

With reference to figure 2 above, the green and orange show how a diagonal line

looks before and after it’s subjected to the viewport transformation respectively, and

the coverage values for the pixels occupied by the line are given on the right.

In the pixel operations stage, pixel data that is input to the pipeline is scaled, biased

and processed using a colour-map, after which the colour values are clamped to a

certain range. The resulting pixel data is then either rasterized into fragments or

written to texture memory for use in texture mapping. Data from the framebuffer can

be read back and placed in the host processor memory. Data for texture mapping can

be taken from either the host processor memory or the framebuffer.

If texturing is enabled, in the per-fragment operations processing stage, the texture

coordinate vector of each vertex (of a fragment’s primitive) is used to map the

vertices to specific points on a two-dimensional texture image, so that the texture

(known as the texel for that particular fragment) can be mapped onto the primitive

appropriately after its position an orientation are transformed in the per-vertex

operations stage. If enabled, other advanced features such as blending (used to create

a photo-realism effect) are also implemented in the per-fragment operations stage.

It’s also in this stage that the coverage values calculated in the rasterization stage are

applied, if anti-aliasing is enabled.

The final pixel values are then drawn (written) into the framebuffer.

1.2.2 : The Geometric Transformation Process

The overall transformation process for producing a scene for viewing is analogous to

that carried out by a camera when it’s used to take a photograph. This transformation

process is carried out on the spatial coordinate vector of each vertex in the OpenGL

pipeline. This process is depicted below in figure 3.

Page 8: Parallel Processor for Graphics Acceleration

8

Figure 3 : showing the stages of transformation for spatial coordinate vectors

As can be seen in figure 3 above, a vertex’s spatial coordinate vector consists of not

only the vertex’s three-dimensional x, y, z coordinates, but also a w component which

is used in the perspective division stage. All three vectors of a vertex in OpenGL

have four elements, thus all matrices (that they are multiplied by) are 4x4.

1.2.2.1 : The ModelView Transformation

A vertex’s spatial coordinates are first presented to the pipeline as object coordinates.

In this form the spatial coordinates specify a vertex’s location in 3-D space when its

parent primitive is centred on the origin and oriented in such a way that makes it easy

for the programmer to visualise where its vertices are in 3-D space (i.e. with the

primitive’s edges parallel with and perpendicular to the axes). The modelview

transformation is the combination of the modelling transformation and the viewing

transformation, represented by the modelling and viewing matrices respectively. The

modelling transformation is always carried out on an object before the viewing

transformation, as by default the modelview matrix is formed by the viewing matrix

pre-multiplying the modelling matrix.

The modelling transformation positions an object at a particular location in 3-D space

relative to the origin (by performing translation), rotates the object relative to the

axes (by performing rotation) and scales the object in size (by performing scaling).

These three transformations are represented by their own matrices, and these are

depicted in figure 4 below. The order in which the modelling transformation carries

out these transformations is determined by the order in which their respective matrices

are multiplied together to form the modelling matrix. The transformation represented

by a post-multiplying matrix is carried out before that represented by the pre-

multiplying matrix, and this principle holds true in all instances where transformation

matrices are combined together through multiplication.

Page 9: Parallel Processor for Graphics Acceleration

9

1 0 0 x

0 1 0 y

0 0 1 z

0 0 0 1

T

Matrix T translates the object from being centered at the origin to the location defined by x, y, z; but the object’s orientation and size

are maintained

x 0 0 0

0 y 0 0

0 0 z 0

0 0 0 1

S

Matrix S scales the object such that it is stretched by a factor of x, y, z in the direction of the corresponding axes; but the object’s

orientation and position are maintained

1 0 0 0

0 cosΦ -sinΦ 0

0 sinΦ cosΦ 0

0 0 0 1

Matrices Rx, Ry, and Rz rotate the object clockwise by Φ degrees about the x, y and z axes respectively; rotation about more than one axis is achieved by multiplyingthe relevant R matrices together; but the object’s position and size are maintained

cosΦ 0 0

0 1 0 0

0 0

0 0 0 1

0 0

0 0

0 0 0 0

0 0 0 1

sinΦ

-sinΦ cosΦ

cosΦ

sinΦ

-sinΦ

cosΦ

Rx Ry Rz

Figure 4 : showing the translation, scaling and rotation matrices that are multiplied

together to form the modelling matrix

With reference to figure 4 above, it can be seen that the translation transformation on

a vertex is achieved by adding to each of its x, y and z components the multiplication

of the corresponding component of the translation vector down column 4 of the T

matrix and its w component. The scaling transformation on a vertex is achieved by

multiplying each of its components by the corresponding component of the scaling

vector along the leading diagonal of the S matrix. The action of the rotation matrix is

rather more complex, although it can be seen from the R matrices above that the

vertex component associated with the axis being rotated about remains unchanged

after the transformation.

The viewing transformation is analogous with adjusting the camera location and the

direction in which it points when taking a photograph of the scene. This

transformation is comprised of a combination of translation (adjusting the viewpoint’s

location) and rotation (adjusting the viewpoint’s direction), and the associated

matrices are the same as those used to produce the modelling matrix (shown above in

figure 4). These two transformations within the viewing transformation (on the

viewpoint) have the exact reverse effect on the appearance of the scene to the

corresponding transformations within the modelling transformation (on the scene’s

objects). The default viewpoint is at the origin and points in the negative z-direction.

As objects are most often initially defined as being centred on the origin, the

modelview transformation as a whole must perform a transformation simply for the

Page 10: Parallel Processor for Graphics Acceleration

10

objects in the scene to be visible, although any of the elementary transformations can

be omitted by simply not including their respective matrices in the product forming

the modelview matrix.

1.2.2.2 : The Projection Transformation

The eye coordinates resulting from the modelview transformation then go through the

projection transformation where they are converted to clip coordinates. The

projection transformation defines the viewing volume. The shape of the viewing

volume determines how objects are projected onto the screen and which objects (or

portions of objects) are clipped out of the final scene. The most common type of

projection used is perspective projection which employs a frustum shaped viewing

volume, as illustrated below in figure 5.

Figure 5 : showing the frustum shaped viewing volume employed by perspective

projection

Perspective projection implements foreshortening, whereby the further away from the

viewpoint an object is, the smaller it appears in the scene, thus emulating the way the

human eye (or a camera) works. The projection matrix that represents the

(perspective) projection transformation is depicted below in figure 6.

0 0

0 0

0 0

0 0 -1 0

P

Matrix P clips the objects against the six planes of the viewing volume; the w component of each vertex is set to -z (the distance of

the vertex from the origin, in a direction away from the viewpoint)

2n

r - l

r + l

r - l

2n

t - b

t + b

t - b

-(f+n)

f - n

-2fn

f - n

n : near

f : far

r : right

l : left

t : top

b : bottom

Figure 6 : showing the projection matrix for perspective projection

Page 11: Parallel Processor for Graphics Acceleration

11

1.2.2.3 : Perspective Division and the ViewPort Transformation

The clip coordinates resulting from the projection transformation are then converted

to normalized device coordinates through the process of perspective division, where

the x, y, and z coordinates of each vertex are divided by its w component (which is set

to –z in the projection transformation as described previously in figure 6). This scales

the objects down in size to implement foreshortening as discussed previously in

regards to perspective projection. After the perspective division stage the four

element spatial coordinate vector of each vertex becomes a three element vector as

the w component is discarded.

The last stage of the process is the viewport transformation which maps the three-

dimensional normalized device coordinates to the two-dimensional viewport, in

converting them to window coordinates. The equations that perform the viewport

transformation are shown below in figure 7.

xw = ( ( xnd + 1 ) x ( width ÷ 2 ) ) + xo

yw = ( ( ynd + 1 ) x ( height ÷ 2 ) ) + yo

Figure 7 : showing the equations that perform the viewport transformation

With reference to figure 7 above, the (xo, yo) coordinate is the position of the bottom

left hand corner of the viewport (viewing window) on the screen, relative the

corresponding corner of the screen. The width and height parameters are the

dimensions of the viewport (in pixels).

1.3 : Overview of Transformations on Nested Loop Programs

A key problem in parallel processing is the way in which the program for each

processor is generated, in a way such that the overall program is executed efficiently.

Most graphics functions are described as a set of Nested for-loops [3].

The Nested-Loop Program (NLP) form of an algorithm represents its sequential

model of computation. The FIR filter operation is a simple example of a NLP and

this is shown below in figure 8 for calculating four output values.

for j = N:N+3 for i = 0:N-1 y[j] = y[j] + ( x[j - i] x h[i] ) end

Page 12: Parallel Processor for Graphics Acceleration

12

end

Figure 8 : showing the NLP form of the FIR filter operation

This sequential model of computation of an algorithm is representative of the way in

which the algorithm would be implemented as Software running on a standard single

thread processor.

The algorithmic transformation of unrolling transforms a NLP such that it task-level

parallelism is enhanced. As a result, tasks that are independent of each other are

explicitly shown to be mutually exclusive, although the resulting representation of the

algorithm is still functionally equivalent to the original. The independent tasks can

then be mapped to separate processors.

The algorithmic transformation of skewing makes the dependences of operations less

demanding, and thus allows for latency in the physical operators that will actually

execute them.

The dependency graph is a useful way of visualising the dependences between the

operations in a NLP, and represents an intermediate level of abstraction between the

NLP and its implementation. An example of a dependency graph is shown for the

FIR filter NLP of figure 8, where the outer j loop has been unrolled by a factor of two,

and N = 4. Data dependences between tasks (represented as circles) are shown by

one task forwarding data to another. The two independent tasks are highlighted in

different colours.

+0

+

+

+

h[0].x[4] + h[1].x[3] + h[2].x[2] +

h[3].x[1]

h[0]

h[1]

h[2]

h[3]

x[4-0]

x[4-1]

x[4-2]

x[4-3]

+0

+

+

+

h[0].x[5] + h[1].x[4] + h[2].x[3] +

h[3].x[2]

h[0]

h[1]

h[2]

h[3]

x[5-0]

x[5-1]

x[5-2]

x[5-3]

+0

+

+

+

h[0].x[6] + h[1].x[5] + h[2].x[4] +

h[3].x[3]

h[0]

h[1]

h[2]

h[3]

x[6-0]

x[6-1]

x[6-2]

x[6-3]

+0

+

+

+

h[0].x[7] + h[1].x[6] + h[2].x[5] +

h[3].x[4]

h[0]

h[1]

h[2]

h[3]

x[7-0]

x[7-1]

x[7-2]

x[7-3]

Figure 9 : Showing the effect of unrolling the outer loop of the FIR filter NLP by a

factor of 2

When the outer loop is unrolled by a factor of two, this is essentially the same as

making two copies of it that can be executed in parallel. As there are two copies, in

the implementation of this modified NLP each copy of the loop will need its own set

of registers for storing coefficient and data values. The unrolling transformation thus

translates to spatial parallelism being employed in the implementation.

Page 13: Parallel Processor for Graphics Acceleration

13

The skewing transformation however translates to temporal parallelism (pipelining)

being employed in the implementation, and as such, a single set of registers are shared

by the different iterations of the internal loop, whose executions are overlapped.

These two transformations can be used to transform a sequential model of

computation into something that it is closer to a data-flow model, thus making it more

suitable for efficient implementation (exploiting parallelism) on hardware.

Section 2 : FIR Filter Analysis

The FIR filter operation essentially carries out a vector dot-product in calculating each

value of y[n]. This is illustrated below in figure 10 for N = 4.

x[n] x[n-3]x[n-2]x[n-1]

h[0]

h[1]

h[2]

h[3]

x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3]

Figure 10 : showing how the FIR filter operation is comprised of vector dot-products

Section 2.1 : The Single MACC FIR Filter

The Single MACC FIR filter (shown below in figure 11) is an implementation of the

FIR filter’s sequential model of computation and as its name suggests it is based on a

single MACC (multiply-accumulate) unit. As such, the algorithmic description of this

implementation is identical to that of the NLP description of the FIR filter (shown

below in figure 12) without applying any unfolding or skewing transformations

(which were discussed earlier in section 1.3). For simplicity it is assumed throughout

this section unless otherwise stated that all references to MACC units refer to non-

pipelined MACCs with a total latency of 1 clock cycle.

Page 14: Parallel Processor for Graphics Acceleration

14

Figure 11 : showing an example H/W implementation of the Single MACC FIR filter

[4]

The primary trade-off between sequential and parallel implementations of the same

algorithm is the amount of hardware resources required versus the throughput

achieved. As the Single MACC FIR filter implements the FIR filter function in a

completely sequential manner, the required hardware resources are reduced by a

factor of N, although so too is the throughput as compared to a fully parallel

implementation that would use one MACC unit for each of the N coefficients (where

the N MACCs would be cascaded).

void singleMaccFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )

{

int i, j; // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index

float y_accum; // output sample is accumulated into ‘y_accum’

const float *k; // pointer to the required input sample

for( j = 0; j < num_samples; j++ )

{

k = x++; // x points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1]

y_accum = 0.0;

for( i = 0; i < num_taps; i++ )

{

y_accum += h[i] * *(k--); // y[n+j] += h[i] * x[(n+j) - i]

}

*y++ = y_accum; // y points to the register address where y[n+j] is to be written and is // incremented (post assignment) to point to the register address where the // next output sample y[(n+j)+1] is to be written

}

}

Figure 12 : A code description of the Single MACC FIR filter

Page 15: Parallel Processor for Graphics Acceleration

15

With reference to the code description of the Single MACC FIR filter shown above in

figure 12, all of the required input samples are assumed to be stored in the register file

(with a stride of 1) of the processor (a single MACC unit in this case) executing the

code, with x initially pointing to the input sample x[n] corresponding to the first

output sample to be calculated y[n]. It is also assumed that all of the required

coefficients are stored in the same way in a group of registers used to store the h[ ]

array.

As can be seen from figure 12 above, and more clearly from the dependency graph of

the Single MACC FIR filter (shown below in figure 13), this implementation

evaluates (accumulates) only one output value at a time. Assuming that each

multiplication of x[(n+j)-i] and h[i] takes one clock cycle, then the performance of

this implementation is given by the following equation:

Throughput = Clock frequency ÷ Number of coefficients

h(0)

h(1)

h(2)

x[n]

+

h(3)

h(0).x[n] + h(1).x[n-1] + h(2).x[n-2]

+ h(3).x[n-3]

+

+

x[n-1]

x[n-2]

x[n-3]

h(0)

h(1)

h(2)

x[n+1]

+

h(3)

h(0).x[n+1] + h(1).x[n] + h(2).x[n-1]

+ h(3).x[n-2]

+

+

x[n]

x[n-1]

x[n-2]

h(0)

h(1)

h(2)

x[n+2]

+

h(3)

h(0).x[n+2] + h(1).x[n+1] +

h(2).x[n] + h(3).x[n-1]

+

+

x[n+1]

x[n]

x[n-1]

h(0)

h(1)

h(2)

x[n+3]

+

h(3)

h(0).x[n+3] + h(1).x[n+2] +

h(2).x[n+1] + h(3).x[n]

+

+

x[n+2]

x[n+1]

x[n]

j

i

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

y-accum

0 1 2 3

0

1

2

3

Figure 13 : dependency graph showing the operation of the Single MACC FIR filter

If the coefficients of the filter posses symmetry (i.e. h[0]=h[N-1], h[1]=h[N-2], etc)

a doubling of throughput can be achieved at the same clock frequency using a

variation of the Single MACC FIR filter called the Symmetric MACC FIR filter. This

implementation uses a single coefficient in place of those that are equal, and as such

only one multiplication by that single coefficient is required, although the other

multiplicand is now the sum of the data values corresponding to those equal

coefficients. Thus, the cost of this performance enhancement is another adder at the

input of the multiplier, as well as another RAM block (or using one Dual-port block

Page 16: Parallel Processor for Graphics Acceleration

16

RAM), as the two input samples corresponding to the single coefficient need to be

fetched simultaneously. Although as the number of coefficients is halved, so too is

the amount of storage required for them. If N is an odd number, the Symmetric

MACC FIR filter reduces the number of coefficients to N/2 + 1. The Symmetric

MACC FIR is derived from the Single MACC FIR by unrolling its inner-loop by a

factor of two and reversing the order in which one of the loops processes its

respective coefficient-data pairs. This is a unique example of employing spatial

parallelism.

Employing spatial parallelism is one way to enhance the performance of the Single

MACC FIR filter, and essentially uses more than one MACC unit to evaluate each

output sample. As a result each MACC unit evaluates an equal share of the

coefficient-data sample multiplications, and as such if M MACC units are employed,

the throughput is increased by a factor of M over the Single MACC FIR filter,

although so too are the required hardware resources.

Section 2.2 : The Transposed FIR Filter

The Transposed FIR filter (an example H/W implementation of which is shown below

in figure 14) is a fully parallel implementation, as one MACC unit is used for each of

the N coefficients. Unlike the Direct form type Ι fully parallel implementation (which

employs an adder tree structure) the Transposed FIR filter employs an adder chain,

and as such the MACC units are much easier to connect together and the

implementation can easily be scaled up or down in terms of N. With regards to

targeting the design to the Xilinx Vertex 4 FPGA, because of this adder-chain

structure the Transposed implementation can be entirely contained within the

dedicated Xilinx DSP48 slices as apposed to using generic FPGA fabric, which would

yield a less efficient mapping.

Figure 14 : showing an example H/W implementation of the Transposed FIR filter [4]

The input data samples are broadcast to all MACC units simultaneously, and with this

implementation the coefficients are assigned (in ascending order) starting from the

right-most MACC unit (from which the final output is taken). As well as the spatial

parallelism through the use of N MACC units, temporal parallelism is also employed

as the evaluation of the successive output samples is overlapped, although this doesn’t

serve to increase performance (throughput) or decrease the latency before the first

output sample appears over the Direct form type Ι implementation.

Page 17: Parallel Processor for Graphics Acceleration

17

A code description for the Transposed FIR filter (with the number of taps N = 4) is

given in figure A1 of appendix A.

The Transposed FIR filter design is yielded by a complete unrolling of the inner-loop

(i-loop) of the original Single MACC FIR filter code description, which results in the

number of MACC units required increasing to N. A skewing of the outer-loop (j-

loop) by a factor of 1 is also performed, which results in the temporal overlapping of

successive output sample calculations. This skewing is required to schedule apart the

dependences that arise because the N MACC operations within any single iteration of

the outer-loop are dependent on the MACC in the previous iteration of the inner-loop

(for their third argument).

The dependency graph of the Transposed FIR filter’s operation is shown below (with

N = 4) in figure 15.

h(0)

h(1)h(1)

h(2)

x[n]

+ +

+

+

++

x[n-3] x[n-2]

x[n-2]

x[n-1]

x[n-1]

x[n-1]

x[n]

x[n]

x[n]

h(3)

h(2)

h(3)h(3)

h(2)

h(3)

y[n] = h(0).x[n] + h(1).x[n-1] +

h(2).x[n-2] + h(3).x[n-3]

+

+

+

x[n+1]

x[n+1]

x[n+1]

h(0)

h(1)

h(2)

h(3)

x[n+1]

y[n+1] = h(0).x[n+1] + h(1).x[n] +

h(2).x[n-1] + h(3).x[n-2]

y_accum0 = 0

j

MACC_1 MACC_1 MACC_1 MACC_1 MACC_1

MACC_2MACC_2 MACC_2 MACC_2

MACC_3 MACC_3 MACC_3

MACC_4 MACC_4

+++++

y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0

y_accum1 y_accum1y_accum1

y_accum1 y_accum1

y_accum2 y_accum2 y_accum2 y_accum2

y_accum3 y_accum3 y_accum3

0 1

Figure 15 : dependency graph showing the operation of the Transposed FIR filter

(with N = 4)

As can be seen above in figure 15, the initial latency before the first output sample

emerges from the output is the same as that seen with the fully-parallel FIR filter.

Although once this initial latency (of the spin-up procedure whereby the pipeline is

filled over the first N cycles) has been endured, the throughput yielded by the

Transposed FIR filter implementation is the same as that of the Direct form type Ι

implementation (equal to the Clock frequency). The latency period between when

each input sample is applied and the emergence of the corresponding output sample is

also the same as that seen with the Direct form type Ι implementation, and this is

equal to the latency of a single MACC unit.

Page 18: Parallel Processor for Graphics Acceleration

18

Section 2.3 : The Systolic FIR Filter

As with the Transposed FIR filter, the Systolic FIR filter (an example H/W

implementation of which is shown below in figure 16) is also a fully parallel

implementation, and also uses an adder-chain to accumulate each value of y[n].

Figure 16 : showing an example H/W implementation of the Systolic FIR filter [4]

The Systolic FIR filter also employs temporal parallelism in addition to spatial

parallelism in the same way that the Transposed FIR filter does. However the

Systolic FIR filter’s coefficients are assigned (in ascending order) starting from the

left-most MACC unit (to which the input samples are applied), which is the opposite

way to how the coefficients are assigned to the Transposed FIR filter’s MACC units.

As such, the Systolic FIR filter evaluates the inner-products (of which each value of

y[n] is consists) in the reverse order to the Transposed FIR filter.

The input data samples are fed into a cascade of registers which have the effect of a

data buffer. The Systolic FIR filter differs from the Direct form type Ι

implementation not only with its use of an adder-chain to accumulate each value of

y[n], but also with the additional register between each of the taps. A code

description of the Systolic FIR filter (with the number of taps N = 4) is given in figure

A2 of appendix A.

As with the Transposed FIR filter, the Systolic FIR filter is yielded by a complete

unrolling of the inner-loop (i-loop) of the original Single MACC FIR filter, and a

skewing of its outer-loop by a factor of 1. However, in generating the Systolic FIR

filter from the Single MACC FIR, the outer-loop is skewed in the opposite direction to

how it is skewed in generating the Transposed FIR filter. This means that the inner-

products which are summed together to produce an output sample are evaluated in the

opposite order to that in which they’re evaluated by the Transposed FIR filter. This

difference is reflected in the two H/W implementations as the Transposed FIR

implementation employs a broadcast structure to feed the same input sample to each

of its MACC units on each clock cycle, whereas the Systolic implementation employs

a tapped delay line with a delay of two clock cycles between each MACC unit. The

dependency graph of the Systolic FIR filter’s operation is shown below in figure 17

(with N = 4).

Page 19: Parallel Processor for Graphics Acceleration

19

h(0)

h(1)

h(3)

h(0) h(0) h(0)

h(1)h(1)

h(2)

h(2)

x[n] x[n+1] x[n+2] x[n+3]

x[n-1]

x[n-2]

x[n-3]

x[n+4]

h(0)

h(1)

h(2)

h(3)

x[n]

x[n-1]

x[n-2]

x[n+1] x[n+2]

x[n]

+ +

+ +

+ +

+

++

y[n] = h(0).x[n] + h(1).x[n-1] +

h(2).x[n-2] + h(3).x[n-3]

y[n+1] = h(0).x[n+1] + h(1).x[n] +

h(2).x[n-1] + h(3).x[n-2]

x[n+3]

x[n+1]

x[n-1]

x[n+4]

x[n+2]

x[n]

j0 1

MACC_1 MACC_1 MACC_1 MACC_1 MACC_1

MACC_2 MACC_2 MACC_2 MACC_2

MACC_3 MACC_3 MACC_3

MACC_4 MACC_4

+ + + + +

y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0

y_accum1 y_accum1 y_accum1 y_accum1y_accum1

y_accum2 y_accum2 y_accum2 y_accum2

y_accum3 y_accum3 y_accum3

Figure 17 : dependency graph showing the operation of the Systolic FIR filter

(with N = 4)

The green arrows represent the forwarding of a partial accumulation of an output

sample through the adder-chain, whilst the effect of the two registers between each of

the MACC units is represented by the blue arrows as they represent the forwarding of

input samples between successive MACC units.

As with the Transposed FIR filter, the initial latency before the first output sample

emerges from the output, and the throughput thereafter is the same as that seen with

the Direct form type Ι implementation. Although because the Systolic FIR filter

evaluates and accumulates the inner-products in the opposite order to the Transposed

FIR filter, the latency between each input sample being applied to the filter and the

corresponding output sample emerging from the output is N clock cycles (assuming

the latency of each MACC unit is 1 cycle). Thus this latency increases by a factor of

N compared to that seen with both the Transposed and Direct form type Ι

implementations. However, the advantage that the Systolic FIR filter holds over the

Transposed implementation is that its input is only applied to one MACC unit, unlike

that of the Transposed FIR filter whose input is broadcast to all of its MACC units

and thus has a high fan-out. Thus the Systolic implementation is more suitable the

Transposed implementation for higher values of N.

Page 20: Parallel Processor for Graphics Acceleration

20

Section 2.4 : The Semi-Parallel FIR Filter

The Semi-Parallel FIR filter (sometimes called the Hardware-folded implementation)

divides its N coefficients amongst M multiply-add units. An example implementation

of the Semi-Parallel FIR filter is shown below in figure 18 (with N = 16, M = 4).

Figure 18 : showing an example H/W implementation of the Semi-Parallel FIR filter

(with N = 16, M = 4) [4]

Each group of N/M coefficients is assigned to one of the MACC units and stored in

order within the associated coefficient-memory. The first group (coefficients 0 to

(N/M – 1)) is assigned to the left-most MACC unit (to which the input samples are

applied), with ascending coefficient groups being assigned to the MACC units from

left to right. If N is not exactly integer-divisible by M, then the higher-order

coefficient-memories are padded with zeros.

Like the Transposed and Systolic implementations, the Semi-Parallel FIR filter

employs both spatial and temporal parallelism, but the degree to which it does this

depends on the ratio of M:N, with a higher M:N ratio resulting in a higher degree of

both spatial parallelism (as more MACC units are used) and temporal parallelism (as

each output sample is evaluated more quickly and thus the evaluation of more output

samples can be overlapped in time). Thus the trade off is performance obtained

versus the resources required, as can be seen by the equation for the performance of a

Semi-parallel FIR filter implementation:

Throughput = ( Clock frequency ÷ N ) * M

The Semi-parallel implementation may be extrapolated either towards being a fully-

parallel implementation like the Transposed and Systolic implementations by using

more MACC units, or the other way towards being a Single MACC FIR filter by

using fewer MACC units. A code description of the Semi-parallel FIR filter (with N

= 16, M = 4) is given in figure A3 of appendix A. The dependency graph of the

Semi-parallel FIR filter’s operation (with N = 16, M = 4) is shown below in figure 19.

Page 21: Parallel Processor for Graphics Acceleration

21

h(0) h(1) h(2) h(3)

h(6)h(5)

h(9)

x[n] x[n-1] x[n-2] x[n-3]

+ +

+ +

+ +

+

++

++

+

x[n-8]

x[n-11]

x[n-14] x[n-15]

x[n-12]

x[n-4]

x[n-16]

x[n-8]

x[n-5]

x[n-12]

x[n-9]

x[n-6]

h(12)

h(8)

h(15)h(14)

h(11)

h(4)

h(13)

h(10)

h(7)

x[n+1]

h(0)

+

+

+

x[n]

+

+

+

x[n-1]

+

+

+

x[n-2]

h(1) h(2) h(3)

h(7) h(4) h(5) h(6)

h(10) h(11) h(8) h(9)

h(13) h(14) h(15) h(12)

x[n-7]

x[n-10]

x[n-13]

x[n-11]

x[n-14]x[n-15]

x[n-3] x[n-4]

x[n-7]

x[n-5]

x[n-8]

x[n-11]

x[n-8]

x[n-12]

x[n-3]

x[n-7]

x[n-11]

x[n-2]

x[n-4]

+ + + + + +

y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] +

h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] +

h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15]

y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6]

+ h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12]

+ h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16]

x[n+1] x[n+2]

++

++

+

+

MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1

MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2

MACC_3MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3

MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4

ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1

++++++++

y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0

y_accum1 y_accum1y_accum1 y_accum1

y_accum1 y_accum1 y_accum1 y_accum1

y_accum2 y_accum2 y_accum2 y_accum2y_accum2 y_accum2 y_accum2 y_accum2

y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3

y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4

y_out y_outy_out y_out y_out y_out

y_out y_out

j0 1

0 1 2 3 0 1 2 3i

Figure 19 : dependency graph showing the operation of the Semi-Parallel FIR filter

The red circles represent MACC units being used to calculate the inner-products of

y[n], and the dark-red circles represent the output-accumulator being used to

accumulate the inner-products of y[n]. The blue and dark-blue circles represent the

same for y[n-1], whilst the yellow circles represent MACC units being used to

calculate the inner-products of y[n+1]. As can be seen above in figure 19, the address

(which lies in the range [0:((N/M) – 1)] for all MACC units) applied to the data-

buffer and coefficient memory of each MACC unit lags one behind the corresponding

address of the immediately preceding (to the immediate left) MACC unit, and all such

addresses are continuously and monotonically cycling from 0 to ((N/M) – 1). This is

necessary in order to employ temporal parallelism by overlapping (in time) the

evaluation of successive output samples in the way shown in figure 19 above. This

temporal parallelism is in turn necessary to achieve the Semi-Parallel

implementation’s maximum throughput of one output sample every N/M clock cycles,

because once an output sample has been retrieved from the accumulator by the

capture register [show on one/both diagrams], the accumulator must be reset (to either

zero or its input value). For its input value to be the first sum of M inner-products of

the next output sample to be evaluated, then the evaluation of this sum needs to have

finished in the previous clock cycle, otherwise the accumulator will have to be reset to

zero at the start of a new result-cycle (in which an output sample is accumulated). If

the accumulator was set to zero between result-cycles in this way, one extra clock

cycle would be required for the evaluation of each output sample, thus degrading

performance.

Page 22: Parallel Processor for Graphics Acceleration

22

Section 3 : Matrix-Vector Multiplication Analysis

Matrix-vector multiplication is essentially a series of vector dot-products, as element

(r,1) of the resultant vector is comprised of the multiplication of row r of the matrix

and the multiplicand column vector. This is illustrated below in figure 20, which

shows how the (4x1) resultant column vector R is formed from the multiplication of

the (4x4) matrix M and the (4x1) column vector V.

R(1) V(1)

V(2)

V(3)

V(4)

M(1,1).V(1) + M(1,2).V(2) + M(1,3).V(3) + M(1,4).V(4) M(1,1) M(1,2) M(1,3) M(1,4)

M(2,1) M(2,2) M(2,3) M(2,4)

M(3,1) M(3,2) M(3,3) M(3,4)

M(4,1) M(4,2) M(4,3) M(4,4)

X

M(2,1).V(1) + M(2,2).V(2) + M(2,3).V(3) + M(2,4).V(4)

M(3,1).V(1) + M(3,2).V(2) + M(3,3).V(3) + M(3,4).V(4)

M(4,1).V(1) + M(4,2).V(2) + M(4,3).V(3) + M(4,4).V(4)

R(2)

R(3)

R(4)

Figure 20 : showing how matrix-vector multiplication is comprised of a series of

vector dot-products

Section 3.1 : The Sequential Model of Computation

Matrix-vector multiplication is related to the FIR filter as both algorithms are

consistent of a series of vector dot-products. Considering both algorithms in their

sequential form, their outer for-loop is essentially for number of vector dot products required,

and their inner for-loop is essentially for number of vector-matrix_row element pairs. Figures 21

and 22 show a code description and dependency graph of the matrix-vector

multiplication problem’s sequential model of computation respectively. For

simplicity it is assumed throughout this section unless otherwise stated that all

references to MACC units refer to non-pipelined MACCs with a total latency of 1

clock cycle

Page 23: Parallel Processor for Graphics Acceleration

23

void sequentialMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ], const float *r[ ])

{

int row, col;

for( row = 0; row < num_matrix_rows; row++ )

{

for( col = 0; col < num_matrix_cols; col++ )

{

r[row] += m[row][col] * v[col]; // matrix is processed in ROW-MAJOR order

}

}

}

Figure 21 : showing the code description of the sequential model of computation of

the matrix-vector multiplication algorithm

V(1)

V(2)

V(3)

M(1, 1)

+

V(4)

V(1).M(1, 1) + V(2).M(1, 2) +

V(3).M(1, 3) + V(4).M(1, 4)

+

+

M(1, 2)

M(1, 3)

M(1, 4)

V(1)

V(2)

V(3)

M(2, 1)

+

V(4)

V(1).M(2, 1) + V(2).M(2, 2) +

V(3).M(2, 3) + V(4).M(2, 4)

+

+

M(2, 2)

M(2, 3)

M(2, 4)

V(1)

V(2)

V(3)

M(3, 1)

+

V(4)

V(1).M(3, 1) + V(2).M(3, 2) +

V(3).M(3, 3) + V(4).M(3, 4)

+

+

M(3, 2)

M(3, 3)

M(3, 4)

V(1)

V(2)

V(3)

M(4, 1)

+

V(4)

V(1).M(4, 1) + V(2).M(4, 2) +

V(3).M(4, 3) + V(4).M(4, 4)

+

+

M(4, 2)

M(4, 3)

M(4, 4)

row

col

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

MACC_1

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

R[row]

0 1 2 3

0

1

2

3

Figure 22 : showing the dependency graph of the matrix-vector multiplication

algorithm’s sequential model of computation (for a 4x4 matrix and 4x1 vectors)

However, as can be seen from figure 20 above, in matrix-vector multiplication each

element of the matrix is a multiplicand of strictly one inner-product in one vector dot-

product, and thus with reference to the program of figure 21, each element of the

matrix is strictly a multiplicand of one MACC operation in one specific iteration of

the inner-loop within one specific iteration of the outer-loop. This is in contrast to the

sequential (Single MACC) FIR filter algorithm where each input sample is a

multiplicand of one inner-product in N successive vector dot-products.

Page 24: Parallel Processor for Graphics Acceleration

24

The multiplicand column-vector of a matrix-vector multiplication is analogous to the

coefficient vector used in the FIR filter algorithm, as all vector-dot products

performed by both algorithms multiply these vectors by another.

Section 3.2 : Exploiting the Inherent Parallelism

The Transposed and Systolic FIR filter implementations discussed previously in

sections 2.2 and 2.3 respectively, were formed by completely unrolling the inner-loop

and then skewing the outer-loop (by a factor of 1 in opposite directions) of the original

sequential FIR filter code. With reference to figure 21 above, if the inner-loop of the

matrix-vector multiplication algorithm is unrolled by any factor, as was previously

discussed in section 2.2 (with regards to the FIR filter algorithm) the outer-loop then

has to be skewed for the MACC operations scheduled for simultaneous execution (by

different MACC units) to be independent of one another. Although as already

discussed, each matrix element is used as a multiplicand only once throughout the

execution of the entire algorithm, and thus unlike with the FIR filter algorithm, the

direction in which the outer-loop is skewed essentially makes no difference, as each

MACC unit would still have to access a separate matrix element at the start of each

clock cycle. Thus unlike the Transposed FIR filter, the access of each matrix element

(analogous to each input sample of the FIR filter) can not be shared among all MACC

units. Similarly, unlike the Systolic FIR filter there is no sense in feeding the matrix

elements through a tapped delay line in order to amortise the overhead of accessing

them.

Section 3.2.1 : Unrolling the Inner-Loop

Figure 23 below shows the dependency diagram of the matrix-vector multiplication

code (in its sequential form) of figure 21 after its inner-loop has been completely

unrolled (with each iteration executed on a separate MACC unit) and its outer-loop

has subsequently been skewed by a factor of +1 in a way analogous to that used in

creating the Systolic FIR filter. With this series of transformations, each MACC unit

employed effectively processes one column of the matrix.

Page 25: Parallel Processor for Graphics Acceleration

25

V(1)

V(2)

V(4)

V(1)V(1) V(1)

V(2)V(2)

V(3) V(3)

M(1, 1) M(2, 1) M(3, 1) M(4, 1)

V(2)

V(3)

V(4)

+ +

+ +

+ +

+

++

R(1) = V(1).M(1, 1) + V(2).M(1, 2)

+ V(3).M(1, 3) + V(4).M(1, 4)

R(2) = V(1).M(2, 1) + V(2).M(2, 2)

+ V(3).M(2, 3) + V(4).M(2, 4)

row0 1

MACC_1 MACC_1 MACC_1 MACC_1

MACC_2 MACC_2 MACC_2 MACC_2

MACC_3 MACC_3 MACC_3

MACC_4 MACC_4

+ + + +

R(1) = 0 R(2) = 0 R(3) = 0 R(4) = 0

R(1) R(2) R(3) R(4)

R(1) R(2) R(3) R(4)

R(1) R(2) R(3)

M(1, 2)

M(2, 3)

M(1, 4) M(2, 4)

M(1, 3)

M(2, 2) M(3, 2)

M(3, 3)

M(4, 2)

+

+

MACC_3

MACC_4

R(3)

+

MACC_4

V(3)

V(4) V(4)

M(4, 3)

M(3, 4) M(4, 4)

R(3) = V(1).M(3, 1) + V(2).M(3, 2)

+ V(3).M(3, 3) + V(4).M(3, 4)

R(4) = V(1).M(4, 1) + V(2).M(4, 2)

+ V(3).M(4, 3) + V(4).M(4, 4)

2 3

0

1

2

3

col

Figure 23 : showing the dependency diagram of the matrix-vector multiplication

algorithm’s sequential model of computation after complete unrolling its inner-loop

and skewing its outer-loop by a factor of +1

With reference to figure 23 above, the magenta and purple circles represent MACC

units used during the spin-up and spin-down procedures respectively, whilst the red

circles represent MACC units used during the steady-state. As can be seen from

figure 23, execution is only in the steady-state for one clock cycle. In order to achieve

better utilisation of the MACC units employed, several such matrix-vector

multiplication problems could be carried out in succession to amortise the overhead of

the spin-up and spin-down procedures. By doing this, the total execution time of all

the problems would be approximately four times faster than executing them on a

single MACC unit as is done for the matrix-vector multiplication problem’s

sequential model of computation detailed previously in figures 21 and 22.

Alternatively, the inner-loop could be unrolled only twice (thus employing only two

MACC units), where the sum of the first two inner-products of each vector dot-

product would need to be stored as intermediate values in a register file. This

implementation of the matrix-vector multiplication algorithm would be approximately

twice as fast as the implementation of it’s sequential model of computation, and

without as much need to chain together several problems to amortise the spin-up and

spin-down latency.

An advantage of this implementation (regardless of the factor the inner-loop is

unrolled by) is that each MACC unit uses the same vector-element throughout

processing its respective column of the matrix, and thus the fetch operation for each

vector element is amortised over the execution of the entire problem.

Page 26: Parallel Processor for Graphics Acceleration

26

Section 3.2.2 : Unrolling the Outer-Loop

Figure 24 below shows the dependency diagram of the implementation that results if

instead the outer-loop is completely unrolled, and thus each MACC unit employed

processes one row of the matrix.

V(1)

V(1)

V(1)

M(1, 1)

V(1)

R(1) = V(1).M(1, 1) + V(2).M(1, 2)

+ V(3).M(1, 3) + V(4).M(1, 4)

+

M(2, 1)

M(3, 1)

M(4, 1)

V(2)

V(2)

V(2)

M(1, 2)

+

V(2)

R(2) = V(1).M(2, 1) + V(2).M(2, 2)

+ V(3).M(2, 3) + V(4).M(2, 4)

+

+

M(2, 2)

M(3, 2)

M(4, 2)

V(3)

V(3)

V(3)

M(1, 3)

+

V(3)

R(3) = V(1).M(3, 1) + V(2).M(3, 2)

+ V(3).M(3, 3) + V(4).M(3, 4)

+

+

M(2, 3)

M(3, 3)

M(4, 3)

V(4)

V(4)

V(4)

M(1, 4)

+

V(4)

R(4) = V(1).M(4, 1) + V(2).M(4, 2)

+ V(3).M(4, 3) + V(4).M(4, 4)

+

+

M(2, 4)

M(3, 4)

M(4, 4)

col

row

MACC_1

MACC_2

MACC_3

MACC_4

MACC_1

MACC_2

MACC_3

MACC_4

MACC_1

MACC_2

MACC_3

MACC_4

MACC_1

MACC_2

MACC_3

MACC_4

R(1) = 0

0 1 2 3

0

1

2

3

+

+

+ + + +

R(2) = 0

R(3) = 0

R(4) = 0

R(1) R(1) R(1)

R(2) R(2) R(2)

R(3) R(3) R(3)

R(4) R(4) R(4)

Figure 24 : showing the dependency diagram of the matrix-vector multiplication

algorithm’s sequential model of computation after its outer-loop is completely

unrolled

As can be seen from figure 24 above, there are no dependencies across separate

iterations of the outer-loop, so thus after this unrolling there’s no need to skew any

instance of the inner-loop. Therefore execution is always in the steady state (as only

spatial parallelism is employed), meaning that all MACC units are always utilised

during execution. This implementation of the matrix-vector multiplication algorithm

would be four times faster than the implementation of its sequential model of

computation, and there is no need to amortise any spin-up and spin-down latency over

the execution of multiple problems as was the case for the implementation that results

from unrolling the inner-loop. Another advantage of this implementation is that each

vector-element is fetched only once (as is also the case when the inner-loop is

unrolled) where the MACC units employed always share the same vector-element

argument.

3.3 : Using Pipelined MACC Units

Until now, the MACC units discussed have been non-pipelined, where new operands

are only issued to such a unit after it has finished processing its previous operands.

The Transposed, Systolic and Semi-Parallel FIR filter implementations discussed

previously in section 2, used a pipeline of non-pipelined MACC units. As was

Page 27: Parallel Processor for Graphics Acceleration

27

discussed in section 2, pipelining temporally overlaps (skews) multiple execution

threads, and thus once the initial latency (whilst the pipeline is filled) has been

endured, the subsequent throughput achievable is n times greater (where n is the

degree of pipelining employed, and the pipeline is balanced). This results from the

non-pipelined execution unit being segmented into n stages, where each stage

contributes an nth

of the overall latency, and is thus able to be clocked at n times the

rate of the non-pipelined equivalent. Thus if a single MACC unit was pipelined by a

degree of n (and clocked at n times the clock frequency), once the initial latency of

filling the pipeline had been endured, the subsequent throughput would be n times

greater than that possible with the non-pipelined version. For simplicity, every

instance of the word pipeline throughout this document will refer to a balanced

pipeline, unless otherwise stated. The benefit of a pipelined MACC unit over its non-

pipelined equivalent is depicted below in figure 25.

0 1 2

0 1 2 3 4 5 6 7 8 9 10 11

MACC0 MACC1 MACC2

MACC0

MACC0

MACC0

MACC0 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8

MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8

MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8

MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC81

2

3

4

Non-Pipelined MACC

Pipelined MACC

Clock Cycle

Clock Cycle

Pipeline Stage

Figure 25 : showing an example to illustrate the benefit of a pipelined MACC over its

non-pipelined equivalent (with n = 4)

3.3.1 : Optimising the Code for Execution on a Pipelined MACC Unit

As discussed previously in section 2.2 the series of MACC operations that a specific

vector dot-product is comprised of have data-dependencies. Thus if the matrix-vector

multiplication problem’s sequential model of computation (shown previously in

figure 21) was executed on a pipelined MACC (consisting of a pipelined multiplier

followed by a pipelined adder), the achievable throughput would not be any higher

than that with the non-pipelined version of the MACC. This is illustrated below in

Page 28: Parallel Processor for Graphics Acceleration

28

figure 26. When executing the sequential model of computation, the pipelined

MACC effectively skews the outer-loop.

0 1 2

0 1 2 3 4 5 6 7 8 9 10 11

MACC0 MACC1 MACC2

MACC0

MACC0

MACC0

MACC0 MACC1 MACC2

MACC1 MACC2

MACC1 MACC2

MACC1 MACC21

2

3

4

Non-Pipelined MACC

Pipelined MACC

Clock Cycle

Clock Cycle

Pipeline Stage

MACC3

MACC3

MACC3

MACC3

3

MACC2

13 14 1512

In both cases : MACC0 : R(1) += M(1, 1) * V(1); MACC1 : R(1) += M(1, 2) * V(2); MACC2 : R(1) += M(1, 3) * V(3); MACC3 : R(1) += M(1, 4) * V(4); where R(1) begins as zero

Figure 26 : showing an example to illustrate the effect of issuing successive MACC

instructions (that are dependent) to a pipelined MACC unit

For simplicity, it is assumed that all three arguments are supplied as part of a MACC

instruction when it is issued to a MACC unit.

The code description of the matrix-vector multiplication algorithm shown below in

figure 27 is a re-write of the sequential code shown previously in figure 21. The outer

and inner loops have been swapped around, which thus requires the matrix to be

processed in column-major order (as apposed to row-major order as was the case for

the sequential code). The reason the two loops have been swapped around is so that

dependent MACC operations are scheduled as far apart as is possible, and this is

illustrated in the dependency diagram of this code which is shown below in figure 28.

Page 29: Parallel Processor for Graphics Acceleration

29

void pipelinedMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ], const float *r[ ])

{

int row, col;

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

r[row] += m[row][col] x v[col]; // matrix is processed in COLUMN-MAJOR order

}

}

}

Figure 27 : showing the code description of the sequential matrix-vector

multiplication algorithm re-written for execution on a single pipelined MACC unit

MACC_1

V(1)

M(1, 1)

+MACC_1

V(2)

M(1, 2)

MACC_1

V(3)

M(1, 3)

MACC_1

V(4)

M(1, 4)

R(1) = 0

MACC_1

V(1)

R(2) = 0MACC_1

V(2)

MACC_1

V(3)

MACC_1

V(4)

+

MACC_1

V(1)

M(3, 1)

MACC_1

V(2)

M(3, 2)

MACC_1

V(3)

M(3, 3)

MACC_1

V(4)

M(3, 4)

+R(3) = 0

MACC_1

V(1)

M(4, 1)

MACC_1

V(2)

M(4, 2)

MACC_1

V(3)

M(4, 3)

MACC_1

V(4)

M(4, 4)

+R(4) = 0

M(2, 1) M(2, 2) M(2, 3) M(2, 4)

R(1) = V(1).M(1, 1) + V(2).M(1, 2) +

V(3).M(1, 3) + V(4).M(1, 4)

R(2) = V(1).M(2, 1) + V(2).M(2, 2) +

V(3).M(2, 3) + V(4).M(2, 4)

R(3) = V(1).M(3, 1) + V(2).M(3, 2) +

V(3).M(3, 3) + V(4).M(3, 4)

R(4) = V(1).M(4, 1) + V(2).M(4, 2) +

V(3).M(4, 3) + V(4).M(4, 4)

col0 1 2 3

row

0

1

2

3

+ + +

+

+

+

+

+

+

+

+ +

R(2) R(2) R(2)

R(3) R(3) R(3)

R(4) R(4)R(4)

Figure 28 : Showing the dependency diagram of the code of figure 27 above, with the

degree of pipelining n = 4

As can be seen from figure 28, this re-written code description of the matrix-vector

multiplication algorithm essentially overlaps the execution of the vector dot-products

by interleaving their constituent MACC operations. In this way, the first inner-

product of all the vector dot-products in turn is calculated and accumulated, after

which the same is done for the second inner-product of all the vector dot-products,

and so on.

The number of vector-dot products a particular matrix-vector multiplication consists

of is equal to the number of elements in the resultant vector, which is equal to the

Page 30: Parallel Processor for Graphics Acceleration

30

number of rows in the matrix. With regards to the matrix-vector multiplication

algorithm detailed in figures 27 and 28 above, as this number (represented by the

variable num_matrix_rows in figure 27) is increased, dependent MACC operations are

scheduled further apart in time. If this number is greater than or equal to the number

of pipeline stages within the MACC, then the optimum throughput of the MACC can

be achieved (n times that of its non-pipelined version), as each time a MACC

instruction was issued, all those it was dependent on will have completed, thus

allowing it to be executed immediately.

As has been demonstrated, if the code is written such that dependencies are scheduled

far enough apart the use of a pipelined MACC can increase throughput by a factor of

n (where n is the degree of pipelining)

Page 31: Parallel Processor for Graphics Acceleration

31

Section 4 : The Floating Point Unit

The implementations of the matrix-vector multiplication algorithms discussed

previously in section 3 are all based on what were termed MACC units, which in

concept have the capabilities of triple-word read, write and multiply-accumulate.

This section details the design and implementation of a Floating-Point Unit (FPU) that

acts as one of those MACC units, and is pipelined in accordance with the findings of

section 3. As was seen previously in section 1.1.2, multiply-accumulate is not the

only elemental operation required to implement high-level OpenGL functions (and

graphics rendering functions in general). With this in mind, the FPU has been

designed in such a way that its instruction set is easy to extend. At the core of the

FPU is a 5-stage pipelined multiplier and a 3-stage pipelined adder. These may be

used in immediate succession to execute a multiply-accumulate (MACC) instruction,

or individually to execute either a multiply or an add instruction. These particular

pipeline lengths were chosen by considering the type of OpenGL program it was

envisaged would be executed on the FPU during its developmental phase, and FPU

has been designed such that changing these pipeline lengths can easily be done.

Figure 29 below depicts this initial FPU architecture.

Figure 29 : showing the initial design of the FPU

The control unit of the FPU is modelled as a program written in M-code, which is

encapsulated within the Embedded Matlab Function block labelled

FPU_Control_Unit on diagram shown in figure B1 of appendix B. M-code is

Matlab’s equivalent to C-code, and the inputs to the FPU_Control_Unit block are

passed through to and processed by the embedded programme. This programme is

Page 32: Parallel Processor for Graphics Acceleration

32

executed and assigns values to its outputs once per step of Simulink’s simulation time.

One of these simulation time steps is analogous to one clock cycle.

The instruction format of the FPU has been devise so that it is compliant with the IBM

PowerPC interface standard, and is shown below in figure 30. This has been done to

allow for the future possibility of employing the FPU alongside a PowerPC RISC as

the latter’s Fabric Co-Processor Module (FCM), as the PowerPC is known to run

OpenGL code in this configuration.

RBRART/S

067131420

MNEMONIC

2126

Figure 30 : showing the format of the FPU’s instruction word

With reference to figure 30 above, the mnemonic field of an arithmetic instruction

tells the FPU_Control_Unit program which of the three types of arithmetic

instruction it is. The RT/S field is the register number of the instruction’s destination

register (and third source register of a MACC), and the RA and RB fields are the

instruction’s source register numbers.

The word length of all data processed by the FPU is 32-bits (in accordance with the

standard binary representation of floating-point numbers). Also part of the FPU is a

100-word register file, which all data is read from and written to. Throughout this

section it is assumed that all input data has already been loaded into this register file,

with section 5.2.6 later describing a DMA unit that was designed and developed to

transfer data in and out of the register file without holding the FPU up. The register

file facilitates three simultaneous reads and one write per clock cycle. Throughout

section 3 it was assumed that when a MACC instruction was issued to a pipelined

MACC unit, all three arguments were also supplied at once. However, when the FPU

begins the execution of a MACC instruction, only the two multiplicand arguments are

fetched immediately, and the third argument is fetched at the start of the accumulate

stage. Thus three reads on the register file per clock cycle must be provided so that

the FPU has the capability to begin executing a new MACC or multiply instruction in

the same clock cycle that it starts executing the accumulate stage of a down-stream

MACC instruction.

4.1 : Dealing with Hazards

The ability to issue the FPU any instruction that it supports in any clock cycle

abstracts the programmer from the architecture. This allows them to get working

code earlier in the design cycle (before optimisation), as apposed to code only

working when it’s exactly optimised for this FPU. In the future, if a compiler is

developed, this capability will allow the FPU to execute code written for other

architectures. The FPU has this capability as a result of being designed to prevent

structural and data hazards from manifesting into errors.

Page 33: Parallel Processor for Graphics Acceleration

33

The FPU has been designed to deal with the structural hazard that occurs when the

new instruction issued is an add, and the accumulate part of a MACC instruction is

due to begin in that clock cycle. The conflict here is that the accumulate part of a

MACC instruction also requires its associated input arguments to be entered into the

adder unit. In this event, priority is given to the accumulate, and the FPU is stalled,

whereby it does not allow new instructions to be issued until the adder becomes

available and execution of the pending add instruction has subsequently began.

The FPU has also been designed to deal with the data hazards that arise when a newly

issued instruction has a variable (represented by a specific register) that is the output

variable of an instruction in the Execution_pipeline at that time. These data hazards

are prevented from manifesting into an error by the FPU stalling whenever such an

instruction is issued to it. This prevents any instruction executed from fetching one of

its source registers before the register’s contents have been updated by a down-stream

instruction, and similarly writing to a register before its contents have been fetched by

the accumulate stage of a down-stream MACC instruction.

4.1.1 : Using the Scoreboard to Detect Data Hazards

The Scoreboard is essentially a table that maintains which registers in the register file

are a destination register of an instruction currently in the pipeline and when they will

be written to (updated). The FPU_Control_Unit program maintains the FPU’s

Scoreboard as a binary vector, with one 6 bit field for each of the 100 registers of the

register file. Each of these fields are split into two sub-fields as illustrated below in

figure 31.

Page 34: Parallel Processor for Graphics Acceleration

34

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1

R2

R3

R99

R100

. . . . . .

R1R2R3R4R100 R99 R98 R97 . . . . . . . . . . . . . . . . . . .

0561112171823599 594 593 588 587 582 581 576

EXEC_UNITCOUNT DOWN

0125

The Scoreboard is conceptually a table with count down and execution unit entries for each register within the register file

(as illustrated above), although it is actually represented within the FPU_Control block as a binary vector as shown below

0 1 adder

1 0 multiplier

Figure 31 : Illustrating the concept of the Scoreboard and how it is represented

As can be seen above in figure 31, the 2 least significant bits of a field (its exec_unit

subfield) represent the actual execution unit that will produce the result to be written

into the field’s respective register, and the 4 most significant bits of a field (its count

down subfield) represent the number of clock cycles before that write-back operation

will occur. After consideration of the type of NLPs to be executed on the FPU (as

detailed previously in section 3 and even section 2), for simplicity the FPU was not

designed to have the capability of arbitrating the execution of multiple write-back

operations. Thus only programs that produce strictly no more than one write-back

operation per clock cycle are supported.

The position of the field within the Scoreboard vector as a whole is representative of

the actual register it represents, where the least significant field represents register R1,

and successive fields represent the registers in ascending order. In each simulation

time-step the Scoreboard is updated to decrement any non-zero count down sub-fields

and add details of a new instruction if one is submitted to the Execution_pipeline by

the FPU_Control_Unit program.

As stated previously in section 4.1.1, there are 100 registers in the register file and

thus the Scoreboard binary vector is consistent of 6 x 100 = 600 bits. In Simulink

Page 35: Parallel Processor for Graphics Acceleration

35

unsigned integers are represented using 32 bits, and thus to model this vector it was

broken up into twenty 30 bit vectors. This is illustrated below in figure 32.

R5 R4 R3 R2 R1

Scoreboard_1

R10 R9 R8 R7 R6

Scoreboard_2

R95 R94 R93 R92 R91

Scoreboard_19

R100 R99 R98 R97 R96

Scoreboard_20

. . . . . . . . . . . . . . . . . . .

R5 R4 R3 R2 R1R10 R9 R8 R7 R6R95 R94 R93 R92 R91R100 R99 R98 R97 R96 . . . . . . . . . . . . . . . . . . .

EXEC_UNITCOUNT DOWN

0125

R6R7R8R9

0561112171823

R10

Scoreboard_2

2429

0293059599 570 569 540

Figure 32: Illustrating how the Scoreboard binary vector is split into 20 segements in

order to represent it in Simulink

With reference to figure 32 above, Scoreboard_1 holds the status of registers R1 to

R5, Scoreboard_2 holds the status of registers R6 to R10, and so on for successive

Scoreboards up to Scoreboard_20. Each of the twenty Scoreboard segments are

represented within the FPU_Control_Unit program as persistent variables.

Figure 33 below shows how the Scoreboard is updated each time the FPU submits a

new instruction to its Execution_pipeline.

Page 36: Parallel Processor for Graphics Acceleration

36

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 0 0

R2 0 0

R3 0 0

R4 0 0

R5 9 1

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 6 2

R2 0 0

R3 0 0

R4 0 0

R5 8 1

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 5 2

R2 0 0

R3 0 0

R4 4 1

R5 7 1

0 1 2 3 4 5 6 7 8 9 10clock cycle

R5 += R1 * R2

0 1 2 3 4 5 6 7 8 9 10

R5 += R1 * R2

R1 = R3 * R4

0 1 2 3 4 5 6 7 8 9 10

R5 += R1 * R2

R1 = R3 * R4

clock cycle

clock cycle

R4 = R2 + R3

X

X X

X

The instruction issued in clock cycle 0 is : R5 += R1 * R2

The instruction issued in clock cycle 1 is : R1 = R3 * R4

The instruction issued in clock cycle 2 is : R4 = R2 + R3

R1

R2

R3

R4

R5

destination

register

R1

R2

R3

R4

R5

R1

R2

R3

R4

R5

destination

register

destination

register

Figure 33 : showing an example to illustrate how the FPU updates the Scoreboard

each time a new instruction is submitted to the Execution_Pipeline

Figure 33 above shows how the state of the Scoreboard (for the registers of concern)

changes over three successive clock cycles, in which a MACC, multiply and add

instructions are issued to the FPU and subsequently submitted to the

Execution_pipeline where their execution begins. As can be seen in figure 33, each

time an instruction is submitted to the Execution_pipeline the count down entry in the

Scoreboard for that instruction’s destination register is set to the latency of the

instruction (in clock cycles) and this value is subsequently decremented in each clock

cycle thereafter. With reference to figure 33, registers R4, R1 and R5 will be written

to in clock cycles 6, 7, and 9 respectively. The exec_unit entry for the instruction’s

destination register is set to the code representing the particular execution_unit within

the Execution_pipeline that will produce its result as detailed previously in figure 31.

A synopsis of the FPU_Control_Unit sub-function updateScore() that is responsible

for updating the Scoreboard (in the way shown in figure 33 above) each time a new

instruction is submitted to the Execution_pipeline is shown in figure B2 followed by a

description in appendix B.

Page 37: Parallel Processor for Graphics Acceleration

37

The code of the sboardCycle() sub-function of FPU_Control_Unit is shown in figure

B3, followed by a description in appendix B. This sub-function is executed every

clock cycle (once per Scoreboard segment) to update the Scoreboard so as to reflect

the passing of one clock cycle. This is done by decrementing all non-zero countdown

sub-fields throughout the entire Scoreboard. A countdown value of 1 is encountered

when its parent field represents the register on which a write back operation is

scheduled to occur in the current clock cycle (as the countdown value will become

zero when it’s decremented in that same execution of sboardCycle()), thus in this

event sboardCycle() asserts the write back operation. After asserting a write back

operation, sboardCycle() sets the field’s exec_unit sub-field to zero. Thus all

Scoreboard fields whose respective registers are not destination registers of an

instruction currently in the Execution_pipeline are maintained with a zero value in

both sub-fields.

4.1.2 : Managing Data Hazards

As already discussed previously at the beginning of the section, when the Controller

(discussed later in section 5) issues the FPU with a new instruction,

FPU_Control_Unit checks the Scoreboard to decide whether or not submitting the

instruction to the Execution_pipeline may cause an error to occur due to a data

hazard. This is illustrated below in figure 34.

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 0 0

R2 0 0

R3 0 0

R4 0 0

R5 9 1

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 6 2

R2 0 0

R3 0 0

R4 0 0

R5 8 1

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 5 2

R2 0 0

R3 0 0

R4 0 0

R5 7 1

0 1 2 3 4 5 6 7 8 9 10clock cycle

R5 += R1 * R2

0 1 2 3 4 5 6 7 8 9 10

R5 += R1 * R2

R1 = R3 * R4

0 1 2 3 4 5 6 7 8 9 10

R5 += R1 * R2

R1 = R3 * R4

clock cycle

clock cycle

R5 = R1 + R2

X

X X

X

The instruction issued in clock cycle 0 is : R5 += R1 * R2

The instruction issued in clock cycle 1 is : R1 = R3 * R4

The instruction issued in clock cycle 2 is : R5 = R1 + R2

R1

R2

R3

R4

R5

destination

register

R1

R2

R3

R4

R5

R1

R2

R3

R4

R5

destination

register

destination

register

R5

Page 38: Parallel Processor for Graphics Acceleration

38

Figure 34 : showing an example to illustrate how the Scoreboard is checked to

prevent data hazards

With reference to figure 34 above, the multiply instruction issued to the FPU in clock

cycle 1 (represented by the blue bar) passes the Scoreboard check (assuming there are

no instructions in the Execution_pipeline before clock cycle 0) because in clock cycle

1 neither of its source registers or it’s destination register is the target of a pending

write back operation. However the add instruction issued to the FPU in clock cycle 2

(represented by the green bar) fails the Scoreboard check for two reasons. Firstly,

one of its source registers (R1) is the target of the write back operation scheduled in

clock cycle 7 after the multiply instruction. Thus if this add instruction was submitted

to the Execution_pipeline in clock cycle 2, it would fetch and use the contents of R1

in the same clock cycle, before they had been updated in clock cycle 7 by the multiply

instruction.

In order to abstract the programmer from the architecture it must be assumed that they

don’t consider the latency of the instructions and that their intention in this event

would be for the result of the multiply instruction to be added to R2 by the add

instruction. Thus if this was the only cause for the Scoreboard check failure, the add

instruction could be submitted to the Execution_pipeline (without causing a data

hazard) in clock cycle 8 or thereafter.

However, a second cause of the add instruction’s Scoreboard check failure is that it’s

destination register (R5) is the target of a write back operation scheduled in clock

cycle 9 after the MACC instruction (represented by the red bar). Thus a data hazard

exists because the destination register of a MACC instruction is first used as its third

source register. As such, if the add instruction was submitted to the

Execution_pipeline whilst the MACC instruction was still being executed, it could

write to R5 before this register had been fetched for the accumulate stage of the

MACC instruction. For simplicity, this event always results in a Scoreboard check

failure, regardless of whether or not the instruction’s write back would occur after the

register fetch for the add stage of the MACC instruction.

The sub-function of FPU_Control_Unit that is employed to check the Scoreboard for

a particular register is checkScore(), a synopsis of which is shown in figure B4,

followed by a description in appendix B.

4.1.3 : Managing Structural Hazards

As well as the data hazards discussed previously in section 4.1.1, a potential

structural hazard also exists as the Execution_pipeline’s adder is used in executing

both the add instruction and the accumulate stage of the MACC instruction. Figure

35 below shows an example to illustrate this data hazard, and how it is dealt with.

Page 39: Parallel Processor for Graphics Acceleration

39

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 0 0

R2 0 0

R3 0 0

R4 0 0

R5 9 1

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 6 2

R2 0 0

R3 0 0

R4 0 0

R5 4 1

SCOREBOARD

REGISTER COUNT DOWN EXECUTION UNIT

R1 0 0

R2 0 0

R3 0 0

R4 0 0

R5 4 1

0 1 2 3 4 5 6 7 8 9 10clock cycle

MACC_mult

0 1 2 3 4 5 6 7 8 9 10

MULT

0 1 2 3 4 5 6 7 8 9 10

clock cycle

clock cycle

ADD

X

X X

The instruction issued in clock cycle 0 is : R5 += R1 * R2

The instruction issued in clock cycle 5 is : R1 = R3 * R4

The instruction issued in clock cycle 5 is : R4 = R2 + R3

R1

R2

R3

R4

R5

destination

register

R1

R2

R3

R4

R5

R1

R2

R3

R4

R5

destination

register

destination

register

MACC_add

X X X X MACC_add

X X X MACC_add

Figure 35 : showing an example to illustrate the structural hazard concerning the use

of the Execution_pipeline’s adder by both MACC and add instructions

Figure 35 above shows a MACC instruction (issued in clock cycle 0) split into its two

stages, where the red bar represents the multiply-stage and the purple bar represents

the accumulate-stage. The dark red section of the combined bar represents the clock

cycle in which the last stage of the multiply and the first stage (register fetch) of the

add are conducted in parallel. This is done so as to hide the latency of the add-stage’s

register fetch. Figure 35 shows the outcome of issuing two alternative instructions to

the FPU in that clock cycle (5). As is the case with both the multiply and MACC

instructions, if the instruction issued in this clock cycle does not require immediate

use of the adder then it is eligible for submission to the Execution_pipeline.

However, as can be seen in figure 35 this is not the case with the add instruction, and

as such FPU_Control_Unit would not submit an add to the Execution_pipeline in this

situation, regardless of whether or not it passed its Scoreboard checks. Priority is

always given to the accumulate-stage of a MACC in this way for simplicity. In the

situation depicted by figure 35, the earliest time after this clock cycle that an add

instruction could be submitted to the Execution_pipeline would be clock cycle 6.

Details of how the FPU program implements the execution of the different

instructions and the management of hazards are contained in appendix B.

Page 40: Parallel Processor for Graphics Acceleration

40

Section 5 : The Controller

The Controller is responsible for issuing the FPU with the next instruction to be

executed. If the FPU stalls in the event of detecting a structural or data hazard, it

asserts a ‘1’ on its stall output and the Controller must re-issue the stalled instruction

in subsequent clock cycles until the FPU does submit it to its Execution_pipeline and

assert a ‘0’ back on its stall output. In the clock cycle after this submission occurs,

the Controller must issue the FPU with the next instruction to be executed.

5.1 : Initial Look-Up Table Design

The initial design of the Controller was essentially a look-up table, where every

instruction of the program to be run was stored sequentially in program memory.

This look-up table design of the Controller is shown below in figure 36.

Figure 36 : showing the initial look-up table design of the Controller

For simplicity, during this developmental phase the PROG_COUNTER (counter) was

initialised with its count_from and count_to block parameters before running the

simulation. Similarly, the PROG_MEMORY (single-port RAM) was initialised with

the sequence of program instructions through its initial_value_vector block parameter.

The output of PROG_MEMORY is separated out into its constituent fields

(mnemonic, RT/S, RA and RB) as bit-slicing is not supported in Simulink’s Embedded

Matlab function block, thus preventing this from being carried out within

FPU_Control_Unit. With reference to figure 36 above, when the stall input is ‘0’,

both PROG_COUNTER and PROG_MEMORY are enabled, thus allowing the

program counter to progress by 1 and the output register of the RAM block to be

written to. As such, when the stall signal is ‘0’ successive instructions of the program

are output by PROG_MEMORY at a rate of one per clock cycle. Figure 37 below

uses an example to show how the Controller deals with an FPU stall.

Page 41: Parallel Processor for Graphics Acceleration

41

0 1 2 3 4 5 6 7 8Clock Cycle

a0 a1 a2 a3 a4 a5 a6

i0 i1 i2 i3 i4 i5X

0

1

Stall

Instruction

word issued

Program

counter

Figure 37 : showing an example to illustrate how an FPU stall is dealt with by the

Controller

In the example shown in figure 37 above, the FPU is issued with instruction i2 in

clock cycle 3 but cannot submit it to the Execution_pipeline for two successive clock

cycles (3 and 4). As can be seen in figure 37 above, when the FPU stalls and asserts a

‘1’ on the stall signal, PROG_COUNTER has advanced to the address of the next

instruction (a3) by the time the count has been stopped. Although as the output

register of PROG_MEMORY is disabled, PROG_MEMORY’s output value remains

as the word-value of the stalled instruction. In the clock cycle that the FPU does

submit i2 to the Execution_pipeline, it asserts a ‘0’ back on the stall signal, thus

enabling PROG_MEMORY’s output register, which is then written to with the word-

value of the next instruction to be executed. This can be seen in figure 37, where the

stalled instruction (i2) is submitted for execution in clock cycle 5, and in the

subsequent clock cycle the FPU is issued with the next instruction (i3).

5.2 : Optimising the Controller for Running Geometric Transformation

Programs

As the problem size (number of instructions the program consists of) increases,

storing every single instruction requires the program memory capacity to be bigger

than is practical. For example, the matrix-vector multiplication program shown in

figure 21 and discussed previously in section 3.1 where the matrix is 4x4 and the

vector is 4x1), is executed using 16 separate MACC instructions when (ignoring any

load and store operations required to get data in and out of the register file). Thus the

size of the program memory required to store this program when completely unrolled

is 16 instruction words.

Figure 38 below illustrates a similar program to that shown in figure 21 which

multiplies eight 4x1 Vectors by the same 4x4 matrix. This is a typical example of the

program run in carrying out both the modelview and projection transformations in the

per-vertex operations stage of the OpenGL pipeline, as discussed previously in

section 1.2.2.

Page 42: Parallel Processor for Graphics Acceleration

42

M(1,1) M(1,2) M(1,3) M(1,4)

M(2,1) M(2,2) M(2,3) M(2,4)

M(3,1) M(3,2) M(3,3) M(3,4)

M(4,1) M(4,2) M(4,3) M(4,4)

X

V1(1)

V1(2)

V1(3)

V1(4)

V2(1)

V2(2)

V2(3)

V2(4)

V3(1)

V3(2)

V3(3)

V3(4)

V4(1)

V4(2)

V4(3)

V4(4)

V5(1)

V5(2)

V5(3)

V5(4)

V6(1)

V6(2)

V6(3)

V6(4)

V7(1)

V7(2)

V7(3)

V7(4)

V8(1)

V8(2)

V8(3)

V8(4)

R1(1)

R1(2)

R1(3)

R1(4)

R2(1)

R2(2)

R2(3)

R2(4)

R3(1)

R3(2)

R3(3)

R3(4)

R4(1)

R4(2)

R4(3)

R4(4)

R5(1)

R5(2)

R5(3)

R5(4)

R6(1)

R6(2)

R6(3)

R6(4)

R7(1)

R7(2)

R7(3)

R7(4)

R8(1)

R8(2)

R8(3)

R8(4)

Figure 38 : illustrating an example of the matrix-vector multiplication program carried

out by OpenGL’s modelview and projection transformations.

With reference to figure 38 above, the M matrix represents either the modelview or

projection matrix depending on which transformation is to be performed, and vectors

V1 to V8 represent the object or eye coordinate vectors of an object in the scene. The

object being transformed in this particular program has eight vertices, the simplest

example of which would be a cube. Figure 39 below shows a code description of this

program.

void pipelined1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float

*v1[ ], const float *v2[ ], const float *v3[ ], const float *v4[ ], const float *v5[ ], const float *v6[ ], const float *v7[ ], const float *v8[ ], const float *r1[ ], const float *r2[ ], const float *r3[ ], const float *r4[ ], const float *r5[ ], const float *r6[ ], const float *r7[ ], const float *r8[ ])

{

int row, col;

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

r1[row] += m[row][col] x v1[col]; // matrix m is processed in COLUMN-MAJOR order

r2[row] += m[row][col] x v2[col];

r3[row] += m[row][col] x v3[col];

r4[row] += m[row][col] x v4[col];

r5[row] += m[row][col] x v5[col];

r6[row] += m[row][col] x v6[col];

r7[row] += m[row][col] x v7[col];

r8[row] += m[row][col] x v8[col];

}

}

}

Figure 39 : showing a code description of the example geometric transformation

program depicted previously in figure 38

Page 43: Parallel Processor for Graphics Acceleration

43

Previously in section 3.3.1 an analysis of how to optimise the matrix-vector

multiplication algorithm for execution on a pipelined MACC unit was detailed. In

conjunction with the findings of this analysis, the two loops in code of figure 39

above are arranged such that the matrix is processed in column major order, so as to

schedule dependent MACC instructions further apart in time and thus avoid periods of

latency due to FPU stalls. However, this program has eight vectors and the eight

separate matrix-vector multiplication problems are interleaved so as to schedule the

dependent MACC instructions (within each individual problem) even further apart in

time, allowing for an even greater overall pipeline depth and thus a higher throughput

to be achieved.

Completely unrolling both loops will yield the fastest execution speed as it entirely

removes the overhead of having to test loop conditions and execute branch operations

(to set the program counter back to the beginning of a loop). Although these

overheads can also be eliminated without unrolling any loops, by implementing loop

tests and branch operations within the Controller, although this would be at the

expense of added hardware resources. Although as discussed previously, the

disadvantage of completely unrolling both loops is that the size of program memory

required is increased by a factor equal to degree of unrolling.

As can be seen from figure 39 above, the inner-loop of the program contains 8 MACC

instructions, and so if this program was represented with both loops completely

unrolled, the size of the required program memory would be 8x4x4=128 instruction-

words. With both loops completely unrolled, transformations of larger sizes (i.e.

where there are more vertices in the scene overall) could be solved by running the

program on small sets of vectors (vertices) at a time, thus reducing the number of

instructions that need to be stored at any one time, and likewise the size of program

memory required. Although this approach would introduce the overhead of switching

between these smaller programs.

5.2.1 : The Use of Induction Variables

As can be seen from the code of figure 39 above, all of the instructions are MACC

instructions. Thus when the program is represented with both loops completely

unrolled, the instructions stored in program memory would all have the same

mnemonic, and differ only in their three source/destination register fields (RT/S, RA

and RB). All of the program’s instruction-words have the same mnemonic, and their

RT/S, RA and RB fields always address the same arrays.

When the majority of DSP compilers need to address arrays in NLP programs, to

simplify this they introduce induction variables which by definition are derived from

the loop index values. Considering the geometric transformation program of figure

39 for just one vector, a code description of this illustrating how induction variables

would be used to address the output and two input arrays is shown below in figure 40.

Page 44: Parallel Processor for Graphics Acceleration

44

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

// matrix m is processed in column major order

p = (col x num_matrix_rows) + row; // p is an induction variable

*(r1 + row) += *(m + p) x *(v1 + col); }

}

Figure 40 : showing an example to illustrate the use of induction variables for

addressing arrays

As can be seen in figure 40 above, the m array is indexed by adding its respective

induction variable p to its base pointer. Although p is the only new variable

introduced, the r1 and v1 arrays are also addressed in this way, where their respective

induction variables are exactly the values of the loop indices. With reference to figure

40 above, it can be seen how the use of induction variables allows for the successive

program instructions to be generated, with only a single generic instruction-word

stored in program memory of the form shown below in figure 41.

V1_base_pointer

(RB)

M_base_pointer

(RA)

R1_base_pointer

(RT/S)

067131420

MACC

(MNEMONIC)

2126

Figure 41 : showing the single generic instruction-word from which all program

instructions could be derived for the program of figure 40

With reference to figure 41 above, the Controller would pass the mnemonic field

straight on to the FPU, but between issuing successive instructions it would have to

evaluate the values of the RT/S, RA and RB fields, which would require additional

hardware resources. If this penalty was migrated into software by issuing the FPU

with instructions to calculate the induction variable values, this would eliminate the

need for extra resources (apart from the extra program memory required) at the cost of

increased execution time. Although this is not an option as the FPU does not have a

register address internal data-format. With reference to the code of figure 40 above,

for this program these extra hardware resources would amount to two adders for

evaluating the R1 array addresses, likewise another two adders for evaluating the V1

array addresses and an adder and a multiply-add unit for evaluating the M array

address.

5.2.2 : Performing Strength Reduction on Induction Variables

In applying strength reduction to all three induction variables of the program of figure

40, the overhead cost of each inner-loop iteration is reduced. Figure 42 below shows

the code description after all three induction variables have undergone strength

reduction.

Page 45: Parallel Processor for Graphics Acceleration

45

p = 0;

r = 0;

c = 0;

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

*(r1 + r) += *(m + p) x *(v1 + c); // matrix m is processed in column major order

r++;

p++; }

r = 0;

c++;

}

Figure 42 : showing the code description of the matrix-vector multiplication algorithm

after all three induction variables have undergone strength reduction

As can be seen from figure 42 above, there is no longer the need for a multiplication

in evaluating successive values of p, thus the additional hardware required to evaluate

the p induction variable is now down to two adders. To facilitate this strength

reduction on p, the m matrix must be stored in column major order. Strength

reduction also removes any dependencies of induction variables on the loop indices

(as is the case for those associated with the R1 and V1 arrays), which provides more

flexibility, as not all programs will have induction variables that directly correspond

to the loop indices.

5.2.3 : Optimising a Geometric Transformation for the Controller

Considering the code shown in figure 39 above, where eight separate matrix-vector

multiplication problems are interleaved, this could be executed using separate base

pointers for the arrays of the separate problems, whilst using the same induction

variables across all problems. A code description of this solution is shown below in

figure 43.

p = 0;

r = 0;

c = 0;

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

// matrix m is processed in column major order

*(r1 + r) += *(m + p) x *(v1 + c);

*(r2 + r) += *(m + p) x *(v2 + c);

Page 46: Parallel Processor for Graphics Acceleration

46

*(r3 + r) += *(m + p) x *(v3 + c);

*(r4 + r) += *(m + p) x *(v4 + c);

*(r5 + r) += *(m + p) x *(v5 + c);

*(r6 + r) += *(m + p) x *(v6 + c);

*(r7 + r) += *(m + p) x *(v7 + c);

*(r8 + r) += *(m + p) x *(v8 + c);

r++;

p++; }

r = 0;

c++;

}

Figure 43 : showing a code description of the program with eight matrix-vector

multiplication problems interleaved, with all induction variables having undergone

strength reduction.

With reference to figure 43 above, to run this code the Controller would have to store

one generic instruction word of the form shown previously in figure 41 for all eight

problems (i.e. one instruction word per vertex). This disadvantage arises from the

corresponding array base pointers having different values across the eight different

problems. The number of vertices an object has, or the number that are in a scene can

be huge (reaching the tens of thousands in very detailed scenes), thus it is desired that

only one generic instruction word be stored in program memory, from which all

successive program instructions are derived.

In order to achieve this, the corresponding arrays across the different problems need

to be combined. Considering the way the geometric transformation interleaves the

execution of the problems, in order to keep the expressions for evaluating the

induction variable values as simple as possible, the best way to combine the arrays is

to interleave them. As such, the register numbers of successive array elements

accessed can be evaluated largely by simple increment operations. This is illustrated

below in figure 44 which illustrates an example register file arrangement for the three

arrays.

Page 47: Parallel Processor for Graphics Acceleration

47

M(1, 1) M(2, 1) M(3, 1) M(4, 1) M(1, 2)

R1

M(2, 2)

M(3, 2) M(4, 2) M(1, 3) M(2, 3) M(3, 3) M(4, 3)

R2 R3 R4 R5 R6

R7 R8 R9 R10 R11 R12

V1(1) V2(1) V3(1) V4(1) V5(1)

R17

V6(1)

V7(1) V8(1) V1(2) V2(2) V3(2) V4(2)

R18 R19 R20 R21 R22

R23 R24 R25 R26 R27 R28

R1(1) R2(1) R3(1) R4(1) R5(1)

R49

R6(1)

R7(1) R8(1) R1(2) R2(2) R3(2) R4(2)

R50 R51 R52 R53 R54

R55 R56 R57 R58 R59 R60

Figure 44 : showing an example arrangement within the FPU’s register file, where the

three arrays of eight matrix-vector multiplication problems are interleaved together

As is illustrated in figure 44 above, the eight source and corresponding eight resultant

vectors are stored in the register file, such that the first element of all eight vectors are

stored adjacent to each other and in the order the vector pairs are processed by the

program (1 to 8), followed by the second element of all 8 vectors, and so on. The M

matrix is stored in column major order, as was discussed earlier in this section. The

arrangement of the source and destination registers means that the complexity of

interleaving and de-interleaving the arrays is handled outside the Controller by

whatever loads the data in and stores it out of the FPU’s register file. Subsequently

the register number sequences that need to be generated by the Controller are simpler

(than if the arrays were simply concatenated), and this will ease compiler

development if it is undertaken in the future. A code description of the program using

this register file arrangement, and requiring only one generic instruction word to be

stored in program memory is shown below in figure 45.

Page 48: Parallel Processor for Graphics Acceleration

48

res = 0;

res_base = 0;

p = 0;

vec_base = 0;

vec = 0;

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

for( vec_no = 0; vec_no < num_vectors; vec_no++ )

{

*(r0 + res) += *(m + p) x *(v0 + vec);

res++;

vec++;

}

p++;

vec = vec_base;

}

res = res_base;

vec = vec_base = vec_base + num_vectors;

}

Figure 45 : showing a code description of the program executing eight interleaved

matrix-vector multiplication problems with only one generic instruction word stored

in program memory

5.2.4 : Designing the Optimal Controller

Considering the code of figure 45 above, it can be seen that to evaluate the three

induction variables between issuing successive instructions, the range of operations

the Controller must be able to carry out on an induction variable to produce its

subsequent value are incrementing it, setting it to a base value and adding a constant

to that base value.

As was discussed previously in section 1.2.1, as well as matrix-vector multiplication,

the two other prominent NLPs executed within the OpenGL pipeline (and the graphics

rendering domain in general) are the FIR filter and matrix-matrix multiplication. For

flexibility and thus the ability to efficiently support a wider range of NLPs, the

Controller needs to have the ability to perform any combination of these operations on

any of the three induction variables in any one evaluation cycle between the issuing of

successive instructions.

Page 49: Parallel Processor for Graphics Acceleration

49

5.2.4.1 : The Data Address Generator Design

A hardware block that generates successive address values by evaluating induction

variable values and then adding them to array base pointers in the way discussed

previously is commonly known in the field of DSP design as a Data Address

Generator (DAG). Such a block was designed to support the three induction variable

operations discussed previously, with a view to using three separate instances of it as

part of the Controller, for evaluating the RT/S, RA and RB field values between the

issuing of successive instructions. The design of this DAG block is shown below in

figure 46.

Figure 46 : Showing the Data Address Generator block used in the Controller

Before the DAG can be used it must first be initialised with the base pointer

(orig_base_pointer) of the array it is to generate addresses (register numbers) for, and

the constant value that can be added to this (val_1). This initialisation is part of

loading the program to be run into the Controller. With reference to figure 46 above,

the base pointer is loaded into the orig_base_pointer register and the initialisation also

acts to simultaneously load this base pointer value into the current_accumulation

register, so that this value appears at the DAGs output (Out1) in the subsequent clock

cycle. Also in the same step, the constant is loaded into the val_1 register. This

combined triple-action instruction initialises the DAG in one clock cycle and is

referred to as the DAG’s initialisation instruction in the remainder of this document.

Figure 47 below shows the instruction words of the DAG instructions used in

executing the optimised geometric transformation program of figure 45.

Page 50: Parallel Processor for Graphics Acceleration

50

0(In8)

0(In1)

0(In2)

0(In3)

0(In4)

5 4 3 2 1

0(In5)

0bits

DAG instruction word for : doNothing

1(In8)

0(In1)

0(In2)

0(In3)

0(In4)

5 4 3 2 1

0(In5)

0bits

DAG instruction word for : current_accumulation++ : (post incrementing the DAG output)

0(In8)

0(In1)

0(In2)

0(In3)

1(In4)

5 4 3 2 1

0(In5)

0bits

DAG instruction word for : current_accumulation = orig_base_pointer

0(In8)

0(In1)

0(In2)

0(In3)

1(In4)

5 4 3 2 1

1(In5)

0bits

DAG instruction word for : current_accumulation = orig_base_pointer = orig_base_pointer + val_1

0(In8)

0(In1)

1(In2)

1(In3)

1(In4)

5 4 3 2 1

1(In5)

0bits

DAG instruction word for : initialiseDAG

Figure 47 : showing the DAG instruction words used in executing the optimised

program of figure 45

With reference to figure 47 above, the initialiseDAG instruction is supplied in

conjunction with the 7-bit orig_base_pointer and val_1 values. As can be seen in

figure 47, bit 4 of all instructions detailed is zero, although there is one other

instruction that the DAG supports, but which is not used in executing the geometric

transformation program of figure 45. This instruction is :

current_accumulation = orig_base_pointer + val_1; where orig_base_pointer is not

modified. This instruction is included as it would be used by other NLPs, which had a

requirement to increase the array address by val_1, and later go back to an earlier base

value (still held in orig_base_pointer).

Page 51: Parallel Processor for Graphics Acceleration

51

5.2.4.2 : The Use of Data Address Generators as part of the Controller

Figure 48 below shows the geometric transformation program (outlined previously in

figure 45) using the set of DAG instructions shown previously in figure 47.

void pipelinedDAG1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, int num_vectors, const float *m[ ], const float *v[ ], const float *r[ ])

{

int row, col;

// DAG initialisation pragmas

assign_DAG1: r_vectors; // DAG1 <= R[ ]

assign_DAG2: mx; // DAG2 <= M[ ]

assign_DAG3: v_vectors; // DAG3 <= V[ ]

// DAG initialisation data

initDAG1: r_vectors_orig_base = r;

initDAG1: r_vectors_val1 = 0;

initDAG2: mx_orig_base = m;

initDAG2: mx_val1 = 0;

initDAG3: v_vectors_orig_base = v;

initDAG3: v_vectors_val1 = num_vectors;

// NLP program

for( col = 0; col < num_matrix_cols; col++ )

{

for( row = 0; row < num_matrix_rows; row++ )

{

for( vec_no = 0; vec_no < num_vectors; vec_no++ )

{

*(r_vectors_curr_accum) += *(mx_curr_accum) x *(v_vectors_curr_accum);

DAG1: r_vectors_curr_accum++; // DAG instruction

DAG3: v_vectors_curr_accum++; // DAG instruction

}

DAG2: mx_curr_accum++; // DAG instruction

DAG3: v_vectors_curr_accum = v_vectors_orig_base; // DAG instruction

}

DAG1: r_vectors_curr_accum = r_vectors_orig_base; // DAG instruction

// DAG instruction

DAG3: v_vectors_curr_accum = v_vectors_orig_base = v_vectors_orig_base + v_vectors_val1;

}

Page 52: Parallel Processor for Graphics Acceleration

52

}

Figure 48 : showing a code description of the optimal geometric transformation

program that interleaves eight matrix-vector multiplications, rewritten to show how

DAGs in the Controller would be used

With reference to figure 48 above, compiling this code would simply involve

constructing the DAG initialisation instruction word for each of the three DAGs,

constructing the instruction word for each of the three loops, and extracting the loop

bounds to initialise the count_to values of the loop counters (discussed later in section

5.2.4.3). The three loop instruction words would be used by the Controller to

generate the program instruction words of this program, and are depicted below in

figure 49.

v_vectors_curr_accum++(I_DAG_RB)

doNothing(I_DAG_RA)

r_vectors_curr_accum++(I_DAG_RT/S)

056111217

MACC(MNEMONIC)

1824

v_vectors_curr_accum =

v_vectors_orig_base(I_DAG_RB)

mx_curr_accum++(I_DAG_RA)

doNothing(I_DAG_RT/S)

056111217

doNothing(MNEMONIC)

1824

v_vectors_curr_accum =

v_vectors_orig_base =

v_vectors_orig_base + val_1(I_DAG_RB)

doNothing(I_DAG_RA)

r_vectors_curr_accum =

r_vectors_orig_base(I_DAG_RT/S)

056111217

doNothing(MNEMONIC)

1824

Controller instruction word for the inner-loop’s instruction

Controller instruction word for the middle-loop’s instruction

Controller instruction word for the outer-loop’s instruction

Figure 49 : showing the three instruction words used by the Controller to run the

interleaved matrix-vector multiplication program of figure 48

Page 53: Parallel Processor for Graphics Acceleration

53

Figure 50 below shows the final Controller design, which employs three instances of

the DAG block, and a loop counter structure to act as a form of program counter.

Figure 50 : showing the final optimised Controller design

5.2.4.3 : The Loop Counter Structure

With reference to figure 50 above, the loop counter structure

(Controller_Loop_Counters) maintains where in the NLP the program execution is

currently at (in terms of the iteration number of each loop) and issues the generic loop

instructions (of the form shown previously in figure 49) accordingly. This loop

counter structure is shown below in figure 51.

Page 54: Parallel Processor for Graphics Acceleration

54

Figure 51 : showing the loop counter structure that acts as the program counter of the

Controller

In consideration of the key NLPs that it is envisaged would be run by this Controller,

for simplicity this loop counter structure is designed to support programs with only

one generic instruction (of the form shown previously in figure 41) in each loop.

With reference to figure 51 above, when a NLP is being run by the Controller,

counter_1, 2 and 3 hold the loop index for the inner, middle and outer loops

respectively. The three counters are connected together via logic gates (and and not)

and registers (delays).

In all clock cycles up to and including when counter_1 reaches its count_to value (i.e.

when the inner loop is on its last iteration) the address applied to the program memory

is that of the inner-loop’s generic instruction. In the cycle after counter_1 reaches its

count_to value, its count is reset back to zero and disabled there for one cycle whilst

the count of counter_2 is advanced by one. In this cycle the address input to the

program memory is that of the middle loop’s generic instruction.

In the clock cycle after counter_2 reaches its count_to value, it behaves in the same

way as just described for counter_1, and the count of counter_1 in this event is

disabled at zero for two successive cycles. The address applied to program memory

in the clock cycle after counter_2 reaches its count_to value is that of the outer-loop’s

generic instruction.

In the clock cycle that counter_3 reaches its count_to value a ‘1’ is asserted on the

switch_context signal shown in figure 51 above, which signals that the last generic

instruction of the NLP has been issued.

When a ‘1’ is asserted on the stall signal by the FPU, the program memory of the

optimal Controller behaves in the same way as the initial look-up table design detailed

Page 55: Parallel Processor for Graphics Acceleration

55

previously in section 5.1. In the event of a stall, the three DAGs are disabled, thus all

effectively issued with doNothing instructions. The loop counter arrangement has

been designed such that every feedback signal, etc between the counters is frozen so

that when the stall is lifted the count sequence will carry on exactly as it would have

done if no stall was encountered (from the point at which it was halted by the stall

event).

The internal contents of the three counter blocks of figure 51 are shown in appendix

C.

5.2.5 : The High-Level Controller

Before the Controller can run a NLP, the generic loop instructions have to be loaded

into its program memory and its DAGs need initialising. This initialisation is carried

out by the High-level controller, which essentially stores three Controller

initialisation instructions in its block RAM. In initialising the Controller to run a

NLP, these three instructions are issued to the Controller. With reference to figure 51

shown above, when the High-level controller is issuing the Controller with one of

these instructions, it asserts a ‘1’ on the Controllers run input. In the clock cycle after

the last of these has been issued, the High-level controller asserts a ‘0’ on the run

signal and the Controller begins running the program. When the Controller asserts a

‘1’ on its switch_context output, the High-level controller then goes through the same

initialisation procedure with the Controller for the next NLP to be run. Thus all NLPs

for which the High-level controller has a set of Controller initialisation instructions

are successively loaded into the Controller and ran in this way. The High-level

controller is shown in appendix D.

One of these Controller initialisation instructions acts to load one of the generic loop

instructions into the Controller’s program memory, initialise one of the DAGs, load

one of the loop instruction addresses, and initialise one of the counters with their

count_to value. Thus it can be seen that three of these instructions need to be issued

in order to load a NLP into the Controller.

5.2.6 : The DMA Unit and Context Switching

Until now it was always assumed that all input data (i.e. matrices and vectors) that

programs operated on was already in the register file. Likewise, no attention was paid

as to how data would be output by the FPU. Implementing this as load and store

instructions in the FPU would mean long periods of latency whilst entire arrays were

loaded in and stored out element by element, in between the execution of the actual

computational instructions.

A DMA unit was designed to load data into and store data out of the FPU’s register

file whilst leaving the FPU free to carry out the execution of a program. However,

with only one register file in the FPU, only three reads and one write per cycle were

available. Instructions being executed by the FPU would want to read the register file

Page 56: Parallel Processor for Graphics Acceleration

56

in order to fetch their arguments, and as all three of the reads available can be

required in the same clock cycle (where a new instruction is submitted to the

Execution_pipeline and the accumulate stage of a down stream MACC instruction

fetches its third source register). Thus using only one register file would mean slow

overall execution time, as the next NLP to be run could not do so until all of its input

data had been loaded into the register file.

This problem was overcome by introducing a second register file in the FPU, where in

each context cycle one register file is read from and written to by the NLP being

executed on the FPU, and the other has data stored out of it (the results of the previous

NLP to have been executed) and loaded into it (input data for the next NLP to be

executed) by the DMA unit.

This DMA unit is shown below in figure 52.

Figure 52 : showing the DMA unit

With reference to figure 52 above, as with the Controller the DMA unit must be

initialised in between context cycles. The High-level controller detailed previously in

section 5.2.5, was extended to carry out this initialisation of the DMA unit in the same

way as, and in parallel with the Controller. In this initialisation, a set of load and

store instructions are loaded into the DMA unit’s RAM block, the format of which is

shown below in figure 53.

Page 57: Parallel Processor for Graphics Acceleration

57

First RegisterMNEMONIC Last Register

067131415

Figure 53 : showing the format of the DMA instructions

With reference to figure 53 above, the mnemonic field indicates whether the

instruction is a load or store, which the DMA unit uses to decide which protocol to

implement. The first_register and last_register fields represent pointers (register

numbers) to the beginning and end of the array being loaded in or stored out, and the

DMA unit generates the numbers in between in a monotonic fashion. The next

number in this sequence is generated every time a new data value is received (at input

In3) for the DMA to load into the register file, or it needs to fetch another data value

from the register file to store.

If at any stage when loading, the input data stream to the DMA is halted, or the

external entity receiving the data being stored out temporarily can not receive, the

counter generating the pointer values is disabled. As with the loop counter structure

of the optimal Controller discussed previously in section 5.2.4.3, the DMA unit has

been designed such that all internal feedback signals, etc are frozen, such that a data

stream halt in either direction does not cause a malfunction (i.e. such as a pointer

value being skipped, etc). Figure 54 below describes the protocols implemented by

the DMA unit in load and store mode.

ACCEPT_MORE = LOAD_INSTRUCTION && ALLOWED_BY_FPU && NOT(LAST_REGISTER) && IN_RUN_MODE

NEW_DATA_OUT = STORE_INSTRUCTION && ALLOWED_BY_FPU && NOT(LAST_REGISTER) && IN_RUN_MODE

Protocol for loading data into the register file

Protocol for storing data out of the register file

Figure 54 : describing the protocols implemented by the DMA unit in loading data in

and storing data out of the FPU

Both register files of the FPU are 100 words in depth, and the input and output signals

of both are multiplexed between the DMA unit and the FPU using a context_bit

provided by the High-level controller which toggles between successive context

cycles.

Page 58: Parallel Processor for Graphics Acceleration

58

Section 6 : Testing and Integration

After each block had been designed and implemented as a Simulink simulation

model, it was thoroughly tested against its specification. Corner cases and perverse

situations were also examined, so as to be aware of these when integrating a block

with others. When the blocks were deemed to have met their specification they were

integrated with the other blocks that they were to interact with immediately, and

tested as part of a sub-system. In this way the complete system, comprised of the

High-level controller, the optimal FPU Controller, the DMA unit, and the optimised

FPU was developed by integrating the sub-systems together, predominantly in a

bottom-up hierarchical manner.

6.1 : Testing of the FPU’s Control Unit

Firstly, the FPU_Control block was tested alone to verify that it could submit all three

types of instruction for execution. This was done in simulation by inputting add,

multiply and MACC instructions and observing the outputs of the block. The register

field values were set to values in the low, mid and high range of the register file’s

range of register numbers (1 to 100). For each of the instructions, two and three of

the register fields were set to the same value (which is legal). All cases were tested

with mnemonics in the instruction word that were unknown to the FPU, to ensure that

the FPU_Control block did not try and submit anything for execution in the that clock

cycle. As well as using it to test the block, this process was also used to develop the

implementation of the FPU_Control_Unit Embedded Matlab function within the

FPU_Control block in an iterative way.

Register field values outside the register file’s range were also input as part of an

instruction word. Whenever a register field value of 0 was used (with a valid

mnemonic) the FPU_Control_Unit program would assign -1 (in 7-bit binary form) to

the appropriate address signal, as all register numbers are decremented by one to map

them to an address in the register file (i.e. register numbers in the FPU start from 1,

but block RAM addresses start from 0). Register numbers between 101 and 127

inclusive were processed as those in the legal range of 1 to 100.

The basic updating of the Scoreboard was verified, by outputting the persistent

Scoreboard segments of interest from the function and observing their decimal values

over successive clock cycles. It was found that register numbers outside the legal

range had no effect on the Scoreboard (when input as part of an instruction word with

a valid mnemonic), as expected since the updateScore() sub-function would not be

able to assign it to one of the Scoreboard segments, although execution of the

FPU_Control_Unit function was found to resume as normal in subsequent clock

cycles.

Sequences of instructions were then input to the block to check that the

FPU_Control_Unit program could handle instructions being issued to it at the rate of

one per clock cycle (and slower), that the Scoreboard detected all cases of data

Page 59: Parallel Processor for Graphics Acceleration

59

hazards, the structural hazard concerning the use of the adder, and combinations of

both types of hazard presented at the same time.

Data hazard detection was verified using instruction streams representing vector dot-

products, with dependent MACC operations scheduled one after the other, as analysed

previously in section 3.1. The same cases were also tested with the vector dot-

products implemented as a series of individual multiply and add instructions.

Structural hazard detection was verified by issuing a MACC instruction followed by a

series of add instructions (devised so as to not carry data hazards) and observing when

the stall signal was asserted high, and if it was high for only one clock cycle (as

intended)

Also verified was the subsequent behaviour of the block in terms of the reaction to

different types of instructions subsequently being issued after the stall signal had gone

high. As expected, when one instruction was stalled, and then perversely (compared

to how the Controller is designed to react in the event of a stall), an entirely different

instruction was issued in the subsequent clock cycle (which did not carry any

hazards), this instruction would be submitted for execution.

6.2 : Testing and Integrating the Register File

The register file was first tested alone to verify that it could support three (or less)

reads and one write per clock cycle. The one write per cycle was tested by feeding

address sequences into its WB_ ADDR input along with a corresponding data stream

into its WB_DATA input. At the same time, address sequences were sent to the three

ADDR_R inputs (addressing a different part of the register file than that being written

into) and the data fetched out was compared to that expected (based on the

intial_value_vector the register file was initialised with). After the address and data

streams had finished, the contents of the register file were streamed out in a serial

manner from a fourth block RAM that was initialised and subsequently written to in

exactly the same way as the three block RAMs of the register file. This was done in

order to verify the write operation over multiple cycles. This analysis also verified

that the register file could support a read-after-write operation on the same register in

the same cycle, although for simplicity the FPU has not been designed to take

advantage of this.

The register file and FPU_Control block were then integrated together, where a series

of instructions were issued to FPU_Control in the simulation, and the three data

outputs of the register file were observed to make sure the right values (registers)

were fetched out after each instruction was submitted for execution, where the register

file was again initialised pre-simulation.

Page 60: Parallel Processor for Graphics Acceleration

60

6.3 : Testing and Integrating the Execution Pipeline

The Execution_pipeline was integrated with both the FPU_Control and register file

blocks. The main issue in getting all three to work together was refining the latency

model of the Execution_pipeline held by the FPU_Control_Unit program, and this

was done via an iterative process using an instruction stream consistent of a series of

independent MACCs, and observing various signals along the Execution_pipeline.

Up until this point, all data words had been of type 8-bit unsigned, and this was now

changed to 12-bit signed (with binary point after bit-4) in order to better represent the

typical data in the OpenGL pipeline that it was envisaged would require processing by

the FPU. The issue faced here was the need to remove intermediate type converter

blocks (which were not able to handle this data format). A simple matrix-vector

multiplication problem was used to verify this sub-system, by streaming out the

register file contents after the program’s execution and comparing the contents with

those calculated using Matlab and also a 20-digit calculator. After experimenting

with different binary point positions, the one stated above was found to be very

adequate.

6.4 : Testing and Integration of the Look-Up Table Controller

The Look-up table Controller was first initialised prior to running the simulation using

Matlab environment variables for the count_to block parameter of its

PROG_COUNTER and initial_value_vector of its PROG_MEMORY. The basic

operation was verified in checking that it reacted to the stall signal going high

correctly, and that it could cope with multiple stalls in immediate succession, whereby

the stall signal would effectively toggle between ‘1’ and ‘0’ in successive clock

cycles.

The Look-up table Controller was then integrated with the FPU. Its

intial_value_vector was initialised to hold a matrix(4x4)-vector(4x1) multiplication

program (implemented as MACC instructions) completely unrolled. This program

processed the matrix in column-major order so as to schedule apart the dependencies

and avoid data hazards arising. Stalls were made to occur in several points within the

execution of the program, by repeating instances of the same instruction at the

corresponding points in program memory. This was also done with multiple stalls

occurring in immediate succession. The contents of the register file were streamed

out at the end of the program execution and compared with those evaluated using

Matlab to ensure that the stalls did not cause an error.

6.5 : Testing of the Optimal Controller’s Data Address Generator

The DAG block was first initialised with its orig_base_pointer, val_1 and

curr_accum values using workspace variables as the initial values of the

corresponding registers. The block was initially tested to check that it could cope

with any sequence of instructions issued to it, and that the latency of all these

instructions was 1 clock cycle (i.e. so that the result would be ready in time for the

Page 61: Parallel Processor for Graphics Acceleration

61

issuing of the next program instruction to the FPU). It was also checked to see that

the DAG block could cope with stalls (in isolation and succession) appropriately

using a similar method as just previously described for the Look-up Controller.

6.6 : Testing and Integration of the Optimal Controller’s Loop Counter

Structure, Program Memory and DAGs

The loop counter structure used as the Optimal Controller’s program counter was

developed through an iterative process of designing and testing. This was first done

to get the counters interacting in the right way (through register delays and logic

gates) in order for the individual counters to work in synchronisation and together

create the right count sequence (depending on the individual loop bounds). After this

basic functionality was achieved, further developing the loop counter structure to cope

with stalls occurring (in isolation or succession) anywhere in the count sequence

involved another iterative process between designing and testing.

The loop counter structure was then integrated with the Program memory of the

Optimal Controller, where the state of the counters provides control inputs to a

network of MUXs which then select the address of the appropriate loop instruction as

the address assigned to Program memory. This subsystem was tested with values 1,

2, and 3 representing the instruction words of the three loops for clarity. The reaction

of the sub-system to a stall occurring was tested using the same method as previously

described in section 6.4 with regards to the Look-up table Controller.

Three instances of the Data Address Generator block were then integrated with this

sub-system, thus completing the Optimal Controller. This was initialised to run the

optimised geometric transformation program detailed previously in section 5.2.3

using workspace variables to represent the various block parameters. The program

was run on the Optimal Controller and the resulting instruction stream was inspected

to verify that the correct sequence of instructions were being generated.

6.7 : Testing and Integration of the DMA Unit

Developing the DMA unit was an iterative process of design and testing. The main

issues faced were getting the right delays in certain control signal paths, so that the

DMA unit could cope with data stream halts occurring at any point in time. The other

major problem faced was that of removing an asynchronous feedback loop between

the DMA unit and the FPU, which was required in order to integrate the DMA unit

with the FPU in the Xilinx System Generator environment.

In testing the load protocol, the input data was supplied by an Embedded Matlab

programme which was used to implement the other side of the protocol. A similar

thing was done to test the store protocol.

Once the operation of the DMA unit had been verified with in isolation, the FPU was

modified to contain two register files as detailed previously in section 5.2.6, and the

DMA unit was integrated with it. The operation of this sub-system was verified by

using the DMA to store data out of and load data into one register file, whilst the FPU

Page 62: Parallel Processor for Graphics Acceleration

62

executed the optimised geometric transformation program using the other register file.

The contents of both register files were then streamed out and the contents were

checked. The store instructions of the DMA unit were verified by inspecting the data

values it fetched out (where the register file being stored out of had previously been

initialised prior to running the simulation).

6.8 : Testing and Integration of the High-Level Controller

The High-level Controller was first tested in integration with the DMA unit. A series

of initialisation instructions were issued to the DMA unit, after which it was put into

run mode. The DMA needed an interface with the High-Level Controller in order for

it to correctly process initialisation instructions, and this was largely developed via an

iterative process of design and test. Once this was developed correctly, it was seen

that the subsequent operation of the DMA complied with what it had been initialised

to do.

A very similar process was carried out with the High-level Controller and the Optimal

Controller.

The High-level Controller was then integrated together with both the DMA Unit and

Optimal Controller, as well as the FPU to form the complete system. The operation

of the system was verified using five different instances of the optimised geometric

transformation, loaded into the Optimal Controller and subsequently executed on the

FPU in succession. Whilst each of these NLPs were executed (over one context

cycle) the DMA unit was used to store out the resultant vectors of the previous NLP

from the other register file, and then load in the source vectors and matrix for the

subsequent NLP to be run. In subsequent context cycles, the register file the FPU and

DMA unit operated on were switched, so as to hide the latency of loading in and

storing out data, thus in effect not letting it degrade execution time.

Page 63: Parallel Processor for Graphics Acceleration

63

6.9 : The OpenGL Demonstration

As previously discussed in section 5.2 the optimised geometric transformation

program essentially represents the program executed by the OpenGL pipeline in

carrying out both the modelview and projection transformations, which are part of the

per-vertex stage of the OpenGL pipeline as detailed previously in section 1.2.2. A

model of the geometric transformation process (detailed previously in section 1.2.2)

carried out in the per-vertex stage of the OpenGL pipeline was developed in the

Matlab environment, with the final processor architecture used to execute the

modelview transformation. This model is shown below in figure 55.

Figure 55 : showing the demonstration of the OpenGL pipeline’s geometric

transformation process, with the modelview transformation executed on the final

processor architecture

With reference to figure 55 above, the OpenGL program that is executed by the model

is the rendering of a cube (modelled by its eight vertices), between each successive

frame being rotated and translated along the trailing diagonal of the viewport. The

viewpoint is such that the viewer is looking down over the top of the cube.

With reference to figure 55 above, the outputs of the High-level Controller are

essentially the fields of the initialisation instructions of the Optimal Controller and the

DMA unit. Between successive context cycles, the High-level Controller initialises

the Optimal Controller to run the optimised geometric transformation program, each

time swapping the contexts (i.e. register files) of the FPU and DMA unit by toggling

the context_bit. The eight spatial coordinate vectors of the cube (in their object

Page 64: Parallel Processor for Graphics Acceleration

64

coordinate form, with the cube centred at the origin) and the modelview matrix to be

used in rendering five successive scenes are sent to the DMA unit (element by

element and according to the DMA’s load protocol) by an Embedded Matlab function

used to model the processing element feeding the geometric transformation pipeline

within the per-vertex stage of the OpenGL pipeline. The FPU executes the

modelview transformation for one frame, whilst the DMA unit stores out the resultant

eye coordinate vectors for the previous frame and loads in the source object

coordinate vectors for the subsequent frame. The de_interleave function block de-

interleaves the output array being stored out, and extracts the eight eye coordinate

vectors before passing them on to the geometric_trans function block which

implements the remainder of the geometric transformation process before the cube is

drawn into the framebuffer. Figure 56 below shows the five successive frames

produced in an overlapped manner so that the effect of the successive executions of

the modelview transformation executed by the final processor architecture may be

seen.

Figure 56 : showing the five successive frames produced by the model of OpenGL’s

geometric transformation process shown previously in figure 55

6.10 : Progress in Developing a Hardware Based Demonstration

The Optimal Controller has been successfully targeted to the Xilinx Virtex 4 FPGA

board using the System Generator tool chain. The Optimal Controller (in hardware)

has also been integrated with the rest of the Processor architecture (in simulation)

successfully.

Page 65: Parallel Processor for Graphics Acceleration

65

The DMA unit also been targeted to hardware successfully, but has not yet been

integrated with the rest of the Processor architecture (in simulation) successfully. The

problem faced is the feedback loop between the FPU and the DMA used in

implementing the two protocols can not be evaluated by System Generator.

7 : Conclusion

The processor architecture that has been developed is capable of carrying out both the

modelview and projection transformations within the geometric transformations

process of the OpenGL pipeline. These are two of the most prominent

transformations carried out within the OpenGL pipeline, as they each implement key

steps in constructing the scene.

These transformations within the OpenGL pipeline are executed in succession on all

objects in the scene per frame. Through the context switching implemented by the

architecture, it is able to process successive NLPs with no latency (during which

execution would have to halt) due to transferring the associated data in or out is seen.

The objects in a scene are most likely to have different numbers of vertices, and as

such create programs of different sizes to be run. The Optimal Controller developed

can efficiently support a wide range of program sizes as it will only ever store one

instruction per loop, thus as well as the efficiency in terms of the amount of program

memory required, the architecture is also initialised very quickly by the High-Level

Controller to run the next NLP.

The processing architecture is therefore a good fit for the processing engine required

to perform the modelview and projection transformations.

It has been shown that when code is written specifically for the Optimal Controller,

the steps involved in constructing the initialisation instruction words to be issued by

the High-Level controller in loading the program into the Optimal Controller are

trivial, and that this alone would be all that was necessary to constitute compilation of

the code.

Although it hasn’t been explicitly shown in this report, when considering the range of

induction variable operations supported by the Data Address Generator block it is

reasonable to assume that the Optimal Controller and the processor architecture as a

whole would efficiently support a wider range of NLPs, especially the FIR filter and

matrix-matrix vector multiplication, which are also prominent NLPs in graphics

processing as well as matrix-vector multiplication.

Page 66: Parallel Processor for Graphics Acceleration

66

Section 8 : Appendices

Appendix A

void transposedFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )

{

int j; // the only loop counter

const float *k = x; // ‘x’ initially points to the input sample (x[n]) corresponding to the first output sample

// to be calculated (y[n])

register temp3 = 0, temp2 = 0, temp1 = 0, temp0 = 0; // registers used to store the each result at // different stages of the accumulation

// Spin-up procedure : filling up the pipeline

temp2 = ( h[3] x *(k-3) ) + temp3;

// PAR construct means execute the statements housed in the following pair of braces simultaneously // (i.e. issue them to separate processors)

PAR {

temp2 = ( h[3] x *(k-2) ) + temp3;

temp1 = ( h[2] x *(k-2) ) + temp2;

}

PAR {

temp2 = ( h[3] x *(k-1) ) + temp3;

temp1 = ( h[2] x *(k-1) ) + temp2;

temp0 = ( h[1] x *(k-1) ) + temp1;

}

// Steady state : where the pipeline is full

for( j = 0; j < (num_samples – (num_taps-1)); j++ )

{

PAR {

*y++ = ( h[0] x *k ) + temp0; // y points to the register address where y[n+j] is temp0 = ( h[1] x *k ) + temp1; // to be written and is incremented (post

temp1 = ( h[2] x *k ) + temp2; // assignment) to point to the register address

temp2 = ( h[3] x *k ) + temp3; // where the next output sample y[(n+j)+1] is to

} // be written

k = x++; // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1]

}

// Spin-down procedure : emptying the pipeline

PAR {

*y++ = ( h[0] x *k ) + temp0;

temp0 = ( h[1] x *k ) + temp1;

temp1 = ( h[2] x *k ) + temp2;

Page 67: Parallel Processor for Graphics Acceleration

67

}

PAR {

*y++ = ( h[0] x *(k+1) ) + temp0;

temp0 = ( h[1] x *(k+1) ) + temp1;

}

*y++ = ( h[0] x *(k+2) ) + temp0;

}

Figure A1 : A code description of the Transposed FIR filter

void systolicFirFilter( int num_taps, int num_samples, const float *x, const float *h[], float *y )

{

int j; // the only loop counter

const float *k = x++; // ‘x’ initially points to the input sample (x[n]) corresponding to the first output // sample to be calculated (y[n])

register temp3 = 0, temp2 = 0, temp1 = 0, temp0 = 0; // registers used to store the each result // at different stages of the accumulation

// Spin-up procedure : filling up the pipeline

temp2 = ( h[0] x *k ) + temp3;

k = x++;

// PAR construct means execute the statements housed in the following pair of braces simultaneously // (i.e. issue them to separate processors)

PAR {

temp1 = ( h[1] x *(k-2) ) + temp2;

temp2 = ( h[0] x *k ) + temp3;

}

k = x++;

PAR {

temp0 = ( h[2] x *(k-4) ) + temp1;

temp1 = ( h[1] x *(k-2) ) + temp2;

temp2 = ( h[0] x *k ) + temp3;

}

// Steady state : where the pipeline remains full

for( j = 0; j < (num_samples – (num_taps-1)); j++ )

{

k = x++; // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to x[(n+j)+1]

PAR {

Page 68: Parallel Processor for Graphics Acceleration

68

*y++ = ( h[3] x *(k-6) ) + temp0; // y points to the register address where y[n+j] is

temp0 = ( h[2] x *(k-4) ) + temp1; // to be written and is incremented (post

temp1 = ( h[1] x *(k-2) ) + temp2; // assignment) to point to the register address

temp2 = ( h[0] x *k ) + temp3; // where the next output sample y[(n+j)+1] is to

} // be written

}

// Spin-down procedure : emptying the pipeline

k = x++;

PAR {

*y++ = ( h[3] x *(k-6) ) + temp0;

temp0 = ( h[2] x *(k-4) ) + temp1;

temp1 = ( h[1] x *(k-2) ) + temp2;

}

k = x++

PAR {

*y++ = ( h[3] x *(k-6) ) + temp0;

temp0 = ( h[2] x *(k-4) ) + temp1;

}

k = x;

*y++ = ( h[0] x *(k-6) ) + temp0;

}

Figure A2 : A code description of the Systolic FIR filter (with N = 4)

void semi_parallelFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )

{

int i, j;

float y_accum;

const float *k0, *k1, *k2, *k3;

register i0, i1, i2, i3;

register temp3 = 0, temp2 = 0, temp1 = 0, temp0 = 0, y_partial = 0, y_accum = 0;

//Spin-up procedure

i0 = 0;

*k0 = x++;

Page 69: Parallel Processor for Graphics Acceleration

69

temp2 = ( h[i0] x *k0 ) + temp3;

i1 = 4 + i0++;

*k1 = (k0--) - 4;

PAR {

temp1 = ( h[i1] x *k1 ) + temp2;

temp2 = ( h[i0] x *k0 ) + temp3;

}

i2 = i1 + 4;

i1 = 4 + i0++;

*k2 = k1 – 4;

*k1 = (k0--) – 4;

PAR {

temp0 = ( h[i2] x *k2 ) + temp1;

temp1 = ( h[i1] x *k1 ) + temp2;

temp2 = ( h[i0] x *k0 ) + temp3;

}

i3 = i2 + 4;

i2 = i1 + 4;

i1 = 4 + i0++;

*k3 = k2 – 4;

*k2 = k1 – 4;

*k1 = (k0--) – 4;

PAR {

y_partial = ( h[i3] x *k3 ) + temp0;

temp0 = ( h[i2] x *k2 ) + temp1;

temp1 = ( h[i1] x *k1 ) + temp2;

temp2 = ( h[i0] x *k0 ) + temp3;

}

i3 = i2 + 4;

i2 = i1 + 4;

i1 = 4 + i0;

i0 = 0;

*k3 = k2 – 4;

*k2 = k1 – 4;

*k1 = (k0) – 4;

*k0 = x++;

//Steady state

for(j = 0; j < (((num_taps / ‘4’) x num_samples) – ‘4’); j++ )

{

y_accum = 0;

for(i = 0:3)

Page 70: Parallel Processor for Graphics Acceleration

70

{

PAR {

y_accum += y_partial;

y_partial = ( h[i3] x *k3 ) + temp0;

temp0 = ( h[i2] x *k2 ) + temp1;

temp1 = ( h[i1] x *k1 ) + temp2;

temp2 = ( h[i0] x *k0 ) + temp3;

}

PAR {

i3 = i2 + 4;

i2 = i1 + 4;

i1 = i0 + 4;

i0++;

}

PAR {

*k3 = k2 – 4;

*k2 = k1 – 4;

*k1 = k0 – 4;

*k0--;

}

}

i0 = 0;

*y++ = y_accum;

}

//Spin-down procedure

PAR {

i3 = i2 + 4;

i2 = i1 + 4;

i1 = i0 + 4;

}

PAR {

*k3 = k2 – 4;

*k2 = k1 – 4;

*k1 = k0 – 4;

}

PAR {

y_accum += y_partial;

y_partial = ( h[i3] x *k3 ) + temp0;

temp0 = ( h[i2] x *k2 ) + temp1;

temp1 = ( h[i1] x *k1 ) + temp2;

}

PAR {

i3 = i2 + 4;

Page 71: Parallel Processor for Graphics Acceleration

71

i2 = i1 + 4;

}

PAR {

*k3 = k2 – 4;

*k2 = k1 – 4;

}

PAR {

y_accum += y_partial;

y_partial = ( h[i3] x *k3 ) + temp0;

temp0 = ( h[i2] x *k2 ) + temp1;

}

PAR {

i3 = i2 + 4;

}

PAR {

*k3 = k2 – 4;

}

PAR {

y_accum += y_partial;

y_partial = ( h[i3] x *k3 ) + temp0;

}

y_accum += y_partial;

*y = y_accum;

}

Figure A3 : a code description of the Semi-Parallel FIR filter

(with N = 16, M = 4)

Page 72: Parallel Processor for Graphics Acceleration

72

Appendix B

Figure B1 : Showing the FPU_Control block of figure 29 at the next level of

abstraction down

function [scb1,scb2,scb3,....] = updateScore(register_no, op_mn, scb1,scb2,scb3,....)

field = 0; // variable to store the Scoreboard field of the register as a decimal value f_number = 0; // variable to store the field number of the register within its respective Scoreboard segment countdown = 0; // variable to store the countdown sub-field of the field as a decimal value

execunit = 0; // variable to store the exec_unit sub-field of the field as a decimal value if(op_mn == 2) // if instruction to be submitted is a MACC countdown = 10; execunit = 1; field = (countdown * 4) + execunit;// countdown sub-field is shifted up by 2 bits and concatenated with exec_unit end if(op_mn == 4) // if instruction to be submitted is a multiply countdown = 7; execunit = 2; field = (countdown * 4) + execunit; end if(op_mn == 3) // if instruction to be submitted is an add countdown = 5; execunit = 1; field = (countdown * 4) + execunit; end

Page 73: Parallel Processor for Graphics Acceleration

73

switch register_no case {1,2,3,4,5} // if register’s field is held in Scoreboard_1 f_number = register_no; scb1 = scb1 + ((field) * (2 ^ (6* (f_number - 1)))); // set the right field within Scoreboard_1 to this field value case {6,7,8,9,10} // if register’s field is held in Scoreboard_2 f_number = register_no - 5; scb2 = scb2 + ((field) * (2 ^ (6* (f_number - 1)))); // set the right field within Scoreboard_1 to this field value case {11,12,13,14,15} // if register’s field is held in Scoreboard_3 f_number = register_no - 10; scb3 = scb3 + ((field) * (2 ^ (6* (f_number - 1)))); // set the right field within Scoreboard_1 to this field value

.

.

.

.

Figure B2 : Showing a synopsis of the updateScore() sub-function within the

FPU_Control_Unit.

The inputs to the sub-function are the mnemonic of the instruction being submitted

(op_mn), its destination register number (register_no) and the Scoreboard in its

segmented form (scb1 to scb20). Firstly the destination register’s Scoreboard field is

constructed as detailed previously in figure 31, where the countdown and exec_unit

sub-fields are set to the appropriate values for the type of instruction being submitted

(represented by the op_mn variable). In updateScore() the countdown value is set to

one more than the latency of the instruction being submitted, because the

sboardCycle() sub-function (discussed below) would be executed after an execution

of updateScore(), and by default sboardCycle() decrements all non-zero countdown

values throughout the entire Scoreboard. The switch case block sets the appropriate

field within the appropriate Scoreboard segment to the newly constructed field value.

The redundancy of inputting and subsequently outputting all Scoreboard segments is

done in order to minimise the number of switch case or if-then-else type blocks

employed to assign the field to the correct Scoreboard segment.

function [sboardOut, WB_EN, WB_ADDR, WRITE_MODE, WB_ARB] = sboardCycle(SBOARD_IN, board_num, opcode, WB_EN, WB_ADDR, WRITE_MODE, WB_ARB) inter_holder = SBOARD_IN; // variable to store the successive slices of the Scoreboard segment in decimal form new_holder = 0; // variable to store the Scoreboard segment after decrementing all non-zero countdowns inter_field = 0; // variable to store the fields of the Scoreboard segment as they are stripped away inter_count = 0; // variable to store their countdown sub-fields inter_xunit = 0; // variable to store their exec_unit sub-fields num_xunit_bits = 2; // variable to store the number of bits in the exec_unit sub-field : to help facilitate the addition

// of more execution units in the Execution_pipeline

for(i=1:5) // for all 5 fields of the Scoreboard segment : starting with the most significant field inter_field = floor(inter_holder / (2 ^ (6*(5-i)))); // place the field value in variable inter_field inter_count = floor(inter_field / (2 ^ num_xunit_bits)); // place its countdown sub-field in inter_count inter_xunit = inter_field - (inter_count * (2 ^ num_xunit_bits)); // place its exec_unit sub-field in inter_xunit inter_holder = inter_holder - ((inter_field) * (2 ^ (6*(5-i)))); // remove this field from the rest of // the Scoreboard segment if(inter_count == 1) // if the countdown value is 1

Page 74: Parallel Processor for Graphics Acceleration

74

WB_EN = 1; // assert the write back operation on the register WB_ADDR = (5-i) + (( board_num - 1) * 5); // represented by this field of the Scoreboard WRITE_MODE = 0; if(inter_xunit == 2) // taking the write back data value from the execution unit represented by the WB_ARB = 1; // exec_unit sub-field end if(inter_xunit == 1) WB_ARB = 0; end inter_field = inter_field - inter_xunit; // strip the exec_unit sub-field away from the field : if a write back // operation has been asserted on this register end

inter_field = inter_field - (inter_count * (2 ^ num_xunit_bits)); // strip the countdown sub-field away from // the field

if(inter_count > 0) inter_count = inter_count - 1; // decrement the countdown sub-field value if it’s bigger than zero end inter_field = inter_field + (inter_count * (2 ^ num_xunit_bits)); // set the countdown sub-field within the field // to the new value new_holder = new_holder + (inter_field * (2 ^ (6*(5-i)))); // set the field in the Scoreboard segment // representing this register to the new field value end sboardOut = new_holder; // output the modified Scoreboard segment

Figure B3 : Showing the code of the sboardCycle() sub-function of

FPU_Control_Unit.

The inputs to the sboardCycle() sub-function are SBOARD_IN (the Scoreboard

segment to be processed), board_num (the Scoreboard segment number), and all of

the control signals used by FPU_Control_Unit to assert a write back operation. If

sboardCycle() detects that a write back operation is scheduled to occur in the current

clock cycle it assigns the appropriate values to these control signals, otherwise they

are passed out unchanged.

function decision = checkScore(reg_no,sc1,sc2,sc3,sc4,sc5, . . . . . . . . . . . sc19, sc20) xunit_bits = 2; // variable to store number of bits in an exec_unit sub-field : to help facilitate the addition of

// more execution units in the Execution_pipeline field_number = 0; // variable to store the Scoreboard segment field number of the register being checked score = 0; // variable to store successive slices of the Scoreboard segment as fields are stripped away field = 0; // variable to store the Scoreboard field representing the register being checked countdown = 0; // variable to store the countdown sub-field value of the register being checked ii = 0; // variable used as a loop indice switch reg_no case {1,2,3,4,5} // if the register being checked is in Scoreboard1 field_number = reg_no; score = sc1; case {6,7,8,9,10} // if the register being checked is in Scoreboard2 field_number = (reg_no - 5); score = sc2;

.

Page 75: Parallel Processor for Graphics Acceleration

75

.

.

.

case {96,97,98,99,100} // if the register being checked is in Scoreboard20 field_number = (reg_no - 95); score = sc20; end for(ii=1:(6-field_number)) // starting with the most significant field : strip successive fields away from field = floor(score / (2 ^ (6*(5-ii)))); // the Scoreboard segment and place them in the variable field : until it score = score - (field * (2 ^ (6*(5-ii)))); // contains the field value of the register being checked end countdown = field / (2 ^ xunit_bits); // extract the countdown sub-field from this field if(countdown == 0) decision = 1; // decision is used as a boolean type variable to represent else // whether or not the register passes its Scoreboard check decision = 0; end

Figure B4 : showing a synopsis of the checkScore() sub-function of

FPU_Control_Unit, used to check the Scoreboard for any data hazards that would

arise from submitting a particular instruction to the Execution_pipeline

As can be seen in figure B4 above the inputs to checkScore() are the number of the

register to be checked (reg_no) and the Scoreboard in its segmented form. The

Scoreboard field representing the register is extracted from the appropriate

Scoreboard segment and its countdown sub-field is inspected. If at that point in time

the register is the target of an upcoming write back operation, the countdown value

will be greater than zero, in which case a 0 is assigned to the output variable

(decision) to signal that the Scoreboard check has failed. Otherwise the register

passes the Scoreboard check and a 1 is assigned to decision. Only if all three

registers associated with the issued instruction pass the Scoreboard check will

FPU_Control_Unit submit the instruction to the Execution_pipeline.

Further detail on how the FPU submits instructions to the Execution_pipeline

Whenever a MACC instruction is issued to the FPU, if it is submitted to the

Execution_pipeline, FPU_Control_Unit also assigns the instruction’s destination

register number to its MQ_DELAYED output in the same clock cycle. This output of

FPU_Control_Unit is fed back to its MQD_IN input via a delay block. The latency of

this delay block is 5 clock cycles, and thus as can be seen by observing figure 35,

when the execution of a MACC instruction is at the stage where the register fetch for

the accumulate-stage needs carrying out, its destination (and third source) register

number appears at MQD_IN, and FPU_Control_Unit uses this in asserting the register

fetch. If the instruction is either an individual multiply or add, the Control unit

assigns zero to its MQ_DELAYED output.

When submitting an add or a multiply instruction, and when not submitting any

instruction FPU_Control_Unit always assigns zero to its MQ_DELAYED output.

Thus when the FPU is issued with an add instruction in a cycle in which the register

fetch of a MACC’s accumulate stage must be carried out, FPU_Control_Unit knows

Page 76: Parallel Processor for Graphics Acceleration

76

that the instruction must not be submitted to the Execution_pipeline as it can see that

the value of MQD_IN is non-zero.

Figure B5 below shows the FPU’s register file.

Figure B5 : showing the FPU’s register file

If the instruction issued to the execution pipeline is an arithmetic instruction (MACC,

multiply or add), the ADDR_RA and ADDR_RB outputs of the FPU’s Control unit are

set to the addresses of source registers RA and RB respectively, and it sets it’s

WRITE_EN output to a ‘0’. This causes the contents of source registers RA and RB

to be fetched from the register file.

As stated previously in section 4, a MACC instruction is executed in two stages, with

the first stage being to use the multiplier to multiply the contents of registers RA and

RB, and the second stage being to use the adder to add the result of the multiplication

to register RT/S.

The multiply part of the multiply-add instruction is fed through the pipelined

multiplier and in subsequent cycles the Control unit is able to submit other

instructions to the Execution_pipeline (as long as checking the Scoreboard does not

reveal that there is a possibility of errors occurring as discussed previously in section

5).

The destination register number that the Control unit assigned to it’s MQ_DELAYED

output appears at it’s MQD_IN input the cycle before the Execution_pipeline’s

multiplier places the result of the corresponding multiply operation into its output

register.

Page 77: Parallel Processor for Graphics Acceleration

77

Figure B6 below shows the FPU’s Execution_Pipeline.

Figure B6 : showing the FPU’s Execution_pipeline

The contents of register RT are then fetched from the register file. This register fetch

operation is issued one cycle before the result of the multiplication with which it is to

be added is ready because it takes one cycle for the contents of register RT to emerge

from the register file. Thus by scheduling the fetch operation for register RT in this

way its latency is hidden. As a result of this, the Control units SEL_MADD output

are both fed through a delay block with a latency of 1 cycle, so that MUX_1 and

MUX_2 within the Execution unit pass the contents of register RT and the result of

the multiplication through to the adder in the cycle that they both emerge from the

register file and the output register of the multiplier respectively. All multiplexers

used have zero latency.

As also discussed previously in section 5, when the sboardCycle() sub-function of the

FPU_Control_Unit program detects that a write-back operation is scheduled to be

issued in the subsequent cycle, it issues it. The Control unit does this by assigning the

address of the detected destination register to its WB_ADDR output, which is

connected to the WRITE_ADDRESS input of the register file. The Control unit

assigns a ‘1’ to its WB_EN output which is connected to the WRITE_ENABLE input

of the register file, to enable a write operation into the register file. The

WRITE_MODE output of the Control unit is set to a ‘0’ so a MUX passes the output

of the Execution unit through to the DATA input of the register file. The WB_ARB

output of the Control unit is assigned based on the Execution-unit sub-field of the

destination register’s Scoreboard field. This sub-field holds a value of 1 when the

Execution unit’s adder produces the final result to be written into the register file (in

which case a ‘0’ is assigned to the WB_ARB output of the Control unit), whereas if

Page 78: Parallel Processor for Graphics Acceleration

78

the Execution unit’s multiplier produces the final result then this sub-field holds a

value of 2 (in which case a ‘1’ is assigned to the WB_ARB output of the Control

unit).

If a newly issued instruction is found to carry a hazard, the instruction is not issued to

the Execution pipeline and the Control Unit asserts a ‘1’ on its STALL output. The

Controller will then continue to issue that same instruction until the Control unit

submits it to the Execution_pipeline and asserts it’s STALL output back to a ‘0’.

Appendix C

Figure C1 : showing counter_1 of the optimal Controller’s loop counter structure

Page 79: Parallel Processor for Graphics Acceleration

79

Figure C2 : showing counter_2 of the optimal Controller’s loop counter structure

Figure C3 : showing counter_3 of the optimal Controller’s loop counter structure

Page 80: Parallel Processor for Graphics Acceleration

80

Appendix D

Figure D1 : showing the optimal Controller’s High-level controller

Page 81: Parallel Processor for Graphics Acceleration

81

Section 9 : Acknowledgements

I would like to give a note of thanks to my Industrial supervisor Richard Walke, as

well as my two Academic supervisors Andy Tyrrell and Jonathan Dell.

Page 82: Parallel Processor for Graphics Acceleration

82

Section 10 : References

[1] A Parallel Processor Architecture for Graphics Arithmetic Operations; John G.

Torborg; Computer Graphics, Volume 21, Number 4, July 1987

[2] OpenGL website; http://www.opengl.org/about/overview.html

[3] Compilation from Matlab to Process Networks Realized in FPGA; Tim Harriss,

Richard Walke; QinetiQ Ltd, Malvern, UK

[4] Xilinx Xtreme DSP User Guide http://www.xilinx.com