Parallel Processor for Graphics Acceleration
-
Upload
sandip-jassar-sandipjassarhotmailcom -
Category
Engineering
-
view
277 -
download
1
Transcript of Parallel Processor for Graphics Acceleration
Design and Development of a Floating-point Co-processor for the Acceleration of Graphics functions
Sandip Jassar
Department of Electronics, The University of York
Academic Supervisors: Andy Tyrrell, Jonathan Dell
Xilinx Development Centre, Edinburgh
Industrial Supervisor: Richard Walke
15th
June, 2005
Final Project Report for degree of MEng in Electronic and Computer
Engineering
Abstract
This report details the design and development of a processing architecture, complete
with a Controller and DMA unit. The reader is shown how the architecture was
optimised for executing Nested-Loop Programs, and in particular those found in the
geometric transformation processes stage of the OpenGL pipeline.
2
Contents
Section 1 : Introduction …………………………………………………………pg4
1.1 Parallel Processing for Graphics Functions ………………...pg4-5
1.2 OpenGL ………….………………………………………….pg5
1.2.1 The OpenGL Pipeline ………………………………....pg5-8
1.2.2 The Geometric Transformation Process ……………....pg8
1.2.2.1 The ModelView Transformation ……………….pg8-10
1.2.2.2 The Projection Transformation ………………….pg10
1.2.2.3 Perspective Division and the
ViewPort Transformation ………………………..pg11
1.3 : Overview of Transformations on Nested Loop Programs .pg11-13
Section 2 : FIR Filter Analysis …………………………………………………..pg13
2.1 The Single MACC FIR Filter ……………………………pg13-16
2.2 The Transposed FIR Filter ………………………………..pg16-17
2.3 The Systolic FIR Filter ……………………………………pg18-19
2.4 The Semi-Parallel FIR filter ………………………………pg20-21
Section 3 : Matrix-Vector Multiplication Analysis ……………………………pg22
3.1 The Sequential Model of Computation ..…………………..pg22-24
3.2 Exploiting the Inherent Parallelism ...……………………...pg24
3.2.1 : Unrolling the Inner-Loop ………………………pg24-25
3.2.2 : Unrolling the Outer-Loop ……………………...pg26
3.3 Using Pipelined MACC Units ....………………………….pg26-27
3.3.1 : Optimising the Code for Execution on a
Pipelined MACC Unit ………………………….pg27-30
Section 4 : The Floating Point Unit ……………...……………………………pg31
4.1 Dealing with Hazards ……………………………………..pg32-33
4.1.1 : Using the Scoreboard to Detect Data Hazards…pg33-37
4.1.2 : Managing Data Hazards ..……………………...pg37-38
4.1.3 : Managing Structural Hazards ..………………...pg38-39
Section 5 : The Controller ………………………..……………………………pg40
5.1 Initial Look-Up Table Design ………....…………………..pg40-41
5.2 Optimising the Controller for Running Geometric
Transformation Programs ……….....……………………...pg41-43
5.2.1 : The Use of Induction Variables ..………………pg43-44
5.2.2 : Performing Strength Reduction on
Induction Variables ……..……………………...pg44-45
5.2.3 : Optimising a Geometric Transformation for
the Controller …….....………………………….p45-48
5.2.4 : Designing the Optimal Controller ……………..pg48
5.2.4.1 : The Data Address Generator Design ...pg49-50
5.2.4.2 : The Use of Data Address Generators as
Part of the Controller …...…………...pg51-53
5.2.4.3 : The Loop Counter Structure …………pg53-55
5.2.5 : The High-Level Controller …………………….pg55
3
5.2.6 : The DMA Unit and Context Switching ………..pg55-57
Section 6 : Testing and Integration ..……………..…………………………..pg58
6.1 Testing of the FPU’s Control Unit .…....…………………pg58-59
6.2 Testing and Integrating the Register File ………………...pg59
6.3 Testing and Integrating the Execution Pipeline ………….pg60
6.4 Testing and Integration of the Look-Up Table Controller .pg60
6.5 Testing of the Optimal Controller’s Data Address Generator..pg60
6.6 Testing and Integration of the Optimal Controller’s
Loop Counter Structure, Program Memory and DAGs …...pg61
6.7 Testing and Integration of the DMA Unit …………………pg61-62
6.8 Testing and Integration of the High-Level Controller ….....pg62
6.4 The OpenGL Demonstration ………………………………pg63-64
6.4 Progress in Developing a Hardware Based Demonstration .pg64-65
Section 7 : Conclusion …………………………………………………………pg65
Section 8 : Appendices ………………………………………………………...pg67-80
Section 9 : Acknowledgments …………………………………………………pg81
Section 10 : References ………………………………………………………..pg82
1 : Introduction
This section begins by describing the 3-D graphics rendering application domain, and
the characteristics of the associated algorithms which allow for parallel execution.
The industry standard OpenGL 3-D graphics rendering pipeline is analysed, where
special focus is given to the geometric transformations stage which is consistent of a
pipeline of matrix-vector manipulation operations that implement a key part of
4
OpenGL’s operation in defining what a scene looks like in terms of the position and
orientation of the objects in it. The section then closes by looking at how Nested
Loop Programs such as the FIR filter can be transformed to exploit their inherent
parallelism by making it more explicit.
Section 2 then follows on from this and examines the different combinations of
transformations that can be applied to the FIR filter and the resulting implementation
tradeoffs. Section 3 then carries out a similar analysis on another prominent NLP in
graphics processing, the matrix-vector multiplication algorithm, where similarities
and differences with the FIR filter are analysed, and tradeoffs between the different
ways to optimise the algorithm’s implementation are discussed. Section 4 then details
the design of a FPU which can exploit the characteristics inherent in the matrix-vector
multiplication algorithm if they are exposed by transforming the code as discussed in
the previous section, by employing temporal parallelism.
Section 5 goes through the design process for an Optimal Controller for the FPU
using a typical matrix-vector multiplication algorithm found in the geometric
transformations stage of the OpenGL pipeline, and goes on to show how the
processor architecture as a whole was optimised for this particular section of the
OpenGL pipeline. Section 6 explains how the complete processor architecture was
built up from its constituent blocks in a hierarchical fashion, after they were tested
against their specification and integrated together into sub-systems, and ends by
describing the demonstration of OpenGL’s geometric transformations based on the
processor architecture developed. Section 7 finishes by drawing conclusions from the
project and putting the results achieved into context.
1.1 : Parallel Processing for Graphics Functions
Currently multiprocessing is the technique used to carry out the significant arithmetic
processing required to implement the realistic rendering techniques used in advanced
3-D graphics applications. Most often this is in the form of clusters of commodity
PC’s [1].
As discussed previously, graphics arithmetic is largely consistent of matrix and vector
manipulation, and thus lends itself to parallel processing because of the independence
of the individual operations required within the high-level matrix/vector function.
Most high level functions encountered in a Graphics environment are also ‘order-
independent’, as the order they are done in has no effect on the final display. Such
functions can thus be executed in parallel to achieve higher throughput, which is one
of the fundamental strengths of FPGAs. Although some functions will be
encountered that are ‘sequential’ which must be executed after all preceding and
before all subsequent functions, and thus if a parallel processing architecture is
employed it must deal with sequential functions with minimal degradation to the
overall system performance.
Another benefit of having an identical number of processors operating in parallel is
that the programming of each processor can be the same, and so this method of
obtaining high performance can also greatly simplify the software development
5
1.2 : OpenGL
The Open Graphics Library (OpenGL) is the industry’s most widely used and
supported 2D and 3D graphics API [2]. As such there are thousands of applications
based on OpenGL that are used to render compelling 2D and 3D graphics, in markets
ranging from broadcasting, CAD/CAM, entertainment, cinematics, medical imaging
and virtual reality, the most famous of all being Pixar’s Renderman (used by the
movie industry to create special effects). Individual function calls in the OpenGL
environment can be executed on dedicated tuned hardware, run as a software routine
on the generic system CPU or implemented as a combination of both. As a result of
this implementation flexibility, OpenGL hardware acceleration can range from that
which simply renders 2-D lines and polygons to the more advanced floating-point
processor capable of transforming and computing geometric data.
1.2.1 : The OpenGL Pipeline
The two types of data that are input to the OpenGL pipeline are pixel data and
geometric data. Pixel data is the RGBA data associated with pixels on the screen, and
comes in the form of individual pixel colour values, images and bitmaps. Geometric
data is used to model objects (ranging in complexity from simple 2-D shapes to
realistic 3-D objects), and comes in the form of points, lines and polygons, which are
OpenGL’s three geometric primitives. All geometric data is eventually described as
vertices. The data associated with each vertex is its 3-D positional coordinate vector,
normal vector and material properties (used in lighting calculations), pixel colour
value (RGBA value or an index to a colour-map), and its texture coordinates (used to
map a texture onto the vertex’s parent object). Figure 1 below shows an overview of
the OpenGL pipeline in terms of its various stages and the order in which operations
occur.
Figure 1 : showing an overview of the OpenGL pipeline
As can be seen in figure 1 above the vertex and pixel data are initially processed
differently before both being used in the rasterization stage. All data can be input to
6
the pipeline from the application and processed immediately, or saved in a display list
which sends the data to the pipeline when the list is executed.
In the per-vertex operations stage, the three vectors associated with each vertex
(spatial coordinates, normal and texture coordinates) are transformed (multiplied) by
the current modelview matrix, its inverse transpose and the current texture matrix
respectively. These transformations are carried out in order to transform the position,
orientation and size of the vertices’ parent objects in the scene. Lighting calculations
are then performed on each vertex using their transformed spatial coordinate and
normal vectors, material properties and the current lighting model. These calculations
act so as to scale each vertex’s pixel colour value to reflect the location and
orientation of its parent polygon relative to the light source/s.
The viewing volume of the scene is defined by six clipping planes. In the primitive
assembly stage the spatial coordinate vector of all vertices is transformed (multiplied)
by the projection matrix, so as to clip the scene against the six planes of the viewing
volume. Depending on the type of clipping employed primitives may have vertices
rejected, modified or added. The programmer may also define additional clipping
planes to further restrict the viewing volume, in order to create cut-away views of
objects, and other similar effects. The equations that represent such additional
clipping planes are used to transform the spatial coordinate vectors before the
projection matrix.
The spatial coordinate vectors of all vertices then go through perspective division,
where primitives are scaled in size to reflect their distance from the viewer. This is
followed by the viewport transformation where the spatial coordinate vectors of all
vertices are transformed so as to map the 3-D scene onto the 2-D viewport (viewing
window) of the computer screen.
In the rasterization processing stage, all primitives (points, lines and polygons) are
rasterized into fragments, where the squares of the viewport’s integer pixel-grid that
are occupied by each primitive are determined. If enabled, advanced features to make
the rendered scene more realistic are also implemented in this stage. The most
commonly used of these is anti-aliasing, which is used to smooth jagged edges that
result from having to map non-vertical and non-horizontal lines to the square pixel-
grid of the viewport. Anti-aliasing calculates the portion of each square within the
pixel-grid that would be occupied by a line, if the line were to be drawn as originally
defined (before the viewport transformation), and this value is known as a pixel’s
coverage value. A pixel’s coverage value is used to scale the alpha component of its
RGBA colour value. Figure 2 illustrates how pixel coverage values are evaluated.
7
Figure 2 : showing an example to illustrate how a pixel’s coverage value is evaluated
With reference to figure 2 above, the green and orange show how a diagonal line
looks before and after it’s subjected to the viewport transformation respectively, and
the coverage values for the pixels occupied by the line are given on the right.
In the pixel operations stage, pixel data that is input to the pipeline is scaled, biased
and processed using a colour-map, after which the colour values are clamped to a
certain range. The resulting pixel data is then either rasterized into fragments or
written to texture memory for use in texture mapping. Data from the framebuffer can
be read back and placed in the host processor memory. Data for texture mapping can
be taken from either the host processor memory or the framebuffer.
If texturing is enabled, in the per-fragment operations processing stage, the texture
coordinate vector of each vertex (of a fragment’s primitive) is used to map the
vertices to specific points on a two-dimensional texture image, so that the texture
(known as the texel for that particular fragment) can be mapped onto the primitive
appropriately after its position an orientation are transformed in the per-vertex
operations stage. If enabled, other advanced features such as blending (used to create
a photo-realism effect) are also implemented in the per-fragment operations stage.
It’s also in this stage that the coverage values calculated in the rasterization stage are
applied, if anti-aliasing is enabled.
The final pixel values are then drawn (written) into the framebuffer.
1.2.2 : The Geometric Transformation Process
The overall transformation process for producing a scene for viewing is analogous to
that carried out by a camera when it’s used to take a photograph. This transformation
process is carried out on the spatial coordinate vector of each vertex in the OpenGL
pipeline. This process is depicted below in figure 3.
8
Figure 3 : showing the stages of transformation for spatial coordinate vectors
As can be seen in figure 3 above, a vertex’s spatial coordinate vector consists of not
only the vertex’s three-dimensional x, y, z coordinates, but also a w component which
is used in the perspective division stage. All three vectors of a vertex in OpenGL
have four elements, thus all matrices (that they are multiplied by) are 4x4.
1.2.2.1 : The ModelView Transformation
A vertex’s spatial coordinates are first presented to the pipeline as object coordinates.
In this form the spatial coordinates specify a vertex’s location in 3-D space when its
parent primitive is centred on the origin and oriented in such a way that makes it easy
for the programmer to visualise where its vertices are in 3-D space (i.e. with the
primitive’s edges parallel with and perpendicular to the axes). The modelview
transformation is the combination of the modelling transformation and the viewing
transformation, represented by the modelling and viewing matrices respectively. The
modelling transformation is always carried out on an object before the viewing
transformation, as by default the modelview matrix is formed by the viewing matrix
pre-multiplying the modelling matrix.
The modelling transformation positions an object at a particular location in 3-D space
relative to the origin (by performing translation), rotates the object relative to the
axes (by performing rotation) and scales the object in size (by performing scaling).
These three transformations are represented by their own matrices, and these are
depicted in figure 4 below. The order in which the modelling transformation carries
out these transformations is determined by the order in which their respective matrices
are multiplied together to form the modelling matrix. The transformation represented
by a post-multiplying matrix is carried out before that represented by the pre-
multiplying matrix, and this principle holds true in all instances where transformation
matrices are combined together through multiplication.
9
1 0 0 x
0 1 0 y
0 0 1 z
0 0 0 1
T
Matrix T translates the object from being centered at the origin to the location defined by x, y, z; but the object’s orientation and size
are maintained
x 0 0 0
0 y 0 0
0 0 z 0
0 0 0 1
S
Matrix S scales the object such that it is stretched by a factor of x, y, z in the direction of the corresponding axes; but the object’s
orientation and position are maintained
1 0 0 0
0 cosΦ -sinΦ 0
0 sinΦ cosΦ 0
0 0 0 1
Matrices Rx, Ry, and Rz rotate the object clockwise by Φ degrees about the x, y and z axes respectively; rotation about more than one axis is achieved by multiplyingthe relevant R matrices together; but the object’s position and size are maintained
cosΦ 0 0
0 1 0 0
0 0
0 0 0 1
0 0
0 0
0 0 0 0
0 0 0 1
sinΦ
-sinΦ cosΦ
cosΦ
sinΦ
-sinΦ
cosΦ
Rx Ry Rz
Figure 4 : showing the translation, scaling and rotation matrices that are multiplied
together to form the modelling matrix
With reference to figure 4 above, it can be seen that the translation transformation on
a vertex is achieved by adding to each of its x, y and z components the multiplication
of the corresponding component of the translation vector down column 4 of the T
matrix and its w component. The scaling transformation on a vertex is achieved by
multiplying each of its components by the corresponding component of the scaling
vector along the leading diagonal of the S matrix. The action of the rotation matrix is
rather more complex, although it can be seen from the R matrices above that the
vertex component associated with the axis being rotated about remains unchanged
after the transformation.
The viewing transformation is analogous with adjusting the camera location and the
direction in which it points when taking a photograph of the scene. This
transformation is comprised of a combination of translation (adjusting the viewpoint’s
location) and rotation (adjusting the viewpoint’s direction), and the associated
matrices are the same as those used to produce the modelling matrix (shown above in
figure 4). These two transformations within the viewing transformation (on the
viewpoint) have the exact reverse effect on the appearance of the scene to the
corresponding transformations within the modelling transformation (on the scene’s
objects). The default viewpoint is at the origin and points in the negative z-direction.
As objects are most often initially defined as being centred on the origin, the
modelview transformation as a whole must perform a transformation simply for the
10
objects in the scene to be visible, although any of the elementary transformations can
be omitted by simply not including their respective matrices in the product forming
the modelview matrix.
1.2.2.2 : The Projection Transformation
The eye coordinates resulting from the modelview transformation then go through the
projection transformation where they are converted to clip coordinates. The
projection transformation defines the viewing volume. The shape of the viewing
volume determines how objects are projected onto the screen and which objects (or
portions of objects) are clipped out of the final scene. The most common type of
projection used is perspective projection which employs a frustum shaped viewing
volume, as illustrated below in figure 5.
Figure 5 : showing the frustum shaped viewing volume employed by perspective
projection
Perspective projection implements foreshortening, whereby the further away from the
viewpoint an object is, the smaller it appears in the scene, thus emulating the way the
human eye (or a camera) works. The projection matrix that represents the
(perspective) projection transformation is depicted below in figure 6.
0 0
0 0
0 0
0 0 -1 0
P
Matrix P clips the objects against the six planes of the viewing volume; the w component of each vertex is set to -z (the distance of
the vertex from the origin, in a direction away from the viewpoint)
2n
r - l
r + l
r - l
2n
t - b
t + b
t - b
-(f+n)
f - n
-2fn
f - n
n : near
f : far
r : right
l : left
t : top
b : bottom
Figure 6 : showing the projection matrix for perspective projection
11
1.2.2.3 : Perspective Division and the ViewPort Transformation
The clip coordinates resulting from the projection transformation are then converted
to normalized device coordinates through the process of perspective division, where
the x, y, and z coordinates of each vertex are divided by its w component (which is set
to –z in the projection transformation as described previously in figure 6). This scales
the objects down in size to implement foreshortening as discussed previously in
regards to perspective projection. After the perspective division stage the four
element spatial coordinate vector of each vertex becomes a three element vector as
the w component is discarded.
The last stage of the process is the viewport transformation which maps the three-
dimensional normalized device coordinates to the two-dimensional viewport, in
converting them to window coordinates. The equations that perform the viewport
transformation are shown below in figure 7.
xw = ( ( xnd + 1 ) x ( width ÷ 2 ) ) + xo
yw = ( ( ynd + 1 ) x ( height ÷ 2 ) ) + yo
Figure 7 : showing the equations that perform the viewport transformation
With reference to figure 7 above, the (xo, yo) coordinate is the position of the bottom
left hand corner of the viewport (viewing window) on the screen, relative the
corresponding corner of the screen. The width and height parameters are the
dimensions of the viewport (in pixels).
1.3 : Overview of Transformations on Nested Loop Programs
A key problem in parallel processing is the way in which the program for each
processor is generated, in a way such that the overall program is executed efficiently.
Most graphics functions are described as a set of Nested for-loops [3].
The Nested-Loop Program (NLP) form of an algorithm represents its sequential
model of computation. The FIR filter operation is a simple example of a NLP and
this is shown below in figure 8 for calculating four output values.
for j = N:N+3 for i = 0:N-1 y[j] = y[j] + ( x[j - i] x h[i] ) end
12
end
Figure 8 : showing the NLP form of the FIR filter operation
This sequential model of computation of an algorithm is representative of the way in
which the algorithm would be implemented as Software running on a standard single
thread processor.
The algorithmic transformation of unrolling transforms a NLP such that it task-level
parallelism is enhanced. As a result, tasks that are independent of each other are
explicitly shown to be mutually exclusive, although the resulting representation of the
algorithm is still functionally equivalent to the original. The independent tasks can
then be mapped to separate processors.
The algorithmic transformation of skewing makes the dependences of operations less
demanding, and thus allows for latency in the physical operators that will actually
execute them.
The dependency graph is a useful way of visualising the dependences between the
operations in a NLP, and represents an intermediate level of abstraction between the
NLP and its implementation. An example of a dependency graph is shown for the
FIR filter NLP of figure 8, where the outer j loop has been unrolled by a factor of two,
and N = 4. Data dependences between tasks (represented as circles) are shown by
one task forwarding data to another. The two independent tasks are highlighted in
different colours.
+0
+
+
+
h[0].x[4] + h[1].x[3] + h[2].x[2] +
h[3].x[1]
h[0]
h[1]
h[2]
h[3]
x[4-0]
x[4-1]
x[4-2]
x[4-3]
+0
+
+
+
h[0].x[5] + h[1].x[4] + h[2].x[3] +
h[3].x[2]
h[0]
h[1]
h[2]
h[3]
x[5-0]
x[5-1]
x[5-2]
x[5-3]
+0
+
+
+
h[0].x[6] + h[1].x[5] + h[2].x[4] +
h[3].x[3]
h[0]
h[1]
h[2]
h[3]
x[6-0]
x[6-1]
x[6-2]
x[6-3]
+0
+
+
+
h[0].x[7] + h[1].x[6] + h[2].x[5] +
h[3].x[4]
h[0]
h[1]
h[2]
h[3]
x[7-0]
x[7-1]
x[7-2]
x[7-3]
Figure 9 : Showing the effect of unrolling the outer loop of the FIR filter NLP by a
factor of 2
When the outer loop is unrolled by a factor of two, this is essentially the same as
making two copies of it that can be executed in parallel. As there are two copies, in
the implementation of this modified NLP each copy of the loop will need its own set
of registers for storing coefficient and data values. The unrolling transformation thus
translates to spatial parallelism being employed in the implementation.
13
The skewing transformation however translates to temporal parallelism (pipelining)
being employed in the implementation, and as such, a single set of registers are shared
by the different iterations of the internal loop, whose executions are overlapped.
These two transformations can be used to transform a sequential model of
computation into something that it is closer to a data-flow model, thus making it more
suitable for efficient implementation (exploiting parallelism) on hardware.
Section 2 : FIR Filter Analysis
The FIR filter operation essentially carries out a vector dot-product in calculating each
value of y[n]. This is illustrated below in figure 10 for N = 4.
x[n] x[n-3]x[n-2]x[n-1]
h[0]
h[1]
h[2]
h[3]
x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3]
Figure 10 : showing how the FIR filter operation is comprised of vector dot-products
Section 2.1 : The Single MACC FIR Filter
The Single MACC FIR filter (shown below in figure 11) is an implementation of the
FIR filter’s sequential model of computation and as its name suggests it is based on a
single MACC (multiply-accumulate) unit. As such, the algorithmic description of this
implementation is identical to that of the NLP description of the FIR filter (shown
below in figure 12) without applying any unfolding or skewing transformations
(which were discussed earlier in section 1.3). For simplicity it is assumed throughout
this section unless otherwise stated that all references to MACC units refer to non-
pipelined MACCs with a total latency of 1 clock cycle.
14
Figure 11 : showing an example H/W implementation of the Single MACC FIR filter
[4]
The primary trade-off between sequential and parallel implementations of the same
algorithm is the amount of hardware resources required versus the throughput
achieved. As the Single MACC FIR filter implements the FIR filter function in a
completely sequential manner, the required hardware resources are reduced by a
factor of N, although so too is the throughput as compared to a fully parallel
implementation that would use one MACC unit for each of the N coefficients (where
the N MACCs would be cascaded).
void singleMaccFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )
{
int i, j; // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index
float y_accum; // output sample is accumulated into ‘y_accum’
const float *k; // pointer to the required input sample
for( j = 0; j < num_samples; j++ )
{
k = x++; // x points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1]
y_accum = 0.0;
for( i = 0; i < num_taps; i++ )
{
y_accum += h[i] * *(k--); // y[n+j] += h[i] * x[(n+j) - i]
}
*y++ = y_accum; // y points to the register address where y[n+j] is to be written and is // incremented (post assignment) to point to the register address where the // next output sample y[(n+j)+1] is to be written
}
}
Figure 12 : A code description of the Single MACC FIR filter
15
With reference to the code description of the Single MACC FIR filter shown above in
figure 12, all of the required input samples are assumed to be stored in the register file
(with a stride of 1) of the processor (a single MACC unit in this case) executing the
code, with x initially pointing to the input sample x[n] corresponding to the first
output sample to be calculated y[n]. It is also assumed that all of the required
coefficients are stored in the same way in a group of registers used to store the h[ ]
array.
As can be seen from figure 12 above, and more clearly from the dependency graph of
the Single MACC FIR filter (shown below in figure 13), this implementation
evaluates (accumulates) only one output value at a time. Assuming that each
multiplication of x[(n+j)-i] and h[i] takes one clock cycle, then the performance of
this implementation is given by the following equation:
Throughput = Clock frequency ÷ Number of coefficients
h(0)
h(1)
h(2)
x[n]
+
h(3)
h(0).x[n] + h(1).x[n-1] + h(2).x[n-2]
+ h(3).x[n-3]
+
+
x[n-1]
x[n-2]
x[n-3]
h(0)
h(1)
h(2)
x[n+1]
+
h(3)
h(0).x[n+1] + h(1).x[n] + h(2).x[n-1]
+ h(3).x[n-2]
+
+
x[n]
x[n-1]
x[n-2]
h(0)
h(1)
h(2)
x[n+2]
+
h(3)
h(0).x[n+2] + h(1).x[n+1] +
h(2).x[n] + h(3).x[n-1]
+
+
x[n+1]
x[n]
x[n-1]
h(0)
h(1)
h(2)
x[n+3]
+
h(3)
h(0).x[n+3] + h(1).x[n+2] +
h(2).x[n+1] + h(3).x[n]
+
+
x[n+2]
x[n+1]
x[n]
j
i
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
0 1 2 3
0
1
2
3
Figure 13 : dependency graph showing the operation of the Single MACC FIR filter
If the coefficients of the filter posses symmetry (i.e. h[0]=h[N-1], h[1]=h[N-2], etc)
a doubling of throughput can be achieved at the same clock frequency using a
variation of the Single MACC FIR filter called the Symmetric MACC FIR filter. This
implementation uses a single coefficient in place of those that are equal, and as such
only one multiplication by that single coefficient is required, although the other
multiplicand is now the sum of the data values corresponding to those equal
coefficients. Thus, the cost of this performance enhancement is another adder at the
input of the multiplier, as well as another RAM block (or using one Dual-port block
16
RAM), as the two input samples corresponding to the single coefficient need to be
fetched simultaneously. Although as the number of coefficients is halved, so too is
the amount of storage required for them. If N is an odd number, the Symmetric
MACC FIR filter reduces the number of coefficients to N/2 + 1. The Symmetric
MACC FIR is derived from the Single MACC FIR by unrolling its inner-loop by a
factor of two and reversing the order in which one of the loops processes its
respective coefficient-data pairs. This is a unique example of employing spatial
parallelism.
Employing spatial parallelism is one way to enhance the performance of the Single
MACC FIR filter, and essentially uses more than one MACC unit to evaluate each
output sample. As a result each MACC unit evaluates an equal share of the
coefficient-data sample multiplications, and as such if M MACC units are employed,
the throughput is increased by a factor of M over the Single MACC FIR filter,
although so too are the required hardware resources.
Section 2.2 : The Transposed FIR Filter
The Transposed FIR filter (an example H/W implementation of which is shown below
in figure 14) is a fully parallel implementation, as one MACC unit is used for each of
the N coefficients. Unlike the Direct form type Ι fully parallel implementation (which
employs an adder tree structure) the Transposed FIR filter employs an adder chain,
and as such the MACC units are much easier to connect together and the
implementation can easily be scaled up or down in terms of N. With regards to
targeting the design to the Xilinx Vertex 4 FPGA, because of this adder-chain
structure the Transposed implementation can be entirely contained within the
dedicated Xilinx DSP48 slices as apposed to using generic FPGA fabric, which would
yield a less efficient mapping.
Figure 14 : showing an example H/W implementation of the Transposed FIR filter [4]
The input data samples are broadcast to all MACC units simultaneously, and with this
implementation the coefficients are assigned (in ascending order) starting from the
right-most MACC unit (from which the final output is taken). As well as the spatial
parallelism through the use of N MACC units, temporal parallelism is also employed
as the evaluation of the successive output samples is overlapped, although this doesn’t
serve to increase performance (throughput) or decrease the latency before the first
output sample appears over the Direct form type Ι implementation.
17
A code description for the Transposed FIR filter (with the number of taps N = 4) is
given in figure A1 of appendix A.
The Transposed FIR filter design is yielded by a complete unrolling of the inner-loop
(i-loop) of the original Single MACC FIR filter code description, which results in the
number of MACC units required increasing to N. A skewing of the outer-loop (j-
loop) by a factor of 1 is also performed, which results in the temporal overlapping of
successive output sample calculations. This skewing is required to schedule apart the
dependences that arise because the N MACC operations within any single iteration of
the outer-loop are dependent on the MACC in the previous iteration of the inner-loop
(for their third argument).
The dependency graph of the Transposed FIR filter’s operation is shown below (with
N = 4) in figure 15.
h(0)
h(1)h(1)
h(2)
x[n]
+ +
+
+
++
x[n-3] x[n-2]
x[n-2]
x[n-1]
x[n-1]
x[n-1]
x[n]
x[n]
x[n]
h(3)
h(2)
h(3)h(3)
h(2)
h(3)
y[n] = h(0).x[n] + h(1).x[n-1] +
h(2).x[n-2] + h(3).x[n-3]
+
+
+
x[n+1]
x[n+1]
x[n+1]
h(0)
h(1)
h(2)
h(3)
x[n+1]
y[n+1] = h(0).x[n+1] + h(1).x[n] +
h(2).x[n-1] + h(3).x[n-2]
y_accum0 = 0
j
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2MACC_2 MACC_2 MACC_2
MACC_3 MACC_3 MACC_3
MACC_4 MACC_4
+++++
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1y_accum1
y_accum1 y_accum1
y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3
0 1
Figure 15 : dependency graph showing the operation of the Transposed FIR filter
(with N = 4)
As can be seen above in figure 15, the initial latency before the first output sample
emerges from the output is the same as that seen with the fully-parallel FIR filter.
Although once this initial latency (of the spin-up procedure whereby the pipeline is
filled over the first N cycles) has been endured, the throughput yielded by the
Transposed FIR filter implementation is the same as that of the Direct form type Ι
implementation (equal to the Clock frequency). The latency period between when
each input sample is applied and the emergence of the corresponding output sample is
also the same as that seen with the Direct form type Ι implementation, and this is
equal to the latency of a single MACC unit.
18
Section 2.3 : The Systolic FIR Filter
As with the Transposed FIR filter, the Systolic FIR filter (an example H/W
implementation of which is shown below in figure 16) is also a fully parallel
implementation, and also uses an adder-chain to accumulate each value of y[n].
Figure 16 : showing an example H/W implementation of the Systolic FIR filter [4]
The Systolic FIR filter also employs temporal parallelism in addition to spatial
parallelism in the same way that the Transposed FIR filter does. However the
Systolic FIR filter’s coefficients are assigned (in ascending order) starting from the
left-most MACC unit (to which the input samples are applied), which is the opposite
way to how the coefficients are assigned to the Transposed FIR filter’s MACC units.
As such, the Systolic FIR filter evaluates the inner-products (of which each value of
y[n] is consists) in the reverse order to the Transposed FIR filter.
The input data samples are fed into a cascade of registers which have the effect of a
data buffer. The Systolic FIR filter differs from the Direct form type Ι
implementation not only with its use of an adder-chain to accumulate each value of
y[n], but also with the additional register between each of the taps. A code
description of the Systolic FIR filter (with the number of taps N = 4) is given in figure
A2 of appendix A.
As with the Transposed FIR filter, the Systolic FIR filter is yielded by a complete
unrolling of the inner-loop (i-loop) of the original Single MACC FIR filter, and a
skewing of its outer-loop by a factor of 1. However, in generating the Systolic FIR
filter from the Single MACC FIR, the outer-loop is skewed in the opposite direction to
how it is skewed in generating the Transposed FIR filter. This means that the inner-
products which are summed together to produce an output sample are evaluated in the
opposite order to that in which they’re evaluated by the Transposed FIR filter. This
difference is reflected in the two H/W implementations as the Transposed FIR
implementation employs a broadcast structure to feed the same input sample to each
of its MACC units on each clock cycle, whereas the Systolic implementation employs
a tapped delay line with a delay of two clock cycles between each MACC unit. The
dependency graph of the Systolic FIR filter’s operation is shown below in figure 17
(with N = 4).
19
h(0)
h(1)
h(3)
h(0) h(0) h(0)
h(1)h(1)
h(2)
h(2)
x[n] x[n+1] x[n+2] x[n+3]
x[n-1]
x[n-2]
x[n-3]
x[n+4]
h(0)
h(1)
h(2)
h(3)
x[n]
x[n-1]
x[n-2]
x[n+1] x[n+2]
x[n]
+ +
+ +
+ +
+
++
y[n] = h(0).x[n] + h(1).x[n-1] +
h(2).x[n-2] + h(3).x[n-3]
y[n+1] = h(0).x[n+1] + h(1).x[n] +
h(2).x[n-1] + h(3).x[n-2]
x[n+3]
x[n+1]
x[n-1]
x[n+4]
x[n+2]
x[n]
j0 1
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2
MACC_3 MACC_3 MACC_3
MACC_4 MACC_4
+ + + + +
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1 y_accum1 y_accum1y_accum1
y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3
Figure 17 : dependency graph showing the operation of the Systolic FIR filter
(with N = 4)
The green arrows represent the forwarding of a partial accumulation of an output
sample through the adder-chain, whilst the effect of the two registers between each of
the MACC units is represented by the blue arrows as they represent the forwarding of
input samples between successive MACC units.
As with the Transposed FIR filter, the initial latency before the first output sample
emerges from the output, and the throughput thereafter is the same as that seen with
the Direct form type Ι implementation. Although because the Systolic FIR filter
evaluates and accumulates the inner-products in the opposite order to the Transposed
FIR filter, the latency between each input sample being applied to the filter and the
corresponding output sample emerging from the output is N clock cycles (assuming
the latency of each MACC unit is 1 cycle). Thus this latency increases by a factor of
N compared to that seen with both the Transposed and Direct form type Ι
implementations. However, the advantage that the Systolic FIR filter holds over the
Transposed implementation is that its input is only applied to one MACC unit, unlike
that of the Transposed FIR filter whose input is broadcast to all of its MACC units
and thus has a high fan-out. Thus the Systolic implementation is more suitable the
Transposed implementation for higher values of N.
20
Section 2.4 : The Semi-Parallel FIR Filter
The Semi-Parallel FIR filter (sometimes called the Hardware-folded implementation)
divides its N coefficients amongst M multiply-add units. An example implementation
of the Semi-Parallel FIR filter is shown below in figure 18 (with N = 16, M = 4).
Figure 18 : showing an example H/W implementation of the Semi-Parallel FIR filter
(with N = 16, M = 4) [4]
Each group of N/M coefficients is assigned to one of the MACC units and stored in
order within the associated coefficient-memory. The first group (coefficients 0 to
(N/M – 1)) is assigned to the left-most MACC unit (to which the input samples are
applied), with ascending coefficient groups being assigned to the MACC units from
left to right. If N is not exactly integer-divisible by M, then the higher-order
coefficient-memories are padded with zeros.
Like the Transposed and Systolic implementations, the Semi-Parallel FIR filter
employs both spatial and temporal parallelism, but the degree to which it does this
depends on the ratio of M:N, with a higher M:N ratio resulting in a higher degree of
both spatial parallelism (as more MACC units are used) and temporal parallelism (as
each output sample is evaluated more quickly and thus the evaluation of more output
samples can be overlapped in time). Thus the trade off is performance obtained
versus the resources required, as can be seen by the equation for the performance of a
Semi-parallel FIR filter implementation:
Throughput = ( Clock frequency ÷ N ) * M
The Semi-parallel implementation may be extrapolated either towards being a fully-
parallel implementation like the Transposed and Systolic implementations by using
more MACC units, or the other way towards being a Single MACC FIR filter by
using fewer MACC units. A code description of the Semi-parallel FIR filter (with N
= 16, M = 4) is given in figure A3 of appendix A. The dependency graph of the
Semi-parallel FIR filter’s operation (with N = 16, M = 4) is shown below in figure 19.
21
h(0) h(1) h(2) h(3)
h(6)h(5)
h(9)
x[n] x[n-1] x[n-2] x[n-3]
+ +
+ +
+ +
+
++
++
+
x[n-8]
x[n-11]
x[n-14] x[n-15]
x[n-12]
x[n-4]
x[n-16]
x[n-8]
x[n-5]
x[n-12]
x[n-9]
x[n-6]
h(12)
h(8)
h(15)h(14)
h(11)
h(4)
h(13)
h(10)
h(7)
x[n+1]
h(0)
+
+
+
x[n]
+
+
+
x[n-1]
+
+
+
x[n-2]
h(1) h(2) h(3)
h(7) h(4) h(5) h(6)
h(10) h(11) h(8) h(9)
h(13) h(14) h(15) h(12)
x[n-7]
x[n-10]
x[n-13]
x[n-11]
x[n-14]x[n-15]
x[n-3] x[n-4]
x[n-7]
x[n-5]
x[n-8]
x[n-11]
x[n-8]
x[n-12]
x[n-3]
x[n-7]
x[n-11]
x[n-2]
x[n-4]
+ + + + + +
y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] +
h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] +
h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15]
y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6]
+ h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12]
+ h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16]
x[n+1] x[n+2]
++
++
+
+
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2
MACC_3MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3
MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4
ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1
++++++++
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1y_accum1 y_accum1
y_accum1 y_accum1 y_accum1 y_accum1
y_accum2 y_accum2 y_accum2 y_accum2y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3
y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4
y_out y_outy_out y_out y_out y_out
y_out y_out
j0 1
0 1 2 3 0 1 2 3i
Figure 19 : dependency graph showing the operation of the Semi-Parallel FIR filter
The red circles represent MACC units being used to calculate the inner-products of
y[n], and the dark-red circles represent the output-accumulator being used to
accumulate the inner-products of y[n]. The blue and dark-blue circles represent the
same for y[n-1], whilst the yellow circles represent MACC units being used to
calculate the inner-products of y[n+1]. As can be seen above in figure 19, the address
(which lies in the range [0:((N/M) – 1)] for all MACC units) applied to the data-
buffer and coefficient memory of each MACC unit lags one behind the corresponding
address of the immediately preceding (to the immediate left) MACC unit, and all such
addresses are continuously and monotonically cycling from 0 to ((N/M) – 1). This is
necessary in order to employ temporal parallelism by overlapping (in time) the
evaluation of successive output samples in the way shown in figure 19 above. This
temporal parallelism is in turn necessary to achieve the Semi-Parallel
implementation’s maximum throughput of one output sample every N/M clock cycles,
because once an output sample has been retrieved from the accumulator by the
capture register [show on one/both diagrams], the accumulator must be reset (to either
zero or its input value). For its input value to be the first sum of M inner-products of
the next output sample to be evaluated, then the evaluation of this sum needs to have
finished in the previous clock cycle, otherwise the accumulator will have to be reset to
zero at the start of a new result-cycle (in which an output sample is accumulated). If
the accumulator was set to zero between result-cycles in this way, one extra clock
cycle would be required for the evaluation of each output sample, thus degrading
performance.
22
Section 3 : Matrix-Vector Multiplication Analysis
Matrix-vector multiplication is essentially a series of vector dot-products, as element
(r,1) of the resultant vector is comprised of the multiplication of row r of the matrix
and the multiplicand column vector. This is illustrated below in figure 20, which
shows how the (4x1) resultant column vector R is formed from the multiplication of
the (4x4) matrix M and the (4x1) column vector V.
R(1) V(1)
V(2)
V(3)
V(4)
M(1,1).V(1) + M(1,2).V(2) + M(1,3).V(3) + M(1,4).V(4) M(1,1) M(1,2) M(1,3) M(1,4)
M(2,1) M(2,2) M(2,3) M(2,4)
M(3,1) M(3,2) M(3,3) M(3,4)
M(4,1) M(4,2) M(4,3) M(4,4)
X
M(2,1).V(1) + M(2,2).V(2) + M(2,3).V(3) + M(2,4).V(4)
M(3,1).V(1) + M(3,2).V(2) + M(3,3).V(3) + M(3,4).V(4)
M(4,1).V(1) + M(4,2).V(2) + M(4,3).V(3) + M(4,4).V(4)
R(2)
R(3)
R(4)
Figure 20 : showing how matrix-vector multiplication is comprised of a series of
vector dot-products
Section 3.1 : The Sequential Model of Computation
Matrix-vector multiplication is related to the FIR filter as both algorithms are
consistent of a series of vector dot-products. Considering both algorithms in their
sequential form, their outer for-loop is essentially for number of vector dot products required,
and their inner for-loop is essentially for number of vector-matrix_row element pairs. Figures 21
and 22 show a code description and dependency graph of the matrix-vector
multiplication problem’s sequential model of computation respectively. For
simplicity it is assumed throughout this section unless otherwise stated that all
references to MACC units refer to non-pipelined MACCs with a total latency of 1
clock cycle
23
void sequentialMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ], const float *r[ ])
{
int row, col;
for( row = 0; row < num_matrix_rows; row++ )
{
for( col = 0; col < num_matrix_cols; col++ )
{
r[row] += m[row][col] * v[col]; // matrix is processed in ROW-MAJOR order
}
}
}
Figure 21 : showing the code description of the sequential model of computation of
the matrix-vector multiplication algorithm
V(1)
V(2)
V(3)
M(1, 1)
+
V(4)
V(1).M(1, 1) + V(2).M(1, 2) +
V(3).M(1, 3) + V(4).M(1, 4)
+
+
M(1, 2)
M(1, 3)
M(1, 4)
V(1)
V(2)
V(3)
M(2, 1)
+
V(4)
V(1).M(2, 1) + V(2).M(2, 2) +
V(3).M(2, 3) + V(4).M(2, 4)
+
+
M(2, 2)
M(2, 3)
M(2, 4)
V(1)
V(2)
V(3)
M(3, 1)
+
V(4)
V(1).M(3, 1) + V(2).M(3, 2) +
V(3).M(3, 3) + V(4).M(3, 4)
+
+
M(3, 2)
M(3, 3)
M(3, 4)
V(1)
V(2)
V(3)
M(4, 1)
+
V(4)
V(1).M(4, 1) + V(2).M(4, 2) +
V(3).M(4, 3) + V(4).M(4, 4)
+
+
M(4, 2)
M(4, 3)
M(4, 4)
row
col
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
0 1 2 3
0
1
2
3
Figure 22 : showing the dependency graph of the matrix-vector multiplication
algorithm’s sequential model of computation (for a 4x4 matrix and 4x1 vectors)
However, as can be seen from figure 20 above, in matrix-vector multiplication each
element of the matrix is a multiplicand of strictly one inner-product in one vector dot-
product, and thus with reference to the program of figure 21, each element of the
matrix is strictly a multiplicand of one MACC operation in one specific iteration of
the inner-loop within one specific iteration of the outer-loop. This is in contrast to the
sequential (Single MACC) FIR filter algorithm where each input sample is a
multiplicand of one inner-product in N successive vector dot-products.
24
The multiplicand column-vector of a matrix-vector multiplication is analogous to the
coefficient vector used in the FIR filter algorithm, as all vector-dot products
performed by both algorithms multiply these vectors by another.
Section 3.2 : Exploiting the Inherent Parallelism
The Transposed and Systolic FIR filter implementations discussed previously in
sections 2.2 and 2.3 respectively, were formed by completely unrolling the inner-loop
and then skewing the outer-loop (by a factor of 1 in opposite directions) of the original
sequential FIR filter code. With reference to figure 21 above, if the inner-loop of the
matrix-vector multiplication algorithm is unrolled by any factor, as was previously
discussed in section 2.2 (with regards to the FIR filter algorithm) the outer-loop then
has to be skewed for the MACC operations scheduled for simultaneous execution (by
different MACC units) to be independent of one another. Although as already
discussed, each matrix element is used as a multiplicand only once throughout the
execution of the entire algorithm, and thus unlike with the FIR filter algorithm, the
direction in which the outer-loop is skewed essentially makes no difference, as each
MACC unit would still have to access a separate matrix element at the start of each
clock cycle. Thus unlike the Transposed FIR filter, the access of each matrix element
(analogous to each input sample of the FIR filter) can not be shared among all MACC
units. Similarly, unlike the Systolic FIR filter there is no sense in feeding the matrix
elements through a tapped delay line in order to amortise the overhead of accessing
them.
Section 3.2.1 : Unrolling the Inner-Loop
Figure 23 below shows the dependency diagram of the matrix-vector multiplication
code (in its sequential form) of figure 21 after its inner-loop has been completely
unrolled (with each iteration executed on a separate MACC unit) and its outer-loop
has subsequently been skewed by a factor of +1 in a way analogous to that used in
creating the Systolic FIR filter. With this series of transformations, each MACC unit
employed effectively processes one column of the matrix.
25
V(1)
V(2)
V(4)
V(1)V(1) V(1)
V(2)V(2)
V(3) V(3)
M(1, 1) M(2, 1) M(3, 1) M(4, 1)
V(2)
V(3)
V(4)
+ +
+ +
+ +
+
++
R(1) = V(1).M(1, 1) + V(2).M(1, 2)
+ V(3).M(1, 3) + V(4).M(1, 4)
R(2) = V(1).M(2, 1) + V(2).M(2, 2)
+ V(3).M(2, 3) + V(4).M(2, 4)
row0 1
MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2
MACC_3 MACC_3 MACC_3
MACC_4 MACC_4
+ + + +
R(1) = 0 R(2) = 0 R(3) = 0 R(4) = 0
R(1) R(2) R(3) R(4)
R(1) R(2) R(3) R(4)
R(1) R(2) R(3)
M(1, 2)
M(2, 3)
M(1, 4) M(2, 4)
M(1, 3)
M(2, 2) M(3, 2)
M(3, 3)
M(4, 2)
+
+
MACC_3
MACC_4
R(3)
+
MACC_4
V(3)
V(4) V(4)
M(4, 3)
M(3, 4) M(4, 4)
R(3) = V(1).M(3, 1) + V(2).M(3, 2)
+ V(3).M(3, 3) + V(4).M(3, 4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2)
+ V(3).M(4, 3) + V(4).M(4, 4)
2 3
0
1
2
3
col
Figure 23 : showing the dependency diagram of the matrix-vector multiplication
algorithm’s sequential model of computation after complete unrolling its inner-loop
and skewing its outer-loop by a factor of +1
With reference to figure 23 above, the magenta and purple circles represent MACC
units used during the spin-up and spin-down procedures respectively, whilst the red
circles represent MACC units used during the steady-state. As can be seen from
figure 23, execution is only in the steady-state for one clock cycle. In order to achieve
better utilisation of the MACC units employed, several such matrix-vector
multiplication problems could be carried out in succession to amortise the overhead of
the spin-up and spin-down procedures. By doing this, the total execution time of all
the problems would be approximately four times faster than executing them on a
single MACC unit as is done for the matrix-vector multiplication problem’s
sequential model of computation detailed previously in figures 21 and 22.
Alternatively, the inner-loop could be unrolled only twice (thus employing only two
MACC units), where the sum of the first two inner-products of each vector dot-
product would need to be stored as intermediate values in a register file. This
implementation of the matrix-vector multiplication algorithm would be approximately
twice as fast as the implementation of it’s sequential model of computation, and
without as much need to chain together several problems to amortise the spin-up and
spin-down latency.
An advantage of this implementation (regardless of the factor the inner-loop is
unrolled by) is that each MACC unit uses the same vector-element throughout
processing its respective column of the matrix, and thus the fetch operation for each
vector element is amortised over the execution of the entire problem.
26
Section 3.2.2 : Unrolling the Outer-Loop
Figure 24 below shows the dependency diagram of the implementation that results if
instead the outer-loop is completely unrolled, and thus each MACC unit employed
processes one row of the matrix.
V(1)
V(1)
V(1)
M(1, 1)
V(1)
R(1) = V(1).M(1, 1) + V(2).M(1, 2)
+ V(3).M(1, 3) + V(4).M(1, 4)
+
M(2, 1)
M(3, 1)
M(4, 1)
V(2)
V(2)
V(2)
M(1, 2)
+
V(2)
R(2) = V(1).M(2, 1) + V(2).M(2, 2)
+ V(3).M(2, 3) + V(4).M(2, 4)
+
+
M(2, 2)
M(3, 2)
M(4, 2)
V(3)
V(3)
V(3)
M(1, 3)
+
V(3)
R(3) = V(1).M(3, 1) + V(2).M(3, 2)
+ V(3).M(3, 3) + V(4).M(3, 4)
+
+
M(2, 3)
M(3, 3)
M(4, 3)
V(4)
V(4)
V(4)
M(1, 4)
+
V(4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2)
+ V(3).M(4, 3) + V(4).M(4, 4)
+
+
M(2, 4)
M(3, 4)
M(4, 4)
col
row
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
R(1) = 0
0 1 2 3
0
1
2
3
+
+
+ + + +
R(2) = 0
R(3) = 0
R(4) = 0
R(1) R(1) R(1)
R(2) R(2) R(2)
R(3) R(3) R(3)
R(4) R(4) R(4)
Figure 24 : showing the dependency diagram of the matrix-vector multiplication
algorithm’s sequential model of computation after its outer-loop is completely
unrolled
As can be seen from figure 24 above, there are no dependencies across separate
iterations of the outer-loop, so thus after this unrolling there’s no need to skew any
instance of the inner-loop. Therefore execution is always in the steady state (as only
spatial parallelism is employed), meaning that all MACC units are always utilised
during execution. This implementation of the matrix-vector multiplication algorithm
would be four times faster than the implementation of its sequential model of
computation, and there is no need to amortise any spin-up and spin-down latency over
the execution of multiple problems as was the case for the implementation that results
from unrolling the inner-loop. Another advantage of this implementation is that each
vector-element is fetched only once (as is also the case when the inner-loop is
unrolled) where the MACC units employed always share the same vector-element
argument.
3.3 : Using Pipelined MACC Units
Until now, the MACC units discussed have been non-pipelined, where new operands
are only issued to such a unit after it has finished processing its previous operands.
The Transposed, Systolic and Semi-Parallel FIR filter implementations discussed
previously in section 2, used a pipeline of non-pipelined MACC units. As was
27
discussed in section 2, pipelining temporally overlaps (skews) multiple execution
threads, and thus once the initial latency (whilst the pipeline is filled) has been
endured, the subsequent throughput achievable is n times greater (where n is the
degree of pipelining employed, and the pipeline is balanced). This results from the
non-pipelined execution unit being segmented into n stages, where each stage
contributes an nth
of the overall latency, and is thus able to be clocked at n times the
rate of the non-pipelined equivalent. Thus if a single MACC unit was pipelined by a
degree of n (and clocked at n times the clock frequency), once the initial latency of
filling the pipeline had been endured, the subsequent throughput would be n times
greater than that possible with the non-pipelined version. For simplicity, every
instance of the word pipeline throughout this document will refer to a balanced
pipeline, unless otherwise stated. The benefit of a pipelined MACC unit over its non-
pipelined equivalent is depicted below in figure 25.
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11
MACC0 MACC1 MACC2
MACC0
MACC0
MACC0
MACC0 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC81
2
3
4
Non-Pipelined MACC
Pipelined MACC
Clock Cycle
Clock Cycle
Pipeline Stage
Figure 25 : showing an example to illustrate the benefit of a pipelined MACC over its
non-pipelined equivalent (with n = 4)
3.3.1 : Optimising the Code for Execution on a Pipelined MACC Unit
As discussed previously in section 2.2 the series of MACC operations that a specific
vector dot-product is comprised of have data-dependencies. Thus if the matrix-vector
multiplication problem’s sequential model of computation (shown previously in
figure 21) was executed on a pipelined MACC (consisting of a pipelined multiplier
followed by a pipelined adder), the achievable throughput would not be any higher
than that with the non-pipelined version of the MACC. This is illustrated below in
28
figure 26. When executing the sequential model of computation, the pipelined
MACC effectively skews the outer-loop.
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11
MACC0 MACC1 MACC2
MACC0
MACC0
MACC0
MACC0 MACC1 MACC2
MACC1 MACC2
MACC1 MACC2
MACC1 MACC21
2
3
4
Non-Pipelined MACC
Pipelined MACC
Clock Cycle
Clock Cycle
Pipeline Stage
MACC3
MACC3
MACC3
MACC3
3
MACC2
13 14 1512
In both cases : MACC0 : R(1) += M(1, 1) * V(1); MACC1 : R(1) += M(1, 2) * V(2); MACC2 : R(1) += M(1, 3) * V(3); MACC3 : R(1) += M(1, 4) * V(4); where R(1) begins as zero
Figure 26 : showing an example to illustrate the effect of issuing successive MACC
instructions (that are dependent) to a pipelined MACC unit
For simplicity, it is assumed that all three arguments are supplied as part of a MACC
instruction when it is issued to a MACC unit.
The code description of the matrix-vector multiplication algorithm shown below in
figure 27 is a re-write of the sequential code shown previously in figure 21. The outer
and inner loops have been swapped around, which thus requires the matrix to be
processed in column-major order (as apposed to row-major order as was the case for
the sequential code). The reason the two loops have been swapped around is so that
dependent MACC operations are scheduled as far apart as is possible, and this is
illustrated in the dependency diagram of this code which is shown below in figure 28.
29
void pipelinedMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ], const float *r[ ])
{
int row, col;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
r[row] += m[row][col] x v[col]; // matrix is processed in COLUMN-MAJOR order
}
}
}
Figure 27 : showing the code description of the sequential matrix-vector
multiplication algorithm re-written for execution on a single pipelined MACC unit
MACC_1
V(1)
M(1, 1)
+MACC_1
V(2)
M(1, 2)
MACC_1
V(3)
M(1, 3)
MACC_1
V(4)
M(1, 4)
R(1) = 0
MACC_1
V(1)
R(2) = 0MACC_1
V(2)
MACC_1
V(3)
MACC_1
V(4)
+
MACC_1
V(1)
M(3, 1)
MACC_1
V(2)
M(3, 2)
MACC_1
V(3)
M(3, 3)
MACC_1
V(4)
M(3, 4)
+R(3) = 0
MACC_1
V(1)
M(4, 1)
MACC_1
V(2)
M(4, 2)
MACC_1
V(3)
M(4, 3)
MACC_1
V(4)
M(4, 4)
+R(4) = 0
M(2, 1) M(2, 2) M(2, 3) M(2, 4)
R(1) = V(1).M(1, 1) + V(2).M(1, 2) +
V(3).M(1, 3) + V(4).M(1, 4)
R(2) = V(1).M(2, 1) + V(2).M(2, 2) +
V(3).M(2, 3) + V(4).M(2, 4)
R(3) = V(1).M(3, 1) + V(2).M(3, 2) +
V(3).M(3, 3) + V(4).M(3, 4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2) +
V(3).M(4, 3) + V(4).M(4, 4)
col0 1 2 3
row
0
1
2
3
+ + +
+
+
+
+
+
+
+
+ +
R(2) R(2) R(2)
R(3) R(3) R(3)
R(4) R(4)R(4)
Figure 28 : Showing the dependency diagram of the code of figure 27 above, with the
degree of pipelining n = 4
As can be seen from figure 28, this re-written code description of the matrix-vector
multiplication algorithm essentially overlaps the execution of the vector dot-products
by interleaving their constituent MACC operations. In this way, the first inner-
product of all the vector dot-products in turn is calculated and accumulated, after
which the same is done for the second inner-product of all the vector dot-products,
and so on.
The number of vector-dot products a particular matrix-vector multiplication consists
of is equal to the number of elements in the resultant vector, which is equal to the
30
number of rows in the matrix. With regards to the matrix-vector multiplication
algorithm detailed in figures 27 and 28 above, as this number (represented by the
variable num_matrix_rows in figure 27) is increased, dependent MACC operations are
scheduled further apart in time. If this number is greater than or equal to the number
of pipeline stages within the MACC, then the optimum throughput of the MACC can
be achieved (n times that of its non-pipelined version), as each time a MACC
instruction was issued, all those it was dependent on will have completed, thus
allowing it to be executed immediately.
As has been demonstrated, if the code is written such that dependencies are scheduled
far enough apart the use of a pipelined MACC can increase throughput by a factor of
n (where n is the degree of pipelining)
31
Section 4 : The Floating Point Unit
The implementations of the matrix-vector multiplication algorithms discussed
previously in section 3 are all based on what were termed MACC units, which in
concept have the capabilities of triple-word read, write and multiply-accumulate.
This section details the design and implementation of a Floating-Point Unit (FPU) that
acts as one of those MACC units, and is pipelined in accordance with the findings of
section 3. As was seen previously in section 1.1.2, multiply-accumulate is not the
only elemental operation required to implement high-level OpenGL functions (and
graphics rendering functions in general). With this in mind, the FPU has been
designed in such a way that its instruction set is easy to extend. At the core of the
FPU is a 5-stage pipelined multiplier and a 3-stage pipelined adder. These may be
used in immediate succession to execute a multiply-accumulate (MACC) instruction,
or individually to execute either a multiply or an add instruction. These particular
pipeline lengths were chosen by considering the type of OpenGL program it was
envisaged would be executed on the FPU during its developmental phase, and FPU
has been designed such that changing these pipeline lengths can easily be done.
Figure 29 below depicts this initial FPU architecture.
Figure 29 : showing the initial design of the FPU
The control unit of the FPU is modelled as a program written in M-code, which is
encapsulated within the Embedded Matlab Function block labelled
FPU_Control_Unit on diagram shown in figure B1 of appendix B. M-code is
Matlab’s equivalent to C-code, and the inputs to the FPU_Control_Unit block are
passed through to and processed by the embedded programme. This programme is
32
executed and assigns values to its outputs once per step of Simulink’s simulation time.
One of these simulation time steps is analogous to one clock cycle.
The instruction format of the FPU has been devise so that it is compliant with the IBM
PowerPC interface standard, and is shown below in figure 30. This has been done to
allow for the future possibility of employing the FPU alongside a PowerPC RISC as
the latter’s Fabric Co-Processor Module (FCM), as the PowerPC is known to run
OpenGL code in this configuration.
RBRART/S
067131420
MNEMONIC
2126
Figure 30 : showing the format of the FPU’s instruction word
With reference to figure 30 above, the mnemonic field of an arithmetic instruction
tells the FPU_Control_Unit program which of the three types of arithmetic
instruction it is. The RT/S field is the register number of the instruction’s destination
register (and third source register of a MACC), and the RA and RB fields are the
instruction’s source register numbers.
The word length of all data processed by the FPU is 32-bits (in accordance with the
standard binary representation of floating-point numbers). Also part of the FPU is a
100-word register file, which all data is read from and written to. Throughout this
section it is assumed that all input data has already been loaded into this register file,
with section 5.2.6 later describing a DMA unit that was designed and developed to
transfer data in and out of the register file without holding the FPU up. The register
file facilitates three simultaneous reads and one write per clock cycle. Throughout
section 3 it was assumed that when a MACC instruction was issued to a pipelined
MACC unit, all three arguments were also supplied at once. However, when the FPU
begins the execution of a MACC instruction, only the two multiplicand arguments are
fetched immediately, and the third argument is fetched at the start of the accumulate
stage. Thus three reads on the register file per clock cycle must be provided so that
the FPU has the capability to begin executing a new MACC or multiply instruction in
the same clock cycle that it starts executing the accumulate stage of a down-stream
MACC instruction.
4.1 : Dealing with Hazards
The ability to issue the FPU any instruction that it supports in any clock cycle
abstracts the programmer from the architecture. This allows them to get working
code earlier in the design cycle (before optimisation), as apposed to code only
working when it’s exactly optimised for this FPU. In the future, if a compiler is
developed, this capability will allow the FPU to execute code written for other
architectures. The FPU has this capability as a result of being designed to prevent
structural and data hazards from manifesting into errors.
33
The FPU has been designed to deal with the structural hazard that occurs when the
new instruction issued is an add, and the accumulate part of a MACC instruction is
due to begin in that clock cycle. The conflict here is that the accumulate part of a
MACC instruction also requires its associated input arguments to be entered into the
adder unit. In this event, priority is given to the accumulate, and the FPU is stalled,
whereby it does not allow new instructions to be issued until the adder becomes
available and execution of the pending add instruction has subsequently began.
The FPU has also been designed to deal with the data hazards that arise when a newly
issued instruction has a variable (represented by a specific register) that is the output
variable of an instruction in the Execution_pipeline at that time. These data hazards
are prevented from manifesting into an error by the FPU stalling whenever such an
instruction is issued to it. This prevents any instruction executed from fetching one of
its source registers before the register’s contents have been updated by a down-stream
instruction, and similarly writing to a register before its contents have been fetched by
the accumulate stage of a down-stream MACC instruction.
4.1.1 : Using the Scoreboard to Detect Data Hazards
The Scoreboard is essentially a table that maintains which registers in the register file
are a destination register of an instruction currently in the pipeline and when they will
be written to (updated). The FPU_Control_Unit program maintains the FPU’s
Scoreboard as a binary vector, with one 6 bit field for each of the 100 registers of the
register file. Each of these fields are split into two sub-fields as illustrated below in
figure 31.
34
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1
R2
R3
R99
R100
. . . . . .
R1R2R3R4R100 R99 R98 R97 . . . . . . . . . . . . . . . . . . .
0561112171823599 594 593 588 587 582 581 576
EXEC_UNITCOUNT DOWN
0125
The Scoreboard is conceptually a table with count down and execution unit entries for each register within the register file
(as illustrated above), although it is actually represented within the FPU_Control block as a binary vector as shown below
0 1 adder
1 0 multiplier
Figure 31 : Illustrating the concept of the Scoreboard and how it is represented
As can be seen above in figure 31, the 2 least significant bits of a field (its exec_unit
subfield) represent the actual execution unit that will produce the result to be written
into the field’s respective register, and the 4 most significant bits of a field (its count
down subfield) represent the number of clock cycles before that write-back operation
will occur. After consideration of the type of NLPs to be executed on the FPU (as
detailed previously in section 3 and even section 2), for simplicity the FPU was not
designed to have the capability of arbitrating the execution of multiple write-back
operations. Thus only programs that produce strictly no more than one write-back
operation per clock cycle are supported.
The position of the field within the Scoreboard vector as a whole is representative of
the actual register it represents, where the least significant field represents register R1,
and successive fields represent the registers in ascending order. In each simulation
time-step the Scoreboard is updated to decrement any non-zero count down sub-fields
and add details of a new instruction if one is submitted to the Execution_pipeline by
the FPU_Control_Unit program.
As stated previously in section 4.1.1, there are 100 registers in the register file and
thus the Scoreboard binary vector is consistent of 6 x 100 = 600 bits. In Simulink
35
unsigned integers are represented using 32 bits, and thus to model this vector it was
broken up into twenty 30 bit vectors. This is illustrated below in figure 32.
R5 R4 R3 R2 R1
Scoreboard_1
R10 R9 R8 R7 R6
Scoreboard_2
R95 R94 R93 R92 R91
Scoreboard_19
R100 R99 R98 R97 R96
Scoreboard_20
. . . . . . . . . . . . . . . . . . .
R5 R4 R3 R2 R1R10 R9 R8 R7 R6R95 R94 R93 R92 R91R100 R99 R98 R97 R96 . . . . . . . . . . . . . . . . . . .
EXEC_UNITCOUNT DOWN
0125
R6R7R8R9
0561112171823
R10
Scoreboard_2
2429
0293059599 570 569 540
Figure 32: Illustrating how the Scoreboard binary vector is split into 20 segements in
order to represent it in Simulink
With reference to figure 32 above, Scoreboard_1 holds the status of registers R1 to
R5, Scoreboard_2 holds the status of registers R6 to R10, and so on for successive
Scoreboards up to Scoreboard_20. Each of the twenty Scoreboard segments are
represented within the FPU_Control_Unit program as persistent variables.
Figure 33 below shows how the Scoreboard is updated each time the FPU submits a
new instruction to its Execution_pipeline.
36
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 8 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 5 2
R2 0 0
R3 0 0
R4 4 1
R5 7 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
R5 += R1 * R2
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
clock cycle
clock cycle
R4 = R2 + R3
X
X X
X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 1 is : R1 = R3 * R4
The instruction issued in clock cycle 2 is : R4 = R2 + R3
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
Figure 33 : showing an example to illustrate how the FPU updates the Scoreboard
each time a new instruction is submitted to the Execution_Pipeline
Figure 33 above shows how the state of the Scoreboard (for the registers of concern)
changes over three successive clock cycles, in which a MACC, multiply and add
instructions are issued to the FPU and subsequently submitted to the
Execution_pipeline where their execution begins. As can be seen in figure 33, each
time an instruction is submitted to the Execution_pipeline the count down entry in the
Scoreboard for that instruction’s destination register is set to the latency of the
instruction (in clock cycles) and this value is subsequently decremented in each clock
cycle thereafter. With reference to figure 33, registers R4, R1 and R5 will be written
to in clock cycles 6, 7, and 9 respectively. The exec_unit entry for the instruction’s
destination register is set to the code representing the particular execution_unit within
the Execution_pipeline that will produce its result as detailed previously in figure 31.
A synopsis of the FPU_Control_Unit sub-function updateScore() that is responsible
for updating the Scoreboard (in the way shown in figure 33 above) each time a new
instruction is submitted to the Execution_pipeline is shown in figure B2 followed by a
description in appendix B.
37
The code of the sboardCycle() sub-function of FPU_Control_Unit is shown in figure
B3, followed by a description in appendix B. This sub-function is executed every
clock cycle (once per Scoreboard segment) to update the Scoreboard so as to reflect
the passing of one clock cycle. This is done by decrementing all non-zero countdown
sub-fields throughout the entire Scoreboard. A countdown value of 1 is encountered
when its parent field represents the register on which a write back operation is
scheduled to occur in the current clock cycle (as the countdown value will become
zero when it’s decremented in that same execution of sboardCycle()), thus in this
event sboardCycle() asserts the write back operation. After asserting a write back
operation, sboardCycle() sets the field’s exec_unit sub-field to zero. Thus all
Scoreboard fields whose respective registers are not destination registers of an
instruction currently in the Execution_pipeline are maintained with a zero value in
both sub-fields.
4.1.2 : Managing Data Hazards
As already discussed previously at the beginning of the section, when the Controller
(discussed later in section 5) issues the FPU with a new instruction,
FPU_Control_Unit checks the Scoreboard to decide whether or not submitting the
instruction to the Execution_pipeline may cause an error to occur due to a data
hazard. This is illustrated below in figure 34.
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 8 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 5 2
R2 0 0
R3 0 0
R4 0 0
R5 7 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
R5 += R1 * R2
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
clock cycle
clock cycle
R5 = R1 + R2
X
X X
X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 1 is : R1 = R3 * R4
The instruction issued in clock cycle 2 is : R5 = R1 + R2
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
R5
38
Figure 34 : showing an example to illustrate how the Scoreboard is checked to
prevent data hazards
With reference to figure 34 above, the multiply instruction issued to the FPU in clock
cycle 1 (represented by the blue bar) passes the Scoreboard check (assuming there are
no instructions in the Execution_pipeline before clock cycle 0) because in clock cycle
1 neither of its source registers or it’s destination register is the target of a pending
write back operation. However the add instruction issued to the FPU in clock cycle 2
(represented by the green bar) fails the Scoreboard check for two reasons. Firstly,
one of its source registers (R1) is the target of the write back operation scheduled in
clock cycle 7 after the multiply instruction. Thus if this add instruction was submitted
to the Execution_pipeline in clock cycle 2, it would fetch and use the contents of R1
in the same clock cycle, before they had been updated in clock cycle 7 by the multiply
instruction.
In order to abstract the programmer from the architecture it must be assumed that they
don’t consider the latency of the instructions and that their intention in this event
would be for the result of the multiply instruction to be added to R2 by the add
instruction. Thus if this was the only cause for the Scoreboard check failure, the add
instruction could be submitted to the Execution_pipeline (without causing a data
hazard) in clock cycle 8 or thereafter.
However, a second cause of the add instruction’s Scoreboard check failure is that it’s
destination register (R5) is the target of a write back operation scheduled in clock
cycle 9 after the MACC instruction (represented by the red bar). Thus a data hazard
exists because the destination register of a MACC instruction is first used as its third
source register. As such, if the add instruction was submitted to the
Execution_pipeline whilst the MACC instruction was still being executed, it could
write to R5 before this register had been fetched for the accumulate stage of the
MACC instruction. For simplicity, this event always results in a Scoreboard check
failure, regardless of whether or not the instruction’s write back would occur after the
register fetch for the add stage of the MACC instruction.
The sub-function of FPU_Control_Unit that is employed to check the Scoreboard for
a particular register is checkScore(), a synopsis of which is shown in figure B4,
followed by a description in appendix B.
4.1.3 : Managing Structural Hazards
As well as the data hazards discussed previously in section 4.1.1, a potential
structural hazard also exists as the Execution_pipeline’s adder is used in executing
both the add instruction and the accumulate stage of the MACC instruction. Figure
35 below shows an example to illustrate this data hazard, and how it is dealt with.
39
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 4 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 4 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
MACC_mult
0 1 2 3 4 5 6 7 8 9 10
MULT
0 1 2 3 4 5 6 7 8 9 10
clock cycle
clock cycle
ADD
X
X X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 5 is : R1 = R3 * R4
The instruction issued in clock cycle 5 is : R4 = R2 + R3
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
MACC_add
X X X X MACC_add
X X X MACC_add
Figure 35 : showing an example to illustrate the structural hazard concerning the use
of the Execution_pipeline’s adder by both MACC and add instructions
Figure 35 above shows a MACC instruction (issued in clock cycle 0) split into its two
stages, where the red bar represents the multiply-stage and the purple bar represents
the accumulate-stage. The dark red section of the combined bar represents the clock
cycle in which the last stage of the multiply and the first stage (register fetch) of the
add are conducted in parallel. This is done so as to hide the latency of the add-stage’s
register fetch. Figure 35 shows the outcome of issuing two alternative instructions to
the FPU in that clock cycle (5). As is the case with both the multiply and MACC
instructions, if the instruction issued in this clock cycle does not require immediate
use of the adder then it is eligible for submission to the Execution_pipeline.
However, as can be seen in figure 35 this is not the case with the add instruction, and
as such FPU_Control_Unit would not submit an add to the Execution_pipeline in this
situation, regardless of whether or not it passed its Scoreboard checks. Priority is
always given to the accumulate-stage of a MACC in this way for simplicity. In the
situation depicted by figure 35, the earliest time after this clock cycle that an add
instruction could be submitted to the Execution_pipeline would be clock cycle 6.
Details of how the FPU program implements the execution of the different
instructions and the management of hazards are contained in appendix B.
40
Section 5 : The Controller
The Controller is responsible for issuing the FPU with the next instruction to be
executed. If the FPU stalls in the event of detecting a structural or data hazard, it
asserts a ‘1’ on its stall output and the Controller must re-issue the stalled instruction
in subsequent clock cycles until the FPU does submit it to its Execution_pipeline and
assert a ‘0’ back on its stall output. In the clock cycle after this submission occurs,
the Controller must issue the FPU with the next instruction to be executed.
5.1 : Initial Look-Up Table Design
The initial design of the Controller was essentially a look-up table, where every
instruction of the program to be run was stored sequentially in program memory.
This look-up table design of the Controller is shown below in figure 36.
Figure 36 : showing the initial look-up table design of the Controller
For simplicity, during this developmental phase the PROG_COUNTER (counter) was
initialised with its count_from and count_to block parameters before running the
simulation. Similarly, the PROG_MEMORY (single-port RAM) was initialised with
the sequence of program instructions through its initial_value_vector block parameter.
The output of PROG_MEMORY is separated out into its constituent fields
(mnemonic, RT/S, RA and RB) as bit-slicing is not supported in Simulink’s Embedded
Matlab function block, thus preventing this from being carried out within
FPU_Control_Unit. With reference to figure 36 above, when the stall input is ‘0’,
both PROG_COUNTER and PROG_MEMORY are enabled, thus allowing the
program counter to progress by 1 and the output register of the RAM block to be
written to. As such, when the stall signal is ‘0’ successive instructions of the program
are output by PROG_MEMORY at a rate of one per clock cycle. Figure 37 below
uses an example to show how the Controller deals with an FPU stall.
41
0 1 2 3 4 5 6 7 8Clock Cycle
a0 a1 a2 a3 a4 a5 a6
i0 i1 i2 i3 i4 i5X
0
1
Stall
Instruction
word issued
Program
counter
Figure 37 : showing an example to illustrate how an FPU stall is dealt with by the
Controller
In the example shown in figure 37 above, the FPU is issued with instruction i2 in
clock cycle 3 but cannot submit it to the Execution_pipeline for two successive clock
cycles (3 and 4). As can be seen in figure 37 above, when the FPU stalls and asserts a
‘1’ on the stall signal, PROG_COUNTER has advanced to the address of the next
instruction (a3) by the time the count has been stopped. Although as the output
register of PROG_MEMORY is disabled, PROG_MEMORY’s output value remains
as the word-value of the stalled instruction. In the clock cycle that the FPU does
submit i2 to the Execution_pipeline, it asserts a ‘0’ back on the stall signal, thus
enabling PROG_MEMORY’s output register, which is then written to with the word-
value of the next instruction to be executed. This can be seen in figure 37, where the
stalled instruction (i2) is submitted for execution in clock cycle 5, and in the
subsequent clock cycle the FPU is issued with the next instruction (i3).
5.2 : Optimising the Controller for Running Geometric Transformation
Programs
As the problem size (number of instructions the program consists of) increases,
storing every single instruction requires the program memory capacity to be bigger
than is practical. For example, the matrix-vector multiplication program shown in
figure 21 and discussed previously in section 3.1 where the matrix is 4x4 and the
vector is 4x1), is executed using 16 separate MACC instructions when (ignoring any
load and store operations required to get data in and out of the register file). Thus the
size of the program memory required to store this program when completely unrolled
is 16 instruction words.
Figure 38 below illustrates a similar program to that shown in figure 21 which
multiplies eight 4x1 Vectors by the same 4x4 matrix. This is a typical example of the
program run in carrying out both the modelview and projection transformations in the
per-vertex operations stage of the OpenGL pipeline, as discussed previously in
section 1.2.2.
42
M(1,1) M(1,2) M(1,3) M(1,4)
M(2,1) M(2,2) M(2,3) M(2,4)
M(3,1) M(3,2) M(3,3) M(3,4)
M(4,1) M(4,2) M(4,3) M(4,4)
X
V1(1)
V1(2)
V1(3)
V1(4)
V2(1)
V2(2)
V2(3)
V2(4)
V3(1)
V3(2)
V3(3)
V3(4)
V4(1)
V4(2)
V4(3)
V4(4)
V5(1)
V5(2)
V5(3)
V5(4)
V6(1)
V6(2)
V6(3)
V6(4)
V7(1)
V7(2)
V7(3)
V7(4)
V8(1)
V8(2)
V8(3)
V8(4)
R1(1)
R1(2)
R1(3)
R1(4)
R2(1)
R2(2)
R2(3)
R2(4)
R3(1)
R3(2)
R3(3)
R3(4)
R4(1)
R4(2)
R4(3)
R4(4)
R5(1)
R5(2)
R5(3)
R5(4)
R6(1)
R6(2)
R6(3)
R6(4)
R7(1)
R7(2)
R7(3)
R7(4)
R8(1)
R8(2)
R8(3)
R8(4)
Figure 38 : illustrating an example of the matrix-vector multiplication program carried
out by OpenGL’s modelview and projection transformations.
With reference to figure 38 above, the M matrix represents either the modelview or
projection matrix depending on which transformation is to be performed, and vectors
V1 to V8 represent the object or eye coordinate vectors of an object in the scene. The
object being transformed in this particular program has eight vertices, the simplest
example of which would be a cube. Figure 39 below shows a code description of this
program.
void pipelined1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float
*v1[ ], const float *v2[ ], const float *v3[ ], const float *v4[ ], const float *v5[ ], const float *v6[ ], const float *v7[ ], const float *v8[ ], const float *r1[ ], const float *r2[ ], const float *r3[ ], const float *r4[ ], const float *r5[ ], const float *r6[ ], const float *r7[ ], const float *r8[ ])
{
int row, col;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
r1[row] += m[row][col] x v1[col]; // matrix m is processed in COLUMN-MAJOR order
r2[row] += m[row][col] x v2[col];
r3[row] += m[row][col] x v3[col];
r4[row] += m[row][col] x v4[col];
r5[row] += m[row][col] x v5[col];
r6[row] += m[row][col] x v6[col];
r7[row] += m[row][col] x v7[col];
r8[row] += m[row][col] x v8[col];
}
}
}
Figure 39 : showing a code description of the example geometric transformation
program depicted previously in figure 38
43
Previously in section 3.3.1 an analysis of how to optimise the matrix-vector
multiplication algorithm for execution on a pipelined MACC unit was detailed. In
conjunction with the findings of this analysis, the two loops in code of figure 39
above are arranged such that the matrix is processed in column major order, so as to
schedule dependent MACC instructions further apart in time and thus avoid periods of
latency due to FPU stalls. However, this program has eight vectors and the eight
separate matrix-vector multiplication problems are interleaved so as to schedule the
dependent MACC instructions (within each individual problem) even further apart in
time, allowing for an even greater overall pipeline depth and thus a higher throughput
to be achieved.
Completely unrolling both loops will yield the fastest execution speed as it entirely
removes the overhead of having to test loop conditions and execute branch operations
(to set the program counter back to the beginning of a loop). Although these
overheads can also be eliminated without unrolling any loops, by implementing loop
tests and branch operations within the Controller, although this would be at the
expense of added hardware resources. Although as discussed previously, the
disadvantage of completely unrolling both loops is that the size of program memory
required is increased by a factor equal to degree of unrolling.
As can be seen from figure 39 above, the inner-loop of the program contains 8 MACC
instructions, and so if this program was represented with both loops completely
unrolled, the size of the required program memory would be 8x4x4=128 instruction-
words. With both loops completely unrolled, transformations of larger sizes (i.e.
where there are more vertices in the scene overall) could be solved by running the
program on small sets of vectors (vertices) at a time, thus reducing the number of
instructions that need to be stored at any one time, and likewise the size of program
memory required. Although this approach would introduce the overhead of switching
between these smaller programs.
5.2.1 : The Use of Induction Variables
As can be seen from the code of figure 39 above, all of the instructions are MACC
instructions. Thus when the program is represented with both loops completely
unrolled, the instructions stored in program memory would all have the same
mnemonic, and differ only in their three source/destination register fields (RT/S, RA
and RB). All of the program’s instruction-words have the same mnemonic, and their
RT/S, RA and RB fields always address the same arrays.
When the majority of DSP compilers need to address arrays in NLP programs, to
simplify this they introduce induction variables which by definition are derived from
the loop index values. Considering the geometric transformation program of figure
39 for just one vector, a code description of this illustrating how induction variables
would be used to address the output and two input arrays is shown below in figure 40.
44
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
// matrix m is processed in column major order
p = (col x num_matrix_rows) + row; // p is an induction variable
*(r1 + row) += *(m + p) x *(v1 + col); }
}
Figure 40 : showing an example to illustrate the use of induction variables for
addressing arrays
As can be seen in figure 40 above, the m array is indexed by adding its respective
induction variable p to its base pointer. Although p is the only new variable
introduced, the r1 and v1 arrays are also addressed in this way, where their respective
induction variables are exactly the values of the loop indices. With reference to figure
40 above, it can be seen how the use of induction variables allows for the successive
program instructions to be generated, with only a single generic instruction-word
stored in program memory of the form shown below in figure 41.
V1_base_pointer
(RB)
M_base_pointer
(RA)
R1_base_pointer
(RT/S)
067131420
MACC
(MNEMONIC)
2126
Figure 41 : showing the single generic instruction-word from which all program
instructions could be derived for the program of figure 40
With reference to figure 41 above, the Controller would pass the mnemonic field
straight on to the FPU, but between issuing successive instructions it would have to
evaluate the values of the RT/S, RA and RB fields, which would require additional
hardware resources. If this penalty was migrated into software by issuing the FPU
with instructions to calculate the induction variable values, this would eliminate the
need for extra resources (apart from the extra program memory required) at the cost of
increased execution time. Although this is not an option as the FPU does not have a
register address internal data-format. With reference to the code of figure 40 above,
for this program these extra hardware resources would amount to two adders for
evaluating the R1 array addresses, likewise another two adders for evaluating the V1
array addresses and an adder and a multiply-add unit for evaluating the M array
address.
5.2.2 : Performing Strength Reduction on Induction Variables
In applying strength reduction to all three induction variables of the program of figure
40, the overhead cost of each inner-loop iteration is reduced. Figure 42 below shows
the code description after all three induction variables have undergone strength
reduction.
45
p = 0;
r = 0;
c = 0;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
*(r1 + r) += *(m + p) x *(v1 + c); // matrix m is processed in column major order
r++;
p++; }
r = 0;
c++;
}
Figure 42 : showing the code description of the matrix-vector multiplication algorithm
after all three induction variables have undergone strength reduction
As can be seen from figure 42 above, there is no longer the need for a multiplication
in evaluating successive values of p, thus the additional hardware required to evaluate
the p induction variable is now down to two adders. To facilitate this strength
reduction on p, the m matrix must be stored in column major order. Strength
reduction also removes any dependencies of induction variables on the loop indices
(as is the case for those associated with the R1 and V1 arrays), which provides more
flexibility, as not all programs will have induction variables that directly correspond
to the loop indices.
5.2.3 : Optimising a Geometric Transformation for the Controller
Considering the code shown in figure 39 above, where eight separate matrix-vector
multiplication problems are interleaved, this could be executed using separate base
pointers for the arrays of the separate problems, whilst using the same induction
variables across all problems. A code description of this solution is shown below in
figure 43.
p = 0;
r = 0;
c = 0;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
// matrix m is processed in column major order
*(r1 + r) += *(m + p) x *(v1 + c);
*(r2 + r) += *(m + p) x *(v2 + c);
46
*(r3 + r) += *(m + p) x *(v3 + c);
*(r4 + r) += *(m + p) x *(v4 + c);
*(r5 + r) += *(m + p) x *(v5 + c);
*(r6 + r) += *(m + p) x *(v6 + c);
*(r7 + r) += *(m + p) x *(v7 + c);
*(r8 + r) += *(m + p) x *(v8 + c);
r++;
p++; }
r = 0;
c++;
}
Figure 43 : showing a code description of the program with eight matrix-vector
multiplication problems interleaved, with all induction variables having undergone
strength reduction.
With reference to figure 43 above, to run this code the Controller would have to store
one generic instruction word of the form shown previously in figure 41 for all eight
problems (i.e. one instruction word per vertex). This disadvantage arises from the
corresponding array base pointers having different values across the eight different
problems. The number of vertices an object has, or the number that are in a scene can
be huge (reaching the tens of thousands in very detailed scenes), thus it is desired that
only one generic instruction word be stored in program memory, from which all
successive program instructions are derived.
In order to achieve this, the corresponding arrays across the different problems need
to be combined. Considering the way the geometric transformation interleaves the
execution of the problems, in order to keep the expressions for evaluating the
induction variable values as simple as possible, the best way to combine the arrays is
to interleave them. As such, the register numbers of successive array elements
accessed can be evaluated largely by simple increment operations. This is illustrated
below in figure 44 which illustrates an example register file arrangement for the three
arrays.
47
M(1, 1) M(2, 1) M(3, 1) M(4, 1) M(1, 2)
R1
M(2, 2)
M(3, 2) M(4, 2) M(1, 3) M(2, 3) M(3, 3) M(4, 3)
R2 R3 R4 R5 R6
R7 R8 R9 R10 R11 R12
V1(1) V2(1) V3(1) V4(1) V5(1)
R17
V6(1)
V7(1) V8(1) V1(2) V2(2) V3(2) V4(2)
R18 R19 R20 R21 R22
R23 R24 R25 R26 R27 R28
R1(1) R2(1) R3(1) R4(1) R5(1)
R49
R6(1)
R7(1) R8(1) R1(2) R2(2) R3(2) R4(2)
R50 R51 R52 R53 R54
R55 R56 R57 R58 R59 R60
Figure 44 : showing an example arrangement within the FPU’s register file, where the
three arrays of eight matrix-vector multiplication problems are interleaved together
As is illustrated in figure 44 above, the eight source and corresponding eight resultant
vectors are stored in the register file, such that the first element of all eight vectors are
stored adjacent to each other and in the order the vector pairs are processed by the
program (1 to 8), followed by the second element of all 8 vectors, and so on. The M
matrix is stored in column major order, as was discussed earlier in this section. The
arrangement of the source and destination registers means that the complexity of
interleaving and de-interleaving the arrays is handled outside the Controller by
whatever loads the data in and stores it out of the FPU’s register file. Subsequently
the register number sequences that need to be generated by the Controller are simpler
(than if the arrays were simply concatenated), and this will ease compiler
development if it is undertaken in the future. A code description of the program using
this register file arrangement, and requiring only one generic instruction word to be
stored in program memory is shown below in figure 45.
48
res = 0;
res_base = 0;
p = 0;
vec_base = 0;
vec = 0;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
for( vec_no = 0; vec_no < num_vectors; vec_no++ )
{
*(r0 + res) += *(m + p) x *(v0 + vec);
res++;
vec++;
}
p++;
vec = vec_base;
}
res = res_base;
vec = vec_base = vec_base + num_vectors;
}
Figure 45 : showing a code description of the program executing eight interleaved
matrix-vector multiplication problems with only one generic instruction word stored
in program memory
5.2.4 : Designing the Optimal Controller
Considering the code of figure 45 above, it can be seen that to evaluate the three
induction variables between issuing successive instructions, the range of operations
the Controller must be able to carry out on an induction variable to produce its
subsequent value are incrementing it, setting it to a base value and adding a constant
to that base value.
As was discussed previously in section 1.2.1, as well as matrix-vector multiplication,
the two other prominent NLPs executed within the OpenGL pipeline (and the graphics
rendering domain in general) are the FIR filter and matrix-matrix multiplication. For
flexibility and thus the ability to efficiently support a wider range of NLPs, the
Controller needs to have the ability to perform any combination of these operations on
any of the three induction variables in any one evaluation cycle between the issuing of
successive instructions.
49
5.2.4.1 : The Data Address Generator Design
A hardware block that generates successive address values by evaluating induction
variable values and then adding them to array base pointers in the way discussed
previously is commonly known in the field of DSP design as a Data Address
Generator (DAG). Such a block was designed to support the three induction variable
operations discussed previously, with a view to using three separate instances of it as
part of the Controller, for evaluating the RT/S, RA and RB field values between the
issuing of successive instructions. The design of this DAG block is shown below in
figure 46.
Figure 46 : Showing the Data Address Generator block used in the Controller
Before the DAG can be used it must first be initialised with the base pointer
(orig_base_pointer) of the array it is to generate addresses (register numbers) for, and
the constant value that can be added to this (val_1). This initialisation is part of
loading the program to be run into the Controller. With reference to figure 46 above,
the base pointer is loaded into the orig_base_pointer register and the initialisation also
acts to simultaneously load this base pointer value into the current_accumulation
register, so that this value appears at the DAGs output (Out1) in the subsequent clock
cycle. Also in the same step, the constant is loaded into the val_1 register. This
combined triple-action instruction initialises the DAG in one clock cycle and is
referred to as the DAG’s initialisation instruction in the remainder of this document.
Figure 47 below shows the instruction words of the DAG instructions used in
executing the optimised geometric transformation program of figure 45.
50
0(In8)
0(In1)
0(In2)
0(In3)
0(In4)
5 4 3 2 1
0(In5)
0bits
DAG instruction word for : doNothing
1(In8)
0(In1)
0(In2)
0(In3)
0(In4)
5 4 3 2 1
0(In5)
0bits
DAG instruction word for : current_accumulation++ : (post incrementing the DAG output)
0(In8)
0(In1)
0(In2)
0(In3)
1(In4)
5 4 3 2 1
0(In5)
0bits
DAG instruction word for : current_accumulation = orig_base_pointer
0(In8)
0(In1)
0(In2)
0(In3)
1(In4)
5 4 3 2 1
1(In5)
0bits
DAG instruction word for : current_accumulation = orig_base_pointer = orig_base_pointer + val_1
0(In8)
0(In1)
1(In2)
1(In3)
1(In4)
5 4 3 2 1
1(In5)
0bits
DAG instruction word for : initialiseDAG
Figure 47 : showing the DAG instruction words used in executing the optimised
program of figure 45
With reference to figure 47 above, the initialiseDAG instruction is supplied in
conjunction with the 7-bit orig_base_pointer and val_1 values. As can be seen in
figure 47, bit 4 of all instructions detailed is zero, although there is one other
instruction that the DAG supports, but which is not used in executing the geometric
transformation program of figure 45. This instruction is :
current_accumulation = orig_base_pointer + val_1; where orig_base_pointer is not
modified. This instruction is included as it would be used by other NLPs, which had a
requirement to increase the array address by val_1, and later go back to an earlier base
value (still held in orig_base_pointer).
51
5.2.4.2 : The Use of Data Address Generators as part of the Controller
Figure 48 below shows the geometric transformation program (outlined previously in
figure 45) using the set of DAG instructions shown previously in figure 47.
void pipelinedDAG1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, int num_vectors, const float *m[ ], const float *v[ ], const float *r[ ])
{
int row, col;
// DAG initialisation pragmas
assign_DAG1: r_vectors; // DAG1 <= R[ ]
assign_DAG2: mx; // DAG2 <= M[ ]
assign_DAG3: v_vectors; // DAG3 <= V[ ]
// DAG initialisation data
initDAG1: r_vectors_orig_base = r;
initDAG1: r_vectors_val1 = 0;
initDAG2: mx_orig_base = m;
initDAG2: mx_val1 = 0;
initDAG3: v_vectors_orig_base = v;
initDAG3: v_vectors_val1 = num_vectors;
// NLP program
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
for( vec_no = 0; vec_no < num_vectors; vec_no++ )
{
*(r_vectors_curr_accum) += *(mx_curr_accum) x *(v_vectors_curr_accum);
DAG1: r_vectors_curr_accum++; // DAG instruction
DAG3: v_vectors_curr_accum++; // DAG instruction
}
DAG2: mx_curr_accum++; // DAG instruction
DAG3: v_vectors_curr_accum = v_vectors_orig_base; // DAG instruction
}
DAG1: r_vectors_curr_accum = r_vectors_orig_base; // DAG instruction
// DAG instruction
DAG3: v_vectors_curr_accum = v_vectors_orig_base = v_vectors_orig_base + v_vectors_val1;
}
52
}
Figure 48 : showing a code description of the optimal geometric transformation
program that interleaves eight matrix-vector multiplications, rewritten to show how
DAGs in the Controller would be used
With reference to figure 48 above, compiling this code would simply involve
constructing the DAG initialisation instruction word for each of the three DAGs,
constructing the instruction word for each of the three loops, and extracting the loop
bounds to initialise the count_to values of the loop counters (discussed later in section
5.2.4.3). The three loop instruction words would be used by the Controller to
generate the program instruction words of this program, and are depicted below in
figure 49.
v_vectors_curr_accum++(I_DAG_RB)
doNothing(I_DAG_RA)
r_vectors_curr_accum++(I_DAG_RT/S)
056111217
MACC(MNEMONIC)
1824
v_vectors_curr_accum =
v_vectors_orig_base(I_DAG_RB)
mx_curr_accum++(I_DAG_RA)
doNothing(I_DAG_RT/S)
056111217
doNothing(MNEMONIC)
1824
v_vectors_curr_accum =
v_vectors_orig_base =
v_vectors_orig_base + val_1(I_DAG_RB)
doNothing(I_DAG_RA)
r_vectors_curr_accum =
r_vectors_orig_base(I_DAG_RT/S)
056111217
doNothing(MNEMONIC)
1824
Controller instruction word for the inner-loop’s instruction
Controller instruction word for the middle-loop’s instruction
Controller instruction word for the outer-loop’s instruction
Figure 49 : showing the three instruction words used by the Controller to run the
interleaved matrix-vector multiplication program of figure 48
53
Figure 50 below shows the final Controller design, which employs three instances of
the DAG block, and a loop counter structure to act as a form of program counter.
Figure 50 : showing the final optimised Controller design
5.2.4.3 : The Loop Counter Structure
With reference to figure 50 above, the loop counter structure
(Controller_Loop_Counters) maintains where in the NLP the program execution is
currently at (in terms of the iteration number of each loop) and issues the generic loop
instructions (of the form shown previously in figure 49) accordingly. This loop
counter structure is shown below in figure 51.
54
Figure 51 : showing the loop counter structure that acts as the program counter of the
Controller
In consideration of the key NLPs that it is envisaged would be run by this Controller,
for simplicity this loop counter structure is designed to support programs with only
one generic instruction (of the form shown previously in figure 41) in each loop.
With reference to figure 51 above, when a NLP is being run by the Controller,
counter_1, 2 and 3 hold the loop index for the inner, middle and outer loops
respectively. The three counters are connected together via logic gates (and and not)
and registers (delays).
In all clock cycles up to and including when counter_1 reaches its count_to value (i.e.
when the inner loop is on its last iteration) the address applied to the program memory
is that of the inner-loop’s generic instruction. In the cycle after counter_1 reaches its
count_to value, its count is reset back to zero and disabled there for one cycle whilst
the count of counter_2 is advanced by one. In this cycle the address input to the
program memory is that of the middle loop’s generic instruction.
In the clock cycle after counter_2 reaches its count_to value, it behaves in the same
way as just described for counter_1, and the count of counter_1 in this event is
disabled at zero for two successive cycles. The address applied to program memory
in the clock cycle after counter_2 reaches its count_to value is that of the outer-loop’s
generic instruction.
In the clock cycle that counter_3 reaches its count_to value a ‘1’ is asserted on the
switch_context signal shown in figure 51 above, which signals that the last generic
instruction of the NLP has been issued.
When a ‘1’ is asserted on the stall signal by the FPU, the program memory of the
optimal Controller behaves in the same way as the initial look-up table design detailed
55
previously in section 5.1. In the event of a stall, the three DAGs are disabled, thus all
effectively issued with doNothing instructions. The loop counter arrangement has
been designed such that every feedback signal, etc between the counters is frozen so
that when the stall is lifted the count sequence will carry on exactly as it would have
done if no stall was encountered (from the point at which it was halted by the stall
event).
The internal contents of the three counter blocks of figure 51 are shown in appendix
C.
5.2.5 : The High-Level Controller
Before the Controller can run a NLP, the generic loop instructions have to be loaded
into its program memory and its DAGs need initialising. This initialisation is carried
out by the High-level controller, which essentially stores three Controller
initialisation instructions in its block RAM. In initialising the Controller to run a
NLP, these three instructions are issued to the Controller. With reference to figure 51
shown above, when the High-level controller is issuing the Controller with one of
these instructions, it asserts a ‘1’ on the Controllers run input. In the clock cycle after
the last of these has been issued, the High-level controller asserts a ‘0’ on the run
signal and the Controller begins running the program. When the Controller asserts a
‘1’ on its switch_context output, the High-level controller then goes through the same
initialisation procedure with the Controller for the next NLP to be run. Thus all NLPs
for which the High-level controller has a set of Controller initialisation instructions
are successively loaded into the Controller and ran in this way. The High-level
controller is shown in appendix D.
One of these Controller initialisation instructions acts to load one of the generic loop
instructions into the Controller’s program memory, initialise one of the DAGs, load
one of the loop instruction addresses, and initialise one of the counters with their
count_to value. Thus it can be seen that three of these instructions need to be issued
in order to load a NLP into the Controller.
5.2.6 : The DMA Unit and Context Switching
Until now it was always assumed that all input data (i.e. matrices and vectors) that
programs operated on was already in the register file. Likewise, no attention was paid
as to how data would be output by the FPU. Implementing this as load and store
instructions in the FPU would mean long periods of latency whilst entire arrays were
loaded in and stored out element by element, in between the execution of the actual
computational instructions.
A DMA unit was designed to load data into and store data out of the FPU’s register
file whilst leaving the FPU free to carry out the execution of a program. However,
with only one register file in the FPU, only three reads and one write per cycle were
available. Instructions being executed by the FPU would want to read the register file
56
in order to fetch their arguments, and as all three of the reads available can be
required in the same clock cycle (where a new instruction is submitted to the
Execution_pipeline and the accumulate stage of a down stream MACC instruction
fetches its third source register). Thus using only one register file would mean slow
overall execution time, as the next NLP to be run could not do so until all of its input
data had been loaded into the register file.
This problem was overcome by introducing a second register file in the FPU, where in
each context cycle one register file is read from and written to by the NLP being
executed on the FPU, and the other has data stored out of it (the results of the previous
NLP to have been executed) and loaded into it (input data for the next NLP to be
executed) by the DMA unit.
This DMA unit is shown below in figure 52.
Figure 52 : showing the DMA unit
With reference to figure 52 above, as with the Controller the DMA unit must be
initialised in between context cycles. The High-level controller detailed previously in
section 5.2.5, was extended to carry out this initialisation of the DMA unit in the same
way as, and in parallel with the Controller. In this initialisation, a set of load and
store instructions are loaded into the DMA unit’s RAM block, the format of which is
shown below in figure 53.
57
First RegisterMNEMONIC Last Register
067131415
Figure 53 : showing the format of the DMA instructions
With reference to figure 53 above, the mnemonic field indicates whether the
instruction is a load or store, which the DMA unit uses to decide which protocol to
implement. The first_register and last_register fields represent pointers (register
numbers) to the beginning and end of the array being loaded in or stored out, and the
DMA unit generates the numbers in between in a monotonic fashion. The next
number in this sequence is generated every time a new data value is received (at input
In3) for the DMA to load into the register file, or it needs to fetch another data value
from the register file to store.
If at any stage when loading, the input data stream to the DMA is halted, or the
external entity receiving the data being stored out temporarily can not receive, the
counter generating the pointer values is disabled. As with the loop counter structure
of the optimal Controller discussed previously in section 5.2.4.3, the DMA unit has
been designed such that all internal feedback signals, etc are frozen, such that a data
stream halt in either direction does not cause a malfunction (i.e. such as a pointer
value being skipped, etc). Figure 54 below describes the protocols implemented by
the DMA unit in load and store mode.
ACCEPT_MORE = LOAD_INSTRUCTION && ALLOWED_BY_FPU && NOT(LAST_REGISTER) && IN_RUN_MODE
NEW_DATA_OUT = STORE_INSTRUCTION && ALLOWED_BY_FPU && NOT(LAST_REGISTER) && IN_RUN_MODE
Protocol for loading data into the register file
Protocol for storing data out of the register file
Figure 54 : describing the protocols implemented by the DMA unit in loading data in
and storing data out of the FPU
Both register files of the FPU are 100 words in depth, and the input and output signals
of both are multiplexed between the DMA unit and the FPU using a context_bit
provided by the High-level controller which toggles between successive context
cycles.
58
Section 6 : Testing and Integration
After each block had been designed and implemented as a Simulink simulation
model, it was thoroughly tested against its specification. Corner cases and perverse
situations were also examined, so as to be aware of these when integrating a block
with others. When the blocks were deemed to have met their specification they were
integrated with the other blocks that they were to interact with immediately, and
tested as part of a sub-system. In this way the complete system, comprised of the
High-level controller, the optimal FPU Controller, the DMA unit, and the optimised
FPU was developed by integrating the sub-systems together, predominantly in a
bottom-up hierarchical manner.
6.1 : Testing of the FPU’s Control Unit
Firstly, the FPU_Control block was tested alone to verify that it could submit all three
types of instruction for execution. This was done in simulation by inputting add,
multiply and MACC instructions and observing the outputs of the block. The register
field values were set to values in the low, mid and high range of the register file’s
range of register numbers (1 to 100). For each of the instructions, two and three of
the register fields were set to the same value (which is legal). All cases were tested
with mnemonics in the instruction word that were unknown to the FPU, to ensure that
the FPU_Control block did not try and submit anything for execution in the that clock
cycle. As well as using it to test the block, this process was also used to develop the
implementation of the FPU_Control_Unit Embedded Matlab function within the
FPU_Control block in an iterative way.
Register field values outside the register file’s range were also input as part of an
instruction word. Whenever a register field value of 0 was used (with a valid
mnemonic) the FPU_Control_Unit program would assign -1 (in 7-bit binary form) to
the appropriate address signal, as all register numbers are decremented by one to map
them to an address in the register file (i.e. register numbers in the FPU start from 1,
but block RAM addresses start from 0). Register numbers between 101 and 127
inclusive were processed as those in the legal range of 1 to 100.
The basic updating of the Scoreboard was verified, by outputting the persistent
Scoreboard segments of interest from the function and observing their decimal values
over successive clock cycles. It was found that register numbers outside the legal
range had no effect on the Scoreboard (when input as part of an instruction word with
a valid mnemonic), as expected since the updateScore() sub-function would not be
able to assign it to one of the Scoreboard segments, although execution of the
FPU_Control_Unit function was found to resume as normal in subsequent clock
cycles.
Sequences of instructions were then input to the block to check that the
FPU_Control_Unit program could handle instructions being issued to it at the rate of
one per clock cycle (and slower), that the Scoreboard detected all cases of data
59
hazards, the structural hazard concerning the use of the adder, and combinations of
both types of hazard presented at the same time.
Data hazard detection was verified using instruction streams representing vector dot-
products, with dependent MACC operations scheduled one after the other, as analysed
previously in section 3.1. The same cases were also tested with the vector dot-
products implemented as a series of individual multiply and add instructions.
Structural hazard detection was verified by issuing a MACC instruction followed by a
series of add instructions (devised so as to not carry data hazards) and observing when
the stall signal was asserted high, and if it was high for only one clock cycle (as
intended)
Also verified was the subsequent behaviour of the block in terms of the reaction to
different types of instructions subsequently being issued after the stall signal had gone
high. As expected, when one instruction was stalled, and then perversely (compared
to how the Controller is designed to react in the event of a stall), an entirely different
instruction was issued in the subsequent clock cycle (which did not carry any
hazards), this instruction would be submitted for execution.
6.2 : Testing and Integrating the Register File
The register file was first tested alone to verify that it could support three (or less)
reads and one write per clock cycle. The one write per cycle was tested by feeding
address sequences into its WB_ ADDR input along with a corresponding data stream
into its WB_DATA input. At the same time, address sequences were sent to the three
ADDR_R inputs (addressing a different part of the register file than that being written
into) and the data fetched out was compared to that expected (based on the
intial_value_vector the register file was initialised with). After the address and data
streams had finished, the contents of the register file were streamed out in a serial
manner from a fourth block RAM that was initialised and subsequently written to in
exactly the same way as the three block RAMs of the register file. This was done in
order to verify the write operation over multiple cycles. This analysis also verified
that the register file could support a read-after-write operation on the same register in
the same cycle, although for simplicity the FPU has not been designed to take
advantage of this.
The register file and FPU_Control block were then integrated together, where a series
of instructions were issued to FPU_Control in the simulation, and the three data
outputs of the register file were observed to make sure the right values (registers)
were fetched out after each instruction was submitted for execution, where the register
file was again initialised pre-simulation.
60
6.3 : Testing and Integrating the Execution Pipeline
The Execution_pipeline was integrated with both the FPU_Control and register file
blocks. The main issue in getting all three to work together was refining the latency
model of the Execution_pipeline held by the FPU_Control_Unit program, and this
was done via an iterative process using an instruction stream consistent of a series of
independent MACCs, and observing various signals along the Execution_pipeline.
Up until this point, all data words had been of type 8-bit unsigned, and this was now
changed to 12-bit signed (with binary point after bit-4) in order to better represent the
typical data in the OpenGL pipeline that it was envisaged would require processing by
the FPU. The issue faced here was the need to remove intermediate type converter
blocks (which were not able to handle this data format). A simple matrix-vector
multiplication problem was used to verify this sub-system, by streaming out the
register file contents after the program’s execution and comparing the contents with
those calculated using Matlab and also a 20-digit calculator. After experimenting
with different binary point positions, the one stated above was found to be very
adequate.
6.4 : Testing and Integration of the Look-Up Table Controller
The Look-up table Controller was first initialised prior to running the simulation using
Matlab environment variables for the count_to block parameter of its
PROG_COUNTER and initial_value_vector of its PROG_MEMORY. The basic
operation was verified in checking that it reacted to the stall signal going high
correctly, and that it could cope with multiple stalls in immediate succession, whereby
the stall signal would effectively toggle between ‘1’ and ‘0’ in successive clock
cycles.
The Look-up table Controller was then integrated with the FPU. Its
intial_value_vector was initialised to hold a matrix(4x4)-vector(4x1) multiplication
program (implemented as MACC instructions) completely unrolled. This program
processed the matrix in column-major order so as to schedule apart the dependencies
and avoid data hazards arising. Stalls were made to occur in several points within the
execution of the program, by repeating instances of the same instruction at the
corresponding points in program memory. This was also done with multiple stalls
occurring in immediate succession. The contents of the register file were streamed
out at the end of the program execution and compared with those evaluated using
Matlab to ensure that the stalls did not cause an error.
6.5 : Testing of the Optimal Controller’s Data Address Generator
The DAG block was first initialised with its orig_base_pointer, val_1 and
curr_accum values using workspace variables as the initial values of the
corresponding registers. The block was initially tested to check that it could cope
with any sequence of instructions issued to it, and that the latency of all these
instructions was 1 clock cycle (i.e. so that the result would be ready in time for the
61
issuing of the next program instruction to the FPU). It was also checked to see that
the DAG block could cope with stalls (in isolation and succession) appropriately
using a similar method as just previously described for the Look-up Controller.
6.6 : Testing and Integration of the Optimal Controller’s Loop Counter
Structure, Program Memory and DAGs
The loop counter structure used as the Optimal Controller’s program counter was
developed through an iterative process of designing and testing. This was first done
to get the counters interacting in the right way (through register delays and logic
gates) in order for the individual counters to work in synchronisation and together
create the right count sequence (depending on the individual loop bounds). After this
basic functionality was achieved, further developing the loop counter structure to cope
with stalls occurring (in isolation or succession) anywhere in the count sequence
involved another iterative process between designing and testing.
The loop counter structure was then integrated with the Program memory of the
Optimal Controller, where the state of the counters provides control inputs to a
network of MUXs which then select the address of the appropriate loop instruction as
the address assigned to Program memory. This subsystem was tested with values 1,
2, and 3 representing the instruction words of the three loops for clarity. The reaction
of the sub-system to a stall occurring was tested using the same method as previously
described in section 6.4 with regards to the Look-up table Controller.
Three instances of the Data Address Generator block were then integrated with this
sub-system, thus completing the Optimal Controller. This was initialised to run the
optimised geometric transformation program detailed previously in section 5.2.3
using workspace variables to represent the various block parameters. The program
was run on the Optimal Controller and the resulting instruction stream was inspected
to verify that the correct sequence of instructions were being generated.
6.7 : Testing and Integration of the DMA Unit
Developing the DMA unit was an iterative process of design and testing. The main
issues faced were getting the right delays in certain control signal paths, so that the
DMA unit could cope with data stream halts occurring at any point in time. The other
major problem faced was that of removing an asynchronous feedback loop between
the DMA unit and the FPU, which was required in order to integrate the DMA unit
with the FPU in the Xilinx System Generator environment.
In testing the load protocol, the input data was supplied by an Embedded Matlab
programme which was used to implement the other side of the protocol. A similar
thing was done to test the store protocol.
Once the operation of the DMA unit had been verified with in isolation, the FPU was
modified to contain two register files as detailed previously in section 5.2.6, and the
DMA unit was integrated with it. The operation of this sub-system was verified by
using the DMA to store data out of and load data into one register file, whilst the FPU
62
executed the optimised geometric transformation program using the other register file.
The contents of both register files were then streamed out and the contents were
checked. The store instructions of the DMA unit were verified by inspecting the data
values it fetched out (where the register file being stored out of had previously been
initialised prior to running the simulation).
6.8 : Testing and Integration of the High-Level Controller
The High-level Controller was first tested in integration with the DMA unit. A series
of initialisation instructions were issued to the DMA unit, after which it was put into
run mode. The DMA needed an interface with the High-Level Controller in order for
it to correctly process initialisation instructions, and this was largely developed via an
iterative process of design and test. Once this was developed correctly, it was seen
that the subsequent operation of the DMA complied with what it had been initialised
to do.
A very similar process was carried out with the High-level Controller and the Optimal
Controller.
The High-level Controller was then integrated together with both the DMA Unit and
Optimal Controller, as well as the FPU to form the complete system. The operation
of the system was verified using five different instances of the optimised geometric
transformation, loaded into the Optimal Controller and subsequently executed on the
FPU in succession. Whilst each of these NLPs were executed (over one context
cycle) the DMA unit was used to store out the resultant vectors of the previous NLP
from the other register file, and then load in the source vectors and matrix for the
subsequent NLP to be run. In subsequent context cycles, the register file the FPU and
DMA unit operated on were switched, so as to hide the latency of loading in and
storing out data, thus in effect not letting it degrade execution time.
63
6.9 : The OpenGL Demonstration
As previously discussed in section 5.2 the optimised geometric transformation
program essentially represents the program executed by the OpenGL pipeline in
carrying out both the modelview and projection transformations, which are part of the
per-vertex stage of the OpenGL pipeline as detailed previously in section 1.2.2. A
model of the geometric transformation process (detailed previously in section 1.2.2)
carried out in the per-vertex stage of the OpenGL pipeline was developed in the
Matlab environment, with the final processor architecture used to execute the
modelview transformation. This model is shown below in figure 55.
Figure 55 : showing the demonstration of the OpenGL pipeline’s geometric
transformation process, with the modelview transformation executed on the final
processor architecture
With reference to figure 55 above, the OpenGL program that is executed by the model
is the rendering of a cube (modelled by its eight vertices), between each successive
frame being rotated and translated along the trailing diagonal of the viewport. The
viewpoint is such that the viewer is looking down over the top of the cube.
With reference to figure 55 above, the outputs of the High-level Controller are
essentially the fields of the initialisation instructions of the Optimal Controller and the
DMA unit. Between successive context cycles, the High-level Controller initialises
the Optimal Controller to run the optimised geometric transformation program, each
time swapping the contexts (i.e. register files) of the FPU and DMA unit by toggling
the context_bit. The eight spatial coordinate vectors of the cube (in their object
64
coordinate form, with the cube centred at the origin) and the modelview matrix to be
used in rendering five successive scenes are sent to the DMA unit (element by
element and according to the DMA’s load protocol) by an Embedded Matlab function
used to model the processing element feeding the geometric transformation pipeline
within the per-vertex stage of the OpenGL pipeline. The FPU executes the
modelview transformation for one frame, whilst the DMA unit stores out the resultant
eye coordinate vectors for the previous frame and loads in the source object
coordinate vectors for the subsequent frame. The de_interleave function block de-
interleaves the output array being stored out, and extracts the eight eye coordinate
vectors before passing them on to the geometric_trans function block which
implements the remainder of the geometric transformation process before the cube is
drawn into the framebuffer. Figure 56 below shows the five successive frames
produced in an overlapped manner so that the effect of the successive executions of
the modelview transformation executed by the final processor architecture may be
seen.
Figure 56 : showing the five successive frames produced by the model of OpenGL’s
geometric transformation process shown previously in figure 55
6.10 : Progress in Developing a Hardware Based Demonstration
The Optimal Controller has been successfully targeted to the Xilinx Virtex 4 FPGA
board using the System Generator tool chain. The Optimal Controller (in hardware)
has also been integrated with the rest of the Processor architecture (in simulation)
successfully.
65
The DMA unit also been targeted to hardware successfully, but has not yet been
integrated with the rest of the Processor architecture (in simulation) successfully. The
problem faced is the feedback loop between the FPU and the DMA used in
implementing the two protocols can not be evaluated by System Generator.
7 : Conclusion
The processor architecture that has been developed is capable of carrying out both the
modelview and projection transformations within the geometric transformations
process of the OpenGL pipeline. These are two of the most prominent
transformations carried out within the OpenGL pipeline, as they each implement key
steps in constructing the scene.
These transformations within the OpenGL pipeline are executed in succession on all
objects in the scene per frame. Through the context switching implemented by the
architecture, it is able to process successive NLPs with no latency (during which
execution would have to halt) due to transferring the associated data in or out is seen.
The objects in a scene are most likely to have different numbers of vertices, and as
such create programs of different sizes to be run. The Optimal Controller developed
can efficiently support a wide range of program sizes as it will only ever store one
instruction per loop, thus as well as the efficiency in terms of the amount of program
memory required, the architecture is also initialised very quickly by the High-Level
Controller to run the next NLP.
The processing architecture is therefore a good fit for the processing engine required
to perform the modelview and projection transformations.
It has been shown that when code is written specifically for the Optimal Controller,
the steps involved in constructing the initialisation instruction words to be issued by
the High-Level controller in loading the program into the Optimal Controller are
trivial, and that this alone would be all that was necessary to constitute compilation of
the code.
Although it hasn’t been explicitly shown in this report, when considering the range of
induction variable operations supported by the Data Address Generator block it is
reasonable to assume that the Optimal Controller and the processor architecture as a
whole would efficiently support a wider range of NLPs, especially the FIR filter and
matrix-matrix vector multiplication, which are also prominent NLPs in graphics
processing as well as matrix-vector multiplication.
66
Section 8 : Appendices
Appendix A
void transposedFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )
{
int j; // the only loop counter
const float *k = x; // ‘x’ initially points to the input sample (x[n]) corresponding to the first output sample
// to be calculated (y[n])
register temp3 = 0, temp2 = 0, temp1 = 0, temp0 = 0; // registers used to store the each result at // different stages of the accumulation
// Spin-up procedure : filling up the pipeline
temp2 = ( h[3] x *(k-3) ) + temp3;
// PAR construct means execute the statements housed in the following pair of braces simultaneously // (i.e. issue them to separate processors)
PAR {
temp2 = ( h[3] x *(k-2) ) + temp3;
temp1 = ( h[2] x *(k-2) ) + temp2;
}
PAR {
temp2 = ( h[3] x *(k-1) ) + temp3;
temp1 = ( h[2] x *(k-1) ) + temp2;
temp0 = ( h[1] x *(k-1) ) + temp1;
}
// Steady state : where the pipeline is full
for( j = 0; j < (num_samples – (num_taps-1)); j++ )
{
PAR {
*y++ = ( h[0] x *k ) + temp0; // y points to the register address where y[n+j] is temp0 = ( h[1] x *k ) + temp1; // to be written and is incremented (post
temp1 = ( h[2] x *k ) + temp2; // assignment) to point to the register address
temp2 = ( h[3] x *k ) + temp3; // where the next output sample y[(n+j)+1] is to
} // be written
k = x++; // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1]
}
// Spin-down procedure : emptying the pipeline
PAR {
*y++ = ( h[0] x *k ) + temp0;
temp0 = ( h[1] x *k ) + temp1;
temp1 = ( h[2] x *k ) + temp2;
67
}
PAR {
*y++ = ( h[0] x *(k+1) ) + temp0;
temp0 = ( h[1] x *(k+1) ) + temp1;
}
*y++ = ( h[0] x *(k+2) ) + temp0;
}
Figure A1 : A code description of the Transposed FIR filter
void systolicFirFilter( int num_taps, int num_samples, const float *x, const float *h[], float *y )
{
int j; // the only loop counter
const float *k = x++; // ‘x’ initially points to the input sample (x[n]) corresponding to the first output // sample to be calculated (y[n])
register temp3 = 0, temp2 = 0, temp1 = 0, temp0 = 0; // registers used to store the each result // at different stages of the accumulation
// Spin-up procedure : filling up the pipeline
temp2 = ( h[0] x *k ) + temp3;
k = x++;
// PAR construct means execute the statements housed in the following pair of braces simultaneously // (i.e. issue them to separate processors)
PAR {
temp1 = ( h[1] x *(k-2) ) + temp2;
temp2 = ( h[0] x *k ) + temp3;
}
k = x++;
PAR {
temp0 = ( h[2] x *(k-4) ) + temp1;
temp1 = ( h[1] x *(k-2) ) + temp2;
temp2 = ( h[0] x *k ) + temp3;
}
// Steady state : where the pipeline remains full
for( j = 0; j < (num_samples – (num_taps-1)); j++ )
{
k = x++; // ‘x’ initially points to x[n+j] and is incremented (post assignment) to point to x[(n+j)+1]
PAR {
68
*y++ = ( h[3] x *(k-6) ) + temp0; // y points to the register address where y[n+j] is
temp0 = ( h[2] x *(k-4) ) + temp1; // to be written and is incremented (post
temp1 = ( h[1] x *(k-2) ) + temp2; // assignment) to point to the register address
temp2 = ( h[0] x *k ) + temp3; // where the next output sample y[(n+j)+1] is to
} // be written
}
// Spin-down procedure : emptying the pipeline
k = x++;
PAR {
*y++ = ( h[3] x *(k-6) ) + temp0;
temp0 = ( h[2] x *(k-4) ) + temp1;
temp1 = ( h[1] x *(k-2) ) + temp2;
}
k = x++
PAR {
*y++ = ( h[3] x *(k-6) ) + temp0;
temp0 = ( h[2] x *(k-4) ) + temp1;
}
k = x;
*y++ = ( h[0] x *(k-6) ) + temp0;
}
Figure A2 : A code description of the Systolic FIR filter (with N = 4)
void semi_parallelFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )
{
int i, j;
float y_accum;
const float *k0, *k1, *k2, *k3;
register i0, i1, i2, i3;
register temp3 = 0, temp2 = 0, temp1 = 0, temp0 = 0, y_partial = 0, y_accum = 0;
//Spin-up procedure
i0 = 0;
*k0 = x++;
69
temp2 = ( h[i0] x *k0 ) + temp3;
i1 = 4 + i0++;
*k1 = (k0--) - 4;
PAR {
temp1 = ( h[i1] x *k1 ) + temp2;
temp2 = ( h[i0] x *k0 ) + temp3;
}
i2 = i1 + 4;
i1 = 4 + i0++;
*k2 = k1 – 4;
*k1 = (k0--) – 4;
PAR {
temp0 = ( h[i2] x *k2 ) + temp1;
temp1 = ( h[i1] x *k1 ) + temp2;
temp2 = ( h[i0] x *k0 ) + temp3;
}
i3 = i2 + 4;
i2 = i1 + 4;
i1 = 4 + i0++;
*k3 = k2 – 4;
*k2 = k1 – 4;
*k1 = (k0--) – 4;
PAR {
y_partial = ( h[i3] x *k3 ) + temp0;
temp0 = ( h[i2] x *k2 ) + temp1;
temp1 = ( h[i1] x *k1 ) + temp2;
temp2 = ( h[i0] x *k0 ) + temp3;
}
i3 = i2 + 4;
i2 = i1 + 4;
i1 = 4 + i0;
i0 = 0;
*k3 = k2 – 4;
*k2 = k1 – 4;
*k1 = (k0) – 4;
*k0 = x++;
//Steady state
for(j = 0; j < (((num_taps / ‘4’) x num_samples) – ‘4’); j++ )
{
y_accum = 0;
for(i = 0:3)
70
{
PAR {
y_accum += y_partial;
y_partial = ( h[i3] x *k3 ) + temp0;
temp0 = ( h[i2] x *k2 ) + temp1;
temp1 = ( h[i1] x *k1 ) + temp2;
temp2 = ( h[i0] x *k0 ) + temp3;
}
PAR {
i3 = i2 + 4;
i2 = i1 + 4;
i1 = i0 + 4;
i0++;
}
PAR {
*k3 = k2 – 4;
*k2 = k1 – 4;
*k1 = k0 – 4;
*k0--;
}
}
i0 = 0;
*y++ = y_accum;
}
//Spin-down procedure
PAR {
i3 = i2 + 4;
i2 = i1 + 4;
i1 = i0 + 4;
}
PAR {
*k3 = k2 – 4;
*k2 = k1 – 4;
*k1 = k0 – 4;
}
PAR {
y_accum += y_partial;
y_partial = ( h[i3] x *k3 ) + temp0;
temp0 = ( h[i2] x *k2 ) + temp1;
temp1 = ( h[i1] x *k1 ) + temp2;
}
PAR {
i3 = i2 + 4;
71
i2 = i1 + 4;
}
PAR {
*k3 = k2 – 4;
*k2 = k1 – 4;
}
PAR {
y_accum += y_partial;
y_partial = ( h[i3] x *k3 ) + temp0;
temp0 = ( h[i2] x *k2 ) + temp1;
}
PAR {
i3 = i2 + 4;
}
PAR {
*k3 = k2 – 4;
}
PAR {
y_accum += y_partial;
y_partial = ( h[i3] x *k3 ) + temp0;
}
y_accum += y_partial;
*y = y_accum;
}
Figure A3 : a code description of the Semi-Parallel FIR filter
(with N = 16, M = 4)
72
Appendix B
Figure B1 : Showing the FPU_Control block of figure 29 at the next level of
abstraction down
function [scb1,scb2,scb3,....] = updateScore(register_no, op_mn, scb1,scb2,scb3,....)
field = 0; // variable to store the Scoreboard field of the register as a decimal value f_number = 0; // variable to store the field number of the register within its respective Scoreboard segment countdown = 0; // variable to store the countdown sub-field of the field as a decimal value
execunit = 0; // variable to store the exec_unit sub-field of the field as a decimal value if(op_mn == 2) // if instruction to be submitted is a MACC countdown = 10; execunit = 1; field = (countdown * 4) + execunit;// countdown sub-field is shifted up by 2 bits and concatenated with exec_unit end if(op_mn == 4) // if instruction to be submitted is a multiply countdown = 7; execunit = 2; field = (countdown * 4) + execunit; end if(op_mn == 3) // if instruction to be submitted is an add countdown = 5; execunit = 1; field = (countdown * 4) + execunit; end
73
switch register_no case {1,2,3,4,5} // if register’s field is held in Scoreboard_1 f_number = register_no; scb1 = scb1 + ((field) * (2 ^ (6* (f_number - 1)))); // set the right field within Scoreboard_1 to this field value case {6,7,8,9,10} // if register’s field is held in Scoreboard_2 f_number = register_no - 5; scb2 = scb2 + ((field) * (2 ^ (6* (f_number - 1)))); // set the right field within Scoreboard_1 to this field value case {11,12,13,14,15} // if register’s field is held in Scoreboard_3 f_number = register_no - 10; scb3 = scb3 + ((field) * (2 ^ (6* (f_number - 1)))); // set the right field within Scoreboard_1 to this field value
.
.
.
.
Figure B2 : Showing a synopsis of the updateScore() sub-function within the
FPU_Control_Unit.
The inputs to the sub-function are the mnemonic of the instruction being submitted
(op_mn), its destination register number (register_no) and the Scoreboard in its
segmented form (scb1 to scb20). Firstly the destination register’s Scoreboard field is
constructed as detailed previously in figure 31, where the countdown and exec_unit
sub-fields are set to the appropriate values for the type of instruction being submitted
(represented by the op_mn variable). In updateScore() the countdown value is set to
one more than the latency of the instruction being submitted, because the
sboardCycle() sub-function (discussed below) would be executed after an execution
of updateScore(), and by default sboardCycle() decrements all non-zero countdown
values throughout the entire Scoreboard. The switch case block sets the appropriate
field within the appropriate Scoreboard segment to the newly constructed field value.
The redundancy of inputting and subsequently outputting all Scoreboard segments is
done in order to minimise the number of switch case or if-then-else type blocks
employed to assign the field to the correct Scoreboard segment.
function [sboardOut, WB_EN, WB_ADDR, WRITE_MODE, WB_ARB] = sboardCycle(SBOARD_IN, board_num, opcode, WB_EN, WB_ADDR, WRITE_MODE, WB_ARB) inter_holder = SBOARD_IN; // variable to store the successive slices of the Scoreboard segment in decimal form new_holder = 0; // variable to store the Scoreboard segment after decrementing all non-zero countdowns inter_field = 0; // variable to store the fields of the Scoreboard segment as they are stripped away inter_count = 0; // variable to store their countdown sub-fields inter_xunit = 0; // variable to store their exec_unit sub-fields num_xunit_bits = 2; // variable to store the number of bits in the exec_unit sub-field : to help facilitate the addition
// of more execution units in the Execution_pipeline
for(i=1:5) // for all 5 fields of the Scoreboard segment : starting with the most significant field inter_field = floor(inter_holder / (2 ^ (6*(5-i)))); // place the field value in variable inter_field inter_count = floor(inter_field / (2 ^ num_xunit_bits)); // place its countdown sub-field in inter_count inter_xunit = inter_field - (inter_count * (2 ^ num_xunit_bits)); // place its exec_unit sub-field in inter_xunit inter_holder = inter_holder - ((inter_field) * (2 ^ (6*(5-i)))); // remove this field from the rest of // the Scoreboard segment if(inter_count == 1) // if the countdown value is 1
74
WB_EN = 1; // assert the write back operation on the register WB_ADDR = (5-i) + (( board_num - 1) * 5); // represented by this field of the Scoreboard WRITE_MODE = 0; if(inter_xunit == 2) // taking the write back data value from the execution unit represented by the WB_ARB = 1; // exec_unit sub-field end if(inter_xunit == 1) WB_ARB = 0; end inter_field = inter_field - inter_xunit; // strip the exec_unit sub-field away from the field : if a write back // operation has been asserted on this register end
inter_field = inter_field - (inter_count * (2 ^ num_xunit_bits)); // strip the countdown sub-field away from // the field
if(inter_count > 0) inter_count = inter_count - 1; // decrement the countdown sub-field value if it’s bigger than zero end inter_field = inter_field + (inter_count * (2 ^ num_xunit_bits)); // set the countdown sub-field within the field // to the new value new_holder = new_holder + (inter_field * (2 ^ (6*(5-i)))); // set the field in the Scoreboard segment // representing this register to the new field value end sboardOut = new_holder; // output the modified Scoreboard segment
Figure B3 : Showing the code of the sboardCycle() sub-function of
FPU_Control_Unit.
The inputs to the sboardCycle() sub-function are SBOARD_IN (the Scoreboard
segment to be processed), board_num (the Scoreboard segment number), and all of
the control signals used by FPU_Control_Unit to assert a write back operation. If
sboardCycle() detects that a write back operation is scheduled to occur in the current
clock cycle it assigns the appropriate values to these control signals, otherwise they
are passed out unchanged.
function decision = checkScore(reg_no,sc1,sc2,sc3,sc4,sc5, . . . . . . . . . . . sc19, sc20) xunit_bits = 2; // variable to store number of bits in an exec_unit sub-field : to help facilitate the addition of
// more execution units in the Execution_pipeline field_number = 0; // variable to store the Scoreboard segment field number of the register being checked score = 0; // variable to store successive slices of the Scoreboard segment as fields are stripped away field = 0; // variable to store the Scoreboard field representing the register being checked countdown = 0; // variable to store the countdown sub-field value of the register being checked ii = 0; // variable used as a loop indice switch reg_no case {1,2,3,4,5} // if the register being checked is in Scoreboard1 field_number = reg_no; score = sc1; case {6,7,8,9,10} // if the register being checked is in Scoreboard2 field_number = (reg_no - 5); score = sc2;
.
75
.
.
.
case {96,97,98,99,100} // if the register being checked is in Scoreboard20 field_number = (reg_no - 95); score = sc20; end for(ii=1:(6-field_number)) // starting with the most significant field : strip successive fields away from field = floor(score / (2 ^ (6*(5-ii)))); // the Scoreboard segment and place them in the variable field : until it score = score - (field * (2 ^ (6*(5-ii)))); // contains the field value of the register being checked end countdown = field / (2 ^ xunit_bits); // extract the countdown sub-field from this field if(countdown == 0) decision = 1; // decision is used as a boolean type variable to represent else // whether or not the register passes its Scoreboard check decision = 0; end
Figure B4 : showing a synopsis of the checkScore() sub-function of
FPU_Control_Unit, used to check the Scoreboard for any data hazards that would
arise from submitting a particular instruction to the Execution_pipeline
As can be seen in figure B4 above the inputs to checkScore() are the number of the
register to be checked (reg_no) and the Scoreboard in its segmented form. The
Scoreboard field representing the register is extracted from the appropriate
Scoreboard segment and its countdown sub-field is inspected. If at that point in time
the register is the target of an upcoming write back operation, the countdown value
will be greater than zero, in which case a 0 is assigned to the output variable
(decision) to signal that the Scoreboard check has failed. Otherwise the register
passes the Scoreboard check and a 1 is assigned to decision. Only if all three
registers associated with the issued instruction pass the Scoreboard check will
FPU_Control_Unit submit the instruction to the Execution_pipeline.
Further detail on how the FPU submits instructions to the Execution_pipeline
Whenever a MACC instruction is issued to the FPU, if it is submitted to the
Execution_pipeline, FPU_Control_Unit also assigns the instruction’s destination
register number to its MQ_DELAYED output in the same clock cycle. This output of
FPU_Control_Unit is fed back to its MQD_IN input via a delay block. The latency of
this delay block is 5 clock cycles, and thus as can be seen by observing figure 35,
when the execution of a MACC instruction is at the stage where the register fetch for
the accumulate-stage needs carrying out, its destination (and third source) register
number appears at MQD_IN, and FPU_Control_Unit uses this in asserting the register
fetch. If the instruction is either an individual multiply or add, the Control unit
assigns zero to its MQ_DELAYED output.
When submitting an add or a multiply instruction, and when not submitting any
instruction FPU_Control_Unit always assigns zero to its MQ_DELAYED output.
Thus when the FPU is issued with an add instruction in a cycle in which the register
fetch of a MACC’s accumulate stage must be carried out, FPU_Control_Unit knows
76
that the instruction must not be submitted to the Execution_pipeline as it can see that
the value of MQD_IN is non-zero.
Figure B5 below shows the FPU’s register file.
Figure B5 : showing the FPU’s register file
If the instruction issued to the execution pipeline is an arithmetic instruction (MACC,
multiply or add), the ADDR_RA and ADDR_RB outputs of the FPU’s Control unit are
set to the addresses of source registers RA and RB respectively, and it sets it’s
WRITE_EN output to a ‘0’. This causes the contents of source registers RA and RB
to be fetched from the register file.
As stated previously in section 4, a MACC instruction is executed in two stages, with
the first stage being to use the multiplier to multiply the contents of registers RA and
RB, and the second stage being to use the adder to add the result of the multiplication
to register RT/S.
The multiply part of the multiply-add instruction is fed through the pipelined
multiplier and in subsequent cycles the Control unit is able to submit other
instructions to the Execution_pipeline (as long as checking the Scoreboard does not
reveal that there is a possibility of errors occurring as discussed previously in section
5).
The destination register number that the Control unit assigned to it’s MQ_DELAYED
output appears at it’s MQD_IN input the cycle before the Execution_pipeline’s
multiplier places the result of the corresponding multiply operation into its output
register.
77
Figure B6 below shows the FPU’s Execution_Pipeline.
Figure B6 : showing the FPU’s Execution_pipeline
The contents of register RT are then fetched from the register file. This register fetch
operation is issued one cycle before the result of the multiplication with which it is to
be added is ready because it takes one cycle for the contents of register RT to emerge
from the register file. Thus by scheduling the fetch operation for register RT in this
way its latency is hidden. As a result of this, the Control units SEL_MADD output
are both fed through a delay block with a latency of 1 cycle, so that MUX_1 and
MUX_2 within the Execution unit pass the contents of register RT and the result of
the multiplication through to the adder in the cycle that they both emerge from the
register file and the output register of the multiplier respectively. All multiplexers
used have zero latency.
As also discussed previously in section 5, when the sboardCycle() sub-function of the
FPU_Control_Unit program detects that a write-back operation is scheduled to be
issued in the subsequent cycle, it issues it. The Control unit does this by assigning the
address of the detected destination register to its WB_ADDR output, which is
connected to the WRITE_ADDRESS input of the register file. The Control unit
assigns a ‘1’ to its WB_EN output which is connected to the WRITE_ENABLE input
of the register file, to enable a write operation into the register file. The
WRITE_MODE output of the Control unit is set to a ‘0’ so a MUX passes the output
of the Execution unit through to the DATA input of the register file. The WB_ARB
output of the Control unit is assigned based on the Execution-unit sub-field of the
destination register’s Scoreboard field. This sub-field holds a value of 1 when the
Execution unit’s adder produces the final result to be written into the register file (in
which case a ‘0’ is assigned to the WB_ARB output of the Control unit), whereas if
78
the Execution unit’s multiplier produces the final result then this sub-field holds a
value of 2 (in which case a ‘1’ is assigned to the WB_ARB output of the Control
unit).
If a newly issued instruction is found to carry a hazard, the instruction is not issued to
the Execution pipeline and the Control Unit asserts a ‘1’ on its STALL output. The
Controller will then continue to issue that same instruction until the Control unit
submits it to the Execution_pipeline and asserts it’s STALL output back to a ‘0’.
Appendix C
Figure C1 : showing counter_1 of the optimal Controller’s loop counter structure
79
Figure C2 : showing counter_2 of the optimal Controller’s loop counter structure
Figure C3 : showing counter_3 of the optimal Controller’s loop counter structure
80
Appendix D
Figure D1 : showing the optimal Controller’s High-level controller
81
Section 9 : Acknowledgements
I would like to give a note of thanks to my Industrial supervisor Richard Walke, as
well as my two Academic supervisors Andy Tyrrell and Jonathan Dell.
82
Section 10 : References
[1] A Parallel Processor Architecture for Graphics Arithmetic Operations; John G.
Torborg; Computer Graphics, Volume 21, Number 4, July 1987
[2] OpenGL website; http://www.opengl.org/about/overview.html
[3] Compilation from Matlab to Process Networks Realized in FPGA; Tim Harriss,
Richard Walke; QinetiQ Ltd, Malvern, UK
[4] Xilinx Xtreme DSP User Guide http://www.xilinx.com