6 Month Technical Report

8/3/2019 6 Month Technical Report

1/23

RECONFIGURABLE SYSTEM FOR VIDEO PROCESSING

(SIX MONTH TECHNICAL REPORT)

Student:

BEN COP E

Supervisors:

PROF PETER CHEUNG & PROF WAYNE LUK

Research Group:

CIRCUITS & SYSTEMS, EEE

Submission Date:

FRIDAY 22 ND APRIL, 2005


2/23


3/23

2 Literature Survey

Covered is a literature survey of the use of FPGAs along with Graphics Hardware in the application of Video Processing.

Firstly, I will look at current architectures using GPUs (Graphics Processing Units) and FPGAs for video processing. Fol-

lowed by interconnect structures (including Network on Chip.) FPGA architectural features which make it adaptable to such

applications will be shown, along with tools used for debugging & programming.

The implementation of video processing applications is moving away from being predominantly software based to a more

hardware based solution. This can be seen with the first video cards: all image processing was done, prior to it reaching

the card. Today much more of the processing is performed on the card, such as lighting effects. Graphics hardware has

progressed from being display hardware to allow user programmability leading to non-graphics applications.

There have also been advances in research into interconnect structures, the bus is no-longer seen as the only solution to

connect hardware cores together. Switch-boxes and networks are emerging, it is likely that we will see more topologies being

considered in the future and some of the new ideas today becoming common-place.

2.1 Current Architectures

This section is split into the use of GPUs and FPGAs individually for Video applications and discusses the possibility of

interlinking these modules.

GPU Architectures:

Emerging Field: Research into the use of GPUs for general-purpose computation began on the Ikonas machine[1] which

was developed in 1978. This was used for the Genesis Planet Sequence in Star Trek: The Wrath of Khan and has also been

used in TV commercials and the Superman III video game. This highlighted early on the potential for graphics hardware

to do much more than output successive images. GPUs were firstly used for non-graphics applications (www.GPGPU.org)

by Lengyel et al in 1990 for robot motion planning, this highlighted the start of the era of GPGPU (General Purpose GPUs.)

Recent Developments: Trendall and Stewart in 2000[2] gave a summary of the possible calculations available on a GPU with

a real-time calculation of refractive caustics. These capabilities have progressed further since, with more on-board memory

2


4/23

and larger processing ability. More recently Moreland and Angel[3] implemented an FFT routine using GPUs with FFT,

filtering and IFFT, performed on a 512x512 image, in under 1 second with a GeForce 5800. This has been made possible by

graphics hardware manufacturers (Nvidia / ATI being the largest) allowing programmers more control over the GPU. This

is facilitated through shader programs (released in 2001) written in languages such as Cg (C for Graphics) by Nvidia or

DirectX by Microsoft, which have been enhanced since for greater user control (DirectX 9.0 now allows 128-bit precision

i.e. 32-bit/RGBA pixel component[4]). There is still room for future progress as some hardwares functionality is still hidden

from the user, there is no pointer support and debugging is difficult. Recognition for non-graphical applications of GPUs was

given at the Siggraph / Eurographics Hardware Workshop in San Diego (2003) showing its emergence as a recognised field.

Nvidia: The intentions of the manufactures is clear, in an article in July 2002[5], NVIDIAs CEO announced teaming with

AMD to develop nForce. This will be capable of handling multimedia tasks and will bring theatre-style DVD playback to the

home computer. Previously the CPU offloaded tasks, such as video processing, onto plug-in cards which were later shrunk

and encapsulated within the CPU. This minimisation was beneficial to the likes of Intel and less so to GPU producers such

as Nvidia. The implementation of multimedia applications on graphics cards (and their importance to the customer) means

the screen is now seen as the computer: rather than the network it sits on. In development of Microsofts Xbox more money

was given to Nvidia than Intel, this trend is likely to continue to show a power shift to Graphics card manufacturers.

Performance: The rate of increase of processing performance of GPUs has been 2.8 times/year since 1993[4] (compared to

2 times/1.5 years for CPUs according to Moores Law) a trend which is expected to continue till 2013. The GeForce 5900

performs at 20 G/flops, which is equivalent to a 10 GHz Pentium Processor[4]. This shows the potential of graphics hardware

to out-perform CPUs, with a new generation being unveiled every 6 months. It is expected that TFLOP performance will be

seen from graphics hardware in 2005. For Example in Strzodka and Garbes implementation of Motion Estimation[6] they

out-perform a P4 2GHz processor by 4 to 5 times (with a GeForce 5800 Ultra.)

Increases in performance also benefit filtering operations: in the previously mentioned FFT implementation[3] the potential

for frequency domain filtering was shown. The amount of computations required to do filtering are reduced by performing

them in the frequency domain from an O(NM2) problem to an O(NM) + FFT + IFFT (about O(MN(log(M) + log(N))))

one. Moreland and Angel implemented clever tricks with indexing (dynamic programming), frequency compression and

3


5/23

splitting a 2D into two 1D problems to achieve this speed up. With the rapidly increasing power of graphics cards it can be

expected that the computation time will be reduced from 1 second to allow real-time processing. A final factor which aids

this is that 32-bit precision calculations are now possible on GPUs which are vital for such calculations.

Cost: Another benefit of the GPU is cost, a top end graphics card (capable of near TFLOP performance) can be purchased

for less than 300. Such a graphics card can perform equivalently to image generators costing 1000s in 1999[4]. This gives

the opportunity for real-time video processing capabilities on a standard workstations.

Parallelism: The architecture of graphics hardware is equivalent to that of stretch computers (designed for fast floating point

arithmetic). They use stream processing: requiring a sequence of data in some order. This method exploits the dataflow in

the organisation of the processing elements to reduce caching (CPUs are typically 60% cache[4].) Other features are the

exploitation of spatial parallelism of images and that pixels are generally independent.

Strzodka and Garbe in their paper on motion estimation / visualisation on graphics cards[6] show how a parallel computer

application can be implemented in graphics hardware. They identify GPUs as not the best solution but to have a better

price-performance ratio than other hardware solutions. In such applications the data-stream controls the flow rather than

instructions, facilitating the above cache benefit. Moreland and Angel[3] go further in branding the GPU as no-longer a fixed

pipeline architecture but a SIMD (Single Instruction-stream, Multiple Data-stream) parallel processor, which highlights the

flexibility in the eyes of the programmer.

How to program: There are 2 programming sources in graphics hardware programming[6]:

Flowware: Assembly and Direction of dataflow

Configware: Configuration of processing elements

In comparison these 2 features are implemented together in an FPGA, however in graphics hardware these are explicitly

different. Careful implementation of GPU code is necessary for platform (e.g. DX 9.0) and system (e.g. C++) independence.

APIs also handle Flowware and Configware separately. This becomes important if considering programming FPGAs and

GPUs simultaneously.

4


6/23

FPGA Architectures:

ASIC solutions for processing tasks are optimal in speed, power and size[7], however, are expensive and inflexible. DSPs

allow for more flexibility and can be reused for many applications, however, are energy inefficient and can cause delays if

not optimised per task. For these reasons it is often favourable to implement such applications in a reconfigurable unit.

An Alternative: When deciding which hardware to use for a graphics sub-system there is a trade-off between operating speed

and flexibility. To maximise its benefit, an FPGA implementation must: give more flexibility than custom graphics processors

and be faster than a general purpose processor. The need for flexibility is justified as one may need a change an algorithm

(e.g. compression standard) post manufacture. By utilising its re-programmability a small FPGA can appear as a large and

efficient device.

Example: Singh and Bellac in 1994[8] implemented three graphics applications on a FPGA, namely outline of circle, filled

circle and a fast sphere algorithm. They found a RAM based FPGA was favourable due to the large storage requirements.

The performance in drawing the circle was seen to be satisfactory out-performing a general purpose display processor

(TMS34020) by a factor of 6 (achieving 16 million pixels/sec.) It was however worse in the application of fast sphere

rendering at only 2627 spheres/sec vs. 14,300 from a custom hardware block. Improvements are however expected with new

FPGAs, such as the Virtex 4, which have larger on-chip storage and more processing power. FPGAs today also have more

built-in blocks to speed up operations such as multiplication (these will be considered later.)

Sonic / Ultra-Sonic: Two more possibilities to accelerate video processing which highlight benefits of a re-configurable ar-

chitecture. The hardware is systolic: 1 data item is clocked in and 1 out of the modules on every clock cycle, this maintains

a high throughput rate although latency can vary. The challenges involved, highlighted in [9, 10] are: Correct hardware and

software partitioning, spatial and temporal resolution, hardware integration with software, keeping memory accesses low

and real-time throughput. Sonic approaches these challenges with PIPEs (Plug in Processing Elements) with 3 main com-

ponents of an engine (for computations), a router (for routing, formatting and data access) and memory for storing video data.

Typical applications are the removal of distortions introduced by watermarking an image[10], 2D filtering[11] and 2D

convolution[12]. In the latter an implementation at 1/2 the clock rate of state-of-the-art technology was adequate suggesting

5


7/23

a lower power solution. 2D filtering was split into two 1D filters and showed a 5.5 times speed up when using 1 PIPE, and

greater speed up with more PIPEs.

Bottlenecks: In contrast to memory the FPGAs bottleneck isnt bus speed but configuration time. Configurations can be

stored in a memory bank and copied into a local cache as required. Singh and Bellac[8] propose partitioning the FPGA

into zones, each with good periphery access to the network and a different size. The capability of partial reconfiguration is

important here: if a new task is required only a section of the FPGA need be reconfigured, leaving other sections untouched

for later reuse. A practical example of this is seen with the above Sonic Architecture: The router and engine are imple-

mented on separate FPGA elements. If a different application required only a different memory access pattern (e.g. 2*1D

implementation of a 2D filter[11]) only the router need be reconfigured, this separation also provides abstraction. Another

architecture where the bus bottleneck problem is reduced is seen in [7] where a daughter board is incorporated to perform

D/A conversion. The sharing of data and configuration control path reduces the bottleneck, however, data-loss occurs during

configuration phase but this is seen as acceptable.

Parallelism: On a Task-level parallelism is often ignored in designs, by proposing a design method focused on the system

dataflow Sedcole et al[12] hope to overcome this. Taking Sonic as an example: spatial parallelism is done through distrib-

uting parts of each frame across multiple hardware blocks (PIPEs in this case). Temporal parallelism can be exploited by

distributing entire frames over these blocks. Further, these elements can also be grouped to perform bigger tasks, Singh

and Bellec[8] similarly suggest the grouping of zones of a partitioned FPGA in a design. The following are general issues,

Sedcole et al propose to be considered in such large scale implementations:

Design Complexity

Modularisation - allocation / management of resources

Connectivity / Communication between modules

Power Minimisation (ties in with low memory accesses)

Hardware or Software: The benefits of a software implementation are seen with irregular, data-dependent or floating point

calculations. A hardware solution is beneficial for regular, parallel computation at high speeds[7]. Tasks must be split opti-

mally between these 2 methods. Advancements in hardware mean that some of the problems with floating point calculations

and alike have been overcome. Hardware can now perform equally or even better than software. The software designer needs

6


8/23

a good software model of the hardware and the hardware designer requires good abstraction[11]. Hardware acceleration is

particularly suited to video processing applications due to the parallelism and relatively simple calculations.

In the Sonic example: PIPEs act as plug-ins, analogous to software plug-ins, which provides an easy path for Sonic into

software. This overcomes a previous problem with re-configurable hardware that there were no good software models.

Co-operation: Another way to look at the use of FPGAs in a graphics system is to extend the instruction set of a host-

processor as virtual hardware. This idea is approached by Vermeulen et al[13] where a processor is mixed with some

hardware to extend its instruction set. In general this hardware could be another processor or an ASIC component, again

there are issues with finding ways to get the components to communicate and work together.

The requirements of a reconfigurable implementation are therefore to be flexible, powerful, low cost, run-time / partial

reconfigurable and to fit in well with software. The current FGPA limitations highlighted by papers [9, 14] are: configuration

speed, debugging, number of gates, partial reconfiguration (Altera previously had no support) and PCI Bus Bottleneck.

These considerations would be important if considering an FPGAs implementation with other hardware and some / all of

these requirements may also apply to this mixed system.

2.2 Interconnects

Interconnects currently used for graphics card to processor communications will be discussed, followed by a look at some

System-on-Chip (SOC) and Network-on-Chip (NOC) architectures.

GPU view: GPU components are implemented in conjunction with CPUs acting as graphics sub-systems and working as

co-processors with the CPUs. To do this a high speed interface is required as GPUs can process large amounts of data in

parallel, doubling in required bandwidth every 2 years[15]. The AGP standard has progressed through 1x to the current 8x

model (peaking at 2.1 GBytes/sec,) however with new GPUs working at higher bit precisions (128bit/RGBA in the GeForce

6800 series) greater throughput was required. AGP uses parallel point to point interconnections with timing relative to the

source. As the transfer speed increased, the capacitance and inductance on connectors needed to be reduced, this became

restrictive past 8x. A new transfer method was required: serial differential point to point offers a high speed interconnect at

7


9/23


10/23

The Network: The advantages of a network are that it has high performance / bandwidth, modularity, can handle concurrent

communications and has better electrical properties than a bus or switch. As the size of chips increases global synchrony

becomes infeasible as it takes a signal several clock cycles to travel across a chip. The NOC overcomes this problem by being

a GALS (Globally-Asynchronous Locally-Synchronous) architecture.

Dally and Towles[18] propose a mesh structured NOC interconnect as a general purpose interconnect structure. The advan-

tage of being general purpose is that the frequency of use would be greater justifying more effort in design, the disadvantage

is that one could do better by optimising to an certain application (though this may not be financially viable.)

In Dally and Towles example they divide a chip into an array of 16 tiles, numbered 0 through 3 in each axis. Interconnec-

tions between tiles are made as a folded torus topology (i.e. in the order 0,2,3,1.) This attempts to minimise the number

of tiles a packet must pass through to reach its destination. Each tile therefore has a N,S,E,W connection and each has

an input and output path to put data into the network or take it out respectively. The data, address and control signals are

grouped and sent as a single flit. Area is dominated by buffers (6.6% of tile area in their example.) The limitations are

opposite to computer networks: less constraint on number of interconnections, but more on buffer space. The network could

be run at 4GB/s (at least twice the speed of tiles) to increase efficiency, however this would increase space required for buffers.

The disadvantage of the above example is that the tiles are not always going to be the same size and thus space would be

wasted for smaller designs. Jantsch[19] proposes a solution which overcomes this, using a similar mesh structure. The main

differences are that he no longer uses the torus topology but a standard connection to a tiles neighbours. He also provides a

region wrapper, around a block considerably larger than others, which emulates the original network being present.

Jantsch suggests 2 possibilities for the future: many NOC designs for many applications (expensive in design time) or 1 NOC

design for many applications (inflexible.) The later would justify the design cost, however one would need to decide on the

correct architecture (mix of CPU, DSP etc.), Language (to configure NOC), Operating System (for when running) and design

method for a set of tasks.

9


11/23

There are other suggested interconnect methods: Hemani et al[20] suggest a honeycomb structure where each component

connects to 6 others. Benini and De Micheli[21] introduce the SPIN (Scalable, Programmable, Interconnect Network) with

a tree structure of routers and the nodes being the leaves of the tree. Dobkin et al[22] propose a similar mesh structure to

Jantsch however include bit-serial long-range links. They use a non-return to zero method for the bit-serial connection and

believe it to be best for NOC. This shows a snapshot of the NOC ideas for which the are possibly as many topology ideas as

for our standard computer networks today.

2.3 FPGA Internal Structure

FPGAs were first designed to be as programmable as possible, comprising configurable logic blocks and interconnects. As

they have developed, manufacturers have introduced standard components into them, such as embedded memory blocks and

in some of the latest Xilinx FPGAs Power PC Processors. There is potential for future work in this area in the development

of new blocks, which could be placed into an FPGA, to improve functionality. In this section interesting modules will be

considered, which could be used within FPGAs in the future.

Multipliers: The motivation for use of embedded multipliers is that implementation of binary multiplication in FPGAs is

often too large and slow. A possible solution is Programmable Array Modules (PAMs) - these are fixed in size however waste

space if small bit length multiplication are required. Other solutions are trees or pre-processing methods, although these are

difficult to generalise. A better solution is presented by Haynes and Cheung[23] to use reconfigurable multiplier blocks. They

designed a Flexible Array Block (FAB) capable of multiplication of two 4 bit numbers, FABs combine to multiply numbers

of lengths 4n and 4m. The 2 input numbers can be independently signed or unsigned. The speed of the FABs is comparable

to that of non-configurable blocks at a cost of them being twice the size and having twice the number of interconnects. The

later isnt a problem due to the many metal layers in an FPGA, they are also smaller than a pure FPGA implementations.

A modification was proposed later by Haynes, Ferrari and Cheung[24] with a design base on the radix-4 overlapped multiple-

bit scanning algorithm, which was more speed and area efficient. The MFAB (Modified FAB) multiplies 2 numbers of length

8 together, or less with redundancy. The length must be greater than 7 to make a space saving on the FAB. The blocks are

1/30th the size of the equivalent pure FPGA implementation and need only 40% usage to make them a worthwhile asset.

10


12/23

Function Evaluation: A more specific block is one for functional evaluation such as that proposed by Sidahao, Constanti-

nides and Cheung[25]. Previously a Lookup Table (LUT) approach was used however their architecture provides a lower

area solution at the cost of execution speed.

Memory: In Video applications storage of frames of data is important, therefore it is useful to be able to store this data

in memory efficiently. Embedded Dual-Port RAMs, currently available in devices such as the Xilinx Virtex II Pro family,

enable two accesses concurrently. It is likely this technology will progress further, perhaps to an Autonomous Memory Block

(AMB), proposed by Melis, Chueng and Luk[26], which can create its own memory address.

2.4 Debugging tools / coding

The testing of a hardware module can be split into 2 areas: pre-load and post-load. A downside to FPGAs over ASICs is in

pre-load: specifically back-annotated compared with initial testing. In ASIC design only wiring capacitance is missing from

pre-synthesis tests, whereas in FPGA design the module placement is decided at synthesis, drastically effecting timing.

Pre-load: The most widely known pre-load test environments are ModelSim (Xilinx) and Quartus (Altera.) COMPASS

(Avant) is an automated design tool, creating a level of abstraction for the user. The benefits are highlighted by Singh and

Bellac in 1994[8]: The user can enter a design as a state machine or dataflow and therefore implement at the system rather

than lower (e.g. VHDL) level.

Post-load: The issue of post-load testing is currently approached by using part of the FPGA space for a debugging environ-

ment, invoked during on-board test. A previously popular test strategy was Bed of Nails where pins are connected directly

to the chip and a logic analyser. Due to the large pin count on todays devices this is impractical, even if possible it would

significantly alter the timing. Following this was Boundary Scanning by JTAG (Joint Test Action Group) however this only

probed external signals. Better still is Xilinx Chipscope: an embedded black box which resides, inside the FPGA, as a probe

unit. The downside is that is uses the slow JTAG interface to communicate readings.

An example of an on-chip debugging environment, which uses the faster interface (PCI Bus), is the SONICmole[27] used

with UltraSonic[14]. This takes up only 4% of a Virtex XVC1000 chip (512 slices.) Its function is to act as a logic analyser,

11


13/23

viewing and driving signals, whilst being as small as possible and having a good software interface. This uses the PIPE mem-

ory to store signal captures. It has been implemented at the UltraSonic maximum frequency of 66MHz[27] and is portable to

other reconfigurable systems.

Coding: Firstly coding for FPGAs: these can be programmed through well known languages such as VHDL and Verilog

at the lower level. MATLAB (system generator), and more recently SystemC (see systemc.org) and HandleC at the higher

level. The focus of this sub-section will be on programming the GPU, as FPGA coding is widely understood.

Cg Language: Cg[28] was developed by Nvidia for developers to program GPUs in a C-like manner. The features of C

beneficial for an equivalent GPU programming tool are: performance, portability, generality and user control over machine

level operations. The main difference to C is the stream processing model for parallelism in GPUs.

Cg supports high level programming, however is linkable with assembly code for optimised units - giving the programmer

more control. Cg supports user defined compound types (e.g. arrays and structures) which are useful for non-graphics ap-

plications. It also allows vectors of floating point numbers up to size 4 (e.g. RGBA), along with matrices up to size 4x4

(for operations on the vectors.) A downside is Cg doesnt support pointers or recursive calls (as there is no stack structure).

Pointers may be implemented at a later date.

Nvidia separates programming of the 2 GPU processors (vertex and fragment)to avoid branching and loop problems and so

they are accessed independently. The downside is optimisations across this boundary arent possible, a solution is to use

a meta-programming system to merge this boundary. Nvidia introduce the concept of profiling for handling differences in

generations of GPUs. Each GPU era has a profile of what it is capable of implementing. There is also a profile level for all

GPUs necessary for portable code.

In development of the PlayStation 2 Sony supported a full C implementation with the on-chip GPUs combined with off chip

resources. This shows the trend towards more user programmability. When developing Cg Nvidia worked closely with other

companies (such as Microsoft) who were developing similar tools. An aim of Cg was to support non-shading uses of GPU,

this is of particular interest. (Fernando and Kilgard[29] provide a tutorial on using Cg to program graphics hardware.) For

12


14/23

the non-programmable parts of a GPU CgFX[28] handles the configuration settings and parameters.

2.5 Literature Survey Conclusions

In summary some current architectural uses for GPUs and FPGAs have been considered including an FFT routine on the

GPU and some graphics routines on a FPGA. The Sonic architecture was looked at particularly how it is used as a hardware

accelerator for graphics applications. This was followed by interconnect structures looking at buses, switches and networks:

specifically their advantages and disadvantages. The internal structure of an FPGA was then considered investigating embed-

ded components that could be useful in video applications such as multipliers, memory and function solvers. Finally tools

used in pre and post device function load and in device programming were analysed, specifically the Cg language.

3 Research Questions

The interconnect between cores in a design is a common bottleneck. It is important to have a good model of the interconnect,

to either eliminate or reduce this delay. There have been many architectures proposed / developed for module interconnects

(groupable as bus, switch and network) discussed in the Literature Survey. This leads to the first research question: investigate

suitable interconnect architectures for mixed core hardware blocks and find adequate ways to model interconnect behaviour.

A model is important to decide the best interconnect for a task without the need for full implementation.

The potential of Graphics Hardware has long been exploited in the gaming industry, focusing on its high pixel throughput

and fast processing. It has been shown to be particularly efficient where there is no inter-dependence between pixels. Pro-

gramming this hardware was historically difficult: One could use assembly level language in which it takes a long time to

prototype. The alternative is an API, such as OpenGL, which limits a programmers choice to a set of functions. In 2003

Nvidia produced a language called Cg, allowing high level programming without losing the control of assembly level coding.

Following this non-graphical applications were explored, for example Moreland and Angels FFT Algorithm[3].

The adaptability of graphics hardware, to non-standard tasks, leads to the second research question: to further investigate

graphics hardware used in a mixed core architecture. This takes advantage of the price-performance ratio of graphics hard-

ware, whilst maintaining current benefits of using FPGA / Processor cores. FPGA cores allow for high levels of parallelism

13


15/23

and flexibility as many designs can be implemented on the same hardware. Processors can be optimised for certain types of

instructions and run many permutations of them without costly reprogramming associated with FPGAs.

When one wishes to resize an image there are two possibilities for determining new values for pixels: filtering or interpo-

lation. Filtering could be a FIR (Finite Impulse Response) Low-Pass filter with complexity variation in the number of taps.

Interpolation could be a Bi-linear, Bi-Cubic or a spline method, each of varying complexity. The final research questions

is: investigate the perceived quality versus computational complexity of the 2 methods. Theory suggests FIR filtering, of a

long enough tap length, should produce a smoother result: this may not however be perceptively the best or could be too

computationally complex.

4 Interconnect Model

My first task was to implement a high level model of the ARM AMBA Bus. This would model its performance for varying

numbers of master & slave and be cycle accurate. SystemC, a relatively new hardware modelling library, was used for this.

The motivation came from a paper by Vermeulen and Catthoor[13] where an ARM7 processor was used, in addition to cus-

tom hardware, to allow for up to 10% post manufacture functional modification.

A multiply function, for a communicating processor and memory, was modelled: Two values, to be multiplied, are loaded in

consecutive cycles, multiplied then returned to memory using an interconnect. This consists of data plus control signals as a

simple bus model. This demonstrates how to display and debug the results of a hardware model. SystemC is used to create

a VCD (Value Change Dump) file, which can be displayed in a waveform viewer such a ModelSims. The results are seen

Figure 1. Waveform for multiplier implementation

14


16/23


17/23

Figure 3. Test output showing reset and bus request / grant procedure

A number of meetings were held with Ray Cheung from Computing (currently modelling processors) to discuss the possible

interoperability, between an AMBA bus model and processor model. A fully flexible bus and processor model was suggested,

which could be later extended to include other hardware blocks such as FPGAs.

Following this, my attention was turned to the design of such a bus model. A physical interpretation of how the AMBA AHB

bus blocks fit together can be seen in figure 2. Missing from figure 2, are global clock and reset signals, which are routed to

each block. HWDATA and HRDATA apply to write and read data respectively and H prefix denotes AHB bus as apposed to

ASB. The control signals are requests from masters and split (resume transfer) signals from slaves. Complexity in coding the

multiplexer blocks lay in making them general. Constants were used, in place of actual numbers, for data and address signal

widths throughout. The master multiplexer used a delayed master select signal from the Arbiter to pipeline address and data

buses. One master can use the data bus, whilst another controls the address bus.

For the decoder an assumption was made about how a slave is chosen. The number of address bits, used to decipher which

slave to use, is calculated as: log2(numberslaves) rounded up. The bits are taken as the MSBs of the address. The literal

value of the binary number indicates which slave to use, i.e. 01 would be slave 1.

16


18/23

A test procedure was produced, this loads stimulus from a text file, with the result viewed as a waveform, as with the mul-

tiplexer example. The file consists of lines of either, variable and value pairs or tick followed by a number of cycles to

run for. Initially, simple tests were carried out, to check for correct reset behaviour and that the 2 multiplexers worked (with

a setup of 1 master and 2 slaves.) An example of a test output is shown in figure 3.

In the example as HSEL signals change at the bottom of the waveform, the 2 read data signals are multiplexed. When reset,

all outputs are set to zero, irrespective of inputs, which is what would be expected. When the master requests the bus, the

arbiter waits till HREADY goes high, before granting access, through HGRANT. In the case of more than 1 master, the

HMASTER signal changes immediately (with HBUSREQ) to the correct master, allowing for multiplexing and so the slaves

know which master is communicating.

The model was further tested with 2 masters and 2 slaves, a common configuration. The correct, cycle accurate, results were

seen. Within this, the sending of packets consisting of 1 and multiple data items was experimented with along with split

transfers and error responses from slaves. The waveforms for these become complicated and large very quickly, however are

of a similar form to figure 3.

5 Primary Colour Correction

Primary Colour Correction is a non-graphical application, as with the FFT on a GPU algorithm discussed above, I will now

discuss my optimised version of this. The algorithm performs three main transformations per pixel: Input Correction, His-

togram Equalisation and Colour Balancing (see Figure 4.)

Input Correction and Colour Balancing require the RGB signal to be converted to HSL (Hue, Saturation and Luminance)

space. In my optimisations, I converted half way to a chroma representation ycbcr and implemented the algorithm at this

level, which showed considerable speed up.

Other key optimisations were to perform calculations in vector space and to remove, where possible, conditional statements

which are inefficient on GPUs. The lessons learnt can be summarised below:

17


19/23

InputCorrect

Texture

Fetch

decalCoords

2D Texture

input

x

x

x

255

black

white

saturation

hue

HistogramCorrect

BlackLevel

Gamma

WhiteLevel

OutputBlackLevel

OutputWhiteLevel

ChannelSelection

ColorBalance

Hue

Shift

SatShift

Lum

Shift

HighlightMidtone

Cross

MidtoneHighlightCross

Area

Selection

/dn

255

/dn

255/dn

255

Fix torange[0,1]

Fix torange[0,1]

Fix torange[0,1]

R

G

B

R'

G'

B'

color

Figure 4. Primary Colour Correction Block Diagram

Perform calculations in vectors & matrices

Use in-built functions to replace complex maths & conditional statements

Pre-compute uniform inputs, where possible, avoiding repetition for each pixel

Consider what is happening at assembly code - decipher code if necessary

Dont convert between colour spaces if not explicitly required

Table 1 shows the performance results for the initial and optimised designs using various generations of GPUs. It is seen

that there is a large variation in the throughput rates of the devices, although there is only 2-3 years between them. For more

information on the optimisation of the primary colour correction algorithm see[30].

Architecture Throughput (Final) MP/s Throughput (Initial) MP/s

6800 Ultra 116.36 44.14

6800 GT 101.82 38.62

6600 72.73 27.59

5700 Ultra 12.67 2.12

5200 Ultra 7.08 1.24

Table 1: Performance Comparison on GeForce architectures for the Optimised (Final) and Initial Designs

18


20/23

For efficient optimisation of an algorithm it is important to understand the performance penalty of each section. A detailed

breakdown of the above primary colour correction algorithm, in terms of delay, was carried out. Some performance bot-

tlenecks in the implementation were compare and clamping operations. The Colour Balancing function, which includes

many of each of these, was seen to be the slowest of the three main blocks. The conversion between colour spaces was seen

to have a large delay penalty due mainly to the conversion from RGB to XYL space. In Histogram Equalisation pow was

seen also to add greatly to the delay and accounts for almost 50% of the delay (0.00089s/MP).

The register usage, although minimal, was seen to be larger in calculations than compare operations. This is due to the large

number of min-terms in the calculations and there being fewer intermediate storages required in compares. In this case the

register usages was not a limiting factor to the implementation, however it may be for other algorithms. The breakdown of

delay for each block can be seen below, for more detail see [31].

Block Cycles Rregs Hregs Instructions Throughput (MP/s) Delay (s/MP)

Input Correction 16 3 1 35 350.00 0.00286

Histogram Correct 12 2 1 25 466.67 0.00214

Colour Balancing 23 3 1 56 243.47 0.00411

Table 2: Effect on Performance of Each Block of the Primary Colour Correction Algorithm

6 Plan of Work Leading to Transfer

The next step in the modelling of interconnects is to consider a general bus structure, this can also consist of: multiple masters

and slaves, varying methods of arbitration, clock speeds, shared / individual read-write lines etcetera. This requires a more

abstract implementation, which is allowed for in the SystemC library. A model of cross-bar switches and a network on chip

structure are other possibilities for the future work on interconnect modelling.

The next stage on the question of graphics hardware is to implement the primary colour correction algorithm on a Pentium

Processor and on a FPGA. An optimised implementation in MATLAB completed the computation, on a Pentium 4, with a

512x512 image in 2.3 seconds. This equates to 0.1MP/s which is much slower than the graphics card. An implementation in

C / C++ is expected to perform better, but to still be 1-2 orders of magnitude worse. The FPGA implementation is expected to

19


21/23

out-perform both, if a large enough device is used. When limited to a device of equivalent cost to a graphics card the FPGA

is expected to perform worse than the graphics card but better than the CPU.

A comparison of the visual differences of filtering and interpolation will be performed, along with the computation time re-

quired by each. The algorithms will be tried on the graphics hardware and any limitations of the interconnect, either on or off

board, noted. Implementations may also be prototyped on an FPGA device and Pentium 4 processor for further comparison

of computational capabilities. The literature survey will also be updated to include documents relating to interpolation and

filtering algorithms, particularly in hardware.

An updated Gantt chart for my work intensions, up to transfer, can be found in Appendix 1 at the rear of this document. This

relates to my above aims.

7 Conclusion

A literature survey of related work to my chosen research area has been presented, highlighting possibilities for work in

the areas of interconnects and utilisation of graphics hardware in a mixed core system. My three main research questions:

investigating interconnects and their modelling; the use of graphics hardware for video processing and comparison of FIR

filtering and interpolation were then explained. The work covered to-date on Interconnect Modelling and Primary Colour

Correction implementation on a graphics card was summarised, followed by a plan of my future work including a Gantt chart.

20


22/23

References

[1] J.N. England, A system for interactive modelling of physical curved surface objects, SIGGRAPH 78 1978, pp.336-340

[2] Chris Trendall and A. James Stewart, General calculations using graphics hardware, with applications to interactive caustics, 2000

[3] Kenneth Moreland and Edward Angel, The FFT on a GPU, in The Eurographics Association, 2003, pp. 112-136

[4] Micheal Macedonia, The GPU Enters Computings Mainstream, Entertainment Computing, pp. 106-108, 2003

[5] Jeffery M. OBrian, Nvidia, www.wired.com Issue 10.07, 2002

[6] Robert Strzodka and Christoph Garbe, Real-Time Motion Estimation and Visualisation on Graphics Cards, University of Duisburg,

2004

[7] Wayne Luk, P. Andreou, A. Derbyshire, F. Dupont-De-Dinechin, J. Rice, N. Shirazi, D. and Siganos, A Reconfigurable Engine for

Real-Time Video Processing, Lecture Notes in Computer Science, 1998

[8] Satnam Singh and Pierre Bellec, Virtual Hardware for Graphics Applications Using FPGAs, FCCM 1994

[9] Simon Haynes, John Stone, Peter Cheung and Wayne Luk, Video Image Processing with the Sonic Architecture, Computer, pp.

50-57, 2000

[10] Wim Melis, Peter Cheung and Wayne Luk, Image Registration of Real-Time Broadcast Video Using the UltraSONIC Reconfigurable

Computer, FPL, pp. 1148-1151, 2002

[11] Simon Haynes, John Stone, Peter Cheung and Wayne Luk, SONIC - A Plug in Architecture for Video Processing, FPGA, pp.21-30,

1999

[12] Pete Sedcole, Peter Cheung, G.A. Constantinides and Wayne Luk, A Reconfigurable Platform for Real-Time Embedded Video

Image Processing, FPGA, 2003

[13] Fredrick Vermeulen and Francky Catthoor, Power-Efficient Flexible Processor Architecture for Embedded Applications, IEEE

Transactions on VLSI Systems Vol 11, pp. 376-385, 2003

[14] Simon Haynes, Sonic - A reconfigurable image processing architecture, Poster - IEEE Symposium on FPGAs for Custom Com-

puting Machines, 1999

[15] Intel Developers Network for PCI Esxpress Architecture, Why PCI Express Architectures for Graphics,www.express-lane.org,

2004

21


23/23

[16] AMBA SPECIFICATION (Rev 2.0), ARM, 1999

[17] Jiang Xu, wayne Wold, Joerg Henkel, Srimat Chakradhar and Tiehan Lv, A case study in Networks-on-Chip Design for Embedded

Video, Automation and Test European Conference, 2004

[18] William J. Dally and Brian Towles, Route Packets, Not Wires: On-Chip Interconnection Networks, DAC, 2001

[19] Axel Jantsch, Networks on Chip, 2002

[20] Ahmed Hemani, Axel Jantsch, Shashi Kumar, Adam Postula, Johnny Oberg, Mikael Millberg and Dan Lindvist, Network on Chip:

Architecture for billion transistor era., Proceedings of the IEEE NorChip Conference, 2000

[21] Luca Benini and Giovanni De Micheli, Networks on Chips: A New SOC Paradigm, Computer, pp. 70-78, 2002

[22] Rostislav Dobkin, Isral Cidon, Ran Ginosar, Avinaom Kolodny and Arkadiy Morgenshtein, Fast Asynchronous Bit-Serial Intercon-

nects for Network-on-Chip, 2004

[23] Simon Haynes and Peter Cheung, A Reconfigurable Multiplier Array for Video Image Processing Tasks, Suitable for Embedded In

An FPGA Structure, IEEE Symposium on Field-Programmable Custom Computing, 1998

[24] Simon Haynes, Antonio Ferrari and Peter Cheung, Flexible Reconfigurable Multiplier Blocks Suitable for Enhancing the Architec-

ture of FPGAs, Proceedings of Custom Integrated Circuit Conference, 1999

[25] Nalin Sidahao, George Constantinides and Peter Cheung, Architectures for Function Evaluation on FPGAs, IEEE Symposium on

Circuits and Systems, pp. 804-807, 2003

[26] Wim Melis, Peter Cheung and Wayne Luk, Autonomous Memory Block for Reconfigurable Computing, ISCAS, pp. 581-584, 2004

[27] T. Wiangtong, C.T. Ewe and P.Y.K. Cheung, SONICmole: A Debugging Environment for the UltraSONIC Reconfigurable Com-

puter, ISCAS, pp.808-811, 2003

[28] William R. Mark, R. Stephen Glanville, Kurt Akeley and Mark J. Kilgard, Cg: A system for programming graphics hardware in a

C-like language, ACM Transactions on Graphics, pp. 896-907, 2003

[29] R. Fernando and M.J. Kilgard, The Cg Tutorial: The definative guide to programming real-time graphics, Addison Wesley, 2003

[30] Ben Cope, Efficient Implementation of Primary Colour Correction on Graphics Hardware, avaliable from author, 2005

[31] Ben Cope, Breakdown of performance for Primary Colour Correction, avaliable from author, 2005

22

6 Month Technical Report

Documents

Transcript of 6 Month Technical Report