The Application of Parallel DSP Architectures to Radar Signal Processing

The application of parallel DSP architectures to radar

signal processing by J.Tulodd& SMiOrEngincer

Marconi Radar Systems Chelmsford

Summary

Increasing requirements for ndar system performance have led to the need for impmwnents in signal processing c o r n p " d capacity. Cauplcd with the need for flexibility and adoptabw of p r o g " b l e Sdutiom this has led to the development of parallel DSP architectures. Such solutions offer potential benefits for fault tolerant and scalable systems. Marconi Radar Systems has implunated massively parallel signal processing archikcturea based on the Inmos 'Ttansputcr and Texas Inshuments TMS320C40 devices. 'Ihe characteristics of each device di& in ways whicb offer advantages for certain architectural configurations and applications. Hnamplcs are given from practical experience. Issues such (IS p"sbg bandwidth, 8u arehitedure. communications bandwidth and topology, multi-processor support and devclqrmcnt ecnrimnment Cmsidend. Future trends in algorithms and anlritecbues are discussed. The coupling between these two design factors is seen to depend on the development of an autamated parallel DSP design environment.

The Radar Problem

The performance requinments of a radar system vary according to many diffcnmt foctors. It is the scale of this variation that demands a flexible apprmh to system design. The largest computational demand is placed upon military systems. Although the volume of civil air trafh is rapidly increasing. the signal proceasing algorithms required to detect and truck cooperative aircraft are much simpler than those rcquired in defence applications. The technological limitations of a radar system are specifically uploited in military scenarios. Sea skimming missiks deliberately attempt to hide their signal returns amongst the returns fmm the sea itself. Stealth targets attunpt to reduce the effective cross secbonal area presented to the radar. Radar jammers attempt to confuse detection algorithms by broadcsting high levels of no&. Alongside these tactical techniques are the natural obstacles within a radar envimnment. Unwanted returns can be generated from sea. land and rain. These returns are known as 'clutter'. All these facton increase the complexity of the rcquired processing algorithms.

The speed of military targek can e d several tunes the sped of sound. This can lead to a maximum allowable processing latency in the ordcr of tens of mihecon&.

The Benefits of Parallelism

Parallel architectures allow the design of a system to accurately rcflea the scale of a given problem. A conventional hardware a p p m h can lud to overcnpincCring. Moreover. the reusability of a parallel solution from one application to the next "k the cost of future hardware alt"s. With a large number of identical boards, an economy of scale leads to a reduction in product life cyde cost (such as in manufacturing and tcst). Parallel solutions also reduce the diversify of component spares required. Inherently, a parallel architeaure offers seope for fault tolerance and graceful degradation above that of a more conventional rpprorch by providing alternative

chosen such that the functionalty mapped onto an array of prowssors is very much independent of the array size. This generalisation can be taken further by providing a layer of abstraction whereby the functionality can be transportui onto a number of Mmnt physical platforms (so enhancing reusability and transportabfity).

mutes for data flow. A h . 30ftwPle mcthodoiogies can be

Hardware Options

DSP devices provide eustormzed architectures in support of classical DSP dgorittunS such as the fourier transform and he-domain filtering. This is achieved by the customisation of the arithmetic hardware and instruction set. General purpose processors provide a lower cost solution Blmed at a larger market. Categorizing the available products on the market is often difficult and arguably only a marketing issue. Having selected a suitable device. there is then the inter-connection choice bctwtcn off-the-&& parallel promsing hardware and in-house board design. This choice will largely depend on project time-des. product availability and the amount of software. support provided by the supplier.

Hardware Features

In addition to factors such as the availability of a device and the likely software and hardware support, the most significant hardware issues considerad are typically:-

the numerical accuracy the FLOPS or M I P S figure the number/speed of communication linLS the amount of on-chip memory the amount of possible off-chip memory the amount of on-chip cache the support for link handling the support for process scheduling the support for debugging

5 / 1 2 1995 The Institution of Electrical Engineers. Printed and published by the IEE, Savoy Place, London WC2R OBL. UK.

Two devices that have proved successful in radar applications are the Inmos family of genual purpoat processors known BS the Transputer and the Texas Instruments TMS320c40 DSP device.

Radar System Design

Having identified an application that justifies a parallel hardware solution, the overall constraints on the signal processor must be ideniiiied. These will partly be defined by the customer and secondly by the nature of the expected radar environment. mical considerations are:-

* the algorithms required * the reliability figure (MTBF)

the numerid accuracy required * the maximum dynamic range - the maximum acceptable Laancy - the input data rate - themaximumtaskrate

These factors determine which hardware features are of most importance. For example. floating point arithmetic is more appropriate for fourier analysis rather than fixed point. However. this can be a sipiticantly more wq.Rnsive option.

Partitioning the Problem

A signal processor is composed of many separable funcuons such BS data distribution. fourier analysis. plot and clutter thresholding and the p a h g i n g and output of results. A strategy is required to map thtse funcbns onto a parallel resource. An dcient scheme is required whereby processor idling time is reduced to an absolute minimum.

The two most common partitioning options are data fanning and pro& farming (or a combination of both). Process farming allocates a sub-set of the required functionality to each processor and acts upon the input data..in a pipeline fashion. Data farming allocates a sub-set of the input data to each processor whilst applying the complete required functionality.

Classical radar signal processing algorithms. required to extract signals from noise and other interference. are largely independent of the range at which they are applied. Moreover. they may be adapted to operate Over a limited range extent. A good example is range-achng CFAR (Constant False Alarm Rate) for plot thresholding[ll. Thus. it is often appropriate to use data fanning by allocating a subset of range to each device. This is especially true where the processing does not reduce data bandwidth such as in producing a video output.

There is a practical limitation on the number of processors that can be applied to a given data set. Over partitioning is constrained by two factors.

Firstly, a point is reached whereby uver-par&itioning of data can yield to unacceptable communicatim overheads. This is an inherent property of parallel systems. Secondly. algorithms will dictate the minimum number of range cells that must be p r e s s e d collectively. Ideally then, these range cells should reside on a single processor.

System Architecture

Input interfaces must more than accommodate the specified data rate. Distribution and collation of data must be implemented with the minimum possible communication overhead. Secondly. suitable processor inter-connectivity must wst to retlect the algorithrmc requirements. 'Ihe architecture should be modular and expandable with few board types. The total number of devices should account for the likely memory and code space requirements. Obvious bottlenecks in data flow can be identified at this stage.

Performance Prediction

Other than costing and technical risk analysis. the main task that remains is to predict the achievable puformance. Inherently, parallel systems am. more eomplex to analyse. than conventional designs. This is mainly due to the more complex data flow. This can lead to memory contention. inpuVoutput bottlenecks. communication and general intempt handling Overheads.

In an ideal processor environment. the behaviour of a radar processing algorithm can be classified in terms of the dependence of its execution time on the data. A fourier transform will be deterministic whereas a thrcsholding task will depend on the number of items being thresholded. Some algorithms not only deqend on the number of items to be processed but on the actual values in the data such as in a s o r h g task. Such variations can be investigated by prototyping code on the target hardware or commonly on a hardware simulator. Also, instruction sets can be analysed together with data books to estimate the worst case number of instruction cycles required.

The concluding performance may prompt more thought on the distribution of functionahty in the system. A combination of process farming and data farming may be considered.

Ground Based 3-0 Air Surveillance Thht: first application that will be considered is that of 3- dimensional air surveillance from a land based site. These radars are required in early warning scenarios where long range coverage is required. For each detection, the range. azimuth and elevation is required. Also, general background clutter maps must be created. Such systems use one of two principles to obtain elevation information

512

The results from the node are sent to an executive processing module where corresponding data from other beams arc used to derive the elevation estimates.

The inter-node arJliteaure is designed to allow alternate routts to be defined to overcome individual node failure. Each node monitors itself and its immediate neighbours. If a fault is detected. it is mportcd to the executive. A reconfiguration programme then maps out the faulty node so allowing it to be physically qlaced whilst the system remains on line. This new codguration employs a previously redundant node to maintain full system performance. The system is also capable of graceful degrdation in performance by reconfisuring with reduced r ~ g e coverrge.

Ille i n d node data flow is based upon four worker chains (Sec Figun 3). Four head workers 0 exchange data directly with the server (S). Each head w k e r exchanges data with lower level workerr 0 to complete the distribution. Each worker chain uunicates with the foreman (F). E d - point hardware is employed. This is adequate k a u s e the surveillance algorithms used do not rquire a very large dynamic range.

(not all workers shown) low bandwidth data to the executive

Figure 3: TAP Node Data Flow

The amount of data allocated to each processor is fixed as already described. Also, the data rate and task rate are ked. Thii very much sirnplif~es system analysis and performance prediction.

The processing latency budget typically m m s that each node must process its allocation of data in tens of milliseconds. To determine the worst caw, process time, the worst case execution time per range cell must be determined. In the surveillance radar application, the computation required IS reasonably deterministic with algorithms such as integration. moving target indication (m) and temporal thresholdingI11

The behaviour of the. software is also well defined. his is because the implementation of softwarc using occ~m on the Transputer is inherently &dent. The philosophy of this approach m m that it is largely not ry or in fact wise to hard-win occm with lower level machine language. This was very much the c w in the development of the TAP system. The result was that very little software optimization was required at a machine level.

All these factors assist the designer in generally matchug the processing rate to the required input task rate. When this is not the case, any short term overlap between processing the current data and processing the nwct task. is managed by the transputer scheduler (i.e. there is the capability to cope with more than one azimuth e t o r at a given instant in time).

The TAP node has proved to be easily upgradable whilst retairUng the same basic architecture. A 32-bit floating point node (based upon the T801 Transputer) has becn constnrcced within the same mechanical frame as that used by the fixed pomt node. Also. for the fixed point node, pin compatible Transputer upgrades to 25 and 30 MHZ are possible.

Multifunction Radar

This category of radar systems is the sccond application that will be considered. The principle is to wmbiie surveillance capabilities with tracking by suitable control of a narmw beam[2](See Figure 4). Thii permits the tracking of multiple targets simultaneously (in contrast with a dedicated mechanically steered backer). The transmitted waveforms are tailored to the mode of operation. There will be a predictable search pattern in normal sweillance mode. In a threat scenario, the beam management software must redistribute the scanning priority. for example, by giving higher priority to tracking

I I Figure 4: Multifunction Radar

The algorithms selected can also depend on the beam orientation. This demands rapid real-time algorithm reconfiguration under very short latency constraints (tens of milliseconds). Signal reixms from sea clutter are most significant at low elevation angles. Under these conditions it is more appropriate to use coherent processing to provide doppler discrimination.

One method is to scan the elevation plane. for each azimuth position using clckoNc beam forming. The second method is to permanently mate a number of beams in elevation using a single bum forming antenna array. ?he dntive signal Stmgth 9cw1 in two ~ b o u r i n g beams i s usad to interpolate theelevationangle(SeeFigure1).

Fmre 1: 3-D Air Surveillance

The Inmos Transputer

The Transputer family of general purpose processors offers a considerable number of building blocks for use in the design of a radar signal procum. The most important hardware feahues are listad beluw (using the specific case of a 722.5 running at uIMHz):-

* 20 MIPS (pealr)

- 4Kb~on-chipstaticRAM 16 bit fixed point architecture

4 x 20MbiWsec d links

- 64Kbym addressable &-chip memory - dedicated link intuface * microcoded scheduler

In addinon to a good technical performance. the Transputer offers a purpose built programming environment. Occam is a high level language designed specifically around the Transputer architecture. Many pardel processes can be defined with point to point communication automatically synchronized and unbuffered. The hardware scheduler removes the need for P software kerncl end 90 reduces the associated interrupt overheads. The scheduler, together with a link interfxe, is able. to automatically deschedule an U0 process for the duration of a message transfer effectively dccoupling the CPU from the link transfer.

The Transputer Array Processor

The Marconi Radar Systems Transputer Array Processor (TAP) was designed speckally for multi-beam surveillance radars. Several of these systems have been successfully employed around the world

A fixed eight beam arrangement is used.

The signal processor is typically composed of -100 TAP nodes. A TAP node is the lowest level, line replacable unit in the system. Each node employs 50 Transputers giving an overall peak system performancc of -100 GIPS (at lo(xf MIPS per node). An may of nodes is configured in such a way that each can communicate direaly with other immadiately neighbouring nodes (See Fire 2). Bach node is composed of three main parts. The server is a data interface module (using a T225 Transputer end an ASIC). This links the node to a high speed interfacc using the FDDI Fibre DBtributed Data Interface) protocol. Raw data is recognized by the node using tagged data packets. On the arrival of new data, a p&t is captured and replaced with processed data from the node. The main core of the node is a collection of child modules also known as the workers. (each a 20 MHz T225 Transputer). These each perform exactly the same processing on a sub-set of data. The data p r o c a d by each workr is that produced from a single bwn (npnscnting a fixed elevation sector) at the current azimuth for a fixed subset of range

................................................................... .........................................................................

Node Array Data Interface

Figure 2: TAP System Architecture

The chdd modules transform the raw data into processed vldeo suitable for &splay at an operator s termmal Thls data IS routed back to the FDDI mterface Secondly, the chdd modules produce canmdate detections and clutter estlmates that are routed to a mgle processor on the node known as the foreman Tlus device (a T425 Transputer) produces the final confirmed detecuon hst and clutter map for the node

513

These waveforms require longer intcgratron periods. In cltar search environments (hi@ elevation an&). it is more appropriate to use wn-wherent algorithms with shorter integration periods.

These algorithms operate. only in the range dimension (no doppler disckination) and 90 are simpler to i m p h m t and pottntially offer il more efiicient use of the &Brch time budget. Tht search kxibility is partly made possible with phase scaoned array antenna technology whicb has provided the required v&ty. Blectronic IS well as mechanical control is used to steer a nurow beam in elevation and azimuth.

The Texas Instruments TMS32OC40 DSP Processor

Abbreviated U the 0 0 . this processor has proved to be a high pcrfonnrnce DSP device ideal for parallel processing in radar applications. The most impcrrtsnt features of the device are listed below (using the example case of a 50 MHz part).

32 bit floating point architeaure 50 Mmxlps (Peak) 6~20MBytes/Su: seriallinks 8K Bytts onchlp RAM - 16 GBytes addnssable off-chip memory 512 Bytes on-chip cache dedicatedDMAcoprocessor JTAG debugging support

One of the mast notable features of the C40 is its powerful debugging envinwunUrt provided by the enhanced u9t of JTAG (Joint Test Action Gmup) technology together with the PDM (parallel Detrupeing Manager) tool. A chain of processors are physicrlly linksd by JTAG. They can then be accessed from a global level command line typically through

halted at any time and ucaminsd. They may then be restarted to allow other dependant procesmars to be examined in the same way.

an OSR windows interface. Any sub-set of processors can be

The C40 DMA (Direct Memory Acccss) Coprocessor is used to copy data from memory to link and visa versa Thls operation is largely independent of the CPU.

The EMPAR Signal Processor

The Marcom Radar Systems EMPAR (European MdhfuncUon Phased Array Radar) slgnal processor has recently been developed for and delwered to a customer The system IS composed of a number of idenhcal C40 processing modules known IS EC40 boards Each contam Iune C4Os providmg a peak performance of 450 MEOPS per board C40s are also used for raw data dntnbuaon and control m the EPIC (EMPAR Processor Interface Card) un~ts

The main system architecture IS based upon two identical systems (known as slices) opu~ng side by side (See figure 5). Digital Pulse Comprespion 0 is applied to the data before the EPICS. The DPC boards are not shown.

EC40 Boards

I Boards Figure 5: EMPAR Signal Processor

Plot extraction and cfisplay processing BS wcll as task schcduhg are performed by a h@er level control urut known as the RMC (Radar Managrment Computer).

Functionally the system operates at a regional level in the main processing channel (the sum channel) using data farming on mall sub-sets of range across a given number of pulse intervals. At this level. a degree of uncertainty exists in the extracted data. At a global level. more certainty is added by collating the information from the regional EC40s onto a single processor. This head global processor decides upon further processing, if necessary. and this is communicated back to a regional level. When global processing has no further need for regional processing, it releases it to allow the next batch of regional data to be processed. At the same time, global processing performs further thresholding tasks on the current data. Or! completion of this task, results are sent to difference processing where azimuth and elevation angles are computed. Thus. the EMPAR system is a combination of data fanning and process farming with a multi-stage pipdine architecture.

The number of boards in each slice is sufficient to process the largest expected data set without exceeding the regional processing latency budget (tens of milliseconds). Secondly. the combination of two slices and several boards per slice is such that throughput can be maintained for any possible sequence of tasks. The multi-slice architecture can be tailored to deal wth the expected load variation of any application. The number of slices in the system and the number of processing boards within each slice can easily be modified. The number of boards within each slice need not be the same. In this case, the slice selection logic could not only select the least loaded slice but also chose the slice that is most suitable for the size of the next data set.

Within each regional EC40. processed data IS farmed onto and off the board via a head regional worker in a I based smture (See Figun 6). The owdl system I is heeded by the hcad global worker with a top level branch to esch regional EC40. This structure is duplicated for each slice in the system.

f Pmem1ng p1Fl

Figure 6 Regional E W Data Flow

The system is implemented using a mixturc of C programming and C40 assunbly coding. The former is used for high level control code. The latter is used for time critical operations such as deta communications. sorting tash and FpTs (Fast Pourier Transforms). Thus, maximum use is made of the C40 facilities such as the parallel instruction set, circular addressing. delayed branching and the onchip RAM. At the same time. the high level code provides the required readability at a system level (simplifying the task of high level debugging).

The dynamic behaviour of the systm is dependent on the exact sequence of tasks (with varying size data sets) given to the processor and also on the data content within each task. Process times. on the same amount of data, may vary two-fold or more between a clear data set and one containing many candidate detections. To manage these variables a queue management system is required.

A task allocation programme monitors the loading on each slice by logging t a b onto the system and logging them off. In this way. the least loaded slice can be given the next task that arrives. This degree of task management arises from the expected variation in the input task rate. Although the incoming data rate is constant. the amount of data to be pnressed cannot be predicted by the processor. On the rwional E M S , a server processor gathers new data for the board from an EPIC. This is farmed out to four worker processors where it i s placed on a fix& size queue. Data are released from the queue when all eight workers are ready for the next task. Many tasks may be queued but only a single task is processed at any one time. DMA is used to manage the raw data transfers with minimal impact on CPU time

As already described, a feedback loop exists whereby the data dstribution control software has an estimate of the current task queue depth.

With hs information. not only can the optimum slice be selected for the next task, but also any indication of potential queue saturation can be detccted and acted upon. The regional queue size is determined by estimating the worst case task arrival sequence. This is possible because although the amount of data to be processed is unknown. the minimum and maximum amounts are well defined. The exact optimum queue size is determined by analysis and simulation.

The EMPAR signal processor has the capability to provide fault tolerance and graceful degradation. Redundant boards could standby within each slice. Simple reconfiguration of the software could map the boards into and out of the system. Also. if the application dunands. graceful degradation could be achieved by operating with only one slice or by reducing the range coverage within either or both slim. In the latter case. the slice selection logic could selectively degrade the range coverage for only the least critical tasks.

As with the TAP node, the ECM board has proved to be easily upgradable with no changes to the board or systun architecture. Early development of the system operated at 40 MHz. The current implementation operates at 50 MHz. A 6OMHz upgrade is also possible (the system can in fact operate with a mixture of clock rates).

Concluding remarks

Two very different applications have been described. It is not intended that the reader should simply choose between them. The important issue is the versatility of parallel design. Both systems have demonstrated the benefits of this approach. Both are evolving systems with scope for scalability and upgradability. They achieve their objeaives in very different but appropriate ways. Ihey both provide a cost effective high performance machine that is not constrained to any specific requirement. Each supports their own category of requirements namely surveillance and multifunction radar

Future Algorithms

Two areas of continuing research are neural networks and chaos mathematics. Both are of interest in the unpredictable world of radar data. As higher resolution radars are developed, more information may be extracted from the environment. The use of parallel architectures is a natural choice for neural mathematics. One or more neurons may be physically allccated to a single processor.

Future Architectures

Optical back-planes for parallel systems have already been developed[31. Potentially. they provide more choice of inter- connection topologies but with only a small commonications overhead.

516

One principle is to use a holographic plate to define a board level inter-conncction.

Silicon technology witbin a single device is approaching the physical limits of the silicon wall. This has led to a trend towards multi-processor fabrication on a single piece of silicon.

Cross-Bar switching and shared memory technology are continually improving. Hardware is becoming more reprogrammable with gate array technology. All these factors add to the efliciency in which parallel systems can be implemented.

Automation of Design

The ideal duign environment is one where the software developer nced not have any howledge of the parallel machine architecture. Also. the machine architecture need not make any assumptions about the application. The physical inter-processor connectivity can be defined by the application. This connectivity and the functional mapping of a given algorithm onto a parallel resource should be an automated procus. To apply this levd of automation to a large &tion of dependent algorithms. quating to a radar signal processor. is a very difkult W. The coupling between algorithms and architectures in this context is the subject of much rescarch. Some tools are already emuging. & Ptolemy development tool (University of California) allows a graphical interpretation of a system to be allocated basic funtionality with automatic code generation. Gate array technology has been applied in attempt to automate the design of M occam based parallel system. Problem-specific hardware can be realised entirely by a software procus[4]. Another field of research is in parallel p r o g r m i n g languages ad compilers. The objective is to provide complete abstraction between the required functionality and the physical solution[5].

A more Certain development is the increased use of parallel systems across the radar product range. More use UI commercial radar applications will maximize the returns from this relatively new technology.

Acknowledgements

References

111 SKOLNIK M.I.. Tntroduction to Radar Systems. 2nd Edition. McGraw-Hill. 1981.

[21 INGIB RJ.. THOMAS AS.. Signal Processing for Multifunction Radars. GEC Review, VOL.10. NO.l. 1995.

131 FEL.DMAN M.. Holographic Optical Interconnects for Multichip Modules. Electronic Eingincenng. September 1992.

[41 PAGE I.. LUK W.. Compihg occam into FPGAs. in FFGAs, 4.. MOORE W.. LUK W.. 271-283, Abingdon EEBrCS B&. 1991.

[51 BISSELING R.H.. McCOLL W.F.. Scientific Computing on Bulk Synchronous Parallel Architectures. Tkchnical Report 836. Department of Mathematics, University of Utrecht. December 1994

The author wishes to thank all members of Marconi Radar Systems that have provided comment on this paper.

5 / 7

The Application of Parallel DSP Architectures to Radar Signal Processing

Documents

Transcript of The Application of Parallel DSP Architectures to Radar Signal Processing