Mission-Specific Processing Techniques Enable Low-Power...

dvanced U.S. Department of Defense (DoD) surveil-lance sensors are continually pushing the need forever-increasing electronics performance. Full-time,large-area ground surveillance capability, for

instance, has been proven to be of great impor-tance in recent conflicts; it drives therequirement for the signal processing ofgrowing numbers of radar channelswith increasing bandwidth. At thesame time, the surveillance plat-forms carrying these sensors, typi-fied by unmanned aerial vehicles(UAVs) or other small aircraft orspacecraft, are being pushed to besmaller, lighter, cheaper, and tohave longer mission durations. Onthe Defense Advanced ResearchProjects Agency (DARPA)-sponsoredMission-Specific Processing (MSP)project, the team of Northrop Grum-man Electronic Systems, Purdue Uni-versity, University of Illinois atUrbana-Champaign, Morpho Technologies,and Atmel Corporation developed and demon-strated low-power, high-performance application-specific integrated circuit (ASIC) cells andcomputation techniques to provide 10× throughput perfor-mance per watt improvements to meet next-generation DoDsensor-processing applications. Northrop Grumman hasdefined a set of cells that form the core of the digital signalprocessing (DSP) functions needed for emerging militaryradio-frequency (RF) applications. The University of Illinoishas developed a fluid intellectual property (IP) core generatorthat produces fully laid out optimized DSP function cells

using transistor and voltage scaling to maximize ASIC perfor-mance per watt. Purdue University developed low-computa-tion complexity algorithms and advanced-circuit techniquesto achieve additional improvements in operational perfor-

mance and input/output (I/O) throughput per watt.Morpho Technologies analyzed application of

these MSP techniques to their reconfigurablecore cell, which provides on-the-fly recon-

figuration with reduced power andimproved performance. MSP cells were

integrated into the design process ofa commercial ASIC foundry (AtmelCorporation) standard cell library,allowing designers to achieve ASICperformance levels normally asso-ciated with full custom designs,yet using a standard-cell designflow to reduce the design cost by afactor of ten. Northrop Grummandefined a wideband adaptive pro-

cessing ASIC chip set incorporatingthese cells and projected their per-

formance in a multichannel real timesurveillance radar processor, exceeding

the 10× improvement goals. An overviewof the project is shown in Figure 1.

KEY ASIC METRICS FOR DOD SENSORSEmerging DoD sensors need more sensor channels, widerbandwidth, and adaptive cancellation of interference, jam-ming, and clutter. Projections of sensor signal processingthroughput needs are on the order of tens of trillion opera-tions per second (TO/s). Many of the sensor platforms havesevere payload constraints on the size, weight, and power

Mission-Specific Processing Techniques Enable Low-Power, High-Performance Surveillance Sensors

■ 22 IEEE CIRCUITS & DEVICES MAGAZINE ■ JULY/AUGUST 20048755-3996/04/$20.00 ©2004 IEEE

Michael Lucas, Naresh Shanbhag, Kaushik Roy, Fadi Kurdahi, and John Fagan

A©

EY

EW

IRE

available for this processing.The use of commercial off-the-shelf (COTS) programmableprocessing falls several orders ofmagnitude short in providingthe required throughput in theavailable physical envelope. Thevast majority of the processingthroughput required in theseadvanced sensors is in dedicatedfunctions that are specific to thesensor, so optimized mission-specific ASICs are an effective solution to meet the processingrequirements within the platform constraints. A key metricon the project is the ASIC performance in terms of operationsper second per watt (O/s/W). This parameter is extremely criti-cal to military systems, as they frequently must operate overwide temperature ranges with severely constrained coolingabilities, as compared to commercial computing hardwareplaced in air-conditioned rooms. Also, since the bulk of thetargeted functions are closely coupled to high data rate sen-sors, low-power, high-bandwidth, chip-to-chip I/O measuredin gigabits per second per watt (Gb/s/W) is also a key metric.

SENSOR REQUIREMENTS ANALYSISMSP ASICs are targeted for the mission-specific applicationdomain, which is typically in the front end of the signal process-ing sequence. These functions are usually tightly coupled to thesensor configuration and require very high arithmetic through-puts and data rates. Northrop Grumman performed system anal-ysis and processing requirements projections for emerging DoDRF sensors to identify which functions provided the highest pay-

back if implemented as optimizedMSP cells. The resulting cell setranged from very large functions,such as finite-impulse response(FIR) filters and fast Fouriertransforms (FFTs), down to low-level functions, such as improvedflip flops that support multivolt-age operation and lower powerchip-to-chip I/O. The set of MSPcells addressed on the projectconsists of the following:

✦ scaleable bit-width core arithmetic cells: multipliers,adders, and multiply/accumulate

✦ large-scale DSP functions: FIR filters, polyphase filters,and FFT core

✦ low-power enabling cells: multivoltage level converting,and dual-edge transition flip flops

✦ chip-to-chip communication: mux/demux I/O cell, low-power gigahertz I/O buffers

✦ high-performance, low-power reconfigurable core.The MSP project is developing this set of high-perfor-

mance cells by incorporating optimization techniques thatinclude a fluid IP core generator, reduced-complexity algo-rithms, advanced I/O circuits, and reconfigurable cores.

Fluid IP Core Generator:Power/Speed Optimized ASIC Cells

Without SynthesisThe University of Illinois at Urbana-Champaign developed afluid IP core generator that provides custom quality layouts ina short (comparable to synthesis) design cycle. The fluid IP

23 ■IEEE CIRCUITS & DEVICES MAGAZINE ■ JULY/AUGUST 2004

MSP ASIC CustomizationTechnology

Optimized CellTechniques:

Purdue University

Fluid IP Core Generator:University of Illinois atUrbana-Champaign

Low Power, High SpeedChip to Chip I/O:

Northrop Grumman/Purdue University

Reconfigurable Core:Morpho Technologies

Commercial IC FoundryDesign Process and

ASIC Fabrication

Atmel Corporation

New MSP Cell Libraries:Optimized for High Performance

Military Signal Processing

• Scaleable Multipliers and Adders• Multiply/Accumulate• FIR Filters• FFT• Multi-Voltage Flip Flops• Dual Edge Clocking• Mux/Demux I/O Buffers• Low Power GHz I/O Buffers• Reconfigurable Core

Existing Standard Cell Library

Wideband Adaptive Processing

Demonstration Chip Set

10X O/s/W Improvement10X Gbps/W I/O Improvement

Beamformer

PulseCompression

AdaptiveWeight

Computation

Applications:Wideband Radar

Wideband Intelligence

1. MSP integrates complementary ASIC design techniques to meet high-performance DoD system throughput and power requirements.

Surveillance platformsare being pushed to besmaller, lighter, cheaper,

and to have longermission durations.

core generator accomplishesthis by sacrificing the generalityof a synthesis methodology inorder to obtain significantimprovements in power, speed,and area efficiencies. This isachieved by having the coregenerator target the design ofspecific DSP subsystems such asFIR filters, FFTs, polyphase fil-ters, and forward error-correction blocks. The fluid core gen-erator employs algorithmic and architectural considerationsto optimize the circuit fabric. The resulting layouts are basedon finely tuned transistor sizes, hence the term “fluid.” More

importantly, the core generator bypasses logic synthesis com-pletely, providing designs with predictable quality. The MSPproject initially focused on the design and demonstration of afluid IP core generator for a complex data FIR filter; exten-sions to other functions were also performed.

IC design methodologies in common use today can bebroadly classified into full custom design, logic synthesisbased, and the more recent core-based methodology. Cus-tom designs offer the best energy, delay, and area benefitsbut require a long and expensive design cycle, making iteconomically unfeasible for the small production quantitiestypical of military designs. Synthesis-based methodologiesreduce the design cycle time and cost by employing anautomated logic synthesis step followed by automaticplacement and routing of standard cells but suffer fromunpredictable quality and sacrifice area, power, andthroughput. Soft cores, usually in the form of VHDL code,are portable from one technology generation to the next

but suffer from similar prob-lems as synthesized designs.Hard cores, in the form ofcompleted layouts, are pre-dictable but are not portableacross technology generations.

The Illinois fluid IP core gen-erator builds upon a core-baseddesign methodology betweenthese extremes. The core genera-

tor encapsulates cross-domain optimizations, such as those thatspan algorithmic, architectural, and circuit domains, andmakes them available to the average designer. For example, alarge number of algorithmic and architectural optimization

techniques exist for signal processing and com-munication systems, such as pipelining,strength-reduction, folding/unfolding, parallelprocessing, and DÉCOR [1], [2]. These tech-niques, when combined with circuit-level opti-mizations, provide significant improvements inpower and delay.

Fluid IP Core Generator ArchitectureFigure 2 illustrates the structure of the fluidIP core generator. The IP core generator hastwo major components: a fluid core optimizerand a layout synthesizer. It accepts as inputspower and delay models, a library of templatetransforms, algorithmic specifications, andpower and delay specifications, and generatesoptimized physical layouts.

Fluid Core OptimizerThe fluid core optimizer begins the designprocess with a data flow graph (DFG) based onthe desired type of filter or DSP function. TheDFG represents computational units, such asmultipliers and adders, as nodes and the flow

of data between the computational units as edges. Architec-ture-specific information, such as word sizes and delays, isembedded in the graph. Using the computation-efficient delaymodels developed on the project, the delay of each node in theDFG can be computed. A shortest-path algorithm is then usedto determine the critical path of the DFG. This can be com-pared with the desired throughput specified by the user.

After the application of transform templates, the fluid coreoptimizer reduces transistor sizes in the fluid cells to minimizepower consumption. The fluid cell library consists of oneparameterized cell for every logic cell in the library, and eachcell can be instantiated at virtually any size desired. This pro-vides a tremendous benefit over traditional standard celllibraries, which consist of cells with a limited number of drivestrengths. The fluid cell library also has cells with multiplescale factors, each corresponding to a different group of transis-tors within the cell. While a buffer has a single scale factor forthe entire cell, a full adder has four scale factors that separately

■ 24 IEEE CIRCUITS & DEVICES MAGAZINE ■ JULY/AUGUST 2004

ProcessSpecifications

Power andDelay Models

ArchitecturalSpecifications

Power and DelaySpecifications

Fluid Core Optimizer

TemplateTransforms

Fluid Pipelines

Fluid Macros

Fluid Cells

DesignRules

Architecture

LayoutSynthesis

PhysicalLayout

2. Fluid IP core generator produces optimized physical layouts from architecture, power, delay, and process specifications.

Advanced DoD surveillancesensors are continually pushing

the need for ever-increasingelectronics performance.

affect the p-channel metal-oxide semiconductor (PMOS) and n-channel MOS (NMOS) transistors in the carry-out and sumpaths. By carefully choosing the groups of transistors to bescaled, as well as the relative scaling of transistors withingroups, a cell with one or two scale factors can be as effective ascustomized sizing of each transistor. Figure 3 shows a fulladder cell instantiated at two different sizes. The scale factor αC

controls scaling in the first stage, which generates the carryoutput, and αS controls scaling in the second stage, which gen-erates the sum output. Thus, αC affects the delay through boththe carry and sum paths, while αS affects only the sum delay. Itis infeasible to characterize each cell for every possible instanti-ated size. Therefore, a large part of the effort involved modelingdevelopment to quickly and accurately describe the delay andpower of the cells in terms of the scaling parameters αC and αS.

The fluid cells are used to reduce the overall power con-sumption of structures. The existing core generator usesripple-carry adders and Baugh-Wooley multipliers due totheir regularity in layout. These adder and multiplier archi-tectures have many paths shorter than the critical path. Theshorter paths can be exploited to reduce power by employingthe scaling factors to shrink the transistors in cells off thecritical path.

Core Generator Layout SynthesisThe layout synthesizer accepts architectural parameters andtransistor scaling factors from the core optimizer to synthesizea layout. As the architectural template is known, the synthesiz-er starts off with a good initial placement and routing. Fromthe power estimates of the filter and taps, the synthesizer calcu-lates the number of power and ground lines and their dimen-sions. At the end, the layout synthesizer routes the clock, whichis buffered as determined by the amount of load that the bufferis expected to face and routed in the opposite direction to thatof the data in order to combat the effect of skew.

Core Generator Filter Test Case Results

A test chip was designed and fabricated with three core-gener-ated filters and a synthesized filter for comparison. All filterswere five-tap, complex FIR filters, with 8-b input, 12-b coeffi-cient, and 23-b accumulator precision. The filters were devel-oped for the Taiwan Semiconductor ManufacturingCorporation (TSMC) 0.25-µm, 2.5-V com-plementary MOS (CMOS) process. Figure4 shows the layout of the test chip withthe three core-generated filters and onesynthesized filter.

The first core-generated filter (Filter 1)was a 20-MHz design using a fixed cellsize with a scale factor of one, which iscomparable to a standard cell with a driveof 1×. The second core-generated filter(Filter 2) is identical, but employs fluidcells. The third core-generated filter (Fil-ter 3) was targeted for 150-MHz through-

put with fluid cells. Here, the core generator employed fine-grained pipelining of the taps to meet the throughputrequirements. These examples show the ability of the coregenerator to design high-performance, low-power cores for awide range of design specifications. The synthesized filter (Fil-ter 4) was targeted for 20-MHz area/power critical operationas a benchmark. The design was synthesized with a 43-cellstandard cell library, consisting of logic gates at multipledrive strengths, several flip-flops, and two full adder cells:low-speed, low-power static logic, and high-speed, high-powerpass-transistor logic. A VHDL description of the filter waspipelined at the tap level. Synopsys Design Compiler was


αC = 1 αS = 1 αC = 2 αS = 3

3. Full adder cell instantiated with different scaling values for carry and sum.

Filter Speed Power Area Power-DelayNumber Design Methodology MHz Consumption mm2 Product pW-s

1 Core generated slow filter 20 0.95 mW at 0.98 V 1.12 47.5(fixed cells)

2 Core generated slow 20 0.90 mW at 0.98 V 1.11 45.0filter (fluid cells)

3 Core generated fast filter 160 109.8 mW at 2.5 V 1.47 686.3

4 Synthesized slow filter 20 6.47 mW at 1.5 V 1.59 323.5

Table 1. Comparison between core-generated and synthesized filters.

Filter 1

Filter 2

Filter 3

Filter 4 (Synthesis)

4. Test chip for fluid IP core generator.

employed for synthesis and Silicon Ensemble for timing-driv-en placement and routing.

Power and performance results for the FIR filters are shownin Table 1. In all cases, the core generator produced lowerpower filters with less area than synthesis. These results showthat the core generated slow filters offer a 7× improvement inenergy-efficiency over the corresponding synthesized filter,with a 30% reduction in area. In the fast-filter case, the core-generated filter is shown to provide 2.5× improvement over thesynthesized filter along with an area reduction of 45%.

Reduced Complexity Algorithms:Complex DSP Results With Less Computation

Recasting mathematical functions into different forms can pro-vide the same results as the original form, but require fewerarithmetic operations. Purdue University has used this approachto develop several methodologies for the design of low-energyconsumption, high-performance DSP functions, such as filterbanks, FFTs, and DCTs, using low-complexity design techniquesat different levels of design abstraction. Two examples of thesecomplexity reduction techniques—differential coefficients andshared multipliers—were the focus of the MSP project.

Differential-Coefficients MethodThe differential coefficients method (DCM) is a novel algo-rithm-level technique for realization of low-power FIR/IIRfilters with a large number of taps (of the order of hun-dreds) [4]. DCM relies on reducing computations to reducepower. The algorithms for the differential coefficients usevarious orders of differences between the coefficients inconjunction with stored precomputed results, rather thanthe coefficients themselves, to compute the canonical formconvolution. These algorithms result in lesser computa-tions per convolution as compared to direct form computa-tion. However, they require more storage and storageaccesses and, hence, more energy for storage operations.The net energy savings using DCM is dependent on variousparameters such as the order of differences used, energydissipated in a storage access, and the word widths used forthe digitized input data and coefficients. DCM can also lead

to a reduction in the time needed for computing each con-volution and may provide an added advantage of higherspeed computation. Analogous to the savings in energy, thespeed enhancement obtained is dependent on the order ofdifferences used and various other parameters. DCM doesnot rely on using a specific binary encoding of the coeffi-cients or input data for the results and is applicable to awide range of FIR filters.

This technique is useful if, and only if, the differencesbetween coefficients are small compared to the coefficientsthemselves. In the multiplication for computing a product termusing this algorithm, the technique trades a long multiplier fora short one and some additional overhead. If the energy savingsin multiplication is greater than the net energy dissipated dueto the overheads, there is a net energy savings as compared tousing the direct form. The algorithm can be generalized to usewith higher order differences (differences of differences) foreven higher performance, as shown in Figure 5.

Shared Multiplication Technique for Digital FiltersComputational complexity can also be reduced by sharingmultiplications for vector-scaling operations, such as FIR fil-tering. The computation shared multiplication (CSHM)scheme considerably reduces redundant computation bydecomposing the vectors in a manner that results in maximalcomputation sharing, resulting in a faster and potentially low-power implementation [3].

Consider a product of a given vector, representing coefficients,with the scalar x. In this operation, the input x is multiplied by all


50

40

30

20

10

016 b 24 b 32 b 48 b 64 b

% Savings, m=1

% Savings, m=2

5. DCM computation reductions for a FIR filter for different word bitwidths and first and second orders (m) of differential coefficients.

Pre-Computer

Select/Shift and Adder Units + FFs

Adders and FF A Select/Shift and Adder Unit

(a)

(b)

6. CSHM FIR test chip.

the coefficients simultaneously. One can carefully select a set ofsmall bit sequences so that the same multiplication result can beobtained by only add and shift operations. For instance, (1011) · xcan be decomposed as (0011) · x + 23(0001) · x. If both (0011)· x and x are available, the entire multiplication process isreduced to a few add and shift operations. These chosen basic bitsequences are referred to as alphabets. Also, an alphabet set is aset of alphabets that spans all the coefficients in the vector.

Depending on the selection of the alphabet set, the numberof required add and shift operations changes. As the number ofcoefficients increases, there can be many choices for alphabetsets on the coefficients, and each alphabet set gives rise to adifferent combination of add and shift operations. Obviously,an alphabet set should cover all the coefficients in the coeffi-cient vector. In addition, there are two other desirable charac-teristics of a “good” alphabet set. First, the total number of addoperations should be minimized. Because the add operationslie on the critical path, reduction of the number of add opera-tions improves overall performance. Second, the number ofalphabets in an alphabet set should be minimized. [5] Byslightly modifying the coefficients, the number of alphabetscan be reduced. In the above example, the coefficient is01101011. If we change the coefficient to 01101100 by adding1, the coefficient can be computed as 25(11) + 22(11). In thiscase, the alphabet set is reduced to {11}. Reducing the numberof alphabets gives rise to a lower complexity vector scalar,which leads to high performance as well as low-power design.

A ten-tap programmable FIR filter was designed and fabri-cated in TSMC 0.25-µm technology based on the architec-tural and circuit-level techniques. Separate power supplies forthe core and I/O pads allowed exact measurement of thepower consumption of the core of the chip. The CSHM layoutand test chip is shown in Figure 6.

Table 2 shows the resulting characteristics of the CSHMtest chip versus FIR filters using a Wal-lace tree multiplier (WTM) and a carrysave array multiplier (CSAM) for compar-ison. (WTM and CSAM are the two mostwidely used multipliers. Generally, WTMhas better performance than CSAM dueto the tree-like structure of partial-sumadders. However, WTM has the disadvan-tage of having very irregular intercon-nect.) As shown in the table and based onthe simulation results, the FIR filterusing CSHM has a 19 and 43% perfor-mance improvement over the FIR filterusing WTM and CSAM, respectively. Interms of power consumption, the CSHM scheme has a 17 and20% improvement with respect to the FIR filter based onWTM and CSAM, providing power-delay product improve-ments of 1.7 and 2.6×, respectively.

New I/O Techniques Improve Chip-to-Chip Gb/s/WDoD radar sensor systems typically have high data input ratesand chip-to-chip communications requirements. In many cases

the chip I/O accounts for half of the power consumed by thechip. MSP-class ASICs typically have high communicationrequirements with chip-to-chip data rates of 50–100 Gb/s.Northrop Grumman developed an MSP I/O receiver and driverto provide a greater than 10×-Gb/s/W improvement over theexisting low-power standard, low-voltage differential signaling(LVDS). The I/O cell implements a lower swing(100 mV) version of the LVDS standard, as shown in Figure 7,and provides higher speed: up to 4 Gb/s double data rate (DDR).

An I/O test ASIC was designed for 2-GHz operation andcontained drivers, receivers, and internal test support struc-tures. The I/O ASIC was tested with a 12-GHz bit error ratetester (BERT) using psuedo random bit stream (PRBS) datasets. Bit error rate tests were performed at frequencies upto 2.2 GHz, demonstrating better than 10−12 bit error rates.Figure 8 shows the test structures and eye patterns for thetest chip.

The power savings over current LVDS I/O designs wassignificant. The comparisons were made against the existingLVDS design in 0.18-µm technology with a common supplyvoltage of 1.8 V. The existing LVDS cell has a maximumspeed of 600 MHz, equivalent to a DDR rate of 1.2 Gb/s. TheMSP I/O cells operated at over 2 GHz (4 Gb/s DDR). Thepower consumption for a 2-GHz link was 11 mW and result-ed in a 13×-Gb/s/W improvement over existing LVDS. A


2.4 V

1.25 V

1 V

0.55 V

0 V

400 mV

100 mV

CommonModeRange

Signal Swing

MSP

LVDS

7. High-speed I/O uses a reduced swing for lower power and higher data rate.

Table 2. Comparison of CSHM-based FIR filter with convention FIR filters.

FIR using CSHM FIR using WTM FIR using CSAM(Measurement) (Simulation) (Simulation)

Minimum Cycle Time (ns) 5.7 (7)* 7.0 10

Power Consumption at 238.75 344.3 357.1100 MHz (mW)

Power-Delay Product 1,362 2,408 3,571(pW-s)

*Simulation showed 5.7 ns clock cycle, but test instruments were limited to 7 ns.

companion mux/demux design byPurdue was also designed and sim-ulated, and simulation datashowed operational speed capableof supporting the 4-Gb/s I/O rate.

Reconfigurable MSP CoreProvides Flexible Performance

Where NeededA reconfigurable core cell was alsodefined as part of the MSP approach.This core was targeted at functionsthat need higher performance-per-watt than central processing unit(CPU) or field-programmable gatearray (FPGA) devices can providebut more flexibility to respond tochanges in modes and missions thanis feasible to put in completely dedi-cated ASICs. The Morpho Technolo-gies MS1 reconfigurable DSP (rDSP)architecture was selected for thiscore cell. Architectural optimiza-tions to accommodate high-perfor-mance radar requirements andcell-level optimizations using theMSP core generator and low powercells were analyzed to define thefinal core design.

MS1 rDSP ArchitecturerDSP solutions are provided basedon a reconfigurable array-processingparadigm known as the MS1 rDSP.Figure 9 shows the MS1 architec-ture. As in conventional reconfig-urable systems [6], the MS1 Core

contains a reconfigurable block, called the RCarray, and a 32-b reduced-instruction set com-puter (RISC) processor, called the mRISC. TheRC array consists of reconfigurable cells (RCs)interconnected by a reconfigurable network.Both the functionality of the RCs and the net-work interconnections are determined by aconfiguration program, called Context. Bywriting the appropriate context, the developercan use the RC array to exploit the parallelismavailable in the application. The mRISC corecontroller determines the application’s controlflow, executes the sequential tasks of the appli-cation, and starts transfers to/from the off-corememory. The instructions for the mRISC arestored in the instruction memory. The MS1Core also contains a context memory, a databuffer, and an I/O Controller. The RC is the


Mem

ory

Con

trol

ler

Mem

ory

Con

trol

ler

InstCache

CodeGenerator

InterleaverEngine

DMAController

DMAEngine

mRISC Processor

ReconfigurableCell Array

FrameBuffer

ContextMemory

9. The MS1 core architecture

1X and 2X DriverRing Oscillator

Divide by 1024Receiver

2 X Driver/ReceiverIsolated Power

Atmel BufferCommunication

1 X Driver/ReceiverIsolated Power

Variation of CoreLine Lengths

Ring Oscillator D

D

/1024

R

D2 R

Atmel Buffer

R

R

R

RD

D

D

D

...

(a)

(b)

8. I/O Test ASIC with <10–12 bit error rates and > 4 Gb/s operation.

programmable element withinthe RC array. It performs general-purpose operations as well asword-level and bit-level DSP func-tions. Input operands can beeither internal to the RC, fromother RCs, or from the databuffer. Due to the flexible inter-connect structure and reconfigu-ration capabilities, the RC arraycan operate in several modeswhereby groups of cells can perform similar operations. Thesize and composition of these groups, as well as the mode ofoperation, can be changed dynamically by switching contextsduring runtime. The operation of the RC array is coordinatedby the mRISC core controller, which selects and switches con-texts at runtime.

Reconfigurable MSP (RMSP) Core OptimizationTo develop the optimized RMSP core, several techniques wereintroduced for performance enhancement of the MS1. Briefly,these techniques can be categorized to two approaches:

✦ optimization of the RC components while maintainingthe overall architecture

✦ enhancements that required architectural changes inRCs and other MS1 components.

The original MS1 core was optimized for low-cost commercialapplications and had relatively narrow (16 b) data paths andlimited register files within the RC array. The DoD radar prob-lem requires higher precision data paths and has large data setsizes requiring more intermediate storage. A number of archi-tectural changes were identified to add these capabilities. Also,two implementation-related MSP enhancements, fluid IP coresfrom the Illinois core generator and the low-power circuit tech-niques from Purdue, were evaluated. When implemented in amulticore ASIC, these improvements can provide a 17×improvement in O/s/W over the existing design when applied toradar applications.

MSP CELLS BECOME PART OF AN ASIC FOUNDRY’S LIBRARY

The MSP program also demonstrated a path to commercial sili-con production. The approach for application of the optimizedMSP cells is to incorporate them into a commercial ASICfoundry standard cell library, so they can be used in a conven-tional design flow along with the foundry’s existing standardcells in the library. Techniques for automatically generatingfront-end and back-end design tool libraries were explored, andonly a minimal tailoring effort was required for the integrationof MSP cell designs into the library. Initial cells were trans-ferred to the commercial foundry team member, Atmel, todemonstrate the approach. This approach is divided into twoareas—standard cell components and compiled macrocell func-tional blocks.

✦ Standard Cell Components: Purdue’s research producedcandidate cells for improved power or speed performance

at the basic logic functionallevel. An example of this typecell is the edge triggered flipflop that clocks data on both ris-ing and falling edges, therebyreducing the clock distributionpower on an integrated circuit.Cells of this type are designed tomatch the standard cell require-ments of Atmels 0.18-µ CMOSdesign kits. Cells are character-

ized with HSPICE for functional behavior, timing, andpower over process, temperature, and voltage corners.

✦ Macrocell Blocks: Macrocell blocks, such as a compiledFFT or FIR filter, are created by compiling a design ofleaf cells with variable transistor sizing for power opti-mization. The leaf cells are created using Atmel’s 0.18-µCMOS design rules and technology files. Components ofthis type are created with the core generator softwareeach time they are used. This software has been deliveredto the government for use in Macrocell block generation.

COMPLETE SYSTEM MODELINGENABLES PRECISE BIT-WIDTH TUNING

The culmination of the MSP ASIC technology development wasthe detailed definition of an MSP chipset and performance projec-tions of a TO/s-throughput radar preprocessor. An end-to-endmodeling methodology of the entire radar processing chain wasdeveloped and demonstrated as part of this ASIC definition pro-cess. Emerging DoD mission requirements were used to drive theprocessor architecture and MSP cell and device requirements.System configurations, modes, and performance parameters werederived from wideband surveillance programs, and the targetedsystem addresses synthetic aperture radar (SAR) and ground mov-ing-target indicator (GMTI) surveillance modes in a dual-bandradar configuration. A MATLAB model of the radar modes wasbuilt and used to derive and optimize the MSP processing archi-tecture and MSP ASICs. The modeling activity consisted of fourmajor parts: a data generator, preprocessor model, SAR post-pro-cessing model, and GMTI post-processing model. The end-to-endmodel with all these elements was integrated as shown inFigure 10.

The model was then exercised using the various data cubesto optimize the system performance. A key use of the modelwas to determine the required precision and word size neededin the MSP chip set computations. System performance param-eters were determined as a function of bit length for differentstages of the preprocessing. This allowed the radar performancetargets to be met using only the minimum word sizes required,which provides the optimum O/s/W in the chip set.

MSP ASIC PROCESSOR PROVIDES REAL-TIME ADAPTIVE

BEAMFORMING ON A BOARDA three-chip set was defined to perform the wideband adap-tive beamforming to support upcoming sensors. Two chips


Emerging DoD sensors need more sensor channels,

wider bandwidth, and adaptivecancellation of interference,

jamming, and clutter.

form the core of the beamforming function and use MSPcells from the core generator, reduced complexity algo-rithms, and low power flip flops and I/O to optimize O/s/W.A third ASIC was defined using the optimized reconfig-urable MS1 core implemented with the MSP techniques toobtain a low-power, high-performance, on-the-fly reconfig-uration capability. A beamformer module was defined foruse with a real-time multichannel wideband radar test bedat Northrop Grumman. The MSP beamformer design hasthe capability to replace seven state-of-the-art custom ASICboards with a single board, providing nearly a TO/s through-put at a 10× improvement in O/s/W.

ACKNOWLEDGMENTSThis work was supported by Dr. Robert Reuss ofDARPA/MTO, under the Mission-Specific Processing Con-tract NBCHC010038. The authors thank the participatingmembers of the team, most notably Ken Delson, Dave Fry,Rob Kober, Joe Matesic, Steve Shauck, and Jennifer Trottaof Northrop Grumman; Timothy M. Wilson, Jeffrey A. Geib,Eric J. Martina, Ming Zhang, Byonghyo Shim, and BrianLam of the University of Illinois; Yongtao Wang, HamidMahmoodi-Meimand, Hunsoo Choo, Woopyo Jeong, Jong-sun Park, Lihyih Chiou, and Mark Johnson of Purdue Uni-versity; and S. Safavi, M. H. Lee, A. Niktash of MorphoTechnologies.

REFERENCES[1] R. Jain, P.T. Yang, B.Y. Chung, C. Chien, L.K. Tan, and T. Yoshino,

“FIRGEN: A CAD system for automatic layout generation of high-perfor-mance FIR filters,” IEEE Trans. Signal Processing, vol. 39, pp.1655–1668, July 1991.

[2] N.R. Shanbhag, “Algorithms transformation techniques for low-powerwireless VLSI systems design,” Int. J. Wireless Inform. Networks, vol. 5,no. 2, pp. 147–171, 1998.

[3] J. Park, H. Choo, K. Muhammad, S. Choi, Y. Im, and K. Roy “Non-adap-tive and adaptive filter implementation based on sharing multiplication,”in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2000(ICASSP 2000), pp. 460–463.

[4] K. Muhammad and K. Roy, “A novel design methodology for high per-formance and low power digital filters,” in Proc. 1999 IEEE/ACM Int.Conf. on Computer-Aided Design, pp. 80–83.

[5] K. Muhammad and K. Roy, “Graph theoretic approach for design andsynthesis of multiplierless FIR filters,” in 1999 IEEE Int. Symp. SystemSynthesis, pp. 94–99.

[6] W.H. Mangione-Smith, B. Hutchings, D. Andrews, A. DeHon, C. Ebel-ing, R. Hartenstein, O. Mencer, J. Morris, K. Palem, V.K. Prasanna,H.A.E. Spaanenburg, “Seeking solutions in configurable computing,”IEEE Computer, vol. 30, pp. 38–43, Dec. 1997.

Michael Lucas is with Northrop Grumman Electronic Sys-tems in Baltimore, Maryland. Naresh Shanbhag is with theUniversity of Illinois at Urbana-Champaign. Kaushik Roy iswith Purdue University in Indiana. Fadi Kurdahi is with Mor-pho Technologies and UC Irvine. John Fagan with Atmel inColumbia, Maryland. E-mail: [email protected].


10. An end-to-end system modeling approach was used to specify ASICs.

Mission AnalysisWideband Surveillance ReqmtsDefine Modes and Operations

Define Radar SystemsDefine Radar Parameters

Preprocessor Algorithmsand ArchitectureRadar Data Cubes

6 GMTICubes

6 SARCubes

UHF and X-BandTargets, Clutter,

Interference

Adaptive Beam-forming

PulseCompression

GMTI Post ProcessingAlgorithm

SAR Post ProcessingAlgorithm

Ideal PerformanceMATLAB Model

Bit LevelMATLAB Model

MSP ASICs Partitioning and SpecificationPerformance Projections

Verification Test Data

WeightComputation

Mission-Specific Processing Techniques Enable Low-Power...

Documents

Transcript of Mission-Specific Processing Techniques Enable Low-Power...