Download - Sanjay(High Speed Dsp Architectures)

8/7/2019 Sanjay(High Speed Dsp Architectures)

1/39

High Performance DSP Architectures

CHAPTER 1EVOLUTION OF DSP PROCESSORS

INTRODUCTION

Digital Signal Processing is carried out by mathematical operations. Digital SignalProcessors are microprocessors specifically designed to handle Digital Signal Processing

tasks. These devices have seen tremendous growth in the last decade, finding use in

everything from cellular telephones to advanced scientific instruments. In fact, hardware

engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use

"DSP" to mean Digital Signal Processing.

DSP has become a key component in many consumer, communications, medical, and

industrial products. These products use a variety of hardware approaches to implement

DSP, ranging from the use of off-the-shelf microprocessors to field-programmable gate

arrays (FPGAs) to custom integrated circuits (ICs). Programmable DSP processors, a

class of microprocessors optimized for DSP, are a popular solution for several reasons.

In comparison to fixed-function solutions, they have the advantage of potentially being

reprogrammed in the field, allowing product upgrades or fixes. They are often more cost-

effective than custom hardware, particularly for low-volume applications, where the

development cost of ICs may be prohibitive. DSP processors often have an advantage in

terms of speed, cost, and energy efficiency.

DSP ALGORITHMS MOULD DSP ARCHITECTURES

From the outset, DSP processor architectures have been moulded by DSP algorithms. For

nearly every feature found in a DSP processor, there are associated DSP algorithms whose

computation is in some way eased by inclusion of this feature. Therefore, perhaps the best

way to understand the evolution of DSP architectures is to examine typical DSP

algorithms and identify how their computational requirements have influenced the

architectures of DSP processors.

FAST MULTIPLIERS

The FIR filter is mathematically expressed as a vector of input data, along with a vector of

filter coefficients. For each tap of the filter, a data sample is multiplied by a filtercoefficient, with the result added to a running sum for all of the taps . Hence, the main

component of the FIR filter algorithm is a dot product: multiply and add, multiply and add.

These operations are not unique to the FIR filter algorithm; in fact, multiplication is one of

the most common operations performed in signal processing convolution, IIR filtering, and

Fourier transforms also all involve heavy use of multiply-accumulate operations.

Originally, microprocessors implemented multiplications by a series of shift and add

operations, each of which consumed one or more clock cycles. As might be expected, faster

multiplication hardware yields faster performance in many DSP algorithms, and for this

reason all modern DSP processors include at least one dedicated single- cycle multiplier or

combined multiply-accumulate (MAC) unit .

.

Department Of Electronics & Communication Engineering, GEC Thrissur. 1


2/39


3/39


DATA FORMAT

DSP applications typically must pay careful attention to numeric fidelity. Since numeric

fidelity is far more easily maintained using a floating point format, it may seem surprising

that most DSP processors use a fixed-point format. DSP processors tend to use the shortest

data word that will provide adequate accuracy in their target applications. Most fixed-point

DSP processors use 16-bit data words, because that data word width is sufficient for manyDSP applications. A few fixed-point DSP processors use 20, 24, or even 32 bits to enable

better accuracy in applications that are difficult to implement well with 16-bit data, such as

high-fidelity audio processing.

To ensure adequate signal quality while using fixed-point data, DSP processors typically

include specialized hardware to help programmers maintain numeric fidelity throughout a

series of computations. For example, most DSP processors include one or more

accumulator registers to hold the results of summing several multiplication products.

Accumulator registers are typically wider than other registers; they often provide extra bits,

called guard bits, to extend the range of values that can be represented and thus avoid

overflow. In addition, DSP processors usually include good support for saturation

arithmetic, rounding, and shifting, all of which are useful for maintaining numeric fidelity.

ZERO-OVERHEAD LOOPING

DSP algorithms typically spend the vast majority of their processing time in relatively

small sections of software that are executed repeatedly; i.e., in loops. Hence, most DSP

processors provide special support for efficient looping. Often, a special loop or repeat

instruction is provided which allows the programmer to implement a for-next loop without

expending any clock cycles for updating and testing the loop counter or branching back to

the top of the loop. This feature is often referred to as Zero-overhead looping.

STREAMLINED I/O

Finally, to allow low-cost, high-performance input and output, most DSP processors

incorporate one or more specialized serial or parallel I/O interfaces, and streamlined I/O

handling mechanisms, such as low-overhead interrupts and direct memory access (DMA),

to allow data transfers to proceed with little or no intervention from the processor's

computational units.

SPECIALIZED INSTRUCTION SETS

DSP processor instruction sets have traditionally been designed with two goals in mind.

The first is to make maximum use of the processor's underlying hardware, thus increasing

efficiency. The second goal is to minimize the amount of memory space required to store

DSP programs, since DSP applications are often quite cost-sensitive and the cost of

memory contributes substantially to overall chip and/or system cost. To accomplish the

first goal, conventional DSP processor instruction sets generally allow the programmer to

specify several parallel operations in a single instruction, typically including one or two

data fetches from memory in parallel with the main arithmetic operation. With the second

goal in mind, instructions are kept short by restricting which registers can be used with

which operations, and restricting which operations can be combined in an instruction.



4/39


CHAPTER 2

TRADITIONAL SOLUTIONS FOR REAL TIME PROCESSING

DSP architectures designs have traditionally focused on providing and meeting real-timeconstraints. Advanced signal processing algorithms, such as those in base station receivers,

present difficulties to the designer due to the implementation of complex algorithms, higher

data rates and desire for more channels per hardware module. A key constraint from the

manufacturing point of view is attaining a high channel density.

Traditionally, real-time architecture designs employ a mix of DSPs, Co-processors,

FPGAs, ASICs and Application Specific Standard Parts (ASSPs) for meeting real-time

requirements in high performance applications. The chip rate processing is handled by the

ASSP, ASIC or FPGA while the DSPs handle the symbol rate processing and use co-

processors for decoding. The DSP can also implement parts of the MAC layers and control

protocols or can be assisted by a RISC processor.

However, dynamic variations in the system workload such as variations in the number of

users in wireless base-stations, will require a dynamic re-partitioning of the algorithms

which may not be possible to implement in traditional FPGAs and ASICs in real-time.

LIMITATIONS OF SINGLE PROCESSOR DSP ARCHITECTURES

Single processors DSPs can only have limited arithmetic units and cannot directly extendtheir architectures to 100s of arithmetic units. This is because, as the number of arithmetic

units increases in an architecture, the size of the register files and the port interconnections

start dominating the architecture.

PROGRAMMABLE MULTIPROCESSOR DSP ARCHITECTURES

Multiprocessor architectures can be classified into Single Instruction Multiple Data (SIMD)

and Multiple Instruction Multiple Data (MIMD) architectures. Data-parallel DSPs exploit

data parallelism, instruction level parallelism and sub word parallelism. Alternate levels of

parallelism such as thread level parallelism exist and can be considered after this

architecture space has been fully studied and explored.



5/39


MULTI-CHIP MIMD PROCESSORS

Each processor in a loosely coupled system has a set of I/O devices and a large local

memory. Processors communicate by exchanging messages using some form of message-

transfer system. Loosely coupled systems are efficient when interaction between tasks are

minimal. The tradeoffs of this processor design have been the increase in programming

complexity and the need for high I/O bandwidth and inter-processor support. Such MIMDsolutions are also difficult to scale with processors. E.g.: TI C4XX.

Register file explosion in traditional DSPs with centralized register files.

The disadvantages of the multi-chip MIMD model and architectures are the following:

1. Load-balancing algorithms for such MIMD architectures is not straight-forward

similar to heterogeneous systems. This makes it difficult to partition algorithms on

this architecture model especially when the workload changes dynamically.2. The loosely coupled model is not scalable with the number of processors due to

interconnection and I/O bandwidth issues.

3. I/O impacts the real-time performance and power consumption of the architecture.

4. Design of a compiler for a MIMD model on a loosely coupled architecture is difficult

and the burden is left to the programmer to decide on the algorithm partitioning on

the multiprocessor.

SINGLE-CHIP MIMD PROCESSORS

Single-chip MIMD processors can be classified into 3 categories: single-threaded chipmultiprocessors (CMPs), multi-threaded multiprocessors (MTs) and clustered VLIW

architectures . A CMP integrates two or more complete processors on a single chip.

Therefore, every unit of a processor is duplicated and used independently of its copies.



6/39


In contrast, a multi-threaded processor interleaves the execution of instructions of various

threads of control in the same pipeline. Therefore, multiple program counters are available

in the fetch unit and multiple contexts are stored in multiple registers on the chip. Multi-

threading increases instruction level parallelism in the arithmetic units by providing access

to more than a single independent instruction stream. Programmer is assigned the duty to

schedule the threads of his application..

Clustered VLIW architectures are another example of VLIW architectures that solve the

register explosion problem by employing clusters of functional units and register files.

Clustering improves cycle time in two ways: by reducing the distance the signals have to

travel within a cycle and by reducing the load on the bus. Clustering is beneficial for

applications which have limited inter-cluster communication. However, compiling for

clustered VLIW architectures can be difficult in order to schedule across various clusters

and minimize inter-cluster operations and their latency.

Although single chip MIMD architectures eliminate the I/O bottleneck between multiple

processors, the load balancing and architecture scaling issues still remain. The availability

of data parallelism in signal processing applications is not utilized efficiently in MIMD

architectures.

SIMD ARRAY PROCESSORS

SIMD processing refers to processing of identical processors in the architecture that

execute the same instruction but work on different sets of data in parallel. An SIMD array

processor is referred to processor designs targeted towards implementation of arrays or

matrices. There are various types of interconnection methodologies used for array

processors such as linear array (vector), ring, star, tree, mesh, systolic arrays and

hypercubes. For example, Illiac-IV , Burroughs Scientific Processor (BSP). Although

vector processors have been the most popular version of array processors, mesh basedprocessors are still being used in scientific computing.

SIMD VECTOR PROCESSORS

Data parallelism allow vector processors to approach the performance and power efficiency

of custom designs, while simultaneously providing the flexibility of a programmable

processor. Vector machines were the first attempt at building super-computers, starting

from the Cray-1 machine These processors executed vector instructions such as vector adds

and multiplications out of a vector register file. The number of memory banks is equal to

the number of processors such that all processors can access memory in parallel.



7/39


DATA-PARALLEL DSPS

Data-parallel DSPs as architectures that exploit ILP. Stream processors are state-of-the-art

programmable architectures aimed at media processing applications. Stream processors

enhance data-parallel DSPs by providing a bandwidth hierarchy for data flow in signal

processing applications that enable support for hundreds of arithmetic units in the data-

parallel DSP.

PIPELINING MULTIPLE PROCESSORS

An alternate method to attain high data rates is to provide multiple processors that are

pipelined. Such processors would be able to take advantage of the streaming flow of data

through the system. The disadvantages of such a design are that the architecture would

need to be carefully designed to match the system throughput and is not flexible enough to

adapt to changes in system workload. Also, such a pipelined system would be difficult to

program and suffer from I/O bottlenecks unless implemented as a SoC. However, this is the

only way to provide desired system performance if the amount of parallelism exploitation

does not meet the system requirements.



8/39


CHAPTER 3CURRENT DSP LANDSCAPE

COVENTIONAL DSP PROCESSORS

The performance and price range among DSP processors is very wide. In the low-cost,

low-performance range are the industry workhorses, which are based on conventional DSP

architecture. They issue and execute one instruction per clock cycle, and use the complex,

multi-operation type of instructions described earlier. These processors typically include a

single multiplier or MAC unit and an ALU, but few additional execution units, if any.

Included in this group are Analog Devices ADSP-21xx family, Texas Instruments

TMS320C2xx family, and Motorola's DSP560xx family. These processors generally

operate at around 20-50 MHz, and provide good DSP performance while maintaining very

modest power consumption and memory usage. Midrange DSP processors achieve higher

performance than the low-cost DSPs described above through a combination of increasedclock speeds and somewhat more sophisticated architectures.

ENHANCED CONVENTIONAL DSP PROCESSORS

DSP processor architects improved performance by extending conventional DSP

architectures by adding parallel execution units, typically a second multiplier and adder.

These hardware enhancements are combined with an extended instruction set that takes

advantage of the additional hardware by allowing more operations to be encoded in a singleinstruction and executed in parallel. We refer to this type of processor as an enhanced-

conventional DSP processor, because it is based on the conventional DSP processor

architectural style rather than being an entirely new approach. With this increased

parallelism, enhanced-conventional DSP processors can execute significantly more work

per clock cyclefor example, two MACs per cycle instead of one.

Enhanced-conventional DSP processors typically have wider data buses to allow them to

retrieve more data words per clock cycle in order to keep the additional execution units fed.

They may also use wider instruction words to accommodate specification of additional

parallel operations within a single instruction.

MULTI-ISSUE ARCHITECTURES

With the goals of achieving high performance and creating an architecture that lends itself

to the use of compilers, some newer DSP processors use a multi-issue approach.



9/39


In contrast to conventional and enhanced-conventional processors, multi-issue processors

use very simple instructions that typically encode a single operation. These processors

achieve a high level of parallelism by issuing and executing instructions in parallel groups

rather than one at a time. Using simple instructions simplifies instruction decoding and

execution, allowing multi-issue processors to execute at higher clock rates than

conventional or enhanced conventional DSP processors.eg:.TMS320C62xx, The twoclasses of architectures that execute multiple instructions in parallel are referred to as

VLIW and Superscalar. These architectures are quite similar, differing mainly in how

instructions are grouped for parallel execution.

VLIW and superscalar architectures provide many execution units each of which executes

its own instruction. VLIW DSP processors typically issue a maximum of between four and

eight instructions per clock cycle, which are fetched and issued as part of one long super-

instruction hence the name Very Long Instruction Word. Superscalar processors typically

issue and execute fewer instructions per cycle, usually between two and four. In a VLIW

architecture, the assembly language programmer specifies which instructions will be

executed in parallel. Hence, instructions are grouped at the time the program is assembled,

and the grouping does not change during program execution. Superscalar processors, in

contrast, contain specialized hardware that determines which instructions will be executed

in parallel based on data dependencies and resource contention, shifting the burden of

scheduling parallel instructions from the programmer to the processor. The processor may

group the same set of instructions differently at different times in the program's execution;

for example, it may group instructions one way the first time it executes a loop, then group

them differently for subsequent iterations. The difference in the way these two types of

architectures schedule instructions for parallel execution is important in the context of

using them in real-time DSP applications. Because superscalar processors dynamically

schedule parallel operations, it may be difficult for the programmer to predict exactly howlong a given segment of software will take to execute. The execution time may vary based

on the particular data accessed, whether the processor is executing a loop for the first time

or the third, or whether it has just finished processing an interrupt, for example. Dynamic

features also complicate software optimization. As a rule, DSP processors have

traditionally avoided dynamic features for just these reasons; this may be why there is

currently only one example of a commercially available superscalar DSP processor.

In VLIW architectures, a wide instruction word may be required in order to specify

information about which functional unit will execute the instruction. Wider instructions

allow the use of larger, more uniform register sets, which in turn enables higher

performance. There are disadvantages, however, to using wide, simple instructions. Sinceeach VLIW instruction is simpler than a conventional DSP processor instruction, VLIW

processors tend to require many more instructions to perform a given task. Combined with

the fact that the instruction words are typically wider than those found on conventional

DSP processors, this characteristic results in relatively high program memory usage. High

program memory usage, in turn, may result in higher chip or system cost because of the

need for additional ROM or RAM.

VLIW processors typically use either wide buses or a large number of buses to access data

memory and keep the multiple execution units fed with data. The architectures of VLIW

DSP processors are in some ways more like those of general-purpose processors than like

those of the highly specialized conventional DSP architectures.



10/39


VLIW and superscalar processors often suffer from high energy consumption relative to

conventional DSP processors, however in general, multi-issue processors are designed with

an emphasis on increased speed rather than energy efficiency. These processors often have

more execution units active in parallel than conventional DSP processors, and they require

wide on-chip buses and memory banks to accommodate multiple parallel instructions and

to keep the multiple execution units supplied with data, all of which contribute to increased

energy consumption.

Because they often have high memory usage and energy consumption, VLIW

and superscalar processors have mainly targeted applications which have very demanding

computational requirements but are not very sensitive to cost or energy efficiency. Forexample, a VLIW processor might be used in a cellular base station, but not in a portable

cellular phone.

On DSP processors with SIMD capabilities, the underlying hardware that supports SIMD

operations varies widely. Analog Devices, for example, modified their basic conventional

floating-point DSP architecture, the ADSP- 2106x, by adding a second set of execution

units that exactly duplicate the original set. The augmented architecture can issue a single

instruction and execute it in parallel in both sets of execution units using different data

effectively doubling performance in some algorithms.

In contrast, instead of having multiple sets of the same execution units, some DSP

processors can logically split their execution units into multiple sub-units that process

narrower operands. These processors treat operands in long registers as multiple short

operands. Perhaps the most extensive SIMD capabilities we have seen in a DSP processor

to date are found in Analog Devices' TigerSHARC processor. TigerSHARC is a VLIW

architecture, and combines the two types of SIMD: one instruction can control execution of

the processor's two sets of execution units, and this instruction can specify a split-

execution-unit operation that will be executed in each set. Using this hierarchical SIMD

capability, TigerSHARC can execute eight 16-bit multiplications per cycle SIMD is only

effective in algorithms that can process data in parallel; for algorithms that are inherently

serial, SIMD is generally not of use.



11/39


CHAPTER 4

DIVERGING ARCHITECTURES

Up until recently, DSP processor designs were improved primarily by incremental

enhancements; new DSPs tended to maintain a close resemblance to their predecessors. Inthe last couple of years, however, DSP architectures have become much more interesting,

with a number of vendors announcing new architectures that are completely different from

preceding generations.

HIGH-PERFORMANCE DSPS

Processor designers who want higher DSP performance than can be squeezed out of

traditional architectures have come up with a variety of performance-boosting strategies.

The main idea is that if you want to improve performance beyond the increase afforded by

faster clock speeds, you need to increase the amount of useful work that gets done everyclock cycle. This is accomplished by increasing the number of operations that are

performed in parallel, which can be implemented in two main ways: by increasing the

number of operations performed by each instruction, or by increasing the number of

instructions that are executed in every instruction cycle.

INCREASING THE WORK PERFORMED BY EACH INSTRUCTION

Traditionally, DSP processors have used complex, compound instructions that allow the

programmer to encode multiple operations in a single instruction. In addition, DSP

processors traditionally issue and execute only one instruction per instruction cycle. This

single-issue, complex-instruction approach allows DSP processors to achieve very strongDSP performance without requiring a large amount of program memory.

One method of increasing the amount of work performed by each instruction while

maintaining the basics of the traditional DSP architecture and instruction set described

above is to augment the data path with extra execution units We refer to processors that

follow this approach as Enhanced Conventional DSPs''; their basic architecture is similar

to previous generations of DSPs, but has been enhanced by adding execution units.

Lucent Technologies DSP16000 architecture is based on that of the earlier DSP1600, but

Lucent added a second multiplier, an adder , and a bit manipulation unit. To support more

parallel operations and keep the processor from starving for data, Lucent also increased thedata bus widths to 32 bits. The net result is a processor that is able to sustain a throughput

of two multiply-accumulates per instruction cycle.

EXECUTING MULTIPLE INSTRUCTIONS / CYCLE

A few designers have opted for a more RISC-like instruction set coupled with an

architecture that supports execution of multiple instructions in every instruction cycle .E.g.

TMS320C62xx family. In TI's version, the processor fetches a 256-bit instruction

``packet,'' parses the packet into eight 32-bit instructions, and routes them to its eight

independent execution units.

VLIW processors typically suffer from high program memory requirements and high

power consumption. Like VLIW processors, superscalar processors issue and execute

multiple instructions in parallel. Unlike VLIW processors, in which the programmer

explicitly specifies which instructions will be executed in parallel, superscalar processors



12/39


use dynamic instruction scheduling to determine ``on the fly'' which instructions will be

executed concurrently based on the processor's available resources, on data dependencies,

and on a variety of other factors. Superscalar architectures have long been used in high-

performance general-purpose processors such as the Pentium and PowerPC.

CIRCULAR BUFFERING

In off-line processing, the entire input signal resides in the computer at the same time. The

key point is that all of the information is simultaneously available to the processing

program. This is common in scientific research and engineering, but not in consumer

products. Off-line processing is the realm of personal computers and mainframes.

In real-time processing, the output signal is produced at the same time that the input signal

is being acquired. To calculate the output FIR sample, we must have access to a certain

number of the most recent samples from the input.. When a new sample is acquired, it

replaces the oldest sample in the array, and the pointer is moved one address ahead.

Circular buffers are efficient because only one value needs to be changed when a new

sample is acquired.

Four parameters are needed to manage a circular buffer. First, there must be a pointer that

indicates the start of the circular buffer in memory. Second, there must be a pointer

indicating the end of the array , or a variable that holds its length . Third, the step size of

the memory addressing must be specified. These three values define the size and

configuration of the circular buffer, and will not change during the program operation. The

fourth value, the pointer to the most recent sample, must be modified as each new sample isacquired. In other words, there must be program logic that controls how this fourth value is

updated based on the value of the first three values.

DSP/MICROCONTROLLER HYBRIDS

Many applications require a mixture of control-oriented software and DSP software. A

prime example is the digital cellular phone, which must implement both supervisory tasks

and voice-processing tasks. In general, microcontrollers provide good performance in

controller tasks and poor performance in DSP tasks; dedicated DSP processors have the

opposite characteristics. Hence, until recently, combination controller/signal processing

applications were typically implemented using two separate processors: a microcontroller

and a DSP.



13/39


In the past couple of years, however, a number of microcontroller vendors have begun to

offer DSP-enhanced versions of their microcontrollers as an alternative to the dual-

processor solution.

Using a single processor to implement both types of software is attractive, because it can

potentially:

simplify the design task save board space

reduce total power consumption

reduce overall system cost

Microcontroller vendors like Hitachi offers a DSP-enhanced version of their SH-2

microcontroller. This version is called the SH-DSP, and adds a complete 16-bit fixed-point

DSP data path to the SH-2. In contrast, ARM took a different approach and developed a

DSP co-processor, ``Piccolo,'' that is meant to be used as an add-on to their ARM7

microcontroller and each has its own instruction set and processes its own instruction

stream. It is therefore possible for the two processors to operate in parallel with the caveat

that Piccolo relies on the ARM7 to perform all data transfers.

RECONFIGURABLE ARCHITECTURES

Reconfigurable architectures are defined as programmable architectures that change the

hardware or the interconnections dynamically so as to provide flexibility with simultaneous

benefits in execution time due to the reconfiguration as opposed to turning off units to

conserve power. There have been various approaches to provide and use this

reconfigurability in programmable architectures. The first approach is the FPGA+

approach, which adds a number of high-level configurable functional blocks to a general

purpose device to optimize it for a specific purpose such as wireless . The second approach

is to develop a reconfigurable system around a programmable ASSP. The third approach isbased on a parallel array of processors on a single die, connected by a reconfigurable

fabric. These kind of architectures are just in their initial form of evolution.



14/39


CHAPTER 5NOVEL DSP ARCHITECTURES

"POST-HARVARD" TECHNOLOGY

After remaining unchanged for more than a decade, DSP architectures have started to

evolve. They are even trying to encompass control operations. Conventional DSP

architecture typically uses Harvard-style architecture, with separate data and instruction

buses. Their main processing elements are a multiplier, an arithmetic logic unit (ALU), and

an accumulation register, allowing creation of a multiply-accumulate (MAC) unit that

accepts two operands. Depending on the processor, the operands may be 16-, 24-, 32-, or

48-bit words in either fixed-point or floating-point format. Whatever the word width, these

conventional DSPs offer fixed-width instructions, executing one instruction per clock

cycle.

Figure : The conventional DSP

architecture uses separate data

and memory buses and features

fixed-width instructions,

executing one instruction per

clock cycle.

The instructions themselves can be fairly complex. A single instruction may embody two

data moves, a MAC operation, and two pointer updates. These complex instructions help

the conventional DSP offer a high degree of code density when performing repeated

mathematical operations on arrays of numbers. As control devices, however, they leave

something to be desired. The fixed-width instructions are inefficient when tasked with

performing simple counter increments as part of a control loop, for instance. Even if the

counter is only going as high as 10, the processor needs to use the full word width for the

values. Conventional DSPs are also weak at bit-level data manipulation beyond bit shifting.

Still, because of their number-crunching proficiency, conventional DSPs soon gained

popularity in communications and media applications. The communications devices,including modem and telephony processors, needed the computational power for echo

canceling, voice coding, and filtering. Media applications, including digital audio, video,

and imaging, needed computational power for compression and filtering along with

program flexibility to track evolving standards. DSPs also found a home in disk-drive and

other servo-motor-control applications.

ENHANCED DSPS EMERGE

As semiconductor process technology evolved, conventional DSPs began to acquire a

number of on-chip peripherals such as local memory, I/O ports, timers, and DMA

controllers. Their basic architecture, however, didn't change for more than a decade.Eventually, though, the relative weakness in bit-level manipulation began to catch up with

conventional DSPs, as did the incessant demand for greater performance.



15/39


One common feature of these enhanced DSPs is the presence of a second MAC, which

allows for some parallelism in computation. In many cases, this parallelism extends to

other elements in the DSP, allowing the device to perform single-instruction, multiple-data

(SIMD) operations. Often this is accomplished with data packing, which allows registers,

data paths, and the like to handle two half-word operands each clock cycle. Along with

data packing, many enhanced DSPs allow the instructions themselves to use fractionalword widths, which allows multiple instructions to launch simultaneously.

The enhanced DSPs also tend to incorporate features that speed execution of algorithms in

a specific application space as well as add special-purpose peripherals and memory. The

exact nature of the specialization varies with the application an enhanced DSP targets,

which makes direct comparisons difficult. Many include hardware accelerators for

frequently-used operations as well as provide specialized addressing modes and augmented

instruction sets that target the application space. The augmented instruction sets may

include both special DSP instructions and RISC-like instructions for improved control

operation.

Consider, for instance, the Blackfin DSP family from Analog Devices. This family targets

voice, video, and data communications signal processing along with control operations.

The core includes dual 16-bit MACs, dual 40-bit arithmetic logic units (ALUs), a 40-bit

barrel shifter, and quad 8-bit ALUs for video operations. Because the architecture allows

data packing, the 40-bit ALUs can handle two 40-bit numbers or four 16-bit numbers. In

addition, a control unit handles sequencing of instructions so that a mix of 16-bit control

and 32-bit DSP instructions can pack for simultaneous execution. Data can be in 8-, 16-, or

32-bit format.

Figure: Analog Devices Blackfin DSP architecture handles multi-width data wordsand can simultaneously execute 16-bit control and 32-bit DSP instructions.

The core also includes two data address generators (DSGs) to simplify both DSP and

control operations. DSP addressing operations include circular buffering, for matrix

operations, and bit-reversal, for unscrambling FFT results. Control operations include auto-

increment, auto-decrement, and base-plus-immediate-offset addressing modes not found in

conventional DSPs.



16/39


INSTRUCTION SETS TARGET APPLICATIONS

The instruction set of the Blackfin core includes both general DSP instructions and RISC-

like control instructions. In addition, the core has complex instructions geared toward the

needs of the intended applications. For Huffman coding, used in communicationsalgorithms, there is a "Field Deposit/Extract" command. For the Discrete Cosine

Transform, used in imaging and video, an IEEE 1180 rounding operation is available.

Video compression algorithms can take advantage of the "Sum Absolute Difference"

instruction.

These specialty instructions are one way that the Blackfin family targets applications. The

other way is the peripheral mix each family member offers. The ADSP-21532, for

example, aims at low-cost consumer multimedia applications by including peripherals

supporting surround-sound and video-specific operating modes. The ADSP-21535 goes

after high-performance communications applications with USB and PCI interfaces as well

as substantial amounts of on-chip SRAM.

The range and variety of variations within the Blackfin family as well as the nature of its

specialized instructions mirror the diversity of enhanced conventional DSPs, available from

companies such as Cirrus Logic, Motorola, and Texas Instruments. But for all the

enhancements, these DSPs follow basically the same programming model as the

conventional device.

Other DSP architectures have emerged that follow a different programming model. In

search of the highest performance levels, these architectures allow the DSP to launch

multiple instructions at the same time for parallel execution. While these approaches resultin greater code execution speed, they also make software more difficult to optimize. They

require careful instruction ordering to avoid needing simultaneous access to the same data.

They also need to avoid attempting simultaneous execution of instructions where one

instruction depends on the results of the other for its operands. Not all DSP application

software has a structure suitable for multiple-launch execution, but when it does, these

DSPs offer the highest performance.

PARALLELISM ARISES

Two different forms of multiple-launch DSPs have arisen: very long instruction word

(VLIW) and superscalar architectures. Both have multiple execution units configured to

operate in parallel and use RISC-like instruction sets. The instructions of a VLIW

architecture are explicitly parallel, being composed of several sub-instructions that control

different resources. The superscalar architectures, on the other hand, load instructions in

bulk, then use hardware run-time scheduling to identify instructions that can run in parallel

and map them to the execution units.

Of the multi-launch architectures, VLIW designs are the most common. Devices from

Adelante Technologies, Equator Technologies, Siroyan, and Texas Instruments fall into

this category, although they vary considerably with the type and number of parallel

execution units they offer. The TI TMS320C64xx processors, for instance, have eight

execution units that can handle both 8- and 16-bit SIMD operations. The Siroyan OneDSP,on the other hand, is scalable from two to 32 clusters, each with several execution units.

The Adelante Saturn DSP core as shown in the following figure demonstrates the essence

of the VLIW approach. It uses multiple data buses in a dual-Harvard configuration to



17/39


deliver data and 96-bit wide instructions to an array of execution units simultaneously.

These units include two multipliers (MPY), four 16-bit ALUs that can combine to form

two 20-bit ALUs, a barrel shifter with saturation logic (SST/BRS), program (PCU) and

loop (LCU) controllers, address controllers (ACU), and an ability for design teams to add

application-specific execution units (AXU) to speed processing.

Figure: Adelante's Saturn DSP core handles VLIW instructions that can compriseseveral sub-instructions that control different resources. The core also handlesapplication-specific execution units (AXUs) to accelerate processing.

The Saturn core uses a unique approach to get around one of the problems the wide word

widths of VLIW architectures cause. Accessing external memory is a challenge for these

DSPs, because of their need to work with buses that can be as wide as 128 bits. The Saturn

core uses 16-bit program memory, which it maps into the 96-bit instruction word it uses

internally. Adelante developed this mapping after analyzing millions of lines of code for

common applications. However, the core also allows developers to create their own

application-specific instructions that map into the VLIW.

SUPERSCALAR DSPS

While the 16-bit external instruction width of the Saturn processor is unusual for VLIW

architectures, it is typical for superscalar architectures. These devices pull in several

instructions at a time and dynamically map them to the execution units. Internally the effect

is much the same as a VLIW architecture in that execution units are operating in parallel.

But from the software development viewpoint the approach reduces programming

complexity. With hardware handling the sequencing and arranging of instructions, the

developer is free to work with the more manageable short instructions.

The structure of a sample superscalar DSP, the LSI Logic ZSP600. Because it is a core its

memory interface isn't constrained, making it look like a VLIW architecture. But the

presence of the instruction-sequencing unit (ISU) and the pipeline control unit betray its

superscalar nature. The ZSP600 fetches eight instructions at a time, and can execute as

many as six, using its four MAC and two ALU execution units simultaneously. Data

packing allows the units to perform 16- or 32-bit operations. The architecture also allows

for the addition of coprocessors to speed specific DSP functions.



18/39


Figure: Superscalar DSPs, such as LSI Logic's ZSP600, use several instructionssimultaneously and dynamically map these instructions to the execution units.

This ability to add coprocessors is becoming a common feature of high-performance DSP

cores. In many cases the core's creators have also created coprocessors for functions such

as DES (data encryption standard) and Viterbi coding. If a pre-designed coprocessor isn't

available, however, creating your own can be a major design challenge.

A recently-introduced DSP architecture, the PulseDSP from Systolix, might make the task

easier. Similar to an FPGA, the PulseDSP offers a massively parallel, repetitive structure. Itis designed as a systolic array, which means that all data transfers occur synchronously on a

clock edge. Each processing element in the array has selectable I/O paths, local data

memory, and an ALU. Both the I/O and the ALU are programmable, and the array has a

programming bus running through it. The combination makes the array reprogrammable,

either statically or dynamically. The array structure is intended to handle low-complexity

but high-speed processing tasks using 16- to 64-bit arithmetic, which makes it suitable as a

coprocessor.

Figure: Systolix's PulseDSP is a systolic array that can run as a coprocessor or as astandalone unit for applications such as filters and FFTs. The array is programmable,



19/39


with each processing element having its own selectable I/O paths, local data memory,and an ALU.The array can also be used as a stand-alone processor for some types of algorithms, such as

filters and FFTs. One of the commercial implementations of the array, in fact, is to provide

filtering in an Analog Devices data acquisition part, the AD7725. The device combines the

PulseDSP with a sigma-delta A/D converter to provide post-processing of the acquired

data. The DSP array implements various filter algorithms.

Innovations such as the PulseDSP as well as the proliferation within the other DSP

architectures are a strong indicator of how important these once-arcane processors have

become. In many applications, especially communications, they share the spotlight with the

RISC processor. The DSP handles the data and the RISC handles the protocols. There are

problems with the two-processor approach, of course, including increased cost and

software development complexity. One reason many DSPs are adding RISC-like

instructions to their set is to be able to edge out the other processor in such applications.

The same thing is happening with some RISC processors. Extensible cores, such as the

Tensilica Xtensa and the ARC International ARCtangent, are offering DSP enhancements

so that communications applications need only one processor. These enhancements follow

the architecture of the conventional DSP, but merge the DSP functions into the instruction

set of the RISC core.

The ARCtangent,, demonstrates how the two get blended. The DSP instruction decode and

processing elements both connect with the rest of the core, allowing them to use the core's

resources as well as their own. The extensions have full access to registers and operate in

the same instruction stream as the RISC core. ARC's DSP offerings include MACs in

varying widths, saturation arithmetic, and X-Y memory for DSP data. The extensions also

support DSP addressing modes such as bit-reversal.

Figure : The ARCtangent core from ARC International blends DSP functionalityinto a RISC processor.Both DSP instruction-decode and processing elements connect with the rest of the core,

allowing these elements to use the cores resources as well as their own.

These extended RISC processors, enhanced conventional DSPs, and high-performance

architectures have all proliferated in the last few years, a sure sign of the importance DSPs

have acquired. Furthermore, that proliferation is likely to continue. With process

technology allowing integration of multiple peripherals with DSP cores and instruction sets

extending to match application needs, DSPs are headed the way of the microcontroller.

From obscure, specialized parts, they are evolving to become a fundamental building block

for virtually any system.



20/39


CHAPTER 6ARCHITECTURE OF LATEST DSP PROCESSORS

TEXAS INSTRUMENTS TMS320C67xx FAMILY

OVERVIEW

The TMS320C67xx family is theHighest Performance Floating-Point version DSPs.It is based on a advanced VelociTI very-long-instruction-word (VLIW) architecturemaking this DSP an excellent choice for multichannel and multifunction applications,

which allows it to execute up to eight RISC-like instructions per clock cycle. It has added

support for floating-point arithmetic and 64-bit data. It has a performance of up to 1 gigafloating-point operations per second (GFLOPS) at a clock rate of167 MHz.It uses an1.8-volt core supply , and executes up to 334 million MACs per second at 167 MHz. TheTMS320C67xx's two data paths extend hardware support for 64-bit data and IEEE-75432-bit single-precision and 64-bit double-precision floating-point arithmetic. Each data

path includes a set of four execution units, a general-purpose register file, and paths for

moving data between memory and registers.

The four execution units in each data path comprise two ALUs, a multiplier, and an

adder/subtractor which is used for address generation. The ALUs support both integer and

floating point operations, and the multipliers can perform both 16x16-bit and 32x32-bit

integer multiplies and 32-bit and 64-bit floating point multiplies. The two register files each

contain sixteen 32-bit general-purpose registers. These registers can be used for storing

addresses or data. To support 64-bit floating point arithmetic, pairs of adjacent registers can

be used to hold 64-bit data.

The C6701 DSP possesses the operational flexibility of high-speed controllers and the

numerical capability of array processors. This processor has 32 general-purpose registers of

32-bit word length and eight highly independent functional units. The eight functional units

provide four floating-/fixed-point ALUs, two fixed-point ALUs, and two floating-/fixed-

point multipliers. Program memory consists of a 64K-byte block that is user-configurable

as cache or memory-mapped program space. Data memory consists of two 32K-byte

blocks of RAM. The peripheral set includes two multichannel buffered serial ports

(McBSPs), two general-purpose timers, a host-port interface (HPI), and a glueless externalmemory interface (EMIF) capable of interfacing to SDRAM or SBSRAM and

asynchronous peripherals.

The large bank of On-chip memory system of the TMS320C67xx implements a modified

Harvard architecture, providing separate address spaces for program and data memory.

Program memory has a 32-bit address bus and a 256-bit data bus. Each of the two data

paths is connected to data memory by a 32-bit address bus and two 32-bit data buses. Since

there are two 32-bit data buses for each data path, the TMS320C67xx can load two 64-bit

words per instruction cycle. TMS320C6701 has 64 Kbytes of 32-bit on-chip program RAM

and 64 Kbytes of 16-bit on-chip data RAM.

The TMS320C6701 has one external memory interface, which provides a 23-bit address

bus and a 32-bit data bus. These buses are multiplexed between program and data memory

accesses. Addressing modes supported include register-direct, register-indirect, indexed

register-indirect, and modulo addressing. Immediate data is also supported.



21/39


The TMS320C67xx does not support hardware looping, and hence all loops must be

implemented in software. However, the parallel architecture of the processor allows the

implementation of software loops with virtually no overhead.

The peripherals on the TMS320C6701 include a host port, four-channel DMA controller,

two TDM-capable buffered serial ports and two 32-bit timers

CPU ARCHITECTURE



22/39


CPU DESCRIPTION

Fetch packets are always 256 bits wide; however, the execute packets can vary in size. Thevariable-length execute packets are a key memory-saving feature, distinguishing the C67x

CPU from other VLIW architectures.

The CPU features two sets of functional units. Each set contains four units and a register

file. One set contains functional units .L1, .S1, .M1, and .D1; the other set contains units

.D2, .M2, .S2, and .L2. The two register files contain 16 32-bit registers each for the total

of 32 general-purpose registers. The two sets of functional units, along with two register

files, compose sides A and B of the CPU

The four functional units on each side of the CPU can freely share the 16 registers

belonging to that side. Additionally, each side features a single data bus connected to allregisters on the other side, by which the two sets of functional units can access data from

the register files on opposite sides.

In addition to the C62x DSP fixed-point instructions, the six out of eight functional units

(.L1, .M1, .D1, .D2, .M2, and .L2) also execute floating-point instructions. The remaining

two functional units (.S1 and .S2) also execute the new LDDW instruction which loads 64

bits per CPU side for a total of 128 bits per cycle.

Another key feature of the C67x CPU is the load/store architecture, where all instructions

operate on registers. Two sets of data-addressing units (.D1 and .D2) are responsible for all

data transfers between the register files and the memory. The data address driven by the .Dunits allows data addresses generated from one register file to be used to load or store data

to or from the other register file. The C67x CPU supports a variety of indirect-addressing

modes using either linear- or circular-addressing modes with 5- or 15-bit offsets. All

instructions are conditional, and most can access any one of the 32 registers. Some

registers, however, are singled out to support specific addressing or to hold the condition

for conditional instructions. The two .M functional units are dedicated multipliers.

The two .S and .L functional units perform a general set of arithmetic, logical, and branch

functions with results available every clock cycle. The processing flow begins when a 256-

bit-wide instruction fetch packet is fetched from a program memory. The 32-bit

instructions destined for the individual functional units are linked together by 1 bits in

the least significant bit (LSB) position of the instructions. The instructions that are

chained together for simultaneous execution compose an execute packet. A 0 in the

LSB of an instruction breaks the chain, effectively placing the instructions that follow it in

the next execute packet. If an execute packet crosses the fetch-packet boundary (256 bits

wide), the assembler places it in the next fetch packet, while the remainder of the current

fetch packet is padded with NOP instructions. The number of execute packets within a

fetch packet can vary from one to eight. Execute packets are dispatched to their respective

functional units at the rate of one per clock cycle and the next 256-bit fetch packet is not

fetched until all the execute packets from the current fetch packet have been dispatched.

After decoding, the instructions simultaneously drive all active functional units for amaximum execution rate of eight instructions every clock cycle. While most results are

stored in 32-bit registers, they can be subsequently moved to memory as bytes or half-

words as well. All load and store instructions are byte-, half-word, or word-addressable.



23/39


ANALOG DEVICES ADSP-21XX FAMILY

OVERVIEW

The ADSP-21xx is the first single chip DSP processor family from Analog Devices. The

family consists of a large number of processors based on a common 16-bit fixed-pointarchitecture core with a 24-bit instruction word. Each processor combines the core DSParchitecture computation units, data address generators, and program sequencerwith

differentiating features such as on-chip program and data memory RAM, a programmable

timer, and one or two serial ports.

The fastest members of the family operate at 75 MIPS at 2.5 volts, 52 MIPS at 3.3 volts,

and 40 MIPS at 5.0 volts. Analog Devices has recently announced the ADSP-219x series,

which offers projected speeds of up to 300 MIPS, as well as architectural enhancements.

ADSP-21xx processors are targeted at modem, audio, PC multimedia, and digital cellular

applications.

Fabricated in a high speed, submicron, double-layer metal CMOS process, the highest-

performance ADSP-21xx processors operate at 25 MHz with a 40 ns instruction cycle time.

Every instruction can execute in a single cycle. Fabrication in CMOS results in low power

dissipation. The ADSP-2100 Familys flexible architecture and comprehensive instruction

set support a high degree of parallelism.

The ADSP-21xx data path consists of three separate arithmetic execution units: anarithmetic/logic unit (ALU), a multiplier/accumulator (MAC), and a barrel shifter. Each

unit is capable of single-cycle execution, but only one of these units can be active during a

single instruction cycle. The ALU operates on 16-bit data. In addition to the usual ALU

operations, the ALU provides increment/decrement, absolute value, and add-with-carry

functions. ALU results are saturated upon overflow if the appropriate configuration bit is

set by the programmer. The MAC unit includes a 16x16->32-bit multiplier, four input

registers, a feedback register, a 40-bit adder, and a single 40-bit result register/accumulator

providing eight guard bits. Besides signed operands, the multiplier can operate on

unsigned/unsigned or on signed/unsigned operands, thus supporting multi-precision

arithmetic. The barrel shifter shifts 16-bit inputs from an input register or from the

ALU/MAC/barrel shifter result registers into a 32-bit result register. Logical and arithmeticshifts are supported left or right up to 32 bits. The barrel shifter also supports block

floating-point arithmetic with block exponent detect (which determines a maximum

exponent of a block of data), single-word exponent detect, normalize, and exponent adjust

instructions.

ADSP-21xx processors use a modified Harvard architecture with separate memory spaces

and on-chip bus sets for program and data. All processors in the ADSP-21xx family

include on-chip program RAM or ROM and on-chip data RAM.

On-chip program memory can be used for both instructions and data, and it can beaccessed via a 14-bit address bus and a 24-bit data bus. On-chip program memory is dual-

ported to allow the processor to fetch both a data operand and the next instruction in a

single instruction cycle. The on-chip data memory can be accessed via a 14-bit address bus

and a 16-bit data bus. One access to the on-chip data memory can be performed in a single



24/39


instruction cycle. Three memory accesses (one instruction and two data operands) can be

performed in one instruction cycle.

Both of the on-chip memory spaces can be extended off-chip. All ADSP-21xx processorshave one external memory interface, providing a 14-bit address bus and a 24-bit data bus.

This external interface is multiplexed between program and data memory accesses.

The ADSP-21xx supports register-direct, memory-direct and register-indirect addressing

modes. Immediate data is also supported. The ADSP-21xx provides zero-overhead

program looping through its DO instruction. Any length sequence of instructions can be

contained in a hardware loop, and up to 16,384 repetitions are supported.

ARCHITECTURE OVERVIEW

The processors contain three independent computational units: the ALU, the

multiplier/accumulator (MAC), and the shifter. The ALU performs a standard set of

arithmetic and logic operations; division primitives are also supported. The MAC performs

single-cycle multiply, multiply/add, and multiply/subtract operations. The shifter performslogical and arithmetic shifts, normalization, renormalizations, and derive exponent

operations. The shifter can be used to efficiently implement numeric format control

including multiword floating-point representations. The internal result (R) bus directly

connects the computational units so that the output of any unit may be used as the input of

any unit on the next cycle. A powerful program sequencer and two dedicated data address

generators ensure efficient use of these computational units. The sequencer supports

conditional jumps, subroutine calls, and returns in a single cycle. With internal loop

counters and loop stacks, the ADSP-21xx executes looped code with zero overhead no

explicit jump instructions are required to maintain the loop. Two data address generators

(DAGs) provide addresses for simultaneous dual operand fetches (from data memory and

program memory). Each DAG maintains and updates four address pointers. Whenever thepointer is used to access data (indirect addressing), it is post-modified by the value of one

of four modify registers. A length value may be associated with each pointer to implement

automatic modulo addressing for circular buffers. The circular buffering feature is also



25/39


used by the serial ports for automatic data transfers to On chip memory. Efficient data

transfer is achieved with the use of five internal buses namely : Program Memory Address

(PMA) Bus , Program Memory Data (PMD) Bus, Data Memory Address (DMA) Bus,

Data Memory Data (DMD) Bus and the Result (R) Bus.

The two address buses (PMA, DMA) share a single external address bus, allowing memory

to be expanded off-chip, and the two data buses (PMD, DMD) share a single external data

bus. The BMS, DMS, and PMS signals indicate which memory space is using the external

buses. Program memory can store both instructions and data, permitting the ADSP-21xx to

fetch two operands in a single cycle, one from program memory and one from data

memory. The processor can fetch an operand from on-chip program memory and the next

instruction in the same cycle. The memory interface supports slow memories and

memorymapped peripherals with programmable wait state generation. External devices can

gain control of the processors buses with the use of the bus request/grant signals.

One bus grant execution mode (GO Mode) allows the ADSP-21xx to continue running

from internal memory. A second execution mode requires the processor to halt while buses

are granted. Each ADSP-21xx processor can respond to several different interrupts. There

can be up to three external interrupts, configured as edge- or level-sensitive. Internal

interrupts can be generated by the timer, serial ports, and, on the ADSP-2111, the host

interface port. There is also a master RESET signal. Booting circuitry provides for loading

on-chip program memory automatically from byte-wide external memory. After reset, three

wait states are automatically generated. This allows, for example, a 60 ns ADSP-2101 to

use a 200 ns EPROM as external boot memory. Multiple programs can be selected and

loaded from the EPROM with no additional hardware. The data receive and transmit pins

on SPORT1 (Serial Port 1) can be alternatively configured as a general-purpose input flagand output flag. You can use these pins for event signalling to and from an external device.

A programmable interval timer can generate periodic interrupts. A 16-bit count register

(TCOUNT) is decremented every n cycles, where n1 is a scaling value stored in an 8-bit



26/39


register (TSCALE). When the value of the count register reaches zero, an interrupt is

generated and the count register is reloaded from a 16-bit period register (TPERIOD).

BLACKFIN PROCESSOR

Blackfin Processors are a new breed of embedded media processor. Based on the Micro

Signal Architecture (MSA) jointly developed with Intel Corporation, Blackfin Processors

combine a 32-bit RISC-like instruction set and dual 16-bit multiply accumulate (MAC)

signal processing functionality with the ease-of-use attributes found in general-purpose

microcontrollers. This combination of processing attributes enables Blackfin Processors to

perform equally well in both signal processing and control processing applications-in many

cases deleting the requirement for separate heterogeneous processors.

This processor family also offers industry leading power consumption performance to as

low as 0.15mW/MMAC at 0.8V. This combination of high performance and low power is

essential in meeting the needs of today's and future signal processing applications including

broadband wireless, audio/video capable Internet appliances, and mobile communications.

HIGH PERFORMANCE SIGNAL PROCESSING

The core architecture employs fully interlocked instruction pipeline, multiple parallel

computational blocks, efficient DMA capability, and instruction set enhancements

designed to accelerate video processing

FULLY INTERLOCKED INSTRUCTION PIPELINE

All Blackfin Processors utilize a multi-stage fully interlocked pipeline that guarantees code

is executed as you would expect and that all data hazards are hidden from the programmer.

This type of pipeline guarantees result accuracy by stalling when necessary to achieve

proper results.

HIGHLY PARALLEL COMPUTATIONAL BLOCKS

The basis of the Blackfin Processor architecture is the Data Arithmetic Unit that includestwo 16-bit Multiplier Accumulators (MACs), two 40-bit Arithmetic Logic Units (ALUs),

four 8-bit video ALUs, and a single 40-bit barrel shifter. Each MAC can perform a 16-bit

by 16-bit multiply on four independent data operands every cycle. The 40-bit ALUs can

accumulate either two 40-bit numbers or four 16-bit numbers. With this architecture, 8-,

16- and 32-bit data word sizes can be processed natively for maximum efficiency.

Two Data Address Generators (DAGs) are complex load/store units designed to generate

addresses to support sophisticated DSP filtering operations. For DSP addressing, bit-

reversed addressing and circular buffering is supported. The DAGs also include two loop

counters for nested zero overhead looping and hardware support for on-the-fly saturation

and clipping.

HIGH BANDWIDTH DMA CAPABILITY



27/39


All Blackfin Processors have multiple, independent DMA controllers that support

automated data transfers with minimal overhead from the processor core. DMA transfers

can occur between the internal memories and any of the many DMA-capable peripherals.

VIDEO INSTRUCTIONS

In addition to native support for 8-bit data, the word size common to many pixel processing

algorithms, the Blackfin Processor architecture includes instructions specifically defined toenhance performance in video processing applications. Video compression algorithms are

incorporated for the enhanced instructions.

Efficient Control Processing is similar to that of RISC control processors. These features

include a hierarchical memory architecture, superior code density, and a variety of

microcontroller-style peripherals including a watch-dog timer, real-time clock, and an

integrated SDRAM controller. The L1 memory is connected directly to the processor core,

runs at full system clock speed, and offers maximum system performance for time critical

algorithm segments. The L2 memory is a larger, bulk memory storage block that offers

slightly reduced performance, but still faster than off-chip memory.

The L1 memory structure has been implemented to provide the performance needed for

signal processing while offering the programming ease found in general purpose

microcontrollers. By supporting both SRAM and cache programming models, system

designers can allocate critical DSP data sets that require high bandwidth and low latency

into SRAM, while maintaining the simple programming model of the data cache for

operating system (OS) and microcontroller code.

The Memory Management Unit provides for a memory protection format that can support a

full OS Kernel. The OS Kernel runs in Supervisor mode and partitions blocks of memory

and other system resources for the actual application software to run in User mode. This is

a unique and powerful feature not present on traditional DSPs.

SUPERIOR CODE DENSITY

The Blackfin Processor architecture supports multi-length instruction encoding. Very

frequently used control-type instructions are encoded as compact 16-bit words, with more

mathematically intensive DSP instructions encoded as 32-bit values.

DYNAMIC POWER MANAGEMENT

They employ multiple power saving techniques. Blackfin Processors are based on a gated

clock core design that selectively powers down functional units on an instruction-by-

instruction basis. They also support multiple power-down modes for periods where little or

no CPU activity is required. Lastly, and probably most importantly, Blackfin Processors

support a dynamic power management scheme whereby the operating frequency AND



28/39


voltage can be tailored to meet the performance requirements of the algorithm currently

being executed.

BLACKFIN PROCESSOR CORE BASICS

The Blackfin Processor core is a load-store architecture consisting of a Data Arithmetic

Unit, an Address Arithmetic Unit, and a sequencer unit. Blackfin Processors combine a

high performance, dual MAC DSP architecture with the programming ease of a RISC

MCU into a single, instruction set architecture.

GENERAL PURPOSE REGISTER FILES

The Blackfin Processor core includes an 8-entry by 32-bit data register file for general use

by the computational units. Supported data types include 8-, 16-, or 32-bit signed or

unsigned integer and 16- or 32-bit signed fractional. In every clock cycle, this multiported

register file supports two 32-bit reads AND two 32-bit writes. It can also be accessed as a

16-entry by 16-bit data register file.

The address register file provides a general purpose addressing mechanism in addition to

supporting circular buffering and stack maintenance. This register file consists of 8 entries

and includes a frame pointer and a stack pointer. The frame pointer is useful for subroutine

parameter passing, while the stack pointer is useful for storing the return address from

subroutine calls.

DATA ARITHMETIC UNIT

It contains:

Two 16-bit MACs

Two 40-bit ALUs

Four 8-bit video ALUs

Single barrel shifter



29/39


All computational resources can process 8-, 16-, or 32-bit operands from the data register

file-R0 through R7. Each register can be accessed as a 32-bit register or a 16-bit register

high or low half.

In a single clock cycle, the dual data paths can read AND write up to two 32-bit values.

However, since the high and low halves of the R0 through R7 registers are individually

addressable (Rx, Rx.H, or Rx.L), each computational block can choose from either two 32-

bit input values or four 16-bit input values with no restrictions on input data. The results ofthe computation can be written back into the register file as either a 32-bit entity or as the

high or low 16-bit half of the register. Additionally, the method of accumulation can vary

between data paths..

Both accumulators are 40 bits in length, providing 8 bits of extended precision. Similar to

the general purpose registers, both accumulators can be accessed in 16-, 32-, or 40-bit

increments. The Blackfin architecture also supports a combined add/subtract instruction

that can generate two 16-, 32-, or 40-bit results or four 16-bit results. In the case where four

16-bit results are desired, the high and low half results can be interchanged. This is a very

powerful capability and significantly improves, for instance, the FFT benchmark results.

ADDRESS ARITHMETIC UNIT

Two data address generators (DAGs) provide addresses for simultaneous dual operand

fetches from memory. The DAGs share a register file that contains four sets of 32-bit index

(I), length(L), base(B), and modify(M) registers. There are also eight additional 32-bit

address registersP0 through P5, frame pointer, and stack pointer that can be used as

pointers for general indexing of variables and stack locations.

The four sets of I, L, B, and M registers are useful for implementing circular buffering.

Used together, each set of index, length, and base registers can implement a unique circularbuffer in internal or external memory. The Blackfin architecture also supports a variety of

addressing modes, including indirect, auto increment and decrement, indexed, and bit

reversed. Last, all address registers are 32 bits in length, supporting the full 4 Gbyte

address range of the Blackfin Processor architecture.

PROGRAM SEQUENCER UNIT

The program sequencer controls the flow of instruction execution and supports conditional

jumps and subroutine calls, as well as nested zero-overhead looping. A multistage fully

interlocked pipeline guarantees code is executed as expected and that all data hazards are

hidden from the programmer. This type of pipeline guarantees result accuracy by stallingwhen necessary to achieve proper results.

The Blackfin architecture supports 16- and 32-bit instruction lengths in addition to limited

multi-issue 64-bit instruction packets. This ensures maximum code density by encoding the

most frequently used control instructions as compact 16-bit words and the more

challenging math operations as 32-bit double words.



30/39


LSI LOGIC ZSP600-QUAD MAC SUPERSCALAR CORE

O V E R V I E W

The ZSP600 is a quad MAC superscalar DSP core that addresses the high performancedata throughput and signal processing requirements of emerging communications

platforms. The ZSP600 supports up to Six IPC DSP performance at a peak 300MHz datarate. It includes quad MAC and quad ALU computational resources, a high-performance

load/store memory architecture, and dedicated co-processor interfaces, combined with

state-of-the-art power reduction techniques. These attributes make the ZSP600 core an

ideal solution for a variety of embedded DSP algorithms, including those required for

wireless infrastructure, mobile (3G), IAD/home gateway, central office, and

access/network applications.ZSP600 instruction parallelism is supported by user-

transparent instruction grouping and pipeline control to deliver superscalar DSP

performance, while programming using a RISC-instruction set..

The ZSP600 is a fully synthesizable, single-phase, clocked architecture, with all core I/Os

registered for ease-of-process migration and design flexibility. The ZSP600 provides

extensive computational resources, including four 16-bit multipliers/MACs, dual 40-bit

ALUs, and dual 16-bit ALUs, all capable of supporting 16-and 32-bit operations. The

ZSP600 can perform four independent 16x16 MUL/MAC operations into four 16-bit or

two 40-bit results, two 32x32-bit MUL/MACs into a 32-bit result, or two Viterbi (add-

compare-select) results per cycle. The ZSP600 is based upon a high-bandwidth memory

architecture with separate 8 instruction per cycle prefetch and dual 64-bit datainterfaces , over a 24-bit address space. The instruction memory architecture allows multi-

instruction/cycle pre-fetch to an integrated instruction cache. The data memory architectureincorporates dual independent 64-bit load/store units, with dedicated address generation,

allowing up to eight 16-bit word or four 32-bit word load/store operations per cycle. The

ZSP600 integrates a bi-directional co-processor interface to support hardware acceleration.

The memory subsystem (MSS) is decoupled from the DSP operations to provide increased

flexibility in support of different memory schemes. It also includes Instruction Set

enhancements to RISC architecture for improved broadband and wireless application

support..



31/39


A WORD ON SUPERSCALAR DSP

A superscalar architecture simply implies that the architecture is responsible for resolving

the operand and resource hazards and that it has the resources to achieve an instruction

throughput that is greater that one instruction per clock. Logic dedicated to pipeline control

is kept to a minimum by enforcing in-order execution and by isolating the control to a

single stage at the head of the pipeline. This stage issues sequential groups of instructions

that have no data dependencies or other resource conflicts. Once a group of instructions has

been issued, they advance through the pipeline in lock step.

A VLIW machine does not employ instruction scheduling or pipeline protection.

Instructions in a VLIW pipeline are statically issued , and it is the programmers

responsibility to prevent data hazards and resource conflicts. Superscalar architectures also

facilitate software compatibility not only between implementations of the same

architecture, but also from one generation of architecture to the next thus increasing

lifetime.

ARCHITECTURE OVERVIEW

The G2 architecture is scalable in terms of arithmetic resources, data bandwidth, and

pipeline capacity. This scalable nature allows the architecture to support multipleimplementations that target different application spaces.

All address and data I/O communication across the core boundary are registered. This

feature is highly desirable from a SOC system designers point of view for a number of

reasons, one being the removal of timing budget ambiguities between system logic and the

core.

Prefetch unit (PFU) is at the head of the instruction pipeline. The ZSP600 can prefetch

eight 16-bit words per cycle. It is responsible for maximizing the probability that the

instruction cache has the data required by the instruction sequencing unit (ISU) for any

given fetch cycle. The prefetch unit performs limited decoding to identify code

discontinuities and to apply static branch prediction when necessary. The ISU is

responsible for instruction fetch and decode, instruction grouping, and instruction issue.

Instruction grouping refers to the pipeline stage in which operand dependencies are



32/39


resolved. The ISU issues groups of in-order instructions that will not cause any operand

conflicts. This is the only unit (and only stage in the execution pipeline) that enforces

pipeline protection. Isolating the pipeline protection logic in this manner simplifies pipeline

control logic significantly.

The ZSP600 ISU can issue up to six instructions per cycle, one to each of the six primary

datapaths: two address generation units (AGUs), two arithmetic logic units (ALUs), and

two multiply/accumulate/arithmetic units (MAUs) that are capable of performing up to fourMAC operations per cycle. The pipeline control unit (PCU) stages control associated with

each of the primary data paths and the bypass logic. The PCU is also responsible for

managing interrupt control, the co-processor interface, the debug interface, and the on-core

timers. The Bypass unit (BYP) handles all the data forwarding between execution units.)

PIPELINE

The pipeline of the G2 architecture is an eight-stage pipeline. The existing architecture usesa data prefetch mechanism, called data linking, to efficiently sustain required data

bandwidth for its dual- MAC. All pipeline protection and resource allocation is performed

during the grouping stage. Instruction groups are issued by the grouping stage and advance

in lock step down the remainder of the pipeline.

Data address generation is performed in the AG stage. This stage is also responsible for

enforcing the boundaries of the circular buffers. A load or store that straddles a boundary of

the circular buffer is split by the AGU into two sequential accesses. Stages M0 and M1 are

allocated for data memory loads. They are optimized for systems using synchronous RAM.

M0 is allocated for address decode and M1 for data access and return. Load and store

requests are registered and issued to the memory subsystem in M0. The memory interface

is stallable. If the MSS determines that is can not return requested data during M1, it stallsthe core until the data is ready.

ARITHMETIC RESOURCES



33/39


By adding two AGUs, along with dedicated address registers , the arithmetic throughput of

G2 demonstrates an immediate improvement. The two AGUs allow the core to issue any

combination of two loads or stores per cycle. The data size of the load/store is

implementation specific. Each data port in the ZSP600 is 64-bits wide, allowing a total of

128-bits (8 words) of data to be loaded per cycle. The AGUs have dedicated hardware to

support four circular buffers and reverse-carry addressing. The circular buffer support has

been enhanced in functionality to support load/store operations with positive and negative

offsets and signed indexes. Circular buffer logic also applies to address arithmetic and alsohas no alignment restrictions.

REGISTER RESOURCES

With the 32-bit address registers, the architecture allows implementations of the core to

remain flexible in defining the physical linear address space. The actual address register

remains a 32-bit register to ensure pointer sizes remain the same from one implementation

to the next. This also allows the address registers to be used as temporary registers for the

GPRs. Dedicated address registers simplify the instruction decoder and issue logic as it can

now identify address related operations and assign the datapath resources appropriately.

The primary operand resource of the AGUs is the address register file, allowing the

general-purpose register file to be physically optimized for data moving to and from the

ALUs and MAUs. The current generation defines two 32-bit registers and another 16-bit

register whose low and high bytes correspond to the upper byte of each accumulator

respectively thus resulting in a 40-bit accumulator. A guard byte is now available for each

of the eight extended 32-bit registers of the GPRs. Accumulators are also recognized in the

programming model by providing associated instruction set support for 40-bit arithmetic

and data loads and stores.

INSTRUCTION SET ENHANCEMENTS

A powerful enhancement to the new architecture is the ability to conditionally execute

instructions. The programming model for G2 allows programmers to define packets of

instructions that are predicated on a specified condition. The programmer then defines a

bracketed set of up to eight instructions that will be predicated in the execution pipeline