8/7/2019 Sanjay(High Speed Dsp Architectures)
1/39
High Performance DSP Architectures
CHAPTER 1EVOLUTION OF DSP PROCESSORS
INTRODUCTION
Digital Signal Processing is carried out by mathematical operations. Digital SignalProcessors are microprocessors specifically designed to handle Digital Signal Processing
tasks. These devices have seen tremendous growth in the last decade, finding use in
everything from cellular telephones to advanced scientific instruments. In fact, hardware
engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use
"DSP" to mean Digital Signal Processing.
DSP has become a key component in many consumer, communications, medical, and
industrial products. These products use a variety of hardware approaches to implement
DSP, ranging from the use of off-the-shelf microprocessors to field-programmable gate
arrays (FPGAs) to custom integrated circuits (ICs). Programmable DSP processors, a
class of microprocessors optimized for DSP, are a popular solution for several reasons.
In comparison to fixed-function solutions, they have the advantage of potentially being
reprogrammed in the field, allowing product upgrades or fixes. They are often more cost-
effective than custom hardware, particularly for low-volume applications, where the
development cost of ICs may be prohibitive. DSP processors often have an advantage in
terms of speed, cost, and energy efficiency.
DSP ALGORITHMS MOULD DSP ARCHITECTURES
From the outset, DSP processor architectures have been moulded by DSP algorithms. For
nearly every feature found in a DSP processor, there are associated DSP algorithms whose
computation is in some way eased by inclusion of this feature. Therefore, perhaps the best
way to understand the evolution of DSP architectures is to examine typical DSP
algorithms and identify how their computational requirements have influenced the
architectures of DSP processors.
FAST MULTIPLIERS
The FIR filter is mathematically expressed as a vector of input data, along with a vector of
filter coefficients. For each tap of the filter, a data sample is multiplied by a filtercoefficient, with the result added to a running sum for all of the taps . Hence, the main
component of the FIR filter algorithm is a dot product: multiply and add, multiply and add.
These operations are not unique to the FIR filter algorithm; in fact, multiplication is one of
the most common operations performed in signal processing convolution, IIR filtering, and
Fourier transforms also all involve heavy use of multiply-accumulate operations.
Originally, microprocessors implemented multiplications by a series of shift and add
operations, each of which consumed one or more clock cycles. As might be expected, faster
multiplication hardware yields faster performance in many DSP algorithms, and for this
reason all modern DSP processors include at least one dedicated single- cycle multiplier or
combined multiply-accumulate (MAC) unit .
.
Department Of Electronics & Communication Engineering, GEC Thrissur. 1
8/7/2019 Sanjay(High Speed Dsp Architectures)
2/39
8/7/2019 Sanjay(High Speed Dsp Architectures)
3/39
High Performance DSP Architectures
DATA FORMAT
DSP applications typically must pay careful attention to numeric fidelity. Since numeric
fidelity is far more easily maintained using a floating point format, it may seem surprising
that most DSP processors use a fixed-point format. DSP processors tend to use the shortest
data word that will provide adequate accuracy in their target applications. Most fixed-point
DSP processors use 16-bit data words, because that data word width is sufficient for manyDSP applications. A few fixed-point DSP processors use 20, 24, or even 32 bits to enable
better accuracy in applications that are difficult to implement well with 16-bit data, such as
high-fidelity audio processing.
To ensure adequate signal quality while using fixed-point data, DSP processors typically
include specialized hardware to help programmers maintain numeric fidelity throughout a
series of computations. For example, most DSP processors include one or more
accumulator registers to hold the results of summing several multiplication products.
Accumulator registers are typically wider than other registers; they often provide extra bits,
called guard bits, to extend the range of values that can be represented and thus avoid
overflow. In addition, DSP processors usually include good support for saturation
arithmetic, rounding, and shifting, all of which are useful for maintaining numeric fidelity.
ZERO-OVERHEAD LOOPING
DSP algorithms typically spend the vast majority of their processing time in relatively
small sections of software that are executed repeatedly; i.e., in loops. Hence, most DSP
processors provide special support for efficient looping. Often, a special loop or repeat
instruction is provided which allows the programmer to implement a for-next loop without
expending any clock cycles for updating and testing the loop counter or branching back to
the top of the loop. This feature is often referred to as Zero-overhead looping.
STREAMLINED I/O
Finally, to allow low-cost, high-performance input and output, most DSP processors
incorporate one or more specialized serial or parallel I/O interfaces, and streamlined I/O
handling mechanisms, such as low-overhead interrupts and direct memory access (DMA),
to allow data transfers to proceed with little or no intervention from the processor's
computational units.
SPECIALIZED INSTRUCTION SETS
DSP processor instruction sets have traditionally been designed with two goals in mind.
The first is to make maximum use of the processor's underlying hardware, thus increasing
efficiency. The second goal is to minimize the amount of memory space required to store
DSP programs, since DSP applications are often quite cost-sensitive and the cost of
memory contributes substantially to overall chip and/or system cost. To accomplish the
first goal, conventional DSP processor instruction sets generally allow the programmer to
specify several parallel operations in a single instruction, typically including one or two
data fetches from memory in parallel with the main arithmetic operation. With the second
goal in mind, instructions are kept short by restricting which registers can be used with
which operations, and restricting which operations can be combined in an instruction.
Department Of Electronics & Communication Engineering, GEC Thrissur. 3
8/7/2019 Sanjay(High Speed Dsp Architectures)
4/39
High Performance DSP Architectures
CHAPTER 2
TRADITIONAL SOLUTIONS FOR REAL TIME PROCESSING
DSP architectures designs have traditionally focused on providing and meeting real-timeconstraints. Advanced signal processing algorithms, such as those in base station receivers,
present difficulties to the designer due to the implementation of complex algorithms, higher
data rates and desire for more channels per hardware module. A key constraint from the
manufacturing point of view is attaining a high channel density.
Traditionally, real-time architecture designs employ a mix of DSPs, Co-processors,
FPGAs, ASICs and Application Specific Standard Parts (ASSPs) for meeting real-time
requirements in high performance applications. The chip rate processing is handled by the
ASSP, ASIC or FPGA while the DSPs handle the symbol rate processing and use co-
processors for decoding. The DSP can also implement parts of the MAC layers and control
protocols or can be assisted by a RISC processor.
However, dynamic variations in the system workload such as variations in the number of
users in wireless base-stations, will require a dynamic re-partitioning of the algorithms
which may not be possible to implement in traditional FPGAs and ASICs in real-time.
LIMITATIONS OF SINGLE PROCESSOR DSP ARCHITECTURES
Single processors DSPs can only have limited arithmetic units and cannot directly extendtheir architectures to 100s of arithmetic units. This is because, as the number of arithmetic
units increases in an architecture, the size of the register files and the port interconnections
start dominating the architecture.
PROGRAMMABLE MULTIPROCESSOR DSP ARCHITECTURES
Multiprocessor architectures can be classified into Single Instruction Multiple Data (SIMD)
and Multiple Instruction Multiple Data (MIMD) architectures. Data-parallel DSPs exploit
data parallelism, instruction level parallelism and sub word parallelism. Alternate levels of
parallelism such as thread level parallelism exist and can be considered after this
architecture space has been fully studied and explored.
Department Of Electronics & Communication Engineering, GEC Thrissur. 4
8/7/2019 Sanjay(High Speed Dsp Architectures)
5/39
High Performance DSP Architectures
MULTI-CHIP MIMD PROCESSORS
Each processor in a loosely coupled system has a set of I/O devices and a large local
memory. Processors communicate by exchanging messages using some form of message-
transfer system. Loosely coupled systems are efficient when interaction between tasks are
minimal. The tradeoffs of this processor design have been the increase in programming
complexity and the need for high I/O bandwidth and inter-processor support. Such MIMDsolutions are also difficult to scale with processors. E.g.: TI C4XX.
Register file explosion in traditional DSPs with centralized register files.
The disadvantages of the multi-chip MIMD model and architectures are the following:
1. Load-balancing algorithms for such MIMD architectures is not straight-forward
similar to heterogeneous systems. This makes it difficult to partition algorithms on
this architecture model especially when the workload changes dynamically.2. The loosely coupled model is not scalable with the number of processors due to
interconnection and I/O bandwidth issues.
3. I/O impacts the real-time performance and power consumption of the architecture.
4. Design of a compiler for a MIMD model on a loosely coupled architecture is difficult
and the burden is left to the programmer to decide on the algorithm partitioning on
the multiprocessor.
SINGLE-CHIP MIMD PROCESSORS
Single-chip MIMD processors can be classified into 3 categories: single-threaded chipmultiprocessors (CMPs), multi-threaded multiprocessors (MTs) and clustered VLIW
architectures . A CMP integrates two or more complete processors on a single chip.
Therefore, every unit of a processor is duplicated and used independently of its copies.
Department Of Electronics & Communication Engineering, GEC Thrissur. 5
8/7/2019 Sanjay(High Speed Dsp Architectures)
6/39
High Performance DSP Architectures
In contrast, a multi-threaded processor interleaves the execution of instructions of various
threads of control in the same pipeline. Therefore, multiple program counters are available
in the fetch unit and multiple contexts are stored in multiple registers on the chip. Multi-
threading increases instruction level parallelism in the arithmetic units by providing access
to more than a single independent instruction stream. Programmer is assigned the duty to
schedule the threads of his application..
Clustered VLIW architectures are another example of VLIW architectures that solve the
register explosion problem by employing clusters of functional units and register files.
Clustering improves cycle time in two ways: by reducing the distance the signals have to
travel within a cycle and by reducing the load on the bus. Clustering is beneficial for
applications which have limited inter-cluster communication. However, compiling for
clustered VLIW architectures can be difficult in order to schedule across various clusters
and minimize inter-cluster operations and their latency.
Although single chip MIMD architectures eliminate the I/O bottleneck between multiple
processors, the load balancing and architecture scaling issues still remain. The availability
of data parallelism in signal processing applications is not utilized efficiently in MIMD
architectures.
SIMD ARRAY PROCESSORS
SIMD processing refers to processing of identical processors in the architecture that
execute the same instruction but work on different sets of data in parallel. An SIMD array
processor is referred to processor designs targeted towards implementation of arrays or
matrices. There are various types of interconnection methodologies used for array
processors such as linear array (vector), ring, star, tree, mesh, systolic arrays and
hypercubes. For example, Illiac-IV , Burroughs Scientific Processor (BSP). Although
vector processors have been the most popular version of array processors, mesh basedprocessors are still being used in scientific computing.
SIMD VECTOR PROCESSORS
Data parallelism allow vector processors to approach the performance and power efficiency
of custom designs, while simultaneously providing the flexibility of a programmable
processor. Vector machines were the first attempt at building super-computers, starting
from the Cray-1 machine These processors executed vector instructions such as vector adds
and multiplications out of a vector register file. The number of memory banks is equal to
the number of processors such that all processors can access memory in parallel.
Department Of Electronics & Communication Engineering, GEC Thrissur. 6
8/7/2019 Sanjay(High Speed Dsp Architectures)
7/39
High Performance DSP Architectures
DATA-PARALLEL DSPS
Data-parallel DSPs as architectures that exploit ILP. Stream processors are state-of-the-art
programmable architectures aimed at media processing applications. Stream processors
enhance data-parallel DSPs by providing a bandwidth hierarchy for data flow in signal
processing applications that enable support for hundreds of arithmetic units in the data-
parallel DSP.
PIPELINING MULTIPLE PROCESSORS
An alternate method to attain high data rates is to provide multiple processors that are
pipelined. Such processors would be able to take advantage of the streaming flow of data
through the system. The disadvantages of such a design are that the architecture would
need to be carefully designed to match the system throughput and is not flexible enough to
adapt to changes in system workload. Also, such a pipelined system would be difficult to
program and suffer from I/O bottlenecks unless implemented as a SoC. However, this is the
only way to provide desired system performance if the amount of parallelism exploitation
does not meet the system requirements.
Department Of Electronics & Communication Engineering, GEC Thrissur. 7
8/7/2019 Sanjay(High Speed Dsp Architectures)
8/39
High Performance DSP Architectures
CHAPTER 3CURRENT DSP LANDSCAPE
COVENTIONAL DSP PROCESSORS
The performance and price range among DSP processors is very wide. In the low-cost,
low-performance range are the industry workhorses, which are based on conventional DSP
architecture. They issue and execute one instruction per clock cycle, and use the complex,
multi-operation type of instructions described earlier. These processors typically include a
single multiplier or MAC unit and an ALU, but few additional execution units, if any.
Included in this group are Analog Devices ADSP-21xx family, Texas Instruments
TMS320C2xx family, and Motorola's DSP560xx family. These processors generally
operate at around 20-50 MHz, and provide good DSP performance while maintaining very
modest power consumption and memory usage. Midrange DSP processors achieve higher
performance than the low-cost DSPs described above through a combination of increasedclock speeds and somewhat more sophisticated architectures.
ENHANCED CONVENTIONAL DSP PROCESSORS
DSP processor architects improved performance by extending conventional DSP
architectures by adding parallel execution units, typically a second multiplier and adder.
These hardware enhancements are combined with an extended instruction set that takes
advantage of the additional hardware by allowing more operations to be encoded in a singleinstruction and executed in parallel. We refer to this type of processor as an enhanced-
conventional DSP processor, because it is based on the conventional DSP processor
architectural style rather than being an entirely new approach. With this increased
parallelism, enhanced-conventional DSP processors can execute significantly more work
per clock cyclefor example, two MACs per cycle instead of one.
Enhanced-conventional DSP processors typically have wider data buses to allow them to
retrieve more data words per clock cycle in order to keep the additional execution units fed.
They may also use wider instruction words to accommodate specification of additional
parallel operations within a single instruction.
MULTI-ISSUE ARCHITECTURES
With the goals of achieving high performance and creating an architecture that lends itself
to the use of compilers, some newer DSP processors use a multi-issue approach.
Department Of Electronics & Communication Engineering, GEC Thrissur. 8
8/7/2019 Sanjay(High Speed Dsp Architectures)
9/39
High Performance DSP Architectures
In contrast to conventional and enhanced-conventional processors, multi-issue processors
use very simple instructions that typically encode a single operation. These processors
achieve a high level of parallelism by issuing and executing instructions in parallel groups
rather than one at a time. Using simple instructions simplifies instruction decoding and
execution, allowing multi-issue processors to execute at higher clock rates than
conventional or enhanced conventional DSP processors.eg:.TMS320C62xx, The twoclasses of architectures that execute multiple instructions in parallel are referred to as
VLIW and Superscalar. These architectures are quite similar, differing mainly in how
instructions are grouped for parallel execution.
VLIW and superscalar architectures provide many execution units each of which executes
its own instruction. VLIW DSP processors typically issue a maximum of between four and
eight instructions per clock cycle, which are fetched and issued as part of one long super-
instruction hence the name Very Long Instruction Word. Superscalar processors typically
issue and execute fewer instructions per cycle, usually between two and four. In a VLIW
architecture, the assembly language programmer specifies which instructions will be
executed in parallel. Hence, instructions are grouped at the time the program is assembled,
and the grouping does not change during program execution. Superscalar processors, in
contrast, contain specialized hardware that determines which instructions will be executed
in parallel based on data dependencies and resource contention, shifting the burden of
scheduling parallel instructions from the programmer to the processor. The processor may
group the same set of instructions differently at different times in the program's execution;
for example, it may group instructions one way the first time it executes a loop, then group
them differently for subsequent iterations. The difference in the way these two types of
architectures schedule instructions for parallel execution is important in the context of
using them in real-time DSP applications. Because superscalar processors dynamically
schedule parallel operations, it may be difficult for the programmer to predict exactly howlong a given segment of software will take to execute. The execution time may vary based
on the particular data accessed, whether the processor is executing a loop for the first time
or the third, or whether it has just finished processing an interrupt, for example. Dynamic
features also complicate software optimization. As a rule, DSP processors have
traditionally avoided dynamic features for just these reasons; this may be why there is
currently only one example of a commercially available superscalar DSP processor.
In VLIW architectures, a wide instruction word may be required in order to specify
information about which functional unit will execute the instruction. Wider instructions
allow the use of larger, more uniform register sets, which in turn enables higher
performance. There are disadvantages, however, to using wide, simple instructions. Sinceeach VLIW instruction is simpler than a conventional DSP processor instruction, VLIW
processors tend to require many more instructions to perform a given task. Combined with
the fact that the instruction words are typically wider than those found on conventional
DSP processors, this characteristic results in relatively high program memory usage. High
program memory usage, in turn, may result in higher chip or system cost because of the
need for additional ROM or RAM.
VLIW processors typically use either wide buses or a large number of buses to access data
memory and keep the multiple execution units fed with data. The architectures of VLIW
DSP processors are in some ways more like those of general-purpose processors than like
those of the highly specialized conventional DSP architectures.
Department Of Electronics & Communication Engineering, GEC Thrissur. 9
8/7/2019 Sanjay(High Speed Dsp Architectures)
10/39
High Performance DSP Architectures
VLIW and superscalar processors often suffer from high energy consumption relative to
conventional DSP processors, however in general, multi-issue processors are designed with
an emphasis on increased speed rather than energy efficiency. These processors often have
more execution units active in parallel than conventional DSP processors, and they require
wide on-chip buses and memory banks to accommodate multiple parallel instructions and
to keep the multiple execution units supplied with data, all of which contribute to increased
energy consumption.
Because they often have high memory usage and energy consumption, VLIW
and superscalar processors have mainly targeted applications which have very demanding
computational requirements but are not very sensitive to cost or energy efficiency. Forexample, a VLIW processor might be used in a cellular base station, but not in a portable
cellular phone.
On DSP processors with SIMD capabilities, the underlying hardware that supports SIMD
operations varies widely. Analog Devices, for example, modified their basic conventional
floating-point DSP architecture, the ADSP- 2106x, by adding a second set of execution
units that exactly duplicate the original set. The augmented architecture can issue a single
instruction and execute it in parallel in both sets of execution units using different data
effectively doubling performance in some algorithms.
In contrast, instead of having multiple sets of the same execution units, some DSP
processors can logically split their execution units into multiple sub-units that process
narrower operands. These processors treat operands in long registers as multiple short
operands. Perhaps the most extensive SIMD capabilities we have seen in a DSP processor
to date are found in Analog Devices' TigerSHARC processor. TigerSHARC is a VLIW
architecture, and combines the two types of SIMD: one instruction can control execution of
the processor's two sets of execution units, and this instruction can specify a split-
execution-unit operation that will be executed in each set. Using this hierarchical SIMD
capability, TigerSHARC can execute eight 16-bit multiplications per cycle SIMD is only
effective in algorithms that can process data in parallel; for algorithms that are inherently
serial, SIMD is generally not of use.
Department Of Electronics & Communication Engineering, GEC Thrissur. 10
8/7/2019 Sanjay(High Speed Dsp Architectures)
11/39
High Performance DSP Architectures
CHAPTER 4
DIVERGING ARCHITECTURES
Up until recently, DSP processor designs were improved primarily by incremental
enhancements; new DSPs tended to maintain a close resemblance to their predecessors. Inthe last couple of years, however, DSP architectures have become much more interesting,
with a number of vendors announcing new architectures that are completely different from
preceding generations.
HIGH-PERFORMANCE DSPS
Processor designers who want higher DSP performance than can be squeezed out of
traditional architectures have come up with a variety of performance-boosting strategies.
The main idea is that if you want to improve performance beyond the increase afforded by
faster clock speeds, you need to increase the amount of useful work that gets done everyclock cycle. This is accomplished by increasing the number of operations that are
performed in parallel, which can be implemented in two main ways: by increasing the
number of operations performed by each instruction, or by increasing the number of
instructions that are executed in every instruction cycle.
INCREASING THE WORK PERFORMED BY EACH INSTRUCTION
Traditionally, DSP processors have used complex, compound instructions that allow the
programmer to encode multiple operations in a single instruction. In addition, DSP
processors traditionally issue and execute only one instruction per instruction cycle. This
single-issue, complex-instruction approach allows DSP processors to achieve very strongDSP performance without requiring a large amount of program memory.
One method of increasing the amount of work performed by each instruction while
maintaining the basics of the traditional DSP architecture and instruction set described
above is to augment the data path with extra execution units We refer to processors that
follow this approach as Enhanced Conventional DSPs''; their basic architecture is similar
to previous generations of DSPs, but has been enhanced by adding execution units.
Lucent Technologies DSP16000 architecture is based on that of the earlier DSP1600, but
Lucent added a second multiplier, an adder , and a bit manipulation unit. To support more
parallel operations and keep the processor from starving for data, Lucent also increased thedata bus widths to 32 bits. The net result is a processor that is able to sustain a throughput
of two multiply-accumulates per instruction cycle.
EXECUTING MULTIPLE INSTRUCTIONS / CYCLE
A few designers have opted for a more RISC-like instruction set coupled with an
architecture that supports execution of multiple instructions in every instruction cycle .E.g.
TMS320C62xx family. In TI's version, the processor fetches a 256-bit instruction
``packet,'' parses the packet into eight 32-bit instructions, and routes them to its eight
independent execution units.
VLIW processors typically suffer from high program memory requirements and high
power consumption. Like VLIW processors, superscalar processors issue and execute
multiple instructions in parallel. Unlike VLIW processors, in which the programmer
explicitly specifies which instructions will be executed in parallel, superscalar processors
Department Of Electronics & Communication Engineering, GEC Thrissur. 11
8/7/2019 Sanjay(High Speed Dsp Architectures)
12/39
High Performance DSP Architectures
use dynamic instruction scheduling to determine ``on the fly'' which instructions will be
executed concurrently based on the processor's available resources, on data dependencies,
and on a variety of other factors. Superscalar architectures have long been used in high-
performance general-purpose processors such as the Pentium and PowerPC.
CIRCULAR BUFFERING
In off-line processing, the entire input signal resides in the computer at the same time. The
key point is that all of the information is simultaneously available to the processing
program. This is common in scientific research and engineering, but not in consumer
products. Off-line processing is the realm of personal computers and mainframes.
In real-time processing, the output signal is produced at the same time that the input signal
is being acquired. To calculate the output FIR sample, we must have access to a certain
number of the most recent samples from the input.. When a new sample is acquired, it
replaces the oldest sample in the array, and the pointer is moved one address ahead.
Circular buffers are efficient because only one value needs to be changed when a new
sample is acquired.
Four parameters are needed to manage a circular buffer. First, there must be a pointer that
indicates the start of the circular buffer in memory. Second, there must be a pointer
indicating the end of the array , or a variable that holds its length . Third, the step size of
the memory addressing must be specified. These three values define the size and
configuration of the circular buffer, and will not change during the program operation. The
fourth value, the pointer to the most recent sample, must be modified as each new sample isacquired. In other words, there must be program logic that controls how this fourth value is
updated based on the value of the first three values.
DSP/MICROCONTROLLER HYBRIDS
Many applications require a mixture of control-oriented software and DSP software. A
prime example is the digital cellular phone, which must implement both supervisory tasks
and voice-processing tasks. In general, microcontrollers provide good performance in
controller tasks and poor performance in DSP tasks; dedicated DSP processors have the
opposite characteristics. Hence, until recently, combination controller/signal processing
applications were typically implemented using two separate processors: a microcontroller
and a DSP.
Department Of Electronics & Communication Engineering, GEC Thrissur. 12
8/7/2019 Sanjay(High Speed Dsp Architectures)
13/39
High Performance DSP Architectures
In the past couple of years, however, a number of microcontroller vendors have begun to
offer DSP-enhanced versions of their microcontrollers as an alternative to the dual-
processor solution.
Using a single processor to implement both types of software is attractive, because it can
potentially:
simplify the design task save board space
reduce total power consumption
reduce overall system cost
Microcontroller vendors like Hitachi offers a DSP-enhanced version of their SH-2
microcontroller. This version is called the SH-DSP, and adds a complete 16-bit fixed-point
DSP data path to the SH-2. In contrast, ARM took a different approach and developed a
DSP co-processor, ``Piccolo,'' that is meant to be used as an add-on to their ARM7
microcontroller and each has its own instruction set and processes its own instruction
stream. It is therefore possible for the two processors to operate in parallel with the caveat
that Piccolo relies on the ARM7 to perform all data transfers.
RECONFIGURABLE ARCHITECTURES
Reconfigurable architectures are defined as programmable architectures that change the
hardware or the interconnections dynamically so as to provide flexibility with simultaneous
benefits in execution time due to the reconfiguration as opposed to turning off units to
conserve power. There have been various approaches to provide and use this
reconfigurability in programmable architectures. The first approach is the FPGA+
approach, which adds a number of high-level configurable functional blocks to a general
purpose device to optimize it for a specific purpose such as wireless . The second approach
is to develop a reconfigurable system around a programmable ASSP. The third approach isbased on a parallel array of processors on a single die, connected by a reconfigurable
fabric. These kind of architectures are just in their initial form of evolution.
Department Of Electronics & Communication Engineering, GEC Thrissur. 13
8/7/2019 Sanjay(High Speed Dsp Architectures)
14/39
High Performance DSP Architectures
CHAPTER 5NOVEL DSP ARCHITECTURES
"POST-HARVARD" TECHNOLOGY
After remaining unchanged for more than a decade, DSP architectures have started to
evolve. They are even trying to encompass control operations. Conventional DSP
architecture typically uses Harvard-style architecture, with separate data and instruction
buses. Their main processing elements are a multiplier, an arithmetic logic unit (ALU), and
an accumulation register, allowing creation of a multiply-accumulate (MAC) unit that
accepts two operands. Depending on the processor, the operands may be 16-, 24-, 32-, or
48-bit words in either fixed-point or floating-point format. Whatever the word width, these
conventional DSPs offer fixed-width instructions, executing one instruction per clock
cycle.
Figure : The conventional DSP
architecture uses separate data
and memory buses and features
fixed-width instructions,
executing one instruction per
clock cycle.
The instructions themselves can be fairly complex. A single instruction may embody two
data moves, a MAC operation, and two pointer updates. These complex instructions help
the conventional DSP offer a high degree of code density when performing repeated
mathematical operations on arrays of numbers. As control devices, however, they leave
something to be desired. The fixed-width instructions are inefficient when tasked with
performing simple counter increments as part of a control loop, for instance. Even if the
counter is only going as high as 10, the processor needs to use the full word width for the
values. Conventional DSPs are also weak at bit-level data manipulation beyond bit shifting.
Still, because of their number-crunching proficiency, conventional DSPs soon gained
popularity in communications and media applications. The communications devices,including modem and telephony processors, needed the computational power for echo
canceling, voice coding, and filtering. Media applications, including digital audio, video,
and imaging, needed computational power for compression and filtering along with
program flexibility to track evolving standards. DSPs also found a home in disk-drive and
other servo-motor-control applications.
ENHANCED DSPS EMERGE
As semiconductor process technology evolved, conventional DSPs began to acquire a
number of on-chip peripherals such as local memory, I/O ports, timers, and DMA
controllers. Their basic architecture, however, didn't change for more than a decade.Eventually, though, the relative weakness in bit-level manipulation began to catch up with
conventional DSPs, as did the incessant demand for greater performance.
Department Of Electronics & Communication Engineering, GEC Thrissur. 14
8/7/2019 Sanjay(High Speed Dsp Architectures)
15/39
High Performance DSP Architectures
One common feature of these enhanced DSPs is the presence of a second MAC, which
allows for some parallelism in computation. In many cases, this parallelism extends to
other elements in the DSP, allowing the device to perform single-instruction, multiple-data
(SIMD) operations. Often this is accomplished with data packing, which allows registers,
data paths, and the like to handle two half-word operands each clock cycle. Along with
data packing, many enhanced DSPs allow the instructions themselves to use fractionalword widths, which allows multiple instructions to launch simultaneously.
The enhanced DSPs also tend to incorporate features that speed execution of algorithms in
a specific application space as well as add special-purpose peripherals and memory. The
exact nature of the specialization varies with the application an enhanced DSP targets,
which makes direct comparisons difficult. Many include hardware accelerators for
frequently-used operations as well as provide specialized addressing modes and augmented
instruction sets that target the application space. The augmented instruction sets may
include both special DSP instructions and RISC-like instructions for improved control
operation.
Consider, for instance, the Blackfin DSP family from Analog Devices. This family targets
voice, video, and data communications signal processing along with control operations.
The core includes dual 16-bit MACs, dual 40-bit arithmetic logic units (ALUs), a 40-bit
barrel shifter, and quad 8-bit ALUs for video operations. Because the architecture allows
data packing, the 40-bit ALUs can handle two 40-bit numbers or four 16-bit numbers. In
addition, a control unit handles sequencing of instructions so that a mix of 16-bit control
and 32-bit DSP instructions can pack for simultaneous execution. Data can be in 8-, 16-, or
32-bit format.
Figure: Analog Devices Blackfin DSP architecture handles multi-width data wordsand can simultaneously execute 16-bit control and 32-bit DSP instructions.
The core also includes two data address generators (DSGs) to simplify both DSP and
control operations. DSP addressing operations include circular buffering, for matrix
operations, and bit-reversal, for unscrambling FFT results. Control operations include auto-
increment, auto-decrement, and base-plus-immediate-offset addressing modes not found in
conventional DSPs.
Department Of Electronics & Communication Engineering, GEC Thrissur. 15
8/7/2019 Sanjay(High Speed Dsp Architectures)
16/39
High Performance DSP Architectures
INSTRUCTION SETS TARGET APPLICATIONS
The instruction set of the Blackfin core includes both general DSP instructions and RISC-
like control instructions. In addition, the core has complex instructions geared toward the
needs of the intended applications. For Huffman coding, used in communicationsalgorithms, there is a "Field Deposit/Extract" command. For the Discrete Cosine
Transform, used in imaging and video, an IEEE 1180 rounding operation is available.
Video compression algorithms can take advantage of the "Sum Absolute Difference"
instruction.
These specialty instructions are one way that the Blackfin family targets applications. The
other way is the peripheral mix each family member offers. The ADSP-21532, for
example, aims at low-cost consumer multimedia applications by including peripherals
supporting surround-sound and video-specific operating modes. The ADSP-21535 goes
after high-performance communications applications with USB and PCI interfaces as well
as substantial amounts of on-chip SRAM.
The range and variety of variations within the Blackfin family as well as the nature of its
specialized instructions mirror the diversity of enhanced conventional DSPs, available from
companies such as Cirrus Logic, Motorola, and Texas Instruments. But for all the
enhancements, these DSPs follow basically the same programming model as the
conventional device.
Other DSP architectures have emerged that follow a different programming model. In
search of the highest performance levels, these architectures allow the DSP to launch
multiple instructions at the same time for parallel execution. While these approaches resultin greater code execution speed, they also make software more difficult to optimize. They
require careful instruction ordering to avoid needing simultaneous access to the same data.
They also need to avoid attempting simultaneous execution of instructions where one
instruction depends on the results of the other for its operands. Not all DSP application
software has a structure suitable for multiple-launch execution, but when it does, these
DSPs offer the highest performance.
PARALLELISM ARISES
Two different forms of multiple-launch DSPs have arisen: very long instruction word
(VLIW) and superscalar architectures. Both have multiple execution units configured to
operate in parallel and use RISC-like instruction sets. The instructions of a VLIW
architecture are explicitly parallel, being composed of several sub-instructions that control
different resources. The superscalar architectures, on the other hand, load instructions in
bulk, then use hardware run-time scheduling to identify instructions that can run in parallel
and map them to the execution units.
Of the multi-launch architectures, VLIW designs are the most common. Devices from
Adelante Technologies, Equator Technologies, Siroyan, and Texas Instruments fall into
this category, although they vary considerably with the type and number of parallel
execution units they offer. The TI TMS320C64xx processors, for instance, have eight
execution units that can handle both 8- and 16-bit SIMD operations. The Siroyan OneDSP,on the other hand, is scalable from two to 32 clusters, each with several execution units.
The Adelante Saturn DSP core as shown in the following figure demonstrates the essence
of the VLIW approach. It uses multiple data buses in a dual-Harvard configuration to
Department Of Electronics & Communication Engineering, GEC Thrissur. 16
8/7/2019 Sanjay(High Speed Dsp Architectures)
17/39
High Performance DSP Architectures
deliver data and 96-bit wide instructions to an array of execution units simultaneously.
These units include two multipliers (MPY), four 16-bit ALUs that can combine to form
two 20-bit ALUs, a barrel shifter with saturation logic (SST/BRS), program (PCU) and
loop (LCU) controllers, address controllers (ACU), and an ability for design teams to add
application-specific execution units (AXU) to speed processing.
Figure: Adelante's Saturn DSP core handles VLIW instructions that can compriseseveral sub-instructions that control different resources. The core also handlesapplication-specific execution units (AXUs) to accelerate processing.
The Saturn core uses a unique approach to get around one of the problems the wide word
widths of VLIW architectures cause. Accessing external memory is a challenge for these
DSPs, because of their need to work with buses that can be as wide as 128 bits. The Saturn
core uses 16-bit program memory, which it maps into the 96-bit instruction word it uses
internally. Adelante developed this mapping after analyzing millions of lines of code for
common applications. However, the core also allows developers to create their own
application-specific instructions that map into the VLIW.
SUPERSCALAR DSPS
While the 16-bit external instruction width of the Saturn processor is unusual for VLIW
architectures, it is typical for superscalar architectures. These devices pull in several
instructions at a time and dynamically map them to the execution units. Internally the effect
is much the same as a VLIW architecture in that execution units are operating in parallel.
But from the software development viewpoint the approach reduces programming
complexity. With hardware handling the sequencing and arranging of instructions, the
developer is free to work with the more manageable short instructions.
The structure of a sample superscalar DSP, the LSI Logic ZSP600. Because it is a core its
memory interface isn't constrained, making it look like a VLIW architecture. But the
presence of the instruction-sequencing unit (ISU) and the pipeline control unit betray its
superscalar nature. The ZSP600 fetches eight instructions at a time, and can execute as
many as six, using its four MAC and two ALU execution units simultaneously. Data
packing allows the units to perform 16- or 32-bit operations. The architecture also allows
for the addition of coprocessors to speed specific DSP functions.
Department Of Electronics & Communication Engineering, GEC Thrissur. 17
8/7/2019 Sanjay(High Speed Dsp Architectures)
18/39
High Performance DSP Architectures
Figure: Superscalar DSPs, such as LSI Logic's ZSP600, use several instructionssimultaneously and dynamically map these instructions to the execution units.
This ability to add coprocessors is becoming a common feature of high-performance DSP
cores. In many cases the core's creators have also created coprocessors for functions such
as DES (data encryption standard) and Viterbi coding. If a pre-designed coprocessor isn't
available, however, creating your own can be a major design challenge.
A recently-introduced DSP architecture, the PulseDSP from Systolix, might make the task
easier. Similar to an FPGA, the PulseDSP offers a massively parallel, repetitive structure. Itis designed as a systolic array, which means that all data transfers occur synchronously on a
clock edge. Each processing element in the array has selectable I/O paths, local data
memory, and an ALU. Both the I/O and the ALU are programmable, and the array has a
programming bus running through it. The combination makes the array reprogrammable,
either statically or dynamically. The array structure is intended to handle low-complexity
but high-speed processing tasks using 16- to 64-bit arithmetic, which makes it suitable as a
coprocessor.
Figure: Systolix's PulseDSP is a systolic array that can run as a coprocessor or as astandalone unit for applications such as filters and FFTs. The array is programmable,
Department Of Electronics & Communication Engineering, GEC Thrissur. 18
8/7/2019 Sanjay(High Speed Dsp Architectures)
19/39
High Performance DSP Architectures
with each processing element having its own selectable I/O paths, local data memory,and an ALU.The array can also be used as a stand-alone processor for some types of algorithms, such as
filters and FFTs. One of the commercial implementations of the array, in fact, is to provide
filtering in an Analog Devices data acquisition part, the AD7725. The device combines the
PulseDSP with a sigma-delta A/D converter to provide post-processing of the acquired
data. The DSP array implements various filter algorithms.
Innovations such as the PulseDSP as well as the proliferation within the other DSP
architectures are a strong indicator of how important these once-arcane processors have
become. In many applications, especially communications, they share the spotlight with the
RISC processor. The DSP handles the data and the RISC handles the protocols. There are
problems with the two-processor approach, of course, including increased cost and
software development complexity. One reason many DSPs are adding RISC-like
instructions to their set is to be able to edge out the other processor in such applications.
The same thing is happening with some RISC processors. Extensible cores, such as the
Tensilica Xtensa and the ARC International ARCtangent, are offering DSP enhancements
so that communications applications need only one processor. These enhancements follow
the architecture of the conventional DSP, but merge the DSP functions into the instruction
set of the RISC core.
The ARCtangent,, demonstrates how the two get blended. The DSP instruction decode and
processing elements both connect with the rest of the core, allowing them to use the core's
resources as well as their own. The extensions have full access to registers and operate in
the same instruction stream as the RISC core. ARC's DSP offerings include MACs in
varying widths, saturation arithmetic, and X-Y memory for DSP data. The extensions also
support DSP addressing modes such as bit-reversal.
Figure : The ARCtangent core from ARC International blends DSP functionalityinto a RISC processor.Both DSP instruction-decode and processing elements connect with the rest of the core,
allowing these elements to use the cores resources as well as their own.
These extended RISC processors, enhanced conventional DSPs, and high-performance
architectures have all proliferated in the last few years, a sure sign of the importance DSPs
have acquired. Furthermore, that proliferation is likely to continue. With process
technology allowing integration of multiple peripherals with DSP cores and instruction sets
extending to match application needs, DSPs are headed the way of the microcontroller.
From obscure, specialized parts, they are evolving to become a fundamental building block
for virtually any system.
Department Of Electronics & Communication Engineering, GEC Thrissur. 19
8/7/2019 Sanjay(High Speed Dsp Architectures)
20/39
High Performance DSP Architectures
CHAPTER 6ARCHITECTURE OF LATEST DSP PROCESSORS
TEXAS INSTRUMENTS TMS320C67xx FAMILY
OVERVIEW
The TMS320C67xx family is theHighest Performance Floating-Point version DSPs.It is based on a advanced VelociTI very-long-instruction-word (VLIW) architecturemaking this DSP an excellent choice for multichannel and multifunction applications,
which allows it to execute up to eight RISC-like instructions per clock cycle. It has added
support for floating-point arithmetic and 64-bit data. It has a performance of up to 1 gigafloating-point operations per second (GFLOPS) at a clock rate of167 MHz.It uses an1.8-volt core supply , and executes up to 334 million MACs per second at 167 MHz. TheTMS320C67xx's two data paths extend hardware support for 64-bit data and IEEE-75432-bit single-precision and 64-bit double-precision floating-point arithmetic. Each data
path includes a set of four execution units, a general-purpose register file, and paths for
moving data between memory and registers.
The four execution units in each data path comprise two ALUs, a multiplier, and an
adder/subtractor which is used for address generation. The ALUs support both integer and
floating point operations, and the multipliers can perform both 16x16-bit and 32x32-bit
integer multiplies and 32-bit and 64-bit floating point multiplies. The two register files each
contain sixteen 32-bit general-purpose registers. These registers can be used for storing
addresses or data. To support 64-bit floating point arithmetic, pairs of adjacent registers can
be used to hold 64-bit data.
The C6701 DSP possesses the operational flexibility of high-speed controllers and the
numerical capability of array processors. This processor has 32 general-purpose registers of
32-bit word length and eight highly independent functional units. The eight functional units
provide four floating-/fixed-point ALUs, two fixed-point ALUs, and two floating-/fixed-
point multipliers. Program memory consists of a 64K-byte block that is user-configurable
as cache or memory-mapped program space. Data memory consists of two 32K-byte
blocks of RAM. The peripheral set includes two multichannel buffered serial ports
(McBSPs), two general-purpose timers, a host-port interface (HPI), and a glueless externalmemory interface (EMIF) capable of interfacing to SDRAM or SBSRAM and
asynchronous peripherals.
The large bank of On-chip memory system of the TMS320C67xx implements a modified
Harvard architecture, providing separate address spaces for program and data memory.
Program memory has a 32-bit address bus and a 256-bit data bus. Each of the two data
paths is connected to data memory by a 32-bit address bus and two 32-bit data buses. Since
there are two 32-bit data buses for each data path, the TMS320C67xx can load two 64-bit
words per instruction cycle. TMS320C6701 has 64 Kbytes of 32-bit on-chip program RAM
and 64 Kbytes of 16-bit on-chip data RAM.
The TMS320C6701 has one external memory interface, which provides a 23-bit address
bus and a 32-bit data bus. These buses are multiplexed between program and data memory
accesses. Addressing modes supported include register-direct, register-indirect, indexed
register-indirect, and modulo addressing. Immediate data is also supported.
Department Of Electronics & Communication Engineering, GEC Thrissur. 20
8/7/2019 Sanjay(High Speed Dsp Architectures)
21/39
High Performance DSP Architectures
The TMS320C67xx does not support hardware looping, and hence all loops must be
implemented in software. However, the parallel architecture of the processor allows the
implementation of software loops with virtually no overhead.
The peripherals on the TMS320C6701 include a host port, four-channel DMA controller,
two TDM-capable buffered serial ports and two 32-bit timers
CPU ARCHITECTURE
Department Of Electronics & Communication Engineering, GEC Thrissur. 21
8/7/2019 Sanjay(High Speed Dsp Architectures)
22/39
High Performance DSP Architectures
CPU DESCRIPTION
Fetch packets are always 256 bits wide; however, the execute packets can vary in size. Thevariable-length execute packets are a key memory-saving feature, distinguishing the C67x
CPU from other VLIW architectures.
The CPU features two sets of functional units. Each set contains four units and a register
file. One set contains functional units .L1, .S1, .M1, and .D1; the other set contains units
.D2, .M2, .S2, and .L2. The two register files contain 16 32-bit registers each for the total
of 32 general-purpose registers. The two sets of functional units, along with two register
files, compose sides A and B of the CPU
The four functional units on each side of the CPU can freely share the 16 registers
belonging to that side. Additionally, each side features a single data bus connected to allregisters on the other side, by which the two sets of functional units can access data from
the register files on opposite sides.
In addition to the C62x DSP fixed-point instructions, the six out of eight functional units
(.L1, .M1, .D1, .D2, .M2, and .L2) also execute floating-point instructions. The remaining
two functional units (.S1 and .S2) also execute the new LDDW instruction which loads 64
bits per CPU side for a total of 128 bits per cycle.
Another key feature of the C67x CPU is the load/store architecture, where all instructions
operate on registers. Two sets of data-addressing units (.D1 and .D2) are responsible for all
data transfers between the register files and the memory. The data address driven by the .Dunits allows data addresses generated from one register file to be used to load or store data
to or from the other register file. The C67x CPU supports a variety of indirect-addressing
modes using either linear- or circular-addressing modes with 5- or 15-bit offsets. All
instructions are conditional, and most can access any one of the 32 registers. Some
registers, however, are singled out to support specific addressing or to hold the condition
for conditional instructions. The two .M functional units are dedicated multipliers.
The two .S and .L functional units perform a general set of arithmetic, logical, and branch
functions with results available every clock cycle. The processing flow begins when a 256-
bit-wide instruction fetch packet is fetched from a program memory. The 32-bit
instructions destined for the individual functional units are linked together by 1 bits in
the least significant bit (LSB) position of the instructions. The instructions that are
chained together for simultaneous execution compose an execute packet. A 0 in the
LSB of an instruction breaks the chain, effectively placing the instructions that follow it in
the next execute packet. If an execute packet crosses the fetch-packet boundary (256 bits
wide), the assembler places it in the next fetch packet, while the remainder of the current
fetch packet is padded with NOP instructions. The number of execute packets within a
fetch packet can vary from one to eight. Execute packets are dispatched to their respective
functional units at the rate of one per clock cycle and the next 256-bit fetch packet is not
fetched until all the execute packets from the current fetch packet have been dispatched.
After decoding, the instructions simultaneously drive all active functional units for amaximum execution rate of eight instructions every clock cycle. While most results are
stored in 32-bit registers, they can be subsequently moved to memory as bytes or half-
words as well. All load and store instructions are byte-, half-word, or word-addressable.
Department Of Electronics & Communication Engineering, GEC Thrissur. 22
8/7/2019 Sanjay(High Speed Dsp Architectures)
23/39
High Performance DSP Architectures
ANALOG DEVICES ADSP-21XX FAMILY
OVERVIEW
The ADSP-21xx is the first single chip DSP processor family from Analog Devices. The
family consists of a large number of processors based on a common 16-bit fixed-pointarchitecture core with a 24-bit instruction word. Each processor combines the core DSParchitecture computation units, data address generators, and program sequencerwith
differentiating features such as on-chip program and data memory RAM, a programmable
timer, and one or two serial ports.
The fastest members of the family operate at 75 MIPS at 2.5 volts, 52 MIPS at 3.3 volts,
and 40 MIPS at 5.0 volts. Analog Devices has recently announced the ADSP-219x series,
which offers projected speeds of up to 300 MIPS, as well as architectural enhancements.
ADSP-21xx processors are targeted at modem, audio, PC multimedia, and digital cellular
applications.
Fabricated in a high speed, submicron, double-layer metal CMOS process, the highest-
performance ADSP-21xx processors operate at 25 MHz with a 40 ns instruction cycle time.
Every instruction can execute in a single cycle. Fabrication in CMOS results in low power
dissipation. The ADSP-2100 Familys flexible architecture and comprehensive instruction
set support a high degree of parallelism.
The ADSP-21xx data path consists of three separate arithmetic execution units: anarithmetic/logic unit (ALU), a multiplier/accumulator (MAC), and a barrel shifter. Each
unit is capable of single-cycle execution, but only one of these units can be active during a
single instruction cycle. The ALU operates on 16-bit data. In addition to the usual ALU
operations, the ALU provides increment/decrement, absolute value, and add-with-carry
functions. ALU results are saturated upon overflow if the appropriate configuration bit is
set by the programmer. The MAC unit includes a 16x16->32-bit multiplier, four input
registers, a feedback register, a 40-bit adder, and a single 40-bit result register/accumulator
providing eight guard bits. Besides signed operands, the multiplier can operate on
unsigned/unsigned or on signed/unsigned operands, thus supporting multi-precision
arithmetic. The barrel shifter shifts 16-bit inputs from an input register or from the
ALU/MAC/barrel shifter result registers into a 32-bit result register. Logical and arithmeticshifts are supported left or right up to 32 bits. The barrel shifter also supports block
floating-point arithmetic with block exponent detect (which determines a maximum
exponent of a block of data), single-word exponent detect, normalize, and exponent adjust
instructions.
ADSP-21xx processors use a modified Harvard architecture with separate memory spaces
and on-chip bus sets for program and data. All processors in the ADSP-21xx family
include on-chip program RAM or ROM and on-chip data RAM.
On-chip program memory can be used for both instructions and data, and it can beaccessed via a 14-bit address bus and a 24-bit data bus. On-chip program memory is dual-
ported to allow the processor to fetch both a data operand and the next instruction in a
single instruction cycle. The on-chip data memory can be accessed via a 14-bit address bus
and a 16-bit data bus. One access to the on-chip data memory can be performed in a single
Department Of Electronics & Communication Engineering, GEC Thrissur. 23
8/7/2019 Sanjay(High Speed Dsp Architectures)
24/39
High Performance DSP Architectures
instruction cycle. Three memory accesses (one instruction and two data operands) can be
performed in one instruction cycle.
Both of the on-chip memory spaces can be extended off-chip. All ADSP-21xx processorshave one external memory interface, providing a 14-bit address bus and a 24-bit data bus.
This external interface is multiplexed between program and data memory accesses.
The ADSP-21xx supports register-direct, memory-direct and register-indirect addressing
modes. Immediate data is also supported. The ADSP-21xx provides zero-overhead
program looping through its DO instruction. Any length sequence of instructions can be
contained in a hardware loop, and up to 16,384 repetitions are supported.
ARCHITECTURE OVERVIEW
The processors contain three independent computational units: the ALU, the
multiplier/accumulator (MAC), and the shifter. The ALU performs a standard set of
arithmetic and logic operations; division primitives are also supported. The MAC performs
single-cycle multiply, multiply/add, and multiply/subtract operations. The shifter performslogical and arithmetic shifts, normalization, renormalizations, and derive exponent
operations. The shifter can be used to efficiently implement numeric format control
including multiword floating-point representations. The internal result (R) bus directly
connects the computational units so that the output of any unit may be used as the input of
any unit on the next cycle. A powerful program sequencer and two dedicated data address
generators ensure efficient use of these computational units. The sequencer supports
conditional jumps, subroutine calls, and returns in a single cycle. With internal loop
counters and loop stacks, the ADSP-21xx executes looped code with zero overhead no
explicit jump instructions are required to maintain the loop. Two data address generators
(DAGs) provide addresses for simultaneous dual operand fetches (from data memory and
program memory). Each DAG maintains and updates four address pointers. Whenever thepointer is used to access data (indirect addressing), it is post-modified by the value of one
of four modify registers. A length value may be associated with each pointer to implement
automatic modulo addressing for circular buffers. The circular buffering feature is also
Department Of Electronics & Communication Engineering, GEC Thrissur. 24
8/7/2019 Sanjay(High Speed Dsp Architectures)
25/39
High Performance DSP Architectures
used by the serial ports for automatic data transfers to On chip memory. Efficient data
transfer is achieved with the use of five internal buses namely : Program Memory Address
(PMA) Bus , Program Memory Data (PMD) Bus, Data Memory Address (DMA) Bus,
Data Memory Data (DMD) Bus and the Result (R) Bus.
The two address buses (PMA, DMA) share a single external address bus, allowing memory
to be expanded off-chip, and the two data buses (PMD, DMD) share a single external data
bus. The BMS, DMS, and PMS signals indicate which memory space is using the external
buses. Program memory can store both instructions and data, permitting the ADSP-21xx to
fetch two operands in a single cycle, one from program memory and one from data
memory. The processor can fetch an operand from on-chip program memory and the next
instruction in the same cycle. The memory interface supports slow memories and
memorymapped peripherals with programmable wait state generation. External devices can
gain control of the processors buses with the use of the bus request/grant signals.
One bus grant execution mode (GO Mode) allows the ADSP-21xx to continue running
from internal memory. A second execution mode requires the processor to halt while buses
are granted. Each ADSP-21xx processor can respond to several different interrupts. There
can be up to three external interrupts, configured as edge- or level-sensitive. Internal
interrupts can be generated by the timer, serial ports, and, on the ADSP-2111, the host
interface port. There is also a master RESET signal. Booting circuitry provides for loading
on-chip program memory automatically from byte-wide external memory. After reset, three
wait states are automatically generated. This allows, for example, a 60 ns ADSP-2101 to
use a 200 ns EPROM as external boot memory. Multiple programs can be selected and
loaded from the EPROM with no additional hardware. The data receive and transmit pins
on SPORT1 (Serial Port 1) can be alternatively configured as a general-purpose input flagand output flag. You can use these pins for event signalling to and from an external device.
A programmable interval timer can generate periodic interrupts. A 16-bit count register
(TCOUNT) is decremented every n cycles, where n1 is a scaling value stored in an 8-bit
Department Of Electronics & Communication Engineering, GEC Thrissur. 25
8/7/2019 Sanjay(High Speed Dsp Architectures)
26/39
High Performance DSP Architectures
register (TSCALE). When the value of the count register reaches zero, an interrupt is
generated and the count register is reloaded from a 16-bit period register (TPERIOD).
BLACKFIN PROCESSOR
Blackfin Processors are a new breed of embedded media processor. Based on the Micro
Signal Architecture (MSA) jointly developed with Intel Corporation, Blackfin Processors
combine a 32-bit RISC-like instruction set and dual 16-bit multiply accumulate (MAC)
signal processing functionality with the ease-of-use attributes found in general-purpose
microcontrollers. This combination of processing attributes enables Blackfin Processors to
perform equally well in both signal processing and control processing applications-in many
cases deleting the requirement for separate heterogeneous processors.
This processor family also offers industry leading power consumption performance to as
low as 0.15mW/MMAC at 0.8V. This combination of high performance and low power is
essential in meeting the needs of today's and future signal processing applications including
broadband wireless, audio/video capable Internet appliances, and mobile communications.
HIGH PERFORMANCE SIGNAL PROCESSING
The core architecture employs fully interlocked instruction pipeline, multiple parallel
computational blocks, efficient DMA capability, and instruction set enhancements
designed to accelerate video processing
FULLY INTERLOCKED INSTRUCTION PIPELINE
All Blackfin Processors utilize a multi-stage fully interlocked pipeline that guarantees code
is executed as you would expect and that all data hazards are hidden from the programmer.
This type of pipeline guarantees result accuracy by stalling when necessary to achieve
proper results.
HIGHLY PARALLEL COMPUTATIONAL BLOCKS
The basis of the Blackfin Processor architecture is the Data Arithmetic Unit that includestwo 16-bit Multiplier Accumulators (MACs), two 40-bit Arithmetic Logic Units (ALUs),
four 8-bit video ALUs, and a single 40-bit barrel shifter. Each MAC can perform a 16-bit
by 16-bit multiply on four independent data operands every cycle. The 40-bit ALUs can
accumulate either two 40-bit numbers or four 16-bit numbers. With this architecture, 8-,
16- and 32-bit data word sizes can be processed natively for maximum efficiency.
Two Data Address Generators (DAGs) are complex load/store units designed to generate
addresses to support sophisticated DSP filtering operations. For DSP addressing, bit-
reversed addressing and circular buffering is supported. The DAGs also include two loop
counters for nested zero overhead looping and hardware support for on-the-fly saturation
and clipping.
HIGH BANDWIDTH DMA CAPABILITY
Department Of Electronics & Communication Engineering, GEC Thrissur. 26
8/7/2019 Sanjay(High Speed Dsp Architectures)
27/39
High Performance DSP Architectures
All Blackfin Processors have multiple, independent DMA controllers that support
automated data transfers with minimal overhead from the processor core. DMA transfers
can occur between the internal memories and any of the many DMA-capable peripherals.
VIDEO INSTRUCTIONS
In addition to native support for 8-bit data, the word size common to many pixel processing
algorithms, the Blackfin Processor architecture includes instructions specifically defined toenhance performance in video processing applications. Video compression algorithms are
incorporated for the enhanced instructions.
Efficient Control Processing is similar to that of RISC control processors. These features
include a hierarchical memory architecture, superior code density, and a variety of
microcontroller-style peripherals including a watch-dog timer, real-time clock, and an
integrated SDRAM controller. The L1 memory is connected directly to the processor core,
runs at full system clock speed, and offers maximum system performance for time critical
algorithm segments. The L2 memory is a larger, bulk memory storage block that offers
slightly reduced performance, but still faster than off-chip memory.
The L1 memory structure has been implemented to provide the performance needed for
signal processing while offering the programming ease found in general purpose
microcontrollers. By supporting both SRAM and cache programming models, system
designers can allocate critical DSP data sets that require high bandwidth and low latency
into SRAM, while maintaining the simple programming model of the data cache for
operating system (OS) and microcontroller code.
The Memory Management Unit provides for a memory protection format that can support a
full OS Kernel. The OS Kernel runs in Supervisor mode and partitions blocks of memory
and other system resources for the actual application software to run in User mode. This is
a unique and powerful feature not present on traditional DSPs.
SUPERIOR CODE DENSITY
The Blackfin Processor architecture supports multi-length instruction encoding. Very
frequently used control-type instructions are encoded as compact 16-bit words, with more
mathematically intensive DSP instructions encoded as 32-bit values.
DYNAMIC POWER MANAGEMENT
They employ multiple power saving techniques. Blackfin Processors are based on a gated
clock core design that selectively powers down functional units on an instruction-by-
instruction basis. They also support multiple power-down modes for periods where little or
no CPU activity is required. Lastly, and probably most importantly, Blackfin Processors
support a dynamic power management scheme whereby the operating frequency AND
Department Of Electronics & Communication Engineering, GEC Thrissur. 27
8/7/2019 Sanjay(High Speed Dsp Architectures)
28/39
High Performance DSP Architectures
voltage can be tailored to meet the performance requirements of the algorithm currently
being executed.
BLACKFIN PROCESSOR CORE BASICS
The Blackfin Processor core is a load-store architecture consisting of a Data Arithmetic
Unit, an Address Arithmetic Unit, and a sequencer unit. Blackfin Processors combine a
high performance, dual MAC DSP architecture with the programming ease of a RISC
MCU into a single, instruction set architecture.
GENERAL PURPOSE REGISTER FILES
The Blackfin Processor core includes an 8-entry by 32-bit data register file for general use
by the computational units. Supported data types include 8-, 16-, or 32-bit signed or
unsigned integer and 16- or 32-bit signed fractional. In every clock cycle, this multiported
register file supports two 32-bit reads AND two 32-bit writes. It can also be accessed as a
16-entry by 16-bit data register file.
The address register file provides a general purpose addressing mechanism in addition to
supporting circular buffering and stack maintenance. This register file consists of 8 entries
and includes a frame pointer and a stack pointer. The frame pointer is useful for subroutine
parameter passing, while the stack pointer is useful for storing the return address from
subroutine calls.
DATA ARITHMETIC UNIT
It contains:
Two 16-bit MACs
Two 40-bit ALUs
Four 8-bit video ALUs
Single barrel shifter
Department Of Electronics & Communication Engineering, GEC Thrissur. 28
8/7/2019 Sanjay(High Speed Dsp Architectures)
29/39
High Performance DSP Architectures
All computational resources can process 8-, 16-, or 32-bit operands from the data register
file-R0 through R7. Each register can be accessed as a 32-bit register or a 16-bit register
high or low half.
In a single clock cycle, the dual data paths can read AND write up to two 32-bit values.
However, since the high and low halves of the R0 through R7 registers are individually
addressable (Rx, Rx.H, or Rx.L), each computational block can choose from either two 32-
bit input values or four 16-bit input values with no restrictions on input data. The results ofthe computation can be written back into the register file as either a 32-bit entity or as the
high or low 16-bit half of the register. Additionally, the method of accumulation can vary
between data paths..
Both accumulators are 40 bits in length, providing 8 bits of extended precision. Similar to
the general purpose registers, both accumulators can be accessed in 16-, 32-, or 40-bit
increments. The Blackfin architecture also supports a combined add/subtract instruction
that can generate two 16-, 32-, or 40-bit results or four 16-bit results. In the case where four
16-bit results are desired, the high and low half results can be interchanged. This is a very
powerful capability and significantly improves, for instance, the FFT benchmark results.
ADDRESS ARITHMETIC UNIT
Two data address generators (DAGs) provide addresses for simultaneous dual operand
fetches from memory. The DAGs share a register file that contains four sets of 32-bit index
(I), length(L), base(B), and modify(M) registers. There are also eight additional 32-bit
address registersP0 through P5, frame pointer, and stack pointer that can be used as
pointers for general indexing of variables and stack locations.
The four sets of I, L, B, and M registers are useful for implementing circular buffering.
Used together, each set of index, length, and base registers can implement a unique circularbuffer in internal or external memory. The Blackfin architecture also supports a variety of
addressing modes, including indirect, auto increment and decrement, indexed, and bit
reversed. Last, all address registers are 32 bits in length, supporting the full 4 Gbyte
address range of the Blackfin Processor architecture.
PROGRAM SEQUENCER UNIT
The program sequencer controls the flow of instruction execution and supports conditional
jumps and subroutine calls, as well as nested zero-overhead looping. A multistage fully
interlocked pipeline guarantees code is executed as expected and that all data hazards are
hidden from the programmer. This type of pipeline guarantees result accuracy by stallingwhen necessary to achieve proper results.
The Blackfin architecture supports 16- and 32-bit instruction lengths in addition to limited
multi-issue 64-bit instruction packets. This ensures maximum code density by encoding the
most frequently used control instructions as compact 16-bit words and the more
challenging math operations as 32-bit double words.
Department Of Electronics & Communication Engineering, GEC Thrissur. 29
8/7/2019 Sanjay(High Speed Dsp Architectures)
30/39
High Performance DSP Architectures
LSI LOGIC ZSP600-QUAD MAC SUPERSCALAR CORE
O V E R V I E W
The ZSP600 is a quad MAC superscalar DSP core that addresses the high performancedata throughput and signal processing requirements of emerging communications
platforms. The ZSP600 supports up to Six IPC DSP performance at a peak 300MHz datarate. It includes quad MAC and quad ALU computational resources, a high-performance
load/store memory architecture, and dedicated co-processor interfaces, combined with
state-of-the-art power reduction techniques. These attributes make the ZSP600 core an
ideal solution for a variety of embedded DSP algorithms, including those required for
wireless infrastructure, mobile (3G), IAD/home gateway, central office, and
access/network applications.ZSP600 instruction parallelism is supported by user-
transparent instruction grouping and pipeline control to deliver superscalar DSP
performance, while programming using a RISC-instruction set..
The ZSP600 is a fully synthesizable, single-phase, clocked architecture, with all core I/Os
registered for ease-of-process migration and design flexibility. The ZSP600 provides
extensive computational resources, including four 16-bit multipliers/MACs, dual 40-bit
ALUs, and dual 16-bit ALUs, all capable of supporting 16-and 32-bit operations. The
ZSP600 can perform four independent 16x16 MUL/MAC operations into four 16-bit or
two 40-bit results, two 32x32-bit MUL/MACs into a 32-bit result, or two Viterbi (add-
compare-select) results per cycle. The ZSP600 is based upon a high-bandwidth memory
architecture with separate 8 instruction per cycle prefetch and dual 64-bit datainterfaces , over a 24-bit address space. The instruction memory architecture allows multi-
instruction/cycle pre-fetch to an integrated instruction cache. The data memory architectureincorporates dual independent 64-bit load/store units, with dedicated address generation,
allowing up to eight 16-bit word or four 32-bit word load/store operations per cycle. The
ZSP600 integrates a bi-directional co-processor interface to support hardware acceleration.
The memory subsystem (MSS) is decoupled from the DSP operations to provide increased
flexibility in support of different memory schemes. It also includes Instruction Set
enhancements to RISC architecture for improved broadband and wireless application
support..
Department Of Electronics & Communication Engineering, GEC Thrissur. 30
8/7/2019 Sanjay(High Speed Dsp Architectures)
31/39
High Performance DSP Architectures
A WORD ON SUPERSCALAR DSP
A superscalar architecture simply implies that the architecture is responsible for resolving
the operand and resource hazards and that it has the resources to achieve an instruction
throughput that is greater that one instruction per clock. Logic dedicated to pipeline control
is kept to a minimum by enforcing in-order execution and by isolating the control to a
single stage at the head of the pipeline. This stage issues sequential groups of instructions
that have no data dependencies or other resource conflicts. Once a group of instructions has
been issued, they advance through the pipeline in lock step.
A VLIW machine does not employ instruction scheduling or pipeline protection.
Instructions in a VLIW pipeline are statically issued , and it is the programmers
responsibility to prevent data hazards and resource conflicts. Superscalar architectures also
facilitate software compatibility not only between implementations of the same
architecture, but also from one generation of architecture to the next thus increasing
lifetime.
ARCHITECTURE OVERVIEW
The G2 architecture is scalable in terms of arithmetic resources, data bandwidth, and
pipeline capacity. This scalable nature allows the architecture to support multipleimplementations that target different application spaces.
All address and data I/O communication across the core boundary are registered. This
feature is highly desirable from a SOC system designers point of view for a number of
reasons, one being the removal of timing budget ambiguities between system logic and the
core.
Prefetch unit (PFU) is at the head of the instruction pipeline. The ZSP600 can prefetch
eight 16-bit words per cycle. It is responsible for maximizing the probability that the
instruction cache has the data required by the instruction sequencing unit (ISU) for any
given fetch cycle. The prefetch unit performs limited decoding to identify code
discontinuities and to apply static branch prediction when necessary. The ISU is
responsible for instruction fetch and decode, instruction grouping, and instruction issue.
Instruction grouping refers to the pipeline stage in which operand dependencies are
Department Of Electronics & Communication Engineering, GEC Thrissur. 31
8/7/2019 Sanjay(High Speed Dsp Architectures)
32/39
High Performance DSP Architectures
resolved. The ISU issues groups of in-order instructions that will not cause any operand
conflicts. This is the only unit (and only stage in the execution pipeline) that enforces
pipeline protection. Isolating the pipeline protection logic in this manner simplifies pipeline
control logic significantly.
The ZSP600 ISU can issue up to six instructions per cycle, one to each of the six primary
datapaths: two address generation units (AGUs), two arithmetic logic units (ALUs), and
two multiply/accumulate/arithmetic units (MAUs) that are capable of performing up to fourMAC operations per cycle. The pipeline control unit (PCU) stages control associated with
each of the primary data paths and the bypass logic. The PCU is also responsible for
managing interrupt control, the co-processor interface, the debug interface, and the on-core
timers. The Bypass unit (BYP) handles all the data forwarding between execution units.)
PIPELINE
The pipeline of the G2 architecture is an eight-stage pipeline. The existing architecture usesa data prefetch mechanism, called data linking, to efficiently sustain required data
bandwidth for its dual- MAC. All pipeline protection and resource allocation is performed
during the grouping stage. Instruction groups are issued by the grouping stage and advance
in lock step down the remainder of the pipeline.
Data address generation is performed in the AG stage. This stage is also responsible for
enforcing the boundaries of the circular buffers. A load or store that straddles a boundary of
the circular buffer is split by the AGU into two sequential accesses. Stages M0 and M1 are
allocated for data memory loads. They are optimized for systems using synchronous RAM.
M0 is allocated for address decode and M1 for data access and return. Load and store
requests are registered and issued to the memory subsystem in M0. The memory interface
is stallable. If the MSS determines that is can not return requested data during M1, it stallsthe core until the data is ready.
ARITHMETIC RESOURCES
Department Of Electronics & Communication Engineering, GEC Thrissur. 32
8/7/2019 Sanjay(High Speed Dsp Architectures)
33/39
High Performance DSP Architectures
By adding two AGUs, along with dedicated address registers , the arithmetic throughput of
G2 demonstrates an immediate improvement. The two AGUs allow the core to issue any
combination of two loads or stores per cycle. The data size of the load/store is
implementation specific. Each data port in the ZSP600 is 64-bits wide, allowing a total of
128-bits (8 words) of data to be loaded per cycle. The AGUs have dedicated hardware to
support four circular buffers and reverse-carry addressing. The circular buffer support has
been enhanced in functionality to support load/store operations with positive and negative
offsets and signed indexes. Circular buffer logic also applies to address arithmetic and alsohas no alignment restrictions.
REGISTER RESOURCES
With the 32-bit address registers, the architecture allows implementations of the core to
remain flexible in defining the physical linear address space. The actual address register
remains a 32-bit register to ensure pointer sizes remain the same from one implementation
to the next. This also allows the address registers to be used as temporary registers for the
GPRs. Dedicated address registers simplify the instruction decoder and issue logic as it can
now identify address related operations and assign the datapath resources appropriately.
The primary operand resource of the AGUs is the address register file, allowing the
general-purpose register file to be physically optimized for data moving to and from the
ALUs and MAUs. The current generation defines two 32-bit registers and another 16-bit
register whose low and high bytes correspond to the upper byte of each accumulator
respectively thus resulting in a 40-bit accumulator. A guard byte is now available for each
of the eight extended 32-bit registers of the GPRs. Accumulators are also recognized in the
programming model by providing associated instruction set support for 40-bit arithmetic
and data loads and stores.
INSTRUCTION SET ENHANCEMENTS
A powerful enhancement to the new architecture is the ability to conditionally execute
instructions. The programming model for G2 allows programmers to define packets of
instructions that are predicated on a specified condition. The programmer then defines a
bracketed set of up to eight instructions that will be predicated in the execution pipeline
Top Related