DSP Processors Introduction………
Transcript of DSP Processors Introduction………
DSP Processors
Introduction………
Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware
Trends – Other Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions
Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware
Trends – Other Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions
What is a DSP?
Digital Signal Processors are microprocessors
specifically designed to handle Digital Signal
Processing tasks.
DSPs must also have a predictable execution
time.
DSPs are designed to operate in real time.
Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware
Trends – Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions
Processor Trends - Architectures
Hardware Units in DSP Processors
Multiplier Accumulator (MAC) Unit.
Most common operation in digital signal processing is
array multiplication.
Consider the implementation of an FIR digital filter, the
most common DSP technique.
Hardware Units in DSP Processors
To implement the operation in real time
we require a hardware multiplier unit
which will give the result of multiplication
in a single clock cycle
We also need to add or accumulate the results,
so we need an adder.
Together it is known as a MAC unit
one of the mandatory requirements of a
programmable DSP.
Hardware Units in DSP Processors
Circular Buffers:
To calculate the output sample, we must have access to a
certain number of the most recent samples from the
input.
For example, suppose we use eight coefficients in this
filter, a0, a1,…., a7. This means we must know the value of
the eight most recent samples from the input signal, x[n],
x[n-1],…x[n-7].
These eight samples must be stored in memory and
continually updated as new samples are acquired.
The best way to manage these stored samples is circular
buffering.
Hardware Units in DSP Processors
Hardware Units in DSP Processors
Circular buffer is placed eight consecutive memorylocations, 20041 to 20048.
The idea of circular buffering is that the end of this lineararray is connected to its beginning; memory location20041 is viewed as being next to 20048, just as 20044 isnext to 20045.
We keep track of the array by a pointer that indicateswhere the most recent sample resides.
When a new sample is acquired, it replaces the oldestsample in the array, and the pointer is moved one addressahead.
Hardware Units in DSP Processors
Four parameters are needed to manage a circular buffer.
A pointer that indicates the start of the circular buffer in memory (in this example, 20041).
A pointer indicating the end of the array (e.g., 20048), or a variable that holds its length (e.g., 8).
The step size of the memory addressing must be specified.
These three values define the size and configuration of the circular buffer, and will not change during the program operation.
the pointer to the most recent sample, must be modified as each new sample is acquired.
There must be program logic that controls how this fourth value is updated based on the value of the first three values.
Hardware Units in DSP Processors
Modified Bus Structures
and
Memory Access Schemes:
Hardware Units in DSP Processors
Multiple Access Memory:
The number of memory accesses per clock cycle can also
be increased
using a high speed memory that permits more than one
access per clock period. (Eg. DARAM)
Multiple access RAM can be connected to the processing
unit of the Harvard Architecture
Multiported Memory:
They dispense with the need for storing the program and
data in two different memory chips.
They are more expensive.
Processor Trends - Architectures
Processor Architecture Trends are
VLIW
Advanced Super Harvard
SIMD
Simplified instruction sets – Architectures to increaseclock speeds, compatibility. - (RISC).
More complex instruction sets for higher performance.- (CISC).
Mixed width instruction sets to reduce memory usage.
Deeper pipelines to enable higher clock speeds..
DSP Enhanced GPP.
Architecture Evolution
In the traditional Von-Neumann architecture there is
only a single memory and a single bus for
transferring data into and out of CPU.
Architecture Evolution
In Harvard Architecture, there are memories for data
and program with separate buses for each.
Since the buses operate independently, program
instructions and data can be fetched at the same
time, improving the speed.
Architecture Evolution
Another improvement is the Super HarvardArchitecture.
A handicap of the basic Harvard design is that thedata memory bus is busier than the programmemory bus.
To improve upon this situation, we start byrelocating part of the "data" to program memory.
For instance, we might place the filter coefficientsin program memory, while keeping the input signalin data memory.
Architecture Evolution
However, DSP algorithms generally spend most of
their execution time in loops.
This means that the same set of program
instructions will continually pass from program
memory to the CPU.
The Super Harvard architecture takes advantage of
this situation by including an instruction cache in
the CPU.
This is a small memory that contains about 32 of the
most recent program instructions.
Architecture Evolution
Architecture Evolution
I/O controller is connected to data memory, through whichthe signals enter and exit the system.
Most of the processors contain both serial and parallelcommunications ports.
Dedicated hardware allows these data streams to betransferred directly into memory (Direct Memory Access, orDMA), without having to pass through the CPU's registers.
This type of high speed I/O is a key characteristic of DSPs.
Some DSPs have onboard analog-to-digital and digital-to-analog converters, a feature called mixed signal.
Exploiting ILP - VLIW
ILP - Instruction Level Parallelism
Ability to perform multiple operations
(or instructions), from a single
instruction stream, in parallel
Exploiting ILP
It is a set of design techniques that speed up
programs by executing in parallel several RISC
style operations,
such as memory loads and stores, integer additions,
floating point multiplications.
These operations are taken from a single stream of
execution rather than from parallel tasks.
Available ILP: Inherent in a region of the code
Achievable ILP: provided by the hardware.
Exploiting ILP
ILP Hardware: Hardware can offer ILP in
several ways.
Several of the functional units found in a
processor can execute at the same time.
Here we allow operations to execute
simultaneously on each of the functional units.
Having separate register banks for the integer
and floating point data can help us to do this by
reducing potential hardware resource conflicts.
Exploiting ILP
Multiple copies of the functional units, possibly
accessing different register files to add register
bandwidth, can be added for the purpose of
executing in parallel.
Functional units with latency longer than one
cycle can be pipelined.
That is pipelining the floating point and cache
operations so that one can be initiated each cycle,
even though each might take several cycles to
finish.
General ILP OrganizationIn
stru
ctio
n m
emory
Inst
ruct
ion
fetc
h u
nit
Inst
ruct
ion
dec
ode
unit
FU-1
FU-2
FU-3
FU-4
FU-5
Reg
iste
r fi
le
Dat
a m
emory
CPU
By
pas
sing n
etw
ork
Exploiting ILP
Example: Consider the code,
Cycle 1: add t3=t1,t2
Cycle 2: store [addr 0] = t3
Cycle 3: fmul f6 = f7,f14
Cycle 4: ....waiting….
Cycle 5: ....waiting….
Cycle 6: fmul f7 = f7, f15
Exploiting ILP
Cycle 7: ....waiting….
Cycle 8: ....waiting….
Cycle 9: add t1 = p2, p7
Cycle 10: add t5 = p2, p10
Cycle 11: add t4 = t1,t5
Cycle 12: store [addr 1] = t4
IF we have 3 integer units, one floatingpoint unit and one load/store unit, then thecode can be arranged as,
Exploiting ILP
Cycle 1:
add t3=t1,t2
add t1 = p2, p7
add t5 = p2, p10
fmul f6 = f7,f14
Cycle 2:
add t4 = t1,t5
fmul f7 = f7, f15
store [addr 0] = t3
Cycle 3:
store [addr 1] = t4
VLIW
VLIW = Very Long Instruction Word
architecture
Instruction format:
operation 1 operation 2 operation 3 operation 4 operation 5
VLIW
VLIW Architecture:
Very Long Instruction Word architecture
They have a number of processing units (data paths). i.e., a
number of ALUs, MAC units, shifters etc.
The VLIW is accessed from memory and is used to specify
the operands and operations to be performed by each of the
data paths.
The multiple functional units share a common multiported
register file for fetching the operands and storing the
results.
VLIW
The performance gains that can be achieved with
VLIW architecture depends on the degree of
parallelism in the algorithm selected for a DSP
application and the number of functional units.
The throughput will be higher only if the
algorithm involves execution of independent
operations.
It is the compiler that does the job of determining
ILP and scheduling it on the functional units.
A VLIW Architecture with 7 FUs
Int Register File
Instruction Memory
Int FU
Data Memory
Int FU Int FU LD/ST LD/ST FP FU
Floating Point
Register File
FP FU
SIMD
SIMD (Single Instruction Multiple Data)
A single stream of instructions will bebroadcasted to a number of processors
All processors execute the same program butoperate on different data.
Nodes have Mesh or hypercube connectivity
Each PE can exchange values with theirneighbors, has a few registers, some localmemory and an ALU.
An SIMD Organization
SIMD Execution Method
tim
e
Instruction 1
Instruction 2
Instruction 3
Instruction n
node1 node2 node-K
Architecture Trends – The Down Side
VLIW, SIMD and deep pipelines can increase
Memory use.
Energy consumption.
Code generation complexity, programming difficulty.
Simple instruction sets often increase memory usage.
More instructions are needed to accomplish a given task.
Complex instruction sets hinder compatibility.
Compatibility can bring messy compromises.
Summary
Each processor makes different tradeoffs,
depending on its target application
top speed is often not the goal
Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware
Trends – Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions
DSP Hardware Trends
Today’s system engineer have a wealth ofoptions for implementing DSP tasks.
GPP
DSPs
Application Specific DSPs
Customizable Cores
ASSPs – Application Specific Standard Products
ASICs - Application Specific Integrated Chips
FPGAs – Field Programmable Gate Arrays
Overview
What is a Digital Signal Processor (DSP)?
What are Signal Processing Hardware
Trends – Processor options?
Processor Trends – Architectures.
What is available in Market?
How to Choose a DSP?
Conclusions
How to Choose?
Performance Analysis
Comparing benchmarking approaches
Benchmarking approaches
How to Benchmark?
Simplified metrics
E.g., MIPS, MOPS,MMACS
Full DSP Applications
E.g., V.90 Modem
DSP Algorithms “kernal” benchmarks
E.g., FIR Filter, FFT etc.
Algorithm Kernel Benchmarks
Most of the benchmarks are based on DSP
algorithm kernels
DSP algorithm kernels are the most computationally
intensive portions of DSP applications
Example includes FFTs, IIR & FIR filters and Viterbi
decoders
Benchmark results are used with application
profiling to predict overall performance
Algorithm Kernel Benchmarks
OTHER
25%
Denorm
11%
Window
25%
IDCT
39%
Application Profile
Algorithm Kernel Benchmarks
Advantages
Relevant, Chosen by analysis of real DSP applications.
Kernels are short, allowing
Functionality to be precisely specified.
Benchmarks to be implemented, optimized in a reasonableamount of time.
Disadvantages
Not practical to implement all important algorithms.
Do not reflect application-level optimizations and trade-offs.
Emerging Benchmarking
Challenges
New technologies create performance-
analysis challenges
Multi-core Devices
DSP-enhanced FPGAs
Application-specific processors
Customizable processors
Reconfigurable processors
Emerging challenges
Evolving applications and tools also lead to new
challenges
Increasing reliance on C compilers
For technologies not well served by kernel benchmarks,
such as
FPGAs
Application-specific Processors
Practicality concerns can be partly addressed by
Using off-the-shelf applications where ever available,or
Using simplified applications
Overview
What is a Digital Signal Processor (DSP)?
What are Signal Processing Hardware
Trends – Processor options?
Processor Trends – Architectures.
What is available in Market?
How to Choose a DSP?
Conclusions
What is available in Market?
Latest Processors
High performance processors
Texas Instruments TMS320C64xx
StarCore SC140
Low Power Processors
Texas Instruments TMS320C55xx
Analog Devices Blackfin (ADSP-BF53x)
General-purpose/ DSP Processors
Intel PXA2xx
Texas Instruments OMAP5910
DSP Speed
1460
3360
6480
3430
930
0
2000
4000
6000
8000
1 2 3 4 5
SPEED PERFORMANCE 1. TI ‘C5502 (300 MHz)
2. ADI ‘BF53x (600
MHz)
3. TI ‘C6414 (720 MHz)
4. StarCore SC 140 (300
MHz)
5. Intel PXA2xx (400
MHz)
DSP Speed
What factors affect DSP Speed?
Parallelism
How many parallel operations can be performed per
cycle
Instruction Set
Suitability for the task at hand
Clock Speed
Data types
Data Bandwidth
DSP Speed
Pipeline Depth
Instructional latencies
Support for DSP oriented features
DSP Addressing modes
Zero-overhead looping
Saturation, scaling, rounding
Memory Use
146 140
256
144 140
0
100
200
300
By
tes
1 2 3 4 5
Memory Speed Comparison
1. TI ‘C55xx (8/16/32/48)
2. ADI ‘BF53x (16/32/64)
3. TI ‘C64xx (32)
4. StarCore SC 140
(16/32)
5. Intel PXA2xx (16/32
MHz)
Lower is Better
Memory Use
What factors affect Memory use?
Processors’ memory usage are affected by
Instruction Set
Wider instructions take more memory
Mixed width instructions becoming popular – Use
short simple instructions for simple tasks and use
longer instructions for more complex tasks
Suitability of instruction set for task at hand
Memory Use
Architecture
VLIW, SIMD and deeper pipelines may
encourage optimizations that increase
memory use to obtain speed optimized code
Compiler Quality (for compiled codes)
Energy Efficiency
11.8
16.9 16.113.7
2.6
0
5
10
15
20
1 2 3 4 5
ENERGY EFFICIENCY 1. TI ‘C5502 (300 MHz)
1.26V
2. ADI ‘BF53x (600
MHz) 1.2V
3. TI ‘C6414 (300 MHz)
1.0V
4. Motorola MSC8101
(SC 140) (300 MHz)
1.5V
5. Intel PXA2xx (400
MHz) 1.0V
Higher is Better
Energy Efficiency
What factors affect Energy efficiency?
Processors’ energy efficiency is affectedby
Speed
Fabrication process, voltage, circuit design, logicdesign
Hardware Implementation
Memory usage
Compiler quality (for compiled code)
Cost Performance
146.2
375.9
98.3
29 25.6
0
100
200
300
400
1 2 3 4 5
COST PERFORMANCE 1. TI ‘C5502 (300 MHz)
$10
2. ADI ‘BF53x (600
MHz) $6
3. TI ‘C6414 (300 MHz)
$45
4. Motorola MSC8101
(SC 140) (300 MHz)
$118
5. Intel PXA2xx (400
MHz) $27
Higher is Better
Cost Performance
What factors affect Cost Performance?
Speed
Chip Cost, which is affected by
Fabrication process
Size of on-chip memory – influenced by processor’s memory usage
On-chip peripherals
Manufacturing volume
Cost Performance
But good cost-performance results do not
necessarily mean chip is suitable for
applications with severe cost constraints.
Users does not want to pay for more
performance than needed.
Overview
What is a Digital Signal Processor (DSP)?
What are Signal Processing Hardware
Trends – Processor options?
Processor Trends – Architectures.
What is available in Market?
How to Choose a DSP?
Conclusions
Conclusions
DSP Processor architecture innovations hasaccelerated greatly
New processor types are increasingly competitive
DSP enhanced general purpose processors
Multiprocessor chips
Customizable cores
Non-processor approaches are increasinglycompetitive
DSP-enhanced FPGAs
Conclusions
Architectural options are
expanding
Conclusions
Today’s DSP oriented processors cannot be
meaningfully compared using simplified matrices
Relevant, meaningful benchmark results are essential
for processor evaluation
There is no ideal processor
Fastest does not mean best
The “best” processor depends on the details of the application
Different architectural approaches make different
performance trade-offs Understanding these is key to select a processor