DSP Processors Introduction………

DSP Processors

Introduction………

Overview

What is a Digital Signal Processor (DSP)?

Processor Trends – Architectures.

What are Signal Processing Hardware

Trends – Other Processor options?

What is available in Market?

How to Choose a DSP?

Conclusions

What is a DSP?

Digital Signal Processors are microprocessors

specifically designed to handle Digital Signal

Processing tasks.

DSPs must also have a predictable execution

time.

DSPs are designed to operate in real time.

Overview




Trends – Processor options?



Conclusions

Processor Trends - Architectures

Hardware Units in DSP Processors

Multiplier Accumulator (MAC) Unit.

Most common operation in digital signal processing is

array multiplication.

Consider the implementation of an FIR digital filter, the

most common DSP technique.


To implement the operation in real time

we require a hardware multiplier unit

which will give the result of multiplication

in a single clock cycle

We also need to add or accumulate the results,

so we need an adder.

Together it is known as a MAC unit

one of the mandatory requirements of a

programmable DSP.


Circular Buffers:

To calculate the output sample, we must have access to a

certain number of the most recent samples from the

input.

For example, suppose we use eight coefficients in this

filter, a0, a1,…., a7. This means we must know the value of

the eight most recent samples from the input signal, x[n],

x[n-1],…x[n-7].

These eight samples must be stored in memory and

continually updated as new samples are acquired.

The best way to manage these stored samples is circular

buffering.


Circular buffer is placed eight consecutive memorylocations, 20041 to 20048.

The idea of circular buffering is that the end of this lineararray is connected to its beginning; memory location20041 is viewed as being next to 20048, just as 20044 isnext to 20045.

We keep track of the array by a pointer that indicateswhere the most recent sample resides.

When a new sample is acquired, it replaces the oldestsample in the array, and the pointer is moved one addressahead.


Four parameters are needed to manage a circular buffer.

A pointer that indicates the start of the circular buffer in memory (in this example, 20041).

A pointer indicating the end of the array (e.g., 20048), or a variable that holds its length (e.g., 8).

The step size of the memory addressing must be specified.

These three values define the size and configuration of the circular buffer, and will not change during the program operation.

the pointer to the most recent sample, must be modified as each new sample is acquired.

There must be program logic that controls how this fourth value is updated based on the value of the first three values.


Modified Bus Structures

and

Memory Access Schemes:


Multiple Access Memory:

The number of memory accesses per clock cycle can also

be increased

using a high speed memory that permits more than one

access per clock period. (Eg. DARAM)

Multiple access RAM can be connected to the processing

unit of the Harvard Architecture

Multiported Memory:

They dispense with the need for storing the program and

data in two different memory chips.

They are more expensive.

Processor Trends - Architectures

Processor Architecture Trends are

VLIW

Advanced Super Harvard

SIMD

Simplified instruction sets – Architectures to increaseclock speeds, compatibility. - (RISC).

More complex instruction sets for higher performance.- (CISC).

Mixed width instruction sets to reduce memory usage.

Deeper pipelines to enable higher clock speeds..

DSP Enhanced GPP.

Architecture Evolution

In the traditional Von-Neumann architecture there is

only a single memory and a single bus for

transferring data into and out of CPU.


In Harvard Architecture, there are memories for data

and program with separate buses for each.

Since the buses operate independently, program

instructions and data can be fetched at the same

time, improving the speed.


Another improvement is the Super HarvardArchitecture.

A handicap of the basic Harvard design is that thedata memory bus is busier than the programmemory bus.

To improve upon this situation, we start byrelocating part of the "data" to program memory.

For instance, we might place the filter coefficientsin program memory, while keeping the input signalin data memory.


However, DSP algorithms generally spend most of

their execution time in loops.

This means that the same set of program

instructions will continually pass from program

memory to the CPU.

The Super Harvard architecture takes advantage of

this situation by including an instruction cache in

the CPU.

This is a small memory that contains about 32 of the

most recent program instructions.


I/O controller is connected to data memory, through whichthe signals enter and exit the system.

Most of the processors contain both serial and parallelcommunications ports.

Dedicated hardware allows these data streams to betransferred directly into memory (Direct Memory Access, orDMA), without having to pass through the CPU's registers.

This type of high speed I/O is a key characteristic of DSPs.

Some DSPs have onboard analog-to-digital and digital-to-analog converters, a feature called mixed signal.

Exploiting ILP - VLIW

ILP - Instruction Level Parallelism

Ability to perform multiple operations

(or instructions), from a single

instruction stream, in parallel

Exploiting ILP

It is a set of design techniques that speed up

programs by executing in parallel several RISC

style operations,

such as memory loads and stores, integer additions,

floating point multiplications.

These operations are taken from a single stream of

execution rather than from parallel tasks.

Available ILP: Inherent in a region of the code

Achievable ILP: provided by the hardware.

Exploiting ILP

ILP Hardware: Hardware can offer ILP in

several ways.

Several of the functional units found in a

processor can execute at the same time.

Here we allow operations to execute

simultaneously on each of the functional units.

Having separate register banks for the integer

and floating point data can help us to do this by

reducing potential hardware resource conflicts.

Exploiting ILP

Multiple copies of the functional units, possibly

accessing different register files to add register

bandwidth, can be added for the purpose of

executing in parallel.

Functional units with latency longer than one

cycle can be pipelined.

That is pipelining the floating point and cache

operations so that one can be initiated each cycle,

even though each might take several cycles to

finish.

General ILP OrganizationIn

stru

ctio

n m

emory

Inst

ruct

ion

fetc

h u

nit

Inst

ruct

ion

dec

ode

unit

FU-1

FU-2

FU-3

FU-4

FU-5

Reg

iste

r fi

le

Dat

a m

emory

CPU

By

pas

sing n

etw

ork

Exploiting ILP

Example: Consider the code,

Cycle 1: add t3=t1,t2

Cycle 2: store [addr 0] = t3

Cycle 3: fmul f6 = f7,f14

Cycle 4: ....waiting….


Cycle 6: fmul f7 = f7, f15

Exploiting ILP



Cycle 9: add t1 = p2, p7

Cycle 10: add t5 = p2, p10

Cycle 11: add t4 = t1,t5

Cycle 12: store [addr 1] = t4

IF we have 3 integer units, one floatingpoint unit and one load/store unit, then thecode can be arranged as,

Exploiting ILP

Cycle 1:

add t3=t1,t2

add t1 = p2, p7

add t5 = p2, p10

fmul f6 = f7,f14

Cycle 2:

add t4 = t1,t5

fmul f7 = f7, f15

store [addr 0] = t3

Cycle 3:

store [addr 1] = t4

VLIW

VLIW = Very Long Instruction Word

architecture

Instruction format:

operation 1 operation 2 operation 3 operation 4 operation 5

VLIW

VLIW Architecture:

Very Long Instruction Word architecture

They have a number of processing units (data paths). i.e., a

number of ALUs, MAC units, shifters etc.

The VLIW is accessed from memory and is used to specify

the operands and operations to be performed by each of the

data paths.

The multiple functional units share a common multiported

register file for fetching the operands and storing the

results.

VLIW

The performance gains that can be achieved with

VLIW architecture depends on the degree of

parallelism in the algorithm selected for a DSP

application and the number of functional units.

The throughput will be higher only if the

algorithm involves execution of independent

operations.

It is the compiler that does the job of determining

ILP and scheduling it on the functional units.

A VLIW Architecture with 7 FUs

Int Register File

Instruction Memory

Int FU

Data Memory

Int FU Int FU LD/ST LD/ST FP FU

Floating Point

Register File

FP FU

SIMD

SIMD (Single Instruction Multiple Data)

A single stream of instructions will bebroadcasted to a number of processors

All processors execute the same program butoperate on different data.

Nodes have Mesh or hypercube connectivity

Each PE can exchange values with theirneighbors, has a few registers, some localmemory and an ALU.

An SIMD Organization

SIMD Execution Method

tim

e

Instruction 1

Instruction 2

Instruction 3

Instruction n

node1 node2 node-K

Architecture Trends – The Down Side

VLIW, SIMD and deep pipelines can increase

Memory use.

Energy consumption.

Code generation complexity, programming difficulty.

Simple instruction sets often increase memory usage.

More instructions are needed to accomplish a given task.

Complex instruction sets hinder compatibility.

Compatibility can bring messy compromises.

Summary

Each processor makes different tradeoffs,

depending on its target application

top speed is often not the goal

Overview







Conclusions

DSP Hardware Trends

Today’s system engineer have a wealth ofoptions for implementing DSP tasks.

GPP

DSPs

Application Specific DSPs

Customizable Cores

ASSPs – Application Specific Standard Products

ASICs - Application Specific Integrated Chips

FPGAs – Field Programmable Gate Arrays

Overview







Conclusions

How to Choose?

Performance Analysis

Comparing benchmarking approaches

Benchmarking approaches

How to Benchmark?

Simplified metrics

E.g., MIPS, MOPS,MMACS

Full DSP Applications

E.g., V.90 Modem

DSP Algorithms “kernal” benchmarks

E.g., FIR Filter, FFT etc.

Algorithm Kernel Benchmarks

Most of the benchmarks are based on DSP

algorithm kernels

DSP algorithm kernels are the most computationally

intensive portions of DSP applications

Example includes FFTs, IIR & FIR filters and Viterbi

decoders

Benchmark results are used with application

profiling to predict overall performance


OTHER

25%

Denorm

11%

Window

25%

IDCT

39%

Application Profile


Advantages

Relevant, Chosen by analysis of real DSP applications.

Kernels are short, allowing

Functionality to be precisely specified.

Benchmarks to be implemented, optimized in a reasonableamount of time.

Disadvantages

Not practical to implement all important algorithms.

Do not reflect application-level optimizations and trade-offs.

Emerging Benchmarking

Challenges

New technologies create performance-

analysis challenges

Multi-core Devices

DSP-enhanced FPGAs

Application-specific processors

Customizable processors

Reconfigurable processors

Emerging challenges

Evolving applications and tools also lead to new

challenges

Increasing reliance on C compilers

For technologies not well served by kernel benchmarks,

such as

FPGAs

Application-specific Processors

Practicality concerns can be partly addressed by

Using off-the-shelf applications where ever available,or

Using simplified applications

Overview







Conclusions


Latest Processors

High performance processors

Texas Instruments TMS320C64xx

StarCore SC140

Low Power Processors

Texas Instruments TMS320C55xx

Analog Devices Blackfin (ADSP-BF53x)

General-purpose/ DSP Processors

Intel PXA2xx

Texas Instruments OMAP5910

DSP Speed

1460

3360

6480

3430

930

0

2000

4000

6000

8000

1 2 3 4 5

SPEED PERFORMANCE 1. TI ‘C5502 (300 MHz)

2. ADI ‘BF53x (600

MHz)

3. TI ‘C6414 (720 MHz)

4. StarCore SC 140 (300

MHz)

5. Intel PXA2xx (400

MHz)

DSP Speed

What factors affect DSP Speed?

Parallelism

How many parallel operations can be performed per

cycle

Instruction Set

Suitability for the task at hand

Clock Speed

Data types

Data Bandwidth

DSP Speed

Pipeline Depth

Instructional latencies

Support for DSP oriented features

DSP Addressing modes

Zero-overhead looping

Saturation, scaling, rounding

Memory Use

146 140

256

144 140

0

100

200

300

By

tes

1 2 3 4 5

Memory Speed Comparison

1. TI ‘C55xx (8/16/32/48)

2. ADI ‘BF53x (16/32/64)

3. TI ‘C64xx (32)

4. StarCore SC 140

(16/32)

5. Intel PXA2xx (16/32

MHz)

Lower is Better

Memory Use

What factors affect Memory use?

Processors’ memory usage are affected by

Instruction Set

Wider instructions take more memory

Mixed width instructions becoming popular – Use

short simple instructions for simple tasks and use

longer instructions for more complex tasks

Suitability of instruction set for task at hand

Memory Use

Architecture

VLIW, SIMD and deeper pipelines may

encourage optimizations that increase

memory use to obtain speed optimized code

Compiler Quality (for compiled codes)

Energy Efficiency

11.8

16.9 16.113.7

2.6

0

5

10

15

20

1 2 3 4 5

ENERGY EFFICIENCY 1. TI ‘C5502 (300 MHz)

1.26V

2. ADI ‘BF53x (600

MHz) 1.2V

3. TI ‘C6414 (300 MHz)

1.0V

4. Motorola MSC8101

(SC 140) (300 MHz)

1.5V


MHz) 1.0V

Higher is Better

Energy Efficiency

What factors affect Energy efficiency?

Processors’ energy efficiency is affectedby

Speed

Fabrication process, voltage, circuit design, logicdesign

Hardware Implementation

Memory usage

Compiler quality (for compiled code)

Cost Performance

146.2

375.9

98.3

29 25.6

0

100

200

300

400

1 2 3 4 5

COST PERFORMANCE 1. TI ‘C5502 (300 MHz)

$10

2. ADI ‘BF53x (600

MHz) $6

3. TI ‘C6414 (300 MHz)

$45

4. Motorola MSC8101

(SC 140) (300 MHz)

$118


MHz) $27

Higher is Better

Cost Performance

What factors affect Cost Performance?

Speed

Chip Cost, which is affected by

Fabrication process

Size of on-chip memory – influenced by processor’s memory usage

On-chip peripherals

Manufacturing volume

Cost Performance

But good cost-performance results do not

necessarily mean chip is suitable for

applications with severe cost constraints.

Users does not want to pay for more

performance than needed.

Overview







Conclusions

Conclusions

DSP Processor architecture innovations hasaccelerated greatly

New processor types are increasingly competitive

DSP enhanced general purpose processors

Multiprocessor chips

Customizable cores

Non-processor approaches are increasinglycompetitive

DSP-enhanced FPGAs

Conclusions

Architectural options are

expanding

Conclusions

Today’s DSP oriented processors cannot be

meaningfully compared using simplified matrices

Relevant, meaningful benchmark results are essential

for processor evaluation

There is no ideal processor

Fastest does not mean best

The “best” processor depends on the details of the application

Different architectural approaches make different

performance trade-offs Understanding these is key to select a processor

DSP Processors Introduction………

Documents

Transcript of DSP Processors Introduction………