01 dsp intro_1

An introduction to

DSP’s

Examples of DSP applications

Why a DSP?

Characteristics of a DSP

Architectures

DSP example: mobile phone

DSP example: mobile phone with video camera

DSP: applications

Why a DSP?

� It’s easy: we want an architecture optimized for Digital Signal Processing

� Some versions are further optimized for some specific applications

- e.g. very low power consumption for mobile phones

Which is the difference between a DSP and a

general purpose processor? (1/4)Memory architecture and bus

� The first processors (in the ‘40) had a Harvard architecture: separate memories for program and data

� But it’s complex -> soon replaced by Von Neumann architecture: no real difference between program and data (an instruction has two fields: operation and data)

� Problem: the processor cannot access instructions and data simultaneously

� To improve performance: Harvard architecture again!

In particular

- separate memories and busses for program and data

- possibly, another separate bus for the DMA


general purpose processor? (2/4)

A DSP is often used to realize a linear filter

The convolution integral

is actually a sum:

yn=Σixn-ihi

- if the number of sums is finite: FIR filter (finite impulse response),

- otherwise: IIR (infinite impulse response),

- which can be realized using two finite sums:

yn=Σixn-ibi + Σiyn-iai



� A common operation in a FIR or IIR filter is A=BC+D: we need- a hardware multiplier (introduced in DSPs in the '70)

- a multiply and accumulate in only one clock cycle: MAC instruction.

Actually, the MAC is in a loop: we also need a zero overhead loop:- H/W for address generation (the access to memory is not random)- loop management

- auto-increment; circular addressing

� Other possible H/W:- H/W saturation

- Instructions to perform a division quickly- Bit reversal for FFT



Other possible features:� Often, data are 16- o 8-bit wide (e.g., audio or images)

- a 32-bit ALU can be splitted in two 16-bit ALUs or four 8-bit ALUs, -> 2 o 4 operations in parallel

� several ALUs which work in parallel� fixed point ALUs, o 16-bit ALUs, to reduce power

consumption and costs

� optimized versions:- cost: for consumer applications

- power: for mobile applications- for specific applications, e.g. electric motor control

� Example: ‘C30 (Texas Instruments,

1982)

� Example: FIR filter using a ‘C30

Note: several of these characteristics, which were born on DSPs, have been ported to general purpose processors

E.g.: the cache in the Pentium processor is

Harvard-like

� Another example.: several units working in parallel, and splittable ALUs (see. MMX extensions) in the Pentium 4

processor

Pipeline…

� Example of a 4-stage pipeline (TI ‘C30)

� each instruction is executed in 4 clock cycles, but (normally) can be put just 1 cycle after the previous one (data are needed only 3 cycles later)

Pipeline: branch (e.g. on the ‘C30)

� Standard branch: the pipeline is flushed to correctly handle

the PC -> 4 cycles

� Delayed branch: the pipeline is not flushed, and the 3

following instructions are loaded before modifying the PC

-> only 1 cycle needed!

BRD label ; delayed branch

MPYF ; executed

ADDF ; executed

SUBF ; executed

AND ; not executed

…

…

label MPYF ; fetched after SUBF

…

Two architectures

� In order to exploit the instruction level parallelism (ILP): two possible architectures- Superscalar: the parallelism is dynamically managed by the hardware- Very Long Instruction Word (VLIW): the parallelism is statically managed by the compiler

Which is the problem?

� Dependences in data or control can generate conflicts - on data (an instruction needs the result of a previous

instruction, but the results is not ready yet), or

- on control (conditional jump, but the condition is not ready yet)

-> pipeline stall

Superscalar

� The analysis of the independent instructions is dynamically done by hardware (which is complex!)

� The sequence of instructions can be executed out-of-order;

then, the completion of the instructions (commit) is done in-

order to correctly update the state of the CPU

VLIW

� Very Long Instruction Word (VLIW): the parallelism is statically managed by the compiler

� The analysis of independent instructions is statically realized during the compilation phase;

- the instructions which can be realized in parallel are assembled in long instructions and send to the various functional units in-order

� Convenient solution for DSP programs (fixed length cycles, few conditional operations); less convenient for general purpose applications

� Simpler hardware! But a specific compilation for each platform is needed

� Deterministic behaviour -> exact computation of execution times

01 dsp intro_1

Internet

Transcript of 01 dsp intro_1