INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs)

INTRODUCTION TODIGITAL SIGNAL

PROCESSORS (DSPs)

Prof. Brian L. Evans

Contributions byNiranjan Damera-Venkata and

Magesh Valliappan

Embedded Signal Processing Laboratory

The University of Texas at AustinAustin, TX 78712-1084

http://signal.ece.utexas.edu/

Accumulator architecture

Load-store architecture

Memory-register architecture

2 -2

Outline

Signal processing applications

Conventional DSP architecture

Pipelining in DSP processors

RISC vs. DSP processor architectures

TI TMS320C6000 DSP architecture

introduction

Signal processing on general-purpose

processors

Conclusion

2 -3

Signal Processing Applications

Embedded system demand: volume, volume, … 400 Million units/year: automobiles, PCs, cell phones 30 Million units/year: ADSL modems and printers

Embedded system cost and input/output rates Low-cost, medium-throughput: low-end printers,

handsets, sound cards, car audio, disk drives High-cost, high-throughput: high-end printers,

wireless basestations, 3-D sonar, 3-D images from2-D X-rays (tomographic reconstruction)

Embedded processor requirements Inexpensive with small area and volume Predictable input/output (I/O) rates to/from processor Power constraints (severe for handheld devices)

Single DSP

Multiple DSPs

2 -4

Conventional DSP Processors

Low cost: $3/processor in volume Deterministic interrupt service routine latency

guarantees predictable input/output rates On-chip direct memory access (DMA) controllers

Processes streaming input/output separately from CPU Sends interrupt to CPU when block has been read/written

Ping-pong buffering CPU reads/writes buffer 1 as DMA reads/writes buffer 2 After DMA finishes buffer 2, roles of buffers 1 & 2 switch

Low power consumption: 10-100 mW TI TMS320C54 0.32 mA/MIP 76.8 mW at 1.5 V, 160 MHz TI TMS320C55 0.05 mA/MIP 22.5 mW at 1.5 V, 300 MHz

Based on conventional (pre-1996) architecture

2 -5

Conventional DSP Architecture

Multiply-accumulate (MAC) in 1 instruction cycle Harvard architecture for fast on-chip I/O

Data memory/bus separate from program memory/bus One read from program memory per instruction cycle Two reads/writes from/to data memory per inst. cycle

Instructions to keep pipeline (3-6 stages) full Zero-overhead looping (one pipeline flush to set up) Delayed branches

Special addressing modes supported in hardware Bit-reversed addressing (e.g. fast Fourier transforms) Modulo addressing for circular buffers (e.g. FIR filters)

2 -6

Conventional DSP Architecture (con’t)

xN-K+1 xN-K+2xN-1 xN

Data Shifting Using a Linear BufferTime Buffer contents Next sample

xN+1

xN+3

xN+2

n=N

n=N+1

n=N+2 xN-K+3 xN-K+4xN+1 xN+2

xN-K+2 xN-K+3xN xN+1

Modulo Addressing Using a Circular BufferTime Buffer contents Next sample

n=N

n=N+1

n=N+2

xN-2 xN-1xN-K+1 xN-K+2

xN-K+4

xN+1

xN+2

xN+3

xN-2 xN-1xN+1 xN-K+2xN

xN-2 xN-1xN+1 xN+2xN

xN

xN

xN

xN-K+3

xN-K+3 xN-K+4

Buffer of length K Used in finite and

infinite impulse response filters

Linear buffer Sort by time index Update: discard

oldest data, copy old data left, insert new data

Circular buffer Oldest data index Update: insert

new data at oldest index, update oldest index

2 -7

Conventional DSP Processors Summary

Fixed-Point Floating-Point Cost/Unit $3 - $79 $3 - $381 Architecture Accumulator load-store or

memory-register Registers 2-4 data

8 address 8 or 16 data

8 or 16 address Data Words 16 or 24 bit integer

and fixed-point 32 bit integer and fixed/floating-point

On-Chip Memory

2-64 kwords data 2-64 kwords program

8-64 kwords data 8-64 kwords program

Address Space

16-128 kw data 16-64 kw program

16 Mw – 4Gw data 16 Mw – 4 Gw program

Compilers C, C++ compilers; poor code generation

C, C++ compilers; better code generation

Examples TI TMS320C5000; Motorola 56000

TI TMS320C30; Analog Devices SHARC

2 -8

Conventional DSP Processor Families

Floating-point DSPs Used in initial prototyping of algorithms Resurgence due to professional and car audio

Different on-chip configurations in each family Size and map of data and program memory A/D, input/output buffers, interfaces, timers, and

D/A

Drawbacks to conventional DSP processors No byte addressing (needed for images and video) Limited on-chip memory Limited addressable memory on fixed-point DSPs

(exceptions include Motorola 56300 and TI C5409) Non-standard C extensions for fixed-point data type

DSP MarketFixed-point 95%Floating-point 5%

2 -9

Pipelining

Pipelining

•Process instruction stream in stages (like oil flows in an oil pipeline)

•Increase throughput

Managing Pipelines

•Compiler or programmer

•Pipeline interlocking

Sequential (Motorola 56000)

Pipelined (Most conventional DSPs)

Superscalar (Pentium, MIPS)

Superpipelined (TMS320C6000)

Fetch Read ExecuteDecode

Fetch Decode Read Execute



2 -10

Time-stationary pipeline model Programmer controls each cycle Example: Motorola DSP56001 (has

separate X/Y data memories/registers)

Data-stationary pipeline model Programmer specifies data operations Example: TI TMS320C30

Interlocked pipeline “Protection” from pipeline effects May not be reported by simulators:

inner loops may take extra cycles

Pipelining: Operation

MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0

MPYF *++AR0(1),*++AR1(IR0),R0

DEFGHIJKLL

CDEFGHIJK-L

BCDEFGHIJK-L

ABCDEFGHIJK-L

F D R E

ExecuteReadDecodeFetch

MAC means multiplication-accumulation.

2 -11

A control hazard occurs when a branch instruction is decoded Processor “flushes” the pipeline, or Use delayed branch (expose pipeline)

A data hazard occurs because an operand cannot be read yet Intended by programmer, or Interlock hardware inserts “bubble” TI TMS320C5000 (20 CPU & 16 I/O

registers, one accumulator, and one address pointer ARP implied by *)

Pipelining: Hazards

LAR AR2, ADDR ; load address reg.LACC *- ; load accumulator w/ ; contents of AR2

DEFbrG--XYYZ

F D R E


CDEFbr---X-YZ

BCDEFbr---X-YZ

ABCDEFbr---X-YZ

LAR: 2 cycles to update AR2 & ARP; need NOP after it

2 -12

A repeat instruction repeats one instruction or a block of instructions after repeat

The pipeline is filled with repeated instruction (or block of instructions)

Cost: one pipeline flush only

Pipelining: Avoiding Control Hazards

; repeat TBLR inst. COUNT-1 timesRPT COUNTTBLR *+

High throughput performance of DSPs is helped by on-chip dedicated logic for looping (downcounters/looping registers)

DEF

rptXXXXXXXX

F D R E


CDEF

rpt--XXXXX

BCDEF

rpt--XXXX

ABCDEF

rpt--XXX

2 -13

RISC vs. DSP: Instruction Encoding

RISC: Superscalar, out-of-order execution

DSP: Horizontal microcode, in-order execution

Reorder

Load/store

Integer UnitFloating-Point Unit

Load/store

Load/store

AddressMultiplierALU

Memory

Memory

2 -14

RISC vs. DSP: Memory Hierarchy

RISC

DSP

Registers

Outof

order

I/DCache

Physical memory

TLB

Registers

DMA Controller

I Cache Internal memories

External memories

TLB: Translation Lookaside Buffer

DMA: Direct Memory Access

2 -15

Program RAMData RAM

or Cache

Internal Buses

Control Regs

Regs (B

0-B

15

)

Regs (A

0-A

15

)

.D1

.M1

.L1

.S1

.D2

.M2

.L2

.S2

CPU

Addr

Data

ExternalMemory -Sync -Async

DMA

Serial Port

Host Port

Boot Load

Timers

Pwr Down

TI TMS320C6000 DSP Architecture

Simplified Architectur

e

2 -16


Families: All support same C6000 instruction set

C6200 fixed-pt. 150- 300 MHz ADSL, printers

C6400 fixed pt. 300-1,000 MHz video, wireless

basestations

C6700 floating 100- 225 MHz medical imaging, pro-audio

TMS320C6211: 150 MHz, $21 in volume

300 million multiply-accumulates/s, 1200 RISC MIPS

On-chip memory: 16 kwords program, 16 kwords data

TMS320C6701 Evaluation Module Board 167 MHz

334 million multiply-accumulates/s, 1336 RISC MIPS

On-chip memory: 16 kwords program, 16 kwords data

External: one 133-MHz 64-kword, two 100-MHz 1-Mword

2 -17


Very long instruction word (VLIW) size of 256 bits Eight 32-bit functional units with single cycle throughput

One instruction cycle per clock cycle

Data word size is 32 bits 16 (32 on C64) 32-bit registers in each of two data paths

40 bits can be stored in adjacent even/odd registers

Two parallel data paths Data unit - 32-bit address calculations (modulo, linear) Multiplier unit - 16 bit 16 bit with 32-bit result Logical unit - 40-bit (saturation) arithmetic & compares Shifter unit - 32-bit integer ALU and 40-bit shifter

2 -18

TI TMS320C6000 Instruction Set

.S UnitADD NEGADDK NOTADD2 ORAND SETB SHLCLR SHREXT SSHLMV SUBMVC SUB2MVK XORMVKH ZERO

.L UnitABS NOTADD ORAND SADDCMPEQ SATCMPGT SSUBCMPLT SUBLMBD SUBCMV XORNEG ZERONORM

.M UnitMPY SMPYMPYH SMPYH

.D UnitADD STADDA SUBLD SUBAMV ZERONEG

OtherNOP IDLE

C6000 Instruction Set by Functional Unit

Six of the eight functional units can perform integer add, subtract, and move operations

2 -19

TI TMS320C6000 Instruction SetArithmeti

cABSADD

ADDAADDKADD2MPY

MPYHNEGSMPY

SMPYHSADDSAT

SSUBSUB

SUBASUBCSUB2ZERO

LogicalAND

CMPEQCMPGTCMPLTNOTORSHLSHRSSHLXOR

BitManagement

CLREXT

LMBDNORMSET

DataManagement

LDMV

MVCMVK

MVKHST

ProgramControl

BIDLENOP

C6000 InstructionSet by Category(un)signed multiplicationsaturation/packed arithmetic

2 -20

C6000 vs. C5000 Addressing Modes

ADD #0FFh add .L1 -13,A1,A6

TI C5000

TI C6000

(implied) add .L1 A7,A6,A7

ADD 010h not supported

ADD * ldw .L1 *A5++[8],A1

Immediate The operand is part

of the instruction Register

Operand is specified in a register

Direct Address of operand

is part of the instruction (added to imply memory page)

Indirect Address of operand

is stored in a register

2 -21


Deep pipeline 7-11 stages in C6200: fetch 4, decode 2, execute 1-5 7-16 stages in C6700: fetch 4, decode 2, execute 1-10 Pentium IV has an estimated 20 pipeline stages

Avoid using branch instructions in code Branch instruction in pipeline disables interrupts: latency

of a branch is 5 cycles Avoid branches by using conditional execution: every

instruction can be conditionally executed

No hardware protection against pipeline hazards Compiler and assembler must prevent pipeline hazards

2 -22

TI TMS320C6700 Extensions

.S UnitABSDP CMPLTSP

ABSSP RCPDPCMPEQDP RCPSP CMPEQSP RSARDP CMPGTDP RSQRSP CMPGTSP SPDPCMPLTDP

.L UnitADDDP INTSPADDSP SPINTDPINT SPTRUNCDPSP SUBDPDPTRUNC SUBSPINTDP

.M UnitMPYDP MPYIDMPYI MPYSP

.D UnitADDAD LDDW

C6700 Floating Point Extensions by Unit

Four functional units can perform IEEE single-precision (SP) and double-precision (DP) floating-point add, subtract, move.

Operations beginning with R are reciprocal calculations.

2 -23

C6712 vs. C6713

C6712 150 MHz clock,

900 MFLOPS 4 kB/4kB of L1

program/data memory

64 kB of L2 cache 0 kB of L2 SRAM 1200 MB/s on-chip

data bus bandwidth $13.50 each in

volume

C6713 225 MHz clock,

1350 MFLOPS 4 kB/4kB of L1

program/data memory

256 kB of L2 cache 192 kB of L2 SRAM 1800 MB/s on-chip

data bus bandwidth $26.85 each in

volumeInformation as of December 14, 2003

2 -24

Digital Signal Processor Cores

ASIC with Programmable digital

signal processor core

RAM

ROM

Standard cells

Codec

Peripherals

Gate array

Microcontroller core

2 -25

General Purpose Processors

Multimedia applications on PCs Video, audio, graphics and animation

Repetitive parallel sequences of instructions

Single Instruction Multiple Data (SIMD) One instruction acts on multiple data in parallel Well-suited for graphics

Native signal processing extensions use

SIMD Sun Visual Instruction Set [1995] (UltraSPARC 1/2) Intel MMX [1996] (Pentium I/II/III/IV) Intel Streaming SIMD Extensions (Pentium III)

2 -26

DSP on General Purpose Processors (con’t)

Programming is considerably tougher Ability of compilers to generate code for instruction set

extensions may lag (e.g. four years for Pentium MMX) Libraries of routines using native signal processing Hand code in assembly for best performance

Single-instruction multiple-data (SIMD) approach Pack/unpack data not aligned on SIMD word boundaries Saturation arithmetic in MMX; not supported in VIS Extended-precision accumulation in MMX; none in VIS

Application speedup for Intel MMX and Sun VIS Signal and image processing: 1.5:1 to 2:1 Graphics: 4:1 to 6:1 (no packing/unpacking)

2 -27

Intel MMX Instruction Set

64-bit SIMD register (4 data types) 64-bit quad word Packed byte (8 bytes packed into 64 bits) Packed word (4 16-bit words packed into 64 bits) Packed double word (2 double words packed into 64

bits) 57 new instructions

Pack and unpack Add, subtract, multiply, and multiply/accumulate

Saturation and wraparound arithmetic Maximum parallelism possible

8:1 for 8-bit additions 4:1 for 8 16 multiplication or 16-bit additions

2 -28

Concluding Remarks

Conventional digital signal processors High performance vs. power consumption/cost/volume Excel at one-dimensional processing Per cycle: 1 16 16 MAC & 4 16-bit RISC instructions

TMS320C6000 VLIW DSP family High performance vs. cost/volume Excel at multidimensional signal processing Per cycle: 2 16 16 MACs & 4 32-bit RISC instructions

Native signal processing Available on desktop computers Excels at graphics Per cycle: 2 8 16 MACs OR 8 8-bit RISC instructions

Use assembly for computational kernels and C for main program (control code, interrupt def.)

2 -29

Concluding Remarks

Digital signal processor market40% annual growth 1990-2000: #1 in semiconductor marketRevenue: $4.4B ‘99, $6.1B ‘00, $4.5B ‘01, $4.9B ‘02, $6B ‘032000: 44% TI, 23% Agere, 13% Motorola, 10% Analog

Dev.2001: 40% TI, 16% Agere, 12% Motorola, 8% Analog Dev.2002: 43% TI, 14% Motorola, 14% Agere, 9% Analog Dev.

Independent processor benchmarking by industry Berkeley Design Technology Inc. http://www.bdti.com EDN Embedded Microprocessor Benchmark Consortium

http://www.eembc.org

Web resources Newsgroup comp.dsp: FAQ http://www.bdti.com/faq/dsp_faq.html Embedded processors and systems: http://www.eg3.com On-line courses: http://www.techonline.com

2 -30

References

G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar Beamforming on a Unix Workstation,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp. 764-768, 1998.http://www.ece.utexas.edu/~bevans/papers/1998/beamforming/

R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE Sym. On Microarchitecture, pp. 37-46, 1998.http://www.ece.utexas.edu/~ravib/mmxdsp/

W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, “Native Signal Processing on the UltraSPARC in the Ptolemy Environment,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, 1996.http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/21_nsp/vis/

B. L. Evans, “EE345S Real-Time DSP Laboratory,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/realtime/

B. L. Evans, “EE382C-9 Embedded Software Systems,” UT Austin.http://www.ece.utexas.edu/~bevans/courses/ee382c/

A. Kulkarni and A. Dube, “Evaluation of the Code Generation Domain in Ptolemy,” http://www.ece.utexas.edu/~bevans/talks/benchmarking97/sld001.htm

P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE Press, 1997.

INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs)

Documents

Transcript of INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs)