Computing Engine Choices - Rochester Institute of...
Transcript of Computing Engine Choices - Rochester Institute of...
EECC722 EECC722 -- ShaabanShaaban#1 lec # 8 Fall 2012 10-10-2012
Computing Engine Choices• General Purpose Processors (GPPs): Intended for general purpose computing
(desktops, servers, clusters..)• Application-Specific Processors (ASPs): Processors with ISAs and
architectural features tailored towards specific application domains– E.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors,
Graphics Processing Units (GPUs), Vector Processors??? ...• Co-Processors: A hardware (hardwired) implementation of specific
algorithms with limited programming interface (augment GPPs or ASPs)• Configurable Hardware:
– Field Programmable Gate Arrays (FPGAs)– Configurable array of simple processing elements
• Application Specific Integrated Circuits (ASICs): A custom VLSI hardware solution for a specific computational task
• The choice of one or more depends on a number of factors including:- Type and complexity of computational algorithm
(general purpose vs. Specialized)- Desired level of flexibility and programmability - Performance requirements- Desired level of computational efficiency
(e.g Computations per watt or computations per chip area)- Power requirements - Real-time constraints- Development time and cost - System cost
General Purpose ISAs (RISC or CISC)
Special Purpose ISAs
Repeated here from lecture 1
The ISA forms an abstraction layerthat sets the requirements for both
complier and CPU designers
- Expected useful lifecycle of computing element or system
Processors
EECC722 EECC722 -- ShaabanShaaban#2 lec # 8 Fall 2012 10-10-2012
Computing Engine ChoicesFl
exib
ility
General PurposeProcessors (GPPs):
Application-SpecificProcessors (ASPs)
Co-ProcessorsApplication Specific Integrated Circuits
(ASICs)
Configurable Hardware
e.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs)Physics Processor ….
-Type and complexity of computational algorithm (general purpose vs. Specialized)
- Desired level of flexibility and programmability - Performance requirements- Desired level of computational efficiency- Power requirements - Real-time constraints- Development time and cost - System cost
Selection Factors:Prog
ram
mab
ility
/ Processor = Programmable computing element that runs programs written using a pre-defined set of instructions
Specialization , Development cost/timePerformance/Chip Area/Watt (Computational Efficiency)Repeated here from lecture 1
Application Domain Requirements ASP ISA ASP ArchitectureFor Application-Specific Processors (ASPs):
Software Hardware(Processors)
ISA
EECC722 EECC722 -- ShaabanShaaban#3 lec # 8 Fall 2012 10-10-2012
Computing Element Choices Observation• Generality and efficiency are in some sense inversely related
to one another:– The more general-purpose a computing element is and thus the greater the
number of tasks it can perform, the less efficient (e.g. Computations per chip area /watt) it will be in performing any of those specific tasks.
– Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and and make compromises on other less important features.
• To counter the problem of computationally intense and specialized problems for which general purpose processors/machines cannot achieve the necessary performance/other requirements:– Special-purpose processors (or Application-Specific Processors, ASPs) ,
attached processors, and coprocessors have been designed/built for many years, for specific application domains, such as image or digital signal processing (for which many of the computational tasks are specialized and can be very well defined).
Generality = Flexibility = Programmability ?Efficiency = Computational Efficiency
(Computations per watt or chip area)
Why Application-Specific Processors (ASPs)?
i.e computational efficiency
ASPs
EECC722 EECC722 -- ShaabanShaaban#4 lec # 8 Fall 2012 10-10-2012
Digital Signal Processor (DSP) Architecture•• Classification of Main Processor Types/ApplicationsClassification of Main Processor Types/Applications•• Requirements of Embedded ProcessorsRequirements of Embedded Processors•• DSP vs. General Purpose CPUsDSP vs. General Purpose CPUs•• DSP Cores vs. ChipsDSP Cores vs. Chips•• Classification of DSP ApplicationsClassification of DSP Applications•• DSP Algorithm FormatDSP Algorithm Format•• DSP BenchmarksDSP Benchmarks•• Basic Architectural Features of Basic Architectural Features of DSPsDSPs• DSP Software Development Considerations•• Classification of Current DSP Architectures and example Classification of Current DSP Architectures and example DSPsDSPs::
–– Conventional Conventional DSPsDSPs: TI TMSC54xx: TI TMSC54xx– Enhanced Conventional Conventional DSPsDSPs: TI TMSC55xx: TI TMSC55xx–– MultipleMultiple--Issue Issue DSPsDSPs::
•• VLIW VLIW DSPsDSPs: TI TMS320C62xx, TMS320C64xx: TI TMS320C62xx, TMS320C64xx•• Superscalar Superscalar DSPsDSPs: : LSI Logic ZSP400/500 DSP coreLSI Logic ZSP400/500 DSP core
DSPs are often embedded
DSP Generations
1-2
3
4
EECC722 EECC722 -- ShaabanShaaban#5 lec # 8 Fall 2012 10-10-2012
Main Processor Types/ApplicationsMain Processor Types/Applications• General Purpose Computing & General Purpose Processors (GPPs) –
– High performance: In general, faster is always better.– RISC or CISC: Intel P4, IBM Power4, SPARC, PowerPC, MIPS ...– Used for general purpose software– End-user programmable– Real-time performance may not be fully predictable (due to dynamic arch. features)– Heavy weight, multi-tasking OS - Windows, UNIX– Normally, low cost and power not a requirement (changing)– Servers, Workstations, Desktops (PC’s), Notebooks, Clusters …
• Embedded Processing: Embedded processors and processor cores– Cost, power code-size and real-time requirements and constraints– Once real-time constraints are met, a faster processor may not be better– e.g: Intel XScale, ARM, 486SX, Hitachi SH7000, NEC V800...– Often require Digital signal processing (DSP) support or other
application-specific support (e.g network, media processing)– Single or few specialized programs – known at system design time– Not end-user programmable– Real-time performance must be fully predictable (avoid dynamic arch. features)– Lightweight, often realtime OS or no OS– Examples: Cellular phones, consumer electronics .. …
• Microcontrollers – Extremely code size/cost/power sensitive– Single program– Small word size - 8 bit common– Usually no OS– Highest volume processors by far– Examples: Control systems, Automobiles, industrial control, thermostats, ...
Incr
easi
ngC
ost/C
ompl
exity
Increasingvolum
e
Examples of Application-Specific Processors (ASPs)
64 bit
16-32 bit
8 bit
EECC722 EECC722 -- ShaabanShaaban#6 lec # 8 Fall 2012 10-10-2012
The Processor Design SpaceThe Processor Design Space
Processor Cost
Perf
orm
ance
Microprocessors
Performance iseverything& Software rules
Embeddedprocessors
Microcontrollers
Cost is everything
Application specific architecturesfor performance
GPPsReal-time constraintsSpecialized applicationsLow power/cost constraints
Chip Area, Powercomplexity
(Main Types)
Examplesof ASPs
EECC722 EECC722 -- ShaabanShaaban#7 lec # 8 Fall 2012 10-10-2012
Requirements of Embedded ProcessorsRequirements of Embedded Processors• Usually must meet strict real-time constraints:
– Real-time performance must be fully predictable:• Avoid dynamic processor architectural features that make real-time
performance harder to predict ( e.g cache, dynamic scheduling, hardware speculation …)
– Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements.
• Optimized for a single (or few) program (s) - code often in on-chip ROM or on/off chip EPROM/flash memory.
• Minimum code size (one of the motivations initially for Java)• Performance obtained by optimizing datapath• Low cost
– Lowest possible area• High computational efficiency: Computation per unit areaComputation per unit area
– VLSI implementation technology usually behind the leading edge– High level of integration of peripherals (System-on-Chip -SoC- approach reduces
system cost/power)• Fast time to market
– Compatible architectures (e.g. ARM family) allows reusable code– Customizable cores (System-on-Chip, SoC).
• Low power if application requires portability
EmbeddedProcessors:How Fast?
Good or bad?
EECC722 EECC722 -- ShaabanShaaban#8 lec # 8 Fall 2012 10-10-2012
Area of processor cores = Cost
Nintendo processor Cellular phones
Embedded ProcessorsEmbedded Processors
(and Power requirements)
Thus need to minimize chip area
Embedded versionof a GPP
EECC722 EECC722 -- ShaabanShaaban#9 lec # 8 Fall 2012 10-10-2012
Another figure of merit: Computation per unit chip areaAnother figure of merit: Computation per unit chip area
Nintendo processor Cellular phones
Embedded ProcessorsEmbedded Processors
(Computational Efficiency)
Embedded versionof a GPP
EECC722 EECC722 -- ShaabanShaaban#10 lec # 8 Fall 2012 10-10-2012
Code sizeCode size
• If a majority of the chip is the program stored in ROM, then minimizing code size is a critical issue
• Common embedded processor ISA features to minimize code size:– Variable length instruction encoding common:
• e.g. the Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate
– Complex/specialized instructions– Complex addressing modes
Embedded ProcessorsEmbedded Processors Smaller is better
1
2
3
How?
CISC-Like ?
EECC722 EECC722 -- ShaabanShaaban#11 lec # 8 Fall 2012 10-10-2012
Embedded Systems vs. General Purpose ComputingEmbedded Systems vs. General Purpose Computing
(and processors GPPs)
General Purpose Computing SystemsEmbedded Systems(and embedded processors)
Real-time performance may not be fully predictable (due to dynamic processor architectural features):
•Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache.
Used for general purpose software :Intended to run a fully general set of applications that may not be known at design time
Run a single or few specialized applications often known at system design time
End-user programmableNot end-user programmable
In general, no real-time constraints
Heavy weight, multi-tasking OS - Windows, UNIXLightweight, often real-time OS or no OS
Faster (higher-performance) is always better
Usually must meet strict real-time constraints–(e.g. real-time sampling rate)
Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements.
Minimum code size is highly desirable Minimizing code size is not an issue
Low power and cost constraints/requirements Higher power and cost constraints/requirements
May require application-specific capability (e.g DSP)
No application-specific capability required
Real-time performance must be fully predictable:•Avoid dynamic processor architectural features that make real-time performance harder to predict
ThusThus
usually
EECC722 EECC722 -- ShaabanShaaban#12 lec # 8 Fall 2012 10-10-2012
Evolution of Evolution of GPPsGPPs and and DSPsDSPs• General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly, Von
Neumann (ENIAC)
• Digital Signal Processors (DSPs) are microprocessors designed for efficient mathematical manipulation of digital signals utilizing digital signal processing algorithms.– DSPs usually process infinite continuous sampled (digitized) data streams
(physical signals) while meeting real-time and power constraints.– DSPs evolved from Analog Signal Processors (ASPs) that utilize analog
hardware to transform physical signals (classical electrical engineering)– ASP to DSP because:
• DSP insensitive to environment (e.g., same response in snow or desert if it works at all)
• DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation
• Different history and different applications requirements led to different ISA design considerations, terms, different metrics, architectures, some new inventions.
+ EDSAC First generation processors
Application Domain Requirements ASP ISA ASP Architecture
For Application-Specific Processors (ASPs):
i.e.
EECC722 EECC722 -- ShaabanShaaban#13 lec # 8 Fall 2012 10-10-2012
DSP vs. General Purpose CPUsDSP vs. General Purpose CPUs• DSPs tend to run one (or few) program(s), not many programs.
– Hence OSes (if any) are much simpler, there is no virtual memory or protection, ...
• DSPs usually run applications with hard real-time constraints:– DSP must meet application signal sampling rate computational requirements:
• Once above real-time constraints are met, a faster DSP is overkill (higher DSP cost, power..) without additional benefit.
– You must account for anything that could happen in a time slot (DSP algorithm inner-loop, data sampling rate)
– All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval.
• Therefore, exceptions are BAD.• DSPs usually process infinite continuous data streams:
– Requires high memory bandwidth (with predictable latency, e.g no data cache) for streaming real-time data samples and predictable processing time on the data samples
• The design of DSP ISAs and processor architectures is driven by the requirements of DSP algorithms.
– Thus DSPs are application-specific processors
DSP Algorithms DSP ISAs DSP Architectures
DSP PerformanceRequirements
Similar to other em
bedded processors
EECC722 EECC722 -- ShaabanShaaban#14 lec # 8 Fall 2012 10-10-2012
DSP vs. GPPDSP vs. GPP• The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate
(MAC). – MAC is common in DSP algorithms that involve computing a vector dot
product, such as digital filters, correlation, and Fourier transforms.– DSP are judged by whether they can keep the multipliers busy 100% of the
time and by how many MACs are performed in each cycle.• The "SPEC" of DSPs is 4 algorithms:
– Inifinite Impule Response (IIR) filters– Finite Impule Response (FIR) filters– FFT, and – convolvers
• In DSPs, target algorithms are important:– Binary compatibility not a major issue
• High-level Software is not as important in DSPs as in GPPs. – People still write in assembly language for a product to minimize
the die area for ROM in the DSP chip and improve performance.
i.e Main performance measure of DSPs is MAC speed
Why?
Note: While this is still mostly true, however, programming for DSPs in high level languages (HLLs) has been gaining more acceptance due to the development of more efficient HLL DSP compilers in recent years.
unlike general purpose
Since DSPS are application domain specific processors
Code size
EECC722 EECC722 -- ShaabanShaaban#15 lec # 8 Fall 2012 10-10-2012
Types of DSP ProcessorsTypes of DSP Processors
• 32-BIT FLOATING POINT (5% of DSP market):– TI TMS320C3X, TMS320C67xx (VLIW)– AT&T DSP32C– ANALOG DEVICES ADSP21xxx– Hitachi SH-4
• 16-BIT FIXED POINT (95% of DSP market):– TI TMS320C2X, TMS320C62xx (VLIW)– Infineon TC1xxx (TriCore1) (VLIW)– MOTOROLA DSP568xx, MSC810x (VLIW)– ANALOG DEVICES ADSP21xx– Agere Systems DSP16xxx, Starpro2000– LSI Logic LSI140x (ZPS400) superscalar– Hitachi SH3-DSP
– StarCore SC110, SC140 (VLIW)
According to type of Arithmetic/operand Size Supported
Examples
Examples
Or 24 bit
EECC722 EECC722 -- ShaabanShaaban#16 lec # 8 Fall 2012 10-10-2012
DSP Cores vs. ChipsDSP Cores vs. ChipsDSP are usually available as synthesizable cores or off-the-shelf packaged chips • Synthesizable Cores:
– Map into chosen fabrication process• Speed, power, and size vary
– Choice of peripherals, etc. (SoC)– Requires extensive hardware development effort.
• Off-the-shelf packaged chips:– Highly optimized for speed, energy efficiency, and/or cost.– Lower development time/cost/effort.– Tools, 3rd-party support often more mature.– Faster time to market.– Limited performance, integration options.
SOC = System On Chip
Resulting in more development time and cost (very high volume needed to justify development cost
IPInstructionMemory
DataMemory
A/D Converter
D/A Converter Seria
l Por
ts
DSPCore
EECC722 EECC722 -- ShaabanShaaban#17 lec # 8 Fall 2012 10-10-2012
DSP ARCHITECTUREDSP ARCHITECTUREEnabling TechnologiesEnabling Technologies
Time Frame Approach Primary Application Enabling Technologies
Early 1970’s • Discrete logic • Non-real timeprocessing
• Simulation
• Bipolar SSI, MSI• FFT algorithm
Late 1970’s • Building block • Military radars• Digital Comm.
• Single chip bipolar multiplier• Flash A/D
Early 1980’s • Single Chip DSP µP • Telecom• Control
• µP architectures• NMOS/CMOS
Late 1980’s • Function/Applicationspecific chips
• Computers• Communication
• Vector processing• Parallel processing
Early 1990’s • Multiprocessing • Video/Image Processing • Advanced multiprocessing• VLIW, MIMD, etc.
Late 1990’s • Single-chipmultiprocessing
• Wireless telephony• Internet related
• Low power single-chip DSP• VLIW/Multiprocessing
First microprocessor DSPTI TMS 32010
1
2
3
4
Generations of single-chip (microprocessor) DSPs
EECC722 EECC722 -- ShaabanShaaban#18 lec # 8 Fall 2012 10-10-2012
Texas Instruments TMS320 Family Multiple DSP µP Generations
FirstSample
Bit Size Clockspeed(MHz)
InstructionThroughput
MACexecution
(ns)
MOPS Device density (#of transistors)
Uniprocessor Based (Harvard Architecture) TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3µ)
TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2µ)
TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1µ)
TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5µ)
TMS320C2XXX 1995 16 integer 40 MIPS 25 80
Multiprocessor Based TMS320C80 1996 32 integer/flt. 2 GOPS
120 MFLOPMIMD
TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
12
3
4
(VLIW)
Generations of single-chip (microprocessor) DSPs
EECC722 EECC722 -- ShaabanShaaban#19 lec # 8 Fall 2012 10-10-2012
DSP ApplicationsDSP Applications• Digital audio applications
– MPEG Audio– Portable audio
• Digital cameras• Cellular telephones• Wearable medical appliances• Storage products:
– disk drive servo control• Military applications:
– radar– sonar
• Industrial control• Seismic exploration• Networking:
(Telecom infrastructure)– Wireless– Base station– Cable modems– ADSL– VDSL– …...
Current DSP Killer Applications: Cell phones and telecom infrastructure
HDTV? ….. Other?
EECC722 EECC722 -- ShaabanShaaban#20 lec # 8 Fall 2012 10-10-2012
DSP Algorithms & ApplicationsDSP Algorithm System Application
Speech Coding Digital cellular telephones, personal communications systems, digital cordless telephones,multimedia computers, secure communications.
Speech Encryption Digital cellular telephones, personal communications systems, digital cordless telephones,secure communications.
Speech Recognition Advanced user interfaces, multimedia workstations, robotics, automotive applications,cellular telephones, personal communications systems.
Speech Synthesis Advanced user interfaces, roboticsSpeaker Identification Security, multimedia workstations, advanced user interfaces
High-fidelity Audio Consumer audio, consumer video, digital audio broadcast, professional audio, multimediacomputers
ModemsDigital cellular telephones, personal communications systems, digital cordless telephones,digital audio broadcast, digital signaling on cable TV, multimedia computers, wirelesscomputing, navigation, data/fax
Noise cancellation Professional audio, advanced vehicular audio, industrial applicationsAudio Equalization Consumer audio, professional audio, advanced vehicular audio, musicAmbient Acoustics Emulation Consumer audio, professional audio, advanced vehicular audio, musicAudio Mixing/Editing Professional audio, music, multimedia computersSound Synthesis Professional audio, music, multimedia computers, advanced user interfaces
Vision Security, multimedia computers, advanced user interfaces, instrumentation, robotics,navigation
Image Compression Digital photography, digital video, multimedia computers, videoconferencingImage Compositing Multimedia computers, consumer video, advanced user interfaces, navigationBeamforming Navigation, medical imaging, radar/sonar, signals intelligenceEcho cancellation Speakerphones, hands-free cellular telephonesSpectral Estimation Signals intelligence, radar/sonar, professional audio, music
EECC722 EECC722 -- ShaabanShaaban#21 lec # 8 Fall 2012 10-10-2012
Another Look at DSP ApplicationsAnother Look at DSP Applications
• High-end:– Military applications (e.g. radar/sonar)– Wireless Base Station - TMS320C6000– Cable modem – Gateways - HDTV …
• Mid-range:– Industrial control– Cellular phone - TMS320C540– Fax/ voice server …
• Low end:– Storage products - TMS320C27 (hard drive controllers)– Digital camera - TMS320C5000– Portable phones– Wireless headsets– Consumer audio– Automobiles, thermostats, ...
Incr
easi
ngC
ost
Increasingvolum
e
EECC722 EECC722 -- ShaabanShaaban#22 lec # 8 Fall 2012 10-10-2012
DSP range of applicationsDSP range of applications& Possible Target DSPs
EECC722 EECC722 -- ShaabanShaaban#23 lec # 8 Fall 2012 10-10-2012
Cellular Phone SystemCellular Phone System
PHYSICALLAYER
PROCESSING
RF MODEM
CONTROLLER1 2 3 4 5 67 8 9
0
415-555-1212
SPEECHDECODE
SPEECHENCODEA/D
BASEBANDCONVERTER
DAC
Example DSP Application
EECC722 EECC722 -- ShaabanShaaban#24 lec # 8 Fall 2012 10-10-2012
Cellular Phone: HW/SW/IC PartitioningCellular Phone: HW/SW/IC Partitioning
PHYSICALLAYER
PROCESSINGRF
MODEM
CONTROLLER1 2 3 4 5 67 8 9
0
415-555-1212
SPEECHDECODE
SPEECHENCODEA/D
BASEBANDCONVERTER
DAC
ANALOG IC
DSP
ASIC
MICROCONTROLLER
Example DSP Application
EECC722 EECC722 -- ShaabanShaaban#25 lec # 8 Fall 2012 10-10-2012
Mapping Onto SystemMapping Onto System--onon--Chip (Chip (SoCSoC))
RAM µCRAM
DSPCORE
ASICLOGIC
S/P
DMA
phonebook
protocol
keypadintfc
control
S/P
DMA
speechquality
enhancment
de-intl &decoder
voicerecognition
RPE-LTPspeech decoder
demodulatorand
synchronizerViterbi
equalizer
(Cellular Phone)
Example DSP Application
DSP Core
Micro-controller or embedded processor
EECC722 EECC722 -- ShaabanShaaban#26 lec # 8 Fall 2012 10-10-2012
Example Cellular Phone OrganizationExample Cellular Phone Organization
C540
ARM7
(DSP)
(µC)
Example DSP Application
EECC722 EECC722 -- ShaabanShaaban#27 lec # 8 Fall 2012 10-10-2012
Multimedia SystemMultimedia System--onon--Chip (Chip (SoCSoC))
• Future chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/O
µP
DSPCom
s
Video Unit
customMemory
Uplink Radio
Downlink Radio
Graphics Out
Video I/O
Voice I/O
Pen In
e.g. Multimedia terminal electronics
(ASIC)
ASICCo-processorOr ASP
Example DSP Application
EECC722 EECC722 -- ShaabanShaaban#28 lec # 8 Fall 2012 10-10-2012
DSP Algorithm FormatDSP Algorithm Format
• DSP culture has a graphical format to represent formulas.
• Like a flowchart for formulas, inner loops,not programs.
• Some seem natural: Σ is add, X is multiply
• Others are obtuse: z–1 means take variable from earlier iteration (delay).
• These graphs are trivial to decode
i.e. DSP algorithms
EECC722 EECC722 -- ShaabanShaaban#29 lec # 8 Fall 2012 10-10-2012
DSP Algorithm NotationDSP Algorithm Notation
• Uses “flowchart” notation instead of equations• Multiply is or
X
• Add is or
+ Σ
• Delay/Storage is or or
Delay z–1 D
EECC722 EECC722 -- ShaabanShaaban#30 lec # 8 Fall 2012 10-10-2012
Typical DSP Algorithm:Typical DSP Algorithm:FiniteFinite--Impulse Response (FIR) FilterImpulse Response (FIR) Filter
• Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute:
where– x is the input sequence– y is the output sequence– h is the impulse response (filter coefficients)– N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and impulse response.
)(*)()()()(1
0nxnhkixkhiy
N
k=−=∑
−
=
Vector Dot Product:Multiply Accumulate (MAC) Operations
i.e filter coefficients
Signal samples
Filter coefficients
N Taps
EECC722 EECC722 -- ShaabanShaaban#31 lec # 8 Fall 2012 10-10-2012
Typical DSP Algorithms:Typical DSP Algorithms:FiniteFinite--impulse Response (FIR) Filterimpulse Response (FIR) Filter
• N most recent samples in the delay line (Xi)• New sample moves data down delay line• Filter “Tap” is a multiply-add• Each tap (N taps total) nominally requires:
– Two data fetches
– Multiply – Accumulate – Memory write-back to update delay line
• Special addressing modes (e.g modulo)
Performance Goal: At least 1 FIR Tap / DSP instruction cycle
Requires real-time data sample streaming• Predictable data bandwidth/latency• Special addressing modes• Separate memory banks/busses?
Repetitive computations, multiply and accumulate (MAC)• Requires efficient MAC support
(Multiply And Accumulate, MAC)
MAC
EECC722 EECC722 -- ShaabanShaaban#32 lec # 8 Fall 2012 10-10-2012
FINITEFINITE--IMPULSE RESPONSE (FIR) FILTERIMPULSE RESPONSE (FIR) FILTER
−1Z −1Z −1Z. . . .X
Y
h0 h1 hN-2hN-1
A Filter Tap
∑−
=
−=1
0)()()(
N
kkixkhiy
Performance Goal: at least 1 FIR Tap / DSP instruction cycle
DSP must meet application signal sampling rate computational requirements: A faster DSP is overkill (more cost/power than really needed)
i.e. Vector dot product
One FIR Filter Tap
Delay (accumulator register)
Filter coefficients Delayed samples
MAC
SignalSamples
FilterCoefficients
FromA/D
ToD/A
EECC722 EECC722 -- ShaabanShaaban#33 lec # 8 Fall 2012 10-10-2012
Sample Computational Rates for FIR Filtering
Signal type Frequency # taps Performance
Speech 8 kHz N =128 20 MOPs
Music 48 kHz N =256 24 MOPs
Video phone 6.75 MHz N*N = 81 1,090 MOPs
TV 27 MHz N*N = 81 4,370 MOPs
HDTV 144 MHz N*N = 81 23,300 MOPs
1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2. OPs = Operation Per Second
(4.37 GOPs)
(23.3 GOPs)
DSP must meet application signal sampling rate computational requirements:• A faster DSP is overkill (higher DSP cost, power..)
FIRType
1-D
1-D
2-D
2-D
2-D
DSP Performance Requirements
EECC722 EECC722 -- ShaabanShaaban#34 lec # 8 Fall 2012 10-10-2012
FIR Filter on (Simple) FIR Filter on (Simple) General Purpose Processor (GPP)General Purpose Processor (GPP)
loop: lw x0, 0(r0) lw y0, 0(r1) mul a, x0,y0add y0,a,b sw y0,(r2) inc r0 inc r1 inc r2 dec ctrtst ctrjnz loop
• Problems: • Bus / memory bandwidth bottleneck, • control/loop code overhead• No suitable addressing modes, instructions -
– e.g. multiply and accumulate (MAC) instruction
+ GPP Real-time performance may (to meet signal sampling rate) not be fully predictable (due to dynamic processor architectural features):
•Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache.
+
EECC722 EECC722 -- ShaabanShaaban#35 lec # 8 Fall 2012 10-10-2012
• Infinite Impulse Response (IIR) filters compute:
• Output sequence depends on input sequence, previous outputs, and impulse response.
• Both FIR and IIR filters – Require vector dot product (multiply-accumulate)
operations– Use fixed coefficients
• Adaptive filters update their coefficients to minimize the distance between the filter output and the desired signal.
∑∑−
=
−
=−+−=
1
0
1
1)()()()()(
N
k
M
kkixkbkiykaiy
Typical DSP Algorithms:Typical DSP Algorithms:InfiniteInfinite--Impulse Response (IIR) FilterImpulse Response (IIR) Filter
i.e Filter coefficients: a(k), b(k)
MAC
MACMAC
normally
EECC722 EECC722 -- ShaabanShaaban#36 lec # 8 Fall 2012 10-10-2012
• The Discrete Fourier Transform (DFT) allows for spectral analysis in the frequency domain.
• It is computed as
for k = 0, 1, … , N-1, where – x is the input sequence in the time domain – y is an output sequence in the frequency domain
• The Inverse Discrete Fourier Transform is computed as
• The Fast Fourier Transform (FFT) provides an efficient method for computing the DFT.
1 )()(21
0−===
−−
=∑ jeWnxWky N
j
N
N
n
nkN
π
1-n , ... 1, 0, n for ,)()(1
0== ∑
−
=
−N
k
nkN kyWnx
Typical DSP Algorithms:Typical DSP Algorithms:Discrete Fourier Transform (DFT)
MAC
MAC
Time Domain Frequency Domain
EECC722 EECC722 -- ShaabanShaaban#37 lec # 8 Fall 2012 10-10-2012
• The Discrete Cosine Transform (DCT) is frequently used in image & video compression (e.g. JPEG, MPEG-2).
• The DCT and Inverse DCT (IDCT) are computed as:
where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1. • A N-Point, 1D-DCT requires N2 MAC operations.
1-N ... 1, 0, k for ,)(]2
)12(cos[)()(1
0=
+= ∑
−
=
N
nnx
Nknkeky π
1-N ... 1, 0, k for ,)(]2
)12(cos[)(2)(1
0=
+= ∑
−
=
N
kny
Nknke
Nnx π
Typical DSP Algorithms:Typical DSP Algorithms:Discrete Cosine Transform (DCT) (DCT)
MAC
MAC
EECC722 EECC722 -- ShaabanShaaban#38 lec # 8 Fall 2012 10-10-2012
DSP BENCHMARKSDSP BENCHMARKS•• DSPstoneDSPstone: : University of Aachen, application benchmarks
– ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES– DOT_PRODUCT, MATRIX_1X3, CONVOLUTION– FIR, FIR2DIM, HR_ONE_BIQUAD– LMS, FFT_INPUT_SCALED
• BDTImark2000: Berkeley Design Technology Inc– 12 DSP kernels in hand-optimized assembly language:
• FIR, IIR, Vector dot product, Vector add, Vector maximum, FFT ….– Returns single number (higher means faster) per processor– Use only on-chip memory (memory bandwidth is the major bottleneck in
performance of embedded applications).
• EEMBC (pronounced “embassy”): EDN Embedded Microprocessor Benchmark Consortium
– 30 companies formed by Electronic Data News (EDN)– Benchmark evaluates compiled C code on a variety of embedded processors
(microcontrollers, DSPs, etc.)– Application domains: automotive-industrial, consumer, office automation,
networking and telecommunications
BDTI
EECC722 EECC722 -- ShaabanShaaban#39 lec # 8 Fall 2012 10-10-2012
1st Generation
2ndGeneration
3rdGeneration
4th Generation
> 800xFaster thanfirst generation
DSPs from generations 2, 3 and 4 are in use today. Why?
EECC722 EECC722 -- ShaabanShaaban#40 lec # 8 Fall 2012 10-10-2012
Basic DSP ISA/Architectural FeaturesBasic DSP ISA/Architectural Features
• Data path configured for DSP algorithms– Fixed-point arithmetic (most DSPs)
• Modulo arithmetic (saturation to handle overflow)– MAC- Multiply-accumulate unit(s)– Hardware rounding support
• Multiple memory banks and buses -– Harvard Architecture– Multiple data memories/buses
• Specialized addressing modes – Bit-reversed addressing– Circular buffers
• Specialized instruction set and execution control – Zero-overhead loops– Support for fast MAC– Fast Interrupt Handling
• Specialized peripherals for DSP- (System on Chip - SoC style)
Usually with no data cachefor predictable fast data sample streaming
To meet real-time signal sampling/processing constraints
Dedicated address generation unitsare usually used
DSP ISA Feature
DSP ISA Feature DSP Architectural Feature
DSP Architectural Feature
DSP ISA Feature
DSP Architectural Features
DSP Architectural Feature
Specialized DSP Algorithms/Application Requirements DSP ISAs DSP Architectures
EECC722 EECC722 -- ShaabanShaaban#41 lec # 8 Fall 2012 10-10-2012
DSP Data Path: ArithmeticDSP Data Path: Arithmetic• DSPs dealing with numbers representing real world signals
=> Want “reals”/ fractions• DSPs dealing with numbers for addresses
=> Want integers• DSP ISA (and DSP) must Support “fixed point” as well as
integers
S.radix point
-1 Š x < 1
S .radix point
–2N–1 Š x < 2N–1
Usually 16-bit fixed-point
In DSP ISAs: Fixed-point arithmetic must be supported, floating point support is optional and is much less common
DSP ISA Feature
DSP ISA Features
Most Common: Fixed Point (16-bit or 24-bit) + Integer Arithmetic
Much Less Common: Single Precision Floating-point Support
Fixed-point
Thus
EECC722 EECC722 -- ShaabanShaaban#42 lec # 8 Fall 2012 10-10-2012
DSP Data Path: PrecisionDSP Data Path: Precision
• Word size affects precision of fixed point numbers• DSPs have 16-bit, 20-bit, or 24-bit data words• Floating Point DSPs cost 2X - 4X vs. fixed point, slower
than fixed point• DSP programmers will scale values inside code
– SW Libraries– Separate explicit exponent
• “Blocked Floating Point” single exponent for a group of fractions
• Floating point support simplify development for high-end DSP applications.
16-bit most common
In DSP ISAs: Fixed-point arithmetic must be supported, floating point (single precision) support is optional and is much less common
DSP ISA Features
16-bit Fixed-Point Most Common
Single Precision
EECC722 EECC722 -- ShaabanShaaban#43 lec # 8 Fall 2012 10-10-2012
DSP Data Path: Overflow HandlingDSP Data Path: Overflow Handling• DSP are descended from analog signal processors:
– Modulo Arithmetic.• Set to most positive (2N–1–1) or
most negative value(–2N–1) : “saturation”• Many DSP algorithms were developed in this
model.
Due to physical nature of signals
–2N–1
2N–1–1Saturation
Saturation
Why Support?
DSP ISA Feature
EECC722 EECC722 -- ShaabanShaaban#44 lec # 8 Fall 2012 10-10-2012
DSP Data Path: Specialized HardwareDSP Data Path: Specialized Hardware
• Fast specialized hardware functional units performs all key arithmetic operations in 1 cycle, including:– Shifters– Saturation– Guard bits– Rounding modes– Multiplication/addition (MAC)
• 50% of instructions can involve multiplier=> single cycle latency multiplier
• Need to perform multiply-accumulate (MAC) fast• n-bit multiplier => 2n-bit product
DSP Architectural Features
To help meet real-time constraints for commonly needed operations
i.e. must optimize common operations
EECC722 EECC722 -- ShaabanShaaban#45 lec # 8 Fall 2012 10-10-2012
DSP Data Path: Multiply Accumulate (MAC) UnitDSP Data Path: Multiply Accumulate (MAC) Unit
• Don’t want overflow or have to scale accumulator• Option 1: accumalator wider than product:
“guard bits”– Motorola DSP:
24b x 24b => 48b product, 56b Accumulator• Option 2: shift right and round product before adder
Accumulator
ALU
Multiplier
Accumulator
ALU
Multiplier
Shift
G
MACUnit
One or more MAC units
addadd
EECC722 EECC722 -- ShaabanShaaban#46 lec # 8 Fall 2012 10-10-2012
DSP Data Path: Rounding ModesDSP Data Path: Rounding Modes• Even with guard bits, will need to round when storing
accumulator into memory• 3 DSP standard options (supported in hardware)• Truncation: chop results
=> biases results up• Round to nearest:
< 1/2 round down, � 1/2 round up (more positive)=> smaller bias
• Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0)=> no biasIEEE 754 calls this round to nearest even
Not in software as in GPPs1
2
3
EECC722 EECC722 -- ShaabanShaaban#47 lec # 8 Fall 2012 10-10-2012
Data Path ComparisonData Path ComparisonDSP Processor
• Specialized hardware performs all key arithmetic operations in 1 cycle.– e.g MAC
• Hardware support for managing numeric fidelity:– Shifters– Guard bits– Saturation– Rounding modes
General-Purpose Processor
• Multiplies often take>1 cycle
• Shifts often take >1 cycle• Other operations (e.g.,
saturation, rounding) typically take multiple cycles.
EECC722 EECC722 -- ShaabanShaaban#48 lec # 8 Fall 2012 10-10-2012
TI 320C54x DSP (1995) Functional Block DiagramTI 320C54x DSP (1995) Functional Block Diagram
Hardware support for rounding/saturation
MACUnit
Multiple memory banks and buses
EECC722 EECC722 -- ShaabanShaaban#49 lec # 8 Fall 2012 10-10-2012
First Commercial DSP (1982): Texas Instruments TMS32010
• 16-bit fixed-point arithmetic• Introduced at 5Mhz (200ns)
instruction cycle. • “Harvard architecture”
– separate instruction, data memories
• Accumulator • Specialized instruction set
– Load and Accumulate• Two-cycle (400 ns) Multiply-
Accumulate (MAC) time.
Processor
InstructionMemory
DataMemory
T-Register
Accumulator
ALU
Multiplier
Datapath:
P-Register
Mem
i.e MAC Unit
i.e. Single-Chip DSP (Microprocessor)
EECC722 EECC722 -- ShaabanShaaban#50 lec # 8 Fall 2012 10-10-2012
First Generation DSP µPTexas Instruments TMS32010 - 1982
Features
• 200 ns instruction cycle (5 MIPS)• 144 words (16 bit) on-chip data RAM• 1.5K words (16 bit) on-chip program ROM - TMS32010• External program memory expansion to a total of 4K words at full speed• 16-bit instruction/data word• single cycle 32-bit ALU/accumulator• Single cycle 16 x 16-bit multiply in 200 ns• Two cycle MAC (5 MOPS)• Zero to 15-bit barrel shifter• Eight input and eight output channels
EECC722 EECC722 -- ShaabanShaaban#51 lec # 8 Fall 2012 10-10-2012
First Generation DSP First Generation DSP µµPP TI TI TMS32010 TMS32010 Block DiagramBlock Diagram
MACUnit
Program Memory(ROM/EPROM)
Data/SamplesMemory
Barrel Shifter (1 cycle)
EECC722 EECC722 -- ShaabanShaaban#52 lec # 8 Fall 2012 10-10-2012
TMS32010 FIR Filter CodeTMS32010 FIR Filter Code
• Here X4, H4, ... are direct (absolute) memory addresses: LT X4 ; Load T with x(n-4) MPY H4 ; P = H4*X4 LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3);
; Acc = Acc + P MPY H3 ; P = H3*X3 LTD X2 MPY H2
...• Two instructions per tap, but requires unrolling
Load and Accumulate
EECC722 EECC722 -- ShaabanShaaban#53 lec # 8 Fall 2012 10-10-2012
DSP MemoryDSP Memory• FIR Tap implies multiple memory accesses• DSPs require multiple data ports• Some DSPs have ad hoc techniques to reduce memory
bandwdith demand:– Instruction repeat buffer: do 1 instruction 256 times– Often disables interrupts, thereby increasing interrupt
response time• Some recent DSPs have instruction caches
– Even then may allow programmer to “lock in”instructions into cache
– Option to turn cache into fast program memory• Usually DSPs have no data caches.• May have multiple data memories
For betterreal-timeperformancepredictability
e.g one for signal data samples and one for filter coefficients
DSP Architectural Features
Separate memories for data, program
Why?
EECC722 EECC722 -- ShaabanShaaban#54 lec # 8 Fall 2012 10-10-2012
Conventional “Von Neumann’’ memoryAKA unified or Princeton memory architecture
EECC722 EECC722 -- ShaabanShaaban#55 lec # 8 Fall 2012 10-10-2012
HARVARD MEMORY ARCHITECTURE in DSPHARVARD MEMORY ARCHITECTURE in DSP
PROGRAMMEMORY X MEMORY Y MEMORY
GLOBAL
P DATA
X DATA
Y DATA
Multiple memory banks and buses
Data Memory Banks (SRAM)e.g one for signal data samples and one for filter coefficients
ROM/EPROM/FLASH?
(i.e. split memory)
EECC722 EECC722 -- ShaabanShaaban#56 lec # 8 Fall 2012 10-10-2012
DSP Processor• Harvard architecture (split)• 2-4 memory accesses/cycle• No caches: on-chip SRAM
General-Purpose Processor• Von Neumann architecture• Typically 1 access/cycle• Use caches
Processor
ProgramMemory
DataMemory
Processor Memory
Memory Architecture ComparisonMemory Architecture Comparison
Makes real-time performanceharder to predict
For real-time performancepredictability
i.e. unified memorybut not L1-cache (split)
EECC722 EECC722 -- ShaabanShaaban#57 lec # 8 Fall 2012 10-10-2012
TI TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture
Multiple memory banks and buses
InstructionCache
Data Data Program
Multiple memory banks and buses
EECC722 EECC722 -- ShaabanShaaban#58 lec # 8 Fall 2012 10-10-2012
TI 320C62x/67x DSPTI 320C62x/67x DSP (1997) – (Fourth Generation DSP)
Program Data
EECC722 EECC722 -- ShaabanShaaban#59 lec # 8 Fall 2012 10-10-2012
DSP Addressing ModesDSP Addressing Modes• Have standard addressing modes: immediate, displacement,
register indirect• Want to keep MAC datapath busy.• Assumption: any extra instructions imply additional clock cycles
of overhead in inner loop and larger code size=> Thus complex addressing is good
• Autoincrement/Autodecrement register indirect– lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1– Option to do it before addressing, positive or negative
• “bit reverse” address addressing mode.• “modulo” or “circular” addressing=> Don’t use normal datapath integer unit to calculate complex
addressing modes:– Instead use dedicated address generation units.
To match data access patterns in DSP algorithmsand reduce number of instructions (code size)
DSP ISA Features
Related DSP Architectural Feature
Complex &Specialized
Examples:
Why?
EECC722 EECC722 -- ShaabanShaaban#60 lec # 8 Fall 2012 10-10-2012
DSP Addressing: FFTDSP Addressing: FFT• FFTs start or end with data in bufferfly order
0 (000) => 0 (000)1 (001) => 4 (100)2 (010) => 2 (010)3 (011) => 6 (110)4 (100) => 1 (001)5 (101) => 5 (101)6 (110) => 3 (011)7 (111) => 7 (111)
• How to avoid overhead of address checking instructions for FFT?• Have an optional “bit reverse” address addressing mode for use with
autoincrement addressing• Thus most DSPs have “bit reverse” addressing for radix-2 FFT
Bit Reversed Addressing
DSP ISA Features
EECC722 EECC722 -- ShaabanShaaban#61 lec # 8 Fall 2012 10-10-2012
Bit Reversed AddressingBit Reversed Addressingx(0)
x(4)
x(2)
x(6)
x(1)
x(5)
x(3)
x(7)
F(0)
F(1)
F(2)
F(3)
F(4)
F(5)
F(6)
F(7)
Four 2-point DFTs
Two 4-point DFTs
One 8-point DFT
000
100
010
110
001
101
011
111
Data flow in the radix-2 decimation-in-time FFT algorithm
DSP ISA Features
EECC722 EECC722 -- ShaabanShaaban#62 lec # 8 Fall 2012 10-10-2012
DSP Addressing: Circular BuffersDSP Addressing: Circular Buffers• DSPs dealing with continuous I/O• Often interact with an I/O buffer (delay lines)• To save memory, buffers often organized as circular
buffers• What can do to avoid overhead of address checking
instructions for circular buffer?• Option 1: Keep start register and end register per
address register for use with autoincrementaddressing, reset to start when reach end of buffer
• Option 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach end
• Every DSP has “modulo” or “circular” addressing
and addressing
Circular Bufferaddressing
Sampled signal
EECC722 EECC722 -- ShaabanShaaban#63 lec # 8 Fall 2012 10-10-2012
Circular Buffers Addressing SupportCircular Buffers Addressing SupportEvery DSP has “modulo” or “circular” addressing mode
Instructions accommodate three elements:
• Buffer address
• Buffer size
• Increment
Allows for cycling through:
• delay elements (signal samples)
• Filter coefficients in data memory
Why?
DSP ISA Features
Or other DSP algorithm coefficients
e.g. from A/D
e.g. to D/A
EECC722 EECC722 -- ShaabanShaaban#64 lec # 8 Fall 2012 10-10-2012
Address calculation for Address calculation for DSPsDSPs
• Dedicated address generation units
• Supports modulo and bit reversal arithmetic
• Often duplicated to calculate multiple addresses per cycle
DSP Architectural Features
DSP
Do not use normalinteger unit
EECC722 EECC722 -- ShaabanShaaban#65 lec # 8 Fall 2012 10-10-2012
Addressing ComparisonAddressing ComparisonDSP Processor
• Dedicated address generation units
• Specialized addressing modes; e.g.:– Autoincrement– Modulo (circular)– Bit-reversed (for FFT)
• Good immediate data support
General-Purpose Processor
• Often, no separate address generation units
• General-purpose addressing modesDSP ISA Feature GPP ISA Feature
Number minimizedIn RISC ISAs
DSP ArchitecturalFeature
EECC722 EECC722 -- ShaabanShaaban#66 lec # 8 Fall 2012 10-10-2012
DSP Instructions and ExecutionDSP Instructions and Execution• May specify multiple operations in a single complex
instruction:– e.g. A compound instruction may perform:
multiply + add + load + modify address register• Must support Multiply-Accumulate (MAC)• Need parallel move support• Usually have special loop support to reduce branch overhead
– Loop an instruction or sequence– 0 value in register usually means loop maximum number of
times– Must be sure if calculate loop count that 0 does not mean 0
• May have saturating shift left arithmetic• May have conditional execution to reduce branches
DSP ISA Features
To reduce number of instructions and reduce code size
In 4th generation VLIW DSPs
Reduce loop overhead
EECC722 EECC722 -- ShaabanShaaban#67 lec # 8 Fall 2012 10-10-2012
DSP Low/Zero Overhead LoopsDSP Low/Zero Overhead Loops
Address Generation PCS = PC + 1if (PC = x && ! condition)
PC = PCSelse
PC = PC +1
DO <addr> UNTIL condition”
X
DO X ...
• Eliminates a few instructions in loops -• Important in loops with small bodies
In ADSP 2100:In ADSP 2100:
Example FIR inner loop on TI TMS320C54xx:
DSP ISA Features
Number of filter taps
Lowers loop overhead
Repeat
Examples
EECC722 EECC722 -- ShaabanShaaban#68 lec # 8 Fall 2012 10-10-2012
Instruction Set (ISA) ComparisonInstruction Set (ISA) ComparisonDSP Processor
• Specialized, complex instructions (e.g. MAC)
• Multiple operations per instruction
• Zero or reduced overhead loops.
General-Purpose Processor
• General-purpose instructions
• Typically only one operation per instruction
mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov *r0,x0mov *r1,y0mpy x0, y0, aadd a, bmov y0, *r2inc r0inc rl
Code Size = 16 bitsCode Size = 7 x 32 =
224 bits
(14X)
• No zero or reduced overhead loops support
Less complex
ISA ISA
The above is addition to addressing mode differences identified earlier (slide 65)
Larger Code SizeSmaller Code Size
EECC722 EECC722 -- ShaabanShaaban#69 lec # 8 Fall 2012 10-10-2012
Specialized Peripherals for DSPsSpecialized Peripherals for DSPs
• Synchronous serial ports
• Parallel ports• Timers• On-chip A/D, D/A
converters• Co-processors.• ASIC• Micro-controller
….
• Program/data memory and busses
• Component /system interconnects
• Host ports• Bit I/O ports• On-chip DMA
controller• Clock generators
• On-chip peripherals often designed for “background” operation, even when DSP core is powered down.
InstructionMemory
DataMemory
A/D Converter
D/A Converter
Seria
l Por
ts
DSPCore
System on Chip (SoC) Approach
DSP Architectural Features
Heavy integration of peripherals/components to reduce cost (chip count)/power
SOC
EECC722 EECC722 -- ShaabanShaaban#70 lec # 8 Fall 2012 10-10-2012
TI TMS320C203/LC203 Block Diagram TI TMS320C203/LC203 Block Diagram DSP Core Approach DSP Core Approach -- 19951995
Integrated Integrated DSP PeripheralsDSP Peripherals
DataProgram
EECC722 EECC722 -- ShaabanShaaban#71 lec # 8 Fall 2012 10-10-2012
Summary of Architectural Features of Summary of Architectural Features of DSPsDSPs• Data path configured for DSP
– Fixed-point arithmetic– Fast MAC- Multiply-accumulate
• Multiple memory banks and buses -– Harvard Architecture– Multiple data memories– Dedicated address generation units
• Specialized addressing modes – Bit-reversed addressing– Circular buffers
• Specialized instruction set and execution control – Zero-overhead loops– Support for MAC
• Specialized peripherals for DSP (SoC)
• THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN.
(or algorithm driven, DSP algorithms in this case)
• Avoiding dynamic processor architectural features that make real-time performance harder to predict (e.g dynamic scheduling, hardware speculation, branch prediction, cache).
Why?
To achieve predictable real-time performance
Most common 95% of all DSPs
DSP ISA Features
DSP ISA Feature
DSP Architectural Features
DSP Architectural Feature
DSP Architectural Feature
+
EECC722 EECC722 -- ShaabanShaaban#72 lec # 8 Fall 2012 10-10-2012
DSP Software Development Considerations• Different from general-purpose software development:
– Resource-hungry, complex algorithms.– Specialized and/or complex processor architectures.– Severe cost/storage limitations.– Hard real-time constraints.– Optimization is essential.– Increased testing challenges.
• Essential tools:– Assembler, linker.– Instruction set simulator.– HLL Code generation: C compiler.– Debugging and profiling tools.
• Increasingly important:– DSP Software libraries (hand optimized).– Real-time operating systems.
Program in DSP Assembly ?
HLL/tools becomingmore mature/gaining popularity
Most common (for performance) but changing
Requirements
Thus
EECC722 EECC722 -- ShaabanShaaban#73 lec # 8 Fall 2012 10-10-2012
Classification of Current DSP ArchitecturesClassification of Current DSP Architectures• Modern Conventional DSPs:
– Similar to the original DSPs of the early 1980s – Single instruction/cycle. Example: TI TMS320C54x– Complex instructions/Not compiler friendly
• Enhanced Conventional DSPs:– Add parallel execution units: SIMD operation– Complex, compound instructions. – Example: TI TMS320C55x– Not compiler friendly
• Multiple-Issue DSPs:– VLIW Example: TI TMS320C62xx, TMS320C64xx
• Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed,
• More compiler friendly - Higher cost/power • SIMD instructions support added to recent DSPs of this class
– Superscalar, Example: LSI Logic ZPS400, ZPS500
SecondGeneration
ThirdGeneration
FourthGeneration
Usually one MAC unit
Usually more than one MAC unit
DSPs from all these three generations are still available today. Why?
LowerCost/Power
HigherCost/PowerPerformance
Late 1980’s -
Early 1990’s -
Late1990’s -
> 1 MAC Unit
EECC722 EECC722 -- ShaabanShaaban#74 lec # 8 Fall 2012 10-10-2012
A Conventional DSP: A Conventional DSP: TI TMSC54xxTI TMSC54xx
• 16-bit fixed-point DSP.• Issues one 16-bit instruction/cycle• Modified Harvard memory architecture• Peripherals typical of conventional DSPs:
– 2-3 synch. Serial ports, parallel port– Bit I/O, Timer, DMA
• Inexpensive (100 MHz ~$5 qty 10K).• Low power (60 mW @ 1.8V, 100 MHz).
SecondGeneration DSP~ 1989
Has one MAC unit
EECC722 EECC722 -- ShaabanShaaban#75 lec # 8 Fall 2012 10-10-2012
A Current Conventional DSP: A Current Conventional DSP: TI TMSC54xxTI TMSC54xx
OneMACUnit
SecondGeneration DSP
EECC722 EECC722 -- ShaabanShaaban#76 lec # 8 Fall 2012 10-10-2012
• The TMS320C55xx is based on Texas Instruments' earlier TMS320C54xx family, but adds significant enhancements to the architecture and instruction set, including:
– Two instructions/cycle• Instructions are scheduled for parallel execution by the assembly
programmer or compiler.
– Two MAC units.• Complex, compound instructions:
– Assembly source code compatible with C54xx– Mixed-width instructions: 8 to 48 bits.– 200 MHz @ 1.5 V, ~130 mW , $17 qty 10k
• Poor compiler target.
An An Enhanced Conventional DSP: Conventional DSP: TI TMSC55xxTI TMSC55xx Third Generation
DSP ~ 1994
(limited VLIW?)
2nd generation DSP
EECC722 EECC722 -- ShaabanShaaban#77 lec # 8 Fall 2012 10-10-2012
An An Enhanced Conventional DSP: Conventional DSP: TI TMSC55xxTI TMSC55xx
2 MACUnits
ThirdGeneration DSP
EECC722 EECC722 -- ShaabanShaaban#78 lec # 8 Fall 2012 10-10-2012
1616--bit Fixedbit Fixed--Point 8Point 8--way VLIW DSP:way VLIW DSP:TI TMS320C6201 Revision 2 (1997TI TMS320C6201 Revision 2 (1997)
C6201 CPU Megamodule
Data Path 1
D1M1S1L1
A Register File
Data Path 2
L2S2M2D2
B Register File
Instruction DispatchProgram Fetch
Interrupts
Control Registers
Control Logic
Emulation Test
Ext. Memory Interface
4-DMA
Program Cache / Program Memory32-bit address, 256-Bit data512K Bits RAM
Host Port
Interface
2 Timers
2 Multi-channel buffered
serial ports (T1/E1)
Data Memory32-Bit address, 8-, 16-, 32-Bit data
512K Bits RAM
PwrDwn
Instruction Decode
The TMS320C62xx is the
first fixed-point DSP
processor from Texas
Instruments that is based
on a VLIW-like architecture
which allows it to execute up
to eight 32-bit RISC-like
instructions per clock cycle.
TMS320C67xx
Floating Point version
• More compiler friendly• Higher cost/power•SIMD instructions support addedto recent DSPs of this class
Example FourthGeneration DSP
MultipleMultiple--Issue Issue DSPsDSPs
(TMS320C64xx)
EECC722 EECC722 -- ShaabanShaaban#79 lec # 8 Fall 2012 10-10-2012
TI TMS320TI TMS320C62xx Internal Memory C62xx Internal Memory ArchitectureArchitecture
• Separate Internal Program and Data Spaces
• Program– 16K 32-bit instructions (2K Fetch Packets)– 256-bit Fetch Width– Configurable as either
• Direct Mapped Cache, Memory Mapped Program Memory
• Data– 32K x 16– Single Ported Accessible by Both CPU Data Buses– 4 x 8K 16-bit Banks
• 2 Possible Simultaneous Memory Accesses (4 Banks)• 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
4 Banks
EECC722 EECC722 -- ShaabanShaaban#80 lec # 8 Fall 2012 10-10-2012
TI TMS320TI TMS320C62xx C62xx DatapathsDatapaths
Cross Paths40-bit Write Paths (8 MSBs)40-bit Read Paths/Store Paths
DDATA_I2(load data)
D2DS1S2
M1D S1 S2
D1D S1 S2
DDATA_O2(store data)
DADR2(address)
DADR1(address)
DDATA_I1(load data)
DDATA_O1(store data)
2X1X
L1 S1S1 S2 DLSL SLD DL S2S1D
M2 L2S2S2D DL SLSL D DLS2 S1S1S2 D S1
Registers B0 - B15Registers A0 - A15
FourthGeneration DSP
Example 8-way VLIW
EECC722 EECC722 -- ShaabanShaaban#81 lec # 8 Fall 2012 10-10-2012
TI TMS320TI TMS320C62xx Functional UnitsC62xx Functional Units• L-Unit (L1, L2)
– 40-bit Integer ALU, Comparisons– Bit Counting, Normalization
• S-Unit (S1, S2)– 32-bit ALU, 40-bit Shifter– Bitfield Operations, Branching
• M-Unit (M1, M2)– 16 x 16 -> 32
• D-Unit (D1, D2)– 32-bit Add/Subtract– Address Calculations
(Statically Scheduled)
EECC722 EECC722 -- ShaabanShaaban#82 lec # 8 Fall 2012 10-10-2012
Example 1
TI TMS320TI TMS320C62xx Instruction PackingC62xx Instruction PackingInstruction Packing Advanced 8Instruction Packing Advanced 8--way VLIWway VLIW
• Fetch Packet– CPU fetches 8 instructions/cycle
• Execute Packet– CPU executes 1 to 8 instructions/cycle– Fetch packets can contain multiple execute packets
• Parallelism determined at compile / assembly time• Examples
– 1) 8 parallel instructions– 2) 8 serial instructions– 3) Mixed Serial/Parallel Groups
• A // B• C• D• E // F // G // H
• Reduces Codesize, Number of Program Fetches, Power Consumption
A B C D E F G H
ABCDEFGH
Example 2
A BCDEF G H
Example 3
(Statically Scheduled VLIW)
EECC722 EECC722 -- ShaabanShaaban#83 lec # 8 Fall 2012 10-10-2012
Fetch
PG PS PW PR DP DC E1 E2 E3 E4 E5
Decode Execute
TI TMS320TI TMS320C62xxC62xx Pipeline OperationPipeline Phases
• Single-Cycle Throughput• Operate in Lock Step• Fetch
– PG Program Address Generate– PS Program Address Send– PW Program Access Ready Wait– PR Program Fetch Packet Receive
• Decode– DP Instruction Dispatch– DC Instruction Decode
• Execute– E1 - E5 Execute 1 through Execute 5
PG PS PW PR DP DC E1 E2 E3 E4 E5Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5
EECC722 EECC722 -- ShaabanShaaban#84 lec # 8 Fall 2012 10-10-2012
C62x Pipeline OperationC62x Pipeline OperationDelay SlotsDelay Slots
• Delay Slots: number of extra cycles until result is:– written to register file– available for use by a subsequent instructions– Multi-cycle NOP instruction can fill delay slots while minimizing
code size impact
PGPSPWPRDPDC E1 5 Delay SlotsBranch Target
E1Branches
E1 E2 E3 E4 E5 4 Delay SlotsLoads
E1 E2 1 Delay SlotsInteger Multiply
E1 No DelayMost Instructions
(Statically Scheduled VLIW)For better real-time performance predictability
EECC722 EECC722 -- ShaabanShaaban#85 lec # 8 Fall 2012 10-10-2012
C6000 Instruction Set FeaturesC6000 Instruction Set FeaturesConditional Instruction ExecutionConditional Instruction Execution
• All Instructions can be Conditional (similar to Intel IA-64)– A1, A2, B0, B1, B2 can be used as Conditions– Based on Zero or Non-Zero Value– Compare Instructions can allow other Conditions (<, >, etc)
• Reduces Branching• Increases Parallelism
EECC722 EECC722 -- ShaabanShaaban#86 lec # 8 Fall 2012 10-10-2012
C6000 Instruction Set Addressing C6000 Instruction Set Addressing FeaturesFeatures
• Load-Store Architecture• Two Addressing Units (D1, D2)• Orthogonal
– Any Register can be used for Addressing or Indexing• Signed/Unsigned Byte, Half-Word, Word, Double-
Word Addressable– Indexes are Scaled by Type
• Register or 5-Bit Unsigned Constant Index
EECC722 EECC722 -- ShaabanShaaban#87 lec # 8 Fall 2012 10-10-2012
C6000 Instruction Set C6000 Instruction Set Addressing Modes/FeaturesAddressing Modes/Features
• Indirect Addressing Modes– Pre-Increment *++R[index]– Post-Increment *R++[index]– Pre-Decrement *--R[index]– Post-Decrement *R--[index]– Positive Offset *+R[index]– Negative Offset *-R[index]
• 15-bit Positive/Negative Constant Offset from Either B14 or B15
• Circular Addressing– Fast and Low Cost: Power of 2 Sizes and Alignment– Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer
Sizes• Bit-reversal Addressing• Dual Endian Support
EECC722 EECC722 -- ShaabanShaaban#88 lec # 8 Fall 2012 10-10-2012
FIR Filter On TMS320C54xx vs. TMS320C62xx
Two filter taps
VLIW DSP: Larger code size
22ndnd Gen Conventional DSPGen Conventional DSP 44thth Gen VLIW DSPGen VLIW DSP
In parallel
Smaller code size than VLIW
EECC722 EECC722 -- ShaabanShaaban#89 lec # 8 Fall 2012 10-10-2012
TI TMS320C64xxTI TMS320C64xx• Announced in February 2000, the TMS320C64xx is an extension
of Texas Instruments' earlier TMS320C62xx architecture.
• The TMS320C64xx has 64 32-bit general-purpose registers, twice as many as the TMS320C62xx.
• The TMS320C64xx instruction set is a superset of that used in the TMS320C62xx, and, among other enhancements, adds significant SIMD/media processing capabilities:
– 8-bit operations for image/video processing.
• Introduced at 600 MHz clock speed (1 GHz now), but:
– 11-stage pipeline with long latencies
– Dynamic caches.
• $100 qty 10k.
• The only DSP current family with compatible fixed and floating-point versions.
Media ProcessingSIMD
Not in C62
EECC722 EECC722 -- ShaabanShaaban#90 lec # 8 Fall 2012 10-10-2012
C64xx (also C62xx and C67xx) VLIW have higher memory usedue to simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed,
Also VLIW but with variable-length instruction encoding (less memory use than C64xx)(16-32 bits)
(VLIW)(VLIW)
EECC722 EECC722 -- ShaabanShaaban#91 lec # 8 Fall 2012 10-10-2012
(XScale)
Computational
EECC722 EECC722 -- ShaabanShaaban#92 lec # 8 Fall 2012 10-10-2012
Superscalar DSP: Superscalar DSP: LSI Logic ZSP400LSI Logic ZSP400• A 4-way superscalar dynamically scheduled 16-bit fixed-
point DSP core.• 16-bit RISC-like instructions• Separate on-chip caches for instructions and data
• Two MAC units, two ALU/shifter units– Limited SIMD support.– MACS can be combined for 32-bit operations.
• Possible Disadvantage:– Dynamic behavior complicates DSP software development:
• Ensuring real-time behavior• Optimizing code.
MultipleMultiple--Issue 4Issue 4thth Generation Generation DSPsDSPs ExampleExample
Good or bad for a DSP?
EECC722 EECC722 -- ShaabanShaaban#93 lec # 8 Fall 2012 10-10-2012
2004
EECC722 EECC722 -- ShaabanShaaban#94 lec # 8 Fall 2012 10-10-2012
2010
EECC722 EECC722 -- ShaabanShaaban#95 lec # 8 Fall 2012 10-10-2012
TI not actively improving their flagshipFP DSP (fixed-point more important!)
GPP
(4th generation TI DSP)
2004