Digital Signal Processing Laboratory Work 521485S Laboratory … · 2009-12-03 · Work 521485S...

Digital Signal Processing LaboratoryWork 521485S

Laboratory Exercises withTMS320C6711 DSK

Miguel BordalloUniversity of Oulu

Department of Electrical and Information EngineeringInformation Processing Laboratory

December 3, 2009

Contents Contents

Contents

1. Introduction 31.1. General Information . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2. Development Environment . . . . . . . . . . . . . . . . . . . . . . 41.3. Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Exercise Assignments 72.1. Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . 72.2. Signal Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3. Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . 112.4. Interpolation and Decimation . . . . . . . . . . . . . . . . . . . . .13

A. Bit Reversal for FFT 20

B. The Overlap-Save Algorithm: An Example 21

C. Debugging on Workstations 23C.1. Functions in the code template . . . . . . . . . . . . . . . . . . . . 23C.2. Debugging with Electric Fence . . . . . . . . . . . . . . . . . . . . 24

D. Debugging Techniques 26

E. Tips 28

2

1 INTRODUCTION

1. Introduction

1.1. General Information

Welcome to the Digital Signal Processing (DSP) Laboratory course! The courseconsists one exercise, and the assistant will be:

1. TMS320C67x exercise (Miguel Bordallo, [email protected], TS317)

This handout will contain the exercise instructions for theTMS320C67x exercise.When the exercise is completed, you will receive 3.5 credit points (2 credit units).The completed answers and all code written should be returned on paper to the cor-responding assistant. For inquiries, you should also primarily contact the appropriateassistant.

No grades will be given: each returned exercise will be either accepted or rejected.If an exercise is rejected, the assistant will give instructions which will help you tocorrect the answers or programs. There is no final exam in the course, but whenreturning the exercise work, be prepared to answer to a few oral questions. If you arean exchange student and need an international ECTS grade, you must ask for it whenreturning the work. If you are planning to graduate soon after completing the course,you should mention about it to the assistant. Otherwise the completion mark will begiven within three weeks.

In the beginning of the course, there is a voluntary initial lecture, where somegeneral topics will be discussed and you will have a chance toregister for the course.The course can be carried out either alone or in groups ofmaximumof two students.Each group should carry out the exercises theirself. The assistants will give adviseson request. Copying code or answers from other groups is definitely prohibited, andwill lead to penalties!

Before starting the work each group must register by adding the member namesto the registration list. After the initial lecture the listwill be placed into the coursematerial shelf in the third floor of Tietotalo building. You will get several parametersfrom the registration which you must use when making some of the exercises. Writethem down here for the C67x-part:

Parameter name Symbol Value

Block length N1

Filter length N2

Interpolation factor F

3

1.2 Development Environment 1 INTRODUCTION

Write the parameters also on the front page of the answer sheet. These instructionswill be also placed into the shelf and they can be borrowed forcopying. Only the firstthree parameters are needed for this exercise.

You will need a key and an access permission for entering the development classroom TS139. If you have the key for other student facilities,you can request theaccess permission personally from course assistants. You may create a temporaryworking directory into the development computers; however, it must be deleted afteruse! Always make backups e.g. on floppy disks. The development computer disksmay be cleared without notice in advance.

The finished exercise work should be returned before May 30, 2010. The returnedreport must contain answers for all questions, denoted as Qx.y in these instructions,and all written code in appendices. Additionally, you must demonstrate in the labora-tory class that your program is working. If you are unable to finish the work in time,you should ask for additional time.

During the C67x exercise, you will learn the following topics:

• Generation of sine signals using IIR filters

• Decimation

• Interpolation

• Implementation of a fast Fourier transform (FFT)

• DSP architecture and DSP-assembly coding

• Frame based filtering

You should write the code for all exercises using the C-language, unless specificallyrequired otherwise.

1.2. Development Environment

TMS320C67x is Texas Instruments’ family of floating point digital signal proces-sors. It is downward compatible with the TMS320C62x fixed point family, and hasa Very Long Instruction Word (VLIW) -like architecture, developed by Texas In-struments and called VelociTI. Each 256-bit wide Instruction Fetch Packet (IFP) cancontain up to 8 simple 32-bit RISC (Reduced Instruction Set Computer) instructions,whose execution is started simultaneously at the same clockcycle. The instructionsare pipelined: new instructions can be dispatched before earlier instructions havefinished.

There are special instructions for loading and storing datafrom and to memory; allarithmetic instructions can use directly only built-in registers. The instructions haveflexible addressing modes: register direct, register indirect, and base+index modesare supported. Additionally, the index register can be post- or preincremented ordecremented. There are two register sets, which both have sixteen 32-bit registers,

4

1.3 Before You Begin 1 INTRODUCTION

or in total 32 registers. Two 32-bit registers can be combined into one 64-bit floatingpoint or a 40-bit fixed point register. The DSP has plenty of computing resources:there are two multipliers and six arithmetic-logical units(ALUs).

The TMS320C6711 DSP is attached to the DSP Starter Kit (DSK) where it isclocked at 150 MHz rate. The DSK contains power supply, 16 megabytes ofSDRAM, 128 kilobytes of flash ROM, programmable LEDs, line inand speaker con-nectors, a codec containing 16-bit D/A and A/D converters, and finally a parallel portinterface to the PC. The code development can be made with Code Composer Stu-dio (CCS), which is an integrated development environment (IDE) containing codeeditor, C compiler, assembler, linker, debugger, and otherutilities.

The Code Composer Studio contains a firmware kernel called DSP/BIOS, whichprovides basic runtime services for scheduling tasks running on the DSP. The codec inthe DSK uses direct memory access (DMA) to store sampled datainto DSK memoryand to play back sound from memory buffers. When a DMA transfer has completedcollecting samples into a buffer and completed playing backa buffer, an interrupt isgenerated and the DMA is programmed to transfer the next buffer. Thus, the appli-cation will receive whole frames at once, and does not need totake care of collectingsingle samples.

The code template (dsp_lab) used in this course includes a working programthat records sound from the computer music player and plays it back into connectedspeakers. The same template can be also ran at workstation/PC computer (for exam-ple in the computers in class room TS138), in which case the sound is read from andwritten to a file and the created file is played back after the program execution.

There are two C67x and two C55x platforms available, and theyare located inroom TS139.

1.3. Before You Begin

You can also take a look at these documents for more information:

• Official course information: http://www.ee.oulu.fi/opiskelu/kurssit/521485S.shtml

• Course homepage: http://www.ee.oulu.fi/research/tklab/courses/521485S/

• TMS320C6000 Technical Brief: http://www-s.ti.com/sc/psheets/spru197d/spru197d.pdf

• Texas Instruments manuals: http://www.ti.com/sc/docs/psheets/man_dsp.htm

• TMS320C6000 Programmer’s Guide:http://www-s.ti.com/sc/psheets/spru198g/spru198g.pdf

5

1.3 Before You Begin 1 INTRODUCTION

• TMS320C6000 Optimizing C Compiler User’s Guide: http://focus.ti.com/general/docs/lit/getliterature.tsp?literatureNumber=spru187l&fileType=pdf

• TMS320C6000 CPU and Instruction Set Reference Guide:http://www-s.ti.com/sc/psheets/spru189f/spru189f.pdf

• Berkeley Design Technology, Inc. processor overviews:http://www.bdti.com/procsum/tic67xx.htm

• DSP Village Home:http://dspvillage.ti.com/

• B. Champagne & F. Labeau (2003) Discrete Time Signal Processing CourseNotes: http://www.ece.mcgill.ca/~info412b/CourseNotes.html

• Nasser Kehtarnavaz & Burc Simsek (2000) C6x-Based Digital Signal Process-ing. Prentice Hall, New Jersey, 164 pp.

• Brian W. Kernighan & Dennis M. Ritchie (1988) The C Programming Lan-guage, 2nd edition

• Introduction to C programming: http://www.ee.oulu.fi/research/tklab/courses/521419A/c_intro.html

• Electric Fence: http://www.pf-lug.de/projekte/haya/efence.php

You may use any published information for completing the work. In any case, youmust always reference the original source if you directly cite some material (eithertext or code). You may discuss with other student groups about your problems, butyou must write all code yourself!

6

2 EXERCISE ASSIGNMENTS

2. Exercise Assignments

2.1. Compilation and Execution

Create a working directory for yourself underC:\students:

1. Double click onMy Computer-icon with mouse.

2. TypeC:\students to the address bar.

3. SelectFile ⊲ New⊲ Folder.

4. Give an unique directory name for your project. You shoulduse only lettersa–z and numbers0–9 in filenames.

5. Close the window.

Download and extract the exercise template as follows:

1. StartMozilla Firefoxby double clicking its icon on the desktop.

2. Enter URL http://www.ee.oulu.fi/research/tklab/courses/521485S/dsp_lab.zip.

3. SelectOpen withand7-Zip.zip. Click OK.

4. ChooseEdit⊲ Select Allin the 7-Zip File Managerwindow and click onEx-tract.

5. Enter the full path to your project directory, for exampleC:\students\myproject.

6. Close the window and exit Firefox.

Then you are ready to compile and execute the program. First turn on the DSK cardand speakers, start Code Composer Studio, and open your new project:

1. Double click on the CCStudio 3.1 icon on the desktop.

2. ChooseProject⊲ Open..., typeC:\students\myproject (or whatever isyour project directory) into theFile namefield, and press Enter.

3. Click ondsp_lab.pjtand click onOpen.

4. If you get error dialogs saying “CRegistryInfo::GetSABiosRegKey() could notfind BIOS registry entries” or “The project you are opening isbased upon anolder CDB configuration file which must be converted to the newTConf Script(TCF) format,” ignore the messages and clickOK.

7

2.1 Compilation and Execution 2 EXERCISE ASSIGNMENTS

5. Compile the project by choosingProject⊲ Rebuild All. You may get somewarnings, but there should be zero errors. In this case the project was builtsuccessfully.

6. Connect the CCS to the DSK by selectingDebug⊲ Connect.

7. Load the compiled program into the DSK by choosingFile ⊲ Load Program...and go to directoryC:\students\myproject\Debug.

8. Click ondsp_lab.outand click onOpen.

9. If you get error dialog window “Data verification failed ataddress...”, do this:

a) Click Cancelto close the dialog and then click onOK to close the nextdialog complaining “A data verification error occurred, fileload failed.”

b) Click onDebug⊲ Disconnect.

c) Turn off and again on the power from the DSK card.

d) Open the transparent lid on top of the DSK and press the round whitebutton denoted asReset.

e) Go to step 6.

10. Finally, you may run the program by selectingDebug⊲ Run.

The program should be now running and replaying all sound coming from the PC tothe loudspeakers. You can not hear anything before playing some music:

1. Double click onfoobar2000icon on the desktop.

2. ChoosePlaylist⊲ Open.

3. Go to directoryC:\music and select some ogg file to play.

4. SelectOrder: Repeat.

The music will have rather poor quality, because the sampling rate is only 8000 Hz. Ifyou are having problems in compiling or running the program,check fromProject⊲Build OptionsthatCompiler Basic Build Optionsare as follows:

Target Version: 671xGenerate Debug Info: Full Symbolic DebugOpt Speed vs Size: Speed Most CriticalOpt Level: NoneProgram Level Opt: None

Also check that the cable from computer sound card is connected to the DSK lineinput and that the DSK output is connected to the loudspeakers, and the speakers areturned on.

The DSK collects the samples from the A/D converter until a complete frame isready. After that a software interrupt is generated, and theframe is delivered to

8

2.1 Compilation and Execution 2 EXERCISE ASSIGNMENTS

theframe_full() function in filedsp_lab.c. To change the frame size, youmust modify the filedsp_lab.c and change the parameter in theinit_audio()function call inside themain() function. The function call argument specifies thebuffer size in samples. Then you must rebuild the project andload the new programinto the DSK.

Q2.1 How many samples long the frame is by default? What is the frequencyof the software interrupt, and how can you calculate it? Witha samplingrate of 8000Hz, what is the highest frequency that we can hearfrom thespeakers?

After you have tested that the program works, stop it by choosing Debug⊲ Halt.You can also compile, run, and stop the program by clicking onthe correspondingicons at the left edge of the CCS window.

You can use Code Composer Studio for developing your code, but because the codeis ran on the DSP, it is difficult to debug: small bugs can crashthe whole DSK andrecovery will be difficult. You should instead download the exercise template packageto UNIX workstation and develop and debug the code there as long as possible. Onlywhen your code appears to be working properly, port it from the workstation to theDSP. The computer should have GNU C Compiler and GNU Make installed (thedepartment Sun workstations eg. in class room TS138 are adequately equipped).You can run the program onUNIX as follows:

1. Open a terminal window withUNIX command line.

2. Create yourself a project directory: mkdir myproject; cdmyproject

3. Download the template code: wget http://www.ee.oulu.fi/research/tklab/courses/521485S/dsp_lab.zip.

4. Extract it:unzip dsp_lab.zip.

5. Compile the program by running GNU make:gmake.

6. Run the program by typinggmake test. This command runs theaudio program which reads the test songaudio_in_be.raw (oraudio_in_le.raw) and writesaudio_out.raw. The resulting soundfile is also played if the programsox is installed.

You can use any text editor, for examplenedit, for editing the files. See AppendixC for more information on using the functions indsp_labtemplate.

9

2.2 Signal Generation 2 EXERCISE ASSIGNMENTS

2.2. Signal Generation

An unit impulse

x(n) =

{

1, whenn = 00, whenn 6= 0

(1)

contains equally much energy at all frequencies. If we can design a filter, whichallows exactly one frequency to pass and stops all other frequencies, and filter theunit impulse with this filter, we get an output signal which ispure sine wave at thepass frequency. This is possible to achieve using infinite impulse response (IIR)filters.

Realizable IIR filters are characterized by the following recursive equation:

y(n) =N

∑k=0

bkx(n−k)−M

∑k=1

aky(n−k) (2)

Transfer function of the IIR filter is

H(z) =

N∑

k=0bkz−k

1+M∑

k=1akz−k

(3)

An IIR filter, which passes only one frequency, has the following coefficients:

b1 = gsin(2π f/Fs) (4)

a0 = 1 (5)

a1 = −2cos(2π f/Fs) (6)

a2 = 1 (7)

whereg = filter gain, f = filter pass frequency, andFs = sampling frequency. Allotherbn andan, which are not shown, are zero.

Q2.2 Implement an IIR filter that produces a sinewave. Add the generatedsignal to the sound stream. Generate the sinewave at 300 Hz. Codethe function in a way that the frequency and the sampling rateof thegenerated signal can be easily changed.

Remember from the previous section that signal processing is performed usingframes: theframe_full() function is called each time when a new frame fills,and it must process all samples in a frame. This is shown in Figure 1. The easiestway is to modify thesine_gen() function in the template code to calculate the sinesamples with the IIR filter instead of callingsinf() function. Thesine_gen()function should return next sample of the sinewave each timeit is called. Write a

10

2.3 Fast Fourier Transform 2 EXERCISE ASSIGNMENTS

loop that goes over all samples in a received frame, adds sinesamples to them, andwrites the result into the output frame.

You can editdsp_lab.c file with the CCS by double clicking atProjects⊲dsp_lab.pjt⊲ dsp_lab.c. You should also selectDSP/BIOS⊲ Message Logwhichopens a panel displaying any messages from the running program1. If you wishto debug the program using CCS, chooseView⊲ Watch Window, where you can addvariables to see their value. Then usesingle stepandstep overcommands to executethe program line by line.

As a result, you should hear an additional sound in the music coming from theloudspeakers.

Q2.2 Modify the function now in a way that several frequencies canbe gen-eratd. The function now should alternate amont three frequencies fromtime to time. Make the function generate a sine on 300Hz during aboutone second, then a sine on 1000Hz for another second, then a sine on5000Hz and back to 300Hz. As a result, you should here differentpitches of beeping alternating in time in the music coming from the loud-spekers.

+

frame_full(buffer_size, src_sample, dst_sample)

sine_gen()

src_sampleInput frame

dst_sampleOutput frame

buffer_size samples

Figure 1: Implementation of a Sine Generator.

2.3. Fast Fourier Transform

Filtering is often performed using finite impulse response (FIR) filters. That corre-sponds to the convolution of the signal and the filter coefficients. However, a long

1Currently there is a problem with this. If you get an error dialog saying “Unknown error”, theMessage Log is not available. We try to solve the problem.

11

2.3 Fast Fourier Transform 2 EXERCISE ASSIGNMENTS

filter requires much computation. How could we perform the filtering with less oper-ations?

According to the convolution theorem, convolution of two signals (the actual sig-nal and the filter) corresponds to multiplication of the Fourier transformed signals.Using the Fourier transform, the computation requirementscan be lowered: the filtercoefficients are transformed, the signal to be filtered is transformed, the transformedsequences are multiplied together element-by-element, and finally the result is in-verse transformed. Mathematically, the convolution can berepresented as

y(t) = x(t)∗c(t) (8)

wherex(t) is the original signal, as a function of time,y(t) is the filtered signal, andc(t) are the filter coefficients. The operation∗ denotes convolution. By applying theFourier transform and the convolution theorem, we get

Y( f ) = X( f )C( f ) (9)

where f is frequency andY( f ), X( f ), andC( f ) are Fourier transforms ofy(t), x(t),andc(t), correspondingly.

The Fourier transform is defined by

X( f ) =N−1

∑t=0

x(t)Wt f (10)

whereW = e−2π i/N, i2 =−1 andN is the transform length. However, by applying di-rectly Formula (10), the computation does not diminish, because the actual transformwill need more operations than the original time-domain convolution.

There are many fast Fourier transform (FFT) algorithms. Thedecimation-in-frequency (DIF) radix-2 algorithm flowchart is shown in Figure 2. When the trans-form is made with a FFT, the total amount of computation, including transforms andmultiplications, can be less than with a direct time-domainconvolution.

Study the FFT flowchart and a single butterfly (Figure 3) of theradix-2 algorithm:it will help implementing the FFT. A typical implementationof the FFT containsthree nested loops: examine the functionfft()which is included with the templatecode. After the FFT has been performed, the data must be permuted into correctorder, as explained in Appendix A.

Outermost loop will loop over all the stages, shown horizontally in Figure 2. In this32-point example, there are log2 32= 5 stages. The middle loop will count the num-ber of butterfly groups. In the figure, there is one butterfly group in the first stage and16 butterfly groups in the last stage. After each stage, the number of butterfly groupsis doubled. The innermost loop loops over individual butterflies in a single butterflygroup. After each stage, the number of butterflies in a singlegroup is halved. Initiallythere are 16 butterflies per group, but in the last stage, there is just one butterfly ineach butterfly group. Note that the figure represents only a 32-point transform: youneed to use a longer transform, so the numbers will be different.

12

2.4 Interpolation and Decimation 2 EXERCISE ASSIGNMENTS

The inverse Fourier transform is defined by

y(t) =1N

N−1

∑f=0

Y( f )W−t f . (11)

This is almost same as the transform, but the exponent sign isnegated and the resultis multiplied by 1

N . The FFT can be used for computing also the inverse transformbyreplacingW by W−1 in Figure 3 and multiplying the result by1N .

Q2.3 Implement the fast Fourier transform and the inverse transform usingradix-2 decimation in frequency or the radix-2 decimation in time algo-rithm. Use transform lengthN1, which you obtained in the registration.Remember to use binary shifts to calculate powers of 2. Change thecompiler optimization levels. How many clock cycles the transformsneed at each level? How many milliseconds? How many transforms canyou perform between two successive audio frames?

To verify that the FFT routine is working correctly, transform some known sequence(e.g. 1,2,3,. . . ) and compare the transform result to the correct result (you can com-pute the correct result with Matlab).

You must implement the FFT by completing the code template. Therefore, youcan not use directly FFT code from books or Internet, but you can look at them forhelping your implementation work.

2.4. Interpolation and Decimation

Interpolation means rising sampling frequency by an integer factor; in decimationthe sampling frequency is lowered by an integer factor. In interpolation, rising thesampling frequency byF times is achieved by addingF −1 zero samples after eachoriginal sample. In decimation by factorF, one sample out of a group ofF samplesis saved and all other samples are discarded.

Without proper filtering, interpolation and decimation will create aliasing or unde-sired frequencies, which will degrade the signal quality. This is shown in Figure 4.Therefore, antialiasing filters should be used to prevent this. After interpolation, thesignal should be filtered with a low-pass filter which removesundesired high frequen-cies. When performing decimation, the filtering is done before dropping out samplesand it should discard those high-frequencies, that can not be represented in the signalafter the decimation.

The convolution given in Equation (9) is cyclic, but we want to perform linear con-volution of a long signal and a short filter. To achieve this, one can use the overlap-save algorithm. The principle is shown in Figure 5: the original signal is divided intoblocks ofN1 samples. The filter with lengthN2 is zero-padded toN1 coefficients.Both are transformed, multiplied elementwise, and the result inverse transformed.

13


0168

244

2012282

1810266

221430

3115237

2711193

2913215

259

171

313029282726252423222120191817161514131211109876543210

4812

4812

4812

4812

15

14

13

12

11

10

987654321

8

8

8

8

8

8

8

8

14

86

42

1012

14

86

42

1012

Stage 4Stage 1Stage 0

Group of butterflies

Figure 2: 32-point decimation in time radix-2 fast Fourier transform algorithmflowchart.

The resulting block containsN2− 1 samples of aliased signal due to cyclic convo-lution, but the lastN1−N2 + 1 samples correspond to linear convolution. For longsignal, there should be one output sample for each input sample, and therefore theblocks in the input signal must overlap byN2−1 samples.

14


b

a a+b

na−b( )Wn

Figure 3: Butterfly of the radix-2 FFT.

Fs / 2 FsNyquistfrequency

signalOriginal

Energy

Frequency

Aliasing

FsFs/ 2Nyquistfrequency

signalOriginal

Energy

Frequency

Fs / 2Nyquistfrequency

Fs

signalOriginal

frequenciesUndesired

frequenciesUndesired

Energy

Frequency

Interpolation Decimation

Figure 4: Interpolation and decimation without filtering.

Q2.4A Interpolate the sound signal coming from the sound card from8000Hz to F× 8 kHz (you obtained the interpolation factorF in the courseregistration). Modify the sine signal generation routine to generate thesine signal at sampling rate ofF× 8 kHz, and add the generated signalto the sound signal at the higher sampling rate. Then decimate the soundback to 8000 Hz, so that it can be transmitted to the loudspeakers.Implement the filters necessary for interpolation and decimation us-ing the fast Fourier transform withN1-point transform length and theoverlap-save method. UseN2 coefficients in the filters. For what sam-pling frequency the filters are designed, and what are the stopband andpassband frequencies? Attatch the code/method that you used to get thecoefficients (e. g. Matlab code)What frame length you decided to use, and why? How many FFT andIFFT operations are needed for each frame? How many clock cycles areneeded per frame?

15


1N

0

2N

FFT FFT

×

IFFT

1 2+1N N

multiplicationsElementwise

Aliased

FilterBlock from signal

Filtered signal

Figure 5: Overlap-save principle.

×8 kHzF ×8 kHzF

×8 kHzFsine_gen()

ADC DAC8 kHz 8 kHz

Interpolate Decimate

+

Figure 6: Signal flowpath.

The signal flowpath is presented in Figure 6, in which the signal comes from theA/D converter (ADC) and the processed signal goes to loudspeakers through the D/Aconverter (DAC). Remember again, that the operations are frame-based and must beperformed to complete frames.

Compute the filter coefficients with the optimal equiripple (Remez) method, orany other equivalent one. Use a suitable transition bandwidth, for example a fewkilohertz. Appendix B gives an example which you can use as a guideline for im-plementing the overlap-save algorithm. Before writing anycode, you should draw a

16


figure of your plan on paper (similar to Fig. 7). If you are unsure, it is recommendedthat you show the figure to the assistant and ask for acceptance before beginningimplementation work.

You must verify that your program is fast enough to process the incoming audioframes in real-time on the DSP. If the FFT function is too slow, you must optimizeit. For example, the coefficients (powers ofW) required in the FFT algorithm can becomputed in the beginning of the program and stored into an array. The same can bedone for the permuted indices. This will speed up the execution. However, beforedoing difficult optimizations, increase the compiler optimization level.

Q2.4B In order to test the code, try first with all the filter coefficients equalto 1.0 + 0.0i in the frequency domain (Pass-all filter). What happenswhen the the generated signal frequency is 5000Hz instead of300Hz or1000Hz? What do you hear from the loudspeakers? Why do you hearit? (Remember that the sampling rate is now 8kHz). Try then with thecoefficients previously calculated in Matlab? What happensnow?

This exercise is a synthetic example, and in practice some things would be imple-mented differently.

Q2.4C In the previous system, there were two input signals at different sam-pling rates combined into one signal. The necessary filters had 65 co-efficients or less. How would you modify the system to combinetwoarbitrary signals at the given sampling rates more efficiently?

• How the filter operation could be made more efficient? How manyclock cycles a time domain filtering would approximately take perframe? Assume that a MAC (multiply-accumulation) operationcan be done in one clock cycle.

• Would it be possible to remove the filter after interpolationof be-fore decimation? If so, why?

2.5 Interfacing Assembly Routines

Many tasks can be done by using only the C language, without any knowledge aboutassembly. Modern C compilers can efficiently optimize wide variety of codes. How-ever, knowledge of processor hardware and assembly will help writing C code whichthe compiler can easily optimize. There are also some operations, which can notbe represented efficiently in C and writing directly assembly is necessary to achievesufficient performance.

17


Q2.4 Write an assembly function to multiply two complex numbers (repre-sented as a pair of single precisionfloat-type values) and return theresult. Code first a version with no pipelined input, where the instruc-tions wait (add NOPs) for the previous one to complete. The functionmust be callable from C language. Explain how the function works in-cluding the effect of each instruction. Test the function and compute thetimes.

The DSP board includes a lot of resources for computation. Several aritmetic unitsand multipliers can be used at the same time, pipelining the instructions and/or dis-patching them at the same time.

Q2.4 Specify now in your assembly function, different multiplication and ar-itmetic units and paralellize some instructions using the pipeline oper-ator. Try to dispatch at the same time the operations that usedifferentunits. Be careful on waiting for the needed results before dispatchingoperations that need a result from a previous operation. Does the as-sembly routine improve the program speed compared to if it would havebeen written in C and compiled at high optimization level? What are themain reasons for the speed difference? Show in a table the computationtimes of the C multiplication with several optimization levels and thetimes of the two versions of the assembly function.

The DSP programming tools include assembly optimizer whichcan rearrange in-structions in linear assembly source code and fill automatically instruction delayslots. Donot use linear assembly in this exercise. You should insert yourassem-bly code into filemultiply.s62 in functionmultiply_asm, which is alreadyincluded in the project. Note that you can not do this exercise with PC or workstation,as they have completely different assembly language compared to DSPs.

The guidebook “TMS320C6000 CPU and Instruction Set Reference Guide” de-scribes the necessary assembly instructions in Chapters 3 and 4. You must also takethe instruction delay slots into account. Chapter 8 in “TMS320C6000 OptimizingCompiler User’s Guide” describes the interface between C and assembly. Pay specialattention to code examples.

To compare the speed between the assembly and the corresponding C code, youcan use the complex multiplication functionmultiply_c() which is included inthe given code template. Write a loop to multiply complex numbers a few thou-sand times and measure the loop execution time. The easiest method for measur-ing clock cycles is to use the debugger breakpoints and to select Profile⊲ Clock⊲EnableandProfile⊲ Clock⊲ Viewfrom the Code Composer menu. You can useMixedSource/Assemblymode to view the assembly code generated by the C compiler tocompare it to the hand-written version (click mouse rightmost button on top of the

18


code window, after the program has been loaded to the DSK). Also use register win-dow to watch their values by selectingView⊲ CPU Registers⊲ Core Registersfromthe CCS menu. Click with the mouse rightmost button on the register display andselectView Floatfrom the menu.

Q2.4 Integrate the Assembly multiplication in your previous code as a susti-tute for multiply_c function. Re run the whole filtering codewith thenew assembly function

19

A BIT REVERSAL FOR FFT

A. Bit Reversal for FFT

The input data sequence is given in an array (or two arrays, for real and imaginarypart) for the fast Fourier transform. The contents of this array must be permuted afterFFT to obtain the correct result. The index to the array requires log2N1 bits whereN1

is the FFT and array length. Each value in the array is swappedwith another value,whose location is obtained by bit reversing the bits of the first value index.

For example, the index into a 32-point transform array can berepresented in fivebits. The index 5 is 00101 in binary. When bit-reversed, it will become 10100, whichis 20 in decimal. Thus, the values in indices 5 and 20 must be swapped.

With longer transforms, there are more bits in the indices. Also, it must be takencare that each pair in the input array is swapped only once; ifa pair would be swappedtwice, the values would be in the original order after the permutation.

The array can be permuted with the following pseudocode:

loop for all locations in the data arraylet index be the current locationlet reversed_index be the corresponding location

obtained with bit-reversalif the values in locations index and reversed_index

have not yet been swapped, swap them nowend of loop

It is simple to check whether a value pair has been already swapped by comparingindex andreversed_index. Alternatively two arrays can be used and the val-ues copied from the first array into the second. In this case nocheck needs to bemade.

The bit reversal of an index can be represented with a pseudocode as follows:

let index be the index which is to be bit-reversedset reversed_index to zeroloop for all bits to reverse

if the least significant bit in index is one, set theleast significant bit in reversed_index to one

shift index right by one bit positionshift reversed_index left by one bit position

end of loop

20

B THE OVERLAP-SAVE ALGORITHM: AN EXAMPLE

B. The Overlap-Save Algorithm: An Example

The DSP/BIOS gathers a sequence of incoming samples into frames. When a newframe is filled, theframe_full() function in called and the new frame is givento it as a parameter. The function must then process this frame and generate thenext frame for output. For the question Q2.4A you have to interpolate, filter, addsine, filter again, and decimate the signal, where the filtersare implemented with theoverlap-save method.

If the parameters wereN1 = 32 (block and FFT length),N2 = 17 (filter length),andF = 2 (decimation factor), one possible implementation is shown in Figure 7. Inthis example both the input and output frame lengths are set to 16 samples.

Filter blocks with FFT

Filter blocks with FFT

+Add sine

buffer1[16] interpolated[32]

fft_buffer[32]

buffer1[16]fft_buffer[32]

src_sample[16]

2

Original signal (new frame)

buffer2[16]

fft_buffer[32]

fft_buffer[32]

buffer2[16]

filtered2[32]

dst_sample[16]

with_sine[32]

Filtered for decimation

2

Final decimated signal (frame to output)

Interpolated and filteredfiltered1[32]

Figure 7: Example of the overlap-save algorithm.

Let us assume that the incoming signal is in arraysrc_sample[16] (where thenumber in brackets denotes the array length in samples) whena new frame arrivesand theaudio() function is called. This array is interpolated by adding a zero aftereach sample and storing the result into another arrayinterpolated[32]. Thisarray is divided into blocks, which are filtered with the FFT.The blocks overlap with16 samples. Since the incoming frames insrc_sample[16] array never overlap,some of the blocks, which are transformed, must be obtained from two consecutiveframes. Thus, the arraybuffer1[16] is necessary for storing the last 16 interpo-lated samples from the previous frame. The first block to be transformed with theFFT overlaps with the last 16 samples of theinterpolated array from the pre-vious frame and the first 16 samples from the newest frame. Therefore, the last 16samples from the array must be saved into the arraybuffer1, from which they areread during the next frame.

The samples to be transformed are copied into bufferfft_buffer[32]. Thisarray is transformed and elements are then multiplied with the coefficients of the

21

B THE OVERLAP-SAVE ALGORITHM: AN EXAMPLE

transformed filter. After inverse transform, the last 16 samples of the buffer hold thefiltered samples. During each frame, two blocks are filtered,requiring two FFTs andtwo inverse FFTs (the filter can be transformed only once in the beginning of theprogram). In total, 32 filtered samples are obtained, which are copied into the arrayfiltered1[32]. Only a single array needs to be allocated for the FFT, becausethe same array can be reused for all blocks.

Next, the sine signal is added into the samples in thefiltered1 array and theresult is stored into arraywith_sine[32] (see also Figure 1). Since the sinesignal might cause aliasing after decimation, the samples in the array have to befiltered. This is performed similarly as after the interpolation. In this case the samplesfrom the previous frame are stored into array calledbuffer2[16]. The samefft_buffer can be again used. Two more FFTs and inverse FFTs are required, or8 in total per frame.

The filtered samples, after each inverse FFT has been computed, are col-lected intofiltered2[32] array, from which they are decimated into arraydst_sample[16]by copying only every other sample.dst_sample[16] con-tains the final resulting samples, which are transmitted from theaudio() function.

When designing your algorithm, it is recommended that you first draw a sketchon paper, similar to the Figure 7. Try finding a frame length that is reasonably smalland which makes the overall algorithm simple. You should initialize all necessarybuffers with zeros in the beginning of the program (in the example,buffer1 andbuffer2would need to be initialized). There are several possible solutions, and theexample given in this appendix is only for assistance. You are free to implement youralgorithm in a different way, as long as it still uses the overlap-save method.

22

C DEBUGGING ON WORKSTATIONS

C. Debugging on Workstations

Developing and testing code on workstation computers instead of digital signal pro-cessors is usually easier, because workstations have better development tools andmemory protection to aid debugging. Also, the number of DSK cards in the labora-tory is limited, and it might be easier to find availableUNIX workstations without theDSK.

Developing DSP/BIOS code onUNIX workstations is made easier with thedsp_lablibrary (http://www.ee.oulu.fi/research/tklab/courses/521485S/dsp_lab.zip). It reads audio signal from a file on hard disk and writesthe processed sound into another file. The library emulates some basic DSP/BIOSfunctions and datatypes which are necessary for completingthe exercise work. Thepackage also contains a test song and sine sounds at various frequencies. However,the final testing of the program has to be done on the DSP to ensure that the real-timeand memory requirements are fulfilled and that the assembly functions are working.

C.1. Functions in the code template

CFG_TESTRUNS When running the program on workstation/PC, this macro defineshow many frames are generated. Larger value processes longer piece ofmusic, but execution time increases. This macro can be configured bymodifying file device_fileio.c.

COPY_BUFFER(src, src_start, dst, dst_start, len) This macro copieslenvalues from the arraysrc to arraydst. The starting locations in thearray are given bysrc_start anddst_start. If src or dst arearrays (and not just pointers to arrays), bounds checking ismade.

ZERO_BUFFER(dst, dst_start, len) This macro setslen values to zero in thedst array. Ifdst is an array (and not just a pointer to array), boundschecking is made.

void init_audio(int bufsize) Initializes the A/D and D/A conversions and be-gins collecting samples into frames. After this function has been called,frame_full() will be called when the first frame is full.bufsizedenotes the number of samples per frame.

void halt_on_error(char *msg, ...) Displays error message and halts pro-gram. When using the DSP, you should enableMessage Logfrom theDSP/BIOSmenu in the CCS to see any error messages.

void switch_led(int state) Turns LED on (if state is nonzero) or off (ifstate is zero) on the DSP card. Can be used for debugging. Whenrunning on workstation/PC, turning LED on will print string<*> andturning it off will print <.>.

23

C.2 Debugging with Electric Fence C DEBUGGING ON WORKSTATIONS

unsigned char *get_bitstring(int x, int b) Convertsx into a bit string andreturns the character string.b is the number of bits in the string.

void cmul_c(float ar, float ai, float br, float bi, float *cr, float *ci)

Multiply two complex numbers and return the result.ar andai arereal and imaginary parts of the first number, respectively, and br andbi are the parts of the second number. The result is written to variablespointed to bycr andci.

void fft_check_init(int length) Initializes FFT bug-checking function.Should be called in the beginning of the FFT-function.length is theFFT length.

void fft_check_butterfly(int stage, int group, int bfly, int ai, int bi, int n)

Checks if the calculated values are correct for the counters, array in-dices, and theW power. This function must be called each time insidethe innermost loop, which calculates the butterfly. If some the valuesare incorrect, an error message is displayed and the programis halted.Use this function only for debugging because it is very slow.

void slow_fft(float *src_re, float *src_im, float *dst_re, float *dst_im, int n)

Perform ann-point Fourier transform. The input data real part is readfrom array pointed to besrc_re, imaginary part is read fromsrc_im.The transformed sequence is stored into arrays pointed to bydst_reand dst_im pointers. This function can not be used with the DSP(only on workstation). It is very slow and intended only for debugging.

void slow_ifft(float *src_re, float *src_im, float *dst_re, float *dst_im, int n)

Perform ann-point inverse Fourier transform. Similar to functionslow_fft(), see above.

C.2. Debugging with Electric Fence

Many errors lead to accessing arrays beyond the end. Consider the following exam-ple:

int i, my_array[10];for (i=0; i<=10; i++) my_array[i] = 0;

The loop writes one value beyond the end of the array, and as a result memory getscorrupted which may lead to odd behaviour. Electric Fence isa library which usesmemory protection hardware to detect immediately bad memory accesses, includingeven reads. The library replacesmalloc() function with its own version, whichprotects the memory after the allocated memory region.

To detect the bad memory access in the above code, the array must be allocatedwith malloc():

24

C.2 Debugging with Electric Fence C DEBUGGING ON WORKSTATIONS

int i, *my_array;my_array = malloc(sizeof(*my_array) * 10);for (i=0; i<=10; i++) my_array[i] = 0;

When this code is compiled, linked with Electric Fence, and ran, the program crashes:

> gmake test./dsp_labElectric Fence 2.4.10 Copyright (C) 1987-1999 Bruce

Perens <[email protected]>Copyright (C) 2002-2004 Hayati Ayguen

<[email protected]>, Procitec GmbHgmake: *** [run] Segmentation Fault

To examine which line causes the problem, run the code under adebugger, for exam-pleddd (Data Display Debugger) orgdb (GNU Debugger):

> gdb ./dsp_labGNU gdb 6.2.1 Copyright 2004 Free Software Foundation,

Inc.(gdb) runStarting program: dsp_labElectric Fence 2.4.10 Copyright (C) 1987-1999 Bruce

Perens <[email protected]>Copyright (C) 2002-2004 Hayati Ayguen

<[email protected]>, Procitec GmbHProgram received signal SIGSEGV, Segmentation fault.0x00011b6c in main () at dsp_lab.c:3131 for (i=0; i<=10; i++) my_array[i] = 0;(gdb) display i1: i = 10(gdb) quit

The debugger displays immediately in the above example thatthe line 31 indsp_lab.c causes the segmentation fault. You can then display the value of thevariablei and see that it points beyond the end of the array.

You should remember that even if a code works on a workstation, it might nev-ertheless have errors which show up only on the DSP. If you test the code on bigendian computers (for example, Sun Sparc), you must write code which also workson the C67x DSP, which is little endian by default. Thereforetesting on the DSP isalso necessary. For more information about Electric Fence,go to the homepage athttp://www.pf-lug.de/projekte/haya/efence.php.

25

D DEBUGGING TECHNIQUES

D. Debugging Techniques

• Look for warning messages

For good code, the compiler should not display any warning messages. If itdoes, the program still gets compiled and you can run it. If you understandwhere the warning message comes from and know that it is harmless, you canignore it. If you do not understand, you should examine the reason.

• Watch for invalid memory accesses

Make sure that the program does not read from arrays before they are initial-ized, and even more importantly, that it does not write outside of arrays. Ccompiler does not do array bounds checking, but you can insert that manu-ally into the code usingif-statements or use Electric Fence. You should alsouse the macrosCOPY_BUFFER andZERO_BUFFER in the utils.h file whenpossible.

• Divide and conquer

Divide your code into small modules and test and debug them independently.For example, always verify first that complex number multiplication, sine gen-eration, FFT, and IFFT work before combining them into interpolation anddecimation routine. Verify that index permutation works before debuggingwhole FFT/IFFT function.

• Use debugger or print statements to view values

With the debugger, you can examine variables and memory contents and runthe program one line at a time. Alternatively, you can useLOG_printf()function to display variable values.

• Verify index calculations

DSP algorithms are unique because they usually access data always in thesame order. First verify that the memory accesses are into correct loca-tions (for example, theai andbi variables in the FFT). Use the functionfft_check_butterfly() in theutils.hfile to verify the FFT indices.

• Use simple data

When the indices are correctly calculated, input some very simple data to thealgorithm. The initial try should be full of zeros, which should also producezeros in most cases. A good second try is a single nonzero number followedby zeros. Follow the propagation of the number in the code anddo the samecalculations by hand.

• Use random data

26

D DEBUGGING TECHNIQUES

If it works for simple data, you should nevertheless verify the output also formore complicated data. You can use the arrayrand512 in thedsp_labtem-plate which includes 512 random values between -10 and 10.

• Use Matlab

Matlab has build-in commands for FFT and IFFT. Verify that your routine givesthe same output for the same input as the Matlab functions. You could imple-ment the algorithms also in Matlab. This would allow you to print intermediateresults from the Matlab code and the C code and then see if theymatch.

• Run code on workstation computers

The DSP has no memory protection and bugs in program may lead to confus-ing results. The compiler is relatively little tested. The debugger is slow touse, since it communicates via parallel cable. Therefore, debugging on a work-station is much easier. Memory protection discovers immediately most badmemory accesses and can even tell the exact location in the code. Debuggingis faster and the tools are more reliable.

• Watch for memory usage

If your algorithm works on workstation computer, but not on the DSP board,check the program memory usage. The DSP architecture limitsthe maximumdata size of your program. In the small memory model, this is only 32 kilo-bytes. Even more important is the stack size limit, because if the stack over-flows, you will get random crashes.

27

E TIPS

E. Tips

• Check if the development computer, that you are using, has additional instruc-tions attached on top of it.

• If the Code Composer Studio or the DSK start behaving incorrectly, try closingCCS and switching the power off from the DSK for a while (this may takeseveral minutes). Try also lifting the lid and pressing the reset button on theboard.

• Computer programs should be also used for computing filter coefficients. Seehelp for Matlab functionsfilter, freqz, fft, ifft, andfiltdemo (oreqfir andfft in Scilab).

• The samples are initially 16-bit signed integer numbers from the A/D-converter. Convert these to floating point, and keep them as floating pointvalues during all signal processing to avoid rounding errors. Convert the sam-ples only once, at the end, back to 16-bit integers.

• To transfer filter coefficients from Matlab into C-program, on UNIX systemsyou can use the following commands:

1. Save column matrix containing the coefficients into file inMatlab:save’coeff.txt’ coeff -ascii

2. Add commas after each line inUNIX shell: sed ’s/$/,/g’<coeff.txt >coeff.c

3. Copy and paste the coefficients into your C-program.

• The DSP has very small stack, by default only one kilobyte. Donot define largearrays that are allocated from stack. If you define large arrays in the beginningof a function, define themstatic or enlarge the stack. If the stack overflows,it will cause random crashes.

• Do not usemalloc()when working with the DSK-board: it requires specialmemory setup. Usemalloc() only when debugging with Electric Fence.

• Use different names for different variables, even when the variables are lo-cal to different functions. Using same variable names may confuse the CCSdebugger.

28

Digital Signal Processing Laboratory Work 521485S Laboratory … · 2009-12-03 · Work 521485S...

Documents

Transcript of Digital Signal Processing Laboratory Work 521485S Laboratory … · 2009-12-03 · Work 521485S...