Web viewAnswer: In the example ... [x,fs]=wavread('sar1.wav'); ... I have a few questions about...

Assignment 1 for CMSC5707: Advanced topics in AI - Audio signal processing (v7a)1

Appendix: A tutorial of using the htk-mfcc tool

Download the package from http://www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-matlab/content/mfcc/mfcc.m

Run example.m you will see it can generate MFCCs from the sound file .

The default values are :

Tw= frame duration (ms)=25 ms,

Ts=frame shift=10ms etc.

C= number of cepstral coefficents is 12.

MFCCs is the output MFCC parameters

In case you want to use the MFCC parameters into a file and read it by another language or package, you may do this. In matlab:

>clear %clear the workspace

> example %run example of 32849-htk-mfcc-matlab once

**you may need to change the sound file name in example.m to select your own sound file.

whos % show the parameters generated, should see MFCCs

>> save('foo1.txt', 'MFCCs' ,'-ascii'); %save MFCCs in foo1.txt

You may use other programs to read this foo1.txt to get the parameters.

Make a function in matlab /octave to use example.m

Edit the file example.m

Comment clear all; close all; clc;, e.g. % clear all; close all; clc;

Add in the first line : function MFCCs=wav2mfcc1(wav_file)

Save this file as wav2mfcc1.m

So you may use wav2mfcc1.m as a function in matlab /octave .m file or in the command window .

Example: put the following line in a test.m file

MFCCs_OUT= wav2mfcc1(sound_file.wav);

%Result of running test.m: the resulting MFCC parameters will be saved in the Matrix MFCCs_OUT after test.m is run

Appendix

FAQ on assignment 1

1. Question: What is the meaning of voice vowel?

Answer: In the example shown in the diagram, it is the sound sar () in Cantonese. The sampling frequency is 22050 Hz, so the duration is 2x104x(1/22050)=0.9070 seconds. The top diagram below shows the whole duration of the sound, by visual inspection, the consonant s is roughly from 0.2x104 samples to 0.6 x104 samples. And the vowel ar is from 0.62 x104 samples to 1.2 2x104 samples. The lower diagram shows a 20ms (which is (20/1000)/(1/22050)=441=samples) segment (vowel sound ar) taken from the middle (from the location at the 1x104 th sample) of the sound.

%the matlab program to produce the plots

%Sound source is from %http://www.cse.cuhk.edu.hk/~khwong/www2/cmsc5707/sar1.wav

[x,fs]=wavread('sar1.wav'); %Matlab source to produce plots

fs % so period =1/fs, during of 20ms is 20/1000

%for 20ms you need to have n20ms=(20/1000)/(1/fs)

n20ms=(20/1000)/(1/fs) %20 ms samples

len=length(x)

figure(1),clf, subplot(2,1,1),plot(x)

subplot(2,1,2),T1=round(len/2); %starting point

plot(x(T1:T1+n20ms))

2. Question:

a. If my sound record lasts for 2s and my speech starts from 600ms and ends at 1300ms, should I chop it to become a 0.7s speech?

Answer: Yes, you may use tools (such as http://www.goldwave.com/) to cut the file to become a shorter duration, say 0.7 S. The idea is to remove the silence regions, which have no use for your system.

b. Could I filter out that parts after I've read the data to the program?

Answer: Yes, you may filter out the parts after you read the data in your program.

3. Question: How do I use matlab efficiently?Answer: Coding matlab requires some skills, if you use a lot of for loops, it can be slow. Use matrix operations will be faster. For example: a=[1 2 3 4], b=[ 4 5 6 7] You may use a program loop to find a*b, or just type a*b, will give you the answer quick.

My implementation for the recognizer runs 10 comparisons in 2 seconds. If you dont want to optimize your code, you may choose to automate the comparisons. Use a program to automatically set the comparisons of all combinations, run it overnight will do the job too.

4. Question: Since my recordings of digits have some period of silence, the MFCC matrix from the sound has many columns of NaNs. Would it affect the results? How should I deal with that ? Thanks !

Answer: NaNs in matlab usually means infinity or some numbers that don't make sense. Remove these numbers from the sequence.

5. Question: Besides cutting the sound file, do we need pre-process the input sound by pre-emphasis, hamming window and so on?

Answer: No need to do it by the students, these pre-processing are done automatically by the tool inhttp://www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-matlab

6. Question: I used my reordered file for example.m but it doesnt work, why?

Answer: I guess your sound file is stereo and has two sound tracks, the program doeslike it. Solutions:1) you may use the tool in www.goldwave.com to make it a single track. You may send me the sound file ,than I will have a better idea of the problem.

2) Remove one column form your data.Do it in matlab, if you have a stereo wav file as the follow

>> x=wavread( 'd:\sounds\stereo1_maid.wav');

>> size(x) %check the size of the sound, if it has two columns it is stereo

ans =

541957 2

>> x1=x(:,1); % use this to make it mon.

>> size(x1)

ans =

541957 1

7. Question: I find the algorithm in ppt slides on speech recognition that said,

If the energy level and zero-crossing rate of 3 successive frames is high it is a starting point.

After the starting point if the energy and zero-crossing rate for 5 successive frames are low it is the end point.

And I don't quite understand these twosentences. Is that means the energy is high and thezero-crossing rate is also high for continuous 3 frames, and that is the starting point? What is the value of HIGH and LOW mean? Is that value depends on ourselves wave?

Answer: HIGH or LOW depends on your wave, plot these waves out you will see the difference between segments of with sound and no sound.

Get a segment , say 20 ms long , of a silent region and save it as x in matlab and run the following.

figure(1) clfplot(x) %the original wave figure(2)clfplot(x.^2) %plot the energy.

Repeat the above with a segment of voiced sound. Then you can see the difference between them. Use it to determine the threshold of determining High or Low.

8. I have a few questions about sound recording for building an isolated word recognition system.

a. First, in your instruction, it states that "Each word should last for 0.6 to 0.8 seconds". Does it mean voiced vowel should last for 0.6-0.8 seconds, or the whole sound (including silence) should last for 0.6-0.8 seconds?

answer: your sound may contain :silence, un-voiced sound and voiced sound . Like the word SAR, you have unvoiced sound(S) and AR (voiced sound) . The silence part should be removed first, your word: unvoiced sound (S) plus voiced vowel (AR) may last for 0.6-0.8 seconds for a normal speaking speech

b. Sounds should be extracted as MFCC parameters. If my sounds contain silent periods before and aftervoiced vowel, whether they will affect the MFCC parameters?

Answer:: MFCCs for silence are bad for recognition. Remove them.

c. If so, should I need to cut out the silent period of the sounds wave?

Answer: yes. Remove the part before your sound and after your sound manually or write a program to filter out the silent period before

d. If the time shift is 10 ms, a 0.7 seconds sound will have 70 frame segments, MFCC represents as M(13,70)". If my sound waves contain a silent period before voiced, should I need to shift 70 frame segments to voiced part?

Answer: I don't quite understand you question. Here is my guessed answer: If your recording has 1 second, 0.2s is silence at the beginning , then your sound lasts for 70 frames(0.7s) and the followed by a 0.1 s of silence. You should remove the 0.2s silence at the beginning and the 0.1s at the end. The middle 0.7 seconds will be converted into MFCC for recognition.

e. Here I have some additional questions aboutbuildinga speech recognition system. First, from 5(a), it states that "MFCC matrix of size should be 70 frame segments, because of a 0.7 seconds sound". However, I have totally 10 sound files. All of themlast for different seconds (say, 0.9 seconds / 0.6seconds). Should I need to speed up/slow down the sound wave tomatch up 70 frame segments?

Answer: 0.7 seconds or 70 frames are typical values; depending on your recording you may adjust these number. Say using 100 frames, 1 second are ok.

f. Second, from 5(c) & 5(d), 5c states "compare each sound pair and show n x n comparsion-matrix table" and 5d states "plot optimal path on accumulated matrix diagram". Does it means that I need to write a program to do the things like the following picture?

Answer: Do it manually by marking the optical path by hand using an editor is ok. Do it by writing a program to plot the path is also welcome.

9. On dynamic programing:

a. I am a little bit confused by the optimal path problem. I found in some material that the optimal path is search from the lower left corner, and the index i ,j are increasing during the process. What's the difference between searching from upper right corner and searching from lower left corner?

Answer: they are the same, it is from the end where , i j are max.

Say if i,j max is at right upper right , the search is at upper-row and right-column. (lecture note written)

Say if i,j max is at lower right , the search is at lower-row and right-column. (captured from program output )

b. In some real test on two sound wav

Web viewAnswer: In the example ... [x,fs]=wavread('sar1.wav'); ... I have a few questions about...

Documents

Transcript of Web viewAnswer: In the example ... [x,fs]=wavread('sar1.wav'); ... I have a few questions about...