145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

download 145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

of 4

Transcript of 145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

  • 8/13/2019 145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

    1/4

    Sci.Int.(Lahore),25(3),483-486, 2013 ISSN 1013-5316; CODEN: SINTE 8 483

    SPEECH TEXT SYNTHESIS: VARIATIONS OF PITCH AND

    DURATION ARE INVESTIGATED USING MATLAB2Suhail Kazi,

    1Yasir Saleem,

    3Mujtaba H. Jaffery

    1Hafiz Muhammad Falak Sher,

    1Muhammad Munwar Iqbal

    1University Engineering & Technology, Lahore, Pakistan2Universiti Teknologi Malaysia (UTM), Johor, Malaysia

    3CREPS, Department of Electrical Engineering, COMSATS Institute of Information Technology, Lahore-Pakistan.

    *CorrespondingAuthor: [email protected],ABSTRACT: In this paper work on speech processing is presented in two different parts. First part is on

    speech synthesis using KLATT synthesizer (4) which is software for a cascade/parallel formant synthesizer.

    In KLATT synthesizer, formant values are placed that are extracted here using WAVESURFER (6),

    software, and a speech is produced by putting values of these extracted formants that is then compared

    with the original signal for analysis. Here actually we want to synthesize text to speech using KLATT

    synthesizer. Different sounds ba, bi, da, di, ga and giare recorded in the PRAAT (7) and then open in the

    WAVE SURFER and manually formants are calculated. In second part of this paper, we record a vowel

    and calculate total duration of the speech by using MATLAB from length of the speech and sampling rate

    then extract the samples for 50ms duration to calculate the pitch period. We can manually mark the pitch

    periods to calculate the pitch of speech signal. By using MATLAB vary the pitch, increase or decrease, and

    vary duration of speech signal and construct a new signal of varied pitch and duration. Different vowels

    are recorded of different aged people and analyzed.

    Keywords: Speech synthesis, Text-To-Speech, Pitch, Duration, Matlab, Wave Surfer

    1. INTRODUCTION

    Speech is the natural form of human communication.Speech sounds are produced by air pressure vibrations,generated by pushing inhaled air from the lungs throughthe vibrating vocal cords and vocal tract and out from thelips and nose airways.Just as the written form of a language is a sequence ofelementary alphabet, speech is also a sequence ofelementary acoustic sounds or symbols known as

    phonemes that convey the spoken form of a language.There are about 40-60 phonemes in the English language

    from which a very large number of spoken words can beconstructed. In practice the production of each phonemicsound is affected by the context of the neighboring

    phonemes. So diphones and triphones are selectedcarefully that acquire complete information. In KLATsynthesizer diphones and triphones are used to synthesizethe speech. The diphone and triphones are selectedcarefully and concatenated to form the speech.2. LITERATURE REVIEW

    Text to speech and speech to text is a very hot area ofresearch. Considerable effort has been dedicated in theearlier two to three decades to developing simulations ofhuman dialogue assembly and observation (Ladefoged2005). Commonly the two separate modalities have been

    deliberated yet a quantity of investigators require keenout.In the English consonant, the most perceptible hint and

    cue can be provided to pronouncing the English languageby the Vowels. Some researchers have worked on it andchecked the effect of such type of consonant for thelanguage other than English, for example French andArabic. Through the conducted research, we can compareLebanese speakers of English word final consonants forthe advanced and intermediate.

    Speech to Text (STT) applications are faced the mainhurdle of new technology required for audio information

    processing. High cost of infrastructure and required toconduct dialogue recognition study precludes many smallinvestigation groups from valuing new concepts on large-scale tasks. Speech to Text classification systemcomprises the following:1. Open access through Internet2. Complete documents and operative guidelines3. advanced system with intervallic promotions4. Application having Object-oriented design5. On-line practical maintenanceSpeech-DM is a KDD application which is designed tohandle the speech and discovers the patterns of pitchvariant. It must comprise of data pre-processing,management, mining, and post-processing. Figure 1 givesits architecture.

    Figure1:SpeechDMArchitecture.

    Speech-DM facilitates interface for watching and settingconstraints throughout training and testing.

  • 8/13/2019 145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

    2/4

    ISSN 1013-5316; CODEN: SINTE 8 Sci.Int.(Lahore),25(3),483-486, 2013484

    3. SPEECH SYNTHESISSpeech synthesis is the artificial production of humanspeech. A typical Text-To-Speech (TTS) system convertsnormal language text into speech.

    Figure 2: Text-to-Speech System

    It is composed of two parts a front-end and a back-end.The front-end assigns phonetic transcriptions to eachword, and divides and marks the text into prosodic units,like phrases, clauses, and sentences. The process of

    assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.Phonetic transcriptions and prosody information togethermake up the symbolic linguistic representation that isgiven by the front-end. The back-end then converts thesymbolic linguistic representation into sound. In certainsystems, this part includes the computation of the target

    prosody (pitch contour, phoneme durations) which is thenimposed on the output speech.Synthesized speech can be created by concatenating

    pieces of recorded speech that are already stored. Hereconcatenated pieces are in the form of diphones andtriphones.First of all we record the sound sba, bi, da, di, ga, gi using

    PRAAT and save as it as wavfile then open it in theWAVESURFER to manually calculate the formants.Using WAVESURFER we can easily guess the formants.i.e. F0, F1, F2, F3.Table 1: Formants calculated for different sounds using

    wave surferSr.no Sounds F0 F1 F2 F31 Ba 159hz 675ms 221hz

    725ms 222hz1125ms 662hz

    3087hz1103hz1213hz

    3325hz2536hz2626hz

    2 Bi 171hz 450ms 221hz500ms 222hz850ms 441hz

    1323hz1544hz2426hz

    2205hz2426hz3197hz

    3 Da 79hz 630ms 221hz680ms 223hz880ms 226hz

    1764hz1544hz2426hz

    2205hz2426hz3197hz

    4 Di 153hz 450ms 221hz500ms 223hz800ms 226hz

    1544hz1544hz2426hz

    2756hz2646hz3197hz

    5 Ga 136hz 415ms 992hz435ms 1103hz835ms 772hz

    1874hz1433hz1213hz

    3418hz3528hz2426hz

    6 Gi 145hz 500ms 221hz550ms 226hz850ms 230hz

    1213hz1433hz2646hz

    2977hz3087hz3308hz

    4. SYNTHESIS SCRIPT OF KLAT SYNTHESIZER

    Now put values of formants, given in the Table 1, in thesynthesis script of KLAT. Synthesizer, of differentsounds.Here just take example of sound gaTIME = 000; F1=992; F2=1874; F3=3418; F0=135.53;

    AV=72

    TIME + 20; F1=1103; F2=1433; F3=3528; AV=72TIME + 20; F1=1323; F2=1433; F3=3638; AV=72TIME = 400; F1=772; F2=1213; F3=2426; F0=135.53;

    AV=72TIME + 30; AV=0We will obtain the speech sound and compare it with theoriginal one which is same sound that had been recordedin PRAAT just different in the manner of speaking(robotic speaking).Similarly we will put values in the synthesis script for

    other formants and getting different sounds which thencompare with their original ones.

    5. QUALITY CHECKING OF A SPEECH

    SYNTHESIZER

    The enhanced quality of a speech synthesizer is assessedthrough its resemblance to the human voice as well asthrough its capability to be understood.

    6. APPLICATIONS OF SPEECH SYNTHESIS

    It permits environmental obstacles to be removedintended for people by means of a broad variety of

    disabilities.

    They are as well commonly utilized to support those by

    means of harsh speech impairment typically in the courseof a committed voice results communication support.

    Speech synthesis methods are as well employed in

    entertainment assembly for example as games as well as

    animations.

    7. VARYING DURATION OF SPEECH SIGNALPrior to discussing about the second part of paper, first

    record vowels in the PRAAT and save it as wavefile.Then we will follow the given steps below for variation in

    the duration of the speech signal.

    Reading the vowelCalculate total duration of the speech signal usingMATLAB Command.close allspch=wavread('g:/44100.wav'); %path in the drive forwavefile%

    fs=44100; %sampling ratets=length(spch); %length of speecht=ts/fs; % total duration of speech

    Extracting the samples for duration of 50 msExtract the samples for the 50ms from the speech signalusing MATLAB Command.time=0.05; % 50 msecsamp= time * fs; %samples in 50 msectemp= buffer(spch,samp,1,'nodelay');spch_samp= temp(:,5);spch_samp=spch_samp';

    pitch_period = 0.007 ; % 8 msec manually calculatedfigure,plot(spch_samp)

  • 8/13/2019 145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

    3/4

    Sci.Int.(Lahore),25(3),483-486, 2013 ISSN 1013-5316; CODEN: SINTE 8 485

    t_pitch_prd=floor(time/pitch_period); % total pitchperiod in 50msec vowel.

    Extracting one pitch period from vowelWe want to calculate one pitch period. We will extract theone pitch period from the vowel using the MATLABcommand.

    L=length(spch_samp); %total # ofsamples in 50msec

    samp_one_pitch= L/t_pitch_prd;temp= buffer(spch_samp,samp_one_pitch,1,'nodelay');% one pitch period samples of onlyone_pitch1=temp(:,1);one_pitch1=one_pitch1';one_pitch2=temp(:,2);one_pitch2=one_pitch2';one_pitch3=temp(:,3);one_pitch3=one_pitch3';one_pitch4=temp(:,4);one_pitch4=one_pitch4';one_pitch5=temp(:,5);one_pitch5=one_pitch5';one_pitch6=temp(:,6);one_pitch6=one_pitch6';

    plot(one_pitch2)l_one= length(one_pitch2);

    Now we can vary the duration of speech signal by simplyadding the pitch period to the original speech signalaccording to my requirement.

    Suppose take a speech signal of 200ms and we want tovary the duration of the speech signal up-to 250ms. Wecalculate pitch period for vowel aaa by using PRAAT,which is 7ms. So we will have to add 7 pitch periods tovary my speech signal approximately up-to 250ms.Similarly we can vary the duration of speech signals uptoseveral milli seconds.This was the manual way to vary the duration of the

    speech signal.

    Figure 4. Extracted pitch period

    Changing the duration

    Now we will discuss another method to vary the durationof the speech signal which is mathematical way to varythe speech signal using MATLAB.disp('enter the duration of vowel you need in sec')dur_time=input('enter duration in seconds = ')temp=dur_time;leng= floor(temp/pitch_period)new_vowel=spch_samp;

    for i=1:leng%k= mod(2*i,7)%one_pitch=one_pitch;

    new_vowel= [new_vowel one_pitch2];i=i+1;end

    figure, plot(new_vowel)%wavwrite(new_vowel,44100,'s');%wavwrite(spch,44100,'2s');%figure, plot(new_vowel)wavwrite(new_vowel,44100,'vowel1')

    figure, plot(new_vowel)Key Points about Varying the Duration1. If we want to double the duration of the speech

    signal, just duplicate the pattern and this will notchange the pitch period of that speech signal.

    2. If we want to decrease the duration of the speechsignal, just subtracting the pitch periods regularlyuntil the requirement met.

    Precautions1. Samples should be taken carefully. Dont take samplesfrom low frequency components.2. Always take sample from high frequency components.3.Transitions should not be stretched because we facing

    problem in hearing. Instead we stretch vowel(like middle

    part).VARIATION IN THE PITCH

    Now we will discuss step by step how the pitch of thespeech signal can be increased and decreased.

    Pitch period ChangingPitch period can be changed by hamming window usingthe MATLAB command below.win=hamming(l_one);

    figure,plot(one_pitch2)figure,plot(win)aq = one_pitch2.*win';

    figure, plot(aq)long=[aqaqaqaqaqaqaqaqaqaqaqaqaqaq];wavwrite(long,44100,'short');

    Increasing PitchAs we know that a speech signal having high and lowfrequency components. To increase the pitch we have tocontract the high frequency components. Because lowfrequency components contraction creating listening

    problems.Following command is used in the MATLAB to increase

    pitch.new_pitch=input('enter the pitch')diff= floor(new_pitch-(1/0.007)); % difference of pitchsamp_less=2*diff; %samples which are to deleted toincrease pitchnew_samp= length(one_pitch1)-samp_less;temp= buffer(aq,new_samp,1,'nodelay');

    temp=temp(:,1);temp=temp';

    pitch=tempfor i=1:15pitch=[pitch temp]

    i=i+1;endwavwrite(pitch,44100,'nn');

  • 8/13/2019 145636087612--483-486-Suhail Kazi Composed Paid 25-3-13 REF-Not in the Text

    4/4

    ISSN 1013-5316; CODEN: SINTE 8 Sci.Int.(Lahore),25(3),483-486, 2013486

    Figure 5. Increased pitch period

    Decreasing PitchPitch can be decreased by stretching the high frequencycomponents.Following command is used in the MATLAB to decreasethe pitch.close allnew_pitch=input('enter the pitch short ')diff= floor((1/0.007)- new_pitch);samp_increase= 2*diff;

    rd=random('exp',(1:30));temp=buffer(rd,samp_increase,1,'nodelay');temp=temp(:,1);temp=temp';

    pitch_prd=[one_pitch2 temp];for i=1:7pitch=[pitch pitch_prd];

    i=i+1;endwavwrite(pitch,44100,'loo')%shrt=[one_pitch2 rd];%temp=shrt;%for i=1:7

    %shrt=[shrt temp];

    %i=i+1;%end%short=[shrtshrtshrtshrtshrtshrtshrtshrtshrtshrt];

    wavwrite(short,44100,'short')

    Figure 6. Decreased pitch period

    Precautions (For Pitch)To avoid the listening problem, stretch or contract onlythe high frequency components. Dont take low frequencycomponents.

    That was all about increasing or decreasing of the pitchspeech signal.

    8. CONCLUSION

    In this paper we have discussed how the speech can besynthesized from text data. For this we use KLATsynthesizer which gives output in the form of speechsound which is almost same to the original sound. We cansay from observations that the output sound of KLAT

    synthesizer is Robotic type speech sound. But it is not asclear as original voice. Further to put formants in theKLAT synthesizer, we used WAVESURFER instead ofPRAAT for formant estimation because formantsestimation from WAVESURFER is easy to find thanPRAAT, but we record my sound in the PRAAT.Then we saw different techniques relating to frequencyselection and windowing for varying the pitch andduration or the speech signal. To increase the pitch wehave to decrease the pitch and vice versa.

    REFERENCES:

    1. Rabiner L.R. and Juang B.H. (1993) Fundamentals ofSpeech Recognition. Prentice-Hall, Englewood

    Cliffs, NJ.2. Ladefoged, Peter, and Keith Johnson. A course in

    phonetics. Wadsworth Publishing Company, 2010.3. Dutoit, Thierry. An introduction to text-to-speech

    synthesis. Vol. 3. Springer, 1997.4. http://www.asel.udel.edu/speech/tutorials/synthesis/[

    Retrieved from Internet on dated: Saturday, January01, 2013]

    5. James MatthewsSAPI 5.0 Tutorial II: Text-to-Speechhttp://www.generation5.org/content/2001/sr01.asp.[Retrieved from Internet on dated: Saturday, January01, 2013]

    6. http://www.wavesurfer.findmysoft.com/[Retrievedfrom Internet on dated: Saturday, January 02, 2013]

    7. http://www.fon.hum.uva.nl/praat/download_win.html[Retrieved from Internet on dated: Saturday, January01, 2013]

    8. Text-to-speech Demohttp://www.research.att.com/projects/tts/. [Retrievedfrom Internet on dated: Saturday, January 01, 2013 ]

    9. Alan W. Black, Perfect synthesis for all of the peopleall of the time. IEEE TTS Workshop 2002.

    10. Furui S. (1989), Digital Speech Processing, Synthesisand Recognition, Marcel Dekker.

    11. McClellan, J. H., Burrus, C. S., Oppenheim, A. V.,Parks, T. W., Schafer, R. W., &Schuessler, H. W.(1998). Computer-based exercises for signal

    processing using MATLAB 5 (Vol. 5). Prentice Hall.

    12. Schafer, R. and L. Rabiner (1978). Digital processingof speech signals, Englewood Cli s, NJ: Prentice-Hall.

    13. D. Pennell and Y. Liu, "Normalization of textmessages for text-to-speech," ICASSP, 2010.

    14. Martin, J. H. and D. Jurafsky (2000). Speech andlanguage processing, prentice hall.