DOS Seminarski

Digital signal processing Krneta Nikola

Speech recognitionMatched filter application testing

Krneta NikolaDepartment of control and electronics

University of SarajevoSarajevo, Bosnia and Herzegovina

[email protected]

Abstract—This document serves to introduce the reader in tothe field of speech recognition, specifically command recognitionas will be mentioned in this paper. The main method describedherein will revolve around the well known matched filter which isa logical choice since the main goal is to detect a certain templatewithin the input signal.

Keywords—speech recognition; command recognition; matchedfilter; signal;

I. IntroductionVoice analysis, specifically, speech and command

recognition are just a part of a larger domain of science. Theterm speech recognition describes the process of turningspoken words from a user into text which can be displayed ona computer. The main interest here is the process to the pointof saving the text file, because the end result may differ.

Most commonly used in the fields of computer since andartificial intelligence, it allows verbal control of machines bythe intended user. The main benefit of such an interface is theintuitive and familiar communication system used by humans.The drawbacks lie in the inability of systems which use voiceanalysis to deliver realtime interactivity without fast andexpensive hardware, robustness to noise and intensity of theinput signal and heavy memory requirements because everyletter, word or sentence requires its own template.

In our example, memory and hardware requirements won’tbe a problem since our experiments will be done on a smallsample of inputs signals and with limited processing. Limitedprocessing in this case is used to limit hardware requirementsand to show the real problems of speech recognition.

II. Speech RecognitionA. Constraints and assumptionsSpeech recognition (SR) is a rather new field of study

which attracted many researchers and engineers. Their workhas created many strategies which make command recognitionpossible. These strategies span many scientific fields includingsignal processing, pattern recognition, artificial intelligence,statistics, information theory, computer algorithms,psychology, linguistics and even biology. Still, almost all

speech recognition systems (SRS) rely on a set of constraintsand assumptions to work correctly [1]:

- Speaker dependence

- Isolated words

- Small vocabulary

- Constrained grammar

These constraints are caused by four deficiencies:

- Lack of sophisticated yet tractable model of speech

- Inadequate use of human knowledge of acoustics,phonetics and lexical access in the system

- Lack of consistent units of speech that are trainable andrelatively insensitive to context

- Inability to account for between-speaker differences andspeaker-specific differences

Speaker dependence is regarded as the most difficultconstraint to overcome. This is because most parametricspeech representations are highly speaker dependent and a setof reference patterns suitable for one speaker may performpoorly for another speaker.

There are three main approaches to speaker independence.The first approach uses engineering techniques to findperceptually motivated speech parameters which are relativelyinvariant between speakers. If these parameters can be found,then speaker-dependent recognition is as easy as speaker-independent recognition.

The second approach is to use multiple representations foreach reference to capture the between-speaker variations.Typically, each word in the vocabulary is uttered by manyspeakers; these multiple examples are then divided intoseveral clusters, and a prototype is generated from each cluster.Like the engineering approach, the multiple representationapproach produces good quality results for limited tasks, buthas not been successfully extended to a large vocabulary task.

The final category tries to use this knowledge about thespeaker by adapting the recognizer to a new speaker. Speakeradaptation begins with an existing set of parameters, and asmall set of adaptation sentences from the new speaker.


Strictly speaking, however, these systems are not truly speakerindependent.

The other three, of the four mentioned constraints, can beconsidered as a single group since they are system dependentand not speaker dependent.

Todays SRS are very complex in order to minimize theeffects of these constraints, but we are still far from theultimate, constraint free SRS.

The performance of speech recognition systems is usuallyevaluated in terms of accuracy and speed. Accuracy is usuallyrated with word error rate (WER), whereas speed is measuredwith the real time factor. Other measures of accuracy includeSingle Word Error Rate (SWER) and Command Success Rate(CSR). However, speech recognition (by a machine) is a verycomplex problem. Vocalizations vary in terms of accent,pronunciation, articulation, roughness, nasality, pitch, volume,and speed. Speech is distorted by a background noise andechoes, electrical characteristics [2].

B. SRS modelsModern speech recognition systems are based upon the

statistical approach of Bays decision rule. The implementationof this rule for the SRS is based on two kinds of stochasticmodels: the acoustic model and the language model whichtogether are the basis of the decision process itself, e.g. thesearch for the most probable sentence [3].

Fig. 1. Architecture of an automatic speech recognition system

These modules of an automatic speech recognition systemare characterized as follows:

The acoustic model captures the acoustic properties ofspeech and provides the probability of the observedacoustic signal given a hypothesized word sequence.

The acoustic model includes the acoustic analysis whichparameterizes the speech input into a sequence of acousticvectors, acoustic models for the smallest sub-word units, e.g.

phonemes which usually are modeled context dependent andthe pronunciation lexicon, which defines the decompositionof the words into the subword units.

The language model captures the linguistic properties ofthe language and provides the a-priori probability of aword sequence. From an information theoretic point ofview, syntax, semantics, and pragmatics of the languagecould also be viewed as redundancies. Because of thestochastic nature of such redundancies, language modelsusually are based on statistical concepts.

Search realizes Bayes decision criterion on the basis of theacoustic model and the language model. This requires thegeneration and scoring of competing sentence hypotheses.To obtain the final recognition result, the main objectivethen is to search for that sentence hypothesis with thebest score using dynamic programming. The efficiencyof the search process is increased by pruning unlikelyhypotheses as early as possible during dynamicprogramming without affecting the recognitionperformance.

In computer science, the model shown on figure 1 isextended to include adaptation and communication withcertain applications and databases.

Fig. 2. Extended SRS architecture [4]

Applications interface with the decoder to obtainrecognition results that may be used to adapt othercomponents in the system.

The speech signal is processed in the signal processingmodule that extracts salient feature vectors for the decoder.The decoder uses both acoustic and language models togenerate the word sequence that has the maximum posteriorprobability for the input feature vectors. It can also provideinformation needed for the adaptation component to modifyeither the acoustic or language models so that improvedperformance can be obtained.

C. Acoustic modelSince the beginning of the speech recognition field of

study there were many attempts at creating an accurateacoustic speech model. We will not go into detail butexperiments have shown that out of all those models, theHMM (Hidden Markov Model) model yields the best resultsso far.

http://en.wikipedia.org/wiki/Word_error_rate

http://en.wikipedia.org/wiki/Real_time_factor

http://en.wikipedia.org/w/index.php?title=Single_Word_Error_Rate&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Command_Success_Rate&action=edit&redlink=1

http://www-i6.informatik.rwth-aachen.de/web/Research/AcousticMod_frame.html


The HMM creates stochastic models from knownutterances and compares the probability that the unknownutterance was generated by each model. Since we do not knowhow to choose the form of this model automatically but, oncegiven a form, have efficient automatic methods of estimatingits parameters, we must instead choose the form according toour knowledge of the application domain and train theparameters from known data [5].

D. Language modelThe role of language modeling in speech recognition is to

provide the value P(W) in the fundamental equation of speechrecognition [3]:

)|()(maxarg^

WXPWPWW

(1)

Where P(W) and P(X|W) constitute the probabilisticquantities computed by the language modeling and acousticmodeling components, respectively, of speech-recognitionsystems.

One type of language model is grammar, which is a formalspecification of the permissible structures for the language.The traditional, deterministic grammar gives the probability ofone if the structure is permissible or of zero otherwise.Withthe advent of bodies of text (corpora) that have had theirstructures hand-annotated, it is now possible to generalize theformal grammar to include accurate probabilities.

Another, more common type of language model is calledthe stochastic language model, which plays a critical role inbuilding a working spoken language system.

III. The Matched FilterSo far, in this paper we have only discussed the general

approach to speech recognition systems and the complexity oftheir design. Now we will concentrate on the main topic of thepaper and that it is build a very simplified version of an SRSand test some of the aforementioned constraints.

Instead of using any of the models depicted earlier due totheir complexity, we will use a much simpler version withoutany acoustic or language models. This is a brute forceapproach where the input signal will be scanned for matchingtemplates in a database of signals - commands in our case. Thework horse for this approach is the matched filter.

The matched filter, also known as the North Filter, isobtained by correlating a known signal, or template, with anunknown signal to detect the presence of the template in anunknown signal. The matched filter is the optimal linear filterfor maximizing signal to noise ratio (SNR) in the presence ofadditive stochastic noise. Matched filters are commonly usedin radar. Also, two dimensional matched filters are commonlyused in image processing [6].

The matched filters works like the following [7]:

- Vectorize all data (templates and input data)

- Find their dot product

- Examine calculated values

In mathematical terms these steps translate to:

1yxT (2)

Where x is the template and y represents the search set. Asx and y become more and more similar ε tends to decrease tozero. This is a good approach if we want to detect the presenceof a signal in a database.

If we wish to find a template with an input signal largerthan the template then we can do all possible dot products ofthe template and cut outs of the input signal. Therefore, we get:

)(...)1(

])(...)1([)(nky

kynxxkd (3)

If we write our equation 3 in the form of a summationequation, we get:

i

ikyixkd )()()( (4)

And examining equation 4 closely, we find this to be theformula for discrete convolution but with the template timereversed. By reversing the template time we no longer have anactual convolution of signals, but rather, what is called, a(cross) correlation. When talking about the matched filter,equation 4 is simply referred to as the sliding dot product. Thisworks well when working with unit length vectors. For usewith nonunit length vectors, a simple modification is required- just divide both vectors (template and input data) with theirmagnitudes.

i kyx

ikyixkd)(

)()()( (5)


Fig. 3. Graphical representation of template detection

The sliding dot product, though simple and easy toimplement, is not very efficient in terms of computation timeand is unsuitable for large data sets.

For large data sets another approach is preferred.Convolution from the time domain (continuous or discrete)maps to element wise multiplication in the frequency domain.Using FFT (Fast Fourier Transform) will speed up thedetection process. After the multiplication, finding the inverseFFT of the product will return us into the time domain. Onemight think that this step is useless, but our experiments haveshown that using the FFT speeds up the algorithm eight times(x8). This, of course, depends on the implementation of theFFT algorithm, the detection algorithm and some other factors.

IV. The Experiment DataA. The experiment dataFor our experiment we will use voice samples from six

different people.

Each person was asked to speak four commands whichwere recorded. Those commands are familiar to everycomputer user:

- cut

- copy

- paste

- select all

Out of the four commands, three are single wordcommands and one is a compound command - select all. Thisone was chosen as a special test case.

These test samples will be tested with data consisting offour five sentences. Four of those sentences will have thosetest samples somewhere in them, one sample per sentence.The fifth sentence will serve as control sentence containingnone of the test samples. The sentences are:

- If you wish to help, please cut the rope to the requiredlength.

- Why does he always copy someone else’s homework?

- Don’t use that school paste, use real glue instead.

- Thank you for coming. Please select all accessories youwish to add to your purchase.

- You realize this plan has me walking into hell, too. Hah.Just like old times.

Most people asked to record the test samples come fromSarajevo and they have similar pronunciation. One person inthe group is a foreigner.

B. The algorithmIn this section we will give a short review of the algorithm

we are going to implement.

The first step, when we load the data, is to reverse thetemplate so that we adhere to equation 5.

The second step is to transfer both signals, the input signaland template, into the frequency domain using the FFT.

The third step is the multiplication the FFT’s of the signals.

The fourth step is returning to the time domain using theinverse FFT.

The fifth step is finding the peak value in the result d(k)which signifies the presence of the template, but notnecessarily.

The sixth step involves extracting a sample from the input.That sample must be the same length as the template and itmust be extracted from the location indicated by the peakvalue found in the fifth step.

The seventh and final step is correlating the template andthe extracted sample to confirm the overlap percentage.

V. Experiment ResultsThe results will be presented by a single experiment.

The first experiment will be conducted using referencedata. The templates will be extracted from the input signalsthemselves to check the validity of the algorithm.

These graphs show the data being processed. The firstgraph represents the sentence (blue) tested against thetemplates. The overlapping colors, red, green, yellow and teal,represent the templates (cut, copy, paste and select all)respectively.

Fig. 4. Sentence 1 tested with control samples


Figure 4 shows the results from the first simulation. Asexpected the match factor is 100% for the command CUT (redcolor), and 15% - 30% for the rest. Everything below thethreshold of 75% can be considered not a match. Keep in mindthat this is only reference data.

Fig. 5. Results from the first speaker

Results from the second test are more than disastrous. Wedo not have a single match. What is worse is that the matchpercentage for the command CUT (12.21%) is even lowerthan those of the other commands (18.2%, 19.4% and 19.39%)that are not even in the sentence and the algorithm “found” itin the wrong location.

Fig. 6. Results from the second speaker

This second test also proves that the statement from thefirst nonreference test. The match percentage is low, around10% for all commands.

Fig. 7. Results from the third speaker

Fig. 8. Results from the fourth speaker

Fig. 9. Results from the fifth speaker

Fig. 10. Results from the sixth speaker

Just as a side note, another test was performed with adifferent data set and different templates. In this test, thereference sentences were combined into one continuousfile/data set. The templates in this case were the samesentences used to construct the data set, but they wererecorded once again.

The test was done with all five templates, but three out offive test forced the algorithm out of bounds.

The results from the test are not that different from the testpreformed earlier, with the words as templates, but somethingsurprising had surfaced.


Fig. 11. Results showing the first viable test

Fig. 12. Results showing the second viable test

These tests produced a match percentage of 7.07% and8.32%, respectively, but the surprising thing is that in thesecond test, shown in figure 12, we have the template placedalmost completely in the marked space where the originalsentence is located.

Due to our low sample size this placement of the templateover the original signal can not be called a match, especially iftake into account that the match percentage is well below 75%,but it does make our initial results and assumptions less valid.

VI. ConclusionBased on the results from figures 5 - 10 (and figure 11) the

only conclusion we can get is that the Matched filter is not atool for pattern detection in the case of voice recognition. Thematched filter is still a very powerful tool when the inputsignal and template are generated by the same machine and onthe same machine (e.g. radar). In our case we did not do anypreprocessing like noise filtering but the hope was that the

matched filters property of noise resistance will do the trick.However, figure 12 can be called an anomaly in a good sensesince it shows that the matched filter still has potential to beused for voice recognition. Using the matched filter for voicerecognition would require a good amount preprocessing of thesignals which is not in the scope of this paper.

The test performed showed the validity of the initialassumptions that each speaker gives a different template forthe commands, and even though we can understand otherpeople, a machine just does not have that ability. Focusing onfigures 1 and 2 we can see the difference in the templates fromthe same speaker. The first set (figure 1) was extracted fromthe input, hence the 100% percentage match for theappropriate commands and the second set (figure 2) wasrecorded separately.

In essence, what these test have proven is that the filed ofvoice/command recognition is very hard to exploit so easily,requiring experts from many fields working together to make ausable system. This would be a very interesting area toinvestigate and work on in a group and, perhaps, implementall aspects of a basic system.

For all other tests please refer to the GUI supplied with thispaper.

References

[1] Kai-Fu Lee, Automatic speech recognition: The development of theSPHINX system, 4th ed., Copyright © 1989 Kluwer AcademicPublishers, pp. 1, 3-5

[2] http://en.wikipedia.org/wiki/Speech_recognition[3] http://www-i6.informatik.rwth-

aachen.de/web/Research/speech_recog.html[4] Xuedong Huang, Li Deng, An Overview of Modern Speech Recognition,

Microsoft Corporation, pp. 340-341, 343-344[5] D. B. Paul, Speech Recognition Using Hidden Markov Models, pp. 41 -

42[6] http://en.wikipedia.org/wiki/Matched_filter[7] http://courses.engr.illinois.edu/cs598ps/CS598PS/Topics_and_Materials

_files/Lecture%208%20-%20Detection%20and%20matched%20filters.pdf

http://en.wikipedia.org/wiki/Speech_recognition

http://www-i6.informatik.rwth-aachen.de/web/Research/speech_recog.html

http://en.wikipedia.org/wiki/Matched_filter

http://courses.engr.illinois.edu/cs598ps/CS598PS/Topics_and_Materials_files/Lecture%208%20-%20Detection%20and%20matched%20filters.pdf

DOS Seminarski

Documents

Transcript of DOS Seminarski