Speech Recognition Final
-
Upload
adrian-abordo -
Category
Documents
-
view
215 -
download
2
Transcript of Speech Recognition Final
Senior Design Final Report
Winter/Spring 2003
University of California, Riverside Department of Electrical Engineering
Voice Command Recognition: ROBOKART
Prepared by Adrian Abordo, Jon Liao
Technical Faculty Advisor: Yingbo Hua
Submitted: June 6th, 2003
June 2003
2
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Part 1: Preparation 1.1 Summary of Speech Recognition Systems . . . . . . . . . . . . . . . . . . . . 4 1.2 Project Overview & Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Project Goals & Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Part 2: Implementation 2.1 System Overview & Block Diagrams . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Project Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Part 3: Theory and Algorithms 3.1 Guiding Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Overview of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Pattern Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Run-Time Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Part 4: User’s Guide 4.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Template Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Program Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Part 5: Post-Project Analysis 5.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Part 6: Administrative Expenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Equipment List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Appendix: Photographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
June 2003
3
INTRODUCTION
This report is intended to provide a detailed walkthrough of our “Robokart”
Speech Recognition, Senior Design Project. Part 1 covers all necessary preparatory
information—overview of past and current approaches in speech recognition technology,
our project summary and roadmap, project goals and initial design specifications. Part 2
covers implementation details: a system overview and block diagrams, the challenges
faced in creating each part of the project, and how they were solved. Part 3 provides a
description of particularly important recognition sub-systems and lays down the
theoretical and mathematical basis for the operation of our algorithms. Part 4 is a User’s
Guide which is meant to teach any interested individuals how to use the software
included in this report, how to set the various parameters, and how to prepare a speech
recognition session using our current system. Part 5 provides performance evaluation
information, as well as future improvements that need to be made. Part 6 provides
logistical information: project expenses, a parts list, glossary, and references. Our source
code is included in the Appendix at the end of the report.
Project Results at a Glance:
Primary Objective (Recognition system works): Operational, in testing
Secondary Objective (Robokart car runs): Incomplete, on hold
Tertiary Objective (System flexible & adaptable): Achieved, in testing
June 2003
4
PART ONE
1.1 Summary of Speech Recognition Systems
Major interest in Automatic Speech Recognition (ASR) had its origins during the
early part of the Cold War, when the U.S. Government needed an automated system that
could analyze and translate intercepted Russian radio transmissions. When initial
government efforts failed to produce a reliable system, the department known today as
the Defense Advanced Research Projects Agency funded programs at top academic
institutions in the country to provide research information needed to make a system
viable, jumpstarting research and commercial efforts that continue to this day.1
In the 1950’s, Bell Labs created a recognition system capable of identifying the
spoken digits 0 through 9, while later systems were able to recognize vowels. During the
1960’s, users . . . were . . . required . . . to . . . pronounce . . . each . . . word . . . separately.
Eventually, continuous speech systems that accepted naturally spoken sentences became
available in the 70’s. Ironically, modern systems that utilize Hidden Markov Models
(HMM’s) work better with continuous sentences than with discrete words.
Today speech recognition systems are commonplace and find applications in
automatic dictation, simple instrument and computer control, personal identification, and
toys. IBM’s ViaVoice software, voice-activated dialing in cellphones, OTG’s SecurNT
voice authentication system, and the Sony Aibo are products that utilize speech
recognition. However, anyone who has ever used an ASR system can attest that modern
systems work reasonably well, but can never recognize speech with the same accuracy
and robustness that human beings can. Even though speech recognition has made great
June 2003
5
advances over several decades, the pinnacle of speech recognition ability—mimicry of
human performance—has yet to be reached.
Certain aspects of human speech and hearing make successful machine-
recognition extremely difficult. For starters, psycho-acoustical experiments showed that
human hearing has a nonlinear frequency and intensity response, and moreover, those
responses are coupled. Before delving into details, let’s first review how human hearing
works.
The human hearing system detects tiny variations in air pressure and converts
those changes into nerve signals which the brain converts into sensations of loudness and
pitch. Air pressure for voiced sounds are typically measured in units of micropascals
(µPa) or microbars (µBar) on a decibel scale. The energy-per-unit-area imparted by the
changes in air pressure is called intensity, and is very roughly (but not directly) related to
loudness. It is important to point out that loudness is a sensation and is not an actual
physical quantity, whereas intensity is. Therefore, the answer to the classic question, “If a
tree falls in a forest, and there is nobody there to hear it, does it make a sound?” would
be: No it does not make a sound, but it would probably make some high-intensity air
pressure changes. Generally speaking, higher intensities of fluctuation result in our ears
perceiving a sound that is louder, as long as certain conditions are met.
The second quality related to hearing is the frequency of vibrations. What we
perceive as pure tones are air-pressure changes which fluctuate at a single frequency.
Human voice, far from producing pure tones, imparts pressure changes at a multitude of
different frequencies, which the inner part of our ears are designed to detect. The classic
theory used to explain our ability to recognize frequencies (pitch) says that tens of
June 2003
6
thousands of tiny hair-cells arranged carefully in the cochlea act like small bandpass
filters that are individually tuned to a particular frequency. When a particular frequency
of vibration hits the ear, the hair cell(s) that are tuned to that frequency will fire nerve
impulses. The brain then uses these patterns of firing to estimate the frequency content of
the signal. The frequency content of a sound is primarily what the brain relies on to
recognize speech, and it is for the most part very good at detecting different
frequencies—up to a point.
Psychological experiments made in the 1930’s by Fletcher and Munson showed
that perceived loudness is not simply a function of intensity, but also of the sound’s
frequency:
Figure 1:
June 2003
7
Figure 1 shows that tones that have the same intensity but oscillate at different
frequencies can be perceived with different loudness. The bottom curve shows that a 10
dB sound oscillating at 1000 Hz sounds just as loud as a more powerful 76 dB sound at
100 Hz.
Not only does human hearing have a frequency-intensity tradeoff, but we also
perceive some frequencies better than others. The bandwidth of human hearing is from 20
Hz through 20 kHz, with peak sensitivity around 3-4 kHz. Sounds that fall outside the
peak range are gradually attenuated and are not perceived too well. (Interesting side note:
3-4 kHz is roughly the frequency range that babies cry at). Figure 2 is called an A-
weighted approximation of the frequency response of human hearing. Peak perception
between 2-4 kHz and gradual attenuation outside that range can easily be seen.
Figure 2:
The last property worth mentioning is the frequency resolution of human ears.
Two tones of the same intensity, played at the same time but at two different frequencies
may be perceived as being only one tone if their frequencies are too close to each other, a
condition known as masking. Therefore, if two sounds are too close to each other in
June 2003
8
frequency, one sound may mask the other, and the second sound may not be perceived at
all. In order for sounds to be successfully identified, they must be separated in frequency
by a minimum distance called the “critical bandwidth” (CBW). To make matters worse,
the critical bandwidth of hearing is not a constant. It depends on the center frequencies
that each part of the cochlea is tuned to detect. It has been found, for example, that the
group of hair cells tuned to a center frequency of 1000 Hz has a smaller bandwidth
(higher resolution) than hair cells tuned to 8000 Hz (which have poorer resolution).
The problems that automatic speech recognition systems face can finally be
explained. For starters, one of the main points of this section is that what we perceive as
sound is really not tied to any single physical property. A great deal of biological and
psychological conditioning takes place between the time the pressure waves first hit our
eardrums and before they are “perceived,” so much so that what we finally sense is quite
different from the physical event that generated it. The intensity and frequency
dependence of human hearing is merely the tip of the iceberg—our brains also utilize a
great deal of information based on timing, duration, learning, and contextual cues.
Modern ASR’s vs Humans:
Microphones, the entry point of practically all current automatic speech
recognition systems, are very poor substitutes for the human ear, as they are limited to
simply measuring the voltages generated by changing air pressures. The electrical signals
of a microphone are not to be confused with the nerve signals that our brains utilize to
recognize sound, mainly because sounds go through many stages of transduction in the
human ear before it is finally processed by the brain. Our inner ear detects frequencies
instantaneously and in parallel, whereas computers utilize the Short-Time Fourier
June 2003
9
Transform to estimate the spectral content of a signal, something which can neither be
done instantaneously (a window of samples first needs to be taken) nor in parallel (one
window after another needs to be extracted before frequency changes can be estimated).
Table 1 summarizes the relationship between artificial and human systems, highlighting
the essential differences between the two.
Table 1
Human Beings: Artificial Systems:
1. Instantaneous detection of frequencies. 1. Reliance on Fourier Transform.
2. Frequency measurements are precise. 2. Window length dictates precision.
3. Neural processing done in parallel. 3. Computer processing is sequential.
4. Brain is very good at pattern detection. 4. Computers are not.
5. Humans can recognize words in a wide 5. Artificial recognizers are only as
variety of conditions, environments, and robust as the models they are built
interference. on.
6. Humans have other sources of information 6. ASR’s can only rely on signals
such as sight or context to understand generated by a microphone.
what was said.
To counteract the failings inherent to computers, a variety of signal processing
techniques and algorithms have been employed to bring machines up to par with their
human counterparts. These algorithms are generally aimed at emulating some known
properties of human recognition. DSP “front-ends” are algorithms meant to simulate the
June 2003
10
hearing process and try to account for such observed properties as critical bandwidth and
nonlinear frequency response, and more importantly are designed to measure features in
the speech signal (power, rate of change, spectrum, cepstral coefficients, etc.) that are
thought to play an important role in humans’ recognition of speech. A graphical overview
of a class of front-ends that utilize spectral analysis is shown in Figure 3 below.
Figure 3:
“Back-ends” typically work in conjunction with front-ends and employ higher
level algorithms such as pattern recognition, statistical prediction, and neural networks to
make sense of the features that have just been extracted by the front-end. Figure 4 shows
the processing sequence involved in the creation of a speech recognition template starting
with front-end systems and ending with a back-end model.
June 2003
11
Figure 4:
With the exception of the Fourier Transform Filter Bank Model (which will be
covered in greater detail in Part 3), explanation of the other systems shown in Figures 3
and 4 is far beyond the scope of this report. For implementation details, interested readers
are strongly encouraged to read Joseph Picone’s paper on “Signal Modeling Techniques
in Speech Recognition.” 2
In the broadest sense, a speech recognition system can either be speaker-
dependent or speaker-independent. The early systems were speaker-dependent. They
were only tailored to recognize the voice of single, or at most a handful, of people,
mainly because that person’s particular voice template was the only training the system
has from which to perform recognition tasks. Security and authentication systems that use
a person’s voice to perform personal identification are speaker-dependent systems. The
current trend in speech recognition is to have systems that are speaker-independent.
Systems that are speaker independent are designed to recognize words from a wide
variety of speakers.
Recognition systems can also operate at a variety of levels. Small-vocabulary
systems can operate with a vocabulary of roughly 100 words and can be used for
individual letter and digit dictation. Medium-vocabulary systems are capable of
recognizing around 1000 words, and large-vocabulary systems are meant to recognize
June 2003
12
more than 5000 words. Additionally, systems can also be tailored to work at the sentence,
word, syllable, or allophone level. The sentence level, as the name implies, is a collection
of words that are grouped together into a single unit of meaning, such as, “Open the
garage door.” Word-level systems attempt to recognize spoken input through each
individual word, such as “Open” “the” “garage” “door.” Syllable-level systems operate at
a lower level and are designed to recognize syllables: “O” “pen” “the” “ga” “rage”
“door.” Finally, allophone-level systems work at even smaller individual units of speech:
O-p-e-n th-e g-a-r-a-g-e d-oo-r (in practice, though, triphones, rather than individual
allophones, are used. See below).
If a system operates at a lower level of recognition, it can theoretically possess a
larger vocabulary using a smaller set of templates (which is desirable for flexibility and
memory reasons). For example, there are an infinite number of sentences in the English
language, but those sentences utilize roughly 500,000 words, which in turn are built upon
approximately 1,000 syllables, which are comprised of about 47 phonemes. If a
recognition system were designed to recognize 47 phonemes (and can recognize them
accurately!), it can theoretically recognize all the words in the English language if it were
also programmed with information on how to chain those individual units together. In
practice, however, ASR accuracy decreases at lower levels of recognition because it has
fewer and shorter samples from which to make a decision. Longer units of meaning (like
sentences) are easy to differentiate from each other because there are more phonetic
features to use for comparison but are necessarily limited in the number of units they can
recognize. Whereas smaller units can be combined to form some very large vocabularies,
they are also harder to differentiate. A computer, for example, might have a difficult time
June 2003
13
distinguishing between the vowel sound in “said” versus the vowel sound in “head.”
Therefore, practical voice dictation software utilize triphones (sets of three allophones
chained together) as a compromise between large vocabulary and accuracy.
The most popular large-vocabulary speech recognition systems today use Hidden
Markov Models that utilize the statistical probabilities of a sound showing up in a
particular word, as well as the probabilities of that sound transitioning into different
allophones. They have been quite successful but require a large database of words spoken
by a wide variety of people from which statistical and probabilistic information is
extracted.
1.2 Project Overview & Roadmap
Our senior design project focused on building a small-vocabulary (24-word) voice
command recognition system to be used in directing the movements of a small “robokart”
in real time. In choosing this project, we were faced with the task of creating a robust and
accurate system that is intelligent enough to recognize spoken word-level commands and
operate with reasonable speed. The commands we used were as follows:
1) Robokart 2) Rotate 3) Clockwise 4) Counterclockwise 5) Proceed 6) Forward 7) Backward 8) Stop 9) Turn 10) Left 11) Right 12) Speed 13) Up 14) Slow 15) Down 16) Dance
June 2003
14
17) Charge 18) Retreat 19) Good 20) Bad 21) Go 22) To 23) Sleep 24) Wake
To demonstrate our results, we designed a system to physically display the
obtained solutions via a remote-controlled car. Thus, the project consisted of two main
parts: the software component built to carry out the actual speech recognition and the
hardware component (Robokart) which had been planned to carry out the spoken
commands.
Using a microphone attached to a personal computer, we obtained voice samples
for template creation and training. We processed the training input using algorithms
written in MATLAB 6.5 and automated the acquisition and analysis of data in real-time
using Matlab’s Data Acquisition Toolbox v2.2. We used mean zero-crossings, mean
power, and a 24-Bark filter bank to build the feature-space from which to perform
recognition.
The analog speech input, after being processed by our speech recognition
program, was planned to produce a binary 8-bit output to be transmitted through the
Serial Port of a personal computer into the Motorola MC68HC11 microntroller, which is
embedded on the CME119-EVBU evaluation board. The binary commands from Matlab
would be outputed through the PC’s serial port interface and sent to the evaluation board
via the 68HC11’s asynchronous serial communication interface (SCI). The 68HC11 in
June 2003
15
turn was connected to the transmitter of a remote-controlled car, from which it would
have directed the car’s movements.
Our project was divided into the following stages:
1.) Research speech recognition technology and methods.
2.) Acquire parts needed to build initial voice training set (PC microphone, voice
recording program).
3.) Compile the training set via Matlab and incorporate the sounds into a library
which will be used as the starting point for template creation.
4.) Acquire the toolboxes needed to perform signal processing and real-time
acquisition (Signal Processing Toolbox and Data Acquisition Toolbox).
5.) Extract features from the sound library and use them to build the reference
template on which recognition will be based.
6.) Create the actual software engine which will work in real-time, process new
inputs, compare them to the template, and output commands through the serial
port.
7.) Acquire and modify the circuitry of a toy remote-controlled car.
8.) Interface the 68HC11 to the car’s remote control transmitter and write programs
needed for the serial communication interface.
9.) Test the speech recognition engine.
10.) Test the microcontroller-to-transmitter interface.
11.) Test the microcontroller-to-serial port interface.
12.) Integrate the speech recognition engine with 68HC11 microcontroller.
13.) Test the final product.
June 2003
16
1.3 Project Goals & Specifications
We wanted to develop a speech recognition system that was flexible, but not
necessarily speaker-independent (for speaker-independence is a very hard thing to
achieve). Since the speech recognition system is geared towards the control of an
instrument, we placed particular importance on accuracy and robustness, envisioning that
this system could one day be incorporated into the hands-free control of certain devices
and instruments (microwave oven, car dash board, non-critical airplane flight controls).
On a personal level, we wanted a system that relied on our own ingenuity and originality.
We therefore used current speech recognition approaches as a starting point for building a
knowledge base, but we tried not to copy approaches that are already known to work. In
short, we wanted to come up with a speech recognition system that was, for the most part,
original and new (at least to us).
Minimum performance specifications are listed below. Highest priorities are listed
first:
1.) Recognition catch rate (RCR) of at least 50 percent (i.e. the spoken word should
appear in the “Recognized Word Buffer” for at least half the duration that it was
spoken).
2.) Ability of the recognition system to operate without physical prompting (control
should be achieved through voice alone, without requiring the user to press keys,
signal an intent to speak, or to issue a record command by pressing some button.
System, once started, must be completely hands-free).
3.) Ability to perform word-recognition within 1 second of when the person has
finished speaking (near real-time performance).
June 2003
17
4.) Ability for the engine to output serial-port commands as soon as a recognized
word has been detected (again, within 1 second).
5.) Ability to physically carry out those commands quickly (this relies on the
microcontroller’s ability to receive and interpret the serial commands within 1
second).
6.) Total system lag from voice-input to Robokart response of no more than 3
seconds (it would be disastrous to say “Stop,” but robokart responds too slowly
and plows into a wall).
7.) Be able to run the recognition engine for an indefinite length of time without it
running out of memory.
PART TWO
2.1 System Overview & Block Diagrams
A word about notation: some blocks will have a number written inside them. This
indicates that the block is composed of several sub-components which are not visible in
that particular diagram but which will be covered in more detail later on. For example,
“Block 1.4” indicates that a particular block is the fourth sub-component of “Block 1.”
The number notation allows you to track the hierarchal relationships between systems.
June 2003
18
Figure 5:
0: The sound card was set to a sampling rate of 22.05 kHz and to single-channel
(mono) acquisition of sound.
1: All the speech processing takes place in Matlab, using functions written from
scratch. The bulk of the senior design project has been spent creating, coding, and
testing algorithms used in this component.
2: The 68HC11 is the middleman between the software and the actual Robokart car.
Figure 6:
1.1: The Data Acquisition Engine (DAQ) is a Matlab toolbox that captures data in real
June 2003
19
time. In this case, the engine is instructed to pull data from the computer’s sound
card.
1.2: The feature extractor contains all the functions needed to extract relevant features
from the user’s voice. The output is a feature vector which is fed to the
comparator.
1.3: The comparator is the heart of the speech recognition engine. It finds the most
likely match between the user’s inputted features and those features that are stored
in memory. Using relationships that have been trained into the reference template,
the Comparator determines the most likely word that has been inputted.
1.4: The reference template is the brain of the speech recognition engine. It contains
three sub-templates which are crucial for determining what parts of a word have
been said: the word-association matrix, the reference feature vectors, and the
reference time positions.
1.5: Serial port output is an 8-bit binary signal that is sent to the 68HC11 and contains
control instructions.
June 2003
20
Figure 7:
1.1.1: Using information stored in “trigger condition,” the trigger detector constantly
scans the digital input stream, looking for an event that would initiate data-
logging.
1.1.2: In order to differentiate background noise from a user attempting to speak, the
trigger condition is set to a band of amplitudes within which the intrinsic noise of
the system is expected to stay. If the microphone detects a voltage level that
leaves this pre-defined band, the trigger detector is instructed to begin logging
samples.
1.1.3: The data-logging system is pre-set to record 0.5 seconds worth of samples every
time the trigger is activated. Too short a value means data might be accidentally
missed. Too long a value slows down the recognition system’s response time.
June 2003
21
Figure 8:
1.2.1: As an additional safeguard against sudden bursts of noise, the power discriminator
ensures that the signal coming in from the Data Acquisition Engine is of sufficient
power to preclude it from being a random hardware event. Signals whose power
exceed one and a half times the baseline magnitude are assumed to originate from
the microphone and not from intrinsic noise.
1.2.2: The Data Acquisition Engine always records 0.5 seconds worth of sound every
time it is triggered. Because the user may not be speaking during that entire
period of time, the pure sound extractor passes on portions of the data where it
believes somebody is speaking and deletes those portions where there is silence.
1.2.3: Sound samples that have been extracted and purified are temporarily kept in a
buffer for subsequent processing. However, since the system is very fast, the
June 2003
22
partitioned sound buffer is mainly there to allow the user to hear the quality of the
sound that has just been recorded for troubleshooting purposes.
1.2.4: Feature intervals are chunks of samples over which features are measured and
compared against. The feature-length parser divides the sound vector into portions
of desired length (often with a sliding window) and passes the portions to the
feature extraction functions.
1.2.5: The mean zero-crossings counter counts the average number of times that the
microphone’s amplitude signal crosses the zero axis over a certain interval. The
number of zero-crossings is a time-domain feature that we believe might be useful
in characterizing certain sounds.
1.2.6: The average signal power over an interval is another time-domain feature that we
employ to help determine what word has been said.
1.2.7: The filter bank is based on a Holmes filter bank architecture and consists of 24
triangular bandpass filters arranged nonlinearly on the frequency axis. Each filter
measures the spectral power within its range and outputs the log10 of that power
amplitude. The filter bank outputs a 24-element power spectrum vector which
plays a very big role in characterizing speech for our system.
1.2.8: The feature vector is a 26-element vector that combines the time-domain and
frequency-domain feature values extracted from each segment of sound. It is sent
to the comparator for analysis.
June 2003
23
Figure 9:
1.3.1: The comparator treats the output from the feature extractor as a 26-dimensional
position vector. This component calculates the distance between the current input
position and all positions stored in the reference template.
1.3.2: The resulting distance vectors are sorted from low to high, allowing the most
promising template candidates (those whose positions are the closest to the
current input) to be placed near the top of the stack.
1.3.3: The desired number of possible word candidates is extracted from the top of the
distance stack.
1.3.4: Using a lookup table, the word-groups who have features that match the current
best candidates are stored in memory.
1.3.5: The relative position within each word is estimated by a position lookup-table
which matches taught feature vectors with the relative positions within the word
that they are known to occur at.
June 2003
24
1.3.6: The identity and position of each recognized word fragment is stored in a large
matrix where trends can be detected over time.
1.3.7: The chain detector looks for those word fragments that occur repeatedly and
consistently and checks to make sure that each candidate word position follows
the forward flow of time. It scores each word fragment according to which ones
have the best consistency and causality.
1.3.8: Based on the chain detector’s recommendation, the candidate word is chosen as
soon as meaningful input from the user is deemed complete.
Figure 10:
2.1: A short program burned into EEPROM that sets the necessary Serial
Communication Interface control registers, checks the SCI status registers through
polling, and stores 8-bit data that has been received.
2.2: Based on the data received by the SCI, it outputs the programmed movement
signals to the radio-control transmitter connected to Port D.
2.2 Project Challenges
The feature-extractor went through a great many changes over a long period of
time. This was due mainly to the fact that we were not sure what features were critical in
recognizing human speech. Initially, we relied entirely on 8 time-domain features: mean
June 2003
25
zero-crossings, mean sound power, mean diffcode (binary backward-looking difference),
mean variability (average positive and negative peaks over an interval), zero-crossings
delta (multiplicative change from one interval to the next), power delta, diffcode delta,
and variability delta. Subsequent testing showed that most of these time-domain features
were unsuitable for our application. All signals had widely differing power changes,
making their power deltas an unreliable and unpredictable measure. Mean variability and
mean diffcode, due to their reliance on binary backward difference, were much too easily
corrupted by noise, and by extension, so too were their deltas. In the end, we decided to
keep average zero-crossings and average power since our tests showed an acceptable, but
small, correlation between their values and words that were spoken.
Selection of proper frequency-domain features also proved troublesome. We
initially relied on a “top N approach,” where the power-spectral-density (PSD) of each
input interval was measured, the top N strongest peaks were extracted (we used N = 30 at
this point), and those top N peaks’ values and locations were compared with peaks stored
in the template. We discovered that finding the top N peaks slowed down our programs
considerably, and the computations needed to evaluate their closeness to the template in
magnitude and in frequency almost brought the recognition engine to a crawl at run-time.
We therefore endeavored to find a feature set that would give us a set of single numbers,
rather than a set of magnitude-frequency pairs, which became very unwieldy over time.
It was during more research that we stumbled across early attempts at emulating
the frequency response of the human cochlea. Psychoacoustic researchers discovered that
the ear had a minimum resolution below which it could not differentiate sounds, and that
some sounds were perceived as being louder than others based on their frequency.
June 2003
26
Keeping these characteristics in mind, early researchers started with analog bandpass
circuits tuned to certain frequencies and used these “filter banks” to decompose the
spectral content of an input sound in a way, it was assumed, that the human ear roughly
worked. We also discovered that the outputs of these filter banks (a vector of power
amplitudes) was also used by neural network researchers as the inputs to their systems.
Since we were interested in solving the speech recognition riddle through an artificial
intelligence approach, we adopted the most common filter bank model—the Holmes filter
bank—and made it part of the spectral feature extraction process.
Subsequent challenges involved optimizing our code to accept massive amounts
of data, compare them to an even more monstrous set of data, perform complex iterative
calculations on huge matrices, sort those matrices, extract the best candidates from each
matrix, and implement some meaningful scoring and weighting scheme and somehow do
it quickly enough to be implemented in near real-time. Our early data structures were
very inefficient and the results were disastrous. At one point, it took our 2.53 GHz
machine almost 5 minutes to analyze a 0.04-second sound fragment. We trimmed the
number of features to use (which was another reason why 8 initial time-domain features
were cut down to a mere 2) and attempted to make every portion of our code as
vectorized as possible, since Matlab works best with matrices, instead of loops. It took
almost two weeks, but we got our code running at a speed that resembled real-time.
An even bigger challenge, one that almost derailed the entire project, involved the
actual word-recognition process itself. Our original template data was unreliable, so we
were extracting statistical information that was very misleading. Our early attempts at
recognition produced almost random results because the statistical occurrences of each
June 2003
27
word fragment were being misrepresented. We corrected the problem by writing a pair of
comprehensive “intra-entity” and “cross-entity” association functions which found the
similarities within and across words. The “association matrices” are the outputs of those
functions and form the core of our system.
When configuring ROBOKART for asynchronous SCI, we came across some
barriers. First, deciding which electronic RC control car to use was easy, the Tyco Fast
Traxx was more versatile, and easier to modify. The only problem was when linking the
two software systems together (MATLAB and IAR Embedded Software) was getting the
COM port to accept signals from MATLAB. The COM1 port was perpetually busy when
we loaded our MCU C code. The hardware setup was configured to take output
responses from the MCU in Port D, but we never had a chance to obtain the input from
MATLAB. Therefore, we needed some more time to get the communication between the
devices to work properly.
PART 3
3.1 Guiding Philosophy
Automatic speech recognition is essentially just a very complicated template-
matching problem. On one end, you have some set of predefined values that are
associated with a particular word, and on the other end, you have an input that can vary in
a multitude of ways. The task of matching the two sets of values has been, and still is, a
daunting task, especially when there are many other competing templates involved in the
matching process. We have discovered in the past two quarters that speech recognition—
June 2003
28
attempting to match an ever-changing signal to an essentially stationary one—is an
exercise in controlling chaos.
Despite the enormous challenges, there were a set of guiding principles which
were helpful to us and formed the basis for our design approach:
1.) When faced with an ambiguity—when a clear choice for a word has not
been recognized—rather than assume that nothing has been said, the
human brain will almost always fall back to the next closest match.
2.) The key to successful speech recognition does not depend on finding the
magical set of features that absolutely differentiates a word from other
sounds, but rather lies in how those words are perceived by the listener.
This is a bit of a philosophical abstraction, but I strongly believe that what
makes the word “lollipop” stand apart from the word “Volkswagen” is not
some critical feature quantity (like zero-crossings or cepstral coefficients) that
can objectively be quantified. Instead, we treat them as different words
because we have been taught to do so by experience. Basically, words are
different not because of some quantitative property of the signal, but because
we “perceive” them differently.
3.) Therefore, between front-end digital signal processing and back-end
pattern detection, as long as the front-end is stable, the bulk of our work
should be focused on creating an intelligent system that can detect the
patterns of similarities and differences between stored words and use that
pattern to classify future inputs. In other words, we decided that the key to
building a working recognition system did not rest in superior signal
June 2003
29
processing, but rather depended on creating an artificially intelligent system
that can teach itself to find similarities and differences between words.
4.) In the end, what we wanted is a system that can truly recognize a word, a
syllable, or a sound without resorting to arbitrarily programmed models,
statistics, or clever signal-processing tricks. We wanted a system that can
learn.
5.) We wanted to do this not by imitating tried and true methods but by
coming up with a (hopefully) new approach.
3.2 Overview of the Problem:
Any sound measured by a microphone is simply a sequence of numbers (in our
case, a sequence of voltages). The reference template is also a sequence of numbers.
Speech recognition is “simply” the process by which one sequence of numbers is
compared to another sequence of numbers in an attempt to find the best fit. The main
difficulty lies in the fact that the template is usually a stationary sequence, whereas the
input sequence of spoken words can change in a variety of unpredictable ways. In
particular, an utterance can differ from a stored template in 3 ways:
1.) Delta-M (∆M) Error such as interference, noise, and other magnitude distortions
corrupt the input signal and can make it sound different from the reference signal.
2.) Delta-T (∆T) Error such as unexpected pauses, unusually fast or slow speaking
styles, and other changes in speed can randomly shift the position of the input
relative to the template.
3.) Combination ∆M and ∆T Error randomly distorts a signal’s values and also
shifts its position randomly in time. Real speech falls under this category because
June 2003
30
people never say the same word exactly the same way twice, in addition to
whatever background noise might be present in the environment. People can also
pause unexpectedly, or say a word faster or shorter than expected, or stutter, or
jump-around, or even be uncooperative. Over a sufficiently long interval, an input
signal can vary from the “ideal” in a multitude of ways.
Two recordings of the word “counterclockwise” spoken by the same person is shown
in Figure 11:
Figure 11
Since both instances of “counterclockwise” are spoken by the same person, these
plots are generally similar. However, there are also differences. Note that
“Counterclockwise” 1 generally has less energy. Note, too, that “Counterclockwise” 2 is
June 2003
31
slightly faster (left-shifted in time) compared to “Counterclockwise” 1. These recordings
were made by the same person on the same day.
Figure 12 below shows the same word “counterclockwise” spoken by the same
person but on different days. “Counterclockwise” 3 was recorded approximately 4
months before “Counterclockwise” 1 was recorded.
Figure 12
Notice that the differences between the two signals are magnified. While the
difference in energies is slight, “Counterclockwise” 3 is stretched compared to
“Counterclockwise” 1. They start at roughly the same time, but there is a pause in the
middle that causes “Counterclockwise” 3 to be skewed in time.
June 2003
32
Such differences are called intra-speaker differences. The same person can utter
the same word in slightly different ways each time. The person can pause, speak faster,
speak slower, or emphasize certain syllables. A recognition system needs to be robust
enough to understand that these different pronunciations of the same word are not
entirely different words, but simply different examples.
Matching words across different speakers is an even more challenging task.
Whereas differences between words spoken by the same person are relatively small,
inter-speaker differences are huge. Figure 13 shows “counterclockwise” spoken by two
different people:
Figure 13
As can be seen in the figure above, the difference in magnitude (∆M) and the
June 2003
33
difference in time (∆T) is large across different speakers, even if they are saying the same
word. Not only does “Counterclockwise” 4 have different energies (on a per-syllable
basis, not across the entire word) than “Counterclockwise” 1, it is substantially stretched
in time.
At this point you might wonder if the differences between words are truly as
substantial as we claim. After a quick visual review, “Counterclockwise” 1-4 shows that
they each have roughly 4 lumps (syllables) which are separated by periods of low energy.
Moreover, except for small variations, these lumps have roughly the same size, and with
a little bit of stretching here and a little bit of imagination there, you may visualize them
as occurring at roughly the same times. So what’s the big deal?
The problem is that computers do not have the faculties that humans have. They
operate at the numerical level. They are also not capable of asking “what if”—what if I
stretch this portion here? or stretch this portion there? or superimpose these two lumps?
All computers have to work with is the data that is given to them, as it is given, without
recourse to imagination. They crunch numbers, and depending on the numbers that come
out, they classify their inputs accordingly. To teach a computer “what if” and try to teach
it all the fuzzy and indeterminate situations where it can adjust the data would be quite an
accomplishment in artificial intelligence. “Counterclockwise” might seem easy to
recognize because it is the only word in our command set that has four syllables, but what
if we decide to include another four-syllable word in the future? Numerically,
“Counterclockwise” 1-4 look like:
June 2003
34
Table 2
Counterclockwise Duration (s) Signal Energy Signal Power
1 1.4048 9.0664 2.9270e-004
2 1.5905 24.4054 6.9591e-004
3 1.4623 12.0885 3.7492e-004
4 1.7524 9.6690 2.5023e-004
By looking at the table and these particular features, you might notice that there
are some big variations over the same word. On the other hand, you might be able to find
a pattern in the similarities among the four examples of “Counterclockwise.” If so, you
are using a human trait—the innate ability to find similarities between examples. With
some clever programming, computers might be able to do the same thing, but the task
becomes much more difficult if the number of possible words is not limited to 1 but 24.
The computer then has to ask itself, How long is long enough? When should I begin
measuring a sound? When should I stop measuring a sound? How do I know that this is
really the beginning of the word? How do I know that this is really the end of the word,
and the speaker simply didn’t pause? Statistically, if this half of the word is similar to this
other word, but the later half is closer to this one, how do I decide which example is
which? How do I even know whether I am measuring these words correctly and at the
right times? How do I know if the input is gibberish?
June 2003
35
The problem of signal distortion is less severe in the frequency domain but is not
the cure-all needed for recognition. Figure 14 shows the power spectral magnitude of
“Counterclockwise” 1 and 4 as well as “Robokart.”
Figure 14
“Counterclockwise” 1 and 4 look fairly similar; they both dip below -60 dB at
roughly the 5700 Hz mark, whereas “Robokart” dips as early as 3900 Hz. From
experience, when we examined the spectrum of all 24 words, there were many instances
when the distinction was blurred. We saw that there were gray areas where the spectrum
of different examples of the same word will be close, but there will sometimes be another
word whose example is even closer, and the problem turns into trying to determine which
example should belong to which word. Setting thresholds by saying, “If this input’s
spectrum dips below so-and-so at this point in frequency at this point in time means that
June 2003
36
it belongs to this group, otherwise it belongs to that group” turned out to be futile. Time
and time again, we have found exceptions to the artificial boundaries that we have
attempted to set, almost as if the words themselves were taunting us. And once we made
an exception for one case, it became hard to determine under what conditions exceptions
were justified. Eventually, once all examples of all words were considered, the distinction
between them became muddled.
To illustrate this “feature ambiguity” which applies to both the frequency and
time domains, Figure 15 shows our attempt at using a time-domain feature called
“variability” to distinguish between different syllables:
Figure 15
Overlapping segments are undesirable because they are syllables for which the
value of a particular feature is similar. In theory, different syllables ought to have
June 2003
37
different values, yet this graph shows that for the feature known as “variability,” many
syllables are indistinguishable because they share a similar range of values.
The astute reader would point out that we should simply add more features. After
all, in an environment encompassing 100 different features (dimensions), syllables or
words overlapping in one feature cannot (or should not!) overlap when all other
dimensions are considered. If a chicken walks like a duck, and “walking” is the only
feature that you are measuring, you might come to the conclusion that a chicken is the
exact same thing as a duck. But if you consider more features and test to see if a chicken
walks like a duck and looks like a duck and sounds like a duck, then you would realize
that ducks and chickens are different because they neither look nor sound the same
(though they might walk the same).
Realizing this, our early attempts utilized as many as 38 features simultaneously:
8 time-domain features, and 30 frequency domain features (highest spectral peaks). Using
these features, we attempted to build statistical models of occurrences, hoping to find the
one pattern that distinguished, say, the word “stop” from the word “up.” The process was
very time-consuming and we realized that there were invariably unusual exceptions
(extreme feature values) that skewed the means and standard distributions of our different
“patterns.” It was difficult to decide whether to keep these deviant values or whether to
throw them out, and trying to decide the conditions they were useful for was extremely
difficult. For 24 words, trying to calculate patterns and set thresholds across different
features of a signal bogged us down in work. During this endeavor we realized that our
models would only work for the particular set of words that we were working with. What
if the user wanted to work with a whole different set of words? What if the user wanted to
June 2003
38
program his or her own unique voice into the system? Would the user have to wade
through the sea of statistical data and manually establish the relationships between new
words? How could our current system even begin to handle words that are completely
new, words for which we have no statistical models?
In the end, we decided that for the sake of flexibility (and our own sanity), we
would somehow force the computer to make those decisions for itself. And so, after
almost four months of trying different signal representations in the time and frequency
domains, after trying to find statistical patterns buried in the signal, and after extracting
all sorts of different features that hopefully would make one word stand out from another,
we concluded that words are not uniquely identifiable through some all-encompassing,
objective property of the signal. Rather, we decided that humans recognize words
because the experience of hearing a word is somehow associated with other experiences
already stored in memory, and an association is formed among similar “experiences”
which allows us to relate new inputs to old.
Like human beings, a speech recognition system would have to be taught that
certain examples belong to the same word. It would also have to determine exactly what
makes such examples similar and what makes them different from other words. Instead of
manually organizing the pattern ourselves and trying to set up a myriad of laws and rules,
the computer would have to form “associations” for itself.
Henceforth, we decided not to torture ourselves by trying to find the perfect
feature extraction algorithm and instead focused on building an intelligent, automated
pattern-recognizer.
June 2003
39
3.3 Feature Extraction
Finalized Extraction Procedure:
All features are extracted across a constant interval. We selected a feature length
of 32 ms (706 samples at 22.05 kHz sampling rate) over which to extract each feature.
Succeeding feature chunks are taken by using a sliding, half-overlap window (353 old
samples + 353 new samples). For time-domain features, the samples were unity-
weighted. For frequency-domain features, the samples were weighted with a Hamming
window.
Commonly used measuring intervals are from 20-40 ms. In the frequency domain,
shorter intervals give you good time resolution but poorer frequency resolution, and
longer intervals give you poorer time resolution but better frequency resolution. In the
time domain, 20-40 ms is also the range over which individual components of the speech
signal (the “r” sound in “robokart,” for example) remain essentially the same, allowing us
to measure relevant features across specific “sounds.” 32 ms was selected as a
compromise between having a feature-length short enough to resolve individual sound
details, but long enough to process the signal quickly.
Our final system consists of two time-domain features (zero-crossings, mean
power) coupled with 24 spectral power outputs of a digital filter bank, for a total feature
vector of 26 elements. Zero-crossings is the average number of times a signal crosses the
zero-axis over an interval. Mean power for a signal g(t) over an interval N is simply
given by:
=)(tpN1 )(
2
0
tgNt
t∑=
= (Equation 1)
June 2003
40
The time-domain features were very easy to implement, but choosing the right frequency-
domain features took a lot of research. We eventually chose one of the earliest systems
that tried to emulate the way human hearing works.
The Digital Filter Bank
Based in part on psychoacoustic measurements that seemed to demonstrate that
the human inner ear has a finite frequency resolution*, early speech researchers designed
an overlapping bank of bandpass filters to mimic the frequency response of the human
cochlea. The bandpass filters were tuned to different frequencies, and the passbands were
made similar to the observed bandwidths of the human ear. Of several types of filter
banks available, the Holmes Filter Bank was used for this project.
To simulate the limited frequency resolution of ears, the filter consists of
bandpass filters whose center frequencies are arranged nonlinearly on the frequency axis.
The bandpass filters also have different bandwidths, which are thought to account for our
ears’ limited spectral resolution at high frequencies. To determine the placement of each
bandpass filter, the standard frequency axis is first warped onto a nonlinear scale, where
the whole-numbered integers in that scale determine where the filter centers are placed.
The two scales commonly used to achieve nonlinear frequency warping are the
Mel Scale and the Bark Scale. Both scales attempt to model experimental data where the
ear’s critical bandwidth at different input frequencies were measured. These scales
therefore approximate human ears’ limited ability to differentiate tones that are too close
in frequency.
* Subjects were tested on how well they could perceive pure tones that were played simultaneously. The limited frequency resolution of hearing is based only on simultaneous tones. In reality, the ear has complex mechanisms in place that allow people to separate tones that have even small changes in timing. But accounting for these timing mechanisms would have made the model too complex. Therefore, a simplified version is used here.
June 2003
41
An equation for the Mel scale is given by:
m = 1125 log (0.0016f + 1) (Equation 2)
An equation for the Bark scale (Traunmuller’s version, 1990) is given by:
53019601
8126 .
f
.B −+
= (Equation 3)
The Mel scale is in common use by engineers but is typically only used for sampling
frequencies at or below 10 kHz. A side-by-side comparison of Bark and Mel scale center
frequencies and bandwidths is shown in Table 3:
June 2003
42
Table 3
In the end we selected the Bark scale because it is used for a higher range of
frequencies (it can cover up to 27 kHz, easily accommodating our sampling rate of 22.05
kHz), but more importantly because it seemed to approximate empirical data much better.
A plot of the frequency-to-bark-transformation is shown in Figure 16:
June 2003
43
Figure 16
Before going into the details of the filter bank itself, it is worth mentioning that
people’s increased sensitivity to particular frequencies (see Figure 2) can be modeled by
a pre-emphasis filter which amplifies raw frequencies around the 4 kHz range and
attenuates them outside that range. One such pre-emphasis filter is given by:
Equation 4
A pre-emphasis filter greatly attenuates frequencies below 100 Hz, amplifies
frequencies around 3-4 kHz, and gradually attenuates frequencies above 6 kHz. It is
meant to approximate the different sensitivity of human ears to different frequencies.
June 2003
44
Early models of our feature extractor utilized a pre-emphasis filter in order to
more closely approximate human hearing, but we had difficulty determining the
appropriate gain for the transfer function. Should we boost the 4 kHz region by a factor
of 10? A factor of 100? A factor of 300? By what constant factor should we attenuate
frequencies outside that range? Because the raw spectral magnitudes of our inputs tended
to change unpredictably, a wrong pre-emphasis gain could squelch the entire spectrum or
excessively amplify an already large 4 kHz response. Rather than spend a great deal of
time tuning the filter, in the end we decided not to use a pre-emphasis filter and
instead fed the raw, unweighted input spectrum directly to our filter bank, hoping that our
pattern-recognition back-end could sort it all out later.
Because the Bark scale goes up to a maximum value of 24, a filter bank based on
the Bark scale utilizes 24 bandpass filters which are centered around the published Bark
center frequencies and whose bandwidths are equivalent to the accepted Bark critical
bandwidths. Because these bandpass filters generally overlap, a triangular weighting
scheme is applied to each filter in order to give the center frequency the greatest weight.
Figure 17 shows a plot of the actual filter bank that we utilized in this project:
June 2003
45
Figure 17
To ensure that no spectral data is lost, the filters overlap by a large amount.
Resolution is intentionally decreased at higher frequencies (the filter bandwidths are
made progressively larger), and the filter centers are nonlinearly spaced. This represents
early attempts at mimicking some observed properties of human hearing.
Figure 18 shows an example of an input to the digital filter bank:
June 2003
46
Figure 18
Figure 19 shows the output of the digital filter bank for this particular input
sequence:
Figure 19
June 2003
47
The entire frequency-domain feature extraction process is shown in Figure 20:
Figure 20
1: A portion of the sound vector is captured and weighted by a sliding Hamming window.
2: If the captured portion is shorter than the pre-defined feature length (N), zero-pad it to
N samples.
3: Perform the N-point FFT. Discard the upper-half of the data, which is redundant.
4: The spectrum is fed to the filter bank. Each set of frequencies is triangularly weighted,
and the base-10 log power of the spectrum is calculated over each filter interval.
5: Individual power values are concatenated together to form a single 24-element feature
vector.
June 2003
48
3.4 Pattern Detection:
At the end of the feature extraction stage, 2 time-domain features and 24-
frequency domain power values are concatenated to form a single 26-element feature
vector. At every point in time (or rather, at every “feature interval”), a sound is
decomposed into a vector which can be thought of as a position coordinate in a 26-
dimensional feature space. Since 26-dimensions cannot be visualized, a 3-dimensional
feature space is shown in Figure 21:
Figure 21
June 2003
49
As can be seen in Table 4, the output of the feature extractor is simply a set of
values specifying a position within this feature space (only feature elements 1-10 are
included to allow the table to fit on the page):
Table 4
Rows correspond to sampling intervals (time), and columns correspond to
individual features (vector components). Thus, each horizontal slice represents a
particular 26-D position at a particular point in time.
Template Creation:
Before the recognition engine can recognize new voice inputs, it must first be able
to recognize the original data provided for its training. This process is called “template
creation” and only needs to be performed once for a particular training set. It is, however,
the most time-consuming, typically taking up to two hours to complete for a 24-word
training set comprised of 5 examples each. Figure 22 shows the parts involved in
template creation:
June 2003
50
Figure 22
The two most important functions are the Intra-Entity Associator and the Cross-
Entity Associator. In general, the Intra-Entity Associator tells the recognition system
where and how examples of the same word are similar, while the Cross-Entity Associator
finds similarities across different word groups. Both are based on the same general
principles, however.
The Intra-Entity Associator:
If the training set is reliable (i.e. all word examples were recorded in a reasonably
noise-free environment by a trainer who spoke clearly and consistently), the feature-
vectors of words and examples that sound similar should cluster around similar regions in
the feature space. If the front-end used is a reasonable approximation to human hearing
(but then again, no one yet knows how hearing truly works), the feature vectors ought to
correspond to how particular units of speech actually sound. Thus, perceptual similarity
ought to be reflected in the spatial similarity of groups within the feature space. The set
of points belonging to “dog” is expected to be near the set of points describing “hog.”
June 2003
51
Likewise, the points belonging to “height” should be closer to “kite” than it is to
“brown.” Of course, this is a bit of a simplification, since longer words that are composed
of multiple distinct sounds such as “counterclockwise” or “dilapidated” will probably be
spread out in some complex pattern in the feature space. This is alright, because what we
really want is not so much the distribution pattern of whole words, but rather the
distribution of the individual units of sound within that word.
Because our feature-length is so short (a mere 706 samples long), our feature
vectors are capable of showing word contents at individual units of sound. Therefore,
given a sufficiently small feature-length, the individual “r” “o” “b” “o” “k” “a” “r” “t”
sounds in “robokart” are plotted in the feature space. The job of the intra-entity associator
is to find the units of sound that are similar across different examples of the same word,
as well as find their positions in time. How does it do it?
Given two position vectors V1 and V2 which represent point-positions within
some N-dimensional space, the distance between the points described by V1 and V2 is
given by the familiar Euclidean distance formula:
221 )( VVsumD −= (Equation 5)
The square operator is applied individually to each coordinate pair. In a 3-
dimensional space where two points are described by X1, Y1, Z1 and X2, Y2, Z2, the
distance between the two points is given by:
212
212
212 )()()( ZZYYXXD −+−+−= (Equation 6)
June 2003
52
In a 26-dimensional space described by coordinates C1, C2, C3, C4, . . . C26, the
distance between points A and B is given by:
(Equation 7)
22626
244
233
222
211 )()()()()( ABABABABAB CCCCCCCCCCD −++−+−+−+−= K
Keep in mind that we are working with a matrix where each row corresponds to a
point in time and each column corresponds to a particular feature coordinate. The
distance-finding formula is very easy to implement over matrices.
A note about our matrices:
The data for all our feature coordinates are stored in a 4-dimensional matrix. Each
row (y-direction) corresponds to a point in time, each column (x-direction) corresponds
to a particular feature, each Z slice corresponds to a particular example, and each block in
the 4th-dimension corresponds to a particular word. Because word samples are generally
non-uniform in length, empty spaces in the matrices are filled with NaN’s (not-a-
number). Figure 23 is a graphical summary of our storage scheme.
June 2003
53
Figure 23
The intra-entity associator iteratively peeks at each example of each word. Within
each example, it latches on to a point in time and establishes the feature coordinates in
that period as the “reference point.” Labeling the reference point as Point A, the
associator calculates the distances of all other points belonging to different examples of
the same word. It produces a 3-dimensional matrix of values that correspond to the
June 2003
54
distances of different example points from Point A. Within each Z-slice (example), it
sorts the distances in ascending order. The distances at the top of each sample stack are
those points that are closest to the reference Point A and are assumed to sound the most
similar to the unit of sound that Point A represents. The closest example feature distances
are extracted and stored in a 1-dimensional array. The positions in time of each distance
is also stored in a 1-dimensional array. Algorithm flow is as follows:
For word = 1:max(word),
For example = 1:max(examples),
For time = 1:max(time),
Reference point = feature coordinates at (time,example,word)
Calculates distances of all other points from Reference point
Find the minimum distances within each example
Record the locations in time that they occur at
Calculate the standard deviation of minimum distances
Calculate the standard deviation of time-locations
Calculate the best mean distance from Reference point
Store the previous three values in such a way that they can be
retrieved easily for that Reference point.
end
end
end
June 2003
55
The Inter-Entity Associator:
Whereas the intra-entity associator finds the distances and times of those points
that are closest to a unit of sound at each point in time for each example within a
particular word, the inter-entity associator finds the distances and times of points
belonging to other words. Using essentially the same algorithm as the intra-entity
associator, the inter-entity associator calculates distances of points outside the current
word. It once again finds the distances and times of each external point and records the
particular word that they belong to. It sorts the candidates and finds the closest points for
each example within each word. How does it determine if a point is close enough to the
reference point to sound the same? If the mean distance of the word cluster <= mean
distance of examples within each word, AND standard deviation of word distances <=
standard deviation of best example distances of the reference point, AND standard
deviation of word times <= best standard deviation of example times associated with the
reference point, then the two groups are assumed to be equal and an association is formed
between them.
The inter-entity associator has 3 outputs: an organized list of feature coordinates,
an organized list of word associations, and an organized list of word-times. All three
matrices are arranged in lookup-table form, where the indices of one matrix correspond
exactly to the same data in another matrix. The feature matrix (truncated) looks like:
June 2003
56
Table 5
In the future, if the recognition system encounters a feature vector resembling
(.0042493, 2.3811e-006, -1.6506, -2.2281, -3.205, -3.264, -3.4798, -3.8434, -4.2645,
-4.175 …) it will be referred to Row Index 4. The recognition system can then go to the
association matrix:
Table 6
Here, in Row Index 4, the system will discover that the new input’s feature vector
is most similar to feature vectors located in words 4, 13, 14, 16, 20, and 24. Presumably,
therefore, the unit of sound that has been inputted sounds a great deal like like units in
those particular words. To get more specific, the recognition engine then looks up the
time matrix:
June 2003
57
Table 7
In Row Index 4, the recognition engine would find the relative times that the matching
sounds occur at within those words. It uses this time information to track potential word
candidates across time.
3.5 Run-Time Operation
The Feature Matrix, Association Matrix, and Time Matrix are the components
that the system needs to perform recognition. Suppose the user speaks into the
microphone. The data acquisition engine captures the sound, front-end algorithms remove
periods of silence, and the feature extractor creates a set of feature vectors. The distances
between the current input sequence and the coordinates stored in the Feature Matrix is
calculated. The row index that corresponds to the closest match is retrieved. Using the
row index, potential words are retrieved from the Association Matrix and stored in a Run-
Time Word Matrix. Using the row index, matching time locations are retrieved from the
Time Matrix and stored in a Run-Time Time Matrix. As the user inputs more samples,
the system finds word-numbers that occur repeatedly in the Run-Time Word Matrix. It
also examines the time-locations in the Run-Time Time Matrix. Using Matlab’s
June 2003
58
intersect( ) and diff( ) functions, if a word number occurs repeatedly in each following
set and the corresponding time locations are causal, then that word is selected.
PART FOUR—USER’S GUIDE
4.1 General Overview
Creating the various functions in our system initially took a great deal of time. To
speed up the development and troubleshooting process, we made each function as
flexible and modular as possible, setting up the system so that we could change important
variables quickly and easily. This had the added benefit of allowing new people, with a
little bit of training, to use our recognition programs to create and run their own templates
and adjust the performance parameters to suit their requirements. Because the system is
implemented in a single environment (Matlab), the learning curve should be fairly quick,
but it is assumed that the potential user has at least a passing familiarity with the Matlab
environment.
Before you begin, make sure you have the following software:
Matlab 6.5 Release 13 (or later)
Data Acquisition Toolbox v2.2 (or later)
Signal Processing Toolbox
To get started, make sure that you are in a reasonably quiet environment and that
all our functions are located in the same directory. Set the Matlab work path to the
directory that the functions are stored at. Once that is done, you should first calibrate the
system so that it can measure the ambient noise level and the noise level intrinsic to your
particular hardware:
June 2003
59
Step 0: Calibration
Type the following command in the Matlab workspace:
[dc,bp,ln,un] = sound_calibrate;
The command line will be blocked for approximately 50 seconds while the system
calibrates itself. Try not to make any noise during this period as these settings will be
needed for all future readings. Once the program has finished running, you should save
the calibration settings via the save command:
save calibration1;
4.2 Template Creation
After calibration, the first phase involves supplying the system with a library of
sound samples from which to build a template. This can be accomplished via an external
sound recording program and then using Matlab’s built-in wavread( ) function or by
using our voice_record( ) function. Because of the nature of the Data Acquisition
Toolbox, you must first declare the following as a global variable:
global Recorded_Samples
This is the variable where all recorded sound samples will be stored.
The voice_record( ) function requires you to enter 5 input parameters: dc, bp, ln, un,
quantity. The values for dc, bp, ln, and un should come from the sound_calibrate( )
function. The fifth input parameter, quantity, is a value that tells the program how many
examples of each word you want to give. For example, if you want to form a training set
based on 7 examples of each word, you should set the value of quantity to 7. We
recommend that you choose a minimum value of five but no more than 8, as higher
June 2003
60
numbers of examples will require you to wait a longer amount of time when you reach
the Association Stage.
By default, the voice_record( ) function will give you a window of 3 seconds within
which you can speak a single example. To give yourself a longer or shorter acquisition
duration, open up the voice_record( ) function, scroll down to the CUSTOMIZABLE
SETTINGS portion of the program, which should be near the top, and then change the
value of record_length to the value in seconds that you wish the system to work.
From beginning to end, here is a sample recording session. Suppose you wanted to record
a single word, say, “Proceed” 6 times and store all 6 recordings in a variable called
word_set1:
global Recorded_Samples
voice_record(dc, bp, ln, un, 6);
The system is smart enough to detect the difference between silence and a sound directed
at the microphone. Therefore, at each prompt, the system will wait for you to begin
before it commences recording that example. You must, however, finish the word before
the allotted time expires. At the end of every example, you will see a counter that tracks
your progress by telling you how many recordings you have made so far. Once your set
number of examples has been reached, you should store the samples:
word_set1 = Recorded_Samples
June 2003
61
If you want to continue and record a different word, for example, “stop,” simply press the
up arrow key or type once again:
voice_record(dc, bp, ln, un, 6);
The program will again wait for you to begin. When you are finished, you should store
your new set:
word_set2 = Recorded_Samples;
IMPORTANT: When you make a template of, say, 5 words, you MUST record the same
number of examples for each word. Once you’ve decided on how many samples you
want the recognition system to work with, you must supply it with exactly that many
samples. In the case above, once you’ve decided on six samples, the “Proceed” command
must be recorded six times, the “Stop” command must be recorded six times, and any
additional words you want to train it with must also be recorded the same number of
times. Failure to have a consistent number of examples will result in the system crashing.
Finally, let’s say you want to record a third word, such as “Charge.” Either press the up
arrow key or type in:
voice_record(dc, bp, ln, un, 6);
Record your six samples and store them into a variable:
word_set3 = Recorded_Samples;
When you have finished recording all the words you want the system to recognize, you
must group each set of words into the same array:
June 2003
62
complete_set = {word_set1, word_set2, word_set3}
The array known as complete_set should contain all the examples of each word that you
want the system to process. It is strongly recommended that you
save complete_set
before continuing.
The next step involves extracting the relevant features from your voice samples. Do this
by executing the following line:
[td,fd] = make_template(complete_set);
There might be a short pause while the system extracts the required features. When the
system is finished, you must convert the cell arrays td and fd into matrices:
td = ccm_to_matrix(td);
fd = ccm_to_matrix(fd);
When that is finished, you must combine the time-domain and frequency-domain
matrices into a single matrix:
combined_template = cat(2,td,fd);
Now you are ready to perform “Intra-Entity Association”:
intra_associations = example_matcher(combined_template);
This process can take anywhere from 10 minutes to 1 ½ hours depending on how many
different words you recorded and how many examples of each that you have. A counter
on the screen will update you of its progress.
June 2003
63
When finished, you should now perform a “Cross-Entity Association”:
cross_associations = associator(combined, intra_associations);
Again, this step can take anywhere from 10 minutes to 1 ½ hours depending on how
many word groups you recorded and how many examples of each that you have
provided. A counter will update you of its progress.
When this step is finished, one thing needs to be done before template creation can be
finished:
[features,final_associations,times] = final_template(combined_template,
cross_associations);
When this step is finished, you should save all workspace data:
save Finalized_Template;
4.3 Program Execution
You are now ready to run the recognition program!
But first, you need to make sure the engine is set to the appropriate settings. Open the
recognizer( ) function via Matlab. Scroll down until you see the LOAD TEMPLATE
FILES section:
change load prime_template_706; to match the workspace name that you just saved all
your variables to. In this case, the line should be changed to:
load Finalized_Template;
You should also change the line that says Prime_Features = FEATURES_706 into:
Prime_Features = features;
The two remaining lines should also be changed to:
Associations = final_associations;
June 2003
64
Times = times;
Finally, scroll down to the very bottom of the recognizer( ) function and you will see a
section where words are being instructed to be outputed to the screen. Change them into
the written form of the words that you have just trained, making sure that they are in the
correct order.
You are now ready to run the program! Type:
recognizer(dc,bp,ln,un)
and give it a shot. You can tweak the variables in the USER SETTINGS section of the
function to obtain better performance. The program does not know when to end, so when
you get tired of running it, hit ctrl-break and the program should stop executing.
PART FIVE
5.1 Performance Evaluation:
For the most part, our voice recognition system has a few bugs that need tweaking
for optimized results. It takes about 30 minutes to create a template of 24 commands to
be used: 24 commands with 5 samples each = 5 * 24 = 120 total recordings. Each
recording takes about 15 seconds, so 120 * 15 = 1800 seconds = 30 minutes. To feed the
template into the association matrix and compute the word associations, the process is a
grueling 2 hours. Because there are so 26 feature vectors to associate the template and
along 120 recordings, calculations on a slower processor will be patience-testing. So we
made good use of our processor power for the template loading.
The 2.53 Ghz computer that we used also gave us a boost when actually running
tests. Our final product had only about a 2.5 second delay between vocal input and
June 2003
65
screen output. Although far from instantaneous, there are some required delays,
attributed to finding the end of a word, and searching for a matching template. Our goal
was to create a smooth communication between man and machine, and our product came
pretty close, in terms of speed. With the introduction of outside noise into the vocal input
stage, the speed did not deteriorate, only the accuracy.
Accuracy was a problem that can be tweaked by changing settings and thresholds
in the “recognizer” program. We sometimes we had to repeat the same voice command
in order for the system to output a result. In an absolute controlled environment, where
there are no outside noise and frequency disturbances, we can achieve close to perfect
results for most of our commands. Some commands, such as “Clockwise”, was almost
never obtained as an output. We attribute this result due to the fact that it sounds
extremely close to “Counterclockwise”. There are other words that turned out to be
similar. For example the command “Up”, and “Stop” were very close because the length
of each command are similar. The “S” sound in “Stop” has a smaller power spectrum
than other consonants so it may be harder to detect. When we introduced noise into the
system during our testing, the accuracy diminishment became much more apparent.
For future improvements we have several to fix. Due to time constraints, there
was only so much polishing that we could accomplish down to the final days. One
feature of our system that we would like to improve upon is user-friendliness. (See User’s
Manual) The process of creating a user-defined template and loading into our system has
many steps, and our vision of a good system would only have the basics: Prompt user for
number of words, how many samples each, and then automatically load then in the
association matrix. After the process is complete (We would also like to shorten the
June 2003
66
loading time to much less than 2 hours), the user could define his own threshold values
that effect the system. These values effect the speed and accuracy. The process of
obtaining the vocal input for comparison with the recently loaded template could be
implemented into a GUI.
Our conclusion to the performance evaluation is that our final product is almost
complete, we just need some minor modifications make the system easier to use, and
maybe better speed optimization. Making the system robust is challenging, removing
unwanted noise might be done with the addition of filtering techniques. More testing will
surely be required, and we will be getting closer to perfection every time.
June 2003
67
PART 6
Parts list and costs Items purchased during Senior Design
Project Purchases Costs
Sony Vaio 2.53 Ghz computer $1299.99
Noise Canceling Microphone $24.99
Omnidirectional desktop microphone $5.99
MATLAB + 2 module toolboxes $159.97
Amerikit Radio Control Car Kit $24.95 9.6 V NiMh battery pack, 9V batteries, Solder wire
$34.19
Radioshack Digital Multimeter $39.99
Total (without tax and shipping) = $1590.07
June 2003
68
List of Equipment Used (Final Products only)
1) Omni directional desktop microphone 2) 2.53 Ghz Desktop computer (Sony Vaio) 3) MATLAB student edition (Release 13) 4) MATLAB Signal Processing Toolbox 5) MATLAB Data Acquisition Toolbox 6) Tyco Fast Traxx 7) 9.6V NMh battery and charger 8) Axiom CME119-EVBU (w/ M68HC11 chip) 9) Black and Decker Power screwdriver 10) Solder wire and Iron 11) AxIDE software 12) IAR Embedded Workbench (v2.0)
June 2003
69
Glossary of Terms, Acronyms, and Abbreviations Peripheral – A hardware device connected to a central hardware device that has a specific corresponding function within a complete hardware system. Microphone – A peripheral device that accepts a signal provided by analog sound and converts it into a digital form for application use. ROBOKART – The name of our robot system that will eventually accept voice commands provided by a user. Feature Vector – A set of values that are related to a defined model in our speech recognition system. Feature Space – The location where feature vectors are stored and accessed for processing and comparison. Template – A pre-defined collection of data that acts as a model to a command. It is used by the voice command recognition system for comparison with new input. Algorithm – A mapped chart or diagram that describes a software methodology for simplification. Asynchronous – describes how the communication between two peripherals or between a peripheral and the main system do not share the same clock timing. Byte – Eight bits of data Noise – Foreign information introduced into the system that disrupts or distorts the input. SCI – Serial Communications Interface 68HC11 – Motorola microcontroller model number/type MCU – microcontroller CME119-EVBU – model number and type of evaluation board, contains the motorola 68HC11 MCU RC – radio control Tyco Fast Traxx – device used for demonstrating ROBOKART, RC car distributed by Tyco International Ltd. Turbo King – alternative device used for radio control testing, built from a electronics kit.
June 2003
70
Figure References
1. http://tonydude.net/physics201/p201chapter6.htm
2. http://hyperphysics.phy-astr.gsu.edu/hbase/sound/acont.html#c1
3, 4:
http://www.isip.msstate.edu/publications/journals/ieee_proceedings/1993/signal_modelin
g/paper_v2.pdf
21: http://www.st-andrews.ac.uk/~wjh/dataview/cluster.html
Table References
3:
http://www.isip.msstate.edu/publications/journals/ieee_proceedings/1993/signal_modelin
g/paper_v2.pdf
References
1. “History of Speech Recognition”
http://www.stanford.edu/~jmaurer/history.htm
2. “Signal Modeling Techniques in Speech Recognition” by Joseph Picone
http://www.isip.msstate.edu/publications/journals/ieee_proceedings/1993/signal_
modeling/paper_v2.pdf
3. “Columbia University Speech Recognition Lecture Notes” http://www.ee.columbia.edu/~dpwe/e6820/lectures/ Documents [1] Performance Analysis of Serial Port Interface. Petr Blaha, Pavel Vaclavek, Centre
for Applied Cybernetics. [2] Rodman, Robert D. Computer Speech Technology Artech, Boston 1999.
June 2003
71
Internet Websites [1] HC11 compendium, http://ee.cleversoul.com/hc11.html [2] Motorola website, http://www.motorola.com
June 2003
72
PHOTOS
June 2003
73
June 2003
74
June 2003
75
June 2003
76
APPENDIX
SOURCE CODE
C code for ROBOKART Tested in IAR embedded workbench
#include <io6811.h> #include <intr6811.h> #include <stdio.h> // unsigned char baudrate = BAUD; unsigned char record; // unsigned char Trdat = 0xa5; void writesci(unsigned char); unsigned char readsci(void); void main(void) { // configures for SCI BAUD = 0x30; SCCR1 = 0x00; SCCR2 = 0x0c; record = SCSR; record = SCDR; DDRD = 0x3c; // set initial output settings for PORTD // DDRC = 0xff; /*set PORTC as 8-bit output*/ // PORTB = 0x00; // initialize to zero outputs // PORTC = 0x00; while(1) { writesci(record); record = readsci(); /* if (CNT_DIR == 0) PORTB = 0x0f; else PORTB = 0x70; if (ROT_DIR == 0) PORTC = 0x0f; else PORTC = 0x70; */ }
June 2003
77
} void writesci(unsigned char data) { // data = 0xf1; (use this for testing) // get transmitted data, wait until register flag empty TDRE set SCDR = data; while ((SCDR & (SCSR & 0x80)) == 0); } unsigned char readsci(void) { // wait for RDRF sets then return SCDR data int y; while ((SCSR & (SCSR & 0x20)) == 0); y = 0; PORTD = 0x00; switch(PORTD) { case 0: // Proceed Forward SCDR = 0x80; while(1) { while(y < 100000, y++) {} PORTD = 0x2b; } PORTD = 0x00; break; case 1: // Proceed Backward SCDR = 0x81; while(1) { while(y < 100000, y++) {} PORTD = 0x17; } PORTD = 0x00; break; case 2: // Turn Left SCDR = 0x88; while(1) { while(y < 100000, y++) {} PORTD = 0x23; } PORTD = 0x00; break;
June 2003
78
case 3: // Turn Right SCDR = 0x89; while(1) { while(y < 100000, y++) {} PORTD = 0x08; } PORTD = 0x00; break; case 4: // Rotate Clockwise SCDR = 0x10; while(1) { while(y < 100000, y++) {} PORTD = 0x27; } PORTD = 0x00; break; case 5: // Rotate Counterclockwise SCDR = 0x11; while(1) { while(y < 100000, y++) {} PORTD = 0x1b; } PORTD = 0x00; break; case 6: // Stop SCDR = 0x00; while(1) { while(y < 100000, y++) {} PORTD = 0x03; } PORTD = 0x00; break; case 7: // Charge SCDR = 0x1f; while(1) { while(y < 1000000, y++) {} PORTD = 0x2b; } PORTD = 0x00; break;
June 2003
79
case 8: // Retreat SCDR = 0xf1; while(1) { while(y < 1000000, y++) {} PORTD = 0x17; } PORTD = 0x00; break; } return (SCDR); } function [TD_template,FD_template] = make_template(ccv) % Standard Template Format: % Rows of each template correspond to a span (or sampling chunk) in time % For Time-Domain templates, each column corresponds to a unique feature % value % For Frequency-Domain templates, every column corresponds to the same % feature type but at a different location in frequency % TD_template legend: % TD_template(row,:) = [mean_zero_crossings mean_power] % 1 2 % All changes are expressed as a simple ratio between current and previous % values warning off MATLAB:divideByZero global Bark_Indices global Ham_Window % User Customizable Values: feature_length = 706; sampling_rate = 22050; overlap = round(feature_length/2); % Prepare spectral window Ham_Window = hamming(feature_length)'; % Find Bark Intervals f_scale = sampling_rate*(0:ceil(feature_length/2))/feature_length; % Find the frequency indices that correspond to each Bark number Bark_Indices = bark_grouper(f_scale);
June 2003
80
% Calculate Equal-Loudness Preemphasis weighting vector preemphasis_weights = preemphasis(f_scale); for c1_index = 1:length(ccv), cv = ccv{c1_index}; for c2_index = 1:length(cv), raw_v = cv{c2_index}; v = sound_finder(raw_v); windowed_samples = slider(v,feature_length,overlap); for k = 1:length(windowed_samples), % Extract time-domain features v_features = vector_features_static(windowed_samples{k}); TD_template{c1_index}{c2_index}(k,:) = v_features; % Extract spectral features if length(windowed_samples{k}) < feature_length, shortfall = feature_length - length(windowed_samples{k}); % if length of current data is less than required feature_length, add zero padding: windowed_samples{k} = cat(2,windowed_samples{k},zeros(1,shortfall)); end spectrum = abs(fft(windowed_samples{k}.*Ham_Window,feature_length)); spectrum = spectrum(1:ceil(feature_length/2)+1); weighted_spectrum = spectrum; bank_powers = filter_bank(weighted_spectrum); FD_template{c1_index}{c2_index}(k,:) = bank_powers; end end end % ----------------------------------------------------------------------- % function [pure_sound] = sound_finder(vector) baseline_power = 9.698568682625995e-007; % measured 'average' power for a reasonably quiet environment minimum_power = baseline_power * 1.5; power_interval = 105; min_level = -0.00386435942872 * 1.5; max_level = 0.00386267192857 * 1.5; % Find Start Index, for k = 1:length(vector), if vector(k) > max_level | vector(k) < min_level, sound_chunk = vector(k:(k+power_interval)-1); power_chunk = mean(sound_chunk.^2); if power_chunk > minimum_power,
June 2003
81
start_index = k; break end end end % Find Stop Index, for k = length(vector):-1:1, sound_chunk = vector(k:-1:(k-power_interval)+1); power_chunk = mean(sound_chunk.^2); if power_chunk > minimum_power, stop_index = k; break end end pure_sound = vector(start_index:stop_index); % ----------------------------------------------------------------------- % function [windowed_samples] = slider(vector,window_length,overlap) % For this version, overlap must be greater than zero % NOTE: windowed_samples{k} = column vector [rows,cols] = size(vector); if rows > 1, vector = vector'; end start_indices = 1:window_length-overlap:length(vector)-(window_length-overlap); stop_indices = window_length:window_length-overlap:length(vector); if length(stop_indices) < length(start_indices), stop_indices = cat(2,stop_indices,length(vector)); end windowed_samples = cell(length(start_indices),1); % pre-allocation for k = 1:length(start_indices), windowed_samples{k} = vector(start_indices(k):stop_indices(k)); end % ----------------------------------------------------------------------- % function [v_features_static] = vector_features_static(row_vector) mean_zero_crossings = mean(crossing_counter(row_vector,0));
June 2003
82
sound_power = mean(row_vector.^2); v_features_static = [mean_zero_crossings sound_power]; % ----------------------------------------------------------------------- % function [output] = crossing_counter(vector,target_value) % forward-looking function that flags instances when a specified value is % crossed. % output will be the same format as the input vector % output length will be the same length as the input [in_rows,in_cols] = size(vector); if in_rows > 1, output = zeros(length(vector),1); end if in_cols > 1, output = zeros(1,length(vector)); end for index = 1:length(vector)-1, % check for upward crossing: if vector(index) < target_value & vector(index+1) > target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) > target_value, output(index) = 1; end % check for downward crossing: if vector(index) > target_value & vector(index+1) < target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) < target_value, output(index) = 1; end end % ----------------------------------------------------------------------- % function [weights] = preemphasis(fvector) % OFFICIAL (SORT OF) VERSION OF THE PRE-EMPHASIS FUNCTION % Maximum sensitivity at 4722.37 Hz % Peak Gain = 390 (approx) function_boost = 1.640399551056837e+021; w = 2*pi*fvector;
June 2003
83
weights = (((w.^2 + 56.8*10^6).*w.^4)./((w.^2 + 6.3*10^6).*(w.^2 + 0.38*10^9).*(w.^6+9.58*10^26)))*function_boost; % ----------------------------------------------------------------------- % function [bank_powers] = filter_bank(spectrum) % spectrum must be a column vector global Bark_Indices; bank_powers = zeros(1,length(Bark_Indices)); % pre-allocate 'bank_powers' as a column vector for k = 1:length(Bark_Indices), triangle_window = triang(length(Bark_Indices{k}))'; % triangle_window is a column vector chunk = spectrum(Bark_Indices{k}); windowed_chunk = triangle_window.*chunk; bank_powers(k) = log10(mean(windowed_chunk.^2)); end % ----------------------------------------------------------------------- % function [barked_indices] = bark_grouper(fvector) % output is a CV (Cells of Vectors) % make sure that fvector is in column form [rows,cols] = size(fvector); if rows > 1, fvector = fvector'; end center_freq = [50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850 2150 2500 2900 3400 4000 4800 5800 7000 8500 10500 13500]; bandwidth = [100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1100 1300 1800 2500 3500]; f = fvector; for k = 1:24, if k == 1, lower_bound = 0; upper_bound = 150; else lower_bound = center_freq(k) - bandwidth(k); upper_bound = center_freq(k) + bandwidth(k); end test = find(f >= lower_bound & f <= upper_bound); if isempty(test) == 1,
June 2003
84
break else barked_indices{k} = test; end end function [] = recognizer(dc_offset,base_power,ln,un) % USER-CUSTOMIZABLE VARIABLES: global Sampling_Rate; % frequency rate at which voice will be sampled at global Trigger_Interval; % number of samples to acquire during every trigger global Buffer_Size; % size of the various storage arrays global Feature_Length; % must be an even number global Overlap; % number of samples by which each sliding window overlaps global DC; % average value of intrinsic noise global Baseline_Power; % average power of intrinsic noise global Minimum_Power; % minimum required power that a spoken utterance ought to have global Min_Power_Interval; % minimum number of samples required to make an accurate power estimate global Min_Power_Chunks; % minimum number of power measurements per trigger global Elasticity; % permissible magnitude variance (must be an integer) global Attention; % number of promising candidates to pay attention to global Entity_Groups; % the number of "words" that the system is required to recognize global Recog_Span global Recog_Space global Mag_Threshold global Time_Threshold_High global Time_Threshold_Low global Chain_Length global Minimum_Length % DATA-STORAGE STRUCTURES global Partitioned_Sound_Buffer; % place to store trigger-collected samples after noise and silence have been removed global Vector_Features_Buffer
June 2003
85
global Raw_Sound_Chunk; % the most recently acquired, yet-unprocessed trigger sample global Recurring_Matrix global Current_Recurring_Entities global Previous_Recurring_Entities % STATUS FLAGS global PSB_Full; % flags a '1' if Partitioned Sound Buffer is full global First_Time; % INDEX (LOCATION) TRACKERS global Current_PSB_Count; % index pointing to the location of the most recent value in the Partitioned Sound Buffer global Current_VFB_Count; % index pointing to the location of the most recent value in the Vector Features Buffer global Recurring_Matrix_Count; % AUTOMATICALLY-DETERMINED PARAMETERS global Power_Interval; % the automatically-determined number of samples to be used on every power measurement global Power_Chunk_Indices; % an automatic, pre-set vector used to determine the sample cutoff points for each power measurement global Bark_Indices; % frequency index values that correspond to each integer Bark value (of 1 through 24) global Feature_Count; % Number of features involved in time-domain analysis global Ham_Window; global Prime_Features global Associations global Times global Template_Size % Loaded settings DC = dc_offset; Baseline_Power = base_power; lower_noise = ln; upper_noise = un; % User Settings: Sampling_Rate = 22050; Trigger_Interval = 11025; % must be an even number Buffer_Size = 120; Feature_Length = 706; % if using half-overlap, this must be an even number
June 2003
86
Overlap = round(Feature_Length/2); % must be smaller than Feature_Length Minimum_Power = Baseline_Power * 1.5; Min_Power_Interval = 60; Min_Power_Chunks = 7; Elasticity = 1; Attention = 5; Entity_Groups = 24; Recog_Span = 90; Mag_Threshold = 0.6; Time_Threshold_Low = 0.27; Time_Threshold_High = 0.77; Chain_Length = 3; Minimum_Length = 30; % LOAD TEMPLATE FILES load Prime_Template_706; Prime_Features = FEATURES_706; Associations = ASSOCIATIONS_706; Times = TIMES_706; % Extracted parameters: lower_trigger = (lower_noise * 1.5) + DC; upper_trigger = (upper_noise * 1.5) + DC; % Data Acquisition Parameters: source = analoginput('winsound'); addchannel(source,1); source.TriggerChannel = source.Channel(1); source.BitsPerSample = 16; source.LoggingMode = 'memory'; source.SampleRate = Sampling_Rate; source.SamplesPerTrigger = Trigger_Interval; source.TriggerType = 'software'; source.TriggerCondition = 'leaving'; source.TriggerConditionValue = [lower_trigger upper_trigger]; source.TriggerDelay = 0; source.TriggerDelayUnits = 'samples'; source.TriggerRepeat = Inf; source.StartFcn = @start_prep; source.StopFcn = @shutdown; source.SamplesAcquiredFcnCount = Trigger_Interval; source.TriggerFcn = @extract_features; source.SamplesAcquiredFcn = @all_clear; % Pre-Allocate Storage Structures:
June 2003
87
Partitioned_Sound_Buffer = cell(1,Buffer_Size); % Calculate hamming window: Ham_Window = hamming(Feature_Length)'; % Intelligent power interval determination: upper_choice = floor(Trigger_Interval/Min_Power_Chunks); choices = Min_Power_Interval:upper_choice; remainders = mod(Trigger_Interval,choices); indices = find(remainders == 0); if isempty(indices) ~= 1, qualified_choices = choices(indices); sorted_qualified_choices = sort(qualified_choices); qualified_indices = find(sorted_qualified_choices >= Min_Power_Interval); if isempty(qualified_indices) ~= 1, best_choices = sorted_qualified_choices(qualified_indices); Power_Interval = best_choices(1); else backup_qualified_indices = find(sorted_qualified_choices < Min_Power_Interval); next_best_choices = sorted_qualified_choices(backup_qualified_indices); Power_Interval = max(next_best_choices); end else Power_Interval = Trigger_Interval; end Power_Chunk_Indices = 1:Power_Interval:Trigger_Interval; % Find Bark Intervals f_scale = Sampling_Rate*(0:ceil(Feature_Length/2))/Feature_Length; % Find the frequency indices that correspond to each Bark number Bark_Indices = bark_grouper(f_scale); % Find Template Size [rows,cols] = size(Prime_Features); Template_Size = rows; warning off MATLAB:divideByZero First_Time = 1; disp(' ') disp(' ') disp(' ') start(source); % ----------------------------------------------------------------------- %
June 2003
88
function [] = start_prep(source,Start) global Partitioned_Sound_Buffer global Current_PSB_Count; global Current_VFB_Count; global Vector_Features_Buffer; global Raw_Sound_Chunk; global PSB_Full; global Entity_Groups; global Recog_Space global Recurring_Matrix Current_PSB_Count = 0; Current_VFB_Count = 0; Partitioned_Sound_Buffer = {}; Raw_Sound_Chunk = 0; PSB_Full = 0; Vector_Features_Buffer = []; Recog_Space = []; Recurring_Matrix = []; % ----------------------------------------------------------------------- % function [] = all_clear(source,SamplesAcquired) global First_Time First_Time = 0; % ----------------------------------------------------------------------- % function [] = shutdown(source,Stop) delete(source); clear source; % ----------------------------------------------------------------------- % function [] = extract_features(source,Trigger) global Sampling_Rate; global Trigger_Interval; global Buffer_Size; global Feature_Length; global Overlap; global DC; global Baseline_Power; global Minimum_Power; global Min_Power_Interval; global Min_Power_Chunks; global Elasticity;
June 2003
89
global Attention; global Entity_Groups; global Recog_Span global Recog_Space global Mag_Threshold global Time_Threshold_High global Time_Threshold_Low global Chain_Length global Minimum_Length global Partitioned_Sound_Buffer; global Vector_Features_Buffer global Raw_Sound_Chunk; global Recurring_Matrix; global Current_Recurring_Entities global Previous_Recurring_Entities global PSB_Full; global First_Time global Current_PSB_Count; global Current_VFB_Count; global Recurring_Matrix_Count; global Power_Interval; global Power_Chunk_Indices; global Bark_Indices; global Ham_Window; global Prime_Features global Associations global Times global Template_Size if First_Time == 1, return end Raw_Sound_Chunk = getdata(source) - DC; % Determine if acquired trigger sample has sufficient active power and % store those samples with sufficient power [pure_sound,valid] = partitioner(Raw_Sound_Chunk); if valid == 1, Current_PSB_Count = Current_PSB_Count + 1; Partitioned_Sound_Buffer{Current_PSB_Count} = pure_sound; % pure_sound = row vector
June 2003
90
end % Extract features from active samples if valid == 1, % NOTE: windowed_samples{k} = column vector if length(pure_sound) >= Feature_Length, windowed_samples = slider(pure_sound,Feature_Length,Overlap); else windowed_samples{1} = pure_sound'; end % Acquire each window of samples for k = 1:length(windowed_samples), % Extract time-domain features v_features = vector_features_static(windowed_samples{k}); Current_VFB_Count = Current_VFB_Count + 1; Vector_Features_Buffer(Current_VFB_Count,:) = v_features; % Extract spectral features if length(windowed_samples{k}) < Feature_Length, shortfall = Feature_Length - length(windowed_samples{k}); % if length of current data is less than required Feature_Length, add zero padding: windowed_samples{k} = cat(2,windowed_samples{k},zeros(1,shortfall)); end spectrum = abs(fft(windowed_samples{k}.*Ham_Window,Feature_Length)); spectrum = spectrum(1:ceil(Feature_Length/2)+1); weighted_spectrum = spectrum; bank_powers = filter_bank(weighted_spectrum); acquired_features = cat(2,v_features,bank_powers); % Calculate Error replicant = repmat(acquired_features,[Template_Size,1]); error_distances = sqrt(sum((replicant-Prime_Features).^2,2)); [Y,I] = sort(error_distances); candidate_entities_matrix = Associations(I(1:Attention),:); candidate_entities_matrix = candidate_entities_matrix'; candidate_entities = cat(1,candidate_entities_matrix(:))'; candidate_entities(candidate_entities == 0) = []; candidate_times_matrix = Times(I(1:Attention),:); candidate_times_matrix = candidate_times_matrix'; candidate_times = cat(1,candidate_times_matrix(:))'; candidate_times(candidate_times == 0) = []; pad = zeros(1,Entity_Groups); pad(candidate_entities) = candidate_times; Recog_Space = cat(1,Recog_Space,pad);
June 2003
91
[recog_rows,cols] = size(Recog_Space); if recog_rows > Recog_Span, Recog_Space = Recog_Space(2:recog_rows,:); end [recog_rows,cols] = size(Recog_Space); if recog_rows > Chain_Length + 1, previous_chunk = Recog_Space(recog_rows-Chain_Length:recog_rows-1,:); [d,previous_recog_entities] = find(previous_chunk > 0); previous_recog_entities = unique(previous_recog_entities); current_recog_line = Recog_Space(recog_rows,:); current_recog_entities = find(current_recog_line > 0); recurring_entities = intersect(current_recog_entities,previous_recog_entities); % check for causality: if isempty(recurring_entities) ~= 1, previous_recurring_times = Recog_Space(recog_rows-Chain_Length:recog_rows-1,recurring_entities); previous_recurring_times_mean = mean(previous_recurring_times,1); current_recurring_times = current_recog_line(recurring_entities); current_minus_previous = current_recurring_times-previous_recurring_times_mean; pseudo_indices = find(current_minus_previous > -0.1); if isempty(pseudo_indices) ~= 1, qualified_recurring_entities = recurring_entities(pseudo_indices); current_qualified_recurring_times = Recog_Space(recog_rows,qualified_recurring_entities); pad = zeros(1,Entity_Groups); pad(qualified_recurring_entities) = current_qualified_recurring_times; Recurring_Matrix = cat(1,Recurring_Matrix,pad); [check_rows,check_cols] = size(Recurring_Matrix); if check_rows > Recog_Span, Recurring_Matrix = Recurring_Matrix(2:check_rows,:); end [Recurring_Matrix_Count,check_cols] = size(Recurring_Matrix); % check for finalists current_recurring_matrix_line = Recurring_Matrix(Recurring_Matrix_Count,:); potential_finalist_entities = find(current_recurring_matrix_line >= Time_Threshold_High); if isempty(potential_finalist_entities) ~= 1, absolute_final = zeros(1,Entity_Groups); for k = 1:length(potential_finalist_entities), current_finalist_entity = potential_finalist_entities(k); current_finalist_times = Recurring_Matrix(:,current_finalist_entity); current_finalist_times_locs = find(current_finalist_times > 0); time_loc_span = length(current_finalist_times_locs);
June 2003
92
if time_loc_span < Minimum_Length, %Recurring_Matrix(:,current_finalist_entity) = 0; continue end if min(current_finalist_times) <= Time_Threshold_Low, absolute_final(current_finalist_entity) = time_loc_span; end end if sum(absolute_final) ~= 0, [test_value,Entity] = max(absolute_final); flasher(Entity); Recurring_Matrix(:,Entity) = 0; Recurring_Matrix(:,potential_finalist_entities) = 0; if Entity == 8, disp('ENGINE HAS STOPPED') stop(source); end end end end end end end end % ----------------------------------------------------------------------- % function [v_features_static] = vector_features_static(row_vector) zero_crossings = mean(crossing_counter(row_vector,0)); sound_power = mean(row_vector.^2); v_features_static = [zero_crossings sound_power]; % ----------------------------------------------------------------------- % function [output] = crossing_counter(vector,target_value) % forward-looking function that flags instances when a specified value is % crossed. % output will be the same format as the input vector % output length will be the same length as the input [in_rows,in_cols] = size(vector); if in_rows > 1, output = zeros(length(vector),1); end if in_cols > 1,
June 2003
93
output = zeros(1,length(vector)); end for index = 1:length(vector)-1, % check for upward crossing: if vector(index) < target_value & vector(index+1) > target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) > target_value, output(index) = 1; end % check for downward crossing: if vector(index) > target_value & vector(index+1) < target_value, output(index+1) = 1; elseif vector(index) == target_value & vector(index+1) < target_value, output(index) = 1; end end % ----------------------------------------------------------------------- % function [pure_sound,valid] = partitioner(vector) global Minimum_Power; global Power_Interval; global Power_Chunk_Indices; % make sure 'vector' is in column form vector = vector'; squared = vector.^2; power_chunks = zeros(length(Power_Chunk_Indices)-1,1); % pre-allocation vector_chunks = zeros(length(Power_Chunk_Indices)-1,Power_Interval); % pre-allocation for k = 1:length(Power_Chunk_Indices)-1, power_chunks(k,:) = mean(squared(Power_Chunk_Indices(k):Power_Chunk_Indices(k+1)-1)); vector_chunks(k,:) = vector(Power_Chunk_Indices(k):Power_Chunk_Indices(k+1)-1); end qualified_power_indices = find(power_chunks > Minimum_Power); if isempty(qualified_power_indices) ~= 1, qualified_vector_chunks = vector_chunks(qualified_power_indices,:); pure_sound = reshape(qualified_vector_chunks',[],1); % turn pure_sound into a continuous row vector valid = 1; else pure_sound = NaN;
June 2003
94
valid = 0; end % ----------------------------------------------------------------------- % function [windowed_samples] = slider(vector,window_length,overlap) % For this version, overlap must be greater than zero % NOTE: windowed_samples{k} = column vector [rows,cols] = size(vector); if rows > 1, vector = vector'; end start_indices = 1:window_length-overlap:length(vector)-(window_length-overlap); stop_indices = window_length:window_length-overlap:length(vector); if length(stop_indices) < length(start_indices), stop_indices = cat(2,stop_indices,length(vector)); end windowed_samples = cell(length(start_indices),1); % pre-allocation for k = 1:length(start_indices), windowed_samples{k} = vector(start_indices(k):stop_indices(k)); end % ----------------------------------------------------------------------- % function [barks] = bark(fvector) % Implements Traunmuller's Bark Scale f = fvector; barks = ((26.81*f)./(1960+f))-0.53; % Apply low and high-frequency corrections for index = 1:length(barks), if barks(index) < 2, barks(index) = barks(index)+0.15*(2-barks(index)); elseif barks(index) > 20.1, barks(index) = barks(index)+0.22*(barks(index)-20.1); end end % ----------------------------------------------------------------------- % function [bank_powers] = filter_bank(spectrum)
June 2003
95
% spectrum must be a column vector global Bark_Indices; bank_powers = zeros(1,length(Bark_Indices)); % pre-allocate 'bank_powers' as a column vector for k = 1:length(Bark_Indices), triangle_window = triang(length(Bark_Indices{k}))'; % triangle_window is a column vector chunk = spectrum(Bark_Indices{k}); windowed_chunk = triangle_window.*chunk; bank_powers(k) = log10(mean(windowed_chunk.^2)); end % ----------------------------------------------------------------------- % function [barked_indices] = bark_grouper(fvector) % output is a CV (Cells of Vectors) % make sure that fvector is in column form [rows,cols] = size(fvector); if rows > 1, fvector = fvector'; end center_freq = [50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850 2150 2500 2900 3400 4000 4800 5800 7000 8500 10500 13500]; bandwidth = [100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1100 1300 1800 2500 3500]; f = fvector; for k = 1:24, if k == 1, lower_bound = 0; upper_bound = 150; else lower_bound = center_freq(k) - bandwidth(k); upper_bound = center_freq(k) + bandwidth(k); end test = find(f >= lower_bound & f <= upper_bound); if isempty(test) == 1, break else barked_indices{k} = test; end end
June 2003
96
%---------------------------------------------------------------------- function [] = flasher(entity) if entity == 1, disp('ROBOKART') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 2, disp('ROTATE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 3, disp('CLOCKWISE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 4, disp('COUNTERCLOCKWISE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 5, disp('PROCEED') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 6, disp('FORWARD') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 7,
June 2003
97
disp('BACKWARD') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 8, disp('STOP COMMAND ISSUED') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 9, disp('TURN') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 10, disp('LEFT') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 11, disp('RIGHT') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 12, disp('SPEED') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 13, disp('UP') disp(' ') disp(' ') disp(' ')
June 2003
98
disp(' ') end if entity == 14, disp('SLOW') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 15, disp('DOWN') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 16, disp('DANCE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 17, disp('CHARGE') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 18, disp('RETREAT') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 19, disp('GOOD') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 20, disp('BAD')
June 2003
99
disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 21, disp('GO') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 22, disp('TO') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 23, disp('SLEEP') disp(' ') disp(' ') disp(' ') disp(' ') end if entity == 24, disp('WAKE') disp(' ') disp(' ') disp(' ') disp(' ') end function [intra_entity_synonyms] = example_matcher(combined_template) % combined_template must be a 4D matrix % intra_entity_synonyms: 3D Cell (m o p) of 2 row vectors: example numbers % aligned with sample indices % DATA-STORAGE STRUCTURES global Local_Template % AUTOMATICALLY-DETERMINED PARAMETERS global Feature_Count;
June 2003
100
global Elasticity % Settings: Elasticity = 1; warning off MATLAB:divideByZero Template = combined_template; [m,Feature_Count,o,p] = size(Template); % Find maximum durations of each example for p_index = 1:p, for o_index = 1:o, example = Template(:,:,o_index,p_index); vertical_slice = example(:,1); valid_slice = vertical_slice(finite(vertical_slice)); duration = length(valid_slice); durations(o_index,p_index) = duration; end end intra_entity_synonyms = cell(m,o,p); for p_index = 1:p, p_index Local_Template = Template(:,:,:,p_index); for o_index = 1:o, o_index example_duration = durations(o_index,p_index); for m_index = 1:example_duration, acquired_sample = Local_Template(m_index,:,o_index); Local_Template(m_index,:,o_index) = repmat(NaN,[1 Feature_Count]); % remove [best_examples,best_sample_indices] = local_matcher(acquired_sample); Local_Template(m_index,:,o_index) = acquired_sample; synonyms = cat(2,best_examples,best_sample_indices); intra_entity_synonyms{m_index,o_index,p_index} = synonyms; end end end % ----------------------------------------------------------------------- % function [delta_m_best,best_indices] = local_matcher(acquired_features) global Elasticity; global Local_Template
June 2003
101
% Create error template [m,n,o] = size(Local_Template); error_template = repmat(NaN,[m n o]); for feature_index = 1:length(acquired_features), current_feature_value = acquired_features(feature_index); error_template(:,feature_index,:) = abs((current_feature_value-Local_Template(:,feature_index,:))./Local_Template(:,feature_index,:)); end % Calculate Delta-M Error Values: unsorted_delta_m_cf_error = mean(error_template,2); [delta_m_cf_error,delta_m_cf_indices] = sort(unsorted_delta_m_cf_error,1); best_delta_m_cf_error = squeeze(delta_m_cf_error(Elasticity,:,:)); best_delta_m_cf_indices = squeeze(delta_m_cf_indices(Elasticity,:,:)); [Y,I] = sort(best_delta_m_cf_error); delta_m_best = I; % example numbers, row vector best_indices = best_delta_m_cf_indices(I); % in-example indices, row vector function [associations] = associator(template,synonyms) tic % template must be a 4-D matrix % synonyms must be a 3-D Cell [m,n,o,p] = size(template); % Find maximum durations of each example for p_index = 1:p, for o_index = 1:o, example = template(:,:,o_index,p_index); vertical_slice = example(:,1); valid_slice = vertical_slice(finite(vertical_slice)); duration = length(valid_slice); durations(o_index,p_index) = duration; end end for p_index = 1:p, reference_entity = p_index target_entities = 1:p; for o_index = 1:o, o_index max_duration = durations(o_index,p_index); for m_index = 1:max_duration, % Retrieve reference data:
June 2003
102
reference_point = template(m_index,:,o_index,p_index); reference_point_neighbor_distances = synonyms{m_index,o_index,p_index}(:,3); reference_point_neighbor_times = synonyms{m_index,o_index,p_index}(:,2); % Remove current entity example from reference data to % maintain an accurate standard deviation reference_point_neighbor_distances(1) = []; reference_point_neighbor_times(1) = []; % Set the standards distance_to_beat = mean(reference_point_neighbor_distances); distance_dev_to_beat = std(reference_point_neighbor_distances); time_dev_to_beat = std(reference_point_neighbor_times); % Calculate competing entity stats: competition_matrix = template(:,:,:,target_entities); [c1,c2,c3,c4] = size(competition_matrix); reference_matrix = repmat(reference_point,[c1 1 c3 c4]); distance_matrix = sqrt(sum((competition_matrix-reference_matrix).^2,2)); [sorted_distances,local_times] = sort(distance_matrix,1); best_distances = squeeze(sorted_distances(1,:,:,:)); best_times = squeeze(local_times(1,:,:,:)); % Make sure current entity always wins: best_distances(:,reference_entity) = 0; best_times(:,reference_entity) = m_index; % Judgement time: consolidated_error = mean(best_distances,1); consolidated_error_dev = std(best_distances,0,1); consolidated_time_dev = std(best_times,0,1); best_error_candidates = find(consolidated_error <= distance_to_beat); best_error_dev_candidates = find(consolidated_error_dev <= distance_dev_to_beat); best_time_dev_candidates = find(consolidated_time_dev <= time_dev_to_beat); pre_candidates = intersect(best_error_candidates,best_error_dev_candidates); Best_Candidates = intersect(pre_candidates,best_time_dev_candidates); chain = []; for k = Best_Candidates, [hits,best_local_time] = max(hist(best_times(:,k),[1:max(best_times(:,k))])); chain = cat(2,chain,best_local_time); end
June 2003
103
% Store the winners: Best_Entities = Best_Candidates; Best_Times = chain; associations{m_index,o_index,p_index} = cat(1,Best_Entities,Best_Times); end end end toc function [matrix] = ccm_to_matrix(ccm) % Converts a Cell-Cell-Matrix (CCM) into a 4-dimensional matrix % C1 is formed along the 4th dimension % C2 is formed along the 3rd (Z) dimension % M is already formed along the 1st (X) and 2nd (Y) dimensions % Find maximum lengths: c1_max = length(ccm); c2_max = 0; row_max = 0; col_max = 0; for c1_index = 1:length(ccm), cm = ccm{c1_index}; if length(cm) > c2_max, c2_max = length(cm); end for c2_index = 1:length(cm), m = cm{c2_index}; [rows,cols] = size(m); if rows > row_max, row_max = rows; end if cols > col_max, col_max = cols; end end end % Pre-allocate the matrix, pad with NaN's matrix = zeros(row_max,col_max,c2_max,c1_max) + NaN; % Populate the matrix for c1_index = 1:length(ccm), cm = ccm{c1_index}; for c2_index = 1:length(cm),
June 2003
104
m = cm{c2_index}; [current_rows,current_cols] = size(m); matrix(1:current_rows,1:current_cols,c2_index,c1_index) = m; end end function [condensed_template] = condenser(template,intra_synonyms,duration) tic % combined must be a 4-D Matrix % intra_synonyms must be a 3-D Cell % duration: 1 = shortest, 2 = mean, 3 = longest % controls the size standard by which each example is to be compared % against [m,n,o,p] = size(template); % Find desired durations for p_index = 1:p, for o_index = 1:o, example = template(:,:,o_index,p_index); vert_slice = example(:,1); I = find(finite(vert_slice)); example_lengths(o_index,p_index) = length(I); end [min_lengths(p_index),min_example(p_index)] = min(example_lengths(:,p_index)); % find mean: theoretical_mean = mean(example_lengths(:,p_index)); % calculate error offsets: error_line = abs((theoretical_mean-example_lengths(:,p_index))./example_lengths(:,p_index)); [Y,I] = sort(error_line); mean_lengths(p_index) = example_lengths(I(1),p_index); mean_example(p_index) = I(1); [max_lengths(p_index),max_example(p_index)] = max(example_lengths(:,p_index)); end if duration == 1, % SHORTEST target_lengths = min_lengths; target_examples = min_example; end if duration == 2, % MEAN target_lengths = mean_lengths; target_examples = mean_example; end if duration == 3, % LONGEST
June 2003
105
target_lengths = max_lengths; target_examples = max_example; end % pre-allocation condensed_template = repmat(NaN,[max(target_lengths) n 3 p]); for p_index = 1:p, p_index target = target_examples(p_index); duration = target_lengths(p_index); for m_index = 1:duration, associations = intra_synonyms{m_index,target,p_index}; examples = associations(:,1); locations = associations(:,2); % Individual feature differentiation: for n_index = 1:n, % n = feature count feature_list = []; for k = 1:length(examples), feature_list = cat(1,feature_list,template(locations(k),n_index,examples(k),p_index)); min_feature = min(feature_list); mean_feature = mean(feature_list); max_feature = max(feature_list); condensed_template(m_index,n_index,1,p_index) = min_feature; condensed_template(m_index,n_index,2,p_index) = mean_feature; condensed_template(m_index,n_index,3,p_index) = max_feature; end end end end toc function [features,associations,times] = final_template(raw_template,raw_associations) % raw_template must be a 4-D matrix % raw_associations must be a 3-D cell array tic [m,n,o,p] = size(raw_template); % Find maximum durations of each example for p_index = 1:p, for o_index = 1:o, example = raw_template(:,:,o_index,p_index); vertical_slice = example(:,1);
June 2003
106
valid_slice = vertical_slice(finite(vertical_slice)); duration = length(valid_slice); durations(o_index,p_index) = duration; end mean_durations(p_index) = mean(durations(:,p_index)); end [m2,o2,p2] = size(raw_associations); max_col = 0; for p2_index = 1:p2, for o2_index = 1:o2, for m2_index = 1:m2, retrieved = raw_associations{m2_index,o2_index,p2_index}; if isempty(retrieved) ~= 1, sample = retrieved(1,:); if length(sample) > max_col, max_col = length(sample); end end end end end k = 1; for p_index = 1:p, p_index for o_index = 1:o, max_duration = durations(o_index,p_index); for m_index = 1:max_duration, features(k,:) = raw_template(m_index,:,o_index,p_index); retrieved = raw_associations{m_index,o_index,p_index}(1,:); padding = zeros(1,max_col-length(retrieved)); associations(k,:) = cat(2,raw_associations{m_index,o_index,p_index}(1,:),padding); normalized_time = (raw_associations{m_index,o_index,p_index}(2,:))./mean_durations(retrieved(1,:)); times(k,:) = cat(2,normalized_time,padding); k = k + 1; end end end toc function [output] = statistics(DataVector) % This function calculates the mean, standard deviation,
June 2003
107
% gaussian maximum, gaussian minimum, actual maximum, % actual minimum, and a 'normal distribution figure of merit' % INPUT MUST BE A ONE-DIMENSIONAL VECTOR % OUTPUT WILL BE A COLUMN VECTOR DataMean = mean(DataVector); Sdev = std(DataVector); Gmin = DataMean - 3*Sdev; Gmax = DataMean + 3*Sdev; Amin = min(DataVector); Amax = max(DataVector); normality = 1 - abs(DataMean-median(DataVector))/DataMean; output = [DataMean Sdev Gmin Gmax Amin Amax normality];