Voice Driven Animation System

7/28/2019 Voice Driven Animation System

1/23

Voice Driven Animation System

Abstract

The goal of this report is to develop a voice driven animation system that couldtake human voice as commands to generate the desired character animation, based on motioncapture data. In this report, the idea of our system is first introduced, followed by a review of

background which is related to the development. Then we will talk about the Microsoft SpeechAPI & Java Speech API which are used as the voice recognition engine in a system.

1Introduction

In a traditional animation system, the animator must use mouse and keyboard tospecify the path along which the character will move and the action that the character will bedoing during the movements. This kind of interaction is not very effective indeed because either

clicking on some buttons or typing on the keyboards will be distracting for the animator who istrying to focus on creating the animation. To improve the interaction, we can borrow the ideafrom the filmmaking scenario, where the director uses his voice to tell the actor what to do beforethe shooting of the scene, and then the actor will perform the action exactly as he was told to. Thisis how we come up the idea of using human voice as a media to make a better interface for theanimation-system.

12Background

In 1986, Dr. Jacob Nielsen asked a group of 57 IT professionals to predict what would bethe greatest changes in user interfaces by the year 2000. The top-five answers were:----

Table 1: User Interfaces Prediction Table [1]


2/23

While Graphical User Interfaces (GUIs) have clearly been the winner since thattime, Voice User Interfaces (VUIs) certainly failed to reach the demand that IT professionalsexpected. The key issue in interaction design and the main determinant of usability is what theuser says to the interface. Whether you say it by speaking or by typing at the keyboard mattersless to most users. Thus, having voice interfaces will not necessarily free us from the most

substantial part of user interface design: determining the structure of the dialogue, whatcommands or features are available, how the users are to specify what they want, and how thecomputer is to communicate the feedback. All that voice does is to allow the commands andfeedback to be spoken rather than written.

Voice interfaces have their greatest potential in the following cases where it isproblematic to rely on the traditional keyboard-mouse-monitor combination:

Users with various disabilities that prevent them from using a mouse and/or keyboard or thatprevent them from seeing the pictures on the screen.

All users, with or without disabilities, whose hands and eyes are occupied with other tasks. Forexample, while driving a car or while repairing a complex piece of equipment.

Users who do not have access to a keyboard and/or a monitor. For instance, users accessing asystem through a payphone.

So it's not that voice is useless. It's just that it is often a secondary interactionmode if additional media are available. Just as in our system, in addition to using voice commandsto select different actions of the character, the user still has to use mouse clicks to specify thelocation at which the action is taking place. The combination of multiple computer medias provesto provide better interactions for most users.

As for the voice recognition system, there are two mainstream products availableat the current market. One is IBM via Voice Dictation SDK, which is based on many years ofdevelopment by IBM. The other one is Microsoft Speech API, also known as SAPI. As an SDKtoolkit, each of them has their unique features, and its hard to compare which one is better.

Speech Application Programming Interface:

Microsoft Speech API

One of the newest extensions for the Windows operating system is the Speech

Application Programming Interface (SAPI). This Windows extension gives workstations theability to recognize human speech as input, and create human-like audio output from printed text.This ability adds a new dimension to human/pc interaction. Speech recognition services can beused to extend the use of PCs to those who find typing too difficult or too time-consuming. Text-to-speech services can be used to provide aural representations of text documents to those who

cannot see typical display screens because of physical limitations or due to the nature of theirwork.Like the other Windows services described in this site, SAPI is part of the

Windows Open Services Architecture (WOSA) model. Speech recognition (SR) and text-to-speech (TTS) services are actually provided by separate modules called engines. Users can selectthe speech engine they prefer to use as long as it conforms to the SAPI interface.In this Section you'll learn the basic concepts behind designing and implementing a speechrecognition and text-to-speech engine using the SAPI design model. You'll also learn aboutcreating grammar definitions for speech recognition.


3/23

Speech Recognition

Any speech system has, at its heart, a process for recognizing human speech andturning it into something the computer understands. In effect, the computer needs a translator.Research into effective speech recognition algorithms and processing models has been going on

almost ever since the computer was invented. And a great deal of mathematics and linguistics gointo the design and implementation of a speech recognition system. A detailed discussion ofspeech recognition algorithms is beyond the scope of our project, but it is important to have agood idea of the commonly used techniques for turning human speech into something a computerunderstands.

Every speech recognition system uses four key operations to listen to and understandhuman speech. They are:

Word separation-This is the process of creating discreet portions of human speech. Each

portion can be as large as a phrase or as small as a single syllable or word part.

Vocabulary-This is the list of speech items that the speech engine can identify.

Word matching-This is the method that the speech system uses to look up a speech part in

the system's vocabulary-the search engine portion of the system. Speaker dependence-This is the degree to which the speech engine is dependent on the

vocal tones and speaking patterns of individuals.

These four aspects of the speech system are closely interrelated. If we want to developa speech system with a rich vocabulary, we'll need a sophisticated word matching system toquickly search the vocabulary. Also, as the vocabulary gets larger, more items in the list couldsound similar (for example, yes and yet). In order to successfully identify these speech parts, theword separation portion of the system must be able to determine smaller and smaller differences

between speech items.Finally, the speech engine must balance all of these factors against the aspect of

speaker dependence. As the speech system learns smaller and smaller differences between words,the system becomes more and more dependent on the speaking habits of a single user. Individual

accents and speech patterns can confuse speech engines. In other words, as the system becomesmore responsive to a single user, that same system becomes less able to translate the speech ofother users.

The next few sections describe each of the four aspects of a speech engine in a bitmore detail.

Word Separation

The first task of the speech engine is to accept words as input. Speech engines use aprocess called word separation to gather human speech. Just as the keyboard is used as an inputdevice to accept physical keystrokes for translation into readable characters, the process of wordseparation accepts the sound of human speech for translation by the computer.

There are three basic methods of word separation. In ascending order of complexity they are:

Discrete speech

Word spotting

Continuous speech

Systems that use the discrete speech method of word separation require the user toplace a short pause between each spoken word. This slight bit of silence allows the speech system


4/23

to recognize the beginning and ending of each word. The silences separate the words much likethe space bar does when we type. The advantage of the discrete speech method is that it requiresthe least amount of computational resources. The disadvantage of this method is that it is not veryuser-friendly. Discrete speech systems can easily become confused if a person does not pause

between words.Systems that use word spotting avoid the need for users to pause in between each

word by listening only for key words or phrases. Word spotting systems, in effect, ignore theitems they do not know or care about and act only on the words they can match in theirvocabulary. For example, suppose the speech system can recognize the word help, and knows toload the Windows Help engine whenever it hears the word. Under word spotting, the following

phrases will all result in the speech engine invoking Windows Help:23 please Load Help

Can U help me, please?4 These definitions are no help at all!

5As we can see, one of the disadvantages of word spotting is that the system can

easily misinterpret the user's meaning. However, word spotting also has several key advantages.

Word spotting allows users to speak normally, without employing pauses. Also, since wordspotting systems simply ignore words they don't know and act only on key words, these systemscan give the appearance of being more sophisticated than they really are. Word spotting requiresmore computing resources than discreet speech, but not as much as the last method of wordseparation-continuous speech.

Continuous speech systems recognize and process every word spoken. This gives thegreatest degree of accuracy when attempting to understand a speaker's request. However, it alsorequires the greatest amount of computing power. First, the speech system must determine thestart and end of each word without the use of silence. This is much likereadingtextthathasnospacesinit (see!). Once the words have been separated, the system must lookthem up in the vocabulary and identify them. This, too, can take precious computing time. The

primary advantage of continuous speech systems is that they offer the greatest level ofsophistication in recognizing human speech. The primary disadvantage is the amount of

computing resources they require.

Speaker Dependence

Speaker dependence is a key factor in the design and implementation of a speech recognitionsystem. In theory, we would like a system that has very little speaker dependence. This wouldmean that the same workstation could be spoken to by several people with the same positiveresults. People often speak quite differently from one another, however, and this can cause

problems.First, there is the case of accents. Just using the United States as an example, we can identifyseveral regional sounds. Add to these the possibility that speakers may also have accents thatcome from outside the U.S. due to the influence of other languages (Spanish, German, Japanese),and we have a wide range of pronunciation for even the simplest of sentences. Speaker speed and

pitch inflection can also vary widely, which can pose problems for speech systems that need todetermine whether a spoken phrase is a statement or a question.Speech systems fall into three categories in terms of their speaker dependence. They can be:

Speaker independent

Speaker dependent

Speaker adaptive


5/23

Speaker-independent systems require the most resources. They must be able to accuratelytranslate human speech across as many dialects and accents as possible. Speaker-dependentsystems require the least amount of computing resources. These systems require that the user"train" the system before it is able to accurately convert human speech. A compromise betweenthe two approaches is the speaker-adaptive method. Speaker-adaptive systems are prepared towork without training, but increase their accuracy after working with the same speaker for a

period of time.The additional training required by speaker-dependent systems can be frustrating to

users. Usually training can take several hours, but some systems can reach 90 percent accuracy orbetter after just five minutes of training. Users with physical disabilities, or those who find typinghighly inefficient, will be most likely to accept using speaker-dependent systems.Systems that will be used by many different people need the power of speaker independence. Thisis especially true for systems that will have short encounters with many different people, such asgreeting kiosks at an airport. In such situations, training is unlikely to occur, and a high degree ofaccuracy is expected right away.

For systems where multiple people will access the same workstation over a longerperiod of time, the speaker-adaptive system will work fine. A good example would be aworkstation used by several employees to query information from a database. The initialinvestment spent training the speech system will pay off over time as the same staff uses the

system.

Word Matching

Word matching is the process of performing look-ups into the speech database. Aseach word is gathered (using the word separation techniques described earlier), it must bematched against some item in the speech engine's database. It is the process of word matching thatconnects the audio input signal to a meaningful item in the speech engine database.There are two primary methods of word matching:

Whole-word matching

Phoneme matching

Under whole-word matching, the speech engine searches the database for a word thatmatches the audio input. Whole-word matching requires less search capability than phonemematching. But, whole-word matching requires a greater amount of storage capacity. Under thewhole-word matching model, the system must store a word template that represents each possibleword that the engine can recognize. While quick retrieval makes whole-word matching attractive,the fact that all words must be known ahead of time limits the application of whole-wordmatching systems.

Phoneme matching systems keep a dictionary of language phonemes. Phonemes arethe smallest unique sound part of a language, and can be numerous. For example, while theEnglish language has 26 individual letters, these letters do not represent the total list of possible

phonemes. Also, phonemes are not restricted by spelling conventions.Consider the words Philip and fill up. These words have the same phonemes: f, eh, ul, ah, and

pah. However, they have entirely different meanings. Under the whole-word matching model,these words could represent multiple entries in the database. Under the phoneme matching model,the same five phonemes can be used to represent both words.

As we may expect, phoneme matching systems require more computationalresources, but less storage space.

Vocabulary


6/23

The final element of a speech recognition system is the vocabulary. There are twocompeting issues regarding vocabulary: size and accuracy. As the vocabulary size increases,recognition improves. With large vocabularies, it is easy for speech systems to locate a word thatmatches the one identified in the word separation phase. However, one of the reasons it is easy tofind a match is that more than one entry in the vocabulary may match the given input. Forexample, the words no and go are very similar to most speech engines. Therefore, as vocabulary

size grows, the accuracy of speech recognition can decrease.Contrary to what we might assume, a speech engine's vocabulary does not represent

the total number of words it understands. Instead, the vocabulary of a speech engine represents thenumber of words that it can recognize in a current state or moment in time. In effect, this is thetotal number of "unidentified" words that the system can resolve at any moment.

For example, let's assume we have registered the following word phrases with ourspeech engine: "Start running Exchange" and "Start running Word." Before we say anything, thecurrent state of the speech engine has four words: start, running, Exchange, and Word. Once wesay "Start running" there are only two words in the current state: Exchange and Word. Thesystem's ability to keep track of the possible next word is determined by the size of its vocabulary.Small vocabulary systems (100 words or less) work well in situations where most of the speechrecognition is devoted to processing commands. However, we need a large vocabulary to handledictation systems. Dictation vocabularies can reach into tens of thousands of words. This is one of

the reasons that dictation systems are so difficult to implement. Not only does the vocabularyneed to be large, the resolutions must be made quite quickly.

Text-to-Speech

A second type of speech service provides the ability to convert written text into

spoken words. This is called text-to-speech (or TTS) technology. Just as there are a number offactors to consider when developing speech recognition engines (SR), there are a few issues thatmust be addressed when creating and implementing rules for TTS engines.The four common issues that must be addressed when creating a TTS engine are as follows:

Phonemes

Voice quality

TTS synthesis

TTS diphone concatenation

The first two factors deal with the creation of audio tones that are recognizable ashuman speech. The last two items are competing methods for interpreting text that is to beconverted into audio.Voice Quality

The quality of a computerized voice is directly related to the sophistication of therules that identify and convert text into an audio signal. It is not too difficult to build a TTS enginethat can create recognizable speech. However, it is extremely difficult to create a TTS engine thatdoes not sound like a computer. Three factors in human speech are very difficult to produce withcomputers:

Prosody

Emotion

Pronunciation anomalies

Human speech has a special rhythm or prosody-a pattern of pauses, inflections, andemphasis that is an integral part of the language. While computers can do a good job of

pronouncing individual words, it is difficult to get them to accurately mimic the tonal and


7/23

rhythmic in-flections of human speech. For this reason, it is always quite easy to differentiatecomputer-generated speech from a computer playing back a recording of a human voice.

Another factor of human speech that computers have difficulty rendering is emotion.While TTS engines are capable of distinguishing declarative statements from questions orexclamations, computers are still not able to convey believable emotive qualities when renderingtext into speech.

Lastly, every language has its own pronunciation anomalies. These are words thatdo not "play by the rules" when it comes to converting text into speech. Some common examplesin English are dough and tough or comb and home. More troublesome are words such as readwhich must be understood in context in order to figure out their exact pronunciation. For example,the pronunciations are different in "He read the paper" or "She will now read to the class." Evenmore likely to cause problems is the interjection of techno-babble such as "SQL," "MAPI," and"SAPI." All these factors make the development of a truly human-sounding computer-generatedvoice extremely difficult.

Speech systems usually offer some way to correct for these types of problems. Onetypical solution is to include the ability to enter the phonetic spelling of a word and relate thatspelling to the text version. Another common adjustment is to allow users to enter control tags inthe text to instruct the speech engine to add emphasis or inflection, or alter the speed or pitch ofthe audio output. Much of this type of adjustment information is based on phonemes, as described

in the next section.

Phonemes

As we've discussed, phonemes are the sound parts that make up words. Linguistsuse phonemes to accurately record the vocal sounds uttered by humans when speaking. Thesesame phonemes also can be used to generate computerized speech. TTS engines use theirknowledge of grammar rules and phonemes to scan printed text and generate audio output.

Note

If we are interested in learning more about phonemes and how they are used toanalyze speech, refer to the Phonetic Symbol Guide by Pullum and Ladusaw(Chicago University Press, 1996).

The SAPI design model recognizes and allows for the incorporation of phonemes asa method for creating speech output. Microsoft has developed an expression of the InternationalPhonetic Alphabet (IPA) in the form of Unicode strings. Programmers can use these strings toimprove the pronunciation skills of the TTS engine, or to add entirely new words to thevocabulary.

Note

If we wish to use direct Unicode to alter the behavior of our TTS engine,you'll have to program using Unicode. SAPI does not support the direct use of

phonemes in ANSI format.

As mentioned in the previous section on voice quality, most TTS engines provideseveral methods for improving the pronunciation of words. Unless we are involved in thedevelopment of a text-to-speech engine, we probably will not use phonemes very often.

TTS Synthesis


8/23

Once the TTS knows what phonemes to use to reproduce a word, there are twopossible methods for creating the audio output: synthesis or diphone concatenation.

The synthesis method uses calculations of a person's lip and tongue position, theforce of breath, and other factors to synthesize human speech. This method is usually not asaccurate as the diphone method. However, if the TTS uses the synthesis method for generatingoutput, it is very easy to modify a few parameters and then create a new "voice."

Synthesis-based TTS engines require less overall computational resources, andless storage capacity. Synthesis-based systems are a bit more difficult to understand at first, butusually offer users the ability to adjust the tone, speed, and inflection of the voice rather easily.

TTS Diphone Concatenation

The diphone concatenation method of generating speech uses pairs of phonemes(di meaning two) to produce each sound. These diphones represent the start and end of eachindividual speech part. For example, the word pig contains the diphones silence-p, p-i, i-g, and g-silence. Diphone TTS systems scan the word and then piece together the correct phoneme pairs to

pronounce the word.These phoneme pairs are produced not by computer synthesis, but from actual

recordings of human voices that have been broken down to their smallest elements and

categorized into the various diphone pairs. Since TTS systems that use diphones are usingelements of actual human speech, they can produce much more human-like output. However,since diphone pairs are very language-specific, diphone TTS systems are usually dedicated to

producing a single language. Because of this, diphone systems do not do well in environmentswhere numerous foreign words may be present, or where the TTS might be required to produceoutput in more than one language.

Grammar Rules

The final elements of a speech engine are the grammar rules. Grammar rules areused by speech recognition (SR) software to analyze human speech input and, in the process,attempt to understand what a person is saying. Most of us suffered through a series of lessons in

grade school where our teachers attempted to show us just how grammar rules affect our everydayspeech patterns. And most of us probably don't remember a great deal from those lessons, but weall use grammar rules every day without thinking about them, to express ourselves and makesense of what others say to us. Without an understanding of and appreciation for the importanceof grammars, computer speech recognition systems would not be possible.There can be any number of grammars, each composed of a set of rules of speech. Just as humansmust learn to share a common grammar in order to be understood, computers must also share acommon grammar with the speaker in order to convert audio information into text.

Grammars can be divided in to three types, each with its own strengths andweaknesses. The types are:

Context-free grammars

Dictation grammars

Limited domain grammars

Context-free grammars offer the greatest degree of flexibility when interpretinghuman speech. Dictation grammars offer the greatest degree of accuracy when converting spokenwords into printed text. Limited domain grammars offer a compromise between the highlyflexible context-free grammar and the restrictive dictation grammar.The following sections discuss each grammar type in more detail.

Context-Free Grammars


9/23

Context-free grammars work on the principle of following established rules todetermine the most likely candidates for the next word in a sentence. Context-free grammars donot work on the idea that each word should be understood within a context. Rather, they evaluatethe relationship of each word and word phrase to a known set of rules about what words are

possible at any given moment.

The main elements of a context-free grammar are:

Words-A list of valid words to be spoken

Rules-A set of speech structures in which words are used

Lists-One or more word sets to be used within rules

Context-free grammars are good for systems that have to deal with a wide variety ofinput. Context-free systems are also able to handle variable vocabularies. This is because most ofthe rule-building done for context-free grammars revolves around declaring lists and groups ofwords that fit into common patterns or rules. Once the SR engine understands the rules, it is veryeasy to expand the vocabulary by expanding the lists of possible members of a group.For example, rules in a context-free grammar might look something like this:

=ALT("Aadil","Ather","Imran","Khuram")

=("Send Email to", )

In the example above, two rules have been established. The first rule, ,creates a list of possible names. The second rule, , creates a rule that depends on. In this way, context-free grammars allow us to build our own grammatical rules asa predictor of how humans will interact with the system.

Even more importantly, context-free grammars allow for easy expansion at run-time.Since much of the way context-free grammars operate focuses on lists, it is easy to allow users toadd list members and, therefore, to improve the value of the SR system quickly. This makes iteasy to install a system with only basic components. The basic system can be expanded to meet

the needs of various users. In this way, context-free grammars offer a high degree of flexibilitywith very little development cost or complication.

The construction of quality context-free grammars can be a challenge, however.Systems that only need to do a few things (such as load and run programs, execute simpledirectives, and so on) are easily expressed using context-free grammars. However, in order to

perform more complex tasks or a wider range of chores, additional rules are needed. As thenumber of rules and the length of lists increases, the computational load rises dramatically. Also,since context-free grammars base their predictions on predefined rules, they are not good for taskslike dictation, where a large vocabulary is most important.


10/23

Dictation Grammars

Unlike context-free grammars that operate using rules, dictation grammars base theirevaluations on vocabulary. The primary function of a dictation grammar is to convert humanspeech into text as accurately as possible. In order to do this, dictation grammars need not only arich vocabulary to work from, but also a sample output to use as a model when analyzing speechinput. Rules of speech are not important to a system that must simply convert human input into

printed text.The elements of a dictation grammar are:

Topic-Identifies the dictation topic (for example, medical or legal).

Common-A set of words commonly used in the dictation. Usually the common group

contains technical or specialized words that are expected to appear during dictation,but are not usually found in regular conversation.

Group-A related set of words that can be expected, but that are not directly related tothe dictation topic. The group usually has a set of words that are expected to occurfrequently during dictation. The grammar model can contain more than one group.

Sample-A sample of text that shows the writing style of the speaker or general format

of the dictation. This text is used to aid the SR engine in analyzing speech input.

The success of a dictation grammar depends on the quality of the vocabulary. Themore items on the list, the greater the chance of the SR engine mistaking one item for another.However, the more limited the vocabulary, the greater the number of "unknown" words that willoccur during the course of the dictation. The most successful dictation systems balancevocabulary depth and the uniqueness of the words in the database. For this reason, dictationsystems are usually tuned for one topic, such as legal or medical dictation. By limiting thevocabulary to the words most likely to occur in the course of dictation, translation accuracy is

increased.

Limited Domain Grammars

Limited domain grammars offer a compromise between the flexibility of context-freegrammars and the accuracy of dictation grammars. Limited domain grammars have the followingelements:

Words-This is the list of specialized words that are likely to occur during a session.

Group-This is a set of related words that could occur during the session. The

grammar can contain multiple word groups. A single phrase would be expected toinclude one of the words in the group.

Sample-A sample of text that shows the writing style of the speaker or general formatof the dictation. This text is used to aid the SR engine in analyzing the speech input.

Limited domain grammars are useful in situations where the vocabulary of the systemneed not be very large. Examples include systems that use natural language to accept commandstatement, such as "How can I set the margins?" or "Replace all instances of 'New York' with 'LosAngeles.'" Limited domain grammars also work well for filling in forms or for simple text entry.


11/23

SAPI Engines Layout

How Speech Recognition Works

You might have already used speech recognition in products, and maybe evenincorporated it into your own application, but you still dont know how it works. Thisdocument will give you a technical overview of speech recognition so you canunderstand how it works, and better understand some of the capabilities and limitationsof the technology.

Speech recognition fundamentally functions as a pipeline that converts PCM(Pulse Code Modulation) digital audio from a sound card into recognized speech. Theelements of the pipeline are:

1. Transform the PCM digital audio into a better acoustic representation2. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A

grammar could be anything from a context-free grammar to full-blown English.3. Figure out which phonemes are spoken.4. Convert the phonemes into words.

System Requirement

We recommend to install Integrated communication Platform only on computers with atleast following software and hardware.

Windows operating system

Sound Cards, Speakers, and microphones

The system resources are mainly consumed while we are doing speech recognition and/orText To Speech Synthesis. So in the next few lines we will discuss this issue.

Hardware Requirement for Text to Speech Synthesis and Speech Recognition .

Integrated communication platform can be resource intensive. It is especially important thatSR engines have enough RAM and disk space to respond quickly to user requests. Failure to respond


12/23

quickly results in additional commands spoken into the system. This has the effect of creating aspiraling degradation in performance. The worse things get, the worse things get. It will not take toomuch of this before your will decide that our software is more trouble than it's worth!

Our Text-to-speech engines can also tax the system. While TTS engines do not alwaysrequire a great deal of memory to operate, insufficient processor speed can result in halting orunintelligible playback of text.

For these reasons, it is important to establish clear hardware and software requirementswhen installing Integrated Communication Platform. User must have all the memory resources andhard disk space needed for proper working of SR and TTS services. There are three general categoriesof workstation resources that should be reviewed:

General hardware, including processor speed and RAM memorySoftware, including operating system and SR/TTS enginesSpecial hardware, including sound cards, microphones, speakers, and headphones

The following three sections provide some general guidelines to follow when establishingminimal resource requirements.

General Hardware Requirements

Speech systems can tax processor and RAM resources. SR services require varying levelsof resources depending on the type of SR engine installed and the level of services implemented. TTSengine requirements are rather stable, but also depend on the TTS engine installed.

SR and TTS engines currently available with our application can be successfully installedon systems with 486/33 processor chip and an additional 1MB of RAM. However, overall PC

performance with this configuration is pretty poor and is not recommended.A good suggested processor is a Pentium processor (P60 or better) with at least 16MB of

total RAM. Systems that will be supporting dictation SR services require the most computationalpower. It is not unreasonable to expect the workstation to use 32MB of RAM and a P100 or higherprocessor. Obviously, the more resources, the better the performance.

SR Processor and Memory Requirements

In general, SR systems that implement command and control services will only need anadditional 1MB of RAM (not counting the application's RAM requirement). Dictation services shouldget at least another 8MB of RAM-preferably more. The type of speech sampling, analysis, and size ofrecognition vocabulary all affect the minimal resource requirements. Table shows published minimal

processor and RAM requirements of speech recognition services.

Published minimal processor and RAM requirements of SR services.

Levels of Speech-Recognition Services Minimal Processor Minimal Additional RAM

Discrete, speaker-dependent, whole word,small vocabulary

386/16 64K

Discrete, speaker-independent, whole word,small vocabulary

386/33 256K

Continuous, speaker-independent, sub-word,small vocabulary

486/33 1MB

Discrete, speaker-dependent, whole word,large vocabulary

Pentium 8MB

Continuous, speaker-independent, sub-word,large vocabulary

RISC processor 8MB


13/23

These memory requirements are in addition to the requirements of the operating system

and any loaded applications. The minimal Windows 95 memory model should be 12MB.Recommended RAM is 16MB and 24MB is preferred. The minimal NT memory should be 16MBwith 24MB recommended and 32MB preferred.

TTS Processor and Memory Requirements

TTS engines do not place as much of a demand on workstation resources as SRengines. Usually TTS services only require a 486/33 processor and only 1MB of additional RAM.However, the grammar and prosody rules can demand as much as another 1MB due to the complexityof the language being spoken. It is interesting to note that probably the most complex and demandinglanguage for TTS processing is English. This is primarily due to the irregular spelling patterns of thelanguage.

Most TTS engines used speech synthesis to produce the audio output. However, butadvanced systems can use diphone concatenation. Since diphone-based systems rely on a set of actualvoice samples for reproducing written text, these systems can require an additional 1MB of RAM. To

be safe, it is a good idea to suggest a requirement of 2MB of additional RAM, with a recommendationof 4MB for advanced TTS systems.

Software Requirements-Operating Systems and Speech Engines

The general software requirements are rather simple. The Microsoft Speech API isimplemented on Windows 32-bit operating systems. This means user will need Windows 95 orWindows NT 3.5 or greater on the workstation.

The most important software requirements for implementing speech services are theSR and TTS engines. An SR/TTS engine is the back-end-processing module. Our application is thefront end, and the SPEECH.DLL acts as the broker between the two processes.We along with our application software included a bundle of text to speech engines and speechrecognition engines. So user doesnt need any of the additional engines.Sound Cards, Microphones, and Speakers

Complete speech-capable workstations need three additional pieces of hardware:A sound card for audio reproductionSpeakers for audio playbackA microphone for audio input

Just about any sound card can support SR/TTS engines. Any of the major vendors'cards are acceptable, including Sound Blaster and its compatibles, Media Vision, ESS technology, andothers. Any card that is compatible with Microsoft's Windows Sound System is also acceptable.A few speech-recognition engines still need a DSP (digital signal processor) card. While it may be

preferable to work with newer cards that do not require DSP handling, there are advantages to usingDSP technology. DSP cards handle some of the computational work of interpreting speech input. Thiscan actually reduce the resource requirements for providing SR services. In systems where speech is a

vital source of process input, DSP cards can noticeably boost performance.SR engines require the use of a microphone for audio input. This is usually handled

by a directional microphone mounted on the PC base. Other options include the use of a lavalieremicrophone draped around the neck, or a headset microphone that includes headphones. Depending onthe audio card installed, user may also be able to use a telephone handset for input.Most multimedia systems ship with a suitable microphone built into the PC or as an external devicethat plugs into the sound card. It is also possible to purchase high-grade unidirectional microphonesfrom audio retailers. Depending on the microphone and the sound card used, you may need anamplifier to boost the input to levels usable by the SR engine.


14/23

The quality of the audio input is one of the most important factors in successfulimplementation of speech services on a PC. If the system will be used in a noisy environment, close-talk microphones should be used. This will reduce extraneous noise and improve the recognitioncapabilities of the SR engine.

Speakers or headphones are needed to play back TTS output. In private office spaces,free-standing speakers provide the best sound reproduction and fewest dangers of ear damage through

high-levels of playback. However, in larger offices, or in areas where the playback can disturb others,headphones are preferred.

12Development of the Voice Driven Animation System

34Design of Grammar Rules

turn
by
degrees

left

right

around

ten

twenty

thirty

forty

fifty

sixty

seventy

eighty

ninety

According to this grammar rule, if the user says "turn right by 70 degrees", then thespeech engine will indicate the application that rule name "VID_TurnCommand" has beenrecognized, with the property of child rule VID_Direction being "right", and the property ofchild rule VID_Degree being "seventy". We have also performed basic testing on the grammarrules we have written, using a grammar compiler and tester provided by the SDK toolkit. All ofthe grammar rules can be recognized from the users speech very well.


15/23

1Integration of the Speech Engine

2 Before the system gets more complicated, we first tested with simple object andmovements, i.e., using voice commands to drive a ball move from one place to another. Here are somesnapshots of the running application:

A Simple Voice Driven Application

The blue ball represents the subject, and the red ball represents the destinations that the

subject must pass through in the same order as they were created. The locations of the red balls arespecified by the mouse clicks of the user. If the user says move to here, to here, to here while hesdoing the mouse clicks, the application will recognize the voice command and start to move the blue

ball towards those red balls, once the mouse clicking has ended. When the subject passes through adestination, the red ball will disappear showing that it has been reached, and then the subject will headstraight to the next destination again until all the red balls have been reached, as shown in the aboveimages.

Although this application may seem simple enough, it demonstrates that the speechrecognition engine has been successfully integrated into the windows program and they can work


16/23

seamlessly together. This makes sure that we can build more complex system on top of the speechengine.

1Combination with Motion Capture Data

2

Now we have got the speech engine working properly, we can combine voice recognitionwith motion capture data to generate the animation of a character driven by voice commands.

Voice Driven Animation System

The user can speak a limited set of voice commands to make the character walk indifferent style, such as walk fast, slow backwards; and he can use faster or slower commandsto control the speed of the walking motion. Besides these, the system also supports directional control,

so if the user says turn left (by) sixty (degrees), the character will make a left turn by sixty degrees.The brackets enclosing by and degrees mean that these two words are optional, i.e., the systemwill recognize users voice command either with or without those two words being said.

When the system is started, a single walk cycle motion of the character is loaded fromthe motion capture data, and the system basically replays this walk cycle at different speed usingdifferent translation and orientation of the character, according to the users voice commands. As forthe change of walking direction, in order to make the rotation smoother, we apply a linearinterpolation to the rotation angle, so that the orientation of the character is changed by 10 degrees ineach successive frame, until the desired rotation is achieved. In this way the turning action looks morenatural than a straight cut to the new walking direction.

LimitationsEven the most sophisticated speech recognition engine has limitations that affect

what it can recognize and how accurate the recognition will be. The following list illustrates manyof the limitations found today. The limitations do pose some problems, but they do not prevent thedesign and development of savvy applications that use dictation.

Microphones and sound cardsThe microphone is the largest problem that speech recognition encounters.

Microphones inherently have the following problems:Not every user has a sound card. Over time more and more PCs will bundle a sound card.


17/23

Not every user has a microphone. Over time more and more PCs will bundle a microphone.Sound cards (being in the back) don't make it very easy for users to plug in the microphone.

Most microphones that come with computers are cheap, and they don't do as well asmore expensive microphones that retail for $50 to $100. Furthermore, many of the cheapmicrophones that are designed to be worn are uncomfortable. A user will not use a microphone if

it is uncomfortable.Users don't know how to use a microphone. If the microphone is a worn on their head

they often wear it incorrectly, or if it sits on their desktop they will lean towards it to speak eventhough the microphone is designed for the user to speak from their normal sitting position;

Most applications can do little about the microphone. One way that vendors can dealwith this is to test and verify the user's microphone setup as part of the installation of any speechcomponent software. Software to test a user's microphone can be delivered along with othercomponents to ensure that the user can periodically test and adjust the microphone andconfiguration.

Most users of dictation will wear close-talk microphones for maximum accuracy.Close-talk mikes have the best characteristics for speech recognition; they alleviate a number ofthe problems encountered in Command and Control recognition caused by weaknesses in thecapabilities of user microphones in speech recognition and dictation applications.

Speech Recognizers make mistakes

Speech recognizers make mistakes, and will always make mistakes. The only thingthat is changing is that every two years recognizers make half as many mistakes as they did

before. But, no matter how great a recognizer is it will always make mistakes.To make matters worse, dictation engines make misrecognitions that are correctly

spelled and often grammatically correct, but mean nothing. Unfortunately, the misrecognitionssometimes mean something completely different than the user intended. These sorts of errorsserve to illustrate some of the complexity of speech communication, particularly in that people arenot accustomed to attributing strange wording to speech errors.To minimize some of the misrecognitions, an application can:

Make it as easy as possible for users to correct mistakes.Provide easy access to the "Correction Window" so the user can correct mistakes thatthe recognizer made.Allow the user to train the speech recognition system to his/her voice.

Is it a Command?

When speech recognition is listening for dictation, user's will often want to interjectcommands such as "cross-out" to delete the previous word or "capitalize-that". Applicationsshould make sure that:

If a command is just one word, it does not replace a word that people like to dictate.If a command is multiple words, it can't be a phrase that people like to dictate.

Finite Number of Words

Speech recognizers listen for 20,000 to 100,000 words. Because of this, one out ofevery fifty words a user speaks isn't recognized because it isn't in the 20,000 -- 100,000 wordssupported by the engine.

Applications can reduce the error rate of an engine if the application tells theengine about what words the engine should expect.


18/23

Other Problems

Some other problems crop up:

Having a user spell out words is a bad idea, since most recognizers are too inaccurate.An engine also cannot tell who is speaking, although some engines may be able to detect achange in the speaker. Voice-recognition algorithms exist that can be used to identify aspeaker, but currently they cannot also determine what the speaker is saying.An engine cannot detect multiple speakers talking over each other in the same digital-audiostream. This means that a dictation system used to transcribe a meeting will not performaccurately during times when two or more people are talking at once.Unlike a human being, an engine cannot hear a new word and guess its spelling.Localization of a speech recognition engine is time-consuming and expensive, requiringextensive amounts of speech data and the skills of a trained linguist. If a language hasstrong dialects that each represents sizable markets, it is also necessary to localize theengine for each dialect. Consequently, most engines support only five or ten majorlanguages-for example, European languages and Japanese, or possibly Korean.

Speakers with accents, or those speaking in nonstandard dialects, can expect moremisrecognitions until they train the engine to recognize their speech, and even then, theengine accuracy will not be as high as it would be for someone with the expected accent ordialect. An engine can be designed to recognize different accents or dialects, but thisrequires almost as much effort as porting the engine to a new language.

FUTURE OF SAPI...

The future of SAPI is wide open. At present, SAPI systems are most successful ascommand-and-control interfaces. Such interfaces allow users to use voice commands to start andstop basic operations that usually require keyboard or mouse intervention. Current technologyoffers limited voice playback services. Users can get quick replies or short readings of textwithout much trouble. However, long stretches of text playback are still difficult to understand.

With the creation of the generalized interfaces defined by Microsoft in the SAPI model,it will not be long before new versions of the TTS and SR engine appear on the market ready totake advantage of the larger base of Windows operating systems already installed. With each newrelease of Windows, and new versions of the SAPI interface, speech services are bound to becomemore powerful and more user-friendly.

Although we have not yet arrived at the level of voice interaction depicted in Star Trekand other futuristic tales, the release of SAPI for Windows puts us more than one step closer tothat..reality!


19/23

References:

MAPI, SAPI & TAPI DEVELOPERS GUIDE... By: Michael C. Amundsen

PROGRAMMING WINDOWS 98/NTBy: Victor Thims Nielsen, J., Will Voice Interfaces Replace Screens, In IBM Developer

Works, 1999

Apaydin, O., Networked Humanoid Animation Driven by Human Voice

Using Extensible 3D (X3D), H-Anim and JAVA Speech Open Standards.In Thesis of Naval Postgraduate School, 2002

Microsoft Speech SDK 5.1 Documentation, 2004

http://java.sun.com/products/java-media/speech/index.htm

l
http://java.sun.com/products/java-media/speech/index.htmhttp://java.sun.com/products/java-media/speech/index.htm


20/23

CONTENTS

1. Introduction2. Backgrounds

3. Speech Application Programming Interface

4. How Speech Recognition Works?

5. System Requirement

6. Development Of Voice Driven Animation System

7. Integration Of The Speech Engine

8. Combination With Motion Capture Data

9. Limitations

10. Future Of SAPI

11. References


21/23

Acknowledgment

Prior to presenting the textual details of the study I take this

opportunity to express my heartfelt gratitude to all those who have contributed

towards the successful attainment of my seminar report.

I was fortunate to have Mr. Santosh Kumar Swain as our seminar

guide who organized this valuable seminar on behalf of the computer science

department. Also I would like to express my gratefulness to the entire faculty

of Computer Science department, who were always there to provide us will all

necessary information on our report.

Last but not the least; I would like to thank KIIT Deemed

University for every help and support.


22/23

SYNOPSIS

VOICE DRIVEN ANIMATION SYSTEM

In a traditional animation system, the animator must use mouse and keyboard tospecify the path along which the character will move and the action that the character will bedoing during the movements. This kind of interaction is not very effective indeed because eitherclicking on some buttons or typing on the keyboards will be distracting for the animator who istrying to focus on creating the animation. To improve the interaction, we can borrow the ideafrom the filmmaking scenario, where the director uses his voice to tell the actor what to do beforethe shooting of the scene, and then the actor will perform the action exactly as he was told to. Thisis how we come up the idea of using human voice as a media to make a better interface for theanimation-system.

The goal of this report is to develop a voice driven animation system that couldtake human voice as commands to generate the desired character animation, based on motion

capture data.

Voice interfaces have their greatest potential in the following cases where it isproblematic to rely on the traditional keyboard-mouse-monitor combination:

Users with various disabilities that prevent them from using a mouse and/or keyboard or thatprevent them from seeing the pictures on the screen.

All users, with or without disabilities, whose hands and eyes are occupied with other tasks. Forexample, while driving a car or while repairing a complex piece of equipment.

Users who do not have access to a keyboard and/or a monitor. For instance, users accessing asystem through a payphone.

So it's not that voice is useless. It's just that it is often a secondary interactionmode if additional media are available. Just as in our system, in addition to using voice commandsto select different actions of the character, the user still has to use mouse clicks to specify thelocation at which the action is taking place. The combination of multiple computer medias provesto provide better interactions for most users.

As for the voice recognition system, there are two mainstream products availableat the current market. One is IBM Via Voice Dictation SDK, which is based on many years of


23/23

development by IBM. The other one is Microsoft Speech API, also known as SAPI. As an SDKtoolkit, each of them has their unique features, and its hard to compare which one is better.

One of the newest extensions for the Windows operating system is the SpeechApplication Programming Interface (SAPI). This Windows extension gives workstations theability to recognize human speech as input, and create human-like audio output from printed text.

This ability adds a new dimension to human/pc interaction. Speech recognition services can beused to extend the use of PCs to those who find typing too difficult or too time-consuming. Text-to-speech services can be used to provide aural representations of text documents to those whocannot see typical display screens because of physical limitations or due to the nature of theirwork.

Voice Driven Animation System

Documents

Transcript of Voice Driven Animation System