Automated Speech Driven Lipsynch Facial Animation for...

Automated Speech Driven Lipsynch Facial

Animation for Turkish

by

Zeki Melek

B.S. in Computer Engineering Department

Bogaziçi University, 1996

Submitted to the Institude for Graduate Studies in

Science and Engineering in partial fulfillment of

the requirements for the degree of

Master of Science

in

Computer Engineering

Bogaziçi University

1999

II

AUTOMATED SPEECH DRIVEN

LIPSYNCH FACIAL ANIMATION

FOR TURKISH

APPROVED BY:

Assoc. Prof. Dr. Lale Akarun ................................................

(Thesis Supervisor)

Assoc. Prof. Dr. Levent Arslan ................................................

Prof. Dr. Fikret Gürgen ................................................

DATE OF APPROVAL ................................................

III

ABSTRACT

Talking three-dimensional (3D) synthetic faces are now used in many applications

involving human-computer interaction. The lip-synchronization of the faces is mostly done

mechanically by computer animators. Although there is some work done on automated lip-

synchronized facial animation, these studies are mostly based on text input. In our work we

used speech in Turkish as an input to generate lip-synchronized facial animation. Speakers’

recorded voice is converted into lip-shape classes and applied to the 3D model. Voice is

analyzed and classified using a training set. Lip animation is facilitated by activating facial

muscles and the jaw. Facial muscles are modeled onto our facial model. For more realistic

facial animation, facial tissue is modeled as well, and the interactions between epidermis,

subcutenous layer and bone are taken into account. High-speed natural-looking lip-

synchronized facial animation is achieved. A real-time version of the engine is also

implemented.

IV

ÖZET

Üç boyutlu insan modellerinin konusmasi canlandirmada oldugu kadar insan-

bilgisayar iletisiminde de giderek daha sik kullanilmaktadir. Konusmanin üç boyutlu yüz

modeli ile agiz eszamanlamasi genelde grafik animatörlerce yapilan uzun ve mekanik bir

islemdir. Otomatik agiz eszamanli yüz animasyonu için çesitli çalismalar yapilmistir. Bu

tip çalismalar genelde yazi tabanli olmaktadir. Biz bu çalismamizda sesi girdi olarak

kullandik. Seslendirenin kaydedilen sesi verilen üç boyutlu yüz modelinde dudak

hareketlerine çevrilmektedir. Bunu için kaydedilen ses analiz edilip egitim kümesi ile

karsilastirilarak dudak hareketine siniflandirilmaktadir. Yüz modelimizde dudak

hareketleri yüz kaslari ve çene kullanilarak yapilir. Üç boyutlu yüz modelimiz üzerine

insan yüzünün fiziksel kas yapisi dikkate alinarak yüz kaslari modellendi. Gerçekçi yüz

animasyonu için insan yüzünü olusturan deri, yag, kas ve kemik katmanlari da

modellenerek aralarindaki etkilesimler hesaplandi. Oldukça hizli bir sekilde dogal

görünüslü canlandirma yapilabilmektedir. Gerçek zamanli çalisan kirpilmis bir

canlandirma motoru da hazirlanmistir.

V

TABLE OF CONTENTS

1. INTRODUCTION ____________________________________________________ 1

2. SPEECH PROCESSING ______________________________________________ 4

2.1. Background_____________________________________________________________ 4

2.2. Cepstral Analysis ________________________________________________________ 8

2.3. Mel Cepstrum __________________________________________________________ 12

2.4. Hamming Window ______________________________________________________ 15

2.5. Classification___________________________________________________________ 16

2.5.1. Training Set ______________________________________________________________ 18

2.5.2. Nearest Neighbor Classifier __________________________________________________ 20

2.5.3. Fuzzy Nearest Neighbor Classifier_____________________________________________ 21

2.5.4. Parametric Classifier _______________________________________________________ 22

2.5.5. Tree Classifier ____________________________________________________________ 23

2.6. Error Correction _______________________________________________________ 25

3. FACIAL ANIMATION_______________________________________________ 27

3.1. Background____________________________________________________________ 27

3.2. Facial Muscles__________________________________________________________ 30

3.3. Facial Tissue ___________________________________________________________ 35

3.4. Moving The Jaw ________________________________________________________ 40

3.5. Automated Eyeblink_____________________________________________________ 41

4. LIPSYNCH FACIAL ANIMATION (AGU) ______________________________ 42

4.1. System Overview _______________________________________________________ 42

4.2. Optimizing Performance _________________________________________________ 45

4.2.1. Multilevel Caching_________________________________________________________ 47

5. REAL-TIME LIPSYNCH FACIAL ANIMATION (RT_AGU) _______________ 48

VI

5.1. System Overview _______________________________________________________ 48

5.2. Keyframe Interpolation __________________________________________________ 50

6. CONCLUSIONS ____________________________________________________ 51

APPENDIX A. NAVIGATION HIGHLIGHTS _______________________________ 53

REFERENCES _________________________________________________________ 56

VII

LIST OF FIGURES

Figure 1.1. Overview of AGU _______________________________________________ 2

Figure 2.1.1. Overview of a statistical speech recognizer__________________________ 6

Figure 2.2.1. Block diagram of cepstrum analysis ______________________________ 11

Figure 2.3.1. The mel scale ________________________________________________ 13

Figure 2.3.2. Mel filter bins ________________________________________________ 14

Figure 2.3.3. Block diagram of mel-cepstral analysis____________________________ 14

Figure 2.4.1. Hamming filter _______________________________________________ 15

Figure 2.4.2. Block diagram of mel-cepstral analysis (hamming filter applied)________ 15

Figure 2.5.1. Turkish viseme classes _________________________________________ 17

Figure 2.5.1.1. Training sample for "e"_______________________________________ 18

Figure 2.5.5.1. Block Diagram of the Tree Classifier ____________________________ 24

Figure 2.6.1. Applying median filter for error correction _________________________ 25

Figure 2.6.2. Frames 4 and 8 are misclassifications_____________________________ 25

Figure 3.1.1. Traditional style lip-synchronization ______________________________ 29

Figure 3.2.1. Facial Muscles _______________________________________________ 31

Figure 3.2.2. Effect zones of a linear muscle___________________________________ 32

Figure 3.2.3. Linear muscle ________________________________________________ 33

Figure 3.2.4. Sphincter muscle _____________________________________________ 34

Figure 3.3.1. Skin layers __________________________________________________ 35

Figure 3.3.2. Simple skin implementation (tension net)___________________________ 36

Figure 3.3.3. Deformable lattice implementation of the skin ______________________ 37

Figure 3.3.4. Effect of muscle activation on skin layers __________________________ 38

Figure 3.3.5. Saying "o" without and with obicularis oris, and with facial layers ______ 38

Figure 3.4.1. Jaw definition torus ___________________________________________ 40

Figure 4.1.1. Subunits of AGU system ________________________________________ 43

Figure 4.2.1. Hierarchy of frame requests_____________________________________ 45

Figure 4.2.1.1. Multilevel cache structure_____________________________________ 47

Figure 5.1.1. Overview of RT_AGU sub-units__________________________________ 48

VIII

LIST OF TABLES

Table 2.5.2-1. Performance of the nearest neighbor classifier _____________________ 20

Table 2.5.3-1. Performance of the fuzzy NN classifier ___________________________ 21

Table 2.5.4-1. Performance of the parametric classifier__________________________ 22

Table 2.5.5-1. Performance of the tree classifier________________________________ 23

Table 6-1. Video output confusion matrix _____________________________________ 51

1

1. INTRODUCTION

In recent years there has been considerable interest in computer-based three-

dimensional (3D) facial character animation. The human face is interesting and challenging

because of its familiarity. Animating synthetic objects is most of the time acceptable, but

when it comes to facial animation, we humans tend to criticise and cannot tolerate

unnatural looking details. Realistically animating a speaking face is one of the hardest

animations to be done [1]. Talking 3D synthetic faces are now used in many applications

involving human-computer interaction.

In traditional animation, synchronization between the drawn or synthetic images

and the speech track is usually achieved through the tedious process of reading the

prerecorded speech track to find the frame times of significant speech events [2],[3],[4].

Key frames with corresponding mouth positions and expressions are then either drawn or

rendered to match these key speech events. For a more realistic correspondance, a live

actor is filmed or videotaped while speaking. These recorded frames guide either

traditional or computer animators to obtain the correct mouth positions for each key frame

or even for each frame of the animation (rotoscoping). Both methods require a large

amount of time and resource, and most of the time is spent to match facial key points

mechanically.

To automate this task, various methods are proposed. Some methods use text based

generation of both synthetic faces and synthetic speech [5]. Generating natural speech is

not always possible; it is better to use a voice actor instead. Text based synthetic facial

animation again faces the problem of synchronization. A fully speech driven facial

animation tool is the ultimate solution to this hard task. Proposed solutions generally

require large amount of computer power to process speech data, and none of the engines

are based on Turkish language. One engine using text as input is created [6],[7]. Another

work is using speech input and codebook based face point trajectory method [8].Our work

2

is to create an automated Turkish based, or better language independent real-time speech

driven facial animation system.

Our system, Automated speech driven Graphical facial animation Unit (AGU)

consists of two units (Figure 1.1), one speech processing unit and one facial animation

unit.

Figure 1.1. Overview of AGU

In the speech processing phase the speech is divided into frames and each frame is

classified into one of the eight lip-shape classes using pretrained data using a feature vector

for each frame. Classified lip-shape classes are sent to the facial animation unit, which

deforms the 3D face accordingly. For each frame, a new screen shot is created, and

displayed on the screen or saved as a file.

Speech processing unit is summarized in Chapter II and our feature vector is

analyzed. Different classification schemes are compared and a tree classifier is proposed.

The nature of the speech allow us to create a very efficient error correction routine, as

illustrated in Chapter II.

3

In Chapter III, facial animation is described. After a summary of the background,

physically based facial animation will be covered in detail, facial muscles and facial tissue

layers will be explained, and our implementation of them. Automated eye blink is shortly

mentioned and the jaw movement is covered in the last subsection.

Chapter IV describes the implemented system. Our system overview will be

covered in detail, and ins and outs of each subunit is investigated. Some optimization

techniques are examined in the last subsection.

Chapter V is about the real-time implementation of our engine. Even the off-line

version performs very fast; a real-time version should include some simplifications as well

as some optimizations which are normally not necessary in the off-line version. The real-

time system overview and the performance issues are covered throughout this chapter.

The last chapter is on evaluating our work, and proposing some future work.

Possible uses of this work is also covered in this chapter.

4

2. SPEECH PROCESSING

2.1. Background

Speech recognition is the task of mapping from digitally encoded acoustic signal to

a string of words. Speech recognition systems or algorithms are generally classified as

small, medium or large vocabulary. We would expect the performance and speed of a

particular recognizer to degrade with increasing vocabulary size. Another classification is

isolated word recognizers versus continuous speech recognizers. Isolated word recognizers

are trained with discrete renderings of speech units. In the recognition phase, it is assumed

that the speaker deliberately utters sentences with sufficiently long pauses between words.

When the vocabulary size is large, isolated word recognizers need to be specially

constructed and trained using subword models. Further, if sentences composed of isolated

words are to be recognized, the performance can be enhanced by exploiting probabilistic

relationships among words in the sentences. On the other hand, the most complex

recognition systems are those which perform continuous speech recognition, in which the

user speaks in a relatively unconstrained manner. First, the recognizer must be capable of

dealing with unknown temporal boundaries in the acoustic signal. Second, the recognizer

must be capable of performing well in the presence of all the coarticulation effects and

sloppy articulation that accompany flowing speech. Continuous speech recognizers require

language processing tools. They are concerned with the attempt to recognize a large pattern

by decomposing it into small subpatterns according to the rules, to reduce entropy. Lexical

rules and other subword knowledge are used to recognize the words as smaller units

(below word level processing). The recognition of a sentence is benefited by the

knowledge of superword knowledge that yields word ordering information (above word

level processing). The usual case of a linguistic processing is the more general case in

which linguistic rules are applied both above and below the word level. Most systems

employed in practical applications are of the small vocabulary isolated word type. All

5

perform significantly better if required to recognize only a single speaker who trains the

system [9],[10].

Human languages use a limited repertoire of about 40 or 50 sounds, called phones.

One of the major problems in speech recognition is the existence of homophones: different

words that sound the same. This is a problem in English, in Turkish it is not a problem. To

solve the problem linguistic processing tools are used.

Sound is an analog energy source. The sampling rate is the frequency with which

we look at the signal. Quantization factor determines the precision to which energy at each

sampling point is recorded. Even with a low sampling rate and quantization factor, speech

requires a lot of information to analyze.

The first step in coming up with a better presentation for the signal is to group the

samples together into larger blocks called frames. This makes it possible to analyze the

whole frame for the appearance of some speech phenomena. Within each frame, the sound

is represented with a feature vector. Typical features are number of zero crossings, or

energy of the frame, etc. Recognition is done using this feature vector, calculated for each

frame.

Current speech recognition systems are firmly based on the principles of statistical

pattern recognition. The basic methods of applying these principles to the problem of

speech recognition were pioneered by Baker [11], Jelinek [12], and their colleagues from

IBM in the 1970s, and little has changed since (Figure 2.1.1).

The utterance consists of a sequence of words, W, and it is the job of the recognizer

to determine the most probable word sequence, W’, given the observed acoustic signal Y.

To do this, Bayes’ rule is used to decompose the required probability P(W|Y) into two

components, that is,

6

)(

)|()(maxarg)|(maxarg'

YP

WYPWPYWPW

WW== (2.1)

Front EndParametrization

th ih s ih z s p iy ch

this is speech

Language Model P(W) . P(Y|W)

Acoustic Models

PronouncingDictionary

Y

W

Parametrized Speech Waveform

Figure 2.1.1. Overview of a statistical speech recognizer

P(W) represents the a priori probability of observing W independent of the

observed signal, and this probability is determined by a language model. P(Y|W)

represents the probability of observing the vector sequence Y given some specified word

sequence W, and this probability is determined by an acoustic model. A very popular

implementation of the acoustic model uses Hidden Markov Models (HMMs) [10],[13]. For

each phone there is a corresponding statistical model called HMM. The sequence of

HMMs needed to represent the postulated utterance are concentrated to form a single

composite model, and the probability of that model generating the observed sequence Y is

calculated. Each individual phone is represented by an HMM, which consists of a number

7

of states connected by arcs. HMM phone models typically have three emitting states and a

simple left-right topology. The entry and exit states are provided to make it easy to join

models together. The exit state of one phone model can be merged with the entry state of

another to form a composite HMM. This allows phone models to be joined together to

form words and words to be joined together to form complete utterances. An HMM is a

finite state machine that changes state every time unit and each time that a state is entered

an acoustic vector is generated with some probability density. Furthermore the transitions

are also probabilistic. The joint probability of an acoustic vector is calculated simply as the

product of the transition probabilities and the output probabilities. This process can be

repeated for all possible word sequences with the most likely sequence (the sequence with

the highest combined probability) selected as the recognizer output. HMM parameters

must be estimated from data, and it will never be possible to obtain sufficient data to cover

all possible contexts. Because of that problem, the language model must be able to deal

with word sequences for which no examples occur in the training data. Language model

probability distributions can be easily calculated from text data, and are unique for every

language.

8

2.2. Cepstral Analysis

For many years Linear Prediction (LP) analysis has been the most popular method

for extracting spectral information from the speech signal. Contributing to this popularity

is the enormous amount of theoretical and applied research on the technique, which has

resulted in very well-understood properties and many efficient and readily available

algorithms. For speech recognition, the LP parameters are very useful spectral

representation of the speech because they represent a smoothed version of the spectrum

that has been resolved from the model excitation. However, LP analysis is not without

drawbacks.

LP analysis does not resolve vocal-tract characteristics from the glottal dynamics.

Since these laryngeal characteristics vary from person to person, and even for within

person utterances of the same words, the LP parameters convey some information to a

speech recognizer that degrades performance, particularly for a speaker-independent

system. In the 1980s, researchers began to improve upon the LP parameters with a cepstral

technique. Much of the impetus for this conversion seems to have been a paper by Davis

and Mermelstein [14], which compared a number of parametric representations and found

the cepstral method to outperform the raw LP parameters in monosyllable word

recognition. In fact, the most successful technique in this study was a cepstral technique,

the mel cepstrum, which is not based on LP analysis, but rather a filter bank analysis. We

will deal with mel cepstrum in section 2.3.

According to the usual model, speech is composed of an excitation sequence

convolved with the impulse response of the vocal system model. Since we have only

access to the output, the elimination of one of two combined signals is a very difficult

problem. If two pieces are combined linearly, a linear operation (Fourier transform) will

allow us to examine the component sequences individually. Because the individual parts of

a voiced speech, vocal tract articulation and pseudoperiodic noise source, are composed in

a convolved combination, our linear operator, Fourier transform, cannot separate them. But

9

in the cepstrum, the representatives of the component signals will be separated, and

linearly combined. If it is the purpose to assess some properties of the component signals,

the cepstrum itself might be sufficient to provide the needed information. However, if it is

the purpose to eliminate one of the component signals we are able to use linear filters to

remove undesired cepstral components, since the representatives of the component signals

are linearly combined.

The cepstrum, or cepstral coefficient, c(τ) is defined as the inverse Fourier transform

of the short-time logarithmic amplitude spectrum |X(ω)|. The term cepstrum is essentially a

coined word which includes the meaning of the inverse transform of the spectrum. The

independent parameter for the cepstrum is called quefrency, which is obviously formed

from the word frequency. Since the cepstrum is the inverse transform of the frequency

domain function, the quefrency becomes a time-domain parameter. The special feature of

the cepstrum is that it allows for the separate representation of the spectral envelope and

the fine structure.

Voiced speech x(t) can be regarded as the response of the vocal tract articulation

equivalent filter, which is driven by a pseudoperiodic source g(t). Then x(t) can be given

by the convolution of g(t) and vocal tract impulse h(t) as

∫ −=t

dthgtx0

)()()( τττ (2.2)

which is equivalent to

)()()( ϖϖϖ HGX = (2.3)

where X(ω), G(ω), and H(ω) are the Fourier transforms of x(t), g(t), and h(t)

respectively.

10

If g(t) is a periodic function, |X(ω)| is represented by line spectra, the frequency

intervals of which are the reciprocal of the fundamental period of g(t). Therefore, when

|X(ω)| is calculated by the Fourier transform of a sampled time sequence for a short speech

wave period, it exhibits sharp peaks with equal intervals along the frequency axis. Its

logarithm log|X(ω)| is

)(log)(log)(log ϖϖϖ HGX += (2.4)

The cepstrum, which is the inverse Fourier transform of log|X(ω)|, is

)(log)(log)(log)( 111 ϖϖϖτ HFGFXFc −−− +== (2.5)

where F is the Fourier transform. The first and second terms on the right side of

equation 2.4 correspond to the spectral fine structure and the spectral envelope,

respectively. The former is the periodic pattern, and the latter is the global pattern along

the frequency axis.

Principally, the first function on the right side of equation 2.5 indicates the

formation of a peak in the high-quefrency region, and the second represents a

concentration in the low-qeufrency region. The fourier transform of the low-quefrency

elements produces the logarithmic spectral envelope from which the linear spectral

envelope can be obtained through the exponential transform. The maximum order of low-

quefrency elements used for the transform determines the smoothness of the spectral

envelope. The process of separating the cepstral elements into these two factors is called

liftering, which is derived from filtering. When the cepstrum value is calculated by the

DFT, it is necessary to set the base value of the transform, N, large enough to eliminate the

aliasing. The process steps for extracting the spectral envelope using cepstral method is

given in Figure 2.2.1.

11

WindowSampledsequence

|DFT| Log IDFTCepstral Window

(liftering)

Low quefrencyelements

High quefrencyelements

Peakextraction

DFT

Spectralenvelope

Fundamentalperiod

Figure 2.2.1. Block diagram of cepstrum analysis

First 13 quefrency elements are used as the 13 dimensional feature vector. The

feature vector is self-normalized by normalizing the first coefficient to 1. As the window

length, we selected 10 milliseconds , but they are overlapping by 5 milliseconds.

12

2.3. Mel Cepstrum

Pure cepstral coefficients are not satisfactory enough for classification, hence an

improvement can be obtained by integrating perceptual information. “Mel-based cepstrum”

or simply “mel cepstrum” comes out of the field psychoacoustics. Its use has been shown

empirically to improve recognition accuracy [10].

A mel is a unit of measure of perceived pitch or frequency of a tone (similar to an

octave in music). It does not correspond linearly to the physical frequency of the tone, as

the human auditory system apparently does not perceive pitch in this linear manner. The

precise meaning of the mel scale becomes clear by examining the experiment by which it

is derived. Stevens and Volkman[15] arbitrarily chose the frequency 1000 Hz and

designated this “1000 mels”. Listeners were then asked to change the physical frequency

until the pitch they received was twice the reference, then 10 times, and so on; and then

half the reference, 1/10 and so on. These pitches were labeled 2000 mels, 10000 mels, and

so on; and 500 mels, 100 mels, and so on. The investigators were then able to determine a

mapping between the real frequency scale (Hz) and the perceived frequency scale (mels).

Stephens and Bate [16] showed that the pitch expressed in mels is roughly proportional to

the number of nerve cells terminating on the basilar membrane of the inner ear, counting

from the apical end to the point of maximal stimulation along the membrane.

The mapping is approximately linear below 1kHz and logarithmic above (Figure

2.3.1), and such approximation is usually used in speech recognition. Fant [17] suggested

following approximation:

+=

10001log1000 2

Hzmel

FF (2.5)

13

Figure 2.3.1. The mel scale

To incorporate the perception properties of the human ear into cepstral analysis, it

is desired to compute the cepstral parameters at frequencies that are linearly distributed in

the range 0 – 1 kHz, and logarithmically above 1 kHz. One approach is to oversample the

frequency axis and then selecting those frequency components that represent

approximately the appropriate distribution. Another method is using filter bins (Figure

2.3.2).

The perception of a particular frequency by the auditory system is influenced by the

energy in a critical band of frequencies around it. The bandwidth of a critical band varies

with frequency, beginning at about 100 Hz for frequencies below 1 kHz, and then

increasing logarithmically above 1 kHz. Therefore, rather than simply using the mel-

distributed log magnitude frequency components, using the total log energy in critical

bands around the mel frequencies as input to the final IDFT is preferred.

Triangularly shaped mel-frequency bins are actually visualizations of these critical

bands. For each bin, the weighted sum of log of the sample frequencies is calculated. The

weight of each sample comes from the filter bin’s magnitude at the selected frequency

position, hence 1 at the bin center frequency and decreasing to 0 linearly at the boundaries

of the bin. Bins are positioned linearly separated and of the same size for frequencies

14

below 1 kHz. For frequencies greater than 1 kHz filter bins are positioned logarithmically

separated and growing logarithmically in width. Actually, filter bins are of the same size

and linearly separated in the mel-frequency domain.

Fmel1000 melFHz1000 Hz

Figure 2.3.2. Mel filter bins

Since filter bins require a large number of samples, we increased our sampling

window into 20 milliseconds frames, overlapping in each 10 milliseconds. This gives us

512 sample frames at a 22050 Hz sampling rate, and this number of samples is enough to

work with filter bins. Experimentally 64 filter bins in the range 0-8 kHz give the best

results.

WindowSampledsequence

|DFT| Log IDFT

K mel-cepstralcoefficients

K filterbins

N samples N frequencies K mel filteroutputs

Figure 2.3.3. Block diagram of mel-cepstral analysis

An enhancement can be achieved by using IDCT instead of IFT in the last step.

This has the effect of compressing the spectral information into the lower-order

coefficients, and it also decorrelates them to allow the subsequent statistical modeling to

use diagonal covariance matrixes [10].

15

2.4. Hamming Window

In cepstral analysis, the effect of neighboring frames affect the results of our

analysis. To get rid of this aliasing, each frame is filtered through a hamming window. A

hamming filter is strongest at the middle part of the frame, and drops down to zero at the

edges of the frame (Figure 2.4.1).

Figure 2.4.1. Hamming filter

HammingSampledsequence

|DFT| Log IDFT

K mel-cepstralcoefficients

K filterbins

N samples N frequencies K mel filteroutputs

Figure 2.4.2. Block diagram of mel-cepstral analysis (hamming filter applied)

frame size

16

2.5. Classification

From the point of visual speech perception, different phonetic sounds can be

visually observed as exhibiting the same lip-shape [18]. For example, the visible mouth

movement for phoneme p is similar to the movement for b. The phoneme classes that can

be visually discriminated are called visemes. The difference of the phones with the same

viseme comes from the different position settings of the tongue and teeth. Cartoon

animators are taking advantage of this knowledge for years [3],[4]. Usually eight visemes

are sufficient to achieve realistic mouth animations (however for lip-reading 16 is the

minimum), and it makes our classification work a lot easier, since instead of 50 phone

classes we only have eight viseme classes to distinguish. We defined Turkish viseme

classes as follows:

1. silence : Lips are at rest, jaw is nearly closed

2. b, m, p : Stops. Lips are pressed, jaw is tightly closed

3. c, ç, d, g, h, k, n, r, s, s, t, y, z : Lips and jaw are slightly open

4. a, e : Lips and jaw are open

5. i, i, l : Lips are slightly open, jaw is open but not as much as class 4

6. o, ö : Lips are rounded, jaw fully open

7. u, ü : Lips are rounded and protruding, jaw fairly open

8. f, v : Lower lip is pressing to the upper lip, jaw is closed

17

1 2 3 4

5 6 7 8

Figure 2.5.1. Turkish viseme classes

Acoustically easily distinguished visemes are hard to distinguish visually, since

within each viseme, phonemes with different acoustic features are included. In order to

classify speech into visemes, proper acoustic features selection is crucial. As our feature

vector, we selected 12 Mel scale cepstral coefficients and the log energy of each frame.

These features are easy and fast to calculate, at the same time offer proper information

about vocal tract shapes, and these features are speaker independent features.

18

2.5.1. Training Set

Our first major task in our work was establishing a training set. Since there wasn’t

much work done in Turkish language itself, we did not have any precreated and marked

training set. We had to create one for our needs. Creating a large training set and marking

each phoneme requires large amount of work, and is beyond the scope of our project. So

we decided to create a large enough training set for our needs, and it affects our

classification procedure. Since time and resources did not permit to create a large enough

training set for training an HMM or even Neural Network, we decided to use simpler

classifiers, such as nearest neighbor and parametric classifiers.

S E N I

Figure 2.5.1.1. Training sample for "e"

For each phoneme at least 4 utterances are recorded and carefully cleaned from the

neighboring phonemes and noise (Figure 2.5.1.1) by listening to the speech and by

inspecting the speech waveform. Two sets of data from the same speaker, one for training

and one for evaluating the results were created.

19

Each utterance is analyzed with 20 milliseconds frames (overlapping in 10

milliseconds), and for each frame the feature vector is calculated. Throughout the same

utterance, all frames give similar feature vectors. Since not all the utterances are of the

same length, this gives out different number of feature vectors for each utterance, which is

not desirable. So instead of using all frames for each utterance, some of them are selected

as representative of the training utterance. To select a representative of the utterance,

vector median operation is used.

Given an utterance with n frames (Utterance length is 10(n+1) milliseconds. The

frame size is 20 milliseconds, but frames are overlapped by 10 milliseconds), the set of all

feature vectors is V=(v1....vn). The vector median VM of V is defined as follows,

{ } ∑==i

ikk

jN vvdistjvvvVM ),(minargsuch that ....1 (2.6)

After calculating the set of all representatives for all utterances, we have our

training data, which we can use in our classifiers. The simplest one is nonparametric

nearest neighbor classifier.

20

2.5.2. Nearest Neighbor Classifier

Since our viseme classes include phonemes with different properties, our viseme

classes are not directly linearly-separable. A nearest neighbor classifier does not require a

large set of training data, it can work even with very small sets of training data. The idea is

to classify the test frame, to the mouth shape with the nearest feature vector [19]. All

features are self normalized to the same range (average 0, variance 1) , and the distance

measure is the simple geometric distance. For the test frame F with feature vector (v1...vn)

the distance to the sample frame S with feature vector (s1...sn) is calculated accordingly,

2211 )(....)(),( nn svsvSFd −++−= (2.7)

To enhance the classifier performance, three nearest neighbors are calculated, and

the classification output is selected by voting.

Table 2.5.2-1. Performance of the nearest neighbor classifier

%67 silence bmp cdsty ae ii oö uü fvsilence 96 0 2 0 0 0 0 1bmp 7 37 41 0 0 1 3 8cdsty 0 15 61 0 3 1 1 15ae 0 0 4 84 0 0 0 10ii 1 0 32 0 43 15 2 3oö 0 0 7 0 0 81 8 2uü 0 1 4 0 0 5 83 4fv 10 21 16 0 0 0 2 62

21

2.5.3. Fuzzy Nearest Neighbor Classifier

For some classes, nearest neighbor classifier is not good enough. One enhancement

to the classifier is using the fuzzy class distances instead of the simple distances. Fuzzy

class distance of a test frame F with features (v1...vn) to the classification class c with

samples (S1...Sm), Sk having the feature vector (sk1...skm), is calculated accordingly,

∑ −−

=k

fkSFd

mcFfd )1(

1

),(1

),( , 1<f<2 (2.8)

Minimum fuzzy distant class is selected as the classification output.

Table 2.5.3-1. Performance of the fuzzy NN classifier


22

2.5.4. Parametric Classifier

Nonparametric classifiers use training data directly, and do not make any

assumptions on the distribution of the data. Parametric classifiers, on the other hand,

assume some distribution model, and try to calculate the parameters of this distribution.

Since our viseme classes contain many phonemes with different properties, a direct

implementation of a parametric classifier using the viseme classes is not useful. Instead of

using viseme class distributions, we assume gaussian distribution for each phone class

separately, and try to calculate their distributions. Some phone classes are easily separable

using this technique (such as the silence class, since its energy is distinctly lower than the

other classes).

All classes are assumed to have normal density N(µi,δi), µi is calculated as the

mean of class i and δi is calculated using the standard deviations of each dimension

(feature) of class i.(Note that the features are self normalized to have the mean zero and the

variance one. But since only a subset of the samples belong to class i, class i has different

mean than zero and different variance than one)

Table 2.5.4-1. Performance of the parametric classifier


23

2.5.5. Tree Classifier

Using nearest neighbor classifier performs well in some classes, and fuzzy nearest

neighbor performs well in some other classes, and parametric classifier performs better

than the others in some other classes. Therefore it is better to build a tree classifier. A

simple tree classifier might be running all three algorithms and doing voting. A more

sophisticated one can be created by integrating the results and success rates of each

algorithm on each class.

Figure 2.5.5.1 illustrates the structure of the tree classifier. For example, if

algorithm A has a success rate of 80 per cent on class i and assigns the frame to class i,

then class i gets 0.80 points. If algorithm B has the success rate of 50 per cent on class j,

and assigns the frame to class j, class j gets 0.50 points. And if algorithm C has 10 per cent

success on class j, and classifies the frame to class j, class j gets additional 0.1 points. A

simple voting mechanism will select class j, but since algorithm C has a very poor rate of

success on class j, it is more probable, that the algorithm has misclassified the frame. More

sophisticated classification can be done, by looking at the total points collected from each

classifier. Now class i has more points, since algorithm A is very successful on class i, so

our classification output is class i.

Table 2.5.5-1. Performance of the tree classifier


24

To get the success rates of algorithms, a part of the test set is used, and classified.

The success rates are used as weights of classification algorithms on classes, and the rest of

the test set is used to evaluate the tree classifier. The class weights are more or less the

same as the tables shown in the previous chapters 2.5.2, 2.5.3, 2.5.4 ( Table 2.5.2-1, Table

2.5.3-1, Table 2.5.4-1 ). It is seen that tree classifier performs better than the individual

classifiers.

Test Frame Tree Classifier

3NN

Fuzzy NN

Parametric

Lip-shapeClass

Training Set

First Half

Second Half

3NN

Fuzzy NN

Parametric

Success rates ofalgorithms

Classification and rate of successcalculation

Training Data

Figure 2.5.5.1. Block Diagram of the Tree Classifier

25

2.6. Error Correction

One property of the speech is very useful to correct some misclassifications. To

produce a sound, lips should remain in the same fixed position for some time. This means

that an isolated lip-shape is a potential error. Using the median performed for each lip-

shape class is applied over a time window to detect and correct such misclassifications.

Class b,m,p

Class f,v

time

activation

time

activationtime

activation

time

activation

before median after median

Figure 2.6.1. Applying median filter for error correction

This error correction routine has one drawback, which occurs when we have fast

speech. As the speech becomes faster, lips are moving very fast, and some lip-shapes can

be caught as misclassifications, hence get lost. In case of fast speech many phones are

melted into each other, so skipping a lip-shape might produce even worse looking

animation, but that is the cost of the error correction.

1 2 3 4 5 6 7 8 9 10

Figure 2.6.2. Frames 4 and 8 are misclassifications

26

Another implementation might be using a gaussian filter instead of the median on

each class separately. Actually applying gaussian filters on each class individually has the

same effect as appliying the median filter.

27

3. FACIAL ANIMATION

3.1. Background

In recent years there has been considerable interest in computer-based three-

dimensional (3D) facial character animation. The human face is interesting and challenging

because of its familiarity. Animating synthetic objects is most of the time acceptable, but

when it comes to facial animation, we humans tend to criticize and cannot tolerate

unnatural-looking details.

First computer generated facial animations were done by F. Parke at University of

Utah in the early 70s [20]. In 1974, Parke created the first parametric facial model [21]. In

1977, Ekman and Friesen created Facial Action Coding System (FACS) [22], which is the

basis of the rest of the facial animation studies. In that system, they analyze and break

down the facial action into smaller units called Action Units (AU). Each AU represents an

individual muscle action, or an action of a small group of muscles, into a single

recognizable facial posture. In 1980, Platt [5] and in 1987, Waters [23] published their

studies on physically based muscle controlled facial animation. In 1987, Lewis [24] and in

1988, Hill [25] made the first studies on automated facial animation. Especially in the

second half of the 80s computer generated short films were made with animated speech.

The Academy Award winning film, TinToy, produced by Pixar, was the first feature film

to use facial animation as a part of the story in 1990. Currently facial animation and speech

is frequently used on animation films. Facial tissue and muscles of the face are modeled

and used for natural-looking facial animation [24].

There are at least five fundamental approaches to facial animation. These

approaches are:

28

1. interpolation: It is perhaps the most widely used of the techniques. In its simplest form,

it corresponds to the key-framing approach found in conventional animation [3],[4].

The desired facial expression is specified for a certain point in time (keyframe) and

then again for another point in time some number of frames later. A computer

algorithm generates the rest of the frames (in-betweens) between these keyframes.

2. performance-driven: It involves measuring real human actions to drive synthetic

characters. Data from interactive input devices such as data gloves, instrumented body

suits, and laser- or video-based motion-tracking systems, are used to drive the

animation.

3. direct parameterization: A set of parameters are used to define facial conformation and

to control facial expressions. The direct parameterized models use local region

interpolations, geometric transformations, and mapping techniques to manipulate the

features of the face [21].

4. pseudomuscle-based: Muscle actions are simulated using geometric deformation

operators. Facial tissue dynamics are not simulated. These techniques include abstract

muscle action, and freeform deformation.

5. muscle-based : A mass-and-spring model is used to simulate facial muscles. Two types

of muscles are implemented: linear muscles that pull, and sphincter muscles that

squeeze. Instead of expressions or operators muscles are parameterized and key-framed

[23].

Lip-synchronization is mostly done mechanically by animators by analyzing the

sound track [26]. Pre-recorded speech track is read to find the frame times of significant

speech events (Figure 3.1.1). Key frames with corresponding mouth positions and

expressions are then drawn to match these speech events. For a more realistic

correspondence, a live actor is filmed or videotaped while speaking. These recorded frames

are then rotoscoped to obtain the correct mouth positions for each frame or for each key

frame. With these techniques the speech track is created first and the animation images are

created to match the speech.

29

S E N I

Figure 3.1.1. Traditional style lip-synchronization

Some work has been done on automating the process but they are mostly based on

text input [27]. With these systems, lip-synchronizing is finished by animators after the

recording. Another way of synchronizing speech with the generated images is using

synthetic speech created from the same textual input [24]. The only work done so far for

Turkish also uses text as input [6], [7]. These systems require the speech track to match the

created images. But computer based speech animation allows a third possibility: speech

and images are created simultaneously, which is the work of this thesis.

30

3.2. Facial Muscles

In the general sense muscles are the organs of the motion. By their contractions,

they move the various parts of the body (parts of the face in our case). The energy of their

contraction is made mechanically effective by means of tendons, aponeuroses, and fascia,

which secure the ends of the muscles and control the direction of their pull. Muscles are

usually suspended between two moving parts, such as between two bones, two different

areas of skin, or two organs. Actively, muscles contract. Their relaxation is passive and

comes about through lack of stimulation. A muscle usually is supplied by one or more

nerves that carry the stimulating impulse and thereby cause it to contract.

When a muscle is suspended between two parts, one of which is fixed and the other

is movable, the attachment of the muscle on the fixed part is known as the origin. The

attachment of the muscle to the movable part is referred to as the insertion. The muscles of

facial expression are superficial, and all attach to a layer of subcutaneous fat and skin at

their insertion. Some of the muscles attach to skin at both the origin and the insertion such

as the obicularis oris (Figure 3.2.4). The muscles of the facial expression work

synergistically and not independently. Three types of muscle can be discerned as the

primary motion muscles:

1. linear/parallel muscles: They pull in an angular direction, such as the zygomatic major

2. elliptical/circular sphincter type muscles: They squeeze, such as the obicularis oris.

3. Sheet muscles: They behave as a series of linear muscles spread over an area, such as

the frontalis.

For a natural-looking facial animation lip-shape classes are converted into face

muscle activations. There are 10 linear muscles and one sphincter muscle modeled around

the lips [2].

31

Figure 3.2.1. Facial Muscles

Muscles modeled around the lips and their effects are:

1. Left and right Zygomatic Major : At the edge of the upper lip.

2. Left and right Angular Depressor : At the edge of the lower lip.

3. Left and right Labi Nasi : Pulls the upper lip.

4. Left and right Inner Labi Nasi : Pulls the inner part of the upper lip.

5. Left and right Depressor : Pushes the lower lip to the front and back. Used to produce

sounds “f” and “v”.

32

6. Obicularis Oris : Sphincter muscle. Tightens the lips and pushes to front. Used to

produce the sounds “o”, ”ö”, “u” and “ü”.

Muscles are the principle motivators of facial expression such as that when a

muscle contracts, it attempts to draw its attachments together. For facial muscles this

action usually involves drawing the skin towards the point of skeletal subsurface

attachment.

The effect range of each muscle is predefined and decreases nonlinearly in an effect

zone. Linear muscles have a bone and a skin attachment and follow the major direction of

the muscle vector. Whereas real muscle consists of many individual fibers, this model

assumes a single direction and attachment. With this simplifying assumption, an individual

muscle can be described with the direction and magnitude in three dimensions; the

direction is toward a point of attachment on the bone, and the magnitude of the

displacement depends upon the muscle spring constant and the tension created by a muscle

contraction. There is no displacement at the attachment to the bone, but a maximum

deflection occurs at the point of insertion into the skin. The surrounding skin is contracted

toward the static node of attachment on the bone, until, at a finite distance away, the force

dissipates to zero. Linear muscles have two sectors with different displacement effects.

Figure 3.2.2. Effect zones of a linear muscle

The displacement caused by the linear muscle is calculated according to the

formula below.

33

Figure 3.2.3. Linear muscle

1

1)cos('pv

vpkrpp α+= (3.1)

−−

−

=Bregion )cos(

Aregion )1

cos(

sf

s

s

RR

RDR

D

r (3.2)

where k is a fixed constant representing the elasticity of the skin. Zygomatic

Major, Angular Depressor, Labi Nasi, Inner Labi Nasi and Depressor muscles are linear

muscles.

Unlike linear muscles sphincter muscle contracts around an imaginary central

point. As a result, the surface surrounding the mouth is drawn together like the tightening

of material at the top of a string bag. Displacement caused by a sphincter muscle goes to

zero at the central point, as well as outside of the muscle zone. Displacement is calculated

according to the distance to the central point. Obicularis Oris is a sphincter muscle and can

be simplified to a parametric ellipsoid with a major and minor axis.

34

p

f1 f2c

p’

ly

lx

A

B

Figure 3.2.4. Sphincter muscle

pc

cpkrpp +=' (3.3)

−=Bregion )

2(

Aregion

s

ds

dr (3.4)

yx

yxxy

ll

plpld

2222 += (3.5)

where k is a fixed constant representing the elasticity of the skin and s is the

decreasing rate of the force along the muscle axis.

35

3.3. Facial Tissue

The skin covers the entire surface of the human form and is a highly specialized

interface between the body and its surroundings. It has a multicomponent microstructure,

the basis of which are five intertwined networks of collagen, nerve fibers, small blood

vessels, and lymphatics, covered by a layer of epithelium and transfied at intervals by hairs

and the ducts of sweat glands (Figure 3.3.1).

Epidermis

Dermis

Hypodermis

Hair

Subcutaneous adipose tissue

Figure 3.3.1. Skin layers

Human skin has a layered structure consisting of

1. epidermis: A superficial layer of dead cells and is composed of keratin.

2. dermis: It consists of irregular, moderately dense, soft connective tissue.

3. subcutaneous tissue: It consists of adipose tissue distributed in a network of

connective fibers. This connective tissue is mostly collagen arranged in a lattice with

fat cells. Beneath this superficial fascia lies the deep fascia which coats the bones.

36

This layered structure of skin is nonhomogeneous and nonisotropic. These features

were elaborated in 1861 by Langer [2] who made observations on many cadavers.

Figure 3.3.2. Simple skin implementation (tension net)

The simplest approach to skin tissue emulation is a collection of springs connected

in a network, or tension net (Figure 3.3.2) [28]. In this model the skin is represented as a

warped plane of skin nodes, connecting their neighbors by arcs. The arcs have elastic

material properties that make them behave like springs where the extension is proportional

to the force divided by the spring constant s

xsF ∆= (3.6)

Forces are generated by synthetic muscles. This force causes a displacement of the

muscle node. The force is then reflected along all arcs adjacent to this node; these reflected

forces are then applied to their corresponding adjacent nodes. In this way, the applied force

is propagated out from the initial node across the face. This approach has a distinct

advantage over a purely geometric technique because the displacement of one node can

influence all the other nodes in the surface. Consequently, muscle forces blend together,

providing a unified approach to facial expression modeling. Furthermore, the inherent

nature of springs helps to maintain some geometric integrity allowing the surface to dimple

and bulge, which is characteristic of facial tissue. One drawback of this model is that it

consists of only one layer, and does not take sublayer interactions into account.

37

Figure 3.3.3. Deformable lattice implementation of the skin

To include facial layers and their interactions, a variation to the basic tension net, a

deformable lattice structure can be defined for the face (Figure 3.3.3). Springs are arranged

into layers of tetrahedral elements cross-strutted with springs to resist shearing and twisting

stresses. The springs in the three layers have different stiffness parameters in accordance

with the nonhomogeneity of real facial tissue. The topmost surface of the lattice represents

the epidermis, and the spring stiffnesses are set to make it moderately resistant to

deformation. The springs underneath the epidermis form the dermis. The springs in the

second layer are highly deformable, reflecting the nature of subcutaneous fatty tissue.

Nodes on the bottom-most surface of the lattice are fixed onto the bone surface.

To create this topology, the procedure starts with triangular facial mesh, whose

nodes and springs represent the epidermis [23]. It projects normal vectors from the center

of gravity of each triangle into the face to establish subcutaneous nodes and forms

tetrahedral dermal elements by connecting them to epidermal nodes using dermal springs.

Second, it forms subcutaneous elements by attaching short weak springs from the

subcutaneous nodes downward to muscle layer nodes. Third, it adds the muscle layer,

whose lower nodes are constrained, anchoring them in bone. Finally it inserts the muscle

fibers through the muscle layer from their emergence in bone to their attachments at

muscle layer nodes.

38

Figure 3.3.4. Effect of muscle activation on skin layers

Muscles that apply forces to the tissue run through the second layer of the synthetic

tissue. The displacement of node j in the fascia layer from jx to 'jx due to muscle

contraction is a weighted sum of m muscle activities acting on node j:

∑=

+=m

ijijj xmxx

1

)(' (3.7)

where )( ji xm is the displacement caused by muscle i on node j and calculated as

according to the type and rate of contraction on muscle i as stated in chapter 3.2.

Figure 3.3.5. Saying "o" without and with obicularis oris, and with facial

layers

39

Once all the muscle interactions have been computed, the positions of nodes that

are subject to muscle actions are displaced to their new positions. As a result, the nodes in

the fatty, dermal, and epidermal layers that are not directly influenced by muscle

contractions are in an unstable state, and unbalanced forces propagate through the lattice to

establish a new equilibrium position. To get to this equilibrium position time step

simulation method is used [2]. Only a percent of the total displacement is applied on the

nodes, and new displacements are calculated according to this new unbalanced state.

Displacements below some threshold are set to zero. Applying not the total displacement

but only a percent of it takes care of oscillations. A set of nodes may enter an oscilating

equilibrium, where the change in the state forces the nodes back to the first state. This

unbalanced state again forces them to move, which again creates enough force to go back

to the first state etc. Applying only a percent of the displacement in each time step enables

the system to get into equilibrium in between. At the first pass unbalanced nodes are

tagged, and their total displacement is calculated. At the second pass some percent of the

total displacements are applied to the unbalanced nodes. We repeat the process until every

node is balanced or every unbalanced nodes displacement is under some threshold. An

unbalanced node applies displacement to all its neighbors, except when they are static

(bone) nodes.

40

3.4. Moving The Jaw

Jaw rotation is necessary for the mouth to assume its various speech and expression

postures. This is achieved by rotating the bone nodes of the lower part of the face about a

jaw pivot axis. The unbalanced upperlayer nodes are displaced to get into an equilibrium.

To mark the bone nodes to be rotated by the jaw, a half torus is defined as the jaw affect

region.

βα

r

Figure 3.4.1. Jaw definition torus

Figure 3.4.1 shows the parameters of the jaw definition torus. α is the start angle, β

is the end angle of the effect region torus definition. r is the radius of the torus. Jaw

definition torus is placed near the ears, just between the lips, such that all vertices of the

lower face, including the lower lip are inside the effect region.

Jaw movement is calculated with a rotation angle γ, and all bone nodes inside the

jaw definition torus rotate with an angle γ.

41

3.5. Automated Eyeblink

Sythetic blinking is an important characteristic that should be included in face

models used in conversational interface modes. Speaker eye blinks are an important part of

speech response systems that include synchronized facial visualizations. Eye blink not only

accentuates speech but addresses the physical need to keep eyes wet as well. The structure

of an eye blink is synchronized with speech articulation. It is also emotion dependent.

During fear, tension, anger, excitement, and lying the amount of blinking increases while it

decreases during concentrated thought.

A simple eye blinking model is based on the on-off characteristics of the speakers

voice. A better model is following the pauses in the speech, where stopping sounds such as

“m”, “b” or “p” are also pauses. There is a slight delay between speech pauses and eye

blinks. This delay is about 0.5 seconds. To add some randomness into eye blinks we

selected this delay between 0.4 and 0.6 seconds [3],[4].

A lesson that we learn from traditional animators is that nothing related to humans is

absolutely symmetric. Disney animators never blink both eyes at the same time. We

implement this idea in our automated eye blink model and one eye blinks with some delay

(0.1-0.2 seconds) to the other one, to increase realism.

Eyelids are modeled like the jaw (Section 3.4). Eyelid torus are put and nodes inside

this torus are rotated as the eye blinks.

42

4. LIPSYNCH FACIAL ANIMATION (AGU)

4.1. System Overview

For our speech driven facial animation system, we used K. Waters OpenGL

implementation code of some basic facial expressions using muscles as a starting point

[29]. His implementation focuses on facial expressions and linear muscles related with

them. To be able to do lip-synchronized animation, we implemented 10 linear muscles

around the mouth and one sphincter muscle on the same face. Jaw movement is

parameterized, and made face independent to be able to be used with other face topologies

(We still do not have any other face model). Facial tissue is created as stated in the

previous chapters, and muscle activations are calculated using number crunching time step

simulation algorithm.

For speech analysis, we used 12 mel scale cepstral coefficients and log energy of

the frame. Each frame is of 20 milliseconds size, and are overlapping in 10 milliseconds,

their size is rounded to make it a total of 512 samples, since we work with 22050 Hz

sampling rate. Each frame to be analyzed is hamming windowed to get rid of the effects of

the previous and next frames. A single speaker training set is used to classify the frames. A

tree classifier (explained in detail in section 2.5.5) is used to classify and the classification

results are median filtered for misclassifications. A set consisting of a class per frame, is

sent to the animation engine. Figure 4.1.1 shows the structure of AGU.

43

Speech Processor Animation EngineRecorded

Speech File I/O Subunit

HammingWindowing

FeatureExtraction

Tree ClassifierTrainingSet

Error Correction Facial MuscleExtraction

Lipshape classvs

muscleactivation

BoundaryInterpolation

EyeblinkDetection

Skin LayerCalculation

Display Subunit

Skin PropertiesFrames

FeatureVector

LipshapeClass

LipshapeClass

MuscleActivations

DeformedSkin

Animation File

MuscleProperties

Figure 4.1.1. Subunits of AGU system

Class information is converted into muscle activation through a library of lip-shape

class versus muscle activation list. Muscle activations are blended at the class boundaries

using gaussian filters. The size of the boundary, which is the time to go from one lip-shape

to the next one, is assigned as a static 10 frames (on a 25 frames-per-second (fps) system,

which makes 0.4 seconds), centered at the boundary. For short duration lip-shapes, the

effect is that the lip-shape is blended between the previous and the next lip-shape, but is

not discarded. The blended muscle activations are applied on the facial lattice. Unstable

states are solved using time step simulation method. When the system reaches its

equilibrium, the epidermal nodes and connections between them are sent to the OpenGL

display engine. Display window is saved as a set of numbered bitmap files, to be used with

the sound track as animation.

44

Files and tables used for the system are:

1. face topology

2. muscle topology

3. muscle activations and jaw rotations for facial expressions and lip-shapes

4. tissue topology

5. training set for the speech classifier

6. jaw topology

7. eyelid topology

8. teeth topology

9. recorded speech file: currently only 22050 Hz mono files are supported

45

4.2. Optimizing Performance

Our system works on a frame-by-frame basis, which means sequential processing is

not necessary. For any given recorded speech file, we can specify any frame to work with

and get the desired output frame. This design works definitely slower than a sequential

design, but allows us more degrees of freedom. This special nonsequential design allows

us to easily convert this stand-alone animation system to a module (plugin) in some

commercial animation engine. The drawback is that in cases of error correction, lip-shape

boundary detection, or even eyeblink detection, knowledge from previous frames is

required. To reprocess these frames time is lost, and performance is degraded. To shorten

this reprocessing time a multilevel cache structure is implemented.

Display Frame

Animation Frame

Lipshape Frame

Classification Frame

25 fps Displayvs

100 fps Classification

Boundary Detection

Error Correction

Figure 4.2.1. Hierarchy of frame requests

To display animation frame X on a 25 fps video output (standard PAL rate) we

need to look at four speech frames, since we are classifying on a rate of 100 fps. For

smoother animation results these four display frames should be blended. To display frames

correctly the end of the previous lip-shape class, and the start of the next one should be

known. So we should process the speech back and forth to detect the previous and the next

lip-shape classes. Another issue is the error corrector, which also requires a window of

frames to detect potential misclassifications. All of this requires reprocessing of the same

46

frames without caching. Also if we are processing sequentially each frame, we can spend

redoing a lot of work using caches.

47

4.2.1. Multilevel Caching

Our frame-by-frame design allows each subunit to work independent of the others.

Each sub-unit takes an input, and after processing gives the output to the next subunit. This

modularity allows us to put caches inside some important processing points to enhance

performance. Within this design each subunit requests its input from its input cache, and

sends its output to its output cache. Instead of subunits requesting processing from the

lower level subunits, the cache passes the request to the lower level subunits, if it cannot

fulfill it by itself, means its cache for the requested frame is empty.

Cache

AGU

SpeechFile Speech

Classification

Error Correction

Facial Animation

DisplaySubsystem

Classified SpeechFrames

Corrected SpeechFrames

Animation Frames

Display Frame

Figure 4.2.1.1. Multilevel cache structure

48

5. REAL-TIME LIPSYNCH FACIAL ANIMATION (RT_AGU)

5.1. System Overview

Microphone Capture

Speech Processor

Animation Engine

Speaker Ping Pong Buffer

HammingWindowing

FeatureExtraction

Tree ClassifierTrainingSet

Error Correction

LipshapePrecalculated

Faces

BoundaryInterpolation

EyeblinkDetection

WeightedInterpolation

Display SubunitFrames

FeatureVector

LipshapeClass

LipshapeClass

InterpolatedFace

Animation File

Initialization

Facial MuscleExtraction

Lipshape classvs

muscleactivation

Skin LayerCalculation

I/O Subunit

Skin Properties

MuscleActivations

DeformedSkin

MuscleProperties

Figure 5.1.1. Overview of RT_AGU sub-units

49

First of all, instead of reading a prerecorded speech file, the system should be

capable of capturing directly through a microphone. This can be achieved using Ping-Pong

buffers easily. Each buffer is of the size of a frame. When a buffer is filled, it is processed

and displayed, before the other buffer becomes full. To increase speed, frames are captured

not overlapping, but sequentially. This increases the error rate, but since our classification

rate is decreased from 100 fps to 50 fps we double the processing speed.

For real-time performance some optimizations are required. First of all, since

everything will work sequentially, cache management can be optimized, its size and

management become easier. Another change is on lip-shape boundaries. In the AGU

engine lip-shape boundaries are centered between two lip-shape classes, but we cannot use

this type of schema in case of a real-time engine, since by the time the speech reaches a

lip-shape class boundary, it is too late to change the displayed frames. The new schema is

to trigger the display of the lip-shape change with the lip-shape change in the speech. Now

the problem with the error correction occurs. How can we be sure that the newcoming lip-

shape class is not misclassified ? This can be handled by adding some more delay of a few

frames, just to be sure that the new classified lip-shape is a correct one. A total of five

frames delay (0.2 seconds in case of 25 fps) is sufficient and unavoidable.

To speed things up further we can use a more erroneous but faster classifier, such

as the parametric classifier. Since the parametric classifier does not have to measure the

distance to all the samples, it works much faster than the other ones, or the tree classifier.

The trade-off is decreased performance in classification; the parametric classifier makes

more misclassifications (Table 2.5.4-1 versus Table 2.5.5-1).

Most of the time is spent in processing the facial tissue layers. Enormous speed

increase is achieved by redefining this process, using precalculated key frames for

interpolation instead of recalculating the facial tissue for each frame.

50

5.2. Keyframe Interpolation

The time step simulation algorithm ( Section 3.3) used for solving the unbalanced

force distribution between the facial layers takes a lot of time. This method shows its

advantages between lip-shape classes; its results are more realistic. But to achieve real-

time performance, we have to sacrifice a bit from the realistic results.

Instead of resolving force distributions on each frame, we calculate it once for each

lip-shape class. Using this precalculated face data a table of lip-shape class faces are filled.

For each frame, each lip-shape class contribution is found and this information is used to

weighted interpolate the desired frame to display. An eight-dimensional vector

I={I1,I2,......I8} is passed to the interpolation subunit; where each parameter Ik is the

contribution (weight) of lip-shape class k of the current frame on the display. For each

vertex point Vi of the face is processed as follows:

∑=

=8

1ccici YIV (5.1)

where Yci is the corresponding vertex to Vi on the lip-shape class c.

51

6. CONCLUSIONS

This is the first work on automated speech driven facial animation in Turkish. We

created a system, Automated speech driven Graphical facial animation Unit (AGU),

capable of automatically making very realistic lip-synchronized facial animation using

recorded turkish speech [30],[31][32]. But since our method uses a speaker and language

independent analysis it can be also used on different speakers and on different languages.

We tested it only on Turkish and English so far.

Classification is done at a rate of 100 frames per second, but video output is

generated at a 25 frames per second rate. Since there are more than one frames of

classification per video frame their weighted average is taken for video output. That means

another level of error smoothing. The video confusion matrix is given in Table 6-1.

Average rate of correct classification is 92 per cent.

Table 6-1. Video output confusion matrix

silence bmp cdsty ae ii oö uü fvsilence 98 0 0 0 0 0 0 2bmp 0 60 32 0 0 1 0 8cdsty 2 0 78 0 0 0 0 10ae 0 0 0 100 0 0 0 0ii 0 0 20 0 80 0 0 0oö 0 0 0 0 0 92 8 0uü 0 0 2 0 0 0 98 0fv 0 0 30 0 0 0 0 70

Table 6-1 is created subjectively, by watching the video output of AGU. The

classification success at the video output is naturally higher, because most common errors

are between visually close classes, and can be tolerated when watching the visual output.

The ultimate test would be to have our results checked by a lipreader (for lip-reading 16

lip-shapes are required, but for realistic animation purposes eight are enough).

52

Short time analysis enables us to create a real-time working system. This selection

of the features is also speaker independent. Since no language processing tools are used,

the system is language independent. But since its training set is created using Turkish

language, the system is most powerful in Turkish speech.

Using physically based tissue layers gives natural-looking results, which can be

best seen on voices where jaw is open (like o,ö,u,ü) (Figure 3.3.5).

Results are promising, but the system is not capable of dealing with faster speech,

because of its error correction routines. Another issue is the face model. We used the face

model created in DEC by K. Waters [29]. To apply the system to other face models, more

parameterization is needed.

There is a parallel work on integrating our speech driven lip-shape classifier

module into a commercial 3D animation program (3D MAX).

Our system is fast, speaker and language independent, and requires not large

resources (real-time operation is achieved even on a ordinary PC level (Pentium processor,

32 Mb of RAM)). This is a very useful tool for animators, since it provides real-time

feedback. This tool can also be used in video conferencing applications, using models for

each speaker, and transferring only the speech, not the video. And a very useful application

might be a new interface for human computer interaction systems.

53

APPENDIX A. NAVIGATION HIGHLIGHTS

At initialization the interpolation keyframe faces areprocessed. As the process continues in thebackground, you can rotate the face, change shading,change lighting, but cannot start realtime display or.wav file processing, since the intermediate data is notprocessed yet. It will take a few seconds according toyour computer power.

Display Window Menu Window

Speed mode [default] is fasterbut makes more errors. For moreaccurate lip movement select theother, which is much moreslower but more suitable forapplications such as .bmp output

54

If you have a microphone ready,after selecting microphone startyou can watch what you aresaying in real time. To end theprocess, select microphone stop

During display you still canchange display settings.

You have to select a .wav file with22050 mono sampling rate. Then thefile will be processed and output asnumbered bitmap sequence.IMPORTANT: During the processthe output window should be on top.

55

model withpolygonalpatches

model withtexturemapping

wireframemodel withmuscles visible

smooth shadingof the model[default]

Inside the display window, you canrotate the face [default] any time byclicking left mouse button andmoving the mouse. By selecting lightmove you can change the move modeto light move mode, and move thelight as stated before. Selecting lightmove toggles this mode.

The head will rotaterandomly by itself whenactivated. Selecting secondtime toggles this mode.

56

REFERENCES

[1]. G. Maestri , Digital Character Animation, New Riders, 1996.

[2]. F.I. Parke, K. Waters. Computer Facial Animation, AK Peters, 1996.

[3]. F. Thomas, O. Johnston, Illusion of Life, Disney Animation, Hyperion, 1981.

[4]. T. White, The Animators Workbook, Watson Guptill Publications, 1988.

[5]. S.M. Platt, “A System for Computer Simulation of the Human Face”, The Moore

School, University of Pennsylvania, Philadelphia, 1980.

[6]. B. Uz , U. Güdükbay ,B. Özgüç, “Realistic Speech Animation of Synthetic Faces”,

Computer Animation'98, IEEE Computer Society Publications, Philadelphia, 1998.

[7]. B. Uz, “Realistic Speech Animation of Synthetic Faces”, M.S. Thesis, Bilkent

University, Department of Computer Engineering and Information Science, June 1999.

[8]. L. Arslan, D. Talkin, “Codebook Based Face Point Trajectory Synthesis Algorithm

Using Speech Input”, Elsevier Science, 953, 01-13, December 1998.

[9]. J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete Time Processing of Speech

Signals, Macmillan Publishing Company, 1993.

[10]. S. Young, “A Review of Large-vocabulary Continuous-speech Recognition”, IEEE

Signal Processing, 45-57, September 1996.

[11]. J.K. Baker, “Stochastic Modeling for Automatic Speech Understanding”, Speech

Recognition, New York, Academic Press 521-542, 1975.

[12]. F. Jelinek, “Continuous Speech Recognition by Statistical Methods”, Proceedings

of the IEEE, vol64, 532-556, April 1976.

[13]. S. Russel, P. Norvig, Artificial Intelligence A Modern Approach, Prentice Hall

International, 1995.

[14]. S.B. Davis, P. Mermelstein, “Comparison of Parametric Representations for

Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE

Transactions on Acoustics, Speech, and Signal Processing, vol 28, 357-366, August

1980.

[15]. S.S. Stevens, J. Volkman, “The Relation of Pitch to Frequency”, American Journal

of Psychology, vol. 53, 329, 1940.

57

[16]. R.W.B. Stevens, A.E. Bate, Acoustics and Vibrational Physics, New York, St.

Martins Press, 1966.

[17]. C.G.M. Fant, “Acoustic Description and Classification of Phonetic Units”, Ericsson

Technics, no 1, 1959.

[18]. S-H. Luo, R.W King, “A Novel Approach for Classifying Continuous Speech into

Visible Mouth-Shape Related Classes”, IEEE, I-465-468, 1994.

[19]. E. Alpaydin, Lecture Notes on Statistical Pattern Recognition, 1995.

[20]. F.I. Parke June, “Computer Generated Animation of Faces”, MS Thesis, University

of Utah, Salt Lake City, UT, UTEC-CSc-72-120, 1972.

[21]. F.I. Parke, “A Parametric Model for Human Faces”, PhD Thesis, University of

Utah, Salt Lake City, UT, UTEC-CSc-75-047, December 1974.

[22]. P. Ekman, W.V. Friesen, Manual for Facial Action Coding System, Consulting

Psychologists Press, Inc., Palo Alto, CA, 1978.

[23]. K. Waters, “A Physical Model for Animating 3D Facial Expressions”, Computer

Graphics (SIGGRAPH ‘87), 21(4):17-24 July, 1987.

[24]. J.P. Lewis, F.I. Parke, “Automatic Lip-Synch and Speech Synthesis for Character

Animation”, Proc. Graphics Interface ’87 CHI+CG ’87, 143-147, Canadian

Information Processing Society, Calgary, 1987.

[25]. Pearce, B. Wyvill, D. Hill, “Speech and Expression: A Computer Solution to Face

Animation”, Proc. Graphics Interface ’86, 136-140, Canadian Information Processing

Society, Calgary, 1986.

[26]. Robertson, “Read My Lips”, Computer Graphics World, 26-36, August 1997.

[27]. Pearce, B. Wyvill, D. Hill, “Animating Speech: An Automated Approach Using

Speech Synthesized by Rules”, Visual Computer, 176-187, 1988.

[28]. S.M. Platt, “A System for Computer Simulation of the Human Face, University of

Pennsylvania, The Moore School, M.S. Thesis, 1980.

[29]. K. Waters, “OpenGL Implementation of the Simple Face”, December 1998,

http://www.crl.research.digital.com/publications/books/waters/Appendix1/opengl/Open

GLW95NT.html.

[30]. L. Akarun, Z. Melek, “Türkçe Ses Eszamanli Yapay Yüz Canlandirma”, Bilisim

’99, 212-217, 1999.

58

[31]. L. Akarun, Z. Melek, “Automated Lipsynchronized Speech Driven Facial

Animation for Turkish”, presented in Confluence of Computer Vision and Graphics

NATO Advanced Research Workshop, Slovenia, August 1999.

[32]. L. Akarun, Z.Melek, “Automated Lipsynchronized Speech Driven Facial

Animation”, submitted to IEEE International Conference of Multimedia, December

1999

This document was created with Win2PDF available at http://www.daneprairie.com.The unregistered version of Win2PDF is for evaluation or non-commercial use only.

http://www.daneprairie.com

Automated Speech Driven Lipsynch Facial Animation for...

Documents

Transcript of Automated Speech Driven Lipsynch Facial Animation for...