Automated Speech Driven Lipsynch Facial Animation for...
Transcript of Automated Speech Driven Lipsynch Facial Animation for...
Automated Speech Driven Lipsynch Facial
Animation for Turkish
by
Zeki Melek
B.S. in Computer Engineering Department
Bogaziçi University, 1996
Submitted to the Institude for Graduate Studies in
Science and Engineering in partial fulfillment of
the requirements for the degree of
Master of Science
in
Computer Engineering
Bogaziçi University
1999
II
AUTOMATED SPEECH DRIVEN
LIPSYNCH FACIAL ANIMATION
FOR TURKISH
APPROVED BY:
Assoc. Prof. Dr. Lale Akarun ................................................
(Thesis Supervisor)
Assoc. Prof. Dr. Levent Arslan ................................................
Prof. Dr. Fikret Gürgen ................................................
DATE OF APPROVAL ................................................
III
ABSTRACT
Talking three-dimensional (3D) synthetic faces are now used in many applications
involving human-computer interaction. The lip-synchronization of the faces is mostly done
mechanically by computer animators. Although there is some work done on automated lip-
synchronized facial animation, these studies are mostly based on text input. In our work we
used speech in Turkish as an input to generate lip-synchronized facial animation. Speakers’
recorded voice is converted into lip-shape classes and applied to the 3D model. Voice is
analyzed and classified using a training set. Lip animation is facilitated by activating facial
muscles and the jaw. Facial muscles are modeled onto our facial model. For more realistic
facial animation, facial tissue is modeled as well, and the interactions between epidermis,
subcutenous layer and bone are taken into account. High-speed natural-looking lip-
synchronized facial animation is achieved. A real-time version of the engine is also
implemented.
IV
ÖZET
Üç boyutlu insan modellerinin konusmasi canlandirmada oldugu kadar insan-
bilgisayar iletisiminde de giderek daha sik kullanilmaktadir. Konusmanin üç boyutlu yüz
modeli ile agiz eszamanlamasi genelde grafik animatörlerce yapilan uzun ve mekanik bir
islemdir. Otomatik agiz eszamanli yüz animasyonu için çesitli çalismalar yapilmistir. Bu
tip çalismalar genelde yazi tabanli olmaktadir. Biz bu çalismamizda sesi girdi olarak
kullandik. Seslendirenin kaydedilen sesi verilen üç boyutlu yüz modelinde dudak
hareketlerine çevrilmektedir. Bunu için kaydedilen ses analiz edilip egitim kümesi ile
karsilastirilarak dudak hareketine siniflandirilmaktadir. Yüz modelimizde dudak
hareketleri yüz kaslari ve çene kullanilarak yapilir. Üç boyutlu yüz modelimiz üzerine
insan yüzünün fiziksel kas yapisi dikkate alinarak yüz kaslari modellendi. Gerçekçi yüz
animasyonu için insan yüzünü olusturan deri, yag, kas ve kemik katmanlari da
modellenerek aralarindaki etkilesimler hesaplandi. Oldukça hizli bir sekilde dogal
görünüslü canlandirma yapilabilmektedir. Gerçek zamanli çalisan kirpilmis bir
canlandirma motoru da hazirlanmistir.
V
TABLE OF CONTENTS
1. INTRODUCTION ____________________________________________________ 1
2. SPEECH PROCESSING ______________________________________________ 4
2.1. Background_____________________________________________________________ 4
2.2. Cepstral Analysis ________________________________________________________ 8
2.3. Mel Cepstrum __________________________________________________________ 12
2.4. Hamming Window ______________________________________________________ 15
2.5. Classification___________________________________________________________ 16
2.5.1. Training Set ______________________________________________________________ 18
2.5.2. Nearest Neighbor Classifier __________________________________________________ 20
2.5.3. Fuzzy Nearest Neighbor Classifier_____________________________________________ 21
2.5.4. Parametric Classifier _______________________________________________________ 22
2.5.5. Tree Classifier ____________________________________________________________ 23
2.6. Error Correction _______________________________________________________ 25
3. FACIAL ANIMATION_______________________________________________ 27
3.1. Background____________________________________________________________ 27
3.2. Facial Muscles__________________________________________________________ 30
3.3. Facial Tissue ___________________________________________________________ 35
3.4. Moving The Jaw ________________________________________________________ 40
3.5. Automated Eyeblink_____________________________________________________ 41
4. LIPSYNCH FACIAL ANIMATION (AGU) ______________________________ 42
4.1. System Overview _______________________________________________________ 42
4.2. Optimizing Performance _________________________________________________ 45
4.2.1. Multilevel Caching_________________________________________________________ 47
5. REAL-TIME LIPSYNCH FACIAL ANIMATION (RT_AGU) _______________ 48
VI
5.1. System Overview _______________________________________________________ 48
5.2. Keyframe Interpolation __________________________________________________ 50
6. CONCLUSIONS ____________________________________________________ 51
APPENDIX A. NAVIGATION HIGHLIGHTS _______________________________ 53
REFERENCES _________________________________________________________ 56
VII
LIST OF FIGURES
Figure 1.1. Overview of AGU _______________________________________________ 2
Figure 2.1.1. Overview of a statistical speech recognizer__________________________ 6
Figure 2.2.1. Block diagram of cepstrum analysis ______________________________ 11
Figure 2.3.1. The mel scale ________________________________________________ 13
Figure 2.3.2. Mel filter bins ________________________________________________ 14
Figure 2.3.3. Block diagram of mel-cepstral analysis____________________________ 14
Figure 2.4.1. Hamming filter _______________________________________________ 15
Figure 2.4.2. Block diagram of mel-cepstral analysis (hamming filter applied)________ 15
Figure 2.5.1. Turkish viseme classes _________________________________________ 17
Figure 2.5.1.1. Training sample for "e"_______________________________________ 18
Figure 2.5.5.1. Block Diagram of the Tree Classifier ____________________________ 24
Figure 2.6.1. Applying median filter for error correction _________________________ 25
Figure 2.6.2. Frames 4 and 8 are misclassifications_____________________________ 25
Figure 3.1.1. Traditional style lip-synchronization ______________________________ 29
Figure 3.2.1. Facial Muscles _______________________________________________ 31
Figure 3.2.2. Effect zones of a linear muscle___________________________________ 32
Figure 3.2.3. Linear muscle ________________________________________________ 33
Figure 3.2.4. Sphincter muscle _____________________________________________ 34
Figure 3.3.1. Skin layers __________________________________________________ 35
Figure 3.3.2. Simple skin implementation (tension net)___________________________ 36
Figure 3.3.3. Deformable lattice implementation of the skin ______________________ 37
Figure 3.3.4. Effect of muscle activation on skin layers __________________________ 38
Figure 3.3.5. Saying "o" without and with obicularis oris, and with facial layers ______ 38
Figure 3.4.1. Jaw definition torus ___________________________________________ 40
Figure 4.1.1. Subunits of AGU system ________________________________________ 43
Figure 4.2.1. Hierarchy of frame requests_____________________________________ 45
Figure 4.2.1.1. Multilevel cache structure_____________________________________ 47
Figure 5.1.1. Overview of RT_AGU sub-units__________________________________ 48
VIII
LIST OF TABLES
Table 2.5.2-1. Performance of the nearest neighbor classifier _____________________ 20
Table 2.5.3-1. Performance of the fuzzy NN classifier ___________________________ 21
Table 2.5.4-1. Performance of the parametric classifier__________________________ 22
Table 2.5.5-1. Performance of the tree classifier________________________________ 23
Table 6-1. Video output confusion matrix _____________________________________ 51
1
1. INTRODUCTION
In recent years there has been considerable interest in computer-based three-
dimensional (3D) facial character animation. The human face is interesting and challenging
because of its familiarity. Animating synthetic objects is most of the time acceptable, but
when it comes to facial animation, we humans tend to criticise and cannot tolerate
unnatural looking details. Realistically animating a speaking face is one of the hardest
animations to be done [1]. Talking 3D synthetic faces are now used in many applications
involving human-computer interaction.
In traditional animation, synchronization between the drawn or synthetic images
and the speech track is usually achieved through the tedious process of reading the
prerecorded speech track to find the frame times of significant speech events [2],[3],[4].
Key frames with corresponding mouth positions and expressions are then either drawn or
rendered to match these key speech events. For a more realistic correspondance, a live
actor is filmed or videotaped while speaking. These recorded frames guide either
traditional or computer animators to obtain the correct mouth positions for each key frame
or even for each frame of the animation (rotoscoping). Both methods require a large
amount of time and resource, and most of the time is spent to match facial key points
mechanically.
To automate this task, various methods are proposed. Some methods use text based
generation of both synthetic faces and synthetic speech [5]. Generating natural speech is
not always possible; it is better to use a voice actor instead. Text based synthetic facial
animation again faces the problem of synchronization. A fully speech driven facial
animation tool is the ultimate solution to this hard task. Proposed solutions generally
require large amount of computer power to process speech data, and none of the engines
are based on Turkish language. One engine using text as input is created [6],[7]. Another
work is using speech input and codebook based face point trajectory method [8].Our work
2
is to create an automated Turkish based, or better language independent real-time speech
driven facial animation system.
Our system, Automated speech driven Graphical facial animation Unit (AGU)
consists of two units (Figure 1.1), one speech processing unit and one facial animation
unit.
Figure 1.1. Overview of AGU
In the speech processing phase the speech is divided into frames and each frame is
classified into one of the eight lip-shape classes using pretrained data using a feature vector
for each frame. Classified lip-shape classes are sent to the facial animation unit, which
deforms the 3D face accordingly. For each frame, a new screen shot is created, and
displayed on the screen or saved as a file.
Speech processing unit is summarized in Chapter II and our feature vector is
analyzed. Different classification schemes are compared and a tree classifier is proposed.
The nature of the speech allow us to create a very efficient error correction routine, as
illustrated in Chapter II.
3
In Chapter III, facial animation is described. After a summary of the background,
physically based facial animation will be covered in detail, facial muscles and facial tissue
layers will be explained, and our implementation of them. Automated eye blink is shortly
mentioned and the jaw movement is covered in the last subsection.
Chapter IV describes the implemented system. Our system overview will be
covered in detail, and ins and outs of each subunit is investigated. Some optimization
techniques are examined in the last subsection.
Chapter V is about the real-time implementation of our engine. Even the off-line
version performs very fast; a real-time version should include some simplifications as well
as some optimizations which are normally not necessary in the off-line version. The real-
time system overview and the performance issues are covered throughout this chapter.
The last chapter is on evaluating our work, and proposing some future work.
Possible uses of this work is also covered in this chapter.
4
2. SPEECH PROCESSING
2.1. Background
Speech recognition is the task of mapping from digitally encoded acoustic signal to
a string of words. Speech recognition systems or algorithms are generally classified as
small, medium or large vocabulary. We would expect the performance and speed of a
particular recognizer to degrade with increasing vocabulary size. Another classification is
isolated word recognizers versus continuous speech recognizers. Isolated word recognizers
are trained with discrete renderings of speech units. In the recognition phase, it is assumed
that the speaker deliberately utters sentences with sufficiently long pauses between words.
When the vocabulary size is large, isolated word recognizers need to be specially
constructed and trained using subword models. Further, if sentences composed of isolated
words are to be recognized, the performance can be enhanced by exploiting probabilistic
relationships among words in the sentences. On the other hand, the most complex
recognition systems are those which perform continuous speech recognition, in which the
user speaks in a relatively unconstrained manner. First, the recognizer must be capable of
dealing with unknown temporal boundaries in the acoustic signal. Second, the recognizer
must be capable of performing well in the presence of all the coarticulation effects and
sloppy articulation that accompany flowing speech. Continuous speech recognizers require
language processing tools. They are concerned with the attempt to recognize a large pattern
by decomposing it into small subpatterns according to the rules, to reduce entropy. Lexical
rules and other subword knowledge are used to recognize the words as smaller units
(below word level processing). The recognition of a sentence is benefited by the
knowledge of superword knowledge that yields word ordering information (above word
level processing). The usual case of a linguistic processing is the more general case in
which linguistic rules are applied both above and below the word level. Most systems
employed in practical applications are of the small vocabulary isolated word type. All
5
perform significantly better if required to recognize only a single speaker who trains the
system [9],[10].
Human languages use a limited repertoire of about 40 or 50 sounds, called phones.
One of the major problems in speech recognition is the existence of homophones: different
words that sound the same. This is a problem in English, in Turkish it is not a problem. To
solve the problem linguistic processing tools are used.
Sound is an analog energy source. The sampling rate is the frequency with which
we look at the signal. Quantization factor determines the precision to which energy at each
sampling point is recorded. Even with a low sampling rate and quantization factor, speech
requires a lot of information to analyze.
The first step in coming up with a better presentation for the signal is to group the
samples together into larger blocks called frames. This makes it possible to analyze the
whole frame for the appearance of some speech phenomena. Within each frame, the sound
is represented with a feature vector. Typical features are number of zero crossings, or
energy of the frame, etc. Recognition is done using this feature vector, calculated for each
frame.
Current speech recognition systems are firmly based on the principles of statistical
pattern recognition. The basic methods of applying these principles to the problem of
speech recognition were pioneered by Baker [11], Jelinek [12], and their colleagues from
IBM in the 1970s, and little has changed since (Figure 2.1.1).
The utterance consists of a sequence of words, W, and it is the job of the recognizer
to determine the most probable word sequence, W’, given the observed acoustic signal Y.
To do this, Bayes’ rule is used to decompose the required probability P(W|Y) into two
components, that is,
6
)(
)|()(maxarg)|(maxarg'
YP
WYPWPYWPW
WW== (2.1)
Front EndParametrization
th ih s ih z s p iy ch
this is speech
Language Model P(W) . P(Y|W)
Acoustic Models
PronouncingDictionary
Y
W
Parametrized Speech Waveform
Figure 2.1.1. Overview of a statistical speech recognizer
P(W) represents the a priori probability of observing W independent of the
observed signal, and this probability is determined by a language model. P(Y|W)
represents the probability of observing the vector sequence Y given some specified word
sequence W, and this probability is determined by an acoustic model. A very popular
implementation of the acoustic model uses Hidden Markov Models (HMMs) [10],[13]. For
each phone there is a corresponding statistical model called HMM. The sequence of
HMMs needed to represent the postulated utterance are concentrated to form a single
composite model, and the probability of that model generating the observed sequence Y is
calculated. Each individual phone is represented by an HMM, which consists of a number
7
of states connected by arcs. HMM phone models typically have three emitting states and a
simple left-right topology. The entry and exit states are provided to make it easy to join
models together. The exit state of one phone model can be merged with the entry state of
another to form a composite HMM. This allows phone models to be joined together to
form words and words to be joined together to form complete utterances. An HMM is a
finite state machine that changes state every time unit and each time that a state is entered
an acoustic vector is generated with some probability density. Furthermore the transitions
are also probabilistic. The joint probability of an acoustic vector is calculated simply as the
product of the transition probabilities and the output probabilities. This process can be
repeated for all possible word sequences with the most likely sequence (the sequence with
the highest combined probability) selected as the recognizer output. HMM parameters
must be estimated from data, and it will never be possible to obtain sufficient data to cover
all possible contexts. Because of that problem, the language model must be able to deal
with word sequences for which no examples occur in the training data. Language model
probability distributions can be easily calculated from text data, and are unique for every
language.
8
2.2. Cepstral Analysis
For many years Linear Prediction (LP) analysis has been the most popular method
for extracting spectral information from the speech signal. Contributing to this popularity
is the enormous amount of theoretical and applied research on the technique, which has
resulted in very well-understood properties and many efficient and readily available
algorithms. For speech recognition, the LP parameters are very useful spectral
representation of the speech because they represent a smoothed version of the spectrum
that has been resolved from the model excitation. However, LP analysis is not without
drawbacks.
LP analysis does not resolve vocal-tract characteristics from the glottal dynamics.
Since these laryngeal characteristics vary from person to person, and even for within
person utterances of the same words, the LP parameters convey some information to a
speech recognizer that degrades performance, particularly for a speaker-independent
system. In the 1980s, researchers began to improve upon the LP parameters with a cepstral
technique. Much of the impetus for this conversion seems to have been a paper by Davis
and Mermelstein [14], which compared a number of parametric representations and found
the cepstral method to outperform the raw LP parameters in monosyllable word
recognition. In fact, the most successful technique in this study was a cepstral technique,
the mel cepstrum, which is not based on LP analysis, but rather a filter bank analysis. We
will deal with mel cepstrum in section 2.3.
According to the usual model, speech is composed of an excitation sequence
convolved with the impulse response of the vocal system model. Since we have only
access to the output, the elimination of one of two combined signals is a very difficult
problem. If two pieces are combined linearly, a linear operation (Fourier transform) will
allow us to examine the component sequences individually. Because the individual parts of
a voiced speech, vocal tract articulation and pseudoperiodic noise source, are composed in
a convolved combination, our linear operator, Fourier transform, cannot separate them. But
9
in the cepstrum, the representatives of the component signals will be separated, and
linearly combined. If it is the purpose to assess some properties of the component signals,
the cepstrum itself might be sufficient to provide the needed information. However, if it is
the purpose to eliminate one of the component signals we are able to use linear filters to
remove undesired cepstral components, since the representatives of the component signals
are linearly combined.
The cepstrum, or cepstral coefficient, c(τ) is defined as the inverse Fourier transform
of the short-time logarithmic amplitude spectrum |X(ω)|. The term cepstrum is essentially a
coined word which includes the meaning of the inverse transform of the spectrum. The
independent parameter for the cepstrum is called quefrency, which is obviously formed
from the word frequency. Since the cepstrum is the inverse transform of the frequency
domain function, the quefrency becomes a time-domain parameter. The special feature of
the cepstrum is that it allows for the separate representation of the spectral envelope and
the fine structure.
Voiced speech x(t) can be regarded as the response of the vocal tract articulation
equivalent filter, which is driven by a pseudoperiodic source g(t). Then x(t) can be given
by the convolution of g(t) and vocal tract impulse h(t) as
∫ −=t
dthgtx0
)()()( τττ (2.2)
which is equivalent to
)()()( ϖϖϖ HGX = (2.3)
where X(ω), G(ω), and H(ω) are the Fourier transforms of x(t), g(t), and h(t)
respectively.
10
If g(t) is a periodic function, |X(ω)| is represented by line spectra, the frequency
intervals of which are the reciprocal of the fundamental period of g(t). Therefore, when
|X(ω)| is calculated by the Fourier transform of a sampled time sequence for a short speech
wave period, it exhibits sharp peaks with equal intervals along the frequency axis. Its
logarithm log|X(ω)| is
)(log)(log)(log ϖϖϖ HGX += (2.4)
The cepstrum, which is the inverse Fourier transform of log|X(ω)|, is
)(log)(log)(log)( 111 ϖϖϖτ HFGFXFc −−− +== (2.5)
where F is the Fourier transform. The first and second terms on the right side of
equation 2.4 correspond to the spectral fine structure and the spectral envelope,
respectively. The former is the periodic pattern, and the latter is the global pattern along
the frequency axis.
Principally, the first function on the right side of equation 2.5 indicates the
formation of a peak in the high-quefrency region, and the second represents a
concentration in the low-qeufrency region. The fourier transform of the low-quefrency
elements produces the logarithmic spectral envelope from which the linear spectral
envelope can be obtained through the exponential transform. The maximum order of low-
quefrency elements used for the transform determines the smoothness of the spectral
envelope. The process of separating the cepstral elements into these two factors is called
liftering, which is derived from filtering. When the cepstrum value is calculated by the
DFT, it is necessary to set the base value of the transform, N, large enough to eliminate the
aliasing. The process steps for extracting the spectral envelope using cepstral method is
given in Figure 2.2.1.
11
WindowSampledsequence
|DFT| Log IDFTCepstral Window
(liftering)
Low quefrencyelements
High quefrencyelements
Peakextraction
DFT
Spectralenvelope
Fundamentalperiod
Figure 2.2.1. Block diagram of cepstrum analysis
First 13 quefrency elements are used as the 13 dimensional feature vector. The
feature vector is self-normalized by normalizing the first coefficient to 1. As the window
length, we selected 10 milliseconds , but they are overlapping by 5 milliseconds.
12
2.3. Mel Cepstrum
Pure cepstral coefficients are not satisfactory enough for classification, hence an
improvement can be obtained by integrating perceptual information. “Mel-based cepstrum”
or simply “mel cepstrum” comes out of the field psychoacoustics. Its use has been shown
empirically to improve recognition accuracy [10].
A mel is a unit of measure of perceived pitch or frequency of a tone (similar to an
octave in music). It does not correspond linearly to the physical frequency of the tone, as
the human auditory system apparently does not perceive pitch in this linear manner. The
precise meaning of the mel scale becomes clear by examining the experiment by which it
is derived. Stevens and Volkman[15] arbitrarily chose the frequency 1000 Hz and
designated this “1000 mels”. Listeners were then asked to change the physical frequency
until the pitch they received was twice the reference, then 10 times, and so on; and then
half the reference, 1/10 and so on. These pitches were labeled 2000 mels, 10000 mels, and
so on; and 500 mels, 100 mels, and so on. The investigators were then able to determine a
mapping between the real frequency scale (Hz) and the perceived frequency scale (mels).
Stephens and Bate [16] showed that the pitch expressed in mels is roughly proportional to
the number of nerve cells terminating on the basilar membrane of the inner ear, counting
from the apical end to the point of maximal stimulation along the membrane.
The mapping is approximately linear below 1kHz and logarithmic above (Figure
2.3.1), and such approximation is usually used in speech recognition. Fant [17] suggested
following approximation:
+=
10001log1000 2
Hzmel
FF (2.5)
13
Figure 2.3.1. The mel scale
To incorporate the perception properties of the human ear into cepstral analysis, it
is desired to compute the cepstral parameters at frequencies that are linearly distributed in
the range 0 – 1 kHz, and logarithmically above 1 kHz. One approach is to oversample the
frequency axis and then selecting those frequency components that represent
approximately the appropriate distribution. Another method is using filter bins (Figure
2.3.2).
The perception of a particular frequency by the auditory system is influenced by the
energy in a critical band of frequencies around it. The bandwidth of a critical band varies
with frequency, beginning at about 100 Hz for frequencies below 1 kHz, and then
increasing logarithmically above 1 kHz. Therefore, rather than simply using the mel-
distributed log magnitude frequency components, using the total log energy in critical
bands around the mel frequencies as input to the final IDFT is preferred.
Triangularly shaped mel-frequency bins are actually visualizations of these critical
bands. For each bin, the weighted sum of log of the sample frequencies is calculated. The
weight of each sample comes from the filter bin’s magnitude at the selected frequency
position, hence 1 at the bin center frequency and decreasing to 0 linearly at the boundaries
of the bin. Bins are positioned linearly separated and of the same size for frequencies
14
below 1 kHz. For frequencies greater than 1 kHz filter bins are positioned logarithmically
separated and growing logarithmically in width. Actually, filter bins are of the same size
and linearly separated in the mel-frequency domain.
Fmel1000 melFHz1000 Hz
Figure 2.3.2. Mel filter bins
Since filter bins require a large number of samples, we increased our sampling
window into 20 milliseconds frames, overlapping in each 10 milliseconds. This gives us
512 sample frames at a 22050 Hz sampling rate, and this number of samples is enough to
work with filter bins. Experimentally 64 filter bins in the range 0-8 kHz give the best
results.
WindowSampledsequence
|DFT| Log IDFT
K mel-cepstralcoefficients
K filterbins
N samples N frequencies K mel filteroutputs
Figure 2.3.3. Block diagram of mel-cepstral analysis
An enhancement can be achieved by using IDCT instead of IFT in the last step.
This has the effect of compressing the spectral information into the lower-order
coefficients, and it also decorrelates them to allow the subsequent statistical modeling to
use diagonal covariance matrixes [10].
15
2.4. Hamming Window
In cepstral analysis, the effect of neighboring frames affect the results of our
analysis. To get rid of this aliasing, each frame is filtered through a hamming window. A
hamming filter is strongest at the middle part of the frame, and drops down to zero at the
edges of the frame (Figure 2.4.1).
Figure 2.4.1. Hamming filter
HammingSampledsequence
|DFT| Log IDFT
K mel-cepstralcoefficients
K filterbins
N samples N frequencies K mel filteroutputs
Figure 2.4.2. Block diagram of mel-cepstral analysis (hamming filter applied)
frame size
16
2.5. Classification
From the point of visual speech perception, different phonetic sounds can be
visually observed as exhibiting the same lip-shape [18]. For example, the visible mouth
movement for phoneme p is similar to the movement for b. The phoneme classes that can
be visually discriminated are called visemes. The difference of the phones with the same
viseme comes from the different position settings of the tongue and teeth. Cartoon
animators are taking advantage of this knowledge for years [3],[4]. Usually eight visemes
are sufficient to achieve realistic mouth animations (however for lip-reading 16 is the
minimum), and it makes our classification work a lot easier, since instead of 50 phone
classes we only have eight viseme classes to distinguish. We defined Turkish viseme
classes as follows:
1. silence : Lips are at rest, jaw is nearly closed
2. b, m, p : Stops. Lips are pressed, jaw is tightly closed
3. c, ç, d, g, h, k, n, r, s, s, t, y, z : Lips and jaw are slightly open
4. a, e : Lips and jaw are open
5. i, i, l : Lips are slightly open, jaw is open but not as much as class 4
6. o, ö : Lips are rounded, jaw fully open
7. u, ü : Lips are rounded and protruding, jaw fairly open
8. f, v : Lower lip is pressing to the upper lip, jaw is closed
17
1 2 3 4
5 6 7 8
Figure 2.5.1. Turkish viseme classes
Acoustically easily distinguished visemes are hard to distinguish visually, since
within each viseme, phonemes with different acoustic features are included. In order to
classify speech into visemes, proper acoustic features selection is crucial. As our feature
vector, we selected 12 Mel scale cepstral coefficients and the log energy of each frame.
These features are easy and fast to calculate, at the same time offer proper information
about vocal tract shapes, and these features are speaker independent features.
18
2.5.1. Training Set
Our first major task in our work was establishing a training set. Since there wasn’t
much work done in Turkish language itself, we did not have any precreated and marked
training set. We had to create one for our needs. Creating a large training set and marking
each phoneme requires large amount of work, and is beyond the scope of our project. So
we decided to create a large enough training set for our needs, and it affects our
classification procedure. Since time and resources did not permit to create a large enough
training set for training an HMM or even Neural Network, we decided to use simpler
classifiers, such as nearest neighbor and parametric classifiers.
S E N I
Figure 2.5.1.1. Training sample for "e"
For each phoneme at least 4 utterances are recorded and carefully cleaned from the
neighboring phonemes and noise (Figure 2.5.1.1) by listening to the speech and by
inspecting the speech waveform. Two sets of data from the same speaker, one for training
and one for evaluating the results were created.
19
Each utterance is analyzed with 20 milliseconds frames (overlapping in 10
milliseconds), and for each frame the feature vector is calculated. Throughout the same
utterance, all frames give similar feature vectors. Since not all the utterances are of the
same length, this gives out different number of feature vectors for each utterance, which is
not desirable. So instead of using all frames for each utterance, some of them are selected
as representative of the training utterance. To select a representative of the utterance,
vector median operation is used.
Given an utterance with n frames (Utterance length is 10(n+1) milliseconds. The
frame size is 20 milliseconds, but frames are overlapped by 10 milliseconds), the set of all
feature vectors is V=(v1....vn). The vector median VM of V is defined as follows,
{ } ∑==i
ikk
jN vvdistjvvvVM ),(minargsuch that ....1 (2.6)
After calculating the set of all representatives for all utterances, we have our
training data, which we can use in our classifiers. The simplest one is nonparametric
nearest neighbor classifier.
20
2.5.2. Nearest Neighbor Classifier
Since our viseme classes include phonemes with different properties, our viseme
classes are not directly linearly-separable. A nearest neighbor classifier does not require a
large set of training data, it can work even with very small sets of training data. The idea is
to classify the test frame, to the mouth shape with the nearest feature vector [19]. All
features are self normalized to the same range (average 0, variance 1) , and the distance
measure is the simple geometric distance. For the test frame F with feature vector (v1...vn)
the distance to the sample frame S with feature vector (s1...sn) is calculated accordingly,
2211 )(....)(),( nn svsvSFd −++−= (2.7)
To enhance the classifier performance, three nearest neighbors are calculated, and
the classification output is selected by voting.
Table 2.5.2-1. Performance of the nearest neighbor classifier
%67 silence bmp cdsty ae ii oö uü fvsilence 96 0 2 0 0 0 0 1bmp 7 37 41 0 0 1 3 8cdsty 0 15 61 0 3 1 1 15ae 0 0 4 84 0 0 0 10ii 1 0 32 0 43 15 2 3oö 0 0 7 0 0 81 8 2uü 0 1 4 0 0 5 83 4fv 10 21 16 0 0 0 2 62
21
2.5.3. Fuzzy Nearest Neighbor Classifier
For some classes, nearest neighbor classifier is not good enough. One enhancement
to the classifier is using the fuzzy class distances instead of the simple distances. Fuzzy
class distance of a test frame F with features (v1...vn) to the classification class c with
samples (S1...Sm), Sk having the feature vector (sk1...skm), is calculated accordingly,
∑ −−
=k
fkSFd
mcFfd )1(
1
),(1
),( , 1<f<2 (2.8)
Minimum fuzzy distant class is selected as the classification output.
Table 2.5.3-1. Performance of the fuzzy NN classifier
%72 silence bmp cdsty ae ii oö uü fvsilence 97 0 1 0 0 0 0 1bmp 8 43 39 0 0 1 0 7cdsty 2 13 68 1 2 0 1 10ae 0 0 4 82 0 0 0 12ii 3 0 29 0 48 15 0 0oö 0 0 0 0 0 92 7 0uü 0 0 5 0 0 3 90 0fv 24 18 16 0 0 0 2 37
22
2.5.4. Parametric Classifier
Nonparametric classifiers use training data directly, and do not make any
assumptions on the distribution of the data. Parametric classifiers, on the other hand,
assume some distribution model, and try to calculate the parameters of this distribution.
Since our viseme classes contain many phonemes with different properties, a direct
implementation of a parametric classifier using the viseme classes is not useful. Instead of
using viseme class distributions, we assume gaussian distribution for each phone class
separately, and try to calculate their distributions. Some phone classes are easily separable
using this technique (such as the silence class, since its energy is distinctly lower than the
other classes).
All classes are assumed to have normal density N(µi,δi), µi is calculated as the
mean of class i and δi is calculated using the standard deviations of each dimension
(feature) of class i.(Note that the features are self normalized to have the mean zero and the
variance one. But since only a subset of the samples belong to class i, class i has different
mean than zero and different variance than one)
Table 2.5.4-1. Performance of the parametric classifier
%66 silence bmp cdsty ae ii oö uü fvsilence 77 0 22 0 0 0 0 0bmp 6 33 38 0 0 1 0 20cdsty 2 16 67 1 3 0 0 8ae 0 0 2 88 0 2 1 4ii 7 2 25 0 39 16 2 4oö 0 2 9 0 17 61 8 0uü 0 0 5 0 0 8 84 1fv 0 21 8 0 0 0 2 70
23
2.5.5. Tree Classifier
Using nearest neighbor classifier performs well in some classes, and fuzzy nearest
neighbor performs well in some other classes, and parametric classifier performs better
than the others in some other classes. Therefore it is better to build a tree classifier. A
simple tree classifier might be running all three algorithms and doing voting. A more
sophisticated one can be created by integrating the results and success rates of each
algorithm on each class.
Figure 2.5.5.1 illustrates the structure of the tree classifier. For example, if
algorithm A has a success rate of 80 per cent on class i and assigns the frame to class i,
then class i gets 0.80 points. If algorithm B has the success rate of 50 per cent on class j,
and assigns the frame to class j, class j gets 0.50 points. And if algorithm C has 10 per cent
success on class j, and classifies the frame to class j, class j gets additional 0.1 points. A
simple voting mechanism will select class j, but since algorithm C has a very poor rate of
success on class j, it is more probable, that the algorithm has misclassified the frame. More
sophisticated classification can be done, by looking at the total points collected from each
classifier. Now class i has more points, since algorithm A is very successful on class i, so
our classification output is class i.
Table 2.5.5-1. Performance of the tree classifier
%76 silence bmp cdsty ae ii oö uü fvsilence 97 0 1 0 0 0 0 1bmp 8 44 38 0 0 1 0 7cdsty 2 7 74 1 1 1 2 10ae 0 0 4 93 0 0 0 2ii 3 4 23 0 50 15 0 0oö 0 0 0 0 0 92 7 0uü 0 0 5 0 0 3 90 0fv 0 2 32 0 0 0 2 62
24
To get the success rates of algorithms, a part of the test set is used, and classified.
The success rates are used as weights of classification algorithms on classes, and the rest of
the test set is used to evaluate the tree classifier. The class weights are more or less the
same as the tables shown in the previous chapters 2.5.2, 2.5.3, 2.5.4 ( Table 2.5.2-1, Table
2.5.3-1, Table 2.5.4-1 ). It is seen that tree classifier performs better than the individual
classifiers.
Test Frame Tree Classifier
3NN
Fuzzy NN
Parametric
Lip-shapeClass
Training Set
First Half
Second Half
3NN
Fuzzy NN
Parametric
Success rates ofalgorithms
Classification and rate of successcalculation
Training Data
Figure 2.5.5.1. Block Diagram of the Tree Classifier
25
2.6. Error Correction
One property of the speech is very useful to correct some misclassifications. To
produce a sound, lips should remain in the same fixed position for some time. This means
that an isolated lip-shape is a potential error. Using the median performed for each lip-
shape class is applied over a time window to detect and correct such misclassifications.
Class b,m,p
Class f,v
time
activation
time
activationtime
activation
time
activation
before median after median
Figure 2.6.1. Applying median filter for error correction
This error correction routine has one drawback, which occurs when we have fast
speech. As the speech becomes faster, lips are moving very fast, and some lip-shapes can
be caught as misclassifications, hence get lost. In case of fast speech many phones are
melted into each other, so skipping a lip-shape might produce even worse looking
animation, but that is the cost of the error correction.
1 2 3 4 5 6 7 8 9 10
Figure 2.6.2. Frames 4 and 8 are misclassifications
26
Another implementation might be using a gaussian filter instead of the median on
each class separately. Actually applying gaussian filters on each class individually has the
same effect as appliying the median filter.
27
3. FACIAL ANIMATION
3.1. Background
In recent years there has been considerable interest in computer-based three-
dimensional (3D) facial character animation. The human face is interesting and challenging
because of its familiarity. Animating synthetic objects is most of the time acceptable, but
when it comes to facial animation, we humans tend to criticize and cannot tolerate
unnatural-looking details.
First computer generated facial animations were done by F. Parke at University of
Utah in the early 70s [20]. In 1974, Parke created the first parametric facial model [21]. In
1977, Ekman and Friesen created Facial Action Coding System (FACS) [22], which is the
basis of the rest of the facial animation studies. In that system, they analyze and break
down the facial action into smaller units called Action Units (AU). Each AU represents an
individual muscle action, or an action of a small group of muscles, into a single
recognizable facial posture. In 1980, Platt [5] and in 1987, Waters [23] published their
studies on physically based muscle controlled facial animation. In 1987, Lewis [24] and in
1988, Hill [25] made the first studies on automated facial animation. Especially in the
second half of the 80s computer generated short films were made with animated speech.
The Academy Award winning film, TinToy, produced by Pixar, was the first feature film
to use facial animation as a part of the story in 1990. Currently facial animation and speech
is frequently used on animation films. Facial tissue and muscles of the face are modeled
and used for natural-looking facial animation [24].
There are at least five fundamental approaches to facial animation. These
approaches are:
28
1. interpolation: It is perhaps the most widely used of the techniques. In its simplest form,
it corresponds to the key-framing approach found in conventional animation [3],[4].
The desired facial expression is specified for a certain point in time (keyframe) and
then again for another point in time some number of frames later. A computer
algorithm generates the rest of the frames (in-betweens) between these keyframes.
2. performance-driven: It involves measuring real human actions to drive synthetic
characters. Data from interactive input devices such as data gloves, instrumented body
suits, and laser- or video-based motion-tracking systems, are used to drive the
animation.
3. direct parameterization: A set of parameters are used to define facial conformation and
to control facial expressions. The direct parameterized models use local region
interpolations, geometric transformations, and mapping techniques to manipulate the
features of the face [21].
4. pseudomuscle-based: Muscle actions are simulated using geometric deformation
operators. Facial tissue dynamics are not simulated. These techniques include abstract
muscle action, and freeform deformation.
5. muscle-based : A mass-and-spring model is used to simulate facial muscles. Two types
of muscles are implemented: linear muscles that pull, and sphincter muscles that
squeeze. Instead of expressions or operators muscles are parameterized and key-framed
[23].
Lip-synchronization is mostly done mechanically by animators by analyzing the
sound track [26]. Pre-recorded speech track is read to find the frame times of significant
speech events (Figure 3.1.1). Key frames with corresponding mouth positions and
expressions are then drawn to match these speech events. For a more realistic
correspondence, a live actor is filmed or videotaped while speaking. These recorded frames
are then rotoscoped to obtain the correct mouth positions for each frame or for each key
frame. With these techniques the speech track is created first and the animation images are
created to match the speech.
29
S E N I
Figure 3.1.1. Traditional style lip-synchronization
Some work has been done on automating the process but they are mostly based on
text input [27]. With these systems, lip-synchronizing is finished by animators after the
recording. Another way of synchronizing speech with the generated images is using
synthetic speech created from the same textual input [24]. The only work done so far for
Turkish also uses text as input [6], [7]. These systems require the speech track to match the
created images. But computer based speech animation allows a third possibility: speech
and images are created simultaneously, which is the work of this thesis.
30
3.2. Facial Muscles
In the general sense muscles are the organs of the motion. By their contractions,
they move the various parts of the body (parts of the face in our case). The energy of their
contraction is made mechanically effective by means of tendons, aponeuroses, and fascia,
which secure the ends of the muscles and control the direction of their pull. Muscles are
usually suspended between two moving parts, such as between two bones, two different
areas of skin, or two organs. Actively, muscles contract. Their relaxation is passive and
comes about through lack of stimulation. A muscle usually is supplied by one or more
nerves that carry the stimulating impulse and thereby cause it to contract.
When a muscle is suspended between two parts, one of which is fixed and the other
is movable, the attachment of the muscle on the fixed part is known as the origin. The
attachment of the muscle to the movable part is referred to as the insertion. The muscles of
facial expression are superficial, and all attach to a layer of subcutaneous fat and skin at
their insertion. Some of the muscles attach to skin at both the origin and the insertion such
as the obicularis oris (Figure 3.2.4). The muscles of the facial expression work
synergistically and not independently. Three types of muscle can be discerned as the
primary motion muscles:
1. linear/parallel muscles: They pull in an angular direction, such as the zygomatic major
2. elliptical/circular sphincter type muscles: They squeeze, such as the obicularis oris.
3. Sheet muscles: They behave as a series of linear muscles spread over an area, such as
the frontalis.
For a natural-looking facial animation lip-shape classes are converted into face
muscle activations. There are 10 linear muscles and one sphincter muscle modeled around
the lips [2].
31
Figure 3.2.1. Facial Muscles
Muscles modeled around the lips and their effects are:
1. Left and right Zygomatic Major : At the edge of the upper lip.
2. Left and right Angular Depressor : At the edge of the lower lip.
3. Left and right Labi Nasi : Pulls the upper lip.
4. Left and right Inner Labi Nasi : Pulls the inner part of the upper lip.
5. Left and right Depressor : Pushes the lower lip to the front and back. Used to produce
sounds “f” and “v”.
32
6. Obicularis Oris : Sphincter muscle. Tightens the lips and pushes to front. Used to
produce the sounds “o”, ”ö”, “u” and “ü”.
Muscles are the principle motivators of facial expression such as that when a
muscle contracts, it attempts to draw its attachments together. For facial muscles this
action usually involves drawing the skin towards the point of skeletal subsurface
attachment.
The effect range of each muscle is predefined and decreases nonlinearly in an effect
zone. Linear muscles have a bone and a skin attachment and follow the major direction of
the muscle vector. Whereas real muscle consists of many individual fibers, this model
assumes a single direction and attachment. With this simplifying assumption, an individual
muscle can be described with the direction and magnitude in three dimensions; the
direction is toward a point of attachment on the bone, and the magnitude of the
displacement depends upon the muscle spring constant and the tension created by a muscle
contraction. There is no displacement at the attachment to the bone, but a maximum
deflection occurs at the point of insertion into the skin. The surrounding skin is contracted
toward the static node of attachment on the bone, until, at a finite distance away, the force
dissipates to zero. Linear muscles have two sectors with different displacement effects.
Figure 3.2.2. Effect zones of a linear muscle
The displacement caused by the linear muscle is calculated according to the
formula below.
33
Figure 3.2.3. Linear muscle
1
1)cos('pv
vpkrpp α+= (3.1)
−−
−
=Bregion )cos(
Aregion )1
cos(
sf
s
s
RR
RDR
D
r (3.2)
where k is a fixed constant representing the elasticity of the skin. Zygomatic
Major, Angular Depressor, Labi Nasi, Inner Labi Nasi and Depressor muscles are linear
muscles.
Unlike linear muscles sphincter muscle contracts around an imaginary central
point. As a result, the surface surrounding the mouth is drawn together like the tightening
of material at the top of a string bag. Displacement caused by a sphincter muscle goes to
zero at the central point, as well as outside of the muscle zone. Displacement is calculated
according to the distance to the central point. Obicularis Oris is a sphincter muscle and can
be simplified to a parametric ellipsoid with a major and minor axis.
34
p
f1 f2c
p’
ly
lx
A
B
Figure 3.2.4. Sphincter muscle
pc
cpkrpp +=' (3.3)
−=Bregion )
2(
Aregion
s
ds
dr (3.4)
yx
yxxy
ll
plpld
2222 += (3.5)
where k is a fixed constant representing the elasticity of the skin and s is the
decreasing rate of the force along the muscle axis.
35
3.3. Facial Tissue
The skin covers the entire surface of the human form and is a highly specialized
interface between the body and its surroundings. It has a multicomponent microstructure,
the basis of which are five intertwined networks of collagen, nerve fibers, small blood
vessels, and lymphatics, covered by a layer of epithelium and transfied at intervals by hairs
and the ducts of sweat glands (Figure 3.3.1).
Epidermis
Dermis
Hypodermis
Hair
Subcutaneous adipose tissue
Figure 3.3.1. Skin layers
Human skin has a layered structure consisting of
1. epidermis: A superficial layer of dead cells and is composed of keratin.
2. dermis: It consists of irregular, moderately dense, soft connective tissue.
3. subcutaneous tissue: It consists of adipose tissue distributed in a network of
connective fibers. This connective tissue is mostly collagen arranged in a lattice with
fat cells. Beneath this superficial fascia lies the deep fascia which coats the bones.
36
This layered structure of skin is nonhomogeneous and nonisotropic. These features
were elaborated in 1861 by Langer [2] who made observations on many cadavers.
Figure 3.3.2. Simple skin implementation (tension net)
The simplest approach to skin tissue emulation is a collection of springs connected
in a network, or tension net (Figure 3.3.2) [28]. In this model the skin is represented as a
warped plane of skin nodes, connecting their neighbors by arcs. The arcs have elastic
material properties that make them behave like springs where the extension is proportional
to the force divided by the spring constant s
xsF ∆= (3.6)
Forces are generated by synthetic muscles. This force causes a displacement of the
muscle node. The force is then reflected along all arcs adjacent to this node; these reflected
forces are then applied to their corresponding adjacent nodes. In this way, the applied force
is propagated out from the initial node across the face. This approach has a distinct
advantage over a purely geometric technique because the displacement of one node can
influence all the other nodes in the surface. Consequently, muscle forces blend together,
providing a unified approach to facial expression modeling. Furthermore, the inherent
nature of springs helps to maintain some geometric integrity allowing the surface to dimple
and bulge, which is characteristic of facial tissue. One drawback of this model is that it
consists of only one layer, and does not take sublayer interactions into account.
37
Figure 3.3.3. Deformable lattice implementation of the skin
To include facial layers and their interactions, a variation to the basic tension net, a
deformable lattice structure can be defined for the face (Figure 3.3.3). Springs are arranged
into layers of tetrahedral elements cross-strutted with springs to resist shearing and twisting
stresses. The springs in the three layers have different stiffness parameters in accordance
with the nonhomogeneity of real facial tissue. The topmost surface of the lattice represents
the epidermis, and the spring stiffnesses are set to make it moderately resistant to
deformation. The springs underneath the epidermis form the dermis. The springs in the
second layer are highly deformable, reflecting the nature of subcutaneous fatty tissue.
Nodes on the bottom-most surface of the lattice are fixed onto the bone surface.
To create this topology, the procedure starts with triangular facial mesh, whose
nodes and springs represent the epidermis [23]. It projects normal vectors from the center
of gravity of each triangle into the face to establish subcutaneous nodes and forms
tetrahedral dermal elements by connecting them to epidermal nodes using dermal springs.
Second, it forms subcutaneous elements by attaching short weak springs from the
subcutaneous nodes downward to muscle layer nodes. Third, it adds the muscle layer,
whose lower nodes are constrained, anchoring them in bone. Finally it inserts the muscle
fibers through the muscle layer from their emergence in bone to their attachments at
muscle layer nodes.
38
Figure 3.3.4. Effect of muscle activation on skin layers
Muscles that apply forces to the tissue run through the second layer of the synthetic
tissue. The displacement of node j in the fascia layer from jx to 'jx due to muscle
contraction is a weighted sum of m muscle activities acting on node j:
∑=
+=m
ijijj xmxx
1
)(' (3.7)
where )( ji xm is the displacement caused by muscle i on node j and calculated as
according to the type and rate of contraction on muscle i as stated in chapter 3.2.
Figure 3.3.5. Saying "o" without and with obicularis oris, and with facial
layers
39
Once all the muscle interactions have been computed, the positions of nodes that
are subject to muscle actions are displaced to their new positions. As a result, the nodes in
the fatty, dermal, and epidermal layers that are not directly influenced by muscle
contractions are in an unstable state, and unbalanced forces propagate through the lattice to
establish a new equilibrium position. To get to this equilibrium position time step
simulation method is used [2]. Only a percent of the total displacement is applied on the
nodes, and new displacements are calculated according to this new unbalanced state.
Displacements below some threshold are set to zero. Applying not the total displacement
but only a percent of it takes care of oscillations. A set of nodes may enter an oscilating
equilibrium, where the change in the state forces the nodes back to the first state. This
unbalanced state again forces them to move, which again creates enough force to go back
to the first state etc. Applying only a percent of the displacement in each time step enables
the system to get into equilibrium in between. At the first pass unbalanced nodes are
tagged, and their total displacement is calculated. At the second pass some percent of the
total displacements are applied to the unbalanced nodes. We repeat the process until every
node is balanced or every unbalanced nodes displacement is under some threshold. An
unbalanced node applies displacement to all its neighbors, except when they are static
(bone) nodes.
40
3.4. Moving The Jaw
Jaw rotation is necessary for the mouth to assume its various speech and expression
postures. This is achieved by rotating the bone nodes of the lower part of the face about a
jaw pivot axis. The unbalanced upperlayer nodes are displaced to get into an equilibrium.
To mark the bone nodes to be rotated by the jaw, a half torus is defined as the jaw affect
region.
βα
r
Figure 3.4.1. Jaw definition torus
Figure 3.4.1 shows the parameters of the jaw definition torus. α is the start angle, β
is the end angle of the effect region torus definition. r is the radius of the torus. Jaw
definition torus is placed near the ears, just between the lips, such that all vertices of the
lower face, including the lower lip are inside the effect region.
Jaw movement is calculated with a rotation angle γ, and all bone nodes inside the
jaw definition torus rotate with an angle γ.
41
3.5. Automated Eyeblink
Sythetic blinking is an important characteristic that should be included in face
models used in conversational interface modes. Speaker eye blinks are an important part of
speech response systems that include synchronized facial visualizations. Eye blink not only
accentuates speech but addresses the physical need to keep eyes wet as well. The structure
of an eye blink is synchronized with speech articulation. It is also emotion dependent.
During fear, tension, anger, excitement, and lying the amount of blinking increases while it
decreases during concentrated thought.
A simple eye blinking model is based on the on-off characteristics of the speakers
voice. A better model is following the pauses in the speech, where stopping sounds such as
“m”, “b” or “p” are also pauses. There is a slight delay between speech pauses and eye
blinks. This delay is about 0.5 seconds. To add some randomness into eye blinks we
selected this delay between 0.4 and 0.6 seconds [3],[4].
A lesson that we learn from traditional animators is that nothing related to humans is
absolutely symmetric. Disney animators never blink both eyes at the same time. We
implement this idea in our automated eye blink model and one eye blinks with some delay
(0.1-0.2 seconds) to the other one, to increase realism.
Eyelids are modeled like the jaw (Section 3.4). Eyelid torus are put and nodes inside
this torus are rotated as the eye blinks.
42
4. LIPSYNCH FACIAL ANIMATION (AGU)
4.1. System Overview
For our speech driven facial animation system, we used K. Waters OpenGL
implementation code of some basic facial expressions using muscles as a starting point
[29]. His implementation focuses on facial expressions and linear muscles related with
them. To be able to do lip-synchronized animation, we implemented 10 linear muscles
around the mouth and one sphincter muscle on the same face. Jaw movement is
parameterized, and made face independent to be able to be used with other face topologies
(We still do not have any other face model). Facial tissue is created as stated in the
previous chapters, and muscle activations are calculated using number crunching time step
simulation algorithm.
For speech analysis, we used 12 mel scale cepstral coefficients and log energy of
the frame. Each frame is of 20 milliseconds size, and are overlapping in 10 milliseconds,
their size is rounded to make it a total of 512 samples, since we work with 22050 Hz
sampling rate. Each frame to be analyzed is hamming windowed to get rid of the effects of
the previous and next frames. A single speaker training set is used to classify the frames. A
tree classifier (explained in detail in section 2.5.5) is used to classify and the classification
results are median filtered for misclassifications. A set consisting of a class per frame, is
sent to the animation engine. Figure 4.1.1 shows the structure of AGU.
43
Speech Processor Animation EngineRecorded
Speech File I/O Subunit
HammingWindowing
FeatureExtraction
Tree ClassifierTrainingSet
Error Correction Facial MuscleExtraction
Lipshape classvs
muscleactivation
BoundaryInterpolation
EyeblinkDetection
Skin LayerCalculation
Display Subunit
Skin PropertiesFrames
FeatureVector
LipshapeClass
LipshapeClass
MuscleActivations
DeformedSkin
Animation File
MuscleProperties
Figure 4.1.1. Subunits of AGU system
Class information is converted into muscle activation through a library of lip-shape
class versus muscle activation list. Muscle activations are blended at the class boundaries
using gaussian filters. The size of the boundary, which is the time to go from one lip-shape
to the next one, is assigned as a static 10 frames (on a 25 frames-per-second (fps) system,
which makes 0.4 seconds), centered at the boundary. For short duration lip-shapes, the
effect is that the lip-shape is blended between the previous and the next lip-shape, but is
not discarded. The blended muscle activations are applied on the facial lattice. Unstable
states are solved using time step simulation method. When the system reaches its
equilibrium, the epidermal nodes and connections between them are sent to the OpenGL
display engine. Display window is saved as a set of numbered bitmap files, to be used with
the sound track as animation.
44
Files and tables used for the system are:
1. face topology
2. muscle topology
3. muscle activations and jaw rotations for facial expressions and lip-shapes
4. tissue topology
5. training set for the speech classifier
6. jaw topology
7. eyelid topology
8. teeth topology
9. recorded speech file: currently only 22050 Hz mono files are supported
45
4.2. Optimizing Performance
Our system works on a frame-by-frame basis, which means sequential processing is
not necessary. For any given recorded speech file, we can specify any frame to work with
and get the desired output frame. This design works definitely slower than a sequential
design, but allows us more degrees of freedom. This special nonsequential design allows
us to easily convert this stand-alone animation system to a module (plugin) in some
commercial animation engine. The drawback is that in cases of error correction, lip-shape
boundary detection, or even eyeblink detection, knowledge from previous frames is
required. To reprocess these frames time is lost, and performance is degraded. To shorten
this reprocessing time a multilevel cache structure is implemented.
Display Frame
Animation Frame
Lipshape Frame
Classification Frame
25 fps Displayvs
100 fps Classification
Boundary Detection
Error Correction
Figure 4.2.1. Hierarchy of frame requests
To display animation frame X on a 25 fps video output (standard PAL rate) we
need to look at four speech frames, since we are classifying on a rate of 100 fps. For
smoother animation results these four display frames should be blended. To display frames
correctly the end of the previous lip-shape class, and the start of the next one should be
known. So we should process the speech back and forth to detect the previous and the next
lip-shape classes. Another issue is the error corrector, which also requires a window of
frames to detect potential misclassifications. All of this requires reprocessing of the same
46
frames without caching. Also if we are processing sequentially each frame, we can spend
redoing a lot of work using caches.
47
4.2.1. Multilevel Caching
Our frame-by-frame design allows each subunit to work independent of the others.
Each sub-unit takes an input, and after processing gives the output to the next subunit. This
modularity allows us to put caches inside some important processing points to enhance
performance. Within this design each subunit requests its input from its input cache, and
sends its output to its output cache. Instead of subunits requesting processing from the
lower level subunits, the cache passes the request to the lower level subunits, if it cannot
fulfill it by itself, means its cache for the requested frame is empty.
Cache
AGU
SpeechFile Speech
Classification
Error Correction
Facial Animation
DisplaySubsystem
Classified SpeechFrames
Corrected SpeechFrames
Animation Frames
Display Frame
Figure 4.2.1.1. Multilevel cache structure
48
5. REAL-TIME LIPSYNCH FACIAL ANIMATION (RT_AGU)
5.1. System Overview
Microphone Capture
Speech Processor
Animation Engine
Speaker Ping Pong Buffer
HammingWindowing
FeatureExtraction
Tree ClassifierTrainingSet
Error Correction
LipshapePrecalculated
Faces
BoundaryInterpolation
EyeblinkDetection
WeightedInterpolation
Display SubunitFrames
FeatureVector
LipshapeClass
LipshapeClass
InterpolatedFace
Animation File
Initialization
Facial MuscleExtraction
Lipshape classvs
muscleactivation
Skin LayerCalculation
I/O Subunit
Skin Properties
MuscleActivations
DeformedSkin
MuscleProperties
Figure 5.1.1. Overview of RT_AGU sub-units
49
First of all, instead of reading a prerecorded speech file, the system should be
capable of capturing directly through a microphone. This can be achieved using Ping-Pong
buffers easily. Each buffer is of the size of a frame. When a buffer is filled, it is processed
and displayed, before the other buffer becomes full. To increase speed, frames are captured
not overlapping, but sequentially. This increases the error rate, but since our classification
rate is decreased from 100 fps to 50 fps we double the processing speed.
For real-time performance some optimizations are required. First of all, since
everything will work sequentially, cache management can be optimized, its size and
management become easier. Another change is on lip-shape boundaries. In the AGU
engine lip-shape boundaries are centered between two lip-shape classes, but we cannot use
this type of schema in case of a real-time engine, since by the time the speech reaches a
lip-shape class boundary, it is too late to change the displayed frames. The new schema is
to trigger the display of the lip-shape change with the lip-shape change in the speech. Now
the problem with the error correction occurs. How can we be sure that the newcoming lip-
shape class is not misclassified ? This can be handled by adding some more delay of a few
frames, just to be sure that the new classified lip-shape is a correct one. A total of five
frames delay (0.2 seconds in case of 25 fps) is sufficient and unavoidable.
To speed things up further we can use a more erroneous but faster classifier, such
as the parametric classifier. Since the parametric classifier does not have to measure the
distance to all the samples, it works much faster than the other ones, or the tree classifier.
The trade-off is decreased performance in classification; the parametric classifier makes
more misclassifications (Table 2.5.4-1 versus Table 2.5.5-1).
Most of the time is spent in processing the facial tissue layers. Enormous speed
increase is achieved by redefining this process, using precalculated key frames for
interpolation instead of recalculating the facial tissue for each frame.
50
5.2. Keyframe Interpolation
The time step simulation algorithm ( Section 3.3) used for solving the unbalanced
force distribution between the facial layers takes a lot of time. This method shows its
advantages between lip-shape classes; its results are more realistic. But to achieve real-
time performance, we have to sacrifice a bit from the realistic results.
Instead of resolving force distributions on each frame, we calculate it once for each
lip-shape class. Using this precalculated face data a table of lip-shape class faces are filled.
For each frame, each lip-shape class contribution is found and this information is used to
weighted interpolate the desired frame to display. An eight-dimensional vector
I={I1,I2,......I8} is passed to the interpolation subunit; where each parameter Ik is the
contribution (weight) of lip-shape class k of the current frame on the display. For each
vertex point Vi of the face is processed as follows:
∑=
=8
1ccici YIV (5.1)
where Yci is the corresponding vertex to Vi on the lip-shape class c.
51
6. CONCLUSIONS
This is the first work on automated speech driven facial animation in Turkish. We
created a system, Automated speech driven Graphical facial animation Unit (AGU),
capable of automatically making very realistic lip-synchronized facial animation using
recorded turkish speech [30],[31][32]. But since our method uses a speaker and language
independent analysis it can be also used on different speakers and on different languages.
We tested it only on Turkish and English so far.
Classification is done at a rate of 100 frames per second, but video output is
generated at a 25 frames per second rate. Since there are more than one frames of
classification per video frame their weighted average is taken for video output. That means
another level of error smoothing. The video confusion matrix is given in Table 6-1.
Average rate of correct classification is 92 per cent.
Table 6-1. Video output confusion matrix
silence bmp cdsty ae ii oö uü fvsilence 98 0 0 0 0 0 0 2bmp 0 60 32 0 0 1 0 8cdsty 2 0 78 0 0 0 0 10ae 0 0 0 100 0 0 0 0ii 0 0 20 0 80 0 0 0oö 0 0 0 0 0 92 8 0uü 0 0 2 0 0 0 98 0fv 0 0 30 0 0 0 0 70
Table 6-1 is created subjectively, by watching the video output of AGU. The
classification success at the video output is naturally higher, because most common errors
are between visually close classes, and can be tolerated when watching the visual output.
The ultimate test would be to have our results checked by a lipreader (for lip-reading 16
lip-shapes are required, but for realistic animation purposes eight are enough).
52
Short time analysis enables us to create a real-time working system. This selection
of the features is also speaker independent. Since no language processing tools are used,
the system is language independent. But since its training set is created using Turkish
language, the system is most powerful in Turkish speech.
Using physically based tissue layers gives natural-looking results, which can be
best seen on voices where jaw is open (like o,ö,u,ü) (Figure 3.3.5).
Results are promising, but the system is not capable of dealing with faster speech,
because of its error correction routines. Another issue is the face model. We used the face
model created in DEC by K. Waters [29]. To apply the system to other face models, more
parameterization is needed.
There is a parallel work on integrating our speech driven lip-shape classifier
module into a commercial 3D animation program (3D MAX).
Our system is fast, speaker and language independent, and requires not large
resources (real-time operation is achieved even on a ordinary PC level (Pentium processor,
32 Mb of RAM)). This is a very useful tool for animators, since it provides real-time
feedback. This tool can also be used in video conferencing applications, using models for
each speaker, and transferring only the speech, not the video. And a very useful application
might be a new interface for human computer interaction systems.
53
APPENDIX A. NAVIGATION HIGHLIGHTS
At initialization the interpolation keyframe faces areprocessed. As the process continues in thebackground, you can rotate the face, change shading,change lighting, but cannot start realtime display or.wav file processing, since the intermediate data is notprocessed yet. It will take a few seconds according toyour computer power.
Display Window Menu Window
Speed mode [default] is fasterbut makes more errors. For moreaccurate lip movement select theother, which is much moreslower but more suitable forapplications such as .bmp output
54
If you have a microphone ready,after selecting microphone startyou can watch what you aresaying in real time. To end theprocess, select microphone stop
During display you still canchange display settings.
You have to select a .wav file with22050 mono sampling rate. Then thefile will be processed and output asnumbered bitmap sequence.IMPORTANT: During the processthe output window should be on top.
55
model withpolygonalpatches
model withtexturemapping
wireframemodel withmuscles visible
smooth shadingof the model[default]
Inside the display window, you canrotate the face [default] any time byclicking left mouse button andmoving the mouse. By selecting lightmove you can change the move modeto light move mode, and move thelight as stated before. Selecting lightmove toggles this mode.
The head will rotaterandomly by itself whenactivated. Selecting secondtime toggles this mode.
56
REFERENCES
[1]. G. Maestri , Digital Character Animation, New Riders, 1996.
[2]. F.I. Parke, K. Waters. Computer Facial Animation, AK Peters, 1996.
[3]. F. Thomas, O. Johnston, Illusion of Life, Disney Animation, Hyperion, 1981.
[4]. T. White, The Animators Workbook, Watson Guptill Publications, 1988.
[5]. S.M. Platt, “A System for Computer Simulation of the Human Face”, The Moore
School, University of Pennsylvania, Philadelphia, 1980.
[6]. B. Uz , U. Güdükbay ,B. Özgüç, “Realistic Speech Animation of Synthetic Faces”,
Computer Animation'98, IEEE Computer Society Publications, Philadelphia, 1998.
[7]. B. Uz, “Realistic Speech Animation of Synthetic Faces”, M.S. Thesis, Bilkent
University, Department of Computer Engineering and Information Science, June 1999.
[8]. L. Arslan, D. Talkin, “Codebook Based Face Point Trajectory Synthesis Algorithm
Using Speech Input”, Elsevier Science, 953, 01-13, December 1998.
[9]. J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete Time Processing of Speech
Signals, Macmillan Publishing Company, 1993.
[10]. S. Young, “A Review of Large-vocabulary Continuous-speech Recognition”, IEEE
Signal Processing, 45-57, September 1996.
[11]. J.K. Baker, “Stochastic Modeling for Automatic Speech Understanding”, Speech
Recognition, New York, Academic Press 521-542, 1975.
[12]. F. Jelinek, “Continuous Speech Recognition by Statistical Methods”, Proceedings
of the IEEE, vol64, 532-556, April 1976.
[13]. S. Russel, P. Norvig, Artificial Intelligence A Modern Approach, Prentice Hall
International, 1995.
[14]. S.B. Davis, P. Mermelstein, “Comparison of Parametric Representations for
Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol 28, 357-366, August
1980.
[15]. S.S. Stevens, J. Volkman, “The Relation of Pitch to Frequency”, American Journal
of Psychology, vol. 53, 329, 1940.
57
[16]. R.W.B. Stevens, A.E. Bate, Acoustics and Vibrational Physics, New York, St.
Martins Press, 1966.
[17]. C.G.M. Fant, “Acoustic Description and Classification of Phonetic Units”, Ericsson
Technics, no 1, 1959.
[18]. S-H. Luo, R.W King, “A Novel Approach for Classifying Continuous Speech into
Visible Mouth-Shape Related Classes”, IEEE, I-465-468, 1994.
[19]. E. Alpaydin, Lecture Notes on Statistical Pattern Recognition, 1995.
[20]. F.I. Parke June, “Computer Generated Animation of Faces”, MS Thesis, University
of Utah, Salt Lake City, UT, UTEC-CSc-72-120, 1972.
[21]. F.I. Parke, “A Parametric Model for Human Faces”, PhD Thesis, University of
Utah, Salt Lake City, UT, UTEC-CSc-75-047, December 1974.
[22]. P. Ekman, W.V. Friesen, Manual for Facial Action Coding System, Consulting
Psychologists Press, Inc., Palo Alto, CA, 1978.
[23]. K. Waters, “A Physical Model for Animating 3D Facial Expressions”, Computer
Graphics (SIGGRAPH ‘87), 21(4):17-24 July, 1987.
[24]. J.P. Lewis, F.I. Parke, “Automatic Lip-Synch and Speech Synthesis for Character
Animation”, Proc. Graphics Interface ’87 CHI+CG ’87, 143-147, Canadian
Information Processing Society, Calgary, 1987.
[25]. Pearce, B. Wyvill, D. Hill, “Speech and Expression: A Computer Solution to Face
Animation”, Proc. Graphics Interface ’86, 136-140, Canadian Information Processing
Society, Calgary, 1986.
[26]. Robertson, “Read My Lips”, Computer Graphics World, 26-36, August 1997.
[27]. Pearce, B. Wyvill, D. Hill, “Animating Speech: An Automated Approach Using
Speech Synthesized by Rules”, Visual Computer, 176-187, 1988.
[28]. S.M. Platt, “A System for Computer Simulation of the Human Face, University of
Pennsylvania, The Moore School, M.S. Thesis, 1980.
[29]. K. Waters, “OpenGL Implementation of the Simple Face”, December 1998,
http://www.crl.research.digital.com/publications/books/waters/Appendix1/opengl/Open
GLW95NT.html.
[30]. L. Akarun, Z. Melek, “Türkçe Ses Eszamanli Yapay Yüz Canlandirma”, Bilisim
’99, 212-217, 1999.
58
[31]. L. Akarun, Z. Melek, “Automated Lipsynchronized Speech Driven Facial
Animation for Turkish”, presented in Confluence of Computer Vision and Graphics
NATO Advanced Research Workshop, Slovenia, August 1999.
[32]. L. Akarun, Z.Melek, “Automated Lipsynchronized Speech Driven Facial
Animation”, submitted to IEEE International Conference of Multimedia, December
1999
This document was created with Win2PDF available at http://www.daneprairie.com.The unregistered version of Win2PDF is for evaluation or non-commercial use only.