Cook - singing voice synthesis
-
Upload
hadaspeery -
Category
Documents
-
view
228 -
download
0
Transcript of Cook - singing voice synthesis
-
7/27/2019 Cook - singing voice synthesis
1/10
Singing Voice Synthesis: History, Current Work, and Future DirectionsAuthor(s): Perry R. CookSource: Computer Music Journal, Vol. 20, No. 3 (Autumn, 1996), pp. 38-46Published by: The MIT PressStable URL: http://www.jstor.org/stable/3680822
Accessed: 12/01/2010 06:40
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=mitpress.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
The MIT Pressis collaborating with JSTOR to digitize, preserve and extend access to Computer Music
Journal.
http://www.jstor.org
http://www.jstor.org/stable/3680822?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/action/showPublisher?publisherCode=mitpresshttp://www.jstor.org/action/showPublisher?publisherCode=mitpresshttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/3680822?origin=JSTOR-pdf -
7/27/2019 Cook - singing voice synthesis
2/10
Perry
R.
Cook
Department
of
Computer
Science
and
Department
of Music
Princeton
University
Princeton,
New
Jersey,
USA
This
article
will
briefly
review the
history
of
sing-
ing
voice
synthesis,
and will
highlight
some cur-
rently
active
projects
in
this area. It will
survey
and
discuss the benefits and trade-offs of
using
different
techniques
and
models. Performance
control,
some
attractions of composing with vocal models, and ex-
citing
directions
for
future
research will be
high-
lighted.
Basic VocalAcoustics
The
voice
can be characterizedas
consisting
of one
or more
sources,
such
as the
oscillating
vocal folds
or turbulence
noise,
and a
system
of filters whose
properties
are
controlled
by
the
shape
of the vocal
tract.
By moving
various
articulators,
we
change
the
ways
the
sources
and filters
behave.
The
spec-
trum of the voice is characterized
by
resonant
peaks
called formants. The location and
shapes
of
these
resonances
are
strong perceptual
cues
that
hu-
mans use
to
differentiate and
identify
vowels and
consonants. For a
system
to
generate speech-like
sounds,
it
should allow for
manipulation
of
the res-
onant
peaks
of
the
spectrum,
and
also
for
manipula-
tion of source
parameters
(voice
pitch,
noise
level,
etc.)
independent
of
the
resonances of the
vocal
tract. Voice
pitch
is
commonly
denoted
as
fo,
and
the formant frequencies are commonly denoted as
fl,
f2,
3,
etc.
Figure
1 shows a
vocal tractcross-
section
forming
the vowel
/
i
/
(as
in
"beet"),
where
the
quasi-periodic
oscillations of the vocal
folds
are
shaped
by
the
resonant filter of the vocal
tract
tube. The
spectrum
of
the vowel shows
the
harmon-
ics of
the voice source
outlining
the
peaks
and val-
leys
of the
vocal
tract
response. Figure
2
shows the
vocal tract
cross-section for
forming
the
conso-
Computer
Music
Journal,
20:3,
pp.
38-46,
Fall
1996
?
1996 Massachusetts Institute of
Technology
S i n g i n g
V o i c e
Synthesis
H i s t o r y
C u r r e n t
W o r k
n d
u t u r e
Directions
nant
/ /("shh"),
where the "source" s not the vo-
cal
folds,
but turbulence
noise
formed
by forcing
air
through
a constriction.
Also shown is the noise-
like
spectrum
of the
consonant,
showing
two
princi-
pal
formant
peaks corresponding
o the
resonances
of the vocal tract upstreamfrom the noise source.
A
Brief
History
f
Digital inging
Speech)
Synthesis
The earliest
computer
music
project
at
Bell Labs in
the late
1950s
yielded
a
number of
speech synthe-
sis
systems
capable
of
singing,
one
being
the
acous-
tic tube model of
Kelly
and Lochbaum
(1962).
This
model was
actually
an
early
physical
model.
At
that time it was
considered too
computationally
ex-
pensive
for
commercialization
as
a
speech synthe-
sizer,
and too
expensive
to be
practical
for musical
composition.
Max
Mathews worked with
Kelly
and
Lochbaum to
generate
some
early
examples
of
sing-
ing
synthesis
(Computer
Music
Journal 1995;
Wergo 1995).
Other
techniques
to arise from the
early
legacy
of
speech signal processing
include
the
channel
vo-
coder
(VOice
CODER)
(Dudley
1939)
and linear
pre-
dictive
coding (LPC) Atal 1970;
Makhoul
1975).
In
the
vocoder,
the
spectrum
is
broken into sections
called sub-bands,and the information in each sub-
band is
analyzed,
then
parameters
are stored or
transmitted for
reconstruction
at another time or
site.
The
parametric
data
representing
the
informa-
tion
in
each
sub-band can be
manipulated,
yielding
transformations such as
pitch
or time
shifting,
or
spectral
shaping.
The vocoder does not
strictly
as-
sume that the
signal
is
speech,
and thus
generalizes
to other sounds.
The
phase
vocoder,
implemented
using
the
discrete Fourier
transform,
has found ex-
tensive
use
in
computer
music
(Moorer
1978;
Dol-
son
1986).
Computer
Music
Journal
8
-
7/27/2019 Cook - singing voice synthesis
3/10
Figure
1. Vocal tract
shape
and
spectrum
of
vowel
/
i
/
(as
in
"beet"),
show-
ing formants
and harmon-
ics
of
periodic
voice
source.
Figure
2. Vocal tract
shape
(left)
and
spectrum
(right)
of
consonant
/
f
/
("shh"),
showing
a
noisy spectrum
with two
formants.
^
Formants
ea.}3.oov~~~~I
Figure
1
Consonant
f/
(as
in
shh)
,r,
Figure
2
The introduction
of linear
predictive coding
(Atal
1970)
revolutionized
speech technology,
and had a
great impact
on
musical
composition
as well
(Moorer 1979;
Steiglitz
and
Lansky 1981;
Lansky
1989).
With
LPC,
a
time-varying
filter is
automati-
cally designed that predicts the next value of the
signal,
based on
past samples.
An
error
signal
is
pro-
duced
which,
if
fed back
through
the
time-varying
filter,
will
yield
exactly
the
original signal.
The
fil-
ter models
linear
correlations
in
the
signal,
which
correspond
to
spectral
features such as
formants.
The error
signal
models
the
input
to the
formant
filter,
and
typically
is
periodic
and
impulsive
for
voiced
speech,
and
noise-like for
unvoiced
speech.
The success of
LPC
in
speech
coding
is
largely
due to
the
similarity
between
the
source/filter
de-
composition
yielded by
the
mathematics of
linear
prediction,
and the
source/filter
model of
the hu-
man vocal tract.
The
power
of
LPC as
a
speech
com-
pression
technique
(Spanias 1994)
stems from
its
ability
to
parametrically
code
and
compress
the
source and
filter
parameters.
The
effectiveness of
LPCas a compositional tool emerges from its abil-
ity
to
modify
the
parameters
before
resynthesis.
There are
weaknesses,
however,
in
LPC,
which are
related to
the
assumption
of
linearity
inherent
in
the
filter model.
Also,
all
spectral
properties
are
modeled
in
the
filter.
In
actuality
the
voice has
mul-
tiple possible
sources of
non-linear
behavior,
nclud-
ing
source-tract
coupling,
non-linear wall
vibration
losses,
and
aerodynamic
effects.
Due to
these devia-
tions from the
ideal
source-filter
model,
the
result
of
analysis/modification/resynthesis
using
LPC or
a
sub-band
vocoder often
sounds
"buzzy."
Cook
i
i
I
39
-
7/27/2019 Cook - singing voice synthesis
4/10
Cross-Synthesis
nd Other
Compositional
Attractions
f
Vocal
Models
The
compositional
interest
in
vocal
analysis/syn-
thesis has at least
three foundations.
The first is
rooted
in the
human
as
a
linguistic organism,
for
it
seems
in
the
nature
of
humans
to
find
interest
in
voice-like sounds.
Any
technique
or
device that
allows
independent
control over
pitch
and
spectral
peaks
tends
to
produce
sounds that are
vocal in
nature,
and such sounds catch
the
interest
of
hu-
mans. The second
compositional
interest
in
using
systems
that
decompose
sounds
in
a
source/filter
paradigm
s to
allow
for
cross-synthesis.
Cross-
synthesis
involves
the
analysis
of two
instruments,
typically
a voice and a non-voice
instrument,
with
the
parameters
exchanged
and modified
on
resyn-
thesis. This allows the resonances of
the
voice
to
be
imposed
on the source of a non-voice instru-
ment. The
third
interest
comes
from the fact that
once
pitch
and
resonance structure are
analyzed
as
they
evolve
in
time,
these three
dimensions are
in-
dependently availableto some extent for manipula-
tion
on
resynthesis.
The
elusive
goal
of
being
able
to
stretch
time
without
changing pitch,
to
change
pitch
without
changing
timbral
quality,
etc.,
are all
of
high
interest to
computer
music
composers.
Other
PopularSynthesisTechniques
Frequency
modulation
(FM)
proved
successful for
singing
synthesis
(Chowning
1981,
1989)
as
well as
the
synthesis
of
other
sounds.
As
described in com-
munications
literature,
FM
involves the
modula-
tion of the
frequency
of
one oscillator with
the
output
of another to create a
spread spectrum
con-
sisting
of side-bands
surrounding
the
original
car-
rier
(oscillator
that is
modulated)
frequency.
In FM
sound
synthesis,
both the carrier
and modulator
oscillators
typically
store
a
sinusoidal
waveform,
and
operate
in
the audio
band.
By
controlling
the
amount of
modulation,
and
using
multiple carrier/
modulator
pairs,
spectra
of
somewhat
arbitrary
shape
can be
constructed. This
technique
proved
ef-
ficient yet sufficiently flexible for music composi-
tion,
and
became
the basis
for
the
most
successful
commercial music
synthesizers
in
history.
In
vocal
modeling,
carriers
placed
near
formant locations
in
the
spectrum
are modulated
by
a
common
modula-
tor oscillator
operating
at the voice fundamental fre-
quency.
Sinusoidal
speech
modeling (McAulay
and
Qua-
tieri
1986)
has been
improved
and
applied
to music
synthesis by
Julius
Smith and XavierSerra
(Smith
and Serra
1987;
Serra
and
Smith
1990),
Xavier
Ro-
det and
Philippe Depalle (1992),
and
others.
These
techniques use Fourieranalysis to locate and track
individual
sinusoidal
partials.
Individual
trajector-
ies
(tracks)
of sinusoidal
amplitude, frequency,
and
phase
as a
function of time
are
extracted from the
time-varying peaks
in a series
of short-time
Fourier
transforms.
To
help
define
tracks,
heuristics
regard-
ing
physical
systems
and the
voice
in
particular
are
used,
such as the fact that a
sinusoid
should
not
ap-
pear, disappear,
or
change frequency
or
phase
instan-
taneously.
The
sinusoids
can
be
resynthesized
from
the
track
parameters,
after
modification or
coding,
by
additive
synthesis.
Noise can
be treated as
rap-
idly varyingsinusoids,
or
explicitly
as a non-
sinusoidal
component.
Formant wave
functions
(FOFs
n
French)
were
pioneered by
Xavier
Rodet
(1984)
at
Institute de
Recherche et
Coordination,
Acoustique/Musique
(IRCAM).
An FOF s
a time-domain
waveform
model of the
impulse
response
of individual for-
mants,
characterized
as a
sinusoid
at the
formant
center
frequency
with an
amplitude
that
rises
rap-
idly
upon
excitation
and
decays
exponentially. By
describing
a
spectral region
as
a
windowed sinusoi-
dal
oscillation
in
the time
domain,
an
FOF can be
viewed as a
special type
of
wavelet. The
control
pa-
rameters define the
center
frequency
and
band-
width of
the formant
being modeled,
and the rate
at which
the FOFs
are
generated
and
added deter-
mines
the
base
frequency
of
the voice.
The
synthe-
sis
system
for
using
FOFs
was
dubbed
CHANT,
and
found
application
in
general synthesis
(Rodet,
Po-
tard,
and
Barriere
1984).
Gerald Bennett
and Xavier
Rodet used CHANT
to
produce
a
number of
impres-
sive
singing examples
and
compositions
(Bennett
and Rodet
1989).
Formantsynthesizers, in which individual for-
Computer
Music
Journal
0
-
7/27/2019 Cook - singing voice synthesis
5/10
mants
are modeled
by
second-orderresonant
filters,
have
been
investigated
by many speech
researchers
(Rabiner
1968;
Klatt
1980).
An
attractive
feature
of
formant
synthesizers
is that Fourier
or LPC
analy-
sis can be used to
automatically
extract formant
frequencies
and source
parameters
from recorded
speech.
Charles
Dodge
used such
techniques
in
a
composition
in
1973
(Dodge 1989).
The
group
that
has
accomplished
the most in the domain of
sing-
ing synthesis using
formant models is the
Speech
Transmission
Laboratory STL)
of the
Royal
Insti-
tute of Technology (KTH),Stockholm. This STL
MUSSE
DIG
(MUsic
and
Singing Synthesis
Equip-
ment,
DIGital
version)
synthesizer
(Carlson
and
Neovius
1990)
has been used
in
singing
synthesis
(Zera, Gauffin,
and
Sundberg
1984),
for
studying
performance
synthesis-by-rule
(Sundberg
1989),
and
has been
adapted
for real-time control
in
perfor-
mance
(Carlson,
et al.
1991).
The
KTH
has con-
ducted and
published
extensively
on
speech,
and
has
arguably
produced
the
largest body
of research
on
singing (Sundberg1987)
and
music,
both acous-
tics and
performance.
Robert C. Maher
(1995)
re-
cently
demonstrated
singing synthesis
using
modi-
fied
forms
of
the second-orderresonant
filter which
lend
themselves
to
parallel implementation.
Acoustic
TubeModelsof the Vocal
Tract
Acoustic
tube models solve
the wave
equation,
usu-
ally
in one
dimension,
inside a
smoothly varying
tube.
The one-dimensional
approximation
is
justi-
fied
by noting
that
the
length
of the
vocal tract is
significantly larger
than
any
width
dimension,
and
thus the
longitudinal
modes
dominate the reso-
nance
structure
up
to about
4,000
Hz.
Modal stand-
ing
waves in
an
acoustic tube
correspond
to
the for-
mants.
The basic
Kelly
and
Lochbaum model
(Kelly
and
Lochbaum
1962)
critically
samples space
and
time
by approximating
the smooth
vocal
tract
tube with
cylindrical
segments
equal
in
length
to
the dis-
tance
traveled
by
a
soundwave in
one
time
sample.
The SPASM
and
Singer
systems (Cook
1992)
are
based on a physical model of the vocal tract filter,
developed
using
the
waveguide
formulation
(Smith
1987).
This model is
a direct descendent of
the
Kelly
and Lochbaum
model,
but with
many
en-
hancements,
such as a nasal
tract,
modeling
of radi-
ation
through
the throat
wall,
various
steady
and
pulsed
noise
sources
(Chafe
1990),
and
real-time
controls.
Shinji
Maeda's
(1982)
model
numerically
integrates
the wave
equation using
the
rectangular
method
in
space,
and the
trapezoidal
rule in
time.
Wall losses are
also
modeled,
and an
articulatory
layer
of control modifies
the basic tube
shape
from
higher-orderdescriptions like tongue and jawposi-
tion. Rene Carre's
1992)
model is
based on distinc-
tive
regions
(DR)
arising
from
sensitivity
analysis,
noting
that movements
in
particularregions
of the
vocal tract affect formant
frequencies
more than
movements in others.
Hill,
Manzara,
and Taube-
Schock
(1995)
have
implemented
a
synthesis-by-
rule
system using
a
model based on
distinctive re-
gions,
with
libraries and
examples
that include ex-
amples
of
singing synthesis. Liljencrants
(1985)
in-
vestigated
an
undersampled
acoustic tube model
and
derived rules for
modifying
the
shape
without
adding
unnaturally
to the
energy
contained within
the
vocal tract. The
computer
music
research
group
in
Helsinki
(Vilimaki
and
Karjalainen
1994)
have
used fractional
sample
interpolation
and
truncated
conical tube
segments
to derive
an
improved
ver-
sion of
the
Kelly
and Lochbaum
model.
OtherActive
SingingSynthesis
Projects
Pabon
(1993)
has
constructed a
singing synthesizer,
with
real-time
formant
control via
spectrogram-
like
displays
called
phonetograms,
and
source wave-
form
synthesis
using
FOF-like
controls. Titze
and
Story (1993)
have
produced
a
super-computer
tenor
called "Pavarobotti"
hat
sings
duets with
Titze,
and
is used for
studying
many
aspects
of
the
voice,
including
advanced
physical
models
of
normal and
pathological
vocal
folds.
Howard
and
Rossiter
(How-
ard
and Rossiter
1993;
Rossiter
and
Howard
1994)
have
studied source
parameters
for
more
natural
singing synthesis,
as
well as
interactive
singing
analysis software for pedagogicalapplications.
Cook
41
-
7/27/2019 Cook - singing voice synthesis
6/10
Spectral
Modelsvs.
Physical
Models
Synthesis
models
can
be
loosely
broken into two
groups:
spectral
models,
which can
be
viewed
as
based on
perceptual
mechanisms,
and
physical
models,
which
can
be viewed as based
on
produc-
tion
mechanisms. Of the
models and
techniques
discussed
above,
the
spectrally
based models in-
clude
FM,
FOFs,
vocoders,
and
sinusoidal models.
Acoustic
tube models are
physically
based,
while
formant
synthesizers
are
spectral
models,
but could
be classified as
pseudo-physical
because of the
source/filter decomposition.
It's
possible
to inter-
pret
LPC three
ways:
as a
least-squares
linear
predic-
tion
in the
time
domain,
as a
least-squares
match-
ing process
on the
spectrum,
and as a source-filter
decomposition.
Therefore,
LPC is both
spectral
and
pseudo-physical,
but not
strictly
a
physical
model
because wave variables are not
propagateddirectly,
and
no
articulation
parameters
go
into the basic
model. Since LPC can be
mapped
to
a filter related
to the acoustic tube model
(Markel
and
Gray 1976),
it may be broughtinto the physical camp.
Both
physical
and
spectral
models have
merit,
and one or another
might
be more suitable
given
a
specific goal
and set of
computational
resources.
The main
attraction
of
physical
models is
that
most of the control
parameters
are those that a
hu-
man uses to control
his/her
own
vocal
system.
As
such,
some intuition
can
be
brought
into the
de-
sign
and
composition
processes.
Another motiva-
tion is that
time-varying
model
parameters
can be
generated by
the model
itself,
if
the
model
is
con-
structed so that it
sufficiently
matches
the
physical
system. Disadvantagesof physical models are that
the
number of
control
parameters
can be
large,
and
while some
parameters
might
have intuitive
sig-
nificance for humans
(jaw
drop),
others
might
not
(specific
muscles
controlling
the vocal
folds).
Fur-
ther,
parameters
often
interact
in
non-obvious
ways.
In
general
there exist no exact
methods for
analysis/resynthesis using physical
models. Parame-
ter
estimation
techniques
have
been
investigated,
but for
physical
models of
reasonable
complexity,
especially
those
involving any
non-linear
compo-
nent,
identity analysis/resynthesis
is a
practical
and often theoretical
impossibility
(Cook
1991b;
Scavone and Cook
1994).
ModelExtensions nd FutureWork
Work remains to be done in
refining
techniques
for
spectral analysis
and
synthesis
of the voice. For
ex-
ample,
a
spectral
envelope
estimation
technique
like that of
Galas and Xavier Rodet
(1990)
allows
more accurate
formant
tracking
on
even
high
fe-
male
tones,
which because of the
large
inter-
harmonic
spacing
have
proven
difficult for
analysis
systems
in the
past.
There are far more
directions
for
research to
proceed
in
improving physical
mod-
els and
source models
for
pseudo-physical
models
of the voice.
Most
of them
involve some
significant
component
of
non-linearity,
and/or higher
dimen-
sional models. The main
research
areas involve
modeling
of
airflow
in
the vocal
tract,
development
of more
exact models
of
the inner
shape
of the
vo-
cal tract
tube,
physical
models
of
the
tongue
and
other articulators,more accuratemodels of the vo-
cal
folds,
and
facial
animation
coupled
to voice
syn-
thesis.
The
modeling
of flow
is
a
difficult but
important
task,
and until
recently
it has been
confined
to
the-
oretical
explorations,
occasionally
verified
experi-
mentally
with
hot-wire
anemometry
or other
flow
measurement
techniques (Teager1980).
Mico
Hirschberg
has
begun
to
make advances in
actually
photographing
low in
constructed models of
musi-
cal
instruments,
and the vocal
tract
(Pelorson
et al.
1994).
These
techniques,
combined with
classical
and new theories, should yield greaterunderstand-
ing
about air
flow and how it
affects
vocal
acous-
tics.
Along
with more
exact
solutions to
the flow-
physics problems,
development
of
efficient
means
for
calculating
the
flow
simulations,
allowing
the
inclusion of these
non-linear
effects
in
practical
synthesis
models must
also
emerge
(Chafe
1995;
Verge 1995).
Constructing
a
physical
model that
includes
more
detailed
simulations of the
dynamics
of the
tongue
and
articulators
would allow
the model to
calculate the
time-varying
parameters,
rather than
Computer
Music
Journal
I
42
-
7/27/2019 Cook - singing voice synthesis
7/10
having
the
shape,
etc.
explicitly
specified
or calcu-
lated.
Wilhelms-Tricarico
1995)
has
developed
a set
of
models
of
soft
tissue,
and has
used these
to con-
struct
a
tongue
model.
Such models can be
cali-
brated
from the
results of articulation
studies
using
X
ray pellets,
magnetic
resonance
imaging,
and
other
techniques.
All of this
can
combine to
yield
models
that "behave"
correctly
in
a
dynamical
sense,
and
give
a better
picture
of the fine structure
of
the
space
inside
the
vocal tract.
This
latter
infor-
mation is critical
if
flow simulations
are
to
be
ac-
curate.
Vocal fold models continue
to be
the
target
of
much
research, and,
like the case
of
airflow,
theo-
ries are difficult
to
conclusively prove
or
disprove.
More elaborate models
of the vocal fold tissue are
being developed
(Story
and Titze
1995),
and theoret-
ical and
experimental
studies
revisiting
and
compar-
ing
the classic models
are
being
conducted
(Rodet
1995).
Facial
animation
coupled
with
speech
synthesis
is
important
for a number
of
reasons. One reason is
for
pedagogy,
where
speech
synthesizers
with ani-
mated
displays
could be used as
teaching
and reha-
bilitation
tools. Another
important
reason
involves
speech perception
in
general,
because humans use
a
significant
amount
of
lip
reading
in
understand-
ing
speech.
Workhas been done
by
Massaro
(1987)
and
Hill,
Pearce,
and
Wyvill (1988),
employing
fa-
cial animation
to
study
coupling
of visual and
audi-
tory
information
in
human
speech
understanding
(McGurk
and MacDonald
1976).
Musically,
we
know that the face of the
singer
can
carry
even
more information about
the
meaning
of
music
than the actual text
being sung (Scotto
Di
Carlo
and Guaitella
1995),
further
motivating
the combi-
nation of facial animation
with
singing synthesis.
Modeling
Performance
One
of
the
distinguishing
features of the
voice is
the continuous nature of
pitch
control,
both inten-
tional
and
uncontrolled. Research
in
random and
periodic
pitch
deviations
(Sundberg
1987;
Chown-
ing
1989;
Ternstrom and
Friberg
1989;
Prame
1994;
Cook
1995),
and the
synthesis
and
perception
of
short vibrato tones
(d'Allessandro
and
Castellengo
1993),
has
provided
data and models
for more natu-
ral
sounding
voice
synthesis.
On
the macro
scale,
rule
systems
for vocal
performance
and
phrasing
(Berndtsson
1995),
and
composition
(Rodet
and
Cointe
1984;
Barriere, ovino,
and Laurson
1991)
have
been constructed. The Stockholm KTH rule
system
is available
on the
compact
disc
Informa-
tion
Technology
and Music
(KTH 1994).
These
im-
portant
areas of research shall
remain a
topic
for a
future survey paper.
Extended
Singing
and
Language
ystems
Investigations
into
non-Western traditional Bel
Canto
singing
styles, traditions,
and acoustics
include
studies
of
overtone
singing
(Bloothooft,
et
al.
1992),
traditional Scandanavian
shepherd
sing-
ing (Johnson,Sundberg,
and
Willbrand
1983),
a
highly
structured
system
of funeral laments
(Ross
and Lehiste
1993),
and even castrati
singing
(De-
palle, Garcia,
and Rodet
1994).
Language
systems
for the
SPASM/Singer
nstruments include
an
Eccle-
siastical
Latin
system
called
LECTOR
Cook
1991a),
and a
system
for
modern Greek called
IGDIS
(Cook,
et
al.
1993).
The IGDIS
system
in-
cludes
support
for
arbitrary
uning
systems,
and
common vocal ornaments can be called
up
by
name,
allowing
traditional folk
songs
and
Byzan-
tine chants to be
synthesized
quickly.
Real-Time
oice
Processing
and
Interactive
Karaoke
Recently,
commercial
products
have been intro-
duced that allow for real-time "smartharmonies"
to
be added to a vocal
signal,
or
implement
real-
time score
following
with
accompaniment.
Vocod-
ers and
LPC,
by
virtue of
being
analysis/synthesis
systems,
allow
potential
for real-time modification
of
voice
signals
under the control of rules
or
real-
time
computer processes.
We
will soon see
systems
that
integrate pitch
detection,
score
following,
and
Cook
I
43
-
7/27/2019 Cook - singing voice synthesis
8/10
sophisticated
voice
processing
algorithms
into a
new
generation
of
interactive
karaoke
systems.
This will remain
a
topic
for a future review
paper.
References
Atal,
B.
1970.
"Speech
Analysis
and
Synthesis
by
Linear
Prediction of the
Speech
Wave."
ournalof
the Acousti-
cal
Society
of
America
47:65(A).
Barriere,
J.
B.,
E
Iovino,
and
M.
Laurson.
1991.
"A
New
CHANT Synthesizerin C and its Control Environ-
ment
in
Patchwork."
n
Proceedings
of
the 1991 Inter-
national
Computer
Music
Conference.
San
Francisco,
California:International
Computer
Music
Association,
pp.
11-14.
Bennett, G.,
and X. Rodet.
1989.
"Synthesis
of the
Sing-
ing
Voice."
In
Mathews,
M. and
J. Pierce, eds.,
Current
Directions
in
Computer
Music Research.
Cambridge,
Massachusetts:
The MIT
Press,
pp.
19-44.
Berndtsson,
G.
1995,
"The KTH Rule
System
For
Singing
Synthesis."
Computer
Music
Journal
20(1):76-91.
Bloothooft, G.,
et
al.
1992.
"Acoustics
and
Perception
of
Overtone
Singing."
Journal
of
the
Acoustical
Society of
America
92(4):1827-1836.
Carlson, G.,
and
L.
Neovius.
1990.
"Implementations
of
Synthesis
Models for
Speech
and
Singing."
STL-
Quarterly
Progress
and
Status
Report.
Stockholm:
KTH,
pp. 2/3:63-67.
Carlson, G.,
et
al.
1991. "A
New
Digital System
for
Sing-
ing
Synthesis
Allowing Expressive
Control."
n
Proceed-
ings
of
the
1991
International
Computer
Music
Confer-
ence. San
Francisco,
California:International
Computer
Music
Association,
pp.
315-318.
Carre,
R.
1992.
"Distinctive
Regions
in
Acoustic Tubes."
Journal
d'Acoustique, 5(141):141-159.
Chafe,
C.
1990.
"Pulsed Noise in
Self-Sustained Oscilla-
tions of Musical Instruments." n
Proceedings
of
the
IEEE
nternational
Conference
on
Acoustics,
Speech,
and
Signal
Processing.
New York: EEE
Press,
pp.
1157-1160.
Chafe,
C.
1995.
"Adding
Vortex
Noise to Wind
Instru-
ment
Physical
Models."
In
Proceedings
of
the
1995
In-
ternational
Computer
Music
Conference.
San Fran-
cisco,
California:International
Computer
Music
Association,
pp.
57-60.
Chowning,
J.
1981,
"Computer Synthesis
of the
Singing
Voice."In
Research
Aspects
on
Singing.
Stockholm:
KTH,
pp.
4-13.
Chowning,
J.
1989.
"Frequency
Modulation
Synthesis
of
the
Singing
Voice."
In
Mathews,
M. and
J.
Pierce, eds.,
Current
Directions in
Computer
Music Research. Cam-
bridge,
Massachusetts:
The MIT
Press,
pp.
57-64.
Computer
Music
Journal.1995.
Computer
Music
Journal
Volume 19
Compact
Disc.
Cambridge,
Massachusetts:
The
MIT
Press.
Cook,
P. 1991a. "LECTOR:An EcclesiasticalLatin
Con-
trol
Language
or
the
SPASM/Singer
nstrument."
n
Proceedings
of
the 1991
International
Computer
Mu-
sic
Conference.
San
Francisco,
California:
International
Computer
Music
Association,
pp.
319-321.
Cook,
P.
1991b.
"Non-Linear
Periodic Prediction for On-
Line Identification of Oscillator Characteristics n
WoodwindInstruments." n
Proceedings
of
the
Interna-
tional
Computer
Music
Conference.
San
Francisco,
Cal-
ifornia:
International
Computer
Music
Association,
pp.
157-160.
Cook,
P.
1992. "SPASM:
A
Real-Time Vocal
Tract
Physi-
cal
Model
Editor/Controller
and
Singer:
he
Compan-
ion
Software
Synthesis System."
Computer
Music
Jour-
nal
17(1):30-44.
Cook,
P.
1995.
"A
Study
of Pitch
Deviation in
Singing
as
a Function of Pitch
and
Dynamics."
13th
International
Congressof
Phonetic Sciences.
Stockholm:
KTH,
pp.
1:202-205.
Cook, P.,
et al. 1993. "IGDIS:A ModernGreek Text to
Speech/Singing Program
or
the
SPASM/Singer
nstru-
ment."
In
Proceedings
of
the
International
Computer
Music
Conference.
San
Francisco,
California:
Interna-
tional
Computer
Music
Association,
pp.
387-389.
d'Allessandro,C.,
and M.
Castellengo.
1993. "ThePitch
of
Short-Duration
VibratoTones:
Experimental
Data
and
Numerical Model."In
Proceedingsof
the
Stock-
holm
Music Acoustics
Conference.
Stockholm:
KTH,
pp.
25-30.
Depalle,
P.,
G.
Garcia,
and
X. Rodet.
1994,
"AVirtual
Cas-
trato
( ?)"
n
Proceedings of
the
1994
International
Computer
Music
Conference.
San
Francisco,
Califor-
nia: International
Computer
Music
Association,
pp.
357-360.
Dodge,
C.
1989.
"On
Speech
Songs."
n
Mathews,
M.
and
J.
Pierce,
eds.,
Current
Directions in
Computer
Music
Research.
Cambridge,
Massachusetts: The MIT
Press,
pp.
9-18.
Dolson,
M.
1986,
"The
Phase
Vocoder:
A
Tutorial."
Com-
puter
Music
Journal
10(4):14-27.
Dudley,
H.
1939.
"The
Vocoder."
Bell
Laboratories
Rec-
ord,
December.
Galas,
T.,
and X. Rodet.
1990 "An
mprovedCepstral
Method for
Deconvolution of
Source-Filter
Systems
with
Discrete
Spectra:Application
to
Musical
Sound
Computer
Music
Journal
I
44
-
7/27/2019 Cook - singing voice synthesis
9/10
Signals."
n
Proceedings
of
the 1990 International
Computer
Music
Conference.
San
Francisco,
Califor-
nia: International
Computer
Music
Association,
pp.
82-84.
Hill, D.,
L.
Manzara,
and C. Taube-Schock.1995.
"Real-
Time
Articulatory
Speech-Synthesis-By-Rules."
AVIOS. San
Jose,
California.
Hill, D.,
A.
Pearce,
and B.
Wyvill.
1988.
'Animating
Speech:
An Automated
Approach
Using
Speech
Synthe-
sized
by
Rules."The Visual
Computer
3(5):277-289.
Howard,D.,
and D.
Rossiter. 1993.
"Real-TimeVisual
Displays
for
Use
in
Singing Training:
An
Overview."
n
Proceedings of the Stockholm Music Acoustics Confer-
ence. Stockholm:
KTH,
pp.
191-196.
Johnson, A.,
J.
Sundberg,
and
H.
Willbrand.
1983.
"K61n-
ing:
A
Study
of Phonation and
Articulation
in
a
Type
of Swedish
Herding Song."
n
Proceedings
of
the Stock-
holm Music Acoustics
Conference.
Stockholm:
KTH,
pp.
187-202.
Kelly,
J.,
and C. Lochbaum.
1962.
"Speech Synthesis" (pa-
per
G42).
In
Proceedings
of
the Fourth
International
Congress
on Acoustics.
pp.
1-4.
Klatt,
D.
1980.
"Software or a
Cascade/Parallel
Formant
Synthesizer."
Journalof
the Acoustical
Society
of
America
67(3):971-995.
KTH. 1994. Information
Technology
and Music
(a
com-
pact
disc to celebrate the
75th
anniversary
of
the
Royal
Swedish
Academy
of
EngineeringScience).
Stockholm:
KTH.
Lansky,
P. 1989.
"Compositional
Applications
of
Linear
Predictive
Coding."
n
Mathews,
M.
and
J. Pierce, eds.,
CurrentDirections
in
Computer
Music
Research. Cam-
bridge,
Massachusetts: The MIT
Press,
pp.
5-8.
Liljencrants,J.
1985.
Speech Synthesis
With a
Reflection-
Type
Line
Analog,
DS
Dissertation,
Speech
Communi-
cation and
Music
Acoustics,
Stockholm: KTH.
Maeda,
S. 1982.
"A
Digital
Simulation Method of
the Vo-
cal Tract
System." Speech
Communication 1:199-299.
Maher,
R. 1995. "Tunable
Bandpass
Filtersin Music
Syn-
thesis"
(paper
4098
L2).
In
Proceedings
of
the Audio
Engineering Society Conference.
Makhoul, J.
1975.
"LinearPrediction:A
Tutorial Re-
view."In
Proceedings
of
the IEEE
63:561-580.
Markel, J.,
and A.
Gray.
1976.
Linear
Prediction
of
Speech.
New
York:
Springer.
Massaro,
D.
1987.
Speech
Perception by
Ear and
Eye.
Hillsdale,
New
Jersey:
Erlbaum
Associates.
Mathews,
M.,
and
J.
Pierce,
eds.
1989. CurrentDirections
in
Computer
Music
Research.
Cambridge,
Massachu-
setts: The MIT Press.
McAulay,
R.,
and
T.
Quatieri.
1986.
"Speech
Analysis/
Synthesis
Based on
a
Sinusoidal
Representation."
EEE
Transactionson
Acoustics,
Speech,
and
Signal
Pro-
cessing 34(4):744-754.
McGurk, H.,
and
J.
MacDonald.
1976.
"HearingLips
and
Seeing
Voices."Nature
264:746-748.
Moorer,
A.
1978.
"TheUse of the Phase Vocoder n
Com-
puter
Music
Applications."
Journalof
the Audio
Engi-
neering
Society
26
(1/2):42-45.
Moorer,
A.
1979,
"The Use
of
Linear Prediction of
Speech
in
Computer
Music
Applications."
Journal
of
the Audio
EngineeringSociety
27(3):134-140.
Pabon,
P.
1993,
"AReal-Time
Singing
Voice
Synthesizer."
In Proceedingsof the Stockholm Music Acoustics Con-
ference.
Stockholm:
KTH,
pp.
288-293.
Pelorson, X.,
et
al.
1994.
"Theoreticaland
Experimental
Study
of
Quasi-Steady
Flow
Separation
Within
the
Glottis
During
Phonation.
Applications
to a
Modified
Two-MassModel."
Journal
of
the Acoustical
Society of
America
96
(6):3416-3431.
Prame,
E. 1994. "Measurementsof the
Vibrato Rate of
Ten
Singers."
Journal
of
the Acoustical
Society
of
America
96(4):1979-1984.
Rabiner,
L.
1968.
"Digital
Formant
Synthesizer."
Journal
of
the Acoustical
Society
of
America
43(4):822-828.
Rodet,
X.
1984. "Time-Domain
Formant-Wave-Function
Synthesis." Computer
Music
Journal
8(3):9-14.
Rodet,
X.
1995.
"One and
Two Mass
Model
Oscillations
for Voice and
Instruments."
n
Proceedings
of
the 1995
International
Computer
Music
Conference.
San Fran-
cisco,
California:International
Computer
Music Asso-
ciation,
pp.
207-210.
Rodet, X.,
and
P. Cointe.
1984. "FORMES:
Composition
and
Scheduling
of
Processes."
Computer
Music
Journal
8(3):32-50.
Rodet, X.,
and
P.
Depalle.
1992.
"Spectral
Envelopes
and
Inverse FFT
Synthesis"
(paper
3393
H3).
In
Proceed-
ings
of
the Audio
EngineeringSociety
Conference,
NY:
AES.
Rodet, X.,
Y.
Potard,
and
J.
B. Barriere.
1984. "The
CHANT
Project:
From the
Synthesis
of the
Singing
Voice to
Synthesis
in
General."
Computer
Music
Jour-
nal
8(3):15-31.
Ross, J.,
and I.
Lehiste.
1993. "Estonian
Laments:
A
Study
of Their
Temporal
Structure."
n
Proceedings
of
the Stockholm
Music Acoustics
Conference.
Stock-
holm:
KTH,
pp.
244-248.
Rossiter, D.,
and D. Howard.
1994.
"Voice
Source and
Acoustic
Output
Qualities
for
Singing Synthesis."
In
Proceedings
of
the
1994
International
Computer
Mu-
sic
Conference.
San
Francisco,
California:
International
Computer
Music
Association,
pp.
191-196.
Cook
45
-
7/27/2019 Cook - singing voice synthesis
10/10
Scavone, G.,
and P. Cook.
1994.
"Combined
Linearand
Non-Linear
Periodic Prediction
in
Calibrating
Models
of Musical
Instruments to
Recordings."
n
Proceedings
of
the 1994 International
Computer
Music
Confer-
ence. San
Francisco,
California:International Com-
puter
Music
Association,
pp.
433-434.
Scotto
Di
Carlo, N.,
and I. Guaitella.
1995.
"Facial
Ex-
pressions
in
Singing."
n
Proceedings
of
the
13th Inter-
national
Congress
of
Phonetic Sciences.
Stockholm:
KTH,
pp.
1:226-229.
Serra,
X.,
and
J.
Smith.
1990.
"SpectralModeling
Synthe-
sis:
A
Sound
Analysis/Synthesis System
Based on a De-
terministic plus Stochastic Decomposition." Computer
Music
Journal
14(4):12-24.
Smith,
J.
1987.
"Musical
Applications
of
Digital
Wave-
guides."
Technical
report
STAN-M-39.
StanfordUniver-
sity
Center
for
Computer
Research
n Music and
Acoustics.
Smith, J.,
and
X.
Serra.
1987.
"PARSHL:
nalysis/Synthe-
sis
Program
or Non-Harmonic Sounds Based
on a
Si-
nusoidal
Representation."
n
Proceedings
of
the 1987
International
Computer
Music
Conference.
San Fran-
cisco,
California:
International
Computer
Music Asso-
ciation,
pp.
290-297.
Spanias,
A. 1994.
"Speech Coding:
A
Tutorial
Review."
n
Proceedingsof the IEEE82(10):1541-1582.
Steiglitz,
K.,
andP.
Lansky.
1981.
"Synthesis
of Timbral
Families
by
Warped
Linear Prediction."
Computer
Music
Journal
5(3):45-49.
Story,
B.,
and
I. Titze.
1995.
"Voice
Simulation
With a
Body-Cover
Model
of
the Vocal
Folds."
Journal
of
the
Acoustical
Society
of
America
97(2):3416-3431.
Sundberg,J.
1987.
The
Science
of
the
Singing
Voice.
De-
kalb,
Illinois:
Northern Illinois
University
Press.
Sundberg,
J.
1989.
"Synthesis
of
Singing
by
Rule."
In
Mathews,
M. and
J. Pierce,
eds.,
Current
Directions
in
Computer
Music Research.
Cambridge,
Massachusetts:
The
MIT
Press,
pp.
45-56.
Teager,
H.
1980.
"Some Observations
on
Oral
Air
Flow
During
Phonation." EEE
Transactions
on
Acoustics,
Speech,
and
Signal Processing
28(5):599-601.
Ternstrom,S.,
and
A.
Friberg.
1989.
"Analysis
and
Simula-
tion of Small Variations
n
the Fundamental
Frequency
of
Sustained
Vowels."
STL-Quarterly
Progress
and Sta-
tus
Report
3:1-14.
Titze, I.,
and B.
Story.
1993.
"The Iowa
Singing Synthe-
sis." In
Proceedings
of
the Stockholm
Music
Acoustics
Conference.Stockholm:KTH,p. 294.
Valimaki, V.,
and M.
Karjalainen.
1994.
"Improving
he
Kelly-Lochbaum
Vocal TractModel
Using
Conical
Tube
Sections and Fractional
Delay
Filtering
Tech-
niques."
In
Proceedings of
the 1994
International Con-
ference
on
Spoken Language Processing. Yokohama,
Ja-
pan,
pp.
18-22.
Verge,
M.
1995.
Aeroacoustics
of
Confined Jets,
with
Applications
to
the
Physics of
Recorder-Like nstru-
ments.
Thesis,
Technical
University
of Eindhoven
(also
availablefrom
IRCAM).
Wergo.
1995.
The
Historical
CD
of
Digital
Sound
Synthe-
sis. WER2033-2.
Wilhelms-Tricarico,R. 1995. "PhysiologicalModelingof
Speech
Production:Methods for
Modeling
Soft-Tissue
Articulators."
Journal
of
the Acoustical
Society
of
America
97(5):3085-3098.
Zera, J.,
J.
Gauffin,
and
J.
Sundberg.
1984.
"Synthesis
of
Selected
VCV-Syllables
n
Singing."
n
Proceedings of
the 1984 International
Computer
Music
Conference.
San
Francisco,
California:
International
Computer
Music
Association, pp.
83-86.
Computer
Music
Journal
6