Straight

9-10 August, 2002 Computational Audition 1

Fixed Point Representations forFixed Point Representations forVery High-Quality Speech andVery High-Quality Speech and

Sound Modification SystemSound Modification System

Hideki KawaharaHideki KawaharaWakayama University, JapanWakayama University, Japan


SummarySummary

nn Functional (computational after Marr)Functional (computational after Marr)approach is important and productive.approach is important and productive.

nn Fixed points provide feature values asFixed points provide feature values aswell as their reliability indices.well as their reliability indices.–– Using Using within channelwithin channel information information

nn Fixed point concept may provide clue toFixed point concept may provide clue tointegrate Fourier based concept andintegrate Fourier based concept andwavelet-wavelet-Mellin Mellin transform based concept.transform based concept.

Reference systemReference system

““LocalLocal”” center of gravitycenter of gravity


original

STRAIGHT: demoSTRAIGHT: demo


STRAIGHT demo: morphingSTRAIGHT demo: morphingneutral angry

interpolationextrapolation extrapolation

Word: /hai/ (“Yes” in Japanese)


BackgroundBackground

nn ““Auditory BrainAuditory Brain”” Project by CREST Project by CREST–– Short term goal: speech processing systemsShort term goal: speech processing systems

based on functional models of auditory functions.based on functional models of auditory functions.–– STRAIGHT: a very high-quality speechSTRAIGHT: a very high-quality speech

manipulation systemmanipulation system–– Fixed point based algorithmsFixed point based algorithms

(alternative way of dimensional reduction of(alternative way of dimensional reduction ofauditory representations)auditory representations)

–– Long term goal: Long term goal: ““computationalcomputational”” theorirdtheorird of ofauditionaudition

nn Frustrations in ways how auditory modelsFrustrations in ways how auditory modelsare used in ASR and how speech processingare used in ASR and how speech processingsystems are evaluated.systems are evaluated.


““Auditory BrainAuditory Brain”” Project Project

nn To develop a very high quality speechTo develop a very high quality speechand/or sound manipulation system basedand/or sound manipulation system basedon perceptually relevant parameters andon perceptually relevant parameters andit does not preserve phase/waveformit does not preserve phase/waveforminformation.information.–– ?? Are distance based quality measures relevant???? Are distance based quality measures relevant??–– ?? Why does periodic sound sounds smoother and ?? Why does periodic sound sounds smoother and

richer (in Auditory Fovea)??richer (in Auditory Fovea)??–– ?? Is it relevant to test highly nonlinear speech ?? Is it relevant to test highly nonlinear speech

perception using elementary sounds ??perception using elementary sounds ??


Why high quality?Why high quality?

nn Ecological approach for investigatingEcological approach for investigatinghighly nonlinear system, Humanhighly nonlinear system, Human

Not necessarily be predictablefrom elementary test signals

Necessary to useecologically valid stimuli

Naturalness


Hans Moravec: Robot, 2000, Oxford

xbox

iMac

PlayStation-3

Key issue: compatibilityKey issue: compatibility

Background figure is removed.Please visit Hans Moravec’s page forthe original figure.Faster than exponential growth in computing power(Chapter 3: Power and Presence, Page 60)http://www.frc.ri.cmu.edu/~hpm/book98/



nn Computational theories of speech/auditoryComputational theories of speech/auditoryperceptionperception–– ecological constraints on evolution ecological constraints on evolution

–– It cannot be ad hoc. It cannot be ad hoc.»» When there is an elegant and reasonable algorithmWhen there is an elegant and reasonable algorithm

and it does not violate ecological (biological andand it does not violate ecological (biological andenvironmental) constraints, there is no reason toenvironmental) constraints, there is no reason todeny that the algorithm shares the commondeny that the algorithm shares the commonunderlying principles with our auditory system.underlying principles with our auditory system.



nn Computational theories of speech/auditoryComputational theories of speech/auditoryperceptionperception–– Periodicity: time-frequency sampling grid Periodicity: time-frequency sampling grid–– Periodicity: stable reference point for wavelet- Periodicity: stable reference point for wavelet-

Mellin Mellin transformtransform–– Log-linear frequency axis Log-linear frequency axis

»» Wavelet-Wavelet-Mellin Mellin transform: shape and sizetransform: shape and size

–– Why two ears? ICA Why two ears? ICA–– Long term correlation (structure) Long term correlation (structure)

»» ASR, musicASR, music


STRAIGHT a core technologySTRAIGHT a core technology

nn Conceptually simple architectureConceptually simple architecture–– Channel VOCODERChannel VOCODER–– Source filter modelSource filter model

nn Graded parameters (Graded parameters (vsvs binary decision) binary decision)–– Sensitivity analysisSensitivity analysis–– MorphingMorphing

nn Reliability / TransparencyReliability / Transparency–– No post-processingNo post-processing–– Weakly constrained modelWeakly constrained model


Structure of STRAIGHTSTRAIGHT: architectureSTRAIGHT: architecture


STRAIGHT a core technologySTRAIGHT a core technology

nn Conceptually simple architectureConceptually simple architecture–– Channel VOCODERChannel VOCODER–– Source filter modelSource filter model

nn Graded parameters (Graded parameters (vsvs binary decision) binary decision)–– Sensitivity analysisSensitivity analysis–– MorphingMorphing

nn Reliability / TransparencyReliability / Transparency–– No post-processingNo post-processing–– Weakly constrained modelWeakly constrained model


Structure of STRAIGHT

Spectral envelope estimation

STRAIGHT: structureSTRAIGHT: structure


Weakly constrained spectralWeakly constrained spectralenvelope estimationenvelope estimation

waveform

Time window

Interferencesin the time domain

Interferences in the frequency domain



Reduction of edge discontinuity

Reduction of periodicity interference

smoothing by spline basisComposite window


Time-frequency smoothing（

Time-frequency smoothing（current implementation）current implementation）

F0 synchronousGaussian window

complimentarytime window

reduced interferencespectrum

F0 synchronousGaussian window

complimentarytime window


Compensation of over-Compensation of over-smoothingsmoothing


Compensation of over-smoothingCompensation of over-smoothing


Fixed point based algorithmsFixed point based algorithms

nn Fixed points in the frequency domain:→Fixed points in the frequency domain:→ F0 extraction F0 extraction

nn Fixed points in the time domain:→Fixed points in the time domain:→ Excitation extraction Excitation extraction


Fixed point of mappingFixed point of mapping

fixed point

y

x

y=f(x)

* Instantaneous frequencyof a filter output arounda sinusoidal component

* Energy centroid of a windowed signalaround an event

Examples


Averaging and fixed pointAveraging and fixed point

nn Prominent componentProminent component

Windowlocations

Average of windowedvalue

Fixed point

background


Averaging and fixed pointAveraging and fixed point

nn Prominent componentProminent component

Windowlocations

Average of windowedvalue

Fixed point

background

Parameters(position,slope,[level])



F0 estimation



window selection for reliablewindow selection for reliablerepresentation of mappingrepresentation of mapping

Refinement of Fo synchronous windows


Window with harmonicWindow with harmoniccancellationcancellation


Fixed-point-based sinusoidalFixed-point-based sinusoidalcomponents extractioncomponents extraction


Fixed-point-based sinusoidalFixed-point-based sinusoidalfrequency and C/N estimationfrequency and C/N estimation

C/N information enablesC/N information enablesoptimum F0 estimation based onoptimum F0 estimation based onmultiple harmonic componentsmultiple harmonic components


Approximate estimation of C/NApproximate estimation of C/N


Reliable built-in mechanismReliable built-in mechanismfor fundamental component selectionfor fundamental component selection

linearlinear filterarrangement

log-linearlog-linear filterarrangement

mapping filter output


Fixed points on C/N mapFixed points on C/N map

Fundamentalcomponent


F0 evaluation based on EGGF0 evaluation based on EGG

gross error

W/O：0.72%with：0.32%

female


Graded sourceGraded sourceInformationInformation

nn Fixed point basedFixed point basedFoFo extraction extraction(with C/N map)(with C/N map)

F0 trajectoriesF0 trajectories(resolution: 1/F0)(resolution: 1/F0)

C/N for each fixed point

Graded aperiodicity informationIs also extracted


Fixed points in the time domainFixed points in the time domain

nn How to define auditory temporal eventsHow to define auditory temporal events–– Localized energy Localized energy centroidcentroid

Alternative representation


Fixed points in the time domainFixed points in the time domain

Squared whitened signal Energy centroid

Gaussianwindow

Amount of energyconcentration

Speechwaveform


waveform

Energycentrold

Window center

Fixed points

Fixed point based event detectionFixed point based event detection


Mean time

duration

Definition of event in the time domainDefinition of event in the time domain

Event location

Windowed whitened signal


Windowed event location andWindowed event location andthe original event locationthe original event location

Gaussian window

Approximation of envelope

Windowed location

Originallocation

Window location


Slope at fixed point

durationWindowparameter

Duration can be estimated fromDuration can be estimated fromthe geometrical parameter atthe geometrical parameter at

the fixed pointthe fixed point


Equivalence between the time domainEquivalence between the time domaindefinition and the frequency domaindefinition and the frequency domain

definitiondefinition

waveform

Time domain definition

Frequency domain definition

Group delay


Inverse problem:Inverse problem:Where is the excitation?Where is the excitation?

Minimum phaseresponse

Event as theenergy centroid

Excitation (impulse)

compensation


Equivalence in definitionsEquivalence in definitions

nn Frequency domain definition of the event locationFrequency domain definition of the event location

nn Assuming causalityAssuming causality

Group delay


Group delay of a minimum phaseGroup delay of a minimum phaseresponseresponse

を介した計算Cepstrum


Compensation based onCompensation based onminimum phase group delayminimum phase group delay

Observed group delay

Causal group delay

Compensated event location

Compensated event duration


example

Observed group delayMinimum phasegroup delay

Compensated group delay


Excitation estimation based on fixedExcitation estimation based on fixedpoint based event detectionpoint based event detection

Event based concentration Excitation based concentration

Energy centroid

Compensatedgroup delay

excitation

Vocal fold closure


Excitation extraction accuracyExcitation extraction accuracy

Standarddeviation


Estimated excitation

Speechwaveform

Multiple resolution display ofMultiple resolution display ofevents (fixed points)events (fixed points)


Multiple resolution display ofMultiple resolution display ofevents (fixed points)events (fixed points)

demo

Fixed pointsdue to oneexcitation aligns on astraight line


Phase map of wavelet transformPhase map of wavelet transform


Instantaneous frequency basedInstantaneous frequency basedfixed pointsfixed points


Group delay based fixed pointsGroup delay based fixed points



Source attribute control



Group delay manipulatedmixed-mode excitation sourceGroup delay manipulatedmixed-mode excitation source

group delay asymmetry

impulse response

..provides continuous coveragefrom pulse train to random noise


SummarySummary

nn Functional (computational after Marr)Functional (computational after Marr)approach is important and productive.approach is important and productive.

nn Fixed points provide feature values asFixed points provide feature values aswell as their reliability indices.well as their reliability indices.–– Using Using within channelwithin channel information information

nn Fixed point concept may provide clue toFixed point concept may provide clue tointegrate Fourier based concept andintegrate Fourier based concept andwavelet-wavelet-Mellin Mellin transform based concept.transform based concept.


ColleaguesColleaguesnn Haruhiro KatayoseHaruhiro Katayose, Toshio , Toshio IrinoIrino,, Takanobu Takanobu Nishiura Nishiura

(Wakayama (Wakayama UnivUniv.).)nn Minoru Minoru TsuzakiTsuzaki, Hideki , Hideki IwasawaIwasawa (ATR) (ATR)nn Parham Parham Zolfaghari Zolfaghari (NTT)(NTT)nn Kiyohiro ShikanoKiyohiro Shikano, Hiroshi , Hiroshi SaruwatariSaruwatari (NAIST) (NAIST)nn Fumitada ItakuraFumitada Itakura, Kazuya Takeda, Shoji , Kazuya Takeda, Shoji kajitakajita, Hideki, Hideki

BannoBanno (CIAIR, Nagoya (CIAIR, Nagoya UnivUniv.).)nn Masato Masato AkagiAkagi, Masashi, Masashi Unoki Unoki (JAIST) (JAIST)nn Seiichi Nakagawa (Seiichi Nakagawa (Toyohashi Toyohashi Inst. Tech)Inst. Tech)nn Shigeki Shigeki SagayamaSagayama, Nobuaki, Nobuaki Minematsu Minematsu ( (UnivUniv. Tokyo). Tokyo)nn Diane Diane KewleyKewley-Port (Indiana -Port (Indiana UnivUniv. USA). USA)nn Osamu Fujimura (Ohio state Osamu Fujimura (Ohio state UnivUniv. USA). USA)nn Alain de Alain de CheveignCheveignéé (IRCAM, France) (IRCAM, France)nn Roy D. Patterson (CNBH, UK)Roy D. Patterson (CNBH, UK)


ReferencesReferencesnn Hideki Kawahara,Hideki Kawahara, Ikuyo Ikuyo Masuda- Masuda-KatsuseKatsuse and Alain de and Alain de Cheveigne Cheveigne: Restructuring: Restructuring

speech representations using a pitch-adaptive time-frequency smoothing and anspeech representations using a pitch-adaptive time-frequency smoothing and aninstantaneous-frequency-based F0 extraction: Possible role of ainstantaneous-frequency-based F0 extraction: Possible role of a reptitive reptitivestructure in sounds, Speech Communication, 27, pp.187-207 (1999).structure in sounds, Speech Communication, 27, pp.187-207 (1999).

nn Hideki Kawahara,Hideki Kawahara, Haruhiro Katayose Haruhiro Katayose, Alain de, Alain de Cheveigne Cheveigne, Roy D. Patterson:, Roy D. Patterson:Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping forFixed Point Analysis of Frequency to Instantaneous Frequency Mapping forAccurate Estimation of F0 and Periodicity , Proc. EUROSPEECH'99, Volume 6,Accurate Estimation of F0 and Periodicity , Proc. EUROSPEECH'99, Volume 6,Page 2781-2784 (1999).Page 2781-2784 (1999).

nn Hideki Kawahara, YoshinoriHideki Kawahara, Yoshinori Atake Atake and Parham and Parham Zolfaghari Zolfaghari: Accurate vocal event: Accurate vocal eventdetection method based on a fixed-point to weighted average group delay,detection method based on a fixed-point to weighted average group delay,ICSLP-2000, Beijing, pp.664-667 2000.ICSLP-2000, Beijing, pp.664-667 2000.

nn H. Kawahara and PH. Kawahara and P Zolfaghari Zolfaghari: Systematic F0 glitches around vowel nasal: Systematic F0 glitches around vowel nasaltransitions, EUROSPEECH'2001, pp.2459-2462, 2001.transitions, EUROSPEECH'2001, pp.2459-2462, 2001.

nn H. Kawahara, JoH. Kawahara, Jo Estill Estill and O. Fujimura: and O. Fujimura: Aperiodicity Aperiodicity extraction and control using extraction and control usingmixed mode excitation and group delay manipulation for a high quality speechmixed mode excitation and group delay manipulation for a high quality speechanalysis, modification and synthesis system STRAIGHT, MAVEBA 2001,analysis, modification and synthesis system STRAIGHT, MAVEBA 2001,Sept.13-15,Sept.13-15, Firentze Firentze Italy, 2001. Italy, 2001.

nn H. Kawahara and H. H. Kawahara and H. KatayoseKatayose: Scat generation research program based on: Scat generation research program based onSTRAIGHT, a high-quality speech analysis, modification and synthesis system,STRAIGHT, a high-quality speech analysis, modification and synthesis system,J. IPSJ, 43, 2, pp.208-218 2002. (in Japanese)J. IPSJ, 43, 2, pp.208-218 2002. (in Japanese)


For computational For computational ““AuditionAudition””

seed#1 seed#2

F0 trajectory and F0 trajectory and frequency axis modificationfrequency axis modification

Parts preparationParts preparation

Mixing and level adjustmentMixing and level adjustment


Nonlinear time warping based Nonlinear time warping basedon phase of the F0 componenton phase of the F0 component

(FM pulse train)(FM pulse train)

without time warpingwithout time warping with time warpingwith time warping


Nonlinear time warping based Nonlinear time warping basedon phase of the F0 componenton phase of the F0 component

(vowel sequence /(vowel sequence /aiueoaiueo/)/)

without time warpingwithout time warping with time warpingwith time warping

Straight

Technology

Transcript of Straight