Speech Processing

Speech Processing

Homomorphic Signal Processing

April 19, 2023 Veton Këpuska 2

Outline

Principles of Homomorphic Signal Processing

Details of Homomorphic Processing

Variants of Homomorphic Processing

Investigation of Homomorphic systems to speech analysis and synthesis


Principles of Homomorphic Processing

Superposition Property of Linear Systems:

Lx1[n]

x2[n]

x[n]L(x[n])

Lx1[n]

x2[n]

a1L(x1[n])

L(x[n])

L a2L(x2[n])

nxLanxLanxanxaL

nxLnxL

nxLnxLnxnxL

22112211

2121

a1

a2

a2

a1



Example 6.1: If signals fall in non-overlapping frequency bands

then they are separable. x[n]=x1[n]+x2[n]

X1()=ℱ{x1[n]} & X1() [0,/2],

X2()=ℱ{x2[n]} & X2() [/2, ],

y[n] = h[n] ＊ (x1[n]+x2[n]) = h[n] ＊ x1[n] + h[n] ＊ x2[n]

y[n] = h[n] ＊ x2[n] = x2[n]

0 for ∈[0,/2]

1 for ∈[/2, ]


Generalized Superposition Concept that would support separation of nonlinearly

combined signals. Leads to the notion of Generalized Linear

Filtering.

Properties: H(x1[n]□x2[n])=H(x1[n])○H(x2[n]) H(c:x [n])=c◈H(x [n])

Systems that satisfy those two properties are referred to as homomorphic systems and are said to satisfy a generalized principle of superposition.


H()x[n]□

Input rule

: y[n]○

Output rule

◈



Importance of homomorphic systems for speech processing lies in their capability of transforming nonlinearly combined signals to additively combined signals so that linear filtering can be performed on them.

Homomorphic systems can be expressed as a cascade of three homomorphic sub-systems depicted in the figure below – referred to as the canonic representation:

H

D□x[n]

□

:+. y[n]L

+. .

+D○

○+

. ◈-1

I II III

nx ny


Canonic Representation of a Homomorphic System

i. The Characteristic System: Transforms □ into add “+”

ii. The linear system: transforms “add” into “add”

iii. The inverse system: transforms add into ○

D□x[n]

□

:+.

I nx

L+. .

+ nx nyII

y[n]D○

○+

. ◈-1

III

ny


Homomorphic Systems

Let the goal be removal of undesired component of the signal (e.g., noise):

Type of combination rule

System Operation

Signal & Additive noise

Linear System Linear Filtering

Signal & Multiplicative noise

Multiplicative System

Multiplicative Filtering

Signal & Convolutional Noise

Convolutional System

Convolutional Filtering


Multiplicative Homomorphic Systems

Consider Homomorphic Multiplicative System depicted below:

Use D□ to convert MULT into ADD. Use D○ to convert ADD into MULT.

Which rule (operation) transforms MULT into ADD?

M[]x[n]● ●

y[n]

-1

D●x[n]

● +y[n]L

+ +D●

●+ -1

I II III

nx ny



If x[n]=x1[n]●x2[n], and x1[n]>0 & x2[n]>0 for all n

Then log(x1[n]●x2[n])=log(x1[n])+log(x2[n])

However, x[n] may not be always positive. Generalization to complex signals:

x[n]=|x[n]|ejarg(x[n])

which requires definition of complex log operator.



An implementation of multiplicative Homomorphic System:

Definition: Complex log:

Complex exp.(Inverse operation)

Complex logx[n]

● +y[n]Linear

System

+ + Complex Exp.

●+

I II III

nx ny

nxjnxnx argloglog

nxjnxnx eee argloglog


Homomorphic Systems for Convolution Consider Homomorphic System for Convolution depicted below:

Use D□ to convert “ ＊” into ADD. Use D○ to convert ADD into “ ＊” .

How to transform “＊” into ADD?

C[]x[n]＊＊

y[n]

D ＊＊ +

y[n]L+ +

D ＊＊+ -1

I II III

nx ny

x[n]

C


Homomorphic Systems for Convolution

Let x[n]=x1[n]*x2[n]

Inverse Operation

I.

З[]＊ ●

log[]● +

З-1[]++

zX zX

x[n] nx

D ＊

time “time”

III.

З[]+ +

exp[]+ ●

З-1[]*●

zY zY

D ＊

“time”

nyy[n]

-1



For x[n]=x1[n]*x2[n]:

1. X(z)=X1(z)X2(z)

2. Log(X(z))=Log(X1(z)X2(z))= Log(X1(z))+Log(X2(z))Complex logarithm. This operation requires special handling because:

X(z) > 0 For complex X(z) phase is not uniquely defined (i.e., multiple of

2) X(z) has to be defined on unit circle (e.g., Z transform of a

stable sequence).

In practice operate on unit circle z=ej. Fourier Transform:

j1

jjjj

eXnx

eXjeXeXeX

ˆˆ

argloglogˆ



Two cases are possible in computing :1. Complex Cepstrum (CC):

2. Real Cepstrum (RC):

nx

jj eXjeXnx arglogˆ 1

jeXnc log1



Example 6.3 Consider a sequence x[n] consisting of a system impulse response h[n] convolved with an impulse train p[n]:

Goal is to estimate h[n]. First form canonical representation for convolution:

If D* is such that p[n] remains train of pulses, and h[n] falls between impulses then separation is possible.

h[]p[n] x[n]

k

k kPnanp x[n]=h[n]*p[n]

npnhnpDnhDnxDnx ˆˆˆ

^^


Example 6.3 (cont.)

Let L denote such operation (i.e., rectangular window that would separate p[n] from h[n]).

nhnpLnhLnpnhLnxLny ˆˆˆˆˆˆˆ

^ ^

0

nyDnh ˆ1*


Example 6.4

a,b real and positive:⇒ log(ab) = log(a)+log(b)

a,b real but b<0⇒ log(ab) = log(a|b|ejk)=log(a)+log(|b|)+jk, k=1,3,5,… log(ab) is ambiguous.

This example indicates that special consideration must be made in defining the logarithm operator for complex X(z) in order to make the logarithm of the product the sum of logarithms.


Homomorphic Systems for Convolution-Complex Logarithm

Suppose that X(z) is evaluated on the unit circle (z=ej)

Let x[n]=x1[n]*x2[n] ⇒ X()=X1() X2()

Consider then complex log of X():

Considering that X()=X1() X2() then:

XjXeXX Xj logloglog

2121

21

2121

loglog

loglog

loglogloglog22

XXjXX

eXeX

XXXXXXjXj



In the previous expression the following was assumed:

Also:

Expression generally does not hold due to the ambiguity in the definition of phase:

0 & 0 if holds Expression

loglog

loglog

21

21

21

XX

XX

XXX

21

21

XX

XXX

kXPVX 2



Note that: PV denotes principal value of the phase which falls in the interval

[-,]. Arbitrary multiple of 2 can be added to the principal phase value Thus additive property generally does not hold.

How to impose uniqueness?1. Force continuity of phase:

Select k such that ∠X()=PV[∠X()]+ 2k is a continuous function. Figure 6.5 (next slide).

2. Phase derivative approach:

It can be shown that:

ωXdω

d ωX, where dβXωX

ω

0

2ωX

ωXωXωXωXωX

dω

d ωX irir


Fourier Transform Phase Continuity



Relationship of complex cepstrum to real cepstrum c[n]: If x[n] real then:

|X()| is real and even and thus log[|X()|] is real and even ∠X() is odd, and hence

is referred to as the complex cepstrum. Even component of the complex cepstrum, c[n] is referred to

as the real cepstrum.

2

ˆˆ nxnxnc

nx

deXnx njlog2

1ˆ

nx


Complex Cepstrum of Speech-Like Sequences

Sequences with Rational z-Transform: General form the class of sequences is given below:

Mi, Ni – are zeros and poles inside the unit circle. Mo, No – are zeros and poles outside the unit circle. |ak|, |bk|, |ck|, |dk| are all < 1 ⇒ Thus there are no singularities on the unit circle. A > 0.

oi

oi

N

kk

N

kk

M

kk

M

kk

r

zdzc

zbzaAzzX

11

1

11

1

11

11



Applying complex logarithm gives:

is a z-transform of sequence

Want inverse z-transform to be absolutely summable ⇒ ROC of must include unit circle, |z|=1.

This condition is equivalent to having all constituent elements of have ROC’s that include unit circle, |z|=1

oioi N

kk

N

kk

M

kk

M

kk zdzczbzaA

zXzX

11

1

11

1 1log1log1log1loglog

logˆ

nx nX

zXnx ˆˆ 1

zX

zX



In order to obtain ROC for expressions of the form: log(1-z-1) log(1-z)

they are expressed in a power series expansion:

1

1

11

1 ,1log

1 ,1log

n

nn

n

nn

zzn

z

zzn

z

1

Im

Re

Z-plane

ROC for log(1-z-1)

1/

1

Im

Re

Z-plane

ROC for log(1- z)



The ROC of is therefore given by an annulus defined by the poles & zeros of X(z) closes to the unit circle:

1

Im

Re

Z-plane

ROC for typical rational X(z)

zX



Complex cepstrum associated with rational X(z) can be therefore expressed as:

nx

11logˆ1111

nun

d

n

bnu

n

c

n

anAnx

ooii N

k

nk

M

k

nk

N

k

nk

M

k

nk

11

11

11

11

zaaz

zbbzAzX


Example 6.5

Let:

where a, b, c, are real and <1. The ROC of X(z) includes unit circle so that x[n] is stable. A delay z-r corresponds to a shift in the sequence. Thus complex cepstrum is given by:

1

1

1

11

cz

bzazzzX r

rnnn

znun

bnu

n

c

n

anx

log11ˆ 1


Example 6.5 (cont.)

The inverse z-transform of the shift term is given by:

Contribution of z-r term is significant. On the unit circle: z-r=e-jr=1∠-r contributes a

linear ramp to the phase and thus for a large shift r, dominates the phase representation and gives a large discontinuity at and -.

0, 0

0,cos

log1

n

nn

nrz r



Relation of complex cepstrum and real cepstrum for x[n] with rational z-transform that is minimum phase:

Complex cepstrum of a minimum-phase sequence with a rational z-transform is right-sided:

0, 0

0, 2

0, 1

ˆ

nnl

nnl

nnl

ncnlnx

2

ˆˆ nxnxnc


Impulse Train Convolved with Rational z-Transform Sequences

Second class of sequences of interest in the speech context is the train of uniformly-spaced unit samples with varying weights and its interaction with the system:

h[n]p[n] x[n]

Q

rk rNnnp

0

x[n]=h[n]*p[n]

Q

r

rNr

Q

rr zα zPrNnnp

00

Z

1

0

1

00

1Q

r

Nr

Q

r

rNr

Q

r

rNk zazαzα zP


Impulse Trans Convolved with Rational z-Transform Sequences

If p[n] is minimum phase and |ar(zN)-1|<1, zeros are inside the unit circle, log[P(z)] can be expressed as:

Thus is an infinite right-sided sequence of impulses spaced N-samples apart.

Note that in general for non-minimum phase sequences the complex cepstrum is two-sided with uniformly spaced impulses.

1

0 1

1

0

11log log

Q

r k

kNkr

Q

r

Nr z

k

azazP

zPnp logˆ 1


Example 6.6

Consider a sequence x[n]=h[n]*p[n] where z-transform of h[n] is given by:

b,b*, and c, c* are complexconjugate pairs.

Consider p[n] to be train ofperiodic pulses then:

11

11

11

11

zaaz

zbbzAzH

1

Im

Re

Z-plane

a

b

b*

a*

h[n]p[n] x[n]

0

k

k kPnnp x[n]=h[n]*p[n]


Example 6.6 (cont)

If ∈ and ||<1 then p[n] is train of decaying exponentials:

Z-transform of p[n] is given by:

Then, as derived earlier:

…

1p[n]

n

0k

kPkP zzP

npnhnx ˆˆˆ


Example 6.6 (cont)

h[n]p[n]


Homomorphic Filtering

In the cepstral domain: Pseudo-time Quefrency Low Quefrency Slowly varying components. High Quefrency Fast varying components.

Removal of unwanted components (i.e., filtering) can be attempted in the cepstral domain (on the signal , in which case filtering is referred to as liftering):

When the complex cestrum of h[n] resides in a quefrency interval less than a pitch period, then the two components can be separated form each other.

nx



If log[X()] Is viewed as a “time signal” Consisting of low-frequency and high-frequency

contributions. Separation of this signal with a high-pass/low-pass

filter.

One implementation of low pass filter:

D ＊

＊ +y[n]l[n]

+ +D ＊

＊+ -1

nx ny

x[n]=h[n]*p[n]



Alternate view of “liftering” operation: Filtering operation L() applied in the log-spectral domain

Interchange of time and frequency domain by viewing the frequency-domain signal log[X()] as a time signal to be filtered. ⇒ “Cepstrum” can be thought of as spectrum of log[X ()] Time axes of is referred to as “quefrency” Filter l[n] as the “lifter”.

F-1 y[n]l[n] F-1

nx nyx[n]=

h[n]*p[n] F log F exp

X()^ Y()^

L()

nx



Three elements in the doted lines of previous figure can be replaced by L(), which can be viewed as a smoothing function:

XLY logˆ

y[n]L() F-1x[n]=h[n]*p[n] F log exp

X()^ Y()^


Practical Implementation Issues

Use FFT and IFFT for Fourier Transformations. X() is computed by:

log|X()| computed as

And for x[n] use

N

n

N

knj

enxkX0

2

kXjkXkXkX loglogˆ

^

N

knjN

kN ekX

Nnx

21

0

ˆ1][ˆ


Practical Implementation Issues

1. Cepstrum x[n] is infinitely long thus xN[n] is aliased version of x[n]. That is:

Thus it is necessary to use a largest N as possible2. Phase component j∠X(k) must be properly

unwrapped to ensure phase continuity.

Goal to determine r[k] so that ∠X(k) is continuous.

r

N rNnxnx ][ˆ][ˆ

^ ^

^

krkXPVkX 2


Modulo 2 Phase Unwrapper

Goal is to determine r[k] so that X(k) is continuous

2/N

-

PrincipalValue PV

PV[X()] PV[X(k)]

Phase Representation in Discrete Complex Spectrum


Modulo 2 Phase Unwrapper Algorithm:

If PV[X(k)]-PV[X(k-1)]>2- r[k]=r[k-1]-1 # Subtract 2

Else if PV[X(k)]-PV[X(k-1)]<2- r[k]=r[k-1]+1 # Add 2

Else r[k]=r[k-1] # Do not change

End

Note: Even with fine grid of (determined by N) 2/N, it is possible that subsequent PV samples may be more than 2 rad apart (case of poles/zeros close together).


Phase Derivate-Based Phase Unwrapper

The phase derivative is uniquely defined by:

Then:

However, since only X(k) is available must estimate from discrete values.

2ωX

ωXωXωXωXωX

dω

d ωX irir

dX ωX 0

ωX


Phase Derivate-Based Phase Unwrapper

Re-state the Problem:

Where q(k) is an integer-valued function.

Assuming that phase has been correctly unwrapped up-to k-1

with the value (k-1) then:

An approximation:

Select value of q(k) such that E[k] is minimized:

over q(k).

kkk qXPVX 2

k

k

dkk

1

1

11

1 2

kkkk

kk

kkk qXPVkE ˆ2


Example


Short-Time Homomorphic Analysis of Periodic Sequences

Recall Source-System model of speech production:

For voiced speech p[n] is quasi-periodic:

For unvoiced speech p[n] is noise-like. In practice a periodic waveform is windowed by a finite-

length sequence w[n]:

s[n]=w[n]x[n]=w[n](p[n]*h[n]) Approximation to s[n]:

h[n]p[n] x[n]= h[n]*p[n]

0

k

k kPnnp

][])[][( ][~ nhnpnwnx



If w[n] is smooth relative to h[n], that is, P large enough so that h[n-kP] do not substantially overlap, then:

Then, Cepstrum of s[n] is:

where is complex cepstrum of w[n]p[n].

Can show that:

D[n] – weighting function depending on w[n].

][ˆ][ˆ][ nhnpns

[n] [n][n][n] [n]~ shpwx

][ˆ np

k

kPnhnDnpns ][ˆ][][ˆ][ …………()



Cepstral Domain (Quefrency) Perspective

Under what conditions can we perform deconvolution? Cepstral Domain (Quefrency) Perspective

Let x[n], a voiced speech signal, produced by an infinite train of periodic impulses:

Thus the only samples in X() and log[X()] are defined at multiples of the fundamental frequency o=2/P, i.e., k=(2/P)k

X(k) = P(k) H(k)

log[X(k)] = log[P(k)] + log[H(k)]

][][][

0

nhnpnx

kPnnpk



In the cepstral domain, appear as a set of replicas of h[n] appearing at every kP.

Thus, aliasing is an issue and needs to be handled properly. That is, can this aliasing be prevented or at least minimized?

Consider:

s[n]=w[n]x[n]=w[n](p[n]*h[n])

k

kPnh ][^

F

WHPS 2

1

k

oo kWkHP

S 1



Let’s rewrite s[n] as:

s[n] = (p[n]w[n])*g[n]where g[n] ≈ h[n].

Then:

Taking log of equations under and , and solving for log[G()] the following is obtained:

GkWP

Sk

o

1

ko

koo kWkWkHG log loglog

………(1)



To simplify, assume W() has only one main lobe of rectangular window:

That is:with wo=2/P

otherwise

Wo

,02

,1



Thus second log term becomes zero:

ko

koo kWkWkHG log loglog

0

………(2)

koo

koo

koo

kkHW

kWkH

kWkHG

log

log

loglog



From (1) and (2) we can write:

where is the complex cepstrum of p[n]w[n], and

ngnpns ˆˆˆ np

k

kPnhnwGng logˆ 1

Quefrency

…………()



Last equation () is a special case of Equation () with D[n]=w[n].

As with purely convolutional model:the contributions of the windowed pulse train and impulse response are additively combined so that deconvolution is possible.

Now the impulse response contribution is repeated at the pitch period rate. This aliasing is: Dependent upon pitch, and is different from aliasing

due to an Insufficient DFT length (see section 6.4.4).

][])[][( ][~ nhnpnwnx



Conditions under which: s[n]≈(w[n]p[n])*h[n]

1. w[n] – time domain window, should be long enough so that D[n] should be smooth over |n|<P over the extent of h[n].

2. w[n] – should be short enough to reduce contribution of replicas of h[n]. In practice w[n] is Hamming window of 2-3 pitch periods long.

3. w[n] should be centered at time origin, n=0, aligned with h[n].

Under those conditions for low-time lifter (filter in cepstral domain), l[n] of the length |n|<P/2

That is, complex cepstrum is close to that derived form conventional model.

Note that with high-pitched speakers there is stronger presence of p[n] close to the origin (as noted earlier) as well as more aliasing of replicas of h[n].

^

^

][])[][( ][~ nhnpnwnx

^


Frequency Domain Perspective

Let x[n] where:

Then: X(k)=P(k) H(k)

Where X(k) represents line spectrum at k=(2/P)k.

Question arises: Under what conditions the window properties would lead:

the output to be close to actual:

s[n]=w[n]x[n]=w[n](p[n]*h[n])?

][][][

0

nhnpnx

kPnnpk

][])[][( ][~ nhnpnwnx


Frequency Domain Perspective Define an error measure E() that would reflect degradation in the

frequency domain:

Want to minimize:

It was found empirically that for Hamming window this spectral distance measure is minimized for window length in the range of roughly 2-3 pitch periods.

An implication of this result is that the length of the analysis window should be adapted to the pitch period to make the windowed waveform as close as possible (in the sense described above) to the desired convolutional model.

X

SE ~

dED2

log2

1


Short-Time Speech Analysis

Complex Cepstrum of Voiced Speech Recall:

H(z)=AG(z)V(z)RL(z)

The output speech then is:

GainGlottalModel

Vocaltract

Model

LipRadiation

Model

][

][][][][][][][nh

l nrnvngnApnhnpnx


Complex Cepstrum of Voiced Speech

General form for stable V(z):

Zeros inside & outside the unit circle Poles inside the unit circle

Goal is to separate h[n] from p[n]. Let s[n]=w[n](p[n]*h[n]) be approximately equal to

i

i o

N

kk

M

k

M

kkk

zc

zbzazV

1

1

1 1

1

1

11

][])[][( ][~ nhnpnwnx



Recall that x[n]≈s[n] if window is 2-3 pitch-periods long and its center aligned with h[n].

Using the DFT of order N the following denotes discrete complex cepstrum:

For a typical speaker the duration of the short-time window lies in the range of 20ms-40ms.

Assuming that: Source and systems components lie roughly in separate

quefrency regions Negligible aliasing of the replicas of h[n] Most of the h[n] occurs within P/2 from origin Distortion function D[n] is smooth in the same range for |n|<P/2

and thus it makes other higher order replicas negligible for |n|>P/2.

Then, applying a cepstral lifter function:

~

][ˆ][ˆ][ˆ nhnpns NNN

^^



Low-Quefrency lifter:

to separate h[n] from p[n]. Similarly high-quefrency lifter can be used to produce

the input train pulse (pitch estimation).

elsewhere

Pnnl

,02

,1 ][

elsewhere

Pnnl

,12

,0 ][

^


Example 6.11

Voiced female speech with pitch period of 5 ms.

Sampling rate fs=10kHz. Hamming window of 15 ms. A 1024 point FFT/IFFT is used to

obtain discrete complex cepstrum. Center window on h[n] (more about

that latter).


Example 6.11


Example 6.11Maximum

Phase

Minimum Phase

Maximum Phase

Minimum Phase


Complex Cepstrum of Unvoiced Speech

Recall the transfer function model for the unvoiced speech:

H(z) = AV(z)R(z)

In contrast to the voiced case, there is no glottal volume velocity contribution.

Resulting speech waveform in time domain:x[n]=u[n]*h[n]=u[n]*v[n]*r[n]

Resulting signal after applying short time analysis window:

s[n]=w[n](u[n]*h[n])

White noise



Similarly to the arguments applied for voiced speech: Duration of the analysis window w[n] is selected so

that the formant of the unvoiced speech power spectral density are not significantly broadened

w[n] is sufficiently smooth so as to be as nearly constant over h[n] the following can be assumed:

s[n]≈(w[n]u[n])*h[n]

Defining the windowed white noise as q[n] = u[n]w[n], and

Computing discrete complex cepstrum with N-point DFT



qN[n] – the discrete complex cepstrum of the noise source covers all quefrencies, and thus separation is not possible.

Phase unwrapping of noisy signals is very unreliable.

Real cepstrum is adequate for unvoiced speech (phase information not important for this case) resulting in minimum-phase versions of h[n].

Deconvolved excitation may contain interesting fine source structure for classes of sounds; e.g., voiced fricatives.

][ˆ][ˆ][ˆ nhnqns NNN


Analysis/Synthesis Structure

In speech analysis underlying parameters of the speech model are estimated

In speech synthesis stage the waveform is reconstructed from the model parameters.

Liftering of low-quefrency region of the cepstrum ⇒ provides an estimate of the system impulse response

Liftering of high-quefrency region of the cepstrum ⇒ provides an estimate of source excitation signal.

Inverting the estimate of the source signal with homomorphic system to obtain excitation function.

Convolution of the two resulting component estimates yields the original short-time segment exactly.

1D


Analysis/Synthesis Structure With an overlap-add reconstruction from the short-time

segments, the entire waveform is recovered. The homomorphic system performs transformation with

no information reduction. This process is analogous to reconstructing the

waveform, in linear prediction analysis/synthesis, from the convolution of the all-pole filter and the output of its inverse filter.

In speech coding and speech modification applications a more efficient representation is desired.

Complex or real cepstrum provides an approach to such a representation because pitch and voicing can be estimated from the peak (or lack of peak) in the high-quefrency region of the cepstrum.


Zero and Minimum-Phase Synthesis

Assuming that we have a succinct and accurate characterization of the speech production source (as with linear prediction-based analysis/synthesis), able to synthesize an estimate of the speech

waveform.

This synthesis can be performed based on any one of several possible phase functions: Zero-phase, Minimum-phase, maximum-phase Mixed-phase functions


Zero and Minimum-Phase Synthesis

General framework for homomorphic analysis/synthesis:

1024-pointReal Cepstrum

Analysis window of 10-20 ms

P/2


Mixed-Phase Synthesis

Example 6.13


Contrasting Linear Predication and Homomorphic Filtering

Homomorphic Filtering is viewed as an alternative to linear prediction.

Linear Prediction Homomorphic Filtering

Parametric Non-parametric

Sharp smooth resonances Wider spurious resonances

All-pole representation Poles and zeros can be represented.

Minimum-phase response estimate only

Minimum-phase as well as Mixed-phase if complex cepstrum is used.

Synthesized speech “crisper” but more “mechanical”

Synthesized speech more “natural” but “muffled”


Contrasting Linear Predication and Homomorphic Filtering Similar problems with both methods:

Linear Prediction Homomorphic FilteringIncreased speech distortion with increasing pitch

Aliasing of the vocal tract impulse response at the pitch period repetition rate

Linear prediction windowing results in the prediction of nonzero values of the waveform from zeros outside the window.

Windowing a periodic waveform distorts the convolutional model.

Number of poles is required The length of the low-quefrency lifter must be chosen

Best window and order selection is often a function of the pitch of the speaker.


Homomorphic Prediction

Number of speech analysis methods rely on combining homomorphic filtering with linear prediction and are referred to collectively as homomorphic prediction.

Two primary advantages of combining the methods:

1. By reducing the effects of waveform periodicity, an all-pole estimate suffers less from the effect of high-pitch aliasing.

2. By removing ambiguity in waveform alignment, zero estimation can be performed without the requirement of pitch-synchronous analysis.



Waveform Periodicity: Recall that for the waveform consisting of the

convolution of a short-time impulse train and an impulse response:

x[n]=p[n]*h[n] Autocorrelation function is given by the convolution

of the autocorrelation function of the response and that of the impulse train:

rx[]=rh[]*rp[] Thus, as the spacing between impulses (the pitch

period) decreases, the autocorrelation function of the impulse response suffers form increasing distortion.



Thus if spectrogram magnitude of h[n] can be estimated accurately then linear prediction analysis can be performed with an estimate of rh[] free of the waveform periodicity. This leads to the following idea:1. Use homomorphic filtering to deconvolve and

estimate of h[n] by low-pass liftering the real or complex cepstrum of x[n].

2. Use autocorrelation method on the resulting impulse response estimate by linear prediction analysis to obtain the model parameters.


Example 6.14 Suppose h[n] is a minimum-phase all-pole sequence of

order p. Consider a waveform x[n] constructed by convolving h[n] with a sequence p[n] where:

p[n] = [n] + [n-N], with <1

Complex cepstrum of x[n] is given by:

Where and are the complex cepstra of p[n] and h[n], respectively.

The autocorrelation function is given by:

rx[] = (1+2)rh[] + rh[-N] + rh[+N] rx[] is rh[] distorted by its neighboring terms centered at

=+N and =-N.

][ˆ][ˆ][ˆ nhnpnx ][ˆ ][ˆ nhnp


Homomorphic Prediction Important point of previous example:

The first p coefficients of the real cepstrum of x[n] are undistorted (if a long-enough DFT length is used in the computation)

The first p coefficients of the autocorrelation function rx[] of the waveform are distorted by aliasing of autocorrelation replicas (regardless of the DFT length)

Cepstral lowpass lifter of duration less than p extracts a smoothed and not aliased version of the spectrum.

Linear prediction coefficients can alternatively be obtained exactly through the recursive relation between the real cepstrum and predictor coefficients of the all-pole model when h[n] is all-pole (Exercise 6.13).


Homomorphic Prediction Zero Estimation:

Consider a transfer function of poles and zeros of the form:

Also consider a sequence x[n]=h[n]*p[n] where p[n] is a periodic impulse train.

Suppose that: Estimate of h[n] is obtained through homographic filtering of

x[n] Number of poles and zeros is known and Linear-phase component z-r has been removed.

Then poles of h[n] can be estimated using the covariance method of linear predication.

Other methods can be used (e.g., Shanks method described in Chapter 5) to estimate zeros.

zD

zNzH


Homographic Prediction


Summary This chapter focus was on the use of Homomorphic

filtering with application to deconvolution-separation of source from a system.

The presented methodology is general and can be applied not only to deconvolution of vocal tract from glottal source.

Example Applications: Control of dynamic range of multiplicatively combined

signals (Exercise 6.19) Recovery of speech from degraded recordings. Old acoustic

recordings suffer from convolutional distortion imparted by an acoustic horn that can be approximated by a linear resonant filter. See Exercise 6.20 for details.

In image processing, homomorphic filtering can be used for contrast enhancement (See Oppenheim and Shafer Book, “Digital Signal Processing”, p487, Prentice Hall 1975.)


Summary Homomorphic processing is applied in the phase Vocoder

and sinewave analysis/synthesis. It also has been found useful in speech coding (Chapter 12) Speaker Recognition (Chapter 14) It also a basis for mel-cepstrum; Fourier Transform of a

constant-Q filtered log-spectrum. Mel-cepstrum it is hypothesized that it approximates signal

processing in the early stages of human auditory perception.

Homomorphic filtering applied along the temporal trajectories of the mel-cepstral coefficients can be used to remove convolutional channel distortions even when the cepstrum of these distortions overlaps the cepstrum of speech (Chapter 13): Cepstral Mean Subtraction and RASTA processing.

END


Speech Processing

Documents

Transcript of Speech Processing