Speech signal time frequency representation

25
Methods and algorithms of speech recognition course Nikolay V. Karpov nkarpov(а)hse.ru Lection 2 Time-Frequency Representation

Transcript of Speech signal time frequency representation

Page 1: Speech signal time frequency representation

Methods and algorithms of speech recognition course

Nikolay V. Karpov

nkarpov(а)hse.ru

Lection 2

Time-Frequency Representation

Page 2: Speech signal time frequency representation

This lecture is concerned with the spectrogram as a representation of how the frequency components within a signal vary with time.

Define what we mean by normalised time and frequency Define the short-term discrete Fourier transformaton Look at the effect of different window lengths on time and

frequency resolution Derive a quantitative description of the frequency

resolution of the short term DFT and compare the performance of common windows

Derive a quantitative description of the time resolution of the short term DFT

Explain the uncertainty principle and illustrate its effects with some examples

Review some of the properties of the DFT

Time-Frequency Representation

Page 3: Speech signal time frequency representation

Sampling theorem says that when the analog signal x(t) is band-restricted between 0 and W[Hz] and when x(t) is sampled at every T = 1/2W, the original signal can be completely reproduced by

Sampling in the time domain

Page 4: Speech signal time frequency representation

For those signals the frequency bandwidths of which are not known, a low-pass filter is used to restrict the bandwidths before sampling. When a signal is sampled contrary to the sampling theorem, aliasing distortion occurs, which distorts the high-frequency components of the signal, as shown

Sampling in the frequency domain

Page 5: Speech signal time frequency representation

Simple continuous signal

Normalised radian frequency Normalised time

Normalised Frequency

1

)2

()2()(

i w

itg

w

ixtx )()()()

2( ixiTx

F

ix

w

ix

s

)2cos()( ftAtx

)2cos()cos())(2cos()( iN

kAiA

F

ifAtx

s

5.0)1(2)1(20

N

k

Ff

s

wt

wttg

2

)2sin()(

i

Designate

Page 6: Speech signal time frequency representation

With sampled data systems it is customary to change the time scale and work in units of the sample period. In normalised units the sampling frequency (fs) and

period (T) both equal 1. They can therefore be omitted from equations: any such omissions can be deduced using dimensional consistency arguments.

The Nyquist frequency is ½ Hz = πradians/second To convert back to real units:

◦ any time quantity must be multiplied by the real sample period (or divided by the real sample frequency)

◦ any frequency or angular frequency quantity must be multiplied by the real sample frequency (or divided by the real sample period)

Normalised Time

Page 7: Speech signal time frequency representation

The spectrogram shows the energy in a signal at each frequency and at each time. We calculate this by evaluating the short-term discrete Fourier transform

Spectrogram

Z- tramsform

c

n

n

n

dzzzXj

zX

znxzX

1)(2

1)(

)()(

Page 8: Speech signal time frequency representation

The Fourier transform of x(n) can be computed from the z-transform as:

Fourier transform

n

ni

ezenxzXX i

)()()(

The Fourier transform may be viewed as the z-transform evaluated around the unit circle.The Discrete Fourier Transform (DFT) is defined as a sampled version of the Fourier shown above:

1

0

/2)()(N

n

NknienxkX

1

0

/2)()(N

n

NkniekXnx

The Fast Fourier Transform (FFT) is simply an efficient computation of the DFT.

Page 9: Speech signal time frequency representation

The Discrete Cosine Transform (DCT) is simply a DFT of a signal that is assumed to be real and even (real and even signals have real spectra!)

The Discrete Cosine Transform

1,...,1,0;)/2cos()(2)0()(1

0

NkNknnxxkXN

n

Note that these are not the only transforms used in speech processing Wavelets; Fractals; Wigner distributions; etc.

Page 10: Speech signal time frequency representation

We often want to estimate the “power spectrum” of a non-stationary signal at a particular instant of time.

Multiply by a finite length window and take the DFT.

For a window of length N ending on sample m we have:

Short-Term Discrete Fourier Transform

1

0

/)(2)()();(N

n

NnmkienmxnwmkX

k-frequency, m-time

Page 11: Speech signal time frequency representation

gives the power at a frequency of k/N Hz (normalized) for a window centred at m-0.5(N-1).

the (m-i) term in the exponent means that the phase origin remains consistent by cancelling out the linear phase shift introduced by a delay of m samples.

the window samples are numbered backwards in time (for convenience later) hence the summation is performed backwards in time.

The values X(k;m) are based on the N signal values from m–N+1 to m.

the frequency resolution is 1/N Hz (normalised units) the spectrum is periodic and (since w and x are real) conjugate

symmetric:

there are only 0.5N+1independent frequency values

Short-Term Discrete Fourier TransformNotes:

),(),(),( *2mkXmkXmkX

)()()( * kXkNXkX

Page 12: Speech signal time frequency representation

Sound «is»

Window multiplication

Hamming window of length401 samples centred on 400

Page 13: Speech signal time frequency representation

Line spectrum from larynx oscillation (at about 0.01 normalised Hz) is superimposed on vocal tract resonances

Vocal sound

Page 14: Speech signal time frequency representation

Wide window (N=400) Independent frequency

values=N/2+1=201

Short window eliminates the fine detail (N=30)

N/2+1=16

Zero padded window gives more spectrum points and an illusion of more detail (N=480=30×16). This is the normal case for a spectrogram

N/2+1=241

Three different Hamming windows:

Page 15: Speech signal time frequency representation

By setting r= N–1–i we can rewrite this as:

Frequency resolution

1

0

/)(2exp)()();(N

n

NnmkinmxnwmkX

1

0

/2exp)(*))1(2

exp();(N

rm Njkrry

N

NmjmkX

)1()1()( rNmxrNwrym

This is a standard DFT multiplied by a phase-shift term that is proportional to k: this compensates for the starting time of the window: m–N+1

y(r) is a product of two signals so its DFT is the convolution of the DFT’s of w(N–1–r) and x(m–N+1+r)

Page 16: Speech signal time frequency representation

Two signals resolution

Page 17: Speech signal time frequency representation

Time-domen windowing

)/)2

(2cos( NN

nkck Hamming window

Slidelobe -42.5 dBMainlobe width 0.039Linkage factor 0.03%a=1.81, b=4

Rectangle window

Slidelobe -13.3 dBMainlobe width(-3dB) 0.027Leakage factor 9.14%a=1.21, b=2

104654.0)( cnw

1)( nw

Page 19: Speech signal time frequency representation

Time-Domain Windowing

Спектр сигнала при использовании окна Блэкмана - Наттала

Спектр сигнала при использовании окна Блэкмана

Page 20: Speech signal time frequency representation

Time Resolution: Filter-Bank Viewpoin

BW = 45 HzNT = 44 ms

BW = 300 HzNT = 7 ms

Page 21: Speech signal time frequency representation

/aI/ from “my” with 45Hz and 300Hz bandwidth spectrograms

Time Resolution: Filter-Bank Viewpoin

45Hz, 44ms 300Hz, 7ms

Page 22: Speech signal time frequency representation

Vertical slice through spectrogram:

mT= 0.1s. 45 Hz gives finer frequency resolution

Horizontal slice through spectrogram:

k/NT= 3.5 kHz 300 Hz gives finer time resolution

45Hz and 300Hz bandwidth spectrograms

Page 23: Speech signal time frequency representation

You cannot get good time resolution and good frequency resolution from the same spectrogram. Duration of window = N/fs=N*T. Frequency resolution = fs×a/N

◦ Equal amplitude frequency components with this separation will give distinct peaks

◦ Most windows have a≈2 Time resolution = 2N/(a*fs)

◦ Amplitude variations with this period will be attenuated by 6 dB For all window functions, the product of the time and

frequency resolutions is equal to 2. Linguistic analysis typically uses a window length of 10–

20 ms. The transfer function of the vocal tract does not change significantly in this time.

Uncertainty Principle

Page 24: Speech signal time frequency representation

To keep all the information about time variation of spectral components, you need only sample the spectrum twice as fast as the spectral magnitudes are varying. Using normalised frequencies:

–∞ dB bandwidth for an N-point window = b/N◦ b = 4 for a Hamming window

Significant variation of spectral components occurs at frequencies below 2b/N

we must sample the spectrum at a frequency of b/N◦ the separation between spectral samples is the window

width divided by b

Overlapping Windows

Page 25: Speech signal time frequency representation

Exact line-spectrum of a periodic signal {xm} Sampled continuous spectrum of zero-extended {xm} Sampled continuous spectrum of infinite {xm} convolved with spectrum of

rectangular window FFT is an algorithm for calculating DFT

Energy Conservation (Parseval’s theorem)

Symmetries {xm} ⇒ {Xk} Discrete⇒Periodic:Xk+N=Xk Real⇒Hermitian: X–k= X*k Periodic:xm+N/r = x m ⇒Discrete:Xk= 0for k ≠ir Skew Periodic:xm+N/2r = –xm ⇒Odd Harmonics: Even:xm= x N–m ⇒Real Odd:xm= –x N–m ⇒Purely Imaginary Real & Even⇒Real & Even Real & Odd⇒Purely Imaginary and Odd

DFT Properties