CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory...

28
CS 551/651: Structure of Spoken Language ecture 1: Visualization of the Speech Signal Introductory Phonetics John-Paul Hosom Fall 2010

Transcript of CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory...

Page 1: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

CS 551/651:Structure of Spoken Language

Lecture 1: Visualization of the Speech Signal,Introductory Phonetics

John-Paul HosomFall 2010

Page 2: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 2

Visualization of the Speech Signal

Most common representations:• Time-domain waveform• Energy• Pitch contour• Spectrogram (power spectrum)

Page 3: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 3

Visualization of the Speech Signal: Time-Domain Waveform

Time-domain waveform is a signal recorded directly from microphone, with time on horizontal axis and amplitude on vertical axis.

“Variations in air pressure in the form of sound waves movethrough the air somewhat like ripples on a pond. … A graphof a sound wave is very similar to a graph of the movementsof the eardrum.” [Ladefoged, p. 184]

“Sound originates from the motion or vibration of an object.This motion is impressed upon the surrounding medium (usuallyair) as a pattern of changes in pressure. … The sound generallyweakens as it moves away from the source and also may besubject to reflections and refractions…” [Moore, p. 2]

Page 4: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 4

Visualization of the Speech Signal: Time-Domain Waveform

Vertical axis: amplitude, relative sound pressuretypical unit: Pa (micro-pascals)

(digital signal usually unitless)quantization (-32768 to 32767)

Horizontal axis: timetypical unit: msec (milliseconds)sampling (8000, 16000, 44.1K samp/sec)

Page 5: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 5

Visualization of the Speech Signal: Energy

“Energy” or “Intensity”:intensity is sound energy transmitted per second (power) through a unit area in a sound field. [Moore p. 9]

intensity is proportional to the square of the pressure variation [Moore p. 9]

normalized energy = intensity

xn = signal x at time sample nN = number of time samples

N

xNt

tnn

12

Page 6: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 6

Visualization of the Speech Signal: Energy

“Energy” or “Intensity”:human auditory system better suited to relative scales:

energy (bels) =

energy (decibels, dB) =

I0 is a reference intensity… if the signal becomes twice aspowerful (I1/I0 = 2), then the energy level is 3 dB (3.01023 dBto be more precise)

Typical value for I0 is 20 Pa.20 Pa is close to the average human absolute threshold for

a 1000-Hz sinusoid.

)(log0

110 I

I

)(log100

110 I

I

Page 7: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 7

Visualization of the Speech Signal: Energy

What is a good value of N? Depends on information of interest:

N=1 msec

N=5 msec

N=20 msec

N=80 msec

Page 8: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 8

Visualization of the Speech Signal: Power Spectrum

What makes one phoneme, /aa/, sound different from anotherphoneme, /iy/?

Different shapes of the vocal tract… /aa/ is produced with the tongue low and in the back of the mouth; /iy/ is produced with the tongue high and toward the front.

The different shapes of the vocal tract produce different“resonant frequencies”, or frequencies at which energy in thesignal is concentrated. (Simple example of resonant energy:a tuning fork may have resonant frequency equal to 440 Hz or “A”).A resonance is the tendency of a system to oscillate with larger amplitude at some frequencies than at others [Wikipedia]

Resonant frequencies in speech (or other sounds) can be displayed by computing a “power spectrum” or “spectrogram,” showing the energy in the signal at different frequencies.

Page 9: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 9

Visualization of the Speech Signal: Power Spectrum

A time-domain signal can be expressed in terms of sinusoidsat a range of frequencies using the Fourier transform:

where x(t) is the time-domain signal at time t, f is a frequencyvalue from 0 to 1, and X(f) is the spectral-domain representation.

note:

One useful property of the Fourier transform is that it is time-invariant (actually, linear time invariant). While a periodic signal x(t) changes at t, t+, t+2, etc., the Fourier transform of this signal is constant, making analysis of periodic signals easier.

t

t

ftj

dtftjfttx

dtetxfX

)2sin()2cos()(

)()( 2

)sin()cos( je j

Page 10: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 10

Visualization of the Speech Signal: Power Spectrum

Since samples are obtained at discrete time steps, and sinceonly a finite section of the signal is of interest, the discreteFourier transform is more useful:

in which x(k) is the amplitude at time sample k, n is a frequencyvalue from 0 to N-1, N is the number of samples or frequency points of interest, and X(n) is the spectral-domain representation ofx(k). Note that we assume that that the series outside the range (0, N-1) is “extended N-periodic,” that is, xk = xk+N for all k.

1

0

1

0

2

)]2

sin()2

)[cos((1

1,0for)(1

)(

N

k

N

k

N

knj

N

knj

N

knkx

N

NnekxN

nX

Page 11: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 11

Visualization of the Speech Signal: Power Spectrum

• The sampling frequency is the rate at which samples are recorded; e.g. 8000 Hz = 8000 samples per second.

• Shannon’s Sampling Theorem states that a continuous signal must be discretely sampled with at least twice the frequency of the highest frequency present in the signal. So, the signal must not contain any data above Fsamp/2 (the Nyquist frequency). If it does, use a low-pass filter to remove these higher frequencies.

• Because the signal is assumed to be periodic over length N, but this assumption is usually false, then the signal is weighted with a window so that both edges of the signal taper toward zero:

Hamming window:

1...01

2cos460540 )()(

Nn

N

n..nxnxw

Page 12: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 12

Visualization of the Speech Signal: Power Spectrum

The magnitude and phase of the spectral representation are:

Phase information is generally considered not important inunderstanding speech, and the energy (or power) of the magnitude of F(n) on the decibel scale provides most relevant information:

Note: usually don’t worry about reference intensity I0 (assume a value of 1.0); the signal strength (in Pa) is unknown anyway.

))(

)((tan

))()()()(()(

1

5.0

)(

)(

nF

nFphase

nFnFnFnFnFmagnitude

real

imagF

imagimagrealrealF

n

n

))()((log10 2210)(

nFnFrumPowerSpect imagrealF n

absolute value of complex number

Page 13: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 13

Visualization of the Speech Signal: Power Spectrum

The power spectrum can be plotted like this (vowel /aa/):

time-domain

amplitude

spectralpower

(dB)(512 samp)

0 Hz 4000 Hz

73 dB

frequency (Hz)

Page 14: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 14

Visualization of the Speech Signal: Power Spectrum

If the speech signal is periodic and the number of samples in the window is large enough, then harmonics are seen:

periodic signal/aa/ periodic signal /aa/ aperiodic signal /sh/128 samples 2048 samples 2048 samples

(frequency range is 0 to 4000 Hz in all plots)

A harmonic is a strong energy component at an integer multipleof the fundamental frequency (pitch), F0.

Page 15: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 15

Visualization of the Speech Signal: Formants

Note that the resonant frequencies, or formants, for the two vowels /aa/ and /iy/ can be identified in the spectra.

For recognition of phonemes, the spectral envelope is important (envelope = shape of spectrum without harmonics)

/aa/ 2048 samples /iy/ 2048 samples

?

envelope

?

0 1K 2K 3K 4K 0 1K 2K 3K 4K

Page 16: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 16

Visualization of the Speech Signal: Formants

The harmonics, which are dependent on F0, are not, in theory, significantly related to the resonant frequencies, which are dependent on the vocal tract shape (or phoneme)

0 1K 2K 3K 4KHz

/aa/F0=80Hz

/aa/F0=164Hz

Page 17: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 17

Visualization of the Speech Signal: Spectrograms

Many power spectra can be plotted over time, creating a “spectrogram” or “spectrograph” (pre-emphasis = 0.97):

/aa/

freq

(H

z)

amp

/iy/

freq

(H

z)

amp

time (msec)

(FFT size = 10 msec)

Page 18: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 18

Visualization of the Speech Signal: Formants

These formants can be modeled by a “damped sinusoid”, whichhas the following representations:

where S(f) is the spectrum at frequency value f, A is overallamplitude, fc is the center frequency of the damped sine wave, and is a damping factor. [Olive, p. 48, 58]

2222

22

2)()2sin()(

cc

cc

t

fff

AffStfAetx

time (msec)

pow

er (

dB)

ampl

itud

e

frequency (Hz)

center freq. fc

0 dB0

Page 19: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 19

Visualization of the Speech Signal: Formants

The bandwidth is defined as the width of the spectral peakmeasured at the point where the linear spectral magnitude value is ½ the maximum value. A reduction of the signal by a factor of 2 is equivalent to a 3 dB change.

pow

er (

dB)

frequency (Hz)

bandwidth

0 dB

3 dB

Also, the resonator must have a value of 0 dB at 0 Hz.

Page 20: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 20

Visualization of the Speech Signal: Formants

• Formants are specified by a frequency, F, and bandwidth, B.

• A neutral vowel (/ax/) theoretically has formants at 500 Hz, 1500 Hz, 2500 Hz, 3500 Hz, etc. The first formant is called F1, the second is called F2, etc. (The fundamental frequency, or pitch, is F0.)

• F1, F2, and sometimes F3 are usually sufficient for identifying vowels.

• Formants can be thought of as filters, which act on the source waveform. For vowels, the source waveform is air pushed through the vibrating vocal folds. Energy is lost (hence a damped sinusoid model) by sound absorption in the mouth.

• A digital model of a formant can be implemented using an infinite-impulse response (IIR) filter.

Page 21: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 21

Visualization of the Speech Signal: Excitation/Source

The vocal-fold vibration source looks like this:

(Note: there are some gross simplifications here… we’ll go intomore detail later in the course.)

In fricatives and other unvoiced speech, the source is turbulent air:

time (msec)

ampl

itud

e

frequency (Hz)

-6 dB/octave

pow

er (

dB)

frequency (Hz)

flat slopepo

wer

(dB

)

time (msec)

ampl

itud

e

Page 22: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 22

Visualization of the Speech Signal: Pre-Emphasis

Because the source for voiced sounds decreases at –6 dB/octave,a simple filter can be used to increase the spectral tilt by +6 dB/octave, thereby making voiced sounds spectrally flatand easier to visualize. (NOTE: unvoiced sounds then have spectral slope of + 6 dB/octave)

frequency (Hz)

0 dB/octave

frequency (Hz)

pow

er (

dB)

-6 dB/octave

97.0

)1()()(

a

nxanxnx

where x(n) is the time-domain speech signal at sample number n,and x(n) is the pre-emphasized speech signal at sample n.

Page 23: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 23

Visualization of the Speech Signal: Spectrograms

The FFT window size has a large impact on visual properties:

/aa/

freq

(H

z)

am

p

/aa/

freq

(H

z)

“wideband” = small time window = small FFT size

“narrowband” = large time window = large FFT size

(FFT size = 5 msec)

(FFT size = 33 msec)

Page 24: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 24

Spectrogram Reading: Vowels

Vowel formant frequencies:

Page 25: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 25

Spectrogram Reading: Vowels

Vowel formants (averages for English, male vs. female):

310

2790

3310

430

2480

3070

610

2330

2990

860

2050

2850

760

1400

2780

850

1220

2810

470

1160

2680

370

950

2670

0

500

1000

1500

2000

2500

3000

3500

iy ih eh ae ah aa uh uw

*from Peterson, G.E., and Barney, H.L. (1952). "Control methods used in the study of vowels", Journal of the Acoustical Society of America, 24,175-184.

Page 26: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 26

Spectrogram Reading: Vowels

Vowel formants, Peterson and Barney data:

Page 27: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 27

Spectrogram Reading: Vowels

Ratios of 1st and 2nd formant, from Miller (1989) based onPeterson and Barney (1952) data:

Page 28: CS 551/651: Structure of Spoken Language Lecture 1: Visualization of the Speech Signal, Introductory Phonetics John-Paul Hosom Fall 2010.

Structure of Spoken Language : Hosom 28

Spectrogram Reading: Vowels

Observed values from vowel midpoints from a single speaker,speaking both “clearly” and “conversationally”, in different phonetic contexts:

iy

ih

uw

uh

eh ae

ah

aa

(from Amano-Kusumoto, PhD thesis 2010)