Course Notes v17

7/18/2019 Course Notes v17

http://slidepdf.com/reader/full/course-notes-v17 1/82

Part I

Speech Processing

1 Introduction

Speech Processing:

1. Speech/Speaker Recognition

2. Synthesis

3. Speech Coding (Communication)

In this course:

1. Analysis (Modeling, Physiology)

(a) Time domain

(b) Frequency domain

2. Coding

2 Review of Digital Signal Processing Fundamentals

2.1 Discrete-Time Signals

Speech signals occur naturally as continuous-time acoustic signals, xa (t). However, speech signals canbe transduced into electrical signals, and sampled at period T = 1/F s to become discrete-time signals orsequences:

x (n) = xa (t)t=nT = xa (nT ) .

• Sampling Theorem: Let xa (t) be a continuous-time signal, and let X a ( ) be the correspondingfrequency representation. Suppose xa (t) is bandlimited:

xa ( ) = 0, for | | > a,

where a = 2πF a. If xa (t) is sampled at period T = 1/F s to become x (n), then xa (t) can beuniquely reconstructed from x (n) if F s > 2F a. Here, F a is referred to as the bandwidth of xa(t). The

Nyquist rate is defined as F N = 2F a.

If the original continuous-time signal is not band-limited, the sampling component must be preceded

by a low-pass filter with cut-off frequency F c ≤ F s

2 in order to avoid aliasing.

Note that speech signals are typically not band-limited. However, the majority of useful information is

contained below certain frequencies. For example, for speech coding, speech is generally sampled at 8kHz(narrowband coding) or 16 kHz (wideband coding). For music, the sampling rate is generally much higher,

such as 44.1 kHz or PCM coded music.

1



2.2 Important Discrete-Time Functions

Several discrete-time functions are important to digital speech processing, and will be used repeatedly inthis course:

• The delta function: x (n) = δ (n)

x (n) = 1, for n = 0,

0, otherwise.

• The step function: x (n) = u (n)

x (n) =

1, for n ≥ 0,0, otherwise.

• The one-sided exponential function: x (n) = an u (n)

x (n) =

an, for n ≥ 0,0, otherwise.

• Sinusoidal functions:

x (n) = cos(2πf n),x (n) = sin(2πf n).

• Euler’s Identity: e jθn = cos (θn) + j sin (θn).⇒ Note that Euler’s Identity can be used to expand sinusoidal functions:cos(θn) = 1

2

e jθn + e− jθn

,

sin(θn) = 12j

e jθn − e− jθn

.

2.3 Important Discrete-Time Frequency-Domain Transforms

Frequency-domain transforms allow for spectral analysis of signals and systems. In this section, severalcommon discrete-time transforms will be reviewed:

• The Z-transform expresses an input signal x (n) as a geometric series in the complex variable z = re jω:

X (z) =∞

n=−∞x (n) z−n.

An important topic regarding the Z-transform is the region of convergence (ROC), which is definedas all portions of the Z-plane for which:

∞

n=−∞|x (n) z−n| =

∞

n=−∞|x (n) r−ne− jnω| =

∞

n=−∞|x (n) r−n| < ∞.

Note that different discrete-time signals may have the same Z-transform, but different ROCs (e.g.,

x1(n) = anu(n) and x2(n) = −anu(−n − 1)).

The inverse Z-transform allows the time-domain signal to be obtained from the transform-domain

representation:

x (n) = 1

2π j

C

X (z) zn−1 dz,

2



where C denotes a path surrounding the origin and contained in the ROC.

• The Discrete-Time Fourier Transform (DTFT) expresses an input signal as a geometric series in the

complex variable e jω:

X e jω

=∞

n=−∞x (n) e− jωn.

The DTFT represents a transformation from discrete time, n, to continuous frequency, ω. Bydefinition, the DTFT is periodic in frequency (ω) with period 2π. The inverse DTFT allows thetime-domain signal to be obtained from the frequency-domain representation:

x (n) = 1

2π

π

−π

X

e jω

e jωn dω.

Note that the DTFT can be interpreted as the Z-transform evaluated along the unit circle (|z| = 1):

X e jω = X (z)z=e

jω .

• The N -point Discrete Fourier Transform (DFT) provides a discrete frequency representation of afinite length discrete-time signal x (n) of length L ≤ N :

X (k) =L−1n=0

x (n) e− j 2πN kn , for 0 ≤ k ≤ N − 1.

The DFT represents a transformation from discrete time, n, to discrete frequency, k. Note that thelength of the sequence is not required to be equal to the size of the DFT, as long as L ≤ N . If L < N ,the sequence is said to be zero-padded with N − L zeros. The discrete frequency representationgiven by the DFT is at many times easier to handle in practice than the continuous representation

provided by the DTFT.

Note that the DFT assumes implicit periodicity of period N , both in the time and frequency domain.

The inverse DFT is given by:

x (n) = 1

N

N −1k=0

X (k) e j 2πN kn , for 0 ≤ n ≤ N − 1.

Note also that the DFT can be interpreted as a uniform sampling of the DTFT along the unit circle:

X (k) = X e jω

ω= 2πkN

.

There exist computationally efficient algorithms for determining the DFT, namely the Fast FourierTransforms (FFTs), many of which require the length of the input time-domain signal to be a power

of 2. Since this is not generally the case, time signals may be zero-padded to produce lengths of 2v

(also known as radix-2).

Figure 1 provides views of the domains of the transforms previously discussed, in the Z-plane.

3



Figure 1: Views of the domains of various frequency transforms in the Z-plane: Panel a shows the Z-transform. Panelb shows the Discrete-Time Fourier Transform (DTFT) as a subset of the Z-transform, evaluated along the unit circle(r = 1). Panel c shows the Discrete Fourier Transform (DFT) as a uniform sampling of the DTFT along the unit circle(in this case N = 8).

• Example 1: Consider an input signal x (n) = [1, 2, 3, 2, 1], where the underscore denotes the origin.

The Z-transform can be determined as:

X (z) =4

n=0

x (n) z−n = 1 + 2z−1 + 3z−2 + 2z−3 + z−4.

The DTFT can be determined as:

X

e jω

=4

n=0

x (n) e− jωn = 1 + 2e− jω + 3e− j2ω + 2e− j3ω + e− j4ω

= e− j2ω

[3 + 4 cos(ω) + 2 cos(2ω)] .

The DFT of length N = 8 (note that this implies zero-padding) is determined as:

X (k) =7

n=0

x (n) e− j 2πN kn = 1 + 2e− j πk

4 + 3e− j πk2 + 2e− j 3πk

4 + e− jπk

= [9, − j5.828, −1, j0.172, 1, − j0.172, −1, j5.828] .

The DFT can also be determined by sampling the DTFT uniformly along the unit circle:

X (k) = e− j2ω[3 + 4 cos (ω) + 2 cos (2ω)]ω= 2πk

N

= [9, − j5.828, −1, j0.172, 1, − j0.172, −1, j5.828] .

4



2.4 Properties of Frequency-Domain Transforms

There exist several important properties pertaining to frequency-domain transforms of discrete-time signals

that assist in the analysis of signals and systems. Note that if the properties discussed in this sectionapply identically to all transforms previously discussed, they will only be explicitly formalized for theDFT, for the sake of brevity. However, differences occur, the properties will be formalized separately for

each necessary transform.

• Linearity:Let x1 (n) and x2 (n) be time-domain signals, with X 1 (k) and X 2 (k) being the corresponding DFTs:

x1 (n) ↔ X 1 (k) ,

x2 (n) ↔ X 2 (k) .

In this case, a linear combination of the time-domain signals will produce a linear combination of the frequency-domain signals:

αx1 (n) + βx2 (n)

↔αX 1 (k) + βX 2 (k) .

One subtle aspect of the linearity property is when dealing with time-domain signals of variouslengths, say N 1 and N 2, they must be initially zero-padded to at least length N = max(N 1, N 2) toavoid aliasing in the time domain.

• Convolution:Convolution of signals in the time-domain results in multiplication of corresponding signals in the

frequency domain. Conversely, multiplication in the time domain results in convolution in thefrequency domain. For the DTFT, this property refers to standard convolution, denoted by ∗:

x1 (n) ∗ x2 (n) ↔ X 1 e jω

X 2 e jω

,

x1 (n) x2 (n) ↔ 12π

X 1

e jω ∗ X 2

e jω .

However, due to the implied periodicity of the DFT, this property refers to circular convolution,denoted by , in this case:

x1 (n) x2 (n) ↔ X 1 (k) X 2 (k) ,

x1 (n) x2 (n) ↔ 1

N X 1 (k) X 2 (k) .

• Time-Shifts:Introducing a time shift results in modulation in the frequency domain by a complex exponential.For the DTFT, this refers to a standard time shift:

x (n − no) ↔ e− jnoωX

e jω

.

However, for the DFT, due to the implied periodicity, this refers to a modular time shift:

x ((n − no)mod N ) ↔ e− j2πknoN X (k) .

5



• Real Time-Domain Signals:Real time-domain signals will results in frequency-domain transforms which are conjugate symmetric.

That is:

x (n) ≡ real ↔ X (k) = X ∗ (−k mod N ) .

This property is important for speech processing since speech signals are naturally real.

2.5 Insight Into Discrete-Time Frequencies

For continuous-time signals, frequencies can take values within the entire range (−∞, ∞). Generally,continuous-time frequencies are denoted (radians per second) or F (hertz). However, as can be interpreted

from the sampling theorem previously discussed, sampling limits the range of possible frequency components

in a signal. For a discrete-time signal, frequencies are generally denoted by ω (radians per sample), which

take values within [0, 2π). Figure 2 provides examples of unit amplitude discrete-time signals at variousfrequencies.

1 2 3 4 5 6 7 8 9 10

−1

0

1

1 2 3 4 5 6 7 8 9 10

−1

0

1

1 2 3 4 5 6 7 8 9 10

−1

0

1

a

b

c

Figure 2: Examples of Discrete-Time Frequencies: Panel a shows a unit-amplitude signal of frequency ω = 0. Panel bshows a unit-amplitude signal of frequency ω = π

2. Panel c shows a unit-amplitude signal of frequency ω = π, which is

the highest possible frequency. Note that the phase is zero in each case.

The relationship between discrete-time frequencies and corresponding continuous-time frequencies is

dependent on the sampling frequency, F s:

ω =

F s.

6



Furthermore, when the DFT is used for spectral analysis, the spacing between frequency bins is givenas:

∆ω = 2π

N .

• Example 2:

Let x (n) be a discrete-time signal with sampling rate F s = 8kHz. If a segment of the signalcorresponding to 25ms is extracted for spectral analysis, and a radix-2 FFT is to be used, what is

the minimum length N of the signal used (after zero-padding) during the transform?

A signal of duration T = 25 ms with sampling rate F s comprises:

N = 0.025s × 8000 samples/s = 200 samples.

However, since the length of the signal must of the form 2v, the final length of the signal after zeropadding should be:

ν = log2(N ) = 8 → N FFT = 28 = 256 samples.

• Example 3:Let x (n) be a discrete-time signal with sampling rate F s = 8kHz. A segment of the signalcorresponding to 25ms is extracted for spectral analysis. If the DFT is used for spectral analysis,what is the minimum length, N , of signal used in order to provide frequency resolution ∆F ≤ 20Hz?

(In this case, don’t worry about using FFTs.)

Note that the minimum frequency resolution is given in the continuous-frequency domain, so thediscrete-frequency resolution must be determined:

∆ω = ∆

F s=

2π∆F

F s≤ 0.005π.

The frequency resolution provided by the DFT is given by ∆ω = 2π

N , so the minimum signal lengthcan be determined as:

2π

N ≤ 0.005π ⇒ N ≥ 400 samples.

The minimum length of the discrete signal, after zero-padding, to guarantee frequency resolution of 20 Hz is N = 400 samples.

2.6 Linear Time-Invariant Systems

The properties of linearity and time-invariance greatly simplify the analysis of systems. Consider the

system:

x(n) −→ H (z) −→ y(n)

Here, Y (Z ) = H (z) X (z), and y (n) is determined as:

y (n) = (h ∗ x) (n) =∞

m=−∞h (m) x (n − m) ,

7



where h (n) is the impulse response. Note that if the system applies the DFT, circular convolution () isused instead. To make the outputs of standard and circular convolution equivalent, the size of the DFT

needs to chosen as N ≥ N 1 + N 2 − 1, where N 1 and N 2 are the lengths of the signals being convolved.Systems comprise zeros and poles (which will be discussed later during the topic of speech production).Zeros and poles are characterized by a spectral location as well as a bandwidth. The effect of zeros areattenuations in the frequency response, whereas the effect of poles are gains in the frequency response.

• Linearity:Let x1 (n) and x2 (n) be input signals, and let y1 (n) and y2 (n) be corresponding output signals. If the system H (z) is linear, then:

αx1 (n) + βx2 (n) → αy1 (n) + βy2 (n) .

• Time Invariance:The property of time-invariance describes a system H (z) as remaining constant throughout time. If

H (z) is time-invariant, then:

x (n−

no)→

y (n−

no) .

As will be seen later in the course, the speech production system is highly time-varying. However,aspects of the speech production system can be sampled at high enough rates to allow the assumption

of stationarity and thus assume time-invariance.

If a system is LTI, then the transfer function H (z), or equivalently the impulse response h (n),completely characterizes the system.

Additionally, systems that are in series can be combined to form a single cascaded system:

x(n) −→ H 1(z) −→ H 2(z) −→ y(n) =⇒ x(n) −→ H (z) −→ y(n)

where H (z) = H 1 (z) H 2 (z), or equivalently, h (n) = (h1

∗h2) (n).

8



3 The Speech Production System

Figure 3 provides a simplified overview of the speech production system. Here, the speech productionorgans can be divided into three main groups: the lungs, the larynx, and the vocal tract. The lungs provide

airflow to the larynx stage of the production system. The larynx modulates the flow of air, resulting either

in a periodic series of pulses, or a noisy flow. The vocal tract spectrally shapes the flow of air. In thissection, a linear model will be used to approximate the speech production system, and signal processing

techniques will be utilized to analyze the resulting speech signal.

Figure 3: An overview of the speech production system.

The speech production can be approximated as a linear system:

u (n) Source Signal

−→ V (z) Vocal Tract

−→ s (n) Speech Signal

Here, the source signal, u (n), is created by the lungs and larynx components of the overall production

system illustrated in Figure 3. The source signal then passes through a linear time-invariant system, V (z),

representing the vocal tract. The resulting speech signal, s (n), is then emitted.

3.1 The Source SignalAs discussed previously, the source signal, u (n), is created by the lungs and larynx. During speech, thelungs maintain a slightly raised air pressure, causing a steady flow of air through the trachea. The larynx

modulates the air flow from the lungs.The larynx is a system of muscles, cartilage, and ligaments, whose function is to control the vocal

cords or vocal folds . The vocal cords are a set of ligaments which stretch from the back to the front of the

9



larynx. The slit located between the vocal cords is called the glottis . Figure 4 shows the vocal folds and

glottis, as a downward-looking view of the human larynx.

Figure 4: The vocal folds and glottis, as a downward-looking view of the human larynx.

The vocal cords can be operated in three different states: breathing , unvoiced , and voiced . In thebreathing state, the vocal cords are held apart with little tension to allow steady airflow without resistance.

In the unvoiced state, the cords are held apart, though closer and more tense than in the breathing state.This results in some turbulence, referred to as aspiration. In the voiced state, the vocal cords are brought

together and held tense, resulting in oscillation of the cords. The breathing state is not directly involvedin speech production, so in this section we will analyze the source signals produced during the unvoiced

and voiced vocal cord states.

3.1.1 Unvoiced Source Signal

When the vocal cords are in the unvoiced state, air flows through the larynx with slight turbulence, oraspiration. The actual signal produced by unvoiced vocal cords is similar to white noise. The aspiration

caused by low tension in the vocal cords results in “whisper-like” speech.

3.1.2 Voiced Source Signal

The close proximity and high tension of vocal cords during the voiced state result in oscillation of thevocal cords, which leads to periodic modulation of airflow through the larynx. Each period is roughlycharacterized by a period of closure (closed phase), then a slow opening of the cords (open phase), followed

by a rapid closure (return phase).A study of the voiced source signal is presented in Fig. 5. The top left panel illustrates the airflow

corresponding to one period of a simulated glottal closure. Note that the various phases of the vocal cordformations are labeled. The top right panel shows the spectrum of one period of the voiced source function.

Note that the spectrum is monotonically decreasing with respect to frequency.The bottom left panel illustrates a series of glottal closures. The time between adjacent peaks is

referred to the the pitch period of the signal. The bottom right panel provides the spectrum of the seriesof glottal closures. Note that the spectrum approximates a sampled version of the spectrum above.

The pitch period of a speaker determines how “high” or how “low” the resulting speech is. The pitchperiod is denoted by T 0. A related parameter, the pitch frequency or fundamental frequency, is denoted

10



!"#$

% " & ' ( ) *

+)",$- /)0&,$ /"123( 45 6$&")-7

8 89: 5 59: ; ;9: <<8

;8

58

8

58

;8

<8

=&$>0$2,? 4&3-"32@7

A 3 1 2 " B 0 - $ 4 - C 7

/D$,B&0# )' +)",$- /)0&,$ /"123( 45 6$&")-7

!"#$

% " & ' ( ) *

+)",$- /)0&,$ /"123(

8 89: 5 59: ; ;9: <;8

8

;8

E8

=&$>0$2,? 4&3-"32@7

A 3 1 2 " B 0 - $ 4 - C 7

/D$,B&0# )' +)",$- /)0&,$ /"123(

6"B,F 6$&")-

G()@$- 6F3@$ HD$2 6F3@$ I$B0&2 6F3@$

Figure 5: The Voiced Source Signal: The top panels show the time waveform and spectrum of one period of a simulatedvoiced source signal. The various temporal segments are labeled. The bottom panels illustrate the time waveform andspectrum of the complete voiced source signal.

F 0 (Hz)ave. min. max.

men 125 80 200women 225 150 350

children 300 200 500

Table 1: The dependency of sex and age on pitch period.

by F 0, and is determined as:

F 0 = 1

T 0≈ 1

2π√

MC ,

where M is the mass of vocal folds per unit length and C is a measure of compliance of vocal foldsC = (stiffness)−1: if you stiffen them, F 0 goes up, if you slacken them, F 0 goes down.

Although pitch varies in time during speech, there exists dependency of sex and age on pitch: females

and children have shorter vocal cords than adult males. Table 1 provides ranges of pitch periods for males,

females, and children.

3.2 The Vocal Tract

The vocal tract comprises the oral cavity and the nasal passage, and the two are coupled via the velum(see Figure 3). The main function of the vocal tract is to spectrally shape the source signal to producedesired speech sounds. Another function of the vocal tract is to generate new sources of sound.

11



3.2.1 Spectral Shaping

It is commonly assumed that the relationship between the source signal airflow and the airflow outputted

by the vocal tract can be approximated by a linear filter, V (z). Certain configurations of the vocaltract components create specific resonant frequencies, referred to as formant frequencies or formants , and

denoted by F i. Note that the term formant can refer to information regarding both the spectral locationand bandwidth of the corresponding resonance. In terms of the expression for H (z), the energy gain

present for formants is due to p poles. Since speech is naturally a real signal, the vocal tract transferfunction’s poles are either real or complex conjugate pairs:

V (z) = G

p1/2k=1

(1 − ckz−1)

1 − c∗kz−1 p2=1

(1 − rz−1)

, p = p1 + p2.

Figure 6 illustrates an approximation of the frequency response of the vocal tract transfer functionfor the vowel sound in “boot,” Figure 7 shows the corresponding pole-zero plot, and the |V (z)| is plottedin Fig. 8. Note that three formants are noticeable in this case. Also, F 1 shows a high peak and narrowbandwidth as it is located near the unit circle of the Z-plane. Conversely, F 2 and F 3 show lower energyand wider bandwidths, since they are placed farther from the unit circle.

Figure 6: The approximated vocal tract transfer function for the vowel sound in “boot.”

3.2.2 Source Generation

Besides spectrally shaping the airflow signal from the larynx, the vocal tract can also generate sourcesignals. Components of the vocal tract can generate two types of source signal: impulse sources and

frication. Impulse sources are produced by creating a full closure in the oral cavity (such as by the tongueor lips), followed by a quick release. Frication sources are produced by creating a steady partial closure(such as between the tongue and palate, or between the tongue and teeth), causing turbulence.

3.3 The Complete Speech Production Model

As can be interpreted from the linear speech production model, the final speech signal is determined as the

convolution of source signal, u (n), with the impulse response of the vocal tract transfer function, v (n):

12



Figure 7: The pole-zero plot for the approximated vocal tract transfer function for the vowel sound in “boot.”

Figure 8: Plot of the amplitude of the vocal tract transfer function for the vowel sound in “boot,” |V (z)|, where thevalues it takes on the unit circle are highlighted in black.

13



s (n) =

∞τ =−∞

v (n − τ ) u (τ ) .

In the transform domain, this can be expressed as:

S (k) = V (k) U (k) .

• Example:

Figure 9 shows the spectrum of a steady-state vowel. (Note that in the figure, the harmonics appear as“lobes,” rather than as clean spikes, which is an effect of windowing. This will be addressed as a future

topic.) What is the pitch period on samples? If the sampling frequency is known to be F s = 8kHz,what is the pitch period in hertz? Is the speaker more likely to be adult male, adult female, or a child?

Figure 9: The spectrum of a steady-state vowel. (Note that in the figure, the harmonics appear as “lobes,” rather thanas clean spikes, which is an effect of windowing. This will be addressed as a future topic.)

The harmonics can be uniformly observed to be spaced by π

24 . This corresponds to a pitch period of

48 samples. If the signal is sampled at F s = 8 kHz, the pitch period (in hertz) can be determined as:

F 0 = 1

48 samples × 8000 samples/s ≈ 167 Hz.

Using Table 1, it can be determined that the speaker is most likely an adult male.

3.4 Sound Classification

The basic unit of speech is referred to as a phoneme . Table 2 provides a condensed list of General American

English phonemes, along with their International Phonetic Alphabet (IPA) and ARPABET symbols. Thephonemes can be classified by considering the previous discussion on the physiological aspects of human

speech production. Discriminative features used to classify phonemes include the presence of voicing, thelocation of formant frequencies, resonance within the nasal cavity, the presence of sound generation within

the vocal tract, and many others. The major classes of phonemes are vowels and consonants. The latter

include nasals, fricatives, and plosives. Some speech sounds are transitional and can be consonants, vowels,

or intermediate sounds.

14



3.4.1 Vowels

Vowels are characterized by a voiced source signal. Each vowel corresponds to a different vocal tractconfiguration, resulting in specific formants. In English speech, the nasal cavity is decoupled from the oral

cavity across all vowels.

3.4.2 Consonants

The air flow is restricted.

3.4.3 Nasals

Nasals are also characterized by a voiced source signal. For nasals, the velum is lowered, which couples the

nasal cavity and oral cavity. Additionally, the oral tract may be constricted, resulting in output airflowbeing radiated from the nostrils. Various nasal phonemes are distinguished by the location at which the

oral cavity is constricted.

3.4.4 Fricatives

The discriminative feature of fricatives is that they entail source generation within the oral cavity. Theoral cavity is constricted to a certain extent, causing turbulence. Examples include frication between thetongue and the palate (/s/) or between the upper teeth and the lower lip (/f /). Additionally, fricativescan include a voiced or unvoiced source signal.

3.4.5 Plosives (Stops)

Plosives are characterized by an impulsive burst of airflow generated within the oral cavity. This isproduced by creating a full closure with the tongue or lips, followed by a quick release. Plosives can include

voiced or unvoiced source signals.

3.4.6 Transitional Speech Sounds

The phoneme types previously discussed assume a degree of stationarity of speech production system.However, certain phonemes are defined by the transition between steady-state sounds. These phonemeclasses include diphthongs, glides, semi-vowels, and affricates.

15



Phoneme ARPABET Example Phoneme ARPABET Example

/i/ IY beat /N/ NG sing

/I/ IH bit /p/ P pet/e j/ EY bait /t/ T ten/E/ EH bet /k/ K kit/æ/ AE bat /b/ B bet/a/ AA bob /d/ D debt

/2/ AH but /g/ G get/O/ AO bought /h/ HH hat

/ow/ OW boat /f / F fat/Ú/ UH book /T/ TH thing/u/ UW boot /s/ S sat/@/ AX about / S / SH shut/1/ IX roses /v/ V vat/Ç / ER bird /D/ DH that/Ä/ AXR butter /z/ Z zoo

/aw/ AW down /Z/ ZH azure/a j/ AY bite /Ù / CH church/O j/ OY boy /Ã/ JH judge

/ j/ Y you /û/ WH which/w/ W wit /ł

"/ EL battle

/ô/ R rent /m"

/ EM bottom/l/ L let /n

"/ EN button

/m/ M met /R/ DX batter/n/ N net /P/ Q (glottal stop)

Table 2: Condensed IPA and ARPABET lists for General American English.

16



Possible classification of General American English phonemes

vocal-cord vibration

voiced unvoiced

sounds

↓ ↓ ↓ ↓ ↓

vowels consonants aspirant

/h/

semivowels(glides)/w, j/

liquids/l, ô/

↓

↓ ↓monophthong diphthong

a a j

i e j

e aw

I ow

u

o

E

æ

Ç

2

O

Ú

↓characterized by

formant frequenciesor resonances.

consonants

↓ ↓ ↓ ↓stops

/p, t, k, b, d, g/fricatives

/s, S , z, Z, v, f , T, D/nasals

/m, n, N/affricates

/Ù , Ã/↓ ↓ ↓

silence +burst

turbulence noisegeneration

stops +fricatives

17



4 Tube Model of the Vocal Tract

How can one find the transfer function V (z)?

Wave equations:

−∂u(x,t)

∂x =A

ρc2∂p(x,t)

∂t

−∂p(x,t)

∂x =

ρ

A

∂u(x,t)

∂twhere

A = cross-sectional area (assumed uniform)t = timeρ = 1.14 × 10−3 g cm−3

volume velocity u =0cm3/s to 1000cm3/sc ≈ 340 m/s to 350m/s in air p = pressure.

The pressure is usually given in terms of the sound pressure level, SPL, in dB:

SPL = 20 log P rms

P ref ,

where P ref = 20 µPa is the reference pressure, considered to be the audibility threshold for humans at1 kHz.

Note that in the equations above we can draw parallels to electric circuits by making the followinganalogies:

Acoustic quantity Analogous electric quantity

p, pressure v, voltageu, velocity i, currentρ/A, acoustic inductance L, inductanceA/(ρc2), acoustic capacitance C , capacitance

If the width of the vocal tract (on average around 2 cm) is much smaller than λ = cF then, to a

first-order approximation, the pressure waves can be considered planar, and the vocal tract can be modeled

as a series of tubes. Three main tubes are considered, the quarter-wavelength tube, closed at one end and

open at the other) and the two half-wavelength tubes, one closed at both ends and the other open at both

ends:

A

l

F = c4l(2n − 1), n = 1, 2, . . .

A A

l lF = c

2ln, n = 1, 2, . . .

In addition, small constriction tubes that connect larger tubes can be modeled as Helmholtz resonators ,

characterized by a resonating frequency given by

F = c

2π

l1l2A2

A1

,

18



l2

A2

A1

l1

wine bottle.

Figure 10: Helmholtz resonator

where l1 and A1 are the length and cross-sectional area of the constriction, and l2 and A2 are the lengthand area of the previous tube, see Fig. 10.

19



5 Short-Time Speech Analysis

We have previously looked at properties and applications of linear, time-invariant (LTI) systems. However,

when modeling the speech production system as a linear system, it must be considered time-varying due to

the dynamic nature of speech. Thus, in order to correctly analyze the spectral or temporal characteristics

of speech, we can extract short segments that can be assumed stationary. That is, we can window speechsegments wherein information regarding pitch, formants, etc., doesn’t change significantly.

5.1 The Short-Time Fourier Transform

Common tools used to analyze the spectral characteristics of speech with respect to time are the discrete-

time short-time Fourier Transform (STFT) and the discrete STFT. For an input speech signal s (n), the

discrete-time STFT is defined as:

S n

e jω

=∞

m=−∞s (m) w (n − m) e− jωm,

where w (n) is the analysis window used to extract the current speech signal segment, and is nonzeroonly on the range n

∈[0, N w

−1].

The STFT provides information regarding the spectral characteristics of speech signals as a function of

two variables, n and ω . The STFT can be interpreted in two ways:

1. Fixed time (n):If the time index, n, is fixed, the STFT provides the Fourier Transform of xn (m) = s (m) w (n − m),

a speech signal windowed on the range n ∈ [n − N w + 1, n]:

S n

e jω

=∞

m=−∞xn (m) e− jωm.

2. Fixed frequency (ω):

If the frequency, ω , is fixed, S ne jω2 provides the trajectory of energy values with respect to time

contained by the input speech signal at frequency ω .

Note that the discrete-time STFT provides the frequency variable ω as a continuous function, which

may not be feasible for some applications. Thus the discrete STFT can be used:

S n (k) =∞

m=−∞s (m) w (n − m) e− j 2π

N km .

The discrete STFT can also be expressed as the discrete-time STFT, sampled uniformly in the frequency

domain:

S n (k) = S n

e jωω= 2πk

N

5.2 The Analysis Window

Two common analysis windows used are the rectangular window and the Hamming window:

20



• The Rectangular Window:

wr (n) =

1, if 0 ≤ n ≤ N w − 1,0, else.

• The Hamming Window:

wh (n) = 0.54 − 0.46 cos 2πnN w−1 , if 0 ≤ n ≤ N w − 1,

0, else.

The rectangular and Hamming windows are illustrated, along with their spectra, in Figure 11. The top

panels illustrate the time waveform and spectrum of the rectangular window. The bottom panels illustrate

the time waveform and spectrum of the Hamming window. Note that the mainlobe of the rectangularwindow has a small main-lobe bandwidth ( 4π

N w), but the side-lobes are very prominent. The Hamming

window shows a mainlobe of greater bandwidth ( 8π

N w), but has highly attenuated side-lobes.

500 550 6000

0.5

1

wr

500 550 6000

0.5

1

wh

0 1 2 3−60

−40

−20

0

frequency (radians)

M a g n i t u d e ( d B )

Wr

0 1 2 3−60

−40

−20

0

frequency (radians)

M a g n i t u d e

( d B )

Wh

Figure 11: Examples of common analysis windows: The top panels illustrate the time waveform, wr (n), and spectrum,W r (ω), of the rectangular window. The bottom panels illustrate the time waveform, wh (n), and spectrum, W h (ω), of the Hamming window.

5.3 Spectral Leakage Associated with Windowing

Applying analysis windows to speech segments during spectral analysis causes energy naturally located in

a given spectral location to “leak” into nearby locations. Since the window is multiplied with the speechsignal in the time domain, spectral leakage can be interpreted as convolution of the true speech spectrumwith the spectrum of the analysis window:

21



w (n) s (n) ↔ 1

2πW

e jω ∗ S

e jω

.

Thus, the optimal (yet unrealistic) window would have a frequency response composed of a deltafunction located at ω = 0. Since this window would be infinite in duration, we instead desire to usewindows which have spectra showing low-bandwidth main lobes and highly attenuated side-lobes, andthus approximating the optimal delta function case.

In the case of voiced speech, spectral leakage will cause harmonics to appear as lobes, as opposed toclean spikes.

• Example:Figure 12 shows the time waveform and discrete STFT of the speech signal “seven.” As was discussed,

the STFT can be interpreted in two ways, for a fixed time index n, or for a fixed frequency index k.Figure 13 illustrates both cases: for a fixed time index during the steady-state vowel /e/, and for the

fixed frequency bin corresponding to 1813 Hz.

Time Waveform of "seven"

Discrete STFT of "seven"

Figure 12: Example of the Discrete STFT: the top panel shows the time waveform for the speech signal “seven.” Thebottom panel shows the corresponding discrete STFT, using a Hamming window.

• Example:Figure 14 provides an example spectrum of a steady-state vowel. If spectrum analysis was carried

out using a Hamming window, what was the length of the analysis window? (Assume negligibleeffects due to side-lobes.) If the sampling rate was F s = 8 kHz, what was the duration of the analysis

window, T w, in seconds?

The bandwidth of the main lobe of a Hamming window is 8π

N w. From Figure 9, it can be observed

that the bandwidth of the analysis window is roughly equal to the harmonic spacing, i.e. π

24 . Thus:

8π

N w=

π

24 ⇒ N w = 192 samples.

22



Case 1: fixed n (during /e/)

Case 2: fixed k (corresponding to 1813 Hz)

Figure 13: Illustrating the various interpretations of the STFT: the top panel shows the STFT for a fixed time indexduring the steady-state vowel /e/; the bottom panel shows the STFT for a fixed frequency bin corresponding to 1813 Hz.

If the sampling rate is F s = 8 kHz, then:

T w = 192 samples × 1

8000 samples/s = 0.024 s

5.4 Wideband vs. Narrowband Analysis

Recall the inverse relationship between analysis window length (in the time domain), and main lobe

bandwidth (in the frequency domain). Thus, as we decrease the duration of the analysis window, weincrease the amount of spectral leakage, and “smear” the resulting spectrum. In such a scenario, harmonics

will become blurred, but formants will become resolved. Additionally, better time resolution is achieved.

This corresponds to wideband (WB) analysis.Conversely, if we increase the duration of the analysis window, we decrease the spectral leakage, and

reduce the amount of “smearing” of the resulting spectrum. In this scenario, harmonics will remainresolved, but it may be difficult to accurately locate formants. Here, better frequency resolution is achieved.

This corresponds to narrowband (NB) analysis.Figure 15 provides examples of wideband and narrowband STFT analysis. The top panel shows WB

analysis; the bottom panel shows NB analysis. Note the resolved harmonics in the NB case, whereas theharmonics have been blurred due to spectral leakage in the WB case. However, note that the formants

are clearer for the WB case. Similarly, Fig. 16 shows that formants are more clearly visible in the WBanalysis of vowels, whereas NB analysis can be used to discern harmonics. WB analysis can be usedfor segmentation, as shown by the crisper transitions in the WB spectrogram of Fig. 17. Also note thethree occurrences of /æ/ at 200 ms, 600ms and 1200 ms from the start of the utterance: the widebandspectrogram in Fig. 17 shows that the formants are the same, while the narrowband spectrogram indicates

that the pitch varies.

23



Figure 14: The spectrum of a steady-state vowel.

5.5 Time-Domain Analysis

Although short-time spectral analysis is widely used for speech processing applications, there exist anumber of short-time time-domain speech features which are also important.

• Short-Time Energy:

STE(n) =∞

m=−∞|s (m)|2 w (n − m) .

The short-time energy is commonly used in voice-activity detection (VAD) and automatic speechrecognition (ASR) of high-energy segments (i.e., vowels). It helps detect pauses, boundaries between

phonemes, words, or syllables, and voiced vs. unvoiced sounds.

• Zero-Crossing Rate:

ZCR(n) =

∞

m=−∞

1

2N w sgn(s (m))

−sgn(s (m

−1))wr (n

−m) .

The zero-crossing rate is used in applications such as VAD to differentiate between periodic signals(low ZCR) and noisy signals (high ZCR). Typical values of ZCRs are 1400 crossings per secondfor voiced sounds and 4900 crossings per second for unvoiced sounds. A threshold at about 2500crossings per second can be used to discriminate voiced vs. unvoiced sounds. ZCR can be used toestimate the frequency of a periodic sound: note that the frequency F a of a sinusoid is related to the

zero crossing rate by

F a = ZCRF s2

.

• Short-Time Autocorrelation:

Rn (k) =∞

m=−∞xn (m) xn (m − k) , (1)

where xn (m) = s (m) w (n − m). The short-time autocorrelation is an important tool in determining

Linear Predictive Coefficients (LPCs), which are used in ASR as well as speech coding.The (non-windowed) autocorrelation function, R (k) =

∞m=−∞ s (m) s (m − k), has several impor-

tant properties:

24



Wideband Analysis of "seven"

Narrowband Analysis of "seven"

Figure 15: Wideband vs. narrowband STFT analysis: the top panel illustrates wideband analysis of the speech signal“seven”; the bottom panel illustrates narrowband analysis of the speech signal “seven.”

1. R (k) is an even function.

2. If s (n) is periodic, i.e., s (n) = s (n + N ) for some N , then R (k) is periodic:

R (k) =m

s (m) s (m − k) =m

s (m) s (m − k − N ) = R (k + N )

3. R (k) has a max value at R (0).

The autocorrelation function of a voiced speech segment has peaks occurring at intervals equal to 1F 0

seconds.

• Short-time average magnitude difference function (AMDF):

AMDFn(k) =

∞m=−∞

s (m) wr (n − m) − s (m − k) wr (n − m + k).

The AMDF of voiced (i.e., quasi-periodic) speech segments has nulls at intervals equal to 1F 0

seconds.

See also Sections 6.1–6.4.3 of the textbook.

25



(a) Wideband analysis of five vowels.

(b) Narrowband analysis of five vowels.

Figure 16: Wideband vs. narrowband STFT of five vowels.

26



Figure 17: Wideband (top) vs. narrowband (bottom) STFT analysis of the sentence “Is Pat sad or mad?” The fifth,tenth, and fifteenth harmonics of two vowels are marked with white squares in the bottom panel. From P. Ladefoged, ACourse in Phonetics , 2001.

27



6 Linear Prediction Analysis

Linear prediction analysis is based on the notion that speech samples can be predicted accurately fromprevious speech. That is, speech samples can be estimated as a linear combination of past samples. Indetermining the optimal weights by which to combine previous samples, we reveal important spectralinformation regarding the signal.

6.1 The All-Pole Vocal Tract Model

As discussed previously, we assume an all-pole model for the vocal-tract transfer function:

V (z) = G

p1/2k=1

(1 − ckz−1)

1 − c∗kz−1 p2=1

(1 − rz−1)

, p = p1 + p2.

Since speech is naturally a real signal, the poles appear either as complex conjugate pairs, ( ck, c∗k), oras real poles, r. An alternative form of the vocal tract transfer function, which will prove useful for linear

prediction analysis, is:

V (z) = G1 −

pk=1

akz−k. (2)

Recall the expression that relates the source signal U (z) and the output speech S (z) in a linear system:

S (z) = V (z) U (z) . (3)

By substituting the expression for the all-pole transfer function from (2) into the expression for thelinear speech production model (3), we obtain:

S (z) =

p

k=1

akS (z) z−k + GU (z) .

By taking the inverse Z-transform we obtain:

s (n) =

pk=1

aks (n − k) + Gu (n) .

Thus, the current speech sample, s (n) can be predicted as a function of past speech samples{s (n − 1) , . . . , s (n − p)} and the current source signal sample, u (n). The parameters ak are referred to

as the Linear Predictive Coefficients (LPCs).The expression A (z) = 1 − p

k=1 akz−k is referred to as the inverse filter , since it theoretically inverts

the effect of the vocal tract, and returns the source signal. If the vocal tract truly is an all-pole system

and is modeled perfectly during linear predictive analysis, then inverse filtering simply gives the sourcefunction:

A (z) S (z) = A (z) V (z) U (z)

= A (z) G

A (z)U (z)

= GU (z) .

28



6.2 Deriving Linear Predictive Coefficients

To derive the linear predictive coefficients, ak, we first define the least-square error function:

E =∞

n=−∞(s (n) − s (n))2

=∞

n=−∞

s (n) −

pk=1

aks (n − k)2

, (4)

where s (n) is the estimate of s (n) based on the past samples.To derive each optimal coefficient, we minimize the least-square error function by setting the derivative

of E with respect to that coefficient to zero:

∂E

∂ai=

∂

∂ai

∞n=−∞

s (n) −

pk=1

aks (n − k)

2

= 2

∞n=−∞

s (n) − pk=1

aks (n − k) ∂

∂ais (n) −

pk=1

aks (n − k)= −2

∞n=−∞

s (n) s (n − i) −

pk=1

aks (n − k) s (n − i)

= 0.

This leads to the normal equations :

∞n=−∞

s (n) s (n − i) =

pk=1

ak

∞n=−∞

s (n − k) s (n − i) . (5)

Note that using (5) in (4) yields the minimum least-square error:

E min =∞

n=−∞

s2 (n) −

pk=1

aks (n − k) s(n)

. (6)

The minimum error is actually the square of the gain used for the LPC-based vocal tract transferfunction, if the excitation, u (n) is normalized such that

∞n=−∞ u2 (n) = 1.

The normal equations can be solved by means of two methods: using the autocorrelation function or

using the covariance function. The covariance method is more accurate, but induces a higher computational

complexity. There exist efficient algorithms for determining LPCs using the autocorrelation method.

6.3 The Autocorrelation Method

The autocorrelation method for linear prediction analysis assumes that a signal is only nonzero overan interval of N w samples, with N w > p. That is, the signal was windowed prior to analysis. In thefollowing, let us assume that that interval is [0, N w − 1], and, in the notation of (1), let R (k) = Rn (k),with n = N w − 1. When applying the autocorrelation function:

R (i) =

N w−1m=i

x (m) x (m − i) , i = 0, . . . , N w − 1,

29



where x (m) = s (m) w (n − m), n = N w − 1. Because x(m) is non-zero only for m ∈ [0, N w − 1], andx(m − i) is non-zero for m ∈ [i, N w + i − 1], the summation only needs to be carried out over the range[i, N w − 1]. In this case, the normal equations (5), where we use the windowed signal, x (n), instead of s (n), become:

R (i) =

p

k=1 akR (i − k) , and more generally

R (i) =

pk=1

akR (|i − k|) , i = 1, . . . , p .

Note that in the above equation we took advantage of the property that the autocorrelation is an even

function. The normal equations (5) can be written in matrix form as:

R (0) R (1) · · · R ( p − 1)R (1) R (0) · · ·

... ...

. . .

R ( p − 1) R (0)

R

a1...

a p

a

=

R (1)

...

R ( p)

r

.

The matrix R is Toeplitz, meaning that is symmetric with identical elements on the diagonals. Thereexist efficient algorithms to invert Toeplitz matrices, so the matrix a can be determined as:

a = R −1r.

The minimum least-square error (6) is then calculated as:

E min = R (0) − p

k=1

akR (k) .

• Example: Consider the speech segment [3, 2, −1, 1] (note that we used N w = 4). Find the 2nd-order

vocal tract transfer function using linear predictive analysis.

R (i) = [15, 3, −1, 3].

For order p = 2:

R =

15 33 15

, r =

3−1

, a = R −1r =

2

9− 1

9

,

E min = R (0) −2

k=1

akR (k)

= 15 −

2

9

(3) −

−1

9

(−1) =

128

9 ≈ 14.22,

G =

E min ≈ 3.77,

⇒ V (z) = 3.77

1 − 29 z−1 + 1

9 z−2,

30



with roots at z = c1 and z = c∗1, with c1 =

1 + j√

8

/9 ≈ 0.11 + j0.31. This corresponds to anestimated formant F 1 ≈ 0.196F s.

The choice of the order p during LPC analysis should reflect the number of formants expected. Recall

that a pair of conjugate poles are required to create a formant. For speech sampled at 8kHz, p is usuallychosen between 10 and 13. Figure 18 shows the effect of model order on LPC analysis. The top panelshows the original spectrum of the steady-state vowel /i/. The bottom panels show the vocal tract transfer

function obtained from LPC analysis using orders p = 2, 10, and 25, respectively. Note that when p ischosen to be too large, a number of spurious peaks may turn up.

Figure 18: The effect of model order on LPC analysis: The top panel shows the original spectrum of the steady-statevowel /i/. The bottom panels show the vocal tract transfer function obtained from LPC analysis using orders p = 2, 10,and 25, respectively.

6.4 The Covariance Method

In the autocorrelation method, it was assumed that the signal is nonzero only on the range [0, N w − 1]. In

the covariance method, however, we drop this assumption, and instead define the covariance function as:

31



φ (i, k) =

pn=0

s (n − k) s (n − i) , 0 ≤ i ≤ p, 0 ≤ k ≤ p. (7)

Note that since the signal is not “windowed,” as in the autocorrelation case, the summation in (7) isalways carried out on the range [0, p].

The normal equations for the covariance method become:

φ (i, 0) =

pk=1

akφ (i, k) , for 1 ≤ i ≤ p.

In matrix form, the normal equations can be expressed as:

φ (1, 1) φ (1, 2) · · · φ (1, p)φ (2, 1) φ (2, 2) · · ·

... ...

. . .

φ ( p, 1) φ ( p, p)

Φ

a1

...a p

a

=

φ (1, 0)

...φ ( p, 0)

φ

.

Note that the matrix Φ is symmetric but not necessarily Toeplitz. Thus it is generally more difficultto determine the inverse of Φ than of R . Nevertheless, the linear predictive coefficients can be solved as:

a = Φ−1φ.

Furthermore, the minimum error is determined as:

E min = φ (0, 0) − pk=1

akφ (0, k) .

The covariance matrix can be interpreted as the error (4) being windowed, as opposed to the signalbeing windowed (as in the autocorrelation case).

• Example: Consider the speech segment [. . . , −3, 1, −2, 3, 2, −1, 1, −2, 0, 1, . . . ] (compare this withthe previous example). Find the 2nd-order vocal tract transfer function using the covariance method

of linear predictive analysis. The origin is denoted by the underscore.For order p = 2:

Φ =

17 −2−2 14

, φ =

−2−4

, a = Φ−1φ = − 1

117

1836

,

E min = φ (0, 0) −2

k=1

akφ (0, k)

= 14 −− 18

117 (−2) −− 36

117 (−4) =

1458

117 ≈ 12.46,

G =

E min ≈ 3.53,

⇒ V (z) = 3.68

1 + 18117 z−1 + 36

117 z−2,

with poles at z = c1 and z = c∗1, with c1 = 1117

−9 + j√

4131 ≈ −0.077+j0.55, corresponding to an

estimated formantF 1 ≈ 0.272F s.

32



Part II

Image Processing

7 Examples of Applications

• Photography

• Computer vision

• Remote sensing

– Planetary exploration

– Environmental monitoring

– Geology

– Reconnaisance

• Medicine

– Magnetic resonance imaging (MRI)

– Positron emission tomography (PET)

– Angiograms

– Digital radiography

• Communications

– Videoconferencing

– HDTV

• Chemistry

• Etc.

8 Image Processing v. 1D DSP

• Similarities

– Sampling

– Filtering

– Impulse response

– Transforms

• Differences

– Data size

– Complexity (1D vs. 2D)

– Conceptual challenges due to dimensionality

– Interface with human visual system

33



9 Overview

• Introduction

• 2D linear system theory

– Math preliminaries

– Continuous 2D systems, including convolution, FT– Discrete systems

– Image sampling

• Image transforms

– Fourier (2D discrete)

– Discrete cosine transform (DCT)

– Other (KL, etc.)

• Image Enhancement

– Histogram modification

– Edge detection

– Noise filtering

• Image compression

• Human visual perception vs. digital image processing

34



10 1D vs. 2D

In 1D, a linear, time-invariant (LTI) system satisfies:

If f (x) −→ LTI −→ g(x),

then f (x − a) −→ LTI −→ g(x − a)

and k1f 1(x) + k2f 2(x) −→ LTI −→ k1g1(x) + k2g2(x)

The functions f (x) and g(x) are related by convolution and the impulse response h(x):

g(x) =

∞−∞

f (u)h(x − u) du = (f ∗ h)(x)

Examples:

f (x)

∗h(x) = δ(x)

=g(x)

f (x)

∗a

h(x) = δ(x − a)=

g(x)

a

f (x)

∗

b

h(x)

−b

=g(x)

b−b

−a a

f (x)b

∗−a a

h(x)b

=

−2a

2ab2

2a

g(x)

Recall that

rect(x) =

1, |x| < 1

2 ,

0, |x| > 12

and

δ(x) = lim→0

1

rect

x

, δ(x) = 0, x = 0,

∞−∞

δ(x) dx = 1, δ(ax) = 1

|a|δ(x).

35



The 2D impulse is defined

δ(x, y) = lim→0

1

rect

x

1

rect

y

= δ(x)δ(y),

δ(x, y) = 0, for (x, y) = (0, 0), and

δ(x, y) dx dy = 1.

Note thatf (x, y) ∗ δ(x − a, y − b) = f (x − a, y − b).

x

y

δ(x, y)

x

y

δ(x − a, y − b)

ba

A 2D linear, shift-invariant (LSI) system satisfies:

If f (x, y) −→ LSI −→ g(x, y),

then f (x − a, y − b) −→ LSI −→ g(x − a, y − b)

and k1f 1(x, y) + k2f 2(x, y) −→ LSI −→ k1g1(x, y) + k2g2(x, y)

An LSI system can be completely defined by its impulse response:

δ(x, y) −→ LSI −→ h(x, y)

Example: Astronomical imaging: Actual image of a star resembles a δ(x − a, y − b) and the resultingimage is h(x − a, y − b). Several stars produce similar images.

In general, the output of a 2D LSI system can be understood as the superposition of shifted and scaled

versions of h(x, y):

g(x, y) =

f (x, y)h(x − x, y − y) dx dy = (f ∗ h)(x, y) = (h ∗ f )(x, y).

Examples (note: a dot represents a delta)

x

y

δ(x − a, y)

a

f (x, y)

x

y

δ(x − b, y)

b

h(x, y)

x

y

δ(x − a − b, y)

a + b

g(x, y)

x

y

a−d

c

−c

f (x, y)

x

y

b

h(x, y)

x

y

b − d

a + b

c

−c

g(x, y)

36



11 The Fourier Transform

The FT is probably the single most important mathematical tool in image processing. We will review the

1D FT and then study the 2D FT.Although image processing generally involves 2D discrete functions, it uses many techniques modeled

on analog processes. A good understanding of 1D and 2D continuous FT theory is essential to imageprocessing.

Given a 1D continuous function f (x), the FT and invest FT (IFT) are defined as follows

F (u) :=

∞−∞

f (x)e− j2πux dx = F[f ](u)

f (x) :=

∞−∞

F (u)e j2πux du = F−1[F ](x).

(8a)

(8b)

The units of u are the inverse of the units of x, i.e., if x is measured in seconds, then u is measured inhertz, if x is in meters, u is in cycles per meter (spacial frequency).

F (u) is a decomposition of f (x) into its component frequencies.Example:

F (u) = 12δ(u + a) + 12δ(u − a).

Using the sifting property of the delta function: ∞−∞

δ(x − x0)f (x) d(x) = f (x0),

we obtain

f (x) =

∞−∞

1

2δ(u + a) +

1

2δ(u − a)

e j2πux dx =

1

2e j2πax +

1

2e− j2πax = cos(2πax).

In general, the 1D FT provides the decomposition of f (x) in terms of cosines of the form A cos(2πf x + θ).

The set of all cosines at all combinations of A, f , and θ, provides a complete basis to build 1D functions.Some important transforms:

δ(x) −→ 1

b rectx

a

−→ b|a| sinc(au)

e−πx2 −→ e−πu2 ,

where

sinc(x) := sin(πx)

πx

is the normalized cardinal sine function.

The Convolution Theorem states that if

g(x) = (f ∗ h)(x) =

∞−∞

f (z)h(x − z) dz =

∞−∞

h(z)f (x − z) dz,

thenG(u) = F (u)H (u),

where G(u), F (u), and H (u) are the FTs of g(x), f (x), and h(x), respectively.

37



The 2D FT of f (x, y) expresses f (x, y) as a sum of 2D sinusoids A cos(ax + by + θ). Each sinusoidhas spatial frequency ρ =

√ a2 + b2, period 1/ρ, and its contours of constant amplitude make an angle

φ = tan−1 ba with respect to the x-axis.

Given a 2D continuous function f (x, y)

F (u, v) := f (x, y)e− j2π(ux+vy) dx dy = F[f ](u, v)

f (x, y) :=

F (u, v)e j2π(ux+vy) du dv = F−1[F ](x, y).

(9a)

(9b)

The 2D FT expresses f (x, y) in terms of a combination of 2D sinusoidal corrugations .Example:

F (u, v) = 1

2δ(u + a, v) +

1

2δ(u − a, v)

u

vF (u, v)

−a a

f (x, y) = cos(2πax).

In this case, f (x, y) depends only on x and the corrugations are parallel with the y-axis.

x

y

u

v

−a a

A rotation of f (x, y) causes a corresponding rotation of F (u, v)

x

y

tan−1 b

a

u

v

−aa

tan−1 ba

38



Figure 19: Illustration of the Fourier transform of a sinusoidal corrugation.

The set of all sinusoidal corrugations (at all amplitudes, frequencies, phase shifts, and rotations)constitutes a complete basis for any function f (x, y). The 2D FT expresses f (x, y) as a linear combination

of such corrugations.Theorems for the 2D FT are generally analogous to their 1D counterparts

39



1D: f (x) ↔ F (u) 2D: f (x, y) ↔ F (u, v)

Shift f (x − a) ↔ e− j2πauF (u) f (x − a, y − b) ↔ e− j2π(au+bv)F (u, v)Stretch f (ax) ↔ 1

|a|F ua

f (ax, by) ↔ 1

|ab|F ua , vb

Convolution (f ∗ g)(x) ↔ F (u)G(u) (f ∗ g)(x, y) ↔ F (u, v)G(u, v)Correlation rfg(x) = f (x) ∗ g∗(−x) ↔ F (u)G∗(u) rfg(x, y) = f (x, y) ∗ g∗(−x, −y) ↔ F (u, v)G∗(u, v)

f ∗(x) ↔ F ∗(−u) f ∗(x, y) ↔ F ∗(−u, −v)

f (−x) ↔ F (−u) f (−x, −y) ↔ F (−u, −v)f ∗(−x) ↔ F ∗(u) f ∗(−x, −y) ↔ F ∗(u, v)

The cross-correlation between f and g is defined as

rfg(x) :=

∞−∞

f (u)g∗(u − x) du.

Note from the table above thatF[rfg](u, v) = (F[rgf ](u, v))∗ ,

which shows, upon IFT, thatrgf (x, y) = r∗fg(−x, −y).

40



Figure 20: Examples of 2D convolutions.

41



11.1 FT of Separable Functions

If f (x, y) = f 1(x)f 2(y) then the FT can be written

F (u, v) =

f 1(x)e− j2πux f 2(y)e− j2πvy dx dy = F 1(u)F 2(v),

where F 1(u) = F[f 1](u) and F 2(v) = F[f 2](v). Therefore, if f (x, y) is separable, so is F (u, v).

Example:

f (x, y) =

1, |x| < W x

2 and |y| < W y

2 ,

0, elsewhere.

Using the rect(·) function

rect(x) =

1, |x| < 1

2 ,

0, elsewhere,

we have that

f (x, y) = rect

x

W x

rect

y

W y

.

We also know that

F

rect

x

W x

(u) = W x sinc(W xu),

F

rect

y

W y

(v) = W y sinc(W yv).

ThereforeF (u, v) = W x W y sinc(W xu)sinc(W yv).

12 Discrete Signals

A general 2D discrete signal (or sequence) has the form f (m, n), where m and n are integers. The region of the (m, n) plane where a function f (m, n) can take on non-zero values is the region of support of f (m, n).

For example, a photograph has a region of support determined by the dimensions of the photo and thesampling rate.

Examples: 2D impulse

δ(m, n) =

1, m = n = 0,

0, otherwise.

Line impulse

δT(m) =

1, m = 0,

0, otherwise.

We have that δ(m, n) = δT(m) δT(n).Step functions

u(m, n) =

1, m, n ≥ 0,

0, otherwise.

and

uT(m) =

1, m ≥ 0,

0, otherwise.

42



Figure 21: Examples of a 2D separable function.

We have that u(m, n) = uT(m) uT(n).A discrete sequence f (m, n) is separable if f (m, n) = x1(m) x2(n). δ(m, n) and u(m, n) are separable.

43



Figure 22: Examples of a 2D non-separable function.

12.1 Periodicity

A sequence f (m, n) is periodic if there exist integers M and N such that

f (m, n) = f (m + M, n + N )

44



for all m, n.

12.2 Input-Output

A 2D system produces an output sequence g(m, n) from an input sequence f (m, n):

g(m, n) = T [f (m, n)] .

Linearity:T [k1f 1(m, n) + k2f 2(m, n)] = k1g1(m, n) + k2g2(m, n).

Shift-invariance:

If T [f (m, n)] = g(m, n), then T [f (m − m1, n − n1)] = g(m − m1, n − n1).

The response h(m, n) of a 2D LSI system to the input δ(m, n) is the impulse response. The function|h(m, n)|2 is sometimes called point spread function.

12.3 2D Discrete Convolution (Linear Convolution)

Aperiodic signal case:

g(m, n) = (f ∗ h)(m, n) =∞

k1=−∞

∞k2=−∞

h(k1, k2)f (m − k1, n − k2).

Suppose h(m, n) = δ(m, n), then g(m, n) = (f ∗ δ)(m, n) = f (m, n). In general, for integers c and d:

f (m, n) ∗ δ(m − c, n − d) = f (m − c, n − d).

Figure 23: Examples of 2D discrete convolutions.

45



A general sequence f (m, n) can be expressed as a sum of weighted, shifted impulses:

f (m, n) = f (0, 0)δ(m, n) + f (1, 0)δ(m − 1, n) + f (0, 1)δ(m, n − 1) + · · ·

=

∞k1=−∞

∞k2−=−∞

f (k1, k2)δ(m − k1, n − k2).

Therefore the output of an LSI system can be written as

g(m, n) = f (0, 0)h(m, n) + f (1, 0)h(m − 1, n) + f (0, 1)h(m, n − 1) + · · ·

=∞

k1=−∞

∞k2−=−∞

f (k1, k2)h(m − k1, n − k2) = (f ∗ h)(m, n).

As with 1D, 2D convolution is commutative, associative, and distributive.An LSI system is separable if h(m, n) is separable. A separable system’s response can be computed

with 1D convolutions and requires fewer multiplications to calculate.An LSI system is stable if every bounded input produces a bounded output

12.4 Cross-correlation

For aperiodic signals

rfg(m, n) :=∞

k1=−∞

∞k2=−∞

f (k1, k2)g∗(k1 − m, k2 − n) = f (m, n) ∗ g∗(−m, −n).

Figure 24: Examples of 2D discrete cross-correlations.

12.5 2D DTFT

For a stable aperiodic sequence with

∞m=−∞

∞n=−∞

|u(m, n)| < ∞,

46



Figure 25: 2D convolutions vs. cross-correlations.

the 2D DTFT is given by

V (ω1, ω2) =∞

m=−∞

∞n=−∞

u(m, n)e− jmω1e− jnω2

u(m, n) = 1(2π)2

π−π

π−π

V (ω1, ω2)e jmω1e jnω2 dω1 dω2.

(10a)

(10b)

It is clear that V (ω1 + 2πk, ω2 + 2π) = V (ω1, ω2). In image processing, the 2D DTFT is approximatedusing the 2D DFT. All properties and theorems for the 1D DTFT hold for the 2D DTFT with theappropriate modifications.Example

(f ∗ h)(m, n) ↔ F (ω1, ω2)H (ω1, ω2).

Example Real f (m, n) produces a Hermitian F (ω1, ω2) = F ∗(−ω1, −ω2).Example

u(m, n) = 1

4

δ(m

−1, n) +

1

4

δ(m + 1, n) + 1

8

δ(m, n

−1) +

1

8

δ(m, n + 1)

has 2D DTFT

V (ω1, ω2) = 1

4e− jω1 +

1

4e jω1 +

1

8e− jω2 +

1

8e jω2 =

1

2 cos(ω1) +

1

4 cos(ω2).

47



13 Image Sampling

Recall that in 1D a comb function (pulse train) is its own transform:

∞n=−∞

δ(x − n) ↔∞

k=−∞δ(u − k),

∞n=−∞

δ(x − nT ) ↔ 1T

∞k=−∞

δ

u − kT

,

A 2D comb function, or bed of nails , is also self-transforming.Sampling can be described as follows

Space domain Frequency domain

Multiplication of f (x, y) by abed of nails with spacing of ∆xin the x dimension and ∆y in

the y dimension

Convolution of F (u, v) withbed of nails with spacing of

1∆x in the u dimension and 1

∆y

in the v dimensionA bandlimited 2D function can be completely reconstructed from its samples provided that the Nyquist

rates are satisfied. A 2D function is bandlimited if F a(u, v) is zero outside a bounded region in the (u, v)plane:

F a(u, v) = 0, |u| > U, |v| > V.

As with the 1D case, sampling replicates the spectrum periodically at periods given by the reciprocalsof the sampling intervals, ∆x and ∆y. Note that ∆x and ∆y have units of meters. The sampled imagespectrum is given by

F s(u, v) = 1

∆x∆y

∞k=−∞

∞=−∞

F a

u − k

∆x, v −

∆y

.

The spectrum is replicated on both the u and v dimensions, in spacings that are multiples of 1/∆x and1/∆y, respectively.

Reconstruction is accomplished by using an ideal 2D LPF

H LPF(u, v) =

∆x∆y, |u| < 1

2∆x , |v| < 12∆y ,

0, otherwise.

Recall thatA rect

x

W

↔ AW sinc(W u) .

The inverse FT of H LPF(u, v) gives

hLPF(x, y) = sinc x

∆x

sinc

y∆y

.

Therefore

f a(x, y) =

∞k=−∞

∞=−∞

f s(k∆x, ∆y) sinc x

∆x − k

sinc

y

∆y −

.

Reconstruction is less important in image processing because the image is often left in discrete form.

48



Figure 26: Bed of nails (N = 8).

49



Figure 27: Bed of nails (N = 4).

50



14 Image Transforms

Topics:

1. 1D DFT

2. Concept of Unitary Transforms

3. 2D DFT

4. DCT

14.1 1D Unitary DFT

The DFT is used for spectral analysis, filtering, convolution, etc. It can be viewed as

1) a transform time(1D)/space(2D) ←→ frequency

2) an expansion of a function into orthogonal basis functions

3) a coordinate rotation

Let u(n) DFT←−→ V (k)

V (k) = 1√

N

N −1n=0

u(n)e− j2πkn/N

u(n) = 1√

N

N −1k=0

V (k)e j2πkn/N

(11a)

(11b)

The DFT can be expressed as a matrix operation

V = Au where u is input (column vector, size N × 1)

A is an N × N transform matrix

V is output (column vector, size N × 1)

The columns of AH are the basis vectors of the expansion.

V (0)V (1)

...V (N − 1)

= 1√

N

e− j0 e− j 2πN

0·1 e− j 2πN

0·2 . . . e− j 2πN

0·(N −1)

e− j 2πN

1·0 e− j 2πN

1·1 e− j 2πN

1·2 . . . e− j 2πN

1·(N −1)

... ...

... . . .

...

e− j 2πN

(N −1)·0 e− j 2πN

(N −1)·1 e− j 2πN

(N −1)·2 . . . e− j 2πN

(N −1)·(N −1)

u(0)u(1)

...u(N − 1)

Example: N = 4

V (0)V (1)V (2)V (3)

=

1

2

1 1 1 11 − j −1 j1 −1 1 −11 j −1 − j

u(0)u(1)

...u(N − 1)

51



OrthogonalityBasis vectors of the DFT are orthogonal. Let a∗k = kth basis vector. 0 ≤ k ≤ N − 1.Example:

N = 4, a∗1 = 1

2

1 j−1

− j

a∗1(0) = 1

2 a∗1(1) =

j

2 a∗1(2) =

−1

2 a∗1(3) =

− j

2Orthogonality

N −1n=0

ak(n)a∗l (n) =

1, k = l

0, k = l

Recall from EE113 that if an input u(n), 0 ≤ n < N − 1 has exactly 1 non-zero element, its transform

V (k) has the form

V (k) = A

√ N e− j2πn0k/N where n0 is the location of the non-zero term

A is its value, i.e., u(n) = Aδ(n − n0)

V (k) is a single, rotating phasor with amplitude A, rotating ∆θ = −2πn0/N radians each time k isincremented.

Example 1:

(a)u(n) = [1, 0, 0, 0], A = 1, n0 = 0 ⇒ ∆θ = 0

V (k) = 1√

4[1, 1, 1, 1]

(b)

u(n) = [0, 1, 0, 0], A = 1, n0 = 1 ⇒ ∆θ = −2π

4 = −π

2

V (k) = 1

2[1, − j, −1, j]

(c)

u(n) = [0, 0, 0, j], A = j, n0 = 3 ⇒ ∆θ = −6π

4 = +

π

2

V (k) = j

2[1, j, −1, − j] =

1

2[j, −1, − j, 1]

The inverse DFT uses an identical procedure, except that ∆θ = +2πk0/N .

Example 2:

V (k) = [0, 0, 1, 0, 0, 0, 0, 0] ∆θ = 2π2

8 =

π

2

⇒ u(n) = 1√

8[1, j, −1, − j, 1, j, −1, − j]

52



Any DFT can be found using linear combinations of these techniques.

Example 3:

(a)

u(n) = [0, 1, 0, 2

− j]

←→1

2

[1,

− j,

−1, j] +

2 − j

2

[1, j,

−1,

− j]

= 1

2[1, − j, −1, j] +

1

2[2 − j, 1 + 2j, −2 + j, −1 − 2j]

= 1

2[3 − j, 1 + j, −3 + j, −1 − j]

(b)

V (k) = [0, 0, 0, 0, 1, 0, 0, 0] ∆θ = 2π4

8 = π

u(n) = 1√

8[1, −1, 1, −1, 1, −1, 1, −1]

(c)V (k) =

1√ 8

[j, 1, − j, −1, j, 1, − j, −1]

Forward DFT of jδ(n − n0), where −2πn08 = −π

2 ⇒ n0 = 2

u(n) = [0, 0, j, 0, 0, 0, 0, 0]

(d)

V (k) = 1

2[− j, 1, j, −1] = forward DFT of − jδ(n − n0), where − 2π

n0

4 =

π

2 ⇒ n0 = 1

Since u(n) (and V (k)) have period 4, n0 = −1 is equivalent to n0 = −1 + 4 = 3

⇒ u(n) = [0, 0, 0, − j]

Example 4: With N = 4, let u(n) =

4 1 0 1

.

4 0 0 0

←→ 1

2

4 4 4 4

0 1 0 0 ←→ 1

2

1 − j −1 j

0 0 0 1 ←→ 1

2

1 j −1 − j

∴ 4 1 0 1 ←→ 126 4 2 4 = 3 2 1 2

Using matrix multiplication:

V = Au = 1

2

1 1 1 11 − j −1 j1 −1 1 −11 j −1 − j

4101

=

3212

53



V is an expansion of u in terms of the N basis vectors given by the conjugate columns of A.

u = 3 · 1

2

1111

+ 2 · 1

2

1 j−1− j

+ 1 · 1

2

1−11

−1

+ 2 · 1

2

1− j−1 j

=

4101

Example 5: With N = 4, let u(n) = 3 1 0 2.V (0) =

1

2 [u(0) + u(1) + u(2) + u(3)] = 3

V (1) = 1

2 [u(0) − ju(1) − u(2) + ju(3)] =

3

2 + j

1

2V (2) = 0

V (3) = 3

2 − j

1

2 = V ∗(1).

Result: V (k) =

3 3

2 + j 12 0 3

2 − j 12

. V (k) can be viewed as an expansion of u(n) into a basis given by

the columns of AH:

u(n) = 1

2

3

1111

+

3

2 + j

1

2

1 j−1− j

−

3

2 − j

1

2

1− j−1 j

=

3102

.

Recall: If u(n) is real, V (k) has Hermitian symmetry. Since u(n) and V (k) are periodic, then

⇒ V (k) = V ∗(−k) = V ∗(N − k)

⇒V (1) = V ∗(4

−1) = V ∗(3)

Symmetry of discrete, periodic functions:

Even: f (n) = f (N − n)

Odd: f (n) = −f (N − n)

Symmetry Properties:

u(n) ←→ V (k)

real, even ←→ real, even

real, odd ←→ imaginary, odd

(Others followed from above)

14.2 Periodicity and Indexing

Note that the DFT definition implies that both u(n) and V (k) are periodic of period N :

u(n + N ) = 1√

N

N −1k=0

V (k)e j2πn+N N k =

1√ N

N −1k=0

V (k)e j2πnkN e j2πk = u(n).

54



In general, u(n) = u(n + N ), V (k) = V (k + N ), any integer.Example:

u(n) =

0 1 2 3 ⇒ u(n − 1) =

3 0 1 2

.

Although any window of length N is sufficient, by convention we use 0 ≤ n ≤ N − 1. This places theorigin on the left, not the center. Make sure you define vectors this way avery time you use functions such

as fft in Matlab.

14.3 DFT Properties

Parseval:N −1n=0

|u(n)|2 =N −1k=0

|V (k)|2 .

Reversal:If u(n) ↔ V (k), then u(−n) ↔ V (−k).

Note that if u(n) =

a b c d e f g h

thenu(−n) =

a h g f e d c b

.

Circular convolution:(u1 u2)(n) ↔ V 1(k)V 2(k).

Shift: If u(n) ↔ V (k), thenu(n − n0) ↔ e− j2πn0kV (k).

Exampleu(n) =

3 1 0 2

↔ V (k) =

3 32 + j 1

2 0 32 − j 1

2

then

u(n − 1) = 2 3 1 0 ↔ e− j2πk/4

V (k) = 3 1

2 − j3

2 0 1

2 + j3

2 .Shifting u(n) imposes a progressive phase term on V (k).

15 Unitary Transformations

The general form for a 1D unitary transformation is

V (k) =N −1n=0

u(n)a(k, n)

u(n) =

N −1k=0

V (k)a∗(k, n).

(12a)

(12b)

In matrix formV = Au,

where A is unitary , i.e.,A−1 = AH = (A∗)T .

55



A unitary transform V = Au provides an expansion of u in terms of basis vectors obtained as the columns

of AH. The coefficients of the expansion are given by the elements of V .Example: DFT with N = 4

A = 1

2

1 1 1 11 − j −1 j1 −1 1 −11 j

−1

− j

.

The four basis vectors are

a∗0 =

1111

,a∗1 =

1 j−1− j

,a∗2 =

1−11

−1

,a∗3 =

1− j−1 j

.

u =

3102

↔ V =

332 + j 1

20

32

− j 1

2

.

⇒ u = 3a∗0 +

3

2 + j

1

2

a∗1 +

3

2 − j

1

2

a3.

A 2D unitary transform pair is given by

V (k, l) =

N −1m=0

N −1n=0

u(m, n)a(k,l ,m,n)

u(m, n) =N −1k=0

N −1l=0

V (k, l)a∗(k,l,m,n).

(13a)

(13b)

The term a(k,l ,m,n) is called kernel of the transformation. If a(k,l,m,n) can be factored into a(k,l,m,n) =b(k, m)c(l, n), the 2D transform is called separable : one can transform all the rows and then all the columns,

or vice versa. Among the unitary transformations are DFT, DCT, Hadamard, Haar, Karhunen-Loeve.Letting b(k, m) be the elements of B and c(l, n) the elements of C , a separable 2D unitary transform

can be expressed asV = BuC T,

where V and u are N × N matrices, and so are B and C . If C = B = A, then

V = AuAT.

This expresses u in terms of the basis images A∗kl:

u =N −1k=0

N −1l=0

V (k, l)A∗kl,

whereA∗kl := a

∗ka

∗lT,

and a∗k is the kth column of AH.

56



16 2D Unitary DFT

V (k, l) =

N −1m=0

N −1n=0

u(m, n)a(k,l ,m,n)

u(m, n) =N −1

k=0

N −1

l=0

V (k, l)a∗(k,l,m,n),

(14a)

(14b)

where the kernel of the transformation is

a(k,l ,m,n) = 1

N e− j2π

mk+nlN .

Since the kernel is factorable, the 2D DFT is separable, i.e., it can be implemented by 1D transforms of

rows, followed by 1D transform of columns (or vice versa).Example:

u(m, n) =

0 1 0 00 1 0 00 1 0 0

0 1 0 0

.

1D transform of columns :

⇒

0 2 0 00 0 0 00 0 0 00 0 0 0

.

1D transform of rows :

⇒

1 − j −1 j0 0 0 00 0 0 00 0 0 0

.

Example of 2D DFTs2D DFT, N = 4:

A = 1

2

1 1 1 11 − j −1 j1 −1 1 −11 j −1 − j

AH =

1

2

1 1 1 11 j −1 − j1 −1 1 −11 − j −1 j

,

and

a∗0 = 1

2

1111

, a∗1 = 1

2

1 j−1

− j

, etc.

Basis images (or basis maps):

A∗00 = a∗0a

∗0T =

1

4

1 1 1 11 1 1 11 1 1 11 1 1 1

, etc.

57



Example

u =

0 1 0 00 1 0 00 1 0 00 1 0 0

↔ V =

1 − j −1 j0 0 0 00 0 0 00 0 0 0

.

This means that we can express u as follows:

u = (1)A∗00 + (− j)A∗

01 + (−1)A∗02 + (j)A∗

03.

16.1 2D Periodic (Cyclic) Convolution

(u1 u2)(m, n) ↔ V 1(k, l)V 2(k, l).

For instance, u1(m, n) may be an image and u2(m, n) the impulse response of a digital filter, such as alow-pass filter. The filtering operation can be applied in the frequency domain by simply multiplyingV 1(k, l) and V 2(k, l) element by element, and then inverse transforming.

17 The Unitary Discrete Cosine Transform (DCT)

The DCT is a real transform, i.e., a real u(n) produces a real V (k). It can be obtained using a DFT. Most

fast DCT algorithms use FFTs. Of all unitary transforms, only the DFT is used more than the DCT.For a 1D sequence u(n), 0 ≤ n ≤ N − 1, the DCT V (k), 0 ≤ k ≤ N − 1, is given by

V = Cu,

where

C (k, n) =

1√ N

, k = 0, 0 ≤ n ≤ N − 1, 2N cos

π(2n+1)k

2N

, 1 ≤ k ≤ N − 1, 0 ≤ n ≤ N − 1.

As with the DFT, the DCT basis vectors are sinusoidal. The DCT is best understood in terms of a DFT

of a sequence of length 2N derived from u(n).The DFT implies a periodicity N in the sequence u(n). This periodicity can result in discontinuities

that do not reflect the nature of the signal u(n).u(n) =

1 2 3 4

. The periodic signal has discontinuities that cause energy to be placed in high-

frequency bins of V (k), the DFT of u(n).Principle behind DCT: Make the input symmetric by doubling the length of u(n) by defining u(n) of

period 2N , where u(n) = u(n), 0 ≤ n ≤ N − 1, and u(n) = u(2N − 1 − n) for N ≤ n ≤ 2N − 1. Thenon-unitary definition of the DCT is thus:

V (k) =N −1

n=0

2u(n)cos

2πk

n + 1

2

2N

, 0 ≤ k ≤ N − 1.

The inverse transform is:

u(n) = 1

2N

V (0) +

N −1k=0

2V (k)cos

2πk

n + 1

2

2N

.

58



Note that V (0) is considered separately because in a symmetric DFT of length 2N the DC element isrepresented only once: V (1) = V (2N − 1), V (2) = V (2N − 2), . . ., V (0) = V (0). The 1D unitary DCT isgiven by:

V (k) = α(k)

N −1n=0

u(n)cos

π(2n + 1)k

2N

, 0 ≤ k ≤ N − 1

u(n) =

N

−1

n=0

α(k)V (k)cosπ(2n + 1)k2N

, 0 ≤ n ≤ N − 1

α(0) = 1√

N , α(k) =

2

N , 1 ≤ k ≤ N − 1.

(15a)

(15b)

(15c)

The 2D unitary DCT is given by:

V (k, l) = α(k)α(l)N −1m=0

N −1n=0

u(m, n)cos

π(2m + 1)k

2N

cos

π(2n + 1)l

2N

u(m, n) =N −1

k=0

N −1

l=0

α(k)α(l)V (k, l)cosπ(2m + 1)k

2N cosπ(2n + 1)l

2N .

(16a)

(16b)

For signals that generate a smooth u(n) the DCT is well localized. In general, the N → 2N mappingimplicit in the DCT gives smoother sequences than the periodic N extension of the DFT. This is why the

DCT has better energy compression for most sequences.Example:

0.65 0.27 −0.27 −0.65

has DCT 0 1 0 0

and DFT 0 0.46 − j0.46 0 0.46 + j0.46 .

Because 0.65 0.27 −0.27 −0.65 −0.65 −0.27 0.27 0.65

is smooth under 2N periodicity, its DCT is compact.Example

1 1 1 1

has DFT and DCT equal to 2 0 0 0

.

Example

1 0 0 0has DCT equal to

0.5 0.65 0.5 0.27

and DFT 0.5 0.5 0.5 0.5

.

The DCT of an impulse is not DC.

59



The DFT kernel is symmetric, i.e., exchanging n with k will not affect the transform. In the DCT the

kernel is not symmetric with respect to n and k.Example: mid-frequency

1 0 −1 0

has DCT

0 0.92 1

−0.38

and DFT 0 1 0 1

.

Example: high-frequency 1 −1 1 −1

has DCT

0 0.765 0 1.848

and DFT 0 0 2 0

.

For signals whose periodic extension is smooth, the energy localization with the DFT seems to be

better than with the DCT. This is because the assumption of periodicity of the DFT does not apply withthe DCT.

For the DCT, C = C T and the expansion using the DCT is fundamentally different from the expansion

using the IDCT.

60



18 Image Enhancement

Image enhancement techniques are used to improve the perception of images to human observers. Thereare subjective and objective measures of quality.

The image may also be processed for further machine-based analysis.Among image enhancements techniques are

1. Contrast enhancement

(a) Histogram (gray scale) modification

(b) High-pass (linear) filtering

(c) Homomorphic processing

2. Denoising and noise smoothing (background noise, Gaussian noise, salt-and-pepper noise, specklenoise, quantization noise)

(a) Low-pass (linear) filtering

(b) Median (nonlinear) filtering

(c) Out-of-range pixel smoothing (nonlinear)

3. Edge detection

(a) Gradient-based edge detection

(b) Laplacian-based edge detection

18.1 Contrast Enhancement

18.1.1 Histogram modification

Let u(m, n) be the intensity level of pixel (m, n) of a given image. Histogram modification allows one to

change the range of values taken on by the intensity function, thereby improving the contrast of the image.

Let y (m, n) be the result of the transformation

y = T [u].

As an example, consider the 4 × 4 image with intensities given by

u =

3 3 4 52 3 4 42 3 4 43 3 4 5

,

where 0 is black and 7 is white. The histogram and the cumulative histogram of the image are given by

u p(u) P (u)

0 0 01 0 02 2 23 6 84 6 145 2 166 0 167 0 16

61



Assume that ideally we would like to have a cumulative histogram given by

y P d(y)

0 21 42 63 8

4 105 126 147 16

The transformation from u to y can now be obtained as follows:

1. Pick a value of u

2. Look up P (u)

3. Find P d(y) that is closest to P (u)

4. Choose the corresponding y

In our example:

1. u = 0 → P (u) = 0 → P d(y) = 2 → y = 0

2. u = 1 → P (u) = 0 → P d(y) = 2 → y = 0

3. u = 2 → P (u) = 2 → P d(y) = 2 → y = 0

4. u = 3 → P (u) = 8 → P d(y) = 8 → y = 3

5. u = 4 → P (u) = 14 → P d(y) = 14 → y = 6

6. u = 5 → P (u) = 16 → P d(y) = 16 → y = 7

7. u = 6 → P (u) = 16 → P d(y) = 16 → y = 7

8. u = 7 → P (u) = 16 → P d(y) = 16 → y = 7

which corresponds to the transformation

u y

0 01 02 0

3 34 65 76 77 7

62



The transformed image is given by

y =

3 3 6 70 3 6 60 3 6 63 3 6 7

.

The histogram and cumulative histogram of the transformed image are given by

y p(y) P (y)

0 2 21 0 22 0 23 6 84 0 85 0 86 6 147 2 16

Fig. 28b shows an example of histogram modification based on the image in Fig. 28a.

(a) Original unprocessed image (from Matlab’s demo photolibrary). (b) Image processed by histogram modification.

Figure 28: Example of histogram modification.

18.1.2 High-pass filtering

The image details, such as hair, edges, specks, etc., correspond to high-frequency components, while flatand smooth surfaces correspond to low-frequency components. High-pass filtering is used to increasecontrast and sharpen the image. High-pass filtering may also accentuate any noise that is present in the

image. In 2D, high-pass filtering may be obtained by using any of the following 3 × 3 filters:

h1 =

0 −1 0−1 5 −10 −1 0

, h2 =

1 −2 1−2 5 −21 −2 1

, h3 =

−1 −2 −1−2 13 −2−1 −2 −1

.

63



Note that these filters are such that

m

n hi(m, n) = H i(k, l)|k=0,l=0 = 1, i = 1, 2, 3, which guarantees

that the DC component (i.e., the average intensity) of the image is unaltered. High-pass filtering can also

be performed in the log-intensity domain, which can be seen as a form of homomorphic processing, seebelow. Examples of the effects of high-pass filtering are shown in Fig. 29.

(a) Original unprocessed image. (b) Image processed by high-pass filtering using h1.

(c) Image processed by high-pass filtering using h2. (d) Image processed by high-pass filtering using h3.

Figure 29: Example of high-pass filtering.

18.1.3 Homomorphic processing

Homomorphic processing is based on the observation that the intensity of an image is the product of illumination and reflectance ,

u(m, n) = i(m, n) r(m, n).

The illumination is the principal contributor to the dynamic range of the image’s intensity, and it varies

slowly across the image. The reflectance contributes to the image’s detail, and it changes much more

64



quickly than the illumination. Imagine a source of light (sun, lamp, etc.) illuminating an object composed

of different parts, each with a different reflective characteristic. When the light source is very bright, thecontrast is diminished. The goal is to reduce the illumination component and enhance the reflectancecomponent, so as to increase the overall contrast of the image. Because of the multiplicative relationshipbetween illumination and reflectance, it is convenient to work in the log domain

log u(m, n) = log i(m, n) + log r(m, n).

In this way the illumination can be extracted by low-pass filtering, and the reflectance by high-passfiltering. The low-passed log i(m, n) is now multiplied by a factor α < 1 and the high-passed log r(m, n) is

multiplied by β > 1. After exponentiation, the transformed image is given by

y(m, n) = (i(m, n))α (r(m, n))β .

The use of homomorphic processing, where the image transformation is achieved in the log-intensitydomain, is also justified by the way the human visual system operates. An example of homomorphicprocessing applied to the image in Fig. 28a is shown in Fig. 30.

Figure 30: Example of homomorphic processing.

18.2 Noise Smoothing

Images may contain different forms of noise: Gaussian noise, speckles or salt-and-pepper noise. Differenttype of linear and nonlinear filtering may be applied to mitigate their effects.

18.2.1 Low-pass filtering

Most of the image’s energy is contained in the low-frequency components. High-frequency componentscontain detail as well as much of the noise. Low-pass filtering may reduce much of the noise, at the expense

of also reducing some of the detail, i.e., blurring the image. Low-pass filtering is a linear 2D filteringoperation that can be achieved by using one of the following filters

h4 = 1

9

1 1 1

1 1 11 1 1

, h5 =

1

10

1 1 1

1 2 11 1 1

, h6 =

1

16

1 2 1

2 4 21 2 1

.

65



Again note that the filters are normalized to guarantee that the average intensity of the image in unchanged.

Fig. 31 demonstrate the effects of low-pass filtering on an image corrupted by Gaussian noise.

(a) Original image corrupted by Gaussian noise. (b) Image processed by low-pass filtering using h4.

(c) Image processed by low-pass filtering using h5. (d) Image processed by low-pass filtering using h6.

Figure 31: Example of low-pass filtering.

18.2.2 Median filtering

Median filtering is a form of nonlinear filtering that is used to combat impulsive, salt-and-pepper noise.This kind of noise may be the result of random bit flips that occur in the communication of an image.Consider first a 1D median filter. Median filtering is obtained by sliding a window of odd length overthe sequence of interest, and replacing the intensity of the middle pixel with the median intensity of allpixels in the window (the median value out of N , N odd, is the one such that (N − 1)/2 values are lowerthan that value, and (N − 1)/2 are higher; for instance, the median of 4, 7, 15, 33, 255 is 15). 1D median

filtering is ideal when some pixels have outlier values of intensity. Such outliers will be removed. It is alsoused when the edges (discontinuities) need to be preserved. Consider for instance the effects of a low-pass

66



filter given by

0.2 0.2 0.2 0.2 0.2

and a 5-point median filter on the sequence

−0.1, 0.2, 0.1, −0.1, 0.0, 0.1, 0.0, 1.2, 0.9, 1.0, 1.1, 0.9, 1.2, 1.0, 1.1, 1.2

An important parameter in median filtering is the length of the sliding window: if outliers can come inpairs (two adjacent large values surrounded by small values), then windows of length less than 5 (such as3) will not be effective at removing the impulsive values.

2D median filtering is obtained by applying 1D median filtering on the horizontal and vertical directionsseparately. The reason is that having a 2D sliding window and applying median filtering in 2D may result

in the altering of edge information where edges meet. Try to apply 2D median filtering to the unit stepu(m, n), and compare with 1D filtering applied separately to the horizontal and vertical directions. The

result of median filtering on an image corrupted by salt-and-pepper noise is demonstrated in Fig. 32.

(a) Original image corrupted by salt-and-pepper noise. (b) Image processed by median filtering.

Figure 32: Example of median filtering.

18.2.3 Out-of-range pixel smoothing

Similarly to median filtering, out-of-range pixel smoothing is a nonlinear filtering operation obtained by

sliding a window over an image and comparing the intensity value of the middle pixel with the averageof all the other pixels’ intensities ; if its intensity is significantly different (based on some predeterminedthreshold), then its value will be replaced by the average. Out-of-range pixel smoothing is used to mitigate

salt-and-pepper noise. Fig. 33 shows the effects of out-of-range pixel smoothing on the image of Fig. 32a.

19 Edge DetectionAn edge is a boundary between two regions of an image that differ by their reflectivity, illumination,distance, etc. Edge detection can be used to segment an image into different regions. Segmentation isuseful for image understanding, for instance, or object detection and identification.

67



Figure 33: Example of out-of-range pixel smoothing.

19.0.4 Gradient-based edge detection

A sudden change in intensity may indicate an edge. Consider a 1D intensity function that suddenlychanges from low values to large values. The point of change can be determined by computing the gradient

of that function, which in 1D is simply a derivative. The amplitude of that derivative is compared toa predetermined threshold, and if it exceeds it, and is a local maximum, then an edge is detected. Anedge detection system may comprise a derivative calculator, an absolute value, a comparator to a giventhreshold and a way to determine if the point is a local maximum. In 2D the gradient is given by

∇u(x, y) = ∂ u(x, y)

∂x ı +

∂u(x, y)

∂y .

The edge detection mechanism is similar to the 1D system, but in addition the values of |∇u(x, y)| for allthe candidate edge points need to be considered for a number of specified directions. If all the candidate|∇u(x, y)| are local maxima in at least one direction, then an edge is detected. Normally, the specifieddirections are the horizontal and vertical directions. In order to avoid minor, false, edge lines from beingdetected, additional constraints may be added, such as the following:

1. If |∇u(x, y)| has a local maximum at (x0, y0) in the horizontal direction but not in the vertical

direction, and∂u∂x

> 2∂u∂y , then (x0, y0) is an edge point;

2. If |∇u(x, y)| has a local maximum at (x0, y0) in the vertical direction but not in the horizontal

direction, and∂u∂y

> 2

∂u∂x

, then (x0, y0) is an edge point.

An edge detector that only considers a specific direction is called a directional edge detector . One that isbased on the gradient |∇u(x, y)| is non-directional.

In the discrete 2D domain, the partial derivatives and gradients can be approximated by linear filters.Among filters that can be used for vertical, horizontal, and diagonal (directional) edge detection are

h7 =

−1 1−1 1−1 1

, h8 =

1 1 1−1 −1 −1

, h9 =

0 1 1−1 0 1−1 −1 0

, h10 =

1 1 0

1 0 −10 −1 −1

.

68



The nondirectional edge detector can be obtained from the approximation of

|∇u(x, y)| =

∂ u(x, y)

∂x

2

+

∂ u(x, y)

∂y

2

as in

(ux

(m, n))2 + (uy

(m, n))2 (17)

whereux(m, n) = (u ∗ hx)(m, n), uy(m, n) = (u ∗ hy)(m, n),

and

hx =

−1 1−2 2−1 1

, hy =

1 2 1−1 −2 −1

.

Gradient-based edge detection methods can be sensitive to noise. Some form of noise smoothing may have

to be applied to the image before using these techniques. Additional removal of false edges may require

some post-processing.

19.0.5 Laplacian-based edge detection

An alternative to gradient-based processing can be derived by considering that the second derivative of the intensity function has a zero-crossing in correspondence with an edge. In 2D, the second derivative isreplaced by the Laplacian

∇2u(x, y) = ∂ 2u(x, y)

∂x2 +

∂ 2u(x, y)

∂y 2 .

In discrete domains, the Laplacian is approximated by computing second-order differences, which can becomputed by using 2D linear filters. If

∂ u(x, y)

∂x ≈u(m + 1, n)

−u(m, n) or u(m, n)

−u(m

−1, n),

then∂ 2u(x, y)

∂x2 ≈ u(m + 1, n) − 2u(m, n) + u(m − 1, n)

and∇2u(x, y) ≈ u(m + 1, n) + u(m − 1, n) + u(m, n + 1) + u(m, n − 1) − 4u(m, n),

which corresponds to the filter

h11 =

0 1 0

1 −4 10 1 0

.

Other filters can be used, such as

h12 =

1 1 1

1 −8 11 1 1

, h13 =

−1 2 −1

2 −4 2−1 2 −1

.

The Laplacian-based edge detector may generate many false edges. One way to avoid that is to requirethat the local variance at a candidate edge point be larger than a predetermined threshold. The rationale

is that if the candidate point is indeed an edge point, then the intensity of the pixels around that point

69



should vary greatly. The variance should be small if the candidate point is simply an outlier. The variance

over a (2M + 1) × (2M + 1) window is computed as follows:

σ2(m, n) = 1

(2M + 1)2

m+M k=m−M

n+M l=n−M

(u(k, l) − µ(k, l))2 ,

whereµ(m, n) =

1

(2M + 1)2

m+M k=m−M

n+M l=n−M

u(k, l)

is the sample mean. The value of M is a design parameter, and is typically M = 2. The threshold willdepend on the nature of the image. Examples of edge detection are shown in Fig. 34

70



(a) Original unprocessed image(from Matlab’s demo photo li-brary).

(b) Image processed by gradient-based edge detection using h7.

(c) Image processed by gradient-based edge detection using h8.

(d) Image processed by gradient-based edge detection using h9.

(e) Image processed by gradient-based edge detection using h10.

(f) Image processed by gradient-based edge detection using (17).

(g) Image processed by Laplacian-based edge detection using h11, withno zero-crossing detection or variancethresholding.

(h) Image processed by Laplacian-based edge detection using h12, withno zero-crossing detection or variancethresholding.

Figure 34: Example of edge detection.

71



20 Image Compression and Coding

Goal: To reduce the number of bits needed to encode an image by eliminating redundancy.Current and future application examples:

Application Raw data volume (1 Mibit = 220 bits)

Color image (256 × 256, 24-bit color) 1.5Mibit

Color video, 1920 × 1080, 24-bit color, 60 fps 2.98 Gibit/sDigital Radiology Facility 64 Mibit/image, 1800 images per monthSAR Facility 800 Mibit/image, 5 to 10 images per dayFax 2 Mibit/page

Almost all compression techniques use one or more of the following principles:

1. Resources, i.e., storage space and transmission time, will be conserved if one uses short messages tocommunicate the occurrence of common events

2. Pixels within a given region are often correlated

3. If loss must occur, discard information less vital to the image

Compression techniques are divided into two classes

1. Reversible, error-free, distortionless: all information is preserved, complete reconstruction is possible.

Typical compression 3 : 1

2. Irreversible, lossy: some information is lost. Typical compression 20 : 1 for single images.

20.1 Information Theory Background

The information of a random event with probability P (E) is

I (E) =−

log P (E).

The base of the logarithm determines the units of information, a bit if the base is 2, a nat if the base is e.Example: E = {coin toss = heads}

P (E) = 1

2 ⇒ I (E) = − log2

1

2 = 1bit.

Now consider an information source with an alphabet of L possible symbols, each with probability pi,i = 0, 1, . . . , L − 1,

L−1i=0

pi = 1.

If M 1 symbols are generated by this source, symbol i will appear M pi times. The informationcontributed by symbol i is

I i = −Mpi log2 pi bits.

The total information in the string of M symbols is

I = −M L−1i=0

pi log2 pi.

72



The average information per symbol, H is

H = −L−1i=0

pi log2 pi = Entropy.

According to Shannon’s noiseless coding theorem, it is always possible to devise a code with an average of

B bits per symbol where

H ≤ B ≤ H + 1

n,

where n is the number of symbols encoded at one time. Usually n = 1.

20.2 Huffman Coding

Huffman Coding is a common, reversible image compression technique that encodes one symbol at a time

and results in a variable-length code with the lowest possible bit rate.To generate a Huffman codebook

1. Map initial symbols to symbol nodes and order based on symbol probabilities

2. Combine two symbol nodes having the lowest probabilities into a new symbol node, using 0 and 1 tolink the new node to the former symbols

3. Repeat until only two symbol nodes remain.

20.3 How to generate a Huffman tree

Let us use an example. Assume your source’s alphabet is given in the following table, together with the

symbols’ probabilities:

Symbol Probability

a 0.1

b 0.4c 0.04d 0.1e 0.07f 0.29

Let’s first rank the symbols by descending probability and create a node for each symbol (which willbecome a leaf in the tree):

73



0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

Next, let’s connect the two nodes with the lowest probability and generate a node with the combinedprobability:

0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

0.11

74



The next two least probable nodes are a and d; they can now be connected to form a new node withcombined probability of 0.2:

0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

0.11

0.2

At each stage, continue connecting the two least probable nodes as follows:

0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

0.11

0.2 0.31

75



0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

0.11

0.2 0.31 0.6

Finally the node with probability 1.0 will be the root of the tree.

0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

0.11

0.2 0.31 0.6 1.0

Now, starting from the root, trace your way back to the leaves. Each node in the tree, whether it is the

root or an internal node, has two edges leading to it. Label one with a “0” and the other with a “1”:

76



0.4b

0.29f

0.1a

0.1d

0.07e

0.04c

0.111

0

0.21

0

0.311

0

0.60

1

1.00

1

The strings of binary digits associated to each symbol can be read by traversing the tree from the root to

each leaf, and are given in the following table

Symbol Binary string

a 0011b 1c 0000

d 0010e 0001f 01

The average number of binary digits per symbol is

1 × 0.4 + 2 × 0.29 + 4 × 0.1 + 4 × 0.1 + 4 × 0.07 + 4 × 0.04 = 2.22.

Compare this with the source entropy in bits per symbol, given by

−0.4log2(0.4) − 0.29log2(0.29) − 0.1log2(0.1) − 0.1log2(0.1) − 0.07log2(0.07) − 0.04log2(0.04) ≈ 2.165.

Example: Consider a digital image, where each pixel has 8 bits. The probability of the ith gray level,0 ≤ i ≤ 255, is pi. If the pdf is uniform, i.e., pi = 2−8 for all i,

H = −255i=0

1

256 log2 2−8 = 8.

The minimum coding rate is 8bit/pixel. Result: when the pdf is uniform, standard PCM coding is optimal.

77



Example: Suppose the image pdf is

pi =

2−7, 0 ≤ i ≤ 127,

0, 128 ≤ i ≤ 255.

Then H = 7, and sending 8 bit/pixel is not optimal.If an image having M bits per pixel has entropy H , the minimum possible distortionless coding scheme

uses H bits per pixel, and the maximum compression rate is

C = M

H ,

and the efficiency of a code using B bits per pixel is

efficiency = H

B.

Note that although the codewords are of variable length, they are uniquely decodable. No end-of-wordcode needs to be sent.

Problem: if M different symbols are possible, there are L codewords, with the longest codeword having

length L.

In the modified Huffman coding , the first L1 symbols are Huffman coded, with symbol i given byi = qL1 + j.

20.4 Run-length Coding (RLC)

RLC is especially good for documents, drawings, and other two-tone images. RLC is the basis for faxtransmission.

Principle: Encode the lengths of runs of consecutive pixels. Possibilities include:

1. Encoding the number of zeros between two consecutive 1s (good when P (1) 1)

2. Encoding the distance to the next 1 → 0 or 0 → 1 transition.

Often the run-lengths themselves are Huffman coded.

20.5 Bit-plane Encoding

Consider an 8-bit image as 8 separate two-tone images. Encode each plane separately.

20.6 Distortion Function

In communication, the rate distortion function, RD, gives the minimum possible coding rate for a given

channel-induced distortion. In image compression, the channel is the image compression/decompression

process. A quantizer that achieves RD is called a Shannon quantizer .Examples:

1. Binary source: d = 1 if coding error occurs, d = 0 is no coding errors occur, p(1) = 12 . Note: if D = 0, the minimum rate is 1 bit/symbol, if D = 0.5, RD = 0.

2. Gaussian source with variance σ 2.

RD = max

0,

1

2 log2

σ2

D

Note: if we accept D = σ2, no information needs to be sent.

78



21 Transform Coding and Compression

Steps:

1. Transform u(m, n) to get V (k, l)

2. Quantization: Each element of V (k, l) is called a coefficient. Each coefficient is quantized using nk,l

bits, where nk,l

is larger for (k, l

) where the variance of V

(k, l

), σ

2

k,l, is large. The value

nk.l is called

the bit allocation.

Example of bit allocation with N = 4, coded as 0.5bit/pixel, i.e., 8 bits per 4 × 4 block.

3 2 0 02 1 0 00 0 0 00 0 0 0

The coded image is V (k, l).

3. Send V (k, l) to the receiver.

4. Inverse-transform V (k, l) to get u(m, n).

22 The JPEG Standard

JPEG = Joint Photographic Expert Group, representatives form industry and academia from all overthe world. “Joint” refers to ISO (international Standardization Organization) and CCITT (ComiteConsultatif International Telephonique et Telegraphique, now ITU-T, International Telegraph Union,Telecommunication Standardization Sector).Goal: A compression standard ensuring compatibility, universality, etc.

There are several modes (progressive, lossless, etc.) in JPEG. We consider only the basic lossy mode.The algorithm (assume the image consists of M -bit integers)

1. Break the image up into non-overlapping 8 × 8 blocks.

2. For each block u(m, n), take the 8 × 8 DCT:

u(m, n) → V (k, l).

3. Quantize each DCT coefficient. The quantization step size for each (k, l) is stored in a look-up table

as q (k, l), 1 ≤ q (k, l) ≤ 255:

V (k, l) = integerround

V (k, l)

q (k, l)

,

where q (k, l) is determined by the just noticeable threshold: if the contribution of V (k, l) will not be

noticed, then V (k, l) = 0. The DC coefficient, V (0, 0), is then converted to a differential value:

∆V = V (0, 0)

block n− V (0, 0)

block n−1

4. The 64 DCT coefficients are then mapped to a 1D vector of length 64 by using a zig-zag approach.This orders V (k.l) in approximate order of increasing spatial frequency. Many coefficients of V (k, l)are zero or near zero.

79



5. The 1D vector of length L1 = 64 is run-length coded (DC element excluded). Each non-zero value is

represented by two symbols:

Symbol 1 : (Run length, size S ), where S is the number of bits used to code symbol 2;

Symbol 2 : Amplitude, axpressed as an S -bit number. For M = 8-bit images, shifted to a range−127 ≤ p ≤ 128, the maximum possible DCT coefficient’s amplitude is 1

8 × 64 × 128 = 1024 ⇒S max = M + 3.

6. Symbol is entropy-coded using modified Huffman code.

Color images contain three components, one component of luminance (gray scale), and two components of

chrominance . In JPEG, chrominance is coded with less precision than luminance.JPEG performance for color images depends on quantization table, image stats, etc.):

0.25 to 0.5 bits per pixel moderate quality, but degradationnoticeable

≈ 1 bit per pixel excellent, acceptable for most uses1.5 to 2 bits per pixel usually indistinguishable from origi-

nal

Other JPEG features include

1. Progressive encoding: sen most important information first. Two options

(a) Send most important DCT coefficient first

(b) Send most significant bits of all DCT coefficients first

2. Lossless predictive coding algorithm based on 2D prediction

23 Video Compression

1. Intraframe techniques compress single images and use spatial redundancy to reduce the bit rate

2. Interframe techniques use temporal redundancies between frames.

A full video compression scheme incorporates both types of redundancy.Video applications can be classified as

• Symmetric : compression and decompression are performed with equal frequency – Example: videoconferencing, video telephone, etc.

• Asymmetric : compress only once – Example: movies, electronic publishing, etc.

The goal of most video compression schemes is often a rate of about 1.5 Mibit/s. This is a good rate for

LANs, existing communication channels, etc. One of the most recent standards is MPEG (motion picture

experts group). Achieved and projected bit rates (for 30 frames/s, non-interlaced) are:

Resolution Compressed bit rate (Mibit/s)352 × 240 ≈ 1.5 (VHS-like quality)720 × 489 ≈ 7.51920 × 1080 (HDTV) ≈ 30

80



In addition to achieving reduced bit rates, video compression algorithms must enable

• Random access

• Audio-visual synchronization

• Coding/decoding delay of less than 150 ms

• Recovery from transmission errors

Rate control in video coding is an extremely difficult issue: goal is to meet an overall constant rate. There

are buffering considerations. Standards may codify quantization step size, inter-/intra-frame compression,block sizes.

Audio in video: Quantization precision is based on perceptual model. Quantization introduces lossdepending on precision. A non-linear quantizer may be used (where step sizes are based on a power law).Stereo may be handled differently. Example: can code L+R and then separately L−R. Or can exploit thefact that for humans it is hard to locate very low and very high frequencies, so can code those on a single

channel. After quantization, Huffman coding.CD data rate: 44.1 kHz sampling rate, 16 bits sample → 700 Mibit/s per channel. Two channels give

1.5Mibit/s.MP3: Initially defined in early 1990s. MPEG-1/2 Layer 3. MPEG is the name of a family of audio-visual coding standards. Originally specified three different layers of audio codec; higher layers give

better quality at given bit rate but require more complexity. Originally (in MPEG1) MP3 allowed output

rates from 32 Kibit/s to 320 Kibit/s; MPEG2 extended this down to 8Kibit/s. MP3 supports samplingrates of 16 kHz, 22.05 kHz, 24kHz, 32 kHz, 44.1kHz and 48 kHz; usually uses 44.1kHz for compatibilitywith CD audio rates.

Figure 35: MP3 block diagram.

Based on subband filtering. First step is decomposition into 32 subbands, then DCT is used to create18 more frequency components for each; total of 576 frequency bins. (MDCT is modified DCT: modifiedbecause it has overlap between successive windows.)

Psychoacoustic (or perceptual) model exploits masking. Masking describes the fact that perception of

a given sound depends not only on the level of the sound itself but also on the context. In coding, thisallows assigning fewer bits to portions of the audio signal that humans would be unlikely to hear.

81



AAC (Advanced Audio Coding): Also developed by MPEG in mid 1990s. Like MP3 it uses MDCT,

perceptual coding. Better than MP3 (music will sound better at same bit rate; same quality at lower bitrate). Supported in iTunes, many video games, standards such as MPEG4, etc.

Course Notes v17

Documents

Transcript of Course Notes v17