Lecture 1:Introduction & DSP
Michael Mandel [email protected]
CUNY Graduate Center, Computer Science Programhttp://mr-pc.org/t/csc83060
With much content from Dan Ellis’ EE 6820 course
Michael Mandel (83060 SAU) Intro & DSP 1 / 67
Outline
1 Sound and information
2 Course Structure
3 Time domain signal processing
4 Frequency-domain signal processing
5 Time-frequency signal processing
Michael Mandel (83060 SAU) Intro & DSP 2 / 67
Acknowledgments: This lecture uses content from
Dan Ellis’ EE 4810 course (Lectures 1,2,3)https://www.ee.columbia.edu/~dpwe/e4810
Dan Ellis’ EE 6820 course (Lecture 1)https://www.ee.columbia.edu/~dpwe/e6820/
Michael Mandel (83060 SAU) Intro & DSP 3 / 67
Outline
1 Sound and information
2 Course Structure
3 Time domain signal processing
4 Frequency-domain signal processing
5 Time-frequency signal processing
Michael Mandel (83060 SAU) Intro & DSP 4 / 67
Sound and information
Sound is air pressure variation
Mechanical vibration
Pressure waves in air
Motion of sensor
Time-varying voltage
+ + + +
t
v(t)
Transducers convert air pressure ↔ voltage
Michael Mandel (83060 SAU) Intro & DSP 5 / 67
What use is sound?
Footsteps examples:
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
time / s
Hearing confers an evolutionary advantage
useful information, complements vision
. . . at a distance, in the dark, around corners
listeners are highly adapted to ‘natural sounds’ (incl. speech)
Michael Mandel (83060 SAU) Intro & DSP 6 / 67
The scope of audio processing
Michael Mandel (83060 SAU) Intro & DSP 7 / 67
The acoustic communication chain
message signal channel receiver decoder
!
synthesis audio processing recognition
Sound is an information bearer
Received sound reflects source(s)plus effect of environment (channel)
Michael Mandel (83060 SAU) Intro & DSP 8 / 67
Levels of abstraction
Much processing concerns shifting between levels of abstraction
sound p(t)
representation (e.g. t-f energy)
‘information’
abstract
concrete
An
alys
is
Syn
thesis
Different representations serve different tasks
separating aspects, making things explicit, . . .
Michael Mandel (83060 SAU) Intro & DSP 9 / 67
Outline
1 Sound and information
2 Course Structure
3 Time domain signal processing
4 Frequency-domain signal processing
5 Time-frequency signal processing
Michael Mandel (83060 SAU) Intro & DSP 10 / 67
Course structureGoals
I survey topics in sound analysis & processingI develop an intuition for sound signalsI learn some specific technologies
Course structureI weekly reading (20% + 10%)I midterm project proposal (20%)I final project (30% + 10%)
Required textbook for first part of course. Available athttp://dicklyon.com/hmh/Lyon_Hearing_book_
01jan2018.pdf
Richard Lyon (2018), Human and MachineHearing: Extracting meaning from sound.Cambridge University Press.
Michael Mandel (83060 SAU) Intro & DSP 11 / 67
Web-based
Course website:
http://mr-pc.org/t/csc83060
for lecture notes, articles, examples, . . .
+ Blackboard for homework, etc.
Michael Mandel (83060 SAU) Intro & DSP 12 / 67
Assignments: every week
Everyone reads book chapters and/or review papersI reviewing topic of next classI discuss in class
One student presents a contribution paperI journal & conference papers presenting a novel contributionI everyone else reads it as wellI everyone writes a two-paragraph response on blackboard:¶1: summary and ¶2: future directions
Write one-paragraph project statusI ¶3: weekly status updates
Michael Mandel (83060 SAU) Intro & DSP 13 / 67
Final project
Most significant part of course (60%) of grade
Oral proposals mid-semester;Presentations in final class+ paper on week later
ScopeI practical (Python or Matlab recommended)I identify a problem; try some solutionsI evaluation
TopicI few restrictions within world of audioI investigate other resourcesI develop in discussion with me
No copying
Michael Mandel (83060 SAU) Intro & DSP 14 / 67
Examples of past projects
Use of “bubble noise” to compare native Korean listeners toKorean learners – became masters thesis
Use of siamese networks as learned distance function betweenclean and noisy speech – became INTERSPEECH paper
Implementation of classic noise tracking algorithms andapplication to beamforming
Implementation of LSTM-based single-channel speechenhancement
Michael Mandel (83060 SAU) Intro & DSP 15 / 67
Some project ideas
Podcast transcriber and diarizer
Podcast episode recommender
Analysis of long-term environmental recordings
Heatmapping tools applied to automatic speech recognizers
Use of neural speech synthesis in speech enhancement
Michael Mandel (83060 SAU) Intro & DSP 16 / 67
Example responses: ¶1: Contribution article summary
The paper for this week is Mohri, Mehryar, Fernando Pereira, and Michael Riley. “Speech recognition with weightedfinite-state transducers.” Springer Handbook of Speech Processing. Springer Berlin Heidelberg, 2008. 559-584.The paper describes how to use weighted finite-state transducers(WFST) for automatic speech recognitions. Thetransducers can be efficiently used to represent major components of speech recognition like Hidden MarkovModel(HMM), context-dependency model, pronunciation lexicon and statistical grammars. At first authors presentgeneral description of transducers , algorithms to determinize and minimize them. Then they describe how thesemethods are very good match for combining and optimizing different parts of the speech recognizer. A WFST canbe described informally as a machine with fixed set of states, it has a start state, some final states and intermediatestates. Given a input symbol they can transitions from one state to another with producing output symbol. WFSTis weighted in the sense that transitions from state to state can have different weights. Generally these weightsrepresents the probability or likelihood of different paths. It transduces a string that can be read along the pathfrom start state to final state to output string with particular weight. In another word they represent a binaryrelation between input and output strings. For example pronunciation lexicons can be represented by transducers asmapping from pronunciation to word. Pronunciation lexicon (L) for all of the words can be generated by composingall word pronunciation transducers together thorough a super-initial state. Applying a Kleene closure on each of theword transducer allows for words to repeat in any order. If we add the mapping context dependent triphone tocontext independent phone with this, we get a model(C) that maps from context dependent phones to wordstrings. For efficiency we need to determinize and minimize the transducer. Converting non-deterministictransducer to a deterministic counter part is known as a detereminizer, there are algorithms like semiring applicablefor this operation. Deterministic transducers are efficient to search given a particular input-output pair, thus wewant to determinize the transducers for efficiency. Now the problem is all of the models may not be determinizableto start with, for example pronunciation has homophones , different words that sounds the same. Thus they are notdeterminizable,we need to add auxiliary phone symbols to make it determinizable. A unweighted transducer can beminimized with classical algorithms, but for weighted version there is an extra step pushing /reweighting. Finally ifwe combine HMM model (H) with this we get a transducer that maps from distribution to word strings.
My response: Good summary
Michael Mandel (83060 SAU) Intro & DSP 17 / 67
Example responses: ¶2: Contribution article response
The paper for this week talks about transducers and their application to ASR. They first introduce acceptors andthen define transducers. They also mention different algorithms for composition, determinization and minimizationof transducers. According to me it is a very well written paper, the examples are specially well described and easyto reproduce. With the algorithms I found the mathematical descriptions very informative and useful as well. Iwanted to know more about other available algorithms and whether they can be used for speech recognizer. Forcomposition they described state-wise pairing with semiring method which increases the state space of thecomposite automation. Though that is expected as we are combining two transducers. I wanted to know a little bitmore about other available algorithms and why we can or can’t use them. I felt similarly about weight pushingalgorithm, whether there are other algorithms available for reweighing the transducer. Or if there is any algorithmfor directly minimizing weighted transducer pushing and then minimizing.
My response: Good questions, I don’t know all of the answers to them. A semiring isn’t an algorithm or a method,it is a mathematical object that has certain characteristics. Informally, it’s the way to keep track of the weights andhow to combine them. The semiring to represent probabilities is different from, but equivalent to, the semiring torepresent log probabilities. Basically you need a plus operation and a times operation. For probabilities, plus is +and times is *. For log probabilities, plus is log(exp(x1)+exp(x2)) and times is +. So applying the plus operationto two probabilities is equivalent to applying the plus operation to two log probabilities. And similarly for times.Anyway, I think there are several algorithms for weight pushing, minimization, etc.
Michael Mandel (83060 SAU) Intro & DSP 18 / 67
Example responses: ¶3: Project status
For the project I am working to train the deep neural network, which is the main part of the project.
This week I have written a code for simple WARP loss function.
I have also created the training data from CHiME2-GRID. The training data contains one clean chunk, noisy chunkand another negative example.
I am also working towards setting the DNN properly to train with training data.
My response: Ok, have you checked the code into github on your own branch yet? If so, send me a link to it. Ifnot, please do.
Michael Mandel (83060 SAU) Intro & DSP 19 / 67
Outline
1 Sound and information
2 Course Structure
3 Time domain signal processingTime domain signalsTime-domain systemsConvolution
4 Frequency-domain signal processing
5 Time-frequency signal processing
Michael Mandel (83060 SAU) Intro & DSP 20 / 67
Signals
Signals: Information-bearing function
E.g. sound: air pressure variation at a point as a function of time
DimensionalityI Sound: 1-dimensionalI Greyscale image: 2-dimensional, i(x , y)I Video: 3 x 3-D, r(x , y , t), g(x , y , t), b(x , y , t)
Michael Mandel (83060 SAU) Intro & DSP 21 / 67
Digital signal processing
DSP = signal processing on a computer
Two effects: discrete-time, discrete level
time
xd[n] = Q( xc(nT ) )
Discrete-time sampling limits bandwidth
Discrete-level quantization
limits dynamic range
T
ε
Michael Mandel (83060 SAU) Intro & DSP 22 / 67
DSP vs analog signal processing
Analog signal processing
Processorp(t) q(t)
Digital signal processing system
Processorp(t) q(t)A/D D/Ap[n] q[n]
Michael Mandel (83060 SAU) Intro & DSP 23 / 67
Digital vs analog
Digital prosI Noise performance – quantized signalI Use general-purpose computer – flexible, upgradeableI Stability/replicability
Digital consI Limitations of analog-digital convertersI Baseline complexity/power consumptionI Latency
Michael Mandel (83060 SAU) Intro & DSP 24 / 67
Digital limitation: sampling sinusoids
Samples of a sinusoid could equally correspond to others
0 1 2 3 4 5 6 7 8 9 10-1
-0.5
0
0.5
1
x1[n] = sin(ω0n)
x2[n] = sin((ω0 + 2πr)n) = sin(ω0n) = x1[n]
Michael Mandel (83060 SAU) Intro & DSP 25 / 67
Aliasing
For example, for cos(ωn), ω = 2πr ± ω0
all (integer) r values appear to the same after sampling
We say that a larger ω appears aliased to a lower frequency
Therefore we generally consider frequencies 0 ≤ ω0 ≤ πi.e., less than 1
2 cycle per sample
Michael Mandel (83060 SAU) Intro & DSP 26 / 67
Sampling and aliasingDiscrete-time signals equal the continuous time signal at discretesampling instants:
xd [n] = xc(nT )
Sampling cannot represent rapid fluctuations
0 1 2 3 4 5 6 7 8 9 10 1
0.5
0
0.5
1
sin
((ΩM +
2π
T
)Tn
)= sin(ΩMTn) ∀n ∈ Z
Nyquist limit (ΩT/2) from periodic spectrum:
ΩΩM−ΩT ΩT −ΩM ΩT - ΩM−ΩT + ΩM
Gp(jΩ)Ga(jΩ) “alias” of “baseband”
signal
Michael Mandel (83060 SAU) Intro & DSP 27 / 67
The speech signal: time domain
Speech is a sequence of different sound types
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42.1
0
.1
1.52 1.54 1.56 1.58-0.1
0
0.1
1.86 1.88 1.921.9-0.05
0
0.05
2.42 2.44 2.46 2.4
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vowel: periodic “has”
Fricative: aperiodic “watch”
Glide: smooth transition “watch”
Stop burst: transient “dime”
Michael Mandel (83060 SAU) Intro & DSP 28 / 67
Discrete-time systems
A system converts input to output
x[n] DT System y[n]
y [n] = f (x [n])∀n
For example, a 3-point moving average (M = 3)
y [n] =1
M
M−1∑k=0
x [n − k]
Michael Mandel (83060 SAU) Intro & DSP 29 / 67
3-point moving average example
y [n] =1
3
2∑k=0
x [n − k]
n
x[n]
-1 1 2 3 4 5 6 7 8 9
A
n
x[n-1]
-1 1 2 3 4 5 6 7 8 9
n
x[n-2]
-1 1 2 3 4 5 6 7 8 9
n
y[n]
-1 1 2 3 4 5 6 7 8 9
Michael Mandel (83060 SAU) Intro & DSP 30 / 67
Moving average smoother
Moving average smooths out rapid variationse.g., “12 month moving average”
Example, 5-point moving average
x [n] = s[n]︸︷︷︸signal
+ d [n]︸︷︷︸noise
y [n] =1
5
4∑k=0
x [n − k]
Michael Mandel (83060 SAU) Intro & DSP 31 / 67
Example 2: accumulator
Accumulator: outputaccumulates all past inputs
y [n] =n∑
`=−∞x [`]
= x [n] +n−1∑`=−∞
x [`]
= x [n] + y [n − 1]
n
x[n]
-1 1 2 3 4 5 6 7 8 9
A
n
y[n-1]
-1 1 2 3 4 5 6 7 8 9
n
y[n]
-1 1 2 3 4 5 6 7 8 9
Michael Mandel (83060 SAU) Intro & DSP 32 / 67
System property: Linearity
Linear systems obey superposition
x[n] DT System y[n]
If input x1[n]→ output y1[n], x2 → y2
Given a linear combination of inputs, a linear systemproduces the same linear combination of outputs
x [n] = αx1[n] + βx2[n]
→ y [n] = αy1[n] + βy2[n]
for all α, β, x1, x2
Michael Mandel (83060 SAU) Intro & DSP 33 / 67
System property: Time-invariance
Time-invariant systems don’t depend on an absolute time
Given a time-shifted input, a time-invariant systemproduces a version of the output shifted by the same amount
if x [n]→ y [n], then x [n − n0]→ y [n − n0]
for all n0, x
Michael Mandel (83060 SAU) Intro & DSP 34 / 67
Impulse response
An impulse is a unit sample sequence
δ[n] =
1, n = 00, n 6= 0
Given a system, if x [n] = δ[n] then y [n] , h[n],and h[n] is the impulse response of the system
x[n] DT System y[n]
A linear, time-invariant system is completely specified by h[n]
Michael Mandel (83060 SAU) Intro & DSP 35 / 67
Impulse response example: Moving average
Consider the 3-point moving average filter again
If we put in a standard impulse x [n] = δ[n]we get out the impulse response y [n] = h[n]
n
x[n]
-1 1 2 3 4 5 6 7
1
-2-3
n
y[n]
-1 1 2 3 4 5 6 7-2-3
3/1
Michael Mandel (83060 SAU) Intro & DSP 36 / 67
Convolution
Impulse response means for an LTI system δ[n]→ h[n]
Shift invariance means δ[n − n0]→ h[n − n0]
Linearity means αδ[n− k] + βδ[n− l ]→ αh[n− k] + βh[n− l ]
We can express any sequence as a sum of weighted δs
x [n] = x [0]δ[n] + x [1]δ[n − 1] + x [2]δ[n − 2] + . . .
Therefore, we can compute the output of any system
Michael Mandel (83060 SAU) Intro & DSP 37 / 67
Convolution sum
Another way to write x [n] as a sum of deltas is
x [n] =∞∑
k=−∞x [k]δ[n − k]
Therefore, for an LTI system, we get the convolution sum
y [n] =∞∑
k=−∞x [k]h[n − k] , x [n] ∗ h[n]
Note that this is symmetric in x and h by using ` = n − k
x [n] ∗ h[n] =∞∑
`=−∞x [n − `]h[`] = h[n] ∗ x [n]
Michael Mandel (83060 SAU) Intro & DSP 38 / 67
Interpreting convolution
Passing a signal through an LTI system is equivalent toconvolving it with the system’s impulse response
x[n] h[n] y[n]
Let’s convolve the following two signals in two different ways
x[n]=0 3 1 2 -1
n-1 1 2 3 5 6 7-2-3
→h[n] = 3 2 1
n-1 1 2 3 4 5 6 7-2-3
→
y [n] =∞∑
k=−∞x [k]h[n − k] =
∞∑`=−∞
x [n − `]h[`]
Michael Mandel (83060 SAU) Intro & DSP 39 / 67
Convolution interpretation 1
y [n] =∞∑
k=−∞x [k]h[n − k]
Time-reverse h, shift by n, take inner product against fixed x
k-1 1 2 3 5 6 7-2-3
x[k]
k-1 1 2 3 4 5 6 7-2-3
h[0-k]
k-1 1 2 3 4 5 6 7-2-3
h[1-k]
k-1 1 2 3 4 5 6 7-2-3
h[2-k]
n-1 1 2 3 5 7-2-3
y[n]0
9 911
2-1
Michael Mandel (83060 SAU) Intro & DSP 40 / 67
Convolution interpretation 2
y [n] =∞∑
`=−∞x [n − `]h[`]
Shifted xs weightedby points in h
Or, weighted delayedversions of h
n-1 1 2 3 5 7-2-3
y[n]0
9 911
2-1
n-1 1 2 3 5 6 7-2-3
x[n]
n-1 1 2 3 4 6 7-2-3
x[n-1]
n-1 1 2 3 4 5 7-2-3
x[n-2]
k-1
1234567
-2-3
h[k]
Michael Mandel (83060 SAU) Intro & DSP 41 / 67
Time-domain summary
Computers analyze signals that are discrete in time and in level
Sampling leads to aliasing, i.e., limited frequency bandwidth
Linear time-invariant systems transform signals by convolvingthem with an impulse response
Michael Mandel (83060 SAU) Intro & DSP 42 / 67
Outline
1 Sound and information
2 Course Structure
3 Time domain signal processing
4 Frequency-domain signal processing
5 Time-frequency signal processing
Michael Mandel (83060 SAU) Intro & DSP 43 / 67
The Fourier series
The basic observation (in continuous time) is thatA periodic signal can be decomposed intoa sum of weighted sinusoids at integer multiplesof the fundamental frequency
i.e., if x(t) = x(t + T ) for some fundamental frequency T , then
x(t) ≈M∑k=0
ak cos
(2πk
Tt + φk
)
=M∑k=0
a′k cos
(2πk
Tt
)+ bk sin
(2πk
Tt
)
=M∑k=0
ck︸︷︷︸complex
exp
(2πkj
Tt
)where j =
√−1
Michael Mandel (83060 SAU) Intro & DSP 44 / 67
Fourier series example: square wave
Approximated by the following Fourier series coefficients
φk = 0 ak =
(−1)
k−12
1k k = 1, 3, 5, . . .
0 otherwise
i.e. x(t) ≈M∑k=0
ak cos
(2πk
Tt + φk
)= cos
(2π
Tt
)− 1
3cos
(2π
T3t
)+
1
5cos
(2π
T5t
)− . . .
1.5 1 0.5 0 0.5 1 1.51
0.5
0
0.5
1
Michael Mandel (83060 SAU) Intro & DSP 45 / 67
Fourier analysis
Given a signal, how can we find Fourier coefficients?
Let’s use the a′k and bk representation, so
x(t) =M∑k=0
a′k cos
(2πk
Tt
)+ bk sin
(2πk
Tt
)Take the inner product with complex sinusoids
a′k =1
T
∫ T/2
−T/2x(t) cos
(2πk
Tt
)dt
bk =1
T
∫ T/2
−T/2x(t) sin
(2πk
Tt
)dt
Works because these sinusoids are orthogonal to each otheri.e., each sinusoid picks itself out of the signal
Michael Mandel (83060 SAU) Intro & DSP 46 / 67
Complex numbers
Complex numbers are for us a mathematical convenience thatlead to simple expressions for things like the Fourier transform
Add a second “imaginary” dimension (j ,√−1) to all values
Two equivalent representations:I Rectangular form: x = xr + jxi
with magnitude |x | =√x2r + x2
i
and phase θ = tan−1(xi/xr )I Polar form: x = |x |e jθ = |x | cos θ + j |x | sin(θ)
Euler’s formula: e jθ = exp(jθ) = cos θ + j sin θ
Michael Mandel (83060 SAU) Intro & DSP 47 / 67
Operations on complex numbers
x
real
imag
xre xre+yre
xim+yim
yre
x+y
x·y
xim
yim
|x|
|y|
|x|·|y|
φ
θ
θ+φ
When adding, real and imaginaryparts add
(a + jb) + (c + jd) = (a + c) + j(b + d)
When multiplying, magnitudesmultiply and phases add
re jθ · se jφ = rse j(θ+φ)
Michael Mandel (83060 SAU) Intro & DSP 48 / 67
Complex conjugation
Conjugation operation negates the imaginary part
x∗ = xr − jxi
also negates the imaginary phase
x∗ = |x |e j(−θ)
Useful in extracting real quantities from complex expressions
x + x∗ = xr + jxi + xr − jxi = 2xr
x · x∗ = |x |e jθ|x |e−jθ = |x |2
real
imag
|x|
θ
xx+x*= 2xr
x*
Michael Mandel (83060 SAU) Intro & DSP 49 / 67
Fourier analysis using complex exponentials
Inner product with complex sinusoids
a′k =1
T
∫ T/2
−T/2x(t) cos
(2πk
Tt
)dt
bk =1
T
∫ T/2
−T/2x(t) sin
(2πk
Tt
)dt
Can be written more compactly with complex exponentials
ck =1
T
∫ T/2
−T/2x(t) exp
(−j 2πk
Tt
)dt
Then synthesis of original signal
x(t) =M∑k=0
ck exp
(j2πk
Tt
)
Michael Mandel (83060 SAU) Intro & DSP 50 / 67
Fourier transform
The Fourier series (applicable to periodic signals) extendsnaturally to the Fourier Transform for continuous time signals
Fourier transform X (jΩ) =
∫ ∞−∞
x(t) exp(−jΩt)dt
Inverse Fourier transform x(t) =1
2π
∫ ∞−∞
X (jΩ) exp(jΩt)dΩ
The discrete index k changes to a continuous frequency Ω
Michael Mandel (83060 SAU) Intro & DSP 51 / 67
The Fourier transform is a mappingbetween two continuous functions
2π ambiguity
0 0.002 0.004 0.006 0.008time / sec
level / dB
-0.01
0
0.01
0.02x(t)
0 2000 4000 6000 8000freq / Hz
-80
-60
-40
-20
0|X jΩ( )|
0 2000 4000 6000 8000freq / Hz
0
π/2
-π/2
argX jΩ( )
Michael Mandel (83060 SAU) Intro & DSP 52 / 67
Fourier transforms
Different Fourier analyses possible depending on signals
continuous vs discrete, infinite vs finite (and periodic)
Time Frequency
Fourier series Cont. periodic x(t) Discrete infinite ckFourier transform (FT) Cont. infinite x(t) Cont. infinite X (Ω)
Discrete-time FT Discrete infinite x [n] Cont. periodic X (e jω)Discrete FT Discrete periodic x [n] Discrete periodic X [k]
Michael Mandel (83060 SAU) Intro & DSP 53 / 67
Fourier transforms
Fourier Series (periodic continuous x)
Ω0 =2π
T
x(t) =∑k
ckejkΩ0t
ck =1
2πT
∫ T/2
−T/2x(t)e−jkΩ0tdt k1 2 3 5 6 74
|ck|1.0
1.5 1 0.5 0 0.5 1 1.5 1
0.5
0
0.5
t
x(t)
Fourier Transform (aperiodic continuous x)
x(t) =1
2π
∫X (jΩ)e jΩtdΩ
X (jΩ) =
∫x(t)e−jΩtdt
0 0.002 0.004 0.006 0.008time / sec
level / dB
-0.01
0
0.01
0.02
x(t)
0 2000 4000 6000 8000freq / Hz
-80
-60
-40
-20 |X(jΩ)|
Michael Mandel (83060 SAU) Intro & DSP 54 / 67
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
x [n] =1
2π
∫ π
−πX (e jω)e jωndω
X (e jω) =∑
x [n]e−jωn
n-1 1 2 3 4 5 6 7
0 π
|X(ejω)|
ω2π 3π 4π 5π
1
2
3
x [n]
Discrete Fourier Transform (N-point x)
the only one we will use on computers
x [n] =∑k
X [k]e j 2πknN
X [k] =∑n
x [n]e−j2πknN
k
|X(ejω)||X[k]|
k=1...
n1 2 3 4 5 6 7
x [n]
Michael Mandel (83060 SAU) Intro & DSP 55 / 67
Fourier transform and convolution
The discrete-time Fourier transform is defined as
X (e jω) ,∞∑
n=−∞x [n]e−jωn
If x [n] = g [n] ∗ h[n] is the result of a convolution
then in the frequency domain it becomes multiplication
X (e jω) = G (e jω) · H(e jω)
Michael Mandel (83060 SAU) Intro & DSP 56 / 67
Fourier transform and convolution proof
The discrete-time Fourier transform is defined as
X (e jω) ,∞∑
n=−∞x [n]e−jωn
If x [n] = g [n] ∗ h[n] then
X (e jω) =∞∑
n=−∞(g [n] ∗ h[n])e−jωn
=∞∑
n=−∞
( ∞∑k=−∞
g [k]h[n − k]
)e−jωn
=∞∑
k=−∞g [k]e−jωk
∞∑n=−∞
h[n − k]e−jω(n−k)
= G (e jω) · H(e jω)
Michael Mandel (83060 SAU) Intro & DSP 57 / 67
Speech sounds in the Fourier domain
1.52 1.54 1.56 1.58-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2000 3000 4000-100
-80
-60
-40
0 1000 2000 3000 4000-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42-0.1
0
0.1
0 1000 2000 3000-100
-80
-60
-40
1.86 1.87 1.88 1.89 1.9 1.91-0.05
0
0.05
0 1000 2000 3000 4000-100
-80
-60
Vowel: periodic “has”
Fricative: aperiodic “watch”
Glide: transition “watch”
Stop: transient “dime”
time domain frequency domain
time / s freq / Hz
ener
gy
/ dB
dB = 20 log10(amplitude) = 10 log10(power)Voiced spectrum has pitch + formants
Michael Mandel (83060 SAU) Intro & DSP 58 / 67
Complex exponentials are eigen-functions of LTI systems
Why do we use complex numbers in signal processing?
B/c complex exponentials are eigen-functions of LTI systems
x [n] = α0 exp(jωn)
y [n] =∞∑
`=−∞x [n − `]h[`]
=∞∑
`=−∞α0 exp(jω(n − `))h[`]
= α0 exp(jωn)∞∑
`=−∞exp(−jω`)h[`]
= α1 exp(jωn)
Output same frequency as input, but different (complex) gain
Michael Mandel (83060 SAU) Intro & DSP 59 / 67
Frequency-domain summary
Any signal can be approximated as sum of weighted sinusoids
Weights of the sinusoids computed using Fourier analysis
Convolution in the time domain is equivalent tomultiplication in the frequency domain
Complex exponentials are eigen-functions of LTI systems
Michael Mandel (83060 SAU) Intro & DSP 60 / 67
Outline
1 Sound and information
2 Course Structure
3 Time domain signal processing
4 Frequency-domain signal processing
5 Time-frequency signal processing
Michael Mandel (83060 SAU) Intro & DSP 61 / 67
Short-time Fourier TransformWant to localize energy in time and frequency
break sound into short-time pieces
calculate DFT of each one
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6-0.1
0
0.1
time / s
freq
/ H
z
k →
short-timewindow
DFT
m = 0 m = 1 m = 2 m = 3
L 2L 3L
Mathematically,
X [k ,m] =N−1∑n=0
x [n]w [n −mL] exp
(−j 2πk(n −mL)
N
)Michael Mandel (83060 SAU) Intro & DSP 62 / 67
The Spectrogram
Plot log-amplitude STFT, 20 log10 |X [k ,m]|, as a gray-scale image
time / s
time / s
freq
/ H
zintensity / dB
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
Michael Mandel (83060 SAU) Intro & DSP 63 / 67
Time-frequency tradeoffLonger window w [n] gains frequency resolutionat cost of time resolution
1.4 1.6 1.8 2 2.2 2.4 2.6
freq
/ H
z
time / s
level/ dB
0
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.2W
indo
w =
256
pt
ÒNar
row
ban
dÓ
Win
dow
= 4
8 pt
ÒWid
eban
dÓ
-50
-40
-30
-20
-10
0
10
Michael Mandel (83060 SAU) Intro & DSP 64 / 67
Speech sounds on the Spectrogram
Most popular speech visualization
freq
/ H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vow
el:
perio
dic
“has
”
Fric
've:
ap
erio
dic
“wat
ch”
Glid
e: tr
ansi
tion
“wat
ch”
Sto
p: tr
ansi
ent
“dim
e”
Wideband (short window) better than narrowband (long window)to see formants
Michael Mandel (83060 SAU) Intro & DSP 65 / 67
A spectrogram is equivalent to a bank of filters1
The STFT can be interpreted as a bank of filters in two ways
X [k,m] =N−1∑n=0
x [n] w [n −mL] exp
(−j 2πk(n −mL)
N
)︸ ︷︷ ︸
Bandpass filter
X [k,m] =N−1∑n=0
w [n −mL]︸ ︷︷ ︸low-pass filter
x [n] exp
(−j 2πk(n −mL)
N
)︸ ︷︷ ︸
Modulated signal
k (really 2πk/N) is the center frequency of each filter
Only look at the output of the filter every L samples
1See https://www.dsprelated.com/freebooks/sasp/Filter_Bank_
Summation_FBS.htmlMichael Mandel (83060 SAU) Intro & DSP 66 / 67
Summary
Information in soundI lots of it, multiple levels of abstraction
Course overviewI survey of audio processing topicsI readings, presentations, project
DSP reviewI digital signals, time domainI Fourier domain, STFT
Michael Mandel (83060 SAU) Intro & DSP 67 / 67
Top Related