Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP...

32
CSC 83060: Speech & Audio Understanding Lecture 1: Introduction & DSP Michael Mandel [email protected] CUNY Graduate Center, Computer Science Program http://mr-pc.org/t/csc83060 With much content from Dan Ellis’ EE 6820 course 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Michael Mandel (83060 SAU) Intro & DSP 1 / 32

Transcript of Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP...

Page 1: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

CSC 83060: Speech & Audio Understanding

Lecture 1:Introduction & DSP

Michael Mandel [email protected]

CUNY Graduate Center, Computer Science Programhttp://mr-pc.org/t/csc83060

With much content from Dan Ellis’ EE 6820 course

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Michael Mandel (83060 SAU) Intro & DSP 1 / 32

Page 2: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Outline

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Michael Mandel (83060 SAU) Intro & DSP 2 / 32

Page 3: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Sound and information

Sound is air pressure variation

Mechanical vibration

Pressure waves in air

Motion of sensor

Time-varying voltage

+ + + +

t

v(t)

Transducers convert air pressure ↔ voltage

Michael Mandel (83060 SAU) Intro & DSP 3 / 32

Page 4: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

What use is sound?

Footsteps examples:

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5

0

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5

0

0.5

time / s

Hearing confers an evolutionary advantage

useful information, complements vision

. . . at a distance, in the dark, around corners

listeners are highly adapted to ‘natural sounds’ (includingspeech)

Michael Mandel (83060 SAU) Intro & DSP 4 / 32

Page 5: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

The scope of audio processing

Michael Mandel (83060 SAU) Intro & DSP 5 / 32

Page 6: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

The acoustic communication chain

message signal channel receiver decoder

!

synthesis audio processing recognition

Sound is an information bearer

Received sound reflects source(s)plus effect of environment (channel)

Michael Mandel (83060 SAU) Intro & DSP 6 / 32

Page 7: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Levels of abstraction

Much processing concerns shifting between levels of abstraction

sound p(t)

representation (e.g. t-f energy)

‘information’

abstract

concrete

An

alys

is

Syn

thesis

Different representations serve different tasks

separating aspects, making things explicit, . . .

Michael Mandel (83060 SAU) Intro & DSP 7 / 32

Page 8: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Outline

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Michael Mandel (83060 SAU) Intro & DSP 8 / 32

Page 9: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Course structure

GoalsI survey topics in sound analysis & processingI develop and intuition for sound signalsI learn some specific technologies

Course structureI weekly reading (20% + 10%)I midterm project proposal (20%)I final project (30% + 10%)

Recommended, but not required text

Speech and Audio Signal ProcessingGold, Morgan, & EllisWiley, 2011ISBN: 978-0-470-19536-9

Michael Mandel (83060 SAU) Intro & DSP 9 / 32

Page 10: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Web-based

Course website:

http://mr-pc.org/t/csc83060

for lecture notes, articles, examples, . . .

+ Blackboard for homework, etc.

Michael Mandel (83060 SAU) Intro & DSP 10 / 32

Page 11: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Weekly assignments

Review papersI book chapters & journal papers reviewing a topic areaI discuss in classI written summaries & future directions due on Blackboard

Contribution papersI journal & conference papers presenting a novel contributionI student presentations in class

Project statusI written weekly status updates due on Blackboard

Michael Mandel (83060 SAU) Intro & DSP 11 / 32

Page 12: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Final project

Most significant part of course (60%) of grade

Oral proposals mid-semester;Presentations in final class+ paper

ScopeI practical (Python or Matlab recommended)I identify a problem; try some solutionsI evaluation

TopicI few restrictions within world of audioI investigate other resourcesI develop in discussion with me

Copying

Michael Mandel (83060 SAU) Intro & DSP 12 / 32

Page 13: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Examples of past projects

Automatic prosody classification Model-based note transcription

Michael Mandel (83060 SAU) Intro & DSP 13 / 32

Page 14: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Outline

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Michael Mandel (83060 SAU) Intro & DSP 14 / 32

Page 15: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

DSP review: digital signals

time

xd[n] = Q( xc(nT ) )

Discrete-time sampling limits bandwidth

Discrete-level quantization

limits dynamic range

T

ε

sampling interval T

sampling frequency ΩT = 2πT

quantizer Q(y) = ε⌊ yε

⌋Michael Mandel (83060 SAU) Intro & DSP 15 / 32

Page 16: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

The speech signal: time domain

Speech is a sequence of different sound types

-0.2

-0.1

0

0.1

0.2

1.38 1.4 1.42.1

0

.1

1.52 1.54 1.56 1.58-0.1

0

0.1

1.86 1.88 1.921.9-0.05

0

0.05

2.42 2.44 2.46 2.4

-0.02

0

0.02

1.4 1.6 1.8 2 2.2 2.4 2.6 time/s

watch thin as a dimeahas

Vowel: periodic “has”

Fricative: aperiodic “watch”

Glide: smooth transition “watch”

Stop burst: transient “dime”

Michael Mandel (83060 SAU) Intro & DSP 16 / 32

Page 17: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Timescale modification (TSM)Can we modify a sound to make it ‘slower’?i.e. speech pronounced more slowly

e.g. to help comprehension, analysis

or more quickly for ‘speed listening’?

Why not just slow it down?

xs(t) = xo( tr ), r = slowdown factor (> 1→ slower)

equivalent to playback at a different sampling rate

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

time/s

Original

2x slower

r = 2

Michael Mandel (83060 SAU) Intro & DSP 17 / 32

Page 18: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Time-domain TSM

Problem: want to preserve local time structurebut alter global time structure

Repeat segmentsI but: artifacts from abrupt edges

Cross-fade & overlap

ym[mL + n] = ym−1[mL + n] + w [n] · x[⌊m

r

⌋L + n

]

2.35 2.4 2.45 2.5 2.55 2.6-0.1

0

0.1

4.7 4.75 4.8 4.85 4.9 4.95-0.1

0

0.1

1

1

1 1 2 2 3 3 4 4 5 5 6

2

2

3

3

4

4

5 6

6

5

time / s

time / s

Michael Mandel (83060 SAU) Intro & DSP 18 / 32

Page 19: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Synchronous overlap-add (SOLA)

Idea: allow some leeway in placing window to optimize alignmentof waveforms

1

2

Km maximizes alignment of 1 and 2

Hence,

ym[mL + n] = ym−1[mL + n] + w [n] · x[⌊m

r

⌋L + n + Km

]Where Km chosen by cross-correlation:

Km = argmax0≤K≤Ku

∑Novn=0 y

m−1[mL + n] · x[⌊

mr

⌋L + n + K

]√∑(ym−1[mL + n])2

∑(x[⌊

mr

⌋L + n + K

])2

Michael Mandel (83060 SAU) Intro & DSP 19 / 32

Page 20: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

The Fourier domain

Fourier Series (periodic continuous x)

Ω0 =2π

T

x(t) =∑k

ckejkΩ0t

ck =1

2πT

∫ T/2

−T/2x(t)e−jkΩ0tdt k1 2 3 5 6 74

|ck|1.0

1.5 1 0.5 0 0.5 1 1.5 1

0.5

0

0.5

t

x(t)

Fourier Transform (aperiodic continuous x)

x(t) =1

∫X (jΩ)e jΩtdΩ

X (jΩ) =

∫x(t)e−jΩtdt

0 0.002 0.004 0.006 0.008time / sec

level / dB

-0.01

0

0.01

0.02

x(t)

0 2000 4000 6000 8000freq / Hz

-80

-60

-40

-20 |X(jΩ)|

Michael Mandel (83060 SAU) Intro & DSP 20 / 32

Page 21: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Discrete-time Fourier

DT Fourier Transform (aperiodic sampled x)

x [n] =1

∫ π

−πX (e jω)e jωndω

X (e jω) =∑

x [n]e−jωn

n-1 1 2 3 4 5 6 7

0 π

|X(ejω)|

ω2π 3π 4π 5π

1

2

3

x [n]

Discrete Fourier Transform (N-point x)

x [n] =∑k

X [k]e j 2πknN

X [k] =∑n

x [n]e−j 2πknN

k

|X(ejω)||X[k]|

k=1...

n1 2 3 4 5 6 7

x [n]

Michael Mandel (83060 SAU) Intro & DSP 21 / 32

Page 22: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Sampling and aliasingDiscrete-time signals equal the continuous time signal at discretesampling instants:

xd [n] = xc(nT )

Sampling cannot represent rapid fluctuations

0 1 2 3 4 5 6 7 8 9 10 1

0.5

0

0.5

1

sin

((ΩM +

T

)Tn

)= sin(ΩMTn) ∀n ∈ Z

Nyquist limit (ΩT/2) from periodic spectrum:

ΩΩM−ΩT ΩT −ΩM ΩT - ΩM−ΩT + ΩM

Gp(jΩ)Ga(jΩ) “alias” of “baseband”

signal

Michael Mandel (83060 SAU) Intro & DSP 22 / 32

Page 23: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Speech sounds in the Fourier domain

1.52 1.54 1.56 1.58-0.1

0

0.1

2.42 2.44 2.46 2.48

-0.02

0

0.02

0 1000 2000 3000 4000-100

-80

-60

-40

0 1000 2000 3000 4000-100

-80

-60

1.37 1.38 1.39 1.4 1.41 1.42-0.1

0

0.1

0 1000 2000 3000-100

-80

-60

-40

1.86 1.87 1.88 1.89 1.9 1.91-0.05

0

0.05

0 1000 2000 3000 4000-100

-80

-60

Vowel: periodic “has”

Fricative: aperiodic “watch”

Glide: transition “watch”

Stop: transient “dime”

time domain frequency domain

time / s freq / Hz

ener

gy

/ dB

dB = 20 log10(amplitude) = 10 log10(power)Voiced spectrum has pitch + formants

Michael Mandel (83060 SAU) Intro & DSP 23 / 32

Page 24: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Short-time Fourier TransformWant to localize energy in time and frequency

break sound into short-time pieces

calculate DFT of each one

2.35 2.4 2.45

0

4000

3000

2000

1000

2.5 2.55 2.6-0.1

0

0.1

time / s

freq

/ H

z

k →

short-timewindow

DFT

m = 0 m = 1 m = 2 m = 3

L 2L 3L

Mathematically,

X [k ,m] =N−1∑n=0

x [n]w [n −mL] exp

(−j 2πk(n −mL)

N

)Michael Mandel (83060 SAU) Intro & DSP 24 / 32

Page 25: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

The Spectrogram

Plot STFT X [k ,m] as a gray-scale image

time / s

time / s

freq

/ H

zintensity / dB

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

-50

-40

-30

-20

-10

0

10

0 0.5 1 1.5 2 2.5

Michael Mandel (83060 SAU) Intro & DSP 25 / 32

Page 26: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Time-frequency tradeoffLonger window w [n] gains frequency resolution at cost of timeresolution

1.4 1.6 1.8 2 2.2 2.4 2.6

freq

/ H

z

time / s

level/ dB

0

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.2W

indo

w =

256

pt

ÒNar

row

ban

Win

dow

= 4

8 pt

ÒWid

eban

-50

-40

-30

-20

-10

0

10

Michael Mandel (83060 SAU) Intro & DSP 26 / 32

Page 27: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Speech sounds on the Spectrogram

Most popular speech visualization

freq

/ H

z

0

1000

2000

3000

4000

1.4 1.6 1.8 2 2.2 2.4 2.6 time/s

watch thin as a dimeahas

Vow

el:

perio

dic

“has

Fric

've:

ap

erio

dic

“wat

ch”

Glid

e: tr

ansi

tion

“wat

ch”

Sto

p: tr

ansi

ent

“dim

e”

Wideband (short window) better than narrowband (long window)to see formants

Michael Mandel (83060 SAU) Intro & DSP 27 / 32

Page 28: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

TSM with the Spectrogram

Just stretch out the spectrogram?

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

how to resynthesize?spectrogram is only |Y [k,m]|

Michael Mandel (83060 SAU) Intro & DSP 28 / 32

Page 29: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

The Phase VocoderTimescale modification in the STFT domain

Magnitude from ‘stretched’ spectrogram:

|Y [k,m]| =∣∣∣X [k , m

r

]∣∣∣I e.g. by linear interpolation

But preserve phase increment between slices:

θY [k,m] = θX

[k,

m

r

]I e.g. by discrete differentiator

Does right thing for single sinusoidI keeps overlapped parts of sinusoid aligned

time

θ = ∆θ∆T

.

∆θ' = θ·2∆T .

∆T

Michael Mandel (83060 SAU) Intro & DSP 29 / 32

Page 30: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

General issues in TSM

Time windowI stretching a narrowband spectrogram

Malleability of different soundsI vowels stretch well, stops lose nature

Not a well-formed problem?I want to alter time without frequency

. . . but time and frequency are not separate!I ‘satisfying’ result is a subjective judgment⇒ solution depends on auditory perception. . .

Michael Mandel (83060 SAU) Intro & DSP 30 / 32

Page 31: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

Summary

Information in soundI lots of it, multiple levels of abstraction

Course overviewI survey of audio processing topicsI practicals, readings, project

DSP reviewI digital signals, time domainI Fourier domain, STFT

Timescale modificationI properties of the speech signalI time-domainI phase vocoder

Michael Mandel (83060 SAU) Intro & DSP 31 / 32

Page 32: Lecture 1: Introduction & DSPm.mr-pc.org/t/csc83060/2016fa/lecture01.pdf · Introduction & DSP Michael Mandel mim@sci.brooklyn.cuny.edu CUNY Graduate Center, Computer Science Program

References

J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal,pages 1493–1509, 1966.

M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):14–27,1986.

M. Puckette. Phase-locked vocoder. In IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics, pages 222–225, 1995.

A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application toInterpolation of Missing Values in Audio Signals. In 13th European SignalProcessing Conference, Antalya, Turkey, 2005.

Michael Mandel (83060 SAU) Intro & DSP 32 / 32