Lecture 1: Introduction & DSP - Columbia Universitydpwe/e6820/lectures/L01-intro+dsp.pdf · Outline...

33
EE E6820: Speech & Audio Processing & Recognition Lecture 1: Introduction & DSP Dan Ellis <[email protected]> Mike Mandel <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 January 22, 2009 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33

Transcript of Lecture 1: Introduction & DSP - Columbia Universitydpwe/e6820/lectures/L01-intro+dsp.pdf · Outline...

EE E6820: Speech & Audio Processing & Recognition

Lecture 1:Introduction & DSP

Dan Ellis <[email protected]>Mike Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

January 22, 2009

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33

Outline

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 2 / 33

Sound and information

Sound is air pressure variation

Mechanical vibration

Pressure waves in air

Motion of sensor

Time-varying voltage

+ + + +

t

v(t)

Transducers convert air pressure ↔ voltage

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33

What use is sound?

Footsteps examples:

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5

0

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5

0

0.5

time / s

Hearing confers an evolutionary advantage

useful information, complements vision

. . . at a distance, in the dark, around corners

listeners are highly adapted to ‘natural sounds’ (includingspeech)

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33

The scope of audio processing

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 5 / 33

The acoustic communication chain

message signal channel receiver decoder

!

synthesis audio processing recognition

Sound is an information bearer

Received sound reflects source(s)plus effect of environment (channel)

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33

Levels of abstraction

Much processing concerns shifting between levels of abstraction

sound p(t)

representation (e.g. t-f energy)

‘information’

abstract

concrete

An

alys

is

Syn

thesis

Different representations serve different tasks

separating aspects, making things explicit, . . .

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33

Outline

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 8 / 33

Source structure

GoalsI survey topics in sound analysis & processingI develop and intuition for sound signalsI learn some specific technologies

Course structureI weekly assignments (25%)I midterm event (25%)I final project (50%)

Text

Speech and Audio Signal ProcessingBen Gold & Nelson MorganWiley, 2000ISBN: 0-471-35154-7

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 9 / 33

Web-basedCourse website:

http://www.ee.columbia.edu/∼dpwe/e6820/for lecture notes, problem sets, examples, . . .

+ student web pages for homework, etc.

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 10 / 33

Course outline

Fundamentals

L1:DSP

L2:Acoustics

L3:Pattern

recognition

L4:Auditory

perception

Audio processing

L5:Signalmodels

L6:Music

analysis/synthesis

L7:Audio

compression

L8:Spatial sound& rendering

Applications

L9:Speech

recognition

L10:Music

retrieval

L11:Signal

separation

L12:Multimediaindexing

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 11 / 33

Weekly assignments

Research papersI journal & conference publicationsI summarize & discuss in classI written summaries on web page + Courseworks discussion

Practical experimentsI Matlab-based (+ Signal Processing Toolbox)I direct experience of sound processingI skills for project

Book sections

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 12 / 33

Final project

Most significant part of course (50%) of grade

Oral proposals mid-semester;Presentations in final class+ website

ScopeI practical (Matlab recommended)I identify a problem; try some solutionsI evaluation

TopicI few restrictions within world of audioI investigate other resourcesI develop in discussion with me

Citation & plagiarism

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 13 / 33

Examples of past projects

Automatic prosody classification Model-based note transcription

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 14 / 33

Outline

1 Sound and information

2 Course Structure

3 DSP review: Timescale modification

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 15 / 33

DSP review: digital signals

time

xd[n] = Q( xc(nT ) )

Discrete-time samplinglimits bandwidth

Discrete-levelquantization

limitsdynamic range

sampling interval T

sampling frequency ΩT = 2πT

quantizer Q(y) = ε⌊ yε

⌋Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 16 / 33

The speech signal: time domain

Speech is a sequence of different sound types

-0.2

-0.1

0

0.1

0.2

1.38 1.4 1.42.1

0

.1

1.52 1.54 1.56 1.58-0.1

0

0.1

1.86 1.88 1.921.9-0.05

0

0.05

2.42 2.44 2.46 2.4

-0.02

0

0.02

1.4 1.6 1.8 2 2.2 2.4 2.6 time/s

watch thin as a dimeahas

Vowel: periodic “has”

Fricative: aperiodic “watch”

Glide: smooth transition “watch”

Stop burst: transient “dime”

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 17 / 33

Timescale modification (TSM)Can we modify a sound to make it ‘slower’?i.e. speech pronounced more slowly

e.g. to help comprehension, analysis

or more quickly for ‘speed listening’?

Why not just slow it down?

xs(t) = xo( tr ), r = slowdown factor (> 1→ slower)

equivalent to playback at a different sampling rate

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

time/s

Original

2x slower

r = 2

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 18 / 33

Time-domain TSM

Problem: want to preserve local time structurebut alter global time structure

Repeat segmentsI but: artifacts from abrupt edges

Cross-fade & overlap

ym[mL + n] = ym−1[mL + n] + w [n] · x[⌊m

r

⌋L + n

]

2.35 2.4 2.45 2.5 2.55 2.6-0.1

0

0.1

4.7 4.75 4.8 4.85 4.9 4.95-0.1

0

0.1

1

1

1 1 2 2 3 3 4 4 5 5 6

2

2

3

3

4

4

5 6

6

5

time / s

time / s

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 19 / 33

Synchronous overlap-add (SOLA)

Idea: allow some leeway in placing window to optimize alignmentof waveforms

1

2

Km maximizes alignment of 1 and 2

Hence,

ym[mL + n] = ym−1[mL + n] + w [n] · x[⌊m

r

⌋L + n + Km

]Where Km chosen by cross-correlation:

Km = argmax0≤K≤Ku

∑Novn=0 ym−1[mL + n] · x

[⌊mr

⌋L + n + K

]√∑(ym−1[mL + n])2

∑(x[⌊

mr

⌋L + n + K

])2

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 20 / 33

The Fourier domain

Fourier Series (periodic continuous x)

Ω0 =2π

T

x(t) =∑k

cke jkΩ0t

ck =1

2πT

∫ T/2

−T/2x(t)e−jkΩ0tdt k1 2 3 5 6 74

|ck|1.0

1.5 1 0.5 0 0.5 1 1.5 1

0.5

0

0.5

t

x(t)

Fourier Transform (aperiodic continuous x)

x(t) =1

∫X (jΩ)e jΩtdΩ

X (jΩ) =

∫x(t)e−jΩtdt

0 0.002 0.004 0.006 0.008time / sec

level / dB

-0.01

0

0.01

0.02

x(t)

0 2000 4000 6000 8000freq / Hz

-80

-60

-40

-20 |X(jΩ)|

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 21 / 33

Discrete-time Fourier

DT Fourier Transform (aperiodic sampled x)

x [n] =1

∫ π

−πX (e jω)e jωndω

X (e jω) =∑

x [n]e−jωn

n-1 1 2 3 4 5 6 7

0 π

|X(ejω)|

ω2π 3π 4π 5π

1

2

3

x [n]

Discrete Fourier Transform (N-point x)

x [n] =∑k

X [k]e j 2πknN

X [k] =∑n

x [n]e−j 2πknN

k

|X(ejω)||X[k]|

k=1...

n1 2 3 4 5 6 7

x [n]

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 22 / 33

Sampling and aliasingDiscrete-time signals equal the continuous time signal at discretesampling instants:

xd [n] = xc(nT )

Sampling cannot represent rapid fluctuations

0 1 2 3 4 5 6 7 8 9 10 1

0.5

0

0.5

1

sin

((ΩM +

T

)Tn

)= sin(ΩMTn) ∀n ∈ Z

Nyquist limit (ΩT/2) from periodic spectrum:

ΩΩM−ΩT ΩT −ΩM ΩT - ΩM−ΩT + ΩM

Gp(jΩ)Ga(jΩ) “alias” of “baseband”

signal

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 23 / 33

Speech sounds in the Fourier domain

1.52 1.54 1.56 1.58-0.1

0

0.1

2.42 2.44 2.46 2.48

-0.02

0

0.02

0 1000 2000 3000 4000-100

-80

-60

-40

0 1000 2000 3000 4000-100

-80

-60

1.37 1.38 1.39 1.4 1.41 1.42-0.1

0

0.1

0 1000 2000 3000-100

-80

-60

-40

1.86 1.87 1.88 1.89 1.9 1.91-0.05

0

0.05

0 1000 2000 3000 4000-100

-80

-60

Vowel: periodic “has”

Fricative: aperiodic “watch”

Glide: transition “watch”

Stop: transient “dime”

time domain frequency domain

time / s freq / Hz

ener

gy

/ dB

dB = 20 log10(amplitude) = 10 log10(power)Voiced spectrum has pitch + formants

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 24 / 33

Short-time Fourier TransformWant to localize energy in time and frequency

break sound into short-time pieces

calculate DFT of each one

2.35 2.4 2.45

0

4000

3000

2000

1000

2.5 2.55 2.6-0.1

0

0.1

time / s

freq

/ H

z

k →

short-timewindow

DFT

m = 0 m = 1 m = 2 m = 3

L 2L 3L

Mathematically,

X [k ,m] =N−1∑n=0

x [n] w [n −mL] exp

(−j

2πk(n −mL)

N

)Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 25 / 33

The Spectrogram

Plot STFT X [k ,m] as a gray-scale image

time / s

time / s

freq

/ H

zintensity / dB

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

-50

-40

-30

-20

-10

0

10

0 0.5 1 1.5 2 2.5

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 26 / 33

Time-frequency tradeoffLonger window w [n] gains frequency resolution at cost of timeresolution

1.4 1.6 1.8 2 2.2 2.4 2.6

freq

/ H

z

time / s

level/ dB

0

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.2W

indo

w =

256

pt

ÒNar

row

ban

Win

dow

= 4

8 pt

ÒWid

eban

-50

-40

-30

-20

-10

0

10

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 27 / 33

Speech sounds on the Spectrogram

Most popular speech visualization

freq

/ H

z

0

1000

2000

3000

4000

1.4 1.6 1.8 2 2.2 2.4 2.6 time/s

watch thin as a dimeahas

Vow

el:

perio

dic

“has

Fric

've:

ap

erio

dic

“wat

ch”

Glid

e: tr

ansi

tion

“wat

ch”

Sto

p: tr

ansi

ent

“dim

e”

Wideband (short window) better than narrowband (long window)to see formants

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 28 / 33

TSM with the Spectrogram

Just stretch out the spectrogram?

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

how to resynthesize?spectrogram is only |Y [k,m]|

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 29 / 33

The Phase VocoderTimescale modification in the STFT domain

Magnitude from ‘stretched’ spectrogram:

|Y [k,m]| =∣∣∣X [k , m

r

]∣∣∣I e.g. by linear interpolation

But preserve phase increment between slices:

θY [k,m] = θX

[k,

m

r

]I e.g. by discrete differentiator

Does right thing for single sinusoidI keeps overlapped parts of sinusoid aligned

time

θ = ∆θ∆T

.

∆θ' = θ·2∆T .

∆T

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 30 / 33

General issues in TSM

Time windowI stretching a narrowband spectrogram

Malleability of different soundsI vowels stretch well, stops lose nature

Not a well-formed problem?I want to alter time without frequency

. . . but time and frequency are not separate!I ‘satisfying’ result is a subjective judgment⇒ solution depends on auditory perception. . .

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 31 / 33

Summary

Information in soundI lots of it, multiple levels of abstraction

Course overviewI survey of audio processing topicsI practicals, readings, project

DSP reviewI digital signals, time domainI Fourier domain, STFT

Timescale modificationI properties of the speech signalI time-domainI phase vocoder

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 32 / 33

References

J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal,pages 1493–1509, 1966.

M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):14–27,1986.

M. Puckette. Phase-locked vocoder. In Proc. IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), pages 222–225, 1995.

A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application toInterpolation of Missing Values in Audio Signals. In 13th European SignalProcessing Conference, Antalya, Turkey, 2005.

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 33 / 33