Lecture 1: Introduction & DSP - Columbia University › ~dpwe › e6820 › lectures ›...

1/29COLUMBIA UNIVERSITY

IN THE CITY OF NEW YORK

EE E6820: Speech & Audio Processing & Recognition

/

ering

E6820 SAPR - Dan Ellis L01 - 2006-01-19

Lecture 1: Introduction & DSP

Sound and information

Course structure

DSP review: Timescale modification

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820

Columbia University Dept. of Electrical EngineSpring 2006

1

2

3



Sound and information

voltage

1

on

air

r

ge

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Sound is air pressure variation

• Transducers convert air pressure ↔↔↔↔

Mechanical vibrati

Pressure waves in

Motion of senso

Time-varying volta

+ + + +

t

v(t)



What use is sound?

vantage

ioncornersral sounds’

.5 5

.5 5

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Footsteps examples:

• Hearing confers an evolutionary ad- useful information, complements vis- ...at a distance, in the dark, around - listeners are highly adapted to ‘natu

(including speech)

0 0.5 1 1.5 2 2.5 3 3.5 4 4-0.5

0

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4 4-0.5

0

0.5

time / s



The scope of audio processing

abstract

E6820 SAPR - Dan Ellis L01 - 2006-01-19

AUDIO

PROCESSING

natural

man-made

simple



The acoustic communication chain

l)

r decoder

!

tion

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Sound is an information bearer

• Received sound reflects source(s) plus effect of environment (channe

message signal channel receive

synthesis audioprocessing recogni



Levels of abstraction

between

erent tasks

xplicit ...

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Much processing concerns shiftinglevels of abstraction

• Different representations serve diff- separating aspects, making things e

sound p(t)

representation(e.g. t-f energy)

‘information’

abstract

concrete

An

alys

is

Syn

thesis



Course structure

rocessingls

2

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Goals:- survey topics in sound analysis & p- develop an intuition for sound signa- learn some specific technologies

• Course structure:- weekly assignments (25%)- midterm event (25%)- final project (50%)

• Text:Speech and Audio Signal ProcessingBen Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7



Web-based

820/

mples, ...

etc.

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Course website:http://www.ee.columbia.edu/~dpwe/e6

for lecture notes, problem sets, exa

• + student web pages for homework



Course outline

rytion

L10:Music

retrieval

L12:Multimediaindexing

E6820 SAPR - Dan Ellis L01 - 2006-01-19

Fundamentals

L1:DSP

L2:Acoustics

L3:Pattern

recognition

L4:Audito

percep

Audio processing

L5:Signalmodels

L6:Music

analysis/synthesis

L7:Audio

compression

L8:Spatial sound& rendering

Applications

L9:Speech

recognition

L11:Signal

separation



Weekly Assignments

g Toolbox)ing

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Research papers- journal & conference publications- summarize & discuss in class- written summaries on web page

• Practical experiments- MATLAB-based (+ Signal Processin- direct experience of sound process- skills for project

• Book sections



Final Project

of grade)

s

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Most significant part of course (50%

• Oral proposals mid-semester; Presentations in final class+ website

• Scope- practical (Matlab recommended)- identify a problem; try some solution- evaluation

• Topic- few restrictions within world of audio- investigate other resources- develop in discussion with me

• Copying



Examples of past projects

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Automatic Prosody Classification

• Model-based note transcription

dream2mSEG.wav

dream2mSEG.wav



DSP review: Digital Signals

3

time

ng

T

ε

E6820 SAPR - Dan Ellis L01 - 2006-01-19

- sampling interval T,

sampling frequency

- quantizer

xd[n] = Q( xc(nT ) )

Discrete-time samplilimits bandwidth

Discrete-levelquantization

limits dynamic range

ΩT2πT------=

Q y( ) ε y εÚ⋅=



The speech signal: time domain

ound types:

1.92

2.46 2.48

2.6 time/s

dime

: transiente”

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Speech is a sequence of different s

-0.2

-0.1

0

0.1

0.2

1.38 1.4 1.42-0.1

0

0.1

1.52 1.54 1.56 1.58-0.1

0

0.1

1.86 1.88 1.9-0.05

0

0.05

2.42 2.44

-0.02

0

0.02

1.4 1.6 1.8 2 2.2 2.4

watch thin as aahas

Vowel: periodic“has”

Fricative: aperiodic“watch”

Glide: smooth transition“watch”

Stop burst“dim



Timescale modification (TSM)

slower’?y

s?

pling rate

)tor

time/s

r = 2

M

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Can we modify a sound to make it ‘i.e. speech pronounced more slowl- e.g. to help comprehension, analysi- or more quickly for ‘speed listening’

• Why not just slow it down?

-

- equiv. to playback at a different sam

xs t( ) xotr-- =

(>1 → slower, r = slowdown fac

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

Original

2x slower



Time-domain TSM

e structure

mr---- L n+

e / s

e / s

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Problem: want to preserve local timbut alter global time structure

• Repeat segments- but: artefacts from abrupt edges

• Cross-fade & overlap

ym

mL n+[ ] ym 1–

mL n+[ ] w n[ ] x⋅+=

2.35 2.4 2.45 2.5 2.55 2.6-0.1

0

0.1

4.7 4.75 4.8 4.85 4.9 4.95-0.1

0

0.1

1

1

1 1 2 2 3 3 4 4 5 5 6

2

2

3

3

4

4

5 6

6

5

tim

tim



Synchronous Overlap-Add (SOLA)

window to

ion

:

L n Km+ +

mr---- L n K+ +

mr---- L n K+ +

2

------------------------------------------

M

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Idea: Allow some leeway in placingoptimize alignment of waveforms

• Hence,

where Km chosen by cross-correlat

1

2

Km maximizes alignment of 1 and 2

ym

mL n+[ ] ym 1–

mL n+[ ] w n[ ] x mr----⋅+=

Km

ym 1–

mL n+[ ] x⋅n 0=

Nov∑

ym 1–

mL n+[ ]( )2

∑ x

∑

-----------------------------------------------------------------------------0 K KU ≤ ≤

argmax =



The Fourier domain

us x)

k5 6 74

0.5 0 0.5 1 1.5

t1/k

04 0.006 0.008time / sec

x(t)

0 6000 8000freq / Hz

|X(jΩ)|

E6820 SAPR - Dan Ellis L01 - 2006-01-19

Fourier Series (periodic continuous x)

Fourier Transform (aperiodic continuo

x t( ) ck ejkΩ0t

⋅k∑=

ck1

2πT---------- x t( ) e

jkΩ0– t⋅ td

T 2Ú–

T 2Ú∫=

Ω02πT------=

1 2 3

|ck|1.0

1.5 1 1

0.5

0

0.5

x(t)

x t( )1

2π------ X jΩ( ) e

jΩt⋅ Ωd∫=

X jΩ( ) x t( ) ejΩt–

⋅ td∫=

0 0.002 0.0

level / dB

-0.01

0

0.01

0.02

0 2000 400-80

-60

-40

-20



Discrete-time Fourier

led x)

n2 3 4 5 6 7

)|

ω2π 3π 4π 5π

n]

k

|X(ejω)|]|n4 5 6 7

E6820 SAPR - Dan Ellis L01 - 2006-01-19

DT Fourier Transform (aperiodic samp

Discrete Fourier Transform (N-point x)

x n[ ]1

2π------ X e

jω( ) e

jωn⋅ ωd

π–

π∫=

X ejω

( ) x n[ ] ejωn–

⋅∑=

-1 1

0 π

|X(ejω

1

2

3

x [

x n[ ] X k[ ] ej2πkn

N-------------

⋅k∑=

X k[ ] x n[ ] ej2πkn

N-------------–

⋅n∑=

|X[k

k=1...

1 2 3

x [n]



Sampling and aliasing

tinuous tants:

uctuations

pectrum:

10

) n I∈∀

Ω

seband” l

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Discrete-time signals equal the contime signal at discrete sampling ins

• Sampling cannot represent rapid fl

• Nyquist limit (ΩΩΩΩT/2) from periodic s

xd n[ ] xc nT( )=

0 1 2 3 4 5 6 7 8 9 1

0.5

0

0.5

1

ΩM2πT------+

Tn sin ΩMTn(sin=

ΩM−ΩT ΩT −ΩMΩT - ΩM−ΩT + ΩM

Gp(jΩ)Ga(jΩ) “alias” of “ba

signa



Speech sounds in the Fourier domain

0(power)

nts

2000 3000 4000

2000 3000 4000

0.1

2000 3000

-40

2000 3000 4000

time domain frequency domain

freq / Hz

B

E6820 SAPR - Dan Ellis L01 - 2006-01-19

- dB = 20·log10(amplitude) = 10·log1

• Voiced spectrum has pitch + forma

1.52 1.54 1.56 1.58-0.1

0

0.1

2.42 2.44 2.46 2.48

-0.02

0

0.02

0 1000-100

-80

-60

-40

0 1000-100

-80

-60

1.37 1.38 1.39 1.4 1.41 1.42-0.1

0

0 1000-100

-80

-60

1.86 1.87 1.88 1.89 1.9 1.91-0.05

0

0.05

0 1000-100

-80

-60

Vowel: periodic“has”

Fricative: aperiodic“watch”

Glide: transition“watch”

Stop: transient“dime”

time / s

ener

gy

/ d



Short-time Fourier Transform

e and freq

time / s

2πk n mL–( )N

--------------------------------( )

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Want to localize energy in both tim→break sound into short-time pieces

calculate DFT of each one

• Mathematically:

2.35 2.4 2.45

0

4000

3000

2000

1000

2.5 2.55 2.6-0.1

0

0.1

freq

/ H

z

k →

short-timewindow

DFT

m = 0 m = 1 m = 2 m = 3

L 2L 3L

X k m,[ ] x n[ ] w n mL–[ ] j–exp⋅ ⋅n 0=N 1–

∑=



The Spectrogram

mage:

time / s

time / s

intensity / dB

-50

-40

-30

-20

-10

0

10

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Plot STFT as a grayscale iX k m,[ ]fr

eq /

Hz

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

0 0.5 1 1.5 2 2.5



Time-frequency tradeoff

cy

.6 time / s

level/ dB

-50

-40

-30

-20

-10

0

10

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Longer window w[n] gains frequenresolution at cost of time resolution

1.4 1.6 1.8 2 2.2 2.4 2

freq

/ H

z

0

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.2W

indo

w =

256

pt

Nar

row

ban

dW

indo

w =

48

ptW

ideb

and



Speech sounds on the Spectrogram

an rmants

6 time/s

dime

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Most popular speech visualization

• Wideband (short window) better thnarrowband (long window) to see fo

freq

/ H

z

0

1000

2000

3000

4000

1.4 1.6 1.8 2 2.2 2.4 2.

watch thin as aahas

Vo

wel

: pe

riodi

c“h

as”

Fri

c've

: ap

erio

dic

“wat

ch”

Glid

e: tr

ansi

tion

“wat

ch”

Sto

p:

tran

sien

t“d

ime”



TSM with the Spectrogram

1.4

1.4

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Just stretch out the spectrogram?

- how to resynthesize?

spectrogram is only

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Y k m,[ ]



The Phase Vocoder

domain

gram:

een slices:

aligned

M

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Timescale modification in the STFT

• Magnitude from ‘stretched’ spectro

- e.g. by linear interpolation

• But preserve phase increment betw

- e.g. by discrete differentiator

• Does right thing for single sinusoid- keeps overlapped parts of sinusoid

Y k m,[ ] X kmr----,=

θY k m,[ ] θX kmr----,=

time



General issues in TSM

m

re

parate!ement

ption...

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Time window- stretching a narrowband spectrogra

• Malleability of different sounds- vowels stretch well, stops lose natu

• Not a well-formed problem?- want to alter time without frequency

... but time and frequency are not se- ‘satisfying’ result is a subjective judg→solution depends on auditory perce



Summary

n

E6820 SAPR - Dan Ellis L01 - 2006-01-19

• Information in sound- lots of it, multiple levels of abstractio

• Course overview- survey of audio processing topics- practicals, readings, project

• DSP review- digital signals, time domain- Fourier domain, STFT

• Timescale modification- properties of the speech signal- time-domain- phase vocoder

Lecture 1: Introduction & DSP - Columbia University › ~dpwe › e6820 › lectures ›...

Documents

Transcript of Lecture 1: Introduction & DSP - Columbia University › ~dpwe › e6820 › lectures ›...