Lecture 1: Introduction & DSP - Columbia...

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 1: Introduction & DSP

Sound and information

Course structure

DSP review: Timescale modification

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/

Columbia University Dept. of Electrical EngineeringSpring 2003

1

2

3

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 2

Sound and information

• Sound is air pressure variation

• Transducers convert air pressure

↔↔↔↔

voltage

1

Mechanical vibration

Pressure waves in air

Motion of sensor

Time-varying voltage

+ + + +

t

v(t)

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 3

What use is sound?

• Footsteps examples:

• Hearing confers an evolutionary advantage

- useful information, complements vision- ...at a distance, in the dark, around corners- listeners are highly adapted to ‘natural sounds’

(including speech)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5

0

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5

0

0.5

time / s

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 4

The scope of audio processing

AUDIO

PROCESSING

natural

man-made

simple abstract

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 5

The acoustic communication chain

• Sound is an information bearer

• Received sound reflects source(s) plus effect of environment (channel)

message signal channel receiver decoder

!

synthesis audioprocessing recognition

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 6

Levels of abstraction

• Much processing concerns shifting between levels of abstraction

• Different representations serve different tasks

- separating aspects, making things explicit ...

sound p(t)

representation(e.g. t-f energy)

‘information’

abstract

concrete

An

alys

is

Syn

thesis

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 7

Course structure

• Goals:

- survey topics in sound analysis & processing- develop an intuition for sound signals- learn some specific technologies (esp. ASR)

• Course structure:

- weekly assignments (25%)- midterm exam (25%)- final project (50%)

• Text:

Speech and Audio Signal Processing

Ben Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7

2

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 8

Web-based

• Course website:

http://www.ee.columbia.edu/~dpwe/e6820/

for lecture notes, problem sets, examples, ...

• + student web pages for homework etc.

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 9

Course outline

Fundamentals

L1:DSP

L2:Acoustics

L3:Pattern

recognition

L4:Auditory

perception

Audio processing

L5:Signalmodels

L6:Music

analysis/synthesis

L7:Audio

compression

L8:Spatial sound& rendering

Speech recognition

L9:Speechfeatures

L10:Sequence

recognition

L11:Recognizer

training

L12:Systems &

applications

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 10

Weekly Assignments

• Research papers

- journal & conference publications- summarize & discuss in class- written summaries on web page

• Practical experiments

- MATLAB-based (+ Signal Processing Toolbox)- direct experience of sound processing- skills for project

• Book sections

+ questions from book

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 11

Final Project

• Most significant part of course (50% of grade)

• Oral proposals mid-semester; Presentations in final class+ website

• Scope

- practical (Matlab recommended)- identify a problem; try some solutions- evaluation

• Topic

- few restrictions within world of audio- investigate other resources- develop in discussion with me

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 12

Examples of past projects

• Detecting airplane noise

- e.g. for environment monitoring

• Separating speakers in recorded meetings

- based on dummy-head binaural cues

dream2mSEG.wav

dream2mSEG.wav

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 13

DSP review: Digital Signals

- sampling interval

T

,

sampling frequency

- quantizer

3

time

xd[n] = Q( xc(nT ) )

Discrete-time samplinglimits bandwidth

Discrete-levelquantization

limits dynamic range

T

ε

ΩT2πT------=

Q y( ) ε y εÚ⋅=

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 14

The speech signal: time domain

• Speech is a sequence of different sound types:

-0.2

-0.1

0

0.1

0.2

1.38 1.4 1.42-0.1

0

0.1

1.52 1.54 1.56 1.58-0.1

0

0.1

1.86 1.88 1.921.9-0.05

0

0.05

2.42 2.44 2.46 2.48

-0.02

0

0.02

1.4 1.6 1.8 2 2.2 2.4 2.6 time/s

watch thin as a dimeahas

Vowel: periodic“has”

Fricative: aperiodic“watch”

Glide: smooth transition“watch”

Stop burst: transient“dime”

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 15

Timescale modification (TSM)

• Can we modify a sound to make it ‘slower’?i.e. speech pronounced more slowly

- e.g. to help comprehension, analysis- or more quickly for ‘speed listening’?

• Why not just slow it down?

- ,

r

= slowdown factor

- equiv. to playback at a different sampling rate

xs t( ) xotr-- =

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

time/s

Original

2x slower

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 16

Time-domain TSM

• Problem: want to preserve

local

time structurebut alter

global

time structure

• Repeat segments

- but: artefacts from abrupt edges

• Cross-fade & overlap

ym

mL n+[ ] ym 1–

mL n+[ ] w n[ ] x mr---- L n+⋅+=

2.35 2.4 2.45 2.5 2.55 2.6-0.1

0

0.1

4.7 4.75 4.8 4.85 4.9 4.95-0.1

0

0.1

1

1

1 1 2 2 3 3 4 4 5 5 6

2

2

3

3

4

4

5 6

6

5

time / s

time / s

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 17

Synchronous Overlap-Add (SOLA)

• Idea: Allow some leeway in placing window to optimize alignment of waveforms

• Hence,

where Km chosen by cross-correlation:

1

2

Km maximizes alignment of 1 and 2

ym

mL n+[ ] ym 1–

mL n+[ ] w n[ ] x mr---- L n Km+ +⋅+=

Km

ym 1–

mL n+[ ] x mr---- L n K+ +⋅

n 0=

Nov∑

ym 1–

mL n+[ ]( )2

∑ x mr---- L n K+ +

2

∑-----------------------------------------------------------------------------------------------------------------------

0 K KU ≤ ≤argmax =

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 18

The Fourier domain

Fourier Series (periodic continuous x)

Fourier Transform (aperiodic continuous x)

x t( ) ck ejkΩ0t

⋅k∑=

ck1

2πT---------- x t( ) e

jkΩ0– t⋅ td

T 2Ú–

T 2Ú∫=

k1 2 3 5 6 74

|ck|1.0

1.5 1 0.5 0 0.5 1 1.5 1

0.5

0

0.5

t

x(t)

x t( )1

2π------ X jΩ( ) e

jΩt⋅ Ωd∫=

X jΩ( ) x t( ) ejΩt–

⋅ td∫=

0 0.002 0.004 0.006 0.008time / sec

level / dB

-0.01

0

0.01

0.02

x(t)

0 2000 4000 6000 8000freq / Hz

-80

-60

-40

-20 |X(jΩ)|

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 19

Discrete-time Fourier

DT Fourier Transform (aperiodic sampled x)

Discrete Fourier Transform (N-point x)

x n[ ]1

2π------ X e

jω( ) e

jωn⋅ ωd

π–

π∫=

X ejω

( ) x n[ ] ejωn–

⋅∑=

n-1 1 2 3 4 5 6 7

0 π

|X(ejω)|

ω2π 3π 4π 5π

1

2

3

x [n]

x n[ ] X k[ ] ej2πkN

---------⋅

k∑=

X k[ ] x n[ ] ej2πkN

---------–⋅

n∑= k

|X(ejω)||X[k]|

k=1...

n1 2 3 4 5 6 7

x [n]

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 20

Sampling and aliasing

• Discrete-time signals equal the continuous time signal at discrete sampling instants:

• Sampling cannot represent rapid fluctuations

• Nyquist limit (ΩΩΩΩT/2) from periodic spectrum:

xd n[ ] xc nT( )=

0 1 2 3 4 5 6 7 8 9 10 1

0.5

0

0.5

1

ΩM2πT------+

Tn sin ΩMTn( )sin= n I∈∀

ΩΩM−ΩT ΩT −ΩM

ΩT - ΩM−ΩT + ΩM

Gp(jΩ)Ga(jΩ) “alias” of “baseband”

signal

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 21

Speech sounds in the Fourier domain

- dB = 20.log10(amplitude) = 10.log10(power)

• Voiced spectrum has pitch + formants

1.52 1.54 1.56 1.58-0.1

0

0.1

2.42 2.44 2.46 2.48

-0.02

0

0.02

0 1000 2000 3000 4000-100

-80

-60

-40

0 1000 2000 3000 4000-100

-80

-60

1.37 1.38 1.39 1.4 1.41 1.42-0.1

0

0.1

0 1000 2000 3000-100

-80

-60

-40

1.86 1.87 1.88 1.89 1.9 1.91-0.05

0

0.05

0 1000 2000 3000 4000-100

-80

-60

Vowel: periodic“has”

Fricative: aperiodic“watch”

Glide: transition“watch”

Stop: transient“dime”

time domain frequency domain

time / s freq / Hz

ener

gy

/ dB

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 22

Short-time Fourier Transform

• Want to localize energy in both time and freq→break sound into short-time pieces

calculate DFT of each one

• Mathematically:

2.35 2.4 2.45

0

4000

3000

2000

1000

2.5 2.55 2.6-0.1

0

0.1

time / s

freq

/ H

z

short-timewindow

DFT

L 2L 3L

X k m,[ ] x n[ ] w n mL–[ ] j 2πk n mL–( )N

--------------------------------( )–exp⋅ ⋅n 0=N 1–

∑=

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 23

The Spectrogram

• Plot STFT as a grayscale image:X k m,[ ]

time / s

time / s

freq

/ H

zintensity / dB

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

-50

-40

-30

-20

-10

0

10

0 0.5 1 1.5 2 2.5

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 24

Time-frequency tradeoff

• Longer of window w[n] gains frequency resolution at cost of time resolution

1.4 1.6 1.8 2 2.2 2.4 2.6

freq

/ H

z

time / s0

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.2W

ind

ow

= 2

56 p

t“N

arro

wb

and

”W

ind

ow

= 3

2 p

t“W

ideb

and

”

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 25

Speech sounds on the Spectrogram

• Most popular speech visualization

• Wideband (short window) better than narrowband (long window) to see formants

freq

/ H

z

0

1000

2000

3000

4000

1.4 1.6 1.8 2 2.2 2.4 2.6 time/s

watch thin as a dimeahas

Vo

wel

: pe

riodi

c“h

as”

Fri

c've

: ap

erio

dic

“wat

ch”

Glid

e: tr

ansi

tion

“wat

ch”

Sto

p:

tran

sien

t“ d

ime”

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 26

TSM with the Spectrogram

• Just stretch out the spectrogram?

- how to resynthesize?

spectrogram is only

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

Y k m,[ ]

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 27

The Phase Vocoder

• Timescale modification in the STFT domain

• Magnitude from ‘stretched’ spectrogram:

- e.g. by linear interpolation

• But preserve phase increment between slices:

- e.g. by discrete differentiator

• Does right thing for single sinusoid- keeps overlapped parts of sinusoid aligned

Y k m,[ ] X kmr----,=

θY k m,[ ] θX kmr----,=

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 28

General issues in TSM

• Time window- stretching a narrowband spectrogram

• Malleability of different sounds- vowels stretch well, stops lose nature

• Not a well-formed problem?- want to alter time without frequency

... but time and frequency are not separate!- ‘satisfying’ result is a subjective judgement→solution depends on auditory perception...

E6820 SAPR - Dan Ellis L01 - 2003-01-27 - 29

Summary

• Information in sound- lots of it, multiple levels of abstraction

• Course overview- survey of audio processing topics- practicals, readings, project

• DSP review- digital signals, time domain- Fourier domain, STFT

• Timescale modification- properties of the speech signal- time-domain- phase vocoder

Lecture 1: Introduction & DSP - Columbia...

Documents

Transcript of Lecture 1: Introduction & DSP - Columbia...