Lecture 6: Nonspeech and Musicdpwe/e6820/lectures/E... · E6820 SAPR - Dan Ellis L06 - Nonspeech &...

E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 6:Nonspeech and Music

Music and nonspeech

Environmental sounds

Music synthesis techniques

Sinewave synthesis

Music analysis

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/

Columbia University Dept. of Electrical EngineeringSpring 2006

1

2

3

4

5


Music & nonspeech

• What is ‘nonspeech’?

- according to research effort: a little music- in the world: most everything

attributes?

1

Originnatural man-made

Info

rmat

ion

cont

ent

low

high

wind &water

animalsounds

speech music

machines & enginescontact/

collision

http://www.ee.columbia.edu/~dpwe/e6820/


Sound attributes

• Attributes suggest model parameters

• What do we notice about ‘general’ sound?

- psychophysics: pitch, loudness, ‘timbre’- bright/dull; sharp/soft; grating/soothing- sound is not ‘abstract’:

tendency is to describe by source-events

• Ecological perspective

- what matters about sound is ‘what happened’

→

our percepts express this more-or-less directly


Motivations for modeling

• Describe/classify

- cast sound into model because want to use the resulting parameters

• Store/transmit

- model implicitly exploits limited structure of signal

• Resynthesize/modify

- model separates out interesting parameters

Sound Model parameterspace


Analysis and synthesis

• Analysis is the converse of synthesis:

• Can exist apart:

- analysis for classification- synthesis of artificial sounds

• Often used together:

- encoding/decoding of compressed formats- resynthesis based on analyses- analysis-by-synthesis

Sound

AnalysisSynthesis

Model / representation


Outline

Music and nonspeech


- Collision sounds- Sound textures


Sinewave synthesis

Music analysis

1

2

3

4

5


Environmental Sounds

• Where sound comes from:mechanical interactions

- contact / collisions- rubbing / scraping- ringing / vibrating

• Interest in environmental sounds

- carry information about events around us.. including indirect hints

- need to create them in virtual environments.. including soundtracks

• Approaches to synthesis

- recording / sampling- synthesis algorithms

2


Collision sounds

• Factors influencing:

- colliding bodies: size, material, damping- local properties at contact point (hardness)- energy of collision

• Source-filter model

- “source” = excitation of collision event(energy, local properties at contact)

- “filter” = resonance and radiation of energy(body properties)

• Variety of strike/scraping sounds

- resonant freqs ~ size/shape- damping ~ material- HF content in excitation/strike ~

mallet, force(from Gaver 1993)

t→

f→


Sound textures

• What do we hear in:

- a city street- a symphony orchestra

• How do we distinguish:

- waterfall- rainfall- applause- static

• Levels of ecological description...

time / s

freq

/ H

z

Applause04

0 1 2 3 40

1000

2000

3000

4000

5000

time / sfr

eq /

Hz

Rain01

0 1 2 3 40

1000

2000

3000

4000

5000


Sound texture modeling

(Athineos)

• Model broad spectral structure with LPC

- could just resynthesize with noise

• Model fine temporal structure in residual with linear prediction in time domain

- precise dual of LPC in frequency- ‘poles’ model temporal events

• Allows modification / synthesis?

TD-LPy[n] = iaiy[n-i] DCT

FD-LPE[k] = ΣΣ ibiE[k-i]Whitened

residualPer-framespectral

parameters

Per-frametemporal envelope

parameters

Residualspectrum

Sound

e[n]y[n] E[k]

-0.02

0

0.02

Temporal envelopes (40 poles, 256ms)

0.05 0.1 0.15 0.2 0.25time / sec

ampl

itude

0.01

0.02

0.03


Outline

Music and nonspeech



- Framework- Historical development

Sinewave synthesis

Music analysis

elements?

1

2

3

4

5



• What is music?

- could be anything

→

flexible synthesis needed!

• Key elements of conventional music

- instruments

→

note-events (time, pitch, accent level)

→

melody, harmony, rhythm- patterns of repetition & variation

• Synthesis framework:

instruments: common framework for many notesscore: sequence of (time, pitch, level) note events

3


The nature of musical instrument notes

• Characterized by instrument (register), note, loudness/emphasis, articulation...

distinguish how?

Time

Fre

quen

cy

Piano

0

1000

2000

3000

4000

0 1 2 3 4 Time

Violin

0

1000

2000

3000

4000

0 1 2 3 4

Time

Fre

quen

cy

Clarinet

0

1000

2000

3000

4000

0 1 2 3 4 Time

Trumpet

0

1000

2000

3000

4000

0 1 2 3 4


Development of music synthesis

• Goals of music synthesis:

- generate realistic / pleasant new notes- control / explore timbre (quality)

• Earliest computer systems in 1960s(voice synthesis, algorithmic)

• Pure synthesis approaches:

- 1970s: Analog synths- 1980s: FM (Stanford/Yamaha)- 1990s: Physical modeling, hybrids

• Analysis-synthesis methods:

- sampling / wavetables- sinusoid modeling- harmonics + noise (+ transients)

others?


Analog synthesis

• The minimum to make an ‘interesting’ sound

• Elements:

- harmonics-rich oscillators- time-varying filters- time-varying envelope- modulation: low frequency + envelope-based

• Result:

- time-varying spectrum, independent pitch

Oscillator Filter

Envelope

PitchTrigger

Vibrato

t

t

f

+

Cutofffreq

GainSound

++


FM synthesis

• Fast frequency modulation

→

sidebands:

- a harmonic series if

ω

c

=

r

·

ω

m

•

J

n

(β) is a Bessel function:

→ Complex harmonic spectra by varying β

ωct β ωmt( )sin+( )cos Jn β( ) ωc nωm+( )t( )cosn ∞–=

∞∑=

phase modulation

0 1 2 3 4 5 6 7 8 9-0.5

0

0.5

1J0

J1 J2 J3 J4Jn(β) ≈ 0 for β < n - 2

modulation index β

time / s

freq

/ H

z

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

1000

2000

3000

4000

whatuse?

ωc 2000Hz=

ωm 200Hz=


Sampling synthesis

• Resynthesis from real notes→ vary pitch, duration, level

• Pitch: stretch (resample) waveform

• Duration: loop a ‘sustain’ section

• Level: cross-fade different examples

- need to ‘line up’ source samples

-0.2

-0.1

0

0.1

0.2

0 0.1 0.2 time

0 0.002 0.004 0.006 0.008 time / s-0.2

-0.1

0

0.1

0.2

0 0.002 0.004 0.006 0.008 time / s

596 Hz 894 Hz

-0.2

-0.1

0

0.1

0.2

-0.2

-0.1

0

0.1

0.2

-0.2

-0.1

0

0.1

0.2

0 0.1 0.2 0.3 0 0.1 0.2 0.3time / s time / s

0.204 0.2060.174 0.176

-0.2

-0.1

0

0.1

0.2

-0.2

-0.1

0

0.1

0.2

0 0.05 0.1 0.15 0 0.05 0.1 0.15time / s time / s

Soft Loud

veloc

mix

good& bad?


Outline

Music and nonspeech



Sinewave synthesis (detail)- Sinewave modeling- Sines + residual ...

Music analysis

1

2

3

4

5


Sinewave synthesis

• If patterns of harmonics are what matter,why not generate them all explicitly:

- particularly powerful model for pitched signals

• Analysis (as with speech):

- find peaks in STFT |S[ω,n]| & track

- or track fundamental ω0 (harmonics / autoco)

& sample STFT at k·ω0

→set of Ak[n] to duplicate tone:

• Synthesis via bank of oscillators

4

s n[ ] Ak n[ ] k ω0 n[ ] n⋅ ⋅( )cosk∑=

0 0.05 0.1 0.15 0.2 time / s time / s0

2000

4000

6000

8000

freq

/ H

z

freq / Hz

mag

00.1

0.2

05000

0

1

2


Steps to sinewave modeling - 1

• The underlying STFT:

What value for N (FFT length & window size)?What value for H (hop size: n0 = r·H, r = 0, 1, 2...)?

• STFT window length determines freq. resol’n:

• Choose N long enough to resolve harmonics→ 2-3x longest (lowest) fundamental period- e.g. 30-60 ms = 480-960 samples @ 16 kHz- choose H ≤ N/2

• N too long → lost time resolution- limits sinusoid amplitude rate of change

X k n0,[ ] x n n0+[ ] w n[ ] j2πkn

N------------- –exp⋅ ⋅

n 0=

N 1–

∑=

Xw ejω

( ) X ejω

( ) W ejω

( )*=



• Choose candidate sinusoids at each time by picking peaks in each STFT frame:

• Quadratic fit for peak:

+ linear interpolation of unwrapped phase

time / s

freq

/ H

zle

vel /

dB

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180

2000

4000

6000

8000

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60

-40

-20

0

20

400 600 800 freq / Hz

-20

-10

0

10

20

400 600 800 freq / Hz

-10

-5

0

leve

l / d

B

phas

e / r

ad

y

x

y = ax(x-b)

b/2

ab2/4



• Which peaks to pick?Want ‘true’ sinusoids, not noise fluctuations- ‘prominence’ threshold above smoothed spec.

• Sinusoids exhibit stability...- of amplitude in time- of phase derivative in time→compare with adjacent time frames to test?

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60

-40

-20

0

20

leve

l / d

B



• ‘Grow’ tracks by appending newly-found peaks to existing tracks:

- ambiguous assignments possible

• Unclaimed new peak- ‘birth’ of new track- backtrack to find earliest trace?

• No continuation peak for existing track- ‘death’ of track- or: reduce peak threshold for hysteresis

time

freq

death

birth

existingtracks

newpeaks


Resynthesis of sinewave models

• After analysis, each track defines contours in frequency, amplitude fk[n], Ak[n] (+ phase?)

- use to drive a sinewave oscillators & sum up

• ‘Regularize’ to exactly harmonic fk[n] = k·f0[n]

0 0.05 0.1 0.15 0.2500

600

7000

1

2

3

0 0.05 0.1 0.15 0.2time / s

time / sfreq

/ Hz

leve

l

-3

-2

-1

0

1

2

3

Ak[n]·cos(2πfk[n]·t)

fk[n]

Ak[n]

n

0 0.05 0.1 0.15 0.20

2000

4000

6000

0 0.05 0.1 0.15 0.2550

600

650

700

time / s time / s

freq

/ H

z

freq

/ H

z

whatto do?


Modification in sinewave resynthesis

• Change duration by warping timebase- may want to keep onset unwarped

• Change pitch by scaling frequencies- either stretching or resampling envelope

• Change timbre by interpolating params

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

1000

2000

3000

4000

5000

time / s

freq

/ H

z

0 1000 2000 3000 40000

10

20

30

40

freq / Hz

leve

l / d

B

0 1000 2000 3000 40000

10

20

30

40

freq / Hzle

vel /

dB


Sinusoids + residual

• Only ‘prominent peaks’ became tracks- remainder of spectral energy was noisy?→ model residual energy with noise

• How to obtain ‘non-harmonic’ spectrum?- zero-out spectrum near extracted peaks?- or: resynthesize (exactly) & subtract waveforms

.. must preserve phase!

• Can model residual signal with LPC→flexible representation of noisy residual

es n[ ] s n[ ] Ak n[ ] 2πn f k n[ ]⋅( )cosk∑–=

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

mag

/ dB

-80

-60

-40

-20

0

20

original

sinusoids

residualLPC


Sinusoids + noise + transients

• Sound represented as sinusoids and noise:

Parameters are {Ak[n], fk[n]}, hn[n]

• Separate out abrupt transients in residual?

- more specific → more flexible

s n[ ] Ak n[ ] 2πn f k n[ ]⋅( )cosk∑ hn n[ ] b n[ ]*+=

Sinusoids Residual es n[ ]

time / s

freq

/ H

z

0 0.2 0.4 0.60

2000

4000

6000

8000

2000

4000

6000

8000

0 0.2 0.4 0.60

2000

4000

6000

8000

{Ak[n], fk[n]}

hn[n]

es n[ ] tk n[ ]k∑ hn n[ ] b' n[ ]*+=


Outline

Music and nonspeech



Sinewave synthesis

Music analysis- Instrument identification- Pitch tracking

1

2

3

4

5


Music analysis

• What might we want to get out of music?

• Instrument identification- different levels of specificity- ‘registers’ within instruments

• Score recovery- transcribe the note sequence- extract the ‘performance’

• Ensemble performance- ‘gestalts’: chords, tone colors

• Broader timescales- phrasing & musical structure- artist / genre clustering and classification

5


Instrument identification

• Research looks for perceptual ‘timbre space’

• Cues to instrument identification- onset (rise time), sustain (brightness)

• Hierarchy of instrument families- strings / reeds / brass- optimize features at each level

bright

dull

low fluxhi flux

low attack

hi attack

procedure?


Pitch tracking• Fundamental frequency (→ pitch)

is a key attribute of musical sounds→pitch tracking as a key technology

• Pitch tracking for speech- voice pitch & spectrum highly dynamic- speech is voiced and unvoiced ground truth?

• Applications- voice coders (excitation description)- harmonic modeling


Pitch tracking for music

• Pitch in music- pitch is more stable (although vibrato)- but: multiple pitches

• Applications- harmonic modeling- music transcription (→ storage, resynthesis)- source separation

• Approaches: “place” & “time”

Time

Fre

quen

cy

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1000

2000

3000

4000

??


Meddis & Hewitt pitch model

• Autocorrelation (time) based pitch extraction- fundamental period → peak(s) in autocorrelation

• Compute separately in each frequency band& ‘summarize’ across (perceptual) channels

x t( ) x t T+( )≈ rxx T( ) x t( )x t T+( )∫= max≈→

0 20 40 60 80 100-0.2-0.1

00.10.2

0 20 40 60 80 100-10123

time / samples lag / samples

Waveform x[n] Autocorrelation rxx[l]

80

328

866

1924

4000CF / Hz Autocorrelogram

2.5 5.0 7.5 10.0 12.50

1

lag / ms

Ban

dpas

sfil

ters

Rec

tific

atio

n &

low

-pas

s fil

ter

Periodicity detection

Cross-channel sum

sound

Summary ACG


Tolonen & Karjalainen simplification• Multiple frequency channels

can have different dominant pitches ...

• But equalizing (flattening) the spectrum works:

→ Summary AC as a function of time:

- ‘Enhancement’ = cancel subharmonics

Pre-whiteningsound

Highpass@ 1kHz

Rectify &low-pass

Lowpass@ 1kHz

Periodicitydetection

SACFenhance

ESACFPeriodicitydetection

+

time/s

50

100

200

400

1000f/Hz Periodogram for M/F voice mix

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Summary autocorrelation at t=0.775 s

0.001 0.002 0.003 0.004 0.006 0.01 0.02 lag/s

200 Hz(0.005s)

125 Hz(0.008s)

lag vs.freq?


Post-processing of pitch tracks

• Remove outliers with median filtering

• Octave errors are common:- if x(t) ≈ x(t + T ) then x(t) ≈ x(t + 2T ) etc.

→ dynamic programming/HMM

• Validity- “is there a pitch at this time?”- voiced/unvoiced decision for speech

• Event detection- when does a pitch slide indicate a new note?

time

5-pt median


Summary

• ‘Nonspeech audio’- i.e. sound in general- characteristics: ecological

• Music synthesis- control of pitch, duration, loudness, articulation- evolution of techniques- sinusoids + noise + transients

• Music analysis- different aspects: instruments, pitches,

performance

and beyond?

Lecture 6: Nonspeech and Musicdpwe/e6820/lectures/E... · E6820 SAPR - Dan Ellis L06 - Nonspeech &...

Documents

Transcript of Lecture 6: Nonspeech and Musicdpwe/e6820/lectures/E... · E6820 SAPR - Dan Ellis L06 - Nonspeech &...