Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...

Information from Sound - Dan Ellis 2011-03-01 /191

1. Machine Listening2. Global Classification3. Foreground & Transients4. Outstanding Issues

Extracting Informationfrom Sound

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

mailto:[email protected]:[email protected]://labrosa.ee.columbia.eduhttp://labrosa.ee.columbia.edu


1. Machine Listening

2

• Extracting useful information from sound... like animals doDectect

Classify

Describe

EnvironmentalSound

Speech Music

Task

Domain

ASR MusicTranscription

MusicRecommendation

VAD Speech/Music

Emotion

“Sound Intelligence”

EnvironmentAwareness

AutomaticNarration


Listening to Mixtures

• The world is cluttered& sound is transparentmixtures are inevitable

• Useful information is structured by ‘sources’specific definition of a ‘source’:intentional independence

3


Applications• Audio Lifelog

Diarization

• Consumer Video Classification

4

09:0009:3010:0010:3011:0011:3012:0012:3013:0013:3014:0014:3015:0015:3016:00

16:3017:0017:3018:00

preschool

cafepreschoolcafelecture

officeofficeoutdoorgroup

lab

cafemeeting2

office

office

outdoorlabcafe

Ron

Manuel

Arroyo?

Lesser

2004-09-13

L2cafeoffice

outdoorlecture

outdoormeeting2

outdoorofficecafeoffice

outdoorofficepostlecoffice

DSP03

compmtg

Mike

Sambarta?

2004-09-14


Consumer Video Dataset• 25 “concepts” from

Kodak user studyboat, crowd, cheer, dance, ...

• Grab top 200 videos from YouTube searchthen filter for quality, unedited = 1873 videosmanually relabel with concepts

• Concept overlap:5

8GMM+Bha (9/15)

pLSA500+lognorm (12/15)

1G+KL2 (10/15)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Overlapped Concepts

group of 3+musiccrowdcheer

singingone person

nightgroup of two

showdancing

beachparkbaby

playgroundparade

boatsports

graduation

sunsetbirthday

animalwedding

picnicmuseum

ski

03 m c c s o n g s d b p b p p b s g s s b a w p m

La

be

led

Co

nce

pts


2. Global Classification• Baseline for soundtrack classification

divide sound into short frames (e.g. 30 ms)calculate features (e.g. MFCC) for each framedescribe clip by statistics of frames (mean, covariance)= “bag of features”

• Classify by e.g. Mahalanobis distance + SVM6

VTS_04_0001 - Spectrogram

freq

/ kH

z

1 2 3 4 5 6 7 8 9012345678

-20

-10

0

10

20

30

time / sec

time / sec

level / dB

value

MFC

C b

in

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

Classify by e.g. Mahalanobis distance + SVMMFCC dimension

MFC

C d

imen

sion

MFCC covariance

5 10 15 20

2

4

6

8

10

12

14

16

18

20

-50

0

50

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix


Codebook Histograms• Convert nonplanar distributions to multinomial

• Classify by distance on histogramsKL, Chi-squared+ SVM

7

-20 -10 0 10 20-10

-8

-6

-4

-2

0

2

4

6

8

MFCC(0)

MFC

C(1

)

12

34

56

7

89

10

11

1213

14

15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

50

100

150

GMM mixture

coun

t

MFCCfeatures

Global Gaussian

Mixture ModelPer-Category

Mixture ComponentHistogram


Latent Semantic Analysis (LSA)• Probabilistic LSA (pLSA) models each histogram

as a mixture of several ‘topics’.. each clip may have several things going on

• Topic sets optimized through EMp(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)

use (normalized?) p(topic | clip) as per-clip features8

GMM histogram ftrs GMM histogram ftrs“Topic”

“Top

ic”

AV C

lip

AV C

lipp(ftr | clip)

p(topic

| clip

) p(ftr | topic)

=

*


Global Classification Results

• Wide range in performanceaudio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes

9

Lee & Ellis ’10

Concept classgro

up of 3

+mu

siccro

wd cheersin

ging

one pe

rson

night

group

of two

showdan

cing

beach par

kbab

y

playgr

oundpar

ade boatspo

rts

gradua

tion skisun

set

birthda

yani

mal

weddi

ngpic

nic

museu

mME

AN0

0.10.20.30.40.50.60.70.80.9

1

Aver

age

Prec

ision Guessing

MFCC + GMM ClassifierSingle−Gaussian + KL28−GMM + Bha1024−GMM Histogram + 500−log(P(z|c)/Pz))


3. Foreground & Transients• Global vs. local class models

tell-tale acoustics may be ‘washed out’ in statisticstry iterative realignment of HMMs:

“background” model shared by all clips

10

Old Way:All frames contribute

New Way:Limited temporal extents

time / s

freq

/ kH

z

YT baby 002: voice baby laugh

5 10 150

1

2

3

4

voice babylaugh

time / s

freq

/ kH

z

5 10 150

1

2

3

4

voice laugh

laugh

bg

bg

voice

baby

baby


Landmark-based Fingerprints

• Sound characterized by time-frequency peaksrobust to channel, background

relies on precisetiming

11

Query audio

0

500

1000

1500

2000

2500

3000

3500

4000

Match: 05−Full Circle at 0.032 sec

0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

500

1000

1500

2000

2500

3000

3500

4000

time / sec

freq

/ Hz

Shazam ’03


VIdeo IMpLQaiHWbE at 195s

VIdeo Yi1hkNkqHBc at 218 s

195.5 196 196.5 197 197.5 198 198.5 199

218.5 219 219.5 220 220.5 221 221.5 2220

1

2

3

4

freq /

kHz

0

1

2

3

4

freq /

kHz

time / sec

time / sec

Soundtrack Fingerprint Matching

• Landmark pairs are a noise-robust fingerprint

• Use to match distinct videos with same sound ambience

12

Cotton & Ellis ’10

0

1

2


Event Landmark Signatures• Build index of Gabor neighbor pairs

recognize repeated events with similar pairs

13

Cotton & Ellis ’09


Transient Features

• Onset detectorfinds energy burstsbest SNR

• PCA basis to represent each300 ms x auditory frq

• “bag of transients”

14

Cotton, Ellis, Loui ’11


4. Outstanding Issues

• How to define “transients”?• How to separate foreground & background?• How to exploit prior knowledge of sounds?• How to make classification discriminative?

• Large-scale soundtrack classification

15


Nonnegative Matrix Factorization• Decompose spectrograms into

templates + activation

fast & forgiving gradient descent algorithm2D patchessparsity control...

16

Smaragdis & Brown ’03Abdallah & Plumbley ’04

Virtanen ’07

X = W · H

Original mixture

Basis 1(L2)

Basis 2(L1)

Basis 3(L1)

freq

/ Hz

442883

1398220334745478

0 1 2 3 4 5 6 7 8 9 10time / s


Sound Textures• Characterize sounds by

perceptually-sufficient statistics

.. verified by matched resynthesis

• Subbanddistributions & env x-corrsMahalanobis distance ...

17

McDermott & Simoncelli ’09Ellis, Zhang, McDermott ’11

SoundAutomatic

gaincontrol

Envelopecorrelation

Cross-bandcorrelations

(318 samples)

Modulationenergy(18 x 6)

mean, var,skew, kurt(18 x 4)

melfilterbank

(18 chans) xx

xxxx

FFT Octave bins0.5,1,2,4,8,16 HzHistogram

617

1273

2404

5

10

15

0.5 2 8 32

0.5 2 8 320.5 2

8 32

5 10 15 8 32

5 10 15

0level

1

freq

/ Hz

mel

ban

d

1159_10 urban cheer clap

Texture features

1062_60 quiet dubbed speech music

0 2 4 6 8 10time / s

moments mod frq / Hz mel band moments mod frq / Hz mel band

0 2 4 6 8 10

K S V MK S V M


Real-World Dictionary• BBC Sound Effects as reference library

18BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT

BBC Sound EffectsExterior AtmosphereHouseholdInterior Atmosphere

TransportationAnimals and BirdsAudiences Children Crowds Foots

MiscEuropean SoundscapesComputers Printers Phones

Rivers Streams WaterBirdsIndustry

Big Ben Taxi Bus AtmospheresRural SoundscapesAutoSport Leisure

Explosions Guns AlarmsElectronically Generated SoundsWeather 1

Ships And Boats 1Ships And Boats 2

AmericaAircraftChina

BabiesHospitals

Africa The Human WorldAfrica The Natural WorldEquestrian EventsGreeceAdventure SportsLivestock 1

Livestock 2Farm MachineryHorsesHorses & Dogs

Schools & CrowdsSpainTrains

Age Of SteamConstruction

CatsEmergencyIstanbulBirdsCrowdsSuburbiaFranceEngland

Exterior Atmospheres-Rural BackgroundIndia & Nepal-City LifeIndia Pakistan Nepal-CountrysidFootsteps 1

Footsteps 2Urban South AmericaRural South AmericaHungary

The Czech Republic - Slovakia

1000+ examples ... comprehensive?

similarity via normalized textures (over 10s chunks)


Summary• Machine Listening:

Getting useful information from sound

• Environmental sound classification... from whole-clip statistics?

• Transients & energy peaks... separate foreground & background

• Useful classification of unconstrained audio... to combine with video analysis

19

Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...

Documents

Transcript of Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...