Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...
Transcript of Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...
-
Information from Sound - Dan Ellis 2011-03-01 /191
1. Machine Listening2. Global Classification3. Foreground & Transients4. Outstanding Issues
Extracting Informationfrom Sound
Dan EllisLaboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
[email protected] http://labrosa.ee.columbia.edu/
mailto:[email protected]:[email protected]://labrosa.ee.columbia.eduhttp://labrosa.ee.columbia.edu
-
Information from Sound - Dan Ellis 2011-03-01 /19
1. Machine Listening
2
• Extracting useful information from sound... like animals doDectect
Classify
Describe
EnvironmentalSound
Speech Music
Task
Domain
ASR MusicTranscription
MusicRecommendation
VAD Speech/Music
Emotion
“Sound Intelligence”
EnvironmentAwareness
AutomaticNarration
-
Information from Sound - Dan Ellis 2011-03-01 /19
Listening to Mixtures
• The world is cluttered& sound is transparentmixtures are inevitable
• Useful information is structured by ‘sources’specific definition of a ‘source’:intentional independence
3
-
Information from Sound - Dan Ellis 2011-03-01 /19
Applications• Audio Lifelog
Diarization
• Consumer Video Classification
4
09:0009:3010:0010:3011:0011:3012:0012:3013:0013:3014:0014:3015:0015:3016:00
16:3017:0017:3018:00
preschool
cafepreschoolcafelecture
officeofficeoutdoorgroup
lab
cafemeeting2
office
office
outdoorlabcafe
Ron
Manuel
Arroyo?
Lesser
2004-09-13
L2cafeoffice
outdoorlecture
outdoormeeting2
outdoorofficecafeoffice
outdoorofficepostlecoffice
DSP03
compmtg
Mike
Sambarta?
2004-09-14
-
Information from Sound - Dan Ellis 2011-03-01 /19
Consumer Video Dataset• 25 “concepts” from
Kodak user studyboat, crowd, cheer, dance, ...
• Grab top 200 videos from YouTube searchthen filter for quality, unedited = 1873 videosmanually relabel with concepts
• Concept overlap:5
8GMM+Bha (9/15)
pLSA500+lognorm (12/15)
1G+KL2 (10/15)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Overlapped Concepts
group of 3+musiccrowdcheer
singingone person
nightgroup of two
showdancing
beachparkbaby
playgroundparade
boatsports
graduation
sunsetbirthday
animalwedding
picnicmuseum
ski
03 m c c s o n g s d b p b p p b s g s s b a w p m
La
be
led
Co
nce
pts
-
Information from Sound - Dan Ellis 2011-03-01 /19
2. Global Classification• Baseline for soundtrack classification
divide sound into short frames (e.g. 30 ms)calculate features (e.g. MFCC) for each framedescribe clip by statistics of frames (mean, covariance)= “bag of features”
• Classify by e.g. Mahalanobis distance + SVM6
VTS_04_0001 - Spectrogram
freq
/ kH
z
1 2 3 4 5 6 7 8 9012345678
-20
-10
0
10
20
30
time / sec
time / sec
level / dB
value
MFC
C b
in
1 2 3 4 5 6 7 8 92468
101214161820
-20-15-10-505101520
Classify by e.g. Mahalanobis distance + SVMMFCC dimension
MFC
C d
imen
sion
MFCC covariance
5 10 15 20
2
4
6
8
10
12
14
16
18
20
-50
0
50
Video Soundtrack
MFCCfeatures
MFCCCovariance
Matrix
-
Information from Sound - Dan Ellis 2011-03-01 /19
Codebook Histograms• Convert nonplanar distributions to multinomial
• Classify by distance on histogramsKL, Chi-squared+ SVM
7
-20 -10 0 10 20-10
-8
-6
-4
-2
0
2
4
6
8
MFCC(0)
MFC
C(1
)
12
34
56
7
89
10
11
1213
14
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
50
100
150
GMM mixture
coun
t
MFCCfeatures
Global Gaussian
Mixture ModelPer-Category
Mixture ComponentHistogram
-
Information from Sound - Dan Ellis 2011-03-01 /19
Latent Semantic Analysis (LSA)• Probabilistic LSA (pLSA) models each histogram
as a mixture of several ‘topics’.. each clip may have several things going on
• Topic sets optimized through EMp(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
use (normalized?) p(topic | clip) as per-clip features8
GMM histogram ftrs GMM histogram ftrs“Topic”
“Top
ic”
AV C
lip
AV C
lipp(ftr | clip)
p(topic
| clip
) p(ftr | topic)
=
*
-
Information from Sound - Dan Ellis 2011-03-01 /19
Global Classification Results
• Wide range in performanceaudio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes
9
Lee & Ellis ’10
Concept classgro
up of 3
+mu
siccro
wd cheersin
ging
one pe
rson
night
group
of two
showdan
cing
beach par
kbab
y
playgr
oundpar
ade boatspo
rts
gradua
tion skisun
set
birthda
yani
mal
weddi
ngpic
nic
museu
mME
AN0
0.10.20.30.40.50.60.70.80.9
1
Aver
age
Prec
ision Guessing
MFCC + GMM ClassifierSingle−Gaussian + KL28−GMM + Bha1024−GMM Histogram + 500−log(P(z|c)/Pz))
-
Information from Sound - Dan Ellis 2011-03-01 /19
3. Foreground & Transients• Global vs. local class models
tell-tale acoustics may be ‘washed out’ in statisticstry iterative realignment of HMMs:
“background” model shared by all clips
10
Old Way:All frames contribute
New Way:Limited temporal extents
time / s
freq
/ kH
z
YT baby 002: voice baby laugh
5 10 150
1
2
3
4
voice babylaugh
time / s
freq
/ kH
z
5 10 150
1
2
3
4
voice laugh
laugh
bg
bg
voice
baby
baby
-
Information from Sound - Dan Ellis 2011-03-01 /19
Landmark-based Fingerprints
• Sound characterized by time-frequency peaksrobust to channel, background
relies on precisetiming
11
Query audio
0
500
1000
1500
2000
2500
3000
3500
4000
Match: 05−Full Circle at 0.032 sec
0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
500
1000
1500
2000
2500
3000
3500
4000
time / sec
freq
/ Hz
Shazam ’03
-
Information from Sound - Dan Ellis 2011-03-01 /19
VIdeo IMpLQaiHWbE at 195s
VIdeo Yi1hkNkqHBc at 218 s
195.5 196 196.5 197 197.5 198 198.5 199
218.5 219 219.5 220 220.5 221 221.5 2220
1
2
3
4
freq /
kHz
0
1
2
3
4
freq /
kHz
time / sec
time / sec
Soundtrack Fingerprint Matching
• Landmark pairs are a noise-robust fingerprint
• Use to match distinct videos with same sound ambience
12
Cotton & Ellis ’10
0
1
2
-
Information from Sound - Dan Ellis 2011-03-01 /19
Event Landmark Signatures• Build index of Gabor neighbor pairs
recognize repeated events with similar pairs
13
Cotton & Ellis ’09
-
Information from Sound - Dan Ellis 2011-03-01 /19
Transient Features
• Onset detectorfinds energy burstsbest SNR
• PCA basis to represent each300 ms x auditory frq
• “bag of transients”
14
Cotton, Ellis, Loui ’11
-
Information from Sound - Dan Ellis 2011-03-01 /19
4. Outstanding Issues
• How to define “transients”?• How to separate foreground & background?• How to exploit prior knowledge of sounds?• How to make classification discriminative?
• Large-scale soundtrack classification
15
-
Information from Sound - Dan Ellis 2011-03-01 /19
Nonnegative Matrix Factorization• Decompose spectrograms into
templates + activation
fast & forgiving gradient descent algorithm2D patchessparsity control...
16
Smaragdis & Brown ’03Abdallah & Plumbley ’04
Virtanen ’07
X = W · H
Original mixture
Basis 1(L2)
Basis 2(L1)
Basis 3(L1)
freq
/ Hz
442883
1398220334745478
0 1 2 3 4 5 6 7 8 9 10time / s
-
Information from Sound - Dan Ellis 2011-03-01 /19
Sound Textures• Characterize sounds by
perceptually-sufficient statistics
.. verified by matched resynthesis
• Subbanddistributions & env x-corrsMahalanobis distance ...
17
McDermott & Simoncelli ’09Ellis, Zhang, McDermott ’11
SoundAutomatic
gaincontrol
Envelopecorrelation
Cross-bandcorrelations
(318 samples)
Modulationenergy(18 x 6)
mean, var,skew, kurt(18 x 4)
melfilterbank
(18 chans) xx
xxxx
FFT Octave bins0.5,1,2,4,8,16 HzHistogram
617
1273
2404
5
10
15
0.5 2 8 32
0.5 2 8 320.5 2
8 32
5 10 15 8 32
5 10 15
0level
1
freq
/ Hz
mel
ban
d
1159_10 urban cheer clap
Texture features
1062_60 quiet dubbed speech music
0 2 4 6 8 10time / s
moments mod frq / Hz mel band moments mod frq / Hz mel band
0 2 4 6 8 10
K S V MK S V M
-
Information from Sound - Dan Ellis 2011-03-01 /19
Real-World Dictionary• BBC Sound Effects as reference library
18BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT
BBC Sound EffectsExterior AtmosphereHouseholdInterior Atmosphere
TransportationAnimals and BirdsAudiences Children Crowds Foots
MiscEuropean SoundscapesComputers Printers Phones
Rivers Streams WaterBirdsIndustry
Big Ben Taxi Bus AtmospheresRural SoundscapesAutoSport Leisure
Explosions Guns AlarmsElectronically Generated SoundsWeather 1
Ships And Boats 1Ships And Boats 2
AmericaAircraftChina
BabiesHospitals
Africa The Human WorldAfrica The Natural WorldEquestrian EventsGreeceAdventure SportsLivestock 1
Livestock 2Farm MachineryHorsesHorses & Dogs
Schools & CrowdsSpainTrains
Age Of SteamConstruction
CatsEmergencyIstanbulBirdsCrowdsSuburbiaFranceEngland
Exterior Atmospheres-Rural BackgroundIndia & Nepal-City LifeIndia Pakistan Nepal-CountrysidFootsteps 1
Footsteps 2Urban South AmericaRural South AmericaHungary
The Czech Republic - Slovakia
1000+ examples ... comprehensive?
similarity via normalized textures (over 10s chunks)
-
Information from Sound - Dan Ellis 2011-03-01 /19
Summary• Machine Listening:
Getting useful information from sound
• Environmental sound classification... from whole-clip statistics?
• Transients & energy peaks... separate foreground & background
• Useful classification of unconstrained audio... to combine with video analysis
19