Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...

19
Information from Sound - Dan Ellis 2011-03-01 /19 1 1. Machine Listening 2. Global Classification 3. Foreground & Transients 4. Outstanding Issues Extracting Information from Sound Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA [email protected] http://labrosa.ee.columbia.edu/

Transcript of Extracting Information from Sound - LabROSA - Aboutdpwe/talks/DMASM... · Information from Sound -...

  • Information from Sound - Dan Ellis 2011-03-01 /191

    1. Machine Listening2. Global Classification3. Foreground & Transients4. Outstanding Issues

    Extracting Informationfrom Sound

    Dan EllisLaboratory for Recognition and Organization of Speech and Audio

    Dept. Electrical Eng., Columbia Univ., NY USA

    [email protected] http://labrosa.ee.columbia.edu/

    mailto:[email protected]:[email protected]://labrosa.ee.columbia.eduhttp://labrosa.ee.columbia.edu

  • Information from Sound - Dan Ellis 2011-03-01 /19

    1. Machine Listening

    2

    • Extracting useful information from sound... like animals doDectect

    Classify

    Describe

    EnvironmentalSound

    Speech Music

    Task

    Domain

    ASR MusicTranscription

    MusicRecommendation

    VAD Speech/Music

    Emotion

    “Sound Intelligence”

    EnvironmentAwareness

    AutomaticNarration

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Listening to Mixtures

    • The world is cluttered& sound is transparentmixtures are inevitable

    • Useful information is structured by ‘sources’specific definition of a ‘source’:intentional independence

    3

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Applications• Audio Lifelog

    Diarization

    • Consumer Video Classification

    4

    09:0009:3010:0010:3011:0011:3012:0012:3013:0013:3014:0014:3015:0015:3016:00

    16:3017:0017:3018:00

    preschool

    cafepreschoolcafelecture

    officeofficeoutdoorgroup

    lab

    cafemeeting2

    office

    office

    outdoorlabcafe

    Ron

    Manuel

    Arroyo?

    Lesser

    2004-09-13

    L2cafeoffice

    outdoorlecture

    outdoormeeting2

    outdoorofficecafeoffice

    outdoorofficepostlecoffice

    DSP03

    compmtg

    Mike

    Sambarta?

    2004-09-14

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Consumer Video Dataset• 25 “concepts” from

    Kodak user studyboat, crowd, cheer, dance, ...

    • Grab top 200 videos from YouTube searchthen filter for quality, unedited = 1873 videosmanually relabel with concepts

    • Concept overlap:5

    8GMM+Bha (9/15)

    pLSA500+lognorm (12/15)

    1G+KL2 (10/15)

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Overlapped Concepts

    group of 3+musiccrowdcheer

    singingone person

    nightgroup of two

    showdancing

    beachparkbaby

    playgroundparade

    boatsports

    graduation

    sunsetbirthday

    animalwedding

    picnicmuseum

    ski

    03 m c c s o n g s d b p b p p b s g s s b a w p m

    La

    be

    led

    Co

    nce

    pts

  • Information from Sound - Dan Ellis 2011-03-01 /19

    2. Global Classification• Baseline for soundtrack classification

    divide sound into short frames (e.g. 30 ms)calculate features (e.g. MFCC) for each framedescribe clip by statistics of frames (mean, covariance)= “bag of features”

    • Classify by e.g. Mahalanobis distance + SVM6

    VTS_04_0001 - Spectrogram

    freq

    / kH

    z

    1 2 3 4 5 6 7 8 9012345678

    -20

    -10

    0

    10

    20

    30

    time / sec

    time / sec

    level / dB

    value

    MFC

    C b

    in

    1 2 3 4 5 6 7 8 92468

    101214161820

    -20-15-10-505101520

    Classify by e.g. Mahalanobis distance + SVMMFCC dimension

    MFC

    C d

    imen

    sion

    MFCC covariance

    5 10 15 20

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    -50

    0

    50

    Video Soundtrack

    MFCCfeatures

    MFCCCovariance

    Matrix

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Codebook Histograms• Convert nonplanar distributions to multinomial

    • Classify by distance on histogramsKL, Chi-squared+ SVM

    7

    -20 -10 0 10 20-10

    -8

    -6

    -4

    -2

    0

    2

    4

    6

    8

    MFCC(0)

    MFC

    C(1

    )

    12

    34

    56

    7

    89

    10

    11

    1213

    14

    15

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

    50

    100

    150

    GMM mixture

    coun

    t

    MFCCfeatures

    Global Gaussian

    Mixture ModelPer-Category

    Mixture ComponentHistogram

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Latent Semantic Analysis (LSA)• Probabilistic LSA (pLSA) models each histogram

    as a mixture of several ‘topics’.. each clip may have several things going on

    • Topic sets optimized through EMp(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)

    use (normalized?) p(topic | clip) as per-clip features8

    GMM histogram ftrs GMM histogram ftrs“Topic”

    “Top

    ic”

    AV C

    lip

    AV C

    lipp(ftr | clip)

    p(topic

    | clip

    ) p(ftr | topic)

    =

    *

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Global Classification Results

    • Wide range in performanceaudio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes

    9

    Lee & Ellis ’10

    Concept classgro

    up of 3

    +mu

    siccro

    wd cheersin

    ging

    one pe

    rson

    night

    group

    of two

    showdan

    cing

    beach par

    kbab

    y

    playgr

    oundpar

    ade boatspo

    rts

    gradua

    tion skisun

    set

    birthda

    yani

    mal

    weddi

    ngpic

    nic

    museu

    mME

    AN0

    0.10.20.30.40.50.60.70.80.9

    1

    Aver

    age

    Prec

    ision Guessing

    MFCC + GMM ClassifierSingle−Gaussian + KL28−GMM + Bha1024−GMM Histogram + 500−log(P(z|c)/Pz))

  • Information from Sound - Dan Ellis 2011-03-01 /19

    3. Foreground & Transients• Global vs. local class models

    tell-tale acoustics may be ‘washed out’ in statisticstry iterative realignment of HMMs:

    “background” model shared by all clips

    10

    Old Way:All frames contribute

    New Way:Limited temporal extents

    time / s

    freq

    / kH

    z

    YT baby 002: voice baby laugh

    5 10 150

    1

    2

    3

    4

    voice babylaugh

    time / s

    freq

    / kH

    z

    5 10 150

    1

    2

    3

    4

    voice laugh

    laugh

    bg

    bg

    voice

    baby

    baby

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Landmark-based Fingerprints

    • Sound characterized by time-frequency peaksrobust to channel, background

    relies on precisetiming

    11

    Query audio

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Match: 05−Full Circle at 0.032 sec

    0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    time / sec

    freq

    / Hz

    Shazam ’03

  • Information from Sound - Dan Ellis 2011-03-01 /19

    VIdeo IMpLQaiHWbE at 195s

    VIdeo Yi1hkNkqHBc at 218 s

    195.5 196 196.5 197 197.5 198 198.5 199

    218.5 219 219.5 220 220.5 221 221.5 2220

    1

    2

    3

    4

    freq /

    kHz

    0

    1

    2

    3

    4

    freq /

    kHz

    time / sec

    time / sec

    Soundtrack Fingerprint Matching

    • Landmark pairs are a noise-robust fingerprint

    • Use to match distinct videos with same sound ambience

    12

    Cotton & Ellis ’10

    0

    1

    2

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Event Landmark Signatures• Build index of Gabor neighbor pairs

    recognize repeated events with similar pairs

    13

    Cotton & Ellis ’09

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Transient Features

    • Onset detectorfinds energy burstsbest SNR

    • PCA basis to represent each300 ms x auditory frq

    • “bag of transients”

    14

    Cotton, Ellis, Loui ’11

  • Information from Sound - Dan Ellis 2011-03-01 /19

    4. Outstanding Issues

    • How to define “transients”?• How to separate foreground & background?• How to exploit prior knowledge of sounds?• How to make classification discriminative?

    • Large-scale soundtrack classification

    15

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Nonnegative Matrix Factorization• Decompose spectrograms into

    templates + activation

    fast & forgiving gradient descent algorithm2D patchessparsity control...

    16

    Smaragdis & Brown ’03Abdallah & Plumbley ’04

    Virtanen ’07

    X = W · H

    Original mixture

    Basis 1(L2)

    Basis 2(L1)

    Basis 3(L1)

    freq

    / Hz

    442883

    1398220334745478

    0 1 2 3 4 5 6 7 8 9 10time / s

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Sound Textures• Characterize sounds by

    perceptually-sufficient statistics

    .. verified by matched resynthesis

    • Subbanddistributions & env x-corrsMahalanobis distance ...

    17

    McDermott & Simoncelli ’09Ellis, Zhang, McDermott ’11

    SoundAutomatic

    gaincontrol

    Envelopecorrelation

    Cross-bandcorrelations

    (318 samples)

    Modulationenergy(18 x 6)

    mean, var,skew, kurt(18 x 4)

    melfilterbank

    (18 chans) xx

    xxxx

    FFT Octave bins0.5,1,2,4,8,16 HzHistogram

    617

    1273

    2404

    5

    10

    15

    0.5 2 8 32

    0.5 2 8 320.5 2

    8 32

    5 10 15 8 32

    5 10 15

    0level

    1

    freq

    / Hz

    mel

    ban

    d

    1159_10 urban cheer clap

    Texture features

    1062_60 quiet dubbed speech music

    0 2 4 6 8 10time / s

    moments mod frq / Hz mel band moments mod frq / Hz mel band

    0 2 4 6 8 10

    K S V MK S V M

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Real-World Dictionary• BBC Sound Effects as reference library

    18BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT

    BBC Sound EffectsExterior AtmosphereHouseholdInterior Atmosphere

    TransportationAnimals and BirdsAudiences Children Crowds Foots

    MiscEuropean SoundscapesComputers Printers Phones

    Rivers Streams WaterBirdsIndustry

    Big Ben Taxi Bus AtmospheresRural SoundscapesAutoSport Leisure

    Explosions Guns AlarmsElectronically Generated SoundsWeather 1

    Ships And Boats 1Ships And Boats 2

    AmericaAircraftChina

    BabiesHospitals

    Africa The Human WorldAfrica The Natural WorldEquestrian EventsGreeceAdventure SportsLivestock 1

    Livestock 2Farm MachineryHorsesHorses & Dogs

    Schools & CrowdsSpainTrains

    Age Of SteamConstruction

    CatsEmergencyIstanbulBirdsCrowdsSuburbiaFranceEngland

    Exterior Atmospheres-Rural BackgroundIndia & Nepal-City LifeIndia Pakistan Nepal-CountrysidFootsteps 1

    Footsteps 2Urban South AmericaRural South AmericaHungary

    The Czech Republic - Slovakia

    1000+ examples ... comprehensive?

    similarity via normalized textures (over 10s chunks)

  • Information from Sound - Dan Ellis 2011-03-01 /19

    Summary• Machine Listening:

    Getting useful information from sound

    • Environmental sound classification... from whole-clip statistics?

    • Transients & energy peaks... separate foreground & background

    • Useful classification of unconstrained audio... to combine with video analysis

    19