On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre...

17
Smith and Lewicki’s sparse auditory codes Our ISMIR paper Outlook On the use of sparse time-relative auditory codes for music Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck June 25 th 2008 Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Transcript of On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre...

Page 1: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

On the use of sparsetime-relative auditory codes for music

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck

June 25th 2008

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 2: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

What’s wrong with conventional features?

time-frequency tradeoff

sensitive to arbitrary alignment of blocks with signal

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 3: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Sparse Auditory codes

The signal x is modelled as a linear superposition of time shiftablekernels φ that are placed at precise time locations τ with a givenscaling coefficient s. Formally:

x(t) =M∑

m=1

nm∑i=1

smi φm(t − τm

i ) + ε(t),

where m runs over the different kernels, i over the differentinstances of a given kernel and ε represents additive noise. Thelength of the kernels is variable.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 4: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Gammatones

Gammatone kernels are set according to an equivalent rectangularband (ERB) filter bank cochlear model using Slaney’s auditorytoolbox for Matlab [5], as done in [6].

0 100 200 300 400 500-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 5: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

C-major scale piano

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 6: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Close-up of piano sounding an A4

4.9 4.95 5 5.05 5.1 5.15 5.2 5.25 5.3−1

−0.5

0

0.5

1

4.9 4.95 5 5.05 5.1 5.15 5.2 5.25 5.30

10

20

30

40

50

60

Time (sec)

Ker

nel

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 7: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Sparse time-relative auditory codes

Sparse: notion that only a small number of things (out of alarge number of possible things) are going on at the sametime.

Time-relative: the events are located at a precise point intime.

Auditory: biologically validated from what goes on in thecochlea.

Efficient

No time-frequency trade-off or sensitivity to arbitraryalignment of signal

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 8: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Extra

Various encoding algorithms.

Possibility of either using predefined kernels (sayGammatones) or of learning kernels.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 9: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Encoding complex western musicGenre recognition resultsLearning kernels

Contributions

First use of these codes on complex western commercialmusic.

Used the codes as naive input for genre classification (as goodas MFCCs)

Learn some kernels which are qualitatively different, butsomewhat disappointing.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 10: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Encoding complex western musicGenre recognition resultsLearning kernels

Encoding Tzanetakis

# of kernels 103 104 2.5× 104 5× 104

32 2.82 (0.97) 8.57 (2.17) 12.69 (2.87) 16.60 (2.88)

64 3.02 (1.03) 9.07 (2.29) 13.44 (2.98) 17.43 (2,70)

128 3.08 (1.04) 9.24 (2.31) 13.76 (2.99) 17.76 (2.52)

Table: Mean (standard deviation) of the SNR (dB) obtained over 100songs (ten from each genre), as a function of the number ofGammatones in the codebook and of the maximum number of spikesallowed to reach a maximal value of 20 dB SNR.

64 Gammatones, ERB 20Hz-Nyquist, 20db SNR

About 1 hour to encode a 30 second segment!!!

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 11: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Encoding complex western musicGenre recognition resultsLearning kernels

Encoding Tzanetakis - basic genre statistics

classical disco jazz pop rock blues countryhiphop metal reggae10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

classical disco jazz pop rock blues countryhiphop metal reggae0

5000

10000

15000

20000

25000

30000

Figure: Top: box-and-whisker plot of the number of spikes (out of a maximum of 105) required to encodesongs from various genres up to a SNR of 20dB. The red line represents the median and the box extends from thelower quartile to the upper one. Bottom: Bar plot of the mean number of kernels used per song depending on thegenre. The 64 kernels are split into eight bins going from the lowest frequency Gammatones (left) to the highestones (right).

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 12: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Encoding complex western musicGenre recognition resultsLearning kernels

Genre Recognition on Tzanetakis

Features: basic statistics on the spikegrams for windows of 5seconds.

Results similar to those obtained with MFCCs.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 13: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Encoding complex western musicGenre recognition resultsLearning kernels

Learned kernels

0.0 1500.0

0

0.0 2000.0

1

0.0 350.0

2

0.0 200.0

3

0.0 800.0

4

0.0 1600.0

5

0.0 800.0

6

0.0 100.0

7

0.0 500.0

8

0.0 1400.0

9

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 14: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Encoding complex western musicGenre recognition resultsLearning kernels

Learned kernels - results

kernel 103 104 2.5× 104 5× 104

gammatones 2.82 (0.97) 8.57 (2.17) 12.69 (2.87) 16.60 (2.88)

learned 1.86 (0.57) 6.06 (1.49) 8.86 (2.00) 11.53 (2,37)

Table: Mean (standard deviation) of the SNR (dB) obtained over 100songs (ten from each genre) with either 32 Gammatones or 32 learnedkernels in the codebook.

Encoding not as good.

Genre recognition results not as good.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 15: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Future work

Three specific issues that will need to be dealt with:

How to put the features in a usable form?

Learning kernels.

Computational issue.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 16: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

References

T. Blumensath and M. Davies.

Sparse and shift-invariant representations of music.IEEE Transactions on Speech and Audio Processing, 2006.

G. Davis, S. Mallat, and M. Avellaneda.

Adaptive greedy approximations.Constructive approximation, 13(1):57–98, 2004.

S.G. Mallat and Zhifeng Zhang.

Matching pursuits with time-frequency dictionaries.Signal Processing, IEEE Transactions on, 41(12):3397–3415, December 1993.

M. Plumbley, S. Abdallah, T. Blumensath, and M. Davies.

Sparse representations of polyphonic music.Signal Processing, 86(3):417–431, March 2006.

M. Slaney.

Auditory toolbox, 1998.(Tech. Rep. No. 1998-010). Palo Alto, CA:Interval Research Corporation.

E. Smith and M. S. Lewicki.

Efficient coding of time-relative structure using spikes.Neural Computation, 17(1):19–45, 2005.

E. Smith and M. S. Lewicki.

Efficient auditory coding.Nature, 439:978–982, February 2006.

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google

Page 17: On the use of sparse time-relative auditory codes for musicpift6080/H09/documents/... · Genre recognition results Learning kernels Encoding Tzanetakis - basic genre statistics classical

Smith and Lewicki’s sparse auditory codesOur ISMIR paper

Outlook

Questions?

Pierre-Antoine Manzagol, Thierry Bertin-Mahieux, Douglas Eck Machine Hearing @ Google