Automatic interpretation of music structure...

1
The QP reconstructs the ABAB interpretation using only bass chroma. The QP approach has clear limitations: - If two musical attributes explain a section equally, the QP might only point to one. Instead, we can measure correlation. - Sequences that are repeated but non- homogenous may be overlooked in a point-wise SSM comparison. Instead, we can use segment-indexed SSMs, or apply additional stripe masking. Jordan B. L. Smith, National Institute of Advanced Industrial Science and Technology (AIST), Japan Elaine Chew, Queen Mary University of London Structural descriptions are usually single- dimensional, or perhaps hierarchical: This annotation tells us that sections A and B are different—but what makes them different? Do listeners think B is defined by a harmonic or melodic progression, or by a timbre? What was the listener’s rationale? Collecting this information from listeners is onerous, and the introspection required is difficult. Instead, we aim to automatically interpret existing annotations by comparing them to the audio. If successful, we could visualize structure to see which musical attributes characterize each section: 1. Motivation We compute self-similarity matrices (SSMs) from several audio features, each of which is assumed to correlate with a relevant musical attribute. We generate masked SSM segments, each revealing the relationship of a segment to the rest of the piece. Then, a quadratic program (QP) estimates coefficients to recreate the ground truth SSM from the masked segments. E.g.: 3. Algorithm The suggested improvements all had a positive impact: the best algorithm used the stripe-masked SSMs, indexing by segment, and correlation instead of the QP output. 4. Validation 2. Data A: Harmonies stable, orchestration builds up; harmonies in a and b are unique across the piece. B: Complex chord sequence, stable timbre; timbre cannot explain individuated subsections. Some analyses have prime markers. If we consider primed sections to be similar or different changes the interpretation. We can use the validated approach to analyze SALAMI annotations: 5. Application Finding appropriate data is not trivial! To validate the algorithm, we need structural annotations paired with listener rationales. We obtained the data in a music perception study: we composed stimuli with intended forms, each suited to intended rationales: A A B A B B 3 3 3 Harpsichord Organ 3 AAB justified by rhythm ABB justified by timbre We also confirmed that listeners perceived these structure with the same rationales: ... ABB justified by timbre... We have a large number of stimuli, in three styles, with either 3 parts (AAB vs. ABB) or 4 parts (AABB vs. ABAB vs. ABBA). Automatic interpretation of music structure analyses A validated technique for post-hoc estimation of the rationale for an annotation 20 40 60 80 100 120 140 160 180 200 A B B C D B C A B B C D B C A B B C D B C A B B C D B C a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c H M R T 20 40 60 80 100 120 140 160 180 200 B B C D B C B B C D B C B B C D B C B B C D B C a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c a a a' a b a c a a b a c d d' d'' d''' d'''' e a a b a c a c H M R T “Another One Bites The Dust”, by Queen 20 40 60 80 100 120 140 160 18 A B C A B B A B C A B B A B C A B B A B C A B B 20 40 60 80 100 120 140 160 18 A B A B A B A B A B A B A B A B a a b c d e f g h a a b c d e f g d e f g a a b c d e f g h a a b c d e f g d e f g a a b c d e f g h a a b c d e f g d e f g a a b c d e f g h a a b c d e f g d e f g H M R T H M R T “We Are The Champions”, by Queen 20 40 60 80 100 120 140 160 180 200 A B A B A B A A B A B A B A A B A B A B A A B A B A B A a b c c' a b c c' a' b c c' a b c c'' c''' d d d d a b c c' a b c c' a' b c c' a b c c'' c''' d d d d a b c c' a b c c' a' b c c' a b c c'' c''' d d d d a b c c' a b c c' a' b c c' a b c c'' c''' d d d d Harmony Melody Rhythm Timbre Harmony Melody Rhythm Timbre 20 40 60 80 100 120 140 160 180 200 A B A B A B A B C A B A B A B A B C A B A B A B A B C A B A B A B A B C “Hello, Goodbye”, by The Beatles 0.4 0.5 0.6 0.7 0.8 Proposed Bnull QP Correlation Regular Segment indexing Stripe masking Both Interpretation B. chroma 0.112 0.112 0.112 0.112 0 0 0 0 Tempo 0 0 0 0 MFCC 0 0 0 0 Harmony F 0 Melody Rhythm Timbre ABBA justified by timbre AABB justified by rhythm ABAB justified by harmony This piece has structure: Interpretation B. chroma 0.94 0.885 0.942 0.914 Melody 0.702 0.595 0.522 0.765 Tempo 0.453 0.2 0.215 0.455 MFCC 0.904 0.843 0.898 0.894 Interpretation B. chroma 0.888 0.87 0.876 0.855 Melody 0.71 0.659 0.693 0.759 Tempo 0.611 0.489 0.515 0.621 MFCC 0.879 0.857 0.849 0.873 Segment indexing Stripe masking But accuracy varied among musical styles and features, as these confusion plots show: H M R T B. chroma 0 13 5 0 0 Chord 384 0 0 0 1 Melody 0 202 10 39 .80 T. chroma 0 65 1 0 .98 Tempo 0 48 235 4 .82 Onset 0 20 133 0 .87 MFCC 0 23 0 9 .28 Low level 0 13 0 332 .96 1 .70 .96 .89 H M R T 78 0 0 63 .55 256 74 4 8 .75 4 238 8 19 .88 10 2 1 16 .07 25 17 73 171 .26 1 0 294 25 .92 8 41 0 58 .54 2 12 4 24 .57 .87 .63 .96 .21 Style 1 Style 2 A B A B A B A B C a c''' a b c c' a b c c' a' b c c' a b c c'' c''' d d d d “Hello, Goodbye”, by The Beatles Accuracy d=d': Stable, stripped-down harmony throughout. dd': Sections feature odd, varying sound effects. How to read: cells are brightest when a feature is: 1. homogenous throughout that section; 2. similar in other sections with the same label; 3. different in other sections with different labels.

Transcript of Automatic interpretation of music structure...

Page 1: Automatic interpretation of music structure analysesjblsmith.github.io/documents/smith2017-ismir... · SSM from the masked segments. E.g.: 3. Algorithm The suggested improvements

The QP reconstructs the ABAB interpretation using only bass chroma.The QP approach has clear limitations:

- If two musical attributes explain a section equally, the QP might only point to one. Instead, we can measure correlation.

- Sequences that are repeated but non-homogenous may be overlooked in a point-wise SSM comparison. Instead, we can use segment-indexed SSMs, or apply additional stripe masking.

Jordan B. L. Smith, National Institute of Advanced Industrial Science and Technology (AIST), Japan Elaine Chew, Queen Mary University of London

Structural descriptions are usually single-dimensional, or perhaps hierarchical:

This annotation tells us that sections A and B are different—but what makes them different? Do listeners think B is defined by a harmonic or melodic progression, or by a timbre? What was the listener’s rationale?Collecting this information from listeners is onerous, and the introspection required is difficult. Instead, we aim to automatically interpret existing annotations by comparing them to the audio.If successful, we could visualize structure to see which musical attributes characterize each section:

1. MotivationWe compute self-similarity matrices (SSMs) from several audio features, each of which is assumed to correlate with a relevant musical attribute.We generate masked SSM segments, each revealing the relationship of a segment to the rest of the piece.Then, a quadratic program (QP) estimates coefficients to recreate the ground truth SSM from the masked segments. E.g.:

3. AlgorithmThe suggested improvements all had a positive impact: the best algorithm used the stripe-masked SSMs, indexing by segment, and correlation instead of the QP output.

4. Validation

2. Data A: Harmonies stable, orchestration builds up; → harmonies in a and b are unique across the piece.B: Complex chord sequence, stable timbre; → timbre cannot explain individuated subsections.

Some analyses have prime markers. If we consider primed sections to be similar or different changes the interpretation.

We can use the validated approach to analyze SALAMI annotations:

5. Application

Finding appropriate data is not trivial! To validate the algorithm, we need structural annotations paired with listener rationales.We obtained the data in a music perception study: we composed stimuli with intended forms, each suited to intended rationales:

A A BA B B

Harmony/Timbre and Melody/Rhythm

! !! !! ! !Melody A / Rhythm A

!" !3

3

3

! ! #! ! ! ! ! #$ ! ! ! !!!Melody A / Rhythm B

!!! ! !!Melody B / Rhythm B! ! ! !

333

! ! ! !! ! ! !" #!Melody B / Rhythm A ! ! ! #! !% $

!Organ

& ' !!!Harmony A / Timbre A

& ' !!!$ & '

Harmony B / Timbre B

Harpsichord

% & ' !!! & ' !!!Harmony A / Timbre B

Harpsichord

& ' !!!! & ' !!

!Harmony B / Timbre A

Organ

! & ' !!!

!!!! !!!!!!! !!!! !!! !!!!!!!! !!!! !!!! !!!!% !!!! !(!!!

(!(!(!(

! ! !!!! ! !

!!!'&

! !!!! #!!3

3

3& &! !'

!!!

!!!!$% '

$" !!!

Organ

! # !

!!!!

!& '!

!

!! !!!!!!

!'&! !!!3

3 3

#!!!!

'&!!

'&! !!!"

% !!!

!'&! !#!!!

!!!

'' '''' '' ''

''! !!!!!! #!!

'&!

!!!!#

3

3

3&! !!'

!!'

!!!$% !!!$" !

&!

&! !!'

! !!!!

'''''' '' '' ''

'' '' '' ''!!! !&!

!!!

#!!!!'&! !!!

'3

3 3

!!!#!!

' &!

Harpsichord

!!!! #!!

' &!

!!!!

$%Organ

!!!$" !

&!

!! !!!!!!

'&!!!

'!!

'' '''' '' !

& 'Organ

! # !!!!

! !!&

3

3

! ! ! !! ! ! !!

" $% $ '!

!& '!

!! !!!!AAB justified by rhythm

ABB justified by timbreWe also confirmed that listeners perceived these structure with the same rationales:

...ABB justified

by timbre...

We have a large number of stimuli, in three styles, with either 3 parts (AAB vs. ABB) or 4 parts (AABB vs. ABAB vs. ABBA).

Automatic interpretation of music structure analyses A validated technique for post-hoc estimation of the rationale for an annotation

20 40 60 80 100 120 140 160 180 200

A B C B C D B C CA B C B C D B C CA B C B C D B C CA B C B C D B C C

a aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a ca aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a ca aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a ca aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a c

HMRT

20 40 60 80 100 120 140 160 180 200

A B C B C D B C CA B C B C D B C CA B C B C D B C CA B C B C D B C C

a aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a ca aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a ca aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a ca aa' a b a c a a b a c d d'd''d'''d''''e a a b a c a c

HMRT

“Another One Bites The Dust”, by Queen

20 40 60 80 100 120 140 160 180

A B C A B BA B C A B BA B C A B BA B C A B B

20 40 60 80 100 120 140 160 180

A B C A B BA B C A B BA B C A B BA B C A B B

a a b c d e f g h a a b c d e f g d e f ga a b c d e f g h a a b c d e f g d e f ga a b c d e f g h a a b c d e f g d e f ga a b c d e f g h a a b c d e f g d e f g

HMRTHMRT

“We Are The Champions”, by Queen

20 40 60 80 100 120 140 160 180 200

A B A B A B A B CA B A B A B A B CA B A B A B A B CA B A B A B A B C

a b c c' a b c c' a' b c c' a b c c'' c''' d d d da b c c' a b c c' a' b c c' a b c c'' c''' d d d da b c c' a b c c' a' b c c' a b c c'' c''' d d d da b c c' a b c c' a' b c c' a b c c'' c''' d d d d

HarmonyMelodyRhythmTimbre

HarmonyMelodyRhythmTimbre

20 40 60 80 100 120 140 160 180 200

A B A B A B A B CA B A B A B A B CA B A B A B A B CA B A B A B A B C

“Hello, Goodbye”, by The Beatles

0.4

0.5

0.6

0.7

0.8Proposed BnullQP Correlation

Regular Segment indexing

Stripe masking BothInterpretation

B. chroma 0.112 0.112 0.112 0.112

Melody 0 0 0 0

Tempo 0 0 0 0

MFCC 0 0 0 0

Harmony

F0

Melody

Rhythm

Timbre

ABBA justified by timbre AABB justified by rhythmABAB justified by harmony

This piece has structure:

Interpretation

B. chroma 0.94 0.885 0.942 0.914

Melody 0.702 0.595 0.522 0.765

Tempo 0.453 0.2 0.215 0.455

MFCC 0.904 0.843 0.898 0.894

Interpretation

B. chroma 0.888 0.87 0.876 0.855

Melody 0.71 0.659 0.693 0.759

Tempo 0.611 0.489 0.515 0.621

MFCC 0.879 0.857 0.849 0.873

Segment indexing Stripe masking

But accuracy varied among musical styles and features, as these confusion plots show:

H M R T

B. chroma 0 13 5 0 0Chord 384 0 0 0 1

Melody 0 202 10 39 .80T. chroma 0 65 1 0 .98

Tempo 0 48 235 4 .82Onset 0 20 133 0 .87MFCC 0 23 0 9 .28

Low level 0 13 0 332 .961 .70 .96 .89

H M R T

78 0 0 63 .55256 74 4 8 .754 238 8 19 .8810 2 1 16 .0725 17 73 171 .261 0 294 25 .928 41 0 58 .542 12 4 24 .57

.87 .63 .96 .21

Style 1 Style 2

20 40 60 80 100 120 140 160 180 200

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

A B A B A B A B C

A B A B A B A B C

A B A B A B A B C

A B A B A B A B C

20 40 60 80 100 120 140 160 180 200

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

a b c c' a b c c' a' b c c' a b c c'' c''' d d d d

“Hello, Goodbye”, by The Beatles

Acc

urac

y

d=d': Stable, stripped-down harmony throughout. d≠d': Sections feature odd, varying sound effects.

How to read: cells are brightest when a feature is:1. homogenous throughout that section;2. similar in other sections with the same label;3. different in other sections with different labels.