Pitch Analysis for Active Music Discovery - Justin Salamon...1 Pitch Analysis for Active Music...

80
1 Pitch Analysis for Active Music Discovery 1 Music and Audio Research Laboratory 2 Center for Urban Science and Progress New York University, USA Thursday June 23 rd 2016 Justin Salamon 1,2 @justin_salamon [email protected] www.justinsalamon.com Collaborators: Juan Pablo Bello, Rachel Bittner, Jordi Bonada, Juan J. Bosch, Jose Miguel Diaz-Bañez, Chris Cannam, Francisco Escobar, Slim Essid, Emilia Gómez, Paco Gómez, Sankalp Gulati, Helga Jiang, Edith Law, Matthias Mauch, Joaquin Mora, Sergio Oramas, Geoffroy Peeters, Aggelos Pikrakis, Axel Röbel, Bruno Rocha, Joan Serrà, Xavier Serra, Tim Tse, Alex Williams

Transcript of Pitch Analysis for Active Music Discovery - Justin Salamon...1 Pitch Analysis for Active Music...

1

Pitch Analysis for Active Music Discovery

1Music and Audio Research Laboratory2Center for Urban Science and Progress

New York University, USA

Thursday June 23rd 2016

Justin Salamon1,2

@justin_salamon [email protected] www.justinsalamon.com

Collaborators: Juan Pablo Bello, Rachel Bittner, Jordi Bonada, Juan J. Bosch, Jose Miguel Diaz-Bañez, Chris Cannam, Francisco Escobar, Slim Essid, Emilia Gómez, Paco Gómez, Sankalp Gulati, Helga Jiang, Edith Law, Matthias Mauch, Joaquin Mora, Sergio Oramas, Geoffroy Peeters, Aggelos Pikrakis, Axel Röbel, Bruno Rocha, Joan Serrà, Xavier Serra, Tim Tse, Alex Williams

In A Nutshell 2/35

In A Nutshell

‣  GOAL: active music discovery

3/35

In A Nutshell

‣  GOAL: active music discovery

‣  MEANS: pitch content analysis

4/35

In A Nutshell

‣  GOAL: active music discovery

‣  MEANS: pitch content analysis

‣  CHALLENGE: data scarcity (and solution!)

5/35

Active Music Discovery

Music Recommendation 7/35

Music Recommendation 8/35

Active Search & Discovery

9/35

Active Search & Discovery

10/35Song A

Song B

Versions?

Yes! Similarity = 0.92

COVER SONG ID

(Salamon, Serrà & Gómez, 2013)

Active Search & Discovery

11/35Song A

Song B

Versions?

Yes! Similarity = 0.92

COVER SONG ID

(Salamon, Serrà & Gómez, 2013)

Query-by-humming

(Salamon, Serrà & Gómez, 2013)

Active Search & Discovery

12/35Song A

Song B

Versions?

Yes! Similarity = 0.92

COVER SONG ID

(Salamon, Serrà & Gómez, 2013)

Query-by-humming

(Salamon, Serrà & Gómez, 2013)

melodic pattern discovery

(Pikrakis et al., 2012)

Active Search & Discovery

13/35Song A

Song B

Versions?

Yes! Similarity = 0.92

COVER SONG ID

(Salamon, Serrà & Gómez, 2013)

Query-by-humming

(Salamon, Serrà & Gómez, 2013)

100 150 200 250 300 350 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frequency bins (1 bin = 10 cents), Ref: 55Hz

Norm

aliz

ed s

alie

nce

Multipitch Histogram

ρ2

ρ4 ρ3

ρ5

Tonic ID

(Salamon, Gulati & Serra, 2012)

melodic pattern discovery

(Pikrakis et al., 2012)

Active Search & Discovery

14/35Song A

Song B

Versions?

Yes! Similarity = 0.92

COVER SONG ID

(Salamon, Serrà & Gómez, 2013)

Query-by-humming

(Salamon, Serrà & Gómez, 2013)

5.6 5.8 6 6.2 6.4 6.60

0.1

0.2

0.3

0.4

0.5

Mean vibrato rate (vr:mean*)

Me

an

vib

rato

co

ve

rag

e (

v c:me

an

*)

Flamenco

Inst. jazz

Opera

Pop

Vocal jazz

SINGING STYLE classification

(Salamon, Rocha & Gómez, 2012)

100 150 200 250 300 350 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frequency bins (1 bin = 10 cents), Ref: 55Hz

Norm

aliz

ed s

alie

nce

Multipitch Histogram

ρ2

ρ4 ρ3

ρ5

Tonic ID

(Salamon, Gulati & Serra, 2012)

melodic pattern discovery

(Pikrakis et al., 2012)

Machine Learning for Pitch Analysis

16/35Melody Extraction

17/35Melody Extraction

18/35Melody Extraction

19/35Melodia (Salamon & Gómez, 2012)

20/35Melodia (Salamon & Gómez, 2012)

Contour Extraction Melody f0Contour Selection

21/35Melodia (Salamon & Gómez, 2012)

Melody f0Contour Extraction Contour Selection

22/35Melodia (Salamon & Gómez, 2012)

Melody f0Contour Extraction Contour Selection

23/35Melodia (Salamon & Gómez, 2012)

Melody f0Contour Extraction Contour Selection

24/35Melodia (Salamon & Gómez, 2012)

Melody f0Contour Extraction Contour Selection

25/35Melodia (Salamon & Gómez, 2012)

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Contour Extraction Contour Selection

26/35Melodia (Salamon & Gómez, 2012)

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Contour Extraction Contour Selection

27/35Melodia (Salamon & Gómez, 2012)

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Contour Extraction Contour Selection

?

Melodia (Salamon & Gómez, 2012)

Melodia (Salamon & Gómez, 2012)

vibrato

pitch deviation

length

pitch mean

salience

rate extent

1000 2000 3000 4000 5000 60000

0.1

0.2

(a)Pitch mean distribution (cents)

0 100 200 300 400 5000

0.4

0.8

(b)Pitch standard deviation distribution (cents)

0 1 2 30

0.075

0.15

(c)Contour mean salience distribution

0 1 2 3 4 50

0.15

0.3

(d)Contour salience standard deviation distribution

0 2 4 6 8 100

0.4

0.8

(e)Contour total salience distribution

0 2 4 60

0.4

0.8

(f)Contour length distribution (seconds)

Melodia (Salamon & Gómez, 2012)

31/35

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Melodia (Salamon & Gómez, 2012)

Contour Extraction Contour Selection

32/35

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Melodia (Salamon & Gómez, 2012)

Contour Extraction Contour Selection

33/35

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Melodia (Salamon & Gómez, 2012)

Contour Extraction Contour Selection

MIREX: 75%MedleyDB: 57%

34/35Contour Classification (Salamon, Peeters & Röbel, 2012)

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Melody contour features

Accompaniment contour features

Contour Extraction Contour Selection

35/35Contour Classification (Bittner et al., 2015)

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Contour Extraction Contour Selection

36/35Contour Classification (Bittner et al., 2015)

Melody f0

Audio signal

Sinusoid extraction

Salience function

Pitch contour creation

Melody selection

Iterative

Melody f0 sequence

Spectral peaks

Time-frequency salience

Pitch contours

Bin salience mapping with harmonic weighting

Contour characterisation

Peak streaming Peak filtering

Frequency/amplitude correction

Spectral transform

Equal loudness filter

Melody peak selection

Pitch outlier removal

Melody pitch mean

Octave error removal

Voicing detection

Melody pitch mean

Contour Extraction Contour Selection

37/35Contour Classification (Bosch et al., 2016)

Melody f0Contour Extraction Contour Selection

38/35Contour Classification (???, 2017)

Melody f0Contour Extraction Contour Selection

?

39/35Contour Classification (???, 2017)

Melody f0Deep Melody Net

?

Data Scarcity

Continuous Melody f0 Annotation

Continuous Melody f0 Annotation

Continuous Melody f0 Annotation

MonophonicPitch Tracker

Continuous Melody f0 Annotation

MonophonicPitch Tracker

Tony (Mauch et al., 2015)

Datasets for Melody Extraction Eval

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop‣  RWC-Pop: 6.8 hr (100 songs), J-Pop

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop‣  RWC-Pop: 6.8 hr (100 songs), J-Pop‣  MedleyDB (Bittner et al., 2014):

‣  7.5 hr (108 songs), varied

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop‣  RWC-Pop: 6.8 hr (100 songs), J-Pop‣  MedleyDB (Bittner et al., 2014):

‣  7.5 hr (108 songs), varied‣  MedleyDBv2 (coming soon):

‣  ~17hr (~240 songs), varied

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop‣  RWC-Pop: 6.8 hr (100 songs), J-Pop‣  MedleyDB (Bittner et al., 2014):

‣  7.5 hr (108 songs), varied‣  MedleyDBv2 (coming soon):

‣  ~17hr (~240 songs), varied

~50 annotator-hours108 songs

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop‣  RWC-Pop: 6.8 hr (100 songs), J-Pop‣  MedleyDB (Bittner et al., 2014):

‣  7.5 hr (108 songs), varied‣  MedleyDBv2 (coming soon):

‣  ~17hr (~240 songs), varied

~50 annotator-hours108 songs 20 million songs

Datasets for Melody Extraction Eval

‣  MIREX‣  ADC2004: 6 minutes (20 excerpts)‣  MIREX05: 12 minutes (25 excerpts)

‣  MIR1K: 2.2 hr (1000 excerpts), C-Pop‣  RWC-Pop: 6.8 hr (100 songs), J-Pop‣  MedleyDB (Bittner et al., 2014):

‣  7.5 hr (108 songs), varied‣  MedleyDBv2 (coming soon):

‣  ~17hr (~240 songs), varied

~50 annotator-hours108 songs

~1057 annotator-years20 million songs

Data Scarcity

Data ScarcityOvercoming

Crowdsourcing Melody Note Annotations

Crowdsourcing Melody Note Annotations

Ensemble (Tse et al., 2016)

Crowdsourcing Melody Note Annotations

Ensemble (Tse et al., 2016)

Data Augmentation: f0 Annotation-by-Synthesis

Data Augmentation: f0 Annotation-by-Synthesis

Data Augmentation: f0 Annotation-by-Synthesis

MonophonicPitch Tracker

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

SinusoidalModelling

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

SinusoidalModelling

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

SinusoidalModelling

Synthesis

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

SinusoidalModelling

SynthesisMixing

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

SinusoidalModelling

SynthesisMixing

Data Augmentation: f0 Annotation-by-Synthesis

Cleaning +Smoothing

MonophonicPitch Tracker

SinusoidalModelling

SynthesisMixing

Algorithm 1

Algorithm 2

Algorithm 3

Metrics: A B C D E

Original data + manual annotations

Synthesized data + automatic annotations

Data Augmentation: f0 Annotation-by-Synthesis

Summary 70/35

Summary

‣  Active music discovery: user plays active role in retrieval

71/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…

72/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

73/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level

74/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level‣  Data scarcity: can’t explore high-capacity (and data

hungry) models

75/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level‣  Data scarcity: can’t explore high-capacity (and data

hungry) models‣  Solutions:

76/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level‣  Data scarcity: can’t explore high-capacity (and data

hungry) models‣  Solutions:

‣  Crowdsourcing

77/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level‣  Data scarcity: can’t explore high-capacity (and data

hungry) models‣  Solutions:

‣  Crowdsourcing ‣  Data augmentation: annotation-by-synthesis

78/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level‣  Data scarcity: can’t explore high-capacity (and data

hungry) models‣  Solutions:

‣  Crowdsourcing ‣  Data augmentation: annotation-by-synthesis

‣  Thanks!

79/35

Summary

‣  Active music discovery: user plays active role in retrieval‣  Examples: QBH, search-by-singing-style, computational

(ethno)musicology…‣  Require (auto) extraction of pitch content: melody extraction

‣  Melody extraction: classification at contour level‣  Data scarcity: can’t explore high-capacity (and data

hungry) models‣  Solutions:

‣  Crowdsourcing ‣  Data augmentation: annotation-by-synthesis

‣  Thanks!

80/35

@justin_salamon [email protected] www.justinsalamon.com