Analysis of Music Data -...

Motivation Similarity Measures Evaluating Similarity Approaches Pitch Estimation Sound characteristics

Analysis of Music Data

Florian Schoppmann

Department of Computer Science

International Graduate School Dynamic Intelligent SystemsUniversity of Paderborn

July 17, 2007

University of Paderborn Florian Schoppmann · 1 / 34


Analysis of Music DataGoals:

I Measuring Similarity of Music DataI Memory researchI Music psychologyI Music analysis

(“ethnomusicology”)I Copyright issues

I Extracting high-level informationfrom general audio signals

I Restoration of musical sourcesI Music transcription

I Sound characteristics of orchestra instrumentsI classicifaction of the register of an instrument by timbre

(and not pitch)



Music Representation

Symbolically,e.g., also MIDI:

Audio Signals, e.g., in AIFF, MP3, etc.:

MIDI Parameters: Note, Velocity,Modulation, Balance, etc.



Technical Terms

Pitch Perceived fundamental frequency of a sound

Fundamental Frequency f0 lowest frequency in a harmonic series

Harmonic Series (Besides maths. . . ) Multiples of f0Overtone/Partial sinusoidal component of a waveform, of greater

frequency than f0Register Relative “height” or range of a note, set of pitches,

melody, instrument, etc.



Example

Sine wave at 440 Hz (Duration 100 ms): Play

First 100ms of piano A4: Play



Spectrum of Sine Wave



Spectrum of Piano A4



Disclaimer :-)

I am not an expert in any of the following fields

I signal processing

I musicology

I statistics

Everything presented is to the best of my knowledge

Technical details sometimes omitted. Hopefully:

I Discussion about the overall picture

I Not too many mistakes



Similarity of Melodies

Klaus Frieler (2006):Generalized N-gram Measures for Melodic Similarity.In: Data Science and Classification, Springer, pp. 289–298.http://dx.doi.org/10.1007/3-540-34416-0_31

Abstraction needed: Melody as Time Series

DefinitionLet j ≤ k ∈ N. A sequence φ over some set E is a map

φ : [j : k] → Ei 7→ φi .

φ is in normal form if j = 0.


http://dx.doi.org/10.1007/3-540-34416-0_31


Melody Abstraction

DefinitionLet φ = (φi )i∈[j :k] be a sequence. An n-gram of φ is a subsequence(φi )i∈[l :m] where m − l + 1 = n and i ≤ l ≤ m ≤ k .

DefinitionLet E be an event space. A sequence µ : [0 : n − 1] → R× E,i 7→ µi =: (ti , pi ), is called melody if

i < j =⇒ ti < tj .

Assumption here: Onset and pitch are sufficient to capture the“essence” of a melody



Similarity Measures (1/2)

Basic Choice about what to measure:

I DissimilarityMore common to mathematics (distance measures, metrics)

I SimilarityJudging similarity far more familiar for musical experts

DefinitionLet M be a set of melodies. A similarity map σ is a mapσ : M×M→ [0, 1] such that the following is fulfilled:

I Symmetry: σ(µ, µ′) = σ(µ′, µ)

I Self-identity: σ(µ, µ) = 1 and σ(ε, µ) = 0

(We denote by ε the empty melody.)



Similarity Measures (2/2)

Further requirement for melodies: Invariance under

I (Pitch) transposition,

I Time Shift,

I Tempo Change.

General idea in the following:

I Counting common n-grams



Further Preliminary Definitions

Let φ = (φi )i∈[j :k] be a sequence. Then:

I | · | is the length operator,

I · is the normalizing operator

such that |φ| = k − j + 1 and φi = φi+j .

For any n ∈ N:

Sn(φ) := r | r is n-gram of φ and

Sn(φ) := r | r is n-gram of φ

The frequency of an n-gram r with respect to φ is defined as:

fφ(r) = |u ∈ S|r |(φ) | r = u|



n-Gram Measures (1/2)

Definition1. The Count-Distinct measure (CDM) is the count of n-grams

common to both sequences:

SCDMn (φ, ϕ) := |Sn(φ) ∩ Sn(ϕ)|

2. The Sum-Common measure (SCM) is the sum of frequenciesof n-grams common to both sequences:

SSCMn (φ, ϕ) :=

∑r∈ bSn(φ)∩ bSn(ϕ)

[fφ(r) + fϕ(r)

]Normalization to get similarity measure:

σCDMn (φ, ϕ) :=

SCDMn (φ, ϕ)

12 ·

[|Sn(ϕ)|+ |Sn(φ)|

] σSCMn (φ, ϕ) :=

SSCMn (φ, ϕ)

|φ|+ |ϕ| − 2(n − 1)



n-Gram Measures (2/2)

Definition (continued)3. The Ukkonen measure (UM) counts the absolute differences of

frequencies of all distinct n-grams of both sequences:

SUMn (φ, ϕ) :=

∑r∈ bSn(φ)∪ bSn(ϕ)

|fφ(r)− fϕ(r)|

Transform into similarity measure:

σUMn (φ, ϕ) := 1− SUM

n (φ, ϕ)

|φ|+ |ϕ| − 2(n − 1)



Example: Similarity of Melodies (1/2)

C major and C minor scale:

!" ###### ##

Music engraving by LilyPond 2.10.25—www.lilypond.org

Play

!"""# $$$$$$ $$

Music engraving by LilyPond 2.10.25—www.lilypond.org

Play

According to invariance w.r.t. transposition, tempo:

φ = (2, 2, 1, 2, 2, 2, 1), ϕ = (2, 1, 2, 2, 1, 2, 2)



Example: Similarity of Melodies (2/2)

Two 6-grams for each melody:

r1 = (2, 2, 1, 2, 2, 2), r2 = (2, 1, 2, 2, 2, 1)

s1 = (2, 1, 2, 2, 1, 2), s2 = (1, 2, 2, 1, 2, 2)

We get σ?n(φ, ϕ) = 0 for all measures.

Consider only first 6 tones and 4-grams:

φ = (2, 2, 1, 2, 2), ϕ = (2, 1, 2, 2, 1)

Two 4-grams for each melody:

r1 = (2, 2, 1, 2), r2 = (2, 1, 2, 2)

s1 = (2, 1, 2, 2), s2 = (1, 2, 2, 1)

We get σ?n(φ, ϕ) = 1

2 for all measures.University of Paderborn Florian Schoppmann · 17 / 34


Generalized Similarity Measures

Definitions of S?n (φ, ϕ) can be rewritten in terms of frequencies:

SCDMn (φ, ϕ) := |Sn(φ) ∩ Sn(ϕ)| =

∑r∈Sn(φ)

1Sn(ϕ)(r) ·1

fφ(r)

SSCMn (φ, ϕ) :=

∑r∈ bSn(φ)∩ bSn(ϕ)

[fφ(r) + fϕ(r)

]=

∑r∈Sn(φ)

1Sn(ϕ)(r) ·(

1 +fϕ(r)

fφ(r)

)...

Generalize the notion of frequency. For a similarity measure σ, let

νφ(r) :=∑

u∈S|r|(φ)

σ(u, r) ≥ |u ∈ S|r |(φ) | r = u| = fφ(r) .



Excursion: Edit-Distance

Edit-Distance d(φ, ϕ) between two sequences in normal form:I Defined as number of deletion, insertion, and substitution steps

Typical example of dynamic programming:ε S a t u r d a y

ε 0 1 2 3 4 5 6 7 8S 1 0 1 2 3 4 5 6 7u 2 1 1 2 2 3 4 5 6n 3 2 2 2 3 3 4 5 6d 4 3 3 3 3 4 3 4 5a 5 4 3 4 4 4 4 3 4y 6 5 4 4 5 5 5 4 3

Recurrence relation for i ∈ [0 : |φ|], j ∈ [0 : |ϕ|]:

D(i , j) := min

D(i − 1, j) + 1,D(i , j − 1) + 1,D(i − 1, j − 1) + δ(φi , ϕj)

D(i , 0) := iD(0, j) := j



Summary and Criticism

Summary:

I Goal: Similarity of monophonic melodies

I Method: Melodies as time series

I Generalizing measures based on “n-grams”, i.e.,subsequences of fixed length

Criticism (→ discussions? :-))

I Why only n-grams for arbitrary (but fixed) n’s?

I Generalization unmotivated and “arbitrary”

I burdensome yet still imprecise mathematical formalisms(tried to mitigate here)



Evaluating Similarity Approaches

Daniel Müllensiefen, Klaus Frieler (2006):Evaluating Different Approaches to Measuring the Similarity ofMelodies.In: Data Science and Classification, Springer, pp. 299–306.http://dx.doi.org/10.1007/3-540-34416-0_32

Several studies to comparison similarity measurement for melodies:

I “Very different similarity values”

I “not clear which one is the most adequate”


http://dx.doi.org/10.1007/3-540-34416-0_32


General Idea of Similarity Algorithms1. Basic transformations (representations)

I Projections: E.g., to pitch componentI Differentiation

2. Main transformations, e.g.:I Rhythmical weighting: Assume melody is quantized, replace

pitch of duration n · T , n > 1, by sequence of n tone withduration T

I Contourization: Idea that perceptionally important notes arethe extrema→ substitute pitches in between by linear interpolation

3. Computation



Experimental Evaluation

Subjects: Musicology students with longtime musical experience

1st test: Similarity of 84 melody pairs with constructed errors

I Generated from 14 original “western popular” melodies

I Rhythm errors, Pitch errors, etc.

2nd and 3rd test: Also completely different melodies

Only let subjects with stable judgement continue:

I 1st test: 23 out of 82

I 2nd test: 12 out of 16

I 3rd test: 5 out of 10



Results

Homogeneity of human similarity judgement:

I High correlation between judgements of selected subjects

I Hypothesis: Objective similarity at least for “western” experts

I “Conceptual Foundation for statistical modeling”

Human notion of similarity is “adaptive”:

I 1st test with only slightly altered melodies: Pitch informationsufficient

I 2nd test: When melodies are different, subjects’ ratings bestmodeled by including rhythmical information

“Unrelated melodies that differ strongly [. . . ] are hard to relate”. :-)



Estimate Pitch in General Audio Data

Katrin Sommer, Claus Weihs (2006):Using MCMC as a Stochastic Optimization Procedure forMusical Time Series.In: Data Science and Classification, Springer, pp. 307-314.http://dx.doi.org/10.1007/3-540-34416-0_33

Pitch estimation of monophonic sound by joint estimation ofovertones

I Based on a model by Davy and Godsill (2002)

I Estimating parameters of the model requires computingmulti-dimensional integrals

I An MCMC (Monte Carlo Markov Chain) algorithm is used


http://dx.doi.org/10.1007/3-540-34416-0_33


Harmonic Model (1/2)

Basic model:

yt =H∑

h=1

ah,t cos(2πhf0t) + bh,t sin(2πhf0t)) + εt

where yt is the instantaneous amplitude at time t.

Idea:

I Tone is composed out of harmonics from H partial tones

I First partial is fundamental frequency f0I Remaining H − 1 partials are called overtones

I Amplitudes of each partial tone are time dependent

I εt is model error



Harmonic Model (2/2)

Estimating the parameters of the former model requires computinga multidimensional integral of the form∫

Ωf (θ, f0, H, σ2

ε)p(θ, f0, H, σ2ε | y) dθ df0 dH dσ2

ε .

Standard numerical techniques generally inaccurate or too slow

I Therefore, use a Metropolis-Hastings MCMC algorithm

I Sufficient for this talk: Generalization of Monte CarloIntegration



Mini-Excursion: Monte Carlo Integration

Wanted:

I =

∫ 1

0f (ω)dω

Let (Ui )i∈N be independent random variable with Ui ∼ U(0, 1).According to strong law of large numbers:

limn→∞

1n

n∑i=1

f (Ui ) = E[f (Ui )] = I

The probabilistic error bound decreases as 1/√

n.

Mainly suited for multi-dimensional integrals (Dimension = d):I Standard numerical approaches need exponentially many

samples in dI Monte Carlo approach: Error bound independent of d



Summary and Criticism

Goal:

I Estimate pitch from general audio signals

I Davy and Godsill (2002): General model very successful

I New: Introduction of “technical” stochastical optimization

Criticism:

I Hardly (if at all) understandable without looking at originalpaper by Davy and Godsill (2002)



Sound characteristics

Claus Weihs, Gero Szepannek, Uwe Ligges, Karsten Luebke,Nils Raabe (2006):Local Models in Register Classification by Timbre.In: Data Science and Classification, Springer, pp. 315–322.http://dx.doi.org/10.1007/3-540-34416-0_34

Follow-up paper to the following goal:

I Identify high and low musical register (Soprano, Alto vs.Tenor, Bass) by timbre

I After pitch information is eliminated from the spectrum


http://dx.doi.org/10.1007/3-540-34416-0_34


The Data

17 singers performing the song “Tochter Zion” (G.F. Händel)I Downsampled to 11,025Hz and standardized to interval [−1, 1]I Analyses are based on characteristics derived from tones

corresponding to single notes (→ suitable segmentation)I For identified notes: Derive pitch-idependent periodogram

Register Classification by Timbre 3

FF OT 1 OT 2 OT 3 OT 4 OT 5 OT 6 OT 7 OT 8 OT 9

0.0

00.0

10.0

2

Fig. 1. Pitch independent periodogram (professional bass singer).

violin-mv.Bass version: bassoon, bflute-flu, bflute-vib, cello-bv, elecbass1, elecbass5,

elecbass6, elecguitar1, elecguitar2, elecguitar4, frehorn, frehorn-m, marimba,piano-ld, piano-pl, piano-sft, tromb-ten, tromb-tenm, tuba, viola-mv.Thus, 28 high instruments and 20 low instruments were chosen together with10 high female singers and 7 male.

From the periodogram corresponding to each tone corresponding to anidentified note voice print characteristics are derived (cp. Weihs and Ligges(2003b)). For our purpose we only use the size and the shape correspondingto the first 13 partials, i.e. to the fundamental frequency and the first 12 over-tones, in a pitch independent periodogram (cp. Figure 1). In order to measurethe size of the peaks in the spectrum, the mass (weight) of the peaks of thepartials are determined as the sum of the percentage shares of those parts ofthe corresponding peak in the spectrum which are higher than a pre-specifiedthreshold. The shape of a peak cannot easily be described. Therefore, we onlyuse one simple characteristic of the shape, namely the width of the peak ofthe partials. The width of a peak is measured by the half tone distance be-tween the smallest and the biggest frequency of the peak with a spectralheight above a pre-specified threshold. Overall, every tone is characterizedby the above 26 characteristics which are used as a basis for classification.For details on the computation of the measures see Guttner (2001). Note thatpitch information is eliminated in that the frequencies corresponding to fun-damentals and overtones are ignored in the pitch independent periodogram.Mass is measured as a percentage (%), whereas width is measured in partsof halftones (pht). Figure 2 illustrates the voice print corresponding to thewhole song “Tochter Zion” for a particular singer. For masses and widthsboxplots are indicating variation over the involved tones. For the analysesof this paper we ignore halftone distance and formant intensity (cp. Weihsand Ligges (2003b)), and use the other characteristics of the voice print forindividual tones, as well as averaged characteristics over all involved tones,leading to only one value for each characteristic per singer or instrument.

3 Classification Methods

On these data we applied supervised classification methods (see, e.g., Michieet al. (1994)) trying to reproduce the pre-defined grouping by means of classi-

I Consider only size and shape of first 13 partials. Aftercropping Fourier frequencies below some threshold

I Mass: sum of the percentage sharesI Width: difference between max and min frequency



Tochter Zion, freue Dich

Georg Friedrich Händel (1685-1759):

Georg Friedrich Händel (1685-1759)

Tochter Zion, freue Dich

!!

!!""on,

na,

na,

#"

"Je

net

"

##freu

Da

Da

# !

Sopran/Alt # !#si

"

si

# !

sa

nem

#$ #

##

#

%&

lem!

mild!%

$

$

#""Zi

#vidsan

#an"

%dich,

Sohn,

## #'

(

"Toch1.

Ho

!ter

2.

Ho3. "

"# !jauch

sei

sei#

# #ru

Kö

#

# #

"

nig

#

# #

%

%

#

#

#

vids

%

%

# #

#

" ##ze

#

ge

ge

Bass

#

" #

#

#""

#

laut,

seg

#

grü

Volk!

"

e

Sohn,

#

ßet

dei

) *+

, ) * #

- ###'

(

ja

Ho

du

" !Grün

E

"

# # ## # "#dein

wig

##Kö

nun

"

## ##er

si

"

dein

des

" #

##kommt

"" !kommt,

an

ew"

#

#

##

#

#

wig

.

%dir,

thron,

"der

na

gen"

- . ##Sieh,

##Frie

in

Va

#

#

#

de

"

dein

e

#

##

#

zu

#

#

Reich,

## # !#

#

#

#

" .nig

#

#

# #

dens

#

%

de

der

ters

&

!

#9

#+ )

, )

%#

steht Frie

#

" !jauch

sei

%

sei

fürst.

#Kind.

" # !ter

#

! #

#

#

#

#

$ " #

#

#

#

"on,

na,

# ## #freu ze

Da ge

Da

#ge#

# ""laut,

seg

grü

#

"

#

#

#

#

#e

vids

""%Je

Sohn,

%net

ßet"

# #%

%

#ru

dei

Toch

Ho Kö

#

#si

#

# "

an

"

'

(

"

"

#"

#!

#

#$

"sa

nemvids

nig#

%

Sohn,

#

##

"

"

!

si

%&

%

$

lem!Zi

Volk!

mild!

%%na,

# #dich,

Ho

16+ )

, #)Höh!

#

an

#

Music engraving by LilyPond 2.8.6 — www.lilypond.org

CC by-sa - http://creativecommons.org/licenses/by-sa/2.0/de/



Classification

Common techniques: Linear Discriminant Analysis (LDA)I 2 · 13 = 26 Variables are generated for every note separatelyI Averaged over all notesI Yields one single value of mass and width per harmonic and

singer/instrument

Global Model:I Training set includes samples from all instruments

Local Mode:I Create own classification rules for each instrumentI Problem: Which classification rule to use?

I Maximum-posterior rule: k ∈ arg maxk(maxl pl(k | x))I Average-posterior rule: k ∈ arg maxk

∑l pl(k | x)

I Majority Voting: . . .University of Paderborn Florian Schoppmann · 33 / 34


Results

Classification of voices alone gives good results. Explanations:I Human mouth acts as a highpass filter

I The lower the tone the less the mass of the fundamentalfrequency compared to 1st overtone

I Therefore: Sopranos have more mass in the fundamental f.

I Local model yields improvements when there are instrumentsI Voice print of a professional bass singer:

!! !

!1.0 !0.5 0.0 0.5 1.0

Halftone Distance

!! !! !

0.0 0.1 0.2 0.3 0.4 0.5

Formant Intensity

!

!

!

!

!

!

!!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!

!!

!!

!

0.0

0.1

0.2

0.3

0.4

0.5

Mass

FF 3 5 7 9 11 132 4 6 8 10 12

!

!

!!

!

!

!!!

!

!!

!

!!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

050

100

150

200

250

Width

FF 3 5 7 9 11 132 4 6 8 10 12

Figure 2: Voice print of professional bass singer.

distance between two Fourier frequencies is 5.38 Hertz.Overall, every tone is characterized by the above 26 characteristics which

are used as a basis for classification. For details on the computation of themeasures see Guttner (2001). Mass is measured as a percentage (%), whereaswidth is measured in Hertz.

Figure 2 illustrates the voice print corresponding to the whole song “Toch-ter Zion” for a particular singer. For masses and widths boxplots are indicat-ing variation over the involved tones. For the analyses of this paper we ignorehalftone distance and formant intensity (cp. Weihs and Ligges (2003)), anduse the other characteristics of the voice print for individual tones, as wellas averaged characteristics over all involved tones, leading to only one valuefor each characteristic per harmonic and singer or instrument.

3 Global modelling

Classification of the register of di!erent instruments and singers is performedusing two very common techniques: the classical linear discriminant analysis[LDA] (Fisher (1936)) as well as Classification trees (more specifically the

5


Analysis of Music Data -...

Documents

Transcript of Analysis of Music Data -...