Rhythm Analysis in Music - Northwestern Engineeringpardo/courses/eecs352/lectures/MPM12... · Music...
Transcript of Rhythm Analysis in Music - Northwestern Engineeringpardo/courses/eecs352/lectures/MPM12... · Music...
Time-frequency Masking
EECS 352: Machine Perception of Music & Audio
Zafar RAFII, Spring 2012 1
Real spectrogram
time
fre
qu
en
cy
• The Short-Time Fourier Transform (STFT) is a succession of local Fourier Transforms (FT)
STFT
STFT
Zafar RAFII, Spring 2012 2
+ j* Imaginary spectrogram
time
fre
qu
en
cy
frame i frame i
Time signal
timewindow i
Imaginary spectrum
frequency
Real spectrum
frequency
• If we used a window of N samples, the FT has N values (from 0 to N-1); let’s suppose N = 8…
Time signal
timewindow i
FT
STFT
Zafar RAFII, Spring 2012 3
+ j* window i frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
N frequency values N frequency values
Imaginary spectrum
frequency
Real spectrum
frequency
• Frequency index 0 is the DC component; it is always real (it is the sum of the time values)
Time signal
timewindow i
FT
STFT
Zafar RAFII, Spring 2012 4
+ j* window i frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Imaginary spectrum
frequency
Real spectrum
frequency
• Frequency indices from 1 to floor(N/2) are the “unique” complex values (a + j*b)
Time signal
timewindow i
FT
STFT
Zafar RAFII, Spring 2012 5
+ j* window i frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Imaginary spectrum
frequency
Real spectrum
frequency
• Frequency indices from floor(N/2) to N-1 are the “mirrored” complex conjugates (a - j*b)
Time signal
timewindow i
FT
STFT
Zafar RAFII, Spring 2012 6
+ j* window i frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Imaginary spectrum
frequency
Real spectrum
frequency
• If N is even, there is also a pivot frequency at frequency index N/2; it is always real
Time signal
timewindow i
FT
STFT
Zafar RAFII, Spring 2012 7
+ j* window i frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Real spectrogram
time
fre
qu
en
cy
• Summary of the frequency indices and values in the STFT (in colors!)
STFT
Zafar RAFII, Spring 2012 8
Imaginary spectrogram
time
fre
qu
en
cy
Frequency 0 = DC component (always real)
Frequency 1 to floor(N/2) = “unique” complex values
Frequency N/2 = “pivot” frequency (always real)
Frequency floor(N/2) to N-1 = “mirrored” complex conjugates
N frequency values = frequency 0 to N-1
+ j*
• The (magnitude) spectrogram is the magnitude (absolute value) of the STFT
Magnitude spectrogram
time
fre
qu
en
cy
Real spectrogram
time
fre
qu
en
cy
Spectrogram
Zafar RAFII, Spring 2012 9
Imaginary spectrogram
time
fre
qu
en
cy
abs
frame i frame i frame i
+ j*
• For a complex number 𝑎 + 𝑗 ∗ 𝑏, the absolute
value is 𝑎 + 𝑗 ∗ 𝑏 = 𝑎2 + 𝑏2
abs
Spectrogram
Zafar RAFII, Spring 2012 10
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
=
frame i
1.4 1.4 1.4 1.4 3 0 2 0
1.4 1.4 1.4
• All the N frequency values (frequency indices from 0 to N-1) are real and positive (abs!)
abs
Spectrogram
Zafar RAFII, Spring 2012 11
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7
= 1.4 3 0 2 0
frame i
N frequency values
1.4 1.4 1.4
• Frequency indices from 0 to floor(N/2) are the unique frequency values (with DC and pivot)
abs
Spectrogram
Zafar RAFII, Spring 2012 12
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7
= 1.4 3 0 2 0
frame i
1.4 1.4 1.4
• Frequency indices from floor(N/2)+1 to N-1 are the mirrored frequency values
abs
Spectrogram
Zafar RAFII, Spring 2012 13
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7
= 1.4 3 0 2 0
frame i
1.4 1.4 1.4
• Since there are redundant, we can discard the frequency values from floor(N/2)+1 to N-1
abs
Spectrogram
Zafar RAFII, Spring 2012 14
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7
= 1.4 3 0 2 0
frame i
floor(N/2)+1 unique frequency values
• The spectrogram has therefore floor(N/2)+1 unique frequency values (with DC and pivot)
Magnitude spectrogram
time
fre
qu
en
cy
Real spectrogram
time
fre
qu
en
cy
Spectrogram
Zafar RAFII, Spring 2012 15
Imaginary spectrogram
time
fre
qu
en
cy
frame i frame i frame i
abs + j*
Spectrogram
• Why the magnitude spectrogram?
– Easy to visualize (compare with the STFT)
– Magnitude information more important
– Human ear less sensitive to phase
Zafar RAFII, Spring 2012 16
Time signal
time
Magnitude spectrogram
time
fre
qu
en
cy +
0
Spectrogram
• When you display a spectrogram in Matlab…
– imagesc: data is scaled to use the full colormap
– 10*log10(V): magnitude spectrogram V in dB
– set(gca,’YDir’,’normal’): y-axis from bottom to top
Zafar RAFII, Spring 2012 17
Time signal
time
Magnitude spectrogram
time
fre
qu
en
cy +
0
• The signal cannot be reconstructed from the spectrogram (phase information is missing!)
Spectrogram
Zafar RAFII, Spring 2012 18
Time signal
time
Imaginary spectrogram
time
fre
qu
en
cy
Real spectrogram
time
fre
qu
en
cy
Magnitude spectrogram
time
fre
qu
en
cy iSTFT ???
???
• Suppose we have a mixture of two sources: a music signal and a voice signal
Time-frequency Masking
Zafar RAFII, Spring 2012 19
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
+ 0
+ 0
+ 0
• We assume that the sources are sparse = most of the time-frequency bins have null energy
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 20
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
• We assume that the sources are sparse = most of the time-frequency bins have null energy
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 21
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mostly low energy bins
Mostly low energy bins Mixture spectrogram
timefr
eq
ue
ncy
• We assume that the sources are disjoint = their time-frequency bins do not overlap
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 22
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
• We assume that the sources are disjoint = their time-frequency bins do not overlap
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 23
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Not a lot of overlapping
• Assuming sparseness and disjointness, we can discriminate the bins between mixed sources
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 24
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Music signal
time
Mixture signal
time
Voice signal
time
• Assuming sparseness and disjointness, we can discriminate the bins between mixed sources
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 25
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Music signal
time
Mixture signal
time
Voice signal
time
Source 1 = bright Source 2 = dark
• Bins that are likely to belong to one source are assigned to 1, the rest to 0 = binary masking!
Binary mask
time
fre
qu
en
cy
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 26
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Binary mask
time
fre
qu
en
cy 1
0
Music signal
time
Voice signal
time
Source of interest Interfering source
• By multiplying the binary mask to the mixture spectrogram, we could “preview” the estimate
Time-frequency Masking
Zafar RAFII, Spring 2012 27
Binary mask
time
fre
qu
en
cy .x
Mixture spectrogram
time
fre
qu
en
cy
Masked spectrogram
time
fre
qu
en
cy
+ 0
+ 0
1 0
• However, we cannot derive the estimate itself because we cannot invert a spectrogram!
Time-frequency Masking
Zafar RAFII, Spring 2012 28
Binary mask
time
fre
qu
en
cy .x
Mixture spectrogram
time
fre
qu
en
cy
Masked spectrogram
time
fre
qu
en
cy
+ 0
+ 0
Music estimate
time
1 0
• We mirror the redundant frequencies from the unique frequencies (without DC and pivot)
Time-frequency Masking
Zafar RAFII, Spring 2012 29
Binary mask
time
fre
qu
en
cy
Binary mask
time
fre
qu
en
cy
Binary mask
time
frequency
1 0
1 0
• And the we apply this full binary mask to the STFT using a element-wise multiplication
Time-frequency Masking
Zafar RAFII, Spring 2012 30
Binary mask
time
fre
qu
en
cy
Binary mask
time
fre
qu
en
cy
Binary mask
time
frequency
Imaginary spectrogram
time
fre
qu
en
cy
Real spectrogram
time
fre
qu
en
cy
1 0
1 0
.x
• The estimate signal can now be reconstructed via inverse STFT
Time-frequency Masking
Zafar RAFII, Spring 2012 31
Binary mask
time
fre
qu
en
cy
1 0
Masked imaginary
time
fre
qu
en
cy
Masked real
time
fre
qu
en
cy
Music estimate
time
iSTFT
• Sources are not really sparse or disjoint in time-frequency in the mixture
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 32
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
• Sources are not really sparse or disjoint in time-frequency in the mixture
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 33
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Soft mask
time
fre
qu
en
cy
• Bins that are likely to belong to one source are close to 1, the rest close to 0 = soft masking!
Music spectrogram
time
fre
qu
en
cy
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Time-frequency Masking
Zafar RAFII, Spring 2012 34
Music spectrogram
time
fre
qu
en
cy +
0
+ 0
+ 0
Voice spectrogram
time
fre
qu
en
cy
Mixture spectrogram
timefr
eq
ue
ncy
Music signal
time
Voice signal
time
Source of interest Interfering source
1 0
Soft mask
time
fre
qu
en
cy
• Let’s listen to the results!
Time-frequency Masking
Zafar RAFII, Spring 2012 35
Music estimate
time
Music signal
time
demix mix
Mixture signal
time
• How can we efficiently model a binary/soft time-frequency mask for source separation?...
• To be continued…
Question…
Zafar RAFII, Spring 2012 36
Mixture spectrogram
time
fre
qu
en
cy
+ 0
Soft mask
time
fre
qu
en
cy
1 0 ???