DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY...

6
2015 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20, 2015, BOSTON, USA DICTIONARY EXTRACTION FROM A COLLECTION OF SPECTROGRAMS FOR BIOACOUSTICS MONITORING J. F. Ruiz-Mu˜ noz 1, Zeyu You 2, Raviv Raich 2, Xiaoli Z. Fern 21 SPRGroup, Universidad Nacional de Colombia, Manizales, Colombia 170004 2 School of EECS, Oregon State University, Corvallis, Oregon 97331-5501 [email protected], {youz, raich, xfern}@eecs.oregonstate.edu ABSTRACT Dictionary learning of spectrograms consists of detecting their fun- damental spectra-temporal patterns and their associated activation signals. In this paper, we propose an efficient convolutive dictio- nary learning approach for analyzing repetitive bioacoustics patterns from a collection of audio recordings. Our method is inspired by the convolutive non-negative matrix factorization (CNMF) model. The proposed approach relies on random projection for reduced compu- tational complexity. As a consequence, the non-negativity require- ment on the dictionary words is relaxed. Moreover, the proposed ap- proach is well-suited for a collection of discontinuous spectrograms. We evaluate our approach on synthetic examples and on two real datasets consisting of multiple birds audio recordings. Bird syllable dictionary learning from a real-world dataset is demonstrated. Ad- ditionally, we apply the approach for spectrogram denoising in the presence of rain noise artifacts. Index TermsUnsupervised Dictionary Learning, Random Matrix Projection 1. INTRODUCTION In recent years, digital signal processing and machine learning tech- niques have been widely applied to the task of wildlife monitoring. Particularly, the analysis of audio recordings (called bioacoustics) is of interests because many species, as pointed out in [1], are eas- ier to detect through sound than sight. In this context, dictionary learning and machine learning have been applied for labeling audio recordings, describing the sound-scape and analyzing the environ- mental impact of human activity and natural changes [2]. In audio signal analysis, a representation that allows for discovering the basic spectra-temporal patterns for the observed signals is important. Sev- eral methods have been proposed for analyzing speech and music signals but few attempts have been made for bioacoustic applica- tions. In some cases, analyzing time-varying patterns of audio signals has been carried out using dictionary learning [3, 4], e.g., to detect basic acoustic units as phonemes in speech recognition [5]. In [6], convolutive non-negative matrix factorization (CNMF) [7] is used as a dictionary learning method. In CNMF approach, a signal is represented by a set of atoms and their associated sparse activation patterns [8]. One of the advantages of CNMF is the simplicity of This work is partially supported by the National Science Foundation grants CCF-1254218, DBI-1356792, IIS-1055113, and the Colciencias’ Doc- toral Training Support Programme. factor dependencies [9], because each recording is recovered by a linear combination of shifted dictionary words. The representation of a set of spectrograms using a convolutive mixtures of sparse activations and dictionary words is particularly important for further analysis of bird song structure. Despite the utility of CNMF in analyzing time-series signals, a few challenges arise when applied to the bioacoustic setting: (i) high computational requirement of CNMF [10] makes it difficult to be applied to large amounts of bioacoustic signals [11]; (ii) CNMF is typically used in a single spectrogram setting, where bioacoustic signals usually con- tain a collection of discontinuous recordings; and (iii) it is often as- sumed that the length of the activation signal is the same as the length of the spectrogram in the time domain but it is possible that record- ings register only part of a vocalization at the beginning or the end. In this case, a longer activation signal should allow for representing syllable parts in the beginning and the end of the spectrogram. In this study, we adapt CNMF for a collection of potentially dis- continuous spectrograms in which vocalizations may occur prior to the beginning of the recording such that only part of them is ob- served. The proposed modification is designed to better suit the con- volutive dictionary learning approach to bioacoustic audio record- ings which are obtained from multiple sources. To illustrate the merit in this approach, we compare our approach against a standard CNMF approach. To address challenges with computational complexity, we propose a random projected dictionary learning approach. We derive a set of iterations with a choice of step-size that guarantees monoton- ically decreasing objective. Furthermore, we present an application of the proposed approach for (i) denoising spectrograms, which are corrupted by rain noise and (ii) unsupervised bird syllable discovery. The paper is organized as follows. Section 2 reviews a con- volutive dictionary learning model using CNMF. Section 3 presents a random matrix projection approach, develops a two-step update equations for dictionary learning and activation signal extraction, and analyzes the proposed algorithm for convergence and compu- tational complexity. Section 4 evaluates the proposed approach on a synthetic dataset as well as in-situ bird audio recordings. Finally, Section 5 concludes our paper. 2. BACKGROUND AND PROBLEM FORMULATION Before we introduce our approach, we would like to review previous work on the application of CNMF to speech or audio analysis. 978-1-4673-7454-5/15/$31.00 c 2015 IEEE

Transcript of DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY...

Page 1: DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY EXTRACTION FR… · dictionary learning from a real-world dataset is demonstrated. Ad-ditionally,

2015 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20, 2015, BOSTON, USA

DICTIONARY EXTRACTION FROM A COLLECTION OF SPECTROGRAMS FORBIOACOUSTICS MONITORING

J. F. Ruiz-Munoz1∗, Zeyu You2∗, Raviv Raich2∗, Xiaoli Z. Fern2∗

1 SPRGroup, Universidad Nacional de Colombia, Manizales, Colombia 1700042 School of EECS, Oregon State University, Corvallis, Oregon 97331-5501

[email protected], {youz, raich, xfern}@eecs.oregonstate.edu

ABSTRACT

Dictionary learning of spectrograms consists of detecting their fun-damental spectra-temporal patterns and their associated activationsignals. In this paper, we propose an efficient convolutive dictio-nary learning approach for analyzing repetitive bioacoustics patternsfrom a collection of audio recordings. Our method is inspired by theconvolutive non-negative matrix factorization (CNMF) model. Theproposed approach relies on random projection for reduced compu-tational complexity. As a consequence, the non-negativity require-ment on the dictionary words is relaxed. Moreover, the proposed ap-proach is well-suited for a collection of discontinuous spectrograms.We evaluate our approach on synthetic examples and on two realdatasets consisting of multiple birds audio recordings. Bird syllabledictionary learning from a real-world dataset is demonstrated. Ad-ditionally, we apply the approach for spectrogram denoising in thepresence of rain noise artifacts.

Index Terms— Unsupervised Dictionary Learning, RandomMatrix Projection

1. INTRODUCTION

In recent years, digital signal processing and machine learning tech-niques have been widely applied to the task of wildlife monitoring.Particularly, the analysis of audio recordings (called bioacoustics)is of interests because many species, as pointed out in [1], are eas-ier to detect through sound than sight. In this context, dictionarylearning and machine learning have been applied for labeling audiorecordings, describing the sound-scape and analyzing the environ-mental impact of human activity and natural changes [2]. In audiosignal analysis, a representation that allows for discovering the basicspectra-temporal patterns for the observed signals is important. Sev-eral methods have been proposed for analyzing speech and musicsignals but few attempts have been made for bioacoustic applica-tions.

In some cases, analyzing time-varying patterns of audio signalshas been carried out using dictionary learning [3, 4], e.g., to detectbasic acoustic units as phonemes in speech recognition [5]. In [6],convolutive non-negative matrix factorization (CNMF) [7] is usedas a dictionary learning method. In CNMF approach, a signal isrepresented by a set of atoms and their associated sparse activationpatterns [8]. One of the advantages of CNMF is the simplicity of

∗This work is partially supported by the National Science Foundationgrants CCF-1254218, DBI-1356792, IIS-1055113, and the Colciencias’ Doc-toral Training Support Programme.

factor dependencies [9], because each recording is recovered by alinear combination of shifted dictionary words.

The representation of a set of spectrograms using a convolutivemixtures of sparse activations and dictionary words is particularlyimportant for further analysis of bird song structure. Despite theutility of CNMF in analyzing time-series signals, a few challengesarise when applied to the bioacoustic setting: (i) high computationalrequirement of CNMF [10] makes it difficult to be applied to largeamounts of bioacoustic signals [11]; (ii) CNMF is typically used ina single spectrogram setting, where bioacoustic signals usually con-tain a collection of discontinuous recordings; and (iii) it is often as-sumed that the length of the activation signal is the same as the lengthof the spectrogram in the time domain but it is possible that record-ings register only part of a vocalization at the beginning or the end.In this case, a longer activation signal should allow for representingsyllable parts in the beginning and the end of the spectrogram.

In this study, we adapt CNMF for a collection of potentially dis-continuous spectrograms in which vocalizations may occur prior tothe beginning of the recording such that only part of them is ob-served. The proposed modification is designed to better suit the con-volutive dictionary learning approach to bioacoustic audio record-ings which are obtained from multiple sources. To illustrate the meritin this approach, we compare our approach against a standard CNMFapproach. To address challenges with computational complexity, wepropose a random projected dictionary learning approach. We derivea set of iterations with a choice of step-size that guarantees monoton-ically decreasing objective. Furthermore, we present an applicationof the proposed approach for (i) denoising spectrograms, which arecorrupted by rain noise and (ii) unsupervised bird syllable discovery.

The paper is organized as follows. Section 2 reviews a con-volutive dictionary learning model using CNMF. Section 3 presentsa random matrix projection approach, develops a two-step updateequations for dictionary learning and activation signal extraction,and analyzes the proposed algorithm for convergence and compu-tational complexity. Section 4 evaluates the proposed approach ona synthetic dataset as well as in-situ bird audio recordings. Finally,Section 5 concludes our paper.

2. BACKGROUND AND PROBLEM FORMULATION

Before we introduce our approach, we would like to review previouswork on the application of CNMF to speech or audio analysis.

978-1-4673-7454-5/15/$31.00 c©2015 IEEE

Page 2: DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY EXTRACTION FR… · dictionary learning from a real-world dataset is demonstrated. Ad-ditionally,

2.1. Background on convolutive NMF

In CNMF, the goal is to approximate a matrix V ∈ RM×N with a

series of two non-negative matrices Wt ∈ RM×R and

t→H ∈ R

R×N

in a convolutive way. The CNMF is a type of structured non-negativematrix factorization (NMF) model [7], which can be applied to dic-tionary learning for speech or audio analysis [6, 5]. Based on aCNMF model, an observed spectrogram V can be written as:

V ≈T−1∑t=0

Wt

t→H , (1)

such that the ik element of V is vik =∑T−1

t=0

∑Rj=1 wijt(

t→hjk).

The Kullback-Leibler (KL) divergence is used because this ap-proach requires that the factorized matrices are both positive. Thesolution is achieved by repeatedly alternating between updating Wand H with sparseness constraint as:

H = H⊗ WTt ·

t→[VΛ]

WTt · 1+ λ · 1 , and, (2)

Wt = Wt + γw[VΛ

·t→H

T

− 1 ·t→H

T ], (3)

where Λ =∑T−1

t=0 Wt

t→H ,

Since we consider a random projection approach to both spectro-grams and dictionary words to reduce the computational complexity,the non-negativity assumption on the dictionary words becomes in-valid. Another limitation of the CNMF model is that the activationsignal may occur before the time of the first observation. This re-quires the length of the activations to be greater than the length ofthe observation signals. To address these issues, we present a convo-lutive dictionary learning model for bioacoustics.

2.2. Problem formulation

We begin by introducing the notations and symbols used in this paperin Table 1, and proceed with a formulation of the proposed convo-lutive dictionary learning model. We denote each spectrogram and

Notation Explanation

N number of spectrograms of the datasetF number of frequency bandr number of reduced frequency bandT number of frames in time per spectrogramK number of dictionary wordsW length of each wordL length of an activation signal (L = T +W − 1)

Y {Yi ∈ RF×T |1 ≤ i ≤ N}, set of N spectrograms

YQ {YQ(i) ∈ R

r×T |1 ≤ i ≤ N}, set of N transformedspectrograms

D {Dk ∈ RF×W |1 ≤ k ≤ K}, set of K dictionary words

DQ {DQ

k ∈ Rr×W |1 ≤ k ≤ K}, set of K transformed

dictionary words

A {aik ∈ R

L×1|1 ≤ k ≤ K, 1 ≤ i ≤ N}, set of N × Kactivation signals

Table 1: Overview of the notations used in this paper

each dictionary word as a function of frequency and time by Yi(f, t)

and Dk(f, t) respectively, where Yi ∈ RF×T for i = 1, 2, . . . , N

and Dk ∈ RF×W . We denote each activation signal as a func-

tion of time by aik(t) where ai

k ∈ RL×1. We use (·) in one of

the coordinates of a matrix to denote the vector formed by stack-ing all the elements along the marked coordinate, i.e., Dk(f, ·) =[Dk(f, 1),Dk(f, 2), . . . , Dk(f,W )]T for k = 1, 2, . . . ,K, andf = 1, 2, . . . , F .

We assume that spectrograms are composed of a sequences ofsuccessive spectro-temporal units called dictionary words that areactivated at certain time instants. The convolutive dictionary learn-ing approach is targeting at jointly finding a set of K dictionarywords D = {D1(f, t),D2(f, t), . . . ,DK(f, t)} and a set of Ksparse activation signals A = {a1

1(t), . . . ,a1K(t),a2

1(t), . . . ,a2K(t),

aN1 (t), . . . ,aN

K(t)} such that Yi(f, t) ≈ ∑Kk=1 a

ik(t)∗Dk(f, t) for

1 ≤ f ≤ F, 1 ≤ t ≤ T, 1 ≤ i ≤ N (e.g., see Fig. 1). The goal isto minimize the distance between the original and the reconstructedspectrograms. Learning a convolutive dictionary model can be for-mulated as the following optimization problem:

minimizeD,A

N∑i=1

[ F∑f=1

T∑t=1

(Yi(t, f)−K∑

k=1

aik(t) ∗Dk(t, f))

2

K∑k=1

L∑t=1

|aik(t)|

]

subject to

F∑f=1

W∑t=1

(Dk(f, t))2 ≤ 1, ∀ 1 ≤ k ≤ K.

(4)

Under this convolutive model, since each frequency band or

Dictionary Words

Activations

Spectrogramsd1(f,t) d2(f,t) d3(f,t)

a11(t) a2

1(t) a31(t)

a12(t) a2

2(t) a32(t)

a13(t) a2

3(t) a33(t)

Y1(f, t)

Y2(f, t)

Y3(f, t)

Fig. 1: A convolutive model for dictionary learning

each spectrogram can be applied with 1-D convolution separately,the objective in (4) can be reformulated using Toeplitz matrices.Given a vector x = [x(1), x(2), · · · , x(m)], the Toeplitz matrix

T(x, c, p, w) ∈ R(p−c+1)×w is:

T =

⎡⎢⎢⎢⎣

x(c) x(c− 1) · · · x(c− w + 1)x(c+ 1) x(c) · · · x(c− w + 2)

... · · · · · ·...

x(p) x(p− 1) · · · x(p− w + 1)

⎤⎥⎥⎥⎦ .

Using a coordinate descent approach, the minimization in (4) can befacilitated by alternating between the dictionary learning problem

(DL) minimized1,d2,...,dF

F∑f=1

‖yf −TAdf‖2

subject to ‖Dk‖2 ≤ 1, ∀1 ≤ k ≤ K,

(5)

Page 3: DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY EXTRACTION FR… · dictionary learning from a real-world dataset is demonstrated. Ad-ditionally,

where yf = [Y1(f, ·),Y2(f, ·), . . . , YN (f, ·)]T ∈ RNT×1 for

f = 1, 2, . . . , F , df = [D1(f, ·), D2(f, ·), . . . ,DK(f, ·)]T ∈R

KW×1 for f = 1, 2, . . . , F and TA ∈ RNT×KW given by

TA =

⎡⎢⎢⎢⎣

T(a11,W,L,W ) · · · T(a1

K ,W,L,W )T(a2

1,W,L,W ) · · · T(a2K ,W,L,W )

.

.

. · · ·...

T(aN1 ,W,L,W ) · · · T(aN

K ,W,L,W )

⎤⎥⎥⎥⎦

and activation extraction problem

(AE) minimizea1,a2,...,aN

N∑i=1

(‖yi − TDai‖2 + λ‖ai‖1), (6)

where yi = [Yi(1, ·)T ,Yi(2, ·)T , . . . ,Yi(F, ·)T ]T ∈ RFT×1 for

i = 1, 2, . . . , N , ai = [ai1T,ai

2T, . . . ,ai

KT]T ∈ R

KL×1 for i =1, 2, . . . , N , and TD ∈ R

FT×KL given by

TD =

⎡⎢⎢⎢⎣

T(D1(1, ·),W,L, L) · · · T(DK(1, ·),W,L, L)T(D1(2, ·),W,L, L) · · · T(DK(2, ·),W,L, L)

.

.

. · · ·...

T(D1(F, ·),W,L, L) · · · T(DK(F, ·),W,L, L)

⎤⎥⎥⎥⎦ .

Constructing TD and TA Toeplitz matrices is memory inefficientand solving the above alternating quadratic programming problemwith matrix inversion is time consuming. To reduce the computa-tional complexity and the memory issue of the convolutive model,we propose a random projected convolutive model with modifiedgradient descent algorithm that utilizes the convolution operator.

3. RANDOM PROJECTED DICTIONARY LEARNING

The convolutive model provides a natural representation for spectro-grams of bird vocalizations. To reduce the computational complex-ity, we propose to make use of the fact that bird vocalizations areconcentrated in a small range of frequencies. Consequently, spectro-grams of bird vocalization tend to have sparse columns. We considera compressive sampling approach to facilitated the reduction in com-putational complexity.

3.1. Model formulation

To reduce the computational complexity, we apply the same trans-formation to both the spectrogram side and dictionary word side. Insuch way, the computational complexity of computing both dictio-nary words and activations is decreased by reducing the number ofunknowns. The new formulation of the dictionary learning is

minimizeDQ,A

N∑i=1

(

r∑c=1

T∑t=1

(YQ(i)(c, t)−K∑

k=1

aik(t) ∗DQ

k (c, t))2

+λK∑

k=1

L∑t=1

|aik(t)|)

subject to

r∑c=1

W∑t=1

DQk (c, t)

2 ≤ 1, ∀ 1 ≤ k ≤ K

(7)with a transformation matrix Q = [Q(1),Q(2), . . . ,Q(F )] ∈R

r×F , where Q(f) = [q1(f), q2(f), . . . , qr(f)]T ∈ R

r such that

r < F . Note that YQ(i) = QYi and DQk = QDk.

Many dimension reduction techniques can be considered whengenerating the transformation matrix Q, e.g., principal compo-nent coefficients (PCC) and Mel-frequency cepstral coefficients

(MFCCs). But the problem of signal or spectrogram distortion andthe difficulty of recovering the original signal or spectrogram mayarise. For example, if the intensities at several frequency bins arecompressed into a single coefficient using MFCC, it is difficult torecover the their value from the single coefficient. To prevent apotential distortion problem, we apply a compressive transformationwith a random matrix [12]. We rely on the sparsity of the signal andthe compressive approach to improve recovery. The recovery of thespectrograms or dictionary words can be implemented using a linearprogramming approach [13, 14].

3.2. Solution approach for dictionary and activations extraction

Consider the (DL) problem in (5), least square solution with normal-ization or projected Newton descent method are both simple to de-rive. Consider the L1 regularized (AE) problem in (6), Least-angle-regression (LARS) algorithm [15] or feature-sign sparse coding al-gorithm are also applicable [16]. However, these algorithms requirea large matrix inversion to obtain an efficient and exact solution. Inour problem, computing TT

DTD and TTATA requires a computa-

tional complexity of the order O(FKT log T ) and O(NKT log T )respectively and computing their inverse requires O((KL)3) andO((KW )3) respectively, which limits the practical applicability ofthe approach. Hence, we propose an optimization transfer algorithmto minimize both (4) and (7). The update rules are derived by min-imizing a surrogate such that the computation is reduced by utiliz-ing the fast Fourier transform (FFT) implementation of the discreteFourier transform (DFT) and its inverse.

In optimization transfer, a surrogate function g(x, x′) is con-sidered as a replacement to the original objective f(x) such that (i)f(x) ≤ g(x, x′), ∀x, x′ and (ii) f(x′) = g(x′, x′), ∀x′. The up-

date iteration x(j+1) = argminx g(x, x′) guarantees f(x(j+1)) ≤

f(x(j)). To solve the (DL) problem, we use∑r

f=1

γf

2‖df −

df ′‖2 − TTA(y

f − TAdf ′)df + 1

2‖yf − TAd

f ′‖2 as a sur-

rogate to∑r

f=112‖yf − TAd

f‖2 and consider γf such that

γf‖df − df ′‖2 ≥ ‖TA(df − df ′

)‖2. We choose γd = maxf γf .Minimizing the surrogate objective subject to the constraints in (5)yields the following update rule to df

k :

df(j+1)k =

⎧⎪⎨⎪⎩

df(j)k + 1

γdvkD,

∑f ‖df(j)

k + 1γd

vkD‖2 ≤ 1;

df(j)k

+ 1γd

vkD√∑

f ‖df(j)k

+ 1γd

vkD

‖2, otherwise,

(8)

where vkD = Tk

AT(yf−Tk

AD(j)k (f, ·)) and Tk

A = [T(a1k,W,L,W )T ,

. . . ,T(aNk ,W,L,W )T ]T . To obtain γd, we propose the following

approach.

Optimal step-size for DL: Setting γd = maxv‖TAv‖2‖v‖2 =

λmax(TTATA) ensures that ‖TAv‖2 ≤ γd‖v‖2 for any v. This

conservative approach results in a small step size 1/γd, which leadsto a slow convergence rate. To improve this, we consider the follow-

ing tighter bound on γd. Let gfk = d

f(j)k + 1

γdvkD . From (8), we

have df − df ′= c(γd)(d

f ′+ 1

γdvD) − df ′

= α1df ′

+ α2vD .

Since df −df ′ ∈ span{df ′,vD}, we can further restrict γd without

violating the bound on γd. Using Gram–Schmidt orthogonalization,

we obtain the orthogonal basis for [vd,df ′] as u1 = vd/‖vd‖

and u2 = df ′/‖df ′‖, where df ′

= df ′ − (df ′ Tu1)u1. Con-

sequently, we can bound γf = maxα1,α2

‖TA[u1,u2][α1,α2]T ‖2

‖[α1,α2]T ‖2 =

Page 4: DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY EXTRACTION FR… · dictionary learning from a real-world dataset is demonstrated. Ad-ditionally,

λmax([u1,u2]TTT

ATA[u1,u2]). Hence, the optimal step size pa-rameter γ∗

d = maxf γf .Next, the optimization transfer approach is applied for the

(AE) problem. The update rule for extracting the activation signal

ai(j)k (t) at iteration j follows the iterative soft-thresholding approach

as ai(j+1)k (t) =

⎧⎪⎨⎪⎩

ai(j)k (t) + 1

γa(vk

A(t)− λ), ai(j)k (t) + 1

γa(vk

A(t)− λ) > 0

ai(j)k (t) + 1

γa(vk

A(t) + λ), ai(j)k (t) + 1

γa(vk

A(t) + λ) < 0

0, otherwise,(9)

where vkA = (Tk

D)T (yi−TDai(j)) and TkD = [T(Dk(1, ·),W,L, L)T ,

T(D1(2, ·),W,L, L)T , . . . ,T(D1(F, ·),W,L, L)T ]T .

Optimal step-size for AE: The optimal step-size 1/γ∗a for acti-

vation updates satisfies γa ≥ ‖TD(ai − ai′)‖2/‖ai − ai′‖2. Webound γa by λmax(T

TDTD), which is the largest eigenvalue of the

matrix TTDTD . Computing the largest eigenvalue of a KL × KL

matrix is costly. Instead, we apply the DFT operator and furtherbound the maximum eigenvalue of TT

DTD . Since ‖TD(ai −ai′)‖2/‖ai − ai′‖2 =

∑Ff=1 ‖

∑Kk=1 Dk(f, ·) ∗ (ai

k − ai′k )‖2/

‖∑Kk=1 a

ik−ai′

k ‖2 =∑F

f=1 1/2π∫ π

−π|∑K

k=1 Dk(f, ω)(aik(ω)−

ai′k (ω))|2dω/(1/2π

∫ π

−π|ai

k(ω)−ai′k (ω)|2dω) ≤

∑Ff=1 1/2π

∫ π

−π

|∑Kk=1 Dk(f, ω)|2dω by CauchySchwarz inequality, maximizing

this expression produces the optimal step-size γ∗a =

∑Ff=1 maxω∑K

k=1(Dk(f, ω))2, where Dk(f, ω) is the ωth coefficient in the

frequency domain on the Discrete Fourier Transform of Dk(f, ·) atfrequency band f for dictionary word k.

For the random projection approach, we simply replace Y withY

Q and D with DQ. We propose an efficient algorithm for practical

dictionary extraction. The algorithm consists of three main parts: (i)Transforming the original input spectrograms using a random pro-jection matrix, (ii) Alternatingly solving the (DL) and (AE) until aconvergence criterion is met, and (iii) Recovering the uncompresseddomain dictionary words by solving the (DL) problem with the ex-tracted activations and the original data Y.

Convergence analysis: Denote f(D,A) =∑N

i=1(‖yi −TDai‖2 +λ‖ai‖1) = ∑F

f=1 ‖yf −TAdf‖2 +∑N

i=1 λ‖ai‖1. By

the property of optimization transfer, we have f(D(j+1),A(j+1)) ≤f(D(j),A(j+1)) after phase (I) and f(D(j),A(j+1)) ≤ f(D(j),A(j))

after phase (II). Hence, f(D(j),A(j)) forms a series of monotoni-cally non-increasing objective values.

3.3. Computational complexity

Since the convolution operator with length L can be computed moreefficiently by using FFT and IFFT, the computational complexityfor each convolution block with size L is O(L logL). For the it-erative procedure in (8), calculating TAd

f , VD and γ∗d all require

O(NKL logL), therefore the overall computational complexityfor (DL) is O(FNKL logL). Updating the activations producesthe same computational complexity as O(NFKL logL). Thetotal computational complexity for the algorithm without randomprojection is O(FNKL logL). With random projection, the com-putational complexity is proportional to the original computationcomplexity of O(FNKL logL). If the reduced frequency bandr is 20% of the original frequency band F , the running time willbe five times faster than the uncompressed dictionary learning al-gorithm, which makes the convolutive dictionary learning method

more efficient and practical.

4. EXPERIMENTAL RESULTS

In this section, we evaluate the random projected dictionary learningapproach on a synthetic data and compare with the CNMF approachto show that our approach can address the boundary effect while theCNMF approach can not. Additionally, we evaluate the proposedapproach for the problems of denoising as well as dictionary discov-ery for the application of bird bioacoustics using in-situ bird audiorecordings.

4.1. Analysis on synthetic data

To demonstrate the robustness of our model to boundary effects,we test our approach on a synthetic dataset and compare it withthe CNMF approach. We generate three spectrograms with threedictionary words and sparse activation signals for each dictionaryword. The dimensions of each spectrogram are fixed to F = 50 andT = 500, and the dimensions of each dictionary word are F = 50and W = 50. To compare the performance of our approach with theCNMF [17] approach, we run both algorithms for 10, 000 iterationson three spectrograms. The learned dictionary words for these threespectrograms and activations in both algorithms are shown in Fig. 2.We observe the proposed approach accurately recovers the dictio-nary words (see Fig. 2(a)) despite the boundary effect in the firstspectrogram of Fig. 2(e). However, CNMF learns each dictionaryword as a mixture of the original dictionary words (see Fig. 2(c))including the part of the dictionary word appearing in the beginningof the first spectrogram.

4.2. Analysis on real-world data

Discovering the syllables of each bird species is an important re-search direction in bird bioacoustics. To explore the bird song dic-tionary and the song structure, we test our random projected convo-lutive dictionary learning approach on both the MLSP 2013 1 datasetand the HJA dataset [18]. The MLSP 2013 database contains 645recordings of 19 different bird species and the HJA database con-tains total of 750 recordings with six different locations PC1, PC4,PC7, PC8, PC13, and PC15. We convert each recording into atwo-dimensional spectrogram with F = 247 and T = 2497. Inthe following, we examine three aspects of the proposed approach:(i) spectrogram denoising (ii) optimal parameter selection, and (iii)finding a dictionary for bird syllables.

4.2.1. Spectrogram denoising

Some audio recordings are corrupted by rain noise or other back-ground noise, in order to remove the rain noise, we train (i.e., learna dictionary) using the clean HJA dataset and test a rain corrupteddataset. The result in the Fig. 3 shows, after running the dictionarylearning algorithm, the rain effect has been significantly reduced forthe reconstructed spectrogram.

4.2.2. Parameter selection for dictionary learning

The model parameters that affect the performance of dictionarylearning are the number of dictionary words K and sparsity of the ac-tivations λ. To show the relationship between the model parameters

1https://www.kaggle.com/c/mlsp-2013-birds

Page 5: DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY EXTRACTION FR… · dictionary learning from a real-world dataset is demonstrated. Ad-ditionally,

20 40

10

20

30

40

5020 40

10

20

30

40

5020 40

10

20

30

40

50

(a) learned dictionary words byour approach

0 200 400 6000

1000

2000

0 200 400 6000

2000

4000

0 200 400 6000

2000

4000

(b) learned activations by our ap-proach

20 40

10

20

30

40

5020 40

10

20

30

40

5020 40

10

20

30

40

50

(c) learned dictionary words byCNMF [17]

0 100 200 300 400 5000

500

1000

0 100 200 300 400 5000

1000

2000

0 100 200 300 400 5000

1000

2000

(d) learned activations byCNMF [17]

100 200 300 400

20

40

100 200 300 400

20

40

100 200 300 400

20

40

(e) reconstructed spectrogramsfrom our approach

100 200 300 400

20

40

100 200 300 400

20

40

100 200 300 400

20

40

(f) reconstructed spectrogramsfrom CNMF [17]

100 200 300 400

20

40

100 200 300 400

20

40

100 200 300 400

20

40

(g) original spectrograms (h) objective vs. iteration

Fig. 2: Comparison of learned dictionary words and activations be-tween our approach and CNMF [17]

and the dictionary learning performance, we present the reconstruc-tion error

∑Ni=1

∑Ff=1

∑Tt=1(Y

i(t, f)−∑Kk=1 a

ik(t)∗Dk(t, f))

2

against a practical approximation of the L0 norm of the activations(number of the elements in A that are greater than ε = 10−2).During the training phase, we select 8 spectrograms from loca-tion PC15 of the HJA dataset and run the proposed algorithm toextract the dictionary words for each of the following parametervalues K = {5, 10, 15} and λ = {1, 5, 10, 15, 30, 50}. We applythe learned dictionary words in the validation phase to indepen-dent three test spectrograms, the performance curves are shownin Fig. 4b (a) and (b). Results show that the reconstruction errordecreases with decreasing value of λ and/or increasing the valueof K. The L0 norm of the activations increases with decreasingthe value of λ. For a large λ, the dictionary concentrates on highenergy words and low energy words are not discovered. For a smallλ, the L0 norm of activations increases significantly even thoughthe reconstruction error decreases. We select the optimal set ofparameters (λ = 10,K = 15) to balance the reconstruction errorand the sparseness of the activation in the validation set. We showthe extracted dictionary words in the Fig. 5.

4.2.3. Extracted dictionary words on MLSP2013 dataset

We select four or five rich-of-syllable spectrograms from eachspecies to learn the bird song dictionary and show the discovereddictionary words of all 19 species in Fig. 6 by using random pro-jected dictionary learning with r = 10%F and setting W = 200 forall species.

(a) reconstructed testing spectro-gram on PC1

(b) original testing spectrogramon PC1 with rain noise

Fig. 3: Examples of rain denoising on test spectrogram

103 104

×105

1

2

3

4

5,1

5,55,10

5,155,30

5,50

10,1

10,5

10,1010,15

10,3010,50

15,1

15,515,10

15,15

15,30

15,50

(a) training phase

103 104

×105

1

2

3

4

5

5,1

5,55,10

5,155,305,50

10,1

10,510,1010,15

10,3010,50

15,1

15,5

15,1015,15

15,3015,50

(b) validation phase

(c) learned dictionary (d) learned dictionary

Fig. 4: Parameter selection: (a) training phase reconstruction errorvs. L0 norm of activations for PC15 (the first number for each pointrepresents K and the second number for each point represents λ); (b)validation phase reconstruction error vs. L0 norm of activations forPC15; (c) learned dictionary with K = 15 and λ = 10 for PC15;(d) learned dictionary with K = 15 and λ = 50 for PC15

5. CONCLUSION

In this paper, a random projection dictionary learning approach wasproposed for a bioacoustic application. This approach combinesthe power of estimating spectra-temporal patterns given by the con-volutive model and the computational complexity savings associ-ated with the random projection approach. Additionally, we addressthe boundary effect arising in a collection of discontinuous spec-trograms. Furthermore, we describe how to estimate a step size toproduce monotonic non-increase in the objective function when up-dating the activations and the dictionary words. Results suggest thatthe proposed approach can be used to efficiently represent bird audiorecordings and has the potential to be the basis of further bird songanalysis.

6. REFERENCES

[1] T. Scott Brandes, “Automated sound recording and analysistechniques for bird surveys and conservation,” Bird Conserva-tion International, vol. 18, pp. S163–S173, 9 2008.

[2] Daniel T. Blumstein, Daniel J. Mennill, Patrick Clemins, LewisGirod, Kung Yao, Gail Patricelli, Jill L. Deppe, Alan H.Krakauer, Christopher Clark, Kathryn A. Cortopassi, Sean F.Hanser, Brenda McCowan, Andreas M. Ali, and AlexanderN. G. Kirschel, “Acoustic monitoring in terrestrial environ-

Page 6: DICTIONARY EXTRACTION FROM A COLLECTION OF …people.oregonstate.edu/~youz/papers/DICTIONARY EXTRACTION FR… · dictionary learning from a real-world dataset is demonstrated. Ad-ditionally,

(a) PC1 (b) PC4

(c) PC7 (d) PC13

Fig. 5: Learned dictionary words for HJA dataset

ments using microphone arrays: applications, technologicalconsiderations and prospectus,” Journal of Applied Ecology,vol. 48, no. 3, pp. 758–767, 2011.

[3] Roland Badeau and M Plumbley, “Multichannel high res-olution NMF for modelling convolutive mixtures of non-stationary signals in the time-frequency domain,” Transactionson Audio, Speech and Language Processing, vol. 22, no. 11,pp. 1670–1680, 2013.

[4] Qingju Liu, Wenwu Wang, Philip J B Jackson, Mark Barnard,Josef Kittler, and Jonathon Chambers, “Source separation ofconvolutive and noisy mixtures using audio-visual dictionarylearning and probabilistic time-frequency masking,” IEEETransactions on Signal Processing, vol. 61, no. 22, pp. 5520–5535, 2013.

[5] Paul D. O’Grady and Barak A. Pearlmutter, “Discoveringspeech phones using convolutive non-negative matrix factori-sation with a sparseness constraint,” Neurocomputing, vol. 72,no. 1-3, pp. 88–101, 2008.

[6] Paul D. O’Grady and Barak A. Pearlmutter, “Convolutive non-negative matrix factorisation with a sparseness constraint,” inProceedings of the IEEE International Workshop on MachineLearning for Signal Processing (MLSP 2006), Maynooth, Ire-land, Sept. 2006, pp. 427–432.

[7] Yu-Xiong Wang and Yu-Jin Zhang, “Nonnegative matrix fac-torization: A comprehensive review,” Knowledge and DataEngineering, IEEE Transactions on, vol. 25, no. 6, pp. 1336–1353, 2013.

[8] Maria G. Jafari and Mark D. Plumbley, “Fast Dictionary Learn-ing for Sparse Representations of Speech Signals,” IEEE Jour-nal of Selected Topics in Signal Processing, vol. 5, pp. 1025–1031, 2011.

[9] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Rep-resentation learning: A review and new perspectives,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 35, no. 8, pp. 1798–1828, 2013.

[10] Dong Wang, Ravichander Vipperla, and Nicholas W D Evans,“Online pattern learning for non-negative convolutive sparsecoding accepted for publication,” in INTERSPEECH 2011,

calgy,

es-n-ns

Fig. 6: Learned bird dictionary words

12th Annual Conference of the International Speech Commu-nication, August 28-31, Florence, Italy, 2011.

[11] Dan Stowell and Mark D. Plumbley, “Automatic large-scaleclassification of bird sounds is strongly improved by unsuper-vised feature learning,” PeerJ, vol. 2, 2014.

[12] Fei Wang and Ping Li, Efficient Nonnegative Matrix Factoriza-tion with Random Projections, chapter 24, pp. 281–292, 2010.

[13] Richard Baraniuk, “Compressive sensing,” IEEE signal pro-cessing magazine, vol. 24, no. 4, 2007.

[14] Justin Romberg, “Imaging via compressive sampling [intro-duction to compressive sampling and recovery via convex pro-gramming],” IEEE Signal Processing Magazine, vol. 25, no.2, pp. 14–20, 2008.

[15] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshi-rani, et al., “Least angle regression,” The Annals of statistics,vol. 32, no. 2, pp. 407–499, 2004.

[16] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng,“Efficient sparse coding algorithms,” in Advances in neuralinformation processing systems, 2006, pp. 801–808.

[17] Paris Smaragdis, “Non-negative matrix factor deconvolution;extraction of multiple sound sources from monophonic inputs,”in Independent Component Analysis and Blind Signal Separa-tion, pp. 494–499. Springer, 2004.

[18] Forrest Briggs, Balaji Lakshminarayanan, Lawrence Neal, Xi-aoli Z Fern, Raviv Raich, Sarah J K Hadley, Adam S Hadley,and Matthew G Betts, “Acoustic classification of multiplesimultaneous bird species: a multi-instance multi-label ap-proach.,” The Journal of the Acoustical Society of America,vol. 131, no. 6, pp. 4640–4650, June 2012.