DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK...

6
DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang 1 , Lei Xie 1,2 , Yougen Yuan 2 , Huaiping Ming 3 , Dongyan Huang 3 , Mingli Song 4 1 School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an, China 2 School of Computer Science, Northwestern Polytechnical University, Xi’an, China 3 Institute for Infocomm Research, A STAR, Singapore 4 College of Computer Science, Zhejiang University, Hangzhou, China {bhzhang,lxie, ygyuan}@nwpu-aslp.org, [email protected], [email protected], [email protected] ABSTRACT In this paper, we propose to use deep neural network (DNN) as an effective tool for audio feature extraction. The DNN- derived features can be effectively used in a subsequent clas- sifier (e.g., an SVM in this study) for audio classification. Specifically, we learn bottleneck features from a multi-layer perceptron (MLP), in which Mel filter bank feature is used as network input and one of the hidden layers has a small num- ber of hidden units, compared to the size of the other hidden layers. The narrow hidden layer is served as a bottleneck lay- er, which creates a constriction in the network that forces the information pertinent to classification into a compact feature representation. We study both unsupervised and supervised bottleneck feature extraction methods and demonstrate that the supervised bottleneck features outperform conventional hand-crafted features and achieve the state-of-the-art perfor- mance in audio classification. Index Termsdeep neural networks, audio classifica- tion, bottleneck features 1. INTRODUCTION Social multimedia contents are proliferating exponentially due to the fast development of social networks and web ap- plications. As massive and heterogeneous social media data are available, effective management and understanding me- dia contents have become a challenging problem than ever before. Audio is an indispensable component of multimedi- a repositories. Thus it is desire for social multimedia com- puting to empower the capacity of automatic classification of audio contents or segmentation of an audio stream into ho- mogenous regions. This paper addresses the audio classifi- cation problem with deep neural network (DNN) [1] derived bottleneck features. Audio classification and its related topics have a long his- tory of research. Some studies specifically focus on speech and music discrimination [2][3], some deal with mixed type- s of audio [4][5], while others are task-oriented or work at higher semantic work levels, e.g., audio event classifica- tion [6], music genre classification [7] and audio scene clas- sification [8]. Despite of various efforts in the area, as a typ- ical classification problem, audio classification involves three major components: feature extraction, modeling and classifi- cation. Feature engineering has been crucial in audio classi- fication. By elaborately investigating the differences among various audio types, researchers have developed a variety of hand-crafted features, including energy features (e.g., short- time energy [9][5] and 4Hz modulation energy [10][5]), sta- tistical spectral features (e.g., ZCR [9], spectral flux [9] and spectral centroid [11]), spectral envelop features (e.g., MFC- C [10] and LPCC [12]) and pitch features (pitch [10], spec- tral peak duration [13], pitch-density [5][2] and chroma vec- tor [14]). Previous approaches usually use a combination of the above features while those features may be highly cor- related or redundant. Thus a feature selection stage may be desired [5]. On the other hand, different classifiers have been stud- ied in audio classification, including Gaussian mixture mod- el (GMM) [15], K-nearest neighbor (KNN) [16], artificial neural network (ANN) [17] and support vector machines (SVM) [16][9]. Specifically, SVM has shown superior per- formances as compared with other classifiers [5][16]. In con- trast to other classifiers that minimize the empirical risk, that is, optimize the performance on the training data, SVM min- imizes the structural risk, that is, the probability of misclassi- fying yet-to-be-seen patterns for a fixed but unknown proba- bility distribution of the data [18]. Artificial neural network, which uses hierarchical lay- ered structure to learn complex non-linear patterns, has in- trinsic advantages in modeling and classification of complex, high-dimensional audio data. Traditionally, ANN, especial- ly multi-layer perceptron (MLP), has been used as an audio classifier with limited success with only one to two hidden layers due to the heavy computation load for a large num- ber of network parameters and the difficulties in optimizing many layers [17]. Those have only become practical recent- 978-1-4799-7082-7/15/$31.00 c 2015 IEEE

Transcript of DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK...

Page 1: DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping

DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIOCLASSIFICATION

Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping Ming3, Dongyan Huang3, Mingli Song4

1 School of Software and Microelectronics, Northwestern Polytechnical University, Xi’an, China2School of Computer Science, Northwestern Polytechnical University, Xi’an, China

3 Institute for Infocomm Research, A⋆STAR, Singapore4 College of Computer Science, Zhejiang University, Hangzhou, China

{bhzhang,lxie, ygyuan}@nwpu-aslp.org, [email protected], [email protected], [email protected]

ABSTRACT

In this paper, we propose to use deep neural network (DNN)as an effective tool for audio feature extraction. The DNN-derived features can be effectively used in a subsequent clas-sifier (e.g., an SVM in this study) for audio classification.Specifically, we learn bottleneck features from a multi-layerperceptron (MLP), in which Mel filter bank feature is used asnetwork input and one of the hidden layers has a small num-ber of hidden units, compared to the size of the other hiddenlayers. The narrow hidden layer is served as a bottleneck lay-er, which creates a constriction in the network that forces theinformation pertinent to classification into a compact featurerepresentation. We study both unsupervised and supervisedbottleneck feature extraction methods and demonstrate thatthe supervised bottleneck features outperform conventionalhand-crafted features and achieve the state-of-the-art perfor-mance in audio classification.

Index Terms— deep neural networks, audio classifica-tion, bottleneck features

1. INTRODUCTION

Social multimedia contents are proliferating exponentiallydue to the fast development of social networks and web ap-plications. As massive and heterogeneous social media dataare available, effective management and understanding me-dia contents have become a challenging problem than everbefore. Audio is an indispensable component of multimedi-a repositories. Thus it is desire for social multimedia com-puting to empower the capacity of automatic classification ofaudio contents or segmentation of an audio stream into ho-mogenous regions. This paper addresses the audio classifi-cation problem with deep neural network (DNN) [1] derivedbottleneck features.

Audio classification and its related topics have a long his-tory of research. Some studies specifically focus on speechand music discrimination [2][3], some deal with mixed type-s of audio [4][5], while others are task-oriented or work

at higher semantic work levels, e.g., audio event classifica-tion [6], music genre classification [7] and audio scene clas-sification [8]. Despite of various efforts in the area, as a typ-ical classification problem, audio classification involves threemajor components: feature extraction, modeling and classifi-cation. Feature engineering has been crucial in audio classi-fication. By elaborately investigating the differences amongvarious audio types, researchers have developed a variety ofhand-crafted features, including energy features (e.g., short-time energy [9][5] and 4Hz modulation energy [10][5]), sta-tistical spectral features (e.g., ZCR [9], spectral flux [9] andspectral centroid [11]), spectral envelop features (e.g., MFC-C [10] and LPCC [12]) and pitch features (pitch [10], spec-tral peak duration [13], pitch-density [5][2] and chroma vec-tor [14]). Previous approaches usually use a combination ofthe above features while those features may be highly cor-related or redundant. Thus a feature selection stage may bedesired [5].

On the other hand, different classifiers have been stud-ied in audio classification, including Gaussian mixture mod-el (GMM) [15], K-nearest neighbor (KNN) [16], artificialneural network (ANN) [17] and support vector machines(SVM) [16][9]. Specifically, SVM has shown superior per-formances as compared with other classifiers [5][16]. In con-trast to other classifiers that minimize the empirical risk, thatis, optimize the performance on the training data, SVM min-imizes the structural risk, that is, the probability of misclassi-fying yet-to-be-seen patterns for a fixed but unknown proba-bility distribution of the data [18].

Artificial neural network, which uses hierarchical lay-ered structure to learn complex non-linear patterns, has in-trinsic advantages in modeling and classification of complex,high-dimensional audio data. Traditionally, ANN, especial-ly multi-layer perceptron (MLP), has been used as an audioclassifier with limited success with only one to two hiddenlayers due to the heavy computation load for a large num-ber of network parameters and the difficulties in optimizingmany layers [17]. Those have only become practical recent-

978-1-4799-7082-7/15/$31.00 c⃝ 2015 IEEE

Page 2: DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping

ly through the use of computing clusters or graph process-ing units (GPUs) and most importantly the development ofeffective learning methods [19]. In the past several years,deep neural networks (DNN), i.e., a MLP with multiple hid-den layers, have been successfully used in many tasks, suchas speech recognition [20], natural language processing [21],and computer vision [19]. Similarly, DNN has re-emerged asa promising model for audio classification [7].

In this paper, instead of using DNN as a classifier, wepropose to use DNN as a feature extractor for accurate audioclassification. This is motivated by DNN’s nature of auto-matic feature learning ability. The DNN-derived new featurescan be effectively used in a subsequent classifier for audioclassification. Specifically, we learn bottleneck features froma neural network with a single type of lower representation(e.g., Mel filter bank) as input, limiting feature engineering tominimum. With their success in speech recognition [22][23],bottleneck features are generated from a special structure ofMLP, in which one of the hidden layers has a small num-ber of hidden units, compared to the size of the other hiddenlayers. The narrow hidden layer is served as a bottleneck lay-er, which creates a constriction in the network that forces theinformation pertinent to classification into a compact repre-sentation. We study both unsupervised and supervised bottle-neck feature extraction methods. The experiments show thatsupervised bottleneck features outperform conventional hand-crafted features and achieve the state-of-the-art performancein audio classification.

2. DEEP NEURAL NETWORK

A deep neural network is actually a multi-layer perceptron(MLP) [12], i.e., a feed-forward neural network model thatmaps sets of input data onto a set of outputs. In the audioclassification task, the input and output are audio features andclass labels, respectively. Based on each layer’s functionality,an MLP usually consists of an input layer, one to several hid-den layers and an output layer, and the nodes in each layer arefully connected to the nodes in another layer. An MLP withmultiple hidden layers are also known as deep neural network(DNN), as shown in Fig. 1. The input layer has no compu-tation capability as it simply attaches the observations to thenetwork. Each hidden layer takes in the activations of the lay-er below and computes a new set of nonlinear activations forthe layers above. The output layer generates either a posteriorvector in a classification task or a value in a regression taskusing the activations from the last hidden layer. Each hiddenlayer computes the activation hl via a linear transformationusing a weight matrix Wl and a bias vector bl followed by anonlinear function fl(x):

hl = fl(Wlhl−1 + bl) for 1 ≤ l < L (1)

where the nonlinear function fl(x) usually operates in anelement-wise manner on the input vector.

The commonly used hidden activation function is the sig-

Features

Input Layer

Output Layer

Hidden

Layer

Fig. 1. The structure of a deep neural network.

moid logistic function, defined as

ϕ(x) = sigmoid(x) (2)

= [1

1 + exp(−x1)· · · 1

1 + exp(−xK)]T

for a K-element input vector x. Each sigmoid hidden unitcan be regarded as carrying out a logistic linear regressionfeature extraction process which refines the lower layer rep-resentation to a better one. Using this representation, the dis-crimination among different classes is much clearer and theprediction can be done using a weak classifier.

The output layer (Lth layer) predicts either a value (re-gression) or a class label (classification). The output layersimply carries out a similar linear regression as hidden layersdo using a weight matrix WL and a bias vector bL. How-ever, a different task-dependent nonlinear function is usuallyadopted. For a classification task like audio classification inthis study, the softmax function is adopted which converts thevalues of arbitrary ranges into a probabilistic representation.The generated output values can be interpreted as posteriorprobabilities for each of the classes given the input observa-tion. For a K-element vector x, the softmax function is

ψ(x) = softmax(x) (3)

=1∑K

k=1 exp (xk)[exp (x1) · · · exp (xK)]T.

In summary, the parameters for an L-layer network are(W1, b1), (W2, b2), ..., (WL, bL). They are usuallyrandomly initialized (or initialized through pre-training [19])and then discriminatively updated using the BP algorith-m [24].

Page 3: DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping

3. BOTTLENECK FEATURES

Bottleneck (BN) features have been widely used in speechrecognition [22, 23, 25]. Bottleneck features are usually gen-erated from a multi-layer perceptron, in which one of the hid-den layers has a small number of hidden units, compared tothe size of the other hidden layers. This special hidden layerserves as a bottleneck layer, which creates a constriction inthe network that forces the information pertinent to classifi-cation into a low dimensional representation. The weightedsums of the outputs of the hidden layer immediately beforethe bottleneck layer are used as bottleneck features. In speechrecognition, bottleneck features derived by deep neural net-work, usually combined with traditional spectral features, arefed into a Gaussian mixture model (GMM) - hidden Markovmodel (HMM) system [23].

Fig. 2 illustrates how BN features are derived from atrained DNN in this study. Bottleneck features can be extract-ed, in an unsupervised manner, from an autoencoder (AE), inwhich the neural network is trained to predict the input fea-tures themselves [19], as shown in the upper part of Fig. 2.Features extracted in this way is named as AE-BN. An au-toencoder can be considered as a method of nonlinear dimen-sionality reduction since the activations at the bottleneck layerare a low-dimensional nonlinear function of the original inputfeatures. As an alternative approach, bottleneck features canbe extracted in a supervised manner, as shown in the lowerpart of Fig. 2. This feature is named as CL-BN (CL mean-s classification). In this approach, the network consists of avariable number of moderately large, fully connected hiddenlayers and a narrow bottleneck layer, followed by optional ad-ditional hidden layer(s) and the final classification layer. Su-pervised BN features are thus created from a neural networktrained to predict the class labels (audio types in this study).

Different from an autoencoder in which the bottlenecklayer is usually in the middle, the bottleneck layer in a su-pervised network can be placed either in the middle or as thelast hidden layer. Apparently, bottleneck features represent anonlinear transformation and sometimes serves as the func-tion of dimensionality reduction of the input features. Theoptimal size of the the bottleneck is task dependent and thuscan be tuned accordingly.

4. EXPERIMENTS

Audio classification experiments were carried out on a richdataset that contains about 240-min audio, collected from d-ifferent sources with various genres: broadcast news frommultiple TV channels, music clips download from Internet,speech clips from the aurora4 dataset1 and noise clips fromthe Noise92 dataset2. The audio streams are in 16 kHz,mono with 16 bit per sample, and manually annotated in

1http://aurora.hsnr.de/aurora-4.html2http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html

Original

FeatureBottleneck Original

Feature

Original

Feature

BottleneckClassification

Layer

Fig. 2. Bottleneck feature extraction: unsupervised (top) andsupervised (bottom). The layers in the dashed-line box areoptional.

terms of the pre-defined five classes, i.e., pure speech (SPH),music (MSC), background sound (BKG), speech with music(SPH+MSC), speech with background sound (SPH+BKG).The dataset was manually separated to a training set (200 min-s) and a testing set (40 mins). The time duration of each audioclass is balanced in order to keep a fare experimental compar-ison. In the experiments, an audio stream was firstly segment-ed into 32-ms non-overlapping frames for short-term featureextraction and then audio classifiers were trained on the framelevel. At the classification (testing) stage, audio type decisionwas made via a majority voting on the classification resultsof 30 consecutive frames. The CURRENT toolkit3 was usedfor neural network training. All neural networks have beentrained with the same learning rate of 0.000001 and momen-tum of 0.9. The input features of neural networks are normal-ized between 0-1.

4.1. Baseline Systems

Although our aim was to use neural network as a feature ex-tractor, as a sanity check, we first tested the performance ofDNN as a classifier. SVM was recruited as a comparison4,in which Gaussian radial basis was used as the kernel func-tion. A rich set of 50-D features was used as classifier in-put, including zero crossing rate (ZCR), short time energy,13-D MFCC, 18-D LSP, 13-D LPCC, spectral flux, spectralspread, spectral centroid and chroma vector. These featureswere proven effective in previous audio classification stud-ies [9]. Experimental results are shown in Table 1. As wecan see, with a clear performance gap, DNN is not able tooutperform SVM when used as a classifier. Please note that

3https://sourceforge.net/projects/currennt/4LibSVM was used: https://www.csie.ntu.edu.tw/%7ecjlin/libsvm/

Page 4: DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping

Table 1. Audio classification accuracy of SVM and DNNclassifiers on conventional features (%).

Audio Type SVM DNNSPH 91.20 77.74

SPH+MSC 90.48 83.14SPH+BKG 94.5 87.50

MSC 98.52 92.59BKG 98.67 65.65

Average 94.67 81.32

we have tuned different network configurations, e.g., numberof layers and nodes, pre-training strategies and learning pa-rameters. But DNN still cannot surpass SVM as a classifier.We believe that SVM is a very strong discriminative classifi-er while DNN only uses softmax as a simple linear classifier.Although the hierarchical structure of a DNN can be regardedas a powerful non-linear feature extractor, the failure of thiskind of usage drives us to figure out another more effectiveway to extract discriminative features that benefit from deepnon-linear learning abilities.

4.2. Bottleneck Features

We tested two different ways of bottleneck feature extraction:unsupervised (AE-BN) and supervised (CL-BN), as describedin Section 3. This time, SVM was adopted as the classifier dueto its superior performance. When extracting supervised BNfeatures (CL-BN), we examined two sets of neural networkinput: 50-D conventional features as described in Section 4.1,namely CF, and 40-D Mel filter bank features, namely FBank.Recent studies have shown that lower representations of sig-nal, e.g., Mel filter bank features, are more superior whenused as neural network input [26]. We also used 40-D F-Bank as input when extracting bottleneck features from anautoencoder (AE-BN). Similar to DNN-based speech recog-nition [27], we took a frame expansion strategy in CL-BN tocover the contextual information, that is, a long window of 11frames (left 5 frames + current frame + right 5 frames) wasactually used as bottleneck network input.

Results are shown in Table 2 and some observations aresummarized as follows. The unsupervised bottleneck fea-ture (AE-BN-FBank) shows inferior performance as com-pared with supervised ones (CL-BN) mainly due to the badclassification performance of background sound (BKG). Themain reason is that the background sound actually involvesdiverse audio types (recorded from restaurants, airport, streetand car) and unsupervised learning cannot catch them in a u-nique characteristic of features. We also notice that the useof FBank as network input for BN feature extraction outper-forms the use of conventional features. The CL-BN-FBankachieves an average classification accuracy of 94.20%, whichis comparable to the SVM baseline with conventional features

CFAE

-BN

SPH

CL-BN

MSC BKG SPH+MSC SPH+BKG

Fig. 3. Visualization of different features extracted for 1-second clips from different audio classes.

(94.67%). This promising result motivates us to further fine-tune the BN neural network structure. Fig. 3 shows the visu-alization of different features, in which we can see clearly thepatterns that bottleneck features have learned.

4.3. Bottleneck Features: Number of Layers

We investigated the classification performance of differentnumber of hidden layers in the supervised BN network. A-gain, SVM was used as the classifier with 50-dim CL-BNfeatures as input. Results in Table 3 show that going ‘deep’does help in feature extraction: a 4-hidden-layer networkderived BN feature achieves the best classification accuracy(94.95%), which has outperformed the state-of-the-art perfor-mance previously reached by SVM with conventional features(94.67% as shown in Table 1).

4.4. Bottleneck Features: Size of BN layer

As a network as deep as 4-layer demonstrated superior perfor-mance, we decided to further examine if extra gain could beachieved when tuning the size of the bottleneck layer (i.e., theBN feature dimension). Results in Table 4 show that furtheraccuracy gain is achieved when we set a wider BN layer. Forexample, when BN Size=150, the audio classification accura-cy is as high as 97.74%. The performance may become evenbetter if we further expand the layer size but the classificationefficiency becomes quite low. So in practical, the number ofhidden layers and the number of nodes in each hidden layershould be set empirically by considering the tradeoff amongtrainability, classification accuracy and runtime efficiency.

4.5. Feature Fusion

Previous studies in speech recognition [28] have suggestedthat the combination of traditional feature with bottleneck fea-tures can obtain extra performance improvement. Hence, in

Page 5: DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping

Table 2. Audio classification accuracy of supervised and unsupervised BN features (%). CF: conventional feature set.BN Feature CL-BN-CF CL-BN-FBank AE-BN-FBank

Network Structure [I]550-1024-512-[BN]50-[O]5 [I]440-1024-512-[BN]50-[O]5 [I]40-200-200-[BN]40-200-200-[O]5SPH 77.91 90.31 89.95

SPH+MSC 94.08 91.25 90.34SPH+BKG 96.21 94.88 94.69

MSC 98.33 97.59 94.63BKG 92.4 96.96 32.64

Average 91.80 94.2 80.45

Table 3. Audio classification accuracy for different hidden layers of BN network (%).Network Structure 1024-[BN]50 1024-512-[BN]50 1024-512-256-[BN]50

SPH 91.56 90.31 93.72SPH+MSC 92.15 91.25 92.54SPH+BKG 95.26 94.88 96.39

MSC 97.96 97.59 97.41BKG 72.68 96.96 94.69

Average 89.92 94.2 94.95

Table 4. Audio classification accuracy of different sizes ofbottleneck layer (%). The network structure is 1024-512-256-[BN]Size, where Size=30, 50, 100, 150.

BN Size 50 30 100 150SPH 93.72 91.38 94.61 96.59

SPH+MSC 92.54 88.93 95.88 95.11SPH+BKG 96.39 90.13 100 100

MSC 97.41 97.22 98.15 99.07BKG 94.69 89.07 97.15 97.94

average 94.95 91.35 97.16 97.74

Table 5. Audio classification accuracy of feature fusion: BNfeatures with conventional features (%).

Audio Type 150-D BN 150-D BN + 50-D CFSPH 96.59 97.49

SPH+MSC 95.11 95.24SPH+BKG 100 100

MSC 99.07 99.63BKG 97.94 98.29

Average 97.74 98.13

the last experiment, we concatenated the 150-D BN featurewith the 50-D conventional features and an SVM classifierwas trained using the combined 200-D features. Results inTable 5 show that as we expected, extra accuracy increase isachieved after feature fusion. This may indicate that neuralnetwork derived BN feature is compliment with conventionalhand-crafted features in the audio classification task.

5. CONCLUSIONS AND FUTURE WORK

We have proposed DNN-derived bottleneck features in audioclassification. Without feature engineering, the automatically

learned neural network feature has achieved higher accuracyin audio classification as compared with conventional hand-crafted features. When two sets of feature are combined, fur-ther performance gain has been observed.

Tandem and stack approaches are often used in neuralnetwork approaches [1][29]. We may use the learned B-N features as the input of another neural network classi-fier in a tandem way. Alternatively, we can stack multi-ple consecutive frames of bottleneck features to produce awide context around the current frame and use this as inputto a second DNN to learn stacked bottleneck features (SB-N) [30][31]. The stack approach is able to learn importantsequential/context information that has previously shown it-s effectiveness in speech recognition. Moreover, advancednetwork structures, e.g., convolutional [7] and recurrent [32]neural networks deserve a study in the audio classificationtask. We believe these future works may further improve theclassification accuracy.

6. ACKNOWLEDGEMENTS

This work is partially supported by the National Natural Sci-ence Foundation of China (61571363) and the AeronauticalScience Foundation of China (20155553038).

7. REFERENCES

[1] Li Deng and Dong Yu, “Deep learning: Methods and applica-tions,” Foundations and Trends in Signal Processing, vol. 7,no. 3–4, pp. 197–387, 2014.

[2] Zhong-Hua Fu, Jhing-Fa Wang, and Lei Xie, “Noise robustfeatures for speech/music discrimination in real-time telecom-munication,” in ICME. IEEE, 2009, pp. 574–577.

[3] Chungsoo Lim, Seong-Ro Lee, and Joon-Hyuk Chang, “Effi-cient implementation of an svm-based speech/music classifier

Page 6: DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES … · DEEP NEURAL NETWORK DERIVED BOTTLENECK FEATURES FOR ACCURATE AUDIO CLASSIFICATION Bihong Zhang1, Lei Xie1,2, Yougen Yuan2, Huaiping

by enhancing temporal locality in support vector references,”Consumer Electronics, IEEE Transactions on, vol. 58, no. 3,pp. 898–904, 2012.

[4] Lei Chen, Sule Gunduz, and M Tamer Ozsu, “Mixed typeaudio classification with support vector machine,” in ICME,2006, pp. 781–784.

[5] Lei Xie, Zhong-Hua Fu, Wei Feng, and Yong Luo, “Pitch-density-based features and an svm binary tree approach formulti-class audio classification in broadcast news,” Multimediasystems, vol. 17, no. 2, pp. 101–112, 2011.

[6] Zvi Kons and Orith Toledo-Ronen, “Audio event classificationusing deep neural networks.,” in INTERSPEECH, 2013, pp.1482–1486.

[7] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng,“Unsupervised feature learning for audio classification usingconvolutional deep belief networks,” in Advances in neuralinformation processing systems, 2009, pp. 1096–1104.

[8] Hongchen Jiang, Junmei Bai, Shuwu Zhang, and Bo Xu,“Svm-based audio scene classification,” in IEEE NLP-KE’05.,2005, pp. 131–136.

[9] Lie Lu, Hong-Jiang Zhang, and Stan Z Li, “Content-basedaudio classification and segmentation by using support vectormachines,” Multimedia systems, vol. 8, no. 6, pp. 482–492,2003.

[10] Michael J Carey, Eluned S Parris, and Harvey Lloyd-Thomas,“A comparison of features for speech, music discrimination,”in ICASSP. IEEE, 1999, vol. 1, pp. 149–152.

[11] Bojana Gaji and Kuldip K Paliwal, “Robust speech recogni-tion in noisy environments based on subband spectral centroidhistograms,” Audio, Speech, and Language Processing, IEEETransactions on, vol. 14, no. 2, pp. 600–608, 2006.

[12] Dongge Li, Ishwar K Sethi, Nevenka Dimitrova, and Tom M-cGee, “Classification of general audio data for content-basedretrieval,” Pattern recognition letters, vol. 22, no. 5, pp. 533–544, 2001.

[13] Jun Wang, Qiong Wu, Haojiang Deng, and Qin Yan, “Real-time speech/music classification with a hierarchical oblique de-cision tree,” in ICASSP, 2008, pp. 2033–2036.

[14] Ning Hu, Roger B Dannenberg, and George Tzanetakis, “Poly-phonic audio matching and alignment for music retrieval,”Computer Science Department, p. 521, 2003.

[15] Eric Scheirer and Malcoh Slaney, “Construction and evaluationof a robust multifeature speech/music discriminator,” in Acous-tics, Speech, and Signal Processing, 1997. ICASSP-97., 1997IEEE International Conference on, 1997, vol. 2, pp. 1331–1334.

[16] Lie Lu, Hong-Jiang Zhang, and Hao Jiang, “Content analysisfor audio classification and segmentation,” Speech and AudioProcessing, IEEE Transactions on, vol. 10, no. 7, pp. 504–516,2002.

[17] Enrique Alexandre, Lucas Cuadra, Lorena Alvarez, andManuel Rosa-Zurera, “Nn-based automatic sound classifier fordigital hearing aids,” in Intelligent Signal Processing, 2007.WISP 2007. IEEE International Symposium on, 2007, pp. 1–6.

[18] Corinna Cortes and Vladimir Vapnik, “Support-vector net-works,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.

[19] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Reducingthe dimensionality of data with neural networks,” Science, vol.313, no. 5786, pp. 504–507, 2006.

[20] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincen-t Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deepneural networks for acoustic modeling in speech recognition:The shared views of four research groups,” Signal ProcessingMagazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.

[21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean,“Efficient estimation of word representations in vector space,”arXiv preprint arXiv:1301.3781, 2013.

[22] Frantisek Grezl, Martin Karafiat, Stanislav Kontar, and JanCernocky, “Probabilistic and bottle-neck features for lvcsr ofmeetings,” in ICASSP, 2007, vol. 4, pp. IV–757.

[23] Zhi-Jie Yan, Qiang Huo, and Jian Xu, “A scalable approach tousing dnn-derived features in gmm-hmm based acoustic mod-eling for lvcsr.,” in INTERSPEECH, 2013, pp. 104–108.

[24] David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams, “Learning representations by back-propagating er-rors,” Cognitive modeling, vol. 5, pp. 3, 1988.

[25] Yu Zhang, Ekapol Chuangsuwanich, and James R Glass, “Ex-tracting deep neural network bottleneck features using low-rank matrix factorization.,” in ICASSP, 2014, pp. 185–189.

[26] Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu,Frank Seide, Mike Seltzer, Geoffrey Zweig, Xiaodong He, Ju-lia Williams, et al., “Recent advances in deep learning forspeech research at microsoft,” in ICASSP, 2013, pp. 8604–8608.

[27] George E Dahl, Dong Yu, Li Deng, and Alex Acero,“Context-dependent pre-trained deep neural networks forlarge-vocabulary speech recognition,” Audio, Speech, and Lan-guage Processing, IEEE Transactions on, vol. 20, no. 1, pp.30–42, 2012.

[28] Zoltan Tuske, Ralf Schluter, and Hermann Ney, “Deep hierar-chical bottleneck mrasta features for lvcsr,” in ICASSP, 2013,pp. 6970–6974.

[29] J Zhang, EB Martin, AJ Morris, and C Kiparissides, “In-ferential estimation of polymer quality using stacked neuralnetworks,” Computers & Chemical Engineering, vol. 21, p-p. S1025–S1030, 1997.

[30] Jonas Gehring, Yinping Miao, Florian Metze, and AlexWaibel, “Extracting deep bottleneck features using stackedauto-encoders,” in ICASSP, 2013, pp. 3377–3381.

[31] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Si-mon King, “Deep neural networks employing multi-task learn-ing and stacked bottleneck features for speech synthesis,” inICASSP, 2015, pp. 4460–4464.

[32] Matthew M Botvinick and David C Plaut, “Short-term memoryfor serial order: a recurrent neural network model.,” Psycho-logical review, vol. 113, no. 2, pp. 201, 2006.