Hierarchical feature extraction by multi-layer non-negative matrix...

Hierarchical feature extraction by multi-layer non-negative matrixfactorization network for classification task

Hyun Ah Song a,1, Bo-Kyeong Kim a, Thanh Luong Xuan a, Soo-Young Lee a,b,n

a Department of Electrical Engineering, KAIST, Daejeon, Republic of Koreab Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea

a r t i c l e i n f o

Article history:Received 28 February 2014Received in revised form16 June 2014Accepted 7 August 2014Available online 23 April 2015

Keywords:Hierarchical feature extractionMulti-layer networkUnsupervised learningNon-negative matrix factorization

a b s t r a c t

In this paper, we propose multi-layer non-negative matrix factorization (NMF) network for classificationtask, which provides intuitively understandable hierarchical feature learning process. The layer-by-layerlearning strategy was adopted through stacked NMF layers, which enforced non-negativity of bothfeatures and their coefficients. With the non-negativity constraint, the learning process revealed latentfeature hierarchies in the complex data in intuitively understandable manner. The multi-layer NMFnetworks was investigated for classification task by studying various network architectures andnonlinear functions. The proposed multilayer NMF network was applied to document classificationtask, and demonstrated that our proposed multi-layer NMF network resulted in much better classifica-tion performance compared to single-layered network, even with the small number of features. Also,through intuitive learning process, the underlying structure of feature hierarchies was revealed for thecomplex document data.

& 2015 Elsevier B.V. All rights reserved.

1. Introduction

Humans learn complex data easily. This is done by understandinglatent features in the complex data. How do we learn latent featuresout of complex data? For understanding of complex data, humansconduct hierarchical feature extraction. We take multi-layered stepsfor understanding complex data. We break a complex problem intoseveral simple problems, and solve one by one throughout multiplestages [1]. One good example of hierarchical feature extractionapproach is described in visual cortex [2].

There have been extensive studies on hierarchical learning algo-rithms that resemble biology. One best known is Deep Belief Net-work (DBN) introduced in 2006 [3]. By DBN, Hinton showed firstsuccess in training of deep architectures. Hinton stacked RestrictedBoltzmann Machines (RBM) into several layers, which are greedilylearned in layer-wise manner. By repeatedly stacking simple unitalgorithms into several layers, the algorithm solves simple problemat each layer. By integrating simple solutions throughout the layers,algorithm is able to solve complex problem even without involvingcomplex mathematical functions. With the success of training deep

architectures, several variants of deep learning have been introduced;auto-encoders stacked into several layers [4,5]. They even beat thecurrent state-of-the-art performances.

These multi-layered algorithms have provided efficient solu-tion to complex problems by mimicking the hierarchical dataprocessing mechanism in brain. Although they produce goodresults in classification tasks, they work as a ‘blackbox.’ They lackintuitive understanding of the procedure. Although they takehierarchical approaches in feature extraction, they do not provideus the concept hierarchies that are learned throughout thehierarchical structure.

There have been several approaches in developing intuitivefeature extraction algorithms, namely multi-layer non-negativematrix factorization (NMF). In these works, NMF is stacked intoseveral layers [6–11]. Research related to multi-layer NMF hasbeen focusing on intuitive hierarchical feature learning process,and its efficiency in blind source separation (BSS) tasks, but not forits efficiency in classification tasks.

In this paper, we propose an optimal structure of multi-layerNMF algorithm for classification task, which provides intuitiveunderstanding of hierarchical data processing steps. By stacking anunit algorithm into multi-layer structure, we conduct hierarchicallearning and demonstrate the feature hierarchies that are learnedthroughout layers. For the unit algorithm, we adopted non-smoothNMF (nsNMF) [12]. Due to non-negativity constraint of NMF, itlearns parts-based and sparse features, and provides intuitive

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.08.0950925-2312/& 2015 Elsevier B.V. All rights reserved.

n Corresponding author at: Department of Electrical Engineering, KAIST, Daejeon,Republic of Korea. Tel.: þ82 42 350 3431.

E-mail address: [email protected] (S.-Y. Lee).1 Hyun Ah Song is now in the Machine Learning Department at Carnegie

Mellon University, USA.

Neurocomputing 165 (2015) 63–74

www.sciencedirect.com/science/journal/09252312

www.elsevier.com/locate/neucom

http://dx.doi.org/10.1016/j.neucom.2014.08.095



http://crossmark.crossref.org/dialog/?doi=10.1016/j.neucom.2014.08.095&domain=pdf



mailto:[email protected]


understanding of data decomposition and reconstruction byallowing only additional operations; we are able to follow featuredevelopment process that is more or less like building a house bycombining lego blocks. In this manner, with multi-layer NMF, weare able to demonstrate underlying feature hierarchies in thecomplex data in intuitively understandable manner by learningadditional relationships between features across the layers. Withthis hierarchical structured feature extraction algorithm, we thor-oughly investigate the optimal structure for classification tasks.We study optimal architectures and application of nonlinearfunctions, and show that our proposed structure of multi-layerNMF reaches the maximum performance it can achieve even withthe small number of features allowed for data representation. Wealso provide characteristics of multi-layer NMF compared tosingle-layered NMF.

The organization of the paper is as follows. First, in Section 2,we introduce previous works related to hierarchical multi-layerNMF. In Section 3, we describe the motivation of designing ourproposed network. In Section 4, we describe how we came to theproposed structure of multi-layer NMF network and explain thelearning process in detail. We show experimental results withReuters-21578 collection in Section 5. We explain our carefulapproach in designing of optimal structure of multi-layer NMFnetwork for classification tasks using Reuters data. We presentremained further works regarding our proposed network inSection 6. Finally we close our paper with conclusion in Section 7.

2. Related works

Extending NMF into multi-layered structure is not a novelconcept. There have been a few approaches in stacking NMFseveral times and extending it into multi-layer structure. Studiesrelated to multi-layer NMF have been focusing on its intuitivefeature learning process, and its efficiency in solving BSS problems.

In [6], the authors proposed up-propagation algorithm, whereNMF is stacked into three layers to form multi-layer structure,with nonlinear function applied in recurrent projection direction.In up-propagation algorithm, the features are jointly trained usingtraining algorithms in previous works of authors [13]. Through thiswork, the authors introduced the notion of 'biologically plausible'feature extraction algorithm, and interesting property of extendingNMF into several layers by demonstrating hierarchical facialfeature development process along the layers. In their later workin [7], they applied their algorithm, to PET image data, and showedthat multi-layer NMF not only demonstrates interesting propertyof intuitive hierarchical feature extraction, but also does its job as a

feature extraction algorithm successfully by demonstrating activa-tion of feature coefficients.

In [8,9], the authors introduced multilayer NMF algorithm ofthree layers. They used projective gradient descent rule to traineach of NMF unit in layers, and trains each layer separately. In theirmultilayer NMF, they did not apply nonlinear function betweenthe layers, which leads to consequent linear factorizations. Theyapplied their algorithm to solve blind source separation (BSS)problem, and showed that dividing feature extraction process intoseveral steps through multiple layers provides better solution toseparation of mixed signals compared to single-layered NMF bymeasuring SIR values.

In [10], the authors proposed sparse and transformation-invarianthierarchical NMF algorithm. They focus on the transformation-invariant property of the algorithm instead of hierarchical properties.

Our proposed multi-layer NMF differs slightly from theseprevious works in its structure. In [6], the architecture of themulti-layer NMF is not considered thoroughly, and the nonlinearfunction is applied in recurrent direction. In [8,9], the architectureof the multi-layer NMF comprises repeating the same processthrough several layers. The authors did not consider designingspecific architecture of multi-layer NMF, but simply divided theone step learning process of single-layer into several steps. Also,they did not apply nonlinear functions. However, in our proposedmulti-layer NMF network, we pay close attention in designingoptimal architecture for classification tasks, and application ofnonlinear function in forward direction to suppress difference indata representation resulting from slight deviation between train-ing samples.

More importantly, the research focus of our proposed multi-layer NMF differs from those of previous works. In [6], the authorsfocused on introducing a novel notion of multi-layer NMF algo-rithm that demonstrates an interesting property of intuitivehierarchical feature learning process, and proving its successfuljob as a feature extraction algorithm [7]. In [8,9], the researchfocus was on separating mixed signals. However, in this paper, wefocus on solving classification problems: we thoroughly investi-gate optimal architecture and usages of nonlinear functions forefficient classification, and observe the characteristics of ourproposed multi-layer NMF structure and its limitations.

3. Motivation

Before we start looking into the structure of hierarchical multi-layer NMF, we provide our hypothesis on advantages of hierarch-ical learning compared to single-layered shallow learning in

Fig. 1. Illustration of hypothesis on the characteristics of multi-layer network compared to the single-layer network. (a) Example of three class features. How features areextracted to represent the original data with only three features in case of (b) single-layer NMF, and (c) multi-layer NMF. (For interpretation of the references to color in thetext, the reader is referred to the web version of this paper.)

H.A. Song et al. / Neurocomputing 165 (2015) 63–7464

learning useful features thus leading to better solution to classi-fication tasks. Our notions are illustrated in Fig. 1.

Let us say we have complex input data that may be represented inform of overlapping of three classes: red, blue, and green. Such datasample that is not classified to a certain class but represented as inform of mixture of several overlapping classes is common in real-world, e.g. document samples of multiple topic categories. In Fig. 1(a),distinct features in three classes are shown as an example. Overlappingsummations of classes in this example will be utilized as input data.

Now, let us say, for the NMF algorithm to yield its bestperformance, at least seven features should be provided for datarepresentation; seven features are the minimum number offeatures needed to break down the complex data into usefulpieces. Then imagine a case where insufficient number ofdimension for perfect learning is provided for final data repre-sentation; let us say, three dimensions. In this case, how willsingle-layer network (b), and hierarchical multi-layer network in(c) work?

If we reduce dimension directly to three using single-layer network,then trouble may occur due to insufficient dimension provided forlearning (b). Learning becomes significantly difficult for the network,because it has to extract meaningful three patterns out of complex datathat consist of overlapping of several classes. Therefore, we may expectthat single-layer network may not succeed in learning of meaningfulfeatures. However, for multi-layer network, if we provide a sufficientnumber of dimension for the first layer of the multi-layer structure forlearning, the first layer may demonstrate its best performance bybreaking the entangled complex data into smallest possible unit blocks.Then, the role of second layer is to learn combination of those unitblocks; the first layer has done the pre-processing of the complex databy breaking up the complex pattern into small pieces, and this reducesburden of learning in the second layer. Therefore we can expect thesecond layer to extract meaningful three features, each representingfeature distinct to each class as illustrated in Fig. 1(c).

As illustrated in Fig. 1, we may expect that hierarchical learningapproach provides better chances of learning meaningful featuresout of the complex data with help of pre-processing steps in thelower layers. This eventually leads to better classification perfor-mances even with the small number of features given. This is thefundamental motivation of deep learning as well. On topof fundamental approaches of deep learning, we are aiming to usenon-negativity constraint, which results in more natural wayof learning hierarchical feature relationships in additionaloperations only.

4. Hierarchical multi-layer NMF

4.1. Non-smooth non-negative matrix factorization (nsNMF)

Our proposed network utilizes non-smooth non-negativematrix factorization (nsNMF) as unit algorithm for each layer.NsNMF [12] is a variant of NMF [14,15] with sparsity constraint.Basic NMF decomposes non-negative input data X into non-negative W and H, which are features and corresponding coeffi-cients or data representation respectively. It aims to reduce errorbetween original data X and its reconstruction WH: X�WH. Thecost function is described in the following equation:

C ¼ 12‖X�WH‖2 ¼ 1

2

XMm ¼ 1

XNn ¼ 1

Xmn�XRr ¼ 1

WmrHrn

!2

ð1Þ

Feature matrix W and its coefficient matrix H are alternativelyupdated during iterations using multiplicative update rules as

shown in the following equation:

Wmr’Wmr

XHT� �

mr

WHHT� �

mr

; Hrn’Hrn

WTX� �

rn

WTWH� �

rn

ð2Þ

To apply sparsity constraint to standard single-layer NMF, asparsity matrix S is introduced in the following equation [12]:

S¼ ð1�θÞeyeðhÞþθh onesðhÞ ð3Þ

In (3), h is the number of features, and θ is the parameter forsparsity, in range of 0–1. eyeðhÞ is the identity matrix of size h�h,and onesðhÞ is a matrix of size h� h with all components of 1s. Bymultiplying a matrix with S, we lose sparsity of the matrix. The closerthe θ is to 1, the more the loss of sparsity is applied. Duringalternative update, we multiply S with H during iterations asH¼ SH. To compensate the loss of sparsity in H, W becomes sparse.

4.2. Multi-layer architecture

The proposed hierarchical multi-layer NMF structure is shownin Fig. 2. We stack unit algorithm into multiple layers. The super-script of each term, l, denotes the index of layer, l¼ 1;2;…L.Outcome of each layer, HðlÞ is processed to KðlÞ before beingintroduced as input to the next layer. KðlÞ is computed as follows:K ðlÞrn ¼ f HðlÞ

rn

� �, where f( � ) is the nonlinear function. We conducted

experiments using various types of nonlinear functions, includinglinear, root, tanh, and logistic sigmoid. The experimental resultwith document data is shown in Section 5.4. Users may choose thebest nonlinear function suitable for their data. Using nsNMF, KðlÞ isdecomposed into Wðlþ1Þ and Hðlþ1Þ: KðlÞ �Wðlþ1ÞHðlþ1Þ. This pro-cess is repeated until l¼L.

The physical meaning of decomposition of KðlÞ in the next layeris to learn the combination information of the features of theprevious layer. By learning which previous layer features arecombined together in the next layer, we develop complex features.This process can be interpreted as learning small building blocks inthe lower layer, and then learning to develop and constructcommonly used complex blocks by combining several blocksthrough proceeding layers.

After training until the last layer, final data representation KðLÞ

is acquired. This is the activation information of correspondingcomplex features, which is the integration of features throughoutthe layers.

Fig. 2. Overall architecture of hierarchical multi-layer NMF network.

H.A. Song et al. / Neurocomputing 165 (2015) 63–74 65

4.3. Architecture for classification—determination of the number ofhidden neurons, Rl

We conducted experiments with various architectures of multi-layer NMF to find the optimal architecture for classification tasks. Wewill denote the number of hidden neurons in lth layer as Rl. Thenumber of hidden neurons in ðL�1Þth layer, which is the number offeatures inWðL�1Þ, and dimension of data representation in HðL�1Þ, isdetermined after running standard single-layer nsNMF algorithmwith an increasing number of features. The number of hiddenneurons in ðL�1Þth layer, RL�1, is determined to be the number offeatures in single-layer network that has reached the maximumperformance, which is denoted as RSingleMax. The reason for referen-cing standard single-layer network experimental result is because of“bottleneck” effect; the maximum performance of the final layer isrestricted to the best performance achieved by the previous layer. Wefound this out after trying experiments of varying RL�1. In order toavoid undesirable deterioration in final performance rate, we providea sufficient number of features for the layers l¼ 1;…; L�1, as manyas the level that single-layer network reaches the maximum perfor-mance threshold.

As mentioned in Section 4.2, layers of l¼ 1;…; L�1 can beinterpreted as pre-processing of complex input data, where com-plex data is decomposed into simple building blocks, in form offavorable unit for cooking. Layer L is responsible for learning howto combine these building blocks together. If we provide notenough number of features for the lower layer learning, buildingblocks will not be sufficient enough to be able to represent allpossible complex features of the original data. Therefore, even ifupper layers, which are built on the lower layer learning result, dobest in achieving its task, the final representation will not besuccessful to represent the original data. In this aspect, we supplyenough number of features for the pre-processing job and do notrestrict the process. We can interpret this point, RL�1, as theminimum number of features necessary to represent the original

data, or the number of building blocks necessary to fully constructthe original data. This can also be inferred to as the ”maximumthreshold performance” of the algorithm; along with bottleneckeffect, this is the maximum performance the network can achievein classification tasks disregarding the structure of single- ormulti-layered network. For Rl where loL�1, the number maybe chosen to any number that is larger than RL�1.

In Fig. 3, we provide experimental result of classification perfor-mances of various architectures of multi-layer NMF for case of L¼2.For the number of neurons for final data representation, R1 for single-layer and R2 for multi-layer, we tried several values increasing from 20to 200 with 20 intervals, 20;40;…;200. In Fig. 3(a), diagrams ofvarious architectures of multi-layer NMF are described. In Fig. 3(b),SVM classification performances of various architectures of multi-layerNMF are demonstrated. First of all, we can see that classificationperformance of single-layer NMF (blue circle) requires the number offeatures to be at least 180 in order to achieve the best performance itcan achieve: RSingleMax¼180. In order to analyze the effect of RL�1 inclassification, we conducted experiments using three architecturetypes of multi-layer NMF network: multi-layer NMF with architectureof (1) RL�1oRSingleMax: 1000�20�R2 (magenta triangle), (2)RL�14RSingleMax: 1000�500�R2 (green star), and (3) RL�1 ¼RSingleMax: 1000�180�R2 (red x). As we can see from Fig. 3(b), ifRL�1oRSingleMax, then the classification performance of multi-layerNMF network is restricted to that of RSingleMax, no matter what RL is,due to the “bottleneck” effect. We can see that classification perfor-mance of multi-layer architecture of RL�14RSingleMax case shows nosignificant difference from the case of RL�1 ¼ RSingleMax. This resultassures that we may choose to supply as many number of features asRSingleMax for RL�1 for guaranteed improvement in classification with asmall number of features. We also extended one more layer andconducted experiment for 1000�500�180�R2 (black rectangle). Theexperimental result shows that extending one more layer improvesthe classification performance compared to single-layer network, butslightly deteriorates the classification performance compared to themulti-layer network. (This is to be discussed in Section 6.)

4.4. Training procedure

We train each layer separately as described in Section 4.2. Westart with standard single-layer network and extend one layer at atime by decomposing processed outcome of previous layer. Thenumber of hidden neurons for the ðL�1Þth layer, RL�1, is chosenregarding description in Section 4.3.

In the first layer, we decompose XtrainARM�Ntrainþ intoWð1ÞAR

M�R1þand Hð1ÞAR

R1�Ntrainþ using nsNMF multiplicative update rules, initerative multiplicative manner. During iterative updates, we multiplyHð1Þ with Sð1Þ ¼ ð1�θ1ÞeyeðR1Þþθ1=R1onesðR1Þ. Sð1ÞAR

R1�R1þ resultsin loss of sparsity in Hð1Þ and makes Wð1Þ sparse to compensate theloss of sparsity in Hð1Þ. After multiplicative update of Wð1Þ and Hð1Þ isdone, we now have trained basis Wð1Þ. Then we apply nonlinearfunction f ð:Þ to Hð1Þ, to get the final data representation of the firstlayer, Kð1ÞAR

R1�Ntrainþ . We have Wð1Þ and Kð1Þ as the outcomes of thefirst layer. Kð1ÞAR

R1�Ntrainþ is fed into the second layer as input data.We repeat the same process for upper layers. We may repeat

this process for L times for L layers. Now we are done with thetraining procedure.

For more intuitive understanding, the pseudo-code for thetraining procedure is described in Appendix A.

4.5. Testing procedure

For the testing procedure, we learn data representation for testdata using the basis Wð1Þ , … , WðLÞ learned in the trainingprocedure. In the testing procedure, we follow the same procedurefor the training procedure except for the multiplicative update of

Fig. 3. Classification performance of various architectures of multi-layer NMF.(a) Diagrams of various architectures of multi-layer NMF. (b) SVM classificationperformance of each architecture. (For interpretation of the references to color inthe text, the reader is referred to the web version of this paper.)


Wð1Þ. In the first layer, for test data XtestARM�Ntestþ , we learn

Hð1ÞtestAR

R1�Ntestþ with fixed Wð1ÞAR

M�R1þ from the training proce-dure, using multiplicative update rules. We apply nonlinear func-tion to Hð1Þ

test to get Kð1ÞtestAR

R1�Ntestþ . Kð1Þ

test is fed to the second layer asinput data. The same process is repeated for upper layers.

For more intuitive understanding, the pseudo-code for thetesting procedure is described in Appendix A.

5. Document data feature hierarchies

5.1. Data information

We applied our proposed network to document database andobserved the concept hierarchies in document. We used “Reuters-21578 collection, distribution 1.0”2 as database. Reuters-21578database comprises documents in Reuters newswire in 1987. Weused ModApte split. Each word accounts for each dimension, andeach value represents frequency of appearance of correspondingword in a sample. ModApte split originally contains 9603 docu-ments for training set, 3299 documents for test set. There arevarious approaches in utilizing ModApte Split: some use ModApteSplit as it is, and some select top 10 or 8 topics of the ModApteSplit. In our experiment, we selected documents with top 10categories with most number of documents, which make up5786 documents for training data (Ntrain ¼ 5786), and 2587 docu-ments for testing data (Ntest ¼ 2587). Top 10 categories are earn,acq, grain, money-fx, crude, trade, interest, ship, sugar, and coffee.In Table 1, a number of training data and test data for eachcategory are described.

The original dimension of the Reuters-21578 is very large withsparse distributions. We removed some of stop-words: the, of, to,in, and, a, for, it, reuter, on, from, is, that, by, at, be, with, will, was, he,has, would, an, as, not, which, but, or, this, have, are, had, were,about, also, they, up, been, may, alter, could, we, made, when, should,I, you, she, your, his, her, am, can, might, so, into, do, did, does, said,its, talks. As a result, the dimension of our data becomesM¼24,231.

In our experiment, we designed our network to be 2-layerednetwork, L¼2. For experiments with sparsity constraint, weapplied same sparsity constraint by setting θ1 ¼ θ2 ¼ 0:8, and forexperiments without sparsity constraint, we set θ1 ¼ θ2 ¼ 0,where subscript denotes the layer index. For the number of hiddenneurons for the first layer, we set it to be R1 ¼ 180, which is thepoint where the classification result of the single-layered networkconverges to its maximum performance, RSingleMax. For the numberof neurons for final data representation, R2, we tried several valuesincreasing from 20 until 200 with 20 intervals, R2 ¼ 20;40;…;200.We normalized each column to have sum of 1s:

PMm ¼ 1 Xmn ¼ 1 for

all n.For classifier, we used SVM. We used SVM with rbf kernel with

quadratic programming. For two parameters of box constraint, andrbf sigma, we chose the optimal value after three times of crossvalidation of training data.

5.2. Pre-processing of data, from M¼24,231 to M¼1000

After removing stop-words, the dimension of our data becomesM¼24,231. Even after removal of stop-words, our data has a highdimensional information. Documents usually comprise very highdimensional data, which is distributed in sparse manner. Becauseof the high dimensionality, the computational cost in processing

them is really high. However, significant information in represent-ing the features may not be that huge. With this notion,dimension-reduction technique is widely used when handlingdocument data. We also believed that all of 24,231 dimensionalvector may not be giving us significant information, so weconducted pre-processing of the data. We selected top 1000 mostfrequent words only, which makes the final dimension M¼1000.In order to make sure that pre-processing of dimension reductionto M¼1000 is safe in conducting classification tasks without lossof significant features, we analyzed the classification properties offull dimensional data (M¼24,231) and pre-processed data(M¼1000).

In Fig. 4, SVM classification of single- (blue circle) and multi-layer (red circle) networks using full dimensional data M¼24,231(dotted line), and pre-processed data M¼1000 (solid line) is shown.The x-axis denotes the number of dimensions for final datarepresentation. This is same as R1 values for single-layer networkresult, and R2 values for multi-layer network result. We can seethat using processed dimension of M¼1000 does not deteriorateclassification performance when compared to using full dimensionof M¼24,231. Moreover, the classification performance is evenbetter forM¼1000 most of the time. We may interpret this as highdimension of words comprising not so significant words, and top1000 words are enough in representing significant features con-tained in them. Unnecessarily high dimension may have led toworse classification performance than compact 1000 word data.Therefore, we decided to use data in processed form withM¼1000.

5.3. Role of sparsity

Applying sparsity constraint is a well-known approach inlearning features. It is known that forcing sparsity helps algo-rithms learn more useful features. In order to analyze the effect ofsparsity constraint, we conducted experiments with varying spar-sity conditions for each layer. To be fair in making comparison, weused the same initialization for each case of varying sparsityconditions. SVM classification performance of varying conditionsis shown in Fig. 5. In the legend, sparsity condition is expressed as’No’ for the zero sparsity constraint, and ‘Sp’ for the sparsityconstraint given in W. For example, ’Multi-NoSp’ means that nosparsity constraint in the 1st layer Wð1Þ, and sparsity constraint isgiven to 2nd layer, Wð2Þ. We can see that for classification task,sparsity in W does not have significant effect. Therefore, wedecided to apply no sparsity constraint in case of Reuters data:θ1 ¼ 0, and θ2 ¼ 0: (Although not included in this paper, when weapplied our multi-layer NMF network to image data, it showedthat applying sparsity constraint to Ws of all layers results inbetter classification performance. Therefore, it is encouraged thatusers determine the sparsity constraint depending on the type oftheir data. In case of Reuters, the reason why applying no sparsity

Table 1Ten categories and number of training and test samples in each.

Category Ntrain Ntest

Earn 2680 1062Acq 1447 708Grain 345 151Money-fx 299 150Crude 276 167Trade 285 142Interest 153 78Ship 115 55Sugar 93 46Coffee 94 29

2 The Reuters-21578, Distribution 1.0 test collection is available from David D.Lewis professional home page, currently: http://www.research.att.com lewis.


http://www.research.att.com

constraint produced better results may be because document dataitself is already highly sparse enough, so that forcing even moresparsity in it ruins the information.)

5.4. Optimal use of nonlinear functions

In our general multi-layer structure, nonlinear functions areapplied between layers in forward direction. We investigatedseveral different types of nonlinear functions for classificationtasks of Reuters data. We tried four types of nonlinear functions:(1) linear: f ðxÞ ¼ x, (2) tanh: f ðxÞ ¼ tanhðxÞ, (3) logistic sigmoid:f ðxÞ ¼ 1=ð1þexp �xÞ, and (4) root: f ðxÞ ¼ ffiffiffi

xp

. In Fig. 6, classifica-tion performance of different types of nonlinear functions withdifferent combinations for each layer is shown.

In (a), the classification performances of four types of nonlinearfunctions for single-layer network are plotted. We can see thatusing tanh or logistic sigmoid as a nonlinear function does notimprove classification performance compared to the linear case.However, using root as nonlinear function improves classificationperformance in significant amount.

In (b), (c), and (d), classification performance of four differentcombinations of nonlinear function usages for each layer of multi-layer network is plotted for tanh, logistic sigmoid, and root,respectively. Four different combinations of nonlinear functionusage for each layer comprise (1) Multi�Hð1Þ–Hð2Þ: linear case, (2)Multi�Hð1Þ–Kð2Þ: applying nonlinear function to the final outcome,Hð2Þ, (3) Multi�Kð1Þ–Hð2Þ: applying nonlinear function to the out-comes of intermediate layers, Hð1Þ but not the final outcome Hð2Þ,and (4) Multi�Kð1Þ–Kð2Þ: applying nonlinear function throughoutall layers. We can see that each nonlinear function differs in thebest combination of nonlinear function usage for each layer. Fortanh, ‘Multi�Kð1Þ–Kð2Þ’ case proved to work best while‘Multi�Kð1Þ–Hð2Þ’ and ‘Multi�Hð1Þ–Kð2Þ’ cases worked best forlogistic sigmoid and root, respectively. The result was quiteinteresting that in case of root as a nonlinear function, from (a),Kð1Þ is proven to provide better solution than Hð1Þ, but this does notmean that feeding Kð1Þ to the upper layer will always result inbetter solution than feeding Hð1Þ: classification performance of‘Multi�Hð1Þ–Kð2Þ’ case was better than ‘Multi�Kð1Þ–Kð2Þ. This showsthat we should investigate all possible combinations of nonlinearfunction usage for each layer of multi-layer network.

In (e), we plotted only the best combinations of nonlinearfunction usage for each nonlinear function type, along with thelinear case. We can see that using root as a nonlinear function withcombination of ‘Hð1Þ–Kð2Þ’ generates the best classification perfor-mance for Reuters data. Therefore, we decided to apply root as anonlinear function, only to the final outcome of the multi-layernetwork.

However, like the determination of sparsity constraint, usersare encouraged to investigate the best nonlinear function and itscombination of usages for their data.

5.5. Classification performance

Until now, we have been conducting several experiments todetermine the optimal structure of multi-layer NMF network forclassification tasks of Reuters data. In this section, we will nowcompare the final classification performances of single-layer net-work, multi-layer NMF network in [8,9], and our proposedstructure of multi-layer NMF network.

In Fig. 7(a), SVM classification performance of three networks isshown. We conducted experiments three times with differentrandom initializations, and calculated the average. We can clearlysee that our proposed structure of multi-layer NMF performsmuch better than either single-layer network or multi-layer NMFin [8,9], especially for the small number of dimensions providedfor the final data representation. Our proposed structure of multi-layer NMF network reaches the maximum performance it canachieve given only the small number of features: it reaches themaximum classification performance with only 40 features, whilesingle-layer network requires 180 features. Classification perfor-mance of the multi-layer NMF in [8,9] is similar to that of single-layer network without nonlinear function, and slightly less thanthat of single-layer network with nonlinear function. The compar-isons of classification performances of these networks are shownin diagram in (b) for easier understanding.

As briefly mentioned in Section 2, multi-layer network in [8,9]repeats the same process through the layers, by dividing the one-step learning process into several steps: R1 ¼ R2 ¼…¼ RL. Thistype of architecture was proven to be effective in separating mixedsignals compared to single-layer network in [8,9]. However, ourexperimental results show that this architecture is not effective inclassification tasks: the architecture proposed in [8,9] does notimprove classification performance of single-layer network.

Indeed, we could have predicted this result from Section 4.3,where we investigated various architectures to find the optimal

Fig. 4. SVM classification performance using full dimensional data (M¼24,231),and reduced-dimensional data (M¼1000) for single- and multi-layer networkarchitectures. (For interpretation of the references to color in the text, the reader isreferred to the web version of this paper.)

Fig. 5. SVM classification performance of varying sparsity conditions in W ofsingle- and multi-layered networks.


architecture for classification tasks. From the experiments in Fig. 3,the multi-layer NMF in [8,9] belongs to the case (1) whereRL�1oRSingleMax. From the experiments shown here, we foundout that RL�1 restricts the final classification performance of themulti-layer network. Since the multi-layer NMF network in [8,9]

sets R1 ¼ R2 ¼…¼ RL, we may have predicted the final classifica-tion performance to be similar to that of single-layer network.

From the comparisons of classification performances of variousarchitectures of networks, we verified that our proposed structure ofmulti-layer NMF network provides better solution to classification

Fig. 6. Finding optimal usage of nonlinear functions. (a) Effect of nonlinear functions for single-layer network. Effect of nonlinear function for multi-layer network withnonlinear functions of (b) tanh, (c) logistic sigmoid, and (d) root. (e) The best combination of each nonlinear function usage for each layer of multi-layer network.


tasks, compared to the single-layer network or the architectureproven to be effective for BSS tasks [8,9].

5.6. Complex features learned through multi-layer network

In this section, we look into the intuitive feature learning done byour proposed structure of multi-layer network: we show observedcomplex features learned through proposed network. For the visua-lization of learned features, we display each dimension with corre-sponding word, and display them in the descending order of values.

In Fig. 8, an example of 20 complex features learned throughthe multi-layer network is displayed. Top 10 words of complexfeatures are displayed, along with its values and correspondingstrength of activations represented in grayscale colors in formof matrix CF1:10�20:ðCFmr ¼ Wð1Þf �1ðWð2Þ

�HactÞÞmr=

PMm ¼ 1

Wð1Þf �1ðWð2ÞHactÞ� �

mr, where Hact ¼ I20�20) In (b), Hð2Þ is shown

with samples of same category grouped together. According to Hð2Þ

in (b), samples in each category display at least one dominantlyactivated complex features shown in yellow box; we wrote downthis category information underneath the corresponding complexfeatures in (a), where we marked with bold boxes. We can see thateach of complex features demonstrates the topic well, especiallythose shown in bold boxes, which are the dominant features foreach category marked in yellow in (b).

5.7. Concept hierarchies in document sets

We also observed underlying concept hierarchies in document databy viewing features learned in the first layer and the second layer.

In Fig. 9 experimental results are shown. Left side shows sixfeatures learned in the first layer, Wð1Þ. According to combinationinformation learned in the second layer, Wð2Þ, six features arecombined together in weighted summation manner to formcomplex feature on the right side. When we look at the sixfeatures on the left side, the dominant concept of each feature is‘oil’. However, when we look into each feature in more detail,each feature displays distinct sub-concepts of oil; the wordsshown in each of Wð1Þ are different from each other and displayunique or distinguishable content from one another. Throughinformation in Wð1Þ, we are able to understand how oil-relatedwords are sub-grouped together or how some of the words arerelated to the others to be grouped into the same Wð1Þ. When welook at the complex feature on the right side, we can observe thatthe higher level feature shows strongly generalized content of‘oil’ with less varying words of oil-related contents; the first fourwords are synonyms of ‘oil’: oil, gas, petroleum, and energy. Wecan analyze that lower level features become generalized todominant concept when developed into complex higher levelfeatures.

Fig. 7. (a) Reuters classification performance comparison between single-layer network, multi-layer NMF network [8,9], and proposed structure of multi-layer NMF network,for increasing number of dimensions for final data representation. (R1 for single-layer network, and R2 for multi-layer network). (b) Illustration on comparisons ofclassification performance of networks. The classification performance of multi-layer network in [8,9] is similar to that of single-layer network without nonlinear function,but slightly less than that of single-layer network with nonlinear function. Meanwhile, our proposed structure of multi-layer network outperforms classification results ofany of these architectures of networks, by achieving the maximum performance threshold, even with the small number of features, which requires a significant number offeatures for single-layer network to perform the similar level of classification performance.


Fig. 8. (a) An example of 20 complex features learned through the multi-layer network, and categories of dominant features represented in them indicated with bold boxes.(Only top 10 words of complex features CF1:10�20 are displayed.) (b) Corresponding Hð2Þ with samples grouped together into 10 categories. (For interpretation of thereferences to color in the text, the reader is referred to the web version of this paper.)


From this example, we are able to understand that Reuterssamples under category ‘crude’ may be sub-divided into variousother sub-categories related to ‘crude’. From this example, we areable to understand that Reuters samples under category ‘crude’may be sub-divided into various other sub-categories related to‘crude’. If it were single-layered network, it would have displayedcomplex features only which consists of synonyms of oil, withoutany information on how oil-related words are grouped into sub-groups. If it were without non-negativity constraint, like otherconventional deep learning algorithms, we would have had a hardtime considering subtraction of features as well; we cannotunderstand feature hierarchy intuitively. With experimental resultdemonstrated in Fig. 9, our proposed structure of multi-layer NMFnetwork proved that it is able to discover underlying concepthierarchies present in contents of document; our proposed net-work learns not just big categories, but also intuitively revealsrelationships of sub-categories under the big categories that formconcept hierarchy structure in document, and displays featuredevelopment process in intuitively understandable manner bysimply adding up the features.

5.8. Reconstruction property

In our previous work in [11], we mistakenly calculated recon-struction error as Mean reconstruction error¼ PM

m ¼ 1PN

n ¼ 1jXmn� ~Xmn j=MN. In this section, we correct our mistake incalculating reconstruction error by re-calculating reconstructionerror as

PMm ¼ 1

PNn ¼ 1 ‖Xmn� ~Xmn‖2.

In Fig. 10, reconstruction errors of single-layer and multi-layernetworks are plotted for increasing number of dimensions for finaldata representation (R1 for single, and R2 for multi). We can seethat extending the network into multi-layers does not improvereconstruction performance. By stacking more layers, the recon-struction error accumulates more. This corrected result tells usthat multi-layer NMF network may provide better solution toclassification tasks, but not for reconstruction tasks.

6. Beyond the maximum threshold performance

6.1. Augmenting final data representation

In Section 4.3, we observed that we cannot achieve classifica-tion performance higher than the maximum threshold

performance of the first layer when using multi-layered architec-ture. This is rather understandable phenomenon, when we con-sider the process of multi-layered network where the second layerlearns based on the first layer training result. Is it really impossibleto overcome this “maximum threshold performance”?

In order to seek a solution to improve classification perfor-mance beyond the maximum threshold level of the single-layernetwork, we thought of making up a new set of features usingalready available feature sets, Kð1Þ and Kð2Þ. We concatenatedsimple features Kð1Þ and complex features Kð2Þ together so thatwe have much more options of building blocks to choose from, to

Fig. 9. Concept hierarchies of Reuters data learned throughout layers. Several simple features Wð1Þ s are weighted summed to develop into complex feature Wð2Þ .

Fig. 10. Reconstruction error of single-layer network and multi-layer network.

Fig. 11. Concatenation of Kð1Þ and Kð2Þ as final data representation.


represent the original data. The motivation is that the second layermay be limited in learning all possible combinations of simplefeatures in the first layer, and augmenting complex feature withsimple features may enhance the chances of representing theoriginal data in more detail. In Fig. 11, the process of gettingconcatenated features is described. We simply concatenated Kð1Þ

and Kð2Þ and fed into classifiers. The classification results forconcatenated features are shown in Fig. 12. The x-axis is the R1value for single-layer network result, and R2 for multi-layer net-work result, and concatenated features. For example, the perfor-mance at x value of 20 means that it is classification performanceof [Kð1Þ; Kð2Þ], whose R2 ¼ 20. (Note that we determined R1 ¼ 180in the earlier section.) For the comparison, classification perfor-mance of single-layer network and multi-layer network is alsoshown together. As we can see, concatenation of simple andcomplex features performs over the maximum threshold perfor-mance of both single-layer network and multi-layer network.Through this experiment, we demonstrated the potential ofmulti-layered network that can improve its classification perfor-mance beyond the maximum threshold performance limit. Furtheranalysis remains open as our further work.

6.2. Extending into more layers

In Fig. 3, in Section 4.3, experimental results showed thatclassification performance of three-layered network was slightlylower than that of two-layered network. Although not included inthis paper, experiments with image data showed the oppositeresult: extending the multi-layer network into three layersshowed slight improvement in classification performance com-pared to two-layered network. These opposite results tell us thatwhile extending the network into two layers guarantees thestability in improvement of classification performances, extendingthe network more than that remains uncertain in stability.

The heart of deep learning lies in the “deep” architecture of thenetwork. Unfortunately, simply adding more layers does not seemto be working stably for multi-layer network with non-negativityconstraint for all types of data. Extending the network into“deeper” layers in stable manner remains as our further work.

7. Conclusion

In this paper, we proposed an optimal structure of multi-layerNMF for classification tasks, which provides natural way of feature

learning process with non-negativity constraint. By stacking NMFalgorithm into several layers, our proposed network takes step-by-step approach in learning of the features, and provides intuitiveexplanation of learning steps in each layer. By investigatingvarious architectures and usages of nonlinear functions, wedesigned an optimal structure of multi-layer NMF network forclassification tasks. We applied the proposed network to docu-ment data, and showed that our proposed structure of multi-layerNMF network outperforms the classification performances ofsingle-layer network and multi-layer network known to be effec-tive in BSS tasks. Also, taking advantages of non-negativity con-straint, we demonstrated the intuitive feature learning processdone by our network: we showed that our proposed structuremulti-layer NMF network is able to demonstrate the process ofhierarchical feature development successfully for document data,and also discover underlying concept hierarchies present in thecomplex document data. As further works, we would like to thinkof ways to improve classification performance beyond the max-imum value, which will be the extension of Section 6.

Acknowledgments

The authors would like to thank Cheong-An Lee, Byeong-YeolKim, and other CNSL members for their valuable discussion andcontribution in developing the idea.

This research was supported by the Brain Research Programthrough the National Research Foundation of Korea funded by theMinistry of Science, ICT & Future Planning (2013-035100 and 2014-028269).

Appendix A. Pseudo-code for training procedure of proposednetwork

%% Training procedureInput:

– Training data, XtrainARM�Ntrainþ

– Rl, where l¼ 1;2;…; L. (number of hidden neurons for eachlayer)

– θl, where l¼ 1;2;…; L. (sparsity constraint for each layer)Output:

– WðlÞARRl� 1�Rlþ , where R0 ¼M.

– HðlÞARRl�Ntrainþ , and KðlÞAR

Rl�Ntrainþ

for l¼ 1 : L do

Randomly initialize WðlÞ and HðlÞ

SðlÞ ¼ ð1�θÞeyeðRlÞþθlRlonesðRlÞ

if l¼1 then

Kðl�1Þ ¼Xtrain

end Iffor iteration¼ 1: ðuntilconvergenceÞ do

HðlÞ’SðlÞHðlÞ

HðlÞrn’HðlÞ

rnWðlÞT Kðl� 1Þ� �

rn

WðlÞÞT WðlÞHðlÞ� �

rn

W ðlÞmr’W ðlÞ

mrKðl� 1ÞHðlÞT� �

mr

WðlÞHðlÞHðlÞT� �

mr

W ðlÞmr ¼ W ðlÞ

mrPM

m0 ¼ 1W ðlÞ

m0r

end for

K ðlÞrn ¼ f HðlÞ

rn

� �end for

Fig. 12. SVM classification performance for Kð1Þ , Kð2Þ , and ½Kð1Þ;Kð2Þ�.


%% Testing procedureInput:– Test data, XtestARM�Ntestþ– Rl, where l¼ 1;2;…; L. (number of hidden neurons for each

layer)

– WðlÞARRl� 1�Rlþ , where R0 ¼M.

Output:

– HðlÞtestAR

Rl�Ntestþ , and KðlÞ

testARRl�Ntestþ

for l¼1:L do

Randomly initialize HðlÞtest

If l¼1 then

Kðl�1Þtest ¼Xtest

end iffor iteration¼ 1: (until convergence) do

HðlÞtestrn’HðlÞ

testrn

WðlÞT Kðl� 1Þ� �

rn

WðlÞT WðlÞHðlÞ� �

rn

K ðlÞtestrn ¼ f HðlÞ

testrn

� �end for

end for

References

[1] Y. Bengio, Learning deep architectures for ai, Found. Trendss Mach. Learn. 2(1) (2009) 1–127.

[2] D.H. Hubel, T.N. Wiesel, Receptive fields, binocular interaction and functionalarchitecture in the cat's visual cortex, J. Physiol. 160 (1) (1962) 106.

[3] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep beliefnets, Neural Comput. 18 (7) (2006) 1527–1554.

[4] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al., Greedy layer-wisetraining of deep networks, in: Advances in Neural Information ProcessingSystems, vol. 19, 2007, p. 153.

[5] C. Poultney, S. Chopra, Y.L. Cun, et al., Efficient learning of sparse representa-tions with an energy-based model, in: Advances in neural informationprocessing systems, 2006, pp. 1137–1144.

[6] J.-H. Ahn, S. Choi, J.-H. Oh, A multiplicative up-propagation algorithm, in:Proceedings of the twenty-first international conference on Machine learning,ACM, 2004, p. 3.

[7] J.-H. Ahn, S. Kim, J.-H. Oh, S. Choi, Multiple nonnegative-matrix factorization ofdynamic pet images, in: Proceedings of Asian Conference on Computer Vision,2004, pp. 1009–1013.

[8] A. Cichocki, R. Zdunek, Multilayer nonnegative matrix factorisation, Electron.Lett. 42 (16) (2006) 947–948.

[9] A. Cichocki, R. Zdunek, Multilayer nonnegative matrix factorization usingprojected gradient approaches, Int. J. Neural Syst. 17 (06) (2007) 431–446.

[10] S. Rebhan, J. Eggert, H.-M. Groß, E. Körner, Sparse and transformation-invariant hierarchical nmf, in: Artificial Neural Networks–ICANN 2007,Springer, Berlin, Heidelberg, 2007, pp. 894–903.

[11] H.A. Song, S.-Y. Lee, Hierarchical representation using nmf, in: Neural Informa-tion Processing, Springer, Berlin, Heidelberg, 2013, pp. 466–473.

[12] A. Pascual-Montano, J.M. Carazo, K. Kochi, D. Lehmann, R.D. Pascual-Marqui,Nonsmooth nonnegative matrix factorization (nsnmf), IEEE Trans. PatternAnal. Mach. Intell. 28 (3) (2006) 403–415.

[13] J.-H. Oh, H.S. Seung, Learning generative models with the up-propagationalgorithm, in: Advances in Neural Information Processing Systems, 1998,pp. 605–611.

[14] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrixfactorization, Nature 401 (6755) (1999) 788–791.

[15] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in:Advances in Neural Information Processing Systems, 2000, pp. 556–562.

Hyun Ah Song received her Bachelor's degree inEnvironmental Science and Engineering from EwhaWomans University, Seoul, South Korea in 2011, andMaster's degree in Electrical Engineering from KoreaAdvanced Institute of Science and Technology (KAIST),Daejeon, South Korea in 2013. She has worked at KoreaInstitute of Science and Technology (KIST), Seoul, SouthKorea, fromMay 2013 to August 2014. She is currently aPhD student in the Machine Learning Department atCarnegie Mellon University, USA. Her research interests

include machine learning, and biologically-plausiblefeature extraction algorithms.

Bo-Kyeong Kim received the B.S. degree in Bio andBrain Engineering from Korea Advanced Institute ofScience and Technology (KAIST), Daejeon, Korea, in2012. She is currently working toward the Ph.D. degreeunder integrated Master's/Doctoral program in Electri-cal Engineering, KAIST. Her research interests includefeature learning for brain-inspired intelligent systems,brain-computer interface, and image processing.

Thanh Luong Xuan received his Bachelors degree inControl and Informatics in Technical Systems fromSouth Russian State Technical University in 2012. Heis currently a M.S student under the supervision ofProfessor Soo-Young Lee in Computational NeuroSys-tems Laboratory, Korea Advanced Institute of Scienceand Technology. His research interests lie in the fieldsof Signal Processing, Machine Learning and NeuralNetworks.

Soo-Young Lee (M'83) received a B.S. degree fromSeoul National University, Seoul, Korea, in 1975, theM.S. degree from the Korea Advanced Institute ofScience and Technology, Daejeon, Korea, in 1977, anda Ph.D. from the Polytechnic Institute of New York in1984. He was with Taihan Engineering Company, Seoul,from 1977 to 1980. He was with the General PhysicsCorporation, Columbia, MD, from 1982 to 1985. In 1986,he joined the Department of Electrical Engineering,KAIST, as an Assistant Professor, and now is a FullProfessor in the same department. From June 2008 to2009, he was with the Mathematical NeuroscienceLaboratory, RIKEN Brain Science Institute, Saitama,

Japan, on sabbatical leave. His current research interests include artificial brains,human-like intelligent systems/robots based on biological information processingmechanism in brains, mathematical models, real-world applications, intelligentman–machine interfaces with electroencephalograms, and eye tracking. He is aPast-President of Asia-Pacific Neural Network Assembly, and has contributed to theInternational Conference on Neural Information Processing as Conference Chair(2000), Conference Vice Co-Chair (2003), and Program Co-Chair (1994, 2002). He ison the Editorial Boards for Neural Processing Letters and Cognitive Neurodynamics.He received the Leadership Award and the Presidential Award from the Interna-tional Neural Network Society in 1994 and 2001, respectively, and the APPNAService Award and the APNNA Outstanding Achievement Award from the Asia-Pacific Neural Network Assembly in 2004 and 2009, respectively. He has alsoreceived from the SPIE the Biomedical Wellness Award and the ICA UnsupervisedLearning Pioneer Award in 2008 and 2010, respectively.


http://refhub.elsevier.com/S0925-2312(15)00458-0/sbref1
















Hierarchical feature extraction by multi-layer non-negative matrix...

Documents

Transcript of Hierarchical feature extraction by multi-layer non-negative matrix...