Application of a staged learning-based resource allocation network to automatic text categorization

Application of a staged learning-based resource allocation networkto automatic text categorization

Wei Song a,b,n, Peng Chen a,b, Soon Cheol Park c

a School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, Chinab Engineering Research Center of Internet of Things Applied Technology, Ministry of Education, Chinac Department of Electronics and Information Engineering, Chonbuk National University, Jeonju, Jeonbuk 561756, Republic of Korea

a r t i c l e i n f o

Article history:Received 11 December 2013Received in revised form4 April 2014Accepted 10 July 2014Communicated by Y. ChangAvailable online 18 July 2014

Keywords:Resource allocation networkNeural networkStaged learning algorithmText categorizationNovelty criteria

a b s t r a c t

In this paper, we propose a novel learning classifier which utilizes a staged learning-based resourceallocation network (SLRAN) for text categorization. In the light of its learning progress, SLRAN is dividedinto a preliminary learning phase and a refined learning phase. In the former phase, to reduce thesensitivity corresponding to input data an agglomerate hierarchical k-means method is utilized to createthe initial structure of hidden layer. Subsequently, a novelty criterion is put forward to dynamicallyregulate the hidden layer centers. In the latter phase a least square method is used to enhance theconvergence rate of network and further improve its ability for classification. Such staged learning-basedapproach builds a compact structure which decreases the computational complexity of network andboosts its learning capability. In order to implement SLRAN to text categorization, we utilize a semanticsimilarity approach which reduces the input scales of neural network and reveals the latent semanticsbetween text features. The benchmark Reuter and 20-newsgroup datasets are tested in our experimentsand the extensive experimental results reveal that the dynamic learning process of SLRAN improves itsclassifying performance in comparison with conventional classifiers, e.g. RAN, BP, RBF neural networksand SVM.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

With the rapid development of Internet technology, a largequantity of online documents and information are growing expo-nentially. The demand of rapidly and accurately finding out theuseful information from such a large dataset has become achallenge for modern information retrieval (IR) technologies. Textcategorization (TC) is a crucial and well-proven instrument fororganizing large volumes of textual information. As a key techni-que in IR field, TC has been extensively researched and witnessedin recent decades. Meanwhile, TC has become a hot spot and putsforward a series of related applications, including web classifica-tion, query recommendation, spam filtering, topic spotting etc.

In recent years, an increasing number of approaches based onintelligent agent and machine learning, e.g. support vector machine(SVM) [1], decision trees [2,3], K-nearest neighbor (KNN) [4,5],bayes model [6–8], neural network [9,10] etc, have been applied totext categorization. Although such methods have been extensively

researched, yet the present automated text classifiers are still withfault and the effectiveness needs improvement. Thus, text categor-ization is still a major research field. Since artificial neural network isstill one of the most powerful tools utilized in the field of patternrecognition [11], we employ it as a classifier.

As a kind of basic supervised network, back propagation (BP)neural network suffers the fault of slow training rate and hightendency to trap into local minimum. On the contrary, without slowlearning rate, the relatively simple mechanism of radial basis function(RBF) neural network [12–14] displays the robust property of globalsituation approaching. It has been known that the key to build asuccessful RBF neural network is to insure a proper number of units inits hidden layer [15]. More specifically, the lack of hidden layer nodesalways results in a negative influence on its ability to decision-making.Whereas the redundant hidden layer nodes bring about a result ofhigh computing [16–18]. That is to say, too small architecture ofnetwork may cause the problem of under-fitting, while on the otherhand, too large architecture of network may lead to over-fitting to data[19,20]. Although more and more learning methods have studied toregulate the hidden nodes to satisfy the demand of the suitablestructure for RBF network, the most remarkable approach is resourceallocation network (RAN) learning method put forward by Platt [21].Platt made a significant contribution through the development of the

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.07.0170925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author at: School of IOT Engineering, Jiangnan University, LihuAvenue, Wuxi, Jiangsu Province 214122, China.

E-mail address: [email protected] (W. Song).

Neurocomputing 149 (2015) 1125–1134

www.sciencedirect.com/science/journal/09252312

www.elsevier.com/locate/neucom

http://dx.doi.org/10.1016/j.neucom.2014.07.017



http://crossmark.crossref.org/dialog/?doi=10.1016/j.neucom.2014.07.017&domain=pdf



mailto:[email protected]


algorithmwhich regulates the hidden nodes according to the so callednovelty criteria. In other words, RAN can dynamically manipulate thenumber of the hidden layer units by judging the novelty criteria.However, the novelty criteria are sensitive to the initialized data,which would easily cause the growth of the training time for network,and lead to the reduction of the employment effect [22]. Meanwhile,in RAN the least mean-square (LMS) algorithm applied to update itslearning parameters usually makes the network suffer from thedrawback of lower convergence rate [23,24]. To tackle with theseproblems, in this paper we propose a staged learning-based resourceallocation network (SLRAN) which divides its learning process into apreliminary learning phase and a refined learning phase. In the formerphase, to reduce its sensitivity corresponding to the initialized data, anagglomerate hierarchical k-means method is utilized to construct thestructure of hidden layer. Subsequently, a novelty criterion is putforward to dynamically add or prune hidden layer centers, and acompact structure is created. That is, the former phase reduces thecomplexity of the network and builds the initial structure of RAN. Yetin the latter phase a least square method is used to enhance theconvergence rate of network and further refine its learning ability.Therefore, SLRAN builds a compact structure which decreases thecomputational complexity of network and boosts its learning abilityfor classification.

The rest of this paper is organized as follows. Section 2 introducesthe basic concepts of the RAN. Section 3 proposes the algorithm ofSLRAN as an efficient text classifier and describes its details. The stepsto generate the latent semantic feature of text documents, whichhelps enhance the text categorization performance, are depictedin Section 4. Experimental results and analysis are illustrated inSection 5. Conclusions are discussed in Section 6.

2. Resource allocation network (RAN)

RAN is a promising and sequential learning algorithm based onRBF neural network. The architecture of RAN includes three layers,i.e. an input layer, an output layer and a single hidden layer. Thetopology of the RAN is shown in Fig. 1. During the training processof RAN, a sample of n dimensional input vector is given to theinput layer, and based on the assigned input pattern, RAN willcompute the output of m dimensional vector in the output layer.That is to say, the aim of the RAN network is to define an approachto map from the input space of n dimensions to output space of mdimensions. Eventually, the network calculates the output vectorsthat match the desired output vectors.

In the structure of RAN, the input layer, the hidden layer andthe output layer are x¼ ðx1; x2; :::; xnÞ, c¼ ðc1; c2; :::; chÞ and y¼ ðy1;y2; :::; ymÞ respectively, b¼ ðb1; b2; :::; bmÞ is the offset item of theoutput layer, where n, h and m are the respective number of unitsin these three layers. The units of the hidden layer take advantageof Gaussian function as its activation function which implements alocally tuned unit, and the Gaussian function of hidden layer is

defined as

ΦiðxÞ ¼ exp �:x�ci:2

σi2

!ð1Þ

Where ci and σi are the ith node center and the width of this centerrespectively. While the output of hidden layer node is linearlyweighted for the output layer, the function for output layer is given by

f jðxÞ ¼ ∑h

i ¼ 0wijΦiðxÞþbjðj¼ 1;2; :::;mÞ ð2Þ

Wherem and h are the respective number of the nodes in output layerand hidden layer, x is a input sample, wij is the connecting weightbetween the ith hidden layer node and the jth output layer node.

At the beginning of the training stage, there is no hidden neuronin the network. RAN initializes the parameters of neural network interms of the first couple of input training sample. Subsequently, if thetraining sample satisfies its novelty criteria, it will be added into thenetwork as a new neuron center in hidden layer, otherwise LMS isused to update the parameters of current network including hiddenlayer centers and the connection weights between hidden layer andoutput layer. However, the novelty criteria of RAN are sensitive to theinitialized data, which causes the growth of the training time.Meanwhile, LMS usually makes the network suffer from the draw-back of lower convergence rate.

3. Staged learning-based resource allocation network (SLRAN)

To handle the above-mentioned problems of RAN, in this paper wepropose a staged learning-based resource allocation network (SLRAN)which divides the learning process into two phases. In order to reduceits sensitivity corresponding to the initialized data, in the preliminarylearning phase of SLRAN an agglomerate hierarchical k-means methodis utilized to construct the structure of hidden layer. Subsequently, anovelty criterion is put forward to dynamically add or prune hiddenlayer centers. That is, this phase reduces the complexity of the networkand builds a compact structure of RAN.

3.1. Determination of initial hidden layers centers

We apply an agglomerate hierarchical k-means algorithm toinitialize the structure of hidden layer, i.e. the centers of hiddenlayer and the widths of the clusters for each center. For a giveninitial documental dataset D, after the clustering process, thegenerated cluster centers are defined as C ¼ ðc1; c2; c3; :::; ckÞ, wherek is the number of the clusters. The k-means algorithm helpsobtain the hidden layer center ci and the cluster width σi. Thealgorithm process is shown as follows:

Step 1: Through random sampling of m times, which ensurethat the data would not be distorted after random samplingdataset and can reflect the feature of natural distribution fordata. So the primitive dataset is divided into m parts, and thesize of each part is n/m, n is the total number of the texts. Thatis, we get m subsets that can be expressed as S¼ ðs1; s2; :::; smÞ.Clustering analysis is executed for every subset si using k-means algorithm. In this way, we obtain a group of k0ðk04kÞclustering centers where k is the predefined number of cate-gories. That is, we take such a step possible to guarantee thatthe generated centers can represent all clusters and avoidinitial object around the isolated point. In our method k0 isempirically defined as 2� k. Thus, the appropriate k0 can ensurethe uniform distribution of centers in each sample as possibleand lead to a good sampling performance. In comparison withthe effect of k0, m is a secondary coefficient. We set it severalFig. 1. The three layers topology of RAN neural network.

W. Song et al. / Neurocomputing 149 (2015) 1125–11341126

times empirically and select the proper m according to thesampling results. Note that we still have m subsets. That is, wewill obtain m� k0 cluster centers after this step.Step 2: As a bottom-up hierarchical clustering strategy, theaverage-linkage agglomerative algorithm is used to cluster thenewly generatedm� k0 clustering centers. In this algorithm, thetwo most similar clusters would be merged to build a newcluster until only k clusters are left.Step 3: Selecting the k cluster centers acquired from theprevious step as the hidden layer centers of SLRAN learningalgorithm. At the same time, we calculate the center width ofthe ith neuron by

σi ¼1Ni

∑xA ci

ðx�ciÞT ðx�ciÞ ð3Þ

Where Ni is the number of the samples contained in the clusterof center ci. In other words, the center width σi stands for themean value between the center ci and the relevant trainingmodels assigned to this type.

3.2. Novelty criteria

In the preliminary learning phase, a novelty criterion is putforward to dynamically regulate the hidden layer centers. RANlearning algorithm first augments the number of hidden layerunits through its novelty criteria. We present the criteria thatconsiders the feature of input and output space, which is shown as

ei ¼ yi� f ðxiÞ�� 4ε ð4Þ

di ¼ min1r jrh

‖xi�cj‖4δi ð5Þ

Where ε is a predefined error precision set empirically, and it isadopted in view of the value of 0.075 in our system for textcategorization. h is the number of the hidden layers in currentnetwork when the ith example xi is input. yi is the desired output,and f(xi) is the related real output. di is Euclidean distance betweenthe input vector xi and the nearest hidden layer center. δi ¼ maxfexpðiγÞδmax; δming, where δmax and δmin represent the maximumEuclidean distance and the minimum Euclidean distance of theentire input space respectively. γ is a decay coefficient whose valueis less than zero. In the light of their characteristics, (4) and (5) arenamed as error criterion and distance criterion separately. Differ-ent from other criteria using only once loop for training, we usemultiple cycle to ensure that the network obtain sufficient hiddenlayer units.

If an input pattern xk and the corresponding real output are farfrom the nearest center ci and the desired output respectively, such apattern xk will regard as the new hidden layer center to reduceoutput errors. That is, xk will be added as the new hidden layer centerif it satisfies both formulas (4) and (5). Subsequently, the hidden layercenter cnew, hidden layer center width σnew and connection weightwnewj of this new added hidden layer are refined as below

cnew ¼ xk ð6Þ

σnew ¼ α‖xk�cnearest‖ ð7Þ

wnewj ¼ yj� f jðxkÞðj¼ 1;2; :::;mÞ ð8Þ

where cnearest is the nearest hidden layer center to xk, α is acorrelation coefficient between 0 and 1, m is the number of outputlayer neurons.

Otherwise, if it does not satisfy both (4) and (5), xk will beassigned to the existed cluster Ci whose center ci is closest to xk.

And the hidden layer are updated by

cij ¼ cijþΔcij ð9Þ

σi ¼ σiþΔσi ð10Þ

Δcij ¼ 2βiηxkj�cijσi2

ΦiðxkÞ � ∑m

s ¼ 1wisðf sðxkÞ�ysÞ ð11Þ

Δσi ¼ 2βiη‖xk�ci‖2

σi3ΦiðxkÞ � ∑

m

s ¼ 1wisðf sðxkÞ�ysÞ ð12Þ

where ΦiðxkÞ is the Gaussian activation function for the ith hiddenlayer neuron given by (1). cij is the jth dimension of the hiddenlayer center ci, xkj is the jth component of input pattern xk, wis isthe connection weight from the ith hidden layer neuron to the sthoutput layer neuron, η is learning rate. We use βi to represent theinverse similarity between ci and xk, and βi is given by

βi ¼‖xk�ci‖

‖cf arthest�ci‖ð13Þ

where cf arthest is the farthest center to the input pattern xk.Meanwhile, a gradient descent method is employed to adjust theoffset bj and the weight wij from the hidden layer neuron to theoutput payer neuron, and it is given by

bj ¼ bjþηðyj� f jðxkÞÞ ð14Þ

wij ¼wijþηðyj� f jðxkÞÞΦiðxkÞ ð15ÞAfter the growing of hidden layer neurons, we step into the

stage of pruning. If a hidden layer neuron is active previously, butits significance is gradually reduced after the input of newsamples, such a hidden neuron with less effect will therefore bepruned. That is, for all input samples, the outputs and theconnection weights of a hidden layer neuron are relatively little.We will remove this hidden layer neuron. Moreover, the redun-dant hidden layer neurons may result in the issue of over-fitting.From above analysis, a pruning strategy is given as follows.

For each training pair fðxðiÞ; yðiÞÞ; i¼ 1;2; :::;ng, we first calculateits output by (1) and then normalize its relative output by

Rk ¼ΦkðxiÞΦmaxðxiÞ

oξ ð16Þ

Where ΦkðxiÞ is the output of the kth hidden layer neuron for theith input pattern, and ΦmaxðxiÞ is the maximum output for the ithinput pattern. ξ is the predefined threshold. Subsequently, wenormalize the relative connection weight between hidden layerneurons and output layer neurons by

μk ¼wk;j

wk; maxoθ ð17Þ

Where wk,j is the connection weight between the kth hidden layerneuron and the jth output layer neuron, and wk; max is themaximum connection weight for kth hidden layer neuron. θ is apredefined threshold. Thus, for the kth hidden neuron, if morethan P input samples meet both formulas (16) and (17), such aneuron will be regarded as not important as other effective hiddenneurons. Thus, we need to remove this neuron. Through utilizingpruning strategy, we can effectively delete these undesirableneurons, and the complexity of the network structure is reduced.Thus, the training structure for the preliminary learning phase ofSLRAN is shown in Fig. 2.

Generally speaking, the preliminary learning phase dynami-cally adjusts the hidden neurons of SLRAN to maintain a compacttraining structure which decreases the computational complexityof network and boosts its learning capability, and in the subse-quent phase, SLRAN uses the least square method to refine theweight of network.

W. Song et al. / Neurocomputing 149 (2015) 1125–1134 1127

3.3. Refine weights

In the refined learning phase, SLRAN uses the least squaremethod to update the weight of network. Assume that N is thenumber of the training samples, H is the eventual number of thehidden layer centers. Then we achieve the output matrix of hiddenlayer P which is given by

P ¼ ½P1; P2; P3; :::; Pi; :::; PH � ð18Þ

Pi ¼ ½Pi1; Pi2; Pi3; :::; Pis; :::; PiN �T ð19Þ

Pis ¼ΦiðxsÞ ¼ exp �‖xs�ci‖2

σi2

� �ð20Þ

Where ΦiðxsÞ is the activation function of the hidden layer whichimplements Gaussian function as its output. Subsequently, theconnection weight from hidden layer to its output layer iscalculated by

W ¼ ðPTPÞ�1PTY ð21Þ

Where Y is the desired output of the whole network. In the view ofthe above discussion, the flow chart of SLRAN is shown in Fig. 3.

Step 1: Clustering samples by agglomerate hierarchicalk-means algorithm. We can obtain the initial hidden layercenters and the widths of the centers in this step.Step 2: Calculating the output of hidden layer neurons withrespect to whole input training dataset using (1).Step 3: Increasing the number of hidden layer neurons byjudging both criteria (4) and (5). If an input pattern satisfiesthese two novelty conditions, this pattern is regarded as a newhidden neuron. Subsequently, the new generated hiddenlayer center, center width, connection weight are calculatedby (6)–(8) respectively. Otherwise, the corresponding centerand the center width are adjusted by the gradient descentmethod using (9)–(12). Meanwhile, we need update its offsetand weight using (14) and (15) respectively. If there is nohidden layer neuron added to the network for consecutive n¼5epochs, our algorithm jumps to Step 4. Otherwise, the algo-rithm goes to Step 2.

Step 4: Calculating and normalizing the output of the hiddenlayer and its weight. SLRAN prunes the hidden layer centers if itsatisfies both (16) and (17).Step 5: After the preliminary learning phase, SLRAN hasachieved a compact network structure with lower networkcomplexity. In the refined learning phase, SLRAN utilizes leastsquare method to determine the connection weight of network.That is, this step helps SLRAN refine its learning ability andimprove its classification performance.

Fig. 2. The training structure for the preliminary learning phase of SLRAN.

Start

Agglomerate hierarchical k-means

Calculate the output of the hidden layer with (1)

Add hidden centerusing (6) - (8)

Update hidden layer with (9) -(12)

Adjust the weight and offset with(14) and (15)

Prune the center of hidden layer

Refine the algorithm using (18) -(21)

Is satisfy (4) and (5)?

Is over of the add process?

End

Y N

Y

N

Fig. 3. The flow chart of SLRAN.


In summary, the first 4 Steps correspond to the preliminarylearning phase, and Step 5 is on behalf of the refined learningphase. Specifically, the training examples are firstly input andstored in memory. In the preliminary learning phase, trainingexamples are utilized repeatedly in the loop of epochs to updatethe hidden layer. In the refined learning phase, least squaremethod is utilized one time to improve the learning ability ofnetwork. Accordingly we just utilize training examples once in thisstage to refine the connection weight of network.

4. Semantic feature selections

4.1. Vector space model (VSM)

VSM is a commonly used method on behalf of documents throughthe weight of each word comprised. The model is on the basis of theidea that, the meaning of a document can be conveyed by its words.And the weight of each feature, which represents the contribution ofevery word, is evaluated by a statistical rule [25–27]. It is implementedthrough creating a term-document matrix that represents all dataset.In order to create a set of initial feature vectors used for representingthe training documents, each document should be transformed into afeature vector. Suppose a document Di is comprised of n terms, thefeature vector for Di can be represented by

Di ¼ fwi1;wi2; :::;wik; :::;wing ð22Þwhere wik is the term weight of the kth indexing term in the ith textdocument Di. Although VSM can reflect the directed relations inoriginal dataset, it still has some weakness. Due to every uniqueword represents one dimension in feature space, it results in a longvector to represent these high dimensions. Moreover, because of thefact that the same concept can be represented by numerous differentwords and a word may have different explanations, the ambiguousmeaning of words prevents from identifying their semantic simila-rities. In this paper, we use a latent semantic feature space model todeal with this problem.

4.2. Latent semantic feature space (LSFS)

In this study, the latent semantic feature space is used to calculatethe semantic similarity between words. Besides, singular valuedecomposition (SVD) is a well developed mathematical method forreducing the dimensionality of large data sets and extracting dominantfeatures of the data [28–30]. So we apply SVD technique to implementthe latent semantic feature space in our system. The training datasetcan be firstly expressed as a document-termmatrix D(n�m), where nis the number of documents and m is the number of terms. Then wefigure out that the transpose of matrix D is a term-document matrixA(m�n):

A¼DT ð23ÞOnce the matrix A is created, SVD is employed to decompose it so

as to construct a semantic vector space which can representconceptual term-document associations. The singular value decom-position of A is given below

A¼UΣVT ð24ÞWhere U and V are the matrices of term vectors and documentvectors respectively, Σ ¼ diagðσ1; σ2; :::; σnÞ is the diagonal matrix ofsingular values. For the sake of reducing the dimensionality, weselect the k largest singular values to take the place of the originalterm-document matrix Aðm� nÞ. Then the approximation of matrixA with rank-k matrix Ak is shown as

Ak ¼UkΣkVkT ð25Þ

Where Uk is made up of the first k columns of U, Σk ¼ diagðσ1; σ2; :::; σkÞ is the diagonal matrix formed by the first k factors of∑, and Vk

T is comprised of the first k rows of VT . The matrix Ak

captures most of the important underlying structure in the associa-tion of terms and documents, while ignoring noise due to wordchoice. Subsequently, we use the reduced Uk matrix and the originaldocument vector d to achieve our target vector of each training andtesting sample, which is given by

d̂¼ dUk ð26ÞWhere d is an original ð1�mÞ feature vector of document, and Uk isthe ðm� kÞ matrix generated by (25). Once all samples are processedin this way, the new generated document vectors are performed asinput for SLRAN classifier which determines the appropriate cate-gories the test samples should be assigned to.

5. Experimental results and analysis

5.1. Data sets

In order to measure the effectiveness of our approach, we makeour experiments over two standard text corpus datasets in thisstudy, i.e. Reuter-21578 corpus and 20-newsgroup corpus. In theformer dataset, we choose 1500 documents which contains tencategories, i.e. acq, coffee, crude, earn, grain, money-fix, trade,interest, ship and sugar. In the latter dataset, 1200 documents areselected which comes from ten categories, i.e. Alt.atheism, Comp.windows.x, Sci.crypt, Rec.motorcycles, Rec.sporthockey, Misc.for-sale, Talk.politics.guns, Talk.politics.mideast, Sci.space and Sci.med.For each of these two datasets, we divide the documents into threeparts. That is, two parts are used for training while the rest onepart is used for testing. Subsequently, we preprocess the docu-ments by canceling the stop words, stemming and calculating theweight of each feature. After preprocessing, we totally obtain 7856indexing terms from the first dataset and 13,642 indexing termsfrom the second dataset, which are respectively expressed as

dj ¼ wj;1;wj;2; :::;wj;i; :::;wj;7856� � ð27Þ

d0j ¼ wj;1;wj;2; :::;wj;i; :::;wj;13642� � ð28Þ

Where wj;i is the word weight of the ith feature word in the jthdocument. Here, we use Okapi rule [31] to calculate the weight ofeach feature. The Okapi rule is shown as

wij ¼ tf ij � idf j=ðtf ijþ0:5þ1:5� dl=avgdlÞ ð29Þ

idf j ¼ logNn

� �ð30Þ

Where N is the total number of the documents in each dataset, andn is the sum of the documents in which the ith feature word iscontained. tf ij represents the occurrence frequency of the ith featureterm in document j, and dl is the length of the jth document sample,avgdl is the average length of whole documents. In order to replacethe ordinary tf � idf method, we adopt Okapi rule in this study whichempirically normalizes the length of document samples.

In the research field of neural network, it has known that themain challenge in its application to text categorization is the highdimensionality of the input feature space. That is, the input size ofthe neural network depends on the number of the selected features.In consideration of the problem that overloading input may result inthe overflow of neural network, while the lack of features may causea poor expression, we therefore sort features by the sum of theirweights. We choose the first 1000 terms for the Reuter dataset andthe first 1200 feature for the 20-newsgroup dataset. Thus each


document in these two datasets is respectively represented by

dj ¼ wj;1;wj;2; :::;wj;i; :::;wj;1000� � ð31Þ

d0j ¼ wj;1;wj;2; :::;wj;i; :::;wj;1200� � ð32Þ

So far, we have achieved the corresponding document vectorsfor these two datasets. Subsequently, the approach of latentsemantic feature space is employed to further represent eachdocument and decrease the number of dimensions by (26).

5.2. Evaluation measures

In the literature of information retrieval (IR), precision P, recallR, and F-measure F [32] are the three main quotas to evaluate theperformance of IR systems. In this study, we adopt these threequotas to appraise our categorization algorithm.

Pi ¼mi

Mið33Þ

Ri ¼mi

Nið34Þ

Fi ¼2� Pi � Ri

PiþRið35Þ

Where Mi is the total number of documents in the ith categoryproduced by the algorithm, and Ni is the number of documents in theith category, which is predefined for the purpose of evaluation. Whilemi is the number of documents in the intersection of Mi and Ni, inother words, it is the sum of documents that has been correctlyassigned to the ith category. Then the macro-average F-measure Fmacro

is given by

Fmacro ¼1k

∑k

i ¼ 1Fi ð36Þ

Where k is the number of categories. Besides, the micro-averageF-measure [33] Fmicro is utilized, and it is given by

Fmicro ¼∑k

i ¼ 1mi

∑ki ¼ 1Mi

ð37Þ

Where ∑ki ¼ 1mi counts the number of documents correctly assigned

to the corresponding category. And ∑ki ¼ 1Mi is the total number of

documents in dataset.In addition, we calculate the mean absolute error (MAE) which

reflects the training capability of our SLRAN algorithm, and it'sdefined as

MAE¼ 12n

∑n

i ¼ 1∑m

j ¼ 1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðyjðxiÞ� f jðxiÞÞ2

qð38Þ

Where n is the number of training samples, m is the number of theneurons in output layer, yjðxiÞ is the desired output, and f jðxiÞ is theactual output of the network.

5.3. Experimental results

The experiments are conducted on a Pentium PC. We use Javalanguage to accomplish the algorithm under MyEclipse 8.5 compi-ler. In order to sufficiently reveal the performance of SLRAN, forthe first dataset, we choose the following dimensions k of LSFS informula (25), i.e. 60, 100, 150, 200, 250, 300, 350, 400, 450, 500,550 and 600. And for the second dataset, we vary ks from 60, 100,150, 200, 250, 300, 350, 400, 450 to 500. The network size and theassociated parameters of SLRAN are shown in Table 1. In addition,for the purpose of illustrating the superiority of SLRAN, wecompare it with RAN [21], BP [10], RBF [34], and SVM [1]. In thefirst experiment shown in Figs. 4 and 5, we provide the Fmacro and

the Fmicro of these five categorization algorithms respectively usingVSM and LSFS on the first dataset.

We can see from Figs. 4 and 5 that the Fmacro and the Fmicro ofthese five algorithms using LSFS model increase gradually andsubsequently reach their extreme maximum points which are muchhigher than those of the five algorithms respectively using VSM. Thatis to say, no matter which categorization algorithm we use, if theproper number of semantic features is selected, they perform betterthan using VSM. After the dimension of LSFS where these fivealgorithms achieve their extreme maximum points, the Fmacro andthe Fmicro begin to decrease with increasing of dimension. That isbecause the redundant dimension of semantic features confuses theidentification of documents and affects the performance of thecategorization algorithms. In other words, the proper dimension ofsemantic features of LSFS enhances their categorization perfor-mances. Moreover, the Fmacro and the Fmicro of SLRAN using LSFSachieves the best maximum point in comparison with the perfor-mances of other four categorization algorithms.

In correspondence to the respective extreme maximum pointwe compare the performances of these five algorithms using LSFSto those of them using VSM in Table 2. What is more, theircomputational time (C-time) is also counted.

We can see from Figs. 4 and 5 and Table 2 that the performanceof SLRAN achieves the best Fmacro and Fmicro when the dimension ofLSFS reaches 400 for the first dataset. In Table 2, the maximumFmacro of SLRAN, SVM, RAN, BP and RBF are 0.9648, 0.9437, 0.9311,0.9133 and 0.9093 respectively, and the maximum Fmicro of SLRAN,SVM, RAN, BP and RBF are 0.9640, 0.9440, 0.9320, 0.9140 and0.9100 respectively. Moreover, the corresponding dimension of

Table 1The network size and the associated parameters of SLRAN.

Dataset #Inputnodes(LSFS)

#Inputnodes(VSM)

#Initialhiddennodes

#Outputnodes

α in(7)

η in(11)

ξ in(16)

θ in(17)

Dataset 1 60–600 1000 10 10 0.80 0.001 0.085 0.085Dataset 2 60–500 1200 10 10 0.75 0.001 0.085 0.085

100 150 200 250 300 350 400 450 500 550 600

0.84

0.86

0.88

0.9

0.92

0.94

0.96

Dimensionality

F-m

acro

SLRAN with LSFSRAN with LSFSBP with LSFSRBF with LSFSSVM with LSFSSVM with VSMSLRAN with VSMRAN with VSMBP with VSMRBF with VSM

Fig. 4. The Fmacro of the five categorization algorithms respectively using LSFS andVSM for the first dataset.


semantic features using LSFS (400) is much smaller than that usingVSM (1000), which greatly decrease the time consumption ofcategorization. Meanwhile, in view of the algorithm of SLRANper se, it dynamically adds or prunes hidden layer centers andbuilds a compact structure, which reduces the complexity of thenetwork. Thus, the C time of SLRAN to categorize the first datasetis 52.6 s, which is shorter in comparison with that of RAN, BP andRBF in LSFS, and just little longer than that of SVM in LSFS.

The subsequent experiment is employed to demonstrate theeffectiveness of SLRAN using LSFS for the second dataset.Figs. 6 and 7 respectively shows the corresponding Fmacro andthe Fmicro of these five categorization algorithms.

We can see from Figs. 6 and 7 that the Fmacro and the Fmicro of thefive categorization algorithms using VSM are higher than those ofthem using LSFS firstly. That is because in the beginning of thedimensions of LSFS, the insufficient number of semantic featurescannot well express the content of document and lowers theperformance of categorization algorithms. And then with the raise ofthe dimensions, the document can be well described using LSFSmodel. Thus, they subsequently perform better than using VSM andachieve their respective extreme maximum point. That is to say, wecan obtain a better result if we utilize LSFS instead of VSM, and thevariation trend is similar to Figs. 4 and 5. Note that, the performance ofSLRANwith LSFS still achieves the best Fmacro and Fmicro in comparison

with that of SVM, RAN, BP and RBF in LSFS or VSM space. In Table 3,we compare of the dimension, Fmacro, Fmicro and C-time of these fivecategorization algorithms respectively for the second dataset.

From Table 3 we can see that the respective best Fmacro ofSLRAN, SVM, RAN, BP and RBF are 0.9500, 0.9000, 0.8669, 0.8529and 0.8239, and the best Fmicro of SLRAN, SVM, RAN, BP and RBF are0.9500, 0.9050, 0.8675, 0.8525 and 0.8250 respectively. In addi-tion, the respective running time of the five algorithms using LSFSis 48.5, 30.2, 44.4, 78.6 and 42.4, which are much shorter thanthose of them using VSM. For the running time of SLRAN withLSFS, it is just a little longer than that of SVM, RBF and RAN withLSFS, and it is still much shorter than BP with LSFS. That is,

100 150 200 250 300 350 400 450 500 550 600

0.84

0.86

0.88

0.9

0.92

0.94

0.96

Dimensionality

F-m

icro


Fig. 5. The Fmicro of the five categorization algorithms respectively using LSFS andVSM for the first dataset.

Table 2The comparisons of dimensions, Fmacro , Fmicro and C-time for the first dataset.

Reuter-21578 corpus

Algorithms Dimensions Fmacro Fmicro C-time

SLRAN 1000(VSM) 0.9442 0.9440 307.1 sSVM 1000(VSM) 0.9131 0.9140 105.8 sRAN 1000(VSM) 0.9099 0.9100 328.6 sBP 1000(VSM) 0.8911 0.8920 360.3 sRBF 1000(VSM) 0.8891 0.8900 265.4 sSLRAN 400(LSFS) 0.9648 0.9640 52.6 sSVM 350(LSFS) 0.9437 0.9440 40.9 sRAN 300(LSFS) 0.9311 0.9320 56.4 sBP 300(LSFS) 0.9133 0.9140 89.2 sRBF 300(LSFS) 0.9093 0.9100 53.4 s

100 150 200 250 300 350 400 450 5000.7

0.75

0.8

0.85

0.9

0.95

Dimensionality

F-m

acro


Fig. 6. The Fmacro of the five categorization algorithms respectively using LSFS andVSM for the second dataset.

100 150 200 250 300 350 400 450 5000.7

0.75

0.8

0.85

0.9

0.95

Dimensionality

F-m

icro


Fig. 7. The Fmicro of the five categorization algorithms respectively using LSFS andVSM for the second dataset.


although the dimension of 300 in LSFS space expand the inputspace of network and relatively increase the time consumption onone hand, and on the other hand, the compact structure of SLRANreduces the complexity of the network and decreases the itscategorization time.

Furthermore, we calculate the mean absolute error of SLRAN toreflect the performance of its training ability in the followingexperiments.

In Figs. 8 and 9, we provide the MAE of SLRAN algorithm versusthe number of epochs for the first dataset and the second datasetrespectively. Note that each epoch here refers to that all learningdata samples complete a training process. We can see fromFigs. 8 and 9 that the MAE becomes high first, and then with theincrease of epoch, the MAE decrease gradually and fall abruptly atthe last epoch. That is mainly because the local attribute of theGaussian function. That is, when a training sample is added intothe hidden layer as a new hidden center, Gaussian function canonly ensure that this sample has no influence on the existingnetwork, while other samples input may cause errors for this kindof new added centers. Moreover, the rapid increase of hidden layercenters will lead to the rise of MAE shown at the beginning epochof Figs. 8 and 9. Then in the following epochs the MAE decreasesgradually because of the relative smaller variation of hidden layercenters and the positive effect of the gradient decent methodemployed. Last but not least, we can see that the MAE decreasesabruptly at the last epoch. That is because the least square method

well refines the parameters of network at the end of the algorithm.Thus, the learning ability and the classification accuracy arefurther improved. What is more, we show the variations of thenumber of hidden layer nodes during the stage of training for thetwo datasets in Table 4.

In Table 4, in the light of the best performance of SLRANdiscussed above, we count the number of hidden layer nodes ineach training epoch and the corresponding MAE. Note that, thelast epoch is regard as the refined learning phase, and the otherepochs are on behalf of the preliminary learning phase. Duringthe preliminary learning phase, we can see that the number of thehidden layer nodes increase rapidly at the second epoch andthe MAE becomes relatively higher due to the local attribute of theGaussian function. Subsequently, with the raising of hidden nodethe MAE decrease gradually. In other words, such a phasedynamically adjusts the hidden neurons to create a sound founda-tion of network. With respect to the refined learning phase, bothof the number of the hidden layer nodes and MAE reduce rapidly.This phase refines the parameters of network and achieves thebest performance which provides optimal (or near optimal) solu-tion to text categorization.

Finally, we evaluate SLRAN by recording the computational time(C-time) which includes the training time and SVD decompositiontime. And the relevant mean absolute error (MAE) is also recorded in

Table 3The comparisons of dimensions, Fmacro , Fmicro and C-time for the second dataset.

20-newsgroup corpus

Algorithms Dimensions Fmacro Fmicro C-time

SLRAN 1200(VSM) 0.9218 0.9225 310.2 sSVM 1200(VSM) 0.8492 0.8500 97.6 sRAN 1200(VSM) 0.8403 0.8450 299.6 sBP 1200(VSM) 0.8088 0.8125 354.3 sRBF 1200(VSM) 0.7965 0.8025 232.4 sSLRAN 300(LSFS) 0.9500 0.9500 48.5 sSVM 250(LSFS) 0.9000 0.9050 30.2 sRAN 200(LSFS) 0.8669 0.8675 44.4 sBP 200(LSFS) 0.8529 0.8525 78.6 sRBF 200(LSFS) 0.8239 0.8250 42.4 s

1 2 3 4 5 6 7 8 9 10 11

0.16

0.15

0.14

0.13

0.12

0.11

0.1

0.09

0.08

0.07

0.065

0.06

0.055

0.05

epochs

Mea

n ab

solu

te e

rror

SLRAN with LSFS250SLRAN with LSFS300SLRAN with LSFS350SLRAN with LSFS400SLRAN with LSFS450SLRAN with LSFS500SLRAN with LSFS550

Fig. 8. The mean absolute error of SLRAN versus the number of epochs for the firstdataset.

1 2 3 4 5 6 7 8 9 10 11

0.240.230.220.21

0.2

0.180.19

0.17

0.16

0.15

0.14

0.13

0.12

0.11

0.10.095

0.090.085

0.08

epochs

Mea

n ab

solu

te e

rror

SLRAN with LSFS150SLRAN with LSFS200SLRAN with LSFS250SLRAN with LSFS300SLRAN with LSFS350SLRAN with LSFS400SLRAN with LSFS450

Fig. 9. The mean absolute error of SLRAN versus the number of epochs for thesecond dataset.

Table 4The comparisons of MAE in view of the variations of the number of hidden nodes.

Reuter-21578 corpus 20-newsgroup corpus

Epochs MAE #Hiddennodes

Variations Epochs MAE #Hiddennodes

Variations

1 0.1398 10 – 1 0.2170 10 –

2 0.1503 209 þ199 2 0.2260 182 þ1723 0.1460 256 þ47 3 0.2192 227 þ354 0.1420 274 þ18 4 0.2140 242 þ155 0.1380 287 þ7 5 0.2093 251 þ96 0.1344 292 þ5 6 0.2050 257 þ67 0.1314 292 0 7 0.2016 257 08 0.1292 292 0 8 0.1987 257 09 0.1270 292 0 9 0.1970 257 0

10 0.1256 292 0 10 0.1950 257 011 0.0510 56 �236 11 0.0800 51 �206


Table 5. In our experiments, SVD is performed using OpenCV (OpenSource Computer Vision), which is a library of programming func-tions mainly aimed at real time computer vision. From Table 5, theC-time of SLRAN increases with the raising of the dimensions. Thetesting time of SLRAN is very fast. For all dimensions listed, it takesonly 0.010–0.030 s. Note that, for Reuter-21578 and 20-newsgroupcorpus, SLRAN achieves the best MAE of 0.0510 and 0.0800, respec-tively, with the C-time of 52.6 s and 48.5 s, which is close to theC-time of SLRAN with the lower dimensions.

6. Conclusion and discussion

In this paper, we propose a learning classifier which utilizes astaged learning-based resource allocation network (SLRAN) for textcategorization. In terms of its learning progress, we divide SLRAN intoa preliminary learning phase and a refined learning phase. In theformer phase, an agglomerate hierarchical k-means method is utilizedto create the initial structure of hidden layer. Such a method canreduce the sensitivity corresponding to the input data and caneffectively prevent it from falling into the limitation of local optimum.Subsequently, a novelty criterion is put forward to dynamically addand prune hidden layer centers. That is to say, this phase ensures thepreliminary structure of SLRAN and reduces the complexity of net-work. In the latter phase, the least square method is used to refine thelearning ability of network and improve the categorization accuracy.In summary, SLRAN builds a compact structure in the former phasewhich decreases the computational complexity of network, and thenthe later phase boosts its learning capability. Moreover, the semanticsimilarity approach is used to organize documents, which greatlydecrease the input scale of network, and reveals the latent semanticsbetween text features. In order to demonstrate the superiority of ourcategorization algorithm, the Reuter and the 20-newsgroup datasetsare tested in our experiments and the extensive experimental resultsreveal that the dynamic learning process of SLRAN improves itsclassifying performance in comparison with conventional classifiers.

Acknowledgments

The authors thank the editors and reviewers for providing veryhelpful comments and suggestions. Their insight and comments ledto a better presentation of the ideas expressed in this paper. Thiswork was sponsored by the National Natural Science Foundation ofChina (61103129), the fourth stage of Brain Korea 21 Project. NaturalScience Foundation of Jiangsu Province (SBK201122266), SRF forROCS, SEM, and the Specialized Research Fund for the DoctoralProgram of Higher Education (20100093120004).

References

[1] H.T. Lin, J.C. Lin, R.C. Weng, A note on platt's probabilistic outputs for supportvector machines, Mach. Learn. 68 (10) (2007) 267–276.

[2] M.C. Wu, S.Y. Lin, C.H. Lin, An effective application of decision tree to stocktrading, Expert Syst Appl. 31 (2) (2006) 270–274.

[3] A. Kumar, M. Hanmandlu, H.M. Gupta, Fuzzy binary decision tree for biometricbased personal authentication, Neurocomputing 99 (1) (2013) 87–97.

[4] E.K. Plakua, L. Avraki, Distributed computation of the KNN graph for large high-dimensional point sets, J. Parallel Distrib. Comput. 67 (3) (2007) 346–359.

[5] Y. Gao, F. Gao, Edited AdaBoost by weighted kNN, Neurocomputing 73 (16–18)(2010) 3079–3088.

[6] J.N. Chen, H.K. Huang, F.Z. Tian, Method of feature selection for textcategorization with bayesian classifiers, Comput. Eng. Appl. 44 (13) (2008)24–27.

[7] H.C. Lin, C.T. Su, A selective Bayes classifier with meta-heuristics for incom-plete data, Neurocomputing 106 (15) (2013) 95–102.

[8] J. Wang, P. Neskovic, L.N. Cooper, Bayes classification based on minimumbounding spheres, Neurocomputing 70 (4–6) (2007) 801–808.

[9] S.F. Guo, S.H. Liu, G.S. Wu, Feature selection for neural network-based Chinesetext categorization, Appl. Res. Comput. 23 (7) (2006) 161–164.

[10] C.H. Li, S.C. Park, An efficient document classification model using animproved back propagation neural network and singular value decomposition,Expert Syst. Appl. 36 (2) (2009) 3208–3215.

[11] H.H. Song, S.W. Lee, A self-organizing neural tree for large-set patternclassification, IEEE Trans. Neural Netw. 9 (5) (1998) 369–380.

[12] T. Poggio, F. Girosi, Networks for approximation and learning, Proceedings ofthe IEEE 78 (9) (1990) 1481–1497.

[13] V. Fathi, G.A. Montazer, An improvement in RBF learning algorithm based onPSO for real time applications, Neurocomputing 111 (2) (2013) 169–176.

[14] D. Du, X. Li, M. Fei, G.W. Irwin, A novel locally regularized automatic constructionmethod for RBF neural models, Neurocomputing 98 (3) (2012) 4–11.

[15] R. Parekh, J. Yang, Constructive neural-network learning algorithms for patternclassification, IEEE Trans. Neural Netw. 11 (2) (2000) 436–451.

[16] H.G. Han, Q.L. Chen, J.F. Qiao, An efficient self-organizing RBF neural networkfor water quality prediction, Neural Netw. 24 (7) (2011) 717–725.

[17] G.B. Huang, P. Saratchandran, N. Sundararajan, An efficient sequential learningalgorithm for growing and pruning RBF (GAP–RBF) networks, IEEE Trans. Syst.,Man, Cybern.—Part B: Cybern. 34 (6) (2004) 2284–2292.

[18] G.B. Huang, P. Saratchandran, N. Sundararajan, A generalized growing andpruning RBF (GGAP–RBF) neural network for function approximation, IEEETrans. Neural Netw. 16 (1) (2005) 57–67.

[19] M.M. Islam, M.A. Sattar, M.F. Amin, X. Yao, K. Murase, A new adaptive mergingand growing algorithm for designing artificial neural networks, IEEE Trans.Syst., Man, Cybern.—Part B: Cybern. 39 (3) (2009) 705–722.

[20] Z. Miljković, D. Aleksendrić, Artificial neural networks—solved examples withtheoretical background, University of Belgrade, Faculty of Mechanical Engi-neering, Belgrade, 2009.

[21] J. Platt, A Resource allocating network for function Interpolation, NeuralComput. 3 (2) (1991) 213–225.

[22] W. Manolis, T. Nicolas, K. Stefanos, Intelligent initialization of resourceallocating RBF networks, Neural Netw. 18 (2) (2005) 117–122.

[23] L. Xu, Least mean square error reconstruction principle for self-organizingneural-nets, Neural Netw. 6 (5) (1993) 627–648.

[24] H.S. Yazdi, M. Pakdaman, H. Modaghegh, Unsupervised kernel least meansquare algorithm for solving ordinary differential equations, Neurocomputing74 (12) (2011) 2062–2071.

[25] W. Song, L.C. Choi, S.C. Park, Fuzzy evolutionary optimization modeling and itsapplications to unsupervised categorization and extractive summarization,Expert Syst. Appl. 38 (8) (2011) 9112–9121.

[26] W. Song, S.C. Park, Latent semantic analysis for vector space expansionand fuzzy logic-based genetic clustering, Knowl. Inf. Syst. 22 (3) (2010)347–369.

[27] W. Song, S.T. Wang, C.H. Li, Parametric and nonparametric evolutionarycomputing with a content-based feature selection approach for parallelcategorization, Expert Syst. Appl. 36 (9) (2009) 11934–11943.

[28] B. Yu, Z.B. Xu, C.H. Li, Latent semantic analysis for text categorization usingneural network, Knowl. Based Syst. 21 (8) (2008) 900–904.

[29] F.F. Riverola, E.L. Iglesias, F. Díaz, J.R. Méndez, J.M. Corchado, Spam hunting: aninstance-based reasoning system for spam labeling and filtering, Decis.Support Syst. 43 (3) (2007) 722–736.

[30] C.H. Li, W. Song, S.C. Park, An automatically constructed thesaurus for neuralnetwork based document categorization, Expert Syst. Appl. 36 (8) (2009)10969–10975.

[31] S.E. Robertson, S. Walker, S. Jones, M.M. Beaulieu, M. Gatford, Okapi at TREC-3,in: Proceedings of the Third Text Retrieval Conference, Gaithersburg, USA,1994, pp. 109–123.

[32] W. Song, C.H. Li, S.C. Park, Genetic algorithm for text clustering using ontologyand evaluating the validity of various semantic similarity measures, ExpertSyst. Appl. 36 (5) (2009) 9095–9104.

[33] Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr.1–2 (1999) 69–90.

[34] Y. Feng, Z.F. Wu, J. Zhong, An enhanced swarm intelligence clustering-basedRBFNN classifier and its application in deep web sources classification, Front.Comput. Sci. China 4 (4) (2010) 560–570.

Table 5The comparisons of MAE and C-time of SLRAN with different dimensions.

Reuter-21578 corpus 20-newsgroup corpus

Dimensions MAE C-time Dimensions MAE C-time

LSFS 250 0.0596 38.2 s LSFS 150 0.0920 37.7 sLSFS 300 0.0566 42.7 s LSFS 200 0.0895 41.2 sLSFS 350 0.0532 47.1 s LSFS 250 0.0871 44.6 sLSFS 400 0.0510 52.6 s LSFS 300 0.0800 48.5 sLSFS 450 0.0541 58.8 s LSFS 350 0.0827 53.4 sLSFS 500 0.0553 65.3 s LSFS 400 0.0860 58.9 sLSFS 550 0.0579 72.1 s LSFS 450 0.0880 64.2 s


http://refhub.elsevier.com/S0925-2312(14)00910-2/sbref1
















































































Wei Song received his MS degree in Information andCommunication Engineering from the ChonbukNational University, Jeonbuk, Korea, in 2006. Hereceived his PhD in Computer Science from the Chon-buk National University, in 2009. Upon graduation, hejoined the Jiangnan University in School of Internet ofthings (IOT). His research interests include patternrecognition, information retrieval, evolutionary com-puting, neural networks, artificial intelligence, datamining and knowledge discovery.

Peng Chen received his B.Eng. degree in 2012 from theHubei University of Technology, Wuha, China, and he iscurrently the MS candidate at Jiangnan University inSchool of Internet of things (IOT). His research interestsinclude information retrieval, neural networks, datamining and knowledge discovery.

Soon Cheol Park received his BS degree in AppliedPhysics from the Inha University, Incheon, Korea, in1979. He received his PhD in Computer Science fromLouisiana State University, Baton Rouge, Louisiana, USA,in 1991. He was a senior researcher in the Division ofComputer Research at the Electronics & Telecommuni-cations Research Institute, Korea, from 1991 to 1993.He is currently a Professor at the Department of theElectronics and Information Engineering at ChonbukNational University, Jeonbuk, Korea. His research inter-ests include pattern recognition, information retrieval,artificial intelligence, data mining and knowledgediscovery.


Application of a staged learning-based resource allocation network to automatic text categorization

Documents

Transcript of Application of a staged learning-based resource allocation network to automatic text categorization