[IEEE The Sixth IEEE International Conference on Computer and Information Technology (CIT'06) -...

High Performance Text Categorization System Based on a Novel Neural Network Algorithm.

Cheng Hua Li

Division of Electronics and Information Engineering, Chonbuk National University

Jeonju, Jeonbuk, 561-756, Korea. [email protected]

Soon Cheol Park

Division of Electronics and Information Engineering, Chonbuk National University

Jeonju, Jeonbuk, 561-756, Korea. [email protected]

Abstract

This paper describes a novel approach for text categorization based on the improved Back-propagation neural network (BPNN). BPNN has been widely used in classification and pattern recognition. However it has some generally acknowledged defects, such as slow convergence and easy to enter into local minima. In this paper, we introduce an improved BPNN that can overcome these defects. We tested the improved model on the standard Reuter-21578, and the result shows that the proposed model is able to achieve high categorization effectiveness as measured by the precision, recall and F-measure.

1. Introduction

The explosive growth on the amount of information available in the internet makes it a hard task to select what is worth reading in our available time. The ability to select important sites, topics and news became a very important skill, it should also be recognized that this is a problem that goes beyond the human information processing ability. Text categorization is an efficient technology for the handling and organizing of the text data. There are many applications for text categorization. The most common application of it is in information retrieval system to document indexing.

Text categorization is a process of classifying documents with regard to a group of one or more existent categories according to themes or concepts present in their contents. Many different approaches have been attempted, including the K-Nearest Neighbor [1, 2] Rocchio [3, 4] Decision Tree [5], and Neural Network [6, 7]. Neural network is one of the most popular algorithms for text categorization. Many researchers have found that the neural networks achieve very good performance in their experiments on

different data sets. BPNNs have many advantages compared with other networks, so they can be used very widely. However they also have their limitations.The main defects of the BPNN can be described as: slow convergence; difficulty in escaping from local minima; easily entrapped in network paralyses; uncertain network structure. In previous experiments, it was demonstrated that these limitations are all related to the morbidity neurons. Therefore, we propose an improved model called MRBP (Morbidity neuron Rectify Back-Propagation neural network) to detect and rectify the morbidity neurons; this reformative BPNN divides the whole learning process into many learning phases. It evaluates the learning mode used in the phase evaluation after every learning phase. This can improve the ability of the neural network, making it more adaptive and robust, so that the network can more easily escape from a local minimum, and be able to train itself more effectively.

The rest of this paper is organized as follows. In section 2, we describe the theory of back-propagation neural networks, including the basic theory and improved method. The experiments are discussed in section 3. The evaluation results are given in section 4. Finally, the conclusion and a discussion of future work are given in section 5.

2. Basic BPNN and improved BPNN algorithms

2.1. BPNN algorithm

The back-propagation neural network is a generalization of the delta rule used for training multi-layer feed-forward neural networks with non-linear units. It is simply a gradient descent method designed to minimize the total error (or mean error) of the output computed by the network. Fig. 1 shows such a network.

Proceedings of The Sixth IEEE International Conference on Computer and Information Technology (CIT'06)0-7695-2687-X/06 $20.00 © 2006

Fig.1. Typical three layers BP networks

In the network, there is an input layer, an output layer, with one or more hidden layers in between them.

The training of a network by back-propagation involves three stages: the feed-forward of the input training pattern, the calculation and back-propagation of the associated error, and the adjustment of the weight and the biases. Such stage is explained in detail as follow:

2.1.1. Input pattern feed-forward. Calculate the neuron’s input and output. For the neuron j , the input

jnet and output jO are

=i

iijj Ownet (1)

( )j j jO f net θ= + (2)

where ijw is the weight of the connection from the thineuron in the previous layer to the neuron j ,

( )j jf net θ+ is activation function of the neurons,

and the iO , jO are the output of previous neuron iand the neuron j , jθ is the biases input to the neuron.

2.1.2. Error calculation. Calculate the total absolute error E in output layer as the following formula.

( )212 l l

l

E t O= − (3)

and the mean absolute error mE is

( )212m l l

l

E t On

= − (4)

where n is the number of training patterns. The absolute error is used to evaluate the learning effects, the training will keep up until the absolute error falls below some threshold or tolerance level. Calculate the back propagation error both in output layer lδ and

hidden layer jδ as following formulas:

( ) ( )l l l lt O f Oδ λ ′= − (5)

( )j l ji ji

w f Oδ λ δ ′= (6)

where lt is the desired output of the thl output neuron,

lO is the actual output in the output layer, jO is the

actual output value in the hidden layer, and λ is the adjustable variable in activation function. The back propagation error is used to update the weights and biases both in output layer and hidden layer.

2.1.3. Weights and biases adjustment. The weights

jiw and biases iθ adjust as the following formulas:

( ) ( ) ijjiji ykwkw ηδ+=+1 (7)

( ) ( ) iii kk ηδθθ +=+1 (8)

where the k is number of epoch, η is the learning rate.

2.2. BPNN defect analysis and commonly used improved methods

The three main defects of the BPNN and some commonly used improved methods are as follows:

2.2.1. Slow convergence. In the beginning, the learning process proceeds very quickly, in each epoch, and can make rapid progress, however it slows down in


the later stages [8]. There are two commonly used methods of improving the speed of training for BPNNs. a) Introduce momentum into the network. Convergence is sometimes faster if a momentum term is added to the weight update formulas. The weight update formula for a BPNN with momentum is

( ) ( ) ( ) ( )( )1 1ij ij i j ij ijW k W k x u W k W kηδ+ = + + − − (9)

where momentum parameter u is constrained to be in the range from 0 to 1. The new weights for the training step t+1 are based on the weights at training steps t and t-1. b) Using the adaptive learning rate to adjust the learning rate. The role of the adaptive learning rate is to allow each weight to have its own learning rate, and to let the learning rates vary with time as training progresses. The formulas for a BPNN with an adaptive learning rate is

( ) ( )1

1n

n nn

EE

η η−

+ = × (10)

where n is the epoch during the training process, and E is the absolute error in each epoch. When E decreases, the learning effect will increase (the weight may change to a greater extent). Otherwise, the learning effect will decrease.

These two kinds of methods accelerate the convergence of the BPNN, but they can not solve other problems associated with the BPNN, especially when the size of the network is large.

2.2.2. Local minimum. When training a BPNN, it is easy to enter into a local minimum, and usually the GA and simulated annealing algorithms have been used to solve this problem. These algorithms can prevent the problem of entering into a local minimum, but they cannot ensure that the network will not enter into a global minimum, and they are even slower than the traditional BPNN.

2.2.3. Network paralyses. During training, the value of the weights may be very large and, consequently, the input of the network will be very large. Thus, the output value of the activation functions, jO (or lO ), tends to 1, according to the formula of error back propagation, and the back propagation error will tend to 0. This phenomenon is referred to as saturation. The speed of training becomes very slow when saturation occurs. Finally it will cause the weight not to change any more, and this will lead to network paralysis. P.D. Wasserman [9] provided the suggested formula to limit

the weight between (-a, a), but it is only used for weight initialization. It cannot prevent the value of the weight increasing during training, and it also has the possibility of leading to network paralysis.

2.3. MRBP Algorithm

The defects mentioned above are all related to saturation. In the case of saturation, the convergence will become slower and the system will change to a higher learning rate. Also, the weight becomes larger due to the larger learning rate, and this will cause the output value of the activation function to tend to 1. Under this situation, the network can easily enter into a local minimum and ultimately become entrapped by network paralysis. Based on our experience with such problems, we also found that there is another phenomenon which can cause such defects. For some of the neurons, the range of input values is restricted to a small range during each epoch, and this causes the values of the output to be extremely close to each other at each epoch, while the error during each epoch changes slowly. In other words, the speed of convergence is slow. Finally, this situation causes a local minimum or even network paralysis. In this paper, we refer to these two kinds of phenomena as neuron overcharge and neuron tiredness respectively. We call these neurons morbidity neurons. In general, if some morbidity neurons occur within it, then the network cannot function effectively.

The MRBP improved method: During the learning process, neurons face two kinds of morbidity: overcharge and tiredness. If we avoid the appearance of morbidity neurons during the learning phase or rectify the problem in time, then the networks can train and evolve effectively.

[Definition 1]: Neuron overcharged. If the input value of the neuron is very big or very small, it will cause the output value to tend to -1 or 1, and cause the back-propagation error to tend to 0. We refer to such a neuron as being overcharged. That is, for the activation function

( )2( ) 1

(1 )j jj j n et

f n ete λ θ

θ− +

+ = −+

(11)

If

( ) 1 ( ) 1j j j jf net f netθ θ+ → ∧ + → − (12)


Results in 0→jδ , then we refer to neuron j as being overcharged.

[Definition 2]: Neuron tiredness. If a certain neuron always receives the similar stimulation, then its response to this stimulation will be very similar, so that it is difficult to distinguish different stimulations by its response. We refer to such a neuron as being tired. That is, when neuron j during one learning phase (defined as follows) obeys,

( ) ( ) 0k k k kj j j j

k kf net f netMINMAX θ θ+ − + → (13)

Then the neuron j is tired.

[Definition 3]: Learning phase. Choosing N iterations (or leanings) as a period, during this period we record some important data, and calculate the effect of the learning process, as the direction for the next period. We called this period the learning phase and, based on our experience, we use 50 epochs as the learning phase. According to the definition of an overcharged neuron

and a tired neuron, we know that they are directly related to the activation function. In the conventional activation function,

2( ) 1(1 )xf x

e λ−= −+

(14)

λ using 1 or other constants, but in our model, λ is an adjustable variable. V.P.Plagianakos [10] tried to use an adjustable value of λ in his paper. Actually different combination of λ correspond to different learning models. The determinant rule of the morbidity neuron is: If

( ) 0.9 ( ) 0.9j j j jf net f netθ θ+ >= ∧ + <= − (15)

Then neuron j is overcharged. And if

( ) ( ) 0.2k k k kj j j j

k kf net f netMINMAX θ θ+ − + <= (16)

Then neuron j is tired. The formula used to rectify the morbidity neuron are

( ) ( )2

k k k kj j j j

k kj j

f net f netMINMAX θ θθ θ

+ + += − (17)

( ) ( )

2 11.9

j k k k kj j j j

k k

Ln

f net f netMINMAXλ

θ θ

−=−

+ − + (18)

Formula (17) is used to normalize the maximum and minimum input values in the previous phase in order to make them symmetric with respect to the origin. Formula (18) is used to limit the maximum and minimum output values to the normal range. In our experiments, the range is (-0.9, 0.9). In our study, the morbidity neurons were rectified in each phase after their evaluation.

3. Experiments

3.1. Experimental design

In order to measure the performance of our system, in all of our experiments, we used a subset of the documents from the standard Reuters-21578 test collection for training and testing our text categorization model. We choose 1600 documents belonging to the Reuters data set with ten frequent categories. 600 documents are used for training and 1000 documents were used for testing.

After word stemming, we merged the sets of stems from each of the 600 training documents and removed the duplicates. This resulted in a set of 6122 indexing terms in the vocabulary.

In order to create the set of initial feature vectors to represent the 600 training documents, we measured the term weight for each of the 6122 indexing terms. The feature vectors were then formed term weights, and each of the feature vectors was of the form

,1 ,2 ,6122, , ..........j j j jD W W W= (19)

where ,j iW is the term weight of the thi indexing term

in document j . For each training and testing documents, we created the feature vectors corresponding to the 600 training documents, where each feature vector had a dimensionality of 6122.

In our experiments, we employ Porter’s stemming algorithms [11] for word stemming, and logarithmic function as the term weight


log(1 )ijweight tf= + (20)

where ijtf is the thi indexing term in document j .Dimensionality of 6122 is too high fro neural

networks, so we reduce this size by choosing the highest term weights, we choose 1000 terms as the neural network’s input since it offers a reasonable reduction neither too specific nor too general.

The number of output nodes is equal to the number of pre-defined categories. According to the rule of hidden node selection

1

1

nn

i ji

k c θ=

< (21)

1n n m a= + + (22)

1 2log nn = (23)

where 1n is the number of hidden nodes, n is the number of input nodes, m is the number of output nodes, a is a constant ranging from (1, 10) and k is the number of training documents, we selected 15 as the number of hidden nodes.

In fact, these rules are used as a reference in order to determine the relationship between the number of nodes required in each layer, and the number of hidden nodes selected with different rules will yield a very different value. Our network then has three layers consisting of 1000, 15 and 10 nodes.

3.4. Experience results

In order to distinguish our proposed approach, we compared the mean absolute error using three different of methods. The first method is the conventional BP networks, which we refer to as the traditional BPNN. The second method is the commonly used improved method, which we call the Modified BPNN. The third method is our proposed method, which we call the MRBP network.

Fig.2: Mean absolute error reduce during training with the three methods

In our experiments, we used 50 epochs as one learning phase. The training will stop after 50 learning phases (2500 epochs).

From the Fig. 2 we can see that at the beginning of the training, the error is reduced very rapidly, but this reduction slows down after a certain number of epochs. The MRBP is more effective than the traditional BPNN and the Modified BPNN in the later part of the training. The mean absolute error that train after 2500 iterations with traditional BPNN, Modified BPNN and MRBP are 0.0024, 0.0009 and 0.00012 respectively.

4. Evaluations

The performance of text categorization systems can be evaluated based on their categorization effectiveness The effectiveness are measured by recall, precision and F-measure.

We used the macro-average method to obtain the average value of the precision and recall. The F-measure is based on the micro-average value. The performance results are given in table 1.


Table 1. Comparison of the performances of the three kinds of networks

Traditional BPNN

Modified BPNN

MRBP Networks Category

Precision

Recall

Precision

Recall

Precision

Recall

Money-supply 0.829 0.902 0.912 0.914 0.936 0.939 coffee 0.844 0.898 0.880 0.898 0.922 0.928 gold 0.938 0.910 0.938 0.992 0.953 1.000 sugar 0.952 0.884 0.945 0.878 0.925 0.912 trade 0.721 0.783 0.758 0.834 0.821 0.895 crude 0.941 0.896 0.942 0.913 1.000 0.927 grain 0.933 0.923 0.926 0.926 0.945 0.923 Money-fx

0.889 0.853 0.906 0.868 0.916 0.908

Acq 0.934 0.870 0.943 0.889 0.939 0.901 earn 0.930 0.921 0.949 0.928 0.964 0.948 micro-average 0.891 0.884 0.910 0.904 0.932 0.928 F-measure 0.887 0.907 0.930

The size of the network and some parameters used in our experiments are given in table 2.

Table 2. the network size and parameters

#Input Nodes

#Hidden Nodes

#Output Nodes

Learning Rate

Momentum

1000 15 10 0.01 0.8

5. Conclusions

In this paper, we have described a new algorithm to text categorization called Morbidity neuron Rectify Back-Propagation neural network (MRBP) which can detect and rectify the morbidity neurons in each learning phase. This method overcomes the network paralysis problem and has good ability to escape from

the local minima. The results of our experiments show that the proposed method outperforms both the traditional BPNN and the commonly used improved BPNN. The superiority of the MRBP is obvious especially when the size of the networks is large.

6. References

[1]. Yang, Y. and Liu, X. A Re-examination of Text Categorization Methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, (1999) pp 42-49

[2]. Mitchell, T.M. Machine Learning. McGraw Hill, New York, NY, (1996)

[3]. Rocchio, Jr. J. JRelevance Feedback in Information Retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing, editor: Gerard Salton, Prentice-Hall, Inc., Englewood Cliffs, News Jersey, (1971)

[4]. Joachims, T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Test Categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, (1997) pp 143-151

[5]. Cohen, W. W. and Singer, Y. Context–Sensitive Learning Methods for Text Categorization. ACM Trans. Inform. Syst. 17, 2, (1999) pp 141-173

[6]. Ruiz, M. E. and Srinivasan, P. Hierarchical Neural Networks for Text Categorization. In Proceedings of SIGIR-99, 22nd ACM International Information Retrieval, (1999) pp 281-282

[7]. David A. Grossman, Ophir Frieder Information Retrieval: Algorithms and Heuristics Kluwer Academic Publishers, (2000)

[8]. Wei Wu, Guorui Feng, Zhengxue Li, and Yuesheng Xu Deterministic Convergence of an Online Gradient Method for BP Neural Networks IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, (2005)

[9]. P.D.Wasserman.Neural Computing: Theory and Practice [M].New York: Van Nostrand Reinhold. (1989)

[10]. V.P. Plagianakos, M.N. Vrahatis, Training Neural Networks with Threshold Activation Functions and Constrained Integer Weights, ijcnn, IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5, (2000) pp.51-61

[11]. M. F. Porter. An algorithm for suffix stripping. Program, Vol.14 no. 3 (1980).pp 130-137


[IEEE The Sixth IEEE International Conference on Computer and Information Technology (CIT'06) -...

Documents

Transcript of [IEEE The Sixth IEEE International Conference on Computer and Information Technology (CIT'06) -...